QRecall

Hi again,
just popping in to share a little follow-up to what has been happening lately.

I've copied the broken archive to an external SSD I had lying around, connected it to an unused elderly MacBook Pro. This way I could make sure there isn't some weird hardware issue with my work Mac and I wouldn't have all that extra I/O going on all the time. Here's what I observed:

QRecallHelper will not exceed its allowance of »wired« or »reserved« memory, but it will take on more and more »virtual« memory over time As htop's VIRT column shows mapped files, too, I'm not worried about this reaching huge values. iStatMenus' memory dropdown has QRecallHelper growing wildly as well, though. For comparison, Safari reaches some 4GiB after a few days. After around 24hours, iStatMenus will list QRecallHelper using some 50GiB of RAM (which can't be true, the sum of physical and swap memory is less than that, but what does it count then?), and some time later QRecallHelper will just be gone from the process list. The qrecall command line utility will then just stop counting time and stall forever until killed (friendly kill is enough). It will complain about being unable to reach its (long dead) QRecallHelper. I tried manipulating the clock and QRecall happily adopted my changes, counting run times of several thousand hours, so it isn't dying upon rolling over a full day even though it does die after about a day's worth of runtime.

The logs don't mention any reason for the process ending, there's just the same chatter on and on for hours and then it just stops. A caught error leading to termination would be logged and could then be found near the end of the logfile; this is not the case, though. There's also no core dump or any meaningful entry in macOS' console.app on the matter.

I then put my hopes into trying an incremental approach instead of doing all the work at once, so I set up a repair action to be run every ten hours with a condition to stop after eight hours. It does stop after eight hours, and it will start anew for another eight hours every time. Unfortunately, it will start from scratch every time and not continue where it left off. I can't find log entries saying that the existing data had to be scrapped or anything, so I assume that while a compact operation can run incrementally, a repair action cannot.

Finally I looked around the various options to the commands, hoping to find a way to have it just extract individual files or layers or some way to force reindexing ignoring errors to create an index that would allow me to get something instead of nothing, but without any luck.

I am readying to formally declare this archive beyond repair. I'll still keep it around to try it on any other Macs I find myself possessing and/or any new versions of QRecalls as they come out, maybe someday some combination of hardware and software will perform a miracle cure. For my daily driver Mac I've had to start a new archive, though. I'm not feeling safe with no system backup at all. Better to lose ten years worth of history than to not prepare for the future.

Here's a thing that came to my mind: A repair action will first dig through the entirety of an archive analyzing it before then starting to take action (with the exception of a reindex, but it refuses that here). Couldn't we have some kind of scavenge mode, especially with the --recover option, that will start writing a new archive immediately, adding files and layers at the pace they are discovered in the source archive? This way, when the process dies/fails, there's at least everything up to that point. Knowing far too little about the inner workings of the repository this may of course range anywhere from an quick'n'easy one-hour hack to something utterly impossible on your end.

Anyhow - I'm ready to put this case to rest. If my observations did give you any new ideas, just dictate actions and I will take them and report back.

Do stay healthy,
Norbert

Hi again,
thanks a lot for the quick reply, as always I've learned a lot. So the repair slides through the file, one word at a time, trying to find the start of an actual data block. What I'm witnessing could then be one or several large files that have been purged by a rolling merge some time in the past which are no longer part of the actual archive, and as the repair operation doesn't know (or doesn't trust the index to reliably know) where a data block starts, it just has to keep trying until the next actual block of data starts. I agree that logging all that does seem a little obsessive.

Now that I know that, I'll just let it finish. At some point there has to be actual data again, and then the rest of the operation should finish in a more timely manner.

Memory leak: It is possible that some other application had been causing the runaway swapping and QRecall was just the victim. I can imagine scenarios like overlap with capture/merge actions, automatic updates of other applications, Photos looking for new faces to recognize, load peaks of some of my own projects etc. etc. which, when happening simultaneously, can accumulate to ludicrous levels of RAM use. With a user in front of the screen, macOS will ask for an application to be closed and/or space in the system volume to be freed. While I'm asleep, though, processes will be shot. If there haven't been any other hints at a possible memory leak recently, you really shouldn't start dissecting your product over a rumor.

Command line tools: ever since QRecall 2.0 I've been using the GUI only for browsing and restoring. I need the CLI to manage exclusions anyway, and I love not having to leave the shell (which might seem counter to the Mac experience, but I'm actually a unix dev abusing Macs as tools, so?).

Timeout an action: thanks for this suggestion, this seems very much worth trying. I do see the »Stop after?« condition in the UI action editor, is there an equivalent for the command line? If not, don't bother writing one just for me, I can just configure an action once graphically and then, instead of composing the whole operation on the command line, run that action in the shell by name.

Customers: what customers? You've been refusing to take money for upgrades for as long as I've known you, and that's about a full decade now. During that time I've received more, quicker and better support than with some paid maintenance plans I've been subscribed to. I'm not sure whether you're aware, but that's hardly a proper vendor-customer relationship.

It might even be slightly irresponsible: your business has to stay afloat even after everyone already has a QRecall license. So when 3.0 goes live I'm just going to buy an extra license again like I did for 2.0.

Stay safe,
Norbert

post scriptum: speaking of staying afloat while completely reeling off topic: Dawn to Dusk isn't just you, is it? There are other people who know the code base and understand the inner workings of QRecall, right? I've been a company's bus factor of 1.0 for too long to not be scared by the thought that there might be only one person in the world to have mission critical knowledge?

Hi there James, hope to see you in good health.

I've recently tried to run a compact operation for my main OS archive. It ran well into the night until macOS decided to shoot it down because the system volume was filled up with swap (that would have been 16~17G of swap, plus all the physical memory not given to other applications, so likely another ~20G). After it having been killed during compacting, I thought it would surely be best to have a look at it before trying to add more data to it, so I asked the QRecall GUI to try and open it. Sure enough, it told me the archive was damaged and needed repairs. I agreed, and worriedly watched the system volume fill up again. Moved some Apps to external storage this time, cleaned out Downloads folder etc so now I have not 17 but 30 Gigs of free space again. Was relieved to see swap usage stagnate at some 20G. In iStatMenu the QRecall Helper was shown using almost 200G of RAM (can't be true, but still, wow) when after some three days I decided to ask the attempt to stop because the system volume was again dangerously low on space.

I then decided to give this the best shot I can. Copied the archive from its storage box (which runs on ssd-cached spinning rust) to a spare SSD for more IO/s, created a named pipe so logging would not eat IO or storage space, and used the command line to start a repair action like

qrecall repair /path/to/broken.quanta --recover /path/to/recovered.quanta --logfile /path/to/named_pipe --monitor --options noautorepair

Then I watched. The named pipe I constantly drain using a »tail -f« so there's no memory leak there.

What I found watching the log scrolling by was that at first it looked fairly normal, but after a few minutes, at around 42% during »Examining archive data« it would start writing a »Details at 'position' [numbers]« line for every eight Bytes (!) of 'position'. At this pace, writing a ~80B log line for every 8B of repository data, I can understand very well how that might fail in my storage situation. repository.data is about 180G in size, and we're just shy of 80G now, so if it doesn't pick up pace again at all I'm looking at several weeks until this is done.
Also, there were lines reading »#debug# 2712 consecutive envelope exceptions«,
many blocks like

2020-04-06 12:29:11.956 +0200 #debug# -[DiagnosticCommand packagePump] caught exception at scan position 79413639240 [2.73752138.46183.30357]
2020-04-06 12:29:11.956 +0200 #debug# invalid envelope type [2.73752138.46183.30357.1]
2020-04-06 12:29:11.956 +0200 Subrosa .Exception: Data [2.73752138.46183.30357.1.1]
2020-04-06 12:29:11.956 +0200 #debug# Class: BufferBlasterSource [2.73752138.46183.30357.1.2]
2020-04-06 12:29:11.956 +0200 Details Type: 0x07000000 [2.73752138.46183.30357.1.3]
2020-04-06 12:29:11.956 +0200 Details Position: 79413639240 [2.73752138.46183.30357.1.4]
2020-04-06 12:29:11.956 +0200 Subrosa .DataBad: 1 [2.73752138.46183.30357.1.5]
2020-04-06 12:29:11.956 +0200 Details File: repository.data [2.73752138.46183.30357.1.6]
2020-04-06 12:29:11.956 +0200 Details Length: 132864 [2.73752138.46183.30357.1.7]
2020-04-06 12:29:11.956 +0200 #debug# backtrace 28001dafb9bec [2.73752138.46183.30357.1.8]

, and just now there are lines reading

2020-04-06 12:41:21.550 +0200 Details [0]: { position=14920574123313796, length=4, <IO> re-read past eof { Position=14920574123245568, .EOF=1, Path=/path/to/broken.quanta/repository.data, Length=131072 } } [2.73752138.46183.801.4472.2.

which leave me a bit worried. Why would something point to a place that I could never afford to buy enough storage for?

The log file currently grows at about a megabyte per minute. Given that there's 80B of log for 8B of repository data, we're currently Examining archive data at a pace of about a megabyte per ten minutes. I things stay like this, we're looking at a million minutes, translating to almost 700days. For the examining, not the actual rebuilding which comes after.

Questions:
? Is this operation likely to finish at all or does the archive seem to be absolutely broken? Why/why not?
? Is logging expensive? If so, can I give a command line argument asking those many many »Details at?« lines to not be logged at all?
? Is it normal that those logged Details are only eight Bytes in size, and that there are this many of them?
? I found an »available memory« option in the Advanced preferences and set it to 8192. Will this help future runs? (Right now, QRecallHelper seems to keep its resident memory below 8G)
? Is it possible to ask a Compact operation in an extra safe way, like when it was running on a network share that might disappear anytime?

Please do stay at home and healthy, we can ill afford to lose you

Olfan

little follow-up.

I've noticed while checking the settings dialogs that for all of my archives data redundancy was disabled. I'm very sure that I've always had this set to 1:16, and I take that confidence from the fact that upon creation of a new archive 1:8 is recommended as the default and I always manually lowered it to be 1:16 instead.
Is it possible that the same upgrade that thrashed the exclude filters in the settings.plist's also clobbered the correction code level? I've re-set the redundancy option everywhere and had QRecall re-create the data, then retried going through the motions. No effect. The two "cursed" archives are still being flagged as broken every time they're verified if a compact action had been performed after the last repair, even if that compact action didn't actually do anything.

Don't know if that piece of info is worth anything, but I didn't want to risk having found some clue and and not sharing it.

Here I am again, sorry for having taken my time.
I've removed any ExcludeFilter settings from all my archives and switched to use 'qrecall captureprefs exclude [?]' instead of setting exclusions inside the archives for all exclusions on my own volumes.

It kind of worked, yet, still didn't? Anyhow, this looks increasingly strange?

After deleting the exclusion data I could access the archive again which is a good thing. I was able to perform a couple of captures, a rolling merge, another couple of captures, and after the next compact action the archive wouldn't verify anymore. It was easy (if lengthy) to repair, but ever since the state of things has been the same: captures and merges work fine, compact actions will reliably cause the next verify to fail and mark the archive as broken, even though without running a verify action I can keep capturing, merging and compacting without any trouble. To phrase the same thing from a different angle: after a repair the archive will verify fine until the first compact action is run. After that, it can be used without issue as long as no verify action is run as that will flag the archive as broken. I'm not sure if compact actually breaks something that nobody but verify cares about, if verify is over-zealous about something it shouldn't concern itself with, or if each of the two is doing right but they just don't have the same understanding of how an archive should look like.

I have tried moving my volume history to a new archive using a repair action, but the "curse" was copied over as well.

Here's what I do on the example of a short-term dev repository:

qrecall repair work.medium.quanta --recover work.repaired.quanta -o noautorepair -o lostfiles --monitor

|-        -|

mv work.repaired.quanta/ work.medium.quanta

vi work.medium.quanta/settings.plist # remove any ExcludeFilter entries possibly present

qrecall capture work.medium.quanta/ --deep --monitor /Volumes/work/

|-        -|

qrecall verify work.medium.quanta # will be fine

|-        -|

qrecall merge work.medium.quanta --monitor --layers keep=30d

|-        -|

qrecall verify work.medium.quanta # will be fine

|-        -|

qrecall compact work.medium.quanta --monitor

|-        -|

qrecall verify work.medium.quanta --monitor # this is going to fail

|-        -|

* verify process failed: unhashed data package missing from index

I can live with remembering to always run a verify-repair loop after ever compact action, I do them only once in three months anyway. Even this I could speed up by just "repair --recover"ing, deleting the old and renaming the new archive so it reads through the whole thing only once.

What bugs me the most is that this is only affecting two of my archives. All the others behave just fine. I can't find the difference between these two and the rest for the dear life of me. Am I to leave these archives dormant and start new ones with the same settings, splitting volume history over two repositories? Or should I turn to use repair actions instead of verify actions?

Any ideas?

Hi James,
we've had a similar headline back in late 2012 during a merge operation done by Robby Pählig (http://forums.qrecall.com/posts/list/429.page#2023).
Now I'm encountering this message while trying to run a capture item. I'm unsure if the two problems are related but wanted to point it out in case it might help your analysis.

It makes no difference if I start the action through the scheduler or on the command line, or type a direct command mimicking the capture action's actions. --deep and/or --nodedup don't make a difference, too. The target archive has been successfully verified last weekend, it has never been merged or compacted because it's still trying to finish for the first time. (The capture action is set to stop after ten hours so it's done before the NFS lease expires)

It will only say "Starting", "Opening archive", "Stopping" and abort saying
"* capture process failed: -[NSConcreteMutableData filter]: unrecognized selector sent to instance 0x?"

I've sent a report, hoping I've put enough info there and here so you can link the two.

Quick addition: I'm restoring without writing a logfile now because it kept refusing, saying "? cannot bind socket to path". I created a directory on the backup drive, touched a logfile to write to, all to no avail. Restoring without logging seems to work (still underway), I really would have liked a log file, though. Did I miss a switch? Could I have set permissions "better"?
My full command line was like this:

/Volumes/Drobo02/QRecall.app/Contents/Resources/qrecall \
restore /Volumes/Drobo02/osx.temp.quanta/ "imac02:OS X" \
--tovolume /Volumes/macOS \
--helper /Volumes/Drobo02/QRecall.app/Contents/Resources/QRecallHelper \
--logfile /Volumes/Drobo02/log/qrecall.logfile
--monitor

Hi James,
thanks for your quick response.

I'm still waiting for repairs to complete (the repair guys are in turn waiting for the spare drive, installation itself would take just about 20min if only they had the part), so I've had ample time to prepare. The archive is copied, the copy verified, the delete operation is underway and done soon, and over the weekend I'll even manage a merge --layers=all and a compact action (all on the command line, thank you so much for that utility). The backup unit has to work over USB2.0 as my old MBP doesn't yet have a Thunderbolt or USB3.0 port, which adds a little ? viscosity to everything.

In my code quote I have indeed left out a lot of switches (-l ? -m --helper ? -k ? -p ?) in order to show just my basic intention. I very much live on the command line, so I really can't see there to be any work at all, let alone "a lot of work". On the contrary, it all seems a lot easier for me if I can type rather than click. You can not split up a fusion drive using Disk Utility at all, so a shell is needed in any case, and if there's an open shell already, why bother firing up a GUI tool?

The help section "Restoring a volume using qrecall and a recovery partition" and the complete $(man qrecall) output lie ready here on real paper. I may be crazy, but I'm no fool.

I plan on altering my partition layout and data storage habits in such a way that this problem will never arise again, so I won't have a use for such a feature unless it already exists now. If there's nobody except myself ever asking for this, please don't bother building a feature for an empty list of active users. If it feels "more complete" that way, don't let me keep you from working on it anyway, but be in no hurry about it, I'm all good here.

Thanks again,
Norbert

post scriptum:
Thank you for a tool that's actually helping its users get stuff done and not not its creators get rich. But, do make a paid upgrade sometime. I'd so hate it if you had to abandon this fine piece of software because it doesn't pay for itself. I know you can't stop me from buying extra licenses I don't need, but you really should look out for yourself more. Just 5$ for registered users, even per user if per licence seems too much for you.

Hi fellow forum folks,

After getting back my Mac with a replaced hard drive (you can't open new iMacs yourself anymore) I'm planning on ripping the fusion drive apart and use it as SSD + HDD instead, so next time one of them crashes I can still use the other and avoid maintenance at all or at least postpone it to a more convenient date.

My QRecall archive is a full system snapshot. One folder in my User Home has all the media files, photos, documents etc. inside a collection of sparse disk bundles which clearly don't need to reside on the SSD and also won't fit there.
I'm aware that after booting into internet recovery using ??R I can use a shell to dissolve the fusion drive, format my SSD as HFS+, mount my backup box and use QRecall's command line utility conveniently waiting there with

qrecall restore osx.longterm.quanta ':OSX' --tovolume /Volumes/SSD

to pull everything from the archive to the newly created volume without having to install a temporary OS first.
But how would I go about restoring everything except that one big folder? Is there a hidden switch or option to achieve this? The man page doesn't say so, but I hope that's just because it tries not to drown people in text.

If QRecall by itself doesn't offer such an option I could still make a copy of the archive, delete the big folder from it and restore the result. That seems an awful lot of i/o for a fairly intuitive transaction request, though.

Thanks in advance,
Norbert

James Bucanek wrote:?a command-line front-end to the existing scheduler and helper architecture. This will allow much more flexibility in what QRecall actions you can perform, won't introduce new security vectors, and opens the possibility of scheduler related commands (i.e. qrecall schedule --holdall 20min).

May I follow up on this asking if there has been any progress already? I'm in the process of setting up a second Mac with QRecall (shared archives, finally) and that topic popped back into my mind.

Best regards,
Norbert

Why does it have to be like that anyway?
I mean, why is NAS over WiFi (even 800MBit/s) so excruciatingly slow in comparison with that very same NAS over CAT-5, FW-800 or USB-3.0?

Can we do anything to improve QRecall performance when backing up via WiFi? Maybe downloading some index data just once and using it locally? I would gladly spend a few GiB's on index caches if it meant I'd be able to make my backup runs quicker.

James wrote:

Norbert wrote:Prioritize Capture Actions?{explain}

{upcoming feature to limit actions to one instance per type}?
?I would suggest you experiment with fewer concurrent actions?{explain}

A toast to the mentioned feature! As my backup chains usually stack merge/compact/verify actions I'm expecting a great deal of effect from this.

My concurrency issue is with memory consumption, not disk performance at all. If I had a more modern Mac supporting 16GiB of RAM instead of only 8 I wouldn't have to impose a limit at all. Restricting QRecall to even fewer concurrent actions would make individual actions wait even longer. If I could reserve one action for captures only and two more for anything else my snapshots could be written more regularly. It's not that anything is broken now, it's just my personal taste not being happy with 20minute snapshots not being taken exactly every 20minutes. What really hurts me is that for hours on end no new snapshots are taken at all once a week because the compact and verify actions all tend to trigger at once (when archive volume connects).

James wrote:

Norbert wrote:• Arbitrary timeouts for event triggered actions
• Non-Capture Actions should be able to be triggered by Capture Actions

?in the works?
?on the to-do list?

Make this a paid upgrade, it will totally be worth it (to me, at least).

James wrote:

Norbert wrote:Multiple event triggers should be supported?{explain}

To some degree, they already are.{explain}

I totally missed that. Thank you for pointing it out, I've already changed my actions to use this.

James wrote:

Norbert wrote:?to have one Capture Action use several backup targets?

?a "waterfall backup" or "cascading archive"?

That pretty much nails it. Have a local archive with very frequent snapshots to undo editing errors or accidental deletes/overwrites of recent work. Have a much larger archive on an external drive with daily/weekly snapshots to recover from disk disaster or harmful OS upgrades or similar. Have a monthly snapshot on tape in the vault just in case the office gets robbed or burns down so you lose only a months worth of progress. I can see how such a topic can need a whole lot of design work so do take your time on this.

Thank you for your quick and helpful feedback,
I wish I still knew anyone who doesn't already use QRecall so i could recommend it. ;)

Best regards,
Norbert

Hi James,
I'm using QRecall 1.2.2.5 with much greater pleasure and success than I have used Time Machine before, so let me start with a capitalized Thank You. ;)
Forgive me the length of this post, I tried to keep it short but failed.

For background: I've set up my system to hold iPhoto and iTunes libraries inside .sparsebundle images, personal data, virtual machines, client files and their test databases reside in separate encrypted .sparsebundle images, most of which are mounted at login, some of which only when needed. Hence I have created a plethora of Capture etc. Actions to keep everything safely backed up: OS X and the contents of the personal and client files images go to a secondary internal hard drive in short intervals. Everything goes onto an external backup drive once a week. Client files additionally go to a corporate network location once a day. Summarizing, I'm currently using QRecall to back up nine data sources into a total of 21 archives, every new client would mean two more data sources and six more backup targets.

Juggling all these backup chains leaves me with a few ideas for optimizing the scheduling of Actions. I've read a few hints on the upcoming QRecall 1.3 in other posts and I'm looking forward to see some of the features in action, so I left out any points that I already found addressed. I've ordered the points by my personal "pain level" from high(top) to low(bottom).

? Prioritize Capture Actions: As the QRecallHelper's can be quite memory-hungry I had to limit QRecall to three concurrent actions. At times this can lead to a long queue of waiting actions, especially in the mornings after connecting to the backup drive and network location. I would greatly appreciate being able to tell QRecall to reserve one action slot exclusively for Capture Actions. A Compact or Verify Action can easily run for an hour, and there may be a handful of them waiting. Actual data snapshot Captures shouldn't need to wait for the backup application's internal data management.
? Arbitrary timeouts for event triggered actions would be great: The external backup drive is always connected while my machine is in the office. Unfortunately, OS X recognizes it very soon during wakeup and then deals with external displays and all the USB gimmicks before letting me log in. By the time I decide that I need a full reboot today, the first Capture etc. Actions have already started, and more are queued. Being able to use not just 1?5min but maybe 10?20min timeouts, things would be much more relaxed.
? Non-Capture Actions should be able to be triggered by Capture Actions: Being able to schedule time-consuming Actions during nighttime is great for stationary machines. For mobile workstations which only have access to their backup media during office hours it's not quite as helpful, especially with floating working hours. For example, I don't want a Verify run every two days or every Thursday at 15:00, I want one after every sixth Capture run. I want a Compact run only after every three Merge runs that actually did merge something.
? Multiple event triggers should be supported: When I need to mount a .sparsebundle image as well as connect an external backup drive for a Capture Action to work, it would be nice to be able to configure the action accordingly. Right now the action just fails and is re-scheduled which doesn't do any harm, but it would be "more right" to watch both trigger events.
? This isn't actually a scheduling feature but just let me add it here: It would be pretty cool to have one Capture Action use several backup targets instead of configuring several Capture Actions for the same data source. This way the same data needed to be read only once, and QRecall could more intelligently keep knowledge about which files to capture. An hourly local snapshot, a daily one to a secondary disk and a weekly one to a remote location might just copy the merged result of the hourly/daily Captures instead of traversing the whole disk again.

I do hope I came across friendly and politely yet understandable. If I didn't, don't hold it against me but rather ask for clarification. Bear in mind that English is not my mother tongue and I'm far away from having mastered all of its subtleties. ;)

Best of regards from sunny-yet-cold Germany,
Norbert