QRecall

Paul,

Things are actually going pretty well, just not on schedule.

The XPC and memory management changes are all done and appear to be working well. There's a bunch of fixes and some small new features have been added.

The work now is entrenched/mired/submerged in the UI changes, which continue to progress nicely but the scope of which has grown to be much larger than I had originally planned. It's like the pulling the proverbial thread; redesigning one features of the UI opens the possibility of redesigning two others, and so on, until I've found myself reimplementing most of the interface. I think you'll like the changes, it just taking up a lot of time and energy.

Once the UI is settled, I'll probably launch a beta to ferret out any problems with the aforementioned changes while simultaneously adding some of the new remote archive features.

Ralph,

You didn't say which archive you were having problems with, but I'm going to assume (from your description) that it was "5th Backup". Here's what I see in the logs:

A whole bunch of successful captures on Feb 28, followed by a merge of a dozen layers.

On March 2, a capture was started at 11:07. At 11:57 a stop (cancel) request was made by the scheduler. (I assume this was a left over stop condition from setting up the incremental capture.) The capture successfully stopped, wrapped up, and committed the items that had been captured up to that point:

2017-03-02 11:07:41.990 -0800 Capture to 5th backup.quanta

2017-03-02 11:57:41.809 -0800 Received cancel request

2017-03-02 11:57:41.890 -0800 Capture stopped by scheduler

2017-03-02 11:57:47.540 -0800 Captured 12154 items, 2.77 GB (53% duplicate)

2017-03-02 11:58:51.121 -0800 Capture incomplete (50:51)

Now if this is the "accidentally interrupted" backup you were referring to, then you should note that this was not an unsuccessful capture—it just wasn't allowed to finish. From QRecall's perspective, the capture was successful; all captured files were successfully added to the archive, and the archive was properly closed.

The archive is then opened again and someone deletes the folder "Dont backup":

2017-03-02 12:04:33.030 -0800 Delete items in 5th backup.quanta

2017-03-02 12:04:38.465 -0800 Deleting Dont backup

2017-03-02 12:04:38.465 -0800 Ralph Strauch(#XXXX):iMacHD:Users:Shared:Dont backup

You then started a verify of the archive, but stopped it before it could finish.

2017-03-02 12:06:11.227 -0800 Verify 5th backup.quanta

2017-03-02 12:10:40.199 -0800 Action canceled

2017-03-02 12:10:40.389 -0800 No problems were found up to the point that the verify was stopped

2017-03-02 12:10:40.389 -0800 Verify incomplete (04:23)

Now here's where things get a little crazy. The archive is opened again and someone deletes the owner "Ralph Strauch"

2017-03-02 12:11:33.487 -0800 Delete items in 5th backup.quanta

2017-03-02 12:11:37.803 -0800 Deleting Ralph Strauch

Since "Ralph Strauch" is the owner, every volume, folder, and file ever captured by that owner (identity key) was deleted from the archive. If "Ralph Strauch" is the only owner that's been captured to this archive, then everything in the archive would have been deleted.

On the other hand, if other owners (identity keys) have captured items to this archive, then those items should still be in the archive.

but when I open the archive there's nothing it -- the window is completely blank

If "Ralph Strauch" was the only owner, then the archive is now, indeed, empty. A blank browser window is exactly what you'd expect to see.

If you expect other owners to be here then first try to use the navigation bar a the bottom of the browser window to navigate to the top-level of the archive, where you will see the other owners.

If the navigation bar is malfunctioning for some reason, you can reset it: (1) close the archive. (2) in the Finder, select the archive and "show package contents" by right+clicking on the archive. (3) inside the archive, discard the view.plist file. (4) reopen the archive in QRecall. That should reset any assumptions the browser window was making about where the view should be in the archive hierarchy.

while Finder shows the file size as 798gb

When you delete items from an archive, using either the delete or merge commands, the records of those items are erased but the space that data occupied remains. Use the compact action to identify, and reclaim, these empty records.

Finally, the archive was repaired several times. I assume this was an attempt to recover the missing items, but I suspect they had already been deleted (see above) and no amount of repairing can recover them as QRecall always securely erases file and directory records.

Ralph,

Sorry to hear you're having drive problems. Hard drive spontaneously, and inappropriately, unmounting can usually be traced to one of five problems:

Software (operating system)

Bus (computer/interface)

Bus (cable)

Hard Drive Controller

Power

Sometimes, very rarely, this can be an operating system or filesystem plugin issue. For example, OS X 10.8, 10.9, and 10.10 all had various problems with external drives using Firewire; it would occasionally eject volumes when the computer went to sleep. But again, these issues are atypical.

Bus problems are the most common. This can usually be traced to a flaky (USB, Firewire, Thunderbolt, eSATA, Ethernet) controller, the cable, or the physical connection. It might be the controller chip itself or the computer's motherboard.

Drive controller problems are also a possibility, although less common.

Finally, power can be the cause. Most external drive use simple transformers (unlike the switched power supply inside your computer) and are much more sensitive to drops in voltage. The likeliness that this is the problem would depend on how "clean" the power in your neighborhood is.

The typical technique for diagnosing these problems is "divide and conquer." In other words, you try to take each of these possible causes and eliminate or substitute them to identify the weak link.

For example, if your drives are connected via Thunderbolt, try switching to USB. Many external drives these days support multiple interfaces, so switching interface can help identify a bus problem. If you find the drive does not unmount when connected via USB, then you can go back to Thunderbolt and try a different cable.

If it's a motherboard issue, or if you don't have an alternative bus to try, the next step is to connect the drive to a different computer system. For example, you might set aside a weekend and relocate the drive from the desktop to the laptop and perform the captures in the other direction, intensively, for a day or overnight.

(If your external drive is unmounting for both local processes and shared computers via Ethernet, then the problem is probably local. Logically, a drive that unmounts locally will be unmounted for the remote user too.)

If it's a physical hard drive or HD controller issue, then swapping it out for another drive is the simplest test, which is sounds like you already did.

Finally, there's power, which can be difficult to diagnose without specialized equipment. If you have a UPS you can borrow, you might try that.

Additional notes:

QRecall does not control volume mounting or unmounting. The most any application (like QRecall) can do is to make mount and unmount requests to the operating system. It's up the operating system to mount and unmount the volume, and it should never unmount a volume that still has open files. So if a volume that's currently in use is getting unmounted, there is a (physical) problem somewhere.

QRecall does write temporary files during capture and those files won't get deleted until the next action or repair. So you may see an archive that has 90GB of data in it, but open it to find almost nothing. The next action (including verify or even just opening the archive) will invoke an auto-repair that will delete/overwrite these temporary files and recover that disk space. (You can open the archive package in the finder and watch this happen.)

But just to be clear, the space used by QRecall during a capture would never be the cause of a volume being unmounted. It would result in a "disk full" error, and the capture stopping gracefully.

As a workaround, you might consider setting up an incremental capture: create a capture action that runs every hour, but stops after 50 minutes. Every hour the capture will start, picking up from where it left off last time, and capture for 50 minutes. If it hasn't finished, it will then wrap up and commit whatever has been successfully captured so far. The next capture starts where the last one left off, but if it is interrupted only the latest capture will be lost when the archive is auto-repaired. You'll probably end up with a lot of unfinished layers (which can later be merged), but at least you'll have successfully captured your files.

Ralph Strauch wrote:My iMac has started to refuse to do captures, giving me a "Capture requires an identity key" error message.

You are most likely encountering a bug in macOS that occasionally prevents one process (the capture) from accessing the preferences of another app (the QRecall client app). Usually a system restart will fix this.

I've tried re-entering my key but when I do I get a warning that I'm entering a new key and that will create a new owner for all subsequent captures.

You're probably entering a different identity key without realizing it. Here's how to find out.

First, get the serial number of the key you have installed now. Open QRecall > Preferences... and go the Identity Key pane. While holding down the Option key, click the Enter Key... button. When the key entry dialog appears, the "key" field will show the serial number of the currently installed key. Make a note of it before canceling the dialog.

Now log into your www.qaccount.com account using this Magic Account URL. This link will cause your account page to display the serial number of each purchased identity key. Compare those numbers with serial number found in the Identity Key pane.

Following up,

All of your repairs that I've looked at fail because the volume containing the archive spontaneously unmounts during the repair process. Here's an example of a repair that started at 09:49:

2017-01-21 09:49:40.052 -0800 Repair 5th backup.quanta

2017-01-21 09:49:40.052 -0800 archive: /Volumes/volume2/5th backup.quanta

Four hours later the volumes mysteriously, and spontaneously, unmounted. (Note that this doesn't appear to be associated with a sleep event.)

2017-01-21 13:34:42.401 -0800 unmounted volume /Volumes/volume2

2017-01-21 13:34:42.402 -0800 unmounted volume /Volumes/volume1

Not surprisingly, the repair starts encountering problems?about a million of them:

2017-01-21 13:34:41.999 -0800 problem getting volume statfs

2017-01-21 13:34:41.999 -0800 IO exception

2017-01-21 13:34:41.999 -0800 POSIXErr: 2

2017-01-21 13:34:41.999 -0800 Path: /Volumes/volume2/5th backup.quanta/repository.data

2017-01-21 13:34:41.999 -0800 ErrDescription: No such file or directory

This is most often caused by some intermittent hardware problem (failing drive controller, brief power interruption, flakey USB connection, and so on).

QRecall can actually be helpful in diagnosing this. Because the QRecall scheduler watches for volume mount and unmount events, it logs those when they occur. Filter out everything except the scheduler messages in the log, and then look for unmount events that shouldn't be happening. If your a command-line geek, this will also do the trick:

fgrep ' [3.' ~/Library/Logs/QRecall/QRecall.log* | fgrep mount

Ralph,

I think it's the drive, but I need some more information to connect the dots.

The log included in your reports contains thousands of errors like this one:

2017-01-26 04:34:30.152 -0800 IO exception

2017-01-26 04:34:30.152 -0800 POSIXErr: 2

2017-01-26 04:34:30.152 -0800 Path: /Volumes/BUD2/2nd backup.quanta/repository.data

2017-01-26 04:34:30.152 -0800 ErrDescription: No such file or directory

This would indicate that the volume spontaneously unmounted while the repair was in progress. The repair is designed to tolerate I/O errors, and will keep plowing away, trying to read whatever data it can, so this just goes on, and on, and on.

The build in Send Report function only includes the latest log records, so to confirm my suspicions I'd need to more of the log. If you can, compress the files in ~/Library/Logs/QRecall and send them to me. They're likely to be very large, so I'll email you a dropbox upload request seperately.

Norbert Karls wrote:Is it possible that the same upgrade that thrashed the exclude filters in the settings.plist's also clobbered the correction code level?

Not likely. While it's true that the redundancy preference is stored in a property file, QRecall refers to the actual redundancy companion files when determining if redundancy is available and how much. So once the redundant data files are created, it doesn't really matter what the settings file says they should be.

Conversely, if the data redundancy files are missing or malformed somehow, QRecall may determine that redundant data isn't available and will run with that assumption, even if the settings disagree.

Norbert Karls wrote:

qrecall verify work.medium.quanta --monitor # this is going to fail

|-        -|

* verify process failed: unhashed data package missing from index

OK, there's probably a bug in the compact action. I'm not sure what, but I need to start with the details so please send a diagnostic report (in the QRecall app choose Help > Send Report). I might need more info later, but I need that to get started.

From what you've posted, I can at least tell you what's going wrong, even though I don't know why. The "unhashed data package missing from index" error means that you have an un-de-duplicated data quanta that has been captured but not yet de-duplicated (using the "defer de-duplication" capture option). But for some reason, that data record is not included in the un-hashed quanta index table. That's very weird, because the repair should fix that and any subsequent compact should then de-dup the quanta in that table. But clearly something is going wrong here, so the diagnostic will help me start an investigation.

So this probably explains why this only happens on some archives because you only have some capture actions which defer de-duplication.

For there record, this is a fairly benign problem[1]. It just means that the next compact action won't know to de-duplicate some of the capture data, which means your archive might contain duplicate quanta. The only side effect of this is that your archive may be larger than it needs to be.

The capture and merge actions still work because they don't touch un-de-duplicated data (that's been set aside of the compact to deal with). As a rule, specific actions verify only the data records and indexes they need to accomplish their task. The verify action verifies everything. That's why capture and merge still work, even when verify fails.

[1] QRecall has an extremely low tolerance for any kind of inconsistency. It might be a curse when dealing with seemingly trivial issues like this, but I prefer the "better correct than sorry" philosophy.

So the mystery is that you can open and capture files to testbackup.quanta, but you can't open and capture files to 3rd backup.quanta.

I don't see any differences in the ownership or permissions for the two archives, and I don't see any stale .lock or .share files that might be blocking access to it.

So, the only thing I can think of at this point is that the file server supports file locking and/or advisory locks and is holding an orphaned lock on one of the files in 3rd backup.quanta. This can happen when a network clients obtains a lock on a file and then gets disconnected from the server.

This can usually be solved by restarting the server. If this is a network device that runs 24/7, it's easy for orphaned locks to stay around for weeks. (Note that in cases like this, restarting the clients won't have any effect on the problem.)

If that doesn't work, you can try repairing the 3rd backup archive (presuming you have enough free space), choosing the "Copy recovered content to new archive" option (say 4th backup archive). QRecall will extract all of the data in the original archive and use it to create a brand new one. When finished, you can discard the old archive.

On deck for QRecall 2.1 is the ability to run a script (or any POSIX executable) before, and again after, an action runs.

A script could, potentially, take steps to prepare either the archive (such as mounting a NAS device) or the items (like backing up a database server or shutting down a VM) for capture.

Once the action finishes, a second script can perform any desired cleanup or post-processing (disconnect a NA, resume a VM, ...).

Your situation would require some tools (shell commands, AppleScript, etc.) that could control the running VMs.

Ralph,

Thanks for the info, but it wasn't quite what I was looking for. I'm interested in seeing the insides of the archive packages. These commands should do the trick:

ls -lna@e '/Volumes/volume2/testbackup.quanta'

ls -lna@e '/Volumes/volume2/3rd backup.quanta'

I'm interested in the ownership, permissions, and existence of the various .lock and .share semaphore files inside the archive package.

Ralph,

Looking at the logs, QRecall is still stuck trying to obtain (and later break) the shared file semaphore. I suspect a permissions problem, but it's hard to tell from the logs.

I'd be very much interested in knowing the ownership and permissions of all of the files in both the archive that is stuck and the one that is working. If you have the time, open an Terminal window and issue this 'ls' command for each archive:

ls -lna@e /Volumes/Backup/PathToArchive.quanta

email the results, or post them here.

Alexandre Takacs wrote:My problem is that my system - and those VM - run pretty much 24/7. I might consider suspending them for capture but I'd really like to avoid actually shutting down (if at all possible).

The big question is does "suspending" a VM flush all of its data to disk? If it does, then you could take a manual approach: create a capture action for just your VM folder, occasionally suspend your VMs, start the capture, and then resume them when the capture is finished.

(I have a much better solution for this kind of problem in the next major release of QRecall...)

Alexandre,

You are absolutely correct. Making a backup (with any software) of a virtual machine image while that VM is running is probably going to result in an incomplete/corrupt copy. This is a general problem with software that doesn't immediately write all of its data to disk and affects databases, video editing software, and so on.

QRecall 2.0 has a new set of action events specifically designed to address this. You can start a capture action when an application quits, and you can ignore or stop a capture action if an application is running. This encourages the capture to run only when the application that might be modifying those files is dormant.

If you run captures regularly and your VMWare usage is infrequent, you might consider just skipping backups that occur when that app is running by adding a new condition to your capture action:

- Stop If Application [ VMWare Fusion ] Is Open

If you run VMWare regularly, and want to capture complete, and stable, copies of your VM images, run the capture using an Event Schedule:

- Event: Run when [ VMWare Fusion Quits ]

My solution was to split off my (Parallels) VM images into their own archive.

My primary archive, which captures my startup volume, excludes the Parallels folder from all captures (set up in Archive > Settings...)

I have a second archive the just captures the Parallels folder.

The capture action for the second archive runs 1 minute after Parallels Desktop quits

The action also stops capturing if the application starts again

Using this scheme, QRecall captures my VM images immediately after I quit, and never while it's running. I always get a "clean" copy of the VM images. If you want to make occasional copies during the day, just suspend your VMs and quit the app when you take a break (I do it before going to get coffee). By the time you get back, QRecall will have captured your VMs.

Another important note:

QRecall depends on File System Change Events (a macOS service that tracks changes made to your filesystem) to quickly determine what items have changed and need to be considered for re-capturing. A few software applications, most notably VM apps, make changes in a way that eludes File System Change Events. This means that QRecall won't notice that certain files within your VM package have changed, possibly for weeks.

The capture action has a new "Deep Scan" option that ignores File System Change Events and exhaustively examines every file for changes. As you can imagine, this is much slower.

Another advantage of splitting off your VM captures into a separate archive and action is that you can perform a "Deep Scan" on your VM captures (which is fairly shallow) and continue to use File System Change Events for your regular captures.

Ralph Strauch wrote:I think I've that sorted out now, and am back to one key per computer. I assume I can just delete the extra layer that I created unintentionally without affecting the rest of the backup

A more surgical approach would be to find the "new" volume in the other owner and simply delete that from the archive. That way, you don't have to delete any subsequent layers.

I had set the Current Encryption Limit to 2 after you suggested 1 or 2, but I'll take it down to 1 now.

Let me know how that goes, or just send another diagnostic report when you get back.

I seldom even log onto the iMac, but it looks like Qrecall could run scheduled backups of the whole iMac from my account without my being logged in. Is that correct, and should that satisfy the router's desire for a single common user? (I'm uid 501 on both machines.)

Here's the important concept:

The account on your computer is independent of the account you use on the file server.

This is the concept that is most confusing when working with networked volumes. It does NOT matter what your user account is. The files written to your file server will belong to the (server) account you use to authenticate with when you connect to the server.

If you and your wife can get set up so that both of your local accounts connect to the file server using the same server account, then the files (archives) on the shared volume will belong to both of you, and it doesn't matter what your local account is or what UID you're using.

The reason I'm short on practical advice is that different file servers, NAS devices, and so on handle this differently. For example, Apple's Airport Time Capsule has (basically) two different authentication modes for its shared disk: shared and per-account. The "shared" modes allow all network users to access the files on the Time Capsule as if they were all the same user. This is the effect you need, and this is the mode I use with my Time Capsules. All QRecall users can connect to the Time Capsule and use the same archives, since (from the Time Capsule's perspective) they are all the same user. If I switched my Time Capsule to the per-account mode, I'd have the same problem you're experiencing.

Other devices handle accounts and ownership differently. Some server/NAS devices, for example, deliberately extend file ownership to the shared volume using your local account ID, effectively emulating an external drive. So YMMV and you'll need to find the magic combination that works for you.