QRecall

Everyone,

I has hoped to have the new beta up a few days ago, but some last minute bug and UI changes ran it down to the wire. The new version (1.2.0b37) is up. Choose QRecall > Check for Updates... if it doesn't prompt you automatically.

Sorry for cutting it so close to the expiration date.

P.S. For those using permanent identity keys, you can choose to ignore the "Beta expired" message. Only a beta identity key expires with the application. Permanent keys continue to work with whatever version you are running.

Neil,

Here's the short lesson:

Merging does, indeed, throw stuff away: a merge keeps the new stuff and throws any old stuff away. But that's a good thing over time, otherwise your archive grows forever (or gets unmanageably complex).

Ignore the plain merge action, that's for geeks.

A rolling merge action works on the idea that you want to keep recent fine-grained changes (the file you changed Tuesday morning vs. the version you saved Monday afternoon), but don't care about fine-grained deltas that are months old (two report documents saved a day apart, some six months ago).

The rolling merge actions lets you choose blocks of time relative to today. These are, in order, a number of days to keep only the last version of each item that day, then a number of weeks to keep only the last version of each item that week, and so on with a number of fortnights, months, and finally years.

Here's an example. If you choose to keep 7 day layers, 8 week layers, and 12 month layers, here's what happens when the rolling merge runs:
- All layers captured today are left alone. (This is the minimum "ignore" period described later.)
- The layers for the previous 7 days are organized into 7 groups. All of the layers within a single day are merged into one layer, keeping only the last version of each item captured that day.
- The layers for the previous 8 weeks are then organized into 8 groups. All of the layers within each week are merged into one layer, keeping only the last version of each item that week.
- Finally, the layers for the previous 12 months are organized into 12 groups ... and you get the idea.

You can choose whatever time periods you want, and make them as small or big as you like. The action is very flexible.

There's a special "ignore" time period that extends the period of time that layers are not merged. So if you want to keep every hourly merge for the past two weeks, set the ignore setting to 14 days. The rolling merge wil then start grouping layers starting 15 days ago.

Your first merge will take some time, because you've got a lot of layers to merge. But after that the rolling merges run pretty quickly as there are usually only a few layers to merge most days. I typically schedule my rolling merge to run once a week, followed by a compact action. The compact action will recover disk space freed by the merge.

Neil,

Another random thought: I noticed from your screen shot that you have a lot of layers in your archive. And I mean a lot of layers. Even some of my biggest, oldest, archives only approach 300 layers, and your archive has almost four times that.

I did some earlier stress testing with QRecall with archives up to 1,000 layers, but have largely used much more shallow archives in performance and regression testing.

The number of layers in an archive adversely affects its performance, and the number of layers in your archive could be causing some serious performance degradation. For example, reading the file records for a particular folder requires a lot of short reads from the archive. For external hard drive this is really fast, but for many network volumes (i.e. a Time Capsule), a lot of short reads over the network can be really slow. Now for a handful of files in a dozen or so layers, the added time might not be noticeable. But for 1,000 layers this could add up to huge delays.

So I have two questions.

First, when you verify the archive is the data transfer rate what you would expect? I ask because verify uses a special DMA mode that reads data as fast as the source can supply it, regardless of individual record size. So if verify is fast, but recalling and browsing are really slow, then it might be the number of layers that's the problem here.

Secondly, do you really need all of those layers?

The idea behind the rolling merge is that you keep fine-grained changes for the past week or two, but then merge those in to more compact and efficient deltas as time goes on, first into daily layers, then weekly layers, and finally only keeping one layer per month.

Eric,

Since you've seen the earlier post, I won't repeat all that. To answer your specific questions, when QRecall can't read an extended attribute it simply notes that in the log and keeps going. QRecall considers this to be minor problem, which is why it's a "caution" message, it's not even a warning. If there was something else about that item that couldn't be captured (its directory information or any of its data), QRecall would have logged a much stronger message.

In your particular case, the system is returning BSD error 22 (invalid argument). This is a nonsensical error in this context, which often means that there's something wrong with the volume's directory structure. Try some of the remedies in in the earlier post, such as repairing the volume or making a copy of the item and stripping it of its extended attributes.

Neil,

338GB over 25 hours (yikes!) is about 3.8 MB/second. This is miserable throughput for a gigabit ethernet connection. It is, however, pretty close to the transfer speed that you'd get using AirPort.

Are you absolutely certain that you don't also have AirPort enabled on your MacBook Pro? With both an AirPort and an ethernet connection active, it's not that hard for the system to connect to the file server over AirPort. It will then stick with that interface until the volume is unmounted.

If you're certain AirPort isn't the issue, then the problem could be with the Time Capsule itself. There is no reason why a Time Capsule can't read or write to a volume at nearly 10x the speed you're seeing.

Some other random thoughts:

The fact that the QRecall monitor numbers are not changing doesn't (conclusively) mean that QRecall is completely stuck. It's possible that it's reading a very large empty region of the archive, which at your throughput speeds could take a while. Activity monitor is the easiest way to see if the QRecallHelper process is still using CPU time and if there's steady activity over the network.

I would suggest (a) repairing the volume and then (b) repairing the QRecall archive. If the Time Capsule volume is internal see Repairing your backup disk. If it's an external volume, disconnect the drive from the Time Capsule and connect it directly to the MacBook Pro. Repairing will be vastly faster and more reliable with a direct connection.

Finally, if the MacBook Pro is losing connection with the Time Capsule and the volume is unmounting during a QRecall action, this is a "very bad thing" (from QRecall's perspective). It also makes me think this is an AirPort problem; wired ethernet connections are typically pretty solid. Disconnects will play havoc with the data integrity, and could lead to corruption of the volume structure, which opens another can of worms.

Hello Neil,

Good timing. I just spent over an hour iChatting with another user about QRecall's UI and how it might be made more accessible.

Neil Lee wrote:An example - see this screenshot of my archive. I know you can turn this off, but what exactly does this communicate?

The timelines impart a lot of information (which is, in itself, part of the problem). A timeline shows you, for a given item, how many different versions of that item exist in the archive and when they were captured. Now that's pretty important information, if you want to know how many versions of file you have and when they were captured.

All of those lines are meaningless, at least to the untrained eye. I see the idea behind what they're supposed to signify, but at this scale with these many connections, they doesn't actually add anything meaningful to the experience - it's infojunk.

I have to disagree that it's "infojunk," as you say. I do agree that there's a problem if you don't know what the interface is trying to communicate. Basically, I'm trying to display all of the details of a third dimension (different versions captured over time) in a two dimensional interface. No other backup system that I know of tries to do this. Time machine, for all its snazzy UI goodness, doesn't even try. It merely shows you the items as they existed at a particular time, but won't give you the history of any individual item.

It's doubly compounded by the fact that you're in column view. When there was only list view, the timelines were manageable. But in column view, you can put considerably more items on the screen at once. Each timeline imparts a lot of information, and a lot of timelines start to overwhelm the interface.

But, as you pointed out, there's always the option of turning off the timelines if you not interested in the individual history of every item.

Finding a file within the archive is similarly overcomplex: the search field seems to imply I can search my archive, but searches there do nothing.

Ah, that's because search is currently unimplemented. See the Known Issues section of the QRecall 1.2.0(35) beta release notes.

It's possible I completely misunderstand how the archive view is supposed to work, but I guess that's specifically my point - the use is opaque to the user.

That's a very valid point, and something I've struggled with since day one. The basic concept of multiple versions of items over time is really hard to convey in an interface. That's why I came up with the Time View, which is as close as I've been able to come to graphically presenting the actual structure of the archive.

Constraining the time span of the view is also confusing, specifically because the "handles" that you drag to narrow down the time span are out of view. You have to scroll to find them, and if I didn't know they existed, I'd never find them.

That's a good point, and something I want to address.

Overall I'm happy with the mechanics of how Qurecall works - it's restored a lost Users folder and restored a full install perfectly, albeit very, very slowly -- the Users folder restore took 2 days, for example[!].

That's very strange. Recalling in almost always faster than capturing. I'd be very curious to know why your recalls are taking so long.

I hate criticizing a UI without offering some suggestions, and I'm happy to collate a list of stuff I've noticed.

I don't mind criticism, and I don't expect my users to design the UI.

My basic suggestion is that the default UI should be optimized for the primary use case for each task.

I agree.

- Chances are if you open an archive it's to find or restore files. This is difficult with the current UI
- If you need to constrain the time span, it shouldn't require physical motion (scrolling and dragging) - give me a date picker or something more effective

I agree that finding the shader handles in the current UI is a bit awkward. But more to the point, the most typical task is to identify an item to recall and then rewind the archive to a specific version of that item. Previous versions of QRecall had a set of "VCR" buttons that would allow you to move backwards and forwards in time, but stopping only at specific versions of that particular file. The current rework of the UI has lost this feature and I'm working on something to replace it.

- Search should provide more feedback - when you perform a search it looks like nothing is happening - which from my tests can sometimes be true?

Well, when the search is working again you can tell me if the feedback is sufficient.

- The view when you have multiple backups in a single archive is confusing. In the screenshot above, it's not clear why I have two instances of my system, and the presence of "Unknown" is unnerving. Where did that come from? What does that mean?

You have captured to this archive using multiple identities (identity keys). Each identity key you use creates a unique owner that keeps everything belonging to that owner separate from all of the items belonging to other owners. This allows you to safely store the backups of two computer systems in the same archive; nothing will get confused, even if the hard drive and every file name is the same.

You have an "Unknown" owner because (at some point) you repaired the archive and the QRecall recovered files but couldn't determine which owner they belonged to; these recovered files are assigned to a special "Unknown" owner.

If one of these owners is now really old/obsolete, or you want to get rid of the files belonging to "Unknown", you can select one of them and use the Archive > Delete Item... command. This will delete all of the items that belong to that owner from the archive.

These thoughts are still pretty scattered, but I wanted to at least get the gist of them out. I'm happy to spend a bit more time diagramming how I think the userflows should work but wanted to see what everyone's thoughts were.

I really appreciate your thoughts and the feedback. I'm acutely aware of some of QRecall's UI deficiencies, and I'm determined to correct them in this release.

Gary,

Please send a diagnostic report and let me know the approximate date/time that this happened. It would also be helpful if you could go into the Console application and see if there are any console message from QRecall around that same time (just enter "QRecall" in the search field, select some relevant messages, copy, and paste it into an e-mail).

Thanks!

Ralph Strauch wrote:I'm trying to search my archive and the search function just isn't responding.

I apologize for the inconvenience. The search feature in the current beta is not working.

See the Known Issues section of the QRecall 1.2.0(35) beta release notes.

Gary K. Griffey wrote:I was just wondering if you were interested in receiving any bug reports for the current QRecall beta running on 10.7 Lion.

Absolutely. The more problem reports I get, the better.

I've been testing QRecall on Lion and have been addressing compatibility issues as I encounter them. Most of these fixes will appear in the next beta. Later this month I'll start regression tests to make sure that QRecall can successfully capture and restore a bootable Lion system, and so on.

If you're really curious, and have nothing better to do, you can open a terminal window and use the iosnoop tool to see what QRecall is up to:

sudo iosnoop

My guess is that when it's stuck, it's writing to the hash_scribble.index file. A bunch of tiny writes to this file indicate that RAM has filled up with newly created (or previously cached) record entries that now must to be written to the disk before any new records can be added. This is probably the most time consuming, and I/O intensive, periodic chore that QRecall has to do. If the archive is on a network volume, this could be even worse (performance wise).

If you're running a virtual machine at the same time, there could also be some good old VM thrashing going on. QRecall bases its table and cache sizes on the physical RAM installed in the machine. There's no rational way to determine what the RAM demands of other processes are, so two or more RAM hungry tasks (which would definately include QRecall) could be fighting for resources.

Bruce Giles wrote:When this happens, is it because QRecall is doing some kind of time-consuming housekeeping in the middle of an archive?

More than likely, this is the case. QRecall maintains an entire menagerie of caches, indexes, and lookup tables. Almost all of which require periodic maintenance. They need to be occasionally collated, sorted, written to disk, and so on. And, as you guessed, quite often this needs to be done at unpredictable times, usually in the middle of capturing some item.

Your system's ratio of RAM to archive size will impact the frequency of this kind of thing a lot. If you have a lot of physical RAM, QRecall will allocate really big caches for many of these tables and indexes. That reduces the frequency that they need to be flushed, sorted, or read again from disk.

If not, do you have any idea what *is* happening?

Sometimes QRecall is just looking for the next thing to capture. QRecall displays the item that it's capturing, but once it's done it goes looking for the next thing to capture. Until it finds the next item, the status line continues to display the previous item. In the future I might change this so that a prolonged scan for the next item is a little more transparent.

Prion wrote:I noticed there was an incomplete layer which I merge with the one that followed. The result was that I now have a more inclusive, larger layer which still is incomplete. I repeated with the next layer, same result. I am not totally sure that it has not been there before the error occurred, it possibly has been without me taking notice.

Diagnosis of this situation is complicated because the layer is an old one. Recent changes in QRecall change why layers are marked as incomplete, and how subsequent captures treat them. I'll explain what's going on, and a couple of scenarios in which this can occur.

A layer can be incomplete for a number of reasons. Typically, it's just because the capture was interrupted and didn't finish. When this happens, the layer is marked as "incomplete." Items lost because of data corruption can also cause a layer to be marked as "incomplete" during the repair process. But the repair is more likely to mark the layer as "damaged," which means the good portions of that layer was reconstructed by the repair, but some items were likely lost.

Since the layer is marked as "incomplete" and not "damaged," let's assume that's exactly what it is. When a layer is incomplete, it just means that QRecall didn't finish the capture, so there are likely items that didn't get captured. The next capture will start over and will recapture any changed items, which will include any of those missed by the incomplete layer.

The first wrinkle is that this is the situation today. Older versions of QRecall were not as good about forcing missed items to be recaptured. They would be recaptured eventually, but it might be weeks before QRecall got back around to them. The newer versions of QRecall are very aggressive and force every item to be scanned (and recaptured, if needed) whenever it's capturing on top of an incomplete layer.

So, today, if you capture an incomplete layer, followed by capturing a complete layer, and then merge those two layers the merged layer will be complete. That's because anything missing from the incomplete layer will have been captured in the subsequent layer. However, if these layers are old that might not be the case.

The second wrinkle is that there may be a bug in QRecall that incorrectly marks a merged layer as incomplete or damaged. Earlier versions of QRecall were too lax about merging incomplete and damaged layers. The resulting merged layer could still be incomplete (or even damaged), yet the layer was marked as whole. Newer versions of QRecall are much more conservative about making sure that merged layer that have inherited any inconsistencies are marked as such. However, I suspect that it's a little too conservative. I've now had two user report where a damaged layer was merged with complete layer which should result in a merged layer without any missing data, yet the merged layer was still marked as "damaged." I'm currently investigating this. The problem doesn't cause any actual data loss, and I prefer QRecall to err on the side of caution, but it still appears to be a bug.

Any ideas what to try next? Ultimately, I am not too worried because the layer is quite old, but I would like to make sure that I have at least one copy of each item inside the incomplete layer that resides in apparently intact layers.

My suggestion is to ignore it. If you have any captures that follow that layer, particularly using a newer version of QRecall, then any items that weren't captured in that layer have most certainly been recaptured in some subsequent layer. You could continue to merge layers until you have a complete layer, but that strikes me as excessive.

Adam Horne wrote:I have a 2TB external HD as my recovery drive. I partitioned this drive into (2) sections with 1TB each. Would there be any problem using QRecall on 1 section and Time Machine on the other?

There shouldn't be any inherent problems, as long as 1GB is sufficient for you backup needs.

Would there be any potential issues?

You could experience performance degradation if QRecall and Time Machine are both copy data to the drive at the same time. So you might want to schedule QRecall to run when Time Machine is normally idle.

Perhaps I'm being a little too cautious, also...

Cautious is good ...

Neil Lee wrote:I just did some AJA speed tests on the drive, and the average write speed was 50+mb/s. Reading, however, was barely 10mb/s.

That's very weird. I don't have any explanation why the Time Capsule would be transferring this slowly, or why the read speed is so much slower than the write speed.

You might try resetting the Time Capsule (via Airport Admin). If it's an external drive, you might also consider plugging it into your Mac to repair the volume, and possibly consider defragmenting it. All just shots in the dark, but sometimes they help.

At any rate, after about 6 hours the restore operation finished (145G of data in total) and everything was exactly as it was. This is the first time I've tried restoring this much data and it's reassuring that QRecall did what it was supposed to.

Well I'm glad it all worked, even if it wasn't as fast as it could have been.

Neil Lee wrote: I launched QRecall and started the restore process. I think there was around 100G of data.

Four hours later, it's barely 1/2 done.

That's slower than I would expect.

That works out to a transfer rate of about 3.5MB/second.

My primary machine is connected to the Time Capsule via ethernet, so I assume it's at least a 100MB connection. It could be gigabyte ethernet, too

As a test, I connected a MacBook via 100Mb ethernet and performed a recall from an Xserve hosted AFP volume. As one would expect, the recall is transferring at about 7-8MB/second. After 1 hour, its recalled about 22GB of data.

I'm not sure if that's the default protocol between the Time Capsule and a 2010 MBP.

3.5MB/second sounds suspiciously like WiFi. That's about the rate you get with a solid 802.11g connection. Are you sure your MacBook Pro isn't connecting to the Time Capsule via WiFi? This has happened to me a number of times with my MacBook Pro (a loose ethernet cable that I never noticed because the laptop switches immediately to WiFi).