QRecall

Gary K. Griffey

Greetings James,

I have a question concerning shifted quanta detection.

In reading the details concerning the benefits/drawbacks of using shifted quanta detection in your help file, it would seem to me that during the initial capture of a file to a newly created archive, shifted quanta would not be relevant at all. It would seem, at least from my understanding, that shifted quanta detection would only be relevant to the conversation during subsequent captures of the same file.

This, however, does not appear to be the case. I created a new archive...and began to capture a single large virtual disk file that is being housed on a network share...this file is roughly 150 GB. At first, I set the shifted quanta in the archive to its maximum setting. After allowing the initial capture to run for nearly 60 hours...it was only 50% complete.

I cancelled the capture...created another new archive...and ran the same initial capture with shifted quanta turned off. The capture completed in just over an hour.

What am I missing? (A whole lot I'm sure...but I always learn a lot from your responses

)

Thanks

GKG

James Bucanek

Gary K. Griffey wrote:In reading the details concerning the benefits/drawbacks of using shifted quanta detection in your help file, it would seem to me that during the initial capture of a file to a newly created archive, shifted quanta would not be relevant at all. It would seem, at least from my understanding, that shifted quanta detection would only be relevant to the conversation during subsequent captures of the same file.

It's true that it would be "most relevant" when applied to a previously captured file, but the mechanics of shift-quanta detection are applied to every block of data no matter what files have been captured. Specifically, data de-duplication is performed at the data block level; it has no knowledge of files. It simply compares each newly captured block of data against the corpus of previously captured blocks.

This, however, does not appear to be the case. I created a new archive...and began to capture a single large virtual disk file that is being housed on a network share...this file is roughly 150 GB. At first, I set the shifted quanta in the archive to its maximum setting. After allowing the initial capture to run for nearly 60 hours...it was only 50% complete.

60 hours? Wow, you are really committed to this experiment.

Here, in a nutshell, is how QRecall performs data de-duplication.

First, let's consider how de-duplication work when shifted-quanta detection is turned off:

Files are divided into 32K blocks. When you capture a file, the first 32K of data (the bytes at offset 0 through 32,767) are read. QRecall then determines if there's an identical block of data already in the archive. If there is, then this is a duplicate block and it does not need to be added. If not, the new block is unique and is added to the archive.

The next 32K of data (the bytes at offset 32,768 through 65,535) are read and the process repeats.

The work to capture data increases exponentially as the size of the archive increases, but through various data tricks and optimizations, it goes pretty fast because each 32K block of data only needs a single up/down vote on whether it's unique or not.

Now let's look at what happens when you turn shifted-quanta detection all the way to the maximum.

The process starts out the same, the first 32K (bytes 0 through 32,767) are read and checked for uniqueness. If the block is a duplicate, then it is not added to the archive, and QRecall skips immediately to the next block.

If the block is unique, however, the process changes. When shifted-quanta detection is on, QRecall then shifts 1 byte and considers the 32K block of data at offset 1 (bytes 1 through 32,76

. If that block also turns out to be unique, it shifts one more byte and sees if the block formed by bytes 2 through 32,769 is unique, and so on:

Are bytes 0 through 32,767 contained in the archive?
No?
Are bytes 1 through 32,768 contained in the archive?
No?
Are bytes 2 through 32,769 contained in the archive?
No?
Are bytes 3 through 32,770 contained in the archive?
No?
(repeat 32,764 more times)

If any of those blocks turn out to a be a duplicate, then QRecall found a "shifted duplicate". The fragment before the shifted duplicate is captured as a separate block and the whole thing repeats, starting with the next block immediately after the shifted duplicate block.

However, it's also likely that all 32,768 tests will be false. If so, then QRecall knows that the block is unique and it doesn't overlap any shifted duplicate block. The block is added to the archive and the process starts over with the next block in the file.

While shifted-quanta detection is pretty ruthless about finding duplicate data, you can also see that if the data is all unique the process takes 30,000 times longer[1] to come to that conclusion. That's why a 1 hour capture with no shifted-quanta detection takes 120 times longer when you max it out.

That explains "no" shifted-quanta detection and "full" shifted-quanta detection. The settings in-between simply toggle between the two modes based on how effective they are.

When shifted-quanta detection is set to "low" or "medium," the capture starts out performing full shifted-quanta detection. But after a few dozen blocks, if this work doesn't find any shifted duplicates, shifted-quanta detection is temporary turned off and QRecall switches to straight full-block de-duplication. This is controlled by a CPU budget. If shifted-quanta detection doesn't find any shifted duplicate blocks, the budget runs out and it stops performing shifted-quanta detection. If shifted-quanta detection does find shifted duplicates, the budget is increased, so it spends more time searching for duplicates.

The basic idea is that shifted-quanta detection is most productive when there are large amounts of shifted, duplicate, data. So QRecall doesn't have to test every single block to discover one. If shifted-quanta detection is producing results, QRecall does it more. If it's turning out to be a huge waste of effort, it does it less.

Let me know if that adequately explains things.

[1] Full shifted-quanta detection doesn't, literally, take 30,000 times longer than de-duplicating a single block because of a few mathematical tricks and other optimizations. For example, QRecall uses so-called "rolling" checksums to create a "fingerprint" of the data in a block. The work required to calculate the first checksum, for bytes 0 through 32,767, requires about 64,000 shift and add operations. But the checksum for bytes 1 through 32,768 can then be calculated with just three more operations?essentially "removing" the value of the first byte and "adding" in the value of the next byte. Thus, the checksums for all of those intermediate blocks of data can be calculated very quickly, the total work not being that much more than calculating the checksum of a second block. Of course, that's just the checksum; each block still has to be tested against those blocks already captured, but it does save a lot of time.

Gary K. Griffey

James,

Thanks for the thorough explanation. I have a much better understanding of the options, the process, etc.

Your answer does raise one additional question. It would seem from your explanation that the vast majority of workload during a capture that is using the highest shifted quanta setting would be on the archive and its disk.

The capture that I described is targeting a large singe file located on an SMB share. The archive, however, is being housed on a locally attached ThunderBolt 2 disk array. This Thunderbolt 2 disk is connected to a 2013 Mac Pro?with a 6 core Xeon CPU, 32 GB of RAM, etc. (i.e., plenty of horsepower).

When I looked at Activity Monitor during the initial 60 hour capture test using the highest shifted quanta setting, the Thunderbolt 2 drive was reading at 20-25 MB/s?certainly far less than its performance capacity. I would think that this drive would be literally ?screaming? as QRecall was being forced to search every block in the archive for a quanta match. This did not appear, however, to be the case. Any thoughts?

Thanks again for all the info and a great product.

GKG

James Bucanek

Gary,

Your I/O isn't maxed out, but I bet the CPU for the QRecallHelper process is running at 100% (or more).

If you want performance, you have to avoid physical I/O as much as possible. I/O is slow. Memory and the CPU are fast.

So QRecall uses massive in-memory maps, tables, indexes, and caches to first determine if a data block is unique or not, before any reading or writing of the archive begins. In fact, most (around 80%) of the memory used by QRecall during a capture are these lookup tables and caches.

When your archive is relatively small, QRecall can determine if a block is unique 99.9% of the time, without having to read any data directly from the archive. This process is typically 200 to 300 times faster than the time it takes to write a single new block to the archive, which is why shifted-quanta detection can test 30,000 variations of a single block so quickly.

Gary K. Griffey

James,

That is interesting...it would appear that you are indeed correct. The QRecall Helper process is running at 15.5%-16.5% of my 6 core Xeon CPU...which means it is maxing out 1 core.

So, I take it that QRecall cannot spread its work out over multiple CPU cores?

Thanks,

GKG

James Bucanek

Gary K. Griffey wrote:So, I take it that QRecall cannot spread its work out over multiple CPU cores?

QRecall is heavily multi-threaded. You'll see that some actions, like verify, can keep 6 or more CPU cores running at 100%.

But the de-duplciation process is, inherently, a sequential process. Consider just two blocks. If you check to see if they are both unique simultaneously, they can both be considered unique (no similar blocks found in the archive). But those blocks could be duplicates of each other. Furthermore, if both blocks are unique, they have to be added to the cache before checking the next block. But two processes can't modify a structure like a cache simultaneously, and you can't use the cache until the add operation is complete. So while there are a few operations that could be performed in parallel, there are so many intermediate steps that have to occur sequentially, that it's actually a waste of time trying to turn it into a multi-tasking problem.

Now if you happen to have encryption, compression, or data redundancy turned on, those processes are independent of the de-duplication logic and are all performed on separate threads.

Gary K. Griffey

That makes perfect sense.

Thanks again for taking the time to provide such detailed answers to my questions.

GKG