Gary K. Griffey wrote:In reading the details concerning the benefits/drawbacks of using shifted quanta detection in your help file, it would seem to me that during the initial capture of a file to a newly created archive, shifted quanta would not be relevant at all. It would seem, at least from my understanding, that shifted quanta detection would only be relevant to the conversation during subsequent captures of the same file.
It's true that it would be "most relevant" when applied to a previously captured file, but the mechanics of shift-quanta detection are applied to every block of data no matter what files have been captured. Specifically, data de-duplication is performed at the data block level; it has no knowledge of files. It simply compares each newly captured block of data against the corpus of previously captured blocks.
This, however, does not appear to be the case. I created a new archive...and began to capture a single large virtual disk file that is being housed on a network share...this file is roughly 150 GB. At first, I set the shifted quanta in the archive to its maximum setting. After allowing the initial capture to run for nearly 60 hours...it was only 50% complete.
60 hours? Wow, you are really committed to this experiment.
Here, in a nutshell, is how QRecall performs data de-duplication.
First, let's consider how de-duplication work when shifted-quanta detection is turned off:
Files are divided into 32K blocks. When you capture a file, the first 32K of data (the bytes at offset 0 through 32,767) are read. QRecall then determines if there's an identical block of data already in the archive. If there is, then this is a duplicate block and it does not need to be added. If not, the new block is unique and is added to the archive.
The next 32K of data (the bytes at offset 32,768 through 65,535) are read and the process repeats.
The work to capture data increases exponentially as the size of the archive increases, but through various data tricks and optimizations, it goes pretty fast because each 32K block of data only needs a single up/down vote on whether it's unique or not.
Now let's look at what happens when you turn shifted-quanta detection all the way to the maximum.
The process starts out the same, the first 32K (bytes 0 through 32,767) are read and checked for uniqueness. If the block is a duplicate, then it is not added to the archive, and QRecall skips immediately to the next block.
If the block is unique, however, the process changes. When shifted-quanta detection is on, QRecall then
shifts 1 byte and considers the 32K block of data at offset 1 (bytes 1 through 32,76
. If that block also turns out to be unique, it shifts one more byte and sees if the block formed by bytes 2 through 32,769 is unique, and so on:
Are bytes 0 through 32,767 contained in the archive?
No?
Are bytes 1 through 32,768 contained in the archive?
No?
Are bytes 2 through 32,769 contained in the archive?
No?
Are bytes 3 through 32,770 contained in the archive?
No?
(repeat 32,764 more times)
If any of those blocks turn out to a be a duplicate, then QRecall found a "shifted duplicate". The fragment before the shifted duplicate is captured as a separate block and the whole thing repeats, starting with the next block immediately after the shifted duplicate block.
However, it's also likely that all 32,768 tests will be false. If so, then QRecall knows that the block is unique
and it doesn't overlap any shifted duplicate block. The block is added to the archive and the process starts over with the next block in the file.
While shifted-quanta detection is pretty ruthless about finding duplicate data, you can also see that if the data is all unique the process takes 30,000 times longer[1] to come to that conclusion. That's why a 1 hour capture with no shifted-quanta detection takes 120 times longer when you max it out.
That explains "no" shifted-quanta detection and "full" shifted-quanta detection. The settings in-between simply toggle between the two modes based on how effective they are.
When shifted-quanta detection is set to "low" or "medium," the capture starts out performing full shifted-quanta detection. But after a few dozen blocks, if this work doesn't find any shifted duplicates, shifted-quanta detection is temporary turned off and QRecall switches to straight full-block de-duplication. This is controlled by a CPU budget. If shifted-quanta detection doesn't find any shifted duplicate blocks, the budget runs out and it stops performing shifted-quanta detection. If shifted-quanta detection does find shifted duplicates, the budget is increased, so it spends more time searching for duplicates.
The basic idea is that shifted-quanta detection is most productive when there are large amounts of shifted, duplicate, data. So QRecall doesn't have to test every single block to discover one. If shifted-quanta detection is producing results, QRecall does it more. If it's turning out to be a huge waste of effort, it does it less.
Let me know if that adequately explains things.
[1] Full shifted-quanta detection doesn't, literally, take 30,000 times longer than de-duplicating a single block because of a few mathematical tricks and other optimizations. For example, QRecall uses so-called "rolling" checksums to create a "fingerprint" of the data in a block. The work required to calculate the first checksum, for bytes 0 through 32,767, requires about 64,000 shift and add operations. But the checksum for bytes 1 through 32,768 can then be calculated with just three more operations?essentially "removing" the value of the first byte and "adding" in the value of the next byte. Thus, the checksums for all of those intermediate blocks of data can be calculated very quickly, the total work not being that much more than calculating the checksum of a second block. Of course, that's just the checksum; each block still has to be tested against those blocks already captured, but it does save a lot of time.