QRecall

john g

Hi, I have a rMBP which uses an external Thunderbolt 2 drive. I want to maintain a backup of this external drive, which currently has about 3TB of data to backup. This is my initial backup of this drive and I have 1.76 TB backed up after running for 46 hours. At this point, I am above 50% complete with 12% duplication.

Equipment
rMBP Thunderbolt 2 external drive
Synology DS414 with 4 x 3TB WB Red drives in RAID10 configuration (50% capacity at this point)
I have both ethernet lines bonded with an HP switch that works with this type of connection. From the switch, I have one 1000 Mbps ethernet connection to the rMBP.

The DS414 is reporting both lan lines bonded for 2000Mbps Full Duplex and MTU of 1500.

Should 3TB take so long? At this point, it looks like I'll need about 100 hours to backup the entire 3TB.

James Bucanek

john g wrote:Should 3TB take so long? At this point, it looks like I'll need about 100 hours to backup the entire 3TB.

It's possible. 3TB is a lot of data to de-duplicate.

The computations needed to de-duplicate data increase exponentially as the corpus of data grows. Especially once your archive is past a TB or two, it requires a massive number of small data reads to check each new block against what's already in the archive. Even though you've got a very fast network configuration, this is still going to have a higher transaction latency than a direct buss connection (SATA, eSATA, Thunderbolt, and so on).

First tip would be to turn off shifted quanta detection. Shifted quanta detection performs numerous lookups into the archive database for every new block, instead of simply checking it once to see if it's a duplicate. Especially for an initial backup, shifted quanta detection won't save you much. (You're free to turn it back on once your initial capture is finished.)

Be patient. It's a lot of data to de-dup, and it's just going to take time, memory, and bandwidth. You might consider scheduling your backup with an action and adding a condition so the capture stops if it's taking longer than, say, 10 hours to finish. Every day it will do another 10 hours, picking up where it left off yesterday. Eventually, it should catch up. At that point you might want to merge all of those incomplete layers into a single baseline layer.

If you're desperate to reduce the de-duplication overhead, you might also consider spitting up your archive. For example, you might capture all of your virtual machine images to one archive while capturing all of your multi-media files to a second archive. Unless your virtual machine images contain copies of your multi-media content, it's unlikely that they would have much in common.

Your post is also timely, in as much as I've been writing code all week to add a new feature to QRecall. In QRecall 2.0, you'll be able to schedule a capture that copies just the raw data to the archive, without performing any de-duplication. This should be nearly as fast as a simple file copy. The de-deduplication work is then deferred until the next compact action is run. This should make short captures to large (>2TB) archives much more efficient.