Streamlining File Deduplication Process for Efficient Data Storage

undefined
Witold Litwin
 
 
 
Université 
Paris Dauphine
Darrell Long
  
University of California Santa Cruz
Thomas Schwarz
 
Universidad Católica del Uruguay
Combining Chunk Boundary Calculations
and Signature Calculation for
Deduplication
10th International Information and Telecommunication Technologies Conference, (I2TS), Dec. 2011,
Florianopolis, Br.
IEEE Latin America Transactions, vol. 10(1), 2011.
Deduplication
 
How 
not
 to store the same data twice
Breaks data into chunks, calculates signature of chunks, uses
signature to check if chunk is already stored
If the chunk is already stored, create a 
file manifest
 that allows
reconstructing the file
The chunk is replaced by a pointer to the chunk stored
elsewhere
Has sometimes horrible read performance, as files need to be
reconstructed
Is 
very appropriate
 for write heavy loads (backup, archival)
Is used for web-based storage
Deduplication
Is part of storage solutions by all major providers
Has impressive compression rates for backup workload
Up to 20:1 Zhu, Li and Patterson, 2008
Up to 30:1 Mandagere, Zhou, Smith, and Uttamchandi, 2008
Deduplication
Can be for streams (backup workloads) or for files (archival)
In the case for files, use a complete file hash to discover
identical files
Breaks incoming data into 
chunks
Typical value is 4KB or 8KB
Calculates the signature of each chunk (hash)
Looks up chunk signatures in a data base
Create a file manifest:
Describes where chunks are
Stores new chunks in the system, stores manifest, updates
information on chunk signatures
Deduplication
incoming stream or file
Chunker
chunks
Calculate
Signatures
File Print:
file signature
chunk sig 1
chunk sig 2
File Manifest:
chunk 1 is at …
chunk 2 is at …
Look up chunks and
create file manifest
Deduplication
Suffers from two bottlenecks
I/O bottleneck: Needs to lookup chunk signatures in a database
Proposals: Bloom filter, adaptive caching, extreme binning
Needs to scan file twice:
To calculate chunk boundaries
To calculate chunk signatures
Suffers from a 
leap of faith
:
There is no time to verify chunk identity byte-by-byte
Rely on chunk signature identity to identify identical chunks
Advent of MD5 and SHA1 convinced people that the risk of
falsely identifying chunks is acceptable
Our proposal
Use the calculation of chunk boundaries to strengthen the
collision resilience of the chunk signatures
Chunk Boundary Calculation
Why not fixed-sized chunks:
Small changes to the file cannot be found
Previous
Chunk
 
Fixed sized
sliding window
 
sig 
 0
 
Chunk
Boundary
 
Chunk
Boundary
sig 
= 0
Chunk Boundary Calculation
Context defined chunk definition:
A small, local change affects in all likelihood only one chunk
If it is located on a chunk boundary, it affects two chunks
Deduplication ratio is much higher
Chunk Boundary Calculation
Need to calculate a signature of a sliding window
Can use “rolling hashes”
When moving the window one letter to the right, can calculate the
signature of the new window using:
1.
The signature of the old sliding window
2.
The character on the left (leaving the sliding window)
3.
The character on the right (entering the sliding window)
Chunk Boundary Calculation
Can use Rabin Fingerprints (O. Rabin, 1989)
Or Algebraic Signatures (Litwin, Schwarz, 2004)
We use the latter
Because we invented them
Because they are marginally better than Rabin Fingerprints for
our purpose
Both allow cumulative calculation of a signature of the chunk
seen so far
Adding to the Chunk Signature
Chunk signature is MD5 or SHA1
There are attacks using artificial collisions, but they are
theoretical so far
There is a small, but positive collision probability
Two different chunks share the same signature value
Dedup then destroys / alters the later file
To keep dedup acceptable, need to have the resulting data loss orders of
magnitude less than losses from other sources
Adding to the Chunk Signature
Adding to the Chunk Signature
Conditions:
Want 
x
 nines assurance against having any collision in a storage
system of size 
N
The number of nines is given by the other failure sources
The figure uses 
x
 = 6
Conclusion:
Adding one byte to the chunk signature increases the possible
size of the data set by 10
1.20412
 (for big enough 
x
)
Example: MD5 has 16B.
At six nines, maximum number of chunks is 8.692
10
11
With two bytes more, it is 2.225
 
 
10
14 
Changes from hundreds of petabytes to tens of exabytes
Flatness
Signature is flat if the probability of any text to have a certain
signature is constant
Measuring flatness is difficult
No results known for MD5 or SHA-1, though support for
almost perfect flatness
Algebraic Signatures:
Are perfectly flat for complete random input
Are very flat for experiment undertaken
Short words (taken from password list)
Flatness of Algebraic Signatures
Flatness of Algebraic Signatures
Flatness of Algebraic Signatures
Conclusions
Can reuse chunk boundary calculations to strengthen
collision resistance of chunk signatures
Causes 
no additional calculation costs
Slide Note
Embed
Share

Break down data into chunks, calculate signatures, and check for duplicates to optimize storage space. Deduplication method replaces duplicate chunks with pointers, improving write-heavy loads like backup and archival systems. Learn about the challenges and solutions for efficient file storage.

  • Deduplication
  • Data Storage
  • Chunking
  • Signatures
  • Efficiency

Uploaded on Sep 26, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Combining Chunk Boundary Calculations and Signature Calculation for Deduplication Witold Litwin Darrell Long Thomas Schwarz Universit Paris Dauphine University of California Santa Cruz Universidad Cat lica del Uruguay 10th International Information and Telecommunication Technologies Conference, (I2TS), Dec. 2011, Florianopolis, Br. IEEE Latin America Transactions, vol. 10(1), 2011.

  2. Deduplication How not to store the same data twice Breaks data into chunks, calculates signature of chunks, uses signature to check if chunk is already stored If the chunk is already stored, create a file manifest that allows reconstructing the file The chunk is replaced by a pointer to the chunk stored elsewhere Has sometimes horrible read performance, as files need to be reconstructed Is very appropriate for write heavy loads (backup, archival) Is used for web-based storage

  3. Deduplication Is part of storage solutions by all major providers Has impressive compression rates for backup workload Up to 20:1 Zhu, Li and Patterson, 2008 Up to 30:1 Mandagere, Zhou, Smith, and Uttamchandi, 2008

  4. Deduplication Can be for streams (backup workloads) or for files (archival) In the case for files, use a complete file hash to discover identical files Breaks incoming data into chunks Typical value is 4KB or 8KB Calculates the signature of each chunk (hash) Looks up chunk signatures in a data base Create a file manifest: Describes where chunks are Stores new chunks in the system, stores manifest, updates information on chunk signatures

  5. Deduplication incoming stream or file Chunker chunks Calculate Signatures File Manifest: chunk 1 is at chunk 2 is at File Print: file signature chunk sig 1 chunk sig 2 Look up chunks and create file manifest

  6. Deduplication Suffers from two bottlenecks I/O bottleneck: Needs to lookup chunk signatures in a database Proposals: Bloom filter, adaptive caching, extreme binning Needs to scan file twice: To calculate chunk boundaries To calculate chunk signatures Suffers from a leap of faith: There is no time to verify chunk identity byte-by-byte Rely on chunk signature identity to identify identical chunks Advent of MD5 and SHA1 convinced people that the risk of falsely identifying chunks is acceptable

  7. Our proposal Use the calculation of chunk boundaries to strengthen the collision resilience of the chunk signatures

  8. Chunk Boundary Calculation Why not fixed-sized chunks: Small changes to the file cannot be found Fixed sized sliding window Previous Chunk sig 0 sig = 0 Chunk Boundary Chunk Boundary

  9. Chunk Boundary Calculation Context defined chunk definition: A small, local change affects in all likelihood only one chunk If it is located on a chunk boundary, it affects two chunks Deduplication ratio is much higher

  10. Chunk Boundary Calculation Need to calculate a signature of a sliding window Can use rolling hashes When moving the window one letter to the right, can calculate the signature of the new window using: The signature of the old sliding window The character on the left (leaving the sliding window) The character on the right (entering the sliding window) 1. 2. 3.

  11. Chunk Boundary Calculation Can use Rabin Fingerprints (O. Rabin, 1989) Or Algebraic Signatures (Litwin, Schwarz, 2004) We use the latter Because we invented them Because they are marginally better than Rabin Fingerprints for our purpose Both allow cumulative calculation of a signature of the chunk seen so far

  12. Adding to the Chunk Signature Chunk signature is MD5 or SHA1 There are attacks using artificial collisions, but they are theoretical so far There is a small, but positive collision probability Two different chunks share the same signature value Dedup then destroys / alters the later file To keep dedup acceptable, need to have the resulting data loss orders of magnitude less than losses from other sources

  13. Adding to the Chunk Signature

  14. Adding to the Chunk Signature Conditions: Want x nines assurance against having any collision in a storage system of size N The number of nines is given by the other failure sources The figure uses x = 6 Conclusion: Adding one byte to the chunk signature increases the possible size of the data set by 101.20412 (for big enough x) Example: MD5 has 16B. At six nines, maximum number of chunks is 8.692 1011 With two bytes more, it is 2.225 1014 Changes from hundreds of petabytes to tens of exabytes

  15. Flatness Signature is flat if the probability of any text to have a certain signature is constant Measuring flatness is difficult No results known for MD5 or SHA-1, though support for almost perfect flatness Algebraic Signatures: Are perfectly flat for complete random input Are very flat for experiment undertaken Short words (taken from password list)

  16. Flatness of Algebraic Signatures

  17. Flatness of Algebraic Signatures

  18. Flatness of Algebraic Signatures

  19. Conclusions Can reuse chunk boundary calculations to strengthen collision resistance of chunk signatures Causes no additional calculation costs

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#