Fast File Clone in ZFS Design Proposal Overview
This document details a proposal for implementing fast file clone functionality in ZFS, allowing for nearly instant file copying through referencing. The motivation behind this proposal includes support for VMware VAAI, NAS Full File Clone, and Fast File Clone to save memory and disk space. The proposed method involves direct references for file cloning, avoiding duplication of data blocks and optimizing data storage. Diagrams and explanations illustrate how the reference node concept works to enable efficient file cloning in ZFS.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Fast File Clone in ZFS Design Proposal Pavel Zakharov 10/11/2024
Introduction Description: Copy files almost instantly by copying by reference. Motivations: VMware VAAI support: NAS Full File Clone and Fast File Clone. Save memory and disk space Existing alternatives: dataset clone deduplication. 2
Regular Copy Copy dnode 2 dnode 1 Copied blocks Data blocks L2' L2 L1' L1' L1 L1 A' B' C' A B C File copy is currently a very costly operation that has to duplicate every data block of the original file. 3
Direct References File Clone: Method 1
Clone: fast file copy Clone dnode 2 dnode 1 Data blocks L2' L2 update blkptr s L1 L1' L1 propagate changes A B C 1 2 3 C' 3 modify data block 3 of dnode 2 Clone is similar to a snapshot. Clone references same blocks as original file. Only modified blocks are written out. 5
Diagrams Clone dnode 2 dnode 1 data blocks both nodes point to the same data A A B private data shared data A A A A A B A A A A A A A A B modify data A cleaner way to represent shared and private data. 6
Reference node: reference original data reference dnode (hidden) dnode 2 (clone) Clone dnode 1 data A A A A A A In order to make the original file writable, we use an approach similar to a dataset clone. A dataset clone is performed on top of a read-only snapshot. Likewise, when a file is cloned, a hidden read-only dnode is created; it references all the original blocks. 7
Reference node: avoid refcount reference dnode dnode 2 (clone) dnode 1 garbage data file 1 file 2 original data A B A C A A A B A C A A A A A A B A A A C A A A A A modify data modify data The extra dnode is used to keep references to the original data even if it is not used anymore. As long as clones exist, original data blocks are not freed. This avoids having to implement any kind of complicated refcount algorithm. Original data is freed only when the refnode is destroyed. 8
System Attributes file 1 dnode 34 ____________ pflags: clone refnode: 55 file 2 (clone) dnode 56 ____________ pflags: clone refnode: 55 reference dnode dnode 55 ____________ birth: txg 200 pflags: clone_ref clones: 34, 56 For cloned files: a flag is set indicating it is a clone. a new attribute is created: refnode. It points to the reference dnode. For the reference dnode: a flag is set indicating it is a the reference dnode for other clones. new attribute: clones. It is an array of dnode numbers representing all the clones. new attribute: birth. Txg of when the reference was created. 9
System Attributes file 1 dnode 34 ____________ pflags: clone refnode: 55 file 2 (clone) dnode 56 ____________ pflags: clone refnode: 55 reference dnode dnode 55 ____________ birth: txg 200 pflags: clone_ref reference dnode dnode 55 ____________ birth: txg 200 pflags: clone_ref clones: 34, 56 clones: 57 ZAP dnode 57 ________ 34 56 New Attributes Refnode: object number of reference dnode. Birth: txg when the reference node was created. Clones: array of all dnodes that are clones of the refnode. Alternatively, clones could point to a ZAP object storing the clones list. pflags: new flags for pflags attribute: clone and clone_ref . 10
Freeing Blocks When overwriting or destroying data, only free blocks that are born after the reference node. reference dnode dnode 55 file 1 dnode 34 birth: txg 200 birth: txg = 177 A Keep when replaced file 1 A A B B C Refnode birth: txg = 200 A A B B C A txg <= 200 KEEP FREE Free when replaced txg > 200 birth: txg = 202 B A A B B C A A birth: txg = 205 C Old data. txg = 177 modify data. txg 202 old data. txg 202 modify data. txg 205 Original data. txg = 177 11
Freeing Blocks Blocks that were born after the reference node are treated the same way as regular blocks. Blocks born before the reference node are only freed when the reference node is destroyed. Any writes sent to a file right after it has been cloned cannot be assigned the same txg as the reference node. The reference node is destroyed when: Option 1: All clones are destroyed. Option 2: All but one clone is destroyed (harder). 12
Multiple Clones clone update clone list file 2 (clone) dnode 56 ____________ pflags: clone refnode: 55 file 1 dnode 34 ____________ pflags: clone refnode: 55 file 3 (clone) dnode 63 ____________ pflags: clone refnode: 55 reference dnode dnode 55 birth: txg 200 ____________ pflags: clone_ref clones: 34, 56 clones: 34, 56, 63 reference dnode dnode 55 birth: txg 200 ____________ pflags: clone_ref file 1 was not modified after the cloning operation, thus file 3 can link to the same reference node. B A B A A A A B A A A A modify file 2 13
Nested Clones clone If a clone is modified and then cloned again, a new reference node will be created. file 2 (clone) dnode 56 ____________ pflags: clone refnode: 63 file 3 (clone) dnode 64 ____________ pflags: clone refnode: 63 modify data reference dnode 2 dnode 63 birth: txg 240 ____________ pflags: clone_ref refnode: 55 clones: 56, 64 update clone list file 1 dnode 34 ____________ pflags: clone refnode: 55 file 2 (clone) dnode 56 ____________ pflags: clone clone_obj: 55 refnode: 55 file 2 (clone) dnode 56 ____________ pflags: clone reference dnode dnode 55 birth: txg 200 ____________ pflags: clone_ref reference dnode dnode 55 birth: txg 200 ____________ pflags: clone_ref clones: 34, 56 clones: 34, 63 Data 2 Data 1 14
Integration with other ZFS features Problem Traversing blocks in-order without going twice over the same block now becomes problematic. ZFS Send/Receive ZFS send/receive loses all information about block pointers and txgs. Traversal must be done in multiple passes: first send the reference nodes, then send the clones Several passes required for nested clones. One extra pass for each clone depth. dataset dnode 1 regular dnode 4 clone of 9 dnode 7 regular dnode 9 refnode dnode 12 regular dnode 13 clone of 9 First Pass: dnodes 1, 7, 9, 12 Second Pass: dnodes 4, 13 Solution: implicit references 15
Implicit References File Clone: Method 2
Implicit Block Pointers Shared data is only accessible/linked from the reference node. The clones have embedded block pointers indicating to look for data in the reference node Implicit Direct References References clone clone clone refnode B B B A A B B B B L1 A A A B B B B A 1 2 3 A A A 1 2 3 return block 2 of refnode read data block 2 overwrite data block 1 replace references to shared blocks by special hole 17
Nested Clones: performance issue Whe data is not found the refnodes have to be traversed one by one. This can cause performance degradation when multiple refnodes are nested. Clone refnode 2 C refnode 1 B C B C 1 2 3 A B 1 2 3 read data block 1 A A A A A 1 2 3 return block 1 18
Nested Clones: improvements One solution to improve performance is to have a reference to the dnode of the appropriate refnode in the embedded blkptr. Clone dnode 13 refnode 2 dnode 9 C refnode 1 dnode 7 7 B C 7 B C 1 2 3 A B 1 2 3 read data block 1 A A A A A 1 2 3 return block 1 19
Analysis Integration and comparison
Compare referencing methods Direct References Access clones at same speed as regular files. ZFS send/receive becomes non-trivial and potentially slower. Requires more changes in DMU layer. Implicit References Accessing clones is potentially slower for each nesting level. Higher arc footprint. Requires more changes in ZPL layer. 21
Integration with other ZFS features ZIL A new record type must be implemented. Snapshots File clone should not interfere with current snapshot logic. Special care has to be taken so that unreferenced clone-related data is destroyed when a snapshot is destroyed. Scrubbing Do not scrub cloned data multiple times. Easy with implicit references. Send / Receive Do not send cloned data multiple times. Easy with implicit references. ZFS features Clone feature should be downgraded from active to enabled when all clones are deleted. 22
Space Quotas Space quotas can be tricky. It is a similar situation as with Linux reflinks. If we treat clone as a copy on the user level: Full ZPL size of each clone (shared + private data) is accounted to owner s userquota. Full ZPL size of each clone is accounted to dataset s refquota and refreservation. Shared data of refnode plus private data of each clone is accounted to dataset s quota and reservation. 23
Accessing Reference Nodes $refnodes A ZAP object in the dataset links to all refnodes. ZPL layer can access this ZAP object as a special read-only folder. Inside this folder, each refnode is displayed as a directory. Each refnode directory contains one file to view the refnode s contents and another file that contains the relative paths to all its clones. refnode 7 contents clones refnode 13 contents 24
Integration with OS NFS NFS Fast File Clone support Fast File Clone libc fclone userland system call fclone A new system call is required kernel File clone support within the same dataset zfs vnode ops vop_fclone zfs znode ops zfs_clone_node File clone workhorse 25
Other thoughts Ditto blocks for highly cloned files. Ability to unlink clone: obtain a hard copy. When cloning a clone, avoid nesting existing refnode if changes between the clone and its refnode are minor. Alternative clone designs Use deduplication Link to dataset clones Work to do. 26
Compare clone alternatives Direct/Implicit References Linked dataset clones Deduplication Instant cloning. Instant cloning. Slow cloning: need to traverse data. Scalable. Scalable. Not scalable as of now. Affects pool import times. Space wasted by snapshot if shared data no longer referenced. Space quotas must be modified for dataset clones that represent files. Space wasted by refnode if shared data no longer referenced. Space quotas must be implemented. No space wasted. Space quotas already implemented. 27
Q & A 28