* [RFC PATCH v5 0/5] Online data deduplication
@ 2013-07-31 15:37 Liu Bo
  2013-07-31 15:37 ` [RFC PATCH v5 1/5] Btrfs: skip merge part for delayed data refs Liu Bo
                   ` (6 more replies)
  0 siblings, 7 replies; 30+ messages in thread
From: Liu Bo @ 2013-07-31 15:37 UTC (permalink / raw)
  To: linux-btrfs
Data deduplication is a specialized data compression technique for eliminating
duplicate copies of repeating data.[1]
This patch set is also related to "Content based storage" in project ideas[2].
PATCH 1 is a hang fix with deduplication on, but it's also useful without
dedup in practice use.
PATCH 2 and 3 are targetting delayed refs' scalability problems, which are
uncovered by the dedup feature.
PATCH 4 is a speed-up improvement, which is about dedup and quota.
PATCH 5 is full of real things, all details about implementation of dedup.
Plus, there is also a btrfs-progs patch which helps to enable/disable dedup
feature.
TODO:
* a bit-to-bit comparison callback.
All comments are welcome!
[1]: http://en.wikipedia.org/wiki/Data_deduplication
[2]: https://btrfs.wiki.kernel.org/index.php/Project_ideas#Content_based_storage
v4->v5:
- go back to one dedup key with a special backref for dedup tree because
  the disk format understands backref well.
- fix a fsync hang with dedup enabled.
- rebase onto the latest btrfs.
Liu Bo (5):
  Btrfs: skip merge part for delayed data refs
  Btrfs: improve the delayed refs process in rm case
  Btrfs: introduce a head ref rbtree
  Btrfs: disable qgroups accounting when quata_enable is 0
  Btrfs: online data deduplication
 fs/btrfs/backref.c         |    9 +
 fs/btrfs/ctree.h           |   59 ++++
 fs/btrfs/delayed-ref.c     |  141 +++++++----
 fs/btrfs/delayed-ref.h     |    8 +
 fs/btrfs/disk-io.c         |   30 ++
 fs/btrfs/extent-tree.c     |  196 ++++++++++++--
 fs/btrfs/extent_io.c       |   29 ++-
 fs/btrfs/extent_io.h       |   16 ++
 fs/btrfs/file-item.c       |  217 +++++++++++++++
 fs/btrfs/inode.c           |  637 ++++++++++++++++++++++++++++++++++++++------
 fs/btrfs/ioctl.c           |   93 +++++++
 fs/btrfs/ordered-data.c    |   36 ++-
 fs/btrfs/ordered-data.h    |   11 +-
 fs/btrfs/qgroup.c          |    6 +
 fs/btrfs/relocation.c      |    3 +
 fs/btrfs/super.c           |   27 ++-
 fs/btrfs/transaction.c     |    4 +-
 include/uapi/linux/btrfs.h |    5 +
 18 files changed, 1356 insertions(+), 171 deletions(-)
-- 
1.7.7
^ permalink raw reply	[flat|nested] 30+ messages in thread* [RFC PATCH v5 1/5] Btrfs: skip merge part for delayed data refs 2013-07-31 15:37 [RFC PATCH v5 0/5] Online data deduplication Liu Bo @ 2013-07-31 15:37 ` Liu Bo 2013-07-31 15:37 ` [RFC PATCH v5 2/5] Btrfs: improve the delayed refs process in rm case Liu Bo ` (5 subsequent siblings) 6 siblings, 0 replies; 30+ messages in thread From: Liu Bo @ 2013-07-31 15:37 UTC (permalink / raw) To: linux-btrfs When we have data deduplication on, we'll hang on the merge part because it needs to verify every queued delayed data refs related to this disk offset but we may have millions refs. And in the case of delayed data refs, we don't usually have too much data refs to merge. So it's safe to shut it down for data refs. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> --- fs/btrfs/delayed-ref.c | 7 +++++++ 1 files changed, 7 insertions(+), 0 deletions(-) diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index f7be9f7..fc4ce8b 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -320,6 +320,13 @@ void btrfs_merge_delayed_refs(struct btrfs_trans_handle *trans, struct rb_node *node; u64 seq = 0; + /* + * We don't have too much refs to merge in the case of delayed data + * refs. + */ + if (head->is_data) + return; + spin_lock(&fs_info->tree_mod_seq_lock); if (!list_empty(&fs_info->tree_mod_seq_list)) { struct seq_list *elem; -- 1.7.7 ^ permalink raw reply related [flat|nested] 30+ messages in thread
* [RFC PATCH v5 2/5] Btrfs: improve the delayed refs process in rm case 2013-07-31 15:37 [RFC PATCH v5 0/5] Online data deduplication Liu Bo 2013-07-31 15:37 ` [RFC PATCH v5 1/5] Btrfs: skip merge part for delayed data refs Liu Bo @ 2013-07-31 15:37 ` Liu Bo 2013-07-31 16:45 ` Stefan Behrens 2013-07-31 15:37 ` [RFC PATCH v5 3/5] Btrfs: introduce a head ref rbtree Liu Bo ` (4 subsequent siblings) 6 siblings, 1 reply; 30+ messages in thread From: Liu Bo @ 2013-07-31 15:37 UTC (permalink / raw) To: linux-btrfs While removing a file with dedup extents, we could have a great number of delayed refs pending to process, and these refs refer to droping a ref of the extent, which is of BTRFS_DROP_DELAYED_REF type. But in order to prevent an extent's ref count from going down to zero when there still are pending delayed refs, we first select those "adding a ref" ones, which is of BTRFS_ADD_DELAYED_REF type. So in removing case, all of our delayed refs are of BTRFS_DROP_DELAYED_REF type, but we have to walk all the refs issued to the extent to find any BTRFS_ADD_DELAYED_REF types and end up there is no such thing, and then start over again to find BTRFS_DROP_DELAYED_REF. This is really unnecessary, we can improve this by tracking how many BTRFS_ADD_DELAYED_REF refs we have and search by the right type. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> --- fs/btrfs/delayed-ref.c | 10 ++++++++++ fs/btrfs/delayed-ref.h | 3 +++ fs/btrfs/extent-tree.c | 17 ++++++++++++++++- 3 files changed, 29 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index fc4ce8b..15ca775 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -543,6 +543,10 @@ update_existing_head_ref(struct btrfs_delayed_ref_node *existing, * update the reference mod on the head to reflect this new operation */ existing->ref_mod += update->ref_mod; + + WARN_ON(update->ref_mod > 0 && update->ref_mod != 1); + if (update->ref_mod == 1) + existing_ref->add_cnt++; } /* @@ -604,6 +608,12 @@ static noinline void add_delayed_ref_head(struct btrfs_fs_info *fs_info, head_ref->must_insert_reserved = must_insert_reserved; head_ref->is_data = is_data; + /* track added ref, more comments in select_delayed_ref() */ + if (count_mod == 1) + head_ref->add_cnt = 1; + else + head_ref->add_cnt = 0; + INIT_LIST_HEAD(&head_ref->cluster); mutex_init(&head_ref->mutex); diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h index 70b962c..9377b27 100644 --- a/fs/btrfs/delayed-ref.h +++ b/fs/btrfs/delayed-ref.h @@ -84,6 +84,9 @@ struct btrfs_delayed_ref_head { struct list_head cluster; struct btrfs_delayed_extent_op *extent_op; + + int add_cnt; + /* * when a new extent is allocated, it is just reserved in memory * The actual extent isn't inserted into the extent allocation tree diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 70002ea..f40f0d9 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2260,6 +2260,16 @@ select_delayed_ref(struct btrfs_delayed_ref_head *head) struct rb_node *node; struct btrfs_delayed_ref_node *ref; int action = BTRFS_ADD_DELAYED_REF; + + /* + * track the count of BTRFS_ADD_DELAYED_REF, + * in the case that there's no BTRFS_ADD_DELAYED_REF but there're a + * a great number of BTRFS_DROP_DELAYED_REF, + * it'll waste time on searching BTRFS_ADD_DELAYED_REF, ususally this + * happens with dedup enabled. + */ + if (head->add_cnt == 0) + action = BTRFS_DROP_DELAYED_REF; again: /* * select delayed ref of type BTRFS_ADD_DELAYED_REF first. @@ -2274,8 +2284,11 @@ again: rb_node); if (ref->bytenr != head->node.bytenr) break; - if (ref->action == action) + if (ref->action == action) { + if (action == BTRFS_ADD_DELAYED_REF) + head->add_cnt--; return ref; + } node = rb_prev(node); } if (action == BTRFS_ADD_DELAYED_REF) { @@ -2351,6 +2364,8 @@ static noinline int run_clustered_refs(struct btrfs_trans_handle *trans, * there are still refs with lower seq numbers in the * process of being added. Don't run this ref yet. */ + if (ref->action == BTRFS_ADD_DELAYED_REF) + locked_ref->add_cnt++; list_del_init(&locked_ref->cluster); btrfs_delayed_ref_unlock(locked_ref); locked_ref = NULL; -- 1.7.7 ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [RFC PATCH v5 2/5] Btrfs: improve the delayed refs process in rm case 2013-07-31 15:37 ` [RFC PATCH v5 2/5] Btrfs: improve the delayed refs process in rm case Liu Bo @ 2013-07-31 16:45 ` Stefan Behrens 0 siblings, 0 replies; 30+ messages in thread From: Stefan Behrens @ 2013-07-31 16:45 UTC (permalink / raw) To: Liu Bo; +Cc: linux-btrfs On Wed, 31 Jul 2013 23:37:42 +0800, Liu Bo wrote: > + WARN_ON(update->ref_mod > 0 && update->ref_mod != 1); WARN_ON(update->ref_mod > 1) ^ permalink raw reply [flat|nested] 30+ messages in thread
* [RFC PATCH v5 3/5] Btrfs: introduce a head ref rbtree 2013-07-31 15:37 [RFC PATCH v5 0/5] Online data deduplication Liu Bo 2013-07-31 15:37 ` [RFC PATCH v5 1/5] Btrfs: skip merge part for delayed data refs Liu Bo 2013-07-31 15:37 ` [RFC PATCH v5 2/5] Btrfs: improve the delayed refs process in rm case Liu Bo @ 2013-07-31 15:37 ` Liu Bo 2013-07-31 21:19 ` Zach Brown 2013-07-31 15:37 ` [RFC PATCH v5 4/5] Btrfs: disable qgroups accounting when quota is off Liu Bo ` (3 subsequent siblings) 6 siblings, 1 reply; 30+ messages in thread From: Liu Bo @ 2013-07-31 15:37 UTC (permalink / raw) To: linux-btrfs The way how we process delayed refs is 1)get a bunch of head refs, 2)pick up one head ref, 3)go one node back for any delayed ref updates. The head ref is also linked in the same rbtree as the delayed ref is, so in 1) stage, we have to walk one by one including not only head refs, but delayed refs. When we have a great number of delayed refs pending to process, this'll cost time a lot. Here we introduce a head ref specific rbtree, it only has head refs, so troubles go away. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> --- fs/btrfs/delayed-ref.c | 124 ++++++++++++++++++++++++++++-------------------- fs/btrfs/delayed-ref.h | 5 ++ fs/btrfs/disk-io.c | 3 + fs/btrfs/extent-tree.c | 21 +++++--- fs/btrfs/transaction.c | 4 +- 5 files changed, 98 insertions(+), 59 deletions(-) diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index 15ca775..8aad3fb 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -161,35 +161,61 @@ static struct btrfs_delayed_ref_node *tree_insert(struct rb_root *root, return NULL; } +/* insert a new ref to head ref rbtree */ +static struct btrfs_delayed_ref_head *htree_insert(struct rb_root *root, + struct rb_node *node) +{ + struct rb_node **p = &root->rb_node; + struct rb_node *parent_node = NULL; + struct btrfs_delayed_ref_head *entry; + struct btrfs_delayed_ref_head *ins; + u64 bytenr; + + ins = rb_entry(node, struct btrfs_delayed_ref_head, href_node); + bytenr = ins->node.bytenr; + while (*p) { + parent_node = *p; + entry = rb_entry(parent_node, struct btrfs_delayed_ref_head, + href_node); + + if (bytenr < entry->node.bytenr) + p = &(*p)->rb_left; + else if (bytenr > entry->node.bytenr) + p = &(*p)->rb_right; + else + return entry; + } + + rb_link_node(node, parent_node, p); + rb_insert_color(node, root); + return NULL; +} + /* * find an head entry based on bytenr. This returns the delayed ref * head if it was able to find one, or NULL if nothing was in that spot. * If return_bigger is given, the next bigger entry is returned if no exact * match is found. */ -static struct btrfs_delayed_ref_node *find_ref_head(struct rb_root *root, - u64 bytenr, - struct btrfs_delayed_ref_node **last, - int return_bigger) +static struct btrfs_delayed_ref_head * +find_ref_head(struct rb_root *root, u64 bytenr, + struct btrfs_delayed_ref_head **last, int return_bigger) { struct rb_node *n; - struct btrfs_delayed_ref_node *entry; + struct btrfs_delayed_ref_head *entry; int cmp = 0; again: n = root->rb_node; entry = NULL; while (n) { - entry = rb_entry(n, struct btrfs_delayed_ref_node, rb_node); - WARN_ON(!entry->in_tree); + entry = rb_entry(n, struct btrfs_delayed_ref_head, href_node); if (last) *last = entry; - if (bytenr < entry->bytenr) + if (bytenr < entry->node.bytenr) cmp = -1; - else if (bytenr > entry->bytenr) - cmp = 1; - else if (!btrfs_delayed_ref_is_head(entry)) + else if (bytenr > entry->node.bytenr) cmp = 1; else cmp = 0; @@ -203,12 +229,12 @@ again: } if (entry && return_bigger) { if (cmp > 0) { - n = rb_next(&entry->rb_node); + n = rb_next(&entry->href_node); if (!n) n = rb_first(root); - entry = rb_entry(n, struct btrfs_delayed_ref_node, - rb_node); - bytenr = entry->bytenr; + entry = rb_entry(n, struct btrfs_delayed_ref_head, + href_node); + bytenr = entry->node.bytenr; return_bigger = 0; goto again; } @@ -246,6 +272,12 @@ static void inline drop_delayed_ref(struct btrfs_trans_handle *trans, struct btrfs_delayed_ref_node *ref) { rb_erase(&ref->rb_node, &delayed_refs->root); + if (btrfs_delayed_ref_is_head(ref)) { + struct btrfs_delayed_ref_head *head; + + head = btrfs_delayed_node_to_head(ref); + rb_erase(&head->href_node, &delayed_refs->href_root); + } ref->in_tree = 0; btrfs_put_delayed_ref(ref); delayed_refs->num_entries--; @@ -386,42 +418,35 @@ int btrfs_find_ref_cluster(struct btrfs_trans_handle *trans, int count = 0; struct btrfs_delayed_ref_root *delayed_refs; struct rb_node *node; - struct btrfs_delayed_ref_node *ref; - struct btrfs_delayed_ref_head *head; + struct btrfs_delayed_ref_head *head = NULL; delayed_refs = &trans->transaction->delayed_refs; - if (start == 0) { - node = rb_first(&delayed_refs->root); - } else { - ref = NULL; - find_ref_head(&delayed_refs->root, start + 1, &ref, 1); - if (ref) { - node = &ref->rb_node; - } else - node = rb_first(&delayed_refs->root); + node = rb_first(&delayed_refs->href_root); + + if (start) { + find_ref_head(&delayed_refs->href_root, start + 1, &head, 1); + if (head) + node = &head->href_node; } again: while (node && count < 32) { - ref = rb_entry(node, struct btrfs_delayed_ref_node, rb_node); - if (btrfs_delayed_ref_is_head(ref)) { - head = btrfs_delayed_node_to_head(ref); - if (list_empty(&head->cluster)) { - list_add_tail(&head->cluster, cluster); - delayed_refs->run_delayed_start = - head->node.bytenr; - count++; + head = rb_entry(node, struct btrfs_delayed_ref_head, href_node); + if (list_empty(&head->cluster)) { + list_add_tail(&head->cluster, cluster); + delayed_refs->run_delayed_start = + head->node.bytenr; + count++; - WARN_ON(delayed_refs->num_heads_ready == 0); - delayed_refs->num_heads_ready--; - } else if (count) { - /* the goal of the clustering is to find extents - * that are likely to end up in the same extent - * leaf on disk. So, we don't want them spread - * all over the tree. Stop now if we've hit - * a head that was already in use - */ - break; - } + WARN_ON(delayed_refs->num_heads_ready == 0); + delayed_refs->num_heads_ready--; + } else if (count) { + /* the goal of the clustering is to find extents + * that are likely to end up in the same extent + * leaf on disk. So, we don't want them spread + * all over the tree. Stop now if we've hit + * a head that was already in use + */ + break; } node = rb_next(node); } @@ -433,7 +458,7 @@ again: * clusters. start from the beginning and try again */ start = 0; - node = rb_first(&delayed_refs->root); + node = rb_first(&delayed_refs->href_root); goto again; } return 1; @@ -629,6 +654,7 @@ static noinline void add_delayed_ref_head(struct btrfs_fs_info *fs_info, */ kmem_cache_free(btrfs_delayed_ref_head_cachep, head_ref); } else { + htree_insert(&delayed_refs->href_root, &head_ref->href_node); delayed_refs->num_heads++; delayed_refs->num_heads_ready++; delayed_refs->num_entries++; @@ -886,14 +912,10 @@ int btrfs_add_delayed_extent_op(struct btrfs_fs_info *fs_info, struct btrfs_delayed_ref_head * btrfs_find_delayed_ref_head(struct btrfs_trans_handle *trans, u64 bytenr) { - struct btrfs_delayed_ref_node *ref; struct btrfs_delayed_ref_root *delayed_refs; delayed_refs = &trans->transaction->delayed_refs; - ref = find_ref_head(&delayed_refs->root, bytenr, NULL, 0); - if (ref) - return btrfs_delayed_node_to_head(ref); - return NULL; + return find_ref_head(&delayed_refs->href_root, bytenr, NULL, 0); } void btrfs_delayed_ref_exit(void) diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h index 9377b27..6a0295b 100644 --- a/fs/btrfs/delayed-ref.h +++ b/fs/btrfs/delayed-ref.h @@ -83,6 +83,8 @@ struct btrfs_delayed_ref_head { struct list_head cluster; + struct rb_node href_node; + struct btrfs_delayed_extent_op *extent_op; int add_cnt; @@ -121,6 +123,9 @@ struct btrfs_delayed_data_ref { struct btrfs_delayed_ref_root { struct rb_root root; + /* head ref rbtree */ + struct rb_root href_root; + /* this spin lock protects the rbtree and the entries inside */ spinlock_t lock; diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 3c2886c..b6d597a 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3793,6 +3793,9 @@ int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans, ref->in_tree = 0; rb_erase(&ref->rb_node, &delayed_refs->root); + if (head) + rb_erase(&head->href_node, &delayed_refs->href_root); + delayed_refs->num_entries--; spin_unlock(&delayed_refs->lock); if (head) { diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index f40f0d9..10a5c72 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2418,6 +2418,11 @@ static noinline int run_clustered_refs(struct btrfs_trans_handle *trans, ref->in_tree = 0; rb_erase(&ref->rb_node, &delayed_refs->root); + if (btrfs_delayed_ref_is_head(ref)) { + BUG_ON(!locked_ref); + rb_erase(&locked_ref->href_node, + &delayed_refs->href_root); + } delayed_refs->num_entries--; if (!btrfs_delayed_ref_is_head(ref)) { /* @@ -2619,7 +2624,7 @@ int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, { struct rb_node *node; struct btrfs_delayed_ref_root *delayed_refs; - struct btrfs_delayed_ref_node *ref; + struct btrfs_delayed_ref_head *head; struct list_head cluster; int ret; u64 delayed_start; @@ -2749,18 +2754,19 @@ again: spin_lock(&delayed_refs->lock); } - node = rb_first(&delayed_refs->root); + node = rb_first(&delayed_refs->href_root); if (!node) goto out; count = (unsigned long)-1; while (node) { - ref = rb_entry(node, struct btrfs_delayed_ref_node, - rb_node); - if (btrfs_delayed_ref_is_head(ref)) { - struct btrfs_delayed_ref_head *head; + head = rb_entry(node, struct btrfs_delayed_ref_head, + href_node); + BUG_ON(!btrfs_delayed_ref_is_head(&head->node)); + { + struct btrfs_delayed_ref_node *ref; - head = btrfs_delayed_node_to_head(ref); + ref = &head->node; atomic_inc(&ref->refs); spin_unlock(&delayed_refs->lock); @@ -5898,6 +5904,7 @@ static noinline int check_ref_cleanup(struct btrfs_trans_handle *trans, */ head->node.in_tree = 0; rb_erase(&head->node.rb_node, &delayed_refs->root); + rb_erase(&head->href_node, &delayed_refs->href_root); delayed_refs->num_entries--; diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index d58cce7..db0dd41 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -62,7 +62,8 @@ static void put_transaction(struct btrfs_transaction *transaction) WARN_ON(atomic_read(&transaction->use_count) == 0); if (atomic_dec_and_test(&transaction->use_count)) { BUG_ON(!list_empty(&transaction->list)); - WARN_ON(transaction->delayed_refs.root.rb_node); + WARN_ON(!RB_EMPTY_ROOT(&transaction->delayed_refs.root)); + WARN_ON(!RB_EMPTY_ROOT(&transaction->delayed_refs.href_root)); while (!list_empty(&transaction->pending_chunks)) { struct extent_map *em; @@ -184,6 +185,7 @@ loop: cur_trans->start_time = get_seconds(); cur_trans->delayed_refs.root = RB_ROOT; + cur_trans->delayed_refs.href_root = RB_ROOT; cur_trans->delayed_refs.num_entries = 0; cur_trans->delayed_refs.num_heads_ready = 0; cur_trans->delayed_refs.num_heads = 0; -- 1.7.7 ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [RFC PATCH v5 3/5] Btrfs: introduce a head ref rbtree 2013-07-31 15:37 ` [RFC PATCH v5 3/5] Btrfs: introduce a head ref rbtree Liu Bo @ 2013-07-31 21:19 ` Zach Brown 0 siblings, 0 replies; 30+ messages in thread From: Zach Brown @ 2013-07-31 21:19 UTC (permalink / raw) To: Liu Bo; +Cc: linux-btrfs > + BUG_ON(!locked_ref); > + BUG_ON(!btrfs_delayed_ref_is_head(&head->node)); Please don't add more BUG_ON()s. They're unreliable because they can kill the system. If you *must* have logic asserts, use WARN_ON_ONCE and return errors. If the path is so dire that the fs is inconsistent, go read only. - z ^ permalink raw reply [flat|nested] 30+ messages in thread
* [RFC PATCH v5 4/5] Btrfs: disable qgroups accounting when quota is off 2013-07-31 15:37 [RFC PATCH v5 0/5] Online data deduplication Liu Bo ` (2 preceding siblings ...) 2013-07-31 15:37 ` [RFC PATCH v5 3/5] Btrfs: introduce a head ref rbtree Liu Bo @ 2013-07-31 15:37 ` Liu Bo 2013-08-05 12:34 ` Jan Schmidt 2013-07-31 15:37 ` [RFC PATCH v5 5/5] Btrfs: online data deduplication Liu Bo ` (2 subsequent siblings) 6 siblings, 1 reply; 30+ messages in thread From: Liu Bo @ 2013-07-31 15:37 UTC (permalink / raw) To: linux-btrfs So we don't need to do qgroups accounting trick without enabling quota. This reduces my tester's costing time from ~28s to ~23s. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> --- fs/btrfs/extent-tree.c | 6 ++++++ fs/btrfs/qgroup.c | 6 ++++++ 2 files changed, 12 insertions(+), 0 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 10a5c72..c6612f5 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2524,6 +2524,12 @@ int btrfs_delayed_refs_qgroup_accounting(struct btrfs_trans_handle *trans, struct qgroup_update *qgroup_update; int ret = 0; + if (!trans->root->fs_info->quota_enabled) { + if (trans->delayed_ref_elem.seq) + btrfs_put_tree_mod_seq(fs_info, &trans->delayed_ref_elem); + return 0; + } + if (list_empty(&trans->qgroup_ref_list) != !trans->delayed_ref_elem.seq) { /* list without seq or seq without list */ diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c index 1280eff..f3e82aa 100644 --- a/fs/btrfs/qgroup.c +++ b/fs/btrfs/qgroup.c @@ -1200,6 +1200,9 @@ int btrfs_qgroup_record_ref(struct btrfs_trans_handle *trans, { struct qgroup_update *u; + if (!trans->root->fs_info->quota_enabled) + return 0; + BUG_ON(!trans->delayed_ref_elem.seq); u = kmalloc(sizeof(*u), GFP_NOFS); if (!u) @@ -1850,6 +1853,9 @@ out: void assert_qgroups_uptodate(struct btrfs_trans_handle *trans) { + if (!trans->root->fs_info->quota_enabled) + return; + if (list_empty(&trans->qgroup_ref_list) && !trans->delayed_ref_elem.seq) return; pr_err("btrfs: qgroups not uptodate in trans handle %p: list is%s empty, seq is %#x.%x\n", -- 1.7.7 ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [RFC PATCH v5 4/5] Btrfs: disable qgroups accounting when quota is off 2013-07-31 15:37 ` [RFC PATCH v5 4/5] Btrfs: disable qgroups accounting when quota is off Liu Bo @ 2013-08-05 12:34 ` Jan Schmidt 2013-08-05 14:18 ` Liu Bo 0 siblings, 1 reply; 30+ messages in thread From: Jan Schmidt @ 2013-08-05 12:34 UTC (permalink / raw) To: Liu Bo; +Cc: linux-btrfs Nice try hiding this one in a dedup patch set, but I finally found it :-) On Wed, July 31, 2013 at 17:37 (+0200), Liu Bo wrote: > So we don't need to do qgroups accounting trick without enabling quota. > This reduces my tester's costing time from ~28s to ~23s. > > Signed-off-by: Liu Bo <bo.li.liu@oracle.com> > --- > fs/btrfs/extent-tree.c | 6 ++++++ > fs/btrfs/qgroup.c | 6 ++++++ > 2 files changed, 12 insertions(+), 0 deletions(-) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 10a5c72..c6612f5 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -2524,6 +2524,12 @@ int btrfs_delayed_refs_qgroup_accounting(struct btrfs_trans_handle *trans, > struct qgroup_update *qgroup_update; > int ret = 0; > > + if (!trans->root->fs_info->quota_enabled) { > + if (trans->delayed_ref_elem.seq) > + btrfs_put_tree_mod_seq(fs_info, &trans->delayed_ref_elem); > + return 0; > + } > + > if (list_empty(&trans->qgroup_ref_list) != > !trans->delayed_ref_elem.seq) { > /* list without seq or seq without list */ > diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c > index 1280eff..f3e82aa 100644 > --- a/fs/btrfs/qgroup.c > +++ b/fs/btrfs/qgroup.c > @@ -1200,6 +1200,9 @@ int btrfs_qgroup_record_ref(struct btrfs_trans_handle *trans, > { > struct qgroup_update *u; > > + if (!trans->root->fs_info->quota_enabled) > + return 0; > + > BUG_ON(!trans->delayed_ref_elem.seq); > u = kmalloc(sizeof(*u), GFP_NOFS); > if (!u) > @@ -1850,6 +1853,9 @@ out: > > void assert_qgroups_uptodate(struct btrfs_trans_handle *trans) > { > + if (!trans->root->fs_info->quota_enabled) > + return; > + > if (list_empty(&trans->qgroup_ref_list) && !trans->delayed_ref_elem.seq) > return; > pr_err("btrfs: qgroups not uptodate in trans handle %p: list is%s empty, seq is %#x.%x\n", > The second hunk looks sensible at first sight. However, hunk 1 and 3 don't. They assert consistency of qgroup state in well defined places. The fact that you need to disable those checks shows that skipping addition to the list in the second hunk cannot be right, or at least not sufficient. We've got the list of qgroup operations trans->qgroup_ref_list and we've got the qgroup's delayed ref blocker, trans->delayed_ref_elem. If you stop adding to the list (hunk 2) which seems reasonable when quota is disabled, then you also must ensure you're not acquiring the delayed ref blocker element, which should give another performance boost. need_ref_seq may be the right place for this change. It just feels a bit too obvious. The critical cases obviously are quota enable and quota disable. I just don't recall why it wasn't that way from the very beginning of qgroups, I might be missing something fundamental here. Thanks, -Jan ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [RFC PATCH v5 4/5] Btrfs: disable qgroups accounting when quota is off 2013-08-05 12:34 ` Jan Schmidt @ 2013-08-05 14:18 ` Liu Bo 2013-08-05 15:10 ` Jan Schmidt 0 siblings, 1 reply; 30+ messages in thread From: Liu Bo @ 2013-08-05 14:18 UTC (permalink / raw) To: Jan Schmidt; +Cc: linux-btrfs On Mon, Aug 05, 2013 at 02:34:30PM +0200, Jan Schmidt wrote: > Nice try hiding this one in a dedup patch set, but I finally found it :-) Ahhhh, I didn't mean to ;-) > > On Wed, July 31, 2013 at 17:37 (+0200), Liu Bo wrote: > > So we don't need to do qgroups accounting trick without enabling quota. > > This reduces my tester's costing time from ~28s to ~23s. > > > > Signed-off-by: Liu Bo <bo.li.liu@oracle.com> > > --- > > fs/btrfs/extent-tree.c | 6 ++++++ > > fs/btrfs/qgroup.c | 6 ++++++ > > 2 files changed, 12 insertions(+), 0 deletions(-) > > > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > > index 10a5c72..c6612f5 100644 > > --- a/fs/btrfs/extent-tree.c > > +++ b/fs/btrfs/extent-tree.c > > @@ -2524,6 +2524,12 @@ int btrfs_delayed_refs_qgroup_accounting(struct btrfs_trans_handle *trans, > > struct qgroup_update *qgroup_update; > > int ret = 0; > > > > + if (!trans->root->fs_info->quota_enabled) { > > + if (trans->delayed_ref_elem.seq) > > + btrfs_put_tree_mod_seq(fs_info, &trans->delayed_ref_elem); > > + return 0; > > + } > > + > > if (list_empty(&trans->qgroup_ref_list) != > > !trans->delayed_ref_elem.seq) { > > /* list without seq or seq without list */ > > diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c > > index 1280eff..f3e82aa 100644 > > --- a/fs/btrfs/qgroup.c > > +++ b/fs/btrfs/qgroup.c > > @@ -1200,6 +1200,9 @@ int btrfs_qgroup_record_ref(struct btrfs_trans_handle *trans, > > { > > struct qgroup_update *u; > > > > + if (!trans->root->fs_info->quota_enabled) > > + return 0; > > + > > BUG_ON(!trans->delayed_ref_elem.seq); > > u = kmalloc(sizeof(*u), GFP_NOFS); > > if (!u) > > @@ -1850,6 +1853,9 @@ out: > > > > void assert_qgroups_uptodate(struct btrfs_trans_handle *trans) > > { > > + if (!trans->root->fs_info->quota_enabled) > > + return; > > + > > if (list_empty(&trans->qgroup_ref_list) && !trans->delayed_ref_elem.seq) > > return; > > pr_err("btrfs: qgroups not uptodate in trans handle %p: list is%s empty, seq is %#x.%x\n", > > > > The second hunk looks sensible at first sight. However, hunk 1 and 3 don't. They > assert consistency of qgroup state in well defined places. The fact that you > need to disable those checks shows that skipping addition to the list in the > second hunk cannot be right, or at least not sufficient. I agree, only hunk 2 is necessary. > > We've got the list of qgroup operations trans->qgroup_ref_list and we've got the > qgroup's delayed ref blocker, trans->delayed_ref_elem. If you stop adding to the > list (hunk 2) which seems reasonable when quota is disabled, then you also must > ensure you're not acquiring the delayed ref blocker element, which should give > another performance boost. WHY a 'must' here? > > need_ref_seq may be the right place for this change. It just feels a bit too > obvious. The critical cases obviously are quota enable and quota disable. I just > don't recall why it wasn't that way from the very beginning of qgroups, I might > be missing something fundamental here. Yeah I thought about 'need_ref_seq', but the point is that delayed ref blocker not only serves qgroups accounting, but also features based on backref walking, such as scrub, snapshot-aware defragment. And I want dedup to work with all these features well, including qgroups too. Any ideas? -liubo ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [RFC PATCH v5 4/5] Btrfs: disable qgroups accounting when quota is off 2013-08-05 14:18 ` Liu Bo @ 2013-08-05 15:10 ` Jan Schmidt 2013-08-06 2:25 ` Liu Bo 0 siblings, 1 reply; 30+ messages in thread From: Jan Schmidt @ 2013-08-05 15:10 UTC (permalink / raw) To: bo.li.liu; +Cc: linux-btrfs On Mon, August 05, 2013 at 16:18 (+0200), Liu Bo wrote: > On Mon, Aug 05, 2013 at 02:34:30PM +0200, Jan Schmidt wrote: >> Nice try hiding this one in a dedup patch set, but I finally found it :-) > > Ahhhh, I didn't mean to ;-) > >> >> On Wed, July 31, 2013 at 17:37 (+0200), Liu Bo wrote: >>> So we don't need to do qgroups accounting trick without enabling quota. >>> This reduces my tester's costing time from ~28s to ~23s. >>> >>> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> >>> --- >>> fs/btrfs/extent-tree.c | 6 ++++++ >>> fs/btrfs/qgroup.c | 6 ++++++ >>> 2 files changed, 12 insertions(+), 0 deletions(-) >>> >>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c >>> index 10a5c72..c6612f5 100644 >>> --- a/fs/btrfs/extent-tree.c >>> +++ b/fs/btrfs/extent-tree.c >>> @@ -2524,6 +2524,12 @@ int btrfs_delayed_refs_qgroup_accounting(struct btrfs_trans_handle *trans, >>> struct qgroup_update *qgroup_update; >>> int ret = 0; >>> >>> + if (!trans->root->fs_info->quota_enabled) { >>> + if (trans->delayed_ref_elem.seq) >>> + btrfs_put_tree_mod_seq(fs_info, &trans->delayed_ref_elem); >>> + return 0; >>> + } >>> + >>> if (list_empty(&trans->qgroup_ref_list) != >>> !trans->delayed_ref_elem.seq) { >>> /* list without seq or seq without list */ >>> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c >>> index 1280eff..f3e82aa 100644 >>> --- a/fs/btrfs/qgroup.c >>> +++ b/fs/btrfs/qgroup.c >>> @@ -1200,6 +1200,9 @@ int btrfs_qgroup_record_ref(struct btrfs_trans_handle *trans, >>> { >>> struct qgroup_update *u; >>> >>> + if (!trans->root->fs_info->quota_enabled) >>> + return 0; >>> + >>> BUG_ON(!trans->delayed_ref_elem.seq); >>> u = kmalloc(sizeof(*u), GFP_NOFS); >>> if (!u) >>> @@ -1850,6 +1853,9 @@ out: >>> >>> void assert_qgroups_uptodate(struct btrfs_trans_handle *trans) >>> { >>> + if (!trans->root->fs_info->quota_enabled) >>> + return; >>> + >>> if (list_empty(&trans->qgroup_ref_list) && !trans->delayed_ref_elem.seq) >>> return; >>> pr_err("btrfs: qgroups not uptodate in trans handle %p: list is%s empty, seq is %#x.%x\n", >>> >> >> The second hunk looks sensible at first sight. However, hunk 1 and 3 don't. They >> assert consistency of qgroup state in well defined places. The fact that you >> need to disable those checks shows that skipping addition to the list in the >> second hunk cannot be right, or at least not sufficient. > > I agree, only hunk 2 is necessary. > >> >> We've got the list of qgroup operations trans->qgroup_ref_list and we've got the >> qgroup's delayed ref blocker, trans->delayed_ref_elem. If you stop adding to the >> list (hunk 2) which seems reasonable when quota is disabled, then you also must >> ensure you're not acquiring the delayed ref blocker element, which should give >> another performance boost. > > WHY a 'must' here? Because otherwise you are going to hit the BUG_ONs you avoided with hunk 1 and 3. >> >> need_ref_seq may be the right place for this change. It just feels a bit too >> obvious. The critical cases obviously are quota enable and quota disable. I just >> don't recall why it wasn't that way from the very beginning of qgroups, I might >> be missing something fundamental here. > > Yeah I thought about 'need_ref_seq', but the point is that delayed ref blocker > not only serves qgroups accounting, but also features based on backref > walking, such as scrub, snapshot-aware defragment. I think you're confusing trans->delayed_ref_elem with other callers of btrfs_get_tree_mod_seq() and btrfs_put_tree_mod_seq(). trans->delayed_ref_elem is only used in qgroup context, as far as my grep reaches. There are other callers of btrfs_get_tree_mod_seq() that can put their blocker element on the stack, such as iterate_extent_inodes(). But I still might be missing something. Thanks, -Jan ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [RFC PATCH v5 4/5] Btrfs: disable qgroups accounting when quota is off 2013-08-05 15:10 ` Jan Schmidt @ 2013-08-06 2:25 ` Liu Bo 0 siblings, 0 replies; 30+ messages in thread From: Liu Bo @ 2013-08-06 2:25 UTC (permalink / raw) To: Jan Schmidt; +Cc: linux-btrfs On Mon, Aug 05, 2013 at 05:10:58PM +0200, Jan Schmidt wrote: > On Mon, August 05, 2013 at 16:18 (+0200), Liu Bo wrote: > > On Mon, Aug 05, 2013 at 02:34:30PM +0200, Jan Schmidt wrote: > >> Nice try hiding this one in a dedup patch set, but I finally found it :-) > > > > Ahhhh, I didn't mean to ;-) > > > >> > >> On Wed, July 31, 2013 at 17:37 (+0200), Liu Bo wrote: > >>> So we don't need to do qgroups accounting trick without enabling quota. > >>> This reduces my tester's costing time from ~28s to ~23s. > >>> > >>> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> > >>> --- > >>> fs/btrfs/extent-tree.c | 6 ++++++ > >>> fs/btrfs/qgroup.c | 6 ++++++ > >>> 2 files changed, 12 insertions(+), 0 deletions(-) > >>> > >>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > >>> index 10a5c72..c6612f5 100644 > >>> --- a/fs/btrfs/extent-tree.c > >>> +++ b/fs/btrfs/extent-tree.c > >>> @@ -2524,6 +2524,12 @@ int btrfs_delayed_refs_qgroup_accounting(struct btrfs_trans_handle *trans, > >>> struct qgroup_update *qgroup_update; > >>> int ret = 0; > >>> > >>> + if (!trans->root->fs_info->quota_enabled) { > >>> + if (trans->delayed_ref_elem.seq) > >>> + btrfs_put_tree_mod_seq(fs_info, &trans->delayed_ref_elem); > >>> + return 0; > >>> + } > >>> + > >>> if (list_empty(&trans->qgroup_ref_list) != > >>> !trans->delayed_ref_elem.seq) { > >>> /* list without seq or seq without list */ > >>> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c > >>> index 1280eff..f3e82aa 100644 > >>> --- a/fs/btrfs/qgroup.c > >>> +++ b/fs/btrfs/qgroup.c > >>> @@ -1200,6 +1200,9 @@ int btrfs_qgroup_record_ref(struct btrfs_trans_handle *trans, > >>> { > >>> struct qgroup_update *u; > >>> > >>> + if (!trans->root->fs_info->quota_enabled) > >>> + return 0; > >>> + > >>> BUG_ON(!trans->delayed_ref_elem.seq); > >>> u = kmalloc(sizeof(*u), GFP_NOFS); > >>> if (!u) > >>> @@ -1850,6 +1853,9 @@ out: > >>> > >>> void assert_qgroups_uptodate(struct btrfs_trans_handle *trans) > >>> { > >>> + if (!trans->root->fs_info->quota_enabled) > >>> + return; > >>> + > >>> if (list_empty(&trans->qgroup_ref_list) && !trans->delayed_ref_elem.seq) > >>> return; > >>> pr_err("btrfs: qgroups not uptodate in trans handle %p: list is%s empty, seq is %#x.%x\n", > >>> > >> > >> The second hunk looks sensible at first sight. However, hunk 1 and 3 don't. They > >> assert consistency of qgroup state in well defined places. The fact that you > >> need to disable those checks shows that skipping addition to the list in the > >> second hunk cannot be right, or at least not sufficient. > > > > I agree, only hunk 2 is necessary. > > > >> > >> We've got the list of qgroup operations trans->qgroup_ref_list and we've got the > >> qgroup's delayed ref blocker, trans->delayed_ref_elem. If you stop adding to the > >> list (hunk 2) which seems reasonable when quota is disabled, then you also must > >> ensure you're not acquiring the delayed ref blocker element, which should give > >> another performance boost. > > > > WHY a 'must' here? > > Because otherwise you are going to hit the BUG_ONs you avoided with hunk 1 and 3. > > >> > >> need_ref_seq may be the right place for this change. It just feels a bit too > >> obvious. The critical cases obviously are quota enable and quota disable. I just > >> don't recall why it wasn't that way from the very beginning of qgroups, I might > >> be missing something fundamental here. > > > > Yeah I thought about 'need_ref_seq', but the point is that delayed ref blocker > > not only serves qgroups accounting, but also features based on backref > > walking, such as scrub, snapshot-aware defragment. > > I think you're confusing trans->delayed_ref_elem with other callers of > btrfs_get_tree_mod_seq() and btrfs_put_tree_mod_seq(). trans->delayed_ref_elem > is only used in qgroup context, as far as my grep reaches. There are other > callers of btrfs_get_tree_mod_seq() that can put their blocker element on the > stack, such as iterate_extent_inodes(). > > But I still might be missing something. It's right that ->delayed_ref_elem is only used in qgroup context, but as the seq now has two parts, major + minor, need_ref_seq() should be true even not in qgroup context 'cause we also need to keep tracking the order of delayed refs' order for tree mod log by increasing the seq's minor part. I don't think we can disable need_ref_seq() when quota is off. -liubo ^ permalink raw reply [flat|nested] 30+ messages in thread
* [RFC PATCH v5 5/5] Btrfs: online data deduplication 2013-07-31 15:37 [RFC PATCH v5 0/5] Online data deduplication Liu Bo ` (3 preceding siblings ...) 2013-07-31 15:37 ` [RFC PATCH v5 4/5] Btrfs: disable qgroups accounting when quota is off Liu Bo @ 2013-07-31 15:37 ` Liu Bo 2013-07-31 22:50 ` Zach Brown 2013-07-31 15:37 ` [PATCH] Btrfs-progs: add dedup subcommand Liu Bo 2013-07-31 21:20 ` [RFC PATCH v5 0/5] Online data deduplication Josef Bacik 6 siblings, 1 reply; 30+ messages in thread From: Liu Bo @ 2013-07-31 15:37 UTC (permalink / raw) To: linux-btrfs This introduces the online data deduplication feature for btrfs. (1) WHY do we need deduplication? To improve our storage effiency. (2) WHAT is deduplication? Two key ways for practical deduplication implementations, * When the data is deduplicated (inband vs background) * The granularity of the deduplication. (block level vs file level) For btrfs, we choose * inband(synchronous) * block level We choose them because of the same reason as how zfs does. a) To get an immediate benefit. b) To remove redundant parts within a file. So we have an inband, block level data deduplication here. (3) HOW does deduplication works? This makes full use of file extent back reference, the same way as IOCTL_CLONE, which lets us easily store multiple copies of a set of data as a single copy along with an index of references to the copy. Here we have a) a new dedicated tree(DEDUP tree) and b) a new key(BTRFS_DEDUP_ITEM_KEY), which consists of (stop 64bits of hash, type, disk offset), * stop 64bits of hash We take the stop 64bits of the sha256 as the index. Also we store the whole 256bits of sha256 in its item, which is very helpful on avoiding collision. * disk offset It helps to find where the data is stored. So the whole deduplication process works as, 1) write something, 2) calculate the hash of this "something" based on the deduplication unit, 3) try to find the match of hash value by searching DEDUP keys in a dedicated tree, DEDUP tree. 4) if found, skip real IO and link to the existing copy if not, do real IO and insert a DEDUP key to the DEDUP tree. Right now, we can a) enable deduplication with mount option "-o dedup", b) control the deduplication unit with mount options '-o dedup_bs=xxx'. The dedault deduplication unit is 8192, and the largest allowed dedup blocksize is 128k because of compression range limit. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> --- v5: - go back to one dedup key with a special backref for dedup tree because the disk format understands backref well. - fix a fsync hang with dedup enabled. - rebase onto the latest btrfs. v4: - add INCOMPAT flag so that old kernel won't mount with a dedup btrfs. - elaborate error handling. - address a compress bug. - address an issue of dedup flag on extent state tree. - add new dedup ioctl interface. v3: - add compress support - add a real ioctl to enable dedup feature - change the maximum allowed dedup blocksize to 128k because of compressed range limit v2: - Make changelog more clearly. - To avoid enlarging the file extent item's size, add another index key used for freeing dedup extent. - Freeing dedup extent is now like how we delete checksum. - Add support for alternative deduplicatin blocksize larger than PAGESIZE. - Add a mount option to set deduplication blocksize. - Add support for those writes that are smaller than deduplication blocksize. - Cleanups fs/btrfs/backref.c | 9 + fs/btrfs/ctree.h | 59 ++++ fs/btrfs/disk-io.c | 27 ++ fs/btrfs/extent-tree.c | 152 +++++++++-- fs/btrfs/extent_io.c | 29 ++- fs/btrfs/extent_io.h | 16 ++ fs/btrfs/file-item.c | 217 +++++++++++++++ fs/btrfs/inode.c | 637 ++++++++++++++++++++++++++++++++++++++------ fs/btrfs/ioctl.c | 93 +++++++ fs/btrfs/ordered-data.c | 36 ++- fs/btrfs/ordered-data.h | 11 +- fs/btrfs/relocation.c | 3 + fs/btrfs/super.c | 27 ++- include/uapi/linux/btrfs.h | 5 + 14 files changed, 1210 insertions(+), 111 deletions(-) diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c index eaf1333..92f3737 100644 --- a/fs/btrfs/backref.c +++ b/fs/btrfs/backref.c @@ -558,6 +558,9 @@ static int __add_delayed_refs(struct btrfs_delayed_ref_head *head, u64 seq, key.objectid = ref->objectid; key.type = BTRFS_EXTENT_DATA_KEY; key.offset = ref->offset; + if (ref->root == BTRFS_DEDUP_TREE_OBJECTID) + break; + ret = __add_prelim_ref(prefs, ref->root, &key, 0, 0, node->bytenr, node->ref_mod * sgn); @@ -676,6 +679,9 @@ static int __add_inline_refs(struct btrfs_fs_info *fs_info, key.type = BTRFS_EXTENT_DATA_KEY; key.offset = btrfs_extent_data_ref_offset(leaf, dref); root = btrfs_extent_data_ref_root(leaf, dref); + if (root == BTRFS_DEDUP_TREE_OBJECTID) + break; + ret = __add_prelim_ref(prefs, root, &key, 0, 0, bytenr, count); break; @@ -759,6 +765,9 @@ static int __add_keyed_refs(struct btrfs_fs_info *fs_info, key.type = BTRFS_EXTENT_DATA_KEY; key.offset = btrfs_extent_data_ref_offset(leaf, dref); root = btrfs_extent_data_ref_root(leaf, dref); + if (root == BTRFS_DEDUP_TREE_OBJECTID) + break; + ret = __add_prelim_ref(prefs, root, &key, 0, 0, bytenr, count); break; diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index e795bf1..92b463f 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -32,6 +32,7 @@ #include <asm/kmap_types.h> #include <linux/pagemap.h> #include <linux/btrfs.h> +#include <crypto/hash.h> #include "extent_io.h" #include "extent_map.h" #include "async-thread.h" @@ -94,6 +95,9 @@ struct btrfs_ordered_sum; /* for storing balance parameters in the root tree */ #define BTRFS_BALANCE_OBJECTID -4ULL +/* dedup tree(experimental) */ +#define BTRFS_DEDUP_TREE_OBJECTID 9ULL + /* orhpan objectid for tracking unlinked/truncated files */ #define BTRFS_ORPHAN_OBJECTID -5ULL @@ -510,6 +514,7 @@ struct btrfs_super_block { #define BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF (1ULL << 6) #define BTRFS_FEATURE_INCOMPAT_RAID56 (1ULL << 7) #define BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA (1ULL << 8) +#define BTRFS_FEATURE_INCOMPAT_DEDUP (1ULL << 9) #define BTRFS_FEATURE_COMPAT_SUPP 0ULL #define BTRFS_FEATURE_COMPAT_RO_SUPP 0ULL @@ -521,6 +526,7 @@ struct btrfs_super_block { BTRFS_FEATURE_INCOMPAT_COMPRESS_LZO | \ BTRFS_FEATURE_INCOMPAT_RAID56 | \ BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF | \ + BTRFS_FEATURE_INCOMPAT_DEDUP | \ BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA) /* @@ -892,6 +898,24 @@ struct btrfs_csum_item { u8 csum; } __attribute__ ((__packed__)); +/* dedup */ +struct btrfs_dedup_item { + __le64 len; /* disk length of dedup range */ + + u8 compression; + u8 encryption; + __le16 other_encoding; /* spare for later use */ +} __attribute__ ((__packed__)); + +#define BTRFS_DEDUP_HASH_SIZE 32 /* 256bit = 32 * 8bit */ +#define BTRFS_DEDUP_HASH_LEN 4 + +struct btrfs_dedup_hash_item { + /* FIXME: put a hash type field here */ + + __le64 hash[BTRFS_DEDUP_HASH_LEN]; +} __attribute__ ((__packed__)); + struct btrfs_dev_stats_item { /* * grow this item struct at the end for future enhancements and keep @@ -1301,6 +1325,7 @@ struct btrfs_fs_info { struct btrfs_root *dev_root; struct btrfs_root *fs_root; struct btrfs_root *csum_root; + struct btrfs_root *dedup_root; struct btrfs_root *quota_root; /* the log root tree is a directory of all the other log roots */ @@ -1641,6 +1666,16 @@ struct btrfs_fs_info { struct btrfs_dev_replace dev_replace; atomic_t mutually_exclusive_operation_running; + + /* reference to deduplication algorithm drvier via cryptoapi */ + struct crypto_shash *dedup_driver; + + /* dedup blocksize */ + u64 dedup_bs; + + /* protect user change for dedup operations */ + struct mutex dedup_ioctl_mutex; + }; /* @@ -1933,6 +1968,8 @@ struct btrfs_ioctl_defrag_range_args { */ #define BTRFS_DEV_REPLACE_KEY 250 +#define BTRFS_DEDUP_ITEM_KEY 251 + /* * string items are for debugging. They just store a short string of * data in the FS @@ -1967,6 +2004,7 @@ struct btrfs_ioctl_defrag_range_args { #define BTRFS_MOUNT_CHECK_INTEGRITY (1 << 20) #define BTRFS_MOUNT_CHECK_INTEGRITY_INCLUDING_EXTENT_DATA (1 << 21) #define BTRFS_MOUNT_PANIC_ON_FATAL_ERROR (1 << 22) +#define BTRFS_MOUNT_DEDUP (1 << 24) #define btrfs_clear_opt(o, opt) ((o) &= ~BTRFS_MOUNT_##opt) #define btrfs_set_opt(o, opt) ((o) |= BTRFS_MOUNT_##opt) @@ -2902,6 +2940,13 @@ static inline u32 btrfs_file_extent_inline_item_len(struct extent_buffer *eb, return btrfs_item_size(eb, e) - offset; } +/* btrfs_dedup_item */ +BTRFS_SETGET_FUNCS(dedup_len, struct btrfs_dedup_item, len, 64); +BTRFS_SETGET_FUNCS(dedup_compression, struct btrfs_dedup_item, compression, 8); +BTRFS_SETGET_FUNCS(dedup_encryption, struct btrfs_dedup_item, encryption, 8); +BTRFS_SETGET_FUNCS(dedup_other_encoding, struct btrfs_dedup_item, + other_encoding, 16); + /* btrfs_dev_stats_item */ static inline u64 btrfs_dev_stats_value(struct extent_buffer *eb, struct btrfs_dev_stats_item *ptr, @@ -3372,6 +3417,8 @@ static inline int btrfs_need_cleaner_sleep(struct btrfs_root *root) static inline void free_fs_info(struct btrfs_fs_info *fs_info) { + if (fs_info->dedup_driver) + crypto_free_shash(fs_info->dedup_driver); kfree(fs_info->balance_ctl); kfree(fs_info->delayed_root); kfree(fs_info->extent_root); @@ -3535,6 +3582,18 @@ int btrfs_csum_truncate(struct btrfs_trans_handle *trans, u64 isize); int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end, struct list_head *list, int search_commit); + +int noinline_for_stack +btrfs_find_dedup_extent(struct btrfs_trans_handle *trans, + struct btrfs_root *root, u64 *hash, u64 *bytenr_ret, + u64 *num_bytes_ret, int *compr_ret); +int noinline_for_stack +btrfs_insert_dedup_extent(struct btrfs_trans_handle *trans, + struct btrfs_root *root, u64 *hash, u64 bytenr, + u64 num_bytes, int compr); +int noinline_for_stack +btrfs_free_dedup_extent(struct btrfs_trans_handle *trans, + struct btrfs_root *root, u64 hash, u64 bytenr); /* inode.c */ struct btrfs_delalloc_work { struct inode *inode; diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index b6d597a..72d470c 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1580,6 +1580,9 @@ struct btrfs_root *btrfs_read_fs_root_no_name(struct btrfs_fs_info *fs_info, if (location->objectid == BTRFS_QUOTA_TREE_OBJECTID) return fs_info->quota_root ? fs_info->quota_root : ERR_PTR(-ENOENT); + if (location->objectid == BTRFS_DEDUP_TREE_OBJECTID) + return fs_info->dedup_root ? fs_info->dedup_root : + ERR_PTR(-ENOENT); again: root = btrfs_lookup_fs_root(fs_info, location->objectid); if (root) @@ -2037,6 +2040,12 @@ static void free_root_pointers(struct btrfs_fs_info *info, int chunk_root) info->quota_root->node = NULL; info->quota_root->commit_root = NULL; } + if (info->dedup_root) { + free_extent_buffer(info->dedup_root->node); + free_extent_buffer(info->dedup_root->commit_root); + info->dedup_root->node = NULL; + info->dedup_root->commit_root = NULL; + } if (chunk_root) { free_extent_buffer(info->chunk_root->node); free_extent_buffer(info->chunk_root->commit_root); @@ -2097,6 +2106,7 @@ int open_ctree(struct super_block *sb, struct btrfs_root *chunk_root; struct btrfs_root *dev_root; struct btrfs_root *quota_root; + struct btrfs_root *dedup_root; struct btrfs_root *log_tree_root; int ret; int err = -EINVAL; @@ -2184,6 +2194,7 @@ int open_ctree(struct super_block *sb, atomic64_set(&fs_info->tree_mod_seq, 0); fs_info->sb = sb; fs_info->max_inline = 8192 * 1024; + fs_info->dedup_bs = 8192; fs_info->metadata_ratio = 0; fs_info->defrag_inodes = RB_ROOT; fs_info->free_chunk_space = 0; @@ -2282,6 +2293,7 @@ int open_ctree(struct super_block *sb, mutex_init(&fs_info->dev_replace.lock_finishing_cancel_unmount); mutex_init(&fs_info->dev_replace.lock_management_lock); mutex_init(&fs_info->dev_replace.lock); + mutex_init(&fs_info->dedup_ioctl_mutex); spin_lock_init(&fs_info->qgroup_lock); mutex_init(&fs_info->qgroup_ioctl_lock); @@ -2457,6 +2469,14 @@ int open_ctree(struct super_block *sb, goto fail_alloc; } + fs_info->dedup_driver = crypto_alloc_shash("sha256", 0, 0); + if (IS_ERR(fs_info->dedup_driver)) { + pr_info("Cannot load sha256 driver\n"); + err = PTR_ERR(fs_info->dedup_driver); + fs_info->dedup_driver = NULL; + goto fail_alloc; + } + btrfs_init_workers(&fs_info->generic_worker, "genwork", 1, NULL); @@ -2695,6 +2715,13 @@ retry_root_backup: fs_info->quota_root = quota_root; } + location.objectid = BTRFS_DEDUP_TREE_OBJECTID; + dedup_root = btrfs_read_tree_root(tree_root, &location); + if (!IS_ERR(dedup_root)) { + dedup_root->track_dirty = 1; + fs_info->dedup_root = dedup_root; + } + fs_info->generation = generation; fs_info->last_trans_committed = generation; diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index c6612f5..36d658d 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -1102,8 +1102,16 @@ static noinline int lookup_extent_data_ref(struct btrfs_trans_handle *trans, key.offset = parent; } else { key.type = BTRFS_EXTENT_DATA_REF_KEY; - key.offset = hash_extent_data_ref(root_objectid, - owner, offset); + + /* + * we've not got the right offset and owner, so search by -1 + * here. + */ + if (root_objectid == BTRFS_DEDUP_TREE_OBJECTID) + key.offset = -1; + else + key.offset = hash_extent_data_ref(root_objectid, + owner, offset); } again: recow = 0; @@ -1130,6 +1138,10 @@ again: goto fail; } + if (ret > 0 && root_objectid == BTRFS_DEDUP_TREE_OBJECTID && + path->slots[0] > 0) + path->slots[0]--; + leaf = path->nodes[0]; nritems = btrfs_header_nritems(leaf); while (1) { @@ -1153,14 +1165,22 @@ again: ref = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_extent_data_ref); - if (match_extent_data_ref(leaf, ref, root_objectid, - owner, offset)) { - if (recow) { - btrfs_release_path(path); - goto again; + if (root_objectid == BTRFS_DEDUP_TREE_OBJECTID) { + if (btrfs_extent_data_ref_root(leaf, ref) == + root_objectid) { + err = 0; + break; + } + } else { + if (match_extent_data_ref(leaf, ref, root_objectid, + owner, offset)) { + if (recow) { + btrfs_release_path(path); + goto again; + } + err = 0; + break; } - err = 0; - break; } path->slots[0]++; } @@ -1304,6 +1324,32 @@ static noinline int remove_extent_data_ref(struct btrfs_trans_handle *trans, return ret; } +static noinline u64 extent_data_ref_offset(struct btrfs_root *root, + struct btrfs_path *path, + struct btrfs_extent_inline_ref *iref) +{ + struct btrfs_key key; + struct extent_buffer *leaf; + struct btrfs_extent_data_ref *ref1; + u64 offset = 0; + + leaf = path->nodes[0]; + btrfs_item_key_to_cpu(leaf, &key, path->slots[0]); + if (iref) { + BUG_ON(btrfs_extent_inline_ref_type(leaf, iref) != + BTRFS_EXTENT_DATA_REF_KEY); + ref1 = (struct btrfs_extent_data_ref *)(&iref->offset); + offset = btrfs_extent_data_ref_offset(leaf, ref1); + } else if (key.type == BTRFS_EXTENT_DATA_REF_KEY) { + ref1 = btrfs_item_ptr(leaf, path->slots[0], + struct btrfs_extent_data_ref); + offset = btrfs_extent_data_ref_offset(leaf, ref1); + } else { + WARN_ON(1); + } + return offset; +} + static noinline u32 extent_data_ref_count(struct btrfs_root *root, struct btrfs_path *path, struct btrfs_extent_inline_ref *iref) @@ -1571,7 +1617,8 @@ again: err = -ENOENT; while (1) { if (ptr >= end) { - WARN_ON(ptr > end); + WARN_ON(ptr > end && + root_objectid != BTRFS_DEDUP_TREE_OBJECTID); break; } iref = (struct btrfs_extent_inline_ref *)ptr; @@ -1586,14 +1633,25 @@ again: if (type == BTRFS_EXTENT_DATA_REF_KEY) { struct btrfs_extent_data_ref *dref; dref = (struct btrfs_extent_data_ref *)(&iref->offset); - if (match_extent_data_ref(leaf, dref, root_objectid, - owner, offset)) { - err = 0; - break; + + if (root_objectid == BTRFS_DEDUP_TREE_OBJECTID) { + if (btrfs_extent_data_ref_root(leaf, dref) == + root_objectid) { + err = 0; + break; + } + } else { + if (match_extent_data_ref(leaf, dref, + root_objectid, owner, + offset)) { + err = 0; + break; + } + if (hash_extent_data_ref_item(leaf, dref) < + hash_extent_data_ref(root_objectid, owner, + offset)) + break; } - if (hash_extent_data_ref_item(leaf, dref) < - hash_extent_data_ref(root_objectid, owner, offset)) - break; } else { u64 ref_offset; ref_offset = btrfs_extent_inline_ref_offset(leaf, iref); @@ -4695,6 +4753,8 @@ static void init_global_block_rsv(struct btrfs_fs_info *fs_info) if (fs_info->quota_root) fs_info->quota_root->block_rsv = &fs_info->global_block_rsv; fs_info->chunk_root->block_rsv = &fs_info->chunk_block_rsv; + if (fs_info->dedup_root) + fs_info->dedup_root->block_rsv = &fs_info->global_block_rsv; update_global_block_rsv(fs_info); } @@ -5613,11 +5673,13 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans, struct btrfs_extent_inline_ref *iref; int ret; int is_data; - int extent_slot = 0; - int found_extent = 0; - int num_to_del = 1; + int extent_slot; + int found_extent; + int num_to_del; u32 item_size; u64 refs; + u64 orig_root_obj; + u64 dedup_hash; bool skinny_metadata = btrfs_fs_incompat(root->fs_info, SKINNY_METADATA); @@ -5625,6 +5687,13 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans, if (!path) return -ENOMEM; +again: + extent_slot = 0; + found_extent = 0; + num_to_del = 1; + orig_root_obj = root_objectid; + dedup_hash = 0; + path->reada = 1; path->leave_spinning = 1; @@ -5666,6 +5735,12 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans, #endif if (!found_extent) { BUG_ON(iref); + + if (is_data && + root_objectid == BTRFS_DEDUP_TREE_OBJECTID) { + dedup_hash = extent_data_ref_offset(root, path, + NULL); + } ret = remove_extent_backref(trans, extent_root, path, NULL, refs_to_drop, is_data); @@ -5723,6 +5798,10 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans, } extent_slot = path->slots[0]; } + } else if (ret == -ENOENT && + root_objectid == BTRFS_DEDUP_TREE_OBJECTID) { + ret = 0; + goto out; } else if (ret == -ENOENT) { btrfs_print_leaf(extent_root, path->nodes[0]); WARN_ON(1); @@ -5819,7 +5898,28 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans, } add_pinned_bytes(root->fs_info, -num_bytes, owner_objectid, root_objectid); + + /* + * special case for dedup + * + * We assume the last ref(ref==1) is backref pointing to dedup. + * + * root_obj == 1 means that it's a free space cache inode, + * and it always uses PREALLOC, so it never has dedup extent. + */ + if (is_data && root->fs_info->dedup_root && + orig_root_obj != BTRFS_ROOT_TREE_OBJECTID && refs == 1) { + btrfs_release_path(path); + root_objectid = BTRFS_DEDUP_TREE_OBJECTID; + parent = 0; + + goto again; + } } else { + if (!dedup_hash && is_data && + root_objectid == BTRFS_DEDUP_TREE_OBJECTID) + dedup_hash = extent_data_ref_offset(root, path, iref); + if (found_extent) { BUG_ON(is_data && refs_to_drop != extent_data_ref_count(root, path, iref)); @@ -5846,6 +5946,18 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans, btrfs_abort_transaction(trans, extent_root, ret); goto out; } + + if (root_objectid == BTRFS_DEDUP_TREE_OBJECTID) { + ret = btrfs_free_dedup_extent(trans, root, + dedup_hash, + bytenr); + if (ret) { + btrfs_abort_transaction(trans, + extent_root, + ret); + goto out; + } + } } ret = update_block_group(root, bytenr, num_bytes, 0); diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index f8586a9..663aed5c 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -1259,6 +1259,20 @@ int clear_extent_uptodate(struct extent_io_tree *tree, u64 start, u64 end, cached_state, mask); } +int set_extent_dedup(struct extent_io_tree *tree, u64 start, u64 end, + struct extent_state **cached_state, gfp_t mask) +{ + return set_extent_bit(tree, start, end, EXTENT_DEDUP, 0, + cached_state, mask); +} + +int clear_extent_dedup(struct extent_io_tree *tree, u64 start, u64 end, + struct extent_state **cached_state, gfp_t mask) +{ + return clear_extent_bit(tree, start, end, EXTENT_DEDUP, 0, 0, + cached_state, mask); +} + /* * either insert or lock state struct between start and end use mask to tell * us if waiting is desired. @@ -1704,6 +1718,8 @@ int extent_clear_unlock_delalloc(struct inode *inode, clear_bits |= EXTENT_LOCKED; if (op & EXTENT_CLEAR_DIRTY) clear_bits |= EXTENT_DIRTY; + if (op & EXTENT_CLEAR_DEDUP) + clear_bits |= EXTENT_DEDUP; if (op & EXTENT_CLEAR_DELALLOC) clear_bits |= EXTENT_DELALLOC; @@ -2413,7 +2429,7 @@ int end_extent_writepage(struct page *page, int err, u64 start, u64 end) * Scheduling is not allowed, so the extent state tree is expected * to have one and only one object corresponding to this IO. */ -static void end_bio_extent_writepage(struct bio *bio, int err) +void end_bio_extent_writepage(struct bio *bio, int err) { struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1; struct extent_io_tree *tree; @@ -2618,8 +2634,8 @@ struct bio *btrfs_io_bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs) } -static int __must_check submit_one_bio(int rw, struct bio *bio, - int mirror_num, unsigned long bio_flags) +int __must_check submit_one_bio(int rw, struct bio *bio, int mirror_num, + unsigned long bio_flags) { int ret = 0; struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1; @@ -2658,7 +2674,7 @@ static int merge_bio(int rw, struct extent_io_tree *tree, struct page *page, } -static int submit_extent_page(int rw, struct extent_io_tree *tree, +int submit_extent_page(int rw, struct extent_io_tree *tree, struct page *page, sector_t sector, size_t size, unsigned long offset, struct block_device *bdev, @@ -3721,9 +3737,10 @@ int extent_write_locked_range(struct extent_io_tree *tree, struct inode *inode, while (start <= end) { page = find_get_page(mapping, start >> PAGE_CACHE_SHIFT); - if (clear_page_dirty_for_io(page)) + + if (clear_page_dirty_for_io(page)) { ret = __extent_writepage(page, &wbc_writepages, &epd); - else { + } else { if (tree->ops && tree->ops->writepage_end_io_hook) tree->ops->writepage_end_io_hook(page, start, start + PAGE_CACHE_SIZE - 1, diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h index 3b8c4e2..874d743 100644 --- a/fs/btrfs/extent_io.h +++ b/fs/btrfs/extent_io.h @@ -20,6 +20,7 @@ #define EXTENT_NEED_WAIT (1 << 13) #define EXTENT_DAMAGED (1 << 14) #define EXTENT_NORESERVE (1 << 15) +#define EXTENT_DEDUP (1 << 16) #define EXTENT_IOBITS (EXTENT_LOCKED | EXTENT_WRITEBACK) #define EXTENT_CTLBITS (EXTENT_DO_ACCOUNTING | EXTENT_FIRST_DELALLOC) @@ -52,6 +53,7 @@ #define EXTENT_END_WRITEBACK 0x20 #define EXTENT_SET_PRIVATE2 0x40 #define EXTENT_CLEAR_ACCOUNTING 0x80 +#define EXTENT_CLEAR_DEDUP 0x100 /* * page->private values. Every page that is controlled by the extent @@ -225,6 +227,10 @@ int set_extent_uptodate(struct extent_io_tree *tree, u64 start, u64 end, struct extent_state **cached_state, gfp_t mask); int clear_extent_uptodate(struct extent_io_tree *tree, u64 start, u64 end, struct extent_state **cached_state, gfp_t mask); +int set_extent_dedup(struct extent_io_tree *tree, u64 start, u64 end, + struct extent_state **cached_state, gfp_t mask); +int clear_extent_dedup(struct extent_io_tree *tree, u64 start, u64 end, + struct extent_state **cached_state, gfp_t mask); int set_extent_new(struct extent_io_tree *tree, u64 start, u64 end, gfp_t mask); int set_extent_dirty(struct extent_io_tree *tree, u64 start, u64 end, @@ -348,4 +354,14 @@ int repair_io_failure(struct btrfs_fs_info *fs_info, u64 start, int end_extent_writepage(struct page *page, int err, u64 start, u64 end); int repair_eb_io_failure(struct btrfs_root *root, struct extent_buffer *eb, int mirror_num); +int submit_extent_page(int rw, struct extent_io_tree *tree, struct page *page, + sector_t sector, size_t size, unsigned long offset, + struct block_device *bdev, struct bio **bio_ret, + unsigned long max_pages, bio_end_io_t end_io_func, + int mirror_num, unsigned long prev_bio_flags, + unsigned long bio_flags); +void end_bio_extent_writepage(struct bio *bio, int err); +int __must_check submit_one_bio(int rw, struct bio *bio, int mirror_num, + unsigned long bio_flags); + #endif diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c index a7bfc95..088858a 100644 --- a/fs/btrfs/file-item.c +++ b/fs/btrfs/file-item.c @@ -858,3 +858,220 @@ out: fail_unlock: goto out; } + +/* 1 means we find one, 0 means we dont. */ +int noinline_for_stack +btrfs_find_dedup_extent(struct btrfs_trans_handle *trans, + struct btrfs_root *root, u64 *hash, u64 *bytenr_ret, + u64 *num_bytes_ret, int *compr_ret) +{ + struct btrfs_key key; + struct btrfs_path *path; + struct extent_buffer *leaf; + struct btrfs_root *dedup_root; + struct btrfs_dedup_item *item; + struct btrfs_dedup_hash_item *hash_item; + u64 *hash_in_item[4]; + u64 hash_value; + u64 length; + int compression; + int found = 0; + int ret; + + if (!root->fs_info->dedup_root) { + WARN(1, KERN_INFO "dedup not enabled\n"); + return 0; + } + + dedup_root = root->fs_info->dedup_root; + + path = btrfs_alloc_path(); + if (!path) + return found; + + hash_value = hash[BTRFS_DEDUP_HASH_LEN - 1]; + + key.objectid = hash_value; + key.offset = -1; + btrfs_set_key_type(&key, BTRFS_DEDUP_ITEM_KEY); + + ret = btrfs_search_slot(trans, dedup_root, &key, path, 0, 0); + if (ret < 0) + goto out; + if (ret == 0) { + WARN_ON(1); + goto out; + } + +prev_slot: + /* this will do match checks. */ + ret = btrfs_previous_item(dedup_root, path, hash_value, + BTRFS_DEDUP_ITEM_KEY); + if (ret) + goto out; + + leaf = path->nodes[0]; + + item = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_dedup_item); + /* disk length of dedup range */ + length = btrfs_dedup_len(leaf, item); + if (length > root->fs_info->dedup_bs) { + WARN_ON(1); + goto out; + } + + compression = btrfs_dedup_compression(leaf, item); + if (compression > BTRFS_COMPRESS_TYPES) { + WARN_ON(1); + goto out; + } + + hash_item = (struct btrfs_dedup_hash_item *)(item + 1); + if (sizeof(*hash_item) != BTRFS_DEDUP_HASH_SIZE) { + WARN_ON(1); + goto out; + } + + read_extent_buffer(leaf, hash_in_item, ((unsigned long)hash_item), + BTRFS_DEDUP_HASH_SIZE); + if (memcmp((char *)hash_in_item, (char *)hash, sizeof(*hash_item))) { + pr_info("goto prev\n"); + goto prev_slot; + } + + btrfs_item_key_to_cpu(leaf, &key, path->slots[0]); + + /* bytenr_ret should exist as we need to return it to caller */ + BUG_ON(!bytenr_ret); + *bytenr_ret = key.offset; + *num_bytes_ret = length; + *compr_ret = compression; + found = 1; +out: + btrfs_free_path(path); + return found; +} + +int noinline_for_stack +btrfs_insert_dedup_extent(struct btrfs_trans_handle *trans, + struct btrfs_root *root, u64 *hash, u64 bytenr, + u64 num_bytes, int compr) +{ + struct btrfs_key key; + struct btrfs_path *path; + struct extent_buffer *leaf; + struct btrfs_root *dedup_root; + struct btrfs_dedup_item *dedup_item; + struct btrfs_dedup_hash_item *hash_item; + u64 ins_size; + int ret; + + if (!root->fs_info->dedup_root) { + WARN(1, KERN_INFO "dedup not enabled\n"); + return 0; + } + + dedup_root = root->fs_info->dedup_root; + + path = btrfs_alloc_path(); + if (!path) + return -ENOMEM; + + /* insert index by hash */ + ins_size = sizeof(*dedup_item) + sizeof(*hash_item); + + key.objectid = hash[BTRFS_DEDUP_HASH_LEN - 1]; + key.offset = bytenr; + btrfs_set_key_type(&key, BTRFS_DEDUP_ITEM_KEY); + + path->leave_spinning = 1; + ret = btrfs_insert_empty_item(trans, dedup_root, path, &key, ins_size); + if (ret < 0) + goto out; + leaf = path->nodes[0]; + + dedup_item = btrfs_item_ptr(leaf, path->slots[0], + struct btrfs_dedup_item); + /* disk length of dedup range */ + BUG_ON(num_bytes > root->fs_info->dedup_bs); + btrfs_set_dedup_len(leaf, dedup_item, num_bytes); + btrfs_set_dedup_compression(leaf, dedup_item, compr); + btrfs_set_dedup_encryption(leaf, dedup_item, 0); + btrfs_set_dedup_other_encoding(leaf, dedup_item, 0); + + hash_item = (struct btrfs_dedup_hash_item *)(dedup_item + 1); + if (sizeof(*hash_item) != BTRFS_DEDUP_HASH_SIZE) { + WARN_ON(1); + goto out; + } + + write_extent_buffer(leaf, hash, (unsigned long)hash_item, + BTRFS_DEDUP_HASH_SIZE); + + btrfs_mark_buffer_dirty(leaf); +out: + /* TODO: EEXIST means that we need to mark it as a dup one */ + WARN_ON(ret == -EEXIST); + btrfs_free_path(path); + return ret; +} + +int noinline_for_stack +btrfs_free_dedup_extent(struct btrfs_trans_handle *trans, + struct btrfs_root *root, u64 hash, u64 bytenr) +{ + struct btrfs_key key; + struct btrfs_path *path; + struct extent_buffer *leaf; + struct btrfs_root *dedup_root; + struct btrfs_dedup_item *dedup_item; + struct btrfs_dedup_hash_item *hash_item; + u64 hash_in_item[4]; + int ret = 0; + + if (!root->fs_info->dedup_root) + return 0; + + dedup_root = root->fs_info->dedup_root; + + path = btrfs_alloc_path(); + if (!path) + return ret; + + /* index by hash */ + key.objectid = hash; + key.offset = bytenr; + btrfs_set_key_type(&key, BTRFS_DEDUP_ITEM_KEY); + + ret = btrfs_search_slot(trans, dedup_root, &key, path, -1, 1); + if (ret < 0) + goto out; + if (ret) { + WARN_ON(1); + ret = -ENOENT; + goto out; + } + + leaf = path->nodes[0]; + + ret = -ENOENT; + btrfs_item_key_to_cpu(leaf, &key, path->slots[0]); + if (btrfs_key_type(&key) != BTRFS_DEDUP_ITEM_KEY) + goto out; + if (key.objectid != hash || key.offset != bytenr) + goto out; + + dedup_item = btrfs_item_ptr(leaf, path->slots[0], + struct btrfs_dedup_item); + hash_item = (struct btrfs_dedup_hash_item *)(dedup_item + 1); + read_extent_buffer(leaf, hash_in_item, ((unsigned long)hash_item), + BTRFS_DEDUP_HASH_SIZE); + if (hash_in_item[BTRFS_DEDUP_HASH_LEN - 1] != hash) + BUG_ON(1); + + ret = btrfs_del_item(trans, dedup_root, path); + WARN_ON(ret); +out: + btrfs_free_path(path); + return ret; +} diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index d0544bf..a3041ec 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -104,6 +104,16 @@ static struct extent_map *create_pinned_em(struct inode *inode, u64 start, u64 block_start, u64 block_len, u64 orig_block_len, u64 ram_bytes, int type); +static noinline int cow_file_range_dedup(struct inode *inode, + struct page *locked_page, + u64 start, u64 end, int *page_started, + unsigned long *nr_written, int unlock, + u64 *hash); +static int run_locked_range(struct extent_io_tree *tree, struct inode *inode, + struct page *locked_page, u64 start, u64 end, + get_extent_t *get_extent, int mode, u64 *hash); +static int btrfs_inode_test_compress(struct inode *inode); + static int btrfs_dirty_inode(struct inode *inode); @@ -285,6 +295,7 @@ struct async_extent { unsigned long nr_pages; int compress_type; struct list_head list; + u64 *hash; /* dedup hash of sha256 */ }; struct async_cow { @@ -302,22 +313,40 @@ static noinline int add_async_extent(struct async_cow *cow, u64 compressed_size, struct page **pages, unsigned long nr_pages, - int compress_type) + int compress_type, u64 *dedup_hash) { struct async_extent *async_extent; async_extent = kmalloc(sizeof(*async_extent), GFP_NOFS); - BUG_ON(!async_extent); /* -ENOMEM */ + if (!async_extent) + return -ENOMEM; async_extent->start = start; async_extent->ram_size = ram_size; async_extent->compressed_size = compressed_size; async_extent->pages = pages; async_extent->nr_pages = nr_pages; async_extent->compress_type = compress_type; + async_extent->hash = NULL; + if (dedup_hash) { + async_extent->hash = kmalloc(sizeof(u64) * BTRFS_DEDUP_HASH_LEN, + GFP_NOFS); + if (!async_extent->hash) { + kfree(async_extent); + return -ENOMEM; + } + memcpy(async_extent->hash, dedup_hash, BTRFS_DEDUP_HASH_SIZE); + } + list_add_tail(&async_extent->list, &cow->extents); return 0; } +static noinline void free_async_extent(struct async_extent *p) +{ + kfree(p->hash); + kfree(p); +} + /* * we create compressed extents in two phases. The first * phase compresses a range of pages that have already been @@ -339,10 +368,10 @@ static noinline int compress_file_range(struct inode *inode, struct page *locked_page, u64 start, u64 end, struct async_cow *async_cow, - int *num_added) + int *num_added, u64 *dedup_hash) { struct btrfs_root *root = BTRFS_I(inode)->root; - struct btrfs_trans_handle *trans; + struct btrfs_trans_handle *trans = NULL; u64 num_bytes; u64 blocksize = root->sectorsize; u64 actual_end; @@ -408,9 +437,7 @@ again: * change at any time if we discover bad compression ratios. */ if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NOCOMPRESS) && - (btrfs_test_opt(root, COMPRESS) || - (BTRFS_I(inode)->force_compress) || - (BTRFS_I(inode)->flags & BTRFS_INODE_COMPRESS))) { + btrfs_inode_test_compress(inode)) { WARN_ON(pages); pages = kzalloc(sizeof(struct page *) * nr_pages, GFP_NOFS); if (!pages) { @@ -547,9 +574,11 @@ cont: * allocation on disk for these compressed pages, * and will submit them to the elevator. */ - add_async_extent(async_cow, start, num_bytes, - total_compressed, pages, nr_pages_ret, - compress_type); + ret = add_async_extent(async_cow, start, num_bytes, + total_compressed, pages, nr_pages_ret, + compress_type, dedup_hash); + if (ret) + goto cleanup_and_out; if (start + num_bytes < end) { start += num_bytes; @@ -573,8 +602,11 @@ cleanup_and_bail_uncompressed: } if (redirty) extent_range_redirty_for_io(inode, start, end); - add_async_extent(async_cow, start, end - start + 1, - 0, NULL, 0, BTRFS_COMPRESS_NONE); + ret = add_async_extent(async_cow, start, end - start + 1, + 0, NULL, 0, BTRFS_COMPRESS_NONE, dedup_hash); + if (ret) + goto cleanup_and_out; + *num_added += 1; } @@ -638,38 +670,15 @@ again: retry: /* did the compression code fall back to uncompressed IO? */ if (!async_extent->pages) { - int page_started = 0; - unsigned long nr_written = 0; - - lock_extent(io_tree, async_extent->start, - async_extent->start + - async_extent->ram_size - 1); - - /* allocate blocks */ - ret = cow_file_range(inode, async_cow->locked_page, - async_extent->start, - async_extent->start + - async_extent->ram_size - 1, - &page_started, &nr_written, 0); - - /* JDM XXX */ + ret = run_locked_range(io_tree, inode, + async_cow->locked_page, + async_extent->start, + async_extent->start + + async_extent->ram_size - 1, + btrfs_get_extent, WB_SYNC_ALL, + async_extent->hash); - /* - * if page_started, cow_file_range inserted an - * inline extent and took care of all the unlocking - * and IO for us. Otherwise, we need to submit - * all those pages down to the drive. - */ - if (!page_started && !ret) - extent_write_locked_range(io_tree, - inode, async_extent->start, - async_extent->start + - async_extent->ram_size - 1, - btrfs_get_extent, - WB_SYNC_ALL); - else if (ret) - unlock_page(async_cow->locked_page); - kfree(async_extent); + free_async_extent(async_extent); cond_resched(); continue; } @@ -762,7 +771,8 @@ retry: async_extent->ram_size, ins.offset, BTRFS_ORDERED_COMPRESSED, - async_extent->compress_type); + async_extent->compress_type, + async_extent->hash); if (ret) goto out_free_reserve; @@ -786,7 +796,7 @@ retry: ins.offset, async_extent->pages, async_extent->nr_pages); alloc_hint = ins.objectid + ins.offset; - kfree(async_extent); + free_async_extent(async_extent); if (ret) goto out; cond_resched(); @@ -807,10 +817,351 @@ out_free: EXTENT_CLEAR_DIRTY | EXTENT_SET_WRITEBACK | EXTENT_END_WRITEBACK); - kfree(async_extent); + free_async_extent(async_extent); goto again; } +static int btrfs_dedup_hash_digest(struct btrfs_root *root, const char *data, + u64 length, u64 *hash) +{ + struct crypto_shash *tfm = root->fs_info->dedup_driver; + struct { + struct shash_desc desc; + char ctx[crypto_shash_descsize(tfm)]; + } sdesc; + + sdesc.desc.tfm = tfm; + sdesc.desc.flags = 0; + + return crypto_shash_digest(&sdesc.desc, data, length, (char *)hash); +} + +static int btrfs_calc_dedup_hash(struct btrfs_root *root, struct inode *inode, + u64 start, u64 *hash) +{ + struct page *p; + char *data; + u64 length = root->fs_info->dedup_bs; + u64 blocksize = root->sectorsize; + int err; + + if (length == blocksize) { + p = find_get_page(inode->i_mapping, + (start >> PAGE_CACHE_SHIFT)); + BUG_ON(!p); /* page should be here */ + data = kmap_atomic(p); + err = btrfs_dedup_hash_digest(root, data, length, hash); + kunmap_atomic(data); + page_cache_release(p); + } else { + char *d; + int i = 0; + + data = kmalloc(length, GFP_NOFS); + if (!data) + return -ENOMEM; + + while (blocksize * i < length) { + p = find_get_page(inode->i_mapping, + (start >> PAGE_CACHE_SHIFT) + i); + BUG_ON(!p); /* page should be here */ + d = kmap_atomic(p); + memcpy((data + blocksize * i), d, blocksize); + kunmap_atomic(d); + page_cache_release(p); + i++; + } + + err = btrfs_dedup_hash_digest(root, data, length, hash); + kfree(data); + } + return err; +} + +static noinline int +run_delalloc_dedup(struct inode *inode, struct page *locked_page, u64 start, + u64 end, struct async_cow *async_cow) +{ + struct btrfs_root *root = BTRFS_I(inode)->root; + struct btrfs_trans_handle *trans = NULL; + struct bio *bio = NULL; + struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree; + struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree; + struct extent_map *em; + struct page *page = NULL; + struct block_device *bdev; + struct btrfs_key ins; + u64 blocksize = root->sectorsize; + u64 num_bytes; + u64 cur_alloc_size; + u64 cur_end; + u64 alloc_hint = 0; + u64 iosize; + u64 dedup_bs = root->fs_info->dedup_bs; + u64 dedup_bytenr; + u64 dedup_len; + u64 hash[4]; + int compr; + int found; + int type = 0; + sector_t sector; + int ret = 0; + struct extent_state *cached_state = NULL; + + BUG_ON(btrfs_is_free_space_inode(inode)); + + num_bytes = ALIGN(end - start + 1, blocksize); + num_bytes = max(blocksize, num_bytes); + + trans = btrfs_join_transaction(root); + if (IS_ERR(trans)) { + ret = PTR_ERR(trans); + goto out; + } + + btrfs_drop_extent_cache(inode, start, start + num_bytes - 1, 0); + + while (num_bytes > 0) { + unsigned long op = 0; + + /* page has been locked by caller */ + page = find_get_page(inode->i_mapping, + start >> PAGE_CACHE_SHIFT); + BUG_ON(!page); /* page should be here */ + + /* already ordered */ + if (PagePrivate2(page)) + goto submit; + + /* too small data, go for normal path */ + if (num_bytes < dedup_bs) { + cur_end = start + num_bytes - 1; + + if (btrfs_inode_test_compress(inode)) { + int num_added = 0; + compress_file_range(inode, page, start, cur_end, + async_cow, &num_added, + NULL); + } else { + /* Now locked_page is not dirty. */ + if (page_offset(locked_page) >= start && + page_offset(locked_page) <= cur_end) { + __set_page_dirty_nobuffers(locked_page); + } + + ret = run_locked_range(tree, inode, page, start, + cur_end, + btrfs_get_extent, + WB_SYNC_ALL, NULL); + if (ret) + SetPageError(page); + } + + page_cache_release(page); + page = NULL; + + num_bytes -= num_bytes; + start += num_bytes; + cond_resched(); + continue; + } + + cur_alloc_size = min_t(u64, num_bytes, dedup_bs); + BUG_ON(cur_alloc_size < dedup_bs); /* shouldn't happen */ + cur_end = start + cur_alloc_size - 1; + + /* see comments in compress_file_range */ + extent_range_clear_dirty_for_io(inode, start, cur_end); + + memset(hash, 0, sizeof(u64) * 4); + ret = btrfs_calc_dedup_hash(root, inode, start, hash); + + compr = BTRFS_COMPRESS_NONE; + if (ret) + found = 0; + else + found = btrfs_find_dedup_extent(trans, root, hash, + &dedup_bytenr, &dedup_len, + &compr); + + if (found == 0) { + /* + * compress fastpath. + * so we take the original data as dedup string instead + * of compressed data since compression methods and data + * from them vary a lot. + */ + if (btrfs_inode_test_compress(inode)) { + int num_added = 0; + + extent_range_redirty_for_io(inode, start, + cur_end); + + compress_file_range(inode, page, start, cur_end, + async_cow, &num_added, + hash); + + page_cache_release(page); + page = NULL; + + num_bytes -= cur_alloc_size; + start += cur_alloc_size; + cond_resched(); + continue; + } + + /* no compress */ + ret = btrfs_reserve_extent(trans, root, cur_alloc_size, + cur_alloc_size, 0, alloc_hint, + &ins, 1); + if (ret < 0) { + btrfs_abort_transaction(trans, root, ret); + goto out_unlock; + } + } else { /* found same hash */ + ins.objectid = dedup_bytenr; + ins.offset = dedup_len; + + set_extent_dedup(tree, start, cur_end, &cached_state, + GFP_NOFS); + } + + lock_extent(tree, start, cur_end); + + em = alloc_extent_map(); + if (!em) { + ret = -ENOMEM; + goto out_reserve; + } + em->start = start; + em->orig_start = em->start; + em->len = cur_alloc_size; + em->mod_start = em->start; + em->mod_len = em->len; + + em->block_start = ins.objectid; + em->block_len = ins.offset; + em->orig_block_len = ins.offset; + em->bdev = root->fs_info->fs_devices->latest_bdev; + set_bit(EXTENT_FLAG_PINNED, &em->flags); + em->generation = -1; + if (compr > BTRFS_COMPRESS_NONE) { + em->compress_type = compr; + set_bit(EXTENT_FLAG_COMPRESSED, &em->flags); + type = BTRFS_ORDERED_COMPRESSED; + } + + while (1) { + write_lock(&em_tree->lock); + ret = add_extent_mapping(em_tree, em, 1); + write_unlock(&em_tree->lock); + if (ret != -EEXIST) { + free_extent_map(em); + break; + } + btrfs_drop_extent_cache(inode, start, cur_end, 0); + } + if (ret) + goto out_reserve; + + ret = btrfs_add_ordered_extent_dedup(inode, start, ins.objectid, + cur_alloc_size, ins.offset, + type, found, hash, compr); + if (ret) + goto out_reserve; + + /* + * Do set the Private2 bit so we know this page was properly + * setup for writepage + */ + op |= EXTENT_CLEAR_UNLOCK | EXTENT_CLEAR_DELALLOC | + EXTENT_SET_PRIVATE2 | EXTENT_CLEAR_DIRTY | + EXTENT_SET_WRITEBACK; + extent_clear_unlock_delalloc(inode, tree, start, cur_end, + NULL, op); + +submit: + iosize = blocksize; + + found = test_range_bit(tree, start, start + iosize - 1, + EXTENT_DEDUP, 0, cached_state); + if (found == 0) { + em = btrfs_get_extent(inode, page, 0, start, blocksize, + 1); + if (IS_ERR(em)) { + /* btrfs_get_extent will not return NULL */ + ret = PTR_ERR(em); + goto out_reserve; + } + + sector = (em->block_start + start - em->start) >> 9; + bdev = em->bdev; + free_extent_map(em); + em = NULL; + + /* TODO: rw can be WRTIE_SYNC */ + ret = submit_extent_page(WRITE, tree, page, sector, + iosize, 0, bdev, &bio, + 0, /* max_nr is no used */ + end_bio_extent_writepage, + 0, 0, 0); + if (ret) + break; + } else { + clear_extent_dedup(tree, start, start + iosize - 1, + &cached_state, GFP_NOFS); + + end_extent_writepage(page, 0, start, + start + iosize - 1); + /* we need to do this ourselves because we skip IO */ + end_page_writeback(page); + } + + unlock_page(page); + page_cache_release(page); + page = NULL; + + num_bytes -= blocksize; + alloc_hint = ins.objectid + blocksize; + start += blocksize; + cond_resched(); + } + +out_unlock: + if (bio) { + if (ret) + bio_put(bio); + else + ret = submit_one_bio(WRITE, bio, 0, 0); + bio = NULL; + } + + if (ret && page) + SetPageError(page); + if (page) { + unlock_page(page); + page_cache_release(page); + } + + btrfs_end_transaction(trans, root); +out: + if (ret && num_bytes > 0) + extent_clear_unlock_delalloc(inode, tree, + start, start + num_bytes - 1, + NULL, EXTENT_CLEAR_UNLOCK_PAGE | + EXTENT_CLEAR_UNLOCK | EXTENT_CLEAR_DELALLOC | + EXTENT_CLEAR_DIRTY | EXTENT_SET_WRITEBACK | + EXTENT_END_WRITEBACK | EXTENT_CLEAR_DEDUP); + + free_extent_state(cached_state); + return ret; + +out_reserve: + if (found == 0) + btrfs_free_reserved_extent(root, ins.objectid, ins.offset); + goto out_unlock; +} + static u64 get_extent_allocation_hint(struct inode *inode, u64 start, u64 num_bytes) { @@ -862,7 +1213,7 @@ static noinline int __cow_file_range(struct btrfs_trans_handle *trans, struct page *locked_page, u64 start, u64 end, int *page_started, unsigned long *nr_written, - int unlock) + int unlock, u64 *hash) { u64 alloc_hint = 0; u64 num_bytes; @@ -964,8 +1315,16 @@ static noinline int __cow_file_range(struct btrfs_trans_handle *trans, goto out_reserve; cur_alloc_size = ins.offset; - ret = btrfs_add_ordered_extent(inode, start, ins.objectid, - ram_size, cur_alloc_size, 0); + if (!hash) + ret = btrfs_add_ordered_extent(inode, start, + ins.objectid, ram_size, + cur_alloc_size, 0); + else + ret = btrfs_add_ordered_extent_dedup(inode, start, + ins.objectid, ram_size, + cur_alloc_size, 0, 0, + hash, + BTRFS_COMPRESS_NONE); if (ret) goto out_reserve; @@ -1046,28 +1405,97 @@ static noinline int cow_file_range(struct inode *inode, trans->block_rsv = &root->fs_info->delalloc_block_rsv; ret = __cow_file_range(trans, inode, root, locked_page, start, end, - page_started, nr_written, unlock); + page_started, nr_written, unlock, NULL); + + btrfs_end_transaction(trans, root); + + return ret; +} + +static noinline int cow_file_range_dedup(struct inode *inode, + struct page *locked_page, + u64 start, u64 end, int *page_started, + unsigned long *nr_written, + int unlock, u64 *hash) +{ + struct btrfs_trans_handle *trans; + struct btrfs_root *root = BTRFS_I(inode)->root; + int ret; + + trans = btrfs_join_transaction(root); + if (IS_ERR(trans)) { + extent_clear_unlock_delalloc(inode, + &BTRFS_I(inode)->io_tree, + start, end, locked_page, + EXTENT_CLEAR_UNLOCK_PAGE | + EXTENT_CLEAR_UNLOCK | + EXTENT_CLEAR_DELALLOC | + EXTENT_CLEAR_DIRTY | + EXTENT_SET_WRITEBACK | + EXTENT_END_WRITEBACK); + return PTR_ERR(trans); + } + trans->block_rsv = &root->fs_info->delalloc_block_rsv; + + ret = __cow_file_range(trans, inode, root, locked_page, start, end, + page_started, nr_written, unlock, hash); btrfs_end_transaction(trans, root); return ret; } +static int run_locked_range(struct extent_io_tree *tree, struct inode *inode, + struct page *locked_page, u64 start, u64 end, + get_extent_t *get_extent, int mode, u64 *hash) +{ + int page_started = 0; + unsigned long nr_written = 0; + int ret; + + lock_extent(tree, start, end); + + /* allocate blocks */ + ret = cow_file_range_dedup(inode, locked_page, start, end, + &page_started, &nr_written, 0, hash); + + /* JDM XXX */ + + /* + * if page_started, cow_file_range inserted an + * inline extent and took care of all the unlocking + * and IO for us. Otherwise, we need to submit + * all those pages down to the drive. + */ + if (!page_started && !ret) + extent_write_locked_range(tree, inode, start, end, get_extent, + mode); + else if (ret) + unlock_page(locked_page); + + return ret; +} + /* * work queue call back to started compression on a file and pages */ static noinline void async_cow_start(struct btrfs_work *work) { struct async_cow *async_cow; - int num_added = 0; async_cow = container_of(work, struct async_cow, work); - compress_file_range(async_cow->inode, async_cow->locked_page, - async_cow->start, async_cow->end, async_cow, - &num_added); - if (num_added == 0) { - btrfs_add_delayed_iput(async_cow->inode); - async_cow->inode = NULL; + if (btrfs_test_opt(async_cow->root, DEDUP)) { + run_delalloc_dedup(async_cow->inode, async_cow->locked_page, + async_cow->start, async_cow->end, async_cow); + } else { + int num_added = 0; + compress_file_range(async_cow->inode, async_cow->locked_page, + async_cow->start, async_cow->end, async_cow, + &num_added, NULL); + if (num_added == 0) { + btrfs_add_delayed_iput(async_cow->inode); + async_cow->inode = NULL; + } } } @@ -1370,7 +1798,8 @@ out_check: if (cow_start != (u64)-1) { ret = __cow_file_range(trans, inode, root, locked_page, cow_start, found_key.offset - 1, - page_started, nr_written, 1); + page_started, nr_written, 1, + NULL); if (ret) { btrfs_abort_transaction(trans, root, ret); goto error; @@ -1445,8 +1874,8 @@ out_check: if (cow_start != (u64)-1) { ret = __cow_file_range(trans, inode, root, locked_page, - cow_start, end, - page_started, nr_written, 1); + cow_start, end, page_started, nr_written, + 1, NULL); if (ret) { btrfs_abort_transaction(trans, root, ret); goto error; @@ -1473,6 +1902,19 @@ error: return ret; } +static int btrfs_inode_test_compress(struct inode *inode) +{ + struct btrfs_inode *bi = BTRFS_I(inode); + struct btrfs_root *root = BTRFS_I(inode)->root; + + if (btrfs_test_opt(root, COMPRESS) || + bi->force_compress || + bi->flags & BTRFS_INODE_COMPRESS) + return 1; + + return 0; +} + /* * extent_io.c call back to do delayed allocation processing */ @@ -1482,21 +1924,21 @@ static int run_delalloc_range(struct inode *inode, struct page *locked_page, { int ret; struct btrfs_root *root = BTRFS_I(inode)->root; + struct btrfs_inode *bi = BTRFS_I(inode); - if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW) { + if (bi->flags & BTRFS_INODE_NODATACOW) { ret = run_delalloc_nocow(inode, locked_page, start, end, page_started, 1, nr_written); - } else if (BTRFS_I(inode)->flags & BTRFS_INODE_PREALLOC) { + } else if (bi->flags & BTRFS_INODE_PREALLOC) { ret = run_delalloc_nocow(inode, locked_page, start, end, page_started, 0, nr_written); - } else if (!btrfs_test_opt(root, COMPRESS) && - !(BTRFS_I(inode)->force_compress) && - !(BTRFS_I(inode)->flags & BTRFS_INODE_COMPRESS)) { + } else if (!btrfs_inode_test_compress(inode) && + !btrfs_test_opt(root, DEDUP)) { ret = cow_file_range(inode, locked_page, start, end, page_started, nr_written, 1); } else { set_bit(BTRFS_INODE_HAS_ASYNC_EXTENT, - &BTRFS_I(inode)->runtime_flags); + &bi->runtime_flags); ret = cow_file_range_async(inode, locked_page, start, end, page_started, nr_written); } @@ -1917,12 +2359,13 @@ static int btrfs_writepage_start_hook(struct page *page, u64 start, u64 end) return -EBUSY; } -static int insert_reserved_file_extent(struct btrfs_trans_handle *trans, - struct inode *inode, u64 file_pos, - u64 disk_bytenr, u64 disk_num_bytes, - u64 num_bytes, u64 ram_bytes, - u8 compression, u8 encryption, - u16 other_encoding, int extent_type) +static int __insert_reserved_file_extent(struct btrfs_trans_handle *trans, + struct inode *inode, u64 file_pos, + u64 disk_bytenr, u64 disk_num_bytes, + u64 num_bytes, u64 ram_bytes, + u8 compression, u8 encryption, + u16 other_encoding, int extent_type, + int dedup, u64 *hash) { struct btrfs_root *root = BTRFS_I(inode)->root; struct btrfs_file_extent_item *fi; @@ -1979,15 +2422,53 @@ static int insert_reserved_file_extent(struct btrfs_trans_handle *trans, ins.objectid = disk_bytenr; ins.offset = disk_num_bytes; ins.type = BTRFS_EXTENT_ITEM_KEY; - ret = btrfs_alloc_reserved_file_extent(trans, root, - root->root_key.objectid, - btrfs_ino(inode), file_pos, &ins); + + if (!dedup) { + ret = btrfs_alloc_reserved_file_extent(trans, root, + root->root_key.objectid, + btrfs_ino(inode), + file_pos, &ins); + + if (hash) { + ret = btrfs_insert_dedup_extent(trans, root, hash, + ins.objectid, + ins.offset, + compression); + + ret = btrfs_inc_extent_ref(trans, root, ins.objectid, + ins.offset, 0, + BTRFS_DEDUP_TREE_OBJECTID, + btrfs_ino(inode), + hash[BTRFS_DEDUP_HASH_LEN - 1], 0); + } + } else { + ret = btrfs_inc_extent_ref(trans, root, ins.objectid, + ins.offset, 0, + root->root_key.objectid, + btrfs_ino(inode), + file_pos, /* file_pos - 0 */ + 0); + } out: btrfs_free_path(path); return ret; } +static int insert_reserved_file_extent(struct btrfs_trans_handle *trans, + struct inode *inode, u64 file_pos, + u64 disk_bytenr, u64 disk_num_bytes, + u64 num_bytes, u64 ram_bytes, + u8 compression, u8 encryption, + u16 other_encoding, int extent_type) +{ + return __insert_reserved_file_extent(trans, inode, file_pos, + disk_bytenr, disk_num_bytes, + num_bytes, ram_bytes, compression, + encryption, other_encoding, + extent_type, 0, NULL); +} + /* snapshot-aware defrag */ struct sa_defrag_extent_backref { struct rb_node node; @@ -2715,14 +3196,16 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent) ordered_extent->len); } else { BUG_ON(root == root->fs_info->tree_root); - ret = insert_reserved_file_extent(trans, inode, + ret = __insert_reserved_file_extent(trans, inode, ordered_extent->file_offset, ordered_extent->start, ordered_extent->disk_len, ordered_extent->len, ordered_extent->len, compress_type, 0, 0, - BTRFS_FILE_EXTENT_REG); + BTRFS_FILE_EXTENT_REG, + ordered_extent->dedup, + ordered_extent->hash); } unpin_extent_cache(&BTRFS_I(inode)->extent_tree, ordered_extent->file_offset, ordered_extent->len, diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 0e17a30..1a4752e 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -4097,6 +4097,97 @@ out_unlock: return ret; } +static long btrfs_register_dedup(struct btrfs_root *root) +{ + struct btrfs_fs_info *fs_info = root->fs_info; + struct btrfs_trans_handle *trans = NULL; + struct btrfs_root *dedup_root; + int ret = 0; + + mutex_lock(&fs_info->dedup_ioctl_mutex); + if (fs_info->dedup_root) { + pr_info("dedup has already been enabled\n"); + mutex_unlock(&fs_info->dedup_ioctl_mutex); + return 0; + } + + trans = btrfs_start_transaction(root, 2); + if (IS_ERR(trans)) { + ret = PTR_ERR(trans); + pr_info("trans error %d\n", ret); + mutex_unlock(&fs_info->dedup_ioctl_mutex); + return ret; + } + + dedup_root = btrfs_create_tree(trans, fs_info, + BTRFS_DEDUP_TREE_OBJECTID); + if (IS_ERR(dedup_root)) + ret = PTR_ERR(dedup_root); + + if (ret) + btrfs_end_transaction(trans, root); + else + ret = btrfs_commit_transaction(trans, root); + + if (!ret) { + pr_info("dedup enabled\n"); + fs_info->dedup_root = dedup_root; + fs_info->dedup_root->block_rsv = &fs_info->global_block_rsv; + btrfs_set_fs_incompat(fs_info, DEDUP); + } + + mutex_unlock(&fs_info->dedup_ioctl_mutex); + return ret; +} + +static long btrfs_unregister_dedup(struct btrfs_root *root) +{ + struct btrfs_fs_info *fs_info = root->fs_info; + struct btrfs_root *dedup_root; + int ret; + + mutex_lock(&fs_info->dedup_ioctl_mutex); + if (!fs_info->dedup_root) { + pr_info("dedup has been disabled\n"); + mutex_unlock(&fs_info->dedup_ioctl_mutex); + return 0; + } + + if (btrfs_test_opt(root, DEDUP)) { + pr_info("Cannot disable dedup until clearing dedup mount options!\n"); + mutex_unlock(&fs_info->dedup_ioctl_mutex); + return -EBUSY; + } + + dedup_root = fs_info->dedup_root; + + ret = btrfs_drop_snapshot(dedup_root, NULL, 1, 0); + + if (!ret) { + fs_info->dedup_root = NULL; + pr_info("dedup disabled\n"); + } + + mutex_unlock(&fs_info->dedup_ioctl_mutex); + BUG_ON(ret < 0 && ret != -EAGAIN && ret != -EROFS); + return ret; +} + +static long btrfs_ioctl_dedup_ctl(struct btrfs_root *root, int cmd) +{ + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + switch (cmd) { + case BTRFS_DEDUP_CTL_REG: + return btrfs_register_dedup(root); + case BTRFS_DEDUP_CTL_UNREG: + return btrfs_unregister_dedup(root); + } + + return -EINVAL; +} + long btrfs_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { @@ -4207,6 +4298,8 @@ long btrfs_ioctl(struct file *file, unsigned int return btrfs_ioctl_get_fslabel(file, argp); case BTRFS_IOC_SET_FSLABEL: return btrfs_ioctl_set_fslabel(file, argp); + case BTRFS_IOC_DEDUP_CTL: + return btrfs_ioctl_dedup_ctl(root, arg); } return -ENOTTY; diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c index 8136982..adc818b 100644 --- a/fs/btrfs/ordered-data.c +++ b/fs/btrfs/ordered-data.c @@ -183,7 +183,8 @@ static inline struct rb_node *tree_search(struct btrfs_ordered_inode_tree *tree, */ static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset, u64 start, u64 len, u64 disk_len, - int type, int dio, int compress_type) + int type, int dio, int compress_type, + int dedup, u64 *hash) { struct btrfs_root *root = BTRFS_I(inode)->root; struct btrfs_ordered_inode_tree *tree; @@ -199,10 +200,23 @@ static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset, entry->start = start; entry->len = len; if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM) && - !(type == BTRFS_ORDERED_NOCOW)) + !(type == BTRFS_ORDERED_NOCOW) && !dedup) entry->csum_bytes_left = disk_len; entry->disk_len = disk_len; entry->bytes_left = len; + entry->dedup = dedup; + entry->hash = NULL; + + if (!dedup && hash) { + entry->hash = kmalloc(sizeof(u64) * BTRFS_DEDUP_HASH_LEN, + GFP_NOFS); + if (!entry->hash) { + kmem_cache_free(btrfs_ordered_extent_cache, entry); + return -ENOMEM; + } + memcpy(entry->hash, hash, BTRFS_DEDUP_HASH_SIZE); + } + entry->inode = igrab(inode); entry->compress_type = compress_type; if (type != BTRFS_ORDERED_IO_DONE && type != BTRFS_ORDERED_COMPLETE) @@ -250,7 +264,16 @@ int btrfs_add_ordered_extent(struct inode *inode, u64 file_offset, { return __btrfs_add_ordered_extent(inode, file_offset, start, len, disk_len, type, 0, - BTRFS_COMPRESS_NONE); + BTRFS_COMPRESS_NONE, 0, NULL); +} + +int btrfs_add_ordered_extent_dedup(struct inode *inode, u64 file_offset, + u64 start, u64 len, u64 disk_len, int type, + int dedup, u64 *hash, int compress_type) +{ + return __btrfs_add_ordered_extent(inode, file_offset, start, len, + disk_len, type, 0, + compress_type, dedup, hash); } int btrfs_add_ordered_extent_dio(struct inode *inode, u64 file_offset, @@ -258,16 +281,16 @@ int btrfs_add_ordered_extent_dio(struct inode *inode, u64 file_offset, { return __btrfs_add_ordered_extent(inode, file_offset, start, len, disk_len, type, 1, - BTRFS_COMPRESS_NONE); + BTRFS_COMPRESS_NONE, 0, NULL); } int btrfs_add_ordered_extent_compress(struct inode *inode, u64 file_offset, u64 start, u64 len, u64 disk_len, - int type, int compress_type) + int type, int compress_type, u64 *hash) { return __btrfs_add_ordered_extent(inode, file_offset, start, len, disk_len, type, 0, - compress_type); + compress_type, 0, hash); } /* @@ -503,6 +526,7 @@ void btrfs_put_ordered_extent(struct btrfs_ordered_extent *entry) list_del(&sum->list); kfree(sum); } + kfree(entry->hash); kmem_cache_free(btrfs_ordered_extent_cache, entry); } } diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h index 68844d5..4969bb8 100644 --- a/fs/btrfs/ordered-data.h +++ b/fs/btrfs/ordered-data.h @@ -102,6 +102,9 @@ struct btrfs_ordered_extent { /* compression algorithm */ int compress_type; + /* if this is for dedup or not */ + int dedup; + /* reference count */ atomic_t refs; @@ -128,6 +131,9 @@ struct btrfs_ordered_extent { struct completion completion; struct btrfs_work flush_work; struct list_head work_list; + + /* dedup hash of sha256 type */ + u64 *hash; }; /* @@ -161,11 +167,14 @@ int btrfs_dec_test_first_ordered_pending(struct inode *inode, int uptodate); int btrfs_add_ordered_extent(struct inode *inode, u64 file_offset, u64 start, u64 len, u64 disk_len, int type); +int btrfs_add_ordered_extent_dedup(struct inode *inode, u64 file_offset, + u64 start, u64 len, u64 disk_len, int type, + int dedup, u64 *hash, int compress_type); int btrfs_add_ordered_extent_dio(struct inode *inode, u64 file_offset, u64 start, u64 len, u64 disk_len, int type); int btrfs_add_ordered_extent_compress(struct inode *inode, u64 file_offset, u64 start, u64 len, u64 disk_len, - int type, int compress_type); + int type, int compress_type, u64 *hash); void btrfs_add_ordered_sum(struct inode *inode, struct btrfs_ordered_extent *entry, struct btrfs_ordered_sum *sum); diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index 1209649..5d44887 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -3471,6 +3471,9 @@ static int find_data_references(struct reloc_control *rc, ref_offset = btrfs_extent_data_ref_offset(leaf, ref); ref_count = btrfs_extent_data_ref_count(leaf, ref); + if (ref_root == BTRFS_DEDUP_TREE_OBJECTID) + return 0; + /* * This is an extent belonging to the free space cache, lets just delete * it and redo the search. diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 8eb6191..ef3140c 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -320,7 +320,8 @@ enum { Opt_enospc_debug, Opt_subvolrootid, Opt_defrag, Opt_inode_cache, Opt_no_space_cache, Opt_recovery, Opt_skip_balance, Opt_check_integrity, Opt_check_integrity_including_extent_data, - Opt_check_integrity_print_mask, Opt_fatal_errors, + Opt_check_integrity_print_mask, Opt_fatal_errors, Opt_dedup, + Opt_dedup_bs, Opt_err, }; @@ -361,6 +362,8 @@ static match_table_t tokens = { {Opt_check_integrity_including_extent_data, "check_int_data"}, {Opt_check_integrity_print_mask, "check_int_print_mask=%d"}, {Opt_fatal_errors, "fatal_errors=%s"}, + {Opt_dedup_bs, "dedup_bs=%s"}, + {Opt_dedup, "dedup"}, {Opt_err, NULL}, }; @@ -626,6 +629,25 @@ int btrfs_parse_options(struct btrfs_root *root, char *options) goto out; } break; + case Opt_dedup_bs: + num = match_strdup(&args[0]); + if (num) { + info->dedup_bs = memparse(num, NULL); + kfree(num); + + if (info->dedup_bs) { + info->dedup_bs = ALIGN(info->dedup_bs, + root->sectorsize); + info->dedup_bs = min_t(u64, + info->dedup_bs, + (128 * 1024ULL)); + } + } + case Opt_dedup: + pr_info("btrfs: use deduplication(dedup blocksize %llu)\n", + (unsigned long long)info->dedup_bs); + btrfs_set_opt(info->mount_opt, DEDUP); + break; case Opt_err: printk(KERN_INFO "btrfs: unrecognized mount option " "'%s'\n", p); @@ -942,6 +964,9 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry) seq_puts(seq, ",skip_balance"); if (btrfs_test_opt(root, PANIC_ON_FATAL_ERROR)) seq_puts(seq, ",fatal_errors=panic"); + if (btrfs_test_opt(root, DEDUP)) + seq_printf(seq, ",dedup_bs=%llu", + (unsigned long long)info->dedup_bs); return 0; } diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h index 05aed70..5408e60 100644 --- a/include/uapi/linux/btrfs.h +++ b/include/uapi/linux/btrfs.h @@ -374,6 +374,10 @@ struct btrfs_ioctl_get_dev_stats { __u64 unused[128 - 2 - BTRFS_DEV_STAT_VALUES_MAX]; /* pad to 1k */ }; +/* deduplication control ioctl modes */ +#define BTRFS_DEDUP_CTL_REG 1 +#define BTRFS_DEDUP_CTL_UNREG 2 + #define BTRFS_QUOTA_CTL_ENABLE 1 #define BTRFS_QUOTA_CTL_DISABLE 2 #define BTRFS_QUOTA_CTL_RESCAN__NOTUSED 3 @@ -579,4 +583,5 @@ static inline char *btrfs_err_str(enum btrfs_err_code err_code) struct btrfs_ioctl_get_dev_stats) #define BTRFS_IOC_DEV_REPLACE _IOWR(BTRFS_IOCTL_MAGIC, 53, \ struct btrfs_ioctl_dev_replace_args) +#define BTRFS_IOC_DEDUP_CTL _IOW(BTRFS_IOCTL_MAGIC, 55, int) #endif /* _UAPI_LINUX_BTRFS_H */ -- 1.7.7 ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [RFC PATCH v5 5/5] Btrfs: online data deduplication 2013-07-31 15:37 ` [RFC PATCH v5 5/5] Btrfs: online data deduplication Liu Bo @ 2013-07-31 22:50 ` Zach Brown 2013-08-01 10:14 ` Liu Bo 0 siblings, 1 reply; 30+ messages in thread From: Zach Brown @ 2013-07-31 22:50 UTC (permalink / raw) To: Liu Bo; +Cc: linux-btrfs > +#define BTRFS_DEDUP_HASH_SIZE 32 /* 256bit = 32 * 8bit */ > +#define BTRFS_DEDUP_HASH_LEN 4 > + > +struct btrfs_dedup_hash_item { > + /* FIXME: put a hash type field here */ > + > + __le64 hash[BTRFS_DEDUP_HASH_LEN]; > +} __attribute__ ((__packed__)); The handling of hashes in this patch is a mess. The inconsistent use of _SIZE, _LEN, and literal 4 and the u64 *s being passed around is asking for mistakes to be made in the future. And I don't think it's endian safe. I think I'd have a struct that represents the native representation of the tree item. Something like: struct btrfs_dedup_hash { u64 key_value; u8 algo; u8 len; u8 bytes[0]; } You can then have helpers that generate that from either the cryptolib transformation of dedup regions or to and from the tree items. The bytes (and the tree item payload) wouldn't need to have the hash bytes that are stored up in the key. - z ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [RFC PATCH v5 5/5] Btrfs: online data deduplication 2013-07-31 22:50 ` Zach Brown @ 2013-08-01 10:14 ` Liu Bo 2013-08-01 18:35 ` Zach Brown 0 siblings, 1 reply; 30+ messages in thread From: Liu Bo @ 2013-08-01 10:14 UTC (permalink / raw) To: Zach Brown; +Cc: linux-btrfs On Wed, Jul 31, 2013 at 03:50:50PM -0700, Zach Brown wrote: > > > +#define BTRFS_DEDUP_HASH_SIZE 32 /* 256bit = 32 * 8bit */ > > +#define BTRFS_DEDUP_HASH_LEN 4 > > + > > +struct btrfs_dedup_hash_item { > > + /* FIXME: put a hash type field here */ > > + > > + __le64 hash[BTRFS_DEDUP_HASH_LEN]; > > +} __attribute__ ((__packed__)); > > The handling of hashes in this patch is a mess. > > The inconsistent use of _SIZE, _LEN, and literal 4 and the u64 *s being > passed around is asking for mistakes to be made in the future. And I > don't think it's endian safe. Yeah, you're right, I missed the endian part for hash. > > I think I'd have a struct that represents the native representation of > the tree item. Something like: > > struct btrfs_dedup_hash { > u64 key_value; > u8 algo; > u8 len; > u8 bytes[0]; > } > > You can then have helpers that generate that from either the cryptolib > transformation of dedup regions or to and from the tree items. The > bytes (and the tree item payload) wouldn't need to have the hash bytes > that are stored up in the key. I agree on merging the two structs, btrfs_dedup_item and btrfs_dedup_hash_item, into one. So do you mean that our whole hash value will be (key.objectid + bytes) because key.objectid is a part of hash value? Thanks for the comments! -liubo ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [RFC PATCH v5 5/5] Btrfs: online data deduplication 2013-08-01 10:14 ` Liu Bo @ 2013-08-01 18:35 ` Zach Brown 0 siblings, 0 replies; 30+ messages in thread From: Zach Brown @ 2013-08-01 18:35 UTC (permalink / raw) To: Liu Bo; +Cc: linux-btrfs > So do you mean that our whole hash value will be (key.objectid + bytes) > because key.objectid is a part of hash value? I think so, if I understood your question. The idea is to not store the bytes of the hash that make up the objectid more than once so the tree items are smaller. For example: + read_extent_buffer(leaf, hash_in_item, ((unsigned long)hash_item), + BTRFS_DEDUP_HASH_SIZE); That'd be (_SIZE - sizeof(u64)) if the bytes of the hash that made up the object id weren't also stored in the item payload. - z ^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH] Btrfs-progs: add dedup subcommand 2013-07-31 15:37 [RFC PATCH v5 0/5] Online data deduplication Liu Bo ` (4 preceding siblings ...) 2013-07-31 15:37 ` [RFC PATCH v5 5/5] Btrfs: online data deduplication Liu Bo @ 2013-07-31 15:37 ` Liu Bo 2013-07-31 16:30 ` Stefan Behrens 2013-08-01 22:01 ` Mark Fasheh 2013-07-31 21:20 ` [RFC PATCH v5 0/5] Online data deduplication Josef Bacik 6 siblings, 2 replies; 30+ messages in thread From: Liu Bo @ 2013-07-31 15:37 UTC (permalink / raw) To: linux-btrfs This aims to add deduplication subcommand, 'btrfs dedup command <path>', ie. register/unregister'. It can be used to enable or disable dedup support for a filesystem. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> --- Makefile | 2 +- btrfs.c | 1 + cmds-dedup.c | 101 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ commands.h | 2 + ctree.h | 2 + ioctl.h | 5 +++ 6 files changed, 112 insertions(+), 1 deletions(-) create mode 100644 cmds-dedup.c diff --git a/Makefile b/Makefile index da7438e..5b4a07d 100644 --- a/Makefile +++ b/Makefile @@ -10,7 +10,7 @@ objects = ctree.o disk-io.o radix-tree.o extent-tree.o print-tree.o \ cmds_objects = cmds-subvolume.o cmds-filesystem.o cmds-device.o cmds-scrub.o \ cmds-inspect.o cmds-balance.o cmds-send.o cmds-receive.o \ cmds-quota.o cmds-qgroup.o cmds-replace.o cmds-check.o \ - cmds-restore.o + cmds-restore.o cmds-dedup.o libbtrfs_objects = send-stream.o send-utils.o rbtree.o btrfs-list.o crc32c.o libbtrfs_headers = send-stream.h send-utils.h send.h rbtree.h btrfs-list.h \ crc32c.h list.h kerncompat.h radix-tree.h extent-cache.h \ diff --git a/btrfs.c b/btrfs.c index 691adef..956905c 100644 --- a/btrfs.c +++ b/btrfs.c @@ -254,6 +254,7 @@ const struct cmd_group btrfs_cmd_group = { { "quota", cmd_quota, NULL, "a_cmd_group, 0 }, { "qgroup", cmd_qgroup, NULL, &qgroup_cmd_group, 0 }, { "replace", cmd_replace, NULL, &replace_cmd_group, 0 }, + { "dedup", cmd_dedup, NULL, &dedup_cmd_group, 0 }, { "help", cmd_help, cmd_help_usage, NULL, 0 }, { "version", cmd_version, cmd_version_usage, NULL, 0 }, { 0, 0, 0, 0, 0 } diff --git a/cmds-dedup.c b/cmds-dedup.c new file mode 100644 index 0000000..a977585 --- /dev/null +++ b/cmds-dedup.c @@ -0,0 +1,101 @@ +/* + * Copyright (C) 2013 Oracle. All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License v2 as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public + * License along with this program; if not, write to the + * Free Software Foundation, Inc., 59 Temple Place - Suite 330, + * Boston, MA 021110-1307, USA. + */ + +#include <sys/ioctl.h> +#include <unistd.h> + +#include "ctree.h" +#include "ioctl.h" + +#include "commands.h" +#include "utils.h" + +static const char * const dedup_cmd_group_usage[] = { + "btrfs dedup <command> [options] <path>", + NULL +}; + +int dedup_ctl(int cmd, int argc, char **argv) +{ + int ret = 0; + int fd; + int e; + char *path = argv[1]; + + if (check_argc_exact(argc, 2)) + return -1; + + fd = open_file_or_dir(path); + if (fd < 0) { + fprintf(stderr, "ERROR: can't access '%s'\n", path); + return -EACCES; + } + + ret = ioctl(fd, BTRFS_IOC_DEDUP_CTL, cmd); + e = errno; + close(fd); + if (ret < 0) { + fprintf(stderr, "ERROR: dedup command failed: %s\n", + strerror(e)); + if (cmd == BTRFS_DEDUP_CTL_UNREG) + fprintf(stderr, "please refer to 'dmesg | tail' for more info\n"); + return -EINVAL; + } + return 0; +} + +static const char * const cmd_dedup_reg_usage[] = { + "btrfs dedup register <path>", + "Enable data deduplication support for a filesystem.", + NULL +}; + +static int cmd_dedup_reg(int argc, char **argv) +{ + int ret = dedup_ctl(BTRFS_DEDUP_CTL_REG, argc, argv); + if (ret < 0) + usage(cmd_dedup_reg_usage); + return ret; +} + +static const char * const cmd_dedup_unreg_usage[] = { + "btrfs dedup unregister <path>", + "Disable data deduplication support for a filesystem.", + NULL +}; + +static int cmd_dedup_unreg(int argc, char **argv) +{ + int ret = dedup_ctl(BTRFS_DEDUP_CTL_UNREG, argc, argv); + if (ret < 0) + usage(cmd_dedup_unreg_usage); + return ret; +} + +const struct cmd_group dedup_cmd_group = { + dedup_cmd_group_usage, NULL, { + { "register", cmd_dedup_reg, cmd_dedup_reg_usage, NULL, 0 }, + { "unregister", cmd_dedup_unreg, cmd_dedup_unreg_usage, 0, 0 }, + { 0, 0, 0, 0, 0 } + } +}; + +int cmd_dedup(int argc, char **argv) +{ + return handle_command_group(&dedup_cmd_group, argc, argv); +} diff --git a/commands.h b/commands.h index 15c616d..d31afa4 100644 --- a/commands.h +++ b/commands.h @@ -90,6 +90,7 @@ extern const struct cmd_group receive_cmd_group; extern const struct cmd_group quota_cmd_group; extern const struct cmd_group qgroup_cmd_group; extern const struct cmd_group replace_cmd_group; +extern const struct cmd_group dedup_cmd_group; extern const char * const cmd_send_usage[]; extern const char * const cmd_receive_usage[]; @@ -112,6 +113,7 @@ int cmd_restore(int argc, char **argv); int cmd_select_super(int argc, char **argv); int cmd_dump_super(int argc, char **argv); int cmd_debug_tree(int argc, char **argv); +int cmd_dedup(int argc, char **argv); /* subvolume exported functions */ int test_issubvolume(char *path); diff --git a/ctree.h b/ctree.h index 6f086bf..e86ff96 100644 --- a/ctree.h +++ b/ctree.h @@ -467,6 +467,7 @@ struct btrfs_super_block { #define BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF (1ULL << 6) #define BTRFS_FEATURE_INCOMPAT_RAID56 (1ULL << 7) #define BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA (1ULL << 8) +#define BTRFS_FEATURE_INCOMPAT_DEDUP (1ULL << 9) #define BTRFS_FEATURE_COMPAT_SUPP 0ULL @@ -479,6 +480,7 @@ struct btrfs_super_block { BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF | \ BTRFS_FEATURE_INCOMPAT_RAID56 | \ BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS | \ + BTRFS_FEATURE_INCOMPAT_DEDUP | \ BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA) /* diff --git a/ioctl.h b/ioctl.h index abe6dd4..fba1898 100644 --- a/ioctl.h +++ b/ioctl.h @@ -417,6 +417,10 @@ struct btrfs_ioctl_get_dev_stats { __u64 unused[128 - 2 - BTRFS_DEV_STAT_VALUES_MAX]; /* pad to 1k */ }; +/* deduplication control ioctl modes */ +#define BTRFS_DEDUP_CTL_REG 1 +#define BTRFS_DEDUP_CTL_UNREG 2 + /* BTRFS_IOC_SNAP_CREATE is no longer used by the btrfs command */ #define BTRFS_QUOTA_CTL_ENABLE 1 #define BTRFS_QUOTA_CTL_DISABLE 2 @@ -537,6 +541,7 @@ struct btrfs_ioctl_clone_range_args { struct btrfs_ioctl_get_dev_stats) #define BTRFS_IOC_DEV_REPLACE _IOWR(BTRFS_IOCTL_MAGIC, 53, \ struct btrfs_ioctl_dev_replace_args) +#define BTRFS_IOC_DEDUP_CTL _IOW(BTRFS_IOCTL_MAGIC, 55, int) #ifdef __cplusplus } -- 1.7.7 ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH] Btrfs-progs: add dedup subcommand 2013-07-31 15:37 ` [PATCH] Btrfs-progs: add dedup subcommand Liu Bo @ 2013-07-31 16:30 ` Stefan Behrens 2013-08-01 10:17 ` Liu Bo 2013-08-01 22:01 ` Mark Fasheh 1 sibling, 1 reply; 30+ messages in thread From: Stefan Behrens @ 2013-07-31 16:30 UTC (permalink / raw) To: Liu Bo; +Cc: linux-btrfs On Wed, 31 Jul 2013 23:37:46 +0800, Liu Bo wrote: > This aims to add deduplication subcommand, 'btrfs dedup command <path>', > ie. register/unregister'. > > It can be used to enable or disable dedup support for a filesystem. > > Signed-off-by: Liu Bo <bo.li.liu@oracle.com> > --- > Makefile | 2 +- > btrfs.c | 1 + > cmds-dedup.c | 101 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > commands.h | 2 + > ctree.h | 2 + > ioctl.h | 5 +++ > 6 files changed, 112 insertions(+), 1 deletions(-) > create mode 100644 cmds-dedup.c No manpage? ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] Btrfs-progs: add dedup subcommand 2013-07-31 16:30 ` Stefan Behrens @ 2013-08-01 10:17 ` Liu Bo 0 siblings, 0 replies; 30+ messages in thread From: Liu Bo @ 2013-08-01 10:17 UTC (permalink / raw) To: Stefan Behrens; +Cc: linux-btrfs On Wed, Jul 31, 2013 at 06:30:37PM +0200, Stefan Behrens wrote: > On Wed, 31 Jul 2013 23:37:46 +0800, Liu Bo wrote: > > This aims to add deduplication subcommand, 'btrfs dedup command <path>', > > ie. register/unregister'. > > > > It can be used to enable or disable dedup support for a filesystem. > > > > Signed-off-by: Liu Bo <bo.li.liu@oracle.com> > > --- > > Makefile | 2 +- > > btrfs.c | 1 + > > cmds-dedup.c | 101 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > commands.h | 2 + > > ctree.h | 2 + > > ioctl.h | 5 +++ > > 6 files changed, 112 insertions(+), 1 deletions(-) > > create mode 100644 cmds-dedup.c > > No manpage? > Good reminder, thanks very much. -liubo ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] Btrfs-progs: add dedup subcommand 2013-07-31 15:37 ` [PATCH] Btrfs-progs: add dedup subcommand Liu Bo 2013-07-31 16:30 ` Stefan Behrens @ 2013-08-01 22:01 ` Mark Fasheh 2013-08-02 2:29 ` Liu Bo 1 sibling, 1 reply; 30+ messages in thread From: Mark Fasheh @ 2013-08-01 22:01 UTC (permalink / raw) To: Liu Bo; +Cc: linux-btrfs On Wed, Jul 31, 2013 at 11:37:46PM +0800, Liu Bo wrote: > This aims to add deduplication subcommand, 'btrfs dedup command <path>', > ie. register/unregister'. > > It can be used to enable or disable dedup support for a filesystem. This seems to me like it should be a switch on btrfstune instead of a subcommand of the btrfs binary. --Mark -- Mark Fasheh ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] Btrfs-progs: add dedup subcommand 2013-08-01 22:01 ` Mark Fasheh @ 2013-08-02 2:29 ` Liu Bo 0 siblings, 0 replies; 30+ messages in thread From: Liu Bo @ 2013-08-02 2:29 UTC (permalink / raw) To: Mark Fasheh; +Cc: linux-btrfs On Thu, Aug 01, 2013 at 03:01:37PM -0700, Mark Fasheh wrote: > On Wed, Jul 31, 2013 at 11:37:46PM +0800, Liu Bo wrote: > > This aims to add deduplication subcommand, 'btrfs dedup command <path>', > > ie. register/unregister'. > > > > It can be used to enable or disable dedup support for a filesystem. > > This seems to me like it should be a switch on btrfstune instead of a > subcommand of the btrfs binary. btrfstune is designed to play with flags if I understand it correctly. Dedup is not about flipping a flag, but creating/deleting a tree. But I'm OK with this idea :) -liubo ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [RFC PATCH v5 0/5] Online data deduplication 2013-07-31 15:37 [RFC PATCH v5 0/5] Online data deduplication Liu Bo ` (5 preceding siblings ...) 2013-07-31 15:37 ` [PATCH] Btrfs-progs: add dedup subcommand Liu Bo @ 2013-07-31 21:20 ` Josef Bacik 2013-08-01 10:16 ` Liu Bo 6 siblings, 1 reply; 30+ messages in thread From: Josef Bacik @ 2013-07-31 21:20 UTC (permalink / raw) To: Liu Bo; +Cc: linux-btrfs On Wed, Jul 31, 2013 at 11:37:40PM +0800, Liu Bo wrote: > Data deduplication is a specialized data compression technique for eliminating > duplicate copies of repeating data.[1] > > This patch set is also related to "Content based storage" in project ideas[2]. > > PATCH 1 is a hang fix with deduplication on, but it's also useful without > dedup in practice use. > > PATCH 2 and 3 are targetting delayed refs' scalability problems, which are > uncovered by the dedup feature. > > PATCH 4 is a speed-up improvement, which is about dedup and quota. > > PATCH 5 is full of real things, all details about implementation of dedup. > > Plus, there is also a btrfs-progs patch which helps to enable/disable dedup > feature. > > TODO: > * a bit-to-bit comparison callback. Didn't pass my BUG_ON() search test, try again. Josef ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [RFC PATCH v5 0/5] Online data deduplication 2013-07-31 21:20 ` [RFC PATCH v5 0/5] Online data deduplication Josef Bacik @ 2013-08-01 10:16 ` Liu Bo 0 siblings, 0 replies; 30+ messages in thread From: Liu Bo @ 2013-08-01 10:16 UTC (permalink / raw) To: Josef Bacik; +Cc: linux-btrfs On Wed, Jul 31, 2013 at 05:20:27PM -0400, Josef Bacik wrote: > On Wed, Jul 31, 2013 at 11:37:40PM +0800, Liu Bo wrote: > > Data deduplication is a specialized data compression technique for eliminating > > duplicate copies of repeating data.[1] > > > > This patch set is also related to "Content based storage" in project ideas[2]. > > > > PATCH 1 is a hang fix with deduplication on, but it's also useful without > > dedup in practice use. > > > > PATCH 2 and 3 are targetting delayed refs' scalability problems, which are > > uncovered by the dedup feature. > > > > PATCH 4 is a speed-up improvement, which is about dedup and quota. > > > > PATCH 5 is full of real things, all details about implementation of dedup. > > > > Plus, there is also a btrfs-progs patch which helps to enable/disable dedup > > feature. > > > > TODO: > > * a bit-to-bit comparison callback. > > Didn't pass my BUG_ON() search test, try again. I'm cleaning them up :) -liubo ^ permalink raw reply [flat|nested] 30+ messages in thread
* [RFC PATCH v8 00/14] Online(inband) data deduplication @ 2013-12-30 8:12 Liu Bo 2013-12-30 8:12 ` [PATCH] Btrfs-progs: add dedup subcommand Liu Bo 0 siblings, 1 reply; 30+ messages in thread From: Liu Bo @ 2013-12-30 8:12 UTC (permalink / raw) To: linux-btrfs; +Cc: Marcel Ritter, Christian Robert, alanqk@gmail.com Hello, Here is the New Year patch bomb :-) Data deduplication is a specialized data compression technique for eliminating duplicate copies of repeating data.[1] This patch set is also related to "Content based storage" in project ideas[2], it introduces inband data deduplication for btrfs and dedup/dedupe is for short. PATCH 1 is a hang fix with deduplication on, but it's also useful without dedup in practice use. PATCH 2 and 3 are targetting delayed refs' scalability problems, which are uncovered by the dedup feature. PATCH 4 is a speed-up improvement, which is about dedup and quota. PATCH 5-8 is the preparation work for dedup implementation. PATCH 9 shows how we implement dedup feature. PATCH 10 fixes a backref walking bug with dedup. PATCH 11 fixes a free space bug of dedup extents on error handling. PATCH 12 adds the ioctl to control dedup feature. PATCH 13 fixes the metadata ENOSPC problem with dedup which has been there WAY TOO LONG. PATCH 14 fixes a race bug on dedup writes. And there is also a btrfs-progs patch(PATCH 15) which offers all details about how to control the dedup feature. I've tested this with xfstests by adding a inline dedup 'enable & on' in xfstests' mount and scratch_mount. TODO: * a bit-to-bit comparison callback. All comments are welcome! [1]: http://en.wikipedia.org/wiki/Data_deduplication [2]: https://btrfs.wiki.kernel.org/index.php/Project_ideas#Content_based_storage v8: - fix the race crash of dedup ref again. - fix the metadata ENOSPC problem with dedup. v7: - rebase onto the lastest btrfs - break a big patch into smaller ones to make reviewers happy. - kill mount options of dedup and use ioctl method instead. - fix two crash due to the special dedup ref For former patch sets: v6: http://thread.gmane.org/gmane.comp.file-systems.btrfs/27512 v5: http://thread.gmane.org/gmane.comp.file-systems.btrfs/27257 v4: http://thread.gmane.org/gmane.comp.file-systems.btrfs/25751 v3: http://comments.gmane.org/gmane.comp.file-systems.btrfs/25433 v2: http://comments.gmane.org/gmane.comp.file-systems.btrfs/24959 Liu Bo (14): Btrfs: skip merge part for delayed data refs Btrfs: improve the delayed refs process in rm case Btrfs: introduce a head ref rbtree Btrfs: disable qgroups accounting when quata_enable is 0 Btrfs: introduce dedup tree and relatives Btrfs: introduce dedup tree operations Btrfs: introduce dedup state Btrfs: make ordered extent aware of dedup Btrfs: online(inband) data dedup Btrfs: skip dedup reference during backref walking Btrfs: don't return space for dedup extent Btrfs: add ioctl of dedup control Btrfs: fix dedupe 'ENOSPC' problem Btrfs: fix a crash of dedup ref fs/btrfs/backref.c | 9 + fs/btrfs/ctree.c | 2 +- fs/btrfs/ctree.h | 86 ++++++ fs/btrfs/delayed-ref.c | 161 +++++++---- fs/btrfs/delayed-ref.h | 8 + fs/btrfs/disk-io.c | 40 +++ fs/btrfs/extent-tree.c | 208 ++++++++++++-- fs/btrfs/extent_io.c | 22 +- fs/btrfs/extent_io.h | 16 ++ fs/btrfs/file-item.c | 244 +++++++++++++++++ fs/btrfs/inode.c | 635 ++++++++++++++++++++++++++++++++++++++----- fs/btrfs/ioctl.c | 167 ++++++++++++ fs/btrfs/ordered-data.c | 38 ++- fs/btrfs/ordered-data.h | 13 +- fs/btrfs/qgroup.c | 3 + fs/btrfs/relocation.c | 3 + fs/btrfs/transaction.c | 4 +- include/trace/events/btrfs.h | 3 +- include/uapi/linux/btrfs.h | 11 + 19 files changed, 1501 insertions(+), 172 deletions(-) -- 1.8.2.1 ^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH] Btrfs-progs: add dedup subcommand 2013-12-30 8:12 [RFC PATCH v8 00/14] Online(inband) " Liu Bo @ 2013-12-30 8:12 ` Liu Bo 2013-12-30 11:34 ` Martin Steigerwald 2014-01-14 17:34 ` David Sterba 0 siblings, 2 replies; 30+ messages in thread From: Liu Bo @ 2013-12-30 8:12 UTC (permalink / raw) To: linux-btrfs; +Cc: Marcel Ritter, Christian Robert, alanqk@gmail.com This adds deduplication subcommands, 'btrfs dedup command <path>', including enable/disable/on/off. - btrfs dedup enable Create the dedup tree, and it's the very first step when you're going to use the dedup feature. - btrfs dedup disable Delete the dedup tree, after this we're not able to use dedup any more unless you enable it again. - btrfs dedup on [-b] Switch on the dedup feature temporarily, and it's the second step of applying dedup with writes. Option '-b' is used to set dedup blocksize. The default blocksize is 8192(no special reason, you may argue), and the current limit is [4096, 128 * 1024], because 4K is the generic page size and 128K is the upper limit of btrfs's compression. - btrfs dedup off Switch off the dedup feature temporarily, but the dedup tree remains. --------------------------------------------------------- Usage: Step 1: btrfs dedup enable /btrfs Step 2: btrfs dedup on /btrfs or btrfs dedup on -b 4K /btrfs Step 3: now we have dedup, run your test. Step 4: btrfs dedup off /btrfs Step 5: btrfs dedup disable /btrfs --------------------------------------------------------- v3: add commands 'btrfs dedup on/off' v2: add manpage Signed-off-by: Liu Bo <bo.li.liu@oracle.com> --- Makefile | 3 +- btrfs.c | 1 + cmds-dedup.c | 178 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ commands.h | 2 + ctree.h | 2 + ioctl.h | 12 ++++ man/btrfs.8.in | 31 +++++++++- 7 files changed, 225 insertions(+), 4 deletions(-) create mode 100644 cmds-dedup.c diff --git a/Makefile b/Makefile index 0874a41..092f2db 100644 --- a/Makefile +++ b/Makefile @@ -13,7 +13,8 @@ objects = ctree.o disk-io.o radix-tree.o extent-tree.o print-tree.o \ cmds_objects = cmds-subvolume.o cmds-filesystem.o cmds-device.o cmds-scrub.o \ cmds-inspect.o cmds-balance.o cmds-send.o cmds-receive.o \ cmds-quota.o cmds-qgroup.o cmds-replace.o cmds-check.o \ - cmds-restore.o cmds-rescue.o chunk-recover.o super-recover.o + cmds-restore.o cmds-rescue.o chunk-recover.o super-recover.o \ + cmds-dedup.o libbtrfs_objects = send-stream.o send-utils.o rbtree.o btrfs-list.o crc32c.o \ uuid-tree.o libbtrfs_headers = send-stream.h send-utils.h send.h rbtree.h btrfs-list.h \ diff --git a/btrfs.c b/btrfs.c index d5fc738..dfae35f 100644 --- a/btrfs.c +++ b/btrfs.c @@ -255,6 +255,7 @@ static const struct cmd_group btrfs_cmd_group = { { "quota", cmd_quota, NULL, "a_cmd_group, 0 }, { "qgroup", cmd_qgroup, NULL, &qgroup_cmd_group, 0 }, { "replace", cmd_replace, NULL, &replace_cmd_group, 0 }, + { "dedup", cmd_dedup, NULL, &dedup_cmd_group, 0 }, { "help", cmd_help, cmd_help_usage, NULL, 0 }, { "version", cmd_version, cmd_version_usage, NULL, 0 }, NULL_CMD_STRUCT diff --git a/cmds-dedup.c b/cmds-dedup.c new file mode 100644 index 0000000..b959349 --- /dev/null +++ b/cmds-dedup.c @@ -0,0 +1,178 @@ +/* + * Copyright (C) 2013 Oracle. All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License v2 as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public + * License along with this program; if not, write to the + * Free Software Foundation, Inc., 59 Temple Place - Suite 330, + * Boston, MA 021110-1307, USA. + */ + +#include <sys/ioctl.h> +#include <unistd.h> +#include <getopt.h> + +#include "ctree.h" +#include "ioctl.h" + +#include "commands.h" +#include "utils.h" + +static const char * const dedup_cmd_group_usage[] = { + "btrfs dedup <command> [options] <path>", + NULL +}; + +int dedup_ctl(char *path, struct btrfs_ioctl_dedup_args *args) +{ + int ret = 0; + int fd; + int e; + DIR *dirstream = NULL; + + fd = open_file_or_dir(path, &dirstream); + if (fd < 0) { + fprintf(stderr, "ERROR: can't access '%s'\n", path); + return -EACCES; + } + + ret = ioctl(fd, BTRFS_IOC_DEDUP_CTL, args); + e = errno; + close_file_or_dir(fd, dirstream); + if (ret < 0) { + fprintf(stderr, "ERROR: dedup command failed: %s\n", + strerror(e)); + if (args->cmd == BTRFS_DEDUP_CTL_DISABLE || + args->cmd == BTRFS_DEDUP_CTL_SET_BS) + fprintf(stderr, "please refer to 'dmesg | tail' for more info\n"); + return -EINVAL; + } + return 0; +} + +static const char * const cmd_dedup_enable_usage[] = { + "btrfs dedup enable <path>", + "Enable data deduplication support for a filesystem.", + NULL +}; + +static int cmd_dedup_enable(int argc, char **argv) +{ + struct btrfs_ioctl_dedup_args dargs; + + if (check_argc_exact(argc, 2)) + usage(cmd_dedup_enable_usage); + + dargs.cmd = BTRFS_DEDUP_CTL_ENABLE; + + return dedup_ctl(argv[1], &dargs); +} + +static const char * const cmd_dedup_disable_usage[] = { + "btrfs dedup disable <path>", + "Disable data deduplication support for a filesystem.", + NULL +}; + +static int cmd_dedup_disable(int argc, char **argv) +{ + struct btrfs_ioctl_dedup_args dargs; + + if (check_argc_exact(argc, 2)) + usage(cmd_dedup_disable_usage); + + dargs.cmd = BTRFS_DEDUP_CTL_DISABLE; + + return dedup_ctl(argv[1], &dargs); +} + +static int dedup_set_bs(char *path, struct btrfs_ioctl_dedup_args *dargs) +{ + return dedup_ctl(path, dargs); +} + +static const char * const cmd_dedup_on_usage[] = { + "btrfs dedup on [-b|--bs size] <path>", + "Switch on data deduplication or change the dedup blocksize.", + "", + "-b|--bs <size> set dedup blocksize", + NULL +}; + +static struct option longopts[] = { + {"bs", required_argument, NULL, 'b'}, + {0, 0, 0, 0} +}; + +static int cmd_dedup_on(int argc, char **argv) +{ + struct btrfs_ioctl_dedup_args dargs; + u64 bs = 8192; + + optind = 1; + while (1) { + int longindex; + + int c = getopt_long(argc, argv, "b:", longopts, &longindex); + if (c < 0) + break; + + switch (c) { + case 'b': + bs = parse_size(optarg); + break; + default: + usage(cmd_dedup_on_usage); + } + } + + if (check_argc_exact(argc - optind, 1)) + usage(cmd_dedup_on_usage); + + dargs.cmd = BTRFS_DEDUP_CTL_SET_BS; + dargs.bs = bs; + + return dedup_set_bs(argv[optind], &dargs); +} + +static const char * const cmd_dedup_off_usage[] = { + "btrfs dedup off <path>", + "Switch off data deduplication.", + NULL +}; + +static int cmd_dedup_off(int argc, char **argv) +{ + struct btrfs_ioctl_dedup_args dargs; + + if (check_argc_exact(argc, 2)) + usage(cmd_dedup_off_usage); + + dargs.cmd = BTRFS_DEDUP_CTL_SET_BS; + dargs.bs = 0; + + return dedup_set_bs(argv[1], &dargs); +} + +const struct cmd_group dedup_cmd_group = { + dedup_cmd_group_usage, NULL, { + { "enable", cmd_dedup_enable, cmd_dedup_enable_usage, NULL, 0 }, + { "disable", cmd_dedup_disable, cmd_dedup_disable_usage, 0, 0 }, + { "on", cmd_dedup_on, cmd_dedup_on_usage, NULL, 0}, + { "off", cmd_dedup_off, cmd_dedup_off_usage, NULL, 0}, + { 0, 0, 0, 0, 0 } + } +}; + +int cmd_dedup(int argc, char **argv) +{ + return handle_command_group(&dedup_cmd_group, argc, argv); +} diff --git a/commands.h b/commands.h index b791d68..6fccc15 100644 --- a/commands.h +++ b/commands.h @@ -91,6 +91,7 @@ extern const struct cmd_group quota_cmd_group; extern const struct cmd_group qgroup_cmd_group; extern const struct cmd_group replace_cmd_group; extern const struct cmd_group rescue_cmd_group; +extern const struct cmd_group dedup_cmd_group; extern const char * const cmd_send_usage[]; extern const char * const cmd_receive_usage[]; @@ -119,6 +120,7 @@ int cmd_select_super(int argc, char **argv); int cmd_dump_super(int argc, char **argv); int cmd_debug_tree(int argc, char **argv); int cmd_rescue(int argc, char **argv); +int cmd_dedup(int argc, char **argv); /* subvolume exported functions */ int test_issubvolume(char *path); diff --git a/ctree.h b/ctree.h index 2117374..27c5897 100644 --- a/ctree.h +++ b/ctree.h @@ -470,6 +470,7 @@ struct btrfs_super_block { #define BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF (1ULL << 6) #define BTRFS_FEATURE_INCOMPAT_RAID56 (1ULL << 7) #define BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA (1ULL << 8) +#define BTRFS_FEATURE_INCOMPAT_DEDUP (1ULL << 9) #define BTRFS_FEATURE_COMPAT_SUPP 0ULL @@ -482,6 +483,7 @@ struct btrfs_super_block { BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF | \ BTRFS_FEATURE_INCOMPAT_RAID56 | \ BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS | \ + BTRFS_FEATURE_INCOMPAT_DEDUP | \ BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA) /* diff --git a/ioctl.h b/ioctl.h index a589cd7..d4c6423 100644 --- a/ioctl.h +++ b/ioctl.h @@ -430,6 +430,15 @@ struct btrfs_ioctl_get_dev_stats { __u64 unused[128 - 2 - BTRFS_DEV_STAT_VALUES_MAX]; /* pad to 1k */ }; +/* deduplication control ioctl modes */ +#define BTRFS_DEDUP_CTL_ENABLE 1 +#define BTRFS_DEDUP_CTL_DISABLE 2 +#define BTRFS_DEDUP_CTL_SET_BS 3 +struct btrfs_ioctl_dedup_args { + __u64 cmd; + __u64 bs; +}; + /* BTRFS_IOC_SNAP_CREATE is no longer used by the btrfs command */ #define BTRFS_QUOTA_CTL_ENABLE 1 #define BTRFS_QUOTA_CTL_DISABLE 2 @@ -593,6 +602,9 @@ struct btrfs_ioctl_clone_range_args { struct btrfs_ioctl_get_dev_stats) #define BTRFS_IOC_DEV_REPLACE _IOWR(BTRFS_IOCTL_MAGIC, 53, \ struct btrfs_ioctl_dev_replace_args) +#define BTRFS_IOC_DEDUP_CTL _IOWR(BTRFS_IOCTL_MAGIC, 55, \ + struct btrfs_ioctl_dedup_args) + #ifdef __cplusplus } #endif diff --git a/man/btrfs.8.in b/man/btrfs.8.in index b620348..56cdf1b 100644 --- a/man/btrfs.8.in +++ b/man/btrfs.8.in @@ -109,13 +109,22 @@ btrfs \- control a btrfs filesystem .PP \fBbtrfs\fP \fBqgroup limit\fP [\fIoptions\fP] \fI<size>\fP|\fBnone\fP [\fI<qgroupid>\fP] \fI<path>\fP .PP -.PP \fBbtrfs\fP \fBreplace start\fP [-Bfr] \fI<srcdev>\fP|\fI<devid> <targetdev> <mount_point>\fP .PP \fBbtrfs\fP \fBreplace status\fP [-1] \fI<mount_point>\fP .PP \fBbtrfs\fP \fBreplace cancel\fP \fI<mount_point>\fP .PP +\fBbtrfs\fP \fBdedup enable\fP \fI<path>\fP +.PP +\fBbtrfs\fP \fBdedup disable\fP \fI<path>\fP +.PP +\fBbtrfs\fP \fBdedup on\fP [-b|--bs \fIsize\fP] \fI<path>\fP +.PP +\fBbtrfs\fP \fBdedup off\fP \fI<path>\fP +.PP +.PP + \fBbtrfs\fP \fBhelp|\-\-help \fP .PP \fBbtrfs\fP \fB<command> \-\-help \fP @@ -739,12 +748,28 @@ Print status and progress information of a running device replace operation. .IP "\fB-1\fP" 5 print once instead of print continuously until the replace operation finishes (or is canceled) -.RE -.TP \fBreplace cancel\fR \fI<mount_point>\fR Cancel a running device replace operation. .RE +.TP + +\fBdedup enable\fP \fI<path>\fP +Enable data deduplication support for a filesystem. +.TP + +\fBdedup disable\fP \fI<path>\fP +Disable data deduplication support for a filesystem. +.TP + +\fBdedup on\fP [-b|--bs \fIsize\fP] \fI<path>\fP +Switch on data deduplication or change the dedup blocksize. +.TP + +\fBdedup off\fP \fI<path>\fP +Switch off data deduplication. +.RE +.TP .SH EXIT STATUS \fBbtrfs\fR returns a zero exist status if it succeeds. Non zero is returned in -- 1.8.2.1 ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH] Btrfs-progs: add dedup subcommand 2013-12-30 8:12 ` [PATCH] Btrfs-progs: add dedup subcommand Liu Bo @ 2013-12-30 11:34 ` Martin Steigerwald 2013-12-31 3:18 ` Liu Bo 2013-12-31 3:24 ` Kai Krakow 2014-01-14 17:34 ` David Sterba 1 sibling, 2 replies; 30+ messages in thread From: Martin Steigerwald @ 2013-12-30 11:34 UTC (permalink / raw) To: Liu Bo; +Cc: linux-btrfs, Marcel Ritter, Christian Robert, alanqk@gmail.com Am Montag, 30. Dezember 2013, 16:12:55 schrieben Sie: > This adds deduplication subcommands, 'btrfs dedup command <path>', > including enable/disable/on/off. Nice. Looking forward to test it. > - btrfs dedup enable > Create the dedup tree, and it's the very first step when you're going to use > the dedup feature. > > - btrfs dedup disable > Delete the dedup tree, after this we're not able to use dedup any more > unless you enable it again. So if deduplication has been switched on for a while, btrfs dedup disable will cause BTRFS to undo the deduplication (and thus require more space for the same amount of data)? Thanks and happy new year, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] Btrfs-progs: add dedup subcommand 2013-12-30 11:34 ` Martin Steigerwald @ 2013-12-31 3:18 ` Liu Bo 2013-12-31 3:24 ` Kai Krakow 1 sibling, 0 replies; 30+ messages in thread From: Liu Bo @ 2013-12-31 3:18 UTC (permalink / raw) To: Martin Steigerwald Cc: linux-btrfs, Marcel Ritter, Christian Robert, alanqk@gmail.com On Mon, Dec 30, 2013 at 12:34:42PM +0100, Martin Steigerwald wrote: > Am Montag, 30. Dezember 2013, 16:12:55 schrieben Sie: > > This adds deduplication subcommands, 'btrfs dedup command <path>', > > including enable/disable/on/off. > > Nice. Looking forward to test it. Well, I just got a report from another user, Marcel, who still got ENOSPC errors with this around of patch set, so it seems that I don't really fix that bug, I guess I have to work harder on this :-( > > > - btrfs dedup enable > > Create the dedup tree, and it's the very first step when you're going to use > > the dedup feature. > > > > - btrfs dedup disable > > Delete the dedup tree, after this we're not able to use dedup any more > > unless you enable it again. > > So if deduplication has been switched on for a while, btrfs dedup disable will > cause BTRFS to undo the deduplication (and thus require more space for the > same amount of data)? No, it remains unchanged, and the data is independent of dedupe, so you can read them without any problems. Happy new year. Thanks, -liubo ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] Btrfs-progs: add dedup subcommand 2013-12-30 11:34 ` Martin Steigerwald 2013-12-31 3:18 ` Liu Bo @ 2013-12-31 3:24 ` Kai Krakow 1 sibling, 0 replies; 30+ messages in thread From: Kai Krakow @ 2013-12-31 3:24 UTC (permalink / raw) To: linux-btrfs Martin Steigerwald <Martin@lichtvoll.de> schrieb: >> - btrfs dedup disable >> Delete the dedup tree, after this we're not able to use dedup any more >> unless you enable it again. > > So if deduplication has been switched on for a while, btrfs dedup disable > will cause BTRFS to undo the deduplication (and thus require more space > for the same amount of data)? >From my intention I would guess it just looses track of what the content is in "content based storage" - so when re-enabling it will have to "learn" from beginning. It should not unshare data as sharing extents is a feature of btrfs disting from the function of online dedup itself. At least that would sound reasonable to me. Regards, Kai ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] Btrfs-progs: add dedup subcommand 2013-12-30 8:12 ` [PATCH] Btrfs-progs: add dedup subcommand Liu Bo 2013-12-30 11:34 ` Martin Steigerwald @ 2014-01-14 17:34 ` David Sterba 2014-01-15 1:35 ` Liu Bo 1 sibling, 1 reply; 30+ messages in thread From: David Sterba @ 2014-01-14 17:34 UTC (permalink / raw) To: Liu Bo; +Cc: linux-btrfs, Marcel Ritter, Christian Robert, alanqk@gmail.com On Mon, Dec 30, 2013 at 04:12:55PM +0800, Liu Bo wrote: > --- a/ctree.h > +++ b/ctree.h > @@ -470,6 +470,7 @@ struct btrfs_super_block { > #define BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF (1ULL << 6) > #define BTRFS_FEATURE_INCOMPAT_RAID56 (1ULL << 7) > #define BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA (1ULL << 8) > +#define BTRFS_FEATURE_INCOMPAT_DEDUP (1ULL << 9) FYI, this incompat bit is taken by Josef's NO_HOLE feature, now heading to next merge window, so I've used 10 for dedup in integration branch. > --- a/ioctl.h > +++ b/ioctl.h > @@ -430,6 +430,15 @@ struct btrfs_ioctl_get_dev_stats { > __u64 unused[128 - 2 - BTRFS_DEV_STAT_VALUES_MAX]; /* pad to 1k */ > }; > > +/* deduplication control ioctl modes */ > +#define BTRFS_DEDUP_CTL_ENABLE 1 > +#define BTRFS_DEDUP_CTL_DISABLE 2 > +#define BTRFS_DEDUP_CTL_SET_BS 3 > +struct btrfs_ioctl_dedup_args { > + __u64 cmd; > + __u64 bs; I've spotted that you did not reserve any space for future extensions of the ioctl, especially the in-band dedup can be quite heavy, I think we'll want some tunables in the future. > +}; > + > /* BTRFS_IOC_SNAP_CREATE is no longer used by the btrfs command */ > #define BTRFS_QUOTA_CTL_ENABLE 1 > #define BTRFS_QUOTA_CTL_DISABLE 2 ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] Btrfs-progs: add dedup subcommand 2014-01-14 17:34 ` David Sterba @ 2014-01-15 1:35 ` Liu Bo 2014-01-17 16:14 ` David Sterba 0 siblings, 1 reply; 30+ messages in thread From: Liu Bo @ 2014-01-15 1:35 UTC (permalink / raw) To: dsterba; +Cc: linux-btrfs On Tue, Jan 14, 2014 at 06:34:19PM +0100, David Sterba wrote: > On Mon, Dec 30, 2013 at 04:12:55PM +0800, Liu Bo wrote: > > --- a/ctree.h > > +++ b/ctree.h > > @@ -470,6 +470,7 @@ struct btrfs_super_block { > > #define BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF (1ULL << 6) > > #define BTRFS_FEATURE_INCOMPAT_RAID56 (1ULL << 7) > > #define BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA (1ULL << 8) > > +#define BTRFS_FEATURE_INCOMPAT_DEDUP (1ULL << 9) > > FYI, this incompat bit is taken by Josef's NO_HOLE feature, now heading > to next merge window, so I've used 10 for dedup in integration branch. Thanks for the notice, David, I'll update it in the next version. > > > --- a/ioctl.h > > +++ b/ioctl.h > > @@ -430,6 +430,15 @@ struct btrfs_ioctl_get_dev_stats { > > __u64 unused[128 - 2 - BTRFS_DEV_STAT_VALUES_MAX]; /* pad to 1k */ > > }; > > > > +/* deduplication control ioctl modes */ > > +#define BTRFS_DEDUP_CTL_ENABLE 1 > > +#define BTRFS_DEDUP_CTL_DISABLE 2 > > +#define BTRFS_DEDUP_CTL_SET_BS 3 > > +struct btrfs_ioctl_dedup_args { > > + __u64 cmd; > > + __u64 bs; > > I've spotted that you did not reserve any space for future extensions of > the ioctl, especially the in-band dedup can be quite heavy, I think > we'll want some tunables in the future. > Good point, I think 128 bytes can be enough for that, agree? I didn't put much notice on that because I kept stuck in fixing more urgent ENOSPC problem, it looks like I can hardly bypass this endless issue with dedup enabled, I have to fix it thoroughly... Thanks, -liubo ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] Btrfs-progs: add dedup subcommand 2014-01-15 1:35 ` Liu Bo @ 2014-01-17 16:14 ` David Sterba 0 siblings, 0 replies; 30+ messages in thread From: David Sterba @ 2014-01-17 16:14 UTC (permalink / raw) To: Liu Bo; +Cc: dsterba, linux-btrfs On Wed, Jan 15, 2014 at 09:35:17AM +0800, Liu Bo wrote: > > > @@ -430,6 +430,15 @@ struct btrfs_ioctl_get_dev_stats { > > > __u64 unused[128 - 2 - BTRFS_DEV_STAT_VALUES_MAX]; /* pad to 1k */ > > > }; > > > > > > +/* deduplication control ioctl modes */ > > > +#define BTRFS_DEDUP_CTL_ENABLE 1 > > > +#define BTRFS_DEDUP_CTL_DISABLE 2 > > > +#define BTRFS_DEDUP_CTL_SET_BS 3 > > > +struct btrfs_ioctl_dedup_args { > > > + __u64 cmd; > > > + __u64 bs; > > > > I've spotted that you did not reserve any space for future extensions of > > the ioctl, especially the in-band dedup can be quite heavy, I think > > we'll want some tunables in the future. > > > > Good point, I think 128 bytes can be enough for that, agree? Yes, should be enough. ^ permalink raw reply [flat|nested] 30+ messages in thread
* [RFC PATCH V4 0/2] Online data deduplication
@ 2013-05-14 12:08 Liu Bo
  2013-05-14 12:08 ` [PATCH] Btrfs-progs: add dedup subcommand Liu Bo
  0 siblings, 1 reply; 30+ messages in thread
From: Liu Bo @ 2013-05-14 12:08 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, jbacik, g2p.code
Data deduplication is a specialized data compression technique for eliminating
duplicate copies of repeating data.[1]
This patch set is also related to "Content based storage" in project ideas[2].
PATCH 1 is a hang fix with deduplication on, but it's also useful with no
deduplication in practice use.
For more implementation details, please refer to PATCH 2.
Plus, there is also a btrfs-progs patch which helps to enable/disable dedup
feature.
TODO:
* a bit-to-bit comparison callback.
All comments are welcome!
[1]: http://en.wikipedia.org/wiki/Data_deduplication
[2]: https://btrfs.wiki.kernel.org/index.php/Project_ideas#Content_based_storage
v4:
  * add INCOMPAT flag so that old kernel won't mount with a dedup btrfs.
  * elaborate error handling.
  * address a compress bug.
  * address an issue of dedup flag on extent state tree.
  * add new dedup ioctl interface.
v3:
  * add COMPRESS support
  * add a real ioctl to enable dedup feature
  * change the maximum allowed dedup blocksize to 128k because of compression
    range limit
v2:
  * To avoid enlarging the file extent item's size, add another index key used
    for freeing dedup extent.
  * Freeing dedup extent is now like how we delete checksum.
  * Add support for alternative deduplicatin blocksize larger than PAGESIZE.
  * Add a mount option to set deduplication blocksize.
  * Add support for those writes that are smaller than deduplication blocksize.
-----------------------------------------------------------
* HOW To turn deduplication on:
There are 2 steps you need to do before using it,
1) mount /dev/disk /mnt_of_your_btrfs -o dedup
   (or mount /dev/disk /mnt_of_your_btrfs -o dedup_bs=128K)
2) btrfs dedup register /mnt_of_your_btrfs
-----------------------------------------------------------
* HOW To turn deduplication off:
Just mount your btrfs without "-o dedup" or "-o dedup_bs=xxxK"
-----------------------------------------------------------
* HOW To disable deduplication completely:
There are 2 steps you need to do before using it,
1) mount your btrfs WITHOUT "-o dedup" or "-o dedup_bs=xxxK"
2) btrfs dedup unregister /mnt_fs_your_btrfs
(NOTE: 'unregister' won't work unless you do step 1 FIRSTLY.)
-----------------------------------------------------------
Liu Bo (2):
  Btrfs: skip merge part for delayed data refs
  Btrfs: online data deduplication
 fs/btrfs/ctree.h           |   63 +++++
 fs/btrfs/delayed-ref.c     |    7 +
 fs/btrfs/disk-io.c         |   34 +++-
 fs/btrfs/extent-tree.c     |    9 +
 fs/btrfs/extent_io.c       |   29 ++-
 fs/btrfs/extent_io.h       |   16 ++
 fs/btrfs/file-item.c       |  274 +++++++++++++++++++
 fs/btrfs/inode.c           |  630 ++++++++++++++++++++++++++++++++++++++------
 fs/btrfs/ioctl.c           |   93 +++++++
 fs/btrfs/ordered-data.c    |   34 ++-
 fs/btrfs/ordered-data.h    |   11 +-
 fs/btrfs/super.c           |   27 ++-
 include/uapi/linux/btrfs.h |    5 +
 13 files changed, 1141 insertions(+), 91 deletions(-)
-- 
1.7.7
^ permalink raw reply	[flat|nested] 30+ messages in thread* [PATCH] Btrfs-progs: add dedup subcommand 2013-05-14 12:08 [RFC PATCH V4 0/2] Online data deduplication Liu Bo @ 2013-05-14 12:08 ` Liu Bo 0 siblings, 0 replies; 30+ messages in thread From: Liu Bo @ 2013-05-14 12:08 UTC (permalink / raw) To: linux-btrfs; +Cc: dsterba, jbacik, g2p.code This aims to add deduplication subcommand, 'btrfs dedup command <path>', ie. register/unregister'. It can be used to enable or disable dedup support for a filesystem. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> --- Makefile | 2 +- btrfs.c | 1 + cmds-dedup.c | 101 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ commands.h | 2 + ctree.h | 2 + ioctl.h | 5 +++ 6 files changed, 112 insertions(+), 1 deletions(-) create mode 100644 cmds-dedup.c diff --git a/Makefile b/Makefile index 9c195b3..c423184 100644 --- a/Makefile +++ b/Makefile @@ -11,7 +11,7 @@ objects = ctree.o disk-io.o radix-tree.o extent-tree.o print-tree.o \ cmds_objects = cmds-subvolume.o cmds-filesystem.o cmds-device.o cmds-scrub.o \ cmds-inspect.o cmds-balance.o cmds-send.o cmds-receive.o \ cmds-quota.o cmds-qgroup.o cmds-replace.o cmds-check.o \ - cmds-restore.o + cmds-restore.o cmds-dedup.o libbtrfs_objects = send-stream.o send-utils.o rbtree.o btrfs-list.o crc32c.o libbtrfs_headers = send-stream.h send-utils.h send.h rbtree.h btrfs-list.h \ crc32c.h list.h kerncompat.h radix-tree.h extent-cache.h \ diff --git a/btrfs.c b/btrfs.c index 691adef..956905c 100644 --- a/btrfs.c +++ b/btrfs.c @@ -254,6 +254,7 @@ const struct cmd_group btrfs_cmd_group = { { "quota", cmd_quota, NULL, "a_cmd_group, 0 }, { "qgroup", cmd_qgroup, NULL, &qgroup_cmd_group, 0 }, { "replace", cmd_replace, NULL, &replace_cmd_group, 0 }, + { "dedup", cmd_dedup, NULL, &dedup_cmd_group, 0 }, { "help", cmd_help, cmd_help_usage, NULL, 0 }, { "version", cmd_version, cmd_version_usage, NULL, 0 }, { 0, 0, 0, 0, 0 } diff --git a/cmds-dedup.c b/cmds-dedup.c new file mode 100644 index 0000000..a977585 --- /dev/null +++ b/cmds-dedup.c @@ -0,0 +1,101 @@ +/* + * Copyright (C) 2013 Oracle. All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License v2 as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public + * License along with this program; if not, write to the + * Free Software Foundation, Inc., 59 Temple Place - Suite 330, + * Boston, MA 021110-1307, USA. + */ + +#include <sys/ioctl.h> +#include <unistd.h> + +#include "ctree.h" +#include "ioctl.h" + +#include "commands.h" +#include "utils.h" + +static const char * const dedup_cmd_group_usage[] = { + "btrfs dedup <command> [options] <path>", + NULL +}; + +int dedup_ctl(int cmd, int argc, char **argv) +{ + int ret = 0; + int fd; + int e; + char *path = argv[1]; + + if (check_argc_exact(argc, 2)) + return -1; + + fd = open_file_or_dir(path); + if (fd < 0) { + fprintf(stderr, "ERROR: can't access '%s'\n", path); + return -EACCES; + } + + ret = ioctl(fd, BTRFS_IOC_DEDUP_CTL, cmd); + e = errno; + close(fd); + if (ret < 0) { + fprintf(stderr, "ERROR: dedup command failed: %s\n", + strerror(e)); + if (cmd == BTRFS_DEDUP_CTL_UNREG) + fprintf(stderr, "please refer to 'dmesg | tail' for more info\n"); + return -EINVAL; + } + return 0; +} + +static const char * const cmd_dedup_reg_usage[] = { + "btrfs dedup register <path>", + "Enable data deduplication support for a filesystem.", + NULL +}; + +static int cmd_dedup_reg(int argc, char **argv) +{ + int ret = dedup_ctl(BTRFS_DEDUP_CTL_REG, argc, argv); + if (ret < 0) + usage(cmd_dedup_reg_usage); + return ret; +} + +static const char * const cmd_dedup_unreg_usage[] = { + "btrfs dedup unregister <path>", + "Disable data deduplication support for a filesystem.", + NULL +}; + +static int cmd_dedup_unreg(int argc, char **argv) +{ + int ret = dedup_ctl(BTRFS_DEDUP_CTL_UNREG, argc, argv); + if (ret < 0) + usage(cmd_dedup_unreg_usage); + return ret; +} + +const struct cmd_group dedup_cmd_group = { + dedup_cmd_group_usage, NULL, { + { "register", cmd_dedup_reg, cmd_dedup_reg_usage, NULL, 0 }, + { "unregister", cmd_dedup_unreg, cmd_dedup_unreg_usage, 0, 0 }, + { 0, 0, 0, 0, 0 } + } +}; + +int cmd_dedup(int argc, char **argv) +{ + return handle_command_group(&dedup_cmd_group, argc, argv); +} diff --git a/commands.h b/commands.h index 15c616d..d31afa4 100644 --- a/commands.h +++ b/commands.h @@ -90,6 +90,7 @@ extern const struct cmd_group receive_cmd_group; extern const struct cmd_group quota_cmd_group; extern const struct cmd_group qgroup_cmd_group; extern const struct cmd_group replace_cmd_group; +extern const struct cmd_group dedup_cmd_group; extern const char * const cmd_send_usage[]; extern const char * const cmd_receive_usage[]; @@ -112,6 +113,7 @@ int cmd_restore(int argc, char **argv); int cmd_select_super(int argc, char **argv); int cmd_dump_super(int argc, char **argv); int cmd_debug_tree(int argc, char **argv); +int cmd_dedup(int argc, char **argv); /* subvolume exported functions */ int test_issubvolume(char *path); diff --git a/ctree.h b/ctree.h index 9442e03..903969d 100644 --- a/ctree.h +++ b/ctree.h @@ -455,6 +455,7 @@ struct btrfs_super_block { */ #define BTRFS_FEATURE_INCOMPAT_BIG_METADATA (1ULL << 5) #define BTRFS_FEATURE_INCOMPAT_RAID56 (1ULL << 7) +#define BTRFS_FEATURE_INCOMPAT_DEDUP (1ULL << 9) #define BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF (1ULL << 6) @@ -467,6 +468,7 @@ struct btrfs_super_block { BTRFS_FEATURE_INCOMPAT_BIG_METADATA | \ BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF | \ BTRFS_FEATURE_INCOMPAT_RAID56 | \ + BTRFS_FEATURE_INCOMPAT_DEDUP | \ BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS) /* diff --git a/ioctl.h b/ioctl.h index e841913..106901f 100644 --- a/ioctl.h +++ b/ioctl.h @@ -384,6 +384,10 @@ struct btrfs_ioctl_send_args { __u64 reserved[4]; /* in */ }; +/* deduplication control ioctl modes */ +#define BTRFS_DEDUP_CTL_REG 1 +#define BTRFS_DEDUP_CTL_UNREG 2 + #define BTRFS_QUOTA_CTL_ENABLE 1 #define BTRFS_QUOTA_CTL_DISABLE 2 #define BTRFS_QUOTA_CTL_RESCAN 3 @@ -528,6 +532,7 @@ struct btrfs_ioctl_clone_range_args { struct btrfs_ioctl_get_dev_stats) #define BTRFS_IOC_DEV_REPLACE _IOWR(BTRFS_IOCTL_MAGIC, 53, \ struct btrfs_ioctl_dev_replace_args) +#define BTRFS_IOC_DEDUP_CTL _IOW(BTRFS_IOCTL_MAGIC, 55, int) #ifdef __cplusplus } -- 1.7.7 ^ permalink raw reply related [flat|nested] 30+ messages in thread
end of thread, other threads:[~2014-01-17 16:14 UTC | newest] Thread overview: 30+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-07-31 15:37 [RFC PATCH v5 0/5] Online data deduplication Liu Bo 2013-07-31 15:37 ` [RFC PATCH v5 1/5] Btrfs: skip merge part for delayed data refs Liu Bo 2013-07-31 15:37 ` [RFC PATCH v5 2/5] Btrfs: improve the delayed refs process in rm case Liu Bo 2013-07-31 16:45 ` Stefan Behrens 2013-07-31 15:37 ` [RFC PATCH v5 3/5] Btrfs: introduce a head ref rbtree Liu Bo 2013-07-31 21:19 ` Zach Brown 2013-07-31 15:37 ` [RFC PATCH v5 4/5] Btrfs: disable qgroups accounting when quota is off Liu Bo 2013-08-05 12:34 ` Jan Schmidt 2013-08-05 14:18 ` Liu Bo 2013-08-05 15:10 ` Jan Schmidt 2013-08-06 2:25 ` Liu Bo 2013-07-31 15:37 ` [RFC PATCH v5 5/5] Btrfs: online data deduplication Liu Bo 2013-07-31 22:50 ` Zach Brown 2013-08-01 10:14 ` Liu Bo 2013-08-01 18:35 ` Zach Brown 2013-07-31 15:37 ` [PATCH] Btrfs-progs: add dedup subcommand Liu Bo 2013-07-31 16:30 ` Stefan Behrens 2013-08-01 10:17 ` Liu Bo 2013-08-01 22:01 ` Mark Fasheh 2013-08-02 2:29 ` Liu Bo 2013-07-31 21:20 ` [RFC PATCH v5 0/5] Online data deduplication Josef Bacik 2013-08-01 10:16 ` Liu Bo -- strict thread matches above, loose matches on Subject: below -- 2013-12-30 8:12 [RFC PATCH v8 00/14] Online(inband) " Liu Bo 2013-12-30 8:12 ` [PATCH] Btrfs-progs: add dedup subcommand Liu Bo 2013-12-30 11:34 ` Martin Steigerwald 2013-12-31 3:18 ` Liu Bo 2013-12-31 3:24 ` Kai Krakow 2014-01-14 17:34 ` David Sterba 2014-01-15 1:35 ` Liu Bo 2014-01-17 16:14 ` David Sterba 2013-05-14 12:08 [RFC PATCH V4 0/2] Online data deduplication Liu Bo 2013-05-14 12:08 ` [PATCH] Btrfs-progs: add dedup subcommand Liu Bo
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).