* [RFC PATCH v10 00/16] Online(inband) data deduplication
@ 2014-04-10 3:48 Liu Bo
2014-04-10 3:48 ` [PATCH v10 01/16] Btrfs: disable qgroups accounting when quota_enable is 0 Liu Bo
` (19 more replies)
0 siblings, 20 replies; 25+ messages in thread
From: Liu Bo @ 2014-04-10 3:48 UTC (permalink / raw)
To: linux-btrfs
Cc: Marcel Ritter, Christian Robert, alanqk, Konstantinos Skarlatos,
David Sterba, Martin Steigerwald, Josef Bacik, Chris Mason
Hello,
This the 10th attempt for in-band data dedupe, based on Linux _3.14_ kernel.
Data deduplication is a specialized data compression technique for eliminating
duplicate copies of repeating data.[1]
This patch set is also related to "Content based storage" in project ideas[2],
it introduces inband data deduplication for btrfs and dedup/dedupe is for short.
* PATCH 1 is a speed-up improvement, which is about dedup and quota.
* PATCH 2-5 is the preparation work for dedup implementation.
* PATCH 6 shows how we implement dedup feature.
* PATCH 7 fixes a backref walking bug with dedup.
* PATCH 8 fixes a free space bug of dedup extents on error handling.
* PATCH 9 adds the ioctl to control dedup feature.
* PATCH 10 targets delayed refs' scalability problem of deleting refs, which is
uncovered by the dedup feature.
* PATCH 11-16 fixes bugs of dedupe including race bug, deadlock, abnormal
transaction abortion and crash.
* btrfs-progs patch(PATCH 17) offers all details about how to control the
dedup feature on progs side.
I've tested this with xfstests by adding a inline dedup 'enable & on' in xfstests'
mount and scratch_mount.
***NOTE***
Known bugs:
* Mounting with options "flushoncommit" and enabling dedupe feature will end up
with _deadlock_.
TODO:
* a bit-to-bit comparison callback.
All comments are welcome!
[1]: http://en.wikipedia.org/wiki/Data_deduplication
[2]: https://btrfs.wiki.kernel.org/index.php/Project_ideas#Content_based_storage
v10:
- fix a typo in the subject line.
- update struct 'btrfs_ioctl_dedup_args' in the kernel side to fix
'Inappropriate ioctl for device'.
v9:
- fix a deadlock and a crash reported by users.
- fix the metadata ENOSPC problem with dedup again.
v8:
- fix the race crash of dedup ref again.
- fix the metadata ENOSPC problem with dedup.
v7:
- rebase onto the lastest btrfs
- break a big patch into smaller ones to make reviewers happy.
- kill mount options of dedup and use ioctl method instead.
- fix two crash due to the special dedup ref
For former patch sets:
v6: http://thread.gmane.org/gmane.comp.file-systems.btrfs/27512
v5: http://thread.gmane.org/gmane.comp.file-systems.btrfs/27257
v4: http://thread.gmane.org/gmane.comp.file-systems.btrfs/25751
v3: http://comments.gmane.org/gmane.comp.file-systems.btrfs/25433
v2: http://comments.gmane.org/gmane.comp.file-systems.btrfs/24959
Liu Bo (16):
Btrfs: disable qgroups accounting when quota_enable is 0
Btrfs: introduce dedup tree and relatives
Btrfs: introduce dedup tree operations
Btrfs: introduce dedup state
Btrfs: make ordered extent aware of dedup
Btrfs: online(inband) data dedup
Btrfs: skip dedup reference during backref walking
Btrfs: don't return space for dedup extent
Btrfs: add ioctl of dedup control
Btrfs: improve the delayed refs process in rm case
Btrfs: fix a crash of dedup ref
Btrfs: fix deadlock of dedup work
Btrfs: fix transactin abortion in __btrfs_free_extent
Btrfs: fix wrong pinned bytes in __btrfs_free_extent
Btrfs: use total_bytes instead of bytes_used for global_rsv
Btrfs: fix dedup enospc problem
fs/btrfs/backref.c | 9 +
fs/btrfs/ctree.c | 2 +-
fs/btrfs/ctree.h | 86 ++++++
fs/btrfs/delayed-ref.c | 26 +-
fs/btrfs/delayed-ref.h | 3 +
fs/btrfs/disk-io.c | 37 +++
fs/btrfs/extent-tree.c | 235 +++++++++++++---
fs/btrfs/extent_io.c | 22 +-
fs/btrfs/extent_io.h | 16 ++
fs/btrfs/file-item.c | 244 +++++++++++++++++
fs/btrfs/inode.c | 635 ++++++++++++++++++++++++++++++++++++++-----
fs/btrfs/ioctl.c | 167 ++++++++++++
fs/btrfs/ordered-data.c | 44 ++-
fs/btrfs/ordered-data.h | 13 +-
fs/btrfs/qgroup.c | 3 +
fs/btrfs/relocation.c | 3 +
fs/btrfs/transaction.c | 41 +++
fs/btrfs/transaction.h | 1 +
include/trace/events/btrfs.h | 3 +-
include/uapi/linux/btrfs.h | 12 +
20 files changed, 1471 insertions(+), 131 deletions(-)
--
1.8.2.1
^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH v10 01/16] Btrfs: disable qgroups accounting when quota_enable is 0
2014-04-10 3:48 [RFC PATCH v10 00/16] Online(inband) data deduplication Liu Bo
@ 2014-04-10 3:48 ` Liu Bo
2014-04-10 3:48 ` [PATCH v10 02/16] Btrfs: introduce dedup tree and relatives Liu Bo
` (18 subsequent siblings)
19 siblings, 0 replies; 25+ messages in thread
From: Liu Bo @ 2014-04-10 3:48 UTC (permalink / raw)
To: linux-btrfs
Cc: Marcel Ritter, Christian Robert, alanqk, Konstantinos Skarlatos,
David Sterba, Martin Steigerwald, Josef Bacik, Chris Mason
It's unnecessary to do qgroups accounting without enabling quota.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
---
fs/btrfs/ctree.c | 2 +-
fs/btrfs/delayed-ref.c | 18 ++++++++++++++----
fs/btrfs/qgroup.c | 3 +++
3 files changed, 18 insertions(+), 5 deletions(-)
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 88d1b1e..54f3c67 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -406,7 +406,7 @@ u64 btrfs_get_tree_mod_seq(struct btrfs_fs_info *fs_info,
tree_mod_log_write_lock(fs_info);
spin_lock(&fs_info->tree_mod_seq_lock);
- if (!elem->seq) {
+ if (elem && !elem->seq) {
elem->seq = btrfs_inc_tree_mod_seq_major(fs_info);
list_add_tail(&elem->list, &fs_info->tree_mod_seq_list);
}
diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 3129964..3ab37b6 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -656,8 +656,13 @@ add_delayed_tree_ref(struct btrfs_fs_info *fs_info,
ref->is_head = 0;
ref->in_tree = 1;
- if (need_ref_seq(for_cow, ref_root))
- seq = btrfs_get_tree_mod_seq(fs_info, &trans->delayed_ref_elem);
+ if (need_ref_seq(for_cow, ref_root)) {
+ struct seq_list *elem = NULL;
+
+ if (fs_info->quota_enabled)
+ elem = &trans->delayed_ref_elem;
+ seq = btrfs_get_tree_mod_seq(fs_info, elem);
+ }
ref->seq = seq;
full_ref = btrfs_delayed_node_to_tree_ref(ref);
@@ -718,8 +723,13 @@ add_delayed_data_ref(struct btrfs_fs_info *fs_info,
ref->is_head = 0;
ref->in_tree = 1;
- if (need_ref_seq(for_cow, ref_root))
- seq = btrfs_get_tree_mod_seq(fs_info, &trans->delayed_ref_elem);
+ if (need_ref_seq(for_cow, ref_root)) {
+ struct seq_list *elem = NULL;
+
+ if (fs_info->quota_enabled)
+ elem = &trans->delayed_ref_elem;
+ seq = btrfs_get_tree_mod_seq(fs_info, elem);
+ }
ref->seq = seq;
full_ref = btrfs_delayed_node_to_data_ref(ref);
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 2cf9058..c634b3e 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1186,6 +1186,9 @@ int btrfs_qgroup_record_ref(struct btrfs_trans_handle *trans,
{
struct qgroup_update *u;
+ if (!trans->root->fs_info->quota_enabled)
+ return 0;
+
BUG_ON(!trans->delayed_ref_elem.seq);
u = kmalloc(sizeof(*u), GFP_NOFS);
if (!u)
--
1.8.1.4
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v10 02/16] Btrfs: introduce dedup tree and relatives
2014-04-10 3:48 [RFC PATCH v10 00/16] Online(inband) data deduplication Liu Bo
2014-04-10 3:48 ` [PATCH v10 01/16] Btrfs: disable qgroups accounting when quota_enable is 0 Liu Bo
@ 2014-04-10 3:48 ` Liu Bo
2014-04-10 3:48 ` [PATCH v10 03/16] Btrfs: introduce dedup tree operations Liu Bo
` (17 subsequent siblings)
19 siblings, 0 replies; 25+ messages in thread
From: Liu Bo @ 2014-04-10 3:48 UTC (permalink / raw)
To: linux-btrfs
Cc: Marcel Ritter, Christian Robert, alanqk, Konstantinos Skarlatos,
David Sterba, Martin Steigerwald, Josef Bacik, Chris Mason
This is a preparation step for online/inband dedup tree.
It introduces dedup tree and its relatives, including hash driver and
some structures.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
---
fs/btrfs/ctree.h | 73 ++++++++++++++++++++++++++++++++++++++++++++
fs/btrfs/disk-io.c | 36 ++++++++++++++++++++++
fs/btrfs/extent-tree.c | 2 ++
include/trace/events/btrfs.h | 3 +-
4 files changed, 113 insertions(+), 1 deletion(-)
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index bc96c03..da4320d 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -33,6 +33,7 @@
#include <asm/kmap_types.h>
#include <linux/pagemap.h>
#include <linux/btrfs.h>
+#include <crypto/hash.h>
#include "extent_io.h"
#include "extent_map.h"
#include "async-thread.h"
@@ -101,6 +102,9 @@ struct btrfs_ordered_sum;
/* for storing items that use the BTRFS_UUID_KEY* types */
#define BTRFS_UUID_TREE_OBJECTID 9ULL
+/* dedup tree(experimental) */
+#define BTRFS_DEDUP_TREE_OBJECTID 10ULL
+
/* for storing balance parameters in the root tree */
#define BTRFS_BALANCE_OBJECTID -4ULL
@@ -523,6 +527,7 @@ struct btrfs_super_block {
#define BTRFS_FEATURE_INCOMPAT_RAID56 (1ULL << 7)
#define BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA (1ULL << 8)
#define BTRFS_FEATURE_INCOMPAT_NO_HOLES (1ULL << 9)
+#define BTRFS_FEATURE_INCOMPAT_DEDUP (1ULL << 10)
#define BTRFS_FEATURE_COMPAT_SUPP 0ULL
#define BTRFS_FEATURE_COMPAT_SAFE_SET 0ULL
@@ -540,6 +545,7 @@ struct btrfs_super_block {
BTRFS_FEATURE_INCOMPAT_RAID56 | \
BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF | \
BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA | \
+ BTRFS_FEATURE_INCOMPAT_DEDUP | \
BTRFS_FEATURE_INCOMPAT_NO_HOLES)
#define BTRFS_FEATURE_INCOMPAT_SAFE_SET \
@@ -915,6 +921,51 @@ struct btrfs_csum_item {
u8 csum;
} __attribute__ ((__packed__));
+/* dedup */
+enum btrfs_dedup_type {
+ BTRFS_DEDUP_SHA256 = 0,
+ BTRFS_DEDUP_LAST = 1,
+};
+
+static int btrfs_dedup_lens[] = { 4, 0 };
+static int btrfs_dedup_sizes[] = { 32, 0 }; /* 256bit, 32bytes */
+
+struct btrfs_dedup_item {
+ /* disk length of dedup range */
+ __le64 len;
+
+ u8 type;
+ u8 compression;
+ u8 encryption;
+
+ /* spare for later use */
+ __le16 other_encoding;
+
+ /* hash/fingerprints go here */
+} __attribute__ ((__packed__));
+
+struct btrfs_dedup_hash {
+ u64 bytenr;
+ u64 num_bytes;
+
+ /* hash algorithm */
+ int type;
+
+ int compression;
+
+ /* last field is a variable length array of dedup hash */
+ u64 hash[];
+};
+
+static inline int btrfs_dedup_hash_size(int type)
+{
+ WARN_ON((btrfs_dedup_lens[type] * sizeof(u64)) !=
+ btrfs_dedup_sizes[type]);
+
+ return sizeof(struct btrfs_dedup_hash) + btrfs_dedup_sizes[type];
+}
+
+
struct btrfs_dev_stats_item {
/*
* grow this item struct at the end for future enhancements and keep
@@ -1320,6 +1371,7 @@ struct btrfs_fs_info {
struct btrfs_root *dev_root;
struct btrfs_root *fs_root;
struct btrfs_root *csum_root;
+ struct btrfs_root *dedup_root;
struct btrfs_root *quota_root;
struct btrfs_root *uuid_root;
@@ -1680,6 +1732,14 @@ struct btrfs_fs_info {
struct semaphore uuid_tree_rescan_sem;
unsigned int update_uuid_tree_gen:1;
+
+ /* reference to deduplication algorithm driver via cryptoapi */
+ struct crypto_shash *dedup_driver;
+
+ /* dedup blocksize */
+ u64 dedup_bs;
+
+ int dedup_type;
};
struct btrfs_subvolume_writers {
@@ -2013,6 +2073,8 @@ struct btrfs_ioctl_defrag_range_args {
*/
#define BTRFS_STRING_ITEM_KEY 253
+#define BTRFS_DEDUP_ITEM_KEY 254
+
/*
* Flags for mount options.
*
@@ -3047,6 +3109,14 @@ static inline u32 btrfs_file_extent_inline_len(struct extent_buffer *eb,
}
+/* btrfs_dedup_item */
+BTRFS_SETGET_FUNCS(dedup_len, struct btrfs_dedup_item, len, 64);
+BTRFS_SETGET_FUNCS(dedup_compression, struct btrfs_dedup_item, compression, 8);
+BTRFS_SETGET_FUNCS(dedup_encryption, struct btrfs_dedup_item, encryption, 8);
+BTRFS_SETGET_FUNCS(dedup_other_encoding, struct btrfs_dedup_item,
+ other_encoding, 16);
+BTRFS_SETGET_FUNCS(dedup_type, struct btrfs_dedup_item, type, 8);
+
/* btrfs_dev_stats_item */
static inline u64 btrfs_dev_stats_value(struct extent_buffer *eb,
struct btrfs_dev_stats_item *ptr,
@@ -3521,6 +3591,8 @@ static inline int btrfs_need_cleaner_sleep(struct btrfs_root *root)
static inline void free_fs_info(struct btrfs_fs_info *fs_info)
{
+ if (fs_info->dedup_driver)
+ crypto_free_shash(fs_info->dedup_driver);
kfree(fs_info->balance_ctl);
kfree(fs_info->delayed_root);
kfree(fs_info->extent_root);
@@ -3687,6 +3759,7 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct inode *inode,
struct bio *bio, u64 file_start, int contig);
int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
struct list_head *list, int search_commit);
+
/* inode.c */
struct btrfs_delalloc_work {
struct inode *inode;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index bd0f752..a2586ac 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -153,6 +153,7 @@ static struct btrfs_lockdep_keyset {
{ .id = BTRFS_FS_TREE_OBJECTID, .name_stem = "fs" },
{ .id = BTRFS_CSUM_TREE_OBJECTID, .name_stem = "csum" },
{ .id = BTRFS_QUOTA_TREE_OBJECTID, .name_stem = "quota" },
+ { .id = BTRFS_DEDUP_TREE_OBJECTID, .name_stem = "dedup" },
{ .id = BTRFS_TREE_LOG_OBJECTID, .name_stem = "log" },
{ .id = BTRFS_TREE_RELOC_OBJECTID, .name_stem = "treloc" },
{ .id = BTRFS_DATA_RELOC_TREE_OBJECTID, .name_stem = "dreloc" },
@@ -1619,6 +1620,9 @@ struct btrfs_root *btrfs_get_fs_root(struct btrfs_fs_info *fs_info,
if (location->objectid == BTRFS_UUID_TREE_OBJECTID)
return fs_info->uuid_root ? fs_info->uuid_root :
ERR_PTR(-ENOENT);
+ if (location->objectid == BTRFS_DEDUP_TREE_OBJECTID)
+ return fs_info->dedup_root ? fs_info->dedup_root :
+ ERR_PTR(-ENOENT);
again:
root = btrfs_lookup_fs_root(fs_info, location->objectid);
if (root) {
@@ -2069,6 +2073,7 @@ static void free_root_pointers(struct btrfs_fs_info *info, int chunk_root)
free_root_extent_buffers(info->csum_root);
free_root_extent_buffers(info->quota_root);
free_root_extent_buffers(info->uuid_root);
+ free_root_extent_buffers(info->dedup_root);
if (chunk_root)
free_root_extent_buffers(info->chunk_root);
}
@@ -2110,6 +2115,19 @@ static void del_fs_roots(struct btrfs_fs_info *fs_info)
}
}
+static struct crypto_shash *
+btrfs_build_dedup_driver(struct btrfs_fs_info *info)
+{
+ switch (info->dedup_type) {
+ case BTRFS_DEDUP_SHA256:
+ return crypto_alloc_shash("sha256", 0, 0);
+ default:
+ pr_err("btrfs: unrecognized dedup type\n");
+ break;
+ }
+ return ERR_PTR(-EINVAL);
+}
+
int open_ctree(struct super_block *sb,
struct btrfs_fs_devices *fs_devices,
char *options)
@@ -2132,6 +2150,7 @@ int open_ctree(struct super_block *sb,
struct btrfs_root *dev_root;
struct btrfs_root *quota_root;
struct btrfs_root *uuid_root;
+ struct btrfs_root *dedup_root;
struct btrfs_root *log_tree_root;
int ret;
int err = -EINVAL;
@@ -2232,6 +2251,8 @@ int open_ctree(struct super_block *sb,
atomic64_set(&fs_info->tree_mod_seq, 0);
fs_info->sb = sb;
fs_info->max_inline = 8192 * 1024;
+ fs_info->dedup_bs = 0;
+ fs_info->dedup_type = BTRFS_DEDUP_SHA256;
fs_info->metadata_ratio = 0;
fs_info->defrag_inodes = RB_ROOT;
fs_info->free_chunk_space = 0;
@@ -2316,6 +2337,14 @@ int open_ctree(struct super_block *sb,
fs_info->pinned_extents = &fs_info->freed_extents[0];
fs_info->do_barriers = 1;
+ fs_info->dedup_driver = btrfs_build_dedup_driver(fs_info);
+ if (IS_ERR(fs_info->dedup_driver)) {
+ pr_info("BTRFS: Cannot load sha256 driver\n");
+ err = PTR_ERR(fs_info->dedup_driver);
+ fs_info->dedup_driver = NULL;
+ goto fail_alloc;
+ }
+
mutex_init(&fs_info->ordered_operations_mutex);
mutex_init(&fs_info->ordered_extent_flush_mutex);
@@ -2723,6 +2752,13 @@ retry_root_backup:
generation != btrfs_super_uuid_tree_generation(disk_super);
}
+ location.objectid = BTRFS_DEDUP_TREE_OBJECTID;
+ dedup_root = btrfs_read_tree_root(tree_root, &location);
+ if (!IS_ERR(dedup_root)) {
+ dedup_root->track_dirty = 1;
+ fs_info->dedup_root = dedup_root;
+ }
+
fs_info->generation = generation;
fs_info->last_trans_committed = generation;
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index c6b6a6e..06124c1 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4703,6 +4703,8 @@ static void init_global_block_rsv(struct btrfs_fs_info *fs_info)
if (fs_info->quota_root)
fs_info->quota_root->block_rsv = &fs_info->global_block_rsv;
fs_info->chunk_root->block_rsv = &fs_info->chunk_block_rsv;
+ if (fs_info->dedup_root)
+ fs_info->dedup_root->block_rsv = &fs_info->global_block_rsv;
update_global_block_rsv(fs_info);
}
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 4ee4e30..c5ae213 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -42,6 +42,7 @@ struct __btrfs_workqueue;
{ BTRFS_ROOT_TREE_DIR_OBJECTID, "ROOT_TREE_DIR" }, \
{ BTRFS_CSUM_TREE_OBJECTID, "CSUM_TREE" }, \
{ BTRFS_TREE_LOG_OBJECTID, "TREE_LOG" }, \
+ { BTRFS_DEDUP_TREE_OBJECTID, "DEDUP_TREE" }, \
{ BTRFS_QUOTA_TREE_OBJECTID, "QUOTA_TREE" }, \
{ BTRFS_TREE_RELOC_OBJECTID, "TREE_RELOC" }, \
{ BTRFS_UUID_TREE_OBJECTID, "UUID_RELOC" }, \
@@ -50,7 +51,7 @@ struct __btrfs_workqueue;
#define show_root_type(obj) \
obj, ((obj >= BTRFS_DATA_RELOC_TREE_OBJECTID) || \
(obj >= BTRFS_ROOT_TREE_OBJECTID && \
- obj <= BTRFS_QUOTA_TREE_OBJECTID)) ? __show_root_type(obj) : "-"
+ obj <= BTRFS_DEDUP_TREE_OBJECTID)) ? __show_root_type(obj) : "-"
#define BTRFS_GROUP_FLAGS \
{ BTRFS_BLOCK_GROUP_DATA, "DATA"}, \
--
1.8.1.4
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v10 03/16] Btrfs: introduce dedup tree operations
2014-04-10 3:48 [RFC PATCH v10 00/16] Online(inband) data deduplication Liu Bo
2014-04-10 3:48 ` [PATCH v10 01/16] Btrfs: disable qgroups accounting when quota_enable is 0 Liu Bo
2014-04-10 3:48 ` [PATCH v10 02/16] Btrfs: introduce dedup tree and relatives Liu Bo
@ 2014-04-10 3:48 ` Liu Bo
2014-04-10 3:48 ` [PATCH v10 04/16] Btrfs: introduce dedup state Liu Bo
` (16 subsequent siblings)
19 siblings, 0 replies; 25+ messages in thread
From: Liu Bo @ 2014-04-10 3:48 UTC (permalink / raw)
To: linux-btrfs
Cc: Marcel Ritter, Christian Robert, alanqk, Konstantinos Skarlatos,
David Sterba, Martin Steigerwald, Josef Bacik, Chris Mason
The operations consist of finding matched items, adding new items and
removing items.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
---
fs/btrfs/ctree.h | 9 +++
fs/btrfs/file-item.c | 210 +++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 219 insertions(+)
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index da4320d..ca1b516 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3760,6 +3760,15 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct inode *inode,
int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
struct list_head *list, int search_commit);
+int noinline_for_stack
+btrfs_find_dedup_extent(struct btrfs_root *root, struct btrfs_dedup_hash *hash);
+int noinline_for_stack
+btrfs_insert_dedup_extent(struct btrfs_trans_handle *trans,
+ struct btrfs_root *root,
+ struct btrfs_dedup_hash *hash);
+int noinline_for_stack
+btrfs_free_dedup_extent(struct btrfs_trans_handle *trans,
+ struct btrfs_root *root, u64 hash, u64 bytenr);
/* inode.c */
struct btrfs_delalloc_work {
struct inode *inode;
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 127555b..6437ebe 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -885,3 +885,213 @@ out:
fail_unlock:
goto out;
}
+
+/* 1 means we find one, 0 means we dont. */
+int noinline_for_stack
+btrfs_find_dedup_extent(struct btrfs_root *root, struct btrfs_dedup_hash *hash)
+{
+ struct btrfs_key key;
+ struct btrfs_path *path;
+ struct extent_buffer *leaf;
+ struct btrfs_root *dedup_root;
+ struct btrfs_dedup_item *item;
+ u64 hash_value;
+ u64 length;
+ u64 dedup_size;
+ int compression;
+ int found = 0;
+ int index;
+ int ret;
+
+ if (!hash) {
+ WARN_ON(1);
+ return 0;
+ }
+ if (!root->fs_info->dedup_root) {
+ WARN(1, KERN_INFO "dedup not enabled\n");
+ return 0;
+ }
+ dedup_root = root->fs_info->dedup_root;
+
+ path = btrfs_alloc_path();
+ if (!path)
+ return 0;
+
+ /*
+ * For SHA256 dedup algorithm, we store the last 64bit as the
+ * key.objectid, and the rest in the tree item.
+ */
+ index = btrfs_dedup_lens[hash->type] - 1;
+ dedup_size = btrfs_dedup_sizes[hash->type] - sizeof(u64);
+
+ hash_value = hash->hash[index];
+
+ key.objectid = hash_value;
+ key.offset = (u64)-1;
+ btrfs_set_key_type(&key, BTRFS_DEDUP_ITEM_KEY);
+
+ ret = btrfs_search_slot(NULL, dedup_root, &key, path, 0, 0);
+ if (ret < 0)
+ goto out;
+ if (ret == 0) {
+ WARN_ON(1);
+ goto out;
+ }
+
+prev_slot:
+ /* this will do match checks. */
+ ret = btrfs_previous_item(dedup_root, path, hash_value,
+ BTRFS_DEDUP_ITEM_KEY);
+ if (ret)
+ goto out;
+
+ leaf = path->nodes[0];
+ btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
+ if (key.objectid != hash_value)
+ goto out;
+
+ item = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_dedup_item);
+ /* disk length of dedup range */
+ length = btrfs_dedup_len(leaf, item);
+
+ compression = btrfs_dedup_compression(leaf, item);
+ if (compression > BTRFS_COMPRESS_TYPES) {
+ WARN_ON(1);
+ goto out;
+ }
+
+ if (btrfs_dedup_type(leaf, item) != hash->type)
+ goto prev_slot;
+
+ if (memcmp_extent_buffer(leaf, hash->hash, (unsigned long)(item + 1),
+ dedup_size)) {
+ pr_info("goto prev\n");
+ goto prev_slot;
+ }
+
+ hash->bytenr = key.offset;
+ hash->num_bytes = length;
+ hash->compression = compression;
+ found = 1;
+out:
+ btrfs_free_path(path);
+ return found;
+}
+
+int noinline_for_stack
+btrfs_insert_dedup_extent(struct btrfs_trans_handle *trans,
+ struct btrfs_root *root,
+ struct btrfs_dedup_hash *hash)
+{
+ struct btrfs_key key;
+ struct btrfs_path *path;
+ struct extent_buffer *leaf;
+ struct btrfs_root *dedup_root;
+ struct btrfs_dedup_item *dedup_item;
+ u64 ins_size;
+ u64 dedup_size;
+ int index;
+ int ret;
+
+ if (!hash) {
+ WARN_ON(1);
+ return 0;
+ }
+
+ WARN_ON(hash->num_bytes > root->fs_info->dedup_bs);
+
+ if (!root->fs_info->dedup_root) {
+ WARN(1, KERN_INFO "dedup not enabled\n");
+ return 0;
+ }
+ dedup_root = root->fs_info->dedup_root;
+
+ path = btrfs_alloc_path();
+ if (!path)
+ return -ENOMEM;
+
+ /*
+ * For SHA256 dedup algorithm, we store the last 64bit as the
+ * key.objectid, and the rest in the tree item.
+ */
+ index = btrfs_dedup_lens[hash->type] - 1;
+ dedup_size = btrfs_dedup_sizes[hash->type] - sizeof(u64);
+
+ ins_size = sizeof(*dedup_item) + dedup_size;
+
+ key.objectid = hash->hash[index];
+ key.offset = hash->bytenr;
+ btrfs_set_key_type(&key, BTRFS_DEDUP_ITEM_KEY);
+
+ path->leave_spinning = 1;
+ ret = btrfs_insert_empty_item(trans, dedup_root, path, &key, ins_size);
+ if (ret < 0)
+ goto out;
+ leaf = path->nodes[0];
+
+ dedup_item = btrfs_item_ptr(leaf, path->slots[0],
+ struct btrfs_dedup_item);
+ /* disk length of dedup range */
+ btrfs_set_dedup_len(leaf, dedup_item, hash->num_bytes);
+ btrfs_set_dedup_compression(leaf, dedup_item, hash->compression);
+ btrfs_set_dedup_encryption(leaf, dedup_item, 0);
+ btrfs_set_dedup_other_encoding(leaf, dedup_item, 0);
+ btrfs_set_dedup_type(leaf, dedup_item, hash->type);
+
+ write_extent_buffer(leaf, hash->hash, (unsigned long)(dedup_item + 1),
+ dedup_size);
+
+ btrfs_mark_buffer_dirty(leaf);
+out:
+ WARN_ON(ret == -EEXIST);
+ btrfs_free_path(path);
+ return ret;
+}
+
+int noinline_for_stack
+btrfs_free_dedup_extent(struct btrfs_trans_handle *trans,
+ struct btrfs_root *root, u64 hash, u64 bytenr)
+{
+ struct btrfs_key key;
+ struct btrfs_path *path;
+ struct extent_buffer *leaf;
+ struct btrfs_root *dedup_root;
+ int ret = 0;
+
+ if (!root->fs_info->dedup_root)
+ return 0;
+
+ dedup_root = root->fs_info->dedup_root;
+
+ path = btrfs_alloc_path();
+ if (!path)
+ return ret;
+
+ key.objectid = hash;
+ key.offset = bytenr;
+ btrfs_set_key_type(&key, BTRFS_DEDUP_ITEM_KEY);
+
+ ret = btrfs_search_slot(trans, dedup_root, &key, path, -1, 1);
+ if (ret < 0)
+ goto out;
+ if (ret) {
+ WARN_ON(1);
+ ret = -ENOENT;
+ goto out;
+ }
+
+ leaf = path->nodes[0];
+
+ ret = -ENOENT;
+ btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
+ if (btrfs_key_type(&key) != BTRFS_DEDUP_ITEM_KEY)
+ goto out;
+ if (key.objectid != hash || key.offset != bytenr)
+ goto out;
+
+ ret = btrfs_del_item(trans, dedup_root, path);
+ WARN_ON(ret);
+out:
+ btrfs_free_path(path);
+ return ret;
+}
--
1.8.1.4
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v10 04/16] Btrfs: introduce dedup state
2014-04-10 3:48 [RFC PATCH v10 00/16] Online(inband) data deduplication Liu Bo
` (2 preceding siblings ...)
2014-04-10 3:48 ` [PATCH v10 03/16] Btrfs: introduce dedup tree operations Liu Bo
@ 2014-04-10 3:48 ` Liu Bo
2014-04-10 3:48 ` [PATCH v10 05/16] Btrfs: make ordered extent aware of dedup Liu Bo
` (15 subsequent siblings)
19 siblings, 0 replies; 25+ messages in thread
From: Liu Bo @ 2014-04-10 3:48 UTC (permalink / raw)
To: linux-btrfs
Cc: Marcel Ritter, Christian Robert, alanqk, Konstantinos Skarlatos,
David Sterba, Martin Steigerwald, Josef Bacik, Chris Mason
This introduces dedup state and relative operations to mark and unmark
the dedup data range, it'll be used in later patches.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
---
fs/btrfs/extent_io.c | 14 ++++++++++++++
fs/btrfs/extent_io.h | 5 +++++
2 files changed, 19 insertions(+)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index ae69a00..d51487b 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1296,6 +1296,20 @@ int clear_extent_uptodate(struct extent_io_tree *tree, u64 start, u64 end,
cached_state, mask);
}
+int set_extent_dedup(struct extent_io_tree *tree, u64 start, u64 end,
+ struct extent_state **cached_state, gfp_t mask)
+{
+ return set_extent_bit(tree, start, end, EXTENT_DEDUP, 0,
+ cached_state, mask);
+}
+
+int clear_extent_dedup(struct extent_io_tree *tree, u64 start, u64 end,
+ struct extent_state **cached_state, gfp_t mask)
+{
+ return clear_extent_bit(tree, start, end, EXTENT_DEDUP, 0, 0,
+ cached_state, mask);
+}
+
/*
* either insert or lock state struct between start and end use mask to tell
* us if waiting is desired.
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 58b27e5..897110d 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -20,6 +20,7 @@
#define EXTENT_NEED_WAIT (1 << 13)
#define EXTENT_DAMAGED (1 << 14)
#define EXTENT_NORESERVE (1 << 15)
+#define EXTENT_DEDUP (1 << 16)
#define EXTENT_IOBITS (EXTENT_LOCKED | EXTENT_WRITEBACK)
#define EXTENT_CTLBITS (EXTENT_DO_ACCOUNTING | EXTENT_FIRST_DELALLOC)
@@ -226,6 +227,10 @@ int set_extent_uptodate(struct extent_io_tree *tree, u64 start, u64 end,
struct extent_state **cached_state, gfp_t mask);
int clear_extent_uptodate(struct extent_io_tree *tree, u64 start, u64 end,
struct extent_state **cached_state, gfp_t mask);
+int set_extent_dedup(struct extent_io_tree *tree, u64 start, u64 end,
+ struct extent_state **cached_state, gfp_t mask);
+int clear_extent_dedup(struct extent_io_tree *tree, u64 start, u64 end,
+ struct extent_state **cached_state, gfp_t mask);
int set_extent_new(struct extent_io_tree *tree, u64 start, u64 end,
gfp_t mask);
int set_extent_dirty(struct extent_io_tree *tree, u64 start, u64 end,
--
1.8.1.4
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v10 05/16] Btrfs: make ordered extent aware of dedup
2014-04-10 3:48 [RFC PATCH v10 00/16] Online(inband) data deduplication Liu Bo
` (3 preceding siblings ...)
2014-04-10 3:48 ` [PATCH v10 04/16] Btrfs: introduce dedup state Liu Bo
@ 2014-04-10 3:48 ` Liu Bo
2014-04-10 3:48 ` [PATCH v10 06/16] Btrfs: online(inband) data dedup Liu Bo
` (14 subsequent siblings)
19 siblings, 0 replies; 25+ messages in thread
From: Liu Bo @ 2014-04-10 3:48 UTC (permalink / raw)
To: linux-btrfs
Cc: Marcel Ritter, Christian Robert, alanqk, Konstantinos Skarlatos,
David Sterba, Martin Steigerwald, Josef Bacik, Chris Mason
This adds a dedup flag and dedup hash into ordered extent so that
we can insert dedup extents to dedup tree at endio time.
The benefit is simplicity, we don't need to fall back to cleanup dedup
structures if the write is cancelled for some reasons.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
---
fs/btrfs/ordered-data.c | 38 ++++++++++++++++++++++++++++++++------
fs/btrfs/ordered-data.h | 13 ++++++++++++-
2 files changed, 44 insertions(+), 7 deletions(-)
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index a94b05f..c520e13 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -183,7 +183,8 @@ static inline struct rb_node *tree_search(struct btrfs_ordered_inode_tree *tree,
*/
static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset,
u64 start, u64 len, u64 disk_len,
- int type, int dio, int compress_type)
+ int type, int dio, int compress_type,
+ int dedup, struct btrfs_dedup_hash *hash)
{
struct btrfs_root *root = BTRFS_I(inode)->root;
struct btrfs_ordered_inode_tree *tree;
@@ -199,10 +200,23 @@ static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset,
entry->start = start;
entry->len = len;
if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM) &&
- !(type == BTRFS_ORDERED_NOCOW))
+ !(type == BTRFS_ORDERED_NOCOW) && !dedup)
entry->csum_bytes_left = disk_len;
entry->disk_len = disk_len;
entry->bytes_left = len;
+ entry->dedup = dedup;
+ entry->hash = NULL;
+
+ if (!dedup && hash) {
+ entry->hash = kzalloc(btrfs_dedup_hash_size(hash->type),
+ GFP_NOFS);
+ if (!entry->hash) {
+ kmem_cache_free(btrfs_ordered_extent_cache, entry);
+ return -ENOMEM;
+ }
+ memcpy(entry->hash, hash, btrfs_dedup_hash_size(hash->type));
+ }
+
entry->inode = igrab(inode);
entry->compress_type = compress_type;
entry->truncated_len = (u64)-1;
@@ -251,7 +265,17 @@ int btrfs_add_ordered_extent(struct inode *inode, u64 file_offset,
{
return __btrfs_add_ordered_extent(inode, file_offset, start, len,
disk_len, type, 0,
- BTRFS_COMPRESS_NONE);
+ BTRFS_COMPRESS_NONE, 0, NULL);
+}
+
+int btrfs_add_ordered_extent_dedup(struct inode *inode, u64 file_offset,
+ u64 start, u64 len, u64 disk_len, int type,
+ int dedup, struct btrfs_dedup_hash *hash,
+ int compress_type)
+{
+ return __btrfs_add_ordered_extent(inode, file_offset, start, len,
+ disk_len, type, 0,
+ compress_type, dedup, hash);
}
int btrfs_add_ordered_extent_dio(struct inode *inode, u64 file_offset,
@@ -259,16 +283,17 @@ int btrfs_add_ordered_extent_dio(struct inode *inode, u64 file_offset,
{
return __btrfs_add_ordered_extent(inode, file_offset, start, len,
disk_len, type, 1,
- BTRFS_COMPRESS_NONE);
+ BTRFS_COMPRESS_NONE, 0, NULL);
}
int btrfs_add_ordered_extent_compress(struct inode *inode, u64 file_offset,
u64 start, u64 len, u64 disk_len,
- int type, int compress_type)
+ int type, int compress_type,
+ struct btrfs_dedup_hash *hash)
{
return __btrfs_add_ordered_extent(inode, file_offset, start, len,
disk_len, type, 0,
- compress_type);
+ compress_type, 0, hash);
}
/*
@@ -530,6 +555,7 @@ void btrfs_put_ordered_extent(struct btrfs_ordered_extent *entry)
list_del(&sum->list);
kfree(sum);
}
+ kfree(entry->hash);
kmem_cache_free(btrfs_ordered_extent_cache, entry);
}
}
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index 2468970..efbb11f 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -109,6 +109,9 @@ struct btrfs_ordered_extent {
/* compression algorithm */
int compress_type;
+ /* whether this ordered extent is marked for dedup or not */
+ int dedup;
+
/* reference count */
atomic_t refs;
@@ -135,6 +138,9 @@ struct btrfs_ordered_extent {
struct completion completion;
struct btrfs_work flush_work;
struct list_head work_list;
+
+ /* dedup hash of sha256 type */
+ struct btrfs_dedup_hash *hash;
};
/*
@@ -168,11 +174,16 @@ int btrfs_dec_test_first_ordered_pending(struct inode *inode,
int uptodate);
int btrfs_add_ordered_extent(struct inode *inode, u64 file_offset,
u64 start, u64 len, u64 disk_len, int type);
+int btrfs_add_ordered_extent_dedup(struct inode *inode, u64 file_offset,
+ u64 start, u64 len, u64 disk_len, int type,
+ int dedup, struct btrfs_dedup_hash *hash,
+ int compress_type);
int btrfs_add_ordered_extent_dio(struct inode *inode, u64 file_offset,
u64 start, u64 len, u64 disk_len, int type);
int btrfs_add_ordered_extent_compress(struct inode *inode, u64 file_offset,
u64 start, u64 len, u64 disk_len,
- int type, int compress_type);
+ int type, int compress_type,
+ struct btrfs_dedup_hash *hash);
void btrfs_add_ordered_sum(struct inode *inode,
struct btrfs_ordered_extent *entry,
struct btrfs_ordered_sum *sum);
--
1.8.1.4
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v10 06/16] Btrfs: online(inband) data dedup
2014-04-10 3:48 [RFC PATCH v10 00/16] Online(inband) data deduplication Liu Bo
` (4 preceding siblings ...)
2014-04-10 3:48 ` [PATCH v10 05/16] Btrfs: make ordered extent aware of dedup Liu Bo
@ 2014-04-10 3:48 ` Liu Bo
2014-04-10 3:48 ` [PATCH v10 07/16] Btrfs: skip dedup reference during backref walking Liu Bo
` (13 subsequent siblings)
19 siblings, 0 replies; 25+ messages in thread
From: Liu Bo @ 2014-04-10 3:48 UTC (permalink / raw)
To: linux-btrfs
Cc: Marcel Ritter, Christian Robert, alanqk, Konstantinos Skarlatos,
David Sterba, Martin Steigerwald, Josef Bacik, Chris Mason
The main part of data dedup.
This introduces a FORMAT CHANGE.
Btrfs provides online(inband/synchronous) and block-level dedup.
It maps naturally to btrfs's block back-reference, which enables us
to store multiple copies of data as single copy with references
on that copy.
The workflow is
(1) write some data,
(2) get the hash of these data based on btrfs's dedup blocksize.
(3) find matched extents by hash and decide whether to mark it
as a duplicate one or not. If no, write the data onto disk,
otherwise, add a reference to the matched extent.
Btrfs's built-in dedup supports normal writes and compressed writes.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
---
fs/btrfs/extent-tree.c | 150 ++++++++++--
fs/btrfs/extent_io.c | 8 +-
fs/btrfs/extent_io.h | 11 +
fs/btrfs/inode.c | 640 +++++++++++++++++++++++++++++++++++++++++++------
4 files changed, 712 insertions(+), 97 deletions(-)
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 06124c1..088846c 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -1123,8 +1123,16 @@ static noinline int lookup_extent_data_ref(struct btrfs_trans_handle *trans,
key.offset = parent;
} else {
key.type = BTRFS_EXTENT_DATA_REF_KEY;
- key.offset = hash_extent_data_ref(root_objectid,
- owner, offset);
+
+ /*
+ * we've not got the right offset and owner, so search by -1
+ * here.
+ */
+ if (root_objectid == BTRFS_DEDUP_TREE_OBJECTID)
+ key.offset = (u64)-1;
+ else
+ key.offset = hash_extent_data_ref(root_objectid,
+ owner, offset);
}
again:
recow = 0;
@@ -1151,6 +1159,10 @@ again:
goto fail;
}
+ if (ret > 0 && root_objectid == BTRFS_DEDUP_TREE_OBJECTID &&
+ path->slots[0] > 0)
+ path->slots[0]--;
+
leaf = path->nodes[0];
nritems = btrfs_header_nritems(leaf);
while (1) {
@@ -1174,14 +1186,22 @@ again:
ref = btrfs_item_ptr(leaf, path->slots[0],
struct btrfs_extent_data_ref);
- if (match_extent_data_ref(leaf, ref, root_objectid,
- owner, offset)) {
- if (recow) {
- btrfs_release_path(path);
- goto again;
+ if (root_objectid == BTRFS_DEDUP_TREE_OBJECTID) {
+ if (btrfs_extent_data_ref_root(leaf, ref) ==
+ root_objectid) {
+ err = 0;
+ break;
+ }
+ } else {
+ if (match_extent_data_ref(leaf, ref, root_objectid,
+ owner, offset)) {
+ if (recow) {
+ btrfs_release_path(path);
+ goto again;
+ }
+ err = 0;
+ break;
}
- err = 0;
- break;
}
path->slots[0]++;
}
@@ -1325,6 +1345,32 @@ static noinline int remove_extent_data_ref(struct btrfs_trans_handle *trans,
return ret;
}
+static noinline u64 extent_data_ref_offset(struct btrfs_root *root,
+ struct btrfs_path *path,
+ struct btrfs_extent_inline_ref *iref)
+{
+ struct btrfs_key key;
+ struct extent_buffer *leaf;
+ struct btrfs_extent_data_ref *ref1;
+ u64 offset = 0;
+
+ leaf = path->nodes[0];
+ btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
+ if (iref) {
+ WARN_ON(btrfs_extent_inline_ref_type(leaf, iref) !=
+ BTRFS_EXTENT_DATA_REF_KEY);
+ ref1 = (struct btrfs_extent_data_ref *)(&iref->offset);
+ offset = btrfs_extent_data_ref_offset(leaf, ref1);
+ } else if (key.type == BTRFS_EXTENT_DATA_REF_KEY) {
+ ref1 = btrfs_item_ptr(leaf, path->slots[0],
+ struct btrfs_extent_data_ref);
+ offset = btrfs_extent_data_ref_offset(leaf, ref1);
+ } else {
+ WARN_ON(1);
+ }
+ return offset;
+}
+
static noinline u32 extent_data_ref_count(struct btrfs_root *root,
struct btrfs_path *path,
struct btrfs_extent_inline_ref *iref)
@@ -1591,7 +1637,8 @@ again:
err = -ENOENT;
while (1) {
if (ptr >= end) {
- WARN_ON(ptr > end);
+ WARN_ON(ptr > end &&
+ root_objectid != BTRFS_DEDUP_TREE_OBJECTID);
break;
}
iref = (struct btrfs_extent_inline_ref *)ptr;
@@ -1606,14 +1653,25 @@ again:
if (type == BTRFS_EXTENT_DATA_REF_KEY) {
struct btrfs_extent_data_ref *dref;
dref = (struct btrfs_extent_data_ref *)(&iref->offset);
- if (match_extent_data_ref(leaf, dref, root_objectid,
- owner, offset)) {
- err = 0;
- break;
+
+ if (root_objectid == BTRFS_DEDUP_TREE_OBJECTID) {
+ if (btrfs_extent_data_ref_root(leaf, dref) ==
+ root_objectid) {
+ err = 0;
+ break;
+ }
+ } else {
+ if (match_extent_data_ref(leaf, dref,
+ root_objectid, owner,
+ offset)) {
+ err = 0;
+ break;
+ }
+ if (hash_extent_data_ref_item(leaf, dref) <
+ hash_extent_data_ref(root_objectid, owner,
+ offset))
+ break;
}
- if (hash_extent_data_ref_item(leaf, dref) <
- hash_extent_data_ref(root_objectid, owner, offset))
- break;
} else {
u64 ref_offset;
ref_offset = btrfs_extent_inline_ref_offset(leaf, iref);
@@ -5630,11 +5688,13 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
struct btrfs_extent_inline_ref *iref;
int ret;
int is_data;
- int extent_slot = 0;
- int found_extent = 0;
- int num_to_del = 1;
+ int extent_slot;
+ int found_extent;
+ int num_to_del;
u32 item_size;
u64 refs;
+ u64 orig_root_obj;
+ u64 dedup_hash;
bool skinny_metadata = btrfs_fs_incompat(root->fs_info,
SKINNY_METADATA);
@@ -5642,6 +5702,13 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
if (!path)
return -ENOMEM;
+again:
+ extent_slot = 0;
+ found_extent = 0;
+ num_to_del = 1;
+ orig_root_obj = root_objectid;
+ dedup_hash = 0;
+
path->reada = 1;
path->leave_spinning = 1;
@@ -5683,6 +5750,12 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
#endif
if (!found_extent) {
BUG_ON(iref);
+
+ if (is_data &&
+ root_objectid == BTRFS_DEDUP_TREE_OBJECTID) {
+ dedup_hash = extent_data_ref_offset(root, path,
+ NULL);
+ }
ret = remove_extent_backref(trans, extent_root, path,
NULL, refs_to_drop,
is_data);
@@ -5740,6 +5813,10 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
}
extent_slot = path->slots[0];
}
+ } else if (ret == -ENOENT &&
+ root_objectid == BTRFS_DEDUP_TREE_OBJECTID) {
+ ret = 0;
+ goto out;
} else if (WARN_ON(ret == -ENOENT)) {
btrfs_print_leaf(extent_root, path->nodes[0]);
btrfs_err(info,
@@ -5832,7 +5909,28 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
}
add_pinned_bytes(root->fs_info, -num_bytes, owner_objectid,
root_objectid);
+
+ /*
+ * special case for dedup
+ *
+ * We assume the last ref(ref==1) is backref pointing to dedup.
+ *
+ * root_obj == 1 means that it's a free space cache inode,
+ * and it always uses PREALLOC, so it never has dedup extent.
+ */
+ if (is_data && refs == 1 &&
+ orig_root_obj != BTRFS_ROOT_TREE_OBJECTID) {
+ btrfs_release_path(path);
+ root_objectid = BTRFS_DEDUP_TREE_OBJECTID;
+ parent = 0;
+
+ goto again;
+ }
} else {
+ if (!dedup_hash && is_data &&
+ root_objectid == BTRFS_DEDUP_TREE_OBJECTID)
+ dedup_hash = extent_data_ref_offset(root, path, iref);
+
if (found_extent) {
BUG_ON(is_data && refs_to_drop !=
extent_data_ref_count(root, path, iref));
@@ -5859,6 +5957,18 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
btrfs_abort_transaction(trans, extent_root, ret);
goto out;
}
+
+ if (root_objectid == BTRFS_DEDUP_TREE_OBJECTID) {
+ ret = btrfs_free_dedup_extent(trans, root,
+ dedup_hash,
+ bytenr);
+ if (ret) {
+ btrfs_abort_transaction(trans,
+ extent_root,
+ ret);
+ goto out;
+ }
+ }
}
ret = update_block_group(root, bytenr, num_bytes, 0);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index d51487b..00f20be 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2390,7 +2390,7 @@ int end_extent_writepage(struct page *page, int err, u64 start, u64 end)
* Scheduling is not allowed, so the extent state tree is expected
* to have one and only one object corresponding to this IO.
*/
-static void end_bio_extent_writepage(struct bio *bio, int err)
+void end_bio_extent_writepage(struct bio *bio, int err)
{
struct bio_vec *bvec;
u64 start;
@@ -2645,8 +2645,8 @@ struct bio *btrfs_io_bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs)
}
-static int __must_check submit_one_bio(int rw, struct bio *bio,
- int mirror_num, unsigned long bio_flags)
+int __must_check submit_one_bio(int rw, struct bio *bio, int mirror_num,
+ unsigned long bio_flags)
{
int ret = 0;
struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
@@ -2685,7 +2685,7 @@ static int merge_bio(int rw, struct extent_io_tree *tree, struct page *page,
}
-static int submit_extent_page(int rw, struct extent_io_tree *tree,
+int submit_extent_page(int rw, struct extent_io_tree *tree,
struct page *page, sector_t sector,
size_t size, unsigned long offset,
struct block_device *bdev,
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 897110d..ee723f5 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -349,6 +349,17 @@ int repair_io_failure(struct btrfs_fs_info *fs_info, u64 start,
int end_extent_writepage(struct page *page, int err, u64 start, u64 end);
int repair_eb_io_failure(struct btrfs_root *root, struct extent_buffer *eb,
int mirror_num);
+int submit_extent_page(int rw, struct extent_io_tree *tree, struct page *page,
+ sector_t sector, size_t size, unsigned long offset,
+ struct block_device *bdev, struct bio **bio_ret,
+ unsigned long max_pages, bio_end_io_t end_io_func,
+ int mirror_num, unsigned long prev_bio_flags,
+ unsigned long bio_flags);
+void end_bio_extent_writepage(struct bio *bio, int err);
+int __must_check submit_one_bio(int rw, struct bio *bio, int mirror_num,
+ unsigned long bio_flags);
+
+
#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
noinline u64 find_lock_delalloc_range(struct inode *inode,
struct extent_io_tree *tree,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 06e9a41..8e031bf 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -43,6 +43,7 @@
#include <linux/btrfs.h>
#include <linux/blkdev.h>
#include <linux/posix_acl_xattr.h>
+#include <asm/unaligned.h>
#include "ctree.h"
#include "disk-io.h"
#include "transaction.h"
@@ -105,6 +106,17 @@ static struct extent_map *create_pinned_em(struct inode *inode, u64 start,
u64 block_start, u64 block_len,
u64 orig_block_len, u64 ram_bytes,
int type);
+static noinline int cow_file_range_dedup(struct inode *inode,
+ struct page *locked_page,
+ u64 start, u64 end, int *page_started,
+ unsigned long *nr_written, int unlock,
+ struct btrfs_dedup_hash *hash);
+static int run_locked_range(struct extent_io_tree *tree, struct inode *inode,
+ struct page *locked_page, u64 start, u64 end,
+ get_extent_t *get_extent, int mode,
+ struct btrfs_dedup_hash *hash);
+static int btrfs_inode_test_compress(struct inode *inode);
+
static int btrfs_dirty_inode(struct inode *inode);
@@ -315,6 +327,7 @@ struct async_extent {
unsigned long nr_pages;
int compress_type;
struct list_head list;
+ struct btrfs_dedup_hash *hash; /* dedup hash of sha256 */
};
struct async_cow {
@@ -332,22 +345,41 @@ static noinline int add_async_extent(struct async_cow *cow,
u64 compressed_size,
struct page **pages,
unsigned long nr_pages,
- int compress_type)
+ int compress_type,
+ struct btrfs_dedup_hash *h)
{
struct async_extent *async_extent;
async_extent = kmalloc(sizeof(*async_extent), GFP_NOFS);
- BUG_ON(!async_extent); /* -ENOMEM */
+ if (!async_extent)
+ return -ENOMEM;
async_extent->start = start;
async_extent->ram_size = ram_size;
async_extent->compressed_size = compressed_size;
async_extent->pages = pages;
async_extent->nr_pages = nr_pages;
async_extent->compress_type = compress_type;
+ async_extent->hash = NULL;
+ if (h) {
+ async_extent->hash = kmalloc(btrfs_dedup_hash_size(h->type),
+ GFP_NOFS);
+ if (!async_extent->hash) {
+ kfree(async_extent);
+ return -ENOMEM;
+ }
+ memcpy(async_extent->hash, h, btrfs_dedup_hash_size(h->type));
+ }
+
list_add_tail(&async_extent->list, &cow->extents);
return 0;
}
+static noinline void free_async_extent(struct async_extent *p)
+{
+ kfree(p->hash);
+ kfree(p);
+}
+
/*
* we create compressed extents in two phases. The first
* phase compresses a range of pages that have already been
@@ -369,7 +401,8 @@ static noinline int compress_file_range(struct inode *inode,
struct page *locked_page,
u64 start, u64 end,
struct async_cow *async_cow,
- int *num_added)
+ int *num_added,
+ struct btrfs_dedup_hash *dedup_hash)
{
struct btrfs_root *root = BTRFS_I(inode)->root;
u64 num_bytes;
@@ -437,9 +470,7 @@ again:
* change at any time if we discover bad compression ratios.
*/
if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NOCOMPRESS) &&
- (btrfs_test_opt(root, COMPRESS) ||
- (BTRFS_I(inode)->force_compress) ||
- (BTRFS_I(inode)->flags & BTRFS_INODE_COMPRESS))) {
+ btrfs_inode_test_compress(inode)) {
WARN_ON(pages);
pages = kzalloc(sizeof(struct page *) * nr_pages, GFP_NOFS);
if (!pages) {
@@ -567,9 +598,11 @@ cont:
* allocation on disk for these compressed pages,
* and will submit them to the elevator.
*/
- add_async_extent(async_cow, start, num_bytes,
- total_compressed, pages, nr_pages_ret,
- compress_type);
+ ret = add_async_extent(async_cow, start, num_bytes,
+ total_compressed, pages, nr_pages_ret,
+ compress_type, dedup_hash);
+ if (ret)
+ goto free_pages_out;
if (start + num_bytes < end) {
start += num_bytes;
@@ -593,8 +626,11 @@ cleanup_and_bail_uncompressed:
}
if (redirty)
extent_range_redirty_for_io(inode, start, end);
- add_async_extent(async_cow, start, end - start + 1,
- 0, NULL, 0, BTRFS_COMPRESS_NONE);
+ ret = add_async_extent(async_cow, start, end - start + 1,
+ 0, NULL, 0, BTRFS_COMPRESS_NONE, dedup_hash);
+ if (ret)
+ goto free_pages_out;
+
*num_added += 1;
}
@@ -643,38 +679,15 @@ again:
retry:
/* did the compression code fall back to uncompressed IO? */
if (!async_extent->pages) {
- int page_started = 0;
- unsigned long nr_written = 0;
-
- lock_extent(io_tree, async_extent->start,
- async_extent->start +
- async_extent->ram_size - 1);
-
- /* allocate blocks */
- ret = cow_file_range(inode, async_cow->locked_page,
- async_extent->start,
- async_extent->start +
- async_extent->ram_size - 1,
- &page_started, &nr_written, 0);
-
- /* JDM XXX */
+ ret = run_locked_range(io_tree, inode,
+ async_cow->locked_page,
+ async_extent->start,
+ async_extent->start +
+ async_extent->ram_size - 1,
+ btrfs_get_extent, WB_SYNC_ALL,
+ async_extent->hash);
- /*
- * if page_started, cow_file_range inserted an
- * inline extent and took care of all the unlocking
- * and IO for us. Otherwise, we need to submit
- * all those pages down to the drive.
- */
- if (!page_started && !ret)
- extent_write_locked_range(io_tree,
- inode, async_extent->start,
- async_extent->start +
- async_extent->ram_size - 1,
- btrfs_get_extent,
- WB_SYNC_ALL);
- else if (ret)
- unlock_page(async_cow->locked_page);
- kfree(async_extent);
+ free_async_extent(async_extent);
cond_resched();
continue;
}
@@ -757,7 +770,8 @@ retry:
async_extent->ram_size,
ins.offset,
BTRFS_ORDERED_COMPRESSED,
- async_extent->compress_type);
+ async_extent->compress_type,
+ async_extent->hash);
if (ret)
goto out_free_reserve;
@@ -777,7 +791,7 @@ retry:
ins.offset, async_extent->pages,
async_extent->nr_pages);
alloc_hint = ins.objectid + ins.offset;
- kfree(async_extent);
+ free_async_extent(async_extent);
if (ret)
goto out;
cond_resched();
@@ -795,10 +809,366 @@ out_free:
EXTENT_DEFRAG | EXTENT_DO_ACCOUNTING,
PAGE_UNLOCK | PAGE_CLEAR_DIRTY |
PAGE_SET_WRITEBACK | PAGE_END_WRITEBACK);
- kfree(async_extent);
+ free_async_extent(async_extent);
goto again;
}
+static void btrfs_dedup_hash_final(struct btrfs_dedup_hash *hash);
+
+static int btrfs_dedup_hash_digest(struct btrfs_root *root, const char *data,
+ u64 length, struct btrfs_dedup_hash *hash)
+{
+ struct crypto_shash *tfm = root->fs_info->dedup_driver;
+ struct {
+ struct shash_desc desc;
+ char ctx[crypto_shash_descsize(tfm)];
+ } sdesc;
+ int ret;
+
+ sdesc.desc.tfm = tfm;
+ sdesc.desc.flags = 0;
+
+ ret = crypto_shash_digest(&sdesc.desc, data, length,
+ (char *)(hash->hash));
+ if (!ret)
+ btrfs_dedup_hash_final(hash);
+ return ret;
+}
+
+static void btrfs_dedup_hash_final(struct btrfs_dedup_hash *hash)
+{
+ int num, i;
+
+ num = btrfs_dedup_lens[hash->type] - 1;
+ for (i = 0; i < num; i++)
+ put_unaligned_le64(hash->hash[i], (char *)(hash->hash + i));
+}
+
+static int btrfs_calc_dedup_hash(struct btrfs_root *root, struct inode *inode,
+ u64 start, struct btrfs_dedup_hash *hash)
+{
+ struct page *p;
+ char *data;
+ u64 length = root->fs_info->dedup_bs;
+ u64 blocksize = root->sectorsize;
+ int err;
+
+ if (length == blocksize) {
+ p = find_get_page(inode->i_mapping,
+ (start >> PAGE_CACHE_SHIFT));
+ WARN_ON(!p); /* page should be here */
+ data = kmap_atomic(p);
+ err = btrfs_dedup_hash_digest(root, data, length, hash);
+ kunmap_atomic(data);
+ page_cache_release(p);
+ } else {
+ char *d;
+ int i = 0;
+
+ data = kmalloc(length, GFP_NOFS);
+ if (!data)
+ return -ENOMEM;
+
+ while (blocksize * i < length) {
+ p = find_get_page(inode->i_mapping,
+ (start >> PAGE_CACHE_SHIFT) + i);
+ WARN_ON(!p); /* page should be here */
+ d = kmap_atomic(p);
+ memcpy((data + blocksize * i), d, blocksize);
+ kunmap_atomic(d);
+ page_cache_release(p);
+ i++;
+ }
+
+ err = btrfs_dedup_hash_digest(root, data, length, hash);
+ kfree(data);
+ }
+ return err;
+}
+
+static noinline int
+run_delalloc_dedup(struct inode *inode, struct page *locked_page, u64 start,
+ u64 end, struct async_cow *async_cow)
+{
+ struct btrfs_root *root = BTRFS_I(inode)->root;
+ struct bio *bio = NULL;
+ struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
+ struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
+ struct extent_map *em;
+ struct page *page = NULL;
+ struct block_device *bdev;
+ struct btrfs_key ins;
+ u64 blocksize = root->sectorsize;
+ u64 num_bytes;
+ u64 cur_alloc_size;
+ u64 cur_end;
+ u64 alloc_hint = 0;
+ u64 iosize;
+ u64 dedup_bs = root->fs_info->dedup_bs;
+ int compr;
+ int found;
+ int type = 0;
+ sector_t sector;
+ int ret = 0;
+ struct extent_state *cached_state = NULL;
+ struct btrfs_dedup_hash *hash;
+ int dedup_type = root->fs_info->dedup_type;
+
+ WARN_ON(btrfs_is_free_space_inode(inode));
+
+ num_bytes = ALIGN(end - start + 1, blocksize);
+ num_bytes = max(blocksize, num_bytes);
+
+ hash = kzalloc(btrfs_dedup_hash_size(dedup_type), GFP_NOFS);
+ if (!hash) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ btrfs_drop_extent_cache(inode, start, start + num_bytes - 1, 0);
+
+ while (num_bytes > 0) {
+ unsigned long op = 0;
+
+ /* page has been locked by caller */
+ page = find_get_page(inode->i_mapping,
+ start >> PAGE_CACHE_SHIFT);
+ WARN_ON(!page); /* page should be here */
+
+ /* already ordered? */
+ if (PagePrivate2(page))
+ goto submit;
+
+ /* too small data, go for normal path */
+ if (num_bytes < dedup_bs) {
+ cur_end = start + num_bytes - 1;
+
+ if (btrfs_inode_test_compress(inode)) {
+ int num_added = 0;
+ compress_file_range(inode, page, start, cur_end,
+ async_cow, &num_added,
+ NULL);
+ } else {
+ /* Now locked_page is not dirty. */
+ if (page_offset(locked_page) >= start &&
+ page_offset(locked_page) <= cur_end) {
+ __set_page_dirty_nobuffers(locked_page);
+ }
+
+ ret = run_locked_range(tree, inode, page, start,
+ cur_end,
+ btrfs_get_extent,
+ WB_SYNC_ALL, NULL);
+ if (ret)
+ SetPageError(page);
+ }
+
+ page_cache_release(page);
+ page = NULL;
+
+ num_bytes -= num_bytes;
+ start += num_bytes;
+ cond_resched();
+ continue;
+ }
+
+ cur_alloc_size = min_t(u64, num_bytes, dedup_bs);
+ WARN_ON(cur_alloc_size < dedup_bs); /* shouldn't happen */
+ cur_end = start + cur_alloc_size - 1;
+
+ /* see comments in compress_file_range */
+ extent_range_clear_dirty_for_io(inode, start, cur_end);
+
+ memset(hash, 0, btrfs_dedup_hash_size(dedup_type));
+ hash->type = dedup_type;
+
+ ret = btrfs_calc_dedup_hash(root, inode, start, hash);
+
+ if (ret) {
+ found = 0;
+ compr = BTRFS_COMPRESS_NONE;
+ } else {
+ found = btrfs_find_dedup_extent(root, hash);
+ compr = hash->compression;
+ }
+
+ if (found == 0) {
+ /*
+ * compress fastpath.
+ * so we take the original data as dedup string instead
+ * of compressed data since compression methods and data
+ * from them vary a lot.
+ */
+ if (btrfs_inode_test_compress(inode)) {
+ int num_added = 0;
+
+ extent_range_redirty_for_io(inode, start,
+ cur_end);
+
+ compress_file_range(inode, page, start, cur_end,
+ async_cow, &num_added,
+ hash);
+
+ page_cache_release(page);
+ page = NULL;
+
+ num_bytes -= cur_alloc_size;
+ start += cur_alloc_size;
+ cond_resched();
+ continue;
+ }
+
+ /* no compress */
+ ret = btrfs_reserve_extent(root, cur_alloc_size,
+ cur_alloc_size, 0, alloc_hint,
+ &ins, 1);
+ if (ret < 0)
+ goto out_unlock;
+ } else { /* found same hash */
+ ins.objectid = hash->bytenr;
+ ins.offset = hash->num_bytes;
+
+ set_extent_dedup(tree, start, cur_end, &cached_state,
+ GFP_NOFS);
+ }
+
+ lock_extent(tree, start, cur_end);
+
+ em = alloc_extent_map();
+ if (!em) {
+ ret = -ENOMEM;
+ goto out_reserve;
+ }
+ em->start = start;
+ em->orig_start = em->start;
+ em->len = cur_alloc_size;
+ em->mod_start = em->start;
+ em->mod_len = em->len;
+
+ em->block_start = ins.objectid;
+ em->block_len = ins.offset;
+ em->orig_block_len = ins.offset;
+ em->bdev = root->fs_info->fs_devices->latest_bdev;
+ set_bit(EXTENT_FLAG_PINNED, &em->flags);
+ em->generation = -1;
+ if (compr > BTRFS_COMPRESS_NONE) {
+ em->compress_type = compr;
+ set_bit(EXTENT_FLAG_COMPRESSED, &em->flags);
+ type = BTRFS_ORDERED_COMPRESSED;
+ }
+
+ while (1) {
+ write_lock(&em_tree->lock);
+ ret = add_extent_mapping(em_tree, em, 1);
+ write_unlock(&em_tree->lock);
+ if (ret != -EEXIST) {
+ free_extent_map(em);
+ break;
+ }
+ btrfs_drop_extent_cache(inode, start, cur_end, 0);
+ }
+ if (ret)
+ goto out_reserve;
+
+ ret = btrfs_add_ordered_extent_dedup(inode, start, ins.objectid,
+ cur_alloc_size, ins.offset,
+ type, found, hash, compr);
+ if (ret)
+ goto out_reserve;
+
+ /*
+ * Do set the Private2 bit so we know this page was properly
+ * setup for writepage
+ */
+ op |= PAGE_SET_PRIVATE2 | PAGE_SET_WRITEBACK | PAGE_CLEAR_DIRTY;
+ extent_clear_unlock_delalloc(inode, start, cur_end,
+ NULL,
+ EXTENT_LOCKED | EXTENT_DELALLOC,
+ op);
+
+submit:
+ iosize = blocksize;
+
+ found = test_range_bit(tree, start, start + iosize - 1,
+ EXTENT_DEDUP, 0, cached_state);
+ if (found == 0) {
+ em = btrfs_get_extent(inode, page, 0, start, blocksize,
+ 1);
+ if (IS_ERR(em)) {
+ /* btrfs_get_extent will not return NULL */
+ ret = PTR_ERR(em);
+ goto out_reserve;
+ }
+
+ sector = (em->block_start + start - em->start) >> 9;
+ bdev = em->bdev;
+ free_extent_map(em);
+ em = NULL;
+
+ /* TODO: rw can be WRTIE_SYNC */
+ ret = submit_extent_page(WRITE, tree, page, sector,
+ iosize, 0, bdev, &bio,
+ 0, /* max_nr is no used */
+ end_bio_extent_writepage,
+ 0, 0, 0);
+ if (ret)
+ break;
+ } else {
+ clear_extent_dedup(tree, start, start + iosize - 1,
+ &cached_state, GFP_NOFS);
+
+ end_extent_writepage(page, 0, start,
+ start + iosize - 1);
+ /* we need to do this ourselves because we skip IO */
+ end_page_writeback(page);
+ }
+
+ unlock_page(page);
+ page_cache_release(page);
+ page = NULL;
+
+ num_bytes -= blocksize;
+ alloc_hint = ins.objectid + blocksize;
+ start += blocksize;
+ cond_resched();
+ }
+
+out_unlock:
+ if (bio) {
+ if (ret)
+ bio_put(bio);
+ else
+ ret = submit_one_bio(WRITE, bio, 0, 0);
+ bio = NULL;
+ }
+
+ if (ret && page)
+ SetPageError(page);
+ if (page) {
+ unlock_page(page);
+ page_cache_release(page);
+ }
+
+out:
+ if (ret && num_bytes > 0)
+ extent_clear_unlock_delalloc(inode,
+ start, start + num_bytes - 1,
+ NULL,
+ EXTENT_DELALLOC | EXTENT_LOCKED |
+ EXTENT_DEDUP | EXTENT_DEFRAG,
+ PAGE_UNLOCK | PAGE_SET_WRITEBACK |
+ PAGE_END_WRITEBACK | PAGE_CLEAR_DIRTY);
+
+ kfree(hash);
+ free_extent_state(cached_state);
+ return ret;
+
+out_reserve:
+ if (found == 0)
+ btrfs_free_reserved_extent(root, ins.objectid, ins.offset);
+ goto out_unlock;
+}
+
static u64 get_extent_allocation_hint(struct inode *inode, u64 start,
u64 num_bytes)
{
@@ -844,11 +1214,11 @@ static u64 get_extent_allocation_hint(struct inode *inode, u64 start,
* required to start IO on it. It may be clean and already done with
* IO when we return.
*/
-static noinline int cow_file_range(struct inode *inode,
+static noinline int __cow_file_range(struct inode *inode,
struct page *locked_page,
u64 start, u64 end, int *page_started,
unsigned long *nr_written,
- int unlock)
+ int unlock, struct btrfs_dedup_hash *hash)
{
struct btrfs_root *root = BTRFS_I(inode)->root;
u64 alloc_hint = 0;
@@ -948,8 +1318,16 @@ static noinline int cow_file_range(struct inode *inode,
goto out_reserve;
cur_alloc_size = ins.offset;
- ret = btrfs_add_ordered_extent(inode, start, ins.objectid,
- ram_size, cur_alloc_size, 0);
+ if (!hash)
+ ret = btrfs_add_ordered_extent(inode, start,
+ ins.objectid, ram_size,
+ cur_alloc_size, 0);
+ else
+ ret = btrfs_add_ordered_extent_dedup(inode, start,
+ ins.objectid, ram_size,
+ cur_alloc_size, 0, 0,
+ hash,
+ BTRFS_COMPRESS_NONE);
if (ret)
goto out_reserve;
@@ -997,21 +1375,76 @@ out_unlock:
goto out;
}
+static noinline int cow_file_range(struct inode *inode,
+ struct page *locked_page,
+ u64 start, u64 end, int *page_started,
+ unsigned long *nr_written,
+ int unlock)
+{
+ return __cow_file_range(inode, locked_page, start, end, page_started,
+ nr_written, unlock, NULL);
+}
+
+static noinline int cow_file_range_dedup(struct inode *inode,
+ struct page *locked_page,
+ u64 start, u64 end, int *page_started,
+ unsigned long *nr_written,
+ int unlock, struct btrfs_dedup_hash *hash)
+{
+ return __cow_file_range(inode, locked_page, start, end, page_started,
+ nr_written, unlock, hash);
+}
+
+static int run_locked_range(struct extent_io_tree *tree, struct inode *inode,
+ struct page *locked_page, u64 start, u64 end,
+ get_extent_t *get_extent, int mode,
+ struct btrfs_dedup_hash *hash)
+{
+ int page_started = 0;
+ unsigned long nr_written = 0;
+ int ret;
+
+ lock_extent(tree, start, end);
+
+ /* allocate blocks */
+ ret = cow_file_range_dedup(inode, locked_page, start, end,
+ &page_started, &nr_written, 0, hash);
+
+ /*
+ * if page_started, cow_file_range inserted an
+ * inline extent and took care of all the unlocking
+ * and IO for us. Otherwise, we need to submit
+ * all those pages down to the drive.
+ */
+ if (!page_started && !ret)
+ extent_write_locked_range(tree, inode, start, end, get_extent,
+ mode);
+ else if (ret)
+ unlock_page(locked_page);
+
+ return ret;
+}
+
/*
* work queue call back to started compression on a file and pages
*/
static noinline void async_cow_start(struct btrfs_work *work)
{
struct async_cow *async_cow;
- int num_added = 0;
async_cow = container_of(work, struct async_cow, work);
- compress_file_range(async_cow->inode, async_cow->locked_page,
- async_cow->start, async_cow->end, async_cow,
- &num_added);
- if (num_added == 0) {
- btrfs_add_delayed_iput(async_cow->inode);
- async_cow->inode = NULL;
+ if (async_cow->root->fs_info->dedup_bs != 0) {
+ run_delalloc_dedup(async_cow->inode, async_cow->locked_page,
+ async_cow->start, async_cow->end, async_cow);
+ } else {
+ int num_added = 0;
+ compress_file_range(async_cow->inode, async_cow->locked_page,
+ async_cow->start, async_cow->end, async_cow,
+ &num_added, NULL);
+ if (num_added == 0) {
+ btrfs_add_delayed_iput(async_cow->inode);
+ async_cow->inode = NULL;
+ }
}
}
@@ -1398,6 +1831,19 @@ error:
return ret;
}
+static int btrfs_inode_test_compress(struct inode *inode)
+{
+ struct btrfs_inode *bi = BTRFS_I(inode);
+ struct btrfs_root *root = BTRFS_I(inode)->root;
+
+ if (btrfs_test_opt(root, COMPRESS) ||
+ bi->force_compress ||
+ bi->flags & BTRFS_INODE_COMPRESS)
+ return 1;
+
+ return 0;
+}
+
/*
* extent_io.c call back to do delayed allocation processing
*/
@@ -1407,21 +1853,21 @@ static int run_delalloc_range(struct inode *inode, struct page *locked_page,
{
int ret;
struct btrfs_root *root = BTRFS_I(inode)->root;
+ struct btrfs_inode *bi = BTRFS_I(inode);
- if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW) {
+ if (bi->flags & BTRFS_INODE_NODATACOW) {
ret = run_delalloc_nocow(inode, locked_page, start, end,
page_started, 1, nr_written);
- } else if (BTRFS_I(inode)->flags & BTRFS_INODE_PREALLOC) {
+ } else if (bi->flags & BTRFS_INODE_PREALLOC) {
ret = run_delalloc_nocow(inode, locked_page, start, end,
page_started, 0, nr_written);
- } else if (!btrfs_test_opt(root, COMPRESS) &&
- !(BTRFS_I(inode)->force_compress) &&
- !(BTRFS_I(inode)->flags & BTRFS_INODE_COMPRESS)) {
+ } else if (!btrfs_inode_test_compress(inode) &&
+ root->fs_info->dedup_bs == 0) {
ret = cow_file_range(inode, locked_page, start, end,
page_started, nr_written, 1);
} else {
set_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
- &BTRFS_I(inode)->runtime_flags);
+ &bi->runtime_flags);
ret = cow_file_range_async(inode, locked_page, start, end,
page_started, nr_written);
}
@@ -1848,12 +2294,14 @@ static int btrfs_writepage_start_hook(struct page *page, u64 start, u64 end)
return -EBUSY;
}
-static int insert_reserved_file_extent(struct btrfs_trans_handle *trans,
- struct inode *inode, u64 file_pos,
- u64 disk_bytenr, u64 disk_num_bytes,
- u64 num_bytes, u64 ram_bytes,
- u8 compression, u8 encryption,
- u16 other_encoding, int extent_type)
+static int __insert_reserved_file_extent(struct btrfs_trans_handle *trans,
+ struct inode *inode, u64 file_pos,
+ u64 disk_bytenr, u64 disk_num_bytes,
+ u64 num_bytes, u64 ram_bytes,
+ u8 compression, u8 encryption,
+ u16 other_encoding, int extent_type,
+ int dedup,
+ struct btrfs_dedup_hash *hash)
{
struct btrfs_root *root = BTRFS_I(inode)->root;
struct btrfs_file_extent_item *fi;
@@ -1915,15 +2363,59 @@ static int insert_reserved_file_extent(struct btrfs_trans_handle *trans,
ins.objectid = disk_bytenr;
ins.offset = disk_num_bytes;
ins.type = BTRFS_EXTENT_ITEM_KEY;
- ret = btrfs_alloc_reserved_file_extent(trans, root,
- root->root_key.objectid,
- btrfs_ino(inode), file_pos, &ins);
+
+ if (!dedup) {
+ ret = btrfs_alloc_reserved_file_extent(trans, root,
+ root->root_key.objectid,
+ btrfs_ino(inode),
+ file_pos, &ins);
+ if (ret)
+ goto out;
+
+ if (hash) {
+ int index = btrfs_dedup_lens[hash->type] - 1;
+
+ hash->bytenr = ins.objectid;
+ hash->num_bytes = ins.offset;
+ hash->compression = compression;
+ ret = btrfs_insert_dedup_extent(trans, root, hash);
+ if (ret)
+ goto out;
+
+ ret = btrfs_inc_extent_ref(trans, root, ins.objectid,
+ ins.offset, 0,
+ BTRFS_DEDUP_TREE_OBJECTID,
+ btrfs_ino(inode),
+ hash->hash[index], 0);
+ }
+ } else {
+ ret = btrfs_inc_extent_ref(trans, root, ins.objectid,
+ ins.offset, 0,
+ root->root_key.objectid,
+ btrfs_ino(inode),
+ file_pos, /* file_pos - 0 */
+ 0);
+ }
out:
btrfs_free_path(path);
return ret;
}
+static int insert_reserved_file_extent(struct btrfs_trans_handle *trans,
+ struct inode *inode, u64 file_pos,
+ u64 disk_bytenr, u64 disk_num_bytes,
+ u64 num_bytes, u64 ram_bytes,
+ u8 compression, u8 encryption,
+ u16 other_encoding, int extent_type)
+{
+ return __insert_reserved_file_extent(trans, inode, file_pos,
+ disk_bytenr, disk_num_bytes,
+ num_bytes, ram_bytes, compression,
+ encryption, other_encoding,
+ extent_type, 0, NULL);
+}
+
/* snapshot-aware defrag */
struct sa_defrag_extent_backref {
struct rb_node node;
@@ -2663,13 +3155,15 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
logical_len);
} else {
BUG_ON(root == root->fs_info->tree_root);
- ret = insert_reserved_file_extent(trans, inode,
+ ret = __insert_reserved_file_extent(trans, inode,
ordered_extent->file_offset,
ordered_extent->start,
ordered_extent->disk_len,
logical_len, logical_len,
compress_type, 0, 0,
- BTRFS_FILE_EXTENT_REG);
+ BTRFS_FILE_EXTENT_REG,
+ ordered_extent->dedup,
+ ordered_extent->hash);
}
unpin_extent_cache(&BTRFS_I(inode)->extent_tree,
ordered_extent->file_offset, ordered_extent->len,
--
1.8.1.4
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v10 07/16] Btrfs: skip dedup reference during backref walking
2014-04-10 3:48 [RFC PATCH v10 00/16] Online(inband) data deduplication Liu Bo
` (5 preceding siblings ...)
2014-04-10 3:48 ` [PATCH v10 06/16] Btrfs: online(inband) data dedup Liu Bo
@ 2014-04-10 3:48 ` Liu Bo
2014-04-10 3:48 ` [PATCH v10 08/16] Btrfs: don't return space for dedup extent Liu Bo
` (12 subsequent siblings)
19 siblings, 0 replies; 25+ messages in thread
From: Liu Bo @ 2014-04-10 3:48 UTC (permalink / raw)
To: linux-btrfs
Cc: Marcel Ritter, Christian Robert, alanqk, Konstantinos Skarlatos,
David Sterba, Martin Steigerwald, Josef Bacik, Chris Mason
The dedup ref is quite a special one, it is just used to store the hash value
of the extent and cannot be used to find data, so we skip it during backref
walking.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
---
fs/btrfs/backref.c | 9 +++++++++
fs/btrfs/relocation.c | 3 +++
2 files changed, 12 insertions(+)
diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index aad7201..5e57949 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -623,6 +623,9 @@ static int __add_delayed_refs(struct btrfs_delayed_ref_head *head, u64 seq,
key.objectid = ref->objectid;
key.type = BTRFS_EXTENT_DATA_KEY;
key.offset = ref->offset;
+ if (ref->root == BTRFS_DEDUP_TREE_OBJECTID)
+ break;
+
ret = __add_prelim_ref(prefs, ref->root, &key, 0, 0,
node->bytenr,
node->ref_mod * sgn, GFP_ATOMIC);
@@ -743,6 +746,9 @@ static int __add_inline_refs(struct btrfs_fs_info *fs_info,
key.type = BTRFS_EXTENT_DATA_KEY;
key.offset = btrfs_extent_data_ref_offset(leaf, dref);
root = btrfs_extent_data_ref_root(leaf, dref);
+ if (root == BTRFS_DEDUP_TREE_OBJECTID)
+ break;
+
ret = __add_prelim_ref(prefs, root, &key, 0, 0,
bytenr, count, GFP_NOFS);
break;
@@ -826,6 +832,9 @@ static int __add_keyed_refs(struct btrfs_fs_info *fs_info,
key.type = BTRFS_EXTENT_DATA_KEY;
key.offset = btrfs_extent_data_ref_offset(leaf, dref);
root = btrfs_extent_data_ref_root(leaf, dref);
+ if (root == BTRFS_DEDUP_TREE_OBJECTID)
+ break;
+
ret = __add_prelim_ref(prefs, root, &key, 0, 0,
bytenr, count, GFP_NOFS);
break;
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index def428a..8431294 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -3508,6 +3508,9 @@ static int find_data_references(struct reloc_control *rc,
ref_offset = btrfs_extent_data_ref_offset(leaf, ref);
ref_count = btrfs_extent_data_ref_count(leaf, ref);
+ if (ref_root == BTRFS_DEDUP_TREE_OBJECTID)
+ return 0;
+
/*
* This is an extent belonging to the free space cache, lets just delete
* it and redo the search.
--
1.8.1.4
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v10 08/16] Btrfs: don't return space for dedup extent
2014-04-10 3:48 [RFC PATCH v10 00/16] Online(inband) data deduplication Liu Bo
` (6 preceding siblings ...)
2014-04-10 3:48 ` [PATCH v10 07/16] Btrfs: skip dedup reference during backref walking Liu Bo
@ 2014-04-10 3:48 ` Liu Bo
2014-04-10 3:48 ` [PATCH v10 09/16] Btrfs: add ioctl of dedup control Liu Bo
` (11 subsequent siblings)
19 siblings, 0 replies; 25+ messages in thread
From: Liu Bo @ 2014-04-10 3:48 UTC (permalink / raw)
To: linux-btrfs
Cc: Marcel Ritter, Christian Robert, alanqk, Konstantinos Skarlatos,
David Sterba, Martin Steigerwald, Josef Bacik, Chris Mason
If the ordered extent had an IOERR or something else went wrong we need to
return the space for this ordered extent back to the allocator, but if the
extent is marked as a dedup one, we don't free the space because we just
use the existing space instead of allocating new space.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
---
fs/btrfs/inode.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 8e031bf..0c1a43e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3213,6 +3213,7 @@ out:
* truncated case if we didn't write out the extent at all.
*/
if ((ret || !logical_len) &&
+ !ordered_extent->dedup &&
!test_bit(BTRFS_ORDERED_NOCOW, &ordered_extent->flags) &&
!test_bit(BTRFS_ORDERED_PREALLOC, &ordered_extent->flags))
btrfs_free_reserved_extent(root, ordered_extent->start,
--
1.8.1.4
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v10 09/16] Btrfs: add ioctl of dedup control
2014-04-10 3:48 [RFC PATCH v10 00/16] Online(inband) data deduplication Liu Bo
` (7 preceding siblings ...)
2014-04-10 3:48 ` [PATCH v10 08/16] Btrfs: don't return space for dedup extent Liu Bo
@ 2014-04-10 3:48 ` Liu Bo
2014-04-10 3:48 ` [PATCH v10 10/16] Btrfs: improve the delayed refs process in rm case Liu Bo
` (10 subsequent siblings)
19 siblings, 0 replies; 25+ messages in thread
From: Liu Bo @ 2014-04-10 3:48 UTC (permalink / raw)
To: linux-btrfs
Cc: Marcel Ritter, Christian Robert, alanqk, Konstantinos Skarlatos,
David Sterba, Martin Steigerwald, Josef Bacik, Chris Mason
So far we have 4 commands to control dedup behaviour,
- btrfs dedup enable
Create the dedup tree, and it's the very first step when you're going to use
the dedup feature.
- btrfs dedup disable
Delete the dedup tree, after this we're not able to use dedup any more unless
you enable it again.
- btrfs dedup on [-b]
Switch on the dedup feature temporarily, and it's the second step of applying
dedup with writes. Option '-b' is used to set dedup blocksize.
The default blocksize is 8192(no special reason, you may argue), and the current
limit is [4096, 128 * 1024], because 4K is the generic page size and 128K is the
upper limit of btrfs's compression.
- btrfs dedup off
Switch off the dedup feature temporarily, but the dedup tree remains.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
---
fs/btrfs/ctree.h | 3 +
fs/btrfs/disk-io.c | 1 +
fs/btrfs/ioctl.c | 167 +++++++++++++++++++++++++++++++++++++++++++++
include/uapi/linux/btrfs.h | 12 ++++
4 files changed, 183 insertions(+)
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index ca1b516..feebfab 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1740,6 +1740,9 @@ struct btrfs_fs_info {
u64 dedup_bs;
int dedup_type;
+
+ /* protect user change for dedup operations */
+ struct mutex dedup_ioctl_mutex;
};
struct btrfs_subvolume_writers {
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index a2586ac..3be947f 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2362,6 +2362,7 @@ int open_ctree(struct super_block *sb,
mutex_init(&fs_info->dev_replace.lock_finishing_cancel_unmount);
mutex_init(&fs_info->dev_replace.lock_management_lock);
mutex_init(&fs_info->dev_replace.lock);
+ mutex_init(&fs_info->dedup_ioctl_mutex);
spin_lock_init(&fs_info->qgroup_lock);
mutex_init(&fs_info->qgroup_ioctl_lock);
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 0401397..45c183c 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -4820,6 +4820,171 @@ static int btrfs_ioctl_set_features(struct file *file, void __user *arg)
return btrfs_commit_transaction(trans, root);
}
+static long btrfs_enable_dedup(struct btrfs_root *root)
+{
+ struct btrfs_fs_info *fs_info = root->fs_info;
+ struct btrfs_trans_handle *trans = NULL;
+ struct btrfs_root *dedup_root;
+ int ret = 0;
+
+ mutex_lock(&fs_info->dedup_ioctl_mutex);
+ if (fs_info->dedup_root) {
+ pr_info("btrfs: dedup has already been enabled\n");
+ mutex_unlock(&fs_info->dedup_ioctl_mutex);
+ return 0;
+ }
+
+ trans = btrfs_start_transaction(root, 2);
+ if (IS_ERR(trans)) {
+ ret = PTR_ERR(trans);
+ mutex_unlock(&fs_info->dedup_ioctl_mutex);
+ return ret;
+ }
+
+ dedup_root = btrfs_create_tree(trans, fs_info,
+ BTRFS_DEDUP_TREE_OBJECTID);
+ if (IS_ERR(dedup_root))
+ ret = PTR_ERR(dedup_root);
+
+ if (ret)
+ btrfs_end_transaction(trans, root);
+ else
+ ret = btrfs_commit_transaction(trans, root);
+
+ if (!ret) {
+ pr_info("btrfs: dedup enabled\n");
+ fs_info->dedup_root = dedup_root;
+ fs_info->dedup_root->block_rsv = &fs_info->global_block_rsv;
+ btrfs_set_fs_incompat(fs_info, DEDUP);
+ }
+
+ mutex_unlock(&fs_info->dedup_ioctl_mutex);
+ return ret;
+}
+
+static long btrfs_disable_dedup(struct btrfs_root *root)
+{
+ struct btrfs_fs_info *fs_info = root->fs_info;
+ struct btrfs_root *dedup_root;
+ int ret;
+
+ mutex_lock(&fs_info->dedup_ioctl_mutex);
+ if (!fs_info->dedup_root) {
+ pr_info("btrfs: dedup has been disabled\n");
+ mutex_unlock(&fs_info->dedup_ioctl_mutex);
+ return 0;
+ }
+
+ if (fs_info->dedup_bs != 0) {
+ pr_info("btrfs: cannot disable dedup until switching off dedup!\n");
+ mutex_unlock(&fs_info->dedup_ioctl_mutex);
+ return -EBUSY;
+ }
+
+ dedup_root = fs_info->dedup_root;
+
+ ret = btrfs_drop_snapshot(dedup_root, NULL, 1, 0);
+
+ if (!ret) {
+ fs_info->dedup_root = NULL;
+ pr_info("btrfs: dedup disabled\n");
+ }
+
+ mutex_unlock(&fs_info->dedup_ioctl_mutex);
+ WARN_ON(ret < 0 && ret != -EAGAIN && ret != -EROFS);
+ return ret;
+}
+
+static long btrfs_set_dedup_bs(struct btrfs_root *root, u64 bs)
+{
+ struct btrfs_fs_info *info = root->fs_info;
+ int ret = 0;
+
+ mutex_lock(&info->dedup_ioctl_mutex);
+ if (!info->dedup_root) {
+ pr_info("btrfs: dedup is disabled, we cannot switch on/off dedup\n");
+ ret = -EINVAL;
+ goto out;
+ }
+
+ bs = ALIGN(bs, root->sectorsize);
+ bs = min_t(u64, bs, (128 * 1024ULL));
+
+ if (bs == info->dedup_bs) {
+ if (info->dedup_bs == 0)
+ pr_info("btrfs: switch OFF dedup(it's already off)\n");
+ else
+ pr_info("btrfs: switch ON dedup(its bs is already %llu)\n",
+ bs);
+ goto out;
+ }
+
+ /*
+ * The dedup works similar to compression, both use async workqueue to
+ * reach better performance. We drain the on-going async works here
+ * so that new dedup writes will apply with the new dedup blocksize.
+ */
+ atomic_inc(&info->async_submit_draining);
+ while (atomic_read(&info->nr_async_submits) ||
+ atomic_read(&info->async_delalloc_pages)) {
+ wait_event(info->async_submit_wait,
+ (atomic_read(&info->nr_async_submits) == 0 &&
+ atomic_read(&info->async_delalloc_pages) == 0));
+ }
+
+ /*
+ * dedup_bs = 0: dedup off;
+ * dedup_bs > 0: dedup on;
+ */
+ info->dedup_bs = bs;
+ if (info->dedup_bs == 0) {
+ pr_info("btrfs: switch OFF dedup\n");
+ } else {
+ info->dedup_bs = bs;
+ pr_info("btrfs: switch ON dedup(dedup blocksize %llu)\n",
+ info->dedup_bs);
+ }
+ atomic_dec(&info->async_submit_draining);
+
+out:
+ mutex_unlock(&info->dedup_ioctl_mutex);
+ return ret;
+}
+
+static long btrfs_ioctl_dedup_ctl(struct btrfs_root *root, void __user *args)
+{
+ struct btrfs_ioctl_dedup_args *dargs;
+ int ret;
+
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ dargs = memdup_user(args, sizeof(*dargs));
+ if (IS_ERR(dargs)) {
+ ret = PTR_ERR(dargs);
+ goto out;
+ }
+
+ switch (dargs->cmd) {
+ case BTRFS_DEDUP_CTL_ENABLE:
+ ret = btrfs_enable_dedup(root);
+ break;
+ case BTRFS_DEDUP_CTL_DISABLE:
+ ret = btrfs_disable_dedup(root);
+ break;
+ case BTRFS_DEDUP_CTL_SET_BS:
+ /* dedup on/off */
+ ret = btrfs_set_dedup_bs(root, dargs->bs);
+ break;
+ default:
+ ret = -EINVAL;
+ }
+
+ kfree(dargs);
+out:
+ return ret;
+}
+
long btrfs_ioctl(struct file *file, unsigned int
cmd, unsigned long arg)
{
@@ -4942,6 +5107,8 @@ long btrfs_ioctl(struct file *file, unsigned int
return btrfs_ioctl_set_fslabel(file, argp);
case BTRFS_IOC_FILE_EXTENT_SAME:
return btrfs_ioctl_file_extent_same(file, argp);
+ case BTRFS_IOC_DEDUP_CTL:
+ return btrfs_ioctl_dedup_ctl(root, argp);
case BTRFS_IOC_GET_SUPPORTED_FEATURES:
return btrfs_ioctl_get_supported_features(file, argp);
case BTRFS_IOC_GET_FEATURES:
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index b4d6909..a300b27 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -405,6 +405,16 @@ struct btrfs_ioctl_get_dev_stats {
__u64 unused[128 - 2 - BTRFS_DEV_STAT_VALUES_MAX]; /* pad to 1k */
};
+/* deduplication control ioctl modes */
+#define BTRFS_DEDUP_CTL_ENABLE 1
+#define BTRFS_DEDUP_CTL_DISABLE 2
+#define BTRFS_DEDUP_CTL_SET_BS 3
+struct btrfs_ioctl_dedup_args {
+ __u64 cmd;
+ __u64 bs;
+ __u64 unused[14];
+};
+
#define BTRFS_QUOTA_CTL_ENABLE 1
#define BTRFS_QUOTA_CTL_DISABLE 2
#define BTRFS_QUOTA_CTL_RESCAN__NOTUSED 3
@@ -612,6 +622,8 @@ static inline char *btrfs_err_str(enum btrfs_err_code err_code)
struct btrfs_ioctl_dev_replace_args)
#define BTRFS_IOC_FILE_EXTENT_SAME _IOWR(BTRFS_IOCTL_MAGIC, 54, \
struct btrfs_ioctl_same_args)
+#define BTRFS_IOC_DEDUP_CTL _IOWR(BTRFS_IOCTL_MAGIC, 55, \
+ struct btrfs_ioctl_dedup_args)
#define BTRFS_IOC_GET_FEATURES _IOR(BTRFS_IOCTL_MAGIC, 57, \
struct btrfs_ioctl_feature_flags)
#define BTRFS_IOC_SET_FEATURES _IOW(BTRFS_IOCTL_MAGIC, 57, \
--
1.8.1.4
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v10 10/16] Btrfs: improve the delayed refs process in rm case
2014-04-10 3:48 [RFC PATCH v10 00/16] Online(inband) data deduplication Liu Bo
` (8 preceding siblings ...)
2014-04-10 3:48 ` [PATCH v10 09/16] Btrfs: add ioctl of dedup control Liu Bo
@ 2014-04-10 3:48 ` Liu Bo
2014-04-10 3:48 ` [PATCH v10 11/16] Btrfs: fix a crash of dedup ref Liu Bo
` (9 subsequent siblings)
19 siblings, 0 replies; 25+ messages in thread
From: Liu Bo @ 2014-04-10 3:48 UTC (permalink / raw)
To: linux-btrfs
Cc: Marcel Ritter, Christian Robert, alanqk, Konstantinos Skarlatos,
David Sterba, Martin Steigerwald, Josef Bacik, Chris Mason
While removing a file with dedup extents, we could have a great number of
delayed refs pending to process, and these refs refer to droping
a ref of the extent, which is of BTRFS_DROP_DELAYED_REF type.
But in order to prevent an extent's ref count from going down to zero when
there still are pending delayed refs, we first select those "adding a ref"
ones, which is of BTRFS_ADD_DELAYED_REF type.
So in removing case, all of our delayed refs are of BTRFS_DROP_DELAYED_REF type,
but we have to walk all the refs issued to the extent to find any
BTRFS_ADD_DELAYED_REF types and end up there is no such thing, and then start
over again to find BTRFS_DROP_DELAYED_REF.
This is really unnecessary, we can improve this by tracking how many
BTRFS_ADD_DELAYED_REF refs we have and search by the right type.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
---
fs/btrfs/delayed-ref.c | 8 ++++++++
fs/btrfs/delayed-ref.h | 3 +++
fs/btrfs/extent-tree.c | 16 +++++++++++++---
3 files changed, 24 insertions(+), 3 deletions(-)
diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 3ab37b6..6435d78 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -538,6 +538,9 @@ update_existing_head_ref(struct btrfs_delayed_ref_node *existing,
* currently, for refs we just added we know we're a-ok.
*/
existing->ref_mod += update->ref_mod;
+ WARN_ON(update->ref_mod > 1);
+ if (update->ref_mod == 1)
+ existing_ref->add_cnt++;
spin_unlock(&existing_ref->lock);
}
@@ -601,6 +604,11 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info,
head_ref->is_data = is_data;
head_ref->ref_root = RB_ROOT;
head_ref->processing = 0;
+ /* track added ref, more comments in select_delayed_ref() */
+ if (count_mod == 1)
+ head_ref->add_cnt = 1;
+ else
+ head_ref->add_cnt = 0;
spin_lock_init(&head_ref->lock);
mutex_init(&head_ref->mutex);
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index 4ba9b93..905f991 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -87,6 +87,9 @@ struct btrfs_delayed_ref_head {
struct rb_node href_node;
struct btrfs_delayed_extent_op *extent_op;
+
+ int add_cnt;
+
/*
* when a new extent is allocated, it is just reserved in memory
* The actual extent isn't inserted into the extent allocation tree
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 088846c..191f0a7 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2347,7 +2347,11 @@ static noinline struct btrfs_delayed_ref_node *
select_delayed_ref(struct btrfs_delayed_ref_head *head)
{
struct rb_node *node;
- struct btrfs_delayed_ref_node *ref, *last = NULL;;
+ struct btrfs_delayed_ref_node *ref, *last = NULL;
+ int action = BTRFS_ADD_DELAYED_REF;
+
+ if (head->add_cnt == 0)
+ action = BTRFS_DROP_DELAYED_REF;
/*
* select delayed ref of type BTRFS_ADD_DELAYED_REF first.
@@ -2358,10 +2362,13 @@ select_delayed_ref(struct btrfs_delayed_ref_head *head)
while (node) {
ref = rb_entry(node, struct btrfs_delayed_ref_node,
rb_node);
- if (ref->action == BTRFS_ADD_DELAYED_REF)
+ if (ref->action == action) {
+ if (ref->action == BTRFS_ADD_DELAYED_REF)
+ head->add_cnt--;
return ref;
- else if (last == NULL)
+ } else if (last == NULL) {
last = ref;
+ }
node = rb_next(node);
}
return last;
@@ -2435,6 +2442,9 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
if (ref && ref->seq &&
btrfs_check_delayed_seq(fs_info, delayed_refs, ref->seq)) {
+ if (ref->action == BTRFS_ADD_DELAYED_REF)
+ locked_ref->add_cnt++;
+
spin_unlock(&locked_ref->lock);
btrfs_delayed_ref_unlock(locked_ref);
spin_lock(&delayed_refs->lock);
--
1.8.1.4
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v10 11/16] Btrfs: fix a crash of dedup ref
2014-04-10 3:48 [RFC PATCH v10 00/16] Online(inband) data deduplication Liu Bo
` (9 preceding siblings ...)
2014-04-10 3:48 ` [PATCH v10 10/16] Btrfs: improve the delayed refs process in rm case Liu Bo
@ 2014-04-10 3:48 ` Liu Bo
2014-04-10 3:48 ` [PATCH v10 12/16] Btrfs: fix deadlock of dedup work Liu Bo
` (8 subsequent siblings)
19 siblings, 0 replies; 25+ messages in thread
From: Liu Bo @ 2014-04-10 3:48 UTC (permalink / raw)
To: linux-btrfs
Cc: Marcel Ritter, Christian Robert, alanqk, Konstantinos Skarlatos,
David Sterba, Martin Steigerwald, Josef Bacik, Chris Mason
The dedup reference is a special kind of delayed refs, and the delayed refs
are batched to be processed later.
If we find a matched dedup extent, then we queue an ADD delayed ref on it within
endio work, but there is already a DROP delayed ref queued,
t1 t2 t3
->writepage commit transaction
->run_delalloc_dedup
find_dedup
------------------------------------------------------------------------------
process_delayed refs
(it deletes the dedup extent)
add ordered extent |
submit pages |
finish ordered io |
insert file extents |
queue delayed refs |
queue dedup ref |
"process delayed refs" continues
(insert a ref on an extent
deleted by the above)
This senario ends up with a crash because we're going to insert a ref on
a deleted extent.
To avoid the race, we need to check if there is a ADD delayed ref on deleting
the extent and protect this job with lock.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
---
fs/btrfs/ctree.h | 3 ++-
fs/btrfs/extent-tree.c | 35 +++++++++++++++++++----------------
fs/btrfs/file-item.c | 36 +++++++++++++++++++++++++++++++++++-
fs/btrfs/inode.c | 10 ++--------
4 files changed, 58 insertions(+), 26 deletions(-)
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index feebfab..2e8c443 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3764,7 +3764,8 @@ int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
struct list_head *list, int search_commit);
int noinline_for_stack
-btrfs_find_dedup_extent(struct btrfs_root *root, struct btrfs_dedup_hash *hash);
+btrfs_find_dedup_extent(struct btrfs_root *root, struct btrfs_dedup_hash *hash,
+ struct inode *inode, u64 file_pos);
int noinline_for_stack
btrfs_insert_dedup_extent(struct btrfs_trans_handle *trans,
struct btrfs_root *root,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 191f0a7..a8da7aa 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5937,9 +5937,23 @@ again:
goto again;
}
} else {
- if (!dedup_hash && is_data &&
- root_objectid == BTRFS_DEDUP_TREE_OBJECTID)
- dedup_hash = extent_data_ref_offset(root, path, iref);
+ if (is_data && root_objectid == BTRFS_DEDUP_TREE_OBJECTID) {
+ if (!dedup_hash)
+ dedup_hash = extent_data_ref_offset(root,
+ path, iref);
+
+ ret = btrfs_free_dedup_extent(trans, root,
+ dedup_hash, bytenr);
+ if (ret) {
+ if (ret == -EAGAIN)
+ ret = 0;
+ else
+ btrfs_abort_transaction(trans,
+ extent_root,
+ ret);
+ goto out;
+ }
+ }
if (found_extent) {
BUG_ON(is_data && refs_to_drop !=
@@ -5964,21 +5978,10 @@ again:
if (is_data) {
ret = btrfs_del_csums(trans, root, bytenr, num_bytes);
if (ret) {
- btrfs_abort_transaction(trans, extent_root, ret);
+ btrfs_abort_transaction(trans,
+ extent_root, ret);
goto out;
}
-
- if (root_objectid == BTRFS_DEDUP_TREE_OBJECTID) {
- ret = btrfs_free_dedup_extent(trans, root,
- dedup_hash,
- bytenr);
- if (ret) {
- btrfs_abort_transaction(trans,
- extent_root,
- ret);
- goto out;
- }
- }
}
ret = update_block_group(root, bytenr, num_bytes, 0);
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 6437ebe..4ae383b 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -888,13 +888,15 @@ fail_unlock:
/* 1 means we find one, 0 means we dont. */
int noinline_for_stack
-btrfs_find_dedup_extent(struct btrfs_root *root, struct btrfs_dedup_hash *hash)
+btrfs_find_dedup_extent(struct btrfs_root *root, struct btrfs_dedup_hash *hash,
+ struct inode *inode, u64 file_pos)
{
struct btrfs_key key;
struct btrfs_path *path;
struct extent_buffer *leaf;
struct btrfs_root *dedup_root;
struct btrfs_dedup_item *item;
+ struct btrfs_trans_handle *trans;
u64 hash_value;
u64 length;
u64 dedup_size;
@@ -917,6 +919,12 @@ btrfs_find_dedup_extent(struct btrfs_root *root, struct btrfs_dedup_hash *hash)
if (!path)
return 0;
+ trans = btrfs_join_transaction(root);
+ if (IS_ERR(trans)) {
+ trans = NULL;
+ goto out;
+ }
+
/*
* For SHA256 dedup algorithm, we store the last 64bit as the
* key.objectid, and the rest in the tree item.
@@ -973,8 +981,16 @@ prev_slot:
hash->num_bytes = length;
hash->compression = compression;
found = 1;
+
+ ret = btrfs_inc_extent_ref(trans, root, key.offset, length, 0,
+ BTRFS_I(inode)->root->root_key.objectid,
+ btrfs_ino(inode),
+ file_pos, /* file_pos - 0 */
+ 0);
out:
btrfs_free_path(path);
+ if (trans)
+ btrfs_end_transaction(trans, root);
return found;
}
@@ -1056,6 +1072,8 @@ btrfs_free_dedup_extent(struct btrfs_trans_handle *trans,
struct btrfs_path *path;
struct extent_buffer *leaf;
struct btrfs_root *dedup_root;
+ struct btrfs_delayed_ref_root *delayed_refs;
+ struct btrfs_delayed_ref_head *head;
int ret = 0;
if (!root->fs_info->dedup_root)
@@ -1089,6 +1107,22 @@ btrfs_free_dedup_extent(struct btrfs_trans_handle *trans,
if (key.objectid != hash || key.offset != bytenr)
goto out;
+ ret = 0;
+
+ /* check if ADD_DELAYED delayed refs exist */
+ delayed_refs = &trans->transaction->delayed_refs;
+
+ spin_lock(&delayed_refs->lock);
+ head = btrfs_find_delayed_ref_head(trans, bytenr);
+
+ /* the mutex has been acquired by the caller */
+ if (head && head->add_cnt) {
+ spin_unlock(&delayed_refs->lock);
+ ret = -EAGAIN;
+ goto out;
+ }
+ spin_unlock(&delayed_refs->lock);
+
ret = btrfs_del_item(trans, dedup_root, path);
WARN_ON(ret);
out:
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 0c1a43e..ed92ef3 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -988,7 +988,8 @@ run_delalloc_dedup(struct inode *inode, struct page *locked_page, u64 start,
found = 0;
compr = BTRFS_COMPRESS_NONE;
} else {
- found = btrfs_find_dedup_extent(root, hash);
+ found = btrfs_find_dedup_extent(root, hash,
+ inode, start);
compr = hash->compression;
}
@@ -2388,13 +2389,6 @@ static int __insert_reserved_file_extent(struct btrfs_trans_handle *trans,
btrfs_ino(inode),
hash->hash[index], 0);
}
- } else {
- ret = btrfs_inc_extent_ref(trans, root, ins.objectid,
- ins.offset, 0,
- root->root_key.objectid,
- btrfs_ino(inode),
- file_pos, /* file_pos - 0 */
- 0);
}
out:
btrfs_free_path(path);
--
1.8.1.4
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v10 12/16] Btrfs: fix deadlock of dedup work
2014-04-10 3:48 [RFC PATCH v10 00/16] Online(inband) data deduplication Liu Bo
` (10 preceding siblings ...)
2014-04-10 3:48 ` [PATCH v10 11/16] Btrfs: fix a crash of dedup ref Liu Bo
@ 2014-04-10 3:48 ` Liu Bo
2014-04-10 3:48 ` [PATCH v10 13/16] Btrfs: fix transactin abortion in __btrfs_free_extent Liu Bo
` (7 subsequent siblings)
19 siblings, 0 replies; 25+ messages in thread
From: Liu Bo @ 2014-04-10 3:48 UTC (permalink / raw)
To: linux-btrfs
Cc: Marcel Ritter, Christian Robert, alanqk, Konstantinos Skarlatos,
David Sterba, Martin Steigerwald, Josef Bacik, Chris Mason
Checking for dedup references needs to allocate memory so it cannot
be run within spin_lock, otherwise it will end up with heavy deadlock.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
---
fs/btrfs/extent-tree.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index a8da7aa..4c1c342 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5720,7 +5720,6 @@ again:
dedup_hash = 0;
path->reada = 1;
- path->leave_spinning = 1;
is_data = owner_objectid >= BTRFS_FIRST_FREE_OBJECTID;
BUG_ON(!is_data && refs_to_drop != 1);
@@ -5774,7 +5773,6 @@ again:
goto out;
}
btrfs_release_path(path);
- path->leave_spinning = 1;
key.objectid = bytenr;
key.type = BTRFS_EXTENT_ITEM_KEY;
@@ -5942,6 +5940,7 @@ again:
dedup_hash = extent_data_ref_offset(root,
path, iref);
+ WARN_ON_ONCE(path->leave_spinning);
ret = btrfs_free_dedup_extent(trans, root,
dedup_hash, bytenr);
if (ret) {
--
1.8.1.4
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v10 13/16] Btrfs: fix transactin abortion in __btrfs_free_extent
2014-04-10 3:48 [RFC PATCH v10 00/16] Online(inband) data deduplication Liu Bo
` (11 preceding siblings ...)
2014-04-10 3:48 ` [PATCH v10 12/16] Btrfs: fix deadlock of dedup work Liu Bo
@ 2014-04-10 3:48 ` Liu Bo
2014-04-10 3:48 ` [PATCH v10 14/16] Btrfs: fix wrong pinned bytes " Liu Bo
` (6 subsequent siblings)
19 siblings, 0 replies; 25+ messages in thread
From: Liu Bo @ 2014-04-10 3:48 UTC (permalink / raw)
To: linux-btrfs
Cc: Marcel Ritter, Christian Robert, alanqk, Konstantinos Skarlatos,
David Sterba, Martin Steigerwald, Josef Bacik, Chris Mason
We need to reset @refs_to_drop to 1 when we're going to delete the last
special dedup reference, otherwise we can trigger (@refs < @refs_to_drop)
and end up with transaction abortion.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
---
fs/btrfs/extent-tree.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 4c1c342..1cb3ec5 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5931,6 +5931,7 @@ again:
btrfs_release_path(path);
root_objectid = BTRFS_DEDUP_TREE_OBJECTID;
parent = 0;
+ refs_to_drop = 1;
goto again;
}
--
1.8.1.4
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v10 14/16] Btrfs: fix wrong pinned bytes in __btrfs_free_extent
2014-04-10 3:48 [RFC PATCH v10 00/16] Online(inband) data deduplication Liu Bo
` (12 preceding siblings ...)
2014-04-10 3:48 ` [PATCH v10 13/16] Btrfs: fix transactin abortion in __btrfs_free_extent Liu Bo
@ 2014-04-10 3:48 ` Liu Bo
2014-04-10 3:48 ` [PATCH v10 15/16] Btrfs: use total_bytes instead of bytes_used for global_rsv Liu Bo
` (5 subsequent siblings)
19 siblings, 0 replies; 25+ messages in thread
From: Liu Bo @ 2014-04-10 3:48 UTC (permalink / raw)
To: linux-btrfs
Cc: Marcel Ritter, Christian Robert, alanqk, Konstantinos Skarlatos,
David Sterba, Martin Steigerwald, Josef Bacik, Chris Mason
With the special dedup reference, in the case of (refs == 1) in __btrfs_free_extent,
we'll actually free the extent, so pinned_bytes of it should not be added to that
global counter.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
---
fs/btrfs/extent-tree.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 1cb3ec5..b8fee86 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5915,9 +5915,6 @@ again:
goto out;
}
}
- add_pinned_bytes(root->fs_info, -num_bytes, owner_objectid,
- root_objectid);
-
/*
* special case for dedup
*
@@ -5934,6 +5931,9 @@ again:
refs_to_drop = 1;
goto again;
+ } else {
+ add_pinned_bytes(root->fs_info, -num_bytes,
+ owner_objectid, root_objectid);
}
} else {
if (is_data && root_objectid == BTRFS_DEDUP_TREE_OBJECTID) {
--
1.8.1.4
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v10 15/16] Btrfs: use total_bytes instead of bytes_used for global_rsv
2014-04-10 3:48 [RFC PATCH v10 00/16] Online(inband) data deduplication Liu Bo
` (13 preceding siblings ...)
2014-04-10 3:48 ` [PATCH v10 14/16] Btrfs: fix wrong pinned bytes " Liu Bo
@ 2014-04-10 3:48 ` Liu Bo
2014-04-10 3:48 ` [PATCH v10 16/16] Btrfs: fix dedup enospc problem Liu Bo
` (4 subsequent siblings)
19 siblings, 0 replies; 25+ messages in thread
From: Liu Bo @ 2014-04-10 3:48 UTC (permalink / raw)
To: linux-btrfs
Cc: Marcel Ritter, Christian Robert, alanqk, Konstantinos Skarlatos,
David Sterba, Martin Steigerwald, Josef Bacik, Chris Mason
Because of dedupe, data space info cannot reflect how many data has
been written, in order to get global_rsv more proper, use total_bytes
instead.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
---
fs/btrfs/extent-tree.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index b8fee86..6f8b012 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4692,14 +4692,14 @@ static u64 calc_global_metadata_size(struct btrfs_fs_info *fs_info)
sinfo = __find_space_info(fs_info, BTRFS_BLOCK_GROUP_DATA);
spin_lock(&sinfo->lock);
- data_used = sinfo->bytes_used;
+ data_used = sinfo->total_bytes;
spin_unlock(&sinfo->lock);
sinfo = __find_space_info(fs_info, BTRFS_BLOCK_GROUP_METADATA);
spin_lock(&sinfo->lock);
if (sinfo->flags & BTRFS_BLOCK_GROUP_DATA)
data_used = 0;
- meta_used = sinfo->bytes_used;
+ meta_used = sinfo->total_bytes;
spin_unlock(&sinfo->lock);
num_bytes = (data_used >> fs_info->sb->s_blocksize_bits) *
--
1.8.1.4
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v10 16/16] Btrfs: fix dedup enospc problem
2014-04-10 3:48 [RFC PATCH v10 00/16] Online(inband) data deduplication Liu Bo
` (14 preceding siblings ...)
2014-04-10 3:48 ` [PATCH v10 15/16] Btrfs: use total_bytes instead of bytes_used for global_rsv Liu Bo
@ 2014-04-10 3:48 ` Liu Bo
2014-04-10 3:48 ` [PATCH v5] Btrfs-progs: add dedup subcommand Liu Bo
` (3 subsequent siblings)
19 siblings, 0 replies; 25+ messages in thread
From: Liu Bo @ 2014-04-10 3:48 UTC (permalink / raw)
To: linux-btrfs
Cc: Marcel Ritter, Christian Robert, alanqk, Konstantinos Skarlatos,
David Sterba, Martin Steigerwald, Josef Bacik, Chris Mason
In the case of dedupe, btrfs will produce large number of delayed refs, and
processing them can very likely eat all of the space reserved in
global_block_rsv, and we'll end up with transaction abortion due to ENOSPC.
I tried several different ways to reserve more space for global_block_rsv to
hope it's enough for flushing delayed refs, but I failed and code could
become very messy.
I found that with high delayed refs pressure, the throttle work in the
end_transaction had little use since it didn't block new delayed refs'
insertion, so I put throttle stuff into the very start stage,
i.e. start_transaction.
We take the worst case into account in the throttle code, that is,
every delayed_refs would update btree, so when we reach the limit that
it may use up all the reserved space of global_block_rsv, we kick
transaction_kthread to commit transaction to process these delayed refs,
refresh global_block_rsv's space, and get pinned space back as well.
That way we get rid of annoy ENOSPC problem.
However, this leads to a new problem that it cannot use along with option
"flushoncommit", otherwise it can cause ABBA deadlock between
commit_transaction between ordered extents flush.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
---
fs/btrfs/extent-tree.c | 50 ++++++++++++++++++++++++++++++++++++++-----------
fs/btrfs/ordered-data.c | 6 ++++++
fs/btrfs/transaction.c | 41 ++++++++++++++++++++++++++++++++++++++++
fs/btrfs/transaction.h | 1 +
4 files changed, 87 insertions(+), 11 deletions(-)
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 6f8b012..ec6f42d 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2695,24 +2695,52 @@ static inline u64 heads_to_leaves(struct btrfs_root *root, u64 heads)
int btrfs_check_space_for_delayed_refs(struct btrfs_trans_handle *trans,
struct btrfs_root *root)
{
+ struct btrfs_delayed_ref_root *delayed_refs;
struct btrfs_block_rsv *global_rsv;
- u64 num_heads = trans->transaction->delayed_refs.num_heads_ready;
+ u64 num_heads;
+ u64 num_entries;
u64 num_bytes;
int ret = 0;
- num_bytes = btrfs_calc_trans_metadata_size(root, 1);
- num_heads = heads_to_leaves(root, num_heads);
- if (num_heads > 1)
- num_bytes += (num_heads - 1) * root->leafsize;
- num_bytes <<= 1;
global_rsv = &root->fs_info->global_block_rsv;
- /*
- * If we can't allocate any more chunks lets make sure we have _lots_ of
- * wiggle room since running delayed refs can create more delayed refs.
- */
- if (global_rsv->space_info->full)
+ if (trans) {
+ num_heads = trans->transaction->delayed_refs.num_heads_ready;
+ num_bytes = btrfs_calc_trans_metadata_size(root, 1);
+ num_heads = heads_to_leaves(root, num_heads);
+ if (num_heads > 1)
+ num_bytes += (num_heads - 1) * root->leafsize;
num_bytes <<= 1;
+ /*
+ * If we can't allocate any more chunks lets make sure we have
+ * _lots_ of wiggle room since running delayed refs can create
+ * more delayed refs.
+ */
+ if (global_rsv->space_info->full)
+ num_bytes <<= 1;
+ } else {
+ if (root->fs_info->dedup_bs == 0)
+ return 0;
+
+ /* dedup enabled */
+ spin_lock(&root->fs_info->trans_lock);
+ if (!root->fs_info->running_transaction) {
+ spin_unlock(&root->fs_info->trans_lock);
+ return 0;
+ }
+
+ delayed_refs =
+ &root->fs_info->running_transaction->delayed_refs;
+
+ num_entries = atomic_read(&delayed_refs->num_entries);
+ num_heads = delayed_refs->num_heads;
+
+ spin_unlock(&root->fs_info->trans_lock);
+
+ /* The worst case */
+ num_bytes = (num_entries - num_heads) *
+ btrfs_calc_trans_metadata_size(root, 1);
+ }
spin_lock(&global_rsv->lock);
if (global_rsv->reserved <= num_bytes)
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index c520e13..72c0caa 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -747,6 +747,12 @@ int btrfs_run_ordered_operations(struct btrfs_trans_handle *trans,
&cur_trans->ordered_operations);
spin_unlock(&root->fs_info->ordered_root_lock);
+ if (cur_trans->blocked) {
+ cur_trans->blocked = 0;
+ if (waitqueue_active(&cur_trans->commit_wait))
+ wake_up(&cur_trans->commit_wait);
+ }
+
work = btrfs_alloc_delalloc_work(inode, wait, 1);
if (!work) {
spin_lock(&root->fs_info->ordered_root_lock);
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index a04707f..9937eb2 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -215,6 +215,7 @@ loop:
cur_trans->transid = fs_info->generation;
fs_info->running_transaction = cur_trans;
cur_trans->aborted = 0;
+ cur_trans->blocked = 1;
spin_unlock(&fs_info->trans_lock);
return 0;
@@ -329,6 +330,27 @@ static void wait_current_trans(struct btrfs_root *root)
wait_event(root->fs_info->transaction_wait,
cur_trans->state >= TRANS_STATE_UNBLOCKED ||
cur_trans->aborted);
+
+ btrfs_put_transaction(cur_trans);
+ } else {
+ spin_unlock(&root->fs_info->trans_lock);
+ }
+}
+
+static noinline void wait_current_trans_for_commit(struct btrfs_root *root)
+{
+ struct btrfs_transaction *cur_trans;
+
+ spin_lock(&root->fs_info->trans_lock);
+ cur_trans = root->fs_info->running_transaction;
+ if (cur_trans && is_transaction_blocked(cur_trans)) {
+ atomic_inc(&cur_trans->use_count);
+ spin_unlock(&root->fs_info->trans_lock);
+
+ wait_event(cur_trans->commit_wait,
+ cur_trans->state >= TRANS_STATE_COMPLETED ||
+ cur_trans->aborted || cur_trans->blocked == 0);
+
btrfs_put_transaction(cur_trans);
} else {
spin_unlock(&root->fs_info->trans_lock);
@@ -436,6 +458,25 @@ again:
if (may_wait_transaction(root, type))
wait_current_trans(root);
+ /*
+ * In the case of dedupe, we need to throttle delayed refs at the
+ * very start stage, otherwise we'd run into ENOSPC because more
+ * delayed refs are added while processing delayed refs.
+ */
+ if (root->fs_info->dedup_bs > 0 && type == TRANS_JOIN &&
+ btrfs_check_space_for_delayed_refs(NULL, root)) {
+ struct btrfs_transaction *cur_trans;
+
+ spin_lock(&root->fs_info->trans_lock);
+ cur_trans = root->fs_info->running_transaction;
+ if (cur_trans && cur_trans->state == TRANS_STATE_RUNNING)
+ cur_trans->state = TRANS_STATE_BLOCKED;
+ spin_unlock(&root->fs_info->trans_lock);
+
+ wake_up_process(root->fs_info->transaction_kthread);
+ wait_current_trans_for_commit(root);
+ }
+
do {
ret = join_transaction(root, type);
if (ret == -EBUSY) {
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 6ac037e..ac58d43 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -59,6 +59,7 @@ struct btrfs_transaction {
struct list_head pending_chunks;
struct btrfs_delayed_ref_root delayed_refs;
int aborted;
+ int blocked;
};
#define __TRANS_FREEZABLE (1U << 0)
--
1.8.1.4
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v5] Btrfs-progs: add dedup subcommand
2014-04-10 3:48 [RFC PATCH v10 00/16] Online(inband) data deduplication Liu Bo
` (15 preceding siblings ...)
2014-04-10 3:48 ` [PATCH v10 16/16] Btrfs: fix dedup enospc problem Liu Bo
@ 2014-04-10 3:48 ` Liu Bo
2014-04-10 9:08 ` [RFC PATCH v10 00/16] Online(inband) data deduplication Konstantinos Skarlatos
` (2 subsequent siblings)
19 siblings, 0 replies; 25+ messages in thread
From: Liu Bo @ 2014-04-10 3:48 UTC (permalink / raw)
To: linux-btrfs
Cc: Marcel Ritter, Christian Robert, alanqk, Konstantinos Skarlatos,
David Sterba, Martin Steigerwald, Josef Bacik, Chris Mason
This adds deduplication subcommands, 'btrfs dedup command <path>',
including enable/disable/on/off.
- btrfs dedup enable
Create the dedup tree, and it's the very first step when you're going to use
the dedup feature.
- btrfs dedup disable
Delete the dedup tree, after this we're not able to use dedup any more unless
you enable it again.
- btrfs dedup on [-b]
Switch on the dedup feature temporarily, and it's the second step of applying
dedup with writes. Option '-b' is used to set dedup blocksize.
The default blocksize is 8192(no special reason, you may argue), and the current
limit is [4096, 128 * 1024], because 4K is the generic page size and 128K is the
upper limit of btrfs's compression.
- btrfs dedup off
Switch off the dedup feature temporarily, but the dedup tree remains.
---------------------------------------------------------
Usage:
Step 1: btrfs dedup enable /btrfs
Step 2: btrfs dedup on /btrfs or btrfs dedup on -b 4K /btrfs
Step 3: now we have dedup, run your test.
Step 4: btrfs dedup off /btrfs
Step 5: btrfs dedup disable /btrfs
---------------------------------------------------------
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
---
v5: rebase onto the latest btrfs-progs v3.14.
v4: rebase and reserve spare space in btrfs_ioctl_dedup_args struct.
v3: add commands 'btrfs dedup on/off'
v2: add manpage
Makefile | 2 +-
btrfs.c | 1 +
cmds-dedup.c | 178 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
commands.h | 2 +
ctree.h | 2 +
ioctl.h | 12 ++++
man/btrfs.8.in | 31 +++++++++-
7 files changed, 224 insertions(+), 4 deletions(-)
create mode 100644 cmds-dedup.c
diff --git a/Makefile b/Makefile
index da05197..369df6c 100644
--- a/Makefile
+++ b/Makefile
@@ -14,7 +14,7 @@ cmds_objects = cmds-subvolume.o cmds-filesystem.o cmds-device.o cmds-scrub.o \
cmds-inspect.o cmds-balance.o cmds-send.o cmds-receive.o \
cmds-quota.o cmds-qgroup.o cmds-replace.o cmds-check.o \
cmds-restore.o cmds-rescue.o chunk-recover.o super-recover.o \
- cmds-property.o
+ cmds-property.o cmds-dedup.o
libbtrfs_objects = send-stream.o send-utils.o rbtree.o btrfs-list.o crc32c.o \
uuid-tree.o
libbtrfs_headers = send-stream.h send-utils.h send.h rbtree.h btrfs-list.h \
diff --git a/btrfs.c b/btrfs.c
index 98ff6f5..16458ef 100644
--- a/btrfs.c
+++ b/btrfs.c
@@ -256,6 +256,7 @@ static const struct cmd_group btrfs_cmd_group = {
{ "quota", cmd_quota, NULL, "a_cmd_group, 0 },
{ "qgroup", cmd_qgroup, NULL, &qgroup_cmd_group, 0 },
{ "replace", cmd_replace, NULL, &replace_cmd_group, 0 },
+ { "dedup", cmd_dedup, NULL, &dedup_cmd_group, 0 },
{ "help", cmd_help, cmd_help_usage, NULL, 0 },
{ "version", cmd_version, cmd_version_usage, NULL, 0 },
NULL_CMD_STRUCT
diff --git a/cmds-dedup.c b/cmds-dedup.c
new file mode 100644
index 0000000..b959349
--- /dev/null
+++ b/cmds-dedup.c
@@ -0,0 +1,178 @@
+/*
+ * Copyright (C) 2013 Oracle. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include <sys/ioctl.h>
+#include <unistd.h>
+#include <getopt.h>
+
+#include "ctree.h"
+#include "ioctl.h"
+
+#include "commands.h"
+#include "utils.h"
+
+static const char * const dedup_cmd_group_usage[] = {
+ "btrfs dedup <command> [options] <path>",
+ NULL
+};
+
+int dedup_ctl(char *path, struct btrfs_ioctl_dedup_args *args)
+{
+ int ret = 0;
+ int fd;
+ int e;
+ DIR *dirstream = NULL;
+
+ fd = open_file_or_dir(path, &dirstream);
+ if (fd < 0) {
+ fprintf(stderr, "ERROR: can't access '%s'\n", path);
+ return -EACCES;
+ }
+
+ ret = ioctl(fd, BTRFS_IOC_DEDUP_CTL, args);
+ e = errno;
+ close_file_or_dir(fd, dirstream);
+ if (ret < 0) {
+ fprintf(stderr, "ERROR: dedup command failed: %s\n",
+ strerror(e));
+ if (args->cmd == BTRFS_DEDUP_CTL_DISABLE ||
+ args->cmd == BTRFS_DEDUP_CTL_SET_BS)
+ fprintf(stderr, "please refer to 'dmesg | tail' for more info\n");
+ return -EINVAL;
+ }
+ return 0;
+}
+
+static const char * const cmd_dedup_enable_usage[] = {
+ "btrfs dedup enable <path>",
+ "Enable data deduplication support for a filesystem.",
+ NULL
+};
+
+static int cmd_dedup_enable(int argc, char **argv)
+{
+ struct btrfs_ioctl_dedup_args dargs;
+
+ if (check_argc_exact(argc, 2))
+ usage(cmd_dedup_enable_usage);
+
+ dargs.cmd = BTRFS_DEDUP_CTL_ENABLE;
+
+ return dedup_ctl(argv[1], &dargs);
+}
+
+static const char * const cmd_dedup_disable_usage[] = {
+ "btrfs dedup disable <path>",
+ "Disable data deduplication support for a filesystem.",
+ NULL
+};
+
+static int cmd_dedup_disable(int argc, char **argv)
+{
+ struct btrfs_ioctl_dedup_args dargs;
+
+ if (check_argc_exact(argc, 2))
+ usage(cmd_dedup_disable_usage);
+
+ dargs.cmd = BTRFS_DEDUP_CTL_DISABLE;
+
+ return dedup_ctl(argv[1], &dargs);
+}
+
+static int dedup_set_bs(char *path, struct btrfs_ioctl_dedup_args *dargs)
+{
+ return dedup_ctl(path, dargs);
+}
+
+static const char * const cmd_dedup_on_usage[] = {
+ "btrfs dedup on [-b|--bs size] <path>",
+ "Switch on data deduplication or change the dedup blocksize.",
+ "",
+ "-b|--bs <size> set dedup blocksize",
+ NULL
+};
+
+static struct option longopts[] = {
+ {"bs", required_argument, NULL, 'b'},
+ {0, 0, 0, 0}
+};
+
+static int cmd_dedup_on(int argc, char **argv)
+{
+ struct btrfs_ioctl_dedup_args dargs;
+ u64 bs = 8192;
+
+ optind = 1;
+ while (1) {
+ int longindex;
+
+ int c = getopt_long(argc, argv, "b:", longopts, &longindex);
+ if (c < 0)
+ break;
+
+ switch (c) {
+ case 'b':
+ bs = parse_size(optarg);
+ break;
+ default:
+ usage(cmd_dedup_on_usage);
+ }
+ }
+
+ if (check_argc_exact(argc - optind, 1))
+ usage(cmd_dedup_on_usage);
+
+ dargs.cmd = BTRFS_DEDUP_CTL_SET_BS;
+ dargs.bs = bs;
+
+ return dedup_set_bs(argv[optind], &dargs);
+}
+
+static const char * const cmd_dedup_off_usage[] = {
+ "btrfs dedup off <path>",
+ "Switch off data deduplication.",
+ NULL
+};
+
+static int cmd_dedup_off(int argc, char **argv)
+{
+ struct btrfs_ioctl_dedup_args dargs;
+
+ if (check_argc_exact(argc, 2))
+ usage(cmd_dedup_off_usage);
+
+ dargs.cmd = BTRFS_DEDUP_CTL_SET_BS;
+ dargs.bs = 0;
+
+ return dedup_set_bs(argv[1], &dargs);
+}
+
+const struct cmd_group dedup_cmd_group = {
+ dedup_cmd_group_usage, NULL, {
+ { "enable", cmd_dedup_enable, cmd_dedup_enable_usage, NULL, 0 },
+ { "disable", cmd_dedup_disable, cmd_dedup_disable_usage, 0, 0 },
+ { "on", cmd_dedup_on, cmd_dedup_on_usage, NULL, 0},
+ { "off", cmd_dedup_off, cmd_dedup_off_usage, NULL, 0},
+ { 0, 0, 0, 0, 0 }
+ }
+};
+
+int cmd_dedup(int argc, char **argv)
+{
+ return handle_command_group(&dedup_cmd_group, argc, argv);
+}
diff --git a/commands.h b/commands.h
index db70043..ef2a341 100644
--- a/commands.h
+++ b/commands.h
@@ -92,6 +92,7 @@ extern const struct cmd_group quota_cmd_group;
extern const struct cmd_group qgroup_cmd_group;
extern const struct cmd_group replace_cmd_group;
extern const struct cmd_group rescue_cmd_group;
+extern const struct cmd_group dedup_cmd_group;
extern const char * const cmd_send_usage[];
extern const char * const cmd_receive_usage[];
@@ -121,6 +122,7 @@ int cmd_select_super(int argc, char **argv);
int cmd_dump_super(int argc, char **argv);
int cmd_debug_tree(int argc, char **argv);
int cmd_rescue(int argc, char **argv);
+int cmd_dedup(int argc, char **argv);
/* subvolume exported functions */
int test_issubvolume(char *path);
diff --git a/ctree.h b/ctree.h
index 9b461af..efedbab 100644
--- a/ctree.h
+++ b/ctree.h
@@ -471,6 +471,7 @@ struct btrfs_super_block {
#define BTRFS_FEATURE_INCOMPAT_RAID56 (1ULL << 7)
#define BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA (1ULL << 8)
#define BTRFS_FEATURE_INCOMPAT_NO_HOLES (1ULL << 9)
+#define BTRFS_FEATURE_INCOMPAT_DEDUP (1ULL << 10)
#define BTRFS_FEATURE_COMPAT_SUPP 0ULL
@@ -484,6 +485,7 @@ struct btrfs_super_block {
BTRFS_FEATURE_INCOMPAT_RAID56 | \
BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS | \
BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA | \
+ BTRFS_FEATURE_INCOMPAT_DEDUP | \
BTRFS_FEATURE_INCOMPAT_NO_HOLES)
/*
diff --git a/ioctl.h b/ioctl.h
index 402317f..6d86395 100644
--- a/ioctl.h
+++ b/ioctl.h
@@ -430,6 +430,16 @@ struct btrfs_ioctl_get_dev_stats {
__u64 unused[128 - 2 - BTRFS_DEV_STAT_VALUES_MAX]; /* pad to 1k */
};
+/* deduplication control ioctl modes */
+#define BTRFS_DEDUP_CTL_ENABLE 1
+#define BTRFS_DEDUP_CTL_DISABLE 2
+#define BTRFS_DEDUP_CTL_SET_BS 3
+struct btrfs_ioctl_dedup_args {
+ __u64 cmd;
+ __u64 bs;
+ __u64 unused[14]; /* pad to 128 bytes */
+};
+
/* BTRFS_IOC_SNAP_CREATE is no longer used by the btrfs command */
#define BTRFS_QUOTA_CTL_ENABLE 1
#define BTRFS_QUOTA_CTL_DISABLE 2
@@ -596,6 +606,8 @@ struct btrfs_ioctl_clone_range_args {
struct btrfs_ioctl_get_dev_stats)
#define BTRFS_IOC_DEV_REPLACE _IOWR(BTRFS_IOCTL_MAGIC, 53, \
struct btrfs_ioctl_dev_replace_args)
+#define BTRFS_IOC_DEDUP_CTL _IOWR(BTRFS_IOCTL_MAGIC, 55, \
+ struct btrfs_ioctl_dedup_args)
#define BTRFS_IOC_GET_FEATURES _IOR(BTRFS_IOCTL_MAGIC, 57, \
struct btrfs_ioctl_feature_flags)
#define BTRFS_IOC_SET_FEATURES _IOW(BTRFS_IOCTL_MAGIC, 57, \
diff --git a/man/btrfs.8.in b/man/btrfs.8.in
index 8fea115..dd65fce 100644
--- a/man/btrfs.8.in
+++ b/man/btrfs.8.in
@@ -109,13 +109,22 @@ btrfs \- control a btrfs filesystem
.PP
\fBbtrfs\fP \fBqgroup limit\fP [\fIoptions\fP] \fI<size>\fP|\fBnone\fP [\fI<qgroupid>\fP] \fI<path>\fP
.PP
-.PP
\fBbtrfs\fP \fBreplace start\fP [-Bfr] \fI<srcdev>\fP|\fI<devid> <targetdev> <mount_point>\fP
.PP
\fBbtrfs\fP \fBreplace status\fP [-1] \fI<mount_point>\fP
.PP
\fBbtrfs\fP \fBreplace cancel\fP \fI<mount_point>\fP
.PP
+\fBbtrfs\fP \fBdedup enable\fP \fI<path>\fP
+.PP
+\fBbtrfs\fP \fBdedup disable\fP \fI<path>\fP
+.PP
+\fBbtrfs\fP \fBdedup on\fP [-b|--bs \fIsize\fP] \fI<path>\fP
+.PP
+\fBbtrfs\fP \fBdedup off\fP \fI<path>\fP
+.PP
+.PP
+
\fBbtrfs\fP \fBhelp|\-\-help \fP
.PP
\fBbtrfs\fP \fB<command> \-\-help \fP
@@ -764,12 +773,28 @@ Print status and progress information of a running device replace operation.
.IP "\fB-1\fP" 5
print once instead of print continuously until the replace
operation finishes (or is canceled)
-.RE
-.TP
\fBreplace cancel\fR \fI<mount_point>\fR
Cancel a running device replace operation.
.RE
+.TP
+
+\fBdedup enable\fP \fI<path>\fP
+Enable data deduplication support for a filesystem.
+.TP
+
+\fBdedup disable\fP \fI<path>\fP
+Disable data deduplication support for a filesystem.
+.TP
+
+\fBdedup on\fP [-b|--bs \fIsize\fP] \fI<path>\fP
+Switch on data deduplication or change the dedup blocksize.
+.TP
+
+\fBdedup off\fP \fI<path>\fP
+Switch off data deduplication.
+.RE
+.TP
.SH EXIT STATUS
\fBbtrfs\fR returns a zero exist status if it succeeds. Non zero is returned in
--
1.8.2.1
^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [RFC PATCH v10 00/16] Online(inband) data deduplication
2014-04-10 3:48 [RFC PATCH v10 00/16] Online(inband) data deduplication Liu Bo
` (16 preceding siblings ...)
2014-04-10 3:48 ` [PATCH v5] Btrfs-progs: add dedup subcommand Liu Bo
@ 2014-04-10 9:08 ` Konstantinos Skarlatos
2014-04-10 15:44 ` Liu Bo
2014-04-10 15:55 ` Liu Bo
2014-04-14 8:41 ` Test results for " Konstantinos Skarlatos
19 siblings, 1 reply; 25+ messages in thread
From: Konstantinos Skarlatos @ 2014-04-10 9:08 UTC (permalink / raw)
To: Liu Bo, linux-btrfs
Cc: Marcel Ritter, Christian Robert, alanqk, David Sterba,
Martin Steigerwald, Josef Bacik, Chris Mason
On 10/4/2014 6:48 πμ, Liu Bo wrote:
> Hello,
>
> This the 10th attempt for in-band data dedupe, based on Linux _3.14_ kernel.
>
> Data deduplication is a specialized data compression technique for eliminating
> duplicate copies of repeating data.[1]
>
> This patch set is also related to "Content based storage" in project ideas[2],
> it introduces inband data deduplication for btrfs and dedup/dedupe is for short.
>
> * PATCH 1 is a speed-up improvement, which is about dedup and quota.
>
> * PATCH 2-5 is the preparation work for dedup implementation.
>
> * PATCH 6 shows how we implement dedup feature.
>
> * PATCH 7 fixes a backref walking bug with dedup.
>
> * PATCH 8 fixes a free space bug of dedup extents on error handling.
>
> * PATCH 9 adds the ioctl to control dedup feature.
>
> * PATCH 10 targets delayed refs' scalability problem of deleting refs, which is
> uncovered by the dedup feature.
>
> * PATCH 11-16 fixes bugs of dedupe including race bug, deadlock, abnormal
> transaction abortion and crash.
>
> * btrfs-progs patch(PATCH 17) offers all details about how to control the
> dedup feature on progs side.
>
> I've tested this with xfstests by adding a inline dedup 'enable & on' in xfstests'
> mount and scratch_mount.
>
>
> ***NOTE***
> Known bugs:
> * Mounting with options "flushoncommit" and enabling dedupe feature will end up
> with _deadlock_.
>
>
> TODO:
> * a bit-to-bit comparison callback.
>
> All comments are welcome!
Hi Liu,
Thanks for doing this work.
I tested your previous patches a few months ago, and will now test the
new ones. One question about memory requirements, are they in the same
league as ZFS dedup (ie needing 10's of gb of RAM for multi TB
filesystems) or are they more reasonable?
Thanks
>
>
> [1]: http://en.wikipedia.org/wiki/Data_deduplication
> [2]: https://btrfs.wiki.kernel.org/index.php/Project_ideas#Content_based_storage
>
> v10:
> - fix a typo in the subject line.
> - update struct 'btrfs_ioctl_dedup_args' in the kernel side to fix
> 'Inappropriate ioctl for device'.
>
> v9:
> - fix a deadlock and a crash reported by users.
> - fix the metadata ENOSPC problem with dedup again.
>
> v8:
> - fix the race crash of dedup ref again.
> - fix the metadata ENOSPC problem with dedup.
>
> v7:
> - rebase onto the lastest btrfs
> - break a big patch into smaller ones to make reviewers happy.
> - kill mount options of dedup and use ioctl method instead.
> - fix two crash due to the special dedup ref
>
> For former patch sets:
> v6: http://thread.gmane.org/gmane.comp.file-systems.btrfs/27512
> v5: http://thread.gmane.org/gmane.comp.file-systems.btrfs/27257
> v4: http://thread.gmane.org/gmane.comp.file-systems.btrfs/25751
> v3: http://comments.gmane.org/gmane.comp.file-systems.btrfs/25433
> v2: http://comments.gmane.org/gmane.comp.file-systems.btrfs/24959
>
> Liu Bo (16):
> Btrfs: disable qgroups accounting when quota_enable is 0
> Btrfs: introduce dedup tree and relatives
> Btrfs: introduce dedup tree operations
> Btrfs: introduce dedup state
> Btrfs: make ordered extent aware of dedup
> Btrfs: online(inband) data dedup
> Btrfs: skip dedup reference during backref walking
> Btrfs: don't return space for dedup extent
> Btrfs: add ioctl of dedup control
> Btrfs: improve the delayed refs process in rm case
> Btrfs: fix a crash of dedup ref
> Btrfs: fix deadlock of dedup work
> Btrfs: fix transactin abortion in __btrfs_free_extent
> Btrfs: fix wrong pinned bytes in __btrfs_free_extent
> Btrfs: use total_bytes instead of bytes_used for global_rsv
> Btrfs: fix dedup enospc problem
>
> fs/btrfs/backref.c | 9 +
> fs/btrfs/ctree.c | 2 +-
> fs/btrfs/ctree.h | 86 ++++++
> fs/btrfs/delayed-ref.c | 26 +-
> fs/btrfs/delayed-ref.h | 3 +
> fs/btrfs/disk-io.c | 37 +++
> fs/btrfs/extent-tree.c | 235 +++++++++++++---
> fs/btrfs/extent_io.c | 22 +-
> fs/btrfs/extent_io.h | 16 ++
> fs/btrfs/file-item.c | 244 +++++++++++++++++
> fs/btrfs/inode.c | 635 ++++++++++++++++++++++++++++++++++++++-----
> fs/btrfs/ioctl.c | 167 ++++++++++++
> fs/btrfs/ordered-data.c | 44 ++-
> fs/btrfs/ordered-data.h | 13 +-
> fs/btrfs/qgroup.c | 3 +
> fs/btrfs/relocation.c | 3 +
> fs/btrfs/transaction.c | 41 +++
> fs/btrfs/transaction.h | 1 +
> include/trace/events/btrfs.h | 3 +-
> include/uapi/linux/btrfs.h | 12 +
> 20 files changed, 1471 insertions(+), 131 deletions(-)
>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [RFC PATCH v10 00/16] Online(inband) data deduplication
2014-04-10 9:08 ` [RFC PATCH v10 00/16] Online(inband) data deduplication Konstantinos Skarlatos
@ 2014-04-10 15:44 ` Liu Bo
0 siblings, 0 replies; 25+ messages in thread
From: Liu Bo @ 2014-04-10 15:44 UTC (permalink / raw)
To: Konstantinos Skarlatos
Cc: linux-btrfs, Marcel Ritter, Christian Robert, alanqk,
David Sterba, Martin Steigerwald, Josef Bacik, Chris Mason
On Thu, Apr 10, 2014 at 12:08:17PM +0300, Konstantinos Skarlatos wrote:
> On 10/4/2014 6:48 πμ, Liu Bo wrote:
> >Hello,
> >
> >This the 10th attempt for in-band data dedupe, based on Linux _3.14_ kernel.
> >
> >Data deduplication is a specialized data compression technique for eliminating
> >duplicate copies of repeating data.[1]
> >
> >This patch set is also related to "Content based storage" in project ideas[2],
> >it introduces inband data deduplication for btrfs and dedup/dedupe is for short.
> >
> >* PATCH 1 is a speed-up improvement, which is about dedup and quota.
> >
> >* PATCH 2-5 is the preparation work for dedup implementation.
> >
> >* PATCH 6 shows how we implement dedup feature.
> >
> >* PATCH 7 fixes a backref walking bug with dedup.
> >
> >* PATCH 8 fixes a free space bug of dedup extents on error handling.
> >
> >* PATCH 9 adds the ioctl to control dedup feature.
> >
> >* PATCH 10 targets delayed refs' scalability problem of deleting refs, which is
> > uncovered by the dedup feature.
> >
> >* PATCH 11-16 fixes bugs of dedupe including race bug, deadlock, abnormal
> > transaction abortion and crash.
> >
> >* btrfs-progs patch(PATCH 17) offers all details about how to control the
> > dedup feature on progs side.
> >
> >I've tested this with xfstests by adding a inline dedup 'enable & on' in xfstests'
> >mount and scratch_mount.
> >
> >
> >***NOTE***
> >Known bugs:
> >* Mounting with options "flushoncommit" and enabling dedupe feature will end up
> > with _deadlock_.
> >
> >
> >TODO:
> >* a bit-to-bit comparison callback.
> >
> >All comments are welcome!
> Hi Liu,
> Thanks for doing this work.
> I tested your previous patches a few months ago, and will now test
> the new ones. One question about memory requirements, are they in
> the same league as ZFS dedup (ie needing 10's of gb of RAM for multi
> TB filesystems) or are they more reasonable?
> Thanks
Hi Konstantinos,
It depends on Linux native memory management which can reclaim memory when
lacking memory, but still, it'd lead to high memory pressure according to my
experiments.
Thanks for testing it!
-liubo
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [RFC PATCH v10 00/16] Online(inband) data deduplication
2014-04-10 3:48 [RFC PATCH v10 00/16] Online(inband) data deduplication Liu Bo
` (17 preceding siblings ...)
2014-04-10 9:08 ` [RFC PATCH v10 00/16] Online(inband) data deduplication Konstantinos Skarlatos
@ 2014-04-10 15:55 ` Liu Bo
2014-04-11 9:28 ` Martin Steigerwald
2014-04-14 8:41 ` Test results for " Konstantinos Skarlatos
19 siblings, 1 reply; 25+ messages in thread
From: Liu Bo @ 2014-04-10 15:55 UTC (permalink / raw)
To: linux-btrfs, Andrea Gelmini
Cc: Marcel Ritter, Christian Robert, alanqk, Konstantinos Skarlatos,
David Sterba, Martin Steigerwald, Josef Bacik, Chris Mason
Hi,
Just FYI, these patches are also available on the following site,
kernel:
https://github.com/liubogithub/btrfs-work.git dedup-on-3.14-linux
progs:
https://github.com/liubogithub/btrfs-progs.git dedup
thanks,
-liubo
On Thu, Apr 10, 2014 at 11:48:30AM +0800, Liu Bo wrote:
> Hello,
>
> This the 10th attempt for in-band data dedupe, based on Linux _3.14_ kernel.
>
> Data deduplication is a specialized data compression technique for eliminating
> duplicate copies of repeating data.[1]
>
> This patch set is also related to "Content based storage" in project ideas[2],
> it introduces inband data deduplication for btrfs and dedup/dedupe is for short.
>
> * PATCH 1 is a speed-up improvement, which is about dedup and quota.
>
> * PATCH 2-5 is the preparation work for dedup implementation.
>
> * PATCH 6 shows how we implement dedup feature.
>
> * PATCH 7 fixes a backref walking bug with dedup.
>
> * PATCH 8 fixes a free space bug of dedup extents on error handling.
>
> * PATCH 9 adds the ioctl to control dedup feature.
>
> * PATCH 10 targets delayed refs' scalability problem of deleting refs, which is
> uncovered by the dedup feature.
>
> * PATCH 11-16 fixes bugs of dedupe including race bug, deadlock, abnormal
> transaction abortion and crash.
>
> * btrfs-progs patch(PATCH 17) offers all details about how to control the
> dedup feature on progs side.
>
> I've tested this with xfstests by adding a inline dedup 'enable & on' in xfstests'
> mount and scratch_mount.
>
>
> ***NOTE***
> Known bugs:
> * Mounting with options "flushoncommit" and enabling dedupe feature will end up
> with _deadlock_.
>
>
> TODO:
> * a bit-to-bit comparison callback.
>
> All comments are welcome!
>
>
> [1]: http://en.wikipedia.org/wiki/Data_deduplication
> [2]: https://btrfs.wiki.kernel.org/index.php/Project_ideas#Content_based_storage
>
> v10:
> - fix a typo in the subject line.
> - update struct 'btrfs_ioctl_dedup_args' in the kernel side to fix
> 'Inappropriate ioctl for device'.
>
> v9:
> - fix a deadlock and a crash reported by users.
> - fix the metadata ENOSPC problem with dedup again.
>
> v8:
> - fix the race crash of dedup ref again.
> - fix the metadata ENOSPC problem with dedup.
>
> v7:
> - rebase onto the lastest btrfs
> - break a big patch into smaller ones to make reviewers happy.
> - kill mount options of dedup and use ioctl method instead.
> - fix two crash due to the special dedup ref
>
> For former patch sets:
> v6: http://thread.gmane.org/gmane.comp.file-systems.btrfs/27512
> v5: http://thread.gmane.org/gmane.comp.file-systems.btrfs/27257
> v4: http://thread.gmane.org/gmane.comp.file-systems.btrfs/25751
> v3: http://comments.gmane.org/gmane.comp.file-systems.btrfs/25433
> v2: http://comments.gmane.org/gmane.comp.file-systems.btrfs/24959
>
> Liu Bo (16):
> Btrfs: disable qgroups accounting when quota_enable is 0
> Btrfs: introduce dedup tree and relatives
> Btrfs: introduce dedup tree operations
> Btrfs: introduce dedup state
> Btrfs: make ordered extent aware of dedup
> Btrfs: online(inband) data dedup
> Btrfs: skip dedup reference during backref walking
> Btrfs: don't return space for dedup extent
> Btrfs: add ioctl of dedup control
> Btrfs: improve the delayed refs process in rm case
> Btrfs: fix a crash of dedup ref
> Btrfs: fix deadlock of dedup work
> Btrfs: fix transactin abortion in __btrfs_free_extent
> Btrfs: fix wrong pinned bytes in __btrfs_free_extent
> Btrfs: use total_bytes instead of bytes_used for global_rsv
> Btrfs: fix dedup enospc problem
>
> fs/btrfs/backref.c | 9 +
> fs/btrfs/ctree.c | 2 +-
> fs/btrfs/ctree.h | 86 ++++++
> fs/btrfs/delayed-ref.c | 26 +-
> fs/btrfs/delayed-ref.h | 3 +
> fs/btrfs/disk-io.c | 37 +++
> fs/btrfs/extent-tree.c | 235 +++++++++++++---
> fs/btrfs/extent_io.c | 22 +-
> fs/btrfs/extent_io.h | 16 ++
> fs/btrfs/file-item.c | 244 +++++++++++++++++
> fs/btrfs/inode.c | 635 ++++++++++++++++++++++++++++++++++++++-----
> fs/btrfs/ioctl.c | 167 ++++++++++++
> fs/btrfs/ordered-data.c | 44 ++-
> fs/btrfs/ordered-data.h | 13 +-
> fs/btrfs/qgroup.c | 3 +
> fs/btrfs/relocation.c | 3 +
> fs/btrfs/transaction.c | 41 +++
> fs/btrfs/transaction.h | 1 +
> include/trace/events/btrfs.h | 3 +-
> include/uapi/linux/btrfs.h | 12 +
> 20 files changed, 1471 insertions(+), 131 deletions(-)
>
> --
> 1.8.2.1
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [RFC PATCH v10 00/16] Online(inband) data deduplication
2014-04-10 15:55 ` Liu Bo
@ 2014-04-11 9:28 ` Martin Steigerwald
2014-04-11 9:51 ` Liu Bo
0 siblings, 1 reply; 25+ messages in thread
From: Martin Steigerwald @ 2014-04-11 9:28 UTC (permalink / raw)
To: bo.li.liu
Cc: linux-btrfs, Andrea Gelmini, Marcel Ritter, Christian Robert,
alanqk, Konstantinos Skarlatos, David Sterba, Josef Bacik,
Chris Mason
Hi Liu,
Am Donnerstag, 10. April 2014, 23:55:21 schrieb Liu Bo:
> Hi,
>
> Just FYI, these patches are also available on the following site,
>
> kernel:
> https://github.com/liubogithub/btrfs-work.git dedup-on-3.14-linux
>
> progs:
> https://github.com/liubogithub/btrfs-progs.git dedup
I bet its good to only test it with test data so far or would you consider it
safe enough to test on production data already?
Fortunately since I added that additional mSATA SSD I have some spare storage
to put some test setup into.
Thanks,
Martin
> thanks,
> -liubo
>
> On Thu, Apr 10, 2014 at 11:48:30AM +0800, Liu Bo wrote:
> > Hello,
> >
> > This the 10th attempt for in-band data dedupe, based on Linux _3.14_
> > kernel.
> >
> > Data deduplication is a specialized data compression technique for
> > eliminating duplicate copies of repeating data.[1]
> >
> > This patch set is also related to "Content based storage" in project
> > ideas[2], it introduces inband data deduplication for btrfs and
> > dedup/dedupe is for short.
> >
> > * PATCH 1 is a speed-up improvement, which is about dedup and quota.
> >
> > * PATCH 2-5 is the preparation work for dedup implementation.
> >
> > * PATCH 6 shows how we implement dedup feature.
> >
> > * PATCH 7 fixes a backref walking bug with dedup.
> >
> > * PATCH 8 fixes a free space bug of dedup extents on error handling.
> >
> > * PATCH 9 adds the ioctl to control dedup feature.
> >
> > * PATCH 10 targets delayed refs' scalability problem of deleting refs,
> > which is>
> > uncovered by the dedup feature.
> >
> > * PATCH 11-16 fixes bugs of dedupe including race bug, deadlock, abnormal
> >
> > transaction abortion and crash.
> >
> > * btrfs-progs patch(PATCH 17) offers all details about how to control the
> >
> > dedup feature on progs side.
> >
> > I've tested this with xfstests by adding a inline dedup 'enable & on' in
> > xfstests' mount and scratch_mount.
> >
> >
> > ***NOTE***
> > Known bugs:
> > * Mounting with options "flushoncommit" and enabling dedupe feature will
> > end up>
> > with _deadlock_.
> >
> > TODO:
> > * a bit-to-bit comparison callback.
> >
> > All comments are welcome!
> >
> >
> > [1]: http://en.wikipedia.org/wiki/Data_deduplication
> > [2]:
> > https://btrfs.wiki.kernel.org/index.php/Project_ideas#Content_based_stora
> > ge
> >
> > v10:
> > - fix a typo in the subject line.
> > - update struct 'btrfs_ioctl_dedup_args' in the kernel side to fix
> >
> > 'Inappropriate ioctl for device'.
> >
> > v9:
> > - fix a deadlock and a crash reported by users.
> > - fix the metadata ENOSPC problem with dedup again.
> >
> > v8:
> > - fix the race crash of dedup ref again.
> > - fix the metadata ENOSPC problem with dedup.
> >
> > v7:
> > - rebase onto the lastest btrfs
> > - break a big patch into smaller ones to make reviewers happy.
> > - kill mount options of dedup and use ioctl method instead.
> > - fix two crash due to the special dedup ref
> >
> > For former patch sets:
> > v6: http://thread.gmane.org/gmane.comp.file-systems.btrfs/27512
> > v5: http://thread.gmane.org/gmane.comp.file-systems.btrfs/27257
> > v4: http://thread.gmane.org/gmane.comp.file-systems.btrfs/25751
> > v3: http://comments.gmane.org/gmane.comp.file-systems.btrfs/25433
> > v2: http://comments.gmane.org/gmane.comp.file-systems.btrfs/24959
> >
> > Liu Bo (16):
> > Btrfs: disable qgroups accounting when quota_enable is 0
> > Btrfs: introduce dedup tree and relatives
> > Btrfs: introduce dedup tree operations
> > Btrfs: introduce dedup state
> > Btrfs: make ordered extent aware of dedup
> > Btrfs: online(inband) data dedup
> > Btrfs: skip dedup reference during backref walking
> > Btrfs: don't return space for dedup extent
> > Btrfs: add ioctl of dedup control
> > Btrfs: improve the delayed refs process in rm case
> > Btrfs: fix a crash of dedup ref
> > Btrfs: fix deadlock of dedup work
> > Btrfs: fix transactin abortion in __btrfs_free_extent
> > Btrfs: fix wrong pinned bytes in __btrfs_free_extent
> > Btrfs: use total_bytes instead of bytes_used for global_rsv
> > Btrfs: fix dedup enospc problem
> >
> > fs/btrfs/backref.c | 9 +
> > fs/btrfs/ctree.c | 2 +-
> > fs/btrfs/ctree.h | 86 ++++++
> > fs/btrfs/delayed-ref.c | 26 +-
> > fs/btrfs/delayed-ref.h | 3 +
> > fs/btrfs/disk-io.c | 37 +++
> > fs/btrfs/extent-tree.c | 235 +++++++++++++---
> > fs/btrfs/extent_io.c | 22 +-
> > fs/btrfs/extent_io.h | 16 ++
> > fs/btrfs/file-item.c | 244 +++++++++++++++++
> > fs/btrfs/inode.c | 635
> > ++++++++++++++++++++++++++++++++++++++----- fs/btrfs/ioctl.c
> > | 167 ++++++++++++
> > fs/btrfs/ordered-data.c | 44 ++-
> > fs/btrfs/ordered-data.h | 13 +-
> > fs/btrfs/qgroup.c | 3 +
> > fs/btrfs/relocation.c | 3 +
> > fs/btrfs/transaction.c | 41 +++
> > fs/btrfs/transaction.h | 1 +
> > include/trace/events/btrfs.h | 3 +-
> > include/uapi/linux/btrfs.h | 12 +
> > 20 files changed, 1471 insertions(+), 131 deletions(-)
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [RFC PATCH v10 00/16] Online(inband) data deduplication
2014-04-11 9:28 ` Martin Steigerwald
@ 2014-04-11 9:51 ` Liu Bo
0 siblings, 0 replies; 25+ messages in thread
From: Liu Bo @ 2014-04-11 9:51 UTC (permalink / raw)
To: Martin Steigerwald
Cc: linux-btrfs, Andrea Gelmini, Marcel Ritter, Christian Robert,
alanqk, Konstantinos Skarlatos, David Sterba, Josef Bacik,
Chris Mason
On Fri, Apr 11, 2014 at 11:28:48AM +0200, Martin Steigerwald wrote:
> Hi Liu,
>
> Am Donnerstag, 10. April 2014, 23:55:21 schrieb Liu Bo:
> > Hi,
> >
> > Just FYI, these patches are also available on the following site,
> >
> > kernel:
> > https://github.com/liubogithub/btrfs-work.git dedup-on-3.14-linux
> >
> > progs:
> > https://github.com/liubogithub/btrfs-progs.git dedup
>
> I bet its good to only test it with test data so far or would you consider it
> safe enough to test on production data already?
Definitely at an experimental stage(still has known bugs), that's why I tagged
a 'RFC' on it.
>
> Fortunately since I added that additional mSATA SSD I have some spare storage
> to put some test setup into.
Thanks for testing it!
Thanks,
-liubo
>
> Thanks,
> Martin
>
> > thanks,
> > -liubo
> >
> > On Thu, Apr 10, 2014 at 11:48:30AM +0800, Liu Bo wrote:
> > > Hello,
> > >
> > > This the 10th attempt for in-band data dedupe, based on Linux _3.14_
> > > kernel.
> > >
> > > Data deduplication is a specialized data compression technique for
> > > eliminating duplicate copies of repeating data.[1]
> > >
> > > This patch set is also related to "Content based storage" in project
> > > ideas[2], it introduces inband data deduplication for btrfs and
> > > dedup/dedupe is for short.
> > >
> > > * PATCH 1 is a speed-up improvement, which is about dedup and quota.
> > >
> > > * PATCH 2-5 is the preparation work for dedup implementation.
> > >
> > > * PATCH 6 shows how we implement dedup feature.
> > >
> > > * PATCH 7 fixes a backref walking bug with dedup.
> > >
> > > * PATCH 8 fixes a free space bug of dedup extents on error handling.
> > >
> > > * PATCH 9 adds the ioctl to control dedup feature.
> > >
> > > * PATCH 10 targets delayed refs' scalability problem of deleting refs,
> > > which is>
> > > uncovered by the dedup feature.
> > >
> > > * PATCH 11-16 fixes bugs of dedupe including race bug, deadlock, abnormal
> > >
> > > transaction abortion and crash.
> > >
> > > * btrfs-progs patch(PATCH 17) offers all details about how to control the
> > >
> > > dedup feature on progs side.
> > >
> > > I've tested this with xfstests by adding a inline dedup 'enable & on' in
> > > xfstests' mount and scratch_mount.
> > >
> > >
> > > ***NOTE***
> > > Known bugs:
> > > * Mounting with options "flushoncommit" and enabling dedupe feature will
> > > end up>
> > > with _deadlock_.
> > >
> > > TODO:
> > > * a bit-to-bit comparison callback.
> > >
> > > All comments are welcome!
> > >
> > >
> > > [1]: http://en.wikipedia.org/wiki/Data_deduplication
> > > [2]:
> > > https://btrfs.wiki.kernel.org/index.php/Project_ideas#Content_based_stora
> > > ge
> > >
> > > v10:
> > > - fix a typo in the subject line.
> > > - update struct 'btrfs_ioctl_dedup_args' in the kernel side to fix
> > >
> > > 'Inappropriate ioctl for device'.
> > >
> > > v9:
> > > - fix a deadlock and a crash reported by users.
> > > - fix the metadata ENOSPC problem with dedup again.
> > >
> > > v8:
> > > - fix the race crash of dedup ref again.
> > > - fix the metadata ENOSPC problem with dedup.
> > >
> > > v7:
> > > - rebase onto the lastest btrfs
> > > - break a big patch into smaller ones to make reviewers happy.
> > > - kill mount options of dedup and use ioctl method instead.
> > > - fix two crash due to the special dedup ref
> > >
> > > For former patch sets:
> > > v6: http://thread.gmane.org/gmane.comp.file-systems.btrfs/27512
> > > v5: http://thread.gmane.org/gmane.comp.file-systems.btrfs/27257
> > > v4: http://thread.gmane.org/gmane.comp.file-systems.btrfs/25751
> > > v3: http://comments.gmane.org/gmane.comp.file-systems.btrfs/25433
> > > v2: http://comments.gmane.org/gmane.comp.file-systems.btrfs/24959
> > >
> > > Liu Bo (16):
> > > Btrfs: disable qgroups accounting when quota_enable is 0
> > > Btrfs: introduce dedup tree and relatives
> > > Btrfs: introduce dedup tree operations
> > > Btrfs: introduce dedup state
> > > Btrfs: make ordered extent aware of dedup
> > > Btrfs: online(inband) data dedup
> > > Btrfs: skip dedup reference during backref walking
> > > Btrfs: don't return space for dedup extent
> > > Btrfs: add ioctl of dedup control
> > > Btrfs: improve the delayed refs process in rm case
> > > Btrfs: fix a crash of dedup ref
> > > Btrfs: fix deadlock of dedup work
> > > Btrfs: fix transactin abortion in __btrfs_free_extent
> > > Btrfs: fix wrong pinned bytes in __btrfs_free_extent
> > > Btrfs: use total_bytes instead of bytes_used for global_rsv
> > > Btrfs: fix dedup enospc problem
> > >
> > > fs/btrfs/backref.c | 9 +
> > > fs/btrfs/ctree.c | 2 +-
> > > fs/btrfs/ctree.h | 86 ++++++
> > > fs/btrfs/delayed-ref.c | 26 +-
> > > fs/btrfs/delayed-ref.h | 3 +
> > > fs/btrfs/disk-io.c | 37 +++
> > > fs/btrfs/extent-tree.c | 235 +++++++++++++---
> > > fs/btrfs/extent_io.c | 22 +-
> > > fs/btrfs/extent_io.h | 16 ++
> > > fs/btrfs/file-item.c | 244 +++++++++++++++++
> > > fs/btrfs/inode.c | 635
> > > ++++++++++++++++++++++++++++++++++++++----- fs/btrfs/ioctl.c
> > > | 167 ++++++++++++
> > > fs/btrfs/ordered-data.c | 44 ++-
> > > fs/btrfs/ordered-data.h | 13 +-
> > > fs/btrfs/qgroup.c | 3 +
> > > fs/btrfs/relocation.c | 3 +
> > > fs/btrfs/transaction.c | 41 +++
> > > fs/btrfs/transaction.h | 1 +
> > > include/trace/events/btrfs.h | 3 +-
> > > include/uapi/linux/btrfs.h | 12 +
> > > 20 files changed, 1471 insertions(+), 131 deletions(-)
>
> --
> Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
> GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [RFC PATCH v10 00/16] Online(inband) data deduplication
@ 2014-04-11 18:00 Michael
0 siblings, 0 replies; 25+ messages in thread
From: Michael @ 2014-04-11 18:00 UTC (permalink / raw)
To: linux-btrfs
Hi Liu,
Thanks for your work.
Each test copy 2gb file from sdf (btrfs) to sde (btrfs with dedup 4k
blocksize).
Before every test i recreate filesystem.
On second write all goods.
Test 1
Nodesize = leafsize = 4k
Write overhead ~ x1.5
Test 2
Nodesize = leafsize = 16k
Write overhead ~ x19
Test 1:
fs created label (null) on /dev/sde
nodesize 4096 leafsize 4096 sectorsize 4096 size 931.32GiB
Btrfs v3.14-dirty
./btrfs dedup enable /mnt/backupkvm/ ; ./btrfs dedup on -b4k
/mnt/backupkvm/
[root@hranilka2 ~]# cat /sys/block/sde/stat
157149 3096 1285992 128900 6472058 12962342 718660664
1511294062 0 1576195 1527264468
[root@hranilka2 ~]# cat /sys/block/sde/stat
157149 3096 1285992 128900 6526802 13268740 724690168
1512117500 0 1587233 1528090320
write sectors: 724690168 - 718660664 = 6029504 sector
write mbyte: 6029504 * 512 / 1024 / 1024 ~ 2944 mb
[root@hranilka2 ~]# cat /sys/block/sdf/stat
338633 165 346540680 7043771 7 4 1496 35
0 610795 7043095
[root@hranilka2 ~]# cat /sys/block/sdf/stat
342737 165 350743176 7127546 7 4 1496 35
0 618652 7126860
read sectors: 350743176 - 346540680 = 4202496 sector
read mbyte: 4202496 * 512 / 1024 / 1024 ~ 2052 mb
Test 2:
fs created label (null) on /dev/sde
nodesize 16384 leafsize 16384 sectorsize 4096 size 931.32GiB
Btrfs v3.14-dirty
./btrfs dedup enable /mnt/backupkvm/ ; ./btrfs dedup on -b4k
/mnt/backupkvm/
[root@hranilka2 ~]# cat /sys/block/sde/stat
157440 3313 1290392 129277 6526912 13272046 724722864
1512117920 0 1587500 1528091116
[root@hranilka2 ~]# cat /sys/block/sde/stat
157440 3313 1290392 129277 6964494 13836479 803994376
1514935880 0 1730081 1530910351
write sectors: 803994376 - 724722864 = 79271512 sector
write mbyte: 79271512 * 512 / 1024 / 1024 ~ 38706 mb
[root@hranilka2 ~]# cat /sys/block/sdf/stat
342737 165 350743176 7127546 7 4 1496 35
0 618652 7126860
[root@hranilka2 ~]# cat /sys/block/sdf/stat
346841 165 354945672 7266231 7 4 1496 35
0 629824 7265539
read sectors: 354945672 - 350743176 = 4202496 sector
read mbyte: 4202496 * 512 / 1024 / 1024 ~ 2052 mb
--
Michael Serikov
--
^ permalink raw reply [flat|nested] 25+ messages in thread
* Test results for [RFC PATCH v10 00/16] Online(inband) data deduplication
2014-04-10 3:48 [RFC PATCH v10 00/16] Online(inband) data deduplication Liu Bo
` (18 preceding siblings ...)
2014-04-10 15:55 ` Liu Bo
@ 2014-04-14 8:41 ` Konstantinos Skarlatos
19 siblings, 0 replies; 25+ messages in thread
From: Konstantinos Skarlatos @ 2014-04-14 8:41 UTC (permalink / raw)
To: Liu Bo, linux-btrfs
Cc: Marcel Ritter, Christian Robert, alanqk, David Sterba,
Martin Steigerwald, Josef Bacik, Chris Mason
Hello,
Here are the test results from my testing of the latest patches of btrfs
dedup.
TLDR;
I rsynced 10 separate copies of a 3.8GB folder with 138 RAW photographs
(23-36MiB) on a btrfs volume with dedup enabled.
On the first try, the copy was very slow, and a sync after that took
over 10 minutes to complete.
For the next copies sync was much faster, but still took up to one
minute to complete.
The copy itself was quite slow, until the fifth try when it went from
8MB/sec to 22-40MB/sec.
Each copy after the first consumed about 60-65MiB of metadata, or
120-130MiB of free space due to metadata being DUP.
Obvious question:
Can dedup recognize that 2 files are the same and dedup them on a file
level, saving much more space in the process?
In any case I am very thankful of the work being done here, and i am
willing to help in any way i can.
AMD Phenom(tm) II X4 955 Processor
MemTotal: 8 GB
Hard Disk: Seagate Barracuda 7200.12 [160 GB]
kernel: 3.14.0-1-git
$ mkfs.btrfs /dev/loop0 -f && mount /storage/btrfs_dedup && mount |grep
dedup && btrfs dedup enable /storage/btrfs_dedup && btrfs dedup on
/storage/btrfs_dedup && for i in {01..10}; do time rsync -a
/storage/btrfs/costas/Photo_library/2014/ /storage/btrfs_dedup/copy$i/
--stats && time btrfs fi sync /storage/btrfs_dedup/ && df
/storage/btrfs_dedup/ && btrfs fi df /storage/btrfs_dedup ; done && time
umount /storage/btrfs_dedup
/root/btrfs.img on /storage/btrfs_dedup type btrfs
(rw,noatime,nodiratime,space_cache)
sent 4,017,134,246 bytes received 2,689 bytes 8,274,226.44 bytes/sec
rsync -a /storage/btrfs/costas/Photo_library/2014/ --stats 21.85s user
45.04s system 13% cpu 8:05.48 total
btrfs fi sync /storage/btrfs_dedup/ 0.00s user 0.36s system 0% cpu
10:43.27 total
/dev/loop1 46080 4119 40173 10% /storage/btrfs_dedup
Data, single: total=4.01GiB, used=3.74GiB
Metadata, DUP: total=1.00GiB, used=143.45MiB
sent 4,017,134,246 bytes received 2,689 bytes 8,956,827.06 bytes/sec
rsync -a /storage/btrfs/costas/Photo_library/2014/ --stats 21.29s user
42.32s system 14% cpu 7:28.74 total
btrfs fi sync /storage/btrfs_dedup/ 0.00s user 0.01s system 0% cpu
4.173 total
/dev/loop1 46080 4250 40173 10% /storage/btrfs_dedup
Data, single: total=5.01GiB, used=3.74GiB
Metadata, DUP: total=1.00GiB, used=208.72MiB
sent 4,017,134,246 bytes received 2,689 bytes 9,691,524.57 bytes/sec
rsync -a /storage/btrfs/costas/Photo_library/2014/ --stats 20.95s user
31.69s system 12% cpu 6:54.90 total
btrfs fi sync /storage/btrfs_dedup/ 0.00s user 0.00s system 0% cpu
3.254 total
/dev/loop1 46080 4371 40172 10% /storage/btrfs_dedup
Data, single: total=5.01GiB, used=3.74GiB
Metadata, DUP: total=1.00GiB, used=269.39MiB
sent 4,017,134,246 bytes received 2,689 bytes 9,037,428.43 bytes/sec
rsync -a /storage/btrfs/costas/Photo_library/2014/ --stats 20.54s user
36.70s system 12% cpu 7:23.93 total
btrfs fi sync /storage/btrfs_dedup/ 0.00s user 0.01s system 0% cpu
5.578 total
/dev/loop1 46080 4497 40172 11% /storage/btrfs_dedup
Data, single: total=5.01GiB, used=3.74GiB
Metadata, DUP: total=1.00GiB, used=331.98MiB
sent 4,017,134,246 bytes received 2,689 bytes 29,004,598.81 bytes/sec
rsync -a /storage/btrfs/costas/Photo_library/2014/ --stats 22.30s user
13.01s system 25% cpu 2:18.15 total
btrfs fi sync /storage/btrfs_dedup/ 0.00s user 0.01s system 0% cpu
23.447 total
/dev/loop1 46080 4617 40172 11% /storage/btrfs_dedup
Data, single: total=5.01GiB, used=3.74GiB
Metadata, DUP: total=1.00GiB, used=391.91MiB
sent 4,017,134,246 bytes received 2,689 bytes 39,971,511.79 bytes/sec
rsync -a /storage/btrfs/costas/Photo_library/2014/ --stats 21.60s user
11.85s system 33% cpu 1:39.74 total
btrfs fi sync /storage/btrfs_dedup/ 0.00s user 0.01s system 0% cpu
32.178 total
/dev/loop1 46080 4747 40171 11% /storage/btrfs_dedup
Data, single: total=5.01GiB, used=3.74GiB
Metadata, DUP: total=1.00GiB, used=456.48MiB
sent 4,017,134,246 bytes received 2,689 bytes 32,009,059.24 bytes/sec
rsync -a /storage/btrfs/costas/Photo_library/2014/ --stats 25.68s user
13.94s system 31% cpu 2:04.42 total
btrfs fi sync /storage/btrfs_dedup/ 0.00s user 0.01s system 0% cpu
29.313 total
/dev/loop1 46080 4870 40171 11% /storage/btrfs_dedup
Data, single: total=5.01GiB, used=3.74GiB
Metadata, DUP: total=1.00GiB, used=518.09MiB
sent 4,017,134,246 bytes received 2,689 bytes 30,782,658.51 bytes/sec
rsync -a /storage/btrfs/costas/Photo_library/2014/ --stats 21.84s user
12.63s system 26% cpu 2:10.20 total
btrfs fi sync /storage/btrfs_dedup/ 0.00s user 0.00s system 0% cpu
41.074 total
/dev/loop1 46080 4990 40171 12% /storage/btrfs_dedup
Data, single: total=5.01GiB, used=3.74GiB
Metadata, DUP: total=1.00GiB, used=578.16MiB
sent 4,017,134,246 bytes received 2,689 bytes 22,379,592.95 bytes/sec
rsync -a /storage/btrfs/costas/Photo_library/2014/ --stats 28.57s user
18.61s system 26% cpu 2:59.07 total
btrfs fi sync /storage/btrfs_dedup/ 0.00s user 0.01s system 0% cpu
55.714 total
/dev/loop1 46080 5114 40171 12% /storage/btrfs_dedup
Data, single: total=5.01GiB, used=3.74GiB
Metadata, DUP: total=1.00GiB, used=639.95MiB
sent 4,017,134,246 bytes received 2,689 bytes 28,591,721.96 bytes/sec
rsync -a /storage/btrfs/costas/Photo_library/2014/ --stats 23.79s user
15.97s system 28% cpu 2:20.61 total
btrfs fi sync /storage/btrfs_dedup/ 0.00s user 0.01s system 0% cpu
1:01.82 total
/dev/loop1 46080 5240 40170 12% /storage/btrfs_dedup
Data, single: total=5.01GiB, used=3.74GiB
Metadata, DUP: total=1.00GiB, used=702.59MiB
umount /storage/btrfs_dedup 0.00s user 0.60s system 59% cpu 1.007 total
^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2014-04-14 8:41 UTC | newest]
Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-04-10 3:48 [RFC PATCH v10 00/16] Online(inband) data deduplication Liu Bo
2014-04-10 3:48 ` [PATCH v10 01/16] Btrfs: disable qgroups accounting when quota_enable is 0 Liu Bo
2014-04-10 3:48 ` [PATCH v10 02/16] Btrfs: introduce dedup tree and relatives Liu Bo
2014-04-10 3:48 ` [PATCH v10 03/16] Btrfs: introduce dedup tree operations Liu Bo
2014-04-10 3:48 ` [PATCH v10 04/16] Btrfs: introduce dedup state Liu Bo
2014-04-10 3:48 ` [PATCH v10 05/16] Btrfs: make ordered extent aware of dedup Liu Bo
2014-04-10 3:48 ` [PATCH v10 06/16] Btrfs: online(inband) data dedup Liu Bo
2014-04-10 3:48 ` [PATCH v10 07/16] Btrfs: skip dedup reference during backref walking Liu Bo
2014-04-10 3:48 ` [PATCH v10 08/16] Btrfs: don't return space for dedup extent Liu Bo
2014-04-10 3:48 ` [PATCH v10 09/16] Btrfs: add ioctl of dedup control Liu Bo
2014-04-10 3:48 ` [PATCH v10 10/16] Btrfs: improve the delayed refs process in rm case Liu Bo
2014-04-10 3:48 ` [PATCH v10 11/16] Btrfs: fix a crash of dedup ref Liu Bo
2014-04-10 3:48 ` [PATCH v10 12/16] Btrfs: fix deadlock of dedup work Liu Bo
2014-04-10 3:48 ` [PATCH v10 13/16] Btrfs: fix transactin abortion in __btrfs_free_extent Liu Bo
2014-04-10 3:48 ` [PATCH v10 14/16] Btrfs: fix wrong pinned bytes " Liu Bo
2014-04-10 3:48 ` [PATCH v10 15/16] Btrfs: use total_bytes instead of bytes_used for global_rsv Liu Bo
2014-04-10 3:48 ` [PATCH v10 16/16] Btrfs: fix dedup enospc problem Liu Bo
2014-04-10 3:48 ` [PATCH v5] Btrfs-progs: add dedup subcommand Liu Bo
2014-04-10 9:08 ` [RFC PATCH v10 00/16] Online(inband) data deduplication Konstantinos Skarlatos
2014-04-10 15:44 ` Liu Bo
2014-04-10 15:55 ` Liu Bo
2014-04-11 9:28 ` Martin Steigerwald
2014-04-11 9:51 ` Liu Bo
2014-04-14 8:41 ` Test results for " Konstantinos Skarlatos
-- strict thread matches above, loose matches on Subject: below --
2014-04-11 18:00 Michael
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).