[PATCH RFC 00/14] Yet Another In-band(online) deduplication implement

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC 00/14] Yet Another In-band(online) deduplication implement
@ 2015-07-28  8:30 Qu Wenruo
  2015-07-28  8:30 ` [PATCH RFC 01/14] btrfs: file-item: Introduce btrfs_setup_file_extent function Qu Wenruo
                   ` (6 more replies)
  0 siblings, 7 replies; 19+ messages in thread
From: Qu Wenruo @ 2015-07-28  8:30 UTC (permalink / raw)
  To: linux-btrfs

Although Liu Bo has already submitted a V10 version of his deduplication
implement, here is another implement for it.

[[CORE FEATURES]]
The main design concept is the following:
1) Controllable memory usage
2) No guarantee to dedup every duplication.
3) No on-disk format change or new format
4) Page size level deduplication

[[IMPLEMENT]]
Implement details includes the following:
1) LRU hash maps to limit the memory usage
   The hash -> extent mapping is control by LRU (or unlimited), to
   get a controllable memory usage (can be tuned by mount option)
   alone with controllable read/write overhead used for hash searching.

2) Reuse existing ordered_extent infrastructure
   For duplicated page, it will still submit a ordered_extent(only one
   page long), to make the full use of all existing infrastructure.
   But only not submit a bio.
   This can reduce the number of code lines.

3) Mount option to control dedup behavior
   Deduplication and its memory usage can be tuned by mount option.
   No need to indicated ioctl interface.
   And further more, it can easily support BTRFS_INODE flag like
   compression, to allow further per file dedup fine tunning.

[[TODO]]
1. Add support for compressed extent
   Shouldn't be quite hard.
2. Try to merge dedup extent to reduce metadata size
   Currently, dedup extent is always in 4K size, although its reference
   source can be quite large.
3. Add support for per file dedup flags
   Much easier, just like compression flags.

[[KNOWN BUG, NEED HELP!]]
On the other hand, since it's still a RFC patch, it must has one or more
problem:
1) Race between __btrfs_free_extent() and dedup ordered_extent.
   The hook in __btrfs_free_extent() will free the corresponding hashes
   of a extent, even there is a dedup ordered_extent referring it.

   The problem will happen like the following case:
======================================================================
   cow_file_range()
     Submit dedup ordered_extent for extent A

   commit_transaction()
     Extent A needs freeing. As the its ref is decreased to 0.
     And dedup ordered_extent can increase only when it hit endio time.

   finish_ordered_io()
     Add reference to Extent A for dedup ordered_extent.
     But it is already freed in previous transaction.
     Causing abort_transaction().
======================================================================
   I'd like to keep the current ordered_extent method, as it adds the
   least number of code lines.
   But I can't find a good idea to either delay transaction until dedup
   ordered_extent is done or things like that.

   Trans->ordered seems to be a good idea, but it seems to cause list
   corruption without extra protection in tree log infrastructure.

That's the only problem spotted yet.
Any early review or advice/question on the design is welcomed.

Thanks.

Qu Wenruo (14):
  btrfs: file-item: Introduce btrfs_setup_file_extent function.
  btrfs: Use btrfs_fill_file_extent to reduce duplicated codes
  btrfs: dedup: Add basic init/free functions for inband dedup.
  btrfs: dedup: Add internal add/remove/search function for btrfs dedup.
  btrfs: dedup: add ordered extent hook for inband dedup
  btrfs: dedup: Apply dedup hook for write time dedup.
  btrfs: extent_map: Add new dedup flag and corresponding hook.
  btrfs: extent-map: Introduce orig_block_start member for extent-map.
  btrfs: dedup: Add inband dedup hook for read extent.
  btrfs: dedup: Introduce btrfs_dedup_free_extent_range function.
  btrfs: dedup: Add hook to free dedup hash at extent free time.
  btrfs: dedup: Add mount option support for btrfs inband deduplication.
  Btrfs: dedup: Support dedup change at remount time.
  btrfs: dedup: Add mount option output for inband dedup.

 fs/btrfs/Makefile       |   2 +-
 fs/btrfs/ctree.h        |  16 ++
 fs/btrfs/dedup.c        | 701 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/dedup.h        | 132 +++++++++
 fs/btrfs/disk-io.c      |   7 +
 fs/btrfs/extent-tree.c  |  10 +
 fs/btrfs/extent_io.c    |   6 +-
 fs/btrfs/extent_map.h   |   4 +
 fs/btrfs/file-item.c    |  61 +++--
 fs/btrfs/inode.c        | 228 ++++++++++++----
 fs/btrfs/ordered-data.c |  32 ++-
 fs/btrfs/ordered-data.h |   8 +
 fs/btrfs/super.c        |  39 ++-
 13 files changed, 1163 insertions(+), 83 deletions(-)
 create mode 100644 fs/btrfs/dedup.c
 create mode 100644 fs/btrfs/dedup.h

-- 
2.4.6

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH RFC 01/14] btrfs: file-item: Introduce btrfs_setup_file_extent function.
  2015-07-28  8:30 [PATCH RFC 00/14] Yet Another In-band(online) deduplication implement Qu Wenruo
@ 2015-07-28  8:30 ` Qu Wenruo
  2015-07-28  8:30 ` [PATCH RFC 02/14] btrfs: Use btrfs_fill_file_extent to reduce duplicated codes Qu Wenruo
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Qu Wenruo @ 2015-07-28  8:30 UTC (permalink / raw)
  To: linux-btrfs

This new function is just used to fill the file extent with given
numbers.

This is mainly used for later cleanup of duplicated file extent setup
codes in inode.c, but also centralized the safety check for later
expansion.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 fs/btrfs/ctree.h     |  5 +++++
 fs/btrfs/file-item.c | 40 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 45 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index aac314e..68ffd26 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3825,6 +3825,11 @@ int btrfs_lookup_bio_sums(struct btrfs_root *root, struct inode *inode,
 			  struct bio *bio, u32 *dst);
 int btrfs_lookup_bio_sums_dio(struct btrfs_root *root, struct inode *inode,
 			      struct bio *bio, u64 logical_offset);
+void btrfs_fill_file_extent(struct btrfs_trans_handle *trans,
+			    struct btrfs_path *path, u64 disk_bytenr,
+			    u64 disk_num_bytes, u64 offset, u64 num_bytes,
+			    u64 ram_bytes, u8 type, u8 compression,
+			    u8 encryption, u16 other_encoding);
 int btrfs_insert_file_extent(struct btrfs_trans_handle *trans,
 			     struct btrfs_root *root,
 			     u64 objectid, u64 pos,
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 58ece65..e1d7c03 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -37,6 +37,46 @@
 				   sizeof(struct btrfs_ordered_sum)) / \
 				   sizeof(u32) * (r)->sectorsize)
 
+/*
+ * Fill contents of the file extent.
+ *
+ * The file extent is indicated by path.
+ * It's mainlly used to reduce the duplicated codes, but also added some extra
+ * safety check.
+ */
+void btrfs_fill_file_extent(struct btrfs_trans_handle *trans,
+			    struct btrfs_path *path, u64 disk_bytenr,
+			    u64 disk_num_bytes, u64 offset, u64 num_bytes,
+			    u64 ram_bytes, u8 type, u8 compression,
+			    u8 encryption, u16 other_encoding)
+{
+	struct btrfs_key key;
+	struct extent_buffer *node = path->nodes[0];
+	struct btrfs_file_extent_item *item;
+	int slot = path->slots[0];
+
+	BUG_ON(encryption || other_encoding);
+	BUG_ON(type == BTRFS_FILE_EXTENT_INLINE);
+
+	btrfs_item_key_to_cpu(node, &key, slot);
+	WARN_ON(key.type != BTRFS_EXTENT_DATA_KEY);
+
+	item = btrfs_item_ptr(node, slot, struct btrfs_file_extent_item);
+
+	btrfs_set_file_extent_generation(node, item, trans->transid);
+	btrfs_set_file_extent_disk_bytenr(node, item, disk_bytenr);
+	btrfs_set_file_extent_disk_num_bytes(node, item, disk_num_bytes);
+	btrfs_set_file_extent_offset(node, item, offset);
+	btrfs_set_file_extent_num_bytes(node, item, num_bytes);
+	btrfs_set_file_extent_ram_bytes(node, item, ram_bytes);
+	btrfs_set_file_extent_type(node, item, type);
+	btrfs_set_file_extent_compression(node, item, compression);
+	btrfs_set_file_extent_other_encoding(node, item, 0);
+	btrfs_set_file_extent_encryption(node, item, 0);
+
+	btrfs_mark_buffer_dirty(node);
+}
+
 int btrfs_insert_file_extent(struct btrfs_trans_handle *trans,
 			     struct btrfs_root *root,
 			     u64 objectid, u64 pos,
-- 
2.4.6


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH RFC 02/14] btrfs: Use btrfs_fill_file_extent to reduce duplicated codes
  2015-07-28  8:30 [PATCH RFC 00/14] Yet Another In-band(online) deduplication implement Qu Wenruo
  2015-07-28  8:30 ` [PATCH RFC 01/14] btrfs: file-item: Introduce btrfs_setup_file_extent function Qu Wenruo
@ 2015-07-28  8:30 ` Qu Wenruo
  2015-07-28  8:30 ` [PATCH RFC 03/14] btrfs: dedup: Add basic init/free functions for inband dedup Qu Wenruo
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Qu Wenruo @ 2015-07-28  8:30 UTC (permalink / raw)
  To: linux-btrfs

Use btrfs_fill_file_extent() function to replace the hand-coded codes.
As it has better check and takes less effort to maintain.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 fs/btrfs/file-item.c | 20 ++++----------------
 fs/btrfs/inode.c     | 38 ++++++--------------------------------
 2 files changed, 10 insertions(+), 48 deletions(-)

diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index e1d7c03..87a8d85 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -88,7 +88,6 @@ int btrfs_insert_file_extent(struct btrfs_trans_handle *trans,
 	struct btrfs_file_extent_item *item;
 	struct btrfs_key file_key;
 	struct btrfs_path *path;
-	struct extent_buffer *leaf;
 
 	path = btrfs_alloc_path();
 	if (!path)
@@ -103,21 +102,10 @@ int btrfs_insert_file_extent(struct btrfs_trans_handle *trans,
 	if (ret < 0)
 		goto out;
 	BUG_ON(ret); /* Can't happen */
-	leaf = path->nodes[0];
-	item = btrfs_item_ptr(leaf, path->slots[0],
-			      struct btrfs_file_extent_item);
-	btrfs_set_file_extent_disk_bytenr(leaf, item, disk_offset);
-	btrfs_set_file_extent_disk_num_bytes(leaf, item, disk_num_bytes);
-	btrfs_set_file_extent_offset(leaf, item, offset);
-	btrfs_set_file_extent_num_bytes(leaf, item, num_bytes);
-	btrfs_set_file_extent_ram_bytes(leaf, item, ram_bytes);
-	btrfs_set_file_extent_generation(leaf, item, trans->transid);
-	btrfs_set_file_extent_type(leaf, item, BTRFS_FILE_EXTENT_REG);
-	btrfs_set_file_extent_compression(leaf, item, compression);
-	btrfs_set_file_extent_encryption(leaf, item, encryption);
-	btrfs_set_file_extent_other_encoding(leaf, item, other_encoding);
-
-	btrfs_mark_buffer_dirty(leaf);
+	btrfs_fill_file_extent(trans, path, disk_offset, disk_num_bytes,
+			       offset, num_bytes, ram_bytes,
+			       BTRFS_FILE_EXTENT_REG, compression, encryption,
+			       other_encoding);
 out:
 	btrfs_free_path(path);
 	return ret;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e33dff3..3221010 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2052,7 +2052,6 @@ static int insert_reserved_file_extent(struct btrfs_trans_handle *trans,
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	struct btrfs_file_extent_item *fi;
 	struct btrfs_path *path;
-	struct extent_buffer *leaf;
 	struct btrfs_key ins;
 	int extent_inserted = 0;
 	int ret;
@@ -2087,21 +2086,9 @@ static int insert_reserved_file_extent(struct btrfs_trans_handle *trans,
 		if (ret)
 			goto out;
 	}
-	leaf = path->nodes[0];
-	fi = btrfs_item_ptr(leaf, path->slots[0],
-			    struct btrfs_file_extent_item);
-	btrfs_set_file_extent_generation(leaf, fi, trans->transid);
-	btrfs_set_file_extent_type(leaf, fi, extent_type);
-	btrfs_set_file_extent_disk_bytenr(leaf, fi, disk_bytenr);
-	btrfs_set_file_extent_disk_num_bytes(leaf, fi, disk_num_bytes);
-	btrfs_set_file_extent_offset(leaf, fi, 0);
-	btrfs_set_file_extent_num_bytes(leaf, fi, num_bytes);
-	btrfs_set_file_extent_ram_bytes(leaf, fi, ram_bytes);
-	btrfs_set_file_extent_compression(leaf, fi, compression);
-	btrfs_set_file_extent_encryption(leaf, fi, encryption);
-	btrfs_set_file_extent_other_encoding(leaf, fi, other_encoding);
-
-	btrfs_mark_buffer_dirty(leaf);
+	btrfs_fill_file_extent(trans, path, disk_bytenr, disk_num_bytes, 0,
+			       num_bytes, ram_bytes, extent_type, compression,
+			       encryption, other_encoding);
 	btrfs_release_path(path);
 
 	inode_add_bytes(inode, num_bytes);
@@ -2391,7 +2378,6 @@ static noinline int relink_extent_backref(struct btrfs_path *path,
 				 struct sa_defrag_extent_backref *backref)
 {
 	struct btrfs_file_extent_item *extent;
-	struct btrfs_file_extent_item *item;
 	struct btrfs_ordered_extent *ordered;
 	struct btrfs_trans_handle *trans;
 	struct btrfs_fs_info *fs_info;
@@ -2548,21 +2534,9 @@ again:
 		goto out_free_path;
 	}
 
-	leaf = path->nodes[0];
-	item = btrfs_item_ptr(leaf, path->slots[0],
-				struct btrfs_file_extent_item);
-	btrfs_set_file_extent_disk_bytenr(leaf, item, new->bytenr);
-	btrfs_set_file_extent_disk_num_bytes(leaf, item, new->disk_len);
-	btrfs_set_file_extent_offset(leaf, item, start - new->file_pos);
-	btrfs_set_file_extent_num_bytes(leaf, item, len);
-	btrfs_set_file_extent_ram_bytes(leaf, item, new->len);
-	btrfs_set_file_extent_generation(leaf, item, trans->transid);
-	btrfs_set_file_extent_type(leaf, item, BTRFS_FILE_EXTENT_REG);
-	btrfs_set_file_extent_compression(leaf, item, new->compress_type);
-	btrfs_set_file_extent_encryption(leaf, item, 0);
-	btrfs_set_file_extent_other_encoding(leaf, item, 0);
-
-	btrfs_mark_buffer_dirty(leaf);
+	btrfs_fill_file_extent(trans, path, new->bytenr, new->disk_len,
+			       start - new->file_pos, len, new->len,
+			       BTRFS_FILE_EXTENT_REG, new->compress_type, 0, 0);
 	inode_add_bytes(inode, len);
 	btrfs_release_path(path);
 
-- 
2.4.6


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH RFC 03/14] btrfs: dedup: Add basic init/free functions for inband dedup.
  2015-07-28  8:30 [PATCH RFC 00/14] Yet Another In-band(online) deduplication implement Qu Wenruo
  2015-07-28  8:30 ` [PATCH RFC 01/14] btrfs: file-item: Introduce btrfs_setup_file_extent function Qu Wenruo
  2015-07-28  8:30 ` [PATCH RFC 02/14] btrfs: Use btrfs_fill_file_extent to reduce duplicated codes Qu Wenruo
@ 2015-07-28  8:30 ` Qu Wenruo
  2015-07-28  8:30 ` [PATCH RFC 04/14] btrfs: dedup: Add internal add/remove/search function for btrfs dedup Qu Wenruo
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Qu Wenruo @ 2015-07-28  8:30 UTC (permalink / raw)
  To: linux-btrfs

Add basic data structures and their init/free functions for later inband
dedup implment.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 fs/btrfs/Makefile  |  2 +-
 fs/btrfs/ctree.h   |  5 ++++
 fs/btrfs/dedup.c   | 88 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/dedup.h   | 84 +++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/disk-io.c |  7 +++++
 5 files changed, 185 insertions(+), 1 deletion(-)
 create mode 100644 fs/btrfs/dedup.c
 create mode 100644 fs/btrfs/dedup.h

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 6d1d0b9..a8bd917 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -9,7 +9,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
 	   export.o tree-log.o free-space-cache.o zlib.o lzo.o \
 	   compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \
 	   reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
-	   uuid-tree.o props.o hash.o
+	   uuid-tree.o props.o hash.o dedup.o
 
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 68ffd26..7f37637 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -38,6 +38,7 @@
 #include "extent_io.h"
 #include "extent_map.h"
 #include "async-thread.h"
+#include "dedup.h"
 
 struct btrfs_trans_handle;
 struct btrfs_transaction;
@@ -1788,6 +1789,9 @@ struct btrfs_fs_info {
 	 * and will be latter freed. Protected by fs_info->chunk_mutex.
 	 */
 	struct list_head pinned_chunks;
+
+	/* Inband deduplication stuff */
+	struct btrfs_dedup_root *dedup_root;
 };
 
 struct btrfs_subvolume_writers {
@@ -3685,6 +3689,7 @@ static inline void free_fs_info(struct btrfs_fs_info *fs_info)
 	kfree(fs_info->super_copy);
 	kfree(fs_info->super_for_commit);
 	security_free_mnt_opts(&fs_info->security_opts);
+	btrfs_free_dedup(fs_info);
 	kfree(fs_info);
 }
 
diff --git a/fs/btrfs/dedup.c b/fs/btrfs/dedup.c
new file mode 100644
index 0000000..41f70f5
--- /dev/null
+++ b/fs/btrfs/dedup.c
@@ -0,0 +1,88 @@
+/*
+ * Copyright (C) 2015 Fujitsu.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include "ctree.h"
+#include "dedup.h"
+
+/* SHA256, 256bits = 32 bytes  */
+static int btrfs_dedup_sizes[] = { 0, 32 };
+
+struct kmem_cache *btrfs_dedup_hash_cachep;
+
+int btrfs_init_dedup(struct btrfs_fs_info *fs_info, u8 dedup_type)
+{
+	struct btrfs_dedup_root *dedup_root;
+	int ret = 0;
+
+	if (dedup_type > ARRAY_SIZE(btrfs_dedup_sizes)) {
+		pr_err("BTRFS: unsopported dedup");
+		return -EINVAL;
+	}
+
+	if (!dedup_type) {
+		fs_info->dedup_root = NULL;
+		return 0;
+	}
+
+	btrfs_dedup_hash_cachep = kmem_cache_create("btrfs_dedup_hash",
+				sizeof(struct btrfs_dedup_hash), 0,
+				SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, NULL);
+	if (!btrfs_dedup_hash_cachep) {
+		ret = -ENOMEM;
+		goto fail;
+	}
+	fs_info->dedup_root = kzalloc(sizeof(struct btrfs_dedup_root),
+				      GFP_NOFS);
+	if (!fs_info->dedup_root) {
+		ret = -ENOMEM;
+		goto fail_hash_cachep;
+	}
+	dedup_root = fs_info->dedup_root;
+	dedup_root->dedup_type = dedup_type;
+
+	switch (dedup_type) {
+	case BTRFS_DEDUP_HASH_SHA256:
+		dedup_root->dedup_driver = crypto_alloc_shash("sha256", 0, 0);
+		if (IS_ERR(dedup_root->dedup_driver)) {
+			pr_err("BTRFS: Cannot load sha256 driver\n");
+			goto fail_dedup_root;
+		}
+	}
+
+	dedup_root->hash_root = RB_ROOT;
+	dedup_root->bytenr_root = RB_ROOT;
+	INIT_LIST_HEAD(&dedup_root->lru_list);
+	spin_lock_init(&dedup_root->lock);
+	return 0;
+
+fail_dedup_root:
+	kfree(dedup_root);
+fail_hash_cachep:
+	kmem_cache_destroy(btrfs_dedup_hash_cachep);
+fail:
+	return ret;
+}
+
+void btrfs_free_dedup(struct btrfs_fs_info *fs_info)
+{
+	if (!fs_info->dedup_root)
+		return;
+
+	kfree(fs_info->dedup_root);
+	return;
+}
diff --git a/fs/btrfs/dedup.h b/fs/btrfs/dedup.h
new file mode 100644
index 0000000..341ca33
--- /dev/null
+++ b/fs/btrfs/dedup.h
@@ -0,0 +1,84 @@
+/*
+ * Copyright (C) 2015 Fujitsu.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#ifndef __DEDUP__
+#define __DEDUP__
+
+#include <linux/rbtree.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <crypto/hash.h>
+
+#define BTRFS_DEDUP_DISABLED	0
+#define BTRFS_DEDUP_HASH_SHA256	1
+
+/*
+ * As we need to use kmem_cache to speed the super frequent memory alloc,
+ * we can't use variable length array in btrfs_dedup_hash.
+ * So set DEDUP_HASH_SIZE to a fixed size
+ *
+ * The 32 bytes is for SHA256 so far.
+ */
+#define BTRFS_DEDUP_HASH_SIZE	32
+
+extern struct kmem_cache *btrfs_dedup_hash_cachep;
+
+struct btrfs_dedup_hash {
+	/* Hash -> extent(bytenr + offset) search, for dedup search */
+	struct rb_node hash_node;
+
+	/* Extent(bytenr + offset) -> hash search, for free extent */
+	struct rb_node bytenr_node;
+
+	/* LRU list to maintain low memory usage */
+	struct list_head lru_list;
+
+	/* The bytenr of the data extent */
+	u64 bytenr;
+
+	/* The length of the data extent */
+	u64 length;
+
+	/*
+	 * The offset inside the extent
+	 * TODO: Use offset to support compression
+	 */
+	u32 offset;
+
+	/* fixed length for hash, it's OK not to use full of them */
+	char hash[BTRFS_DEDUP_HASH_SIZE];
+};
+
+struct btrfs_dedup_root {
+	struct rb_root hash_root;
+	struct rb_root bytenr_root;
+	struct list_head lru_list;
+
+	spinlock_t lock;
+
+	struct crypto_shash *dedup_driver;
+	u8 dedup_type;
+
+	/* To limit the amount of btrfs_dedup_hash */
+	u32 limit_nr;
+	u32 current_nr;
+};
+
+int btrfs_init_dedup(struct btrfs_fs_info *fs_info, u8 dedup_type);
+void btrfs_free_dedup(struct btrfs_fs_info *fs_info);
+#endif
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index a9aadb2..3997bd9 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -49,6 +49,7 @@
 #include "raid56.h"
 #include "sysfs.h"
 #include "qgroup.h"
+#include "dedup.h"
 
 #ifdef CONFIG_X86
 #include <asm/cpufeature.h>
@@ -2633,6 +2634,12 @@ int open_ctree(struct super_block *sb,
 
 	INIT_LIST_HEAD(&fs_info->pinned_chunks);
 
+	/* TODO: use real dedup type other than 0 */
+	ret = btrfs_init_dedup(fs_info, 0);
+	if (ret) {
+		err = -ret;
+		goto fail_alloc;
+	}
 	ret = btrfs_alloc_stripe_hash_table(fs_info);
 	if (ret) {
 		err = ret;
-- 
2.4.6


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH RFC 04/14] btrfs: dedup: Add internal add/remove/search function for btrfs dedup.
  2015-07-28  8:30 [PATCH RFC 00/14] Yet Another In-band(online) deduplication implement Qu Wenruo
                   ` (2 preceding siblings ...)
  2015-07-28  8:30 ` [PATCH RFC 03/14] btrfs: dedup: Add basic init/free functions for inband dedup Qu Wenruo
@ 2015-07-28  8:30 ` Qu Wenruo
  2015-07-28  8:56 ` [PATCH RFC 00/14] Yet Another In-band(online) deduplication implement Qu Wenruo
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Qu Wenruo @ 2015-07-28  8:30 UTC (permalink / raw)
  To: linux-btrfs

Add basic internal add/remove/search functions for btrfs_dedup.
They are all internal use only as caller shouldn't call this low level
functions

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 fs/btrfs/dedup.c | 169 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 167 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/dedup.c b/fs/btrfs/dedup.c
index 41f70f5..2bce303 100644
--- a/fs/btrfs/dedup.c
+++ b/fs/btrfs/dedup.c
@@ -78,11 +78,176 @@ fail:
 	return ret;
 }
 
+static void __dedup_del_hash(struct btrfs_dedup_root *root,
+			     struct btrfs_dedup_hash *hash)
+{
+	list_del(&hash->lru_list);
+	rb_erase(&hash->hash_node, &root->hash_root);
+	rb_erase(&hash->bytenr_node, &root->bytenr_root);
+
+	WARN_ON(root->current_nr == 0);
+	root->current_nr--;
+}
+
 void btrfs_free_dedup(struct btrfs_fs_info *fs_info)
 {
-	if (!fs_info->dedup_root)
+	struct btrfs_dedup_hash *entry, *tmp;
+	struct btrfs_dedup_root *dedup_root = fs_info->dedup_root;
+
+	if (!dedup_root)
 		return;
 
-	kfree(fs_info->dedup_root);
+	spin_lock(&dedup_root->lock);
+	list_for_each_entry_safe(entry, tmp, &dedup_root->lru_list, lru_list) {
+		__dedup_del_hash(dedup_root, entry);
+		kmem_cache_free(btrfs_dedup_hash_cachep, entry);
+	}
+	spin_unlock(&dedup_root->lock);
+	kfree(dedup_root);
 	return;
 }
+
+static int __hash_page(struct btrfs_dedup_root *root, struct page *page,
+		       char *out)
+{
+	struct crypto_shash *tfm = root->dedup_driver;
+	struct {
+		struct shash_desc desc;
+		char ctx[crypto_shash_descsize(tfm)];
+	} sdesc;
+	char *kaddr;
+	int ret = 0;
+
+	sdesc.desc.tfm = tfm;
+	sdesc.desc.flags = 0;
+
+	kaddr = kmap_atomic(page);
+	ret = crypto_shash_digest(&sdesc.desc, kaddr, PAGE_SIZE,
+				  out);
+	kunmap_atomic(kaddr);
+
+	return ret;
+}
+
+/*
+ * Return 1 when the extent already exists
+ * Return 0 when inserted the one into bytenr tree.
+ */
+static int insert_bytenr(struct rb_root *root, struct btrfs_dedup_hash *hash)
+{
+	struct rb_node **p = &root->rb_node;
+	struct rb_node *parent = NULL;
+	struct btrfs_dedup_hash *entry = NULL;
+
+	while (*p) {
+		parent = *p;
+		entry = rb_entry(parent, struct btrfs_dedup_hash,
+				 bytenr_node);
+		if (hash->bytenr + hash->offset < entry->bytenr + entry->offset)
+			p = &(*p)->rb_left;
+		else if (hash->bytenr + hash->offset > entry->bytenr +
+			 entry->offset)
+			p = &(*p)->rb_right;
+		else
+			return 1;
+	}
+	rb_link_node(&hash->bytenr_node, parent, p);
+	rb_insert_color(&hash->bytenr_node, root);
+	return 0;
+}
+
+/*
+ * Must ensure insert_bytenr is called before, ensure this dedup_hash
+ * is not already in the tree
+ */
+static int insert_hash(struct rb_root *root, struct btrfs_dedup_hash *hash,
+			int hash_len)
+{
+	struct rb_node **p = &root->rb_node;
+	struct rb_node *parent = NULL;
+	struct btrfs_dedup_hash *entry = NULL;
+
+	while (*p) {
+		parent = *p;
+		entry = rb_entry(parent, struct btrfs_dedup_hash,
+				 hash_node);
+		if (memcmp(hash->hash, entry->hash, hash_len) < 0)
+			p = &(*p)->rb_left;
+		else if (memcmp(hash->hash, entry->hash, hash_len) > 0)
+			p = &(*p)->rb_right;
+		/* Now hash matches, still allow it to be inserted */
+		else if (hash->bytenr + hash->offset < entry->bytenr +
+			 entry->offset)
+			p = &(*p)->rb_left;
+		else if (hash->bytenr + hash->offset > entry->bytenr +
+			 entry->offset)
+			p = &(*p)->rb_right;
+		else
+			return 1;
+	}
+	rb_link_node(&hash->hash_node, parent, p);
+	rb_insert_color(&hash->hash_node, root);
+	return 0;
+}
+
+static int dedup_add_hash(struct btrfs_dedup_root *root,
+			  struct btrfs_dedup_hash *hash)
+{
+	int ret = 0;
+
+	WARN_ON(hash->bytenr == 0);
+
+	spin_lock(&root->lock);
+
+	ret = insert_bytenr(&root->bytenr_root, hash);
+	/* Already in tree */
+	if (ret > 0)
+		goto out;
+	insert_hash(&root->hash_root, hash,
+		    btrfs_dedup_sizes[root->dedup_type]);
+	list_add(&hash->lru_list, &root->lru_list);
+
+	root->current_nr++;
+
+	/* Remove the last dedup hash if we exceed limit */
+	while (root->limit_nr && root->current_nr > root->limit_nr) {
+		struct btrfs_dedup_hash *last;
+
+		last = list_entry(root->lru_list.prev, struct btrfs_dedup_hash,
+				  lru_list);
+		__dedup_del_hash(root, last);
+		kmem_cache_free(btrfs_dedup_hash_cachep, last);
+	}
+out:
+	spin_unlock(&root->lock);
+	return ret;
+}
+
+/*
+ * Caller must hold lock on dedup_root->lock during the use of the hash.
+ * As the dedup_hash hash can be freed at anytime.
+ */
+static struct btrfs_dedup_hash *
+dedup_search_by_hash(struct btrfs_dedup_root *root, u8 *hash)
+{
+	struct rb_node **p = &root->hash_root.rb_node;
+	struct rb_node *parent = NULL;
+	struct btrfs_dedup_hash *entry = NULL;
+	int hash_len = btrfs_dedup_sizes[root->dedup_type];
+
+	while (*p) {
+		parent = *p;
+		entry = rb_entry(parent, struct btrfs_dedup_hash, hash_node);
+		if (memcmp(hash, entry->hash, hash_len) < 0)
+			p = &(*p)->rb_left;
+		else if (memcmp(hash, entry->hash, hash_len) > 0)
+			p = &(*p)->rb_right;
+		else {
+			/* Found, need to re-add it to LRU list head */
+			list_del(&entry->lru_list);
+			list_add(&entry->lru_list, &root->lru_list);
+			return entry;
+		}
+	}
+	return NULL;
+}
-- 
2.4.6


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 00/14] Yet Another In-band(online) deduplication implement
  2015-07-28  8:30 [PATCH RFC 00/14] Yet Another In-band(online) deduplication implement Qu Wenruo
                   ` (3 preceding siblings ...)
  2015-07-28  8:30 ` [PATCH RFC 04/14] btrfs: dedup: Add internal add/remove/search function for btrfs dedup Qu Wenruo
@ 2015-07-28  8:56 ` Qu Wenruo
  2015-07-28  9:52 ` Liu Bo
  2015-07-28 14:50 ` David Sterba
  6 siblings, 0 replies; 19+ messages in thread
From: Qu Wenruo @ 2015-07-28  8:56 UTC (permalink / raw)
  To: linux-btrfs

Oh, there seems to be something wrong with the internal mail server.

The codes and patches can also get from github, as only the first 4 
patches are successfully sent...

https://github.com/adam900710/linux/tree/dedup

Thanks,
Qu

Qu Wenruo wrote on 2015/07/28 16:30 +0800:
> Although Liu Bo has already submitted a V10 version of his deduplication
> implement, here is another implement for it.
>
> [[CORE FEATURES]]
> The main design concept is the following:
> 1) Controllable memory usage
> 2) No guarantee to dedup every duplication.
> 3) No on-disk format change or new format
> 4) Page size level deduplication
>
> [[IMPLEMENT]]
> Implement details includes the following:
> 1) LRU hash maps to limit the memory usage
>     The hash -> extent mapping is control by LRU (or unlimited), to
>     get a controllable memory usage (can be tuned by mount option)
>     alone with controllable read/write overhead used for hash searching.
>
> 2) Reuse existing ordered_extent infrastructure
>     For duplicated page, it will still submit a ordered_extent(only one
>     page long), to make the full use of all existing infrastructure.
>     But only not submit a bio.
>     This can reduce the number of code lines.
>
> 3) Mount option to control dedup behavior
>     Deduplication and its memory usage can be tuned by mount option.
>     No need to indicated ioctl interface.
>     And further more, it can easily support BTRFS_INODE flag like
>     compression, to allow further per file dedup fine tunning.
>
> [[TODO]]
> 1. Add support for compressed extent
>     Shouldn't be quite hard.
> 2. Try to merge dedup extent to reduce metadata size
>     Currently, dedup extent is always in 4K size, although its reference
>     source can be quite large.
> 3. Add support for per file dedup flags
>     Much easier, just like compression flags.
>
> [[KNOWN BUG, NEED HELP!]]
> On the other hand, since it's still a RFC patch, it must has one or more
> problem:
> 1) Race between __btrfs_free_extent() and dedup ordered_extent.
>     The hook in __btrfs_free_extent() will free the corresponding hashes
>     of a extent, even there is a dedup ordered_extent referring it.
>
>     The problem will happen like the following case:
> ======================================================================
>     cow_file_range()
>       Submit dedup ordered_extent for extent A
>
>     commit_transaction()
>       Extent A needs freeing. As the its ref is decreased to 0.
>       And dedup ordered_extent can increase only when it hit endio time.
>
>     finish_ordered_io()
>       Add reference to Extent A for dedup ordered_extent.
>       But it is already freed in previous transaction.
>       Causing abort_transaction().
> ======================================================================
>     I'd like to keep the current ordered_extent method, as it adds the
>     least number of code lines.
>     But I can't find a good idea to either delay transaction until dedup
>     ordered_extent is done or things like that.
>
>     Trans->ordered seems to be a good idea, but it seems to cause list
>     corruption without extra protection in tree log infrastructure.
>
> That's the only problem spotted yet.
> Any early review or advice/question on the design is welcomed.
>
> Thanks.
>
> Qu Wenruo (14):
>    btrfs: file-item: Introduce btrfs_setup_file_extent function.
>    btrfs: Use btrfs_fill_file_extent to reduce duplicated codes
>    btrfs: dedup: Add basic init/free functions for inband dedup.
>    btrfs: dedup: Add internal add/remove/search function for btrfs dedup.
>    btrfs: dedup: add ordered extent hook for inband dedup
>    btrfs: dedup: Apply dedup hook for write time dedup.
>    btrfs: extent_map: Add new dedup flag and corresponding hook.
>    btrfs: extent-map: Introduce orig_block_start member for extent-map.
>    btrfs: dedup: Add inband dedup hook for read extent.
>    btrfs: dedup: Introduce btrfs_dedup_free_extent_range function.
>    btrfs: dedup: Add hook to free dedup hash at extent free time.
>    btrfs: dedup: Add mount option support for btrfs inband deduplication.
>    Btrfs: dedup: Support dedup change at remount time.
>    btrfs: dedup: Add mount option output for inband dedup.
>
>   fs/btrfs/Makefile       |   2 +-
>   fs/btrfs/ctree.h        |  16 ++
>   fs/btrfs/dedup.c        | 701 ++++++++++++++++++++++++++++++++++++++++++++++++
>   fs/btrfs/dedup.h        | 132 +++++++++
>   fs/btrfs/disk-io.c      |   7 +
>   fs/btrfs/extent-tree.c  |  10 +
>   fs/btrfs/extent_io.c    |   6 +-
>   fs/btrfs/extent_map.h   |   4 +
>   fs/btrfs/file-item.c    |  61 +++--
>   fs/btrfs/inode.c        | 228 ++++++++++++----
>   fs/btrfs/ordered-data.c |  32 ++-
>   fs/btrfs/ordered-data.h |   8 +
>   fs/btrfs/super.c        |  39 ++-
>   13 files changed, 1163 insertions(+), 83 deletions(-)
>   create mode 100644 fs/btrfs/dedup.c
>   create mode 100644 fs/btrfs/dedup.h
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH RFC 00/14] Yet Another In-band(online) deduplication implement
@ 2015-07-28  9:14 Qu Wenruo
  0 siblings, 0 replies; 19+ messages in thread
From: Qu Wenruo @ 2015-07-28  9:14 UTC (permalink / raw)
  To: linux-btrfs

Although Liu Bo has already submitted a V10 version of his deduplication
implement, here is another implement for it.

[[CORE FEATURES]]
The main design concept is the following:
1) Controllable memory usage
2) No guarantee to dedup every duplication.
3) No on-disk format change or new format
4) Page size level deduplication

[[IMPLEMENT]]
Implement details includes the following:
1) LRU hash maps to limit the memory usage
   The hash -> extent mapping is control by LRU (or unlimited), to
   get a controllable memory usage (can be tuned by mount option)
   alone with controllable read/write overhead used for hash searching.

2) Reuse existing ordered_extent infrastructure
   For duplicated page, it will still submit a ordered_extent(only one
   page long), to make the full use of all existing infrastructure.
   But only not submit a bio.
   This can reduce the number of code lines.

3) Mount option to control dedup behavior
   Deduplication and its memory usage can be tuned by mount option.
   No need to indicated ioctl interface.
   And further more, it can easily support BTRFS_INODE flag like
   compression, to allow further per file dedup fine tunning.

[[TODO]]
1. Add support for compressed extent
   Shouldn't be quite hard.
2. Try to merge dedup extent to reduce metadata size
   Currently, dedup extent is always in 4K size, although its reference
   source can be quite large.
3. Add support for per file dedup flags
   Much easier, just like compression flags.

[[KNOWN BUG, NEED HELP!]]
On the other hand, since it's still a RFC patch, it must has one or more
problem:
1) Race between __btrfs_free_extent() and dedup ordered_extent.
   The hook in __btrfs_free_extent() will free the corresponding hashes
   of a extent, even there is a dedup ordered_extent referring it.

   The problem will happen like the following case:
======================================================================
   cow_file_range()
     Submit dedup ordered_extent for extent A

   commit_transaction()
     Extent A needs freeing. As the its ref is decreased to 0.
     And dedup ordered_extent can increase only when it hit endio time.

   finish_ordered_io()
     Add reference to Extent A for dedup ordered_extent.
     But it is already freed in previous transaction.
     Causing abort_transaction().
======================================================================
   I'd like to keep the current ordered_extent method, as it adds the
   least number of code lines.
   But I can't find a good idea to either delay transaction until dedup
   ordered_extent is done or things like that.

   Trans->ordered seems to be a good idea, but it seems to cause list
   corruption without extra protection in tree log infrastructure.

That's the only problem spotted yet.
Any early review or advice/question on the design is welcomed.

Thanks.

Qu Wenruo (14):
  btrfs: file-item: Introduce btrfs_setup_file_extent function.
  btrfs: Use btrfs_fill_file_extent to reduce duplicated codes
  btrfs: dedup: Add basic init/free functions for inband dedup.
  btrfs: dedup: Add internal add/remove/search function for btrfs dedup.
  btrfs: dedup: add ordered extent hook for inband dedup
  btrfs: dedup: Apply dedup hook for write time dedup.
  btrfs: extent_map: Add new dedup flag and corresponding hook.
  btrfs: extent-map: Introduce orig_block_start member for extent-map.
  btrfs: dedup: Add inband dedup hook for read extent.
  btrfs: dedup: Introduce btrfs_dedup_free_extent_range function.
  btrfs: dedup: Add hook to free dedup hash at extent free time.
  btrfs: dedup: Add mount option support for btrfs inband deduplication.
  Btrfs: dedup: Support dedup change at remount time.
  btrfs: dedup: Add mount option output for inband dedup.

 fs/btrfs/Makefile       |   2 +-
 fs/btrfs/ctree.h        |  16 ++
 fs/btrfs/dedup.c        | 701 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/dedup.h        | 132 +++++++++
 fs/btrfs/disk-io.c      |   7 +
 fs/btrfs/extent-tree.c  |  10 +
 fs/btrfs/extent_io.c    |   6 +-
 fs/btrfs/extent_map.h   |   4 +
 fs/btrfs/file-item.c    |  61 +++--
 fs/btrfs/inode.c        | 228 ++++++++++++----
 fs/btrfs/ordered-data.c |  32 ++-
 fs/btrfs/ordered-data.h |   8 +
 fs/btrfs/super.c        |  39 ++-
 13 files changed, 1163 insertions(+), 83 deletions(-)
 create mode 100644 fs/btrfs/dedup.c
 create mode 100644 fs/btrfs/dedup.h

-- 
2.4.6

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 00/14] Yet Another In-band(online) deduplication implement
  2015-07-28  8:30 [PATCH RFC 00/14] Yet Another In-band(online) deduplication implement Qu Wenruo
                   ` (4 preceding siblings ...)
  2015-07-28  8:56 ` [PATCH RFC 00/14] Yet Another In-band(online) deduplication implement Qu Wenruo
@ 2015-07-28  9:52 ` Liu Bo
  2015-07-29  2:09   ` Qu Wenruo
  2015-07-28 14:50 ` David Sterba
  6 siblings, 1 reply; 19+ messages in thread
From: Liu Bo @ 2015-07-28  9:52 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Tue, Jul 28, 2015 at 04:30:36PM +0800, Qu Wenruo wrote:
> Although Liu Bo has already submitted a V10 version of his deduplication
> implement, here is another implement for it.
> 
> [[CORE FEATURES]]
> The main design concept is the following:
> 1) Controllable memory usage
> 2) No guarantee to dedup every duplication.
> 3) No on-disk format change or new format
> 4) Page size level deduplication
> 
> [[IMPLEMENT]]
> Implement details includes the following:
> 1) LRU hash maps to limit the memory usage
>    The hash -> extent mapping is control by LRU (or unlimited), to
>    get a controllable memory usage (can be tuned by mount option)
>    alone with controllable read/write overhead used for hash searching.
> 
> 2) Reuse existing ordered_extent infrastructure
>    For duplicated page, it will still submit a ordered_extent(only one
>    page long), to make the full use of all existing infrastructure.
>    But only not submit a bio.
>    This can reduce the number of code lines.
> 
> 3) Mount option to control dedup behavior
>    Deduplication and its memory usage can be tuned by mount option.
>    No need to indicated ioctl interface.
>    And further more, it can easily support BTRFS_INODE flag like
>    compression, to allow further per file dedup fine tunning.
> 
> [[TODO]]
> 1. Add support for compressed extent
>    Shouldn't be quite hard.
> 2. Try to merge dedup extent to reduce metadata size
>    Currently, dedup extent is always in 4K size, although its reference
>    source can be quite large.
> 3. Add support for per file dedup flags
>    Much easier, just like compression flags.
> 
> [[KNOWN BUG, NEED HELP!]]
> On the other hand, since it's still a RFC patch, it must has one or more
> problem:

You may have a look at my patchset, one of them is aimed to address the
similar problem.

Thanks,

-liubo

> 1) Race between __btrfs_free_extent() and dedup ordered_extent.
>    The hook in __btrfs_free_extent() will free the corresponding hashes
>    of a extent, even there is a dedup ordered_extent referring it.
> 
>    The problem will happen like the following case:
> ======================================================================
>    cow_file_range()
>      Submit dedup ordered_extent for extent A
> 
>    commit_transaction()
>      Extent A needs freeing. As the its ref is decreased to 0.
>      And dedup ordered_extent can increase only when it hit endio time.
> 
>    finish_ordered_io()
>      Add reference to Extent A for dedup ordered_extent.
>      But it is already freed in previous transaction.
>      Causing abort_transaction().
> ======================================================================
>    I'd like to keep the current ordered_extent method, as it adds the
>    least number of code lines.
>    But I can't find a good idea to either delay transaction until dedup
>    ordered_extent is done or things like that.
> 
>    Trans->ordered seems to be a good idea, but it seems to cause list
>    corruption without extra protection in tree log infrastructure.
> 
> That's the only problem spotted yet.
> Any early review or advice/question on the design is welcomed.
> 
> Thanks.
> 
> Qu Wenruo (14):
>   btrfs: file-item: Introduce btrfs_setup_file_extent function.
>   btrfs: Use btrfs_fill_file_extent to reduce duplicated codes
>   btrfs: dedup: Add basic init/free functions for inband dedup.
>   btrfs: dedup: Add internal add/remove/search function for btrfs dedup.
>   btrfs: dedup: add ordered extent hook for inband dedup
>   btrfs: dedup: Apply dedup hook for write time dedup.
>   btrfs: extent_map: Add new dedup flag and corresponding hook.
>   btrfs: extent-map: Introduce orig_block_start member for extent-map.
>   btrfs: dedup: Add inband dedup hook for read extent.
>   btrfs: dedup: Introduce btrfs_dedup_free_extent_range function.
>   btrfs: dedup: Add hook to free dedup hash at extent free time.
>   btrfs: dedup: Add mount option support for btrfs inband deduplication.
>   Btrfs: dedup: Support dedup change at remount time.
>   btrfs: dedup: Add mount option output for inband dedup.
> 
>  fs/btrfs/Makefile       |   2 +-
>  fs/btrfs/ctree.h        |  16 ++
>  fs/btrfs/dedup.c        | 701 ++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/btrfs/dedup.h        | 132 +++++++++
>  fs/btrfs/disk-io.c      |   7 +
>  fs/btrfs/extent-tree.c  |  10 +
>  fs/btrfs/extent_io.c    |   6 +-
>  fs/btrfs/extent_map.h   |   4 +
>  fs/btrfs/file-item.c    |  61 +++--
>  fs/btrfs/inode.c        | 228 ++++++++++++----
>  fs/btrfs/ordered-data.c |  32 ++-
>  fs/btrfs/ordered-data.h |   8 +
>  fs/btrfs/super.c        |  39 ++-
>  13 files changed, 1163 insertions(+), 83 deletions(-)
>  create mode 100644 fs/btrfs/dedup.c
>  create mode 100644 fs/btrfs/dedup.h
> 
> -- 
> 2.4.6
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 00/14] Yet Another In-band(online) deduplication implement
  2015-07-28  8:30 [PATCH RFC 00/14] Yet Another In-band(online) deduplication implement Qu Wenruo
                   ` (5 preceding siblings ...)
  2015-07-28  9:52 ` Liu Bo
@ 2015-07-28 14:50 ` David Sterba
  2015-07-29  1:07   ` Chris Mason
                     ` (2 more replies)
  6 siblings, 3 replies; 19+ messages in thread
From: David Sterba @ 2015-07-28 14:50 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Tue, Jul 28, 2015 at 04:30:36PM +0800, Qu Wenruo wrote:
> Although Liu Bo has already submitted a V10 version of his deduplication
> implement, here is another implement for it.

What's the reason to start another implementation?

> [[CORE FEATURES]]
> The main design concept is the following:
> 1) Controllable memory usage
> 2) No guarantee to dedup every duplication.
> 3) No on-disk format change or new format
> 4) Page size level deduplication

1 and 2) are good goals, allow usability tradeoffs

3) so the dedup hash is stored only for the mount life time. Though it
avoids the on-disk format changes, it also reduces the effectivity. It
is possible to "seed" the in-memory tree by reading all files that
contain potentially duplicate blocks but one would have to do that after
each mount.

4) page-sized dedup chunk is IMHO way too small. Although it can achieve
high dedup rate, the metadata can potentially explode and cause more
fragmentation.

> Implement details includes the following:
> 1) LRU hash maps to limit the memory usage
>    The hash -> extent mapping is control by LRU (or unlimited), to
>    get a controllable memory usage (can be tuned by mount option)
>    alone with controllable read/write overhead used for hash searching.

In Liu Bo's series, I rejected the mount options as an interface and
will do that here as well. His patches added a dedup ioctl to (at least)
enable/disable the dedup.

> 2) Reuse existing ordered_extent infrastructure
>    For duplicated page, it will still submit a ordered_extent(only one
>    page long), to make the full use of all existing infrastructure.
>    But only not submit a bio.
>    This can reduce the number of code lines.

> 3) Mount option to control dedup behavior
>    Deduplication and its memory usage can be tuned by mount option.
>    No need to indicated ioctl interface.

I'd say the other way around.

>    And further more, it can easily support BTRFS_INODE flag like
>    compression, to allow further per file dedup fine tunning.
> 
> [[TODO]]
> 3. Add support for per file dedup flags
>    Much easier, just like compression flags.

How is that supposed to work? You mean add per-file flags/attributes to
mark a file so it fills the dedup hash tree and is actively going to be
deduped agains other files?

> Any early review or advice/question on the design is welcomed.

The implementation is looks simpler than the Liu Bo's, but (IMHO) at the
cost of reduced funcionality.

Ideally, we merge one patchset with all desired functionality. Some kind
of control interface is needed not only to enable/dsiable the whole
feature but to affect the trade-offs (memory consumptin vs dedup
efficiency vs speed), and that in a way that's flexible according to
immediate needs.

The persistent dedup hash storage is not mandatory in theory, so we
could implement an "in-memory tree only" mode, ie. what you're
proposing, on top of Liu Bo's patchset.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 00/14] Yet Another In-band(online) deduplication implement
  2015-07-28 14:50 ` David Sterba
@ 2015-07-29  1:07   ` Chris Mason
  2015-07-29  1:47   ` Qu Wenruo
  2015-08-03  7:18   ` Qu Wenruo
  2 siblings, 0 replies; 19+ messages in thread
From: Chris Mason @ 2015-07-29  1:07 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs

On Tue, Jul 28, 2015 at 04:50:21PM +0200, David Sterba wrote:
> On Tue, Jul 28, 2015 at 04:30:36PM +0800, Qu Wenruo wrote:
> > Although Liu Bo has already submitted a V10 version of his deduplication
> > implement, here is another implement for it.
> 
> What's the reason to start another implementation?

I'm really glad to see more experiments around dedup, its one of those
features we haven't fully explored.

[ ... ]

> 
> > Any early review or advice/question on the design is welcomed.
> 
> The implementation is looks simpler than the Liu Bo's, but (IMHO) at the
> cost of reduced funcionality.
> 
> Ideally, we merge one patchset with all desired functionality. Some kind
> of control interface is needed not only to enable/dsiable the whole
> feature but to affect the trade-offs (memory consumptin vs dedup
> efficiency vs speed), and that in a way that's flexible according to
> immediate needs.
> 
> The persistent dedup hash storage is not mandatory in theory, so we
> could implement an "in-memory tree only" mode, ie. what you're
> proposing, on top of Liu Bo's patchset.

Agree here, I'd love to see the two patch sets build on each other.
If dedup is really valuable, it's worth storing the hashes etc on disk.
But the work to confine the tradeoffs will make it much more usable over
the long term.

-chris

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 00/14] Yet Another In-band(online) deduplication implement
  2015-07-28 14:50 ` David Sterba
  2015-07-29  1:07   ` Chris Mason
@ 2015-07-29  1:47   ` Qu Wenruo
  2015-07-29  2:40     ` Liu Bo
  2015-08-03  7:18   ` Qu Wenruo
  2 siblings, 1 reply; 19+ messages in thread
From: Qu Wenruo @ 2015-07-29  1:47 UTC (permalink / raw)
  To: dsterba, linux-btrfs

First, thanks David for the review.

David Sterba wrote on 2015/07/28 16:50 +0200:
> On Tue, Jul 28, 2015 at 04:30:36PM +0800, Qu Wenruo wrote:
>> Although Liu Bo has already submitted a V10 version of his deduplication
>> implement, here is another implement for it.
>
> What's the reason to start another implementation?
Mainly for the memory usage advantage and less codes to implement it.

Also want to test my understanding of dedup:
Dedup should be implemented as simple as possible, as the benefit is not 
so huge but potential bug may be huge.

>
>> [[CORE FEATURES]]
>> The main design concept is the following:
>> 1) Controllable memory usage
>> 2) No guarantee to dedup every duplication.
>> 3) No on-disk format change or new format
>> 4) Page size level deduplication
>
> 1 and 2) are good goals, allow usability tradeoffs
>
> 3) so the dedup hash is stored only for the mount life time. Though it
> avoids the on-disk format changes, it also reduces the effectivity. It
> is possible to "seed" the in-memory tree by reading all files that
> contain potentially duplicate blocks but one would have to do that after
> each mount.
For 3), that's almost the same thing with 2).
For the next mount, either read needed file contents as you mentioned, 
or just let the write happen for a while just like no dedup, to build up 
the hash tree.

>
> 4) page-sized dedup chunk is IMHO way too small. Although it can achieve
> high dedup rate, the metadata can potentially explode and cause more
> fragmentation.
Yes, that's one of my concern too.
But compared to Liu's implement, at least the non-dedup extent is less 
affected, which is up to 512Kbytes other than dedup size length.

And that's in my TODO list to merge possible adjusted dedup extents to 
increase dedup extent size and reduce metadata size/fragmentation.
>
>> Implement details includes the following:
>> 1) LRU hash maps to limit the memory usage
>>     The hash -> extent mapping is control by LRU (or unlimited), to
>>     get a controllable memory usage (can be tuned by mount option)
>>     alone with controllable read/write overhead used for hash searching.
>
> In Liu Bo's series, I rejected the mount options as an interface and
> will do that here as well. His patches added a dedup ioctl to (at least)
> enable/disable the dedup.
For ioctl method, what I am afraid of is, we may need to implement a 
rescan function just like qgroup, as we need to keep hash up-to-date.

And IMHO, qgroup is not a good example for new feature to follow.
As so many bugs we tried to fix and so many new bugs we introduced 
during the fix.
Even with the 4.2-rc1 qgroup fix, I reintroduced an old bug, fixed by 
Fillipe recently.
And we still don't have a good idea to fix the snapshot deletion bug.
(My patchset can only handle snapshot with up to 2 levels. With higher 
level, the qgroup number will still be wrong until related node/leaves 
are all COWed)

So the fear of being next qgroup also drives me to avoid persistent hash 
and ioctl hash.

>
>> 2) Reuse existing ordered_extent infrastructure
>>     For duplicated page, it will still submit a ordered_extent(only one
>>     page long), to make the full use of all existing infrastructure.
>>     But only not submit a bio.
>>     This can reduce the number of code lines.
>
>> 3) Mount option to control dedup behavior
>>     Deduplication and its memory usage can be tuned by mount option.
>>     No need to indicated ioctl interface.
>
> I'd say the other way around.
>
>>     And further more, it can easily support BTRFS_INODE flag like
>>     compression, to allow further per file dedup fine tunning.
>>
>> [[TODO]]
>> 3. Add support for per file dedup flags
>>     Much easier, just like compression flags.
>
> How is that supposed to work? You mean add per-file flags/attributes to
> mark a file so it fills the dedup hash tree and is actively going to be
> deduped agains other files?

Yes, much like that.
Just like NODATACOW flag, for files set NODEDUP, any read from it won't 
add hash into hash tree and write won't ever bother searching hashes.

>
>> Any early review or advice/question on the design is welcomed.
>
> The implementation is looks simpler than the Liu Bo's, but (IMHO) at the
> cost of reduced funcionality.
>
> Ideally, we merge one patchset with all desired functionality. Some kind
> of control interface is needed not only to enable/dsiable the whole
> feature but to affect the trade-offs (memory consumptin vs dedup
> efficiency vs speed), and that in a way that's flexible according to
> immediate needs.
>
> The persistent dedup hash storage is not mandatory in theory, so we
> could implement an "in-memory tree only" mode, ie. what you're
> proposing, on top of Liu Bo's patchset.
>
So the ideal implement should be with the following features?
1) Tunable dedup size
    For trade off and Liu Bo's patchset has provided it.
    But I still want it not to affect non-dedup extent size too much like
    the current patchset.

2) Different dedup backend
    Again for trade off, persist one from Liu Bo and the in-memory only
    one?
And maybe others?

For me, all the ideas seems great, but I'm more concerned about whether 
it's worthy just for dedup function.
Maybe we need about 3~4K lines for the ideal dedup function and new 
incompat flags.

But for the benefit, we may never do as well as user-space dedup implement.
So why not focus on the simplicity and speed in kernel implement?

Thanks,
Qu

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 00/14] Yet Another In-band(online) deduplication implement
  2015-07-28  9:52 ` Liu Bo
@ 2015-07-29  2:09   ` Qu Wenruo
  0 siblings, 0 replies; 19+ messages in thread
From: Qu Wenruo @ 2015-07-29  2:09 UTC (permalink / raw)
  To: bo.li.liu; +Cc: linux-btrfs



Liu Bo wrote on 2015/07/28 17:52 +0800:
> On Tue, Jul 28, 2015 at 04:30:36PM +0800, Qu Wenruo wrote:
>> Although Liu Bo has already submitted a V10 version of his deduplication
>> implement, here is another implement for it.
>>
>> [[CORE FEATURES]]
>> The main design concept is the following:
>> 1) Controllable memory usage
>> 2) No guarantee to dedup every duplication.
>> 3) No on-disk format change or new format
>> 4) Page size level deduplication
>>
>> [[IMPLEMENT]]
>> Implement details includes the following:
>> 1) LRU hash maps to limit the memory usage
>>     The hash -> extent mapping is control by LRU (or unlimited), to
>>     get a controllable memory usage (can be tuned by mount option)
>>     alone with controllable read/write overhead used for hash searching.
>>
>> 2) Reuse existing ordered_extent infrastructure
>>     For duplicated page, it will still submit a ordered_extent(only one
>>     page long), to make the full use of all existing infrastructure.
>>     But only not submit a bio.
>>     This can reduce the number of code lines.
>>
>> 3) Mount option to control dedup behavior
>>     Deduplication and its memory usage can be tuned by mount option.
>>     No need to indicated ioctl interface.
>>     And further more, it can easily support BTRFS_INODE flag like
>>     compression, to allow further per file dedup fine tunning.
>>
>> [[TODO]]
>> 1. Add support for compressed extent
>>     Shouldn't be quite hard.
>> 2. Try to merge dedup extent to reduce metadata size
>>     Currently, dedup extent is always in 4K size, although its reference
>>     source can be quite large.
>> 3. Add support for per file dedup flags
>>     Much easier, just like compression flags.
>>
>> [[KNOWN BUG, NEED HELP!]]
>> On the other hand, since it's still a RFC patch, it must has one or more
>> problem:
>
> You may have a look at my patchset, one of them is aimed to address the
> similar problem.
>
> Thanks,
>
> -liubo
In fact, I took a look at your patchset.
But the following two points are not so perfect so I choose to implement 
my own one:
1) Extent size.
    Extent size won't be larger than dedup_bs.
    Causing a lot of fragment even the write is not duplicated.

    So in my implement, for non-duplicated extent, its size will be up
    to 512K, not dedup_size.

    And the 512K limit can be further increase much easier.
    I choose 512K because for 512K extent, its hash can just be stored
    into one page (512K contains 128 pages, each page takes 32bytes for
    hash).

2) Submit bio directly
    Not a fan as it's crossing level.
    Normally bio is submitted in extent_io.c, and now we are doing high
    level metadata modification with low level bio submit in one
    function.

    At least it's not a good idea for me.
    So in my implement, I just add some small hooks into
    cow_file_range(), and even for duplicated page, I reuse the
    ordered_extent infrastructure to handle the inode size update and
    page/extent lock things.

To solve the problem I have another idea, to just submit duplciated 
metadata modification and don't go through ordered_extent.

But it's not perfect either.
we need extra functions to handle inode size update and 
btrfs_ordered_update_i_size() is useless as we don't have ordered_extent.

And if the duplicated page is an outstanding one, to insert the file 
extent, we also need to fill the holes between [inode_size, start).
And most of existing functions to punch hole won't help, as the range 
outstanding range is still locked by us.

Thanks,
Qu
>
>> 1) Race between __btrfs_free_extent() and dedup ordered_extent.
>>     The hook in __btrfs_free_extent() will free the corresponding hashes
>>     of a extent, even there is a dedup ordered_extent referring it.
>>
>>     The problem will happen like the following case:
>> ======================================================================
>>     cow_file_range()
>>       Submit dedup ordered_extent for extent A
>>
>>     commit_transaction()
>>       Extent A needs freeing. As the its ref is decreased to 0.
>>       And dedup ordered_extent can increase only when it hit endio time.
>>
>>     finish_ordered_io()
>>       Add reference to Extent A for dedup ordered_extent.
>>       But it is already freed in previous transaction.
>>       Causing abort_transaction().
>> ======================================================================
>>     I'd like to keep the current ordered_extent method, as it adds the
>>     least number of code lines.
>>     But I can't find a good idea to either delay transaction until dedup
>>     ordered_extent is done or things like that.
>>
>>     Trans->ordered seems to be a good idea, but it seems to cause list
>>     corruption without extra protection in tree log infrastructure.
>>
>> That's the only problem spotted yet.
>> Any early review or advice/question on the design is welcomed.
>>
>> Thanks.
>>
>> Qu Wenruo (14):
>>    btrfs: file-item: Introduce btrfs_setup_file_extent function.
>>    btrfs: Use btrfs_fill_file_extent to reduce duplicated codes
>>    btrfs: dedup: Add basic init/free functions for inband dedup.
>>    btrfs: dedup: Add internal add/remove/search function for btrfs dedup.
>>    btrfs: dedup: add ordered extent hook for inband dedup
>>    btrfs: dedup: Apply dedup hook for write time dedup.
>>    btrfs: extent_map: Add new dedup flag and corresponding hook.
>>    btrfs: extent-map: Introduce orig_block_start member for extent-map.
>>    btrfs: dedup: Add inband dedup hook for read extent.
>>    btrfs: dedup: Introduce btrfs_dedup_free_extent_range function.
>>    btrfs: dedup: Add hook to free dedup hash at extent free time.
>>    btrfs: dedup: Add mount option support for btrfs inband deduplication.
>>    Btrfs: dedup: Support dedup change at remount time.
>>    btrfs: dedup: Add mount option output for inband dedup.
>>
>>   fs/btrfs/Makefile       |   2 +-
>>   fs/btrfs/ctree.h        |  16 ++
>>   fs/btrfs/dedup.c        | 701 ++++++++++++++++++++++++++++++++++++++++++++++++
>>   fs/btrfs/dedup.h        | 132 +++++++++
>>   fs/btrfs/disk-io.c      |   7 +
>>   fs/btrfs/extent-tree.c  |  10 +
>>   fs/btrfs/extent_io.c    |   6 +-
>>   fs/btrfs/extent_map.h   |   4 +
>>   fs/btrfs/file-item.c    |  61 +++--
>>   fs/btrfs/inode.c        | 228 ++++++++++++----
>>   fs/btrfs/ordered-data.c |  32 ++-
>>   fs/btrfs/ordered-data.h |   8 +
>>   fs/btrfs/super.c        |  39 ++-
>>   13 files changed, 1163 insertions(+), 83 deletions(-)
>>   create mode 100644 fs/btrfs/dedup.c
>>   create mode 100644 fs/btrfs/dedup.h
>>
>> --
>> 2.4.6
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 00/14] Yet Another In-band(online) deduplication implement
  2015-07-29  1:47   ` Qu Wenruo
@ 2015-07-29  2:40     ` Liu Bo
  0 siblings, 0 replies; 19+ messages in thread
From: Liu Bo @ 2015-07-29  2:40 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: dsterba, linux-btrfs

On Wed, Jul 29, 2015 at 09:47:12AM +0800, Qu Wenruo wrote:
> First, thanks David for the review.
> 
> David Sterba wrote on 2015/07/28 16:50 +0200:
> >On Tue, Jul 28, 2015 at 04:30:36PM +0800, Qu Wenruo wrote:
> >>Although Liu Bo has already submitted a V10 version of his deduplication
> >>implement, here is another implement for it.
> >
> >What's the reason to start another implementation?
> Mainly for the memory usage advantage and less codes to implement it.
> 
> Also want to test my understanding of dedup:
> Dedup should be implemented as simple as possible, as the benefit is not so
> huge but potential bug may be huge.
> 
> >
> >>[[CORE FEATURES]]
> >>The main design concept is the following:
> >>1) Controllable memory usage
> >>2) No guarantee to dedup every duplication.
> >>3) No on-disk format change or new format
> >>4) Page size level deduplication
> >
> >1 and 2) are good goals, allow usability tradeoffs
> >
> >3) so the dedup hash is stored only for the mount life time. Though it
> >avoids the on-disk format changes, it also reduces the effectivity. It
> >is possible to "seed" the in-memory tree by reading all files that
> >contain potentially duplicate blocks but one would have to do that after
> >each mount.
> For 3), that's almost the same thing with 2).
> For the next mount, either read needed file contents as you mentioned, or
> just let the write happen for a while just like no dedup, to build up the
> hash tree.
> 
> >
> >4) page-sized dedup chunk is IMHO way too small. Although it can achieve
> >high dedup rate, the metadata can potentially explode and cause more
> >fragmentation.
> Yes, that's one of my concern too.
> But compared to Liu's implement, at least the non-dedup extent is less
> affected, which is up to 512Kbytes other than dedup size length.
> 
> And that's in my TODO list to merge possible adjusted dedup extents to
> increase dedup extent size and reduce metadata size/fragmentation.
> >
> >>Implement details includes the following:
> >>1) LRU hash maps to limit the memory usage
> >>    The hash -> extent mapping is control by LRU (or unlimited), to
> >>    get a controllable memory usage (can be tuned by mount option)
> >>    alone with controllable read/write overhead used for hash searching.
> >
> >In Liu Bo's series, I rejected the mount options as an interface and
> >will do that here as well. His patches added a dedup ioctl to (at least)
> >enable/disable the dedup.
> For ioctl method, what I am afraid of is, we may need to implement a rescan
> function just like qgroup, as we need to keep hash up-to-date.
> 
> And IMHO, qgroup is not a good example for new feature to follow.
> As so many bugs we tried to fix and so many new bugs we introduced during
> the fix.
> Even with the 4.2-rc1 qgroup fix, I reintroduced an old bug, fixed by
> Fillipe recently.
> And we still don't have a good idea to fix the snapshot deletion bug.
> (My patchset can only handle snapshot with up to 2 levels. With higher
> level, the qgroup number will still be wrong until related node/leaves are
> all COWed)
> 
> So the fear of being next qgroup also drives me to avoid persistent hash and
> ioctl hash.
> 
> >
> >>2) Reuse existing ordered_extent infrastructure
> >>    For duplicated page, it will still submit a ordered_extent(only one
> >>    page long), to make the full use of all existing infrastructure.
> >>    But only not submit a bio.
> >>    This can reduce the number of code lines.
> >
> >>3) Mount option to control dedup behavior
> >>    Deduplication and its memory usage can be tuned by mount option.
> >>    No need to indicated ioctl interface.
> >
> >I'd say the other way around.
> >
> >>    And further more, it can easily support BTRFS_INODE flag like
> >>    compression, to allow further per file dedup fine tunning.
> >>
> >>[[TODO]]
> >>3. Add support for per file dedup flags
> >>    Much easier, just like compression flags.
> >
> >How is that supposed to work? You mean add per-file flags/attributes to
> >mark a file so it fills the dedup hash tree and is actively going to be
> >deduped agains other files?
> 
> Yes, much like that.
> Just like NODATACOW flag, for files set NODEDUP, any read from it won't add
> hash into hash tree and write won't ever bother searching hashes.
> 
> >
> >>Any early review or advice/question on the design is welcomed.
> >
> >The implementation is looks simpler than the Liu Bo's, but (IMHO) at the
> >cost of reduced funcionality.
> >
> >Ideally, we merge one patchset with all desired functionality. Some kind
> >of control interface is needed not only to enable/dsiable the whole
> >feature but to affect the trade-offs (memory consumptin vs dedup
> >efficiency vs speed), and that in a way that's flexible according to
> >immediate needs.
> >
> >The persistent dedup hash storage is not mandatory in theory, so we
> >could implement an "in-memory tree only" mode, ie. what you're
> >proposing, on top of Liu Bo's patchset.
> >
> So the ideal implement should be with the following features?
> 1) Tunable dedup size
>    For trade off and Liu Bo's patchset has provided it.
>    But I still want it not to affect non-dedup extent size too much like
>    the current patchset.
> 
> 2) Different dedup backend
>    Again for trade off, persist one from Liu Bo and the in-memory only
>    one?
> And maybe others?
> 
> For me, all the ideas seems great, but I'm more concerned about whether it's
> worthy just for dedup function.
> Maybe we need about 3~4K lines for the ideal dedup function and new incompat
> flags.
> 
> But for the benefit, we may never do as well as user-space dedup implement.
> So why not focus on the simplicity and speed in kernel implement?

One thing about user-space dedupe is so much great is that in
user-space application, there is normally something which helps sort
data to be ready for dedupe in order to achive the best dedupe ratio.
However, that's something we won't never do in kernel.

Thanks,

-liubo

> 
> Thanks,
> Qu
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 00/14] Yet Another In-band(online) deduplication implement
  2015-07-28 14:50 ` David Sterba
  2015-07-29  1:07   ` Chris Mason
  2015-07-29  1:47   ` Qu Wenruo
@ 2015-08-03  7:18   ` Qu Wenruo
  2015-08-27  0:52     ` Qu Wenruo
  2015-08-27  9:14     ` David Sterba
  2 siblings, 2 replies; 19+ messages in thread
From: Qu Wenruo @ 2015-08-03  7:18 UTC (permalink / raw)
  To: dsterba, linux-btrfs



David Sterba wrote on 2015/07/28 16:50 +0200:
> On Tue, Jul 28, 2015 at 04:30:36PM +0800, Qu Wenruo wrote:
>> Although Liu Bo has already submitted a V10 version of his deduplication
>> implement, here is another implement for it.
>
> What's the reason to start another implementation?
>
>> [[CORE FEATURES]]
>> The main design concept is the following:
>> 1) Controllable memory usage
>> 2) No guarantee to dedup every duplication.
>> 3) No on-disk format change or new format
>> 4) Page size level deduplication
>
> 1 and 2) are good goals, allow usability tradeoffs
>
> 3) so the dedup hash is stored only for the mount life time. Though it
> avoids the on-disk format changes, it also reduces the effectivity. It
> is possible to "seed" the in-memory tree by reading all files that
> contain potentially duplicate blocks but one would have to do that after
> each mount.
>
> 4) page-sized dedup chunk is IMHO way too small. Although it can achieve
> high dedup rate, the metadata can potentially explode and cause more
> fragmentation.
>
>> Implement details includes the following:
>> 1) LRU hash maps to limit the memory usage
>>     The hash -> extent mapping is control by LRU (or unlimited), to
>>     get a controllable memory usage (can be tuned by mount option)
>>     alone with controllable read/write overhead used for hash searching.
>
> In Liu Bo's series, I rejected the mount options as an interface and
> will do that here as well. His patches added a dedup ioctl to (at least)
> enable/disable the dedup.
BTW,　would you please give me some reason why that's not a good idea to 
use mount option to trigger/change dedup options?

Thanks,
Qu
>
>> 2) Reuse existing ordered_extent infrastructure
>>     For duplicated page, it will still submit a ordered_extent(only one
>>     page long), to make the full use of all existing infrastructure.
>>     But only not submit a bio.
>>     This can reduce the number of code lines.
>
>> 3) Mount option to control dedup behavior
>>     Deduplication and its memory usage can be tuned by mount option.
>>     No need to indicated ioctl interface.
>
> I'd say the other way around.
>
>>     And further more, it can easily support BTRFS_INODE flag like
>>     compression, to allow further per file dedup fine tunning.
>>
>> [[TODO]]
>> 3. Add support for per file dedup flags
>>     Much easier, just like compression flags.
>
> How is that supposed to work? You mean add per-file flags/attributes to
> mark a file so it fills the dedup hash tree and is actively going to be
> deduped agains other files?
>
>> Any early review or advice/question on the design is welcomed.
>
> The implementation is looks simpler than the Liu Bo's, but (IMHO) at the
> cost of reduced funcionality.
>
> Ideally, we merge one patchset with all desired functionality. Some kind
> of control interface is needed not only to enable/dsiable the whole
> feature but to affect the trade-offs (memory consumptin vs dedup
> efficiency vs speed), and that in a way that's flexible according to
> immediate needs.
>
> The persistent dedup hash storage is not mandatory in theory, so we
> could implement an "in-memory tree only" mode, ie. what you're
> proposing, on top of Liu Bo's patchset.
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 00/14] Yet Another In-band(online) deduplication implement
  2015-08-03  7:18   ` Qu Wenruo
@ 2015-08-27  0:52     ` Qu Wenruo
  2015-08-27  9:14     ` David Sterba
  1 sibling, 0 replies; 19+ messages in thread
From: Qu Wenruo @ 2015-08-27  0:52 UTC (permalink / raw)
  To: dsterba, linux-btrfs



Qu Wenruo wrote on 2015/08/03 15:18 +0800:
>
>
> David Sterba wrote on 2015/07/28 16:50 +0200:
>> On Tue, Jul 28, 2015 at 04:30:36PM +0800, Qu Wenruo wrote:
>>> Although Liu Bo has already submitted a V10 version of his deduplication
>>> implement, here is another implement for it.
>>
>> What's the reason to start another implementation?
>>
>>> [[CORE FEATURES]]
>>> The main design concept is the following:
>>> 1) Controllable memory usage
>>> 2) No guarantee to dedup every duplication.
>>> 3) No on-disk format change or new format
>>> 4) Page size level deduplication
>>
>> 1 and 2) are good goals, allow usability tradeoffs
>>
>> 3) so the dedup hash is stored only for the mount life time. Though it
>> avoids the on-disk format changes, it also reduces the effectivity. It
>> is possible to "seed" the in-memory tree by reading all files that
>> contain potentially duplicate blocks but one would have to do that after
>> each mount.
>>
>> 4) page-sized dedup chunk is IMHO way too small. Although it can achieve
>> high dedup rate, the metadata can potentially explode and cause more
>> fragmentation.
>>
>>> Implement details includes the following:
>>> 1) LRU hash maps to limit the memory usage
>>>     The hash -> extent mapping is control by LRU (or unlimited), to
>>>     get a controllable memory usage (can be tuned by mount option)
>>>     alone with controllable read/write overhead used for hash searching.
>>
>> In Liu Bo's series, I rejected the mount options as an interface and
>> will do that here as well. His patches added a dedup ioctl to (at least)
>> enable/disable the dedup.
> BTW,　would you please give me some reason why that's not a good idea to
> use mount option to trigger/change dedup options?
>
> Thanks,
> Qu

Ping?
No other comment?

Thanks,
Qu
>>
>>> 2) Reuse existing ordered_extent infrastructure
>>>     For duplicated page, it will still submit a ordered_extent(only one
>>>     page long), to make the full use of all existing infrastructure.
>>>     But only not submit a bio.
>>>     This can reduce the number of code lines.
>>
>>> 3) Mount option to control dedup behavior
>>>     Deduplication and its memory usage can be tuned by mount option.
>>>     No need to indicated ioctl interface.
>>
>> I'd say the other way around.
>>
>>>     And further more, it can easily support BTRFS_INODE flag like
>>>     compression, to allow further per file dedup fine tunning.
>>>
>>> [[TODO]]
>>> 3. Add support for per file dedup flags
>>>     Much easier, just like compression flags.
>>
>> How is that supposed to work? You mean add per-file flags/attributes to
>> mark a file so it fills the dedup hash tree and is actively going to be
>> deduped agains other files?
>>
>>> Any early review or advice/question on the design is welcomed.
>>
>> The implementation is looks simpler than the Liu Bo's, but (IMHO) at the
>> cost of reduced funcionality.
>>
>> Ideally, we merge one patchset with all desired functionality. Some kind
>> of control interface is needed not only to enable/dsiable the whole
>> feature but to affect the trade-offs (memory consumptin vs dedup
>> efficiency vs speed), and that in a way that's flexible according to
>> immediate needs.
>>
>> The persistent dedup hash storage is not mandatory in theory, so we
>> could implement an "in-memory tree only" mode, ie. what you're
>> proposing, on top of Liu Bo's patchset.
>>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 00/14] Yet Another In-band(online) deduplication implement
  2015-08-03  7:18   ` Qu Wenruo
  2015-08-27  0:52     ` Qu Wenruo
@ 2015-08-27  9:14     ` David Sterba
  2015-08-31  1:13       ` Qu Wenruo
  1 sibling, 1 reply; 19+ messages in thread
From: David Sterba @ 2015-08-27  9:14 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Mon, Aug 03, 2015 at 03:18:55PM +0800, Qu Wenruo wrote:
> >> Implement details includes the following:
> >> 1) LRU hash maps to limit the memory usage
> >>     The hash -> extent mapping is control by LRU (or unlimited), to
> >>     get a controllable memory usage (can be tuned by mount option)
> >>     alone with controllable read/write overhead used for hash searching.
> >
> > In Liu Bo's series, I rejected the mount options as an interface and
> > will do that here as well. His patches added a dedup ioctl to (at least)
> > enable/disable the dedup.
> BTW,　would you please give me some reason why that's not a good idea to 
> use mount option to trigger/change dedup options?

That's the wrong interface to use for such actions.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 00/14] Yet Another In-band(online) deduplication implement
  2015-08-27  9:14     ` David Sterba
@ 2015-08-31  1:13       ` Qu Wenruo
  2015-09-22 15:07         ` David Sterba
  0 siblings, 1 reply; 19+ messages in thread
From: Qu Wenruo @ 2015-08-31  1:13 UTC (permalink / raw)
  To: dsterba, linux-btrfs



David Sterba wrote on 2015/08/27 11:14 +0200:
> On Mon, Aug 03, 2015 at 03:18:55PM +0800, Qu Wenruo wrote:
>>>> Implement details includes the following:
>>>> 1) LRU hash maps to limit the memory usage
>>>>      The hash -> extent mapping is control by LRU (or unlimited), to
>>>>      get a controllable memory usage (can be tuned by mount option)
>>>>      alone with controllable read/write overhead used for hash searching.
>>>
>>> In Liu Bo's series, I rejected the mount options as an interface and
>>> will do that here as well. His patches added a dedup ioctl to (at least)
>>> enable/disable the dedup.
>> BTW,　would you please give me some reason why that's not a good idea to
>> use mount option to trigger/change dedup options?
>
> That's the wrong interface to use for such actions.
>
But IMHO, deduplication is much like compression, we only need to 
execution extra hook to handle data at run_delalloc_range().

And even better than compression, inline dedup won't even cause any 
format change.
So I'd prefer to use mount option other than introduce a new ioctl 
interface.

Would you please explain more about why mount option is not a good idea 
in this use case?

Thanks,
Qu

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 00/14] Yet Another In-band(online) deduplication implement
  2015-08-31  1:13       ` Qu Wenruo
@ 2015-09-22 15:07         ` David Sterba
  2015-09-23  7:16           ` Qu Wenruo
  0 siblings, 1 reply; 19+ messages in thread
From: David Sterba @ 2015-09-22 15:07 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Mon, Aug 31, 2015 at 09:13:29AM +0800, Qu Wenruo wrote:
> > That's the wrong interface to use for such actions.
> >
> But IMHO, deduplication is much like compression, we only need to 
> execution extra hook to handle data at run_delalloc_range().

It is, but the filesystem-wide compression already shows its
limitations. Eg. it's not possible to set the compression on a
per-subvolume (or at least per-subvolume-mount [*]) basis, it's not possible
to select the compression method for a given file or directory.

Next, changing the status of dedup via remount is quite a heavy
operation. Remount has to sync the whole filesystem, while this is not
strictly necessary for changes in the dedup parameters.

The ioctl, if designed well, can be ready for future enhancements and we
can fine-tune the dedup engine in ways that we do not forsee right now.
I really don't want to see the mount options grow that way.

> And even better than compression, inline dedup won't even cause any 
> format change.
> So I'd prefer to use mount option other than introduce a new ioctl 
> interface.

Yeah, no change in the format makes it just a matter of selecting the
right interface to manage dedup.

(I don't remember if there were other points that I wrote back then when
Liu Bo sent his in-band dedup patches and wanted to use the mount
options as well.)

[*] I do have a POC for that, mostly working and will submit at least
    for RFC

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 00/14] Yet Another In-band(online) deduplication implement
  2015-09-22 15:07         ` David Sterba
@ 2015-09-23  7:16           ` Qu Wenruo
  0 siblings, 0 replies; 19+ messages in thread
From: Qu Wenruo @ 2015-09-23  7:16 UTC (permalink / raw)
  To: dsterba, linux-btrfs



David Sterba wrote on 2015/09/22 17:07 +0200:
> On Mon, Aug 31, 2015 at 09:13:29AM +0800, Qu Wenruo wrote:
>>> That's the wrong interface to use for such actions.
>>>
>> But IMHO, deduplication is much like compression, we only need to
>> execution extra hook to handle data at run_delalloc_range().
>
> It is, but the filesystem-wide compression already shows its
> limitations. Eg. it's not possible to set the compression on a
> per-subvolume (or at least per-subvolume-mount [*]) basis, it's not possible
> to select the compression method for a given file or directory.

Make sense, for this point, it's better to use ioctl other than add 
complicated mount options.

>
> Next, changing the status of dedup via remount is quite a heavy
> operation. Remount has to sync the whole filesystem, while this is not
> strictly necessary for changes in the dedup parameters.
>
> The ioctl, if designed well, can be ready for future enhancements and we
> can fine-tune the dedup engine in ways that we do not forsee right now.
> I really don't want to see the mount options grow that way.

Very convincing.
I'll change to use ioctl instead of mount option.

But this will definitely introduce new incompact feature, as even using 
the in-memory only dedup engine, it will still need some persistent data 
to record the dedup status, or we need to call "btrfs dedup enable" 
every time after mount.

Thanks for the review,
Qu

>
>> And even better than compression, inline dedup won't even cause any
>> format change.
>> So I'd prefer to use mount option other than introduce a new ioctl
>> interface.
>
> Yeah, no change in the format makes it just a matter of selecting the
> right interface to manage dedup.
>
> (I don't remember if there were other points that I wrote back then when
> Liu Bo sent his in-band dedup patches and wanted to use the mount
> options as well.)
>
>
> [*] I do have a POC for that, mostly working and will submit at least
>      for RFC
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2015-09-23  7:16 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-07-28  8:30 [PATCH RFC 00/14] Yet Another In-band(online) deduplication implement Qu Wenruo
2015-07-28  8:30 ` [PATCH RFC 01/14] btrfs: file-item: Introduce btrfs_setup_file_extent function Qu Wenruo
2015-07-28  8:30 ` [PATCH RFC 02/14] btrfs: Use btrfs_fill_file_extent to reduce duplicated codes Qu Wenruo
2015-07-28  8:30 ` [PATCH RFC 03/14] btrfs: dedup: Add basic init/free functions for inband dedup Qu Wenruo
2015-07-28  8:30 ` [PATCH RFC 04/14] btrfs: dedup: Add internal add/remove/search function for btrfs dedup Qu Wenruo
2015-07-28  8:56 ` [PATCH RFC 00/14] Yet Another In-band(online) deduplication implement Qu Wenruo
2015-07-28  9:52 ` Liu Bo
2015-07-29  2:09   ` Qu Wenruo
2015-07-28 14:50 ` David Sterba
2015-07-29  1:07   ` Chris Mason
2015-07-29  1:47   ` Qu Wenruo
2015-07-29  2:40     ` Liu Bo
2015-08-03  7:18   ` Qu Wenruo
2015-08-27  0:52     ` Qu Wenruo
2015-08-27  9:14     ` David Sterba
2015-08-31  1:13       ` Qu Wenruo
2015-09-22 15:07         ` David Sterba
2015-09-23  7:16           ` Qu Wenruo
  -- strict thread matches above, loose matches on Subject: below --
2015-07-28  9:14 Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).