linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink
@ 2020-04-27  8:47 Shiyang Ruan
  2020-04-27  8:47 ` [RFC PATCH 1/8] fs/dax: Introduce dax-rmap btree for reflink Shiyang Ruan
                   ` (8 more replies)
  0 siblings, 9 replies; 22+ messages in thread
From: Shiyang Ruan @ 2020-04-27  8:47 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, linux-nvdimm
  Cc: linux-mm, linux-fsdevel, darrick.wong, dan.j.williams, david, hch,
	rgoldwyn, qi.fuli, y-goto

This patchset is a try to resolve the shared 'page cache' problem for
fsdax.

In order to track multiple mappings and indexes on one page, I
introduced a dax-rmap rb-tree to manage the relationship.  A dax entry
will be associated more than once if is shared.  At the second time we
associate this entry, we create this rb-tree and store its root in
page->private(not used in fsdax).  Insert (->mapping, ->index) when
dax_associate_entry() and delete it when dax_disassociate_entry().

We can iterate the dax-rmap rb-tree before any other operations on
mappings of files.  Such as memory-failure and rmap.

Same as before, I borrowed and made some changes on Goldwyn's patchsets.
These patches makes up for the lack of CoW mechanism in fsdax.

The rests are dax & reflink support for xfs.

(Rebased to 5.7-rc2)


Shiyang Ruan (8):
  fs/dax: Introduce dax-rmap btree for reflink
  mm: add dax-rmap for memory-failure and rmap
  fs/dax: Introduce dax_copy_edges() for COW
  fs/dax: copy data before write
  fs/dax: replace mmap entry in case of CoW
  fs/dax: dedup file range to use a compare function
  fs/xfs: handle CoW for fsdax write() path
  fs/xfs: support dedupe for fsdax

 fs/dax.c               | 343 +++++++++++++++++++++++++++++++++++++----
 fs/ocfs2/file.c        |   2 +-
 fs/read_write.c        |  11 +-
 fs/xfs/xfs_bmap_util.c |   6 +-
 fs/xfs/xfs_file.c      |  10 +-
 fs/xfs/xfs_iomap.c     |   3 +-
 fs/xfs/xfs_iops.c      |  11 +-
 fs/xfs/xfs_reflink.c   |  79 ++++++----
 include/linux/dax.h    |  11 ++
 include/linux/fs.h     |   9 +-
 mm/memory-failure.c    |  63 ++++++--
 mm/rmap.c              |  54 +++++--
 12 files changed, 498 insertions(+), 104 deletions(-)

-- 
2.26.2




^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC PATCH 1/8] fs/dax: Introduce dax-rmap btree for reflink
  2020-04-27  8:47 [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink Shiyang Ruan
@ 2020-04-27  8:47 ` Shiyang Ruan
  2020-04-27  8:47 ` [RFC PATCH 2/8] mm: add dax-rmap for memory-failure and rmap Shiyang Ruan
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Shiyang Ruan @ 2020-04-27  8:47 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, linux-nvdimm
  Cc: linux-mm, linux-fsdevel, darrick.wong, dan.j.williams, david, hch,
	rgoldwyn, qi.fuli, y-goto

Normally, when accessing a mmapped file, entering the page fault, the
file's (->mapping, ->index) will be associated with dax entry(represents
for one page or a couple of pages) to facilitate the reverse mapping
search.  But in the case of reflink, a dax entry may be shared by multiple
files or offsets.  In order to establish a reverse mapping relationship in
this case, I introduce a rb-tree to track multiple files and offsets.

The root of the rb-tree is stored in page->private, since I haven't found
it be used in fsdax.  We create the rb-tree and insert the
(->mapping, ->index) tuple in the second time a dax entry is associated,
which means this dax entry is shared.  And delete this tuple from the
rb-tree when disassociating.

Signed-off-by: Shiyang Ruan <ruansy.fnst@cn.fujitsu.com>
---
 fs/dax.c            | 153 ++++++++++++++++++++++++++++++++++++++++----
 include/linux/dax.h |   6 ++
 2 files changed, 147 insertions(+), 12 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 11b16729b86f..2f996c566103 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -25,6 +25,7 @@
 #include <linux/sizes.h>
 #include <linux/mmu_notifier.h>
 #include <linux/iomap.h>
+#include <linux/rbtree.h>
 #include <asm/pgalloc.h>
 
 #define CREATE_TRACE_POINTS
@@ -310,6 +311,120 @@ static unsigned long dax_entry_size(void *entry)
 		return PAGE_SIZE;
 }
 
+static struct kmem_cache *dax_rmap_node_cachep;
+static struct kmem_cache *dax_rmap_root_cachep;
+
+static int __init init_dax_rmap_cache(void)
+{
+	dax_rmap_root_cachep = KMEM_CACHE(rb_root_cached, SLAB_PANIC|SLAB_ACCOUNT);
+	dax_rmap_node_cachep = KMEM_CACHE(shared_file, SLAB_PANIC|SLAB_ACCOUNT);
+	return 0;
+}
+fs_initcall(init_dax_rmap_cache);
+
+struct rb_root_cached *dax_create_rbroot(void)
+{
+	struct rb_root_cached *root = kmem_cache_alloc(dax_rmap_root_cachep,
+						       GFP_KERNEL);
+
+	memset(root, 0, sizeof(struct rb_root_cached));
+	return root;
+}
+
+static bool dax_rmap_insert(struct page *page, struct address_space *mapping,
+			    pgoff_t index)
+{
+	struct rb_root_cached *root = (struct rb_root_cached *)page_private(page);
+	struct rb_node **new, *parent = NULL;
+	struct shared_file *p;
+	bool leftmost = true;
+
+	if (!root) {
+		root = dax_create_rbroot();
+		set_page_private(page, (unsigned long)root);
+		dax_rmap_insert(page, page->mapping, page->index);
+	}
+	new = &root->rb_root.rb_node;
+	/* Figure out where to insert new node */
+	while (*new) {
+		struct shared_file *this = container_of(*new, struct shared_file, node);
+		long result = (long)mapping - (long)this->mapping;
+
+		if (result == 0)
+			result = (long)index - (long)this->index;
+		parent = *new;
+		if (result < 0)
+			new = &((*new)->rb_left);
+		else if (result > 0) {
+			new = &((*new)->rb_right);
+			leftmost = false;
+		} else
+			return false;
+	}
+	p = kmem_cache_alloc(dax_rmap_node_cachep, GFP_KERNEL);
+	p->mapping = mapping;
+	p->index = index;
+
+	/* Add new node and rebalance tree. */
+	rb_link_node(&p->node, parent, new);
+	rb_insert_color_cached(&p->node, root, leftmost);
+
+	return true;
+}
+
+static struct shared_file *dax_rmap_search(struct page *page,
+					   struct address_space *mapping,
+					   pgoff_t index)
+{
+	struct rb_root_cached *root = (struct rb_root_cached *)page_private(page);
+	struct rb_node *node = root->rb_root.rb_node;
+
+	while (node) {
+		struct shared_file *this = container_of(node, struct shared_file, node);
+		long result = (long)mapping - (long)this->mapping;
+
+		if (result == 0)
+			result = (long)index - (long)this->index;
+		if (result < 0)
+			node = node->rb_left;
+		else if (result > 0)
+			node = node->rb_right;
+		else
+			return this;
+	}
+	return NULL;
+}
+
+static void dax_rmap_delete(struct page *page, struct address_space *mapping,
+			    pgoff_t index)
+{
+	struct rb_root_cached *root = (struct rb_root_cached *)page_private(page);
+	struct shared_file *this;
+
+	if (!root) {
+		page->mapping = NULL;
+		page->index = 0;
+		return;
+	}
+
+	this = dax_rmap_search(page, mapping, index);
+	rb_erase_cached(&this->node, root);
+	kmem_cache_free(dax_rmap_node_cachep, this);
+
+	if (!RB_EMPTY_ROOT(&root->rb_root)) {
+		if (page->mapping == mapping && page->index == index) {
+			this = container_of(rb_first_cached(root), struct shared_file, node);
+			page->mapping = this->mapping;
+			page->index = this->index;
+		}
+	} else {
+		kmem_cache_free(dax_rmap_root_cachep, root);
+		set_page_private(page, 0);
+		page->mapping = NULL;
+		page->index = 0;
+	}
+}
+
 static unsigned long dax_end_pfn(void *entry)
 {
 	return dax_to_pfn(entry) + dax_entry_size(entry) / PAGE_SIZE;
@@ -341,16 +456,20 @@ static void dax_associate_entry(void *entry, struct address_space *mapping,
 	for_each_mapped_pfn(entry, pfn) {
 		struct page *page = pfn_to_page(pfn);
 
-		WARN_ON_ONCE(page->mapping);
-		page->mapping = mapping;
-		page->index = index + i++;
+		if (!page->mapping) {
+			page->mapping = mapping;
+			page->index = index + i++;
+		} else {
+			dax_rmap_insert(page, mapping, index + i++);
+		}
 	}
 }
 
 static void dax_disassociate_entry(void *entry, struct address_space *mapping,
-		bool trunc)
+				   pgoff_t index, bool trunc)
 {
 	unsigned long pfn;
+	int i = 0;
 
 	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
 		return;
@@ -359,9 +478,19 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping,
 		struct page *page = pfn_to_page(pfn);
 
 		WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
-		WARN_ON_ONCE(page->mapping && page->mapping != mapping);
-		page->mapping = NULL;
-		page->index = 0;
+		WARN_ON_ONCE(!page->mapping);
+		dax_rmap_delete(page, mapping, index + i++);
+	}
+}
+
+static void __dax_decrease_nrexceptional(void *entry,
+					 struct address_space *mapping)
+{
+	if (dax_is_empty_entry(entry) || dax_is_zero_entry(entry) ||
+	    dax_is_pmd_entry(entry)) {
+		mapping->nrexceptional--;
+	} else {
+		mapping->nrexceptional -= PHYS_PFN(dax_entry_size(entry));
 	}
 }
 
@@ -522,10 +651,10 @@ static void *grab_mapping_entry(struct xa_state *xas,
 			xas_lock_irq(xas);
 		}
 
-		dax_disassociate_entry(entry, mapping, false);
+		dax_disassociate_entry(entry, mapping, index, false);
 		xas_store(xas, NULL);	/* undo the PMD join */
 		dax_wake_entry(xas, entry, true);
-		mapping->nrexceptional--;
+		__dax_decrease_nrexceptional(entry, mapping);
 		entry = NULL;
 		xas_set(xas, index);
 	}
@@ -642,9 +771,9 @@ static int __dax_invalidate_entry(struct address_space *mapping,
 	    (xas_get_mark(&xas, PAGECACHE_TAG_DIRTY) ||
 	     xas_get_mark(&xas, PAGECACHE_TAG_TOWRITE)))
 		goto out;
-	dax_disassociate_entry(entry, mapping, trunc);
+	dax_disassociate_entry(entry, mapping, index, trunc);
 	xas_store(&xas, NULL);
-	mapping->nrexceptional--;
+	__dax_decrease_nrexceptional(entry, mapping);
 	ret = 1;
 out:
 	put_unlocked_entry(&xas, entry);
@@ -737,7 +866,7 @@ static void *dax_insert_entry(struct xa_state *xas,
 	if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
 		void *old;
 
-		dax_disassociate_entry(entry, mapping, false);
+		dax_disassociate_entry(entry, mapping, xas->xa_index, false);
 		dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address);
 		/*
 		 * Only swap our new entry into the page cache if the current
diff --git a/include/linux/dax.h b/include/linux/dax.h
index d7af5d243f24..1e2e81c701b6 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -39,6 +39,12 @@ struct dax_operations {
 	int (*zero_page_range)(struct dax_device *, pgoff_t, size_t);
 };
 
+struct shared_file {
+	struct address_space *mapping;
+	pgoff_t index;
+	struct rb_node node;
+};
+
 extern struct attribute_group dax_attribute_group;
 
 #if IS_ENABLED(CONFIG_DAX)
-- 
2.26.2




^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 2/8] mm: add dax-rmap for memory-failure and rmap
  2020-04-27  8:47 [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink Shiyang Ruan
  2020-04-27  8:47 ` [RFC PATCH 1/8] fs/dax: Introduce dax-rmap btree for reflink Shiyang Ruan
@ 2020-04-27  8:47 ` Shiyang Ruan
  2020-04-27  8:47 ` [RFC PATCH 3/8] fs/dax: Introduce dax_copy_edges() for COW Shiyang Ruan
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Shiyang Ruan @ 2020-04-27  8:47 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, linux-nvdimm
  Cc: linux-mm, linux-fsdevel, darrick.wong, dan.j.williams, david, hch,
	rgoldwyn, qi.fuli, y-goto

Memory-failure collects and kill processes who is accessing a posioned,
file mmapped page.  Add dax-rmap iteration to support reflink case.
Also add it for rmap iteration.

Signed-off-by: Shiyang Ruan <ruansy.fnst@cn.fujitsu.com>
---
 mm/memory-failure.c | 63 +++++++++++++++++++++++++++++++++++----------
 mm/rmap.c           | 54 +++++++++++++++++++++++++++-----------
 2 files changed, 88 insertions(+), 29 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index a96364be8ab4..6d7da1fd55fd 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -463,36 +463,71 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill,
 	page_unlock_anon_vma_read(av);
 }
 
+static void collect_each_procs_file(struct page *page,
+				    struct task_struct *task,
+				    struct list_head *to_kill)
+{
+	struct vm_area_struct *vma;
+	struct address_space *mapping = page->mapping;
+	struct rb_root_cached *root = (struct rb_root_cached *)page_private(page);
+	struct rb_node *node;
+	struct shared_file *shared;
+	pgoff_t pgoff;
+
+	if (dax_mapping(mapping) && root) {
+		struct shared_file save = {
+			.mapping = mapping,
+			.index = page->index,
+		};
+		for (node = rb_first_cached(root); node; node = rb_next(node)) {
+			shared = container_of(node, struct shared_file, node);
+			mapping = page->mapping = shared->mapping;
+			page->index = shared->index;
+			pgoff = page_to_pgoff(page);
+			vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff,
+						  pgoff) {
+				if (vma->vm_mm == task->mm) {
+					// each vma is unique, so is the vaddr.
+					add_to_kill(task, page, vma, to_kill);
+				}
+			}
+		}
+		// restore the mapping and index.
+		page->mapping = save.mapping;
+		page->index = save.index;
+	} else {
+		pgoff = page_to_pgoff(page);
+		vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+			/*
+			 * Send early kill signal to tasks where a vma covers
+			 * the page but the corrupted page is not necessarily
+			 * mapped it in its pte.
+			 * Assume applications who requested early kill want
+			 * to be informed of all such data corruptions.
+			 */
+			if (vma->vm_mm == task->mm)
+				add_to_kill(task, page, vma, to_kill);
+		}
+	}
+}
+
 /*
  * Collect processes when the error hit a file mapped page.
  */
 static void collect_procs_file(struct page *page, struct list_head *to_kill,
 				int force_early)
 {
-	struct vm_area_struct *vma;
 	struct task_struct *tsk;
 	struct address_space *mapping = page->mapping;
 
 	i_mmap_lock_read(mapping);
 	read_lock(&tasklist_lock);
 	for_each_process(tsk) {
-		pgoff_t pgoff = page_to_pgoff(page);
 		struct task_struct *t = task_early_kill(tsk, force_early);
 
 		if (!t)
 			continue;
-		vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff,
-				      pgoff) {
-			/*
-			 * Send early kill signal to tasks where a vma covers
-			 * the page but the corrupted page is not necessarily
-			 * mapped it in its pte.
-			 * Assume applications who requested early kill want
-			 * to be informed of all such data corruptions.
-			 */
-			if (vma->vm_mm == t->mm)
-				add_to_kill(t, page, vma, to_kill);
-		}
+		collect_each_procs_file(page, t, to_kill);
 	}
 	read_unlock(&tasklist_lock);
 	i_mmap_unlock_read(mapping);
diff --git a/mm/rmap.c b/mm/rmap.c
index f79a206b271a..69ea66f9e971 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1870,21 +1870,7 @@ static void rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc,
 		anon_vma_unlock_read(anon_vma);
 }
 
-/*
- * rmap_walk_file - do something to file page using the object-based rmap method
- * @page: the page to be handled
- * @rwc: control variable according to each walk type
- *
- * Find all the mappings of a page using the mapping pointer and the vma chains
- * contained in the address_space struct it points to.
- *
- * When called from try_to_munlock(), the mmap_sem of the mm containing the vma
- * where the page was found will be held for write.  So, we won't recheck
- * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
- * LOCKED.
- */
-static void rmap_walk_file(struct page *page, struct rmap_walk_control *rwc,
-		bool locked)
+static void rmap_walk_file_one(struct page *page, struct rmap_walk_control *rwc, bool locked)
 {
 	struct address_space *mapping = page_mapping(page);
 	pgoff_t pgoff_start, pgoff_end;
@@ -1925,6 +1911,44 @@ static void rmap_walk_file(struct page *page, struct rmap_walk_control *rwc,
 		i_mmap_unlock_read(mapping);
 }
 
+/*
+ * rmap_walk_file - do something to file page using the object-based rmap method
+ * @page: the page to be handled
+ * @rwc: control variable according to each walk type
+ *
+ * Find all the mappings of a page using the mapping pointer and the vma chains
+ * contained in the address_space struct it points to.
+ *
+ * When called from try_to_munlock(), the mmap_sem of the mm containing the vma
+ * where the page was found will be held for write.  So, we won't recheck
+ * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
+ * LOCKED.
+ */
+static void rmap_walk_file(struct page *page, struct rmap_walk_control *rwc,
+		bool locked)
+{
+	struct rb_root_cached *root = (struct rb_root_cached *)page_private(page);
+	struct rb_node *node;
+	struct shared_file *shared;
+
+	if (dax_mapping(page->mapping) && root) {
+		struct shared_file save = {
+			.mapping = page->mapping,
+			.index = page->index,
+		};
+		for (node = rb_first_cached(root); node; node = rb_next(node)) {
+			shared = container_of(node, struct shared_file, node);
+			page->mapping = shared->mapping;
+			page->index = shared->index;
+			rmap_walk_file_one(page, rwc, locked);
+		}
+		// restore the mapping and index.
+		page->mapping = save.mapping;
+		page->index = save.index;
+	} else
+		rmap_walk_file_one(page, rwc, locked);
+}
+
 void rmap_walk(struct page *page, struct rmap_walk_control *rwc)
 {
 	if (unlikely(PageKsm(page)))
-- 
2.26.2




^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 3/8] fs/dax: Introduce dax_copy_edges() for COW
  2020-04-27  8:47 [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink Shiyang Ruan
  2020-04-27  8:47 ` [RFC PATCH 1/8] fs/dax: Introduce dax-rmap btree for reflink Shiyang Ruan
  2020-04-27  8:47 ` [RFC PATCH 2/8] mm: add dax-rmap for memory-failure and rmap Shiyang Ruan
@ 2020-04-27  8:47 ` Shiyang Ruan
  2020-04-27  8:47 ` [RFC PATCH 4/8] fs/dax: copy data before write Shiyang Ruan
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Shiyang Ruan @ 2020-04-27  8:47 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, linux-nvdimm
  Cc: linux-mm, linux-fsdevel, darrick.wong, dan.j.williams, david, hch,
	rgoldwyn, qi.fuli, y-goto, Goldwyn Rodrigues

Add address output in dax_iomap_pfn() in order to perform a memcpy() in CoW
case.  Since this function both output address and pfn, rename it to
dax_iomap_direct_access().

dax_copy_edges() is a helper functions performs a copy from one part of
the device to another for data not page aligned.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Shiyang Ruan <ruansy.fnst@cn.fujitsu.com>
---
 fs/dax.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 58 insertions(+), 6 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 2f996c566103..8107ed10c851 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1111,8 +1111,8 @@ static sector_t dax_iomap_sector(struct iomap *iomap, loff_t pos)
 	return (iomap->addr + (pos & PAGE_MASK) - iomap->offset) >> 9;
 }
 
-static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, size_t size,
-			 pfn_t *pfnp)
+static int dax_iomap_direct_access(struct iomap *iomap, loff_t pos, size_t size,
+			 pfn_t *pfnp, void **addr)
 {
 	const sector_t sector = dax_iomap_sector(iomap, pos);
 	pgoff_t pgoff;
@@ -1123,12 +1123,14 @@ static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, size_t size,
 	if (rc)
 		return rc;
 	id = dax_read_lock();
-	length = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(size),
-				   NULL, pfnp);
+	length = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(size), addr,
+				   pfnp);
 	if (length < 0) {
 		rc = length;
 		goto out;
 	}
+	if (!pfnp)
+		goto out_check_addr;
 	rc = -EINVAL;
 	if (PFN_PHYS(length) < size)
 		goto out;
@@ -1138,11 +1140,59 @@ static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, size_t size,
 	if (length > 1 && !pfn_t_devmap(*pfnp))
 		goto out;
 	rc = 0;
+
+out_check_addr:
+	if (!addr)
+		goto out;
+	if (!*addr)
+		rc = -EFAULT;
 out:
 	dax_read_unlock(id);
 	return rc;
 }
 
+/*
+ * dax_copy_edges - Copies the part of the pages not included in
+ *		    the write, but required for CoW because
+ *		    offset/offset+length are not page aligned.
+ */
+static int dax_copy_edges(loff_t pos, loff_t length, struct iomap *srcmap,
+			  void *daddr, bool pmd)
+{
+	size_t page_size = pmd ? PMD_SIZE : PAGE_SIZE;
+	loff_t offset = pos & (page_size - 1);
+	size_t size = ALIGN(offset + length, page_size);
+	loff_t end = pos + length;
+	loff_t pg_end = round_up(end, page_size);
+	void *saddr = 0;
+	int ret = 0;
+
+	ret = dax_iomap_direct_access(srcmap, pos, size, NULL, &saddr);
+	if (ret)
+		return ret;
+	/*
+	 * Copy the first part of the page
+	 * Note: we pass offset as length
+	 */
+	if (offset) {
+		if (saddr)
+			ret = memcpy_mcsafe(daddr, saddr, offset);
+		else
+			memset(daddr, 0, offset);
+	}
+
+	/* Copy the last part of the range */
+	if (end < pg_end) {
+		if (saddr)
+			ret = memcpy_mcsafe(daddr + offset + length,
+			       saddr + offset + length,	pg_end - end);
+		else
+			memset(daddr + offset + length, 0,
+					pg_end - end);
+	}
+	return ret;
+}
+
 /*
  * The user has performed a load from a hole in the file.  Allocating a new
  * page in the file would cause excessive storage usage for workloads with
@@ -1462,7 +1512,8 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
 			count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
 			major = VM_FAULT_MAJOR;
 		}
-		error = dax_iomap_pfn(&iomap, pos, PAGE_SIZE, &pfn);
+		error = dax_iomap_direct_access(&iomap, pos, PAGE_SIZE, &pfn,
+						NULL);
 		if (error < 0)
 			goto error_finish_iomap;
 
@@ -1680,7 +1731,8 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
 
 	switch (iomap.type) {
 	case IOMAP_MAPPED:
-		error = dax_iomap_pfn(&iomap, pos, PMD_SIZE, &pfn);
+		error = dax_iomap_direct_access(&iomap, pos, PMD_SIZE, &pfn,
+						NULL);
 		if (error < 0)
 			goto finish_iomap;
 
-- 
2.26.2




^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 4/8] fs/dax: copy data before write
  2020-04-27  8:47 [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink Shiyang Ruan
                   ` (2 preceding siblings ...)
  2020-04-27  8:47 ` [RFC PATCH 3/8] fs/dax: Introduce dax_copy_edges() for COW Shiyang Ruan
@ 2020-04-27  8:47 ` Shiyang Ruan
  2020-04-27  8:47 ` [RFC PATCH 5/8] fs/dax: replace mmap entry in case of CoW Shiyang Ruan
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Shiyang Ruan @ 2020-04-27  8:47 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, linux-nvdimm
  Cc: linux-mm, linux-fsdevel, darrick.wong, dan.j.williams, david, hch,
	rgoldwyn, qi.fuli, y-goto

Add dax_copy_edges() into each dax actor functions to perform CoW.

Signed-off-by: Shiyang Ruan <ruansy.fnst@cn.fujitsu.com>
---
 fs/dax.c | 37 ++++++++++++++++++++++++++++++++-----
 1 file changed, 32 insertions(+), 5 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 8107ed10c851..13a6a1d3c3b3 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1276,9 +1276,6 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 			return iov_iter_zero(min(length, end - pos), iter);
 	}
 
-	if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED))
-		return -EIO;
-
 	/*
 	 * Write can allocate block for an area which has a hole page mapped
 	 * into page tables. We have to tear down these mappings so that data
@@ -1315,6 +1312,12 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 			break;
 		}
 
+		if (iomap != srcmap) {
+			ret = dax_copy_edges(pos, length, srcmap, kaddr, false);
+			if (ret)
+				break;
+		}
+
 		map_len = PFN_PHYS(map_len);
 		kaddr += offset;
 		map_len -= offset;
@@ -1426,6 +1429,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
 	vm_fault_t ret = 0;
 	void *entry;
 	pfn_t pfn;
+	void *kaddr;
 
 	trace_dax_pte_fault(inode, vmf, ret);
 	/*
@@ -1507,19 +1511,27 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
 
 	switch (iomap.type) {
 	case IOMAP_MAPPED:
+cow:
 		if (iomap.flags & IOMAP_F_NEW) {
 			count_vm_event(PGMAJFAULT);
 			count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
 			major = VM_FAULT_MAJOR;
 		}
 		error = dax_iomap_direct_access(&iomap, pos, PAGE_SIZE, &pfn,
-						NULL);
+						&kaddr);
 		if (error < 0)
 			goto error_finish_iomap;
 
 		entry = dax_insert_entry(&xas, mapping, vmf, entry, pfn,
 						 0, write && !sync);
 
+		if (srcmap.type != IOMAP_HOLE) {
+			error = dax_copy_edges(pos, PAGE_SIZE, &srcmap, kaddr,
+					       false);
+			if (error)
+				goto error_finish_iomap;
+		}
+
 		/*
 		 * If we are doing synchronous page fault and inode needs fsync,
 		 * we can insert PTE into page tables only after that happens.
@@ -1543,6 +1555,9 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
 
 		goto finish_iomap;
 	case IOMAP_UNWRITTEN:
+		if (srcmap.type != IOMAP_HOLE)
+			goto cow;
+		/*FALLTHRU*/
 	case IOMAP_HOLE:
 		if (!write) {
 			ret = dax_load_hole(&xas, mapping, &entry, vmf);
@@ -1650,6 +1665,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
 	loff_t pos;
 	int error;
 	pfn_t pfn;
+	void *kaddr;
 
 	/*
 	 * Check whether offset isn't beyond end of file now. Caller is
@@ -1731,14 +1747,22 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
 
 	switch (iomap.type) {
 	case IOMAP_MAPPED:
+cow:
 		error = dax_iomap_direct_access(&iomap, pos, PMD_SIZE, &pfn,
-						NULL);
+						&kaddr);
 		if (error < 0)
 			goto finish_iomap;
 
 		entry = dax_insert_entry(&xas, mapping, vmf, entry, pfn,
 						DAX_PMD, write && !sync);
 
+		if (srcmap.type != IOMAP_HOLE) {
+			error = dax_copy_edges(pos, PMD_SIZE, &srcmap, kaddr,
+					       true);
+			if (error)
+				goto unlock_entry;
+		}
+
 		/*
 		 * If we are doing synchronous page fault and inode needs fsync,
 		 * we can insert PMD into page tables only after that happens.
@@ -1757,6 +1781,9 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
 		result = vmf_insert_pfn_pmd(vmf, pfn, write);
 		break;
 	case IOMAP_UNWRITTEN:
+		if (srcmap.type != IOMAP_HOLE)
+			goto cow;
+		/*FALLTHRU*/
 	case IOMAP_HOLE:
 		if (WARN_ON_ONCE(write))
 			break;
-- 
2.26.2




^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 5/8] fs/dax: replace mmap entry in case of CoW
  2020-04-27  8:47 [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink Shiyang Ruan
                   ` (3 preceding siblings ...)
  2020-04-27  8:47 ` [RFC PATCH 4/8] fs/dax: copy data before write Shiyang Ruan
@ 2020-04-27  8:47 ` Shiyang Ruan
  2020-04-27  8:47 ` [RFC PATCH 6/8] fs/dax: dedup file range to use a compare function Shiyang Ruan
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Shiyang Ruan @ 2020-04-27  8:47 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, linux-nvdimm
  Cc: linux-mm, linux-fsdevel, darrick.wong, dan.j.williams, david, hch,
	rgoldwyn, qi.fuli, y-goto, Goldwyn Rodrigues

We replace the existing entry to the newly allocated one
in case of CoW. Also, we mark the entry as PAGECACHE_TAG_TOWRITE
so writeback marks this entry as writeprotected. This
helps us snapshots so new write pagefaults after snapshots
trigger a CoW.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Shiyang Ruan <ruansy.fnst@cn.fujitsu.com>
---
 fs/dax.c | 37 ++++++++++++++++++++++++++++---------
 1 file changed, 28 insertions(+), 9 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 13a6a1d3c3b3..12096edb2569 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -835,6 +835,9 @@ static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev,
 	return 0;
 }
 
+#define DAX_IF_DIRTY		(1ULL << 0)
+#define DAX_IF_COW		(1ULL << 1)
+
 /*
  * By this point grab_mapping_entry() has ensured that we have a locked entry
  * of the appropriate size so we don't have to worry about downgrading PMDs to
@@ -844,14 +847,17 @@ static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev,
  */
 static void *dax_insert_entry(struct xa_state *xas,
 		struct address_space *mapping, struct vm_fault *vmf,
-		void *entry, pfn_t pfn, unsigned long flags, bool dirty)
+		void *entry, pfn_t pfn, unsigned long flags,
+		unsigned int insert_flags)
 {
 	void *new_entry = dax_make_entry(pfn, flags);
+	bool dirty = insert_flags & DAX_IF_DIRTY;
+	bool cow = insert_flags & DAX_IF_COW;
 
 	if (dirty)
 		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 
-	if (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE)) {
+	if (cow || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) {
 		unsigned long index = xas->xa_index;
 		/* we are replacing a zero page with block mapping */
 		if (dax_is_pmd_entry(entry))
@@ -863,7 +869,7 @@ static void *dax_insert_entry(struct xa_state *xas,
 
 	xas_reset(xas);
 	xas_lock_irq(xas);
-	if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
+	if (cow || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
 		void *old;
 
 		dax_disassociate_entry(entry, mapping, xas->xa_index, false);
@@ -887,6 +893,9 @@ static void *dax_insert_entry(struct xa_state *xas,
 	if (dirty)
 		xas_set_mark(xas, PAGECACHE_TAG_DIRTY);
 
+	if (cow)
+		xas_set_mark(xas, PAGECACHE_TAG_TOWRITE);
+
 	xas_unlock_irq(xas);
 	return entry;
 }
@@ -1430,6 +1439,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
 	void *entry;
 	pfn_t pfn;
 	void *kaddr;
+	unsigned long insert_flags = 0;
 
 	trace_dax_pte_fault(inode, vmf, ret);
 	/*
@@ -1522,8 +1532,10 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
 		if (error < 0)
 			goto error_finish_iomap;
 
-		entry = dax_insert_entry(&xas, mapping, vmf, entry, pfn,
-						 0, write && !sync);
+		if (write && !sync)
+			insert_flags |= DAX_IF_DIRTY;
+		entry = dax_insert_entry(&xas, mapping, vmf, entry, pfn, 0,
+					 insert_flags);
 
 		if (srcmap.type != IOMAP_HOLE) {
 			error = dax_copy_edges(pos, PAGE_SIZE, &srcmap, kaddr,
@@ -1555,8 +1567,10 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
 
 		goto finish_iomap;
 	case IOMAP_UNWRITTEN:
-		if (srcmap.type != IOMAP_HOLE)
+		if (srcmap.type != IOMAP_HOLE) {
+			insert_flags |= DAX_IF_COW;
 			goto cow;
+		}
 		/*FALLTHRU*/
 	case IOMAP_HOLE:
 		if (!write) {
@@ -1614,7 +1628,7 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
 
 	pfn = page_to_pfn_t(zero_page);
 	*entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
-			DAX_PMD | DAX_ZERO_PAGE, false);
+			DAX_PMD | DAX_ZERO_PAGE, 0);
 
 	if (arch_needs_pgtable_deposit()) {
 		pgtable = pte_alloc_one(vma->vm_mm);
@@ -1666,6 +1680,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
 	int error;
 	pfn_t pfn;
 	void *kaddr;
+	unsigned long insert_flags = 0;
 
 	/*
 	 * Check whether offset isn't beyond end of file now. Caller is
@@ -1753,8 +1768,10 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
 		if (error < 0)
 			goto finish_iomap;
 
+		if (write && !sync)
+			insert_flags |= DAX_IF_DIRTY;
 		entry = dax_insert_entry(&xas, mapping, vmf, entry, pfn,
-						DAX_PMD, write && !sync);
+						DAX_PMD, insert_flags);
 
 		if (srcmap.type != IOMAP_HOLE) {
 			error = dax_copy_edges(pos, PMD_SIZE, &srcmap, kaddr,
@@ -1781,8 +1798,10 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
 		result = vmf_insert_pfn_pmd(vmf, pfn, write);
 		break;
 	case IOMAP_UNWRITTEN:
-		if (srcmap.type != IOMAP_HOLE)
+		if (srcmap.type != IOMAP_HOLE) {
+			insert_flags |= DAX_IF_COW;
 			goto cow;
+		}
 		/*FALLTHRU*/
 	case IOMAP_HOLE:
 		if (WARN_ON_ONCE(write))
-- 
2.26.2




^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 6/8] fs/dax: dedup file range to use a compare function
  2020-04-27  8:47 [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink Shiyang Ruan
                   ` (4 preceding siblings ...)
  2020-04-27  8:47 ` [RFC PATCH 5/8] fs/dax: replace mmap entry in case of CoW Shiyang Ruan
@ 2020-04-27  8:47 ` Shiyang Ruan
  2020-04-27  8:47 ` [RFC PATCH 7/8] fs/xfs: handle CoW for fsdax write() path Shiyang Ruan
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Shiyang Ruan @ 2020-04-27  8:47 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, linux-nvdimm
  Cc: linux-mm, linux-fsdevel, darrick.wong, dan.j.williams, david, hch,
	rgoldwyn, qi.fuli, y-goto, Goldwyn Rodrigues

With dax we cannot deal with readpage() etc. So, we create a
funciton callback to perform the file data comparison and pass
it to generic_remap_file_range_prep() so it can use iomap-based
functions.

This may not be the best way to solve this. Suggestions welcome.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Shiyang Ruan <ruansy.fnst@cn.fujitsu.com>
---
 fs/dax.c             | 60 ++++++++++++++++++++++++++++++++++++++++++++
 fs/ocfs2/file.c      |  2 +-
 fs/read_write.c      | 11 ++++----
 fs/xfs/xfs_reflink.c |  2 +-
 include/linux/dax.h  |  5 ++++
 include/linux/fs.h   |  9 ++++++-
 6 files changed, 81 insertions(+), 8 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 12096edb2569..ab6be7749105 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -31,6 +31,8 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/fs_dax.h>
 
+#define MIN(a, b) (((a) < (b)) ? (a) : (b))
+
 static inline unsigned int pe_order(enum page_entry_size pe_size)
 {
 	if (pe_size == PE_SIZE_PTE)
@@ -1942,3 +1944,61 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
 	return dax_insert_pfn_mkwrite(vmf, pfn, order);
 }
 EXPORT_SYMBOL_GPL(dax_finish_sync_fault);
+
+int dax_file_range_compare(struct inode *src, loff_t srcoff, struct inode *dest,
+		loff_t destoff, loff_t len, bool *is_same, const struct iomap_ops *ops)
+{
+	void *saddr, *daddr;
+	struct iomap s_iomap = {0};
+	struct iomap d_iomap = {0};
+	bool same = true;
+	loff_t cmp_len;
+	int id, ret = 0;
+
+	id = dax_read_lock();
+	while (len) {
+		ret = ops->iomap_begin(src, srcoff, len, 0, &s_iomap, NULL);
+		if (ret < 0)
+			goto out_src;
+		cmp_len = MIN(len, s_iomap.offset + s_iomap.length - srcoff);
+
+		ret = ops->iomap_begin(dest, destoff, cmp_len, 0, &d_iomap, NULL);
+		if (ret < 0)
+			goto out_dest;
+		cmp_len = MIN(cmp_len, d_iomap.offset + d_iomap.length - destoff);
+
+		ret = dax_iomap_direct_access(&s_iomap, srcoff,
+					      ALIGN(srcoff + cmp_len, PAGE_SIZE),
+					      NULL, &saddr);
+		if (ret < 0)
+			goto out_dest;
+
+		ret = dax_iomap_direct_access(&d_iomap, destoff,
+					      ALIGN(destoff + cmp_len, PAGE_SIZE),
+					      NULL, &daddr);
+		if (ret < 0)
+			goto out_dest;
+
+		same = !memcmp(saddr, daddr, cmp_len);
+		if (!same)
+			break;
+		len -= cmp_len;
+		srcoff += cmp_len;
+		destoff += cmp_len;
+out_dest:
+		if (ops->iomap_end)
+			ops->iomap_end(dest, destoff, len, 0, 0, &d_iomap);
+out_src:
+		if (ops->iomap_end)
+			ops->iomap_end(src, srcoff, len, 0, 0, &s_iomap);
+
+		if (ret < 0)
+			goto out;
+
+	}
+	*is_same = same;
+out:
+	dax_read_unlock(id);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(dax_file_range_compare);
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 6cd5e4924e4d..6fa6e1ac6357 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2591,7 +2591,7 @@ static loff_t ocfs2_remap_file_range(struct file *file_in, loff_t pos_in,
 		goto out_unlock;
 
 	ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
-			&len, remap_flags);
+			&len, remap_flags, vfs_dedupe_file_range_compare);
 	if (ret < 0 || len == 0)
 		goto out_unlock;
 
diff --git a/fs/read_write.c b/fs/read_write.c
index bbfa9b12b15e..e08875f35f10 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1908,7 +1908,7 @@ static void vfs_unlock_two_pages(struct page *page1, struct page *page2)
  * Compare extents of two files to see if they are the same.
  * Caller must have locked both inodes to prevent write races.
  */
-static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
+int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
 					 struct inode *dest, loff_t destoff,
 					 loff_t len, bool *is_same)
 {
@@ -1989,6 +1989,7 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
 out_error:
 	return error;
 }
+EXPORT_SYMBOL_GPL(vfs_dedupe_file_range_compare);
 
 /*
  * Check that the two inodes are eligible for cloning, the ranges make
@@ -2000,7 +2001,8 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
  */
 int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
 				  struct file *file_out, loff_t pos_out,
-				  loff_t *len, unsigned int remap_flags)
+				  loff_t *len, unsigned int remap_flags,
+				  compare_range_t compare)
 {
 	struct inode *inode_in = file_inode(file_in);
 	struct inode *inode_out = file_inode(file_out);
@@ -2059,9 +2061,8 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
 	 */
 	if (remap_flags & REMAP_FILE_DEDUP) {
 		bool		is_same = false;
-
-		ret = vfs_dedupe_file_range_compare(inode_in, pos_in,
-				inode_out, pos_out, *len, &is_same);
+		ret = (*compare)(inode_in, pos_in,
+			inode_out, pos_out, *len, &is_same);
 		if (ret)
 			return ret;
 		if (!is_same)
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 107bf2a2f344..792217cd1e64 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1338,7 +1338,7 @@ xfs_reflink_remap_prep(
 		goto out_unlock;
 
 	ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
-			len, remap_flags);
+			len, remap_flags, vfs_dedupe_file_range_compare);
 	if (ret < 0 || *len == 0)
 		goto out_unlock;
 
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 1e2e81c701b6..f8143fe24bf5 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -223,6 +223,11 @@ int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
 				      pgoff_t index);
 int dax_iomap_zero(loff_t pos, unsigned offset, unsigned size,
 			struct iomap *iomap);
+int dax_file_range_compare(struct inode *src, loff_t srcoff,
+			   struct inode *dest, loff_t destoff,
+			   loff_t len, bool *is_same,
+			   const struct iomap_ops *ops);
+
 static inline bool dax_mapping(struct address_space *mapping)
 {
 	return mapping->host && IS_DAX(mapping->host);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 4f6f59b4f22a..959173cdeb05 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1924,13 +1924,20 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
 		unsigned long, loff_t *, rwf_t);
 extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
 				   loff_t, size_t, unsigned int);
+extern int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
+					 struct inode *dest, loff_t destoff,
+					 loff_t len, bool *is_same);
+typedef int (*compare_range_t)(struct inode *src, loff_t srcpos,
+			       struct inode *dest, loff_t destpos,
+			       loff_t len, bool *is_same);
 extern ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in,
 				       struct file *file_out, loff_t pos_out,
 				       size_t len, unsigned int flags);
 extern int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
 					 struct file *file_out, loff_t pos_out,
 					 loff_t *count,
-					 unsigned int remap_flags);
+					 unsigned int remap_flags,
+					 compare_range_t cmp);
 extern loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
 				  struct file *file_out, loff_t pos_out,
 				  loff_t len, unsigned int remap_flags);
-- 
2.26.2




^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 7/8] fs/xfs: handle CoW for fsdax write() path
  2020-04-27  8:47 [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink Shiyang Ruan
                   ` (5 preceding siblings ...)
  2020-04-27  8:47 ` [RFC PATCH 6/8] fs/dax: dedup file range to use a compare function Shiyang Ruan
@ 2020-04-27  8:47 ` Shiyang Ruan
  2020-04-27  8:47 ` [RFC PATCH 8/8] fs/xfs: support dedupe for fsdax Shiyang Ruan
  2020-04-27 12:28 ` [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink Matthew Wilcox
  8 siblings, 0 replies; 22+ messages in thread
From: Shiyang Ruan @ 2020-04-27  8:47 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, linux-nvdimm
  Cc: linux-mm, linux-fsdevel, darrick.wong, dan.j.williams, david, hch,
	rgoldwyn, qi.fuli, y-goto

In fsdax mode, WRITE and ZERO on a shared extent need CoW mechanism
performed.  After CoW, new extents needs to be remapped to the file.

Signed-off-by: Shiyang Ruan <ruansy.fnst@cn.fujitsu.com>

fs/dax.c: fix nrexceptional count error.
---
 fs/xfs/xfs_bmap_util.c |  6 +++++-
 fs/xfs/xfs_file.c      | 10 +++++++---
 fs/xfs/xfs_iomap.c     |  3 ++-
 fs/xfs/xfs_iops.c      | 11 ++++++++---
 fs/xfs/xfs_reflink.c   |  2 ++
 5 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 4f800f7fe888..8980120415ba 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -969,10 +969,14 @@ xfs_free_file_space(
 	if (offset + len > XFS_ISIZE(ip))
 		len = XFS_ISIZE(ip) - offset;
 	error = iomap_zero_range(VFS_I(ip), offset, len, NULL,
-			&xfs_buffered_write_iomap_ops);
+		  IS_DAX(VFS_I(ip)) ?
+		  &xfs_direct_write_iomap_ops : &xfs_buffered_write_iomap_ops);
 	if (error)
 		return error;
 
+	if (xfs_is_reflink_inode(ip))
+		xfs_reflink_end_cow(ip, offset, len);
+
 	/*
 	 * If we zeroed right up to EOF and EOF straddles a page boundary we
 	 * must make sure that the post-EOF area is also zeroed because the
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 4b8bdecc3863..2f8a83b2605d 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -588,9 +588,13 @@ xfs_file_dax_write(
 
 	trace_xfs_file_dax_write(ip, count, pos);
 	ret = dax_iomap_rw(iocb, from, &xfs_direct_write_iomap_ops);
-	if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
-		i_size_write(inode, iocb->ki_pos);
-		error = xfs_setfilesize(ip, pos, ret);
+	if (ret > 0) {
+		if (iocb->ki_pos > i_size_read(inode)) {
+			i_size_write(inode, iocb->ki_pos);
+			error = xfs_setfilesize(ip, pos, ret);
+		}
+		if (xfs_is_cow_inode(ip))
+			xfs_reflink_end_cow(ip, pos, ret);
 	}
 out:
 	xfs_iunlock(ip, iolock);
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index bb590a267a7f..ca92f76498c9 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -754,13 +754,14 @@ xfs_direct_write_iomap_begin(
 		goto out_unlock;
 
 	if (imap_needs_cow(ip, flags, &imap, nimaps)) {
+		bool need_convert = flags & IOMAP_DIRECT || IS_DAX(inode);
 		error = -EAGAIN;
 		if (flags & IOMAP_NOWAIT)
 			goto out_unlock;
 
 		/* may drop and re-acquire the ilock */
 		error = xfs_reflink_allocate_cow(ip, &imap, &cmap, &shared,
-				&lockmode, flags & IOMAP_DIRECT);
+				&lockmode, need_convert);
 		if (error)
 			goto out_unlock;
 		if (shared)
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index f7a99b3bbcf7..92b58ae06c0e 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -864,6 +864,7 @@ xfs_setattr_size(
 	int			error;
 	uint			lock_flags = 0;
 	bool			did_zeroing = false;
+	const struct iomap_ops	*ops;
 
 	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL));
 	ASSERT(xfs_isilocked(ip, XFS_MMAPLOCK_EXCL));
@@ -910,13 +911,17 @@ xfs_setattr_size(
 	 * extension, or zeroing out the rest of the block on a downward
 	 * truncate.
 	 */
+	if (IS_DAX(inode))
+		ops = &xfs_direct_write_iomap_ops;
+	else
+		ops = &xfs_buffered_write_iomap_ops;
+
 	if (newsize > oldsize) {
 		trace_xfs_zero_eof(ip, oldsize, newsize - oldsize);
 		error = iomap_zero_range(inode, oldsize, newsize - oldsize,
-				&did_zeroing, &xfs_buffered_write_iomap_ops);
+				&did_zeroing, ops);
 	} else {
-		error = iomap_truncate_page(inode, newsize, &did_zeroing,
-				&xfs_buffered_write_iomap_ops);
+		error = iomap_truncate_page(inode, newsize, &did_zeroing, ops);
 	}
 
 	if (error)
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 792217cd1e64..f87ab78dd421 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1269,6 +1269,8 @@ xfs_reflink_zero_posteof(
 
 	trace_xfs_zero_eof(ip, isize, pos - isize);
 	return iomap_zero_range(VFS_I(ip), isize, pos - isize, NULL,
+			IS_DAX(VFS_I(ip)) ?
+			&xfs_direct_write_iomap_ops :
 			&xfs_buffered_write_iomap_ops);
 }
 
-- 
2.26.2




^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 8/8] fs/xfs: support dedupe for fsdax
  2020-04-27  8:47 [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink Shiyang Ruan
                   ` (6 preceding siblings ...)
  2020-04-27  8:47 ` [RFC PATCH 7/8] fs/xfs: handle CoW for fsdax write() path Shiyang Ruan
@ 2020-04-27  8:47 ` Shiyang Ruan
  2020-04-27 12:28 ` [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink Matthew Wilcox
  8 siblings, 0 replies; 22+ messages in thread
From: Shiyang Ruan @ 2020-04-27  8:47 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, linux-nvdimm
  Cc: linux-mm, linux-fsdevel, darrick.wong, dan.j.williams, david, hch,
	rgoldwyn, qi.fuli, y-goto

Use xfs_break_layouts() to break files' layouts when locking them.  And
call dax_file_range_compare() function to compare range for files both
DAX.

Signed-off-by: Shiyang Ruan <ruansy.fnst@cn.fujitsu.com>
---
 fs/xfs/xfs_reflink.c | 77 ++++++++++++++++++++++++++------------------
 1 file changed, 45 insertions(+), 32 deletions(-)

diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index f87ab78dd421..efbe3464ae85 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1185,47 +1185,41 @@ xfs_reflink_remap_blocks(
  * back out both locks.
  */
 static int
-xfs_iolock_two_inodes_and_break_layout(
-	struct inode		*src,
-	struct inode		*dest)
+xfs_reflink_remap_lock_and_break_layouts(
+	struct file		*file_in,
+	struct file		*file_out)
 {
 	int			error;
+	struct inode		*inode_in = file_inode(file_in);
+	struct xfs_inode	*src = XFS_I(inode_in);
+	struct inode		*inode_out = file_inode(file_out);
+	struct xfs_inode	*dest = XFS_I(inode_out);
+	uint			iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
 
-	if (src > dest)
+	if (inode_in > inode_out) {
+		swap(inode_in, inode_out);
 		swap(src, dest);
-
-retry:
-	/* Wait to break both inodes' layouts before we start locking. */
-	error = break_layout(src, true);
-	if (error)
-		return error;
-	if (src != dest) {
-		error = break_layout(dest, true);
-		if (error)
-			return error;
 	}
 
-	/* Lock one inode and make sure nobody got in and leased it. */
-	inode_lock(src);
-	error = break_layout(src, false);
+	inode_lock(inode_in);
+	xfs_ilock(src, XFS_MMAPLOCK_EXCL);
+	error = xfs_break_layouts(inode_in, &iolock, BREAK_UNMAP);
+	xfs_iunlock(src, XFS_MMAPLOCK_EXCL);
 	if (error) {
-		inode_unlock(src);
-		if (error == -EWOULDBLOCK)
-			goto retry;
+		inode_unlock(inode_in);
 		return error;
 	}
 
-	if (src == dest)
+	if (inode_in == inode_out)
 		return 0;
 
-	/* Lock the other inode and make sure nobody got in and leased it. */
-	inode_lock_nested(dest, I_MUTEX_NONDIR2);
-	error = break_layout(dest, false);
+	inode_lock_nested(inode_out, I_MUTEX_NONDIR2);
+	xfs_ilock(dest, XFS_MMAPLOCK_EXCL);
+	error = xfs_break_layouts(inode_out, &iolock, BREAK_UNMAP);
+	xfs_iunlock(dest, XFS_MMAPLOCK_EXCL);
 	if (error) {
-		inode_unlock(src);
-		inode_unlock(dest);
-		if (error == -EWOULDBLOCK)
-			goto retry;
+		inode_unlock(inode_in);
+		inode_unlock(inode_out);
 		return error;
 	}
 
@@ -1244,6 +1238,11 @@ xfs_reflink_remap_unlock(
 	struct xfs_inode	*dest = XFS_I(inode_out);
 	bool			same_inode = (inode_in == inode_out);
 
+	if (inode_in > inode_out) {
+		swap(inode_in, inode_out);
+		swap(src, dest);
+	}
+
 	xfs_iunlock(dest, XFS_MMAPLOCK_EXCL);
 	if (!same_inode)
 		xfs_iunlock(src, XFS_MMAPLOCK_EXCL);
@@ -1274,6 +1273,14 @@ xfs_reflink_zero_posteof(
 			&xfs_buffered_write_iomap_ops);
 }
 
+int xfs_reflink_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
+					  struct inode *dest, loff_t destoff,
+					  loff_t len, bool *is_same)
+{
+	return dax_file_range_compare(src, srcoff, dest, destoff, len, is_same,
+				      &xfs_read_iomap_ops);
+}
+
 /*
  * Prepare two files for range cloning.  Upon a successful return both inodes
  * will have the iolock and mmaplock held, the page cache of the out file will
@@ -1318,9 +1325,10 @@ xfs_reflink_remap_prep(
 	struct xfs_inode	*dest = XFS_I(inode_out);
 	bool			same_inode = (inode_in == inode_out);
 	ssize_t			ret;
+	compare_range_t		cmp;
 
 	/* Lock both files against IO */
-	ret = xfs_iolock_two_inodes_and_break_layout(inode_in, inode_out);
+	ret = xfs_reflink_remap_lock_and_break_layouts(file_in, file_out);
 	if (ret)
 		return ret;
 	if (same_inode)
@@ -1335,12 +1343,17 @@ xfs_reflink_remap_prep(
 	if (XFS_IS_REALTIME_INODE(src) || XFS_IS_REALTIME_INODE(dest))
 		goto out_unlock;
 
-	/* Don't share DAX file data for now. */
-	if (IS_DAX(inode_in) || IS_DAX(inode_out))
+	/* Don't share DAX file data with non-DAX file. */
+	if (IS_DAX(inode_in) != IS_DAX(inode_out))
 		goto out_unlock;
 
+	if (IS_DAX(inode_in))
+		cmp = xfs_reflink_dedupe_file_range_compare;
+	else
+		cmp = vfs_dedupe_file_range_compare;
+
 	ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
-			len, remap_flags, vfs_dedupe_file_range_compare);
+			len, remap_flags, cmp);
 	if (ret < 0 || *len == 0)
 		goto out_unlock;
 
-- 
2.26.2




^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink
  2020-04-27  8:47 [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink Shiyang Ruan
                   ` (7 preceding siblings ...)
  2020-04-27  8:47 ` [RFC PATCH 8/8] fs/xfs: support dedupe for fsdax Shiyang Ruan
@ 2020-04-27 12:28 ` Matthew Wilcox
  2020-04-28  6:09   ` 回复: " Ruan, Shiyang
  8 siblings, 1 reply; 22+ messages in thread
From: Matthew Wilcox @ 2020-04-27 12:28 UTC (permalink / raw)
  To: Shiyang Ruan
  Cc: linux-kernel, linux-xfs, linux-nvdimm, linux-mm, linux-fsdevel,
	darrick.wong, dan.j.williams, david, hch, rgoldwyn, qi.fuli,
	y-goto

On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote:
> This patchset is a try to resolve the shared 'page cache' problem for
> fsdax.
> 
> In order to track multiple mappings and indexes on one page, I
> introduced a dax-rmap rb-tree to manage the relationship.  A dax entry
> will be associated more than once if is shared.  At the second time we
> associate this entry, we create this rb-tree and store its root in
> page->private(not used in fsdax).  Insert (->mapping, ->index) when
> dax_associate_entry() and delete it when dax_disassociate_entry().

Do we really want to track all of this on a per-page basis?  I would
have thought a per-extent basis was more useful.  Essentially, create
a new address_space for each shared extent.  Per page just seems like
a huge overhead.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* 回复: Re: [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink
  2020-04-27 12:28 ` [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink Matthew Wilcox
@ 2020-04-28  6:09   ` Ruan, Shiyang
  2020-04-28  6:43     ` Dave Chinner
  0 siblings, 1 reply; 22+ messages in thread
From: Ruan, Shiyang @ 2020-04-28  6:09 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-kernel@vger.kernel.org, linux-xfs@vger.kernel.org,
	linux-nvdimm@lists.01.org, linux-mm@kvack.org,
	linux-fsdevel@vger.kernel.org, darrick.wong@oracle.com,
	dan.j.williams@intel.com, david@fromorbit.com, hch@lst.de,
	rgoldwyn@suse.de, Qi, Fuli, Gotou, Yasunori


在 2020/4/27 20:28:36, "Matthew Wilcox" <willy@infradead.org> 写道:

>On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote:
>>  This patchset is a try to resolve the shared 'page cache' problem for
>>  fsdax.
>>
>>  In order to track multiple mappings and indexes on one page, I
>>  introduced a dax-rmap rb-tree to manage the relationship.  A dax entry
>>  will be associated more than once if is shared.  At the second time we
>>  associate this entry, we create this rb-tree and store its root in
>>  page->private(not used in fsdax).  Insert (->mapping, ->index) when
>>  dax_associate_entry() and delete it when dax_disassociate_entry().
>
>Do we really want to track all of this on a per-page basis?  I would
>have thought a per-extent basis was more useful.  Essentially, create
>a new address_space for each shared extent.  Per page just seems like
>a huge overhead.
>
Per-extent tracking is a nice idea for me.  I haven't thought of it 
yet...

But the extent info is maintained by filesystem.  I think we need a way 
to obtain this info from FS when associating a page.  May be a bit 
complicated.  Let me think about it...


--
Thanks,
Ruan Shiyang.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: 回复: Re: [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink
  2020-04-28  6:09   ` 回复: " Ruan, Shiyang
@ 2020-04-28  6:43     ` Dave Chinner
  2020-04-28  9:32       ` Ruan Shiyang
  2020-06-04  7:37       ` Ruan Shiyang
  0 siblings, 2 replies; 22+ messages in thread
From: Dave Chinner @ 2020-04-28  6:43 UTC (permalink / raw)
  To: Ruan, Shiyang
  Cc: Matthew Wilcox, linux-kernel@vger.kernel.org,
	linux-xfs@vger.kernel.org, linux-nvdimm@lists.01.org,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	darrick.wong@oracle.com, dan.j.williams@intel.com, hch@lst.de,
	rgoldwyn@suse.de, Qi, Fuli, Gotou, Yasunori

On Tue, Apr 28, 2020 at 06:09:47AM +0000, Ruan, Shiyang wrote:
> 
> 在 2020/4/27 20:28:36, "Matthew Wilcox" <willy@infradead.org> 写道:
> 
> >On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote:
> >>  This patchset is a try to resolve the shared 'page cache' problem for
> >>  fsdax.
> >>
> >>  In order to track multiple mappings and indexes on one page, I
> >>  introduced a dax-rmap rb-tree to manage the relationship.  A dax entry
> >>  will be associated more than once if is shared.  At the second time we
> >>  associate this entry, we create this rb-tree and store its root in
> >>  page->private(not used in fsdax).  Insert (->mapping, ->index) when
> >>  dax_associate_entry() and delete it when dax_disassociate_entry().
> >
> >Do we really want to track all of this on a per-page basis?  I would
> >have thought a per-extent basis was more useful.  Essentially, create
> >a new address_space for each shared extent.  Per page just seems like
> >a huge overhead.
> >
> Per-extent tracking is a nice idea for me.  I haven't thought of it 
> yet...
> 
> But the extent info is maintained by filesystem.  I think we need a way 
> to obtain this info from FS when associating a page.  May be a bit 
> complicated.  Let me think about it...

That's why I want the -user of this association- to do a filesystem
callout instead of keeping it's own naive tracking infrastructure.
The filesystem can do an efficient, on-demand reverse mapping lookup
from it's own extent tracking infrastructure, and there's zero
runtime overhead when there are no errors present.

At the moment, this "dax association" is used to "report" a storage
media error directly to userspace. I say "report" because what it
does is kill userspace processes dead. The storage media error
actually needs to be reported to the owner of the storage media,
which in the case of FS-DAX is the filesytem.

That way the filesystem can then look up all the owners of that bad
media range (i.e. the filesystem block it corresponds to) and take
appropriate action. e.g.

- if it falls in filesytem metadata, shutdown the filesystem
- if it falls in user data, call the "kill userspace dead" routines
  for each mapping/index tuple the filesystem finds for the given
  LBA address that the media error occurred.

Right now if the media error is in filesystem metadata, the
filesystem isn't even told about it. The filesystem can't even shut
down - the error is just dropped on the floor and it won't be until
the filesystem next tries to reference that metadata that we notice
there is an issue.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: 回复: Re: [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink
  2020-04-28  6:43     ` Dave Chinner
@ 2020-04-28  9:32       ` Ruan Shiyang
  2020-04-28 11:16         ` Matthew Wilcox
  2020-06-04  7:37       ` Ruan Shiyang
  1 sibling, 1 reply; 22+ messages in thread
From: Ruan Shiyang @ 2020-04-28  9:32 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Matthew Wilcox, linux-kernel@vger.kernel.org,
	linux-xfs@vger.kernel.org, linux-nvdimm@lists.01.org,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	darrick.wong@oracle.com, dan.j.williams@intel.com, hch@lst.de,
	rgoldwyn@suse.de, Qi, Fuli, Gotou, Yasunori



On 2020/4/28 下午2:43, Dave Chinner wrote:
> On Tue, Apr 28, 2020 at 06:09:47AM +0000, Ruan, Shiyang wrote:
>>
>> 在 2020/4/27 20:28:36, "Matthew Wilcox" <willy@infradead.org> 写道:
>>
>>> On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote:
>>>>   This patchset is a try to resolve the shared 'page cache' problem for
>>>>   fsdax.
>>>>
>>>>   In order to track multiple mappings and indexes on one page, I
>>>>   introduced a dax-rmap rb-tree to manage the relationship.  A dax entry
>>>>   will be associated more than once if is shared.  At the second time we
>>>>   associate this entry, we create this rb-tree and store its root in
>>>>   page->private(not used in fsdax).  Insert (->mapping, ->index) when
>>>>   dax_associate_entry() and delete it when dax_disassociate_entry().
>>>
>>> Do we really want to track all of this on a per-page basis?  I would
>>> have thought a per-extent basis was more useful.  Essentially, create
>>> a new address_space for each shared extent.  Per page just seems like
>>> a huge overhead.
>>>
>> Per-extent tracking is a nice idea for me.  I haven't thought of it
>> yet...
>>
>> But the extent info is maintained by filesystem.  I think we need a way
>> to obtain this info from FS when associating a page.  May be a bit
>> complicated.  Let me think about it...
> 
> That's why I want the -user of this association- to do a filesystem
> callout instead of keeping it's own naive tracking infrastructure.
> The filesystem can do an efficient, on-demand reverse mapping lookup
> from it's own extent tracking infrastructure, and there's zero
> runtime overhead when there are no errors present.
> 
> At the moment, this "dax association" is used to "report" a storage
> media error directly to userspace. I say "report" because what it
> does is kill userspace processes dead. The storage media error
> actually needs to be reported to the owner of the storage media,
> which in the case of FS-DAX is the filesytem.

Understood.

BTW, this is the usage in memory-failure, so what about rmap?  I have 
not found how to use this tracking in rmap.  Do you have any ideas?

> 
> That way the filesystem can then look up all the owners of that bad
> media range (i.e. the filesystem block it corresponds to) and take
> appropriate action. e.g.

I tried writing a function to look up all the owners' info of one block 
in xfs for memory-failure use.  It was dropped in this patchset because 
I found out that this lookup function needs 'rmapbt' to be enabled when 
mkfs.  But by default, rmapbt is disabled.  I am not sure if it matters...

> 
> - if it falls in filesytem metadata, shutdown the filesystem
> - if it falls in user data, call the "kill userspace dead" routines
>    for each mapping/index tuple the filesystem finds for the given
>    LBA address that the media error occurred >
> Right now if the media error is in filesystem metadata, the
> filesystem isn't even told about it. The filesystem can't even shut
> down - the error is just dropped on the floor and it won't be until
> the filesystem next tries to reference that metadata that we notice
> there is an issue.

Understood.  Thanks.

> 
> Cheers,
> 
> Dave.
> 


--
Thanks,
Ruan Shiyang.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: 回复: Re: [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink
  2020-04-28  9:32       ` Ruan Shiyang
@ 2020-04-28 11:16         ` Matthew Wilcox
  2020-04-28 11:24           ` Dave Chinner
  0 siblings, 1 reply; 22+ messages in thread
From: Matthew Wilcox @ 2020-04-28 11:16 UTC (permalink / raw)
  To: Ruan Shiyang
  Cc: Dave Chinner, linux-kernel@vger.kernel.org,
	linux-xfs@vger.kernel.org, linux-nvdimm@lists.01.org,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	darrick.wong@oracle.com, dan.j.williams@intel.com, hch@lst.de,
	rgoldwyn@suse.de, Qi, Fuli, Gotou, Yasunori

On Tue, Apr 28, 2020 at 05:32:41PM +0800, Ruan Shiyang wrote:
> On 2020/4/28 下午2:43, Dave Chinner wrote:
> > On Tue, Apr 28, 2020 at 06:09:47AM +0000, Ruan, Shiyang wrote:
> > > 在 2020/4/27 20:28:36, "Matthew Wilcox" <willy@infradead.org> 写道:
> > > > On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote:
> > > > >   This patchset is a try to resolve the shared 'page cache' problem for
> > > > >   fsdax.
> > > > > 
> > > > >   In order to track multiple mappings and indexes on one page, I
> > > > >   introduced a dax-rmap rb-tree to manage the relationship.  A dax entry
> > > > >   will be associated more than once if is shared.  At the second time we
> > > > >   associate this entry, we create this rb-tree and store its root in
> > > > >   page->private(not used in fsdax).  Insert (->mapping, ->index) when
> > > > >   dax_associate_entry() and delete it when dax_disassociate_entry().
> > > > 
> > > > Do we really want to track all of this on a per-page basis?  I would
> > > > have thought a per-extent basis was more useful.  Essentially, create
> > > > a new address_space for each shared extent.  Per page just seems like
> > > > a huge overhead.
> > > > 
> > > Per-extent tracking is a nice idea for me.  I haven't thought of it
> > > yet...
> > > 
> > > But the extent info is maintained by filesystem.  I think we need a way
> > > to obtain this info from FS when associating a page.  May be a bit
> > > complicated.  Let me think about it...
> > 
> > That's why I want the -user of this association- to do a filesystem
> > callout instead of keeping it's own naive tracking infrastructure.
> > The filesystem can do an efficient, on-demand reverse mapping lookup
> > from it's own extent tracking infrastructure, and there's zero
> > runtime overhead when there are no errors present.
> > 
> > At the moment, this "dax association" is used to "report" a storage
> > media error directly to userspace. I say "report" because what it
> > does is kill userspace processes dead. The storage media error
> > actually needs to be reported to the owner of the storage media,
> > which in the case of FS-DAX is the filesytem.
> 
> Understood.
> 
> BTW, this is the usage in memory-failure, so what about rmap?  I have not
> found how to use this tracking in rmap.  Do you have any ideas?
> 
> > 
> > That way the filesystem can then look up all the owners of that bad
> > media range (i.e. the filesystem block it corresponds to) and take
> > appropriate action. e.g.
> 
> I tried writing a function to look up all the owners' info of one block in
> xfs for memory-failure use.  It was dropped in this patchset because I found
> out that this lookup function needs 'rmapbt' to be enabled when mkfs.  But
> by default, rmapbt is disabled.  I am not sure if it matters...

I'm pretty sure you can't have shared extents on an XFS filesystem if you
_don't_ have the rmapbt feature enabled.  I mean, that's why it exists.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: 回复: Re: [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink
  2020-04-28 11:16         ` Matthew Wilcox
@ 2020-04-28 11:24           ` Dave Chinner
  2020-04-28 15:37             ` Darrick J. Wong
  0 siblings, 1 reply; 22+ messages in thread
From: Dave Chinner @ 2020-04-28 11:24 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ruan Shiyang, linux-kernel@vger.kernel.org,
	linux-xfs@vger.kernel.org, linux-nvdimm@lists.01.org,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	darrick.wong@oracle.com, dan.j.williams@intel.com, hch@lst.de,
	rgoldwyn@suse.de, Qi, Fuli, Gotou, Yasunori

On Tue, Apr 28, 2020 at 04:16:36AM -0700, Matthew Wilcox wrote:
> On Tue, Apr 28, 2020 at 05:32:41PM +0800, Ruan Shiyang wrote:
> > On 2020/4/28 下午2:43, Dave Chinner wrote:
> > > On Tue, Apr 28, 2020 at 06:09:47AM +0000, Ruan, Shiyang wrote:
> > > > 在 2020/4/27 20:28:36, "Matthew Wilcox" <willy@infradead.org> 写道:
> > > > > On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote:
> > > > > >   This patchset is a try to resolve the shared 'page cache' problem for
> > > > > >   fsdax.
> > > > > > 
> > > > > >   In order to track multiple mappings and indexes on one page, I
> > > > > >   introduced a dax-rmap rb-tree to manage the relationship.  A dax entry
> > > > > >   will be associated more than once if is shared.  At the second time we
> > > > > >   associate this entry, we create this rb-tree and store its root in
> > > > > >   page->private(not used in fsdax).  Insert (->mapping, ->index) when
> > > > > >   dax_associate_entry() and delete it when dax_disassociate_entry().
> > > > > 
> > > > > Do we really want to track all of this on a per-page basis?  I would
> > > > > have thought a per-extent basis was more useful.  Essentially, create
> > > > > a new address_space for each shared extent.  Per page just seems like
> > > > > a huge overhead.
> > > > > 
> > > > Per-extent tracking is a nice idea for me.  I haven't thought of it
> > > > yet...
> > > > 
> > > > But the extent info is maintained by filesystem.  I think we need a way
> > > > to obtain this info from FS when associating a page.  May be a bit
> > > > complicated.  Let me think about it...
> > > 
> > > That's why I want the -user of this association- to do a filesystem
> > > callout instead of keeping it's own naive tracking infrastructure.
> > > The filesystem can do an efficient, on-demand reverse mapping lookup
> > > from it's own extent tracking infrastructure, and there's zero
> > > runtime overhead when there are no errors present.
> > > 
> > > At the moment, this "dax association" is used to "report" a storage
> > > media error directly to userspace. I say "report" because what it
> > > does is kill userspace processes dead. The storage media error
> > > actually needs to be reported to the owner of the storage media,
> > > which in the case of FS-DAX is the filesytem.
> > 
> > Understood.
> > 
> > BTW, this is the usage in memory-failure, so what about rmap?  I have not
> > found how to use this tracking in rmap.  Do you have any ideas?
> > 
> > > 
> > > That way the filesystem can then look up all the owners of that bad
> > > media range (i.e. the filesystem block it corresponds to) and take
> > > appropriate action. e.g.
> > 
> > I tried writing a function to look up all the owners' info of one block in
> > xfs for memory-failure use.  It was dropped in this patchset because I found
> > out that this lookup function needs 'rmapbt' to be enabled when mkfs.  But
> > by default, rmapbt is disabled.  I am not sure if it matters...
> 
> I'm pretty sure you can't have shared extents on an XFS filesystem if you
> _don't_ have the rmapbt feature enabled.  I mean, that's why it exists.

You're confusing reflink with rmap. :)

rmapbt does all the reverse mapping tracking, reflink just does the
shared data extent tracking.

But given that anyone who wants to use DAX with reflink is going to
have to mkfs their filesystem anyway (to turn on reflink) requiring
that rmapbt is also turned on is not a big deal. Especially as we
can check it at mount time in the kernel...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: 回复: Re: [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink
  2020-04-28 11:24           ` Dave Chinner
@ 2020-04-28 15:37             ` Darrick J. Wong
  2020-04-28 22:02               ` Dave Chinner
  0 siblings, 1 reply; 22+ messages in thread
From: Darrick J. Wong @ 2020-04-28 15:37 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Matthew Wilcox, Ruan Shiyang, linux-kernel@vger.kernel.org,
	linux-xfs@vger.kernel.org, linux-nvdimm@lists.01.org,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	dan.j.williams@intel.com, hch@lst.de, rgoldwyn@suse.de, Qi, Fuli,
	Gotou, Yasunori

On Tue, Apr 28, 2020 at 09:24:41PM +1000, Dave Chinner wrote:
> On Tue, Apr 28, 2020 at 04:16:36AM -0700, Matthew Wilcox wrote:
> > On Tue, Apr 28, 2020 at 05:32:41PM +0800, Ruan Shiyang wrote:
> > > On 2020/4/28 下午2:43, Dave Chinner wrote:
> > > > On Tue, Apr 28, 2020 at 06:09:47AM +0000, Ruan, Shiyang wrote:
> > > > > 在 2020/4/27 20:28:36, "Matthew Wilcox" <willy@infradead.org> 写道:
> > > > > > On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote:
> > > > > > >   This patchset is a try to resolve the shared 'page cache' problem for
> > > > > > >   fsdax.
> > > > > > > 
> > > > > > >   In order to track multiple mappings and indexes on one page, I
> > > > > > >   introduced a dax-rmap rb-tree to manage the relationship.  A dax entry
> > > > > > >   will be associated more than once if is shared.  At the second time we
> > > > > > >   associate this entry, we create this rb-tree and store its root in
> > > > > > >   page->private(not used in fsdax).  Insert (->mapping, ->index) when
> > > > > > >   dax_associate_entry() and delete it when dax_disassociate_entry().
> > > > > > 
> > > > > > Do we really want to track all of this on a per-page basis?  I would
> > > > > > have thought a per-extent basis was more useful.  Essentially, create
> > > > > > a new address_space for each shared extent.  Per page just seems like
> > > > > > a huge overhead.
> > > > > > 
> > > > > Per-extent tracking is a nice idea for me.  I haven't thought of it
> > > > > yet...
> > > > > 
> > > > > But the extent info is maintained by filesystem.  I think we need a way
> > > > > to obtain this info from FS when associating a page.  May be a bit
> > > > > complicated.  Let me think about it...
> > > > 
> > > > That's why I want the -user of this association- to do a filesystem
> > > > callout instead of keeping it's own naive tracking infrastructure.
> > > > The filesystem can do an efficient, on-demand reverse mapping lookup
> > > > from it's own extent tracking infrastructure, and there's zero
> > > > runtime overhead when there are no errors present.
> > > > 
> > > > At the moment, this "dax association" is used to "report" a storage
> > > > media error directly to userspace. I say "report" because what it
> > > > does is kill userspace processes dead. The storage media error
> > > > actually needs to be reported to the owner of the storage media,
> > > > which in the case of FS-DAX is the filesytem.
> > > 
> > > Understood.
> > > 
> > > BTW, this is the usage in memory-failure, so what about rmap?  I have not
> > > found how to use this tracking in rmap.  Do you have any ideas?
> > > 
> > > > 
> > > > That way the filesystem can then look up all the owners of that bad
> > > > media range (i.e. the filesystem block it corresponds to) and take
> > > > appropriate action. e.g.
> > > 
> > > I tried writing a function to look up all the owners' info of one block in
> > > xfs for memory-failure use.  It was dropped in this patchset because I found
> > > out that this lookup function needs 'rmapbt' to be enabled when mkfs.  But
> > > by default, rmapbt is disabled.  I am not sure if it matters...
> > 
> > I'm pretty sure you can't have shared extents on an XFS filesystem if you
> > _don't_ have the rmapbt feature enabled.  I mean, that's why it exists.
> 
> You're confusing reflink with rmap. :)
> 
> rmapbt does all the reverse mapping tracking, reflink just does the
> shared data extent tracking.
> 
> But given that anyone who wants to use DAX with reflink is going to
> have to mkfs their filesystem anyway (to turn on reflink) requiring
> that rmapbt is also turned on is not a big deal. Especially as we
> can check it at mount time in the kernel...

Are we going to turn on rmap by default?  The last I checked, it did
have a 10-20% performance cost on extreme metadata-heavy workloads.
Or do we only enable it by default if mkfs detects a pmem device?

(Admittedly, most people do not run fsx as a productivity app; the
normal hit is usually 3-5% which might not be such a big deal since you
also get (half of) online fsck. :P)

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: 回复: Re: [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink
  2020-04-28 15:37             ` Darrick J. Wong
@ 2020-04-28 22:02               ` Dave Chinner
  0 siblings, 0 replies; 22+ messages in thread
From: Dave Chinner @ 2020-04-28 22:02 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Matthew Wilcox, Ruan Shiyang, linux-kernel@vger.kernel.org,
	linux-xfs@vger.kernel.org, linux-nvdimm@lists.01.org,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	dan.j.williams@intel.com, hch@lst.de, rgoldwyn@suse.de, Qi, Fuli,
	Gotou, Yasunori

On Tue, Apr 28, 2020 at 08:37:32AM -0700, Darrick J. Wong wrote:
> On Tue, Apr 28, 2020 at 09:24:41PM +1000, Dave Chinner wrote:
> > On Tue, Apr 28, 2020 at 04:16:36AM -0700, Matthew Wilcox wrote:
> > > On Tue, Apr 28, 2020 at 05:32:41PM +0800, Ruan Shiyang wrote:
> > > > On 2020/4/28 下午2:43, Dave Chinner wrote:
> > > > > On Tue, Apr 28, 2020 at 06:09:47AM +0000, Ruan, Shiyang wrote:
> > > > > > 在 2020/4/27 20:28:36, "Matthew Wilcox" <willy@infradead.org> 写道:
> > > > > > > On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote:
> > > > > > > >   This patchset is a try to resolve the shared 'page cache' problem for
> > > > > > > >   fsdax.
> > > > > > > > 
> > > > > > > >   In order to track multiple mappings and indexes on one page, I
> > > > > > > >   introduced a dax-rmap rb-tree to manage the relationship.  A dax entry
> > > > > > > >   will be associated more than once if is shared.  At the second time we
> > > > > > > >   associate this entry, we create this rb-tree and store its root in
> > > > > > > >   page->private(not used in fsdax).  Insert (->mapping, ->index) when
> > > > > > > >   dax_associate_entry() and delete it when dax_disassociate_entry().
> > > > > > > 
> > > > > > > Do we really want to track all of this on a per-page basis?  I would
> > > > > > > have thought a per-extent basis was more useful.  Essentially, create
> > > > > > > a new address_space for each shared extent.  Per page just seems like
> > > > > > > a huge overhead.
> > > > > > > 
> > > > > > Per-extent tracking is a nice idea for me.  I haven't thought of it
> > > > > > yet...
> > > > > > 
> > > > > > But the extent info is maintained by filesystem.  I think we need a way
> > > > > > to obtain this info from FS when associating a page.  May be a bit
> > > > > > complicated.  Let me think about it...
> > > > > 
> > > > > That's why I want the -user of this association- to do a filesystem
> > > > > callout instead of keeping it's own naive tracking infrastructure.
> > > > > The filesystem can do an efficient, on-demand reverse mapping lookup
> > > > > from it's own extent tracking infrastructure, and there's zero
> > > > > runtime overhead when there are no errors present.
> > > > > 
> > > > > At the moment, this "dax association" is used to "report" a storage
> > > > > media error directly to userspace. I say "report" because what it
> > > > > does is kill userspace processes dead. The storage media error
> > > > > actually needs to be reported to the owner of the storage media,
> > > > > which in the case of FS-DAX is the filesytem.
> > > > 
> > > > Understood.
> > > > 
> > > > BTW, this is the usage in memory-failure, so what about rmap?  I have not
> > > > found how to use this tracking in rmap.  Do you have any ideas?
> > > > 
> > > > > 
> > > > > That way the filesystem can then look up all the owners of that bad
> > > > > media range (i.e. the filesystem block it corresponds to) and take
> > > > > appropriate action. e.g.
> > > > 
> > > > I tried writing a function to look up all the owners' info of one block in
> > > > xfs for memory-failure use.  It was dropped in this patchset because I found
> > > > out that this lookup function needs 'rmapbt' to be enabled when mkfs.  But
> > > > by default, rmapbt is disabled.  I am not sure if it matters...
> > > 
> > > I'm pretty sure you can't have shared extents on an XFS filesystem if you
> > > _don't_ have the rmapbt feature enabled.  I mean, that's why it exists.
> > 
> > You're confusing reflink with rmap. :)
> > 
> > rmapbt does all the reverse mapping tracking, reflink just does the
> > shared data extent tracking.
> > 
> > But given that anyone who wants to use DAX with reflink is going to
> > have to mkfs their filesystem anyway (to turn on reflink) requiring
> > that rmapbt is also turned on is not a big deal. Especially as we
> > can check it at mount time in the kernel...
> 
> Are we going to turn on rmap by default?  The last I checked, it did
> have a 10-20% performance cost on extreme metadata-heavy workloads.
> Or do we only enable it by default if mkfs detects a pmem device?

Just have the kernel refuse to mount a reflink enabled filesystem on
a DAX capable device unless -o dax=never or rmapbt is enabled.

That'll get the message across pretty quickly....

> (Admittedly, most people do not run fsx as a productivity app; the
> normal hit is usually 3-5% which might not be such a big deal since you
> also get (half of) online fsck. :P)

I have not noticed the overhead at all on any of my production
machines since I enabled it way on all of them way back when....

And, really, pmem is a _very poor choice_ for metadata intensive
applications on XFS as pmem is completely synchronous.  XFS has an
async IO model for it's metadata that *must* be buffered (so no
DAX!) and the synchronous nature of pmem completely defeats the
architectural IO pipelining XFS uses to allow thousands of
concurrent metadata IOs in flight. OTOH, pmem IO depth is limited to
the number of CPUs that are concurrently issuing IO, so it really,
really sucks compared to a handful of high end nvme SSDs on PCIe
4.0....

So with that in mind, I see little reason to care about the small
additional overhead of rmapbt on FS-DAX installations that require
reflink...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: 回复: Re: [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink
  2020-04-28  6:43     ` Dave Chinner
  2020-04-28  9:32       ` Ruan Shiyang
@ 2020-06-04  7:37       ` Ruan Shiyang
  2020-06-04 14:51         ` Darrick J. Wong
  1 sibling, 1 reply; 22+ messages in thread
From: Ruan Shiyang @ 2020-06-04  7:37 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Matthew Wilcox, linux-kernel@vger.kernel.org,
	linux-xfs@vger.kernel.org, linux-nvdimm@lists.01.org,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	darrick.wong@oracle.com, dan.j.williams@intel.com, hch@lst.de,
	rgoldwyn@suse.de, Qi, Fuli, Gotou, Yasunori



On 2020/4/28 下午2:43, Dave Chinner wrote:
> On Tue, Apr 28, 2020 at 06:09:47AM +0000, Ruan, Shiyang wrote:
>>
>> 在 2020/4/27 20:28:36, "Matthew Wilcox" <willy@infradead.org> 写道:
>>
>>> On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote:
>>>>   This patchset is a try to resolve the shared 'page cache' problem for
>>>>   fsdax.
>>>>
>>>>   In order to track multiple mappings and indexes on one page, I
>>>>   introduced a dax-rmap rb-tree to manage the relationship.  A dax entry
>>>>   will be associated more than once if is shared.  At the second time we
>>>>   associate this entry, we create this rb-tree and store its root in
>>>>   page->private(not used in fsdax).  Insert (->mapping, ->index) when
>>>>   dax_associate_entry() and delete it when dax_disassociate_entry().
>>>
>>> Do we really want to track all of this on a per-page basis?  I would
>>> have thought a per-extent basis was more useful.  Essentially, create
>>> a new address_space for each shared extent.  Per page just seems like
>>> a huge overhead.
>>>
>> Per-extent tracking is a nice idea for me.  I haven't thought of it
>> yet...
>>
>> But the extent info is maintained by filesystem.  I think we need a way
>> to obtain this info from FS when associating a page.  May be a bit
>> complicated.  Let me think about it...
> 
> That's why I want the -user of this association- to do a filesystem
> callout instead of keeping it's own naive tracking infrastructure.
> The filesystem can do an efficient, on-demand reverse mapping lookup
> from it's own extent tracking infrastructure, and there's zero
> runtime overhead when there are no errors present.

Hi Dave,

I ran into some difficulties when trying to implement the per-extent 
rmap tracking.  So, I re-read your comments and found that I was 
misunderstanding what you described here.

I think what you mean is: we don't need the in-memory dax-rmap tracking 
now.  Just ask the FS for the owner's information that associate with 
one page when memory-failure.  So, the per-page (even per-extent) 
dax-rmap is needless in this case.  Is this right?

Based on this, we only need to store the extent information of a fsdax 
page in its ->mapping (by searching from FS).  Then obtain the owners of 
this page (also by searching from FS) when memory-failure or other rmap 
case occurs.

So, a fsdax page is no longer associated with a specific file, but with 
a FS(or the pmem device).  I think it's easier to understand and implement.


--
Thanks,
Ruan Shiyang.
> 
> At the moment, this "dax association" is used to "report" a storage
> media error directly to userspace. I say "report" because what it
> does is kill userspace processes dead. The storage media error
> actually needs to be reported to the owner of the storage media,
> which in the case of FS-DAX is the filesytem.
> 
> That way the filesystem can then look up all the owners of that bad
> media range (i.e. the filesystem block it corresponds to) and take
> appropriate action. e.g.
> 
> - if it falls in filesytem metadata, shutdown the filesystem
> - if it falls in user data, call the "kill userspace dead" routines
>    for each mapping/index tuple the filesystem finds for the given
>    LBA address that the media error occurred.
> 
> Right now if the media error is in filesystem metadata, the
> filesystem isn't even told about it. The filesystem can't even shut
> down - the error is just dropped on the floor and it won't be until
> the filesystem next tries to reference that metadata that we notice
> there is an issue.
> 
> Cheers,
> 
> Dave.
> 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: 回复: Re: [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink
  2020-06-04  7:37       ` Ruan Shiyang
@ 2020-06-04 14:51         ` Darrick J. Wong
  2020-06-05  1:30           ` Dave Chinner
  2020-06-05  2:11           ` Ruan Shiyang
  0 siblings, 2 replies; 22+ messages in thread
From: Darrick J. Wong @ 2020-06-04 14:51 UTC (permalink / raw)
  To: Ruan Shiyang
  Cc: Dave Chinner, Matthew Wilcox, linux-kernel@vger.kernel.org,
	linux-xfs@vger.kernel.org, linux-nvdimm@lists.01.org,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	dan.j.williams@intel.com, hch@lst.de, rgoldwyn@suse.de, Qi, Fuli,
	Gotou, Yasunori

On Thu, Jun 04, 2020 at 03:37:42PM +0800, Ruan Shiyang wrote:
> 
> 
> On 2020/4/28 下午2:43, Dave Chinner wrote:
> > On Tue, Apr 28, 2020 at 06:09:47AM +0000, Ruan, Shiyang wrote:
> > > 
> > > 在 2020/4/27 20:28:36, "Matthew Wilcox" <willy@infradead.org> 写道:
> > > 
> > > > On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote:
> > > > >   This patchset is a try to resolve the shared 'page cache' problem for
> > > > >   fsdax.
> > > > > 
> > > > >   In order to track multiple mappings and indexes on one page, I
> > > > >   introduced a dax-rmap rb-tree to manage the relationship.  A dax entry
> > > > >   will be associated more than once if is shared.  At the second time we
> > > > >   associate this entry, we create this rb-tree and store its root in
> > > > >   page->private(not used in fsdax).  Insert (->mapping, ->index) when
> > > > >   dax_associate_entry() and delete it when dax_disassociate_entry().
> > > > 
> > > > Do we really want to track all of this on a per-page basis?  I would
> > > > have thought a per-extent basis was more useful.  Essentially, create
> > > > a new address_space for each shared extent.  Per page just seems like
> > > > a huge overhead.
> > > > 
> > > Per-extent tracking is a nice idea for me.  I haven't thought of it
> > > yet...
> > > 
> > > But the extent info is maintained by filesystem.  I think we need a way
> > > to obtain this info from FS when associating a page.  May be a bit
> > > complicated.  Let me think about it...
> > 
> > That's why I want the -user of this association- to do a filesystem
> > callout instead of keeping it's own naive tracking infrastructure.
> > The filesystem can do an efficient, on-demand reverse mapping lookup
> > from it's own extent tracking infrastructure, and there's zero
> > runtime overhead when there are no errors present.
> 
> Hi Dave,
> 
> I ran into some difficulties when trying to implement the per-extent rmap
> tracking.  So, I re-read your comments and found that I was misunderstanding
> what you described here.
> 
> I think what you mean is: we don't need the in-memory dax-rmap tracking now.
> Just ask the FS for the owner's information that associate with one page
> when memory-failure.  So, the per-page (even per-extent) dax-rmap is
> needless in this case.  Is this right?

Right.  XFS already has its own rmap tree.

> Based on this, we only need to store the extent information of a fsdax page
> in its ->mapping (by searching from FS).  Then obtain the owners of this
> page (also by searching from FS) when memory-failure or other rmap case
> occurs.

I don't even think you need that much.  All you need is the "physical"
offset of that page within the pmem device (e.g. 'this is the 307th 4k
page == offset 1257472 since the start of /dev/pmem0') and xfs can look
up the owner of that range of physical storage and deal with it as
needed.

> So, a fsdax page is no longer associated with a specific file, but with a
> FS(or the pmem device).  I think it's easier to understand and implement.

Yes.  I also suspect this will be necessary to support reflink...

--D

> 
> --
> Thanks,
> Ruan Shiyang.
> > 
> > At the moment, this "dax association" is used to "report" a storage
> > media error directly to userspace. I say "report" because what it
> > does is kill userspace processes dead. The storage media error
> > actually needs to be reported to the owner of the storage media,
> > which in the case of FS-DAX is the filesytem.
> > 
> > That way the filesystem can then look up all the owners of that bad
> > media range (i.e. the filesystem block it corresponds to) and take
> > appropriate action. e.g.
> > 
> > - if it falls in filesytem metadata, shutdown the filesystem
> > - if it falls in user data, call the "kill userspace dead" routines
> >    for each mapping/index tuple the filesystem finds for the given
> >    LBA address that the media error occurred.
> > 
> > Right now if the media error is in filesystem metadata, the
> > filesystem isn't even told about it. The filesystem can't even shut
> > down - the error is just dropped on the floor and it won't be until
> > the filesystem next tries to reference that metadata that we notice
> > there is an issue.
> > 
> > Cheers,
> > 
> > Dave.
> > 
> 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: 回复: Re: [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink
  2020-06-04 14:51         ` Darrick J. Wong
@ 2020-06-05  1:30           ` Dave Chinner
  2020-06-05  2:30             ` Ruan Shiyang
  2020-06-05  2:11           ` Ruan Shiyang
  1 sibling, 1 reply; 22+ messages in thread
From: Dave Chinner @ 2020-06-05  1:30 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Ruan Shiyang, Matthew Wilcox, linux-kernel@vger.kernel.org,
	linux-xfs@vger.kernel.org, linux-nvdimm@lists.01.org,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	dan.j.williams@intel.com, hch@lst.de, rgoldwyn@suse.de, Qi, Fuli,
	Gotou, Yasunori

On Thu, Jun 04, 2020 at 07:51:07AM -0700, Darrick J. Wong wrote:
> On Thu, Jun 04, 2020 at 03:37:42PM +0800, Ruan Shiyang wrote:
> > 
> > 
> > On 2020/4/28 下午2:43, Dave Chinner wrote:
> > > On Tue, Apr 28, 2020 at 06:09:47AM +0000, Ruan, Shiyang wrote:
> > > > 
> > > > 在 2020/4/27 20:28:36, "Matthew Wilcox" <willy@infradead.org> 写道:
> > > > 
> > > > > On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote:
> > > > > >   This patchset is a try to resolve the shared 'page cache' problem for
> > > > > >   fsdax.
> > > > > > 
> > > > > >   In order to track multiple mappings and indexes on one page, I
> > > > > >   introduced a dax-rmap rb-tree to manage the relationship.  A dax entry
> > > > > >   will be associated more than once if is shared.  At the second time we
> > > > > >   associate this entry, we create this rb-tree and store its root in
> > > > > >   page->private(not used in fsdax).  Insert (->mapping, ->index) when
> > > > > >   dax_associate_entry() and delete it when dax_disassociate_entry().
> > > > > 
> > > > > Do we really want to track all of this on a per-page basis?  I would
> > > > > have thought a per-extent basis was more useful.  Essentially, create
> > > > > a new address_space for each shared extent.  Per page just seems like
> > > > > a huge overhead.
> > > > > 
> > > > Per-extent tracking is a nice idea for me.  I haven't thought of it
> > > > yet...
> > > > 
> > > > But the extent info is maintained by filesystem.  I think we need a way
> > > > to obtain this info from FS when associating a page.  May be a bit
> > > > complicated.  Let me think about it...
> > > 
> > > That's why I want the -user of this association- to do a filesystem
> > > callout instead of keeping it's own naive tracking infrastructure.
> > > The filesystem can do an efficient, on-demand reverse mapping lookup
> > > from it's own extent tracking infrastructure, and there's zero
> > > runtime overhead when there are no errors present.
> > 
> > Hi Dave,
> > 
> > I ran into some difficulties when trying to implement the per-extent rmap
> > tracking.  So, I re-read your comments and found that I was misunderstanding
> > what you described here.
> > 
> > I think what you mean is: we don't need the in-memory dax-rmap tracking now.
> > Just ask the FS for the owner's information that associate with one page
> > when memory-failure.  So, the per-page (even per-extent) dax-rmap is
> > needless in this case.  Is this right?
> 
> Right.  XFS already has its own rmap tree.

*nod*

> > Based on this, we only need to store the extent information of a fsdax page
> > in its ->mapping (by searching from FS).  Then obtain the owners of this
> > page (also by searching from FS) when memory-failure or other rmap case
> > occurs.
> 
> I don't even think you need that much.  All you need is the "physical"
> offset of that page within the pmem device (e.g. 'this is the 307th 4k
> page == offset 1257472 since the start of /dev/pmem0') and xfs can look
> up the owner of that range of physical storage and deal with it as
> needed.

Right. If we have the dax device associated with the page that had
the failure, then we can determine the offset of the page into the
block device address space and that's all we need to find the owner
of the page in the filesystem.

Note that there may actually be no owner - the page that had the
fault might land in free space, in which case we can simply zero
the page and clear the error.

> > So, a fsdax page is no longer associated with a specific file, but with a
> > FS(or the pmem device).  I think it's easier to understand and implement.

Effectively, yes. But we shouldn't need to actually associate the
page with anything at the filesystem level because it is already
associated with a DAX device at a lower level via a dev_pagemap.
The hardware page fault already runs thought this code
memory_failure_dev_pagemap() before it gets to the DAX code, so
really all we need to is have that function pass us the page, offset
into the device and, say, the struct dax_device associated with that
page so we can get to the filesystem superblock we can then use for
rmap lookups on...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: 回复: Re: [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink
  2020-06-04 14:51         ` Darrick J. Wong
  2020-06-05  1:30           ` Dave Chinner
@ 2020-06-05  2:11           ` Ruan Shiyang
  1 sibling, 0 replies; 22+ messages in thread
From: Ruan Shiyang @ 2020-06-05  2:11 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Dave Chinner, Matthew Wilcox, linux-kernel@vger.kernel.org,
	linux-xfs@vger.kernel.org, linux-nvdimm@lists.01.org,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	dan.j.williams@intel.com, hch@lst.de, rgoldwyn@suse.de, Qi, Fuli,
	Gotou, Yasunori



On 2020/6/4 下午10:51, Darrick J. Wong wrote:
> On Thu, Jun 04, 2020 at 03:37:42PM +0800, Ruan Shiyang wrote:
>>
>>
>> On 2020/4/28 下午2:43, Dave Chinner wrote:
>>> On Tue, Apr 28, 2020 at 06:09:47AM +0000, Ruan, Shiyang wrote:
>>>>
>>>> 在 2020/4/27 20:28:36, "Matthew Wilcox" <willy@infradead.org> 写道:
>>>>
>>>>> On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote:
>>>>>>    This patchset is a try to resolve the shared 'page cache' problem for
>>>>>>    fsdax.
>>>>>>
>>>>>>    In order to track multiple mappings and indexes on one page, I
>>>>>>    introduced a dax-rmap rb-tree to manage the relationship.  A dax entry
>>>>>>    will be associated more than once if is shared.  At the second time we
>>>>>>    associate this entry, we create this rb-tree and store its root in
>>>>>>    page->private(not used in fsdax).  Insert (->mapping, ->index) when
>>>>>>    dax_associate_entry() and delete it when dax_disassociate_entry().
>>>>>
>>>>> Do we really want to track all of this on a per-page basis?  I would
>>>>> have thought a per-extent basis was more useful.  Essentially, create
>>>>> a new address_space for each shared extent.  Per page just seems like
>>>>> a huge overhead.
>>>>>
>>>> Per-extent tracking is a nice idea for me.  I haven't thought of it
>>>> yet...
>>>>
>>>> But the extent info is maintained by filesystem.  I think we need a way
>>>> to obtain this info from FS when associating a page.  May be a bit
>>>> complicated.  Let me think about it...
>>>
>>> That's why I want the -user of this association- to do a filesystem
>>> callout instead of keeping it's own naive tracking infrastructure.
>>> The filesystem can do an efficient, on-demand reverse mapping lookup
>>> from it's own extent tracking infrastructure, and there's zero
>>> runtime overhead when there are no errors present.
>>
>> Hi Dave,
>>
>> I ran into some difficulties when trying to implement the per-extent rmap
>> tracking.  So, I re-read your comments and found that I was misunderstanding
>> what you described here.
>>
>> I think what you mean is: we don't need the in-memory dax-rmap tracking now.
>> Just ask the FS for the owner's information that associate with one page
>> when memory-failure.  So, the per-page (even per-extent) dax-rmap is
>> needless in this case.  Is this right?
> 
> Right.  XFS already has its own rmap tree.
> 
>> Based on this, we only need to store the extent information of a fsdax page
>> in its ->mapping (by searching from FS).  Then obtain the owners of this
>> page (also by searching from FS) when memory-failure or other rmap case
>> occurs.
> 
> I don't even think you need that much.  All you need is the "physical"
> offset of that page within the pmem device (e.g. 'this is the 307th 4k
> page == offset 1257472 since the start of /dev/pmem0') and xfs can look
> up the owner of that range of physical storage and deal with it as
> needed.

Yes, I think so.

> 
>> So, a fsdax page is no longer associated with a specific file, but with a
>> FS(or the pmem device).  I think it's easier to understand and implement.
> 
> Yes.  I also suspect this will be necessary to support reflink...
> 
> --D

OK, Thank you very much.


--
Thanks,
Ruan Shiyang.

> 
>>
>> --
>> Thanks,
>> Ruan Shiyang.
>>>
>>> At the moment, this "dax association" is used to "report" a storage
>>> media error directly to userspace. I say "report" because what it
>>> does is kill userspace processes dead. The storage media error
>>> actually needs to be reported to the owner of the storage media,
>>> which in the case of FS-DAX is the filesytem.
>>>
>>> That way the filesystem can then look up all the owners of that bad
>>> media range (i.e. the filesystem block it corresponds to) and take
>>> appropriate action. e.g.
>>>
>>> - if it falls in filesytem metadata, shutdown the filesystem
>>> - if it falls in user data, call the "kill userspace dead" routines
>>>     for each mapping/index tuple the filesystem finds for the given
>>>     LBA address that the media error occurred.
>>>
>>> Right now if the media error is in filesystem metadata, the
>>> filesystem isn't even told about it. The filesystem can't even shut
>>> down - the error is just dropped on the floor and it won't be until
>>> the filesystem next tries to reference that metadata that we notice
>>> there is an issue.
>>>
>>> Cheers,
>>>
>>> Dave.
>>>
>>
>>
> 
> 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: 回复: Re: [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink
  2020-06-05  1:30           ` Dave Chinner
@ 2020-06-05  2:30             ` Ruan Shiyang
  0 siblings, 0 replies; 22+ messages in thread
From: Ruan Shiyang @ 2020-06-05  2:30 UTC (permalink / raw)
  To: Dave Chinner, Darrick J. Wong
  Cc: Matthew Wilcox, linux-kernel@vger.kernel.org,
	linux-xfs@vger.kernel.org, linux-nvdimm@lists.01.org,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	dan.j.williams@intel.com, hch@lst.de, rgoldwyn@suse.de, Qi, Fuli,
	Gotou, Yasunori



On 2020/6/5 上午9:30, Dave Chinner wrote:
> On Thu, Jun 04, 2020 at 07:51:07AM -0700, Darrick J. Wong wrote:
>> On Thu, Jun 04, 2020 at 03:37:42PM +0800, Ruan Shiyang wrote:
>>>
>>>
>>> On 2020/4/28 下午2:43, Dave Chinner wrote:
>>>> On Tue, Apr 28, 2020 at 06:09:47AM +0000, Ruan, Shiyang wrote:
>>>>>
>>>>> 在 2020/4/27 20:28:36, "Matthew Wilcox" <willy@infradead.org> 写道:
>>>>>
>>>>>> On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote:
>>>>>>>    This patchset is a try to resolve the shared 'page cache' problem for
>>>>>>>    fsdax.
>>>>>>>
>>>>>>>    In order to track multiple mappings and indexes on one page, I
>>>>>>>    introduced a dax-rmap rb-tree to manage the relationship.  A dax entry
>>>>>>>    will be associated more than once if is shared.  At the second time we
>>>>>>>    associate this entry, we create this rb-tree and store its root in
>>>>>>>    page->private(not used in fsdax).  Insert (->mapping, ->index) when
>>>>>>>    dax_associate_entry() and delete it when dax_disassociate_entry().
>>>>>>
>>>>>> Do we really want to track all of this on a per-page basis?  I would
>>>>>> have thought a per-extent basis was more useful.  Essentially, create
>>>>>> a new address_space for each shared extent.  Per page just seems like
>>>>>> a huge overhead.
>>>>>>
>>>>> Per-extent tracking is a nice idea for me.  I haven't thought of it
>>>>> yet...
>>>>>
>>>>> But the extent info is maintained by filesystem.  I think we need a way
>>>>> to obtain this info from FS when associating a page.  May be a bit
>>>>> complicated.  Let me think about it...
>>>>
>>>> That's why I want the -user of this association- to do a filesystem
>>>> callout instead of keeping it's own naive tracking infrastructure.
>>>> The filesystem can do an efficient, on-demand reverse mapping lookup
>>>> from it's own extent tracking infrastructure, and there's zero
>>>> runtime overhead when there are no errors present.
>>>
>>> Hi Dave,
>>>
>>> I ran into some difficulties when trying to implement the per-extent rmap
>>> tracking.  So, I re-read your comments and found that I was misunderstanding
>>> what you described here.
>>>
>>> I think what you mean is: we don't need the in-memory dax-rmap tracking now.
>>> Just ask the FS for the owner's information that associate with one page
>>> when memory-failure.  So, the per-page (even per-extent) dax-rmap is
>>> needless in this case.  Is this right?
>>
>> Right.  XFS already has its own rmap tree.
> 
> *nod*
> 
>>> Based on this, we only need to store the extent information of a fsdax page
>>> in its ->mapping (by searching from FS).  Then obtain the owners of this
>>> page (also by searching from FS) when memory-failure or other rmap case
>>> occurs.
>>
>> I don't even think you need that much.  All you need is the "physical"
>> offset of that page within the pmem device (e.g. 'this is the 307th 4k
>> page == offset 1257472 since the start of /dev/pmem0') and xfs can look
>> up the owner of that range of physical storage and deal with it as
>> needed.
> 
> Right. If we have the dax device associated with the page that had
> the failure, then we can determine the offset of the page into the
> block device address space and that's all we need to find the owner
> of the page in the filesystem.
> 
> Note that there may actually be no owner - the page that had the
> fault might land in free space, in which case we can simply zero
> the page and clear the error.

OK.  Thanks for pointing out.

> 
>>> So, a fsdax page is no longer associated with a specific file, but with a
>>> FS(or the pmem device).  I think it's easier to understand and implement.
> 
> Effectively, yes. But we shouldn't need to actually associate the
> page with anything at the filesystem level because it is already
> associated with a DAX device at a lower level via a dev_pagemap.
> The hardware page fault already runs thought this code
> memory_failure_dev_pagemap() before it gets to the DAX code, so
> really all we need to is have that function pass us the page, offset
> into the device and, say, the struct dax_device associated with that
> page so we can get to the filesystem superblock we can then use for
> rmap lookups on...
> 

OK.  I was just thinking about how can I execute the FS rmap search from 
the memory-failure.  Thanks again for pointing out. :)


--
Thanks,
Ruan Shiyang.

> Cheers,
> 
> Dave.
> 



^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2020-06-05  2:31 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-04-27  8:47 [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink Shiyang Ruan
2020-04-27  8:47 ` [RFC PATCH 1/8] fs/dax: Introduce dax-rmap btree for reflink Shiyang Ruan
2020-04-27  8:47 ` [RFC PATCH 2/8] mm: add dax-rmap for memory-failure and rmap Shiyang Ruan
2020-04-27  8:47 ` [RFC PATCH 3/8] fs/dax: Introduce dax_copy_edges() for COW Shiyang Ruan
2020-04-27  8:47 ` [RFC PATCH 4/8] fs/dax: copy data before write Shiyang Ruan
2020-04-27  8:47 ` [RFC PATCH 5/8] fs/dax: replace mmap entry in case of CoW Shiyang Ruan
2020-04-27  8:47 ` [RFC PATCH 6/8] fs/dax: dedup file range to use a compare function Shiyang Ruan
2020-04-27  8:47 ` [RFC PATCH 7/8] fs/xfs: handle CoW for fsdax write() path Shiyang Ruan
2020-04-27  8:47 ` [RFC PATCH 8/8] fs/xfs: support dedupe for fsdax Shiyang Ruan
2020-04-27 12:28 ` [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink Matthew Wilcox
2020-04-28  6:09   ` 回复: " Ruan, Shiyang
2020-04-28  6:43     ` Dave Chinner
2020-04-28  9:32       ` Ruan Shiyang
2020-04-28 11:16         ` Matthew Wilcox
2020-04-28 11:24           ` Dave Chinner
2020-04-28 15:37             ` Darrick J. Wong
2020-04-28 22:02               ` Dave Chinner
2020-06-04  7:37       ` Ruan Shiyang
2020-06-04 14:51         ` Darrick J. Wong
2020-06-05  1:30           ` Dave Chinner
2020-06-05  2:30             ` Ruan Shiyang
2020-06-05  2:11           ` Ruan Shiyang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).