linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v12 0/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
@ 2023-06-29  8:16 Shiyang Ruan
  2023-06-29  8:16 ` [PATCH v12 1/2] xfs: fix the calculation for "end" and "length" Shiyang Ruan
                   ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: Shiyang Ruan @ 2023-06-29  8:16 UTC (permalink / raw)
  To: linux-fsdevel, nvdimm, linux-xfs, linux-mm
  Cc: dan.j.williams, willy, jack, akpm, djwong, mcgrof

This patchset is to add gracefully unbind support for pmem.
Patch1 corrects the calculation of length and end of a given range.
Patch2 introduces a new flag call MF_MEM_REMOVE, to let dax holder know
it is a remove event.  With the help of notify_failure mechanism, we are
able to shutdown the filesystem on the pmem gracefully.

Changes since v11:
 Patch1:
  1. correct the count calculation in xfs_failure_pgcnt().
      (was a wrong fix in v11)
 Patch2:
  1. use new exclusive freeze_super/thaw_super API, to make sure the unbind
      progress won't be disturbed by any other freezer.

Shiyang Ruan (2):
  xfs: fix the calculation for "end" and "length"
  mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

 drivers/dax/super.c         |  3 +-
 fs/xfs/xfs_notify_failure.c | 95 +++++++++++++++++++++++++++++++++----
 include/linux/mm.h          |  1 +
 mm/memory-failure.c         | 17 +++++--
 4 files changed, 101 insertions(+), 15 deletions(-)

-- 
2.40.1


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v12 1/2] xfs: fix the calculation for "end" and "length"
  2023-06-29  8:16 [PATCH v12 0/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan
@ 2023-06-29  8:16 ` Shiyang Ruan
  2023-06-29  8:16 ` [PATCH v12 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan
  2024-01-11 22:24 ` [PATCH v12 0/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE " Bill O'Donnell
  2 siblings, 0 replies; 37+ messages in thread
From: Shiyang Ruan @ 2023-06-29  8:16 UTC (permalink / raw)
  To: linux-fsdevel, nvdimm, linux-xfs, linux-mm
  Cc: dan.j.williams, willy, jack, akpm, djwong, mcgrof

The value of "end" should be "start + length - 1".

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_notify_failure.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index c4078d0ec108..4a9bbd3fe120 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -114,7 +114,8 @@ xfs_dax_notify_ddev_failure(
 	int			error = 0;
 	xfs_fsblock_t		fsbno = XFS_DADDR_TO_FSB(mp, daddr);
 	xfs_agnumber_t		agno = XFS_FSB_TO_AGNO(mp, fsbno);
-	xfs_fsblock_t		end_fsbno = XFS_DADDR_TO_FSB(mp, daddr + bblen);
+	xfs_fsblock_t		end_fsbno = XFS_DADDR_TO_FSB(mp,
+							     daddr + bblen - 1);
 	xfs_agnumber_t		end_agno = XFS_FSB_TO_AGNO(mp, end_fsbno);
 
 	error = xfs_trans_alloc_empty(mp, &tp);
@@ -210,7 +211,7 @@ xfs_dax_notify_failure(
 	ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
 
 	/* Ignore the range out of filesystem area */
-	if (offset + len < ddev_start)
+	if (offset + len - 1 < ddev_start)
 		return -ENXIO;
 	if (offset > ddev_end)
 		return -ENXIO;
@@ -222,8 +223,8 @@ xfs_dax_notify_failure(
 		len -= ddev_start - offset;
 		offset = 0;
 	}
-	if (offset + len > ddev_end)
-		len -= ddev_end - offset;
+	if (offset + len - 1 > ddev_end)
+		len = ddev_end - offset + 1;
 
 	return xfs_dax_notify_ddev_failure(mp, BTOBB(offset), BTOBB(len),
 			mf_flags);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v12 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
  2023-06-29  8:16 [PATCH v12 0/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan
  2023-06-29  8:16 ` [PATCH v12 1/2] xfs: fix the calculation for "end" and "length" Shiyang Ruan
@ 2023-06-29  8:16 ` Shiyang Ruan
  2023-06-29 12:02   ` kernel test robot
                     ` (5 more replies)
  2024-01-11 22:24 ` [PATCH v12 0/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE " Bill O'Donnell
  2 siblings, 6 replies; 37+ messages in thread
From: Shiyang Ruan @ 2023-06-29  8:16 UTC (permalink / raw)
  To: linux-fsdevel, nvdimm, linux-xfs, linux-mm
  Cc: dan.j.williams, willy, jack, akpm, djwong, mcgrof

This patch is inspired by Dan's "mm, dax, pmem: Introduce
dev_pagemap_failure()"[1].  With the help of dax_holder and
->notify_failure() mechanism, the pmem driver is able to ask filesystem
on it to unmap all files in use, and notify processes who are using
those files.

Call trace:
trigger unbind
 -> unbind_store()
  -> ... (skip)
   -> devres_release_all()
    -> kill_dax()
     -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
      -> xfs_dax_notify_failure()
      `-> freeze_super()             // freeze (kernel call)
      `-> do xfs rmap
      ` -> mf_dax_kill_procs()
      `  -> collect_procs_fsdax()    // all associated processes
      `  -> unmap_and_kill()
      ` -> invalidate_inode_pages2_range() // drop file's cache
      `-> thaw_super()               // thaw (both kernel & user call)

Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
event.  Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
new dax mapping from being created.  Do not shutdown filesystem directly
if configuration is not supported, or if failure range includes metadata
area.  Make sure all files and processes(not only the current progress)
are handled correctly.  Also drop the cache of associated files before
pmem is removed.

[1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
[2]: https://lore.kernel.org/linux-xfs/168688010689.860947.1788875898367401950.stgit@frogsfrogsfrogs/

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
---
 drivers/dax/super.c         |  3 +-
 fs/xfs/xfs_notify_failure.c | 86 ++++++++++++++++++++++++++++++++++---
 include/linux/mm.h          |  1 +
 mm/memory-failure.c         | 17 ++++++--
 4 files changed, 96 insertions(+), 11 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index c4c4728a36e4..2e1a35e82fce 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
 		return;
 
 	if (dax_dev->holder_data != NULL)
-		dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
+		dax_holder_notify_failure(dax_dev, 0, U64_MAX,
+				MF_MEM_PRE_REMOVE);
 
 	clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
 	synchronize_srcu(&dax_srcu);
diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index 4a9bbd3fe120..f6ec56b76db6 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -22,6 +22,7 @@
 
 #include <linux/mm.h>
 #include <linux/dax.h>
+#include <linux/fs.h>
 
 struct xfs_failure_info {
 	xfs_agblock_t		startblock;
@@ -73,10 +74,16 @@ xfs_dax_failure_fn(
 	struct xfs_mount		*mp = cur->bc_mp;
 	struct xfs_inode		*ip;
 	struct xfs_failure_info		*notify = data;
+	struct address_space		*mapping;
+	pgoff_t				pgoff;
+	unsigned long			pgcnt;
 	int				error = 0;
 
 	if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
 	    (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
+		/* Continue the query because this isn't a failure. */
+		if (notify->mf_flags & MF_MEM_PRE_REMOVE)
+			return 0;
 		notify->want_shutdown = true;
 		return 0;
 	}
@@ -92,14 +99,55 @@ xfs_dax_failure_fn(
 		return 0;
 	}
 
-	error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
-				  xfs_failure_pgoff(mp, rec, notify),
-				  xfs_failure_pgcnt(mp, rec, notify),
-				  notify->mf_flags);
+	mapping = VFS_I(ip)->i_mapping;
+	pgoff = xfs_failure_pgoff(mp, rec, notify);
+	pgcnt = xfs_failure_pgcnt(mp, rec, notify);
+
+	/* Continue the rmap query if the inode isn't a dax file. */
+	if (dax_mapping(mapping))
+		error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
+					  notify->mf_flags);
+
+	/* Invalidate the cache in dax pages. */
+	if (notify->mf_flags & MF_MEM_PRE_REMOVE)
+		invalidate_inode_pages2_range(mapping, pgoff,
+					      pgoff + pgcnt - 1);
+
 	xfs_irele(ip);
 	return error;
 }
 
+static void
+xfs_dax_notify_failure_freeze(
+	struct xfs_mount	*mp)
+{
+	struct super_block 	*sb = mp->m_super;
+
+	/* Wait until no one is holding the FREEZE_HOLDER_KERNEL. */
+	while (freeze_super(sb, FREEZE_HOLDER_KERNEL) != 0) {
+		// Shall we just wait, or print warning then return -EBUSY?
+		delay(HZ / 10);
+	}
+}
+
+static void
+xfs_dax_notify_failure_thaw(
+	struct xfs_mount	*mp)
+{
+	struct super_block	*sb = mp->m_super;
+	int			error;
+
+	error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
+	if (error)
+		xfs_emerg(mp, "still frozen after notify failure, err=%d",
+			  error);
+	/*
+	 * Also thaw userspace call anyway because the device is about to be
+	 * removed immediately.
+	 */
+	thaw_super(sb, FREEZE_HOLDER_USERSPACE);
+}
+
 static int
 xfs_dax_notify_ddev_failure(
 	struct xfs_mount	*mp,
@@ -120,7 +168,7 @@ xfs_dax_notify_ddev_failure(
 
 	error = xfs_trans_alloc_empty(mp, &tp);
 	if (error)
-		return error;
+		goto out;
 
 	for (; agno <= end_agno; agno++) {
 		struct xfs_rmap_irec	ri_low = { };
@@ -165,11 +213,23 @@ xfs_dax_notify_ddev_failure(
 	}
 
 	xfs_trans_cancel(tp);
+
+	/*
+	 * Determine how to shutdown the filesystem according to the
+	 * error code and flags.
+	 */
 	if (error || notify.want_shutdown) {
 		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
 		if (!error)
 			error = -EFSCORRUPTED;
-	}
+	} else if (mf_flags & MF_MEM_PRE_REMOVE)
+		xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
+
+out:
+	/* Thaw the fs if it is freezed before. */
+	if (mf_flags & MF_MEM_PRE_REMOVE)
+		xfs_dax_notify_failure_thaw(mp);
+
 	return error;
 }
 
@@ -197,6 +257,8 @@ xfs_dax_notify_failure(
 
 	if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
 	    mp->m_logdev_targp != mp->m_ddev_targp) {
+		if (mf_flags & MF_MEM_PRE_REMOVE)
+			return 0;
 		xfs_err(mp, "ondisk log corrupt, shutting down fs!");
 		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
 		return -EFSCORRUPTED;
@@ -210,6 +272,12 @@ xfs_dax_notify_failure(
 	ddev_start = mp->m_ddev_targp->bt_dax_part_off;
 	ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
 
+	/* Notify failure on the whole device. */
+	if (offset == 0 && len == U64_MAX) {
+		offset = ddev_start;
+		len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
+	}
+
 	/* Ignore the range out of filesystem area */
 	if (offset + len - 1 < ddev_start)
 		return -ENXIO;
@@ -226,6 +294,12 @@ xfs_dax_notify_failure(
 	if (offset + len - 1 > ddev_end)
 		len = ddev_end - offset + 1;
 
+	if (mf_flags & MF_MEM_PRE_REMOVE) {
+		xfs_info(mp, "device is about to be removed!");
+		/* Freeze fs to prevent new mappings from being created. */
+		xfs_dax_notify_failure_freeze(mp);
+	}
+
 	return xfs_dax_notify_ddev_failure(mp, BTOBB(offset), BTOBB(len),
 			mf_flags);
 }
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 27ce77080c79..a80c255b88d2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3576,6 +3576,7 @@ enum mf_flags {
 	MF_UNPOISON = 1 << 4,
 	MF_SW_SIMULATED = 1 << 5,
 	MF_NO_RETRY = 1 << 6,
+	MF_MEM_PRE_REMOVE = 1 << 7,
 };
 int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
 		      unsigned long count, int mf_flags);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 5b663eca1f29..483b75f2fcfb 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -688,7 +688,7 @@ static void add_to_kill_fsdax(struct task_struct *tsk, struct page *p,
  */
 static void collect_procs_fsdax(struct page *page,
 		struct address_space *mapping, pgoff_t pgoff,
-		struct list_head *to_kill)
+		struct list_head *to_kill, bool pre_remove)
 {
 	struct vm_area_struct *vma;
 	struct task_struct *tsk;
@@ -696,8 +696,15 @@ static void collect_procs_fsdax(struct page *page,
 	i_mmap_lock_read(mapping);
 	read_lock(&tasklist_lock);
 	for_each_process(tsk) {
-		struct task_struct *t = task_early_kill(tsk, true);
+		struct task_struct *t = tsk;
 
+		/*
+		 * Search for all tasks while MF_MEM_PRE_REMOVE, because the
+		 * current may not be the one accessing the fsdax page.
+		 * Otherwise, search for the current task.
+		 */
+		if (!pre_remove)
+			t = task_early_kill(tsk, true);
 		if (!t)
 			continue;
 		vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
@@ -1793,6 +1800,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
 	dax_entry_t cookie;
 	struct page *page;
 	size_t end = index + count;
+	bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
 
 	mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
 
@@ -1804,9 +1812,10 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
 		if (!page)
 			goto unlock;
 
-		SetPageHWPoison(page);
+		if (!pre_remove)
+			SetPageHWPoison(page);
 
-		collect_procs_fsdax(page, mapping, index, &to_kill);
+		collect_procs_fsdax(page, mapping, index, &to_kill, pre_remove);
 		unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
 				index, mf_flags);
 unlock:
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v12 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
  2023-06-29  8:16 ` [PATCH v12 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan
@ 2023-06-29 12:02   ` kernel test robot
  2023-07-14  9:07   ` Shiyang Ruan
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 37+ messages in thread
From: kernel test robot @ 2023-06-29 12:02 UTC (permalink / raw)
  To: Shiyang Ruan, linux-fsdevel, nvdimm, linux-xfs, linux-mm
  Cc: oe-kbuild-all, dan.j.williams, willy, jack, akpm, djwong, mcgrof

Hi Shiyang,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Shiyang-Ruan/xfs-fix-the-calculation-for-end-and-length/20230629-161913
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20230629081651.253626-3-ruansy.fnst%40fujitsu.com
patch subject: [PATCH v12 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
config: x86_64-kexec (https://download.01.org/0day-ci/archive/20230629/202306291954.zqVvCUZ5-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce: (https://download.01.org/0day-ci/archive/20230629/202306291954.zqVvCUZ5-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202306291954.zqVvCUZ5-lkp@intel.com/

All errors (new ones prefixed by >>):

   fs/xfs/xfs_notify_failure.c: In function 'xfs_dax_notify_failure_freeze':
>> fs/xfs/xfs_notify_failure.c:127:33: error: 'FREEZE_HOLDER_KERNEL' undeclared (first use in this function)
     127 |         while (freeze_super(sb, FREEZE_HOLDER_KERNEL) != 0) {
         |                                 ^~~~~~~~~~~~~~~~~~~~
   fs/xfs/xfs_notify_failure.c:127:33: note: each undeclared identifier is reported only once for each function it appears in
>> fs/xfs/xfs_notify_failure.c:127:16: error: too many arguments to function 'freeze_super'
     127 |         while (freeze_super(sb, FREEZE_HOLDER_KERNEL) != 0) {
         |                ^~~~~~~~~~~~
   In file included from include/linux/huge_mm.h:8,
                    from include/linux/mm.h:988,
                    from fs/xfs/kmem.h:11,
                    from fs/xfs/xfs_linux.h:24,
                    from fs/xfs/xfs.h:22,
                    from fs/xfs/xfs_notify_failure.c:6:
   include/linux/fs.h:2289:12: note: declared here
    2289 | extern int freeze_super(struct super_block *super);
         |            ^~~~~~~~~~~~
   fs/xfs/xfs_notify_failure.c: In function 'xfs_dax_notify_failure_thaw':
   fs/xfs/xfs_notify_failure.c:140:32: error: 'FREEZE_HOLDER_KERNEL' undeclared (first use in this function)
     140 |         error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
         |                                ^~~~~~~~~~~~~~~~~~~~
>> fs/xfs/xfs_notify_failure.c:140:17: error: too many arguments to function 'thaw_super'
     140 |         error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
         |                 ^~~~~~~~~~
   include/linux/fs.h:2290:12: note: declared here
    2290 | extern int thaw_super(struct super_block *super);
         |            ^~~~~~~~~~
>> fs/xfs/xfs_notify_failure.c:148:24: error: 'FREEZE_HOLDER_USERSPACE' undeclared (first use in this function)
     148 |         thaw_super(sb, FREEZE_HOLDER_USERSPACE);
         |                        ^~~~~~~~~~~~~~~~~~~~~~~
   fs/xfs/xfs_notify_failure.c:148:9: error: too many arguments to function 'thaw_super'
     148 |         thaw_super(sb, FREEZE_HOLDER_USERSPACE);
         |         ^~~~~~~~~~
   include/linux/fs.h:2290:12: note: declared here
    2290 | extern int thaw_super(struct super_block *super);
         |            ^~~~~~~~~~


vim +/FREEZE_HOLDER_KERNEL +127 fs/xfs/xfs_notify_failure.c

   119	
   120	static void
   121	xfs_dax_notify_failure_freeze(
   122		struct xfs_mount	*mp)
   123	{
   124		struct super_block 	*sb = mp->m_super;
   125	
   126		/* Wait until no one is holding the FREEZE_HOLDER_KERNEL. */
 > 127		while (freeze_super(sb, FREEZE_HOLDER_KERNEL) != 0) {
   128			// Shall we just wait, or print warning then return -EBUSY?
   129			delay(HZ / 10);
   130		}
   131	}
   132	
   133	static void
   134	xfs_dax_notify_failure_thaw(
   135		struct xfs_mount	*mp)
   136	{
   137		struct super_block	*sb = mp->m_super;
   138		int			error;
   139	
 > 140		error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
   141		if (error)
   142			xfs_emerg(mp, "still frozen after notify failure, err=%d",
   143				  error);
   144		/*
   145		 * Also thaw userspace call anyway because the device is about to be
   146		 * removed immediately.
   147		 */
 > 148		thaw_super(sb, FREEZE_HOLDER_USERSPACE);
   149	}
   150	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v12 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
  2023-06-29  8:16 ` [PATCH v12 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan
  2023-06-29 12:02   ` kernel test robot
@ 2023-07-14  9:07   ` Shiyang Ruan
  2023-07-14 14:18     ` Darrick J. Wong
  2023-07-29 15:15   ` Darrick J. Wong
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 37+ messages in thread
From: Shiyang Ruan @ 2023-07-14  9:07 UTC (permalink / raw)
  To: djwong
  Cc: linux-mm, linux-xfs, nvdimm, linux-fsdevel, dan.j.williams, willy,
	jack, akpm, mcgrof

Hi Darrick,

Thanks for applying the 1st patch.

Now, since this patch is based on the new freeze_super()/thaw_super() 
api[1], I'd like to ask what's the plan for this api?  It seems to have 
missed the v6.5-rc1.

[1] 
https://lore.kernel.org/linux-xfs/168688010689.860947.1788875898367401950.stgit@frogsfrogsfrogs/


--
Thanks,
Ruan.


在 2023/6/29 16:16, Shiyang Ruan 写道:
> This patch is inspired by Dan's "mm, dax, pmem: Introduce
> dev_pagemap_failure()"[1].  With the help of dax_holder and
> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> on it to unmap all files in use, and notify processes who are using
> those files.
> 
> Call trace:
> trigger unbind
>   -> unbind_store()
>    -> ... (skip)
>     -> devres_release_all()
>      -> kill_dax()
>       -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
>        -> xfs_dax_notify_failure()
>        `-> freeze_super()             // freeze (kernel call)
>        `-> do xfs rmap
>        ` -> mf_dax_kill_procs()
>        `  -> collect_procs_fsdax()    // all associated processes
>        `  -> unmap_and_kill()
>        ` -> invalidate_inode_pages2_range() // drop file's cache
>        `-> thaw_super()               // thaw (both kernel & user call)
> 
> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> event.  Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
> new dax mapping from being created.  Do not shutdown filesystem directly
> if configuration is not supported, or if failure range includes metadata
> area.  Make sure all files and processes(not only the current progress)
> are handled correctly.  Also drop the cache of associated files before
> pmem is removed.
> 
> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
> [2]: https://lore.kernel.org/linux-xfs/168688010689.860947.1788875898367401950.stgit@frogsfrogsfrogs/
> 
> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> ---
>   drivers/dax/super.c         |  3 +-
>   fs/xfs/xfs_notify_failure.c | 86 ++++++++++++++++++++++++++++++++++---
>   include/linux/mm.h          |  1 +
>   mm/memory-failure.c         | 17 ++++++--
>   4 files changed, 96 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index c4c4728a36e4..2e1a35e82fce 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
>   		return;
>   
>   	if (dax_dev->holder_data != NULL)
> -		dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
> +		dax_holder_notify_failure(dax_dev, 0, U64_MAX,
> +				MF_MEM_PRE_REMOVE);
>   
>   	clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
>   	synchronize_srcu(&dax_srcu);
> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> index 4a9bbd3fe120..f6ec56b76db6 100644
> --- a/fs/xfs/xfs_notify_failure.c
> +++ b/fs/xfs/xfs_notify_failure.c
> @@ -22,6 +22,7 @@
>   
>   #include <linux/mm.h>
>   #include <linux/dax.h>
> +#include <linux/fs.h>
>   
>   struct xfs_failure_info {
>   	xfs_agblock_t		startblock;
> @@ -73,10 +74,16 @@ xfs_dax_failure_fn(
>   	struct xfs_mount		*mp = cur->bc_mp;
>   	struct xfs_inode		*ip;
>   	struct xfs_failure_info		*notify = data;
> +	struct address_space		*mapping;
> +	pgoff_t				pgoff;
> +	unsigned long			pgcnt;
>   	int				error = 0;
>   
>   	if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
>   	    (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
> +		/* Continue the query because this isn't a failure. */
> +		if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> +			return 0;
>   		notify->want_shutdown = true;
>   		return 0;
>   	}
> @@ -92,14 +99,55 @@ xfs_dax_failure_fn(
>   		return 0;
>   	}
>   
> -	error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
> -				  xfs_failure_pgoff(mp, rec, notify),
> -				  xfs_failure_pgcnt(mp, rec, notify),
> -				  notify->mf_flags);
> +	mapping = VFS_I(ip)->i_mapping;
> +	pgoff = xfs_failure_pgoff(mp, rec, notify);
> +	pgcnt = xfs_failure_pgcnt(mp, rec, notify);
> +
> +	/* Continue the rmap query if the inode isn't a dax file. */
> +	if (dax_mapping(mapping))
> +		error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
> +					  notify->mf_flags);
> +
> +	/* Invalidate the cache in dax pages. */
> +	if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> +		invalidate_inode_pages2_range(mapping, pgoff,
> +					      pgoff + pgcnt - 1);
> +
>   	xfs_irele(ip);
>   	return error;
>   }
>   
> +static void
> +xfs_dax_notify_failure_freeze(
> +	struct xfs_mount	*mp)
> +{
> +	struct super_block 	*sb = mp->m_super;
> +
> +	/* Wait until no one is holding the FREEZE_HOLDER_KERNEL. */
> +	while (freeze_super(sb, FREEZE_HOLDER_KERNEL) != 0) {
> +		// Shall we just wait, or print warning then return -EBUSY?
> +		delay(HZ / 10);
> +	}
> +}
> +
> +static void
> +xfs_dax_notify_failure_thaw(
> +	struct xfs_mount	*mp)
> +{
> +	struct super_block	*sb = mp->m_super;
> +	int			error;
> +
> +	error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
> +	if (error)
> +		xfs_emerg(mp, "still frozen after notify failure, err=%d",
> +			  error);
> +	/*
> +	 * Also thaw userspace call anyway because the device is about to be
> +	 * removed immediately.
> +	 */
> +	thaw_super(sb, FREEZE_HOLDER_USERSPACE);
> +}
> +
>   static int
>   xfs_dax_notify_ddev_failure(
>   	struct xfs_mount	*mp,
> @@ -120,7 +168,7 @@ xfs_dax_notify_ddev_failure(
>   
>   	error = xfs_trans_alloc_empty(mp, &tp);
>   	if (error)
> -		return error;
> +		goto out;
>   
>   	for (; agno <= end_agno; agno++) {
>   		struct xfs_rmap_irec	ri_low = { };
> @@ -165,11 +213,23 @@ xfs_dax_notify_ddev_failure(
>   	}
>   
>   	xfs_trans_cancel(tp);
> +
> +	/*
> +	 * Determine how to shutdown the filesystem according to the
> +	 * error code and flags.
> +	 */
>   	if (error || notify.want_shutdown) {
>   		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>   		if (!error)
>   			error = -EFSCORRUPTED;
> -	}
> +	} else if (mf_flags & MF_MEM_PRE_REMOVE)
> +		xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
> +
> +out:
> +	/* Thaw the fs if it is freezed before. */
> +	if (mf_flags & MF_MEM_PRE_REMOVE)
> +		xfs_dax_notify_failure_thaw(mp);
> +
>   	return error;
>   }
>   
> @@ -197,6 +257,8 @@ xfs_dax_notify_failure(
>   
>   	if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
>   	    mp->m_logdev_targp != mp->m_ddev_targp) {
> +		if (mf_flags & MF_MEM_PRE_REMOVE)
> +			return 0;
>   		xfs_err(mp, "ondisk log corrupt, shutting down fs!");
>   		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>   		return -EFSCORRUPTED;
> @@ -210,6 +272,12 @@ xfs_dax_notify_failure(
>   	ddev_start = mp->m_ddev_targp->bt_dax_part_off;
>   	ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
>   
> +	/* Notify failure on the whole device. */
> +	if (offset == 0 && len == U64_MAX) {
> +		offset = ddev_start;
> +		len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
> +	}
> +
>   	/* Ignore the range out of filesystem area */
>   	if (offset + len - 1 < ddev_start)
>   		return -ENXIO;
> @@ -226,6 +294,12 @@ xfs_dax_notify_failure(
>   	if (offset + len - 1 > ddev_end)
>   		len = ddev_end - offset + 1;
>   
> +	if (mf_flags & MF_MEM_PRE_REMOVE) {
> +		xfs_info(mp, "device is about to be removed!");
> +		/* Freeze fs to prevent new mappings from being created. */
> +		xfs_dax_notify_failure_freeze(mp);
> +	}
> +
>   	return xfs_dax_notify_ddev_failure(mp, BTOBB(offset), BTOBB(len),
>   			mf_flags);
>   }
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 27ce77080c79..a80c255b88d2 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3576,6 +3576,7 @@ enum mf_flags {
>   	MF_UNPOISON = 1 << 4,
>   	MF_SW_SIMULATED = 1 << 5,
>   	MF_NO_RETRY = 1 << 6,
> +	MF_MEM_PRE_REMOVE = 1 << 7,
>   };
>   int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>   		      unsigned long count, int mf_flags);
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 5b663eca1f29..483b75f2fcfb 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -688,7 +688,7 @@ static void add_to_kill_fsdax(struct task_struct *tsk, struct page *p,
>    */
>   static void collect_procs_fsdax(struct page *page,
>   		struct address_space *mapping, pgoff_t pgoff,
> -		struct list_head *to_kill)
> +		struct list_head *to_kill, bool pre_remove)
>   {
>   	struct vm_area_struct *vma;
>   	struct task_struct *tsk;
> @@ -696,8 +696,15 @@ static void collect_procs_fsdax(struct page *page,
>   	i_mmap_lock_read(mapping);
>   	read_lock(&tasklist_lock);
>   	for_each_process(tsk) {
> -		struct task_struct *t = task_early_kill(tsk, true);
> +		struct task_struct *t = tsk;
>   
> +		/*
> +		 * Search for all tasks while MF_MEM_PRE_REMOVE, because the
> +		 * current may not be the one accessing the fsdax page.
> +		 * Otherwise, search for the current task.
> +		 */
> +		if (!pre_remove)
> +			t = task_early_kill(tsk, true);
>   		if (!t)
>   			continue;
>   		vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> @@ -1793,6 +1800,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>   	dax_entry_t cookie;
>   	struct page *page;
>   	size_t end = index + count;
> +	bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
>   
>   	mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
>   
> @@ -1804,9 +1812,10 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>   		if (!page)
>   			goto unlock;
>   
> -		SetPageHWPoison(page);
> +		if (!pre_remove)
> +			SetPageHWPoison(page);
>   
> -		collect_procs_fsdax(page, mapping, index, &to_kill);
> +		collect_procs_fsdax(page, mapping, index, &to_kill, pre_remove);
>   		unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
>   				index, mf_flags);
>   unlock:

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v12 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
  2023-07-14  9:07   ` Shiyang Ruan
@ 2023-07-14 14:18     ` Darrick J. Wong
  2023-07-20  1:50       ` Shiyang Ruan
  0 siblings, 1 reply; 37+ messages in thread
From: Darrick J. Wong @ 2023-07-14 14:18 UTC (permalink / raw)
  To: Shiyang Ruan
  Cc: linux-mm, linux-xfs, nvdimm, linux-fsdevel, dan.j.williams, willy,
	jack, akpm, mcgrof

On Fri, Jul 14, 2023 at 05:07:58PM +0800, Shiyang Ruan wrote:
> Hi Darrick,
> 
> Thanks for applying the 1st patch.
> 
> Now, since this patch is based on the new freeze_super()/thaw_super()
> api[1], I'd like to ask what's the plan for this api?  It seems to have
> missed the v6.5-rc1.
> 
> [1] https://lore.kernel.org/linux-xfs/168688010689.860947.1788875898367401950.stgit@frogsfrogsfrogs/

6.6.  I intend to push the XFS UBSAN fixes to the list today for review.
Early next week I'll resend the 6.5 rebase of the kernelfreeze series
and push it to vfs-for-next.  Some time after that will come large folio
writes.

--D

> 
> --
> Thanks,
> Ruan.
> 
> 
> 在 2023/6/29 16:16, Shiyang Ruan 写道:
> > This patch is inspired by Dan's "mm, dax, pmem: Introduce
> > dev_pagemap_failure()"[1].  With the help of dax_holder and
> > ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> > on it to unmap all files in use, and notify processes who are using
> > those files.
> > 
> > Call trace:
> > trigger unbind
> >   -> unbind_store()
> >    -> ... (skip)
> >     -> devres_release_all()
> >      -> kill_dax()
> >       -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
> >        -> xfs_dax_notify_failure()
> >        `-> freeze_super()             // freeze (kernel call)
> >        `-> do xfs rmap
> >        ` -> mf_dax_kill_procs()
> >        `  -> collect_procs_fsdax()    // all associated processes
> >        `  -> unmap_and_kill()
> >        ` -> invalidate_inode_pages2_range() // drop file's cache
> >        `-> thaw_super()               // thaw (both kernel & user call)
> > 
> > Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> > event.  Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
> > new dax mapping from being created.  Do not shutdown filesystem directly
> > if configuration is not supported, or if failure range includes metadata
> > area.  Make sure all files and processes(not only the current progress)
> > are handled correctly.  Also drop the cache of associated files before
> > pmem is removed.
> > 
> > [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
> > [2]: https://lore.kernel.org/linux-xfs/168688010689.860947.1788875898367401950.stgit@frogsfrogsfrogs/
> > 
> > Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> > ---
> >   drivers/dax/super.c         |  3 +-
> >   fs/xfs/xfs_notify_failure.c | 86 ++++++++++++++++++++++++++++++++++---
> >   include/linux/mm.h          |  1 +
> >   mm/memory-failure.c         | 17 ++++++--
> >   4 files changed, 96 insertions(+), 11 deletions(-)
> > 
> > diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> > index c4c4728a36e4..2e1a35e82fce 100644
> > --- a/drivers/dax/super.c
> > +++ b/drivers/dax/super.c
> > @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
> >   		return;
> >   	if (dax_dev->holder_data != NULL)
> > -		dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
> > +		dax_holder_notify_failure(dax_dev, 0, U64_MAX,
> > +				MF_MEM_PRE_REMOVE);
> >   	clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
> >   	synchronize_srcu(&dax_srcu);
> > diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> > index 4a9bbd3fe120..f6ec56b76db6 100644
> > --- a/fs/xfs/xfs_notify_failure.c
> > +++ b/fs/xfs/xfs_notify_failure.c
> > @@ -22,6 +22,7 @@
> >   #include <linux/mm.h>
> >   #include <linux/dax.h>
> > +#include <linux/fs.h>
> >   struct xfs_failure_info {
> >   	xfs_agblock_t		startblock;
> > @@ -73,10 +74,16 @@ xfs_dax_failure_fn(
> >   	struct xfs_mount		*mp = cur->bc_mp;
> >   	struct xfs_inode		*ip;
> >   	struct xfs_failure_info		*notify = data;
> > +	struct address_space		*mapping;
> > +	pgoff_t				pgoff;
> > +	unsigned long			pgcnt;
> >   	int				error = 0;
> >   	if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
> >   	    (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
> > +		/* Continue the query because this isn't a failure. */
> > +		if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> > +			return 0;
> >   		notify->want_shutdown = true;
> >   		return 0;
> >   	}
> > @@ -92,14 +99,55 @@ xfs_dax_failure_fn(
> >   		return 0;
> >   	}
> > -	error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
> > -				  xfs_failure_pgoff(mp, rec, notify),
> > -				  xfs_failure_pgcnt(mp, rec, notify),
> > -				  notify->mf_flags);
> > +	mapping = VFS_I(ip)->i_mapping;
> > +	pgoff = xfs_failure_pgoff(mp, rec, notify);
> > +	pgcnt = xfs_failure_pgcnt(mp, rec, notify);
> > +
> > +	/* Continue the rmap query if the inode isn't a dax file. */
> > +	if (dax_mapping(mapping))
> > +		error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
> > +					  notify->mf_flags);
> > +
> > +	/* Invalidate the cache in dax pages. */
> > +	if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> > +		invalidate_inode_pages2_range(mapping, pgoff,
> > +					      pgoff + pgcnt - 1);
> > +
> >   	xfs_irele(ip);
> >   	return error;
> >   }
> > +static void
> > +xfs_dax_notify_failure_freeze(
> > +	struct xfs_mount	*mp)
> > +{
> > +	struct super_block 	*sb = mp->m_super;
> > +
> > +	/* Wait until no one is holding the FREEZE_HOLDER_KERNEL. */
> > +	while (freeze_super(sb, FREEZE_HOLDER_KERNEL) != 0) {
> > +		// Shall we just wait, or print warning then return -EBUSY?
> > +		delay(HZ / 10);
> > +	}
> > +}
> > +
> > +static void
> > +xfs_dax_notify_failure_thaw(
> > +	struct xfs_mount	*mp)
> > +{
> > +	struct super_block	*sb = mp->m_super;
> > +	int			error;
> > +
> > +	error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
> > +	if (error)
> > +		xfs_emerg(mp, "still frozen after notify failure, err=%d",
> > +			  error);
> > +	/*
> > +	 * Also thaw userspace call anyway because the device is about to be
> > +	 * removed immediately.
> > +	 */
> > +	thaw_super(sb, FREEZE_HOLDER_USERSPACE);
> > +}
> > +
> >   static int
> >   xfs_dax_notify_ddev_failure(
> >   	struct xfs_mount	*mp,
> > @@ -120,7 +168,7 @@ xfs_dax_notify_ddev_failure(
> >   	error = xfs_trans_alloc_empty(mp, &tp);
> >   	if (error)
> > -		return error;
> > +		goto out;
> >   	for (; agno <= end_agno; agno++) {
> >   		struct xfs_rmap_irec	ri_low = { };
> > @@ -165,11 +213,23 @@ xfs_dax_notify_ddev_failure(
> >   	}
> >   	xfs_trans_cancel(tp);
> > +
> > +	/*
> > +	 * Determine how to shutdown the filesystem according to the
> > +	 * error code and flags.
> > +	 */
> >   	if (error || notify.want_shutdown) {
> >   		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
> >   		if (!error)
> >   			error = -EFSCORRUPTED;
> > -	}
> > +	} else if (mf_flags & MF_MEM_PRE_REMOVE)
> > +		xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
> > +
> > +out:
> > +	/* Thaw the fs if it is freezed before. */
> > +	if (mf_flags & MF_MEM_PRE_REMOVE)
> > +		xfs_dax_notify_failure_thaw(mp);
> > +
> >   	return error;
> >   }
> > @@ -197,6 +257,8 @@ xfs_dax_notify_failure(
> >   	if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
> >   	    mp->m_logdev_targp != mp->m_ddev_targp) {
> > +		if (mf_flags & MF_MEM_PRE_REMOVE)
> > +			return 0;
> >   		xfs_err(mp, "ondisk log corrupt, shutting down fs!");
> >   		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
> >   		return -EFSCORRUPTED;
> > @@ -210,6 +272,12 @@ xfs_dax_notify_failure(
> >   	ddev_start = mp->m_ddev_targp->bt_dax_part_off;
> >   	ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
> > +	/* Notify failure on the whole device. */
> > +	if (offset == 0 && len == U64_MAX) {
> > +		offset = ddev_start;
> > +		len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
> > +	}
> > +
> >   	/* Ignore the range out of filesystem area */
> >   	if (offset + len - 1 < ddev_start)
> >   		return -ENXIO;
> > @@ -226,6 +294,12 @@ xfs_dax_notify_failure(
> >   	if (offset + len - 1 > ddev_end)
> >   		len = ddev_end - offset + 1;
> > +	if (mf_flags & MF_MEM_PRE_REMOVE) {
> > +		xfs_info(mp, "device is about to be removed!");
> > +		/* Freeze fs to prevent new mappings from being created. */
> > +		xfs_dax_notify_failure_freeze(mp);
> > +	}
> > +
> >   	return xfs_dax_notify_ddev_failure(mp, BTOBB(offset), BTOBB(len),
> >   			mf_flags);
> >   }
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 27ce77080c79..a80c255b88d2 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -3576,6 +3576,7 @@ enum mf_flags {
> >   	MF_UNPOISON = 1 << 4,
> >   	MF_SW_SIMULATED = 1 << 5,
> >   	MF_NO_RETRY = 1 << 6,
> > +	MF_MEM_PRE_REMOVE = 1 << 7,
> >   };
> >   int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> >   		      unsigned long count, int mf_flags);
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index 5b663eca1f29..483b75f2fcfb 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -688,7 +688,7 @@ static void add_to_kill_fsdax(struct task_struct *tsk, struct page *p,
> >    */
> >   static void collect_procs_fsdax(struct page *page,
> >   		struct address_space *mapping, pgoff_t pgoff,
> > -		struct list_head *to_kill)
> > +		struct list_head *to_kill, bool pre_remove)
> >   {
> >   	struct vm_area_struct *vma;
> >   	struct task_struct *tsk;
> > @@ -696,8 +696,15 @@ static void collect_procs_fsdax(struct page *page,
> >   	i_mmap_lock_read(mapping);
> >   	read_lock(&tasklist_lock);
> >   	for_each_process(tsk) {
> > -		struct task_struct *t = task_early_kill(tsk, true);
> > +		struct task_struct *t = tsk;
> > +		/*
> > +		 * Search for all tasks while MF_MEM_PRE_REMOVE, because the
> > +		 * current may not be the one accessing the fsdax page.
> > +		 * Otherwise, search for the current task.
> > +		 */
> > +		if (!pre_remove)
> > +			t = task_early_kill(tsk, true);
> >   		if (!t)
> >   			continue;
> >   		vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> > @@ -1793,6 +1800,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> >   	dax_entry_t cookie;
> >   	struct page *page;
> >   	size_t end = index + count;
> > +	bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
> >   	mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
> > @@ -1804,9 +1812,10 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> >   		if (!page)
> >   			goto unlock;
> > -		SetPageHWPoison(page);
> > +		if (!pre_remove)
> > +			SetPageHWPoison(page);
> > -		collect_procs_fsdax(page, mapping, index, &to_kill);
> > +		collect_procs_fsdax(page, mapping, index, &to_kill, pre_remove);
> >   		unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
> >   				index, mf_flags);
> >   unlock:

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v12 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
  2023-07-14 14:18     ` Darrick J. Wong
@ 2023-07-20  1:50       ` Shiyang Ruan
  2023-07-29 10:01         ` Shiyang Ruan
  0 siblings, 1 reply; 37+ messages in thread
From: Shiyang Ruan @ 2023-07-20  1:50 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-mm, linux-xfs, nvdimm, linux-fsdevel, dan.j.williams, willy,
	jack, akpm, mcgrof



在 2023/7/14 22:18, Darrick J. Wong 写道:
> On Fri, Jul 14, 2023 at 05:07:58PM +0800, Shiyang Ruan wrote:
>> Hi Darrick,
>>
>> Thanks for applying the 1st patch.
>>
>> Now, since this patch is based on the new freeze_super()/thaw_super()
>> api[1], I'd like to ask what's the plan for this api?  It seems to have
>> missed the v6.5-rc1.
>>
>> [1] https://lore.kernel.org/linux-xfs/168688010689.860947.1788875898367401950.stgit@frogsfrogsfrogs/
> 
> 6.6.  I intend to push the XFS UBSAN fixes to the list today for review.
> Early next week I'll resend the 6.5 rebase of the kernelfreeze series
> and push it to vfs-for-next.  Some time after that will come large folio
> writes.

Got it.  Thanks for your information!


--
Ruan.

> 
> --D
> 
>>
>> --
>> Thanks,
>> Ruan.
>>
>>
>> 在 2023/6/29 16:16, Shiyang Ruan 写道:
>>> This patch is inspired by Dan's "mm, dax, pmem: Introduce
>>> dev_pagemap_failure()"[1].  With the help of dax_holder and
>>> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
>>> on it to unmap all files in use, and notify processes who are using
>>> those files.
>>>
>>> Call trace:
>>> trigger unbind
>>>    -> unbind_store()
>>>     -> ... (skip)
>>>      -> devres_release_all()
>>>       -> kill_dax()
>>>        -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
>>>         -> xfs_dax_notify_failure()
>>>         `-> freeze_super()             // freeze (kernel call)
>>>         `-> do xfs rmap
>>>         ` -> mf_dax_kill_procs()
>>>         `  -> collect_procs_fsdax()    // all associated processes
>>>         `  -> unmap_and_kill()
>>>         ` -> invalidate_inode_pages2_range() // drop file's cache
>>>         `-> thaw_super()               // thaw (both kernel & user call)
>>>
>>> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
>>> event.  Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
>>> new dax mapping from being created.  Do not shutdown filesystem directly
>>> if configuration is not supported, or if failure range includes metadata
>>> area.  Make sure all files and processes(not only the current progress)
>>> are handled correctly.  Also drop the cache of associated files before
>>> pmem is removed.
>>>
>>> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
>>> [2]: https://lore.kernel.org/linux-xfs/168688010689.860947.1788875898367401950.stgit@frogsfrogsfrogs/
>>>
>>> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
>>> ---
>>>    drivers/dax/super.c         |  3 +-
>>>    fs/xfs/xfs_notify_failure.c | 86 ++++++++++++++++++++++++++++++++++---
>>>    include/linux/mm.h          |  1 +
>>>    mm/memory-failure.c         | 17 ++++++--
>>>    4 files changed, 96 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
>>> index c4c4728a36e4..2e1a35e82fce 100644
>>> --- a/drivers/dax/super.c
>>> +++ b/drivers/dax/super.c
>>> @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
>>>    		return;
>>>    	if (dax_dev->holder_data != NULL)
>>> -		dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
>>> +		dax_holder_notify_failure(dax_dev, 0, U64_MAX,
>>> +				MF_MEM_PRE_REMOVE);
>>>    	clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
>>>    	synchronize_srcu(&dax_srcu);
>>> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
>>> index 4a9bbd3fe120..f6ec56b76db6 100644
>>> --- a/fs/xfs/xfs_notify_failure.c
>>> +++ b/fs/xfs/xfs_notify_failure.c
>>> @@ -22,6 +22,7 @@
>>>    #include <linux/mm.h>
>>>    #include <linux/dax.h>
>>> +#include <linux/fs.h>
>>>    struct xfs_failure_info {
>>>    	xfs_agblock_t		startblock;
>>> @@ -73,10 +74,16 @@ xfs_dax_failure_fn(
>>>    	struct xfs_mount		*mp = cur->bc_mp;
>>>    	struct xfs_inode		*ip;
>>>    	struct xfs_failure_info		*notify = data;
>>> +	struct address_space		*mapping;
>>> +	pgoff_t				pgoff;
>>> +	unsigned long			pgcnt;
>>>    	int				error = 0;
>>>    	if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
>>>    	    (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
>>> +		/* Continue the query because this isn't a failure. */
>>> +		if (notify->mf_flags & MF_MEM_PRE_REMOVE)
>>> +			return 0;
>>>    		notify->want_shutdown = true;
>>>    		return 0;
>>>    	}
>>> @@ -92,14 +99,55 @@ xfs_dax_failure_fn(
>>>    		return 0;
>>>    	}
>>> -	error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
>>> -				  xfs_failure_pgoff(mp, rec, notify),
>>> -				  xfs_failure_pgcnt(mp, rec, notify),
>>> -				  notify->mf_flags);
>>> +	mapping = VFS_I(ip)->i_mapping;
>>> +	pgoff = xfs_failure_pgoff(mp, rec, notify);
>>> +	pgcnt = xfs_failure_pgcnt(mp, rec, notify);
>>> +
>>> +	/* Continue the rmap query if the inode isn't a dax file. */
>>> +	if (dax_mapping(mapping))
>>> +		error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
>>> +					  notify->mf_flags);
>>> +
>>> +	/* Invalidate the cache in dax pages. */
>>> +	if (notify->mf_flags & MF_MEM_PRE_REMOVE)
>>> +		invalidate_inode_pages2_range(mapping, pgoff,
>>> +					      pgoff + pgcnt - 1);
>>> +
>>>    	xfs_irele(ip);
>>>    	return error;
>>>    }
>>> +static void
>>> +xfs_dax_notify_failure_freeze(
>>> +	struct xfs_mount	*mp)
>>> +{
>>> +	struct super_block 	*sb = mp->m_super;
>>> +
>>> +	/* Wait until no one is holding the FREEZE_HOLDER_KERNEL. */
>>> +	while (freeze_super(sb, FREEZE_HOLDER_KERNEL) != 0) {
>>> +		// Shall we just wait, or print warning then return -EBUSY?
>>> +		delay(HZ / 10);
>>> +	}
>>> +}
>>> +
>>> +static void
>>> +xfs_dax_notify_failure_thaw(
>>> +	struct xfs_mount	*mp)
>>> +{
>>> +	struct super_block	*sb = mp->m_super;
>>> +	int			error;
>>> +
>>> +	error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
>>> +	if (error)
>>> +		xfs_emerg(mp, "still frozen after notify failure, err=%d",
>>> +			  error);
>>> +	/*
>>> +	 * Also thaw userspace call anyway because the device is about to be
>>> +	 * removed immediately.
>>> +	 */
>>> +	thaw_super(sb, FREEZE_HOLDER_USERSPACE);
>>> +}
>>> +
>>>    static int
>>>    xfs_dax_notify_ddev_failure(
>>>    	struct xfs_mount	*mp,
>>> @@ -120,7 +168,7 @@ xfs_dax_notify_ddev_failure(
>>>    	error = xfs_trans_alloc_empty(mp, &tp);
>>>    	if (error)
>>> -		return error;
>>> +		goto out;
>>>    	for (; agno <= end_agno; agno++) {
>>>    		struct xfs_rmap_irec	ri_low = { };
>>> @@ -165,11 +213,23 @@ xfs_dax_notify_ddev_failure(
>>>    	}
>>>    	xfs_trans_cancel(tp);
>>> +
>>> +	/*
>>> +	 * Determine how to shutdown the filesystem according to the
>>> +	 * error code and flags.
>>> +	 */
>>>    	if (error || notify.want_shutdown) {
>>>    		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>>>    		if (!error)
>>>    			error = -EFSCORRUPTED;
>>> -	}
>>> +	} else if (mf_flags & MF_MEM_PRE_REMOVE)
>>> +		xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
>>> +
>>> +out:
>>> +	/* Thaw the fs if it is freezed before. */
>>> +	if (mf_flags & MF_MEM_PRE_REMOVE)
>>> +		xfs_dax_notify_failure_thaw(mp);
>>> +
>>>    	return error;
>>>    }
>>> @@ -197,6 +257,8 @@ xfs_dax_notify_failure(
>>>    	if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
>>>    	    mp->m_logdev_targp != mp->m_ddev_targp) {
>>> +		if (mf_flags & MF_MEM_PRE_REMOVE)
>>> +			return 0;
>>>    		xfs_err(mp, "ondisk log corrupt, shutting down fs!");
>>>    		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>>>    		return -EFSCORRUPTED;
>>> @@ -210,6 +272,12 @@ xfs_dax_notify_failure(
>>>    	ddev_start = mp->m_ddev_targp->bt_dax_part_off;
>>>    	ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
>>> +	/* Notify failure on the whole device. */
>>> +	if (offset == 0 && len == U64_MAX) {
>>> +		offset = ddev_start;
>>> +		len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
>>> +	}
>>> +
>>>    	/* Ignore the range out of filesystem area */
>>>    	if (offset + len - 1 < ddev_start)
>>>    		return -ENXIO;
>>> @@ -226,6 +294,12 @@ xfs_dax_notify_failure(
>>>    	if (offset + len - 1 > ddev_end)
>>>    		len = ddev_end - offset + 1;
>>> +	if (mf_flags & MF_MEM_PRE_REMOVE) {
>>> +		xfs_info(mp, "device is about to be removed!");
>>> +		/* Freeze fs to prevent new mappings from being created. */
>>> +		xfs_dax_notify_failure_freeze(mp);
>>> +	}
>>> +
>>>    	return xfs_dax_notify_ddev_failure(mp, BTOBB(offset), BTOBB(len),
>>>    			mf_flags);
>>>    }
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index 27ce77080c79..a80c255b88d2 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -3576,6 +3576,7 @@ enum mf_flags {
>>>    	MF_UNPOISON = 1 << 4,
>>>    	MF_SW_SIMULATED = 1 << 5,
>>>    	MF_NO_RETRY = 1 << 6,
>>> +	MF_MEM_PRE_REMOVE = 1 << 7,
>>>    };
>>>    int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>>>    		      unsigned long count, int mf_flags);
>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>> index 5b663eca1f29..483b75f2fcfb 100644
>>> --- a/mm/memory-failure.c
>>> +++ b/mm/memory-failure.c
>>> @@ -688,7 +688,7 @@ static void add_to_kill_fsdax(struct task_struct *tsk, struct page *p,
>>>     */
>>>    static void collect_procs_fsdax(struct page *page,
>>>    		struct address_space *mapping, pgoff_t pgoff,
>>> -		struct list_head *to_kill)
>>> +		struct list_head *to_kill, bool pre_remove)
>>>    {
>>>    	struct vm_area_struct *vma;
>>>    	struct task_struct *tsk;
>>> @@ -696,8 +696,15 @@ static void collect_procs_fsdax(struct page *page,
>>>    	i_mmap_lock_read(mapping);
>>>    	read_lock(&tasklist_lock);
>>>    	for_each_process(tsk) {
>>> -		struct task_struct *t = task_early_kill(tsk, true);
>>> +		struct task_struct *t = tsk;
>>> +		/*
>>> +		 * Search for all tasks while MF_MEM_PRE_REMOVE, because the
>>> +		 * current may not be the one accessing the fsdax page.
>>> +		 * Otherwise, search for the current task.
>>> +		 */
>>> +		if (!pre_remove)
>>> +			t = task_early_kill(tsk, true);
>>>    		if (!t)
>>>    			continue;
>>>    		vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
>>> @@ -1793,6 +1800,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>>>    	dax_entry_t cookie;
>>>    	struct page *page;
>>>    	size_t end = index + count;
>>> +	bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
>>>    	mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
>>> @@ -1804,9 +1812,10 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>>>    		if (!page)
>>>    			goto unlock;
>>> -		SetPageHWPoison(page);
>>> +		if (!pre_remove)
>>> +			SetPageHWPoison(page);
>>> -		collect_procs_fsdax(page, mapping, index, &to_kill);
>>> +		collect_procs_fsdax(page, mapping, index, &to_kill, pre_remove);
>>>    		unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
>>>    				index, mf_flags);
>>>    unlock:

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v12 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
  2023-07-20  1:50       ` Shiyang Ruan
@ 2023-07-29 10:01         ` Shiyang Ruan
  2023-07-29 15:15           ` Darrick J. Wong
  0 siblings, 1 reply; 37+ messages in thread
From: Shiyang Ruan @ 2023-07-29 10:01 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-mm, linux-xfs, nvdimm, linux-fsdevel, dan.j.williams, willy,
	jack, akpm, mcgrof



在 2023/7/20 9:50, Shiyang Ruan 写道:
> 
> 
> 在 2023/7/14 22:18, Darrick J. Wong 写道:
>> On Fri, Jul 14, 2023 at 05:07:58PM +0800, Shiyang Ruan wrote:
>>> Hi Darrick,
>>>
>>> Thanks for applying the 1st patch.
>>>
>>> Now, since this patch is based on the new freeze_super()/thaw_super()
>>> api[1], I'd like to ask what's the plan for this api?  It seems to have
>>> missed the v6.5-rc1.
>>>
>>> [1] 
>>> https://lore.kernel.org/linux-xfs/168688010689.860947.1788875898367401950.stgit@frogsfrogsfrogs/
>>
>> 6.6.  I intend to push the XFS UBSAN fixes to the list today for review.
>> Early next week I'll resend the 6.5 rebase of the kernelfreeze series
>> and push it to vfs-for-next.  Some time after that will come large folio
>> writes.
> 
> Got it.  Thanks for your information!

A small request:  If you have time to give some comments, I would 
appreciate it because I hope we can make the most out of this 
period(before freeze api be merged in 6.6).


--
Thanks,
Ruan.

> 
> 
> -- 
> Ruan.
> 
>>
>> --D
>>
>>>
>>> -- 
>>> Thanks,
>>> Ruan.
>>>
>>>
>>> 在 2023/6/29 16:16, Shiyang Ruan 写道:
>>>> This patch is inspired by Dan's "mm, dax, pmem: Introduce
>>>> dev_pagemap_failure()"[1].  With the help of dax_holder and
>>>> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
>>>> on it to unmap all files in use, and notify processes who are using
>>>> those files.
>>>>
>>>> Call trace:
>>>> trigger unbind
>>>>    -> unbind_store()
>>>>     -> ... (skip)
>>>>      -> devres_release_all()
>>>>       -> kill_dax()
>>>>        -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, 
>>>> MF_MEM_PRE_REMOVE)
>>>>         -> xfs_dax_notify_failure()
>>>>         `-> freeze_super()             // freeze (kernel call)
>>>>         `-> do xfs rmap
>>>>         ` -> mf_dax_kill_procs()
>>>>         `  -> collect_procs_fsdax()    // all associated processes
>>>>         `  -> unmap_and_kill()
>>>>         ` -> invalidate_inode_pages2_range() // drop file's cache
>>>>         `-> thaw_super()               // thaw (both kernel & user 
>>>> call)
>>>>
>>>> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
>>>> event.  Use the exclusive freeze/thaw[2] to lock the filesystem to 
>>>> prevent
>>>> new dax mapping from being created.  Do not shutdown filesystem 
>>>> directly
>>>> if configuration is not supported, or if failure range includes 
>>>> metadata
>>>> area.  Make sure all files and processes(not only the current progress)
>>>> are handled correctly.  Also drop the cache of associated files before
>>>> pmem is removed.
>>>>
>>>> [1]: 
>>>> https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
>>>> [2]: 
>>>> https://lore.kernel.org/linux-xfs/168688010689.860947.1788875898367401950.stgit@frogsfrogsfrogs/
>>>>
>>>> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
>>>> ---
>>>>    drivers/dax/super.c         |  3 +-
>>>>    fs/xfs/xfs_notify_failure.c | 86 
>>>> ++++++++++++++++++++++++++++++++++---
>>>>    include/linux/mm.h          |  1 +
>>>>    mm/memory-failure.c         | 17 ++++++--
>>>>    4 files changed, 96 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
>>>> index c4c4728a36e4..2e1a35e82fce 100644
>>>> --- a/drivers/dax/super.c
>>>> +++ b/drivers/dax/super.c
>>>> @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
>>>>            return;
>>>>        if (dax_dev->holder_data != NULL)
>>>> -        dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
>>>> +        dax_holder_notify_failure(dax_dev, 0, U64_MAX,
>>>> +                MF_MEM_PRE_REMOVE);
>>>>        clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
>>>>        synchronize_srcu(&dax_srcu);
>>>> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
>>>> index 4a9bbd3fe120..f6ec56b76db6 100644
>>>> --- a/fs/xfs/xfs_notify_failure.c
>>>> +++ b/fs/xfs/xfs_notify_failure.c
>>>> @@ -22,6 +22,7 @@
>>>>    #include <linux/mm.h>
>>>>    #include <linux/dax.h>
>>>> +#include <linux/fs.h>
>>>>    struct xfs_failure_info {
>>>>        xfs_agblock_t        startblock;
>>>> @@ -73,10 +74,16 @@ xfs_dax_failure_fn(
>>>>        struct xfs_mount        *mp = cur->bc_mp;
>>>>        struct xfs_inode        *ip;
>>>>        struct xfs_failure_info        *notify = data;
>>>> +    struct address_space        *mapping;
>>>> +    pgoff_t                pgoff;
>>>> +    unsigned long            pgcnt;
>>>>        int                error = 0;
>>>>        if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
>>>>            (rec->rm_flags & (XFS_RMAP_ATTR_FORK | 
>>>> XFS_RMAP_BMBT_BLOCK))) {
>>>> +        /* Continue the query because this isn't a failure. */
>>>> +        if (notify->mf_flags & MF_MEM_PRE_REMOVE)
>>>> +            return 0;
>>>>            notify->want_shutdown = true;
>>>>            return 0;
>>>>        }
>>>> @@ -92,14 +99,55 @@ xfs_dax_failure_fn(
>>>>            return 0;
>>>>        }
>>>> -    error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
>>>> -                  xfs_failure_pgoff(mp, rec, notify),
>>>> -                  xfs_failure_pgcnt(mp, rec, notify),
>>>> -                  notify->mf_flags);
>>>> +    mapping = VFS_I(ip)->i_mapping;
>>>> +    pgoff = xfs_failure_pgoff(mp, rec, notify);
>>>> +    pgcnt = xfs_failure_pgcnt(mp, rec, notify);
>>>> +
>>>> +    /* Continue the rmap query if the inode isn't a dax file. */
>>>> +    if (dax_mapping(mapping))
>>>> +        error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
>>>> +                      notify->mf_flags);
>>>> +
>>>> +    /* Invalidate the cache in dax pages. */
>>>> +    if (notify->mf_flags & MF_MEM_PRE_REMOVE)
>>>> +        invalidate_inode_pages2_range(mapping, pgoff,
>>>> +                          pgoff + pgcnt - 1);
>>>> +
>>>>        xfs_irele(ip);
>>>>        return error;
>>>>    }
>>>> +static void
>>>> +xfs_dax_notify_failure_freeze(
>>>> +    struct xfs_mount    *mp)
>>>> +{
>>>> +    struct super_block     *sb = mp->m_super;
>>>> +
>>>> +    /* Wait until no one is holding the FREEZE_HOLDER_KERNEL. */
>>>> +    while (freeze_super(sb, FREEZE_HOLDER_KERNEL) != 0) {
>>>> +        // Shall we just wait, or print warning then return -EBUSY?
>>>> +        delay(HZ / 10);
>>>> +    }
>>>> +}
>>>> +
>>>> +static void
>>>> +xfs_dax_notify_failure_thaw(
>>>> +    struct xfs_mount    *mp)
>>>> +{
>>>> +    struct super_block    *sb = mp->m_super;
>>>> +    int            error;
>>>> +
>>>> +    error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
>>>> +    if (error)
>>>> +        xfs_emerg(mp, "still frozen after notify failure, err=%d",
>>>> +              error);
>>>> +    /*
>>>> +     * Also thaw userspace call anyway because the device is about 
>>>> to be
>>>> +     * removed immediately.
>>>> +     */
>>>> +    thaw_super(sb, FREEZE_HOLDER_USERSPACE);
>>>> +}
>>>> +
>>>>    static int
>>>>    xfs_dax_notify_ddev_failure(
>>>>        struct xfs_mount    *mp,
>>>> @@ -120,7 +168,7 @@ xfs_dax_notify_ddev_failure(
>>>>        error = xfs_trans_alloc_empty(mp, &tp);
>>>>        if (error)
>>>> -        return error;
>>>> +        goto out;
>>>>        for (; agno <= end_agno; agno++) {
>>>>            struct xfs_rmap_irec    ri_low = { };
>>>> @@ -165,11 +213,23 @@ xfs_dax_notify_ddev_failure(
>>>>        }
>>>>        xfs_trans_cancel(tp);
>>>> +
>>>> +    /*
>>>> +     * Determine how to shutdown the filesystem according to the
>>>> +     * error code and flags.
>>>> +     */
>>>>        if (error || notify.want_shutdown) {
>>>>            xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>>>>            if (!error)
>>>>                error = -EFSCORRUPTED;
>>>> -    }
>>>> +    } else if (mf_flags & MF_MEM_PRE_REMOVE)
>>>> +        xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
>>>> +
>>>> +out:
>>>> +    /* Thaw the fs if it is freezed before. */
>>>> +    if (mf_flags & MF_MEM_PRE_REMOVE)
>>>> +        xfs_dax_notify_failure_thaw(mp);
>>>> +
>>>>        return error;
>>>>    }
>>>> @@ -197,6 +257,8 @@ xfs_dax_notify_failure(
>>>>        if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == 
>>>> dax_dev &&
>>>>            mp->m_logdev_targp != mp->m_ddev_targp) {
>>>> +        if (mf_flags & MF_MEM_PRE_REMOVE)
>>>> +            return 0;
>>>>            xfs_err(mp, "ondisk log corrupt, shutting down fs!");
>>>>            xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>>>>            return -EFSCORRUPTED;
>>>> @@ -210,6 +272,12 @@ xfs_dax_notify_failure(
>>>>        ddev_start = mp->m_ddev_targp->bt_dax_part_off;
>>>>        ddev_end = ddev_start + 
>>>> bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
>>>> +    /* Notify failure on the whole device. */
>>>> +    if (offset == 0 && len == U64_MAX) {
>>>> +        offset = ddev_start;
>>>> +        len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
>>>> +    }
>>>> +
>>>>        /* Ignore the range out of filesystem area */
>>>>        if (offset + len - 1 < ddev_start)
>>>>            return -ENXIO;
>>>> @@ -226,6 +294,12 @@ xfs_dax_notify_failure(
>>>>        if (offset + len - 1 > ddev_end)
>>>>            len = ddev_end - offset + 1;
>>>> +    if (mf_flags & MF_MEM_PRE_REMOVE) {
>>>> +        xfs_info(mp, "device is about to be removed!");
>>>> +        /* Freeze fs to prevent new mappings from being created. */
>>>> +        xfs_dax_notify_failure_freeze(mp);
>>>> +    }
>>>> +
>>>>        return xfs_dax_notify_ddev_failure(mp, BTOBB(offset), 
>>>> BTOBB(len),
>>>>                mf_flags);
>>>>    }
>>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>>> index 27ce77080c79..a80c255b88d2 100644
>>>> --- a/include/linux/mm.h
>>>> +++ b/include/linux/mm.h
>>>> @@ -3576,6 +3576,7 @@ enum mf_flags {
>>>>        MF_UNPOISON = 1 << 4,
>>>>        MF_SW_SIMULATED = 1 << 5,
>>>>        MF_NO_RETRY = 1 << 6,
>>>> +    MF_MEM_PRE_REMOVE = 1 << 7,
>>>>    };
>>>>    int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>>>>                  unsigned long count, int mf_flags);
>>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>>> index 5b663eca1f29..483b75f2fcfb 100644
>>>> --- a/mm/memory-failure.c
>>>> +++ b/mm/memory-failure.c
>>>> @@ -688,7 +688,7 @@ static void add_to_kill_fsdax(struct task_struct 
>>>> *tsk, struct page *p,
>>>>     */
>>>>    static void collect_procs_fsdax(struct page *page,
>>>>            struct address_space *mapping, pgoff_t pgoff,
>>>> -        struct list_head *to_kill)
>>>> +        struct list_head *to_kill, bool pre_remove)
>>>>    {
>>>>        struct vm_area_struct *vma;
>>>>        struct task_struct *tsk;
>>>> @@ -696,8 +696,15 @@ static void collect_procs_fsdax(struct page *page,
>>>>        i_mmap_lock_read(mapping);
>>>>        read_lock(&tasklist_lock);
>>>>        for_each_process(tsk) {
>>>> -        struct task_struct *t = task_early_kill(tsk, true);
>>>> +        struct task_struct *t = tsk;
>>>> +        /*
>>>> +         * Search for all tasks while MF_MEM_PRE_REMOVE, because the
>>>> +         * current may not be the one accessing the fsdax page.
>>>> +         * Otherwise, search for the current task.
>>>> +         */
>>>> +        if (!pre_remove)
>>>> +            t = task_early_kill(tsk, true);
>>>>            if (!t)
>>>>                continue;
>>>>            vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, 
>>>> pgoff) {
>>>> @@ -1793,6 +1800,7 @@ int mf_dax_kill_procs(struct address_space 
>>>> *mapping, pgoff_t index,
>>>>        dax_entry_t cookie;
>>>>        struct page *page;
>>>>        size_t end = index + count;
>>>> +    bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
>>>>        mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
>>>> @@ -1804,9 +1812,10 @@ int mf_dax_kill_procs(struct address_space 
>>>> *mapping, pgoff_t index,
>>>>            if (!page)
>>>>                goto unlock;
>>>> -        SetPageHWPoison(page);
>>>> +        if (!pre_remove)
>>>> +            SetPageHWPoison(page);
>>>> -        collect_procs_fsdax(page, mapping, index, &to_kill);
>>>> +        collect_procs_fsdax(page, mapping, index, &to_kill, 
>>>> pre_remove);
>>>>            unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
>>>>                    index, mf_flags);
>>>>    unlock:

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v12 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
  2023-06-29  8:16 ` [PATCH v12 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan
  2023-06-29 12:02   ` kernel test robot
  2023-07-14  9:07   ` Shiyang Ruan
@ 2023-07-29 15:15   ` Darrick J. Wong
  2023-07-31  9:36     ` Shiyang Ruan
  2023-08-08  0:31   ` Dan Williams
                     ` (2 subsequent siblings)
  5 siblings, 1 reply; 37+ messages in thread
From: Darrick J. Wong @ 2023-07-29 15:15 UTC (permalink / raw)
  To: Shiyang Ruan
  Cc: linux-fsdevel, nvdimm, linux-xfs, linux-mm, dan.j.williams, willy,
	jack, akpm, mcgrof

On Thu, Jun 29, 2023 at 04:16:51PM +0800, Shiyang Ruan wrote:
> This patch is inspired by Dan's "mm, dax, pmem: Introduce
> dev_pagemap_failure()"[1].  With the help of dax_holder and
> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> on it to unmap all files in use, and notify processes who are using
> those files.
> 
> Call trace:
> trigger unbind
>  -> unbind_store()
>   -> ... (skip)
>    -> devres_release_all()
>     -> kill_dax()
>      -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
>       -> xfs_dax_notify_failure()
>       `-> freeze_super()             // freeze (kernel call)
>       `-> do xfs rmap
>       ` -> mf_dax_kill_procs()
>       `  -> collect_procs_fsdax()    // all associated processes
>       `  -> unmap_and_kill()
>       ` -> invalidate_inode_pages2_range() // drop file's cache
>       `-> thaw_super()               // thaw (both kernel & user call)
> 
> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> event.  Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
> new dax mapping from being created.  Do not shutdown filesystem directly
> if configuration is not supported, or if failure range includes metadata
> area.  Make sure all files and processes(not only the current progress)
> are handled correctly.  Also drop the cache of associated files before
> pmem is removed.
> 
> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
> [2]: https://lore.kernel.org/linux-xfs/168688010689.860947.1788875898367401950.stgit@frogsfrogsfrogs/
> 
> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> ---
>  drivers/dax/super.c         |  3 +-
>  fs/xfs/xfs_notify_failure.c | 86 ++++++++++++++++++++++++++++++++++---
>  include/linux/mm.h          |  1 +
>  mm/memory-failure.c         | 17 ++++++--
>  4 files changed, 96 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index c4c4728a36e4..2e1a35e82fce 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
>  		return;
>  
>  	if (dax_dev->holder_data != NULL)
> -		dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
> +		dax_holder_notify_failure(dax_dev, 0, U64_MAX,
> +				MF_MEM_PRE_REMOVE);
>  
>  	clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
>  	synchronize_srcu(&dax_srcu);
> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> index 4a9bbd3fe120..f6ec56b76db6 100644
> --- a/fs/xfs/xfs_notify_failure.c
> +++ b/fs/xfs/xfs_notify_failure.c
> @@ -22,6 +22,7 @@
>  
>  #include <linux/mm.h>
>  #include <linux/dax.h>
> +#include <linux/fs.h>
>  
>  struct xfs_failure_info {
>  	xfs_agblock_t		startblock;
> @@ -73,10 +74,16 @@ xfs_dax_failure_fn(
>  	struct xfs_mount		*mp = cur->bc_mp;
>  	struct xfs_inode		*ip;
>  	struct xfs_failure_info		*notify = data;
> +	struct address_space		*mapping;
> +	pgoff_t				pgoff;
> +	unsigned long			pgcnt;
>  	int				error = 0;
>  
>  	if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
>  	    (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
> +		/* Continue the query because this isn't a failure. */
> +		if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> +			return 0;
>  		notify->want_shutdown = true;
>  		return 0;
>  	}
> @@ -92,14 +99,55 @@ xfs_dax_failure_fn(
>  		return 0;
>  	}
>  
> -	error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
> -				  xfs_failure_pgoff(mp, rec, notify),
> -				  xfs_failure_pgcnt(mp, rec, notify),
> -				  notify->mf_flags);
> +	mapping = VFS_I(ip)->i_mapping;
> +	pgoff = xfs_failure_pgoff(mp, rec, notify);
> +	pgcnt = xfs_failure_pgcnt(mp, rec, notify);
> +
> +	/* Continue the rmap query if the inode isn't a dax file. */
> +	if (dax_mapping(mapping))
> +		error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
> +					  notify->mf_flags);
> +
> +	/* Invalidate the cache in dax pages. */
> +	if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> +		invalidate_inode_pages2_range(mapping, pgoff,
> +					      pgoff + pgcnt - 1);
> +
>  	xfs_irele(ip);
>  	return error;
>  }
>  
> +static void
> +xfs_dax_notify_failure_freeze(
> +	struct xfs_mount	*mp)
> +{
> +	struct super_block 	*sb = mp->m_super;

Nit: extra space right    ^ here.

> +
> +	/* Wait until no one is holding the FREEZE_HOLDER_KERNEL. */
> +	while (freeze_super(sb, FREEZE_HOLDER_KERNEL) != 0) {
> +		// Shall we just wait, or print warning then return -EBUSY?

Hm.  PRE_REMOVE gets called before the pmem gets unplugged, right?  So
we'll send a second notification after it goes away, right?

If so, then I'd say return the error here instead of looping, and live
with a kernel-frozen fs discarding the PRE_REMOVE message.

> +		delay(HZ / 10);
> +	}
> +}
> +
> +static void
> +xfs_dax_notify_failure_thaw(
> +	struct xfs_mount	*mp)
> +{
> +	struct super_block	*sb = mp->m_super;
> +	int			error;
> +
> +	error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
> +	if (error)
> +		xfs_emerg(mp, "still frozen after notify failure, err=%d",
> +			  error);
> +	/*
> +	 * Also thaw userspace call anyway because the device is about to be
> +	 * removed immediately.
> +	 */
> +	thaw_super(sb, FREEZE_HOLDER_USERSPACE);
> +}
> +
>  static int
>  xfs_dax_notify_ddev_failure(
>  	struct xfs_mount	*mp,
> @@ -120,7 +168,7 @@ xfs_dax_notify_ddev_failure(
>  
>  	error = xfs_trans_alloc_empty(mp, &tp);
>  	if (error)
> -		return error;
> +		goto out;
>  
>  	for (; agno <= end_agno; agno++) {
>  		struct xfs_rmap_irec	ri_low = { };
> @@ -165,11 +213,23 @@ xfs_dax_notify_ddev_failure(
>  	}
>  
>  	xfs_trans_cancel(tp);
> +
> +	/*
> +	 * Determine how to shutdown the filesystem according to the
> +	 * error code and flags.
> +	 */
>  	if (error || notify.want_shutdown) {
>  		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>  		if (!error)
>  			error = -EFSCORRUPTED;
> -	}
> +	} else if (mf_flags & MF_MEM_PRE_REMOVE)
> +		xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
> +
> +out:
> +	/* Thaw the fs if it is freezed before. */
> +	if (mf_flags & MF_MEM_PRE_REMOVE)
> +		xfs_dax_notify_failure_thaw(mp);

_thaw should be called from the same function that called _freeze.

The rest of the patch seems ok to me.

--D

> +
>  	return error;
>  }
>  
> @@ -197,6 +257,8 @@ xfs_dax_notify_failure(
>  
>  	if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
>  	    mp->m_logdev_targp != mp->m_ddev_targp) {
> +		if (mf_flags & MF_MEM_PRE_REMOVE)
> +			return 0;
>  		xfs_err(mp, "ondisk log corrupt, shutting down fs!");
>  		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>  		return -EFSCORRUPTED;
> @@ -210,6 +272,12 @@ xfs_dax_notify_failure(
>  	ddev_start = mp->m_ddev_targp->bt_dax_part_off;
>  	ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
>  
> +	/* Notify failure on the whole device. */
> +	if (offset == 0 && len == U64_MAX) {
> +		offset = ddev_start;
> +		len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
> +	}
> +
>  	/* Ignore the range out of filesystem area */
>  	if (offset + len - 1 < ddev_start)
>  		return -ENXIO;
> @@ -226,6 +294,12 @@ xfs_dax_notify_failure(
>  	if (offset + len - 1 > ddev_end)
>  		len = ddev_end - offset + 1;
>  
> +	if (mf_flags & MF_MEM_PRE_REMOVE) {
> +		xfs_info(mp, "device is about to be removed!");
> +		/* Freeze fs to prevent new mappings from being created. */
> +		xfs_dax_notify_failure_freeze(mp);
> +	}
> +
>  	return xfs_dax_notify_ddev_failure(mp, BTOBB(offset), BTOBB(len),
>  			mf_flags);
>  }
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 27ce77080c79..a80c255b88d2 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3576,6 +3576,7 @@ enum mf_flags {
>  	MF_UNPOISON = 1 << 4,
>  	MF_SW_SIMULATED = 1 << 5,
>  	MF_NO_RETRY = 1 << 6,
> +	MF_MEM_PRE_REMOVE = 1 << 7,
>  };
>  int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>  		      unsigned long count, int mf_flags);
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 5b663eca1f29..483b75f2fcfb 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -688,7 +688,7 @@ static void add_to_kill_fsdax(struct task_struct *tsk, struct page *p,
>   */
>  static void collect_procs_fsdax(struct page *page,
>  		struct address_space *mapping, pgoff_t pgoff,
> -		struct list_head *to_kill)
> +		struct list_head *to_kill, bool pre_remove)
>  {
>  	struct vm_area_struct *vma;
>  	struct task_struct *tsk;
> @@ -696,8 +696,15 @@ static void collect_procs_fsdax(struct page *page,
>  	i_mmap_lock_read(mapping);
>  	read_lock(&tasklist_lock);
>  	for_each_process(tsk) {
> -		struct task_struct *t = task_early_kill(tsk, true);
> +		struct task_struct *t = tsk;
>  
> +		/*
> +		 * Search for all tasks while MF_MEM_PRE_REMOVE, because the
> +		 * current may not be the one accessing the fsdax page.
> +		 * Otherwise, search for the current task.
> +		 */
> +		if (!pre_remove)
> +			t = task_early_kill(tsk, true);
>  		if (!t)
>  			continue;
>  		vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> @@ -1793,6 +1800,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>  	dax_entry_t cookie;
>  	struct page *page;
>  	size_t end = index + count;
> +	bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
>  
>  	mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
>  
> @@ -1804,9 +1812,10 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>  		if (!page)
>  			goto unlock;
>  
> -		SetPageHWPoison(page);
> +		if (!pre_remove)
> +			SetPageHWPoison(page);
>  
> -		collect_procs_fsdax(page, mapping, index, &to_kill);
> +		collect_procs_fsdax(page, mapping, index, &to_kill, pre_remove);
>  		unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
>  				index, mf_flags);
>  unlock:
> -- 
> 2.40.1
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v12 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
  2023-07-29 10:01         ` Shiyang Ruan
@ 2023-07-29 15:15           ` Darrick J. Wong
  0 siblings, 0 replies; 37+ messages in thread
From: Darrick J. Wong @ 2023-07-29 15:15 UTC (permalink / raw)
  To: Shiyang Ruan
  Cc: linux-mm, linux-xfs, nvdimm, linux-fsdevel, dan.j.williams, willy,
	jack, akpm, mcgrof

On Sat, Jul 29, 2023 at 06:01:00PM +0800, Shiyang Ruan wrote:
> 
> 
> 在 2023/7/20 9:50, Shiyang Ruan 写道:
> > 
> > 
> > 在 2023/7/14 22:18, Darrick J. Wong 写道:
> > > On Fri, Jul 14, 2023 at 05:07:58PM +0800, Shiyang Ruan wrote:
> > > > Hi Darrick,
> > > > 
> > > > Thanks for applying the 1st patch.
> > > > 
> > > > Now, since this patch is based on the new freeze_super()/thaw_super()
> > > > api[1], I'd like to ask what's the plan for this api?  It seems to have
> > > > missed the v6.5-rc1.
> > > > 
> > > > [1] https://lore.kernel.org/linux-xfs/168688010689.860947.1788875898367401950.stgit@frogsfrogsfrogs/
> > > 
> > > 6.6.  I intend to push the XFS UBSAN fixes to the list today for review.
> > > Early next week I'll resend the 6.5 rebase of the kernelfreeze series
> > > and push it to vfs-for-next.  Some time after that will come large folio
> > > writes.
> > 
> > Got it.  Thanks for your information!
> 
> A small request:  If you have time to give some comments, I would appreciate
> it because I hope we can make the most out of this period(before freeze api
> be merged in 6.6).

Done.

--D


> 
> --
> Thanks,
> Ruan.
> 
> > 
> > 
> > -- 
> > Ruan.
> > 
> > > 
> > > --D
> > > 
> > > > 
> > > > -- 
> > > > Thanks,
> > > > Ruan.
> > > > 
> > > > 
> > > > 在 2023/6/29 16:16, Shiyang Ruan 写道:
> > > > > This patch is inspired by Dan's "mm, dax, pmem: Introduce
> > > > > dev_pagemap_failure()"[1].  With the help of dax_holder and
> > > > > ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> > > > > on it to unmap all files in use, and notify processes who are using
> > > > > those files.
> > > > > 
> > > > > Call trace:
> > > > > trigger unbind
> > > > >    -> unbind_store()
> > > > >     -> ... (skip)
> > > > >      -> devres_release_all()
> > > > >       -> kill_dax()
> > > > >        -> dax_holder_notify_failure(dax_dev, 0, U64_MAX,
> > > > > MF_MEM_PRE_REMOVE)
> > > > >         -> xfs_dax_notify_failure()
> > > > >         `-> freeze_super()             // freeze (kernel call)
> > > > >         `-> do xfs rmap
> > > > >         ` -> mf_dax_kill_procs()
> > > > >         `  -> collect_procs_fsdax()    // all associated processes
> > > > >         `  -> unmap_and_kill()
> > > > >         ` -> invalidate_inode_pages2_range() // drop file's cache
> > > > >         `-> thaw_super()               // thaw (both kernel
> > > > > & user call)
> > > > > 
> > > > > Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> > > > > event.  Use the exclusive freeze/thaw[2] to lock the
> > > > > filesystem to prevent
> > > > > new dax mapping from being created.  Do not shutdown
> > > > > filesystem directly
> > > > > if configuration is not supported, or if failure range
> > > > > includes metadata
> > > > > area.  Make sure all files and processes(not only the current progress)
> > > > > are handled correctly.  Also drop the cache of associated files before
> > > > > pmem is removed.
> > > > > 
> > > > > [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
> > > > > [2]: https://lore.kernel.org/linux-xfs/168688010689.860947.1788875898367401950.stgit@frogsfrogsfrogs/
> > > > > 
> > > > > Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> > > > > ---
> > > > >    drivers/dax/super.c         |  3 +-
> > > > >    fs/xfs/xfs_notify_failure.c | 86
> > > > > ++++++++++++++++++++++++++++++++++---
> > > > >    include/linux/mm.h          |  1 +
> > > > >    mm/memory-failure.c         | 17 ++++++--
> > > > >    4 files changed, 96 insertions(+), 11 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> > > > > index c4c4728a36e4..2e1a35e82fce 100644
> > > > > --- a/drivers/dax/super.c
> > > > > +++ b/drivers/dax/super.c
> > > > > @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
> > > > >            return;
> > > > >        if (dax_dev->holder_data != NULL)
> > > > > -        dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
> > > > > +        dax_holder_notify_failure(dax_dev, 0, U64_MAX,
> > > > > +                MF_MEM_PRE_REMOVE);
> > > > >        clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
> > > > >        synchronize_srcu(&dax_srcu);
> > > > > diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> > > > > index 4a9bbd3fe120..f6ec56b76db6 100644
> > > > > --- a/fs/xfs/xfs_notify_failure.c
> > > > > +++ b/fs/xfs/xfs_notify_failure.c
> > > > > @@ -22,6 +22,7 @@
> > > > >    #include <linux/mm.h>
> > > > >    #include <linux/dax.h>
> > > > > +#include <linux/fs.h>
> > > > >    struct xfs_failure_info {
> > > > >        xfs_agblock_t        startblock;
> > > > > @@ -73,10 +74,16 @@ xfs_dax_failure_fn(
> > > > >        struct xfs_mount        *mp = cur->bc_mp;
> > > > >        struct xfs_inode        *ip;
> > > > >        struct xfs_failure_info        *notify = data;
> > > > > +    struct address_space        *mapping;
> > > > > +    pgoff_t                pgoff;
> > > > > +    unsigned long            pgcnt;
> > > > >        int                error = 0;
> > > > >        if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
> > > > >            (rec->rm_flags & (XFS_RMAP_ATTR_FORK |
> > > > > XFS_RMAP_BMBT_BLOCK))) {
> > > > > +        /* Continue the query because this isn't a failure. */
> > > > > +        if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> > > > > +            return 0;
> > > > >            notify->want_shutdown = true;
> > > > >            return 0;
> > > > >        }
> > > > > @@ -92,14 +99,55 @@ xfs_dax_failure_fn(
> > > > >            return 0;
> > > > >        }
> > > > > -    error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
> > > > > -                  xfs_failure_pgoff(mp, rec, notify),
> > > > > -                  xfs_failure_pgcnt(mp, rec, notify),
> > > > > -                  notify->mf_flags);
> > > > > +    mapping = VFS_I(ip)->i_mapping;
> > > > > +    pgoff = xfs_failure_pgoff(mp, rec, notify);
> > > > > +    pgcnt = xfs_failure_pgcnt(mp, rec, notify);
> > > > > +
> > > > > +    /* Continue the rmap query if the inode isn't a dax file. */
> > > > > +    if (dax_mapping(mapping))
> > > > > +        error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
> > > > > +                      notify->mf_flags);
> > > > > +
> > > > > +    /* Invalidate the cache in dax pages. */
> > > > > +    if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> > > > > +        invalidate_inode_pages2_range(mapping, pgoff,
> > > > > +                          pgoff + pgcnt - 1);
> > > > > +
> > > > >        xfs_irele(ip);
> > > > >        return error;
> > > > >    }
> > > > > +static void
> > > > > +xfs_dax_notify_failure_freeze(
> > > > > +    struct xfs_mount    *mp)
> > > > > +{
> > > > > +    struct super_block     *sb = mp->m_super;
> > > > > +
> > > > > +    /* Wait until no one is holding the FREEZE_HOLDER_KERNEL. */
> > > > > +    while (freeze_super(sb, FREEZE_HOLDER_KERNEL) != 0) {
> > > > > +        // Shall we just wait, or print warning then return -EBUSY?
> > > > > +        delay(HZ / 10);
> > > > > +    }
> > > > > +}
> > > > > +
> > > > > +static void
> > > > > +xfs_dax_notify_failure_thaw(
> > > > > +    struct xfs_mount    *mp)
> > > > > +{
> > > > > +    struct super_block    *sb = mp->m_super;
> > > > > +    int            error;
> > > > > +
> > > > > +    error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
> > > > > +    if (error)
> > > > > +        xfs_emerg(mp, "still frozen after notify failure, err=%d",
> > > > > +              error);
> > > > > +    /*
> > > > > +     * Also thaw userspace call anyway because the device
> > > > > is about to be
> > > > > +     * removed immediately.
> > > > > +     */
> > > > > +    thaw_super(sb, FREEZE_HOLDER_USERSPACE);
> > > > > +}
> > > > > +
> > > > >    static int
> > > > >    xfs_dax_notify_ddev_failure(
> > > > >        struct xfs_mount    *mp,
> > > > > @@ -120,7 +168,7 @@ xfs_dax_notify_ddev_failure(
> > > > >        error = xfs_trans_alloc_empty(mp, &tp);
> > > > >        if (error)
> > > > > -        return error;
> > > > > +        goto out;
> > > > >        for (; agno <= end_agno; agno++) {
> > > > >            struct xfs_rmap_irec    ri_low = { };
> > > > > @@ -165,11 +213,23 @@ xfs_dax_notify_ddev_failure(
> > > > >        }
> > > > >        xfs_trans_cancel(tp);
> > > > > +
> > > > > +    /*
> > > > > +     * Determine how to shutdown the filesystem according to the
> > > > > +     * error code and flags.
> > > > > +     */
> > > > >        if (error || notify.want_shutdown) {
> > > > >            xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
> > > > >            if (!error)
> > > > >                error = -EFSCORRUPTED;
> > > > > -    }
> > > > > +    } else if (mf_flags & MF_MEM_PRE_REMOVE)
> > > > > +        xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
> > > > > +
> > > > > +out:
> > > > > +    /* Thaw the fs if it is freezed before. */
> > > > > +    if (mf_flags & MF_MEM_PRE_REMOVE)
> > > > > +        xfs_dax_notify_failure_thaw(mp);
> > > > > +
> > > > >        return error;
> > > > >    }
> > > > > @@ -197,6 +257,8 @@ xfs_dax_notify_failure(
> > > > >        if (mp->m_logdev_targp &&
> > > > > mp->m_logdev_targp->bt_daxdev == dax_dev &&
> > > > >            mp->m_logdev_targp != mp->m_ddev_targp) {
> > > > > +        if (mf_flags & MF_MEM_PRE_REMOVE)
> > > > > +            return 0;
> > > > >            xfs_err(mp, "ondisk log corrupt, shutting down fs!");
> > > > >            xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
> > > > >            return -EFSCORRUPTED;
> > > > > @@ -210,6 +272,12 @@ xfs_dax_notify_failure(
> > > > >        ddev_start = mp->m_ddev_targp->bt_dax_part_off;
> > > > >        ddev_end = ddev_start +
> > > > > bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
> > > > > +    /* Notify failure on the whole device. */
> > > > > +    if (offset == 0 && len == U64_MAX) {
> > > > > +        offset = ddev_start;
> > > > > +        len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
> > > > > +    }
> > > > > +
> > > > >        /* Ignore the range out of filesystem area */
> > > > >        if (offset + len - 1 < ddev_start)
> > > > >            return -ENXIO;
> > > > > @@ -226,6 +294,12 @@ xfs_dax_notify_failure(
> > > > >        if (offset + len - 1 > ddev_end)
> > > > >            len = ddev_end - offset + 1;
> > > > > +    if (mf_flags & MF_MEM_PRE_REMOVE) {
> > > > > +        xfs_info(mp, "device is about to be removed!");
> > > > > +        /* Freeze fs to prevent new mappings from being created. */
> > > > > +        xfs_dax_notify_failure_freeze(mp);
> > > > > +    }
> > > > > +
> > > > >        return xfs_dax_notify_ddev_failure(mp, BTOBB(offset),
> > > > > BTOBB(len),
> > > > >                mf_flags);
> > > > >    }
> > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > > > index 27ce77080c79..a80c255b88d2 100644
> > > > > --- a/include/linux/mm.h
> > > > > +++ b/include/linux/mm.h
> > > > > @@ -3576,6 +3576,7 @@ enum mf_flags {
> > > > >        MF_UNPOISON = 1 << 4,
> > > > >        MF_SW_SIMULATED = 1 << 5,
> > > > >        MF_NO_RETRY = 1 << 6,
> > > > > +    MF_MEM_PRE_REMOVE = 1 << 7,
> > > > >    };
> > > > >    int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> > > > >                  unsigned long count, int mf_flags);
> > > > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > > > > index 5b663eca1f29..483b75f2fcfb 100644
> > > > > --- a/mm/memory-failure.c
> > > > > +++ b/mm/memory-failure.c
> > > > > @@ -688,7 +688,7 @@ static void add_to_kill_fsdax(struct
> > > > > task_struct *tsk, struct page *p,
> > > > >     */
> > > > >    static void collect_procs_fsdax(struct page *page,
> > > > >            struct address_space *mapping, pgoff_t pgoff,
> > > > > -        struct list_head *to_kill)
> > > > > +        struct list_head *to_kill, bool pre_remove)
> > > > >    {
> > > > >        struct vm_area_struct *vma;
> > > > >        struct task_struct *tsk;
> > > > > @@ -696,8 +696,15 @@ static void collect_procs_fsdax(struct page *page,
> > > > >        i_mmap_lock_read(mapping);
> > > > >        read_lock(&tasklist_lock);
> > > > >        for_each_process(tsk) {
> > > > > -        struct task_struct *t = task_early_kill(tsk, true);
> > > > > +        struct task_struct *t = tsk;
> > > > > +        /*
> > > > > +         * Search for all tasks while MF_MEM_PRE_REMOVE, because the
> > > > > +         * current may not be the one accessing the fsdax page.
> > > > > +         * Otherwise, search for the current task.
> > > > > +         */
> > > > > +        if (!pre_remove)
> > > > > +            t = task_early_kill(tsk, true);
> > > > >            if (!t)
> > > > >                continue;
> > > > >            vma_interval_tree_foreach(vma, &mapping->i_mmap,
> > > > > pgoff, pgoff) {
> > > > > @@ -1793,6 +1800,7 @@ int mf_dax_kill_procs(struct
> > > > > address_space *mapping, pgoff_t index,
> > > > >        dax_entry_t cookie;
> > > > >        struct page *page;
> > > > >        size_t end = index + count;
> > > > > +    bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
> > > > >        mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
> > > > > @@ -1804,9 +1812,10 @@ int mf_dax_kill_procs(struct
> > > > > address_space *mapping, pgoff_t index,
> > > > >            if (!page)
> > > > >                goto unlock;
> > > > > -        SetPageHWPoison(page);
> > > > > +        if (!pre_remove)
> > > > > +            SetPageHWPoison(page);
> > > > > -        collect_procs_fsdax(page, mapping, index, &to_kill);
> > > > > +        collect_procs_fsdax(page, mapping, index, &to_kill,
> > > > > pre_remove);
> > > > >            unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
> > > > >                    index, mf_flags);
> > > > >    unlock:

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v12 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
  2023-07-29 15:15   ` Darrick J. Wong
@ 2023-07-31  9:36     ` Shiyang Ruan
  2023-08-01  3:25       ` Darrick J. Wong
  0 siblings, 1 reply; 37+ messages in thread
From: Shiyang Ruan @ 2023-07-31  9:36 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-fsdevel, nvdimm, linux-xfs, linux-mm, dan.j.williams, willy,
	jack, akpm, mcgrof



在 2023/7/29 23:15, Darrick J. Wong 写道:
> On Thu, Jun 29, 2023 at 04:16:51PM +0800, Shiyang Ruan wrote:
>> This patch is inspired by Dan's "mm, dax, pmem: Introduce
>> dev_pagemap_failure()"[1].  With the help of dax_holder and
>> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
>> on it to unmap all files in use, and notify processes who are using
>> those files.
>>
>> Call trace:
>> trigger unbind
>>   -> unbind_store()
>>    -> ... (skip)
>>     -> devres_release_all()
>>      -> kill_dax()
>>       -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
>>        -> xfs_dax_notify_failure()
>>        `-> freeze_super()             // freeze (kernel call)
>>        `-> do xfs rmap
>>        ` -> mf_dax_kill_procs()
>>        `  -> collect_procs_fsdax()    // all associated processes
>>        `  -> unmap_and_kill()
>>        ` -> invalidate_inode_pages2_range() // drop file's cache
>>        `-> thaw_super()               // thaw (both kernel & user call)
>>
>> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
>> event.  Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
>> new dax mapping from being created.  Do not shutdown filesystem directly
>> if configuration is not supported, or if failure range includes metadata
>> area.  Make sure all files and processes(not only the current progress)
>> are handled correctly.  Also drop the cache of associated files before
>> pmem is removed.
>>
>> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
>> [2]: https://lore.kernel.org/linux-xfs/168688010689.860947.1788875898367401950.stgit@frogsfrogsfrogs/
>>
>> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
>> ---
>>   drivers/dax/super.c         |  3 +-
>>   fs/xfs/xfs_notify_failure.c | 86 ++++++++++++++++++++++++++++++++++---
>>   include/linux/mm.h          |  1 +
>>   mm/memory-failure.c         | 17 ++++++--
>>   4 files changed, 96 insertions(+), 11 deletions(-)
>>
>> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
>> index c4c4728a36e4..2e1a35e82fce 100644
>> --- a/drivers/dax/super.c
>> +++ b/drivers/dax/super.c
>> @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
>>   		return;
>>   
>>   	if (dax_dev->holder_data != NULL)
>> -		dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
>> +		dax_holder_notify_failure(dax_dev, 0, U64_MAX,
>> +				MF_MEM_PRE_REMOVE);
>>   
>>   	clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
>>   	synchronize_srcu(&dax_srcu);
>> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
>> index 4a9bbd3fe120..f6ec56b76db6 100644
>> --- a/fs/xfs/xfs_notify_failure.c
>> +++ b/fs/xfs/xfs_notify_failure.c
>> @@ -22,6 +22,7 @@
>>   
>>   #include <linux/mm.h>
>>   #include <linux/dax.h>
>> +#include <linux/fs.h>
>>   
>>   struct xfs_failure_info {
>>   	xfs_agblock_t		startblock;
>> @@ -73,10 +74,16 @@ xfs_dax_failure_fn(
>>   	struct xfs_mount		*mp = cur->bc_mp;
>>   	struct xfs_inode		*ip;
>>   	struct xfs_failure_info		*notify = data;
>> +	struct address_space		*mapping;
>> +	pgoff_t				pgoff;
>> +	unsigned long			pgcnt;
>>   	int				error = 0;
>>   
>>   	if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
>>   	    (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
>> +		/* Continue the query because this isn't a failure. */
>> +		if (notify->mf_flags & MF_MEM_PRE_REMOVE)
>> +			return 0;
>>   		notify->want_shutdown = true;
>>   		return 0;
>>   	}
>> @@ -92,14 +99,55 @@ xfs_dax_failure_fn(
>>   		return 0;
>>   	}
>>   
>> -	error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
>> -				  xfs_failure_pgoff(mp, rec, notify),
>> -				  xfs_failure_pgcnt(mp, rec, notify),
>> -				  notify->mf_flags);
>> +	mapping = VFS_I(ip)->i_mapping;
>> +	pgoff = xfs_failure_pgoff(mp, rec, notify);
>> +	pgcnt = xfs_failure_pgcnt(mp, rec, notify);
>> +
>> +	/* Continue the rmap query if the inode isn't a dax file. */
>> +	if (dax_mapping(mapping))
>> +		error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
>> +					  notify->mf_flags);
>> +
>> +	/* Invalidate the cache in dax pages. */
>> +	if (notify->mf_flags & MF_MEM_PRE_REMOVE)
>> +		invalidate_inode_pages2_range(mapping, pgoff,
>> +					      pgoff + pgcnt - 1);
>> +
>>   	xfs_irele(ip);
>>   	return error;
>>   }
>>   
>> +static void
>> +xfs_dax_notify_failure_freeze(
>> +	struct xfs_mount	*mp)
>> +{
>> +	struct super_block 	*sb = mp->m_super;
> 
> Nit: extra space right    ^ here.
> 
>> +
>> +	/* Wait until no one is holding the FREEZE_HOLDER_KERNEL. */
>> +	while (freeze_super(sb, FREEZE_HOLDER_KERNEL) != 0) {
>> +		// Shall we just wait, or print warning then return -EBUSY?
> 
> Hm.  PRE_REMOVE gets called before the pmem gets unplugged, right?  So
> we'll send a second notification after it goes away, right?

For the first question, yes.

But I'm not sure about the second one.  Do you mean: we'll send this 
notification again if unbind didn't success because freeze_super() 
returns -EBUSY?  In other words, if the previous unbind operation did 
not work, we could unbind the device again.

> 
> If so, then I'd say return the error here instead of looping, and live
> with a kernel-frozen fs discarding the PRE_REMOVE message.
> 
>> +		delay(HZ / 10);
>> +	}
>> +}
>> +
>> +static void
>> +xfs_dax_notify_failure_thaw(
>> +	struct xfs_mount	*mp)
>> +{
>> +	struct super_block	*sb = mp->m_super;
>> +	int			error;
>> +
>> +	error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
>> +	if (error)
>> +		xfs_emerg(mp, "still frozen after notify failure, err=%d",
>> +			  error);
>> +	/*
>> +	 * Also thaw userspace call anyway because the device is about to be
>> +	 * removed immediately.
>> +	 */
>> +	thaw_super(sb, FREEZE_HOLDER_USERSPACE);
>> +}
>> +
>>   static int
>>   xfs_dax_notify_ddev_failure(
>>   	struct xfs_mount	*mp,
>> @@ -120,7 +168,7 @@ xfs_dax_notify_ddev_failure(
>>   
>>   	error = xfs_trans_alloc_empty(mp, &tp);
>>   	if (error)
>> -		return error;
>> +		goto out;
>>   
>>   	for (; agno <= end_agno; agno++) {
>>   		struct xfs_rmap_irec	ri_low = { };
>> @@ -165,11 +213,23 @@ xfs_dax_notify_ddev_failure(
>>   	}
>>   
>>   	xfs_trans_cancel(tp);
>> +
>> +	/*
>> +	 * Determine how to shutdown the filesystem according to the
>> +	 * error code and flags.
>> +	 */
>>   	if (error || notify.want_shutdown) {
>>   		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>>   		if (!error)
>>   			error = -EFSCORRUPTED;
>> -	}
>> +	} else if (mf_flags & MF_MEM_PRE_REMOVE)
>> +		xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
>> +
>> +out:
>> +	/* Thaw the fs if it is freezed before. */
>> +	if (mf_flags & MF_MEM_PRE_REMOVE)
>> +		xfs_dax_notify_failure_thaw(mp);
> 
> _thaw should be called from the same function that called _freeze.

Will fix this.

> 
> The rest of the patch seems ok to me.

Thank you!


--
Ruan.

> 
> --D
> 
>> +
>>   	return error;
>>   }
>>   
>> @@ -197,6 +257,8 @@ xfs_dax_notify_failure(
>>   
>>   	if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
>>   	    mp->m_logdev_targp != mp->m_ddev_targp) {
>> +		if (mf_flags & MF_MEM_PRE_REMOVE)
>> +			return 0;
>>   		xfs_err(mp, "ondisk log corrupt, shutting down fs!");
>>   		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>>   		return -EFSCORRUPTED;
>> @@ -210,6 +272,12 @@ xfs_dax_notify_failure(
>>   	ddev_start = mp->m_ddev_targp->bt_dax_part_off;
>>   	ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
>>   
>> +	/* Notify failure on the whole device. */
>> +	if (offset == 0 && len == U64_MAX) {
>> +		offset = ddev_start;
>> +		len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
>> +	}
>> +
>>   	/* Ignore the range out of filesystem area */
>>   	if (offset + len - 1 < ddev_start)
>>   		return -ENXIO;
>> @@ -226,6 +294,12 @@ xfs_dax_notify_failure(
>>   	if (offset + len - 1 > ddev_end)
>>   		len = ddev_end - offset + 1;
>>   
>> +	if (mf_flags & MF_MEM_PRE_REMOVE) {
>> +		xfs_info(mp, "device is about to be removed!");
>> +		/* Freeze fs to prevent new mappings from being created. */
>> +		xfs_dax_notify_failure_freeze(mp);
>> +	}
>> +
>>   	return xfs_dax_notify_ddev_failure(mp, BTOBB(offset), BTOBB(len),
>>   			mf_flags);
>>   }
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 27ce77080c79..a80c255b88d2 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -3576,6 +3576,7 @@ enum mf_flags {
>>   	MF_UNPOISON = 1 << 4,
>>   	MF_SW_SIMULATED = 1 << 5,
>>   	MF_NO_RETRY = 1 << 6,
>> +	MF_MEM_PRE_REMOVE = 1 << 7,
>>   };
>>   int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>>   		      unsigned long count, int mf_flags);
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index 5b663eca1f29..483b75f2fcfb 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -688,7 +688,7 @@ static void add_to_kill_fsdax(struct task_struct *tsk, struct page *p,
>>    */
>>   static void collect_procs_fsdax(struct page *page,
>>   		struct address_space *mapping, pgoff_t pgoff,
>> -		struct list_head *to_kill)
>> +		struct list_head *to_kill, bool pre_remove)
>>   {
>>   	struct vm_area_struct *vma;
>>   	struct task_struct *tsk;
>> @@ -696,8 +696,15 @@ static void collect_procs_fsdax(struct page *page,
>>   	i_mmap_lock_read(mapping);
>>   	read_lock(&tasklist_lock);
>>   	for_each_process(tsk) {
>> -		struct task_struct *t = task_early_kill(tsk, true);
>> +		struct task_struct *t = tsk;
>>   
>> +		/*
>> +		 * Search for all tasks while MF_MEM_PRE_REMOVE, because the
>> +		 * current may not be the one accessing the fsdax page.
>> +		 * Otherwise, search for the current task.
>> +		 */
>> +		if (!pre_remove)
>> +			t = task_early_kill(tsk, true);
>>   		if (!t)
>>   			continue;
>>   		vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
>> @@ -1793,6 +1800,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>>   	dax_entry_t cookie;
>>   	struct page *page;
>>   	size_t end = index + count;
>> +	bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
>>   
>>   	mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
>>   
>> @@ -1804,9 +1812,10 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>>   		if (!page)
>>   			goto unlock;
>>   
>> -		SetPageHWPoison(page);
>> +		if (!pre_remove)
>> +			SetPageHWPoison(page);
>>   
>> -		collect_procs_fsdax(page, mapping, index, &to_kill);
>> +		collect_procs_fsdax(page, mapping, index, &to_kill, pre_remove);
>>   		unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
>>   				index, mf_flags);
>>   unlock:
>> -- 
>> 2.40.1
>>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v12 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
  2023-07-31  9:36     ` Shiyang Ruan
@ 2023-08-01  3:25       ` Darrick J. Wong
  2023-08-03 10:44         ` Shiyang Ruan
  0 siblings, 1 reply; 37+ messages in thread
From: Darrick J. Wong @ 2023-08-01  3:25 UTC (permalink / raw)
  To: Shiyang Ruan
  Cc: linux-fsdevel, nvdimm, linux-xfs, linux-mm, dan.j.williams, willy,
	jack, akpm, mcgrof

On Mon, Jul 31, 2023 at 05:36:36PM +0800, Shiyang Ruan wrote:
> 
> 
> 在 2023/7/29 23:15, Darrick J. Wong 写道:
> > On Thu, Jun 29, 2023 at 04:16:51PM +0800, Shiyang Ruan wrote:
> > > This patch is inspired by Dan's "mm, dax, pmem: Introduce
> > > dev_pagemap_failure()"[1].  With the help of dax_holder and
> > > ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> > > on it to unmap all files in use, and notify processes who are using
> > > those files.
> > > 
> > > Call trace:
> > > trigger unbind
> > >   -> unbind_store()
> > >    -> ... (skip)
> > >     -> devres_release_all()
> > >      -> kill_dax()
> > >       -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
> > >        -> xfs_dax_notify_failure()
> > >        `-> freeze_super()             // freeze (kernel call)
> > >        `-> do xfs rmap
> > >        ` -> mf_dax_kill_procs()
> > >        `  -> collect_procs_fsdax()    // all associated processes
> > >        `  -> unmap_and_kill()
> > >        ` -> invalidate_inode_pages2_range() // drop file's cache
> > >        `-> thaw_super()               // thaw (both kernel & user call)
> > > 
> > > Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> > > event.  Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
> > > new dax mapping from being created.  Do not shutdown filesystem directly
> > > if configuration is not supported, or if failure range includes metadata
> > > area.  Make sure all files and processes(not only the current progress)
> > > are handled correctly.  Also drop the cache of associated files before
> > > pmem is removed.
> > > 
> > > [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
> > > [2]: https://lore.kernel.org/linux-xfs/168688010689.860947.1788875898367401950.stgit@frogsfrogsfrogs/
> > > 
> > > Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> > > ---
> > >   drivers/dax/super.c         |  3 +-
> > >   fs/xfs/xfs_notify_failure.c | 86 ++++++++++++++++++++++++++++++++++---
> > >   include/linux/mm.h          |  1 +
> > >   mm/memory-failure.c         | 17 ++++++--
> > >   4 files changed, 96 insertions(+), 11 deletions(-)
> > > 
> > > diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> > > index c4c4728a36e4..2e1a35e82fce 100644
> > > --- a/drivers/dax/super.c
> > > +++ b/drivers/dax/super.c
> > > @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
> > >   		return;
> > >   	if (dax_dev->holder_data != NULL)
> > > -		dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
> > > +		dax_holder_notify_failure(dax_dev, 0, U64_MAX,
> > > +				MF_MEM_PRE_REMOVE);
> > >   	clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
> > >   	synchronize_srcu(&dax_srcu);
> > > diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> > > index 4a9bbd3fe120..f6ec56b76db6 100644
> > > --- a/fs/xfs/xfs_notify_failure.c
> > > +++ b/fs/xfs/xfs_notify_failure.c
> > > @@ -22,6 +22,7 @@
> > >   #include <linux/mm.h>
> > >   #include <linux/dax.h>
> > > +#include <linux/fs.h>
> > >   struct xfs_failure_info {
> > >   	xfs_agblock_t		startblock;
> > > @@ -73,10 +74,16 @@ xfs_dax_failure_fn(
> > >   	struct xfs_mount		*mp = cur->bc_mp;
> > >   	struct xfs_inode		*ip;
> > >   	struct xfs_failure_info		*notify = data;
> > > +	struct address_space		*mapping;
> > > +	pgoff_t				pgoff;
> > > +	unsigned long			pgcnt;
> > >   	int				error = 0;
> > >   	if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
> > >   	    (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
> > > +		/* Continue the query because this isn't a failure. */
> > > +		if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> > > +			return 0;
> > >   		notify->want_shutdown = true;
> > >   		return 0;
> > >   	}
> > > @@ -92,14 +99,55 @@ xfs_dax_failure_fn(
> > >   		return 0;
> > >   	}
> > > -	error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
> > > -				  xfs_failure_pgoff(mp, rec, notify),
> > > -				  xfs_failure_pgcnt(mp, rec, notify),
> > > -				  notify->mf_flags);
> > > +	mapping = VFS_I(ip)->i_mapping;
> > > +	pgoff = xfs_failure_pgoff(mp, rec, notify);
> > > +	pgcnt = xfs_failure_pgcnt(mp, rec, notify);
> > > +
> > > +	/* Continue the rmap query if the inode isn't a dax file. */
> > > +	if (dax_mapping(mapping))
> > > +		error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
> > > +					  notify->mf_flags);
> > > +
> > > +	/* Invalidate the cache in dax pages. */
> > > +	if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> > > +		invalidate_inode_pages2_range(mapping, pgoff,
> > > +					      pgoff + pgcnt - 1);
> > > +
> > >   	xfs_irele(ip);
> > >   	return error;
> > >   }
> > > +static void
> > > +xfs_dax_notify_failure_freeze(
> > > +	struct xfs_mount	*mp)
> > > +{
> > > +	struct super_block 	*sb = mp->m_super;
> > 
> > Nit: extra space right    ^ here.
> > 
> > > +
> > > +	/* Wait until no one is holding the FREEZE_HOLDER_KERNEL. */
> > > +	while (freeze_super(sb, FREEZE_HOLDER_KERNEL) != 0) {
> > > +		// Shall we just wait, or print warning then return -EBUSY?
> > 
> > Hm.  PRE_REMOVE gets called before the pmem gets unplugged, right?  So
> > we'll send a second notification after it goes away, right?
> 
> For the first question, yes.
> 
> But I'm not sure about the second one.  Do you mean: we'll send this
> notification again if unbind didn't success because freeze_super() returns
> -EBUSY?  In other words, if the previous unbind operation did not work, we
> could unbind the device again.

Yeah.  If the MF_MEM_PRE_REMOVE fails with EBUSY, then call it again
without PRE_REMOVE and let it kill processes.

--D

> > 
> > If so, then I'd say return the error here instead of looping, and live
> > with a kernel-frozen fs discarding the PRE_REMOVE message.
> > 
> > > +		delay(HZ / 10);
> > > +	}
> > > +}
> > > +
> > > +static void
> > > +xfs_dax_notify_failure_thaw(
> > > +	struct xfs_mount	*mp)
> > > +{
> > > +	struct super_block	*sb = mp->m_super;
> > > +	int			error;
> > > +
> > > +	error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
> > > +	if (error)
> > > +		xfs_emerg(mp, "still frozen after notify failure, err=%d",
> > > +			  error);
> > > +	/*
> > > +	 * Also thaw userspace call anyway because the device is about to be
> > > +	 * removed immediately.
> > > +	 */
> > > +	thaw_super(sb, FREEZE_HOLDER_USERSPACE);
> > > +}
> > > +
> > >   static int
> > >   xfs_dax_notify_ddev_failure(
> > >   	struct xfs_mount	*mp,
> > > @@ -120,7 +168,7 @@ xfs_dax_notify_ddev_failure(
> > >   	error = xfs_trans_alloc_empty(mp, &tp);
> > >   	if (error)
> > > -		return error;
> > > +		goto out;
> > >   	for (; agno <= end_agno; agno++) {
> > >   		struct xfs_rmap_irec	ri_low = { };
> > > @@ -165,11 +213,23 @@ xfs_dax_notify_ddev_failure(
> > >   	}
> > >   	xfs_trans_cancel(tp);
> > > +
> > > +	/*
> > > +	 * Determine how to shutdown the filesystem according to the
> > > +	 * error code and flags.
> > > +	 */
> > >   	if (error || notify.want_shutdown) {
> > >   		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
> > >   		if (!error)
> > >   			error = -EFSCORRUPTED;
> > > -	}
> > > +	} else if (mf_flags & MF_MEM_PRE_REMOVE)
> > > +		xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
> > > +
> > > +out:
> > > +	/* Thaw the fs if it is freezed before. */
> > > +	if (mf_flags & MF_MEM_PRE_REMOVE)
> > > +		xfs_dax_notify_failure_thaw(mp);
> > 
> > _thaw should be called from the same function that called _freeze.
> 
> Will fix this.
> 
> > 
> > The rest of the patch seems ok to me.
> 
> Thank you!
> 
> 
> --
> Ruan.
> 
> > 
> > --D
> > 
> > > +
> > >   	return error;
> > >   }
> > > @@ -197,6 +257,8 @@ xfs_dax_notify_failure(
> > >   	if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
> > >   	    mp->m_logdev_targp != mp->m_ddev_targp) {
> > > +		if (mf_flags & MF_MEM_PRE_REMOVE)
> > > +			return 0;
> > >   		xfs_err(mp, "ondisk log corrupt, shutting down fs!");
> > >   		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
> > >   		return -EFSCORRUPTED;
> > > @@ -210,6 +272,12 @@ xfs_dax_notify_failure(
> > >   	ddev_start = mp->m_ddev_targp->bt_dax_part_off;
> > >   	ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
> > > +	/* Notify failure on the whole device. */
> > > +	if (offset == 0 && len == U64_MAX) {
> > > +		offset = ddev_start;
> > > +		len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
> > > +	}
> > > +
> > >   	/* Ignore the range out of filesystem area */
> > >   	if (offset + len - 1 < ddev_start)
> > >   		return -ENXIO;
> > > @@ -226,6 +294,12 @@ xfs_dax_notify_failure(
> > >   	if (offset + len - 1 > ddev_end)
> > >   		len = ddev_end - offset + 1;
> > > +	if (mf_flags & MF_MEM_PRE_REMOVE) {
> > > +		xfs_info(mp, "device is about to be removed!");
> > > +		/* Freeze fs to prevent new mappings from being created. */
> > > +		xfs_dax_notify_failure_freeze(mp);
> > > +	}
> > > +
> > >   	return xfs_dax_notify_ddev_failure(mp, BTOBB(offset), BTOBB(len),
> > >   			mf_flags);
> > >   }
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index 27ce77080c79..a80c255b88d2 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -3576,6 +3576,7 @@ enum mf_flags {
> > >   	MF_UNPOISON = 1 << 4,
> > >   	MF_SW_SIMULATED = 1 << 5,
> > >   	MF_NO_RETRY = 1 << 6,
> > > +	MF_MEM_PRE_REMOVE = 1 << 7,
> > >   };
> > >   int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> > >   		      unsigned long count, int mf_flags);
> > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > > index 5b663eca1f29..483b75f2fcfb 100644
> > > --- a/mm/memory-failure.c
> > > +++ b/mm/memory-failure.c
> > > @@ -688,7 +688,7 @@ static void add_to_kill_fsdax(struct task_struct *tsk, struct page *p,
> > >    */
> > >   static void collect_procs_fsdax(struct page *page,
> > >   		struct address_space *mapping, pgoff_t pgoff,
> > > -		struct list_head *to_kill)
> > > +		struct list_head *to_kill, bool pre_remove)
> > >   {
> > >   	struct vm_area_struct *vma;
> > >   	struct task_struct *tsk;
> > > @@ -696,8 +696,15 @@ static void collect_procs_fsdax(struct page *page,
> > >   	i_mmap_lock_read(mapping);
> > >   	read_lock(&tasklist_lock);
> > >   	for_each_process(tsk) {
> > > -		struct task_struct *t = task_early_kill(tsk, true);
> > > +		struct task_struct *t = tsk;
> > > +		/*
> > > +		 * Search for all tasks while MF_MEM_PRE_REMOVE, because the
> > > +		 * current may not be the one accessing the fsdax page.
> > > +		 * Otherwise, search for the current task.
> > > +		 */
> > > +		if (!pre_remove)
> > > +			t = task_early_kill(tsk, true);
> > >   		if (!t)
> > >   			continue;
> > >   		vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> > > @@ -1793,6 +1800,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> > >   	dax_entry_t cookie;
> > >   	struct page *page;
> > >   	size_t end = index + count;
> > > +	bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
> > >   	mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
> > > @@ -1804,9 +1812,10 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> > >   		if (!page)
> > >   			goto unlock;
> > > -		SetPageHWPoison(page);
> > > +		if (!pre_remove)
> > > +			SetPageHWPoison(page);
> > > -		collect_procs_fsdax(page, mapping, index, &to_kill);
> > > +		collect_procs_fsdax(page, mapping, index, &to_kill, pre_remove);
> > >   		unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
> > >   				index, mf_flags);
> > >   unlock:
> > > -- 
> > > 2.40.1
> > > 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v12 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
  2023-08-01  3:25       ` Darrick J. Wong
@ 2023-08-03 10:44         ` Shiyang Ruan
  0 siblings, 0 replies; 37+ messages in thread
From: Shiyang Ruan @ 2023-08-03 10:44 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-fsdevel, nvdimm, linux-xfs, linux-mm, dan.j.williams, willy,
	jack, akpm, mcgrof



在 2023/8/1 11:25, Darrick J. Wong 写道:
> On Mon, Jul 31, 2023 at 05:36:36PM +0800, Shiyang Ruan wrote:
>>
>>
>> 在 2023/7/29 23:15, Darrick J. Wong 写道:
>>> On Thu, Jun 29, 2023 at 04:16:51PM +0800, Shiyang Ruan wrote:
>>>> This patch is inspired by Dan's "mm, dax, pmem: Introduce
>>>> dev_pagemap_failure()"[1].  With the help of dax_holder and
>>>> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
>>>> on it to unmap all files in use, and notify processes who are using
>>>> those files.
>>>>
>>>> Call trace:
>>>> trigger unbind
>>>>    -> unbind_store()
>>>>     -> ... (skip)
>>>>      -> devres_release_all()
>>>>       -> kill_dax()
>>>>        -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
>>>>         -> xfs_dax_notify_failure()
>>>>         `-> freeze_super()             // freeze (kernel call)
>>>>         `-> do xfs rmap
>>>>         ` -> mf_dax_kill_procs()
>>>>         `  -> collect_procs_fsdax()    // all associated processes
>>>>         `  -> unmap_and_kill()
>>>>         ` -> invalidate_inode_pages2_range() // drop file's cache
>>>>         `-> thaw_super()               // thaw (both kernel & user call)
>>>>
>>>> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
>>>> event.  Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
>>>> new dax mapping from being created.  Do not shutdown filesystem directly
>>>> if configuration is not supported, or if failure range includes metadata
>>>> area.  Make sure all files and processes(not only the current progress)
>>>> are handled correctly.  Also drop the cache of associated files before
>>>> pmem is removed.
>>>>
>>>> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
>>>> [2]: https://lore.kernel.org/linux-xfs/168688010689.860947.1788875898367401950.stgit@frogsfrogsfrogs/
>>>>
>>>> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
>>>> ---
>>>>    drivers/dax/super.c         |  3 +-
>>>>    fs/xfs/xfs_notify_failure.c | 86 ++++++++++++++++++++++++++++++++++---
>>>>    include/linux/mm.h          |  1 +
>>>>    mm/memory-failure.c         | 17 ++++++--
>>>>    4 files changed, 96 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
>>>> index c4c4728a36e4..2e1a35e82fce 100644
>>>> --- a/drivers/dax/super.c
>>>> +++ b/drivers/dax/super.c
>>>> @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
>>>>    		return;
>>>>    	if (dax_dev->holder_data != NULL)
>>>> -		dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
>>>> +		dax_holder_notify_failure(dax_dev, 0, U64_MAX,
>>>> +				MF_MEM_PRE_REMOVE);
>>>>    	clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
>>>>    	synchronize_srcu(&dax_srcu);
>>>> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
>>>> index 4a9bbd3fe120..f6ec56b76db6 100644
>>>> --- a/fs/xfs/xfs_notify_failure.c
>>>> +++ b/fs/xfs/xfs_notify_failure.c
>>>> @@ -22,6 +22,7 @@
>>>>    #include <linux/mm.h>
>>>>    #include <linux/dax.h>
>>>> +#include <linux/fs.h>
>>>>    struct xfs_failure_info {
>>>>    	xfs_agblock_t		startblock;
>>>> @@ -73,10 +74,16 @@ xfs_dax_failure_fn(
>>>>    	struct xfs_mount		*mp = cur->bc_mp;
>>>>    	struct xfs_inode		*ip;
>>>>    	struct xfs_failure_info		*notify = data;
>>>> +	struct address_space		*mapping;
>>>> +	pgoff_t				pgoff;
>>>> +	unsigned long			pgcnt;
>>>>    	int				error = 0;
>>>>    	if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
>>>>    	    (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
>>>> +		/* Continue the query because this isn't a failure. */
>>>> +		if (notify->mf_flags & MF_MEM_PRE_REMOVE)
>>>> +			return 0;
>>>>    		notify->want_shutdown = true;
>>>>    		return 0;
>>>>    	}
>>>> @@ -92,14 +99,55 @@ xfs_dax_failure_fn(
>>>>    		return 0;
>>>>    	}
>>>> -	error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
>>>> -				  xfs_failure_pgoff(mp, rec, notify),
>>>> -				  xfs_failure_pgcnt(mp, rec, notify),
>>>> -				  notify->mf_flags);
>>>> +	mapping = VFS_I(ip)->i_mapping;
>>>> +	pgoff = xfs_failure_pgoff(mp, rec, notify);
>>>> +	pgcnt = xfs_failure_pgcnt(mp, rec, notify);
>>>> +
>>>> +	/* Continue the rmap query if the inode isn't a dax file. */
>>>> +	if (dax_mapping(mapping))
>>>> +		error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
>>>> +					  notify->mf_flags);
>>>> +
>>>> +	/* Invalidate the cache in dax pages. */
>>>> +	if (notify->mf_flags & MF_MEM_PRE_REMOVE)
>>>> +		invalidate_inode_pages2_range(mapping, pgoff,
>>>> +					      pgoff + pgcnt - 1);
>>>> +
>>>>    	xfs_irele(ip);
>>>>    	return error;
>>>>    }
>>>> +static void
>>>> +xfs_dax_notify_failure_freeze(
>>>> +	struct xfs_mount	*mp)
>>>> +{
>>>> +	struct super_block 	*sb = mp->m_super;
>>>
>>> Nit: extra space right    ^ here.
>>>
>>>> +
>>>> +	/* Wait until no one is holding the FREEZE_HOLDER_KERNEL. */
>>>> +	while (freeze_super(sb, FREEZE_HOLDER_KERNEL) != 0) {
>>>> +		// Shall we just wait, or print warning then return -EBUSY?
>>>
>>> Hm.  PRE_REMOVE gets called before the pmem gets unplugged, right?  So
>>> we'll send a second notification after it goes away, right?
>>
>> For the first question, yes.
>>
>> But I'm not sure about the second one.  Do you mean: we'll send this
>> notification again if unbind didn't success because freeze_super() returns
>> -EBUSY?  In other words, if the previous unbind operation did not work, we
>> could unbind the device again.
> 
> Yeah.  If the MF_MEM_PRE_REMOVE fails with EBUSY, then call it again
> without PRE_REMOVE and let it kill processes.

Ok.  But I have to pass the flag (MF_MEM_PRE_REMOVE) to 
mf_dax_kill_procs() so that it can search for all processes who are 
holding dax pages rather than the only the current process.

Then, my thought is, if filesystem is currently frozen by kernel during 
unbind, just allow the -EBUSY and keep on the RMAP & killing processes. 
After RMAP is done, ignore the kernel thaw as well.  In this way, there 
is no need to send a second notification.

```
     bool frozen_by_kernel = false;

     // skip... other definitions

     if (mf_flags & MF_MEM_PRE_REMOVE) {
         xfs_info(mp, "Device is about to be removed!");
         /* Freeze fs to prevent new mappings from being created. */
         error = xfs_dax_notify_failure_freeze(mp);
         if (error) {
             /* Keep on if filesystem is frozen by kernel */
             if (error == -EBUSY)
                 frozen_by_kernel = true;
             else
                 return error;
         }
     }

     // skip... RMAP

out:
     /* Thaw the filesystem. */
     if (mf_flags & MF_MEM_PRE_REMOVE)
         /* don't thaw kernel frozen if already frozen by kernel */
         xfs_dax_notify_failure_thaw(mp, frozen_by_kernel);

     return error;
```


--
Thanks,
Ruan.

> 
> --D
> 
>>>
>>> If so, then I'd say return the error here instead of looping, and live
>>> with a kernel-frozen fs discarding the PRE_REMOVE message.
>>>
>>>> +		delay(HZ / 10);
>>>> +	}
>>>> +}
>>>> +
>>>> +static void
>>>> +xfs_dax_notify_failure_thaw(
>>>> +	struct xfs_mount	*mp)
>>>> +{
>>>> +	struct super_block	*sb = mp->m_super;
>>>> +	int			error;
>>>> +
>>>> +	error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
>>>> +	if (error)
>>>> +		xfs_emerg(mp, "still frozen after notify failure, err=%d",
>>>> +			  error);
>>>> +	/*
>>>> +	 * Also thaw userspace call anyway because the device is about to be
>>>> +	 * removed immediately.
>>>> +	 */
>>>> +	thaw_super(sb, FREEZE_HOLDER_USERSPACE);
>>>> +}
>>>> +
>>>>    static int
>>>>    xfs_dax_notify_ddev_failure(
>>>>    	struct xfs_mount	*mp,
>>>> @@ -120,7 +168,7 @@ xfs_dax_notify_ddev_failure(
>>>>    	error = xfs_trans_alloc_empty(mp, &tp);
>>>>    	if (error)
>>>> -		return error;
>>>> +		goto out;
>>>>    	for (; agno <= end_agno; agno++) {
>>>>    		struct xfs_rmap_irec	ri_low = { };
>>>> @@ -165,11 +213,23 @@ xfs_dax_notify_ddev_failure(
>>>>    	}
>>>>    	xfs_trans_cancel(tp);
>>>> +
>>>> +	/*
>>>> +	 * Determine how to shutdown the filesystem according to the
>>>> +	 * error code and flags.
>>>> +	 */
>>>>    	if (error || notify.want_shutdown) {
>>>>    		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>>>>    		if (!error)
>>>>    			error = -EFSCORRUPTED;
>>>> -	}
>>>> +	} else if (mf_flags & MF_MEM_PRE_REMOVE)
>>>> +		xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
>>>> +
>>>> +out:
>>>> +	/* Thaw the fs if it is freezed before. */
>>>> +	if (mf_flags & MF_MEM_PRE_REMOVE)
>>>> +		xfs_dax_notify_failure_thaw(mp);
>>>
>>> _thaw should be called from the same function that called _freeze.
>>
>> Will fix this.
>>
>>>
>>> The rest of the patch seems ok to me.
>>
>> Thank you!
>>
>>
>> --
>> Ruan.
>>
>>>
>>> --D
>>>
>>>> +
>>>>    	return error;
>>>>    }
>>>> @@ -197,6 +257,8 @@ xfs_dax_notify_failure(
>>>>    	if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
>>>>    	    mp->m_logdev_targp != mp->m_ddev_targp) {
>>>> +		if (mf_flags & MF_MEM_PRE_REMOVE)
>>>> +			return 0;
>>>>    		xfs_err(mp, "ondisk log corrupt, shutting down fs!");
>>>>    		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>>>>    		return -EFSCORRUPTED;
>>>> @@ -210,6 +272,12 @@ xfs_dax_notify_failure(
>>>>    	ddev_start = mp->m_ddev_targp->bt_dax_part_off;
>>>>    	ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
>>>> +	/* Notify failure on the whole device. */
>>>> +	if (offset == 0 && len == U64_MAX) {
>>>> +		offset = ddev_start;
>>>> +		len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
>>>> +	}
>>>> +
>>>>    	/* Ignore the range out of filesystem area */
>>>>    	if (offset + len - 1 < ddev_start)
>>>>    		return -ENXIO;
>>>> @@ -226,6 +294,12 @@ xfs_dax_notify_failure(
>>>>    	if (offset + len - 1 > ddev_end)
>>>>    		len = ddev_end - offset + 1;
>>>> +	if (mf_flags & MF_MEM_PRE_REMOVE) {
>>>> +		xfs_info(mp, "device is about to be removed!");
>>>> +		/* Freeze fs to prevent new mappings from being created. */
>>>> +		xfs_dax_notify_failure_freeze(mp);
>>>> +	}
>>>> +
>>>>    	return xfs_dax_notify_ddev_failure(mp, BTOBB(offset), BTOBB(len),
>>>>    			mf_flags);
>>>>    }
>>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>>> index 27ce77080c79..a80c255b88d2 100644
>>>> --- a/include/linux/mm.h
>>>> +++ b/include/linux/mm.h
>>>> @@ -3576,6 +3576,7 @@ enum mf_flags {
>>>>    	MF_UNPOISON = 1 << 4,
>>>>    	MF_SW_SIMULATED = 1 << 5,
>>>>    	MF_NO_RETRY = 1 << 6,
>>>> +	MF_MEM_PRE_REMOVE = 1 << 7,
>>>>    };
>>>>    int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>>>>    		      unsigned long count, int mf_flags);
>>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>>> index 5b663eca1f29..483b75f2fcfb 100644
>>>> --- a/mm/memory-failure.c
>>>> +++ b/mm/memory-failure.c
>>>> @@ -688,7 +688,7 @@ static void add_to_kill_fsdax(struct task_struct *tsk, struct page *p,
>>>>     */
>>>>    static void collect_procs_fsdax(struct page *page,
>>>>    		struct address_space *mapping, pgoff_t pgoff,
>>>> -		struct list_head *to_kill)
>>>> +		struct list_head *to_kill, bool pre_remove)
>>>>    {
>>>>    	struct vm_area_struct *vma;
>>>>    	struct task_struct *tsk;
>>>> @@ -696,8 +696,15 @@ static void collect_procs_fsdax(struct page *page,
>>>>    	i_mmap_lock_read(mapping);
>>>>    	read_lock(&tasklist_lock);
>>>>    	for_each_process(tsk) {
>>>> -		struct task_struct *t = task_early_kill(tsk, true);
>>>> +		struct task_struct *t = tsk;
>>>> +		/*
>>>> +		 * Search for all tasks while MF_MEM_PRE_REMOVE, because the
>>>> +		 * current may not be the one accessing the fsdax page.
>>>> +		 * Otherwise, search for the current task.
>>>> +		 */
>>>> +		if (!pre_remove)
>>>> +			t = task_early_kill(tsk, true);
>>>>    		if (!t)
>>>>    			continue;
>>>>    		vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
>>>> @@ -1793,6 +1800,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>>>>    	dax_entry_t cookie;
>>>>    	struct page *page;
>>>>    	size_t end = index + count;
>>>> +	bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
>>>>    	mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
>>>> @@ -1804,9 +1812,10 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>>>>    		if (!page)
>>>>    			goto unlock;
>>>> -		SetPageHWPoison(page);
>>>> +		if (!pre_remove)
>>>> +			SetPageHWPoison(page);
>>>> -		collect_procs_fsdax(page, mapping, index, &to_kill);
>>>> +		collect_procs_fsdax(page, mapping, index, &to_kill, pre_remove);
>>>>    		unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
>>>>    				index, mf_flags);
>>>>    unlock:
>>>> -- 
>>>> 2.40.1
>>>>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* RE: [PATCH v12 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
  2023-06-29  8:16 ` [PATCH v12 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan
                     ` (2 preceding siblings ...)
  2023-07-29 15:15   ` Darrick J. Wong
@ 2023-08-08  0:31   ` Dan Williams
  2023-08-23  8:36     ` Shiyang Ruan
  2023-08-23  8:17   ` [PATCH v13] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE " Shiyang Ruan
  2023-08-28  6:57   ` [PATCH v14] " Shiyang Ruan
  5 siblings, 1 reply; 37+ messages in thread
From: Dan Williams @ 2023-08-08  0:31 UTC (permalink / raw)
  To: Shiyang Ruan, linux-fsdevel, nvdimm, linux-xfs, linux-mm
  Cc: dan.j.williams, willy, jack, akpm, djwong, mcgrof

Shiyang Ruan wrote:
> This patch is inspired by Dan's "mm, dax, pmem: Introduce
> dev_pagemap_failure()"[1].  With the help of dax_holder and
> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> on it to unmap all files in use, and notify processes who are using
> those files.
> 
> Call trace:
> trigger unbind
>  -> unbind_store()
>   -> ... (skip)
>    -> devres_release_all()
>     -> kill_dax()
>      -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
>       -> xfs_dax_notify_failure()
>       `-> freeze_super()             // freeze (kernel call)
>       `-> do xfs rmap
>       ` -> mf_dax_kill_procs()
>       `  -> collect_procs_fsdax()    // all associated processes
>       `  -> unmap_and_kill()
>       ` -> invalidate_inode_pages2_range() // drop file's cache
>       `-> thaw_super()               // thaw (both kernel & user call)
> 
> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> event.  Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
> new dax mapping from being created.  Do not shutdown filesystem directly
> if configuration is not supported, or if failure range includes metadata
> area.  Make sure all files and processes(not only the current progress)
> are handled correctly.  Also drop the cache of associated files before
> pmem is removed.

I would say more about why this is important for DAX users. Yes, the
devm_memremap_pages() vs get_user_pages() infrastructure can be improved
if it has a mechanism to revoke all pages that it has handed out for a
given device, but that's not an end user visible effect.

The end user impact needs to be clear. Is this for existing deployed
pmem where a user accidentally removes a device and wants failures and
process killing instead of hangs?

The reason Linux has got along without this for so long is because pmem
is difficult to remove (and with the sunset of Optane, difficult to
acquire). One motivation to pursue this is CXL where hotplug is better
defined and use cases like dynamic capacity devices where making forward
progress to kill processes is better than hanging.

It would help to have an example of what happens without this patch.

> 
> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
> [2]: https://lore.kernel.org/linux-xfs/168688010689.860947.1788875898367401950.stgit@frogsfrogsfrogs/
> 
> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> ---
>  drivers/dax/super.c         |  3 +-
>  fs/xfs/xfs_notify_failure.c | 86 ++++++++++++++++++++++++++++++++++---
>  include/linux/mm.h          |  1 +
>  mm/memory-failure.c         | 17 ++++++--
>  4 files changed, 96 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index c4c4728a36e4..2e1a35e82fce 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
>  		return;
>  
>  	if (dax_dev->holder_data != NULL)
> -		dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
> +		dax_holder_notify_failure(dax_dev, 0, U64_MAX,
> +				MF_MEM_PRE_REMOVE);

The motivation in the original proposal was to convey the death of
large extents to memory_failure(). However, that proposal predated your
mf_dax_kill_procs() approach. With mf_dax_kill_procs() the need for a
new bulk memory_failure() API is gone.

This is where the end user impact needs to be clear. It seems that
without this patch the filesystem may assume failure while the device is
already present, but that seems ok. The goal is forward progress after a
mistake not necessarily minimizing damage after a mistake. The fact that
the current code is not as gentle could be considered a feature because
graceful shutdown should always unmount before unplug, and if one
unplugs before unmount it is already understood that they get to keep
the pieces.

Because the driver ->remove() callback can not enforce that the device
is still present it seems unnecessary to optimize for the case where the
filesystem is the device is being removed from an actively mounted
filesystem, but the device is still present.

The dax_holder_notify_failure(dax_dev, 0, U64_MAX) is sufficient to say
"userspace failed to umount before hardware eject, stop trying to access
this range", rather than "try to finish up in this range, but it might
already be too late".

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v13] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind
  2023-06-29  8:16 ` [PATCH v12 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan
                     ` (3 preceding siblings ...)
  2023-08-08  0:31   ` Dan Williams
@ 2023-08-23  8:17   ` Shiyang Ruan
  2023-08-23 23:36     ` Darrick J. Wong
  2023-08-28  6:57   ` [PATCH v14] " Shiyang Ruan
  5 siblings, 1 reply; 37+ messages in thread
From: Shiyang Ruan @ 2023-08-23  8:17 UTC (permalink / raw)
  To: linux-fsdevel, nvdimm, linux-xfs, linux-mm
  Cc: dan.j.williams, willy, jack, akpm, djwong, mcgrof

====
Changes since v12:
 1. correct flag name in subject (MF_MEM_REMOVE => MF_MEM_PRE_REMOVE)
 2. complete the behavior when fs has already frozen by kernel call
      NOTICE: Instead of "call notify_failure() again w/o PRE_REMOVE",
              I tried this proposal[0].
 3. call xfs_dax_notify_failure_freeze() and _thaw() in same function
 4. rebase on: xfs/xfs-linux.git vfs-for-next
====

Now, if we suddenly remove a PMEM device(by calling unbind) which
contains FSDAX while programs are still accessing data in this device,
e.g.:
```
 $FSSTRESS_PROG -d $SCRATCH_MNT -n 99999 -p 4 &
 # $FSX_PROG -N 1000000 -o 8192 -l 500000 $SCRATCH_MNT/t001 &
 echo "pfn1.1" > /sys/bus/nd/drivers/nd_pmem/unbind
```
it could come into an unacceptable state:
  1. device has gone but mount point still exists, and umount will fail
       with "target is busy"
  2. programs will hang and cannot be killed
  3. may crash with NULL pointer dereference

To fix this, we introduce a MF_MEM_PRE_REMOVE flag to let it know that we
are going to remove the whole device, and make sure all related processes
could be notified so that they could end up gracefully.

This patch is inspired by Dan's "mm, dax, pmem: Introduce
dev_pagemap_failure()"[1].  With the help of dax_holder and
->notify_failure() mechanism, the pmem driver is able to ask filesystem
on it to unmap all files in use, and notify processes who are using
those files.

Call trace:
trigger unbind
 -> unbind_store()
  -> ... (skip)
   -> devres_release_all()
    -> kill_dax()
     -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
      -> xfs_dax_notify_failure()
      `-> freeze_super()             // freeze (kernel call)
      `-> do xfs rmap
      ` -> mf_dax_kill_procs()
      `  -> collect_procs_fsdax()    // all associated processes
      `  -> unmap_and_kill()
      ` -> invalidate_inode_pages2_range() // drop file's cache
      `-> thaw_super()               // thaw (both kernel & user call)

Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
event.  Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
new dax mapping from being created.  Do not shutdown filesystem directly
if configuration is not supported, or if failure range includes metadata
area.  Make sure all files and processes(not only the current progress)
are handled correctly.  Also drop the cache of associated files before
pmem is removed.

[0]: https://lore.kernel.org/linux-xfs/25cf6700-4db0-a346-632c-ec9fc291793a@fujitsu.com/
[1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
[2]: https://lore.kernel.org/linux-xfs/169116275623.3187159.16862410128731457358.stg-ugh@frogsfrogsfrogs/

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
---
 drivers/dax/super.c         |  3 +-
 fs/xfs/xfs_notify_failure.c | 99 ++++++++++++++++++++++++++++++++++---
 include/linux/mm.h          |  1 +
 mm/memory-failure.c         | 17 +++++--
 4 files changed, 109 insertions(+), 11 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index c4c4728a36e4..2e1a35e82fce 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
 		return;
 
 	if (dax_dev->holder_data != NULL)
-		dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
+		dax_holder_notify_failure(dax_dev, 0, U64_MAX,
+				MF_MEM_PRE_REMOVE);
 
 	clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
 	synchronize_srcu(&dax_srcu);
diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index 4a9bbd3fe120..6496c32a9172 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -22,6 +22,7 @@
 
 #include <linux/mm.h>
 #include <linux/dax.h>
+#include <linux/fs.h>
 
 struct xfs_failure_info {
 	xfs_agblock_t		startblock;
@@ -73,10 +74,16 @@ xfs_dax_failure_fn(
 	struct xfs_mount		*mp = cur->bc_mp;
 	struct xfs_inode		*ip;
 	struct xfs_failure_info		*notify = data;
+	struct address_space		*mapping;
+	pgoff_t				pgoff;
+	unsigned long			pgcnt;
 	int				error = 0;
 
 	if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
 	    (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
+		/* Continue the query because this isn't a failure. */
+		if (notify->mf_flags & MF_MEM_PRE_REMOVE)
+			return 0;
 		notify->want_shutdown = true;
 		return 0;
 	}
@@ -92,14 +99,60 @@ xfs_dax_failure_fn(
 		return 0;
 	}
 
-	error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
-				  xfs_failure_pgoff(mp, rec, notify),
-				  xfs_failure_pgcnt(mp, rec, notify),
-				  notify->mf_flags);
+	mapping = VFS_I(ip)->i_mapping;
+	pgoff = xfs_failure_pgoff(mp, rec, notify);
+	pgcnt = xfs_failure_pgcnt(mp, rec, notify);
+
+	/* Continue the rmap query if the inode isn't a dax file. */
+	if (dax_mapping(mapping))
+		error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
+					  notify->mf_flags);
+
+	/* Invalidate the cache in dax pages. */
+	if (notify->mf_flags & MF_MEM_PRE_REMOVE)
+		invalidate_inode_pages2_range(mapping, pgoff,
+					      pgoff + pgcnt - 1);
+
 	xfs_irele(ip);
 	return error;
 }
 
+static int
+xfs_dax_notify_failure_freeze(
+	struct xfs_mount	*mp)
+{
+	struct super_block	*sb = mp->m_super;
+	int			error;
+
+	error = freeze_super(sb, FREEZE_HOLDER_KERNEL);
+	if (error)
+		xfs_emerg(mp, "already frozen by kernel, err=%d", error);
+
+	return error;
+}
+
+static void
+xfs_dax_notify_failure_thaw(
+	struct xfs_mount	*mp,
+	bool			kernel_frozen)
+{
+	struct super_block	*sb = mp->m_super;
+	int			error;
+
+	if (!kernel_frozen) {
+		error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
+		if (error)
+			xfs_emerg(mp, "still frozen after notify failure, err=%d",
+				error);
+	}
+
+	/*
+	 * Also thaw userspace call anyway because the device is about to be
+	 * removed immediately.
+	 */
+	thaw_super(sb, FREEZE_HOLDER_USERSPACE);
+}
+
 static int
 xfs_dax_notify_ddev_failure(
 	struct xfs_mount	*mp,
@@ -112,15 +165,29 @@ xfs_dax_notify_ddev_failure(
 	struct xfs_btree_cur	*cur = NULL;
 	struct xfs_buf		*agf_bp = NULL;
 	int			error = 0;
+	bool			kernel_frozen = false;
 	xfs_fsblock_t		fsbno = XFS_DADDR_TO_FSB(mp, daddr);
 	xfs_agnumber_t		agno = XFS_FSB_TO_AGNO(mp, fsbno);
 	xfs_fsblock_t		end_fsbno = XFS_DADDR_TO_FSB(mp,
 							     daddr + bblen - 1);
 	xfs_agnumber_t		end_agno = XFS_FSB_TO_AGNO(mp, end_fsbno);
 
+	if (mf_flags & MF_MEM_PRE_REMOVE) {
+		xfs_info(mp, "Device is about to be removed!");
+		/* Freeze fs to prevent new mappings from being created. */
+		error = xfs_dax_notify_failure_freeze(mp);
+		if (error) {
+			/* Keep going on if filesystem is frozen by kernel. */
+			if (error == -EBUSY)
+				kernel_frozen = true;
+			else
+				return error;
+		}
+	}
+
 	error = xfs_trans_alloc_empty(mp, &tp);
 	if (error)
-		return error;
+		goto out;
 
 	for (; agno <= end_agno; agno++) {
 		struct xfs_rmap_irec	ri_low = { };
@@ -165,11 +232,23 @@ xfs_dax_notify_ddev_failure(
 	}
 
 	xfs_trans_cancel(tp);
+
+	/*
+	 * Determine how to shutdown the filesystem according to the
+	 * error code and flags.
+	 */
 	if (error || notify.want_shutdown) {
 		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
 		if (!error)
 			error = -EFSCORRUPTED;
-	}
+	} else if (mf_flags & MF_MEM_PRE_REMOVE)
+		xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
+
+out:
+	/* Thaw the fs if it is frozen before. */
+	if (mf_flags & MF_MEM_PRE_REMOVE)
+		xfs_dax_notify_failure_thaw(mp, kernel_frozen);
+
 	return error;
 }
 
@@ -197,6 +276,8 @@ xfs_dax_notify_failure(
 
 	if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
 	    mp->m_logdev_targp != mp->m_ddev_targp) {
+		if (mf_flags & MF_MEM_PRE_REMOVE)
+			return 0;
 		xfs_err(mp, "ondisk log corrupt, shutting down fs!");
 		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
 		return -EFSCORRUPTED;
@@ -210,6 +291,12 @@ xfs_dax_notify_failure(
 	ddev_start = mp->m_ddev_targp->bt_dax_part_off;
 	ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
 
+	/* Notify failure on the whole device. */
+	if (offset == 0 && len == U64_MAX) {
+		offset = ddev_start;
+		len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
+	}
+
 	/* Ignore the range out of filesystem area */
 	if (offset + len - 1 < ddev_start)
 		return -ENXIO;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 799836e84840..944a1165a321 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3577,6 +3577,7 @@ enum mf_flags {
 	MF_UNPOISON = 1 << 4,
 	MF_SW_SIMULATED = 1 << 5,
 	MF_NO_RETRY = 1 << 6,
+	MF_MEM_PRE_REMOVE = 1 << 7,
 };
 int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
 		      unsigned long count, int mf_flags);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index dc5ff7dd4e50..92f18c9e0aaf 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -688,7 +688,7 @@ static void add_to_kill_fsdax(struct task_struct *tsk, struct page *p,
  */
 static void collect_procs_fsdax(struct page *page,
 		struct address_space *mapping, pgoff_t pgoff,
-		struct list_head *to_kill)
+		struct list_head *to_kill, bool pre_remove)
 {
 	struct vm_area_struct *vma;
 	struct task_struct *tsk;
@@ -696,8 +696,15 @@ static void collect_procs_fsdax(struct page *page,
 	i_mmap_lock_read(mapping);
 	read_lock(&tasklist_lock);
 	for_each_process(tsk) {
-		struct task_struct *t = task_early_kill(tsk, true);
+		struct task_struct *t = tsk;
 
+		/*
+		 * Search for all tasks while MF_MEM_PRE_REMOVE is set, because
+		 * the current may not be the one accessing the fsdax page.
+		 * Otherwise, search for the current task.
+		 */
+		if (!pre_remove)
+			t = task_early_kill(tsk, true);
 		if (!t)
 			continue;
 		vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
@@ -1793,6 +1800,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
 	dax_entry_t cookie;
 	struct page *page;
 	size_t end = index + count;
+	bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
 
 	mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
 
@@ -1804,9 +1812,10 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
 		if (!page)
 			goto unlock;
 
-		SetPageHWPoison(page);
+		if (!pre_remove)
+			SetPageHWPoison(page);
 
-		collect_procs_fsdax(page, mapping, index, &to_kill);
+		collect_procs_fsdax(page, mapping, index, &to_kill, pre_remove);
 		unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
 				index, mf_flags);
 unlock:
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v12 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
  2023-08-08  0:31   ` Dan Williams
@ 2023-08-23  8:36     ` Shiyang Ruan
  0 siblings, 0 replies; 37+ messages in thread
From: Shiyang Ruan @ 2023-08-23  8:36 UTC (permalink / raw)
  To: Dan Williams, linux-fsdevel, nvdimm, linux-xfs, linux-mm
  Cc: willy, jack, akpm, djwong, mcgrof



在 2023/8/8 8:31, Dan Williams 写道:
> Shiyang Ruan wrote:
>> This patch is inspired by Dan's "mm, dax, pmem: Introduce
>> dev_pagemap_failure()"[1].  With the help of dax_holder and
>> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
>> on it to unmap all files in use, and notify processes who are using
>> those files.
>>
>> Call trace:
>> trigger unbind
>>   -> unbind_store()
>>    -> ... (skip)
>>     -> devres_release_all()
>>      -> kill_dax()
>>       -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
>>        -> xfs_dax_notify_failure()
>>        `-> freeze_super()             // freeze (kernel call)
>>        `-> do xfs rmap
>>        ` -> mf_dax_kill_procs()
>>        `  -> collect_procs_fsdax()    // all associated processes
>>        `  -> unmap_and_kill()
>>        ` -> invalidate_inode_pages2_range() // drop file's cache
>>        `-> thaw_super()               // thaw (both kernel & user call)
>>
>> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
>> event.  Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
>> new dax mapping from being created.  Do not shutdown filesystem directly
>> if configuration is not supported, or if failure range includes metadata
>> area.  Make sure all files and processes(not only the current progress)
>> are handled correctly.  Also drop the cache of associated files before
>> pmem is removed.
> 
> I would say more about why this is important for DAX users. Yes, the
> devm_memremap_pages() vs get_user_pages() infrastructure can be improved
> if it has a mechanism to revoke all pages that it has handed out for a
> given device, but that's not an end user visible effect.
> 
> The end user impact needs to be clear. Is this for existing deployed
> pmem where a user accidentally removes a device and wants failures and
> process killing instead of hangs?
> 
> The reason Linux has got along without this for so long is because pmem
> is difficult to remove (and with the sunset of Optane, difficult to
> acquire). One motivation to pursue this is CXL where hotplug is better
> defined and use cases like dynamic capacity devices where making forward
> progress to kill processes is better than hanging.
> 
> It would help to have an example of what happens without this patch.
> 
>>
>> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
>> [2]: https://lore.kernel.org/linux-xfs/168688010689.860947.1788875898367401950.stgit@frogsfrogsfrogs/
>>
>> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
>> ---
>>   drivers/dax/super.c         |  3 +-
>>   fs/xfs/xfs_notify_failure.c | 86 ++++++++++++++++++++++++++++++++++---
>>   include/linux/mm.h          |  1 +
>>   mm/memory-failure.c         | 17 ++++++--
>>   4 files changed, 96 insertions(+), 11 deletions(-)
>>
>> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
>> index c4c4728a36e4..2e1a35e82fce 100644
>> --- a/drivers/dax/super.c
>> +++ b/drivers/dax/super.c
>> @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
>>   		return;
>>   
>>   	if (dax_dev->holder_data != NULL)
>> -		dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
>> +		dax_holder_notify_failure(dax_dev, 0, U64_MAX,
>> +				MF_MEM_PRE_REMOVE);
> 
> The motivation in the original proposal was to convey the death of
> large extents to memory_failure(). However, that proposal predated your
> mf_dax_kill_procs() approach. With mf_dax_kill_procs() the need for a
> new bulk memory_failure() API is gone.
> 
> This is where the end user impact needs to be clear. It seems that
> without this patch the filesystem may assume failure while the device is
> already present, but that seems ok. The goal is forward progress after a
> mistake not necessarily minimizing damage after a mistake. The fact that
> the current code is not as gentle could be considered a feature because
> graceful shutdown should always unmount before unplug, and if one
> unplugs before unmount it is already understood that they get to keep
> the pieces.
> 
> Because the driver ->remove() callback can not enforce that the device
> is still present it seems unnecessary to optimize for the case where the
> filesystem is the device is being removed from an actively mounted
> filesystem, but the device is still present.
> 
> The dax_holder_notify_failure(dax_dev, 0, U64_MAX) is sufficient to say
> "userspace failed to umount before hardware eject, stop trying to access
> this range", rather than "try to finish up in this range, but it might
> already be too late".

Hi Dan,

I added an simple example of "accidentally remove pmem device" and its 
consequences of not having this patch in the latest version.  Please review.


--
Thanks,
Ruan.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v13] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind
  2023-08-23  8:17   ` [PATCH v13] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE " Shiyang Ruan
@ 2023-08-23 23:36     ` Darrick J. Wong
  2023-08-24  9:41       ` Shiyang Ruan
  0 siblings, 1 reply; 37+ messages in thread
From: Darrick J. Wong @ 2023-08-23 23:36 UTC (permalink / raw)
  To: Shiyang Ruan
  Cc: linux-fsdevel, nvdimm, linux-xfs, linux-mm, dan.j.williams, willy,
	jack, akpm, mcgrof

On Wed, Aug 23, 2023 at 04:17:06PM +0800, Shiyang Ruan wrote:
> ====
> Changes since v12:
>  1. correct flag name in subject (MF_MEM_REMOVE => MF_MEM_PRE_REMOVE)
>  2. complete the behavior when fs has already frozen by kernel call
>       NOTICE: Instead of "call notify_failure() again w/o PRE_REMOVE",
>               I tried this proposal[0].
>  3. call xfs_dax_notify_failure_freeze() and _thaw() in same function
>  4. rebase on: xfs/xfs-linux.git vfs-for-next
> ====
> 
> Now, if we suddenly remove a PMEM device(by calling unbind) which
> contains FSDAX while programs are still accessing data in this device,
> e.g.:
> ```
>  $FSSTRESS_PROG -d $SCRATCH_MNT -n 99999 -p 4 &
>  # $FSX_PROG -N 1000000 -o 8192 -l 500000 $SCRATCH_MNT/t001 &
>  echo "pfn1.1" > /sys/bus/nd/drivers/nd_pmem/unbind
> ```
> it could come into an unacceptable state:
>   1. device has gone but mount point still exists, and umount will fail
>        with "target is busy"
>   2. programs will hang and cannot be killed
>   3. may crash with NULL pointer dereference
> 
> To fix this, we introduce a MF_MEM_PRE_REMOVE flag to let it know that we
> are going to remove the whole device, and make sure all related processes
> could be notified so that they could end up gracefully.
> 
> This patch is inspired by Dan's "mm, dax, pmem: Introduce
> dev_pagemap_failure()"[1].  With the help of dax_holder and
> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> on it to unmap all files in use, and notify processes who are using
> those files.
> 
> Call trace:
> trigger unbind
>  -> unbind_store()
>   -> ... (skip)
>    -> devres_release_all()
>     -> kill_dax()
>      -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
>       -> xfs_dax_notify_failure()
>       `-> freeze_super()             // freeze (kernel call)
>       `-> do xfs rmap
>       ` -> mf_dax_kill_procs()
>       `  -> collect_procs_fsdax()    // all associated processes
>       `  -> unmap_and_kill()
>       ` -> invalidate_inode_pages2_range() // drop file's cache
>       `-> thaw_super()               // thaw (both kernel & user call)
> 
> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> event.  Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
> new dax mapping from being created.  Do not shutdown filesystem directly
> if configuration is not supported, or if failure range includes metadata
> area.  Make sure all files and processes(not only the current progress)
> are handled correctly.  Also drop the cache of associated files before
> pmem is removed.
> 
> [0]: https://lore.kernel.org/linux-xfs/25cf6700-4db0-a346-632c-ec9fc291793a@fujitsu.com/
> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
> [2]: https://lore.kernel.org/linux-xfs/169116275623.3187159.16862410128731457358.stg-ugh@frogsfrogsfrogs/
> 
> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> ---
>  drivers/dax/super.c         |  3 +-
>  fs/xfs/xfs_notify_failure.c | 99 ++++++++++++++++++++++++++++++++++---
>  include/linux/mm.h          |  1 +
>  mm/memory-failure.c         | 17 +++++--
>  4 files changed, 109 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index c4c4728a36e4..2e1a35e82fce 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
>  		return;
>  
>  	if (dax_dev->holder_data != NULL)
> -		dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
> +		dax_holder_notify_failure(dax_dev, 0, U64_MAX,
> +				MF_MEM_PRE_REMOVE);
>  
>  	clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
>  	synchronize_srcu(&dax_srcu);
> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> index 4a9bbd3fe120..6496c32a9172 100644
> --- a/fs/xfs/xfs_notify_failure.c
> +++ b/fs/xfs/xfs_notify_failure.c
> @@ -22,6 +22,7 @@
>  
>  #include <linux/mm.h>
>  #include <linux/dax.h>
> +#include <linux/fs.h>
>  
>  struct xfs_failure_info {
>  	xfs_agblock_t		startblock;
> @@ -73,10 +74,16 @@ xfs_dax_failure_fn(
>  	struct xfs_mount		*mp = cur->bc_mp;
>  	struct xfs_inode		*ip;
>  	struct xfs_failure_info		*notify = data;
> +	struct address_space		*mapping;
> +	pgoff_t				pgoff;
> +	unsigned long			pgcnt;
>  	int				error = 0;
>  
>  	if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
>  	    (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
> +		/* Continue the query because this isn't a failure. */
> +		if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> +			return 0;
>  		notify->want_shutdown = true;
>  		return 0;
>  	}
> @@ -92,14 +99,60 @@ xfs_dax_failure_fn(
>  		return 0;
>  	}
>  
> -	error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
> -				  xfs_failure_pgoff(mp, rec, notify),
> -				  xfs_failure_pgcnt(mp, rec, notify),
> -				  notify->mf_flags);
> +	mapping = VFS_I(ip)->i_mapping;
> +	pgoff = xfs_failure_pgoff(mp, rec, notify);
> +	pgcnt = xfs_failure_pgcnt(mp, rec, notify);
> +
> +	/* Continue the rmap query if the inode isn't a dax file. */
> +	if (dax_mapping(mapping))
> +		error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
> +					  notify->mf_flags);
> +
> +	/* Invalidate the cache in dax pages. */
> +	if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> +		invalidate_inode_pages2_range(mapping, pgoff,
> +					      pgoff + pgcnt - 1);
> +
>  	xfs_irele(ip);
>  	return error;
>  }
>  
> +static int
> +xfs_dax_notify_failure_freeze(
> +	struct xfs_mount	*mp)
> +{
> +	struct super_block	*sb = mp->m_super;
> +	int			error;
> +
> +	error = freeze_super(sb, FREEZE_HOLDER_KERNEL);
> +	if (error)
> +		xfs_emerg(mp, "already frozen by kernel, err=%d", error);
> +
> +	return error;
> +}
> +
> +static void
> +xfs_dax_notify_failure_thaw(
> +	struct xfs_mount	*mp,
> +	bool			kernel_frozen)
> +{
> +	struct super_block	*sb = mp->m_super;
> +	int			error;
> +
> +	if (!kernel_frozen) {
> +		error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
> +		if (error)
> +			xfs_emerg(mp, "still frozen after notify failure, err=%d",
> +				error);
> +	}
> +
> +	/*
> +	 * Also thaw userspace call anyway because the device is about to be
> +	 * removed immediately.

Does a userspace freeze inhibit or otherwise break device removal?

> +	 */
> +	thaw_super(sb, FREEZE_HOLDER_USERSPACE);
> +}
> +
>  static int
>  xfs_dax_notify_ddev_failure(
>  	struct xfs_mount	*mp,
> @@ -112,15 +165,29 @@ xfs_dax_notify_ddev_failure(
>  	struct xfs_btree_cur	*cur = NULL;
>  	struct xfs_buf		*agf_bp = NULL;
>  	int			error = 0;
> +	bool			kernel_frozen = false;
>  	xfs_fsblock_t		fsbno = XFS_DADDR_TO_FSB(mp, daddr);
>  	xfs_agnumber_t		agno = XFS_FSB_TO_AGNO(mp, fsbno);
>  	xfs_fsblock_t		end_fsbno = XFS_DADDR_TO_FSB(mp,
>  							     daddr + bblen - 1);
>  	xfs_agnumber_t		end_agno = XFS_FSB_TO_AGNO(mp, end_fsbno);
>  
> +	if (mf_flags & MF_MEM_PRE_REMOVE) {
> +		xfs_info(mp, "Device is about to be removed!");
> +		/* Freeze fs to prevent new mappings from being created. */
> +		error = xfs_dax_notify_failure_freeze(mp);
> +		if (error) {
> +			/* Keep going on if filesystem is frozen by kernel. */
> +			if (error == -EBUSY)
> +				kernel_frozen = true;

EBUSY means that xfs_dax_notify_failure_freeze did /not/ succeed in
kernel-freezing the fs.  Someone else did, and they're expecting that
thaw_super will undo that.

	switch (error) {
	case -EBUSY:
		/* someone else froze the fs, keep going */
		break;
	case 0:
		/* we froze the fs */
		kernel_frozen = true;
		break;
	default:
		/* something else broke, should we continue anyway? */
		return error;
	}

TBH I wonder why all that isn't just:

	kernel_frozen = xfs_dax_notify_failure_freeze(mp) == 0;

Since we'd want to keep going even if (say) the pmem was already
starting to fail and the freeze actually failed due to EIO, right?

--D

> +			else
> +				return error;
> +		}
> +	}
> +
>  	error = xfs_trans_alloc_empty(mp, &tp);
>  	if (error)
> -		return error;
> +		goto out;
>  
>  	for (; agno <= end_agno; agno++) {
>  		struct xfs_rmap_irec	ri_low = { };
> @@ -165,11 +232,23 @@ xfs_dax_notify_ddev_failure(
>  	}
>  
>  	xfs_trans_cancel(tp);
> +
> +	/*
> +	 * Determine how to shutdown the filesystem according to the
> +	 * error code and flags.
> +	 */
>  	if (error || notify.want_shutdown) {
>  		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>  		if (!error)
>  			error = -EFSCORRUPTED;
> -	}
> +	} else if (mf_flags & MF_MEM_PRE_REMOVE)
> +		xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
> +
> +out:
> +	/* Thaw the fs if it is frozen before. */
> +	if (mf_flags & MF_MEM_PRE_REMOVE)
> +		xfs_dax_notify_failure_thaw(mp, kernel_frozen);
> +
>  	return error;
>  }
>  
> @@ -197,6 +276,8 @@ xfs_dax_notify_failure(
>  
>  	if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
>  	    mp->m_logdev_targp != mp->m_ddev_targp) {
> +		if (mf_flags & MF_MEM_PRE_REMOVE)
> +			return 0;
>  		xfs_err(mp, "ondisk log corrupt, shutting down fs!");
>  		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>  		return -EFSCORRUPTED;
> @@ -210,6 +291,12 @@ xfs_dax_notify_failure(
>  	ddev_start = mp->m_ddev_targp->bt_dax_part_off;
>  	ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
>  
> +	/* Notify failure on the whole device. */
> +	if (offset == 0 && len == U64_MAX) {
> +		offset = ddev_start;
> +		len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
> +	}
> +
>  	/* Ignore the range out of filesystem area */
>  	if (offset + len - 1 < ddev_start)
>  		return -ENXIO;
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 799836e84840..944a1165a321 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3577,6 +3577,7 @@ enum mf_flags {
>  	MF_UNPOISON = 1 << 4,
>  	MF_SW_SIMULATED = 1 << 5,
>  	MF_NO_RETRY = 1 << 6,
> +	MF_MEM_PRE_REMOVE = 1 << 7,
>  };
>  int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>  		      unsigned long count, int mf_flags);
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index dc5ff7dd4e50..92f18c9e0aaf 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -688,7 +688,7 @@ static void add_to_kill_fsdax(struct task_struct *tsk, struct page *p,
>   */
>  static void collect_procs_fsdax(struct page *page,
>  		struct address_space *mapping, pgoff_t pgoff,
> -		struct list_head *to_kill)
> +		struct list_head *to_kill, bool pre_remove)
>  {
>  	struct vm_area_struct *vma;
>  	struct task_struct *tsk;
> @@ -696,8 +696,15 @@ static void collect_procs_fsdax(struct page *page,
>  	i_mmap_lock_read(mapping);
>  	read_lock(&tasklist_lock);
>  	for_each_process(tsk) {
> -		struct task_struct *t = task_early_kill(tsk, true);
> +		struct task_struct *t = tsk;
>  
> +		/*
> +		 * Search for all tasks while MF_MEM_PRE_REMOVE is set, because
> +		 * the current may not be the one accessing the fsdax page.
> +		 * Otherwise, search for the current task.
> +		 */
> +		if (!pre_remove)
> +			t = task_early_kill(tsk, true);
>  		if (!t)
>  			continue;
>  		vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> @@ -1793,6 +1800,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>  	dax_entry_t cookie;
>  	struct page *page;
>  	size_t end = index + count;
> +	bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
>  
>  	mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
>  
> @@ -1804,9 +1812,10 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>  		if (!page)
>  			goto unlock;
>  
> -		SetPageHWPoison(page);
> +		if (!pre_remove)
> +			SetPageHWPoison(page);
>  
> -		collect_procs_fsdax(page, mapping, index, &to_kill);
> +		collect_procs_fsdax(page, mapping, index, &to_kill, pre_remove);
>  		unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
>  				index, mf_flags);
>  unlock:
> -- 
> 2.41.0
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v13] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind
  2023-08-23 23:36     ` Darrick J. Wong
@ 2023-08-24  9:41       ` Shiyang Ruan
  2023-08-24 23:57         ` Darrick J. Wong
  0 siblings, 1 reply; 37+ messages in thread
From: Shiyang Ruan @ 2023-08-24  9:41 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-fsdevel, nvdimm, linux-xfs, linux-mm, dan.j.williams, willy,
	jack, akpm, mcgrof



在 2023/8/24 7:36, Darrick J. Wong 写道:
> On Wed, Aug 23, 2023 at 04:17:06PM +0800, Shiyang Ruan wrote:
>> ====
>> Changes since v12:
>>   1. correct flag name in subject (MF_MEM_REMOVE => MF_MEM_PRE_REMOVE)
>>   2. complete the behavior when fs has already frozen by kernel call
>>        NOTICE: Instead of "call notify_failure() again w/o PRE_REMOVE",
>>                I tried this proposal[0].
>>   3. call xfs_dax_notify_failure_freeze() and _thaw() in same function
>>   4. rebase on: xfs/xfs-linux.git vfs-for-next
>> ====
>>
>> Now, if we suddenly remove a PMEM device(by calling unbind) which
>> contains FSDAX while programs are still accessing data in this device,
>> e.g.:
>> ```
>>   $FSSTRESS_PROG -d $SCRATCH_MNT -n 99999 -p 4 &
>>   # $FSX_PROG -N 1000000 -o 8192 -l 500000 $SCRATCH_MNT/t001 &
>>   echo "pfn1.1" > /sys/bus/nd/drivers/nd_pmem/unbind
>> ```
>> it could come into an unacceptable state:
>>    1. device has gone but mount point still exists, and umount will fail
>>         with "target is busy"
>>    2. programs will hang and cannot be killed
>>    3. may crash with NULL pointer dereference
>>
>> To fix this, we introduce a MF_MEM_PRE_REMOVE flag to let it know that we
>> are going to remove the whole device, and make sure all related processes
>> could be notified so that they could end up gracefully.
>>
>> This patch is inspired by Dan's "mm, dax, pmem: Introduce
>> dev_pagemap_failure()"[1].  With the help of dax_holder and
>> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
>> on it to unmap all files in use, and notify processes who are using
>> those files.
>>
>> Call trace:
>> trigger unbind
>>   -> unbind_store()
>>    -> ... (skip)
>>     -> devres_release_all()
>>      -> kill_dax()
>>       -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
>>        -> xfs_dax_notify_failure()
>>        `-> freeze_super()             // freeze (kernel call)
>>        `-> do xfs rmap
>>        ` -> mf_dax_kill_procs()
>>        `  -> collect_procs_fsdax()    // all associated processes
>>        `  -> unmap_and_kill()
>>        ` -> invalidate_inode_pages2_range() // drop file's cache
>>        `-> thaw_super()               // thaw (both kernel & user call)
>>
>> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
>> event.  Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
>> new dax mapping from being created.  Do not shutdown filesystem directly
>> if configuration is not supported, or if failure range includes metadata
>> area.  Make sure all files and processes(not only the current progress)
>> are handled correctly.  Also drop the cache of associated files before
>> pmem is removed.
>>
>> [0]: https://lore.kernel.org/linux-xfs/25cf6700-4db0-a346-632c-ec9fc291793a@fujitsu.com/
>> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
>> [2]: https://lore.kernel.org/linux-xfs/169116275623.3187159.16862410128731457358.stg-ugh@frogsfrogsfrogs/
>>
>> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
>> ---
>>   drivers/dax/super.c         |  3 +-
>>   fs/xfs/xfs_notify_failure.c | 99 ++++++++++++++++++++++++++++++++++---
>>   include/linux/mm.h          |  1 +
>>   mm/memory-failure.c         | 17 +++++--
>>   4 files changed, 109 insertions(+), 11 deletions(-)
>>
>> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
>> index c4c4728a36e4..2e1a35e82fce 100644
>> --- a/drivers/dax/super.c
>> +++ b/drivers/dax/super.c
>> @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
>>   		return;
>>   
>>   	if (dax_dev->holder_data != NULL)
>> -		dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
>> +		dax_holder_notify_failure(dax_dev, 0, U64_MAX,
>> +				MF_MEM_PRE_REMOVE);
>>   
>>   	clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
>>   	synchronize_srcu(&dax_srcu);
>> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
>> index 4a9bbd3fe120..6496c32a9172 100644
>> --- a/fs/xfs/xfs_notify_failure.c
>> +++ b/fs/xfs/xfs_notify_failure.c
>> @@ -22,6 +22,7 @@
>>   
>>   #include <linux/mm.h>
>>   #include <linux/dax.h>
>> +#include <linux/fs.h>
>>   
>>   struct xfs_failure_info {
>>   	xfs_agblock_t		startblock;
>> @@ -73,10 +74,16 @@ xfs_dax_failure_fn(
>>   	struct xfs_mount		*mp = cur->bc_mp;
>>   	struct xfs_inode		*ip;
>>   	struct xfs_failure_info		*notify = data;
>> +	struct address_space		*mapping;
>> +	pgoff_t				pgoff;
>> +	unsigned long			pgcnt;
>>   	int				error = 0;
>>   
>>   	if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
>>   	    (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
>> +		/* Continue the query because this isn't a failure. */
>> +		if (notify->mf_flags & MF_MEM_PRE_REMOVE)
>> +			return 0;
>>   		notify->want_shutdown = true;
>>   		return 0;
>>   	}
>> @@ -92,14 +99,60 @@ xfs_dax_failure_fn(
>>   		return 0;
>>   	}
>>   
>> -	error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
>> -				  xfs_failure_pgoff(mp, rec, notify),
>> -				  xfs_failure_pgcnt(mp, rec, notify),
>> -				  notify->mf_flags);
>> +	mapping = VFS_I(ip)->i_mapping;
>> +	pgoff = xfs_failure_pgoff(mp, rec, notify);
>> +	pgcnt = xfs_failure_pgcnt(mp, rec, notify);
>> +
>> +	/* Continue the rmap query if the inode isn't a dax file. */
>> +	if (dax_mapping(mapping))
>> +		error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
>> +					  notify->mf_flags);
>> +
>> +	/* Invalidate the cache in dax pages. */
>> +	if (notify->mf_flags & MF_MEM_PRE_REMOVE)
>> +		invalidate_inode_pages2_range(mapping, pgoff,
>> +					      pgoff + pgcnt - 1);
>> +
>>   	xfs_irele(ip);
>>   	return error;
>>   }
>>   
>> +static int
>> +xfs_dax_notify_failure_freeze(
>> +	struct xfs_mount	*mp)
>> +{
>> +	struct super_block	*sb = mp->m_super;
>> +	int			error;
>> +
>> +	error = freeze_super(sb, FREEZE_HOLDER_KERNEL);
>> +	if (error)
>> +		xfs_emerg(mp, "already frozen by kernel, err=%d", error);
>> +
>> +	return error;
>> +}
>> +
>> +static void
>> +xfs_dax_notify_failure_thaw(
>> +	struct xfs_mount	*mp,
>> +	bool			kernel_frozen)
>> +{
>> +	struct super_block	*sb = mp->m_super;
>> +	int			error;
>> +
>> +	if (!kernel_frozen) {
>> +		error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
>> +		if (error)
>> +			xfs_emerg(mp, "still frozen after notify failure, err=%d",
>> +				error);
>> +	}
>> +
>> +	/*
>> +	 * Also thaw userspace call anyway because the device is about to be
>> +	 * removed immediately.
> 
> Does a userspace freeze inhibit or otherwise break device removal?

It doesn't.  Device can be removed.  But after that, the mount point 
still exists, and `umount /mnt/scratch` fails with "target is busy." 
`xfs_freeze -u /mnt/scratch` cannot work too.

So, I think thaw_super() anyway here is needed.


> 
>> +	 */
>> +	thaw_super(sb, FREEZE_HOLDER_USERSPACE);
>> +}
>> +
>>   static int
>>   xfs_dax_notify_ddev_failure(
>>   	struct xfs_mount	*mp,
>> @@ -112,15 +165,29 @@ xfs_dax_notify_ddev_failure(
>>   	struct xfs_btree_cur	*cur = NULL;
>>   	struct xfs_buf		*agf_bp = NULL;
>>   	int			error = 0;
>> +	bool			kernel_frozen = false;
>>   	xfs_fsblock_t		fsbno = XFS_DADDR_TO_FSB(mp, daddr);
>>   	xfs_agnumber_t		agno = XFS_FSB_TO_AGNO(mp, fsbno);
>>   	xfs_fsblock_t		end_fsbno = XFS_DADDR_TO_FSB(mp,
>>   							     daddr + bblen - 1);
>>   	xfs_agnumber_t		end_agno = XFS_FSB_TO_AGNO(mp, end_fsbno);
>>   
>> +	if (mf_flags & MF_MEM_PRE_REMOVE) {
>> +		xfs_info(mp, "Device is about to be removed!");
>> +		/* Freeze fs to prevent new mappings from being created. */
>> +		error = xfs_dax_notify_failure_freeze(mp);
>> +		if (error) {
>> +			/* Keep going on if filesystem is frozen by kernel. */
>> +			if (error == -EBUSY)
>> +				kernel_frozen = true;
> 
> EBUSY means that xfs_dax_notify_failure_freeze did /not/ succeed in
> kernel-freezing the fs.  Someone else did, and they're expecting that
> thaw_super will undo that.
> 
> 	switch (error) {
> 	case -EBUSY:
> 		/* someone else froze the fs, keep going */
> 		break;
> 	case 0:
> 		/* we froze the fs */
> 		kernel_frozen = true;
> 		break;
> 	default:
> 		/* something else broke, should we continue anyway? */
> 		return error;
> 	}
> 
> TBH I wonder why all that isn't just:
> 
> 	kernel_frozen = xfs_dax_notify_failure_freeze(mp) == 0;
> 
> Since we'd want to keep going even if (say) the pmem was already
> starting to fail and the freeze actually failed due to EIO, right?

Yes.  So we can say it is a *try* to _freeze() here.  No matter what its 
result is, we continue.

Then I think the `kernel_frozen` becomes useless as well.  Because we 
should try to call both _thaw(KERNEL_CALL) and _thaw(USER_CALL) to make 
sure umount can work after device is gone.

Then, I think it's better to change them:
   `static int xfs_dax_notify_failure_freeze()`,
   `static void xfs_dax_notify_failure_thaw()`
to
   `static void xfs_dax_notify_failure_try_freeze()`,
   `static void xfs_dax_notify_failure_try_thaw()`.


--
Thanks,
Ruan.

> 
> --D
> 
>> +			else
>> +				return error;
>> +		}
>> +	}
>> +
>>   	error = xfs_trans_alloc_empty(mp, &tp);
>>   	if (error)
>> -		return error;
>> +		goto out;
>>   
>>   	for (; agno <= end_agno; agno++) {
>>   		struct xfs_rmap_irec	ri_low = { };
>> @@ -165,11 +232,23 @@ xfs_dax_notify_ddev_failure(
>>   	}
>>   
>>   	xfs_trans_cancel(tp);
>> +
>> +	/*
>> +	 * Determine how to shutdown the filesystem according to the
>> +	 * error code and flags.
>> +	 */
>>   	if (error || notify.want_shutdown) {
>>   		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>>   		if (!error)
>>   			error = -EFSCORRUPTED;
>> -	}
>> +	} else if (mf_flags & MF_MEM_PRE_REMOVE)
>> +		xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
>> +
>> +out:
>> +	/* Thaw the fs if it is frozen before. */
>> +	if (mf_flags & MF_MEM_PRE_REMOVE)
>> +		xfs_dax_notify_failure_thaw(mp, kernel_frozen);
>> +
>>   	return error;
>>   }
>>   
>> @@ -197,6 +276,8 @@ xfs_dax_notify_failure(
>>   
>>   	if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
>>   	    mp->m_logdev_targp != mp->m_ddev_targp) {
>> +		if (mf_flags & MF_MEM_PRE_REMOVE)
>> +			return 0;
>>   		xfs_err(mp, "ondisk log corrupt, shutting down fs!");
>>   		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>>   		return -EFSCORRUPTED;
>> @@ -210,6 +291,12 @@ xfs_dax_notify_failure(
>>   	ddev_start = mp->m_ddev_targp->bt_dax_part_off;
>>   	ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
>>   
>> +	/* Notify failure on the whole device. */
>> +	if (offset == 0 && len == U64_MAX) {
>> +		offset = ddev_start;
>> +		len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
>> +	}
>> +
>>   	/* Ignore the range out of filesystem area */
>>   	if (offset + len - 1 < ddev_start)
>>   		return -ENXIO;
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 799836e84840..944a1165a321 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -3577,6 +3577,7 @@ enum mf_flags {
>>   	MF_UNPOISON = 1 << 4,
>>   	MF_SW_SIMULATED = 1 << 5,
>>   	MF_NO_RETRY = 1 << 6,
>> +	MF_MEM_PRE_REMOVE = 1 << 7,
>>   };
>>   int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>>   		      unsigned long count, int mf_flags);
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index dc5ff7dd4e50..92f18c9e0aaf 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -688,7 +688,7 @@ static void add_to_kill_fsdax(struct task_struct *tsk, struct page *p,
>>    */
>>   static void collect_procs_fsdax(struct page *page,
>>   		struct address_space *mapping, pgoff_t pgoff,
>> -		struct list_head *to_kill)
>> +		struct list_head *to_kill, bool pre_remove)
>>   {
>>   	struct vm_area_struct *vma;
>>   	struct task_struct *tsk;
>> @@ -696,8 +696,15 @@ static void collect_procs_fsdax(struct page *page,
>>   	i_mmap_lock_read(mapping);
>>   	read_lock(&tasklist_lock);
>>   	for_each_process(tsk) {
>> -		struct task_struct *t = task_early_kill(tsk, true);
>> +		struct task_struct *t = tsk;
>>   
>> +		/*
>> +		 * Search for all tasks while MF_MEM_PRE_REMOVE is set, because
>> +		 * the current may not be the one accessing the fsdax page.
>> +		 * Otherwise, search for the current task.
>> +		 */
>> +		if (!pre_remove)
>> +			t = task_early_kill(tsk, true);
>>   		if (!t)
>>   			continue;
>>   		vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
>> @@ -1793,6 +1800,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>>   	dax_entry_t cookie;
>>   	struct page *page;
>>   	size_t end = index + count;
>> +	bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
>>   
>>   	mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
>>   
>> @@ -1804,9 +1812,10 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>>   		if (!page)
>>   			goto unlock;
>>   
>> -		SetPageHWPoison(page);
>> +		if (!pre_remove)
>> +			SetPageHWPoison(page);
>>   
>> -		collect_procs_fsdax(page, mapping, index, &to_kill);
>> +		collect_procs_fsdax(page, mapping, index, &to_kill, pre_remove);
>>   		unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
>>   				index, mf_flags);
>>   unlock:
>> -- 
>> 2.41.0
>>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v13] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind
  2023-08-24  9:41       ` Shiyang Ruan
@ 2023-08-24 23:57         ` Darrick J. Wong
  2023-08-25  3:52           ` Shiyang Ruan
  0 siblings, 1 reply; 37+ messages in thread
From: Darrick J. Wong @ 2023-08-24 23:57 UTC (permalink / raw)
  To: Shiyang Ruan
  Cc: linux-fsdevel, nvdimm, linux-xfs, linux-mm, dan.j.williams, willy,
	jack, akpm, mcgrof

On Thu, Aug 24, 2023 at 05:41:50PM +0800, Shiyang Ruan wrote:
> 
> 
> 在 2023/8/24 7:36, Darrick J. Wong 写道:
> > On Wed, Aug 23, 2023 at 04:17:06PM +0800, Shiyang Ruan wrote:
> > > ====
> > > Changes since v12:
> > >   1. correct flag name in subject (MF_MEM_REMOVE => MF_MEM_PRE_REMOVE)
> > >   2. complete the behavior when fs has already frozen by kernel call
> > >        NOTICE: Instead of "call notify_failure() again w/o PRE_REMOVE",
> > >                I tried this proposal[0].
> > >   3. call xfs_dax_notify_failure_freeze() and _thaw() in same function
> > >   4. rebase on: xfs/xfs-linux.git vfs-for-next
> > > ====
> > > 
> > > Now, if we suddenly remove a PMEM device(by calling unbind) which
> > > contains FSDAX while programs are still accessing data in this device,
> > > e.g.:
> > > ```
> > >   $FSSTRESS_PROG -d $SCRATCH_MNT -n 99999 -p 4 &
> > >   # $FSX_PROG -N 1000000 -o 8192 -l 500000 $SCRATCH_MNT/t001 &
> > >   echo "pfn1.1" > /sys/bus/nd/drivers/nd_pmem/unbind
> > > ```
> > > it could come into an unacceptable state:
> > >    1. device has gone but mount point still exists, and umount will fail
> > >         with "target is busy"
> > >    2. programs will hang and cannot be killed
> > >    3. may crash with NULL pointer dereference
> > > 
> > > To fix this, we introduce a MF_MEM_PRE_REMOVE flag to let it know that we
> > > are going to remove the whole device, and make sure all related processes
> > > could be notified so that they could end up gracefully.
> > > 
> > > This patch is inspired by Dan's "mm, dax, pmem: Introduce
> > > dev_pagemap_failure()"[1].  With the help of dax_holder and
> > > ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> > > on it to unmap all files in use, and notify processes who are using
> > > those files.
> > > 
> > > Call trace:
> > > trigger unbind
> > >   -> unbind_store()
> > >    -> ... (skip)
> > >     -> devres_release_all()
> > >      -> kill_dax()
> > >       -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
> > >        -> xfs_dax_notify_failure()
> > >        `-> freeze_super()             // freeze (kernel call)
> > >        `-> do xfs rmap
> > >        ` -> mf_dax_kill_procs()
> > >        `  -> collect_procs_fsdax()    // all associated processes
> > >        `  -> unmap_and_kill()
> > >        ` -> invalidate_inode_pages2_range() // drop file's cache
> > >        `-> thaw_super()               // thaw (both kernel & user call)
> > > 
> > > Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> > > event.  Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
> > > new dax mapping from being created.  Do not shutdown filesystem directly
> > > if configuration is not supported, or if failure range includes metadata
> > > area.  Make sure all files and processes(not only the current progress)
> > > are handled correctly.  Also drop the cache of associated files before
> > > pmem is removed.
> > > 
> > > [0]: https://lore.kernel.org/linux-xfs/25cf6700-4db0-a346-632c-ec9fc291793a@fujitsu.com/
> > > [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
> > > [2]: https://lore.kernel.org/linux-xfs/169116275623.3187159.16862410128731457358.stg-ugh@frogsfrogsfrogs/
> > > 
> > > Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> > > ---
> > >   drivers/dax/super.c         |  3 +-
> > >   fs/xfs/xfs_notify_failure.c | 99 ++++++++++++++++++++++++++++++++++---
> > >   include/linux/mm.h          |  1 +
> > >   mm/memory-failure.c         | 17 +++++--
> > >   4 files changed, 109 insertions(+), 11 deletions(-)
> > > 
> > > diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> > > index c4c4728a36e4..2e1a35e82fce 100644
> > > --- a/drivers/dax/super.c
> > > +++ b/drivers/dax/super.c
> > > @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
> > >   		return;
> > >   	if (dax_dev->holder_data != NULL)
> > > -		dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
> > > +		dax_holder_notify_failure(dax_dev, 0, U64_MAX,
> > > +				MF_MEM_PRE_REMOVE);
> > >   	clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
> > >   	synchronize_srcu(&dax_srcu);
> > > diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> > > index 4a9bbd3fe120..6496c32a9172 100644
> > > --- a/fs/xfs/xfs_notify_failure.c
> > > +++ b/fs/xfs/xfs_notify_failure.c
> > > @@ -22,6 +22,7 @@
> > >   #include <linux/mm.h>
> > >   #include <linux/dax.h>
> > > +#include <linux/fs.h>
> > >   struct xfs_failure_info {
> > >   	xfs_agblock_t		startblock;
> > > @@ -73,10 +74,16 @@ xfs_dax_failure_fn(
> > >   	struct xfs_mount		*mp = cur->bc_mp;
> > >   	struct xfs_inode		*ip;
> > >   	struct xfs_failure_info		*notify = data;
> > > +	struct address_space		*mapping;
> > > +	pgoff_t				pgoff;
> > > +	unsigned long			pgcnt;
> > >   	int				error = 0;
> > >   	if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
> > >   	    (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
> > > +		/* Continue the query because this isn't a failure. */
> > > +		if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> > > +			return 0;
> > >   		notify->want_shutdown = true;
> > >   		return 0;
> > >   	}
> > > @@ -92,14 +99,60 @@ xfs_dax_failure_fn(
> > >   		return 0;
> > >   	}
> > > -	error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
> > > -				  xfs_failure_pgoff(mp, rec, notify),
> > > -				  xfs_failure_pgcnt(mp, rec, notify),
> > > -				  notify->mf_flags);
> > > +	mapping = VFS_I(ip)->i_mapping;
> > > +	pgoff = xfs_failure_pgoff(mp, rec, notify);
> > > +	pgcnt = xfs_failure_pgcnt(mp, rec, notify);
> > > +
> > > +	/* Continue the rmap query if the inode isn't a dax file. */
> > > +	if (dax_mapping(mapping))
> > > +		error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
> > > +					  notify->mf_flags);
> > > +
> > > +	/* Invalidate the cache in dax pages. */
> > > +	if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> > > +		invalidate_inode_pages2_range(mapping, pgoff,
> > > +					      pgoff + pgcnt - 1);
> > > +
> > >   	xfs_irele(ip);
> > >   	return error;
> > >   }
> > > +static int
> > > +xfs_dax_notify_failure_freeze(
> > > +	struct xfs_mount	*mp)
> > > +{
> > > +	struct super_block	*sb = mp->m_super;
> > > +	int			error;
> > > +
> > > +	error = freeze_super(sb, FREEZE_HOLDER_KERNEL);
> > > +	if (error)
> > > +		xfs_emerg(mp, "already frozen by kernel, err=%d", error);
> > > +
> > > +	return error;
> > > +}
> > > +
> > > +static void
> > > +xfs_dax_notify_failure_thaw(
> > > +	struct xfs_mount	*mp,
> > > +	bool			kernel_frozen)
> > > +{
> > > +	struct super_block	*sb = mp->m_super;
> > > +	int			error;
> > > +
> > > +	if (!kernel_frozen) {
> > > +		error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
> > > +		if (error)
> > > +			xfs_emerg(mp, "still frozen after notify failure, err=%d",
> > > +				error);
> > > +	}
> > > +
> > > +	/*
> > > +	 * Also thaw userspace call anyway because the device is about to be
> > > +	 * removed immediately.
> > 
> > Does a userspace freeze inhibit or otherwise break device removal?
> 
> It doesn't.  Device can be removed.  But after that, the mount point still
> exists, and `umount /mnt/scratch` fails with "target is busy." `xfs_freeze
> -u /mnt/scratch` cannot work too.

Yes, that's true, but that's long been the case for removing block
devices.  Should block device removal (since we now have hooks for
that!) also be breaking freezes?

> So, I think thaw_super() anyway here is needed.
> 
> 
> > 
> > > +	 */
> > > +	thaw_super(sb, FREEZE_HOLDER_USERSPACE);
> > > +}
> > > +
> > >   static int
> > >   xfs_dax_notify_ddev_failure(
> > >   	struct xfs_mount	*mp,
> > > @@ -112,15 +165,29 @@ xfs_dax_notify_ddev_failure(
> > >   	struct xfs_btree_cur	*cur = NULL;
> > >   	struct xfs_buf		*agf_bp = NULL;
> > >   	int			error = 0;
> > > +	bool			kernel_frozen = false;
> > >   	xfs_fsblock_t		fsbno = XFS_DADDR_TO_FSB(mp, daddr);
> > >   	xfs_agnumber_t		agno = XFS_FSB_TO_AGNO(mp, fsbno);
> > >   	xfs_fsblock_t		end_fsbno = XFS_DADDR_TO_FSB(mp,
> > >   							     daddr + bblen - 1);
> > >   	xfs_agnumber_t		end_agno = XFS_FSB_TO_AGNO(mp, end_fsbno);
> > > +	if (mf_flags & MF_MEM_PRE_REMOVE) {
> > > +		xfs_info(mp, "Device is about to be removed!");
> > > +		/* Freeze fs to prevent new mappings from being created. */
> > > +		error = xfs_dax_notify_failure_freeze(mp);
> > > +		if (error) {
> > > +			/* Keep going on if filesystem is frozen by kernel. */
> > > +			if (error == -EBUSY)
> > > +				kernel_frozen = true;
> > 
> > EBUSY means that xfs_dax_notify_failure_freeze did /not/ succeed in
> > kernel-freezing the fs.  Someone else did, and they're expecting that
> > thaw_super will undo that.
> > 
> > 	switch (error) {
> > 	case -EBUSY:
> > 		/* someone else froze the fs, keep going */
> > 		break;
> > 	case 0:
> > 		/* we froze the fs */
> > 		kernel_frozen = true;
> > 		break;
> > 	default:
> > 		/* something else broke, should we continue anyway? */
> > 		return error;
> > 	}
> > 
> > TBH I wonder why all that isn't just:
> > 
> > 	kernel_frozen = xfs_dax_notify_failure_freeze(mp) == 0;
> > 
> > Since we'd want to keep going even if (say) the pmem was already
> > starting to fail and the freeze actually failed due to EIO, right?
> 
> Yes.  So we can say it is a *try* to _freeze() here.  No matter what its
> result is, we continue.
> 
> Then I think the `kernel_frozen` becomes useless as well.  Because we should
> try to call both _thaw(KERNEL_CALL) and _thaw(USER_CALL) to make sure umount
> can work after device is gone.

I disagree -- unlike the mess that is userspace freezing, kernel code
that obtained a kernel freeze will get very confused and potentially do
Seriously Bad Things if the kernel freeze is yanked out from under them.
Kernel code is not supposed to release things that they did not
themselves obtain.

That might not ultimately matter for the narrow case of the device going
away, but the two other usecases (online fsck and suspend) will
malfunction if you drop a kernel freeze that they obtained.

I don't mind if PREREMOVE can't get a freeze and keeps going with the
invalidations anyway.  We did our best, and when the pmem goes away we
can just kill -9 down the processes.

--D

> Then, I think it's better to change them:
>   `static int xfs_dax_notify_failure_freeze()`,
>   `static void xfs_dax_notify_failure_thaw()`
> to
>   `static void xfs_dax_notify_failure_try_freeze()`,
>   `static void xfs_dax_notify_failure_try_thaw()`.
> 
> 
> --
> Thanks,
> Ruan.
> 
> > 
> > --D
> > 
> > > +			else
> > > +				return error;
> > > +		}
> > > +	}
> > > +
> > >   	error = xfs_trans_alloc_empty(mp, &tp);
> > >   	if (error)
> > > -		return error;
> > > +		goto out;
> > >   	for (; agno <= end_agno; agno++) {
> > >   		struct xfs_rmap_irec	ri_low = { };
> > > @@ -165,11 +232,23 @@ xfs_dax_notify_ddev_failure(
> > >   	}
> > >   	xfs_trans_cancel(tp);
> > > +
> > > +	/*
> > > +	 * Determine how to shutdown the filesystem according to the
> > > +	 * error code and flags.
> > > +	 */
> > >   	if (error || notify.want_shutdown) {
> > >   		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
> > >   		if (!error)
> > >   			error = -EFSCORRUPTED;
> > > -	}
> > > +	} else if (mf_flags & MF_MEM_PRE_REMOVE)
> > > +		xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
> > > +
> > > +out:
> > > +	/* Thaw the fs if it is frozen before. */
> > > +	if (mf_flags & MF_MEM_PRE_REMOVE)
> > > +		xfs_dax_notify_failure_thaw(mp, kernel_frozen);
> > > +
> > >   	return error;
> > >   }
> > > @@ -197,6 +276,8 @@ xfs_dax_notify_failure(
> > >   	if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
> > >   	    mp->m_logdev_targp != mp->m_ddev_targp) {
> > > +		if (mf_flags & MF_MEM_PRE_REMOVE)
> > > +			return 0;
> > >   		xfs_err(mp, "ondisk log corrupt, shutting down fs!");
> > >   		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
> > >   		return -EFSCORRUPTED;
> > > @@ -210,6 +291,12 @@ xfs_dax_notify_failure(
> > >   	ddev_start = mp->m_ddev_targp->bt_dax_part_off;
> > >   	ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
> > > +	/* Notify failure on the whole device. */
> > > +	if (offset == 0 && len == U64_MAX) {
> > > +		offset = ddev_start;
> > > +		len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
> > > +	}
> > > +
> > >   	/* Ignore the range out of filesystem area */
> > >   	if (offset + len - 1 < ddev_start)
> > >   		return -ENXIO;
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index 799836e84840..944a1165a321 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -3577,6 +3577,7 @@ enum mf_flags {
> > >   	MF_UNPOISON = 1 << 4,
> > >   	MF_SW_SIMULATED = 1 << 5,
> > >   	MF_NO_RETRY = 1 << 6,
> > > +	MF_MEM_PRE_REMOVE = 1 << 7,
> > >   };
> > >   int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> > >   		      unsigned long count, int mf_flags);
> > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > > index dc5ff7dd4e50..92f18c9e0aaf 100644
> > > --- a/mm/memory-failure.c
> > > +++ b/mm/memory-failure.c
> > > @@ -688,7 +688,7 @@ static void add_to_kill_fsdax(struct task_struct *tsk, struct page *p,
> > >    */
> > >   static void collect_procs_fsdax(struct page *page,
> > >   		struct address_space *mapping, pgoff_t pgoff,
> > > -		struct list_head *to_kill)
> > > +		struct list_head *to_kill, bool pre_remove)
> > >   {
> > >   	struct vm_area_struct *vma;
> > >   	struct task_struct *tsk;
> > > @@ -696,8 +696,15 @@ static void collect_procs_fsdax(struct page *page,
> > >   	i_mmap_lock_read(mapping);
> > >   	read_lock(&tasklist_lock);
> > >   	for_each_process(tsk) {
> > > -		struct task_struct *t = task_early_kill(tsk, true);
> > > +		struct task_struct *t = tsk;
> > > +		/*
> > > +		 * Search for all tasks while MF_MEM_PRE_REMOVE is set, because
> > > +		 * the current may not be the one accessing the fsdax page.
> > > +		 * Otherwise, search for the current task.
> > > +		 */
> > > +		if (!pre_remove)
> > > +			t = task_early_kill(tsk, true);
> > >   		if (!t)
> > >   			continue;
> > >   		vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> > > @@ -1793,6 +1800,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> > >   	dax_entry_t cookie;
> > >   	struct page *page;
> > >   	size_t end = index + count;
> > > +	bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
> > >   	mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
> > > @@ -1804,9 +1812,10 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> > >   		if (!page)
> > >   			goto unlock;
> > > -		SetPageHWPoison(page);
> > > +		if (!pre_remove)
> > > +			SetPageHWPoison(page);
> > > -		collect_procs_fsdax(page, mapping, index, &to_kill);
> > > +		collect_procs_fsdax(page, mapping, index, &to_kill, pre_remove);
> > >   		unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
> > >   				index, mf_flags);
> > >   unlock:
> > > -- 
> > > 2.41.0
> > > 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v13] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind
  2023-08-24 23:57         ` Darrick J. Wong
@ 2023-08-25  3:52           ` Shiyang Ruan
  2023-08-26  0:17             ` Darrick J. Wong
  0 siblings, 1 reply; 37+ messages in thread
From: Shiyang Ruan @ 2023-08-25  3:52 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-fsdevel, nvdimm, linux-xfs, linux-mm, dan.j.williams, willy,
	jack, akpm, mcgrof



在 2023/8/25 7:57, Darrick J. Wong 写道:
> On Thu, Aug 24, 2023 at 05:41:50PM +0800, Shiyang Ruan wrote:
>>
>>
>> 在 2023/8/24 7:36, Darrick J. Wong 写道:
>>> On Wed, Aug 23, 2023 at 04:17:06PM +0800, Shiyang Ruan wrote:
>>>> ====
>>>> Changes since v12:
>>>>    1. correct flag name in subject (MF_MEM_REMOVE => MF_MEM_PRE_REMOVE)
>>>>    2. complete the behavior when fs has already frozen by kernel call
>>>>         NOTICE: Instead of "call notify_failure() again w/o PRE_REMOVE",
>>>>                 I tried this proposal[0].
>>>>    3. call xfs_dax_notify_failure_freeze() and _thaw() in same function
>>>>    4. rebase on: xfs/xfs-linux.git vfs-for-next
>>>> ====
>>>>
>>>> Now, if we suddenly remove a PMEM device(by calling unbind) which
>>>> contains FSDAX while programs are still accessing data in this device,
>>>> e.g.:
>>>> ```
>>>>    $FSSTRESS_PROG -d $SCRATCH_MNT -n 99999 -p 4 &
>>>>    # $FSX_PROG -N 1000000 -o 8192 -l 500000 $SCRATCH_MNT/t001 &
>>>>    echo "pfn1.1" > /sys/bus/nd/drivers/nd_pmem/unbind
>>>> ```
>>>> it could come into an unacceptable state:
>>>>     1. device has gone but mount point still exists, and umount will fail
>>>>          with "target is busy"
>>>>     2. programs will hang and cannot be killed
>>>>     3. may crash with NULL pointer dereference
>>>>
>>>> To fix this, we introduce a MF_MEM_PRE_REMOVE flag to let it know that we
>>>> are going to remove the whole device, and make sure all related processes
>>>> could be notified so that they could end up gracefully.
>>>>
>>>> This patch is inspired by Dan's "mm, dax, pmem: Introduce
>>>> dev_pagemap_failure()"[1].  With the help of dax_holder and
>>>> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
>>>> on it to unmap all files in use, and notify processes who are using
>>>> those files.
>>>>
>>>> Call trace:
>>>> trigger unbind
>>>>    -> unbind_store()
>>>>     -> ... (skip)
>>>>      -> devres_release_all()
>>>>       -> kill_dax()
>>>>        -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
>>>>         -> xfs_dax_notify_failure()
>>>>         `-> freeze_super()             // freeze (kernel call)
>>>>         `-> do xfs rmap
>>>>         ` -> mf_dax_kill_procs()
>>>>         `  -> collect_procs_fsdax()    // all associated processes
>>>>         `  -> unmap_and_kill()
>>>>         ` -> invalidate_inode_pages2_range() // drop file's cache
>>>>         `-> thaw_super()               // thaw (both kernel & user call)
>>>>
>>>> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
>>>> event.  Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
>>>> new dax mapping from being created.  Do not shutdown filesystem directly
>>>> if configuration is not supported, or if failure range includes metadata
>>>> area.  Make sure all files and processes(not only the current progress)
>>>> are handled correctly.  Also drop the cache of associated files before
>>>> pmem is removed.
>>>>
>>>> [0]: https://lore.kernel.org/linux-xfs/25cf6700-4db0-a346-632c-ec9fc291793a@fujitsu.com/
>>>> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
>>>> [2]: https://lore.kernel.org/linux-xfs/169116275623.3187159.16862410128731457358.stg-ugh@frogsfrogsfrogs/
>>>>
>>>> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
>>>> ---
>>>>    drivers/dax/super.c         |  3 +-
>>>>    fs/xfs/xfs_notify_failure.c | 99 ++++++++++++++++++++++++++++++++++---
>>>>    include/linux/mm.h          |  1 +
>>>>    mm/memory-failure.c         | 17 +++++--
>>>>    4 files changed, 109 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
>>>> index c4c4728a36e4..2e1a35e82fce 100644
>>>> --- a/drivers/dax/super.c
>>>> +++ b/drivers/dax/super.c
>>>> @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
>>>>    		return;
>>>>    	if (dax_dev->holder_data != NULL)
>>>> -		dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
>>>> +		dax_holder_notify_failure(dax_dev, 0, U64_MAX,
>>>> +				MF_MEM_PRE_REMOVE);
>>>>    	clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
>>>>    	synchronize_srcu(&dax_srcu);
>>>> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
>>>> index 4a9bbd3fe120..6496c32a9172 100644
>>>> --- a/fs/xfs/xfs_notify_failure.c
>>>> +++ b/fs/xfs/xfs_notify_failure.c
>>>> @@ -22,6 +22,7 @@
>>>>    #include <linux/mm.h>
>>>>    #include <linux/dax.h>
>>>> +#include <linux/fs.h>
>>>>    struct xfs_failure_info {
>>>>    	xfs_agblock_t		startblock;
>>>> @@ -73,10 +74,16 @@ xfs_dax_failure_fn(
>>>>    	struct xfs_mount		*mp = cur->bc_mp;
>>>>    	struct xfs_inode		*ip;
>>>>    	struct xfs_failure_info		*notify = data;
>>>> +	struct address_space		*mapping;
>>>> +	pgoff_t				pgoff;
>>>> +	unsigned long			pgcnt;
>>>>    	int				error = 0;
>>>>    	if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
>>>>    	    (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
>>>> +		/* Continue the query because this isn't a failure. */
>>>> +		if (notify->mf_flags & MF_MEM_PRE_REMOVE)
>>>> +			return 0;
>>>>    		notify->want_shutdown = true;
>>>>    		return 0;
>>>>    	}
>>>> @@ -92,14 +99,60 @@ xfs_dax_failure_fn(
>>>>    		return 0;
>>>>    	}
>>>> -	error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
>>>> -				  xfs_failure_pgoff(mp, rec, notify),
>>>> -				  xfs_failure_pgcnt(mp, rec, notify),
>>>> -				  notify->mf_flags);
>>>> +	mapping = VFS_I(ip)->i_mapping;
>>>> +	pgoff = xfs_failure_pgoff(mp, rec, notify);
>>>> +	pgcnt = xfs_failure_pgcnt(mp, rec, notify);
>>>> +
>>>> +	/* Continue the rmap query if the inode isn't a dax file. */
>>>> +	if (dax_mapping(mapping))
>>>> +		error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
>>>> +					  notify->mf_flags);
>>>> +
>>>> +	/* Invalidate the cache in dax pages. */
>>>> +	if (notify->mf_flags & MF_MEM_PRE_REMOVE)
>>>> +		invalidate_inode_pages2_range(mapping, pgoff,
>>>> +					      pgoff + pgcnt - 1);
>>>> +
>>>>    	xfs_irele(ip);
>>>>    	return error;
>>>>    }
>>>> +static int
>>>> +xfs_dax_notify_failure_freeze(
>>>> +	struct xfs_mount	*mp)
>>>> +{
>>>> +	struct super_block	*sb = mp->m_super;
>>>> +	int			error;
>>>> +
>>>> +	error = freeze_super(sb, FREEZE_HOLDER_KERNEL);
>>>> +	if (error)
>>>> +		xfs_emerg(mp, "already frozen by kernel, err=%d", error);
>>>> +
>>>> +	return error;
>>>> +}
>>>> +
>>>> +static void
>>>> +xfs_dax_notify_failure_thaw(
>>>> +	struct xfs_mount	*mp,
>>>> +	bool			kernel_frozen)
>>>> +{
>>>> +	struct super_block	*sb = mp->m_super;
>>>> +	int			error;
>>>> +
>>>> +	if (!kernel_frozen) {
>>>> +		error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
>>>> +		if (error)
>>>> +			xfs_emerg(mp, "still frozen after notify failure, err=%d",
>>>> +				error);
>>>> +	}
>>>> +
>>>> +	/*
>>>> +	 * Also thaw userspace call anyway because the device is about to be
>>>> +	 * removed immediately.
>>>
>>> Does a userspace freeze inhibit or otherwise break device removal?
>>
>> It doesn't.  Device can be removed.  But after that, the mount point still
>> exists, and `umount /mnt/scratch` fails with "target is busy." `xfs_freeze
>> -u /mnt/scratch` cannot work too.
> 
> Yes, that's true, but that's long been the case for removing block
> devices.  Should block device removal (since we now have hooks for
> that!) also be breaking freezes?

I think so.  But it may need more time to accomplish.  Shall we leave it 
for later optimization?

> 
>> So, I think thaw_super() anyway here is needed.
>>
>>
>>>
>>>> +	 */
>>>> +	thaw_super(sb, FREEZE_HOLDER_USERSPACE);
>>>> +}
>>>> +
>>>>    static int
>>>>    xfs_dax_notify_ddev_failure(
>>>>    	struct xfs_mount	*mp,
>>>> @@ -112,15 +165,29 @@ xfs_dax_notify_ddev_failure(
>>>>    	struct xfs_btree_cur	*cur = NULL;
>>>>    	struct xfs_buf		*agf_bp = NULL;
>>>>    	int			error = 0;
>>>> +	bool			kernel_frozen = false;
>>>>    	xfs_fsblock_t		fsbno = XFS_DADDR_TO_FSB(mp, daddr);
>>>>    	xfs_agnumber_t		agno = XFS_FSB_TO_AGNO(mp, fsbno);
>>>>    	xfs_fsblock_t		end_fsbno = XFS_DADDR_TO_FSB(mp,
>>>>    							     daddr + bblen - 1);
>>>>    	xfs_agnumber_t		end_agno = XFS_FSB_TO_AGNO(mp, end_fsbno);
>>>> +	if (mf_flags & MF_MEM_PRE_REMOVE) {
>>>> +		xfs_info(mp, "Device is about to be removed!");
>>>> +		/* Freeze fs to prevent new mappings from being created. */
>>>> +		error = xfs_dax_notify_failure_freeze(mp);
>>>> +		if (error) {
>>>> +			/* Keep going on if filesystem is frozen by kernel. */
>>>> +			if (error == -EBUSY)
>>>> +				kernel_frozen = true;
>>>
>>> EBUSY means that xfs_dax_notify_failure_freeze did /not/ succeed in
>>> kernel-freezing the fs.  Someone else did, and they're expecting that
>>> thaw_super will undo that.
>>>
>>> 	switch (error) {
>>> 	case -EBUSY:
>>> 		/* someone else froze the fs, keep going */
>>> 		break;
>>> 	case 0:
>>> 		/* we froze the fs */
>>> 		kernel_frozen = true;
>>> 		break;
>>> 	default:
>>> 		/* something else broke, should we continue anyway? */
>>> 		return error;
>>> 	}
>>>
>>> TBH I wonder why all that isn't just:
>>>
>>> 	kernel_frozen = xfs_dax_notify_failure_freeze(mp) == 0;
>>>
>>> Since we'd want to keep going even if (say) the pmem was already
>>> starting to fail and the freeze actually failed due to EIO, right?
>>
>> Yes.  So we can say it is a *try* to _freeze() here.  No matter what its
>> result is, we continue.
>>
>> Then I think the `kernel_frozen` becomes useless as well.  Because we should
>> try to call both _thaw(KERNEL_CALL) and _thaw(USER_CALL) to make sure umount
>> can work after device is gone.
> 
> I disagree -- unlike the mess that is userspace freezing, kernel code
> that obtained a kernel freeze will get very confused and potentially do
> Seriously Bad Things if the kernel freeze is yanked out from under them.
> Kernel code is not supposed to release things that they did not
> themselves obtain.
> 
> That might not ultimately matter for the narrow case of the device going
> away, but the two other usecases (online fsck and suspend) will
> malfunction if you drop a kernel freeze that they obtained.

Could online fsck and suspend keep working even after 
`xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);` being called?

> 
> I don't mind if PREREMOVE can't get a freeze and keeps going with the
> invalidations anyway.  We did our best, and when the pmem goes away we
> can just kill -9 down the processes.

Ok, I agree.

Then, the last thing I want to be confirmed:
On my host, if the freeze state wasn't _thaw() after device gone, the 
processes will keep on waiting and cannot be killed by `kill -9` 
manually.  Is there another way to make the processes killed?


--
Thanks,
Ruan.

> 
> --D
> 
>> Then, I think it's better to change them:
>>    `static int xfs_dax_notify_failure_freeze()`,
>>    `static void xfs_dax_notify_failure_thaw()`
>> to
>>    `static void xfs_dax_notify_failure_try_freeze()`,
>>    `static void xfs_dax_notify_failure_try_thaw()`.
>>
>>
>> --
>> Thanks,
>> Ruan.
>>
>>>
>>> --D
>>>
>>>> +			else
>>>> +				return error;
>>>> +		}
>>>> +	}
>>>> +
>>>>    	error = xfs_trans_alloc_empty(mp, &tp);
>>>>    	if (error)
>>>> -		return error;
>>>> +		goto out;
>>>>    	for (; agno <= end_agno; agno++) {
>>>>    		struct xfs_rmap_irec	ri_low = { };
>>>> @@ -165,11 +232,23 @@ xfs_dax_notify_ddev_failure(
>>>>    	}
>>>>    	xfs_trans_cancel(tp);
>>>> +
>>>> +	/*
>>>> +	 * Determine how to shutdown the filesystem according to the
>>>> +	 * error code and flags.
>>>> +	 */
>>>>    	if (error || notify.want_shutdown) {
>>>>    		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>>>>    		if (!error)
>>>>    			error = -EFSCORRUPTED;
>>>> -	}
>>>> +	} else if (mf_flags & MF_MEM_PRE_REMOVE)
>>>> +		xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
>>>> +
>>>> +out:
>>>> +	/* Thaw the fs if it is frozen before. */
>>>> +	if (mf_flags & MF_MEM_PRE_REMOVE)
>>>> +		xfs_dax_notify_failure_thaw(mp, kernel_frozen);
>>>> +
>>>>    	return error;
>>>>    }
>>>> @@ -197,6 +276,8 @@ xfs_dax_notify_failure(
>>>>    	if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
>>>>    	    mp->m_logdev_targp != mp->m_ddev_targp) {
>>>> +		if (mf_flags & MF_MEM_PRE_REMOVE)
>>>> +			return 0;
>>>>    		xfs_err(mp, "ondisk log corrupt, shutting down fs!");
>>>>    		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>>>>    		return -EFSCORRUPTED;
>>>> @@ -210,6 +291,12 @@ xfs_dax_notify_failure(
>>>>    	ddev_start = mp->m_ddev_targp->bt_dax_part_off;
>>>>    	ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
>>>> +	/* Notify failure on the whole device. */
>>>> +	if (offset == 0 && len == U64_MAX) {
>>>> +		offset = ddev_start;
>>>> +		len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
>>>> +	}
>>>> +
>>>>    	/* Ignore the range out of filesystem area */
>>>>    	if (offset + len - 1 < ddev_start)
>>>>    		return -ENXIO;
>>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>>> index 799836e84840..944a1165a321 100644
>>>> --- a/include/linux/mm.h
>>>> +++ b/include/linux/mm.h
>>>> @@ -3577,6 +3577,7 @@ enum mf_flags {
>>>>    	MF_UNPOISON = 1 << 4,
>>>>    	MF_SW_SIMULATED = 1 << 5,
>>>>    	MF_NO_RETRY = 1 << 6,
>>>> +	MF_MEM_PRE_REMOVE = 1 << 7,
>>>>    };
>>>>    int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>>>>    		      unsigned long count, int mf_flags);
>>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>>> index dc5ff7dd4e50..92f18c9e0aaf 100644
>>>> --- a/mm/memory-failure.c
>>>> +++ b/mm/memory-failure.c
>>>> @@ -688,7 +688,7 @@ static void add_to_kill_fsdax(struct task_struct *tsk, struct page *p,
>>>>     */
>>>>    static void collect_procs_fsdax(struct page *page,
>>>>    		struct address_space *mapping, pgoff_t pgoff,
>>>> -		struct list_head *to_kill)
>>>> +		struct list_head *to_kill, bool pre_remove)
>>>>    {
>>>>    	struct vm_area_struct *vma;
>>>>    	struct task_struct *tsk;
>>>> @@ -696,8 +696,15 @@ static void collect_procs_fsdax(struct page *page,
>>>>    	i_mmap_lock_read(mapping);
>>>>    	read_lock(&tasklist_lock);
>>>>    	for_each_process(tsk) {
>>>> -		struct task_struct *t = task_early_kill(tsk, true);
>>>> +		struct task_struct *t = tsk;
>>>> +		/*
>>>> +		 * Search for all tasks while MF_MEM_PRE_REMOVE is set, because
>>>> +		 * the current may not be the one accessing the fsdax page.
>>>> +		 * Otherwise, search for the current task.
>>>> +		 */
>>>> +		if (!pre_remove)
>>>> +			t = task_early_kill(tsk, true);
>>>>    		if (!t)
>>>>    			continue;
>>>>    		vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
>>>> @@ -1793,6 +1800,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>>>>    	dax_entry_t cookie;
>>>>    	struct page *page;
>>>>    	size_t end = index + count;
>>>> +	bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
>>>>    	mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
>>>> @@ -1804,9 +1812,10 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>>>>    		if (!page)
>>>>    			goto unlock;
>>>> -		SetPageHWPoison(page);
>>>> +		if (!pre_remove)
>>>> +			SetPageHWPoison(page);
>>>> -		collect_procs_fsdax(page, mapping, index, &to_kill);
>>>> +		collect_procs_fsdax(page, mapping, index, &to_kill, pre_remove);
>>>>    		unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
>>>>    				index, mf_flags);
>>>>    unlock:
>>>> -- 
>>>> 2.41.0
>>>>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v13] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind
  2023-08-25  3:52           ` Shiyang Ruan
@ 2023-08-26  0:17             ` Darrick J. Wong
  0 siblings, 0 replies; 37+ messages in thread
From: Darrick J. Wong @ 2023-08-26  0:17 UTC (permalink / raw)
  To: Shiyang Ruan
  Cc: linux-fsdevel, nvdimm, linux-xfs, linux-mm, dan.j.williams, willy,
	jack, akpm, mcgrof

On Fri, Aug 25, 2023 at 11:52:35AM +0800, Shiyang Ruan wrote:
> 
> 
> 在 2023/8/25 7:57, Darrick J. Wong 写道:
> > On Thu, Aug 24, 2023 at 05:41:50PM +0800, Shiyang Ruan wrote:
> > > 
> > > 
> > > 在 2023/8/24 7:36, Darrick J. Wong 写道:
> > > > On Wed, Aug 23, 2023 at 04:17:06PM +0800, Shiyang Ruan wrote:
> > > > > ====
> > > > > Changes since v12:
> > > > >    1. correct flag name in subject (MF_MEM_REMOVE => MF_MEM_PRE_REMOVE)
> > > > >    2. complete the behavior when fs has already frozen by kernel call
> > > > >         NOTICE: Instead of "call notify_failure() again w/o PRE_REMOVE",
> > > > >                 I tried this proposal[0].
> > > > >    3. call xfs_dax_notify_failure_freeze() and _thaw() in same function
> > > > >    4. rebase on: xfs/xfs-linux.git vfs-for-next
> > > > > ====
> > > > > 
> > > > > Now, if we suddenly remove a PMEM device(by calling unbind) which
> > > > > contains FSDAX while programs are still accessing data in this device,
> > > > > e.g.:
> > > > > ```
> > > > >    $FSSTRESS_PROG -d $SCRATCH_MNT -n 99999 -p 4 &
> > > > >    # $FSX_PROG -N 1000000 -o 8192 -l 500000 $SCRATCH_MNT/t001 &
> > > > >    echo "pfn1.1" > /sys/bus/nd/drivers/nd_pmem/unbind
> > > > > ```
> > > > > it could come into an unacceptable state:
> > > > >     1. device has gone but mount point still exists, and umount will fail
> > > > >          with "target is busy"
> > > > >     2. programs will hang and cannot be killed
> > > > >     3. may crash with NULL pointer dereference
> > > > > 
> > > > > To fix this, we introduce a MF_MEM_PRE_REMOVE flag to let it know that we
> > > > > are going to remove the whole device, and make sure all related processes
> > > > > could be notified so that they could end up gracefully.
> > > > > 
> > > > > This patch is inspired by Dan's "mm, dax, pmem: Introduce
> > > > > dev_pagemap_failure()"[1].  With the help of dax_holder and
> > > > > ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> > > > > on it to unmap all files in use, and notify processes who are using
> > > > > those files.
> > > > > 
> > > > > Call trace:
> > > > > trigger unbind
> > > > >    -> unbind_store()
> > > > >     -> ... (skip)
> > > > >      -> devres_release_all()
> > > > >       -> kill_dax()
> > > > >        -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
> > > > >         -> xfs_dax_notify_failure()
> > > > >         `-> freeze_super()             // freeze (kernel call)
> > > > >         `-> do xfs rmap
> > > > >         ` -> mf_dax_kill_procs()
> > > > >         `  -> collect_procs_fsdax()    // all associated processes
> > > > >         `  -> unmap_and_kill()
> > > > >         ` -> invalidate_inode_pages2_range() // drop file's cache
> > > > >         `-> thaw_super()               // thaw (both kernel & user call)
> > > > > 
> > > > > Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> > > > > event.  Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
> > > > > new dax mapping from being created.  Do not shutdown filesystem directly
> > > > > if configuration is not supported, or if failure range includes metadata
> > > > > area.  Make sure all files and processes(not only the current progress)
> > > > > are handled correctly.  Also drop the cache of associated files before
> > > > > pmem is removed.
> > > > > 
> > > > > [0]: https://lore.kernel.org/linux-xfs/25cf6700-4db0-a346-632c-ec9fc291793a@fujitsu.com/
> > > > > [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
> > > > > [2]: https://lore.kernel.org/linux-xfs/169116275623.3187159.16862410128731457358.stg-ugh@frogsfrogsfrogs/
> > > > > 
> > > > > Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> > > > > ---
> > > > >    drivers/dax/super.c         |  3 +-
> > > > >    fs/xfs/xfs_notify_failure.c | 99 ++++++++++++++++++++++++++++++++++---
> > > > >    include/linux/mm.h          |  1 +
> > > > >    mm/memory-failure.c         | 17 +++++--
> > > > >    4 files changed, 109 insertions(+), 11 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> > > > > index c4c4728a36e4..2e1a35e82fce 100644
> > > > > --- a/drivers/dax/super.c
> > > > > +++ b/drivers/dax/super.c
> > > > > @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
> > > > >    		return;
> > > > >    	if (dax_dev->holder_data != NULL)
> > > > > -		dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
> > > > > +		dax_holder_notify_failure(dax_dev, 0, U64_MAX,
> > > > > +				MF_MEM_PRE_REMOVE);
> > > > >    	clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
> > > > >    	synchronize_srcu(&dax_srcu);
> > > > > diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> > > > > index 4a9bbd3fe120..6496c32a9172 100644
> > > > > --- a/fs/xfs/xfs_notify_failure.c
> > > > > +++ b/fs/xfs/xfs_notify_failure.c
> > > > > @@ -22,6 +22,7 @@
> > > > >    #include <linux/mm.h>
> > > > >    #include <linux/dax.h>
> > > > > +#include <linux/fs.h>
> > > > >    struct xfs_failure_info {
> > > > >    	xfs_agblock_t		startblock;
> > > > > @@ -73,10 +74,16 @@ xfs_dax_failure_fn(
> > > > >    	struct xfs_mount		*mp = cur->bc_mp;
> > > > >    	struct xfs_inode		*ip;
> > > > >    	struct xfs_failure_info		*notify = data;
> > > > > +	struct address_space		*mapping;
> > > > > +	pgoff_t				pgoff;
> > > > > +	unsigned long			pgcnt;
> > > > >    	int				error = 0;
> > > > >    	if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
> > > > >    	    (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
> > > > > +		/* Continue the query because this isn't a failure. */
> > > > > +		if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> > > > > +			return 0;
> > > > >    		notify->want_shutdown = true;
> > > > >    		return 0;
> > > > >    	}
> > > > > @@ -92,14 +99,60 @@ xfs_dax_failure_fn(
> > > > >    		return 0;
> > > > >    	}
> > > > > -	error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
> > > > > -				  xfs_failure_pgoff(mp, rec, notify),
> > > > > -				  xfs_failure_pgcnt(mp, rec, notify),
> > > > > -				  notify->mf_flags);
> > > > > +	mapping = VFS_I(ip)->i_mapping;
> > > > > +	pgoff = xfs_failure_pgoff(mp, rec, notify);
> > > > > +	pgcnt = xfs_failure_pgcnt(mp, rec, notify);
> > > > > +
> > > > > +	/* Continue the rmap query if the inode isn't a dax file. */
> > > > > +	if (dax_mapping(mapping))
> > > > > +		error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
> > > > > +					  notify->mf_flags);
> > > > > +
> > > > > +	/* Invalidate the cache in dax pages. */
> > > > > +	if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> > > > > +		invalidate_inode_pages2_range(mapping, pgoff,
> > > > > +					      pgoff + pgcnt - 1);
> > > > > +
> > > > >    	xfs_irele(ip);
> > > > >    	return error;
> > > > >    }
> > > > > +static int
> > > > > +xfs_dax_notify_failure_freeze(
> > > > > +	struct xfs_mount	*mp)
> > > > > +{
> > > > > +	struct super_block	*sb = mp->m_super;
> > > > > +	int			error;
> > > > > +
> > > > > +	error = freeze_super(sb, FREEZE_HOLDER_KERNEL);
> > > > > +	if (error)
> > > > > +		xfs_emerg(mp, "already frozen by kernel, err=%d", error);
> > > > > +
> > > > > +	return error;
> > > > > +}
> > > > > +
> > > > > +static void
> > > > > +xfs_dax_notify_failure_thaw(
> > > > > +	struct xfs_mount	*mp,
> > > > > +	bool			kernel_frozen)
> > > > > +{
> > > > > +	struct super_block	*sb = mp->m_super;
> > > > > +	int			error;
> > > > > +
> > > > > +	if (!kernel_frozen) {
> > > > > +		error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
> > > > > +		if (error)
> > > > > +			xfs_emerg(mp, "still frozen after notify failure, err=%d",
> > > > > +				error);
> > > > > +	}
> > > > > +
> > > > > +	/*
> > > > > +	 * Also thaw userspace call anyway because the device is about to be
> > > > > +	 * removed immediately.
> > > > 
> > > > Does a userspace freeze inhibit or otherwise break device removal?
> > > 
> > > It doesn't.  Device can be removed.  But after that, the mount point still
> > > exists, and `umount /mnt/scratch` fails with "target is busy." `xfs_freeze
> > > -u /mnt/scratch` cannot work too.
> > 
> > Yes, that's true, but that's long been the case for removing block
> > devices.  Should block device removal (since we now have hooks for
> > that!) also be breaking freezes?
> 
> I think so.  But it may need more time to accomplish.  Shall we leave it for
> later optimization?

Yeah, I think patching the block layer is a separate patch.

> > 
> > > So, I think thaw_super() anyway here is needed.
> > > 
> > > 
> > > > 
> > > > > +	 */
> > > > > +	thaw_super(sb, FREEZE_HOLDER_USERSPACE);
> > > > > +}
> > > > > +
> > > > >    static int
> > > > >    xfs_dax_notify_ddev_failure(
> > > > >    	struct xfs_mount	*mp,
> > > > > @@ -112,15 +165,29 @@ xfs_dax_notify_ddev_failure(
> > > > >    	struct xfs_btree_cur	*cur = NULL;
> > > > >    	struct xfs_buf		*agf_bp = NULL;
> > > > >    	int			error = 0;
> > > > > +	bool			kernel_frozen = false;
> > > > >    	xfs_fsblock_t		fsbno = XFS_DADDR_TO_FSB(mp, daddr);
> > > > >    	xfs_agnumber_t		agno = XFS_FSB_TO_AGNO(mp, fsbno);
> > > > >    	xfs_fsblock_t		end_fsbno = XFS_DADDR_TO_FSB(mp,
> > > > >    							     daddr + bblen - 1);
> > > > >    	xfs_agnumber_t		end_agno = XFS_FSB_TO_AGNO(mp, end_fsbno);
> > > > > +	if (mf_flags & MF_MEM_PRE_REMOVE) {
> > > > > +		xfs_info(mp, "Device is about to be removed!");
> > > > > +		/* Freeze fs to prevent new mappings from being created. */
> > > > > +		error = xfs_dax_notify_failure_freeze(mp);
> > > > > +		if (error) {
> > > > > +			/* Keep going on if filesystem is frozen by kernel. */
> > > > > +			if (error == -EBUSY)
> > > > > +				kernel_frozen = true;
> > > > 
> > > > EBUSY means that xfs_dax_notify_failure_freeze did /not/ succeed in
> > > > kernel-freezing the fs.  Someone else did, and they're expecting that
> > > > thaw_super will undo that.
> > > > 
> > > > 	switch (error) {
> > > > 	case -EBUSY:
> > > > 		/* someone else froze the fs, keep going */
> > > > 		break;
> > > > 	case 0:
> > > > 		/* we froze the fs */
> > > > 		kernel_frozen = true;
> > > > 		break;
> > > > 	default:
> > > > 		/* something else broke, should we continue anyway? */
> > > > 		return error;
> > > > 	}
> > > > 
> > > > TBH I wonder why all that isn't just:
> > > > 
> > > > 	kernel_frozen = xfs_dax_notify_failure_freeze(mp) == 0;
> > > > 
> > > > Since we'd want to keep going even if (say) the pmem was already
> > > > starting to fail and the freeze actually failed due to EIO, right?
> > > 
> > > Yes.  So we can say it is a *try* to _freeze() here.  No matter what its
> > > result is, we continue.
> > > 
> > > Then I think the `kernel_frozen` becomes useless as well.  Because we should
> > > try to call both _thaw(KERNEL_CALL) and _thaw(USER_CALL) to make sure umount
> > > can work after device is gone.
> > 
> > I disagree -- unlike the mess that is userspace freezing, kernel code
> > that obtained a kernel freeze will get very confused and potentially do
> > Seriously Bad Things if the kernel freeze is yanked out from under them.
> > Kernel code is not supposed to release things that they did not
> > themselves obtain.
> > 
> > That might not ultimately matter for the narrow case of the device going
> > away, but the two other usecases (online fsck and suspend) will
> > malfunction if you drop a kernel freeze that they obtained.
> 
> Could online fsck and suspend keep working even after
> `xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);` being called?

It's likely to go down with the filesystem, but the point of the kernel
freeze is that the freeze should be brief and undone by the same
function on its way out.  Hence PREREMOVE shouldn't be releasing
something that was obtained by another (running) thread, just like any
other resource.

> > 
> > I don't mind if PREREMOVE can't get a freeze and keeps going with the
> > invalidations anyway.  We did our best, and when the pmem goes away we
> > can just kill -9 down the processes.
> 
> Ok, I agree.
> 
> Then, the last thing I want to be confirmed:
> On my host, if the freeze state wasn't _thaw() after device gone, the
> processes will keep on waiting and cannot be killed by `kill -9` manually.
> Is there another way to make the processes killed?

No, I don't think there is.  FWIW I'm ok with you moving on to the
invalidation part if something else has frozen the fs; and I'm also ok
with the unconditional thaw_super(sb, FREEZE_HOLDER_USERSPACE).

--D

> 
> 
> --
> Thanks,
> Ruan.
> 
> > 
> > --D
> > 
> > > Then, I think it's better to change them:
> > >    `static int xfs_dax_notify_failure_freeze()`,
> > >    `static void xfs_dax_notify_failure_thaw()`
> > > to
> > >    `static void xfs_dax_notify_failure_try_freeze()`,
> > >    `static void xfs_dax_notify_failure_try_thaw()`.
> > > 
> > > 
> > > --
> > > Thanks,
> > > Ruan.
> > > 
> > > > 
> > > > --D
> > > > 
> > > > > +			else
> > > > > +				return error;
> > > > > +		}
> > > > > +	}
> > > > > +
> > > > >    	error = xfs_trans_alloc_empty(mp, &tp);
> > > > >    	if (error)
> > > > > -		return error;
> > > > > +		goto out;
> > > > >    	for (; agno <= end_agno; agno++) {
> > > > >    		struct xfs_rmap_irec	ri_low = { };
> > > > > @@ -165,11 +232,23 @@ xfs_dax_notify_ddev_failure(
> > > > >    	}
> > > > >    	xfs_trans_cancel(tp);
> > > > > +
> > > > > +	/*
> > > > > +	 * Determine how to shutdown the filesystem according to the
> > > > > +	 * error code and flags.
> > > > > +	 */
> > > > >    	if (error || notify.want_shutdown) {
> > > > >    		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
> > > > >    		if (!error)
> > > > >    			error = -EFSCORRUPTED;
> > > > > -	}
> > > > > +	} else if (mf_flags & MF_MEM_PRE_REMOVE)
> > > > > +		xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
> > > > > +
> > > > > +out:
> > > > > +	/* Thaw the fs if it is frozen before. */
> > > > > +	if (mf_flags & MF_MEM_PRE_REMOVE)
> > > > > +		xfs_dax_notify_failure_thaw(mp, kernel_frozen);
> > > > > +
> > > > >    	return error;
> > > > >    }
> > > > > @@ -197,6 +276,8 @@ xfs_dax_notify_failure(
> > > > >    	if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
> > > > >    	    mp->m_logdev_targp != mp->m_ddev_targp) {
> > > > > +		if (mf_flags & MF_MEM_PRE_REMOVE)
> > > > > +			return 0;
> > > > >    		xfs_err(mp, "ondisk log corrupt, shutting down fs!");
> > > > >    		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
> > > > >    		return -EFSCORRUPTED;
> > > > > @@ -210,6 +291,12 @@ xfs_dax_notify_failure(
> > > > >    	ddev_start = mp->m_ddev_targp->bt_dax_part_off;
> > > > >    	ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
> > > > > +	/* Notify failure on the whole device. */
> > > > > +	if (offset == 0 && len == U64_MAX) {
> > > > > +		offset = ddev_start;
> > > > > +		len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
> > > > > +	}
> > > > > +
> > > > >    	/* Ignore the range out of filesystem area */
> > > > >    	if (offset + len - 1 < ddev_start)
> > > > >    		return -ENXIO;
> > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > > > index 799836e84840..944a1165a321 100644
> > > > > --- a/include/linux/mm.h
> > > > > +++ b/include/linux/mm.h
> > > > > @@ -3577,6 +3577,7 @@ enum mf_flags {
> > > > >    	MF_UNPOISON = 1 << 4,
> > > > >    	MF_SW_SIMULATED = 1 << 5,
> > > > >    	MF_NO_RETRY = 1 << 6,
> > > > > +	MF_MEM_PRE_REMOVE = 1 << 7,
> > > > >    };
> > > > >    int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> > > > >    		      unsigned long count, int mf_flags);
> > > > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > > > > index dc5ff7dd4e50..92f18c9e0aaf 100644
> > > > > --- a/mm/memory-failure.c
> > > > > +++ b/mm/memory-failure.c
> > > > > @@ -688,7 +688,7 @@ static void add_to_kill_fsdax(struct task_struct *tsk, struct page *p,
> > > > >     */
> > > > >    static void collect_procs_fsdax(struct page *page,
> > > > >    		struct address_space *mapping, pgoff_t pgoff,
> > > > > -		struct list_head *to_kill)
> > > > > +		struct list_head *to_kill, bool pre_remove)
> > > > >    {
> > > > >    	struct vm_area_struct *vma;
> > > > >    	struct task_struct *tsk;
> > > > > @@ -696,8 +696,15 @@ static void collect_procs_fsdax(struct page *page,
> > > > >    	i_mmap_lock_read(mapping);
> > > > >    	read_lock(&tasklist_lock);
> > > > >    	for_each_process(tsk) {
> > > > > -		struct task_struct *t = task_early_kill(tsk, true);
> > > > > +		struct task_struct *t = tsk;
> > > > > +		/*
> > > > > +		 * Search for all tasks while MF_MEM_PRE_REMOVE is set, because
> > > > > +		 * the current may not be the one accessing the fsdax page.
> > > > > +		 * Otherwise, search for the current task.
> > > > > +		 */
> > > > > +		if (!pre_remove)
> > > > > +			t = task_early_kill(tsk, true);
> > > > >    		if (!t)
> > > > >    			continue;
> > > > >    		vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> > > > > @@ -1793,6 +1800,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> > > > >    	dax_entry_t cookie;
> > > > >    	struct page *page;
> > > > >    	size_t end = index + count;
> > > > > +	bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
> > > > >    	mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
> > > > > @@ -1804,9 +1812,10 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> > > > >    		if (!page)
> > > > >    			goto unlock;
> > > > > -		SetPageHWPoison(page);
> > > > > +		if (!pre_remove)
> > > > > +			SetPageHWPoison(page);
> > > > > -		collect_procs_fsdax(page, mapping, index, &to_kill);
> > > > > +		collect_procs_fsdax(page, mapping, index, &to_kill, pre_remove);
> > > > >    		unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
> > > > >    				index, mf_flags);
> > > > >    unlock:
> > > > > -- 
> > > > > 2.41.0
> > > > > 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v14] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind
  2023-06-29  8:16 ` [PATCH v12 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan
                     ` (4 preceding siblings ...)
  2023-08-23  8:17   ` [PATCH v13] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE " Shiyang Ruan
@ 2023-08-28  6:57   ` Shiyang Ruan
  2023-08-30 15:34     ` Darrick J. Wong
                       ` (2 more replies)
  5 siblings, 3 replies; 37+ messages in thread
From: Shiyang Ruan @ 2023-08-28  6:57 UTC (permalink / raw)
  To: linux-fsdevel, nvdimm, linux-xfs, linux-mm
  Cc: dan.j.williams, willy, jack, akpm, djwong, mcgrof

====
Changes since v13:
 1. don't return error if _freeze(FREEZE_HOLDER_KERNEL) got other error
====

Now, if we suddenly remove a PMEM device(by calling unbind) which
contains FSDAX while programs are still accessing data in this device,
e.g.:
```
 $FSSTRESS_PROG -d $SCRATCH_MNT -n 99999 -p 4 &
 # $FSX_PROG -N 1000000 -o 8192 -l 500000 $SCRATCH_MNT/t001 &
 echo "pfn1.1" > /sys/bus/nd/drivers/nd_pmem/unbind
```
it could come into an unacceptable state:
  1. device has gone but mount point still exists, and umount will fail
       with "target is busy"
  2. programs will hang and cannot be killed
  3. may crash with NULL pointer dereference

To fix this, we introduce a MF_MEM_PRE_REMOVE flag to let it know that we
are going to remove the whole device, and make sure all related processes
could be notified so that they could end up gracefully.

This patch is inspired by Dan's "mm, dax, pmem: Introduce
dev_pagemap_failure()"[1].  With the help of dax_holder and
->notify_failure() mechanism, the pmem driver is able to ask filesystem
on it to unmap all files in use, and notify processes who are using
those files.

Call trace:
trigger unbind
 -> unbind_store()
  -> ... (skip)
   -> devres_release_all()
    -> kill_dax()
     -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
      -> xfs_dax_notify_failure()
      `-> freeze_super()             // freeze (kernel call)
      `-> do xfs rmap
      ` -> mf_dax_kill_procs()
      `  -> collect_procs_fsdax()    // all associated processes
      `  -> unmap_and_kill()
      ` -> invalidate_inode_pages2_range() // drop file's cache
      `-> thaw_super()               // thaw (both kernel & user call)

Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
event.  Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
new dax mapping from being created.  Do not shutdown filesystem directly
if configuration is not supported, or if failure range includes metadata
area.  Make sure all files and processes(not only the current progress)
are handled correctly.  Also drop the cache of associated files before
pmem is removed.

[1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
[2]: https://lore.kernel.org/linux-xfs/169116275623.3187159.16862410128731457358.stg-ugh@frogsfrogsfrogs/

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
---
 drivers/dax/super.c         |  3 +-
 fs/xfs/xfs_notify_failure.c | 99 ++++++++++++++++++++++++++++++++++---
 include/linux/mm.h          |  1 +
 mm/memory-failure.c         | 17 +++++--
 4 files changed, 109 insertions(+), 11 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 0da9232ea175..f4b635526345 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -326,7 +326,8 @@ void kill_dax(struct dax_device *dax_dev)
 		return;
 
 	if (dax_dev->holder_data != NULL)
-		dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
+		dax_holder_notify_failure(dax_dev, 0, U64_MAX,
+				MF_MEM_PRE_REMOVE);
 
 	clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
 	synchronize_srcu(&dax_srcu);
diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index 4a9bbd3fe120..79586abc75bf 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -22,6 +22,7 @@
 
 #include <linux/mm.h>
 #include <linux/dax.h>
+#include <linux/fs.h>
 
 struct xfs_failure_info {
 	xfs_agblock_t		startblock;
@@ -73,10 +74,16 @@ xfs_dax_failure_fn(
 	struct xfs_mount		*mp = cur->bc_mp;
 	struct xfs_inode		*ip;
 	struct xfs_failure_info		*notify = data;
+	struct address_space		*mapping;
+	pgoff_t				pgoff;
+	unsigned long			pgcnt;
 	int				error = 0;
 
 	if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
 	    (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
+		/* Continue the query because this isn't a failure. */
+		if (notify->mf_flags & MF_MEM_PRE_REMOVE)
+			return 0;
 		notify->want_shutdown = true;
 		return 0;
 	}
@@ -92,14 +99,60 @@ xfs_dax_failure_fn(
 		return 0;
 	}
 
-	error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
-				  xfs_failure_pgoff(mp, rec, notify),
-				  xfs_failure_pgcnt(mp, rec, notify),
-				  notify->mf_flags);
+	mapping = VFS_I(ip)->i_mapping;
+	pgoff = xfs_failure_pgoff(mp, rec, notify);
+	pgcnt = xfs_failure_pgcnt(mp, rec, notify);
+
+	/* Continue the rmap query if the inode isn't a dax file. */
+	if (dax_mapping(mapping))
+		error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
+					  notify->mf_flags);
+
+	/* Invalidate the cache in dax pages. */
+	if (notify->mf_flags & MF_MEM_PRE_REMOVE)
+		invalidate_inode_pages2_range(mapping, pgoff,
+					      pgoff + pgcnt - 1);
+
 	xfs_irele(ip);
 	return error;
 }
 
+static int
+xfs_dax_notify_failure_freeze(
+	struct xfs_mount	*mp)
+{
+	struct super_block	*sb = mp->m_super;
+	int			error;
+
+	error = freeze_super(sb, FREEZE_HOLDER_KERNEL);
+	if (error)
+		xfs_emerg(mp, "already frozen by kernel, err=%d", error);
+
+	return error;
+}
+
+static void
+xfs_dax_notify_failure_thaw(
+	struct xfs_mount	*mp,
+	bool			kernel_frozen)
+{
+	struct super_block	*sb = mp->m_super;
+	int			error;
+
+	if (kernel_frozen) {
+		error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
+		if (error)
+			xfs_emerg(mp, "still frozen after notify failure, err=%d",
+				error);
+	}
+
+	/*
+	 * Also thaw userspace call anyway because the device is about to be
+	 * removed immediately.
+	 */
+	thaw_super(sb, FREEZE_HOLDER_USERSPACE);
+}
+
 static int
 xfs_dax_notify_ddev_failure(
 	struct xfs_mount	*mp,
@@ -112,15 +165,29 @@ xfs_dax_notify_ddev_failure(
 	struct xfs_btree_cur	*cur = NULL;
 	struct xfs_buf		*agf_bp = NULL;
 	int			error = 0;
+	bool			kernel_frozen = false;
 	xfs_fsblock_t		fsbno = XFS_DADDR_TO_FSB(mp, daddr);
 	xfs_agnumber_t		agno = XFS_FSB_TO_AGNO(mp, fsbno);
 	xfs_fsblock_t		end_fsbno = XFS_DADDR_TO_FSB(mp,
 							     daddr + bblen - 1);
 	xfs_agnumber_t		end_agno = XFS_FSB_TO_AGNO(mp, end_fsbno);
 
+	if (mf_flags & MF_MEM_PRE_REMOVE) {
+		xfs_info(mp, "Device is about to be removed!");
+		/*
+		 * Freeze fs to prevent new mappings from being created.
+		 * - Keep going on if others already hold the kernel forzen.
+		 * - Keep going on if other errors too because this device is
+		 *   starting to fail.
+		 * - If kernel frozen state is hold successfully here, thaw it
+		 *   here as well at the end.
+		 */
+		kernel_frozen = xfs_dax_notify_failure_freeze(mp) == 0;
+	}
+
 	error = xfs_trans_alloc_empty(mp, &tp);
 	if (error)
-		return error;
+		goto out;
 
 	for (; agno <= end_agno; agno++) {
 		struct xfs_rmap_irec	ri_low = { };
@@ -165,11 +232,23 @@ xfs_dax_notify_ddev_failure(
 	}
 
 	xfs_trans_cancel(tp);
+
+	/*
+	 * Determine how to shutdown the filesystem according to the
+	 * error code and flags.
+	 */
 	if (error || notify.want_shutdown) {
 		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
 		if (!error)
 			error = -EFSCORRUPTED;
-	}
+	} else if (mf_flags & MF_MEM_PRE_REMOVE)
+		xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
+
+out:
+	/* Thaw the fs if it is frozen before. */
+	if (mf_flags & MF_MEM_PRE_REMOVE)
+		xfs_dax_notify_failure_thaw(mp, kernel_frozen);
+
 	return error;
 }
 
@@ -197,6 +276,8 @@ xfs_dax_notify_failure(
 
 	if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
 	    mp->m_logdev_targp != mp->m_ddev_targp) {
+		if (mf_flags & MF_MEM_PRE_REMOVE)
+			return 0;
 		xfs_err(mp, "ondisk log corrupt, shutting down fs!");
 		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
 		return -EFSCORRUPTED;
@@ -210,6 +291,12 @@ xfs_dax_notify_failure(
 	ddev_start = mp->m_ddev_targp->bt_dax_part_off;
 	ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
 
+	/* Notify failure on the whole device. */
+	if (offset == 0 && len == U64_MAX) {
+		offset = ddev_start;
+		len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
+	}
+
 	/* Ignore the range out of filesystem area */
 	if (offset + len - 1 < ddev_start)
 		return -ENXIO;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2dd73e4f3d8e..a10c75bebd6d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3665,6 +3665,7 @@ enum mf_flags {
 	MF_UNPOISON = 1 << 4,
 	MF_SW_SIMULATED = 1 << 5,
 	MF_NO_RETRY = 1 << 6,
+	MF_MEM_PRE_REMOVE = 1 << 7,
 };
 int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
 		      unsigned long count, int mf_flags);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index e245191e6b04..e71616ccc643 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -683,7 +683,7 @@ static void add_to_kill_fsdax(struct task_struct *tsk, struct page *p,
  */
 static void collect_procs_fsdax(struct page *page,
 		struct address_space *mapping, pgoff_t pgoff,
-		struct list_head *to_kill)
+		struct list_head *to_kill, bool pre_remove)
 {
 	struct vm_area_struct *vma;
 	struct task_struct *tsk;
@@ -691,8 +691,15 @@ static void collect_procs_fsdax(struct page *page,
 	i_mmap_lock_read(mapping);
 	read_lock(&tasklist_lock);
 	for_each_process(tsk) {
-		struct task_struct *t = task_early_kill(tsk, true);
+		struct task_struct *t = tsk;
 
+		/*
+		 * Search for all tasks while MF_MEM_PRE_REMOVE is set, because
+		 * the current may not be the one accessing the fsdax page.
+		 * Otherwise, search for the current task.
+		 */
+		if (!pre_remove)
+			t = task_early_kill(tsk, true);
 		if (!t)
 			continue;
 		vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
@@ -1788,6 +1795,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
 	dax_entry_t cookie;
 	struct page *page;
 	size_t end = index + count;
+	bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
 
 	mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
 
@@ -1799,9 +1807,10 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
 		if (!page)
 			goto unlock;
 
-		SetPageHWPoison(page);
+		if (!pre_remove)
+			SetPageHWPoison(page);
 
-		collect_procs_fsdax(page, mapping, index, &to_kill);
+		collect_procs_fsdax(page, mapping, index, &to_kill, pre_remove);
 		unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
 				index, mf_flags);
 unlock:
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v14] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind
  2023-08-28  6:57   ` [PATCH v14] " Shiyang Ruan
@ 2023-08-30 15:34     ` Darrick J. Wong
  2023-09-27  8:17     ` Dan Williams
  2023-09-28 10:32     ` [PATCH v15] " Shiyang Ruan
  2 siblings, 0 replies; 37+ messages in thread
From: Darrick J. Wong @ 2023-08-30 15:34 UTC (permalink / raw)
  To: Shiyang Ruan
  Cc: linux-fsdevel, nvdimm, linux-xfs, linux-mm, dan.j.williams, willy,
	jack, akpm, mcgrof

On Mon, Aug 28, 2023 at 02:57:44PM +0800, Shiyang Ruan wrote:
> ====
> Changes since v13:
>  1. don't return error if _freeze(FREEZE_HOLDER_KERNEL) got other error
> ====
> 
> Now, if we suddenly remove a PMEM device(by calling unbind) which
> contains FSDAX while programs are still accessing data in this device,
> e.g.:
> ```
>  $FSSTRESS_PROG -d $SCRATCH_MNT -n 99999 -p 4 &
>  # $FSX_PROG -N 1000000 -o 8192 -l 500000 $SCRATCH_MNT/t001 &
>  echo "pfn1.1" > /sys/bus/nd/drivers/nd_pmem/unbind
> ```
> it could come into an unacceptable state:
>   1. device has gone but mount point still exists, and umount will fail
>        with "target is busy"
>   2. programs will hang and cannot be killed
>   3. may crash with NULL pointer dereference
> 
> To fix this, we introduce a MF_MEM_PRE_REMOVE flag to let it know that we
> are going to remove the whole device, and make sure all related processes
> could be notified so that they could end up gracefully.
> 
> This patch is inspired by Dan's "mm, dax, pmem: Introduce
> dev_pagemap_failure()"[1].  With the help of dax_holder and
> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> on it to unmap all files in use, and notify processes who are using
> those files.
> 
> Call trace:
> trigger unbind
>  -> unbind_store()
>   -> ... (skip)
>    -> devres_release_all()
>     -> kill_dax()
>      -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
>       -> xfs_dax_notify_failure()
>       `-> freeze_super()             // freeze (kernel call)
>       `-> do xfs rmap
>       ` -> mf_dax_kill_procs()
>       `  -> collect_procs_fsdax()    // all associated processes
>       `  -> unmap_and_kill()
>       ` -> invalidate_inode_pages2_range() // drop file's cache
>       `-> thaw_super()               // thaw (both kernel & user call)
> 
> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> event.  Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
> new dax mapping from being created.  Do not shutdown filesystem directly
> if configuration is not supported, or if failure range includes metadata
> area.  Make sure all files and processes(not only the current progress)
> are handled correctly.  Also drop the cache of associated files before
> pmem is removed.
> 
> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
> [2]: https://lore.kernel.org/linux-xfs/169116275623.3187159.16862410128731457358.stg-ugh@frogsfrogsfrogs/
> 
> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>

Looks good, now who wants to take this patch?

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  drivers/dax/super.c         |  3 +-
>  fs/xfs/xfs_notify_failure.c | 99 ++++++++++++++++++++++++++++++++++---
>  include/linux/mm.h          |  1 +
>  mm/memory-failure.c         | 17 +++++--
>  4 files changed, 109 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 0da9232ea175..f4b635526345 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -326,7 +326,8 @@ void kill_dax(struct dax_device *dax_dev)
>  		return;
>  
>  	if (dax_dev->holder_data != NULL)
> -		dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
> +		dax_holder_notify_failure(dax_dev, 0, U64_MAX,
> +				MF_MEM_PRE_REMOVE);
>  
>  	clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
>  	synchronize_srcu(&dax_srcu);
> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> index 4a9bbd3fe120..79586abc75bf 100644
> --- a/fs/xfs/xfs_notify_failure.c
> +++ b/fs/xfs/xfs_notify_failure.c
> @@ -22,6 +22,7 @@
>  
>  #include <linux/mm.h>
>  #include <linux/dax.h>
> +#include <linux/fs.h>
>  
>  struct xfs_failure_info {
>  	xfs_agblock_t		startblock;
> @@ -73,10 +74,16 @@ xfs_dax_failure_fn(
>  	struct xfs_mount		*mp = cur->bc_mp;
>  	struct xfs_inode		*ip;
>  	struct xfs_failure_info		*notify = data;
> +	struct address_space		*mapping;
> +	pgoff_t				pgoff;
> +	unsigned long			pgcnt;
>  	int				error = 0;
>  
>  	if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
>  	    (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
> +		/* Continue the query because this isn't a failure. */
> +		if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> +			return 0;
>  		notify->want_shutdown = true;
>  		return 0;
>  	}
> @@ -92,14 +99,60 @@ xfs_dax_failure_fn(
>  		return 0;
>  	}
>  
> -	error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
> -				  xfs_failure_pgoff(mp, rec, notify),
> -				  xfs_failure_pgcnt(mp, rec, notify),
> -				  notify->mf_flags);
> +	mapping = VFS_I(ip)->i_mapping;
> +	pgoff = xfs_failure_pgoff(mp, rec, notify);
> +	pgcnt = xfs_failure_pgcnt(mp, rec, notify);
> +
> +	/* Continue the rmap query if the inode isn't a dax file. */
> +	if (dax_mapping(mapping))
> +		error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
> +					  notify->mf_flags);
> +
> +	/* Invalidate the cache in dax pages. */
> +	if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> +		invalidate_inode_pages2_range(mapping, pgoff,
> +					      pgoff + pgcnt - 1);
> +
>  	xfs_irele(ip);
>  	return error;
>  }
>  
> +static int
> +xfs_dax_notify_failure_freeze(
> +	struct xfs_mount	*mp)
> +{
> +	struct super_block	*sb = mp->m_super;
> +	int			error;
> +
> +	error = freeze_super(sb, FREEZE_HOLDER_KERNEL);
> +	if (error)
> +		xfs_emerg(mp, "already frozen by kernel, err=%d", error);
> +
> +	return error;
> +}
> +
> +static void
> +xfs_dax_notify_failure_thaw(
> +	struct xfs_mount	*mp,
> +	bool			kernel_frozen)
> +{
> +	struct super_block	*sb = mp->m_super;
> +	int			error;
> +
> +	if (kernel_frozen) {
> +		error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
> +		if (error)
> +			xfs_emerg(mp, "still frozen after notify failure, err=%d",
> +				error);
> +	}
> +
> +	/*
> +	 * Also thaw userspace call anyway because the device is about to be
> +	 * removed immediately.
> +	 */
> +	thaw_super(sb, FREEZE_HOLDER_USERSPACE);
> +}
> +
>  static int
>  xfs_dax_notify_ddev_failure(
>  	struct xfs_mount	*mp,
> @@ -112,15 +165,29 @@ xfs_dax_notify_ddev_failure(
>  	struct xfs_btree_cur	*cur = NULL;
>  	struct xfs_buf		*agf_bp = NULL;
>  	int			error = 0;
> +	bool			kernel_frozen = false;
>  	xfs_fsblock_t		fsbno = XFS_DADDR_TO_FSB(mp, daddr);
>  	xfs_agnumber_t		agno = XFS_FSB_TO_AGNO(mp, fsbno);
>  	xfs_fsblock_t		end_fsbno = XFS_DADDR_TO_FSB(mp,
>  							     daddr + bblen - 1);
>  	xfs_agnumber_t		end_agno = XFS_FSB_TO_AGNO(mp, end_fsbno);
>  
> +	if (mf_flags & MF_MEM_PRE_REMOVE) {
> +		xfs_info(mp, "Device is about to be removed!");
> +		/*
> +		 * Freeze fs to prevent new mappings from being created.
> +		 * - Keep going on if others already hold the kernel forzen.
> +		 * - Keep going on if other errors too because this device is
> +		 *   starting to fail.
> +		 * - If kernel frozen state is hold successfully here, thaw it
> +		 *   here as well at the end.
> +		 */
> +		kernel_frozen = xfs_dax_notify_failure_freeze(mp) == 0;
> +	}
> +
>  	error = xfs_trans_alloc_empty(mp, &tp);
>  	if (error)
> -		return error;
> +		goto out;
>  
>  	for (; agno <= end_agno; agno++) {
>  		struct xfs_rmap_irec	ri_low = { };
> @@ -165,11 +232,23 @@ xfs_dax_notify_ddev_failure(
>  	}
>  
>  	xfs_trans_cancel(tp);
> +
> +	/*
> +	 * Determine how to shutdown the filesystem according to the
> +	 * error code and flags.
> +	 */
>  	if (error || notify.want_shutdown) {
>  		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>  		if (!error)
>  			error = -EFSCORRUPTED;
> -	}
> +	} else if (mf_flags & MF_MEM_PRE_REMOVE)
> +		xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
> +
> +out:
> +	/* Thaw the fs if it is frozen before. */
> +	if (mf_flags & MF_MEM_PRE_REMOVE)
> +		xfs_dax_notify_failure_thaw(mp, kernel_frozen);
> +
>  	return error;
>  }
>  
> @@ -197,6 +276,8 @@ xfs_dax_notify_failure(
>  
>  	if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
>  	    mp->m_logdev_targp != mp->m_ddev_targp) {
> +		if (mf_flags & MF_MEM_PRE_REMOVE)
> +			return 0;
>  		xfs_err(mp, "ondisk log corrupt, shutting down fs!");
>  		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>  		return -EFSCORRUPTED;
> @@ -210,6 +291,12 @@ xfs_dax_notify_failure(
>  	ddev_start = mp->m_ddev_targp->bt_dax_part_off;
>  	ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
>  
> +	/* Notify failure on the whole device. */
> +	if (offset == 0 && len == U64_MAX) {
> +		offset = ddev_start;
> +		len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
> +	}
> +
>  	/* Ignore the range out of filesystem area */
>  	if (offset + len - 1 < ddev_start)
>  		return -ENXIO;
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 2dd73e4f3d8e..a10c75bebd6d 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3665,6 +3665,7 @@ enum mf_flags {
>  	MF_UNPOISON = 1 << 4,
>  	MF_SW_SIMULATED = 1 << 5,
>  	MF_NO_RETRY = 1 << 6,
> +	MF_MEM_PRE_REMOVE = 1 << 7,
>  };
>  int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>  		      unsigned long count, int mf_flags);
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index e245191e6b04..e71616ccc643 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -683,7 +683,7 @@ static void add_to_kill_fsdax(struct task_struct *tsk, struct page *p,
>   */
>  static void collect_procs_fsdax(struct page *page,
>  		struct address_space *mapping, pgoff_t pgoff,
> -		struct list_head *to_kill)
> +		struct list_head *to_kill, bool pre_remove)
>  {
>  	struct vm_area_struct *vma;
>  	struct task_struct *tsk;
> @@ -691,8 +691,15 @@ static void collect_procs_fsdax(struct page *page,
>  	i_mmap_lock_read(mapping);
>  	read_lock(&tasklist_lock);
>  	for_each_process(tsk) {
> -		struct task_struct *t = task_early_kill(tsk, true);
> +		struct task_struct *t = tsk;
>  
> +		/*
> +		 * Search for all tasks while MF_MEM_PRE_REMOVE is set, because
> +		 * the current may not be the one accessing the fsdax page.
> +		 * Otherwise, search for the current task.
> +		 */
> +		if (!pre_remove)
> +			t = task_early_kill(tsk, true);
>  		if (!t)
>  			continue;
>  		vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> @@ -1788,6 +1795,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>  	dax_entry_t cookie;
>  	struct page *page;
>  	size_t end = index + count;
> +	bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
>  
>  	mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
>  
> @@ -1799,9 +1807,10 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>  		if (!page)
>  			goto unlock;
>  
> -		SetPageHWPoison(page);
> +		if (!pre_remove)
> +			SetPageHWPoison(page);
>  
> -		collect_procs_fsdax(page, mapping, index, &to_kill);
> +		collect_procs_fsdax(page, mapping, index, &to_kill, pre_remove);
>  		unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
>  				index, mf_flags);
>  unlock:
> -- 
> 2.41.0
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* RE: [PATCH v14] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind
  2023-08-28  6:57   ` [PATCH v14] " Shiyang Ruan
  2023-08-30 15:34     ` Darrick J. Wong
@ 2023-09-27  8:17     ` Dan Williams
  2023-09-27  9:18       ` Shiyang Ruan
  2023-09-28 10:32     ` [PATCH v15] " Shiyang Ruan
  2 siblings, 1 reply; 37+ messages in thread
From: Dan Williams @ 2023-09-27  8:17 UTC (permalink / raw)
  To: Shiyang Ruan, linux-fsdevel, nvdimm, linux-xfs, linux-mm
  Cc: dan.j.williams, willy, jack, akpm, djwong, mcgrof

Shiyang Ruan wrote:
> ====
> Changes since v13:
>  1. don't return error if _freeze(FREEZE_HOLDER_KERNEL) got other error
> ====
> 
> Now, if we suddenly remove a PMEM device(by calling unbind) which
> contains FSDAX while programs are still accessing data in this device,
> e.g.:
> ```
>  $FSSTRESS_PROG -d $SCRATCH_MNT -n 99999 -p 4 &
>  # $FSX_PROG -N 1000000 -o 8192 -l 500000 $SCRATCH_MNT/t001 &
>  echo "pfn1.1" > /sys/bus/nd/drivers/nd_pmem/unbind
> ```
> it could come into an unacceptable state:
>   1. device has gone but mount point still exists, and umount will fail
>        with "target is busy"
>   2. programs will hang and cannot be killed
>   3. may crash with NULL pointer dereference

Thanks, this addresses my main concern that this new capability is needed
otherwise DAX regresses the survivability of the kernel when removing a
device from underneath the mounted filesystem compared to removing a
non-DAX capable block device.

> 
> To fix this, we introduce a MF_MEM_PRE_REMOVE flag to let it know that we
> are going to remove the whole device, and make sure all related processes
> could be notified so that they could end up gracefully.
> 
> This patch is inspired by Dan's "mm, dax, pmem: Introduce
> dev_pagemap_failure()"[1].  With the help of dax_holder and
> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> on it to unmap all files in use, and notify processes who are using
> those files.
> 
> Call trace:
> trigger unbind
>  -> unbind_store()
>   -> ... (skip)
>    -> devres_release_all()
>     -> kill_dax()
>      -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
>       -> xfs_dax_notify_failure()
>       `-> freeze_super()             // freeze (kernel call)
>       `-> do xfs rmap
>       ` -> mf_dax_kill_procs()
>       `  -> collect_procs_fsdax()    // all associated processes
>       `  -> unmap_and_kill()
>       ` -> invalidate_inode_pages2_range() // drop file's cache
>       `-> thaw_super()               // thaw (both kernel & user call)
> 
> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> event.  Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
> new dax mapping from being created.  Do not shutdown filesystem directly
> if configuration is not supported, or if failure range includes metadata
> area.  Make sure all files and processes(not only the current progress)
> are handled correctly.  Also drop the cache of associated files before
> pmem is removed.
> 
> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
> [2]: https://lore.kernel.org/linux-xfs/169116275623.3187159.16862410128731457358.stg-ugh@frogsfrogsfrogs/

I only have some questions and comment suggestions below, but otherwise
consider this:

Acked-by: Dan Williams <dan.j.williams@intel.com>

> 
> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> ---
>  drivers/dax/super.c         |  3 +-
>  fs/xfs/xfs_notify_failure.c | 99 ++++++++++++++++++++++++++++++++++---
>  include/linux/mm.h          |  1 +
>  mm/memory-failure.c         | 17 +++++--
>  4 files changed, 109 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 0da9232ea175..f4b635526345 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -326,7 +326,8 @@ void kill_dax(struct dax_device *dax_dev)
>  		return;
>  
>  	if (dax_dev->holder_data != NULL)
> -		dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
> +		dax_holder_notify_failure(dax_dev, 0, U64_MAX,
> +				MF_MEM_PRE_REMOVE);
>  
>  	clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
>  	synchronize_srcu(&dax_srcu);
> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> index 4a9bbd3fe120..79586abc75bf 100644
> --- a/fs/xfs/xfs_notify_failure.c
> +++ b/fs/xfs/xfs_notify_failure.c
> @@ -22,6 +22,7 @@
>  
>  #include <linux/mm.h>
>  #include <linux/dax.h>
> +#include <linux/fs.h>
>  
>  struct xfs_failure_info {
>  	xfs_agblock_t		startblock;
> @@ -73,10 +74,16 @@ xfs_dax_failure_fn(
>  	struct xfs_mount		*mp = cur->bc_mp;
>  	struct xfs_inode		*ip;
>  	struct xfs_failure_info		*notify = data;
> +	struct address_space		*mapping;
> +	pgoff_t				pgoff;
> +	unsigned long			pgcnt;
>  	int				error = 0;
>  
>  	if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
>  	    (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
> +		/* Continue the query because this isn't a failure. */
> +		if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> +			return 0;
>  		notify->want_shutdown = true;
>  		return 0;
>  	}
> @@ -92,14 +99,60 @@ xfs_dax_failure_fn(
>  		return 0;
>  	}
>  
> -	error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
> -				  xfs_failure_pgoff(mp, rec, notify),
> -				  xfs_failure_pgcnt(mp, rec, notify),
> -				  notify->mf_flags);
> +	mapping = VFS_I(ip)->i_mapping;
> +	pgoff = xfs_failure_pgoff(mp, rec, notify);
> +	pgcnt = xfs_failure_pgcnt(mp, rec, notify);
> +
> +	/* Continue the rmap query if the inode isn't a dax file. */
> +	if (dax_mapping(mapping))
> +		error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
> +					  notify->mf_flags);
> +
> +	/* Invalidate the cache in dax pages. */
> +	if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> +		invalidate_inode_pages2_range(mapping, pgoff,
> +					      pgoff + pgcnt - 1);
> +
>  	xfs_irele(ip);
>  	return error;
>  }
>  
> +static int
> +xfs_dax_notify_failure_freeze(
> +	struct xfs_mount	*mp)
> +{
> +	struct super_block	*sb = mp->m_super;
> +	int			error;
> +
> +	error = freeze_super(sb, FREEZE_HOLDER_KERNEL);
> +	if (error)
> +		xfs_emerg(mp, "already frozen by kernel, err=%d", error);
> +
> +	return error;
> +}
> +
> +static void
> +xfs_dax_notify_failure_thaw(
> +	struct xfs_mount	*mp,
> +	bool			kernel_frozen)
> +{
> +	struct super_block	*sb = mp->m_super;
> +	int			error;
> +
> +	if (kernel_frozen) {
> +		error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
> +		if (error)
> +			xfs_emerg(mp, "still frozen after notify failure, err=%d",
> +				error);
> +	}
> +
> +	/*
> +	 * Also thaw userspace call anyway because the device is about to be
> +	 * removed immediately.
> +	 */
> +	thaw_super(sb, FREEZE_HOLDER_USERSPACE);

I don't understand why this is not paired with a freeze in
xfs_dax_notify_failure_freeze()?

> +}
> +
>  static int
>  xfs_dax_notify_ddev_failure(
>  	struct xfs_mount	*mp,
> @@ -112,15 +165,29 @@ xfs_dax_notify_ddev_failure(
>  	struct xfs_btree_cur	*cur = NULL;
>  	struct xfs_buf		*agf_bp = NULL;
>  	int			error = 0;
> +	bool			kernel_frozen = false;
>  	xfs_fsblock_t		fsbno = XFS_DADDR_TO_FSB(mp, daddr);
>  	xfs_agnumber_t		agno = XFS_FSB_TO_AGNO(mp, fsbno);
>  	xfs_fsblock_t		end_fsbno = XFS_DADDR_TO_FSB(mp,
>  							     daddr + bblen - 1);
>  	xfs_agnumber_t		end_agno = XFS_FSB_TO_AGNO(mp, end_fsbno);
>  
> +	if (mf_flags & MF_MEM_PRE_REMOVE) {
> +		xfs_info(mp, "Device is about to be removed!");
> +		/*
> +		 * Freeze fs to prevent new mappings from being created.
> +		 * - Keep going on if others already hold the kernel forzen.
> +		 * - Keep going on if other errors too because this device is
> +		 *   starting to fail.
> +		 * - If kernel frozen state is hold successfully here, thaw it
> +		 *   here as well at the end.
> +		 */
> +		kernel_frozen = xfs_dax_notify_failure_freeze(mp) == 0;
> +	}
> +
>  	error = xfs_trans_alloc_empty(mp, &tp);
>  	if (error)
> -		return error;
> +		goto out;
>  
>  	for (; agno <= end_agno; agno++) {
>  		struct xfs_rmap_irec	ri_low = { };
> @@ -165,11 +232,23 @@ xfs_dax_notify_ddev_failure(
>  	}
>  
>  	xfs_trans_cancel(tp);
> +
> +	/*
> +	 * Determine how to shutdown the filesystem according to the
> +	 * error code and flags.
> +	 */

This comment is not adding any value. It would be better if it clarified
why why want_shutdown will be false in the pre-remove case?

>  	if (error || notify.want_shutdown) {
>  		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>  		if (!error)
>  			error = -EFSCORRUPTED;
> -	}
> +	} else if (mf_flags & MF_MEM_PRE_REMOVE)
> +		xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
> +
> +out:
> +	/* Thaw the fs if it is frozen before. */
> +	if (mf_flags & MF_MEM_PRE_REMOVE)
> +		xfs_dax_notify_failure_thaw(mp, kernel_frozen);
> +
>  	return error;
>  }
>  
> @@ -197,6 +276,8 @@ xfs_dax_notify_failure(
>  
>  	if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
>  	    mp->m_logdev_targp != mp->m_ddev_targp) {

Maybe a comment:

/* 
 * In the pre-remove case the failure notification is attempting to
 * trigger a force unmount, the expectation is that the device is still
 * present, but its removal is in progress and can not be cancelled,
 * proceed with accessing the log device.
 */

> +		if (mf_flags & MF_MEM_PRE_REMOVE)
> +			return 0;
>  		xfs_err(mp, "ondisk log corrupt, shutting down fs!");
>  		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>  		return -EFSCORRUPTED;
> @@ -210,6 +291,12 @@ xfs_dax_notify_failure(
>  	ddev_start = mp->m_ddev_targp->bt_dax_part_off;
>  	ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
>  
> +	/* Notify failure on the whole device. */
> +	if (offset == 0 && len == U64_MAX) {
> +		offset = ddev_start;
> +		len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
> +	}
> +
>  	/* Ignore the range out of filesystem area */
>  	if (offset + len - 1 < ddev_start)
>  		return -ENXIO;
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 2dd73e4f3d8e..a10c75bebd6d 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3665,6 +3665,7 @@ enum mf_flags {
>  	MF_UNPOISON = 1 << 4,
>  	MF_SW_SIMULATED = 1 << 5,
>  	MF_NO_RETRY = 1 << 6,
> +	MF_MEM_PRE_REMOVE = 1 << 7,
>  };
>  int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>  		      unsigned long count, int mf_flags);
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index e245191e6b04..e71616ccc643 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -683,7 +683,7 @@ static void add_to_kill_fsdax(struct task_struct *tsk, struct page *p,
>   */
>  static void collect_procs_fsdax(struct page *page,
>  		struct address_space *mapping, pgoff_t pgoff,
> -		struct list_head *to_kill)
> +		struct list_head *to_kill, bool pre_remove)
>  {
>  	struct vm_area_struct *vma;
>  	struct task_struct *tsk;
> @@ -691,8 +691,15 @@ static void collect_procs_fsdax(struct page *page,
>  	i_mmap_lock_read(mapping);
>  	read_lock(&tasklist_lock);
>  	for_each_process(tsk) {
> -		struct task_struct *t = task_early_kill(tsk, true);
> +		struct task_struct *t = tsk;
>  
> +		/*
> +		 * Search for all tasks while MF_MEM_PRE_REMOVE is set, because
> +		 * the current may not be the one accessing the fsdax page.
> +		 * Otherwise, search for the current task.
> +		 */
> +		if (!pre_remove)
> +			t = task_early_kill(tsk, true);
>  		if (!t)
>  			continue;
>  		vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> @@ -1788,6 +1795,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>  	dax_entry_t cookie;
>  	struct page *page;
>  	size_t end = index + count;
> +	bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
>  
>  	mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
>  
> @@ -1799,9 +1807,10 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>  		if (!page)
>  			goto unlock;
>  
> -		SetPageHWPoison(page);
> +		if (!pre_remove)
> +			SetPageHWPoison(page);

This problably wants a comment like:

/*
 * The pre_remove case is revoking access, the memory is still good and
 * could theoretically be put back into service
 */

>  
> -		collect_procs_fsdax(page, mapping, index, &to_kill);
> +		collect_procs_fsdax(page, mapping, index, &to_kill, pre_remove);
>  		unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
>  				index, mf_flags);
>  unlock:
> -- 
> 2.41.0
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v14] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind
  2023-09-27  8:17     ` Dan Williams
@ 2023-09-27  9:18       ` Shiyang Ruan
  0 siblings, 0 replies; 37+ messages in thread
From: Shiyang Ruan @ 2023-09-27  9:18 UTC (permalink / raw)
  To: Dan Williams, linux-fsdevel, nvdimm, linux-xfs, linux-mm
  Cc: Chandan Babu R, djwong, Andrew Morton, willy, jack, akpm, mcgrof



在 2023/9/27 16:17, Dan Williams 写道:
> Shiyang Ruan wrote:
>> ====
>> Changes since v13:
>>   1. don't return error if _freeze(FREEZE_HOLDER_KERNEL) got other error
>> ====
>>
>> Now, if we suddenly remove a PMEM device(by calling unbind) which
>> contains FSDAX while programs are still accessing data in this device,
>> e.g.:
>> ```
>>   $FSSTRESS_PROG -d $SCRATCH_MNT -n 99999 -p 4 &
>>   # $FSX_PROG -N 1000000 -o 8192 -l 500000 $SCRATCH_MNT/t001 &
>>   echo "pfn1.1" > /sys/bus/nd/drivers/nd_pmem/unbind
>> ```
>> it could come into an unacceptable state:
>>    1. device has gone but mount point still exists, and umount will fail
>>         with "target is busy"
>>    2. programs will hang and cannot be killed
>>    3. may crash with NULL pointer dereference
> 
> Thanks, this addresses my main concern that this new capability is needed
> otherwise DAX regresses the survivability of the kernel when removing a
> device from underneath the mounted filesystem compared to removing a
> non-DAX capable block device.
> 
>>
>> To fix this, we introduce a MF_MEM_PRE_REMOVE flag to let it know that we
>> are going to remove the whole device, and make sure all related processes
>> could be notified so that they could end up gracefully.
>>
>> This patch is inspired by Dan's "mm, dax, pmem: Introduce
>> dev_pagemap_failure()"[1].  With the help of dax_holder and
>> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
>> on it to unmap all files in use, and notify processes who are using
>> those files.
>>
>> Call trace:
>> trigger unbind
>>   -> unbind_store()
>>    -> ... (skip)
>>     -> devres_release_all()
>>      -> kill_dax()
>>       -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
>>        -> xfs_dax_notify_failure()
>>        `-> freeze_super()             // freeze (kernel call)
>>        `-> do xfs rmap
>>        ` -> mf_dax_kill_procs()
>>        `  -> collect_procs_fsdax()    // all associated processes
>>        `  -> unmap_and_kill()
>>        ` -> invalidate_inode_pages2_range() // drop file's cache
>>        `-> thaw_super()               // thaw (both kernel & user call)
>>
>> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
>> event.  Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
>> new dax mapping from being created.  Do not shutdown filesystem directly
>> if configuration is not supported, or if failure range includes metadata
>> area.  Make sure all files and processes(not only the current progress)
>> are handled correctly.  Also drop the cache of associated files before
>> pmem is removed.
>>
>> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
>> [2]: https://lore.kernel.org/linux-xfs/169116275623.3187159.16862410128731457358.stg-ugh@frogsfrogsfrogs/
> 
> I only have some questions and comment suggestions below, but otherwise
> consider this:
> 
> Acked-by: Dan Williams <dan.j.williams@intel.com>
> 
>>
>> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
>> ---
>>   drivers/dax/super.c         |  3 +-
>>   fs/xfs/xfs_notify_failure.c | 99 ++++++++++++++++++++++++++++++++++---
>>   include/linux/mm.h          |  1 +
>>   mm/memory-failure.c         | 17 +++++--
>>   4 files changed, 109 insertions(+), 11 deletions(-)
>>
>> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
>> index 0da9232ea175..f4b635526345 100644
>> --- a/drivers/dax/super.c
>> +++ b/drivers/dax/super.c
>> @@ -326,7 +326,8 @@ void kill_dax(struct dax_device *dax_dev)
>>   		return;
>>   
>>   	if (dax_dev->holder_data != NULL)
>> -		dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
>> +		dax_holder_notify_failure(dax_dev, 0, U64_MAX,
>> +				MF_MEM_PRE_REMOVE);
>>   
>>   	clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
>>   	synchronize_srcu(&dax_srcu);
>> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
>> index 4a9bbd3fe120..79586abc75bf 100644
>> --- a/fs/xfs/xfs_notify_failure.c
>> +++ b/fs/xfs/xfs_notify_failure.c
>> @@ -22,6 +22,7 @@
>>   
>>   #include <linux/mm.h>
>>   #include <linux/dax.h>
>> +#include <linux/fs.h>
>>   
>>   struct xfs_failure_info {
>>   	xfs_agblock_t		startblock;
>> @@ -73,10 +74,16 @@ xfs_dax_failure_fn(
>>   	struct xfs_mount		*mp = cur->bc_mp;
>>   	struct xfs_inode		*ip;
>>   	struct xfs_failure_info		*notify = data;
>> +	struct address_space		*mapping;
>> +	pgoff_t				pgoff;
>> +	unsigned long			pgcnt;
>>   	int				error = 0;
>>   
>>   	if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
>>   	    (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
>> +		/* Continue the query because this isn't a failure. */
>> +		if (notify->mf_flags & MF_MEM_PRE_REMOVE)
>> +			return 0;
>>   		notify->want_shutdown = true;
>>   		return 0;
>>   	}
>> @@ -92,14 +99,60 @@ xfs_dax_failure_fn(
>>   		return 0;
>>   	}
>>   
>> -	error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
>> -				  xfs_failure_pgoff(mp, rec, notify),
>> -				  xfs_failure_pgcnt(mp, rec, notify),
>> -				  notify->mf_flags);
>> +	mapping = VFS_I(ip)->i_mapping;
>> +	pgoff = xfs_failure_pgoff(mp, rec, notify);
>> +	pgcnt = xfs_failure_pgcnt(mp, rec, notify);
>> +
>> +	/* Continue the rmap query if the inode isn't a dax file. */
>> +	if (dax_mapping(mapping))
>> +		error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
>> +					  notify->mf_flags);
>> +
>> +	/* Invalidate the cache in dax pages. */
>> +	if (notify->mf_flags & MF_MEM_PRE_REMOVE)
>> +		invalidate_inode_pages2_range(mapping, pgoff,
>> +					      pgoff + pgcnt - 1);
>> +
>>   	xfs_irele(ip);
>>   	return error;
>>   }
>>   
>> +static int
>> +xfs_dax_notify_failure_freeze(
>> +	struct xfs_mount	*mp)
>> +{
>> +	struct super_block	*sb = mp->m_super;
>> +	int			error;
>> +
>> +	error = freeze_super(sb, FREEZE_HOLDER_KERNEL);
>> +	if (error)
>> +		xfs_emerg(mp, "already frozen by kernel, err=%d", error);
>> +
>> +	return error;
>> +}
>> +
>> +static void
>> +xfs_dax_notify_failure_thaw(
>> +	struct xfs_mount	*mp,
>> +	bool			kernel_frozen)
>> +{
>> +	struct super_block	*sb = mp->m_super;
>> +	int			error;
>> +
>> +	if (kernel_frozen) {
>> +		error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
>> +		if (error)
>> +			xfs_emerg(mp, "still frozen after notify failure, err=%d",
>> +				error);
>> +	}
>> +
>> +	/*
>> +	 * Also thaw userspace call anyway because the device is about to be
>> +	 * removed immediately.
>> +	 */
>> +	thaw_super(sb, FREEZE_HOLDER_USERSPACE);
> 
> I don't understand why this is not paired with a freeze in
> xfs_dax_notify_failure_freeze()?

What we want to do is freezing the filesystem, so acutally 
freeze_super(sb, FREEZE_HOLDER_KERNEL) is enough.  But adding 
thaw_super(sb, FREEZE_HOLDER_USERSPACE) here is to make sure the mount 
point could be umounted after unbind, while other userspace program is 
holding the freeze state of this filesystem.  Otherwize, after unbind, 
the mount point still exists and `umount /mnt/scratch` fails with 
"target is busy." `xfs_freeze -u /mnt/scratch` doesn't work too.

> 
>> +}
>> +
>>   static int
>>   xfs_dax_notify_ddev_failure(
>>   	struct xfs_mount	*mp,
>> @@ -112,15 +165,29 @@ xfs_dax_notify_ddev_failure(
>>   	struct xfs_btree_cur	*cur = NULL;
>>   	struct xfs_buf		*agf_bp = NULL;
>>   	int			error = 0;
>> +	bool			kernel_frozen = false;
>>   	xfs_fsblock_t		fsbno = XFS_DADDR_TO_FSB(mp, daddr);
>>   	xfs_agnumber_t		agno = XFS_FSB_TO_AGNO(mp, fsbno);
>>   	xfs_fsblock_t		end_fsbno = XFS_DADDR_TO_FSB(mp,
>>   							     daddr + bblen - 1);
>>   	xfs_agnumber_t		end_agno = XFS_FSB_TO_AGNO(mp, end_fsbno);
>>   
>> +	if (mf_flags & MF_MEM_PRE_REMOVE) {
>> +		xfs_info(mp, "Device is about to be removed!");
>> +		/*
>> +		 * Freeze fs to prevent new mappings from being created.
>> +		 * - Keep going on if others already hold the kernel forzen.
>> +		 * - Keep going on if other errors too because this device is
>> +		 *   starting to fail.
>> +		 * - If kernel frozen state is hold successfully here, thaw it
>> +		 *   here as well at the end.
>> +		 */
>> +		kernel_frozen = xfs_dax_notify_failure_freeze(mp) == 0;
>> +	}
>> +
>>   	error = xfs_trans_alloc_empty(mp, &tp);
>>   	if (error)
>> -		return error;
>> +		goto out;
>>   
>>   	for (; agno <= end_agno; agno++) {
>>   		struct xfs_rmap_irec	ri_low = { };
>> @@ -165,11 +232,23 @@ xfs_dax_notify_ddev_failure(
>>   	}
>>   
>>   	xfs_trans_cancel(tp);
>> +
>> +	/*
>> +	 * Determine how to shutdown the filesystem according to the
>> +	 * error code and flags.
>> +	 */
> 
> This comment is not adding any value. It would be better if it clarified
> why why want_shutdown will be false in the pre-remove case?
> 
>>   	if (error || notify.want_shutdown) {
>>   		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>>   		if (!error)
>>   			error = -EFSCORRUPTED;
>> -	}
>> +	} else if (mf_flags & MF_MEM_PRE_REMOVE)
>> +		xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
>> +
>> +out:
>> +	/* Thaw the fs if it is frozen before. */
>> +	if (mf_flags & MF_MEM_PRE_REMOVE)
>> +		xfs_dax_notify_failure_thaw(mp, kernel_frozen);
>> +
>>   	return error;
>>   }
>>   
>> @@ -197,6 +276,8 @@ xfs_dax_notify_failure(
>>   
>>   	if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
>>   	    mp->m_logdev_targp != mp->m_ddev_targp) {
> 
> Maybe a comment:
> 
> /*
>   * In the pre-remove case the failure notification is attempting to
>   * trigger a force unmount, the expectation is that the device is still
>   * present, but its removal is in progress and can not be cancelled,
>   * proceed with accessing the log device.
>   */
> 
>> +		if (mf_flags & MF_MEM_PRE_REMOVE)
>> +			return 0;
>>   		xfs_err(mp, "ondisk log corrupt, shutting down fs!");
>>   		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>>   		return -EFSCORRUPTED;
>> @@ -210,6 +291,12 @@ xfs_dax_notify_failure(
>>   	ddev_start = mp->m_ddev_targp->bt_dax_part_off;
>>   	ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
>>   
>> +	/* Notify failure on the whole device. */
>> +	if (offset == 0 && len == U64_MAX) {
>> +		offset = ddev_start;
>> +		len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
>> +	}
>> +
>>   	/* Ignore the range out of filesystem area */
>>   	if (offset + len - 1 < ddev_start)
>>   		return -ENXIO;
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 2dd73e4f3d8e..a10c75bebd6d 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -3665,6 +3665,7 @@ enum mf_flags {
>>   	MF_UNPOISON = 1 << 4,
>>   	MF_SW_SIMULATED = 1 << 5,
>>   	MF_NO_RETRY = 1 << 6,
>> +	MF_MEM_PRE_REMOVE = 1 << 7,
>>   };
>>   int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>>   		      unsigned long count, int mf_flags);
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index e245191e6b04..e71616ccc643 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -683,7 +683,7 @@ static void add_to_kill_fsdax(struct task_struct *tsk, struct page *p,
>>    */
>>   static void collect_procs_fsdax(struct page *page,
>>   		struct address_space *mapping, pgoff_t pgoff,
>> -		struct list_head *to_kill)
>> +		struct list_head *to_kill, bool pre_remove)
>>   {
>>   	struct vm_area_struct *vma;
>>   	struct task_struct *tsk;
>> @@ -691,8 +691,15 @@ static void collect_procs_fsdax(struct page *page,
>>   	i_mmap_lock_read(mapping);
>>   	read_lock(&tasklist_lock);
>>   	for_each_process(tsk) {
>> -		struct task_struct *t = task_early_kill(tsk, true);
>> +		struct task_struct *t = tsk;
>>   
>> +		/*
>> +		 * Search for all tasks while MF_MEM_PRE_REMOVE is set, because
>> +		 * the current may not be the one accessing the fsdax page.
>> +		 * Otherwise, search for the current task.
>> +		 */
>> +		if (!pre_remove)
>> +			t = task_early_kill(tsk, true);
>>   		if (!t)
>>   			continue;
>>   		vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
>> @@ -1788,6 +1795,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>>   	dax_entry_t cookie;
>>   	struct page *page;
>>   	size_t end = index + count;
>> +	bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
>>   
>>   	mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
>>   
>> @@ -1799,9 +1807,10 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>>   		if (!page)
>>   			goto unlock;
>>   
>> -		SetPageHWPoison(page);
>> +		if (!pre_remove)
>> +			SetPageHWPoison(page);
> 
> This problably wants a comment like:
> 
> /*
>   * The pre_remove case is revoking access, the memory is still good and
>   * could theoretically be put back into service
>   */
> 
>>   
>> -		collect_procs_fsdax(page, mapping, index, &to_kill);
>> +		collect_procs_fsdax(page, mapping, index, &to_kill, pre_remove);
>>   		unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
>>   				index, mf_flags);
>>   unlock:

I'll add/modify these comments as you suggested.  Thanks!


--
Ruan.

>> -- 
>> 2.41.0
>>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v15] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind
  2023-08-28  6:57   ` [PATCH v14] " Shiyang Ruan
  2023-08-30 15:34     ` Darrick J. Wong
  2023-09-27  8:17     ` Dan Williams
@ 2023-09-28 10:32     ` Shiyang Ruan
  2023-09-29 18:31       ` Dan Williams
                         ` (3 more replies)
  2 siblings, 4 replies; 37+ messages in thread
From: Shiyang Ruan @ 2023-09-28 10:32 UTC (permalink / raw)
  To: linux-fsdevel, nvdimm, linux-xfs, linux-mm
  Cc: dan.j.williams, willy, jack, akpm, djwong, mcgrof, chandanbabu

====
Changes since v14:
 1. added/fixed code comments per Dan's comments
====

Now, if we suddenly remove a PMEM device(by calling unbind) which
contains FSDAX while programs are still accessing data in this device,
e.g.:
```
 $FSSTRESS_PROG -d $SCRATCH_MNT -n 99999 -p 4 &
 # $FSX_PROG -N 1000000 -o 8192 -l 500000 $SCRATCH_MNT/t001 &
 echo "pfn1.1" > /sys/bus/nd/drivers/nd_pmem/unbind
```
it could come into an unacceptable state:
  1. device has gone but mount point still exists, and umount will fail
       with "target is busy"
  2. programs will hang and cannot be killed
  3. may crash with NULL pointer dereference

To fix this, we introduce a MF_MEM_PRE_REMOVE flag to let it know that we
are going to remove the whole device, and make sure all related processes
could be notified so that they could end up gracefully.

This patch is inspired by Dan's "mm, dax, pmem: Introduce
dev_pagemap_failure()"[1].  With the help of dax_holder and
->notify_failure() mechanism, the pmem driver is able to ask filesystem
on it to unmap all files in use, and notify processes who are using
those files.

Call trace:
trigger unbind
 -> unbind_store()
  -> ... (skip)
   -> devres_release_all()
    -> kill_dax()
     -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
      -> xfs_dax_notify_failure()
      `-> freeze_super()             // freeze (kernel call)
      `-> do xfs rmap
      ` -> mf_dax_kill_procs()
      `  -> collect_procs_fsdax()    // all associated processes
      `  -> unmap_and_kill()
      ` -> invalidate_inode_pages2_range() // drop file's cache
      `-> thaw_super()               // thaw (both kernel & user call)

Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
event.  Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
new dax mapping from being created.  Do not shutdown filesystem directly
if configuration is not supported, or if failure range includes metadata
area.  Make sure all files and processes(not only the current progress)
are handled correctly.  Also drop the cache of associated files before
pmem is removed.

[1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
[2]: https://lore.kernel.org/linux-xfs/169116275623.3187159.16862410128731457358.stg-ugh@frogsfrogsfrogs/

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Acked-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/super.c         |   3 +-
 fs/xfs/xfs_notify_failure.c | 108 ++++++++++++++++++++++++++++++++++--
 include/linux/mm.h          |   1 +
 mm/memory-failure.c         |  21 +++++--
 4 files changed, 122 insertions(+), 11 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 0da9232ea175..f4b635526345 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -326,7 +326,8 @@ void kill_dax(struct dax_device *dax_dev)
 		return;
 
 	if (dax_dev->holder_data != NULL)
-		dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
+		dax_holder_notify_failure(dax_dev, 0, U64_MAX,
+				MF_MEM_PRE_REMOVE);
 
 	clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
 	synchronize_srcu(&dax_srcu);
diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index 4a9bbd3fe120..30e9f4e09f76 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -22,6 +22,7 @@
 
 #include <linux/mm.h>
 #include <linux/dax.h>
+#include <linux/fs.h>
 
 struct xfs_failure_info {
 	xfs_agblock_t		startblock;
@@ -73,10 +74,16 @@ xfs_dax_failure_fn(
 	struct xfs_mount		*mp = cur->bc_mp;
 	struct xfs_inode		*ip;
 	struct xfs_failure_info		*notify = data;
+	struct address_space		*mapping;
+	pgoff_t				pgoff;
+	unsigned long			pgcnt;
 	int				error = 0;
 
 	if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
 	    (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
+		/* Continue the query because this isn't a failure. */
+		if (notify->mf_flags & MF_MEM_PRE_REMOVE)
+			return 0;
 		notify->want_shutdown = true;
 		return 0;
 	}
@@ -92,14 +99,60 @@ xfs_dax_failure_fn(
 		return 0;
 	}
 
-	error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
-				  xfs_failure_pgoff(mp, rec, notify),
-				  xfs_failure_pgcnt(mp, rec, notify),
-				  notify->mf_flags);
+	mapping = VFS_I(ip)->i_mapping;
+	pgoff = xfs_failure_pgoff(mp, rec, notify);
+	pgcnt = xfs_failure_pgcnt(mp, rec, notify);
+
+	/* Continue the rmap query if the inode isn't a dax file. */
+	if (dax_mapping(mapping))
+		error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
+					  notify->mf_flags);
+
+	/* Invalidate the cache in dax pages. */
+	if (notify->mf_flags & MF_MEM_PRE_REMOVE)
+		invalidate_inode_pages2_range(mapping, pgoff,
+					      pgoff + pgcnt - 1);
+
 	xfs_irele(ip);
 	return error;
 }
 
+static int
+xfs_dax_notify_failure_freeze(
+	struct xfs_mount	*mp)
+{
+	struct super_block	*sb = mp->m_super;
+	int			error;
+
+	error = freeze_super(sb, FREEZE_HOLDER_KERNEL);
+	if (error)
+		xfs_emerg(mp, "already frozen by kernel, err=%d", error);
+
+	return error;
+}
+
+static void
+xfs_dax_notify_failure_thaw(
+	struct xfs_mount	*mp,
+	bool			kernel_frozen)
+{
+	struct super_block	*sb = mp->m_super;
+	int			error;
+
+	if (kernel_frozen) {
+		error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
+		if (error)
+			xfs_emerg(mp, "still frozen after notify failure, err=%d",
+				error);
+	}
+
+	/*
+	 * Also thaw userspace call anyway because the device is about to be
+	 * removed immediately.
+	 */
+	thaw_super(sb, FREEZE_HOLDER_USERSPACE);
+}
+
 static int
 xfs_dax_notify_ddev_failure(
 	struct xfs_mount	*mp,
@@ -112,15 +165,29 @@ xfs_dax_notify_ddev_failure(
 	struct xfs_btree_cur	*cur = NULL;
 	struct xfs_buf		*agf_bp = NULL;
 	int			error = 0;
+	bool			kernel_frozen = false;
 	xfs_fsblock_t		fsbno = XFS_DADDR_TO_FSB(mp, daddr);
 	xfs_agnumber_t		agno = XFS_FSB_TO_AGNO(mp, fsbno);
 	xfs_fsblock_t		end_fsbno = XFS_DADDR_TO_FSB(mp,
 							     daddr + bblen - 1);
 	xfs_agnumber_t		end_agno = XFS_FSB_TO_AGNO(mp, end_fsbno);
 
+	if (mf_flags & MF_MEM_PRE_REMOVE) {
+		xfs_info(mp, "Device is about to be removed!");
+		/*
+		 * Freeze fs to prevent new mappings from being created.
+		 * - Keep going on if others already hold the kernel forzen.
+		 * - Keep going on if other errors too because this device is
+		 *   starting to fail.
+		 * - If kernel frozen state is hold successfully here, thaw it
+		 *   here as well at the end.
+		 */
+		kernel_frozen = xfs_dax_notify_failure_freeze(mp) == 0;
+	}
+
 	error = xfs_trans_alloc_empty(mp, &tp);
 	if (error)
-		return error;
+		goto out;
 
 	for (; agno <= end_agno; agno++) {
 		struct xfs_rmap_irec	ri_low = { };
@@ -165,11 +232,26 @@ xfs_dax_notify_ddev_failure(
 	}
 
 	xfs_trans_cancel(tp);
-	if (error || notify.want_shutdown) {
+
+	/*
+	 * Shutdown fs from a force umount in pre-remove case which won't fail,
+	 * so errors can be ignored.  Otherwise, shutdown the filesystem with
+	 * CORRUPT flag if error occured or notify.want_shutdown was set during
+	 * RMAP querying.
+	 */
+	if (mf_flags & MF_MEM_PRE_REMOVE)
+		xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
+	else if (error || notify.want_shutdown) {
 		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
 		if (!error)
 			error = -EFSCORRUPTED;
 	}
+
+out:
+	/* Thaw the fs if it has been frozen before. */
+	if (mf_flags & MF_MEM_PRE_REMOVE)
+		xfs_dax_notify_failure_thaw(mp, kernel_frozen);
+
 	return error;
 }
 
@@ -197,6 +279,14 @@ xfs_dax_notify_failure(
 
 	if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
 	    mp->m_logdev_targp != mp->m_ddev_targp) {
+		/*
+		 * In the pre-remove case the failure notification is attempting
+		 * to trigger a force unmount.  The expectation is that the
+		 * device is still present, but its removal is in progress and
+		 * can not be cancelled, proceed with accessing the log device.
+		 */
+		if (mf_flags & MF_MEM_PRE_REMOVE)
+			return 0;
 		xfs_err(mp, "ondisk log corrupt, shutting down fs!");
 		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
 		return -EFSCORRUPTED;
@@ -210,6 +300,12 @@ xfs_dax_notify_failure(
 	ddev_start = mp->m_ddev_targp->bt_dax_part_off;
 	ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
 
+	/* Notify failure on the whole device. */
+	if (offset == 0 && len == U64_MAX) {
+		offset = ddev_start;
+		len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
+	}
+
 	/* Ignore the range out of filesystem area */
 	if (offset + len - 1 < ddev_start)
 		return -ENXIO;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2dd73e4f3d8e..a10c75bebd6d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3665,6 +3665,7 @@ enum mf_flags {
 	MF_UNPOISON = 1 << 4,
 	MF_SW_SIMULATED = 1 << 5,
 	MF_NO_RETRY = 1 << 6,
+	MF_MEM_PRE_REMOVE = 1 << 7,
 };
 int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
 		      unsigned long count, int mf_flags);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index e245191e6b04..955edea9837f 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -683,7 +683,7 @@ static void add_to_kill_fsdax(struct task_struct *tsk, struct page *p,
  */
 static void collect_procs_fsdax(struct page *page,
 		struct address_space *mapping, pgoff_t pgoff,
-		struct list_head *to_kill)
+		struct list_head *to_kill, bool pre_remove)
 {
 	struct vm_area_struct *vma;
 	struct task_struct *tsk;
@@ -691,8 +691,15 @@ static void collect_procs_fsdax(struct page *page,
 	i_mmap_lock_read(mapping);
 	read_lock(&tasklist_lock);
 	for_each_process(tsk) {
-		struct task_struct *t = task_early_kill(tsk, true);
+		struct task_struct *t = tsk;
 
+		/*
+		 * Search for all tasks while MF_MEM_PRE_REMOVE is set, because
+		 * the current may not be the one accessing the fsdax page.
+		 * Otherwise, search for the current task.
+		 */
+		if (!pre_remove)
+			t = task_early_kill(tsk, true);
 		if (!t)
 			continue;
 		vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
@@ -1788,6 +1795,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
 	dax_entry_t cookie;
 	struct page *page;
 	size_t end = index + count;
+	bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
 
 	mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
 
@@ -1799,9 +1807,14 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
 		if (!page)
 			goto unlock;
 
-		SetPageHWPoison(page);
+		if (!pre_remove)
+			SetPageHWPoison(page);
 
-		collect_procs_fsdax(page, mapping, index, &to_kill);
+		/*
+		 * The pre_remove case is revoking access, the memory is still
+		 * good and could theoretically be put back into service.
+		 */
+		collect_procs_fsdax(page, mapping, index, &to_kill, pre_remove);
 		unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
 				index, mf_flags);
 unlock:
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* RE: [PATCH v15] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind
  2023-09-28 10:32     ` [PATCH v15] " Shiyang Ruan
@ 2023-09-29 18:31       ` Dan Williams
  2023-10-01  1:43       ` kernel test robot
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 37+ messages in thread
From: Dan Williams @ 2023-09-29 18:31 UTC (permalink / raw)
  To: Shiyang Ruan, linux-fsdevel, nvdimm, linux-xfs, linux-mm
  Cc: dan.j.williams, willy, jack, akpm, djwong, mcgrof, chandanbabu

Shiyang Ruan wrote:
> ====
> Changes since v14:
>  1. added/fixed code comments per Dan's comments
> ====
> 
> Now, if we suddenly remove a PMEM device(by calling unbind) which
> contains FSDAX while programs are still accessing data in this device,
> e.g.:
> ```
>  $FSSTRESS_PROG -d $SCRATCH_MNT -n 99999 -p 4 &
>  # $FSX_PROG -N 1000000 -o 8192 -l 500000 $SCRATCH_MNT/t001 &
>  echo "pfn1.1" > /sys/bus/nd/drivers/nd_pmem/unbind
> ```
> it could come into an unacceptable state:
>   1. device has gone but mount point still exists, and umount will fail
>        with "target is busy"
>   2. programs will hang and cannot be killed
>   3. may crash with NULL pointer dereference
> 
> To fix this, we introduce a MF_MEM_PRE_REMOVE flag to let it know that we
> are going to remove the whole device, and make sure all related processes
> could be notified so that they could end up gracefully.
> 
> This patch is inspired by Dan's "mm, dax, pmem: Introduce
> dev_pagemap_failure()"[1].  With the help of dax_holder and
> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> on it to unmap all files in use, and notify processes who are using
> those files.
> 
> Call trace:
> trigger unbind
>  -> unbind_store()
>   -> ... (skip)
>    -> devres_release_all()
>     -> kill_dax()
>      -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
>       -> xfs_dax_notify_failure()
>       `-> freeze_super()             // freeze (kernel call)
>       `-> do xfs rmap
>       ` -> mf_dax_kill_procs()
>       `  -> collect_procs_fsdax()    // all associated processes
>       `  -> unmap_and_kill()
>       ` -> invalidate_inode_pages2_range() // drop file's cache
>       `-> thaw_super()               // thaw (both kernel & user call)
> 
> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> event.  Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
> new dax mapping from being created.  Do not shutdown filesystem directly
> if configuration is not supported, or if failure range includes metadata
> area.  Make sure all files and processes(not only the current progress)
> are handled correctly.  Also drop the cache of associated files before
> pmem is removed.
> 
> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
> [2]: https://lore.kernel.org/linux-xfs/169116275623.3187159.16862410128731457358.stg-ugh@frogsfrogsfrogs/
> 
> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> Acked-by: Dan Williams <dan.j.williams@intel.com>

This version address my feedback you can upgrade that Acked-by: to

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v15] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind
  2023-09-28 10:32     ` [PATCH v15] " Shiyang Ruan
  2023-09-29 18:31       ` Dan Williams
@ 2023-10-01  1:43       ` kernel test robot
  2023-10-02 11:57         ` Shiyang Ruan
  2023-10-20  9:56       ` Chandan Babu R
  2023-10-23  7:20       ` [PATCH v15.1] " Shiyang Ruan
  3 siblings, 1 reply; 37+ messages in thread
From: kernel test robot @ 2023-10-01  1:43 UTC (permalink / raw)
  To: Shiyang Ruan, linux-fsdevel, nvdimm, linux-xfs, linux-mm
  Cc: llvm, oe-kbuild-all, dan.j.williams, willy, jack, akpm, djwong,
	mcgrof, chandanbabu

Hi Shiyang,

kernel test robot noticed the following build errors:



url:    https://github.com/intel-lab-lkp/linux/commits/UPDATE-20230928-183310/Shiyang-Ruan/xfs-fix-the-calculation-for-end-and-length/20230629-161913
base:   the 2th patch of https://lore.kernel.org/r/20230629081651.253626-3-ruansy.fnst%40fujitsu.com
patch link:    https://lore.kernel.org/r/20230928103227.250550-1-ruansy.fnst%40fujitsu.com
patch subject: [PATCH v15] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind
config: x86_64-rhel-8.3-rust (https://download.01.org/0day-ci/archive/20231001/202310010955.feI4HCwZ-lkp@intel.com/config)
compiler: clang version 15.0.7 (https://github.com/llvm/llvm-project.git 8dfdcc7b7bf66834a761bd8de445840ef68e4d1a)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231001/202310010955.feI4HCwZ-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202310010955.feI4HCwZ-lkp@intel.com/

All errors (new ones prefixed by >>):

>> fs/xfs/xfs_notify_failure.c:127:27: error: use of undeclared identifier 'FREEZE_HOLDER_KERNEL'
           error = freeze_super(sb, FREEZE_HOLDER_KERNEL);
                                    ^
   fs/xfs/xfs_notify_failure.c:143:26: error: use of undeclared identifier 'FREEZE_HOLDER_KERNEL'
                   error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
                                          ^
>> fs/xfs/xfs_notify_failure.c:153:17: error: use of undeclared identifier 'FREEZE_HOLDER_USERSPACE'
           thaw_super(sb, FREEZE_HOLDER_USERSPACE);
                          ^
   3 errors generated.


vim +/FREEZE_HOLDER_KERNEL +127 fs/xfs/xfs_notify_failure.c

   119	
   120	static int
   121	xfs_dax_notify_failure_freeze(
   122		struct xfs_mount	*mp)
   123	{
   124		struct super_block	*sb = mp->m_super;
   125		int			error;
   126	
 > 127		error = freeze_super(sb, FREEZE_HOLDER_KERNEL);
   128		if (error)
   129			xfs_emerg(mp, "already frozen by kernel, err=%d", error);
   130	
   131		return error;
   132	}
   133	
   134	static void
   135	xfs_dax_notify_failure_thaw(
   136		struct xfs_mount	*mp,
   137		bool			kernel_frozen)
   138	{
   139		struct super_block	*sb = mp->m_super;
   140		int			error;
   141	
   142		if (kernel_frozen) {
   143			error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
   144			if (error)
   145				xfs_emerg(mp, "still frozen after notify failure, err=%d",
   146					error);
   147		}
   148	
   149		/*
   150		 * Also thaw userspace call anyway because the device is about to be
   151		 * removed immediately.
   152		 */
 > 153		thaw_super(sb, FREEZE_HOLDER_USERSPACE);
   154	}
   155	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v15] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind
  2023-10-01  1:43       ` kernel test robot
@ 2023-10-02 11:57         ` Shiyang Ruan
  0 siblings, 0 replies; 37+ messages in thread
From: Shiyang Ruan @ 2023-10-02 11:57 UTC (permalink / raw)
  To: kernel test robot
  Cc: llvm, oe-kbuild-all, dan.j.williams, willy, jack, akpm, djwong,
	mcgrof, chandanbabu, linux-fsdevel, nvdimm, linux-xfs, linux-mm



在 2023/10/1 9:43, kernel test robot 写道:
> Hi Shiyang,
> 
> kernel test robot noticed the following build errors:
> 
> 
> 
> url:    https://github.com/intel-lab-lkp/linux/commits/UPDATE-20230928-183310/Shiyang-Ruan/xfs-fix-the-calculation-for-end-and-length/20230629-161913
> base:   the 2th patch of https://lore.kernel.org/r/20230629081651.253626-3-ruansy.fnst%40fujitsu.com
> patch link:    https://lore.kernel.org/r/20230928103227.250550-1-ruansy.fnst%40fujitsu.com
> patch subject: [PATCH v15] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind
> config: x86_64-rhel-8.3-rust (https://download.01.org/0day-ci/archive/20231001/202310010955.feI4HCwZ-lkp@intel.com/config)
> compiler: clang version 15.0.7 (https://github.com/llvm/llvm-project.git 8dfdcc7b7bf66834a761bd8de445840ef68e4d1a)
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231001/202310010955.feI4HCwZ-lkp@intel.com/reproduce)
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202310010955.feI4HCwZ-lkp@intel.com/
> 
> All errors (new ones prefixed by >>):
> 
>>> fs/xfs/xfs_notify_failure.c:127:27: error: use of undeclared identifier 'FREEZE_HOLDER_KERNEL'
>             error = freeze_super(sb, FREEZE_HOLDER_KERNEL);
>                                      ^
>     fs/xfs/xfs_notify_failure.c:143:26: error: use of undeclared identifier 'FREEZE_HOLDER_KERNEL'
>                     error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
>                                            ^
>>> fs/xfs/xfs_notify_failure.c:153:17: error: use of undeclared identifier 'FREEZE_HOLDER_USERSPACE'
>             thaw_super(sb, FREEZE_HOLDER_USERSPACE);
>                            ^
>     3 errors generated.
> 

The two enums has been introduced since 880b9577855e ("fs: distinguish 
between user initiated freeze and kernel initiated freeze"), v6.6-rc1. 
I also compiled my patches based on v6.6-rc1 with your config file, it 
passed with no error.

So, which kernel version were you testing?


--
Thanks,
Ruan.

> 
> vim +/FREEZE_HOLDER_KERNEL +127 fs/xfs/xfs_notify_failure.c
> 
>     119	
>     120	static int
>     121	xfs_dax_notify_failure_freeze(
>     122		struct xfs_mount	*mp)
>     123	{
>     124		struct super_block	*sb = mp->m_super;
>     125		int			error;
>     126	
>   > 127		error = freeze_super(sb, FREEZE_HOLDER_KERNEL);
>     128		if (error)
>     129			xfs_emerg(mp, "already frozen by kernel, err=%d", error);
>     130	
>     131		return error;
>     132	}
>     133	
>     134	static void
>     135	xfs_dax_notify_failure_thaw(
>     136		struct xfs_mount	*mp,
>     137		bool			kernel_frozen)
>     138	{
>     139		struct super_block	*sb = mp->m_super;
>     140		int			error;
>     141	
>     142		if (kernel_frozen) {
>     143			error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
>     144			if (error)
>     145				xfs_emerg(mp, "still frozen after notify failure, err=%d",
>     146					error);
>     147		}
>     148	
>     149		/*
>     150		 * Also thaw userspace call anyway because the device is about to be
>     151		 * removed immediately.
>     152		 */
>   > 153		thaw_super(sb, FREEZE_HOLDER_USERSPACE);
>     154	}
>     155	
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v15] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind
  2023-09-28 10:32     ` [PATCH v15] " Shiyang Ruan
  2023-09-29 18:31       ` Dan Williams
  2023-10-01  1:43       ` kernel test robot
@ 2023-10-20  9:56       ` Chandan Babu R
  2023-10-20 15:40         ` Darrick J. Wong
  2023-10-23  7:20       ` [PATCH v15.1] " Shiyang Ruan
  3 siblings, 1 reply; 37+ messages in thread
From: Chandan Babu R @ 2023-10-20  9:56 UTC (permalink / raw)
  To: akpm
  Cc: Shiyang Ruan, linux-fsdevel, nvdimm, linux-xfs, linux-mm,
	dan.j.williams, willy, jack, djwong, mcgrof

On Thu, Sep 28, 2023 at 06:32:27 PM +0800, Shiyang Ruan wrote:
> ====
> Changes since v14:
>  1. added/fixed code comments per Dan's comments
> ====
>
> Now, if we suddenly remove a PMEM device(by calling unbind) which
> contains FSDAX while programs are still accessing data in this device,
> e.g.:
> ```
>  $FSSTRESS_PROG -d $SCRATCH_MNT -n 99999 -p 4 &
>  # $FSX_PROG -N 1000000 -o 8192 -l 500000 $SCRATCH_MNT/t001 &
>  echo "pfn1.1" > /sys/bus/nd/drivers/nd_pmem/unbind
> ```
> it could come into an unacceptable state:
>   1. device has gone but mount point still exists, and umount will fail
>        with "target is busy"
>   2. programs will hang and cannot be killed
>   3. may crash with NULL pointer dereference
>
> To fix this, we introduce a MF_MEM_PRE_REMOVE flag to let it know that we
> are going to remove the whole device, and make sure all related processes
> could be notified so that they could end up gracefully.
>
> This patch is inspired by Dan's "mm, dax, pmem: Introduce
> dev_pagemap_failure()"[1].  With the help of dax_holder and
> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> on it to unmap all files in use, and notify processes who are using
> those files.
>
> Call trace:
> trigger unbind
>  -> unbind_store()
>   -> ... (skip)
>    -> devres_release_all()
>     -> kill_dax()
>      -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
>       -> xfs_dax_notify_failure()
>       `-> freeze_super()             // freeze (kernel call)
>       `-> do xfs rmap
>       ` -> mf_dax_kill_procs()
>       `  -> collect_procs_fsdax()    // all associated processes
>       `  -> unmap_and_kill()
>       ` -> invalidate_inode_pages2_range() // drop file's cache
>       `-> thaw_super()               // thaw (both kernel & user call)
>
> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> event.  Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
> new dax mapping from being created.  Do not shutdown filesystem directly
> if configuration is not supported, or if failure range includes metadata
> area.  Make sure all files and processes(not only the current progress)
> are handled correctly.  Also drop the cache of associated files before
> pmem is removed.
>
> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
> [2]: https://lore.kernel.org/linux-xfs/169116275623.3187159.16862410128731457358.stg-ugh@frogsfrogsfrogs/
>
> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> Acked-by: Dan Williams <dan.j.williams@intel.com>

Hi Andrew,

Shiyang had indicated that this patch has been added to
akpm/mm-hotfixes-unstable branch. However, I don't see the patch listed in
that branch.

I am about to start collecting XFS patches for v6.7 cycle. Please let me know
if you have any objections with me taking this patch via the XFS tree.

-- 
Chandan

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v15] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind
  2023-10-20  9:56       ` Chandan Babu R
@ 2023-10-20 15:40         ` Darrick J. Wong
  2023-10-23  6:40           ` Chandan Babu R
  0 siblings, 1 reply; 37+ messages in thread
From: Darrick J. Wong @ 2023-10-20 15:40 UTC (permalink / raw)
  To: Chandan Babu R
  Cc: akpm, Shiyang Ruan, linux-fsdevel, nvdimm, linux-xfs, linux-mm,
	dan.j.williams, willy, jack, mcgrof

On Fri, Oct 20, 2023 at 03:26:32PM +0530, Chandan Babu R wrote:
> On Thu, Sep 28, 2023 at 06:32:27 PM +0800, Shiyang Ruan wrote:
> > ====
> > Changes since v14:
> >  1. added/fixed code comments per Dan's comments
> > ====
> >
> > Now, if we suddenly remove a PMEM device(by calling unbind) which
> > contains FSDAX while programs are still accessing data in this device,
> > e.g.:
> > ```
> >  $FSSTRESS_PROG -d $SCRATCH_MNT -n 99999 -p 4 &
> >  # $FSX_PROG -N 1000000 -o 8192 -l 500000 $SCRATCH_MNT/t001 &
> >  echo "pfn1.1" > /sys/bus/nd/drivers/nd_pmem/unbind
> > ```
> > it could come into an unacceptable state:
> >   1. device has gone but mount point still exists, and umount will fail
> >        with "target is busy"
> >   2. programs will hang and cannot be killed
> >   3. may crash with NULL pointer dereference
> >
> > To fix this, we introduce a MF_MEM_PRE_REMOVE flag to let it know that we
> > are going to remove the whole device, and make sure all related processes
> > could be notified so that they could end up gracefully.
> >
> > This patch is inspired by Dan's "mm, dax, pmem: Introduce
> > dev_pagemap_failure()"[1].  With the help of dax_holder and
> > ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> > on it to unmap all files in use, and notify processes who are using
> > those files.
> >
> > Call trace:
> > trigger unbind
> >  -> unbind_store()
> >   -> ... (skip)
> >    -> devres_release_all()
> >     -> kill_dax()
> >      -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
> >       -> xfs_dax_notify_failure()
> >       `-> freeze_super()             // freeze (kernel call)
> >       `-> do xfs rmap
> >       ` -> mf_dax_kill_procs()
> >       `  -> collect_procs_fsdax()    // all associated processes
> >       `  -> unmap_and_kill()
> >       ` -> invalidate_inode_pages2_range() // drop file's cache
> >       `-> thaw_super()               // thaw (both kernel & user call)
> >
> > Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> > event.  Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
> > new dax mapping from being created.  Do not shutdown filesystem directly
> > if configuration is not supported, or if failure range includes metadata
> > area.  Make sure all files and processes(not only the current progress)
> > are handled correctly.  Also drop the cache of associated files before
> > pmem is removed.
> >
> > [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
> > [2]: https://lore.kernel.org/linux-xfs/169116275623.3187159.16862410128731457358.stg-ugh@frogsfrogsfrogs/
> >
> > Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> > Acked-by: Dan Williams <dan.j.williams@intel.com>
> 
> Hi Andrew,
> 
> Shiyang had indicated that this patch has been added to
> akpm/mm-hotfixes-unstable branch. However, I don't see the patch listed in
> that branch.
> 
> I am about to start collecting XFS patches for v6.7 cycle. Please let me know
> if you have any objections with me taking this patch via the XFS tree.

V15 was dropped from his tree on 28 Sept., you might as well pull it
into your own tree for 6.7.  It's been testing fine on my trees for the
past 3 weeks.

https://lore.kernel.org/mm-commits/20230928172815.EE6AFC433C8@smtp.kernel.org/

--D

> 
> -- 
> Chandan

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v15] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind
  2023-10-20 15:40         ` Darrick J. Wong
@ 2023-10-23  6:40           ` Chandan Babu R
  2023-10-23  7:26             ` Shiyang Ruan
  0 siblings, 1 reply; 37+ messages in thread
From: Chandan Babu R @ 2023-10-23  6:40 UTC (permalink / raw)
  To: Shiyang Ruan
  Cc: akpm, Darrick J. Wong, linux-fsdevel, nvdimm, linux-xfs, linux-mm,
	dan.j.williams, willy, jack, mcgrof


On Fri, Oct 20, 2023 at 08:40:09 AM -0700, Darrick J. Wong wrote:
> On Fri, Oct 20, 2023 at 03:26:32PM +0530, Chandan Babu R wrote:
>> On Thu, Sep 28, 2023 at 06:32:27 PM +0800, Shiyang Ruan wrote:
>> > ====
>> > Changes since v14:
>> >  1. added/fixed code comments per Dan's comments
>> > ====
>> >
>> > Now, if we suddenly remove a PMEM device(by calling unbind) which
>> > contains FSDAX while programs are still accessing data in this device,
>> > e.g.:
>> > ```
>> >  $FSSTRESS_PROG -d $SCRATCH_MNT -n 99999 -p 4 &
>> >  # $FSX_PROG -N 1000000 -o 8192 -l 500000 $SCRATCH_MNT/t001 &
>> >  echo "pfn1.1" > /sys/bus/nd/drivers/nd_pmem/unbind
>> > ```
>> > it could come into an unacceptable state:
>> >   1. device has gone but mount point still exists, and umount will fail
>> >        with "target is busy"
>> >   2. programs will hang and cannot be killed
>> >   3. may crash with NULL pointer dereference
>> >
>> > To fix this, we introduce a MF_MEM_PRE_REMOVE flag to let it know that we
>> > are going to remove the whole device, and make sure all related processes
>> > could be notified so that they could end up gracefully.
>> >
>> > This patch is inspired by Dan's "mm, dax, pmem: Introduce
>> > dev_pagemap_failure()"[1].  With the help of dax_holder and
>> > ->notify_failure() mechanism, the pmem driver is able to ask filesystem
>> > on it to unmap all files in use, and notify processes who are using
>> > those files.
>> >
>> > Call trace:
>> > trigger unbind
>> >  -> unbind_store()
>> >   -> ... (skip)
>> >    -> devres_release_all()
>> >     -> kill_dax()
>> >      -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
>> >       -> xfs_dax_notify_failure()
>> >       `-> freeze_super()             // freeze (kernel call)
>> >       `-> do xfs rmap
>> >       ` -> mf_dax_kill_procs()
>> >       `  -> collect_procs_fsdax()    // all associated processes
>> >       `  -> unmap_and_kill()
>> >       ` -> invalidate_inode_pages2_range() // drop file's cache
>> >       `-> thaw_super()               // thaw (both kernel & user call)
>> >
>> > Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
>> > event.  Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
>> > new dax mapping from being created.  Do not shutdown filesystem directly
>> > if configuration is not supported, or if failure range includes metadata
>> > area.  Make sure all files and processes(not only the current progress)
>> > are handled correctly.  Also drop the cache of associated files before
>> > pmem is removed.
>> >
>> > [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
>> > [2]: https://lore.kernel.org/linux-xfs/169116275623.3187159.16862410128731457358.stg-ugh@frogsfrogsfrogs/
>> >
>> > Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
>> > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
>> > Acked-by: Dan Williams <dan.j.williams@intel.com>
>> 
>> Hi Andrew,
>> 
>> Shiyang had indicated that this patch has been added to
>> akpm/mm-hotfixes-unstable branch. However, I don't see the patch listed in
>> that branch.
>> 
>> I am about to start collecting XFS patches for v6.7 cycle. Please let me know
>> if you have any objections with me taking this patch via the XFS tree.
>
> V15 was dropped from his tree on 28 Sept., you might as well pull it
> into your own tree for 6.7.  It's been testing fine on my trees for the
> past 3 weeks.
>
> https://lore.kernel.org/mm-commits/20230928172815.EE6AFC433C8@smtp.kernel.org/

Shiyang, this patch does not apply cleanly on v6.6-rc7. Can you please rebase
the patch on v6.6-rc7 and send it to the mailing list?

-- 
Chandan

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v15.1] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind
  2023-09-28 10:32     ` [PATCH v15] " Shiyang Ruan
                         ` (2 preceding siblings ...)
  2023-10-20  9:56       ` Chandan Babu R
@ 2023-10-23  7:20       ` Shiyang Ruan
  3 siblings, 0 replies; 37+ messages in thread
From: Shiyang Ruan @ 2023-10-23  7:20 UTC (permalink / raw)
  To: linux-fsdevel, nvdimm, linux-xfs, linux-mm, chandanbabu
  Cc: dan.j.williams, willy, jack, akpm, djwong, mcgrof

Changes since v15:
 1. Rebased on v6.6-rc7

Now, if we suddenly remove a PMEM device(by calling unbind) which
contains FSDAX while programs are still accessing data in this device,
e.g.:
```
 $FSSTRESS_PROG -d $SCRATCH_MNT -n 99999 -p 4 &
 # $FSX_PROG -N 1000000 -o 8192 -l 500000 $SCRATCH_MNT/t001 &
 echo "pfn1.1" > /sys/bus/nd/drivers/nd_pmem/unbind
```
it could come into an unacceptable state:
  1. device has gone but mount point still exists, and umount will fail
       with "target is busy"
  2. programs will hang and cannot be killed
  3. may crash with NULL pointer dereference

To fix this, we introduce a MF_MEM_PRE_REMOVE flag to let it know that we
are going to remove the whole device, and make sure all related processes
could be notified so that they could end up gracefully.

This patch is inspired by Dan's "mm, dax, pmem: Introduce
dev_pagemap_failure()"[1].  With the help of dax_holder and
->notify_failure() mechanism, the pmem driver is able to ask filesystem
on it to unmap all files in use, and notify processes who are using
those files.

Call trace:
trigger unbind
 -> unbind_store()
  -> ... (skip)
   -> devres_release_all()
    -> kill_dax()
     -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
      -> xfs_dax_notify_failure()
      `-> freeze_super()             // freeze (kernel call)
      `-> do xfs rmap
      ` -> mf_dax_kill_procs()
      `  -> collect_procs_fsdax()    // all associated processes
      `  -> unmap_and_kill()
      ` -> invalidate_inode_pages2_range() // drop file's cache
      `-> thaw_super()               // thaw (both kernel & user call)

Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
event.  Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
new dax mapping from being created.  Do not shutdown filesystem directly
if configuration is not supported, or if failure range includes metadata
area.  Make sure all files and processes(not only the current progress)
are handled correctly.  Also drop the cache of associated files before
pmem is removed.

[1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
[2]: https://lore.kernel.org/linux-xfs/169116275623.3187159.16862410128731457358.stg-ugh@frogsfrogsfrogs/

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/super.c         |   3 +-
 fs/xfs/xfs_notify_failure.c | 108 ++++++++++++++++++++++++++++++++++--
 include/linux/mm.h          |   1 +
 mm/memory-failure.c         |  21 +++++--
 4 files changed, 122 insertions(+), 11 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 0da9232ea175..f4b635526345 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -326,7 +326,8 @@ void kill_dax(struct dax_device *dax_dev)
 		return;
 
 	if (dax_dev->holder_data != NULL)
-		dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
+		dax_holder_notify_failure(dax_dev, 0, U64_MAX,
+				MF_MEM_PRE_REMOVE);
 
 	clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
 	synchronize_srcu(&dax_srcu);
diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index a7daa522e00f..fa50e5308292 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -22,6 +22,7 @@
 
 #include <linux/mm.h>
 #include <linux/dax.h>
+#include <linux/fs.h>
 
 struct xfs_failure_info {
 	xfs_agblock_t		startblock;
@@ -73,10 +74,16 @@ xfs_dax_failure_fn(
 	struct xfs_mount		*mp = cur->bc_mp;
 	struct xfs_inode		*ip;
 	struct xfs_failure_info		*notify = data;
+	struct address_space		*mapping;
+	pgoff_t				pgoff;
+	unsigned long			pgcnt;
 	int				error = 0;
 
 	if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
 	    (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
+		/* Continue the query because this isn't a failure. */
+		if (notify->mf_flags & MF_MEM_PRE_REMOVE)
+			return 0;
 		notify->want_shutdown = true;
 		return 0;
 	}
@@ -92,14 +99,60 @@ xfs_dax_failure_fn(
 		return 0;
 	}
 
-	error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
-				  xfs_failure_pgoff(mp, rec, notify),
-				  xfs_failure_pgcnt(mp, rec, notify),
-				  notify->mf_flags);
+	mapping = VFS_I(ip)->i_mapping;
+	pgoff = xfs_failure_pgoff(mp, rec, notify);
+	pgcnt = xfs_failure_pgcnt(mp, rec, notify);
+
+	/* Continue the rmap query if the inode isn't a dax file. */
+	if (dax_mapping(mapping))
+		error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
+					  notify->mf_flags);
+
+	/* Invalidate the cache in dax pages. */
+	if (notify->mf_flags & MF_MEM_PRE_REMOVE)
+		invalidate_inode_pages2_range(mapping, pgoff,
+					      pgoff + pgcnt - 1);
+
 	xfs_irele(ip);
 	return error;
 }
 
+static int
+xfs_dax_notify_failure_freeze(
+	struct xfs_mount	*mp)
+{
+	struct super_block	*sb = mp->m_super;
+	int			error;
+
+	error = freeze_super(sb, FREEZE_HOLDER_KERNEL);
+	if (error)
+		xfs_emerg(mp, "already frozen by kernel, err=%d", error);
+
+	return error;
+}
+
+static void
+xfs_dax_notify_failure_thaw(
+	struct xfs_mount	*mp,
+	bool			kernel_frozen)
+{
+	struct super_block	*sb = mp->m_super;
+	int			error;
+
+	if (kernel_frozen) {
+		error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
+		if (error)
+			xfs_emerg(mp, "still frozen after notify failure, err=%d",
+				error);
+	}
+
+	/*
+	 * Also thaw userspace call anyway because the device is about to be
+	 * removed immediately.
+	 */
+	thaw_super(sb, FREEZE_HOLDER_USERSPACE);
+}
+
 static int
 xfs_dax_notify_ddev_failure(
 	struct xfs_mount	*mp,
@@ -112,15 +165,29 @@ xfs_dax_notify_ddev_failure(
 	struct xfs_btree_cur	*cur = NULL;
 	struct xfs_buf		*agf_bp = NULL;
 	int			error = 0;
+	bool			kernel_frozen = false;
 	xfs_fsblock_t		fsbno = XFS_DADDR_TO_FSB(mp, daddr);
 	xfs_agnumber_t		agno = XFS_FSB_TO_AGNO(mp, fsbno);
 	xfs_fsblock_t		end_fsbno = XFS_DADDR_TO_FSB(mp,
 							     daddr + bblen - 1);
 	xfs_agnumber_t		end_agno = XFS_FSB_TO_AGNO(mp, end_fsbno);
 
+	if (mf_flags & MF_MEM_PRE_REMOVE) {
+		xfs_info(mp, "Device is about to be removed!");
+		/*
+		 * Freeze fs to prevent new mappings from being created.
+		 * - Keep going on if others already hold the kernel forzen.
+		 * - Keep going on if other errors too because this device is
+		 *   starting to fail.
+		 * - If kernel frozen state is hold successfully here, thaw it
+		 *   here as well at the end.
+		 */
+		kernel_frozen = xfs_dax_notify_failure_freeze(mp) == 0;
+	}
+
 	error = xfs_trans_alloc_empty(mp, &tp);
 	if (error)
-		return error;
+		goto out;
 
 	for (; agno <= end_agno; agno++) {
 		struct xfs_rmap_irec	ri_low = { };
@@ -165,11 +232,26 @@ xfs_dax_notify_ddev_failure(
 	}
 
 	xfs_trans_cancel(tp);
-	if (error || notify.want_shutdown) {
+
+	/*
+	 * Shutdown fs from a force umount in pre-remove case which won't fail,
+	 * so errors can be ignored.  Otherwise, shutdown the filesystem with
+	 * CORRUPT flag if error occured or notify.want_shutdown was set during
+	 * RMAP querying.
+	 */
+	if (mf_flags & MF_MEM_PRE_REMOVE)
+		xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
+	else if (error || notify.want_shutdown) {
 		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
 		if (!error)
 			error = -EFSCORRUPTED;
 	}
+
+out:
+	/* Thaw the fs if it has been frozen before. */
+	if (mf_flags & MF_MEM_PRE_REMOVE)
+		xfs_dax_notify_failure_thaw(mp, kernel_frozen);
+
 	return error;
 }
 
@@ -197,6 +279,14 @@ xfs_dax_notify_failure(
 
 	if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
 	    mp->m_logdev_targp != mp->m_ddev_targp) {
+		/*
+		 * In the pre-remove case the failure notification is attempting
+		 * to trigger a force unmount.  The expectation is that the
+		 * device is still present, but its removal is in progress and
+		 * can not be cancelled, proceed with accessing the log device.
+		 */
+		if (mf_flags & MF_MEM_PRE_REMOVE)
+			return 0;
 		xfs_err(mp, "ondisk log corrupt, shutting down fs!");
 		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
 		return -EFSCORRUPTED;
@@ -210,6 +300,12 @@ xfs_dax_notify_failure(
 	ddev_start = mp->m_ddev_targp->bt_dax_part_off;
 	ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
 
+	/* Notify failure on the whole device. */
+	if (offset == 0 && len == U64_MAX) {
+		offset = ddev_start;
+		len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
+	}
+
 	/* Ignore the range out of filesystem area */
 	if (offset + len - 1 < ddev_start)
 		return -ENXIO;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index bf5d0b1b16f4..385eee0d05a2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3831,6 +3831,7 @@ enum mf_flags {
 	MF_UNPOISON = 1 << 4,
 	MF_SW_SIMULATED = 1 << 5,
 	MF_NO_RETRY = 1 << 6,
+	MF_MEM_PRE_REMOVE = 1 << 7,
 };
 int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
 		      unsigned long count, int mf_flags);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 4d6e43c88489..6e43ae369fef 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -679,7 +679,7 @@ static void add_to_kill_fsdax(struct task_struct *tsk, struct page *p,
  */
 static void collect_procs_fsdax(struct page *page,
 		struct address_space *mapping, pgoff_t pgoff,
-		struct list_head *to_kill)
+		struct list_head *to_kill, bool pre_remove)
 {
 	struct vm_area_struct *vma;
 	struct task_struct *tsk;
@@ -687,8 +687,15 @@ static void collect_procs_fsdax(struct page *page,
 	i_mmap_lock_read(mapping);
 	rcu_read_lock();
 	for_each_process(tsk) {
-		struct task_struct *t = task_early_kill(tsk, true);
+		struct task_struct *t = tsk;
 
+		/*
+		 * Search for all tasks while MF_MEM_PRE_REMOVE is set, because
+		 * the current may not be the one accessing the fsdax page.
+		 * Otherwise, search for the current task.
+		 */
+		if (!pre_remove)
+			t = task_early_kill(tsk, true);
 		if (!t)
 			continue;
 		vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
@@ -1792,6 +1799,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
 	dax_entry_t cookie;
 	struct page *page;
 	size_t end = index + count;
+	bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
 
 	mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
 
@@ -1803,9 +1811,14 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
 		if (!page)
 			goto unlock;
 
-		SetPageHWPoison(page);
+		if (!pre_remove)
+			SetPageHWPoison(page);
 
-		collect_procs_fsdax(page, mapping, index, &to_kill);
+		/*
+		 * The pre_remove case is revoking access, the memory is still
+		 * good and could theoretically be put back into service.
+		 */
+		collect_procs_fsdax(page, mapping, index, &to_kill, pre_remove);
 		unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
 				index, mf_flags);
 unlock:
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v15] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind
  2023-10-23  6:40           ` Chandan Babu R
@ 2023-10-23  7:26             ` Shiyang Ruan
  2023-10-23 12:21               ` Chandan Babu R
  0 siblings, 1 reply; 37+ messages in thread
From: Shiyang Ruan @ 2023-10-23  7:26 UTC (permalink / raw)
  To: Chandan Babu R
  Cc: akpm, Darrick J. Wong, linux-fsdevel, nvdimm, linux-xfs, linux-mm,
	dan.j.williams, willy, jack, mcgrof



在 2023/10/23 14:40, Chandan Babu R 写道:
> 
> On Fri, Oct 20, 2023 at 08:40:09 AM -0700, Darrick J. Wong wrote:
>> On Fri, Oct 20, 2023 at 03:26:32PM +0530, Chandan Babu R wrote:
>>> On Thu, Sep 28, 2023 at 06:32:27 PM +0800, Shiyang Ruan wrote:
>>>> ====
>>>> Changes since v14:
>>>>   1. added/fixed code comments per Dan's comments
>>>> ====
>>>>
>>>> Now, if we suddenly remove a PMEM device(by calling unbind) which
>>>> contains FSDAX while programs are still accessing data in this device,
>>>> e.g.:
>>>> ```
>>>>   $FSSTRESS_PROG -d $SCRATCH_MNT -n 99999 -p 4 &
>>>>   # $FSX_PROG -N 1000000 -o 8192 -l 500000 $SCRATCH_MNT/t001 &
>>>>   echo "pfn1.1" > /sys/bus/nd/drivers/nd_pmem/unbind
>>>> ```
>>>> it could come into an unacceptable state:
>>>>    1. device has gone but mount point still exists, and umount will fail
>>>>         with "target is busy"
>>>>    2. programs will hang and cannot be killed
>>>>    3. may crash with NULL pointer dereference
>>>>
>>>> To fix this, we introduce a MF_MEM_PRE_REMOVE flag to let it know that we
>>>> are going to remove the whole device, and make sure all related processes
>>>> could be notified so that they could end up gracefully.
>>>>
>>>> This patch is inspired by Dan's "mm, dax, pmem: Introduce
>>>> dev_pagemap_failure()"[1].  With the help of dax_holder and
>>>> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
>>>> on it to unmap all files in use, and notify processes who are using
>>>> those files.
>>>>
>>>> Call trace:
>>>> trigger unbind
>>>>   -> unbind_store()
>>>>    -> ... (skip)
>>>>     -> devres_release_all()
>>>>      -> kill_dax()
>>>>       -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
>>>>        -> xfs_dax_notify_failure()
>>>>        `-> freeze_super()             // freeze (kernel call)
>>>>        `-> do xfs rmap
>>>>        ` -> mf_dax_kill_procs()
>>>>        `  -> collect_procs_fsdax()    // all associated processes
>>>>        `  -> unmap_and_kill()
>>>>        ` -> invalidate_inode_pages2_range() // drop file's cache
>>>>        `-> thaw_super()               // thaw (both kernel & user call)
>>>>
>>>> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
>>>> event.  Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
>>>> new dax mapping from being created.  Do not shutdown filesystem directly
>>>> if configuration is not supported, or if failure range includes metadata
>>>> area.  Make sure all files and processes(not only the current progress)
>>>> are handled correctly.  Also drop the cache of associated files before
>>>> pmem is removed.
>>>>
>>>> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
>>>> [2]: https://lore.kernel.org/linux-xfs/169116275623.3187159.16862410128731457358.stg-ugh@frogsfrogsfrogs/
>>>>
>>>> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
>>>> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
>>>> Acked-by: Dan Williams <dan.j.williams@intel.com>
>>>
>>> Hi Andrew,
>>>
>>> Shiyang had indicated that this patch has been added to
>>> akpm/mm-hotfixes-unstable branch. However, I don't see the patch listed in
>>> that branch.
>>>
>>> I am about to start collecting XFS patches for v6.7 cycle. Please let me know
>>> if you have any objections with me taking this patch via the XFS tree.
>>
>> V15 was dropped from his tree on 28 Sept., you might as well pull it
>> into your own tree for 6.7.  It's been testing fine on my trees for the
>> past 3 weeks.
>>
>> https://lore.kernel.org/mm-commits/20230928172815.EE6AFC433C8@smtp.kernel.org/
> 
> Shiyang, this patch does not apply cleanly on v6.6-rc7. Can you please rebase
> the patch on v6.6-rc7 and send it to the mailing list?

Sure.  I have rebased it and sent a v15.1.  Please check it:

https://lore.kernel.org/linux-xfs/20231023072046.1626474-1-ruansy.fnst@fujitsu.com/


--
Thanks,
Ruan.

> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v15] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind
  2023-10-23  7:26             ` Shiyang Ruan
@ 2023-10-23 12:21               ` Chandan Babu R
  0 siblings, 0 replies; 37+ messages in thread
From: Chandan Babu R @ 2023-10-23 12:21 UTC (permalink / raw)
  To: Shiyang Ruan
  Cc: akpm, Darrick J. Wong, linux-fsdevel, nvdimm, linux-xfs, linux-mm,
	dan.j.williams, willy, jack, mcgrof

On Mon, Oct 23, 2023 at 03:26:52 PM +0800, Shiyang Ruan wrote:
> 在 2023/10/23 14:40, Chandan Babu R 写道:
>> On Fri, Oct 20, 2023 at 08:40:09 AM -0700, Darrick J. Wong wrote:
>>> On Fri, Oct 20, 2023 at 03:26:32PM +0530, Chandan Babu R wrote:
>>>> On Thu, Sep 28, 2023 at 06:32:27 PM +0800, Shiyang Ruan wrote:
>>>>> ====
>>>>> Changes since v14:
>>>>>   1. added/fixed code comments per Dan's comments
>>>>> ====
>>>>>
>>>>> Now, if we suddenly remove a PMEM device(by calling unbind) which
>>>>> contains FSDAX while programs are still accessing data in this device,
>>>>> e.g.:
>>>>> ```
>>>>>   $FSSTRESS_PROG -d $SCRATCH_MNT -n 99999 -p 4 &
>>>>>   # $FSX_PROG -N 1000000 -o 8192 -l 500000 $SCRATCH_MNT/t001 &
>>>>>   echo "pfn1.1" > /sys/bus/nd/drivers/nd_pmem/unbind
>>>>> ```
>>>>> it could come into an unacceptable state:
>>>>>    1. device has gone but mount point still exists, and umount will fail
>>>>>         with "target is busy"
>>>>>    2. programs will hang and cannot be killed
>>>>>    3. may crash with NULL pointer dereference
>>>>>
>>>>> To fix this, we introduce a MF_MEM_PRE_REMOVE flag to let it know that we
>>>>> are going to remove the whole device, and make sure all related processes
>>>>> could be notified so that they could end up gracefully.
>>>>>
>>>>> This patch is inspired by Dan's "mm, dax, pmem: Introduce
>>>>> dev_pagemap_failure()"[1].  With the help of dax_holder and
>>>>> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
>>>>> on it to unmap all files in use, and notify processes who are using
>>>>> those files.
>>>>>
>>>>> Call trace:
>>>>> trigger unbind
>>>>>   -> unbind_store()
>>>>>    -> ... (skip)
>>>>>     -> devres_release_all()
>>>>>      -> kill_dax()
>>>>>       -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
>>>>>        -> xfs_dax_notify_failure()
>>>>>        `-> freeze_super()             // freeze (kernel call)
>>>>>        `-> do xfs rmap
>>>>>        ` -> mf_dax_kill_procs()
>>>>>        `  -> collect_procs_fsdax()    // all associated processes
>>>>>        `  -> unmap_and_kill()
>>>>>        ` -> invalidate_inode_pages2_range() // drop file's cache
>>>>>        `-> thaw_super()               // thaw (both kernel & user call)
>>>>>
>>>>> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
>>>>> event.  Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
>>>>> new dax mapping from being created.  Do not shutdown filesystem directly
>>>>> if configuration is not supported, or if failure range includes metadata
>>>>> area.  Make sure all files and processes(not only the current progress)
>>>>> are handled correctly.  Also drop the cache of associated files before
>>>>> pmem is removed.
>>>>>
>>>>> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
>>>>> [2]: https://lore.kernel.org/linux-xfs/169116275623.3187159.16862410128731457358.stg-ugh@frogsfrogsfrogs/
>>>>>
>>>>> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
>>>>> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
>>>>> Acked-by: Dan Williams <dan.j.williams@intel.com>
>>>>
>>>> Hi Andrew,
>>>>
>>>> Shiyang had indicated that this patch has been added to
>>>> akpm/mm-hotfixes-unstable branch. However, I don't see the patch listed in
>>>> that branch.
>>>>
>>>> I am about to start collecting XFS patches for v6.7 cycle. Please let me know
>>>> if you have any objections with me taking this patch via the XFS tree.
>>>
>>> V15 was dropped from his tree on 28 Sept., you might as well pull it
>>> into your own tree for 6.7.  It's been testing fine on my trees for the
>>> past 3 weeks.
>>>
>>> https://lore.kernel.org/mm-commits/20230928172815.EE6AFC433C8@smtp.kernel.org/
>> Shiyang, this patch does not apply cleanly on v6.6-rc7. Can you
>> please rebase
>> the patch on v6.6-rc7 and send it to the mailing list?
>
> Sure.  I have rebased it and sent a v15.1.  Please check it:
>
> https://lore.kernel.org/linux-xfs/20231023072046.1626474-1-ruansy.fnst@fujitsu.com/

Thank you. I have applied the patch to my local Git tree.

-- 
Chandan

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v12 0/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
  2023-06-29  8:16 [PATCH v12 0/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan
  2023-06-29  8:16 ` [PATCH v12 1/2] xfs: fix the calculation for "end" and "length" Shiyang Ruan
  2023-06-29  8:16 ` [PATCH v12 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan
@ 2024-01-11 22:24 ` Bill O'Donnell
  2024-01-12  1:56   ` Shiyang Ruan
  2 siblings, 1 reply; 37+ messages in thread
From: Bill O'Donnell @ 2024-01-11 22:24 UTC (permalink / raw)
  To: Shiyang Ruan
  Cc: linux-fsdevel, nvdimm, linux-xfs, linux-mm, dan.j.williams, willy,
	jack, akpm, djwong, mcgrof

On Thu, Jun 29, 2023 at 04:16:49PM +0800, Shiyang Ruan wrote:
> This patchset is to add gracefully unbind support for pmem.
> Patch1 corrects the calculation of length and end of a given range.
> Patch2 introduces a new flag call MF_MEM_REMOVE, to let dax holder know
> it is a remove event.  With the help of notify_failure mechanism, we are
> able to shutdown the filesystem on the pmem gracefully.

What is the status of this patch?
Thanks-
Bill


> 
> Changes since v11:
>  Patch1:
>   1. correct the count calculation in xfs_failure_pgcnt().
>       (was a wrong fix in v11)
>  Patch2:
>   1. use new exclusive freeze_super/thaw_super API, to make sure the unbind
>       progress won't be disturbed by any other freezer.
> 
> Shiyang Ruan (2):
>   xfs: fix the calculation for "end" and "length"
>   mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
> 
>  drivers/dax/super.c         |  3 +-
>  fs/xfs/xfs_notify_failure.c | 95 +++++++++++++++++++++++++++++++++----
>  include/linux/mm.h          |  1 +
>  mm/memory-failure.c         | 17 +++++--
>  4 files changed, 101 insertions(+), 15 deletions(-)
> 
> -- 
> 2.40.1
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v12 0/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
  2024-01-11 22:24 ` [PATCH v12 0/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE " Bill O'Donnell
@ 2024-01-12  1:56   ` Shiyang Ruan
  0 siblings, 0 replies; 37+ messages in thread
From: Shiyang Ruan @ 2024-01-12  1:56 UTC (permalink / raw)
  To: Bill O'Donnell
  Cc: linux-fsdevel, nvdimm, linux-xfs, linux-mm, dan.j.williams, willy,
	jack, akpm, djwong, mcgrof



在 2024/1/12 6:24, Bill O'Donnell 写道:
> On Thu, Jun 29, 2023 at 04:16:49PM +0800, Shiyang Ruan wrote:
>> This patchset is to add gracefully unbind support for pmem.
>> Patch1 corrects the calculation of length and end of a given range.
>> Patch2 introduces a new flag call MF_MEM_REMOVE, to let dax holder know
>> it is a remove event.  With the help of notify_failure mechanism, we are
>> able to shutdown the filesystem on the pmem gracefully.
> 
> What is the status of this patch?

Hi Bill,

This patch has just been merged.  You can find it here:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fa422b353d212373fb2b2857a5ea5a6fa4876f9c


--
Thanks,
Ruan.

> Thanks-
> Bill
> 
> 
>>
>> Changes since v11:
>>   Patch1:
>>    1. correct the count calculation in xfs_failure_pgcnt().
>>        (was a wrong fix in v11)
>>   Patch2:
>>    1. use new exclusive freeze_super/thaw_super API, to make sure the unbind
>>        progress won't be disturbed by any other freezer.
>>
>> Shiyang Ruan (2):
>>    xfs: fix the calculation for "end" and "length"
>>    mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
>>
>>   drivers/dax/super.c         |  3 +-
>>   fs/xfs/xfs_notify_failure.c | 95 +++++++++++++++++++++++++++++++++----
>>   include/linux/mm.h          |  1 +
>>   mm/memory-failure.c         | 17 +++++--
>>   4 files changed, 101 insertions(+), 15 deletions(-)
>>
>> -- 
>> 2.40.1
>>
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2024-01-12  1:56 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-06-29  8:16 [PATCH v12 0/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan
2023-06-29  8:16 ` [PATCH v12 1/2] xfs: fix the calculation for "end" and "length" Shiyang Ruan
2023-06-29  8:16 ` [PATCH v12 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan
2023-06-29 12:02   ` kernel test robot
2023-07-14  9:07   ` Shiyang Ruan
2023-07-14 14:18     ` Darrick J. Wong
2023-07-20  1:50       ` Shiyang Ruan
2023-07-29 10:01         ` Shiyang Ruan
2023-07-29 15:15           ` Darrick J. Wong
2023-07-29 15:15   ` Darrick J. Wong
2023-07-31  9:36     ` Shiyang Ruan
2023-08-01  3:25       ` Darrick J. Wong
2023-08-03 10:44         ` Shiyang Ruan
2023-08-08  0:31   ` Dan Williams
2023-08-23  8:36     ` Shiyang Ruan
2023-08-23  8:17   ` [PATCH v13] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE " Shiyang Ruan
2023-08-23 23:36     ` Darrick J. Wong
2023-08-24  9:41       ` Shiyang Ruan
2023-08-24 23:57         ` Darrick J. Wong
2023-08-25  3:52           ` Shiyang Ruan
2023-08-26  0:17             ` Darrick J. Wong
2023-08-28  6:57   ` [PATCH v14] " Shiyang Ruan
2023-08-30 15:34     ` Darrick J. Wong
2023-09-27  8:17     ` Dan Williams
2023-09-27  9:18       ` Shiyang Ruan
2023-09-28 10:32     ` [PATCH v15] " Shiyang Ruan
2023-09-29 18:31       ` Dan Williams
2023-10-01  1:43       ` kernel test robot
2023-10-02 11:57         ` Shiyang Ruan
2023-10-20  9:56       ` Chandan Babu R
2023-10-20 15:40         ` Darrick J. Wong
2023-10-23  6:40           ` Chandan Babu R
2023-10-23  7:26             ` Shiyang Ruan
2023-10-23 12:21               ` Chandan Babu R
2023-10-23  7:20       ` [PATCH v15.1] " Shiyang Ruan
2024-01-11 22:24 ` [PATCH v12 0/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE " Bill O'Donnell
2024-01-12  1:56   ` Shiyang Ruan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).