* [PATCH v12 0/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
@ 2023-06-29 8:16 Shiyang Ruan
2023-06-29 8:16 ` [PATCH v12 1/2] xfs: fix the calculation for "end" and "length" Shiyang Ruan
` (2 more replies)
0 siblings, 3 replies; 37+ messages in thread
From: Shiyang Ruan @ 2023-06-29 8:16 UTC (permalink / raw)
To: linux-fsdevel, nvdimm, linux-xfs, linux-mm
Cc: dan.j.williams, willy, jack, akpm, djwong, mcgrof
This patchset is to add gracefully unbind support for pmem.
Patch1 corrects the calculation of length and end of a given range.
Patch2 introduces a new flag call MF_MEM_REMOVE, to let dax holder know
it is a remove event. With the help of notify_failure mechanism, we are
able to shutdown the filesystem on the pmem gracefully.
Changes since v11:
Patch1:
1. correct the count calculation in xfs_failure_pgcnt().
(was a wrong fix in v11)
Patch2:
1. use new exclusive freeze_super/thaw_super API, to make sure the unbind
progress won't be disturbed by any other freezer.
Shiyang Ruan (2):
xfs: fix the calculation for "end" and "length"
mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
drivers/dax/super.c | 3 +-
fs/xfs/xfs_notify_failure.c | 95 +++++++++++++++++++++++++++++++++----
include/linux/mm.h | 1 +
mm/memory-failure.c | 17 +++++--
4 files changed, 101 insertions(+), 15 deletions(-)
--
2.40.1
^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH v12 1/2] xfs: fix the calculation for "end" and "length"
2023-06-29 8:16 [PATCH v12 0/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan
@ 2023-06-29 8:16 ` Shiyang Ruan
2023-06-29 8:16 ` [PATCH v12 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan
2024-01-11 22:24 ` [PATCH v12 0/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE " Bill O'Donnell
2 siblings, 0 replies; 37+ messages in thread
From: Shiyang Ruan @ 2023-06-29 8:16 UTC (permalink / raw)
To: linux-fsdevel, nvdimm, linux-xfs, linux-mm
Cc: dan.j.williams, willy, jack, akpm, djwong, mcgrof
The value of "end" should be "start + length - 1".
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/xfs_notify_failure.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index c4078d0ec108..4a9bbd3fe120 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -114,7 +114,8 @@ xfs_dax_notify_ddev_failure(
int error = 0;
xfs_fsblock_t fsbno = XFS_DADDR_TO_FSB(mp, daddr);
xfs_agnumber_t agno = XFS_FSB_TO_AGNO(mp, fsbno);
- xfs_fsblock_t end_fsbno = XFS_DADDR_TO_FSB(mp, daddr + bblen);
+ xfs_fsblock_t end_fsbno = XFS_DADDR_TO_FSB(mp,
+ daddr + bblen - 1);
xfs_agnumber_t end_agno = XFS_FSB_TO_AGNO(mp, end_fsbno);
error = xfs_trans_alloc_empty(mp, &tp);
@@ -210,7 +211,7 @@ xfs_dax_notify_failure(
ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
/* Ignore the range out of filesystem area */
- if (offset + len < ddev_start)
+ if (offset + len - 1 < ddev_start)
return -ENXIO;
if (offset > ddev_end)
return -ENXIO;
@@ -222,8 +223,8 @@ xfs_dax_notify_failure(
len -= ddev_start - offset;
offset = 0;
}
- if (offset + len > ddev_end)
- len -= ddev_end - offset;
+ if (offset + len - 1 > ddev_end)
+ len = ddev_end - offset + 1;
return xfs_dax_notify_ddev_failure(mp, BTOBB(offset), BTOBB(len),
mf_flags);
--
2.40.1
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH v12 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
2023-06-29 8:16 [PATCH v12 0/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan
2023-06-29 8:16 ` [PATCH v12 1/2] xfs: fix the calculation for "end" and "length" Shiyang Ruan
@ 2023-06-29 8:16 ` Shiyang Ruan
2023-06-29 12:02 ` kernel test robot
` (5 more replies)
2024-01-11 22:24 ` [PATCH v12 0/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE " Bill O'Donnell
2 siblings, 6 replies; 37+ messages in thread
From: Shiyang Ruan @ 2023-06-29 8:16 UTC (permalink / raw)
To: linux-fsdevel, nvdimm, linux-xfs, linux-mm
Cc: dan.j.williams, willy, jack, akpm, djwong, mcgrof
This patch is inspired by Dan's "mm, dax, pmem: Introduce
dev_pagemap_failure()"[1]. With the help of dax_holder and
->notify_failure() mechanism, the pmem driver is able to ask filesystem
on it to unmap all files in use, and notify processes who are using
those files.
Call trace:
trigger unbind
-> unbind_store()
-> ... (skip)
-> devres_release_all()
-> kill_dax()
-> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
-> xfs_dax_notify_failure()
`-> freeze_super() // freeze (kernel call)
`-> do xfs rmap
` -> mf_dax_kill_procs()
` -> collect_procs_fsdax() // all associated processes
` -> unmap_and_kill()
` -> invalidate_inode_pages2_range() // drop file's cache
`-> thaw_super() // thaw (both kernel & user call)
Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
event. Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
new dax mapping from being created. Do not shutdown filesystem directly
if configuration is not supported, or if failure range includes metadata
area. Make sure all files and processes(not only the current progress)
are handled correctly. Also drop the cache of associated files before
pmem is removed.
[1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
[2]: https://lore.kernel.org/linux-xfs/168688010689.860947.1788875898367401950.stgit@frogsfrogsfrogs/
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
---
drivers/dax/super.c | 3 +-
fs/xfs/xfs_notify_failure.c | 86 ++++++++++++++++++++++++++++++++++---
include/linux/mm.h | 1 +
mm/memory-failure.c | 17 ++++++--
4 files changed, 96 insertions(+), 11 deletions(-)
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index c4c4728a36e4..2e1a35e82fce 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
return;
if (dax_dev->holder_data != NULL)
- dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
+ dax_holder_notify_failure(dax_dev, 0, U64_MAX,
+ MF_MEM_PRE_REMOVE);
clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
synchronize_srcu(&dax_srcu);
diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index 4a9bbd3fe120..f6ec56b76db6 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -22,6 +22,7 @@
#include <linux/mm.h>
#include <linux/dax.h>
+#include <linux/fs.h>
struct xfs_failure_info {
xfs_agblock_t startblock;
@@ -73,10 +74,16 @@ xfs_dax_failure_fn(
struct xfs_mount *mp = cur->bc_mp;
struct xfs_inode *ip;
struct xfs_failure_info *notify = data;
+ struct address_space *mapping;
+ pgoff_t pgoff;
+ unsigned long pgcnt;
int error = 0;
if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
(rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
+ /* Continue the query because this isn't a failure. */
+ if (notify->mf_flags & MF_MEM_PRE_REMOVE)
+ return 0;
notify->want_shutdown = true;
return 0;
}
@@ -92,14 +99,55 @@ xfs_dax_failure_fn(
return 0;
}
- error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
- xfs_failure_pgoff(mp, rec, notify),
- xfs_failure_pgcnt(mp, rec, notify),
- notify->mf_flags);
+ mapping = VFS_I(ip)->i_mapping;
+ pgoff = xfs_failure_pgoff(mp, rec, notify);
+ pgcnt = xfs_failure_pgcnt(mp, rec, notify);
+
+ /* Continue the rmap query if the inode isn't a dax file. */
+ if (dax_mapping(mapping))
+ error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
+ notify->mf_flags);
+
+ /* Invalidate the cache in dax pages. */
+ if (notify->mf_flags & MF_MEM_PRE_REMOVE)
+ invalidate_inode_pages2_range(mapping, pgoff,
+ pgoff + pgcnt - 1);
+
xfs_irele(ip);
return error;
}
+static void
+xfs_dax_notify_failure_freeze(
+ struct xfs_mount *mp)
+{
+ struct super_block *sb = mp->m_super;
+
+ /* Wait until no one is holding the FREEZE_HOLDER_KERNEL. */
+ while (freeze_super(sb, FREEZE_HOLDER_KERNEL) != 0) {
+ // Shall we just wait, or print warning then return -EBUSY?
+ delay(HZ / 10);
+ }
+}
+
+static void
+xfs_dax_notify_failure_thaw(
+ struct xfs_mount *mp)
+{
+ struct super_block *sb = mp->m_super;
+ int error;
+
+ error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
+ if (error)
+ xfs_emerg(mp, "still frozen after notify failure, err=%d",
+ error);
+ /*
+ * Also thaw userspace call anyway because the device is about to be
+ * removed immediately.
+ */
+ thaw_super(sb, FREEZE_HOLDER_USERSPACE);
+}
+
static int
xfs_dax_notify_ddev_failure(
struct xfs_mount *mp,
@@ -120,7 +168,7 @@ xfs_dax_notify_ddev_failure(
error = xfs_trans_alloc_empty(mp, &tp);
if (error)
- return error;
+ goto out;
for (; agno <= end_agno; agno++) {
struct xfs_rmap_irec ri_low = { };
@@ -165,11 +213,23 @@ xfs_dax_notify_ddev_failure(
}
xfs_trans_cancel(tp);
+
+ /*
+ * Determine how to shutdown the filesystem according to the
+ * error code and flags.
+ */
if (error || notify.want_shutdown) {
xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
if (!error)
error = -EFSCORRUPTED;
- }
+ } else if (mf_flags & MF_MEM_PRE_REMOVE)
+ xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
+
+out:
+ /* Thaw the fs if it is freezed before. */
+ if (mf_flags & MF_MEM_PRE_REMOVE)
+ xfs_dax_notify_failure_thaw(mp);
+
return error;
}
@@ -197,6 +257,8 @@ xfs_dax_notify_failure(
if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
mp->m_logdev_targp != mp->m_ddev_targp) {
+ if (mf_flags & MF_MEM_PRE_REMOVE)
+ return 0;
xfs_err(mp, "ondisk log corrupt, shutting down fs!");
xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
return -EFSCORRUPTED;
@@ -210,6 +272,12 @@ xfs_dax_notify_failure(
ddev_start = mp->m_ddev_targp->bt_dax_part_off;
ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
+ /* Notify failure on the whole device. */
+ if (offset == 0 && len == U64_MAX) {
+ offset = ddev_start;
+ len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
+ }
+
/* Ignore the range out of filesystem area */
if (offset + len - 1 < ddev_start)
return -ENXIO;
@@ -226,6 +294,12 @@ xfs_dax_notify_failure(
if (offset + len - 1 > ddev_end)
len = ddev_end - offset + 1;
+ if (mf_flags & MF_MEM_PRE_REMOVE) {
+ xfs_info(mp, "device is about to be removed!");
+ /* Freeze fs to prevent new mappings from being created. */
+ xfs_dax_notify_failure_freeze(mp);
+ }
+
return xfs_dax_notify_ddev_failure(mp, BTOBB(offset), BTOBB(len),
mf_flags);
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 27ce77080c79..a80c255b88d2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3576,6 +3576,7 @@ enum mf_flags {
MF_UNPOISON = 1 << 4,
MF_SW_SIMULATED = 1 << 5,
MF_NO_RETRY = 1 << 6,
+ MF_MEM_PRE_REMOVE = 1 << 7,
};
int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
unsigned long count, int mf_flags);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 5b663eca1f29..483b75f2fcfb 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -688,7 +688,7 @@ static void add_to_kill_fsdax(struct task_struct *tsk, struct page *p,
*/
static void collect_procs_fsdax(struct page *page,
struct address_space *mapping, pgoff_t pgoff,
- struct list_head *to_kill)
+ struct list_head *to_kill, bool pre_remove)
{
struct vm_area_struct *vma;
struct task_struct *tsk;
@@ -696,8 +696,15 @@ static void collect_procs_fsdax(struct page *page,
i_mmap_lock_read(mapping);
read_lock(&tasklist_lock);
for_each_process(tsk) {
- struct task_struct *t = task_early_kill(tsk, true);
+ struct task_struct *t = tsk;
+ /*
+ * Search for all tasks while MF_MEM_PRE_REMOVE, because the
+ * current may not be the one accessing the fsdax page.
+ * Otherwise, search for the current task.
+ */
+ if (!pre_remove)
+ t = task_early_kill(tsk, true);
if (!t)
continue;
vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
@@ -1793,6 +1800,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
dax_entry_t cookie;
struct page *page;
size_t end = index + count;
+ bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
@@ -1804,9 +1812,10 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
if (!page)
goto unlock;
- SetPageHWPoison(page);
+ if (!pre_remove)
+ SetPageHWPoison(page);
- collect_procs_fsdax(page, mapping, index, &to_kill);
+ collect_procs_fsdax(page, mapping, index, &to_kill, pre_remove);
unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
index, mf_flags);
unlock:
--
2.40.1
^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH v12 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
2023-06-29 8:16 ` [PATCH v12 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan
@ 2023-06-29 12:02 ` kernel test robot
2023-07-14 9:07 ` Shiyang Ruan
` (4 subsequent siblings)
5 siblings, 0 replies; 37+ messages in thread
From: kernel test robot @ 2023-06-29 12:02 UTC (permalink / raw)
To: Shiyang Ruan, linux-fsdevel, nvdimm, linux-xfs, linux-mm
Cc: oe-kbuild-all, dan.j.williams, willy, jack, akpm, djwong, mcgrof
Hi Shiyang,
kernel test robot noticed the following build errors:
[auto build test ERROR on akpm-mm/mm-everything]
url: https://github.com/intel-lab-lkp/linux/commits/Shiyang-Ruan/xfs-fix-the-calculation-for-end-and-length/20230629-161913
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20230629081651.253626-3-ruansy.fnst%40fujitsu.com
patch subject: [PATCH v12 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
config: x86_64-kexec (https://download.01.org/0day-ci/archive/20230629/202306291954.zqVvCUZ5-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce: (https://download.01.org/0day-ci/archive/20230629/202306291954.zqVvCUZ5-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202306291954.zqVvCUZ5-lkp@intel.com/
All errors (new ones prefixed by >>):
fs/xfs/xfs_notify_failure.c: In function 'xfs_dax_notify_failure_freeze':
>> fs/xfs/xfs_notify_failure.c:127:33: error: 'FREEZE_HOLDER_KERNEL' undeclared (first use in this function)
127 | while (freeze_super(sb, FREEZE_HOLDER_KERNEL) != 0) {
| ^~~~~~~~~~~~~~~~~~~~
fs/xfs/xfs_notify_failure.c:127:33: note: each undeclared identifier is reported only once for each function it appears in
>> fs/xfs/xfs_notify_failure.c:127:16: error: too many arguments to function 'freeze_super'
127 | while (freeze_super(sb, FREEZE_HOLDER_KERNEL) != 0) {
| ^~~~~~~~~~~~
In file included from include/linux/huge_mm.h:8,
from include/linux/mm.h:988,
from fs/xfs/kmem.h:11,
from fs/xfs/xfs_linux.h:24,
from fs/xfs/xfs.h:22,
from fs/xfs/xfs_notify_failure.c:6:
include/linux/fs.h:2289:12: note: declared here
2289 | extern int freeze_super(struct super_block *super);
| ^~~~~~~~~~~~
fs/xfs/xfs_notify_failure.c: In function 'xfs_dax_notify_failure_thaw':
fs/xfs/xfs_notify_failure.c:140:32: error: 'FREEZE_HOLDER_KERNEL' undeclared (first use in this function)
140 | error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
| ^~~~~~~~~~~~~~~~~~~~
>> fs/xfs/xfs_notify_failure.c:140:17: error: too many arguments to function 'thaw_super'
140 | error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
| ^~~~~~~~~~
include/linux/fs.h:2290:12: note: declared here
2290 | extern int thaw_super(struct super_block *super);
| ^~~~~~~~~~
>> fs/xfs/xfs_notify_failure.c:148:24: error: 'FREEZE_HOLDER_USERSPACE' undeclared (first use in this function)
148 | thaw_super(sb, FREEZE_HOLDER_USERSPACE);
| ^~~~~~~~~~~~~~~~~~~~~~~
fs/xfs/xfs_notify_failure.c:148:9: error: too many arguments to function 'thaw_super'
148 | thaw_super(sb, FREEZE_HOLDER_USERSPACE);
| ^~~~~~~~~~
include/linux/fs.h:2290:12: note: declared here
2290 | extern int thaw_super(struct super_block *super);
| ^~~~~~~~~~
vim +/FREEZE_HOLDER_KERNEL +127 fs/xfs/xfs_notify_failure.c
119
120 static void
121 xfs_dax_notify_failure_freeze(
122 struct xfs_mount *mp)
123 {
124 struct super_block *sb = mp->m_super;
125
126 /* Wait until no one is holding the FREEZE_HOLDER_KERNEL. */
> 127 while (freeze_super(sb, FREEZE_HOLDER_KERNEL) != 0) {
128 // Shall we just wait, or print warning then return -EBUSY?
129 delay(HZ / 10);
130 }
131 }
132
133 static void
134 xfs_dax_notify_failure_thaw(
135 struct xfs_mount *mp)
136 {
137 struct super_block *sb = mp->m_super;
138 int error;
139
> 140 error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
141 if (error)
142 xfs_emerg(mp, "still frozen after notify failure, err=%d",
143 error);
144 /*
145 * Also thaw userspace call anyway because the device is about to be
146 * removed immediately.
147 */
> 148 thaw_super(sb, FREEZE_HOLDER_USERSPACE);
149 }
150
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v12 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
2023-06-29 8:16 ` [PATCH v12 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan
2023-06-29 12:02 ` kernel test robot
@ 2023-07-14 9:07 ` Shiyang Ruan
2023-07-14 14:18 ` Darrick J. Wong
2023-07-29 15:15 ` Darrick J. Wong
` (3 subsequent siblings)
5 siblings, 1 reply; 37+ messages in thread
From: Shiyang Ruan @ 2023-07-14 9:07 UTC (permalink / raw)
To: djwong
Cc: linux-mm, linux-xfs, nvdimm, linux-fsdevel, dan.j.williams, willy,
jack, akpm, mcgrof
Hi Darrick,
Thanks for applying the 1st patch.
Now, since this patch is based on the new freeze_super()/thaw_super()
api[1], I'd like to ask what's the plan for this api? It seems to have
missed the v6.5-rc1.
[1]
https://lore.kernel.org/linux-xfs/168688010689.860947.1788875898367401950.stgit@frogsfrogsfrogs/
--
Thanks,
Ruan.
在 2023/6/29 16:16, Shiyang Ruan 写道:
> This patch is inspired by Dan's "mm, dax, pmem: Introduce
> dev_pagemap_failure()"[1]. With the help of dax_holder and
> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> on it to unmap all files in use, and notify processes who are using
> those files.
>
> Call trace:
> trigger unbind
> -> unbind_store()
> -> ... (skip)
> -> devres_release_all()
> -> kill_dax()
> -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
> -> xfs_dax_notify_failure()
> `-> freeze_super() // freeze (kernel call)
> `-> do xfs rmap
> ` -> mf_dax_kill_procs()
> ` -> collect_procs_fsdax() // all associated processes
> ` -> unmap_and_kill()
> ` -> invalidate_inode_pages2_range() // drop file's cache
> `-> thaw_super() // thaw (both kernel & user call)
>
> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> event. Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
> new dax mapping from being created. Do not shutdown filesystem directly
> if configuration is not supported, or if failure range includes metadata
> area. Make sure all files and processes(not only the current progress)
> are handled correctly. Also drop the cache of associated files before
> pmem is removed.
>
> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
> [2]: https://lore.kernel.org/linux-xfs/168688010689.860947.1788875898367401950.stgit@frogsfrogsfrogs/
>
> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> ---
> drivers/dax/super.c | 3 +-
> fs/xfs/xfs_notify_failure.c | 86 ++++++++++++++++++++++++++++++++++---
> include/linux/mm.h | 1 +
> mm/memory-failure.c | 17 ++++++--
> 4 files changed, 96 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index c4c4728a36e4..2e1a35e82fce 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
> return;
>
> if (dax_dev->holder_data != NULL)
> - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
> + dax_holder_notify_failure(dax_dev, 0, U64_MAX,
> + MF_MEM_PRE_REMOVE);
>
> clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
> synchronize_srcu(&dax_srcu);
> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> index 4a9bbd3fe120..f6ec56b76db6 100644
> --- a/fs/xfs/xfs_notify_failure.c
> +++ b/fs/xfs/xfs_notify_failure.c
> @@ -22,6 +22,7 @@
>
> #include <linux/mm.h>
> #include <linux/dax.h>
> +#include <linux/fs.h>
>
> struct xfs_failure_info {
> xfs_agblock_t startblock;
> @@ -73,10 +74,16 @@ xfs_dax_failure_fn(
> struct xfs_mount *mp = cur->bc_mp;
> struct xfs_inode *ip;
> struct xfs_failure_info *notify = data;
> + struct address_space *mapping;
> + pgoff_t pgoff;
> + unsigned long pgcnt;
> int error = 0;
>
> if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
> (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
> + /* Continue the query because this isn't a failure. */
> + if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> + return 0;
> notify->want_shutdown = true;
> return 0;
> }
> @@ -92,14 +99,55 @@ xfs_dax_failure_fn(
> return 0;
> }
>
> - error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
> - xfs_failure_pgoff(mp, rec, notify),
> - xfs_failure_pgcnt(mp, rec, notify),
> - notify->mf_flags);
> + mapping = VFS_I(ip)->i_mapping;
> + pgoff = xfs_failure_pgoff(mp, rec, notify);
> + pgcnt = xfs_failure_pgcnt(mp, rec, notify);
> +
> + /* Continue the rmap query if the inode isn't a dax file. */
> + if (dax_mapping(mapping))
> + error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
> + notify->mf_flags);
> +
> + /* Invalidate the cache in dax pages. */
> + if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> + invalidate_inode_pages2_range(mapping, pgoff,
> + pgoff + pgcnt - 1);
> +
> xfs_irele(ip);
> return error;
> }
>
> +static void
> +xfs_dax_notify_failure_freeze(
> + struct xfs_mount *mp)
> +{
> + struct super_block *sb = mp->m_super;
> +
> + /* Wait until no one is holding the FREEZE_HOLDER_KERNEL. */
> + while (freeze_super(sb, FREEZE_HOLDER_KERNEL) != 0) {
> + // Shall we just wait, or print warning then return -EBUSY?
> + delay(HZ / 10);
> + }
> +}
> +
> +static void
> +xfs_dax_notify_failure_thaw(
> + struct xfs_mount *mp)
> +{
> + struct super_block *sb = mp->m_super;
> + int error;
> +
> + error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
> + if (error)
> + xfs_emerg(mp, "still frozen after notify failure, err=%d",
> + error);
> + /*
> + * Also thaw userspace call anyway because the device is about to be
> + * removed immediately.
> + */
> + thaw_super(sb, FREEZE_HOLDER_USERSPACE);
> +}
> +
> static int
> xfs_dax_notify_ddev_failure(
> struct xfs_mount *mp,
> @@ -120,7 +168,7 @@ xfs_dax_notify_ddev_failure(
>
> error = xfs_trans_alloc_empty(mp, &tp);
> if (error)
> - return error;
> + goto out;
>
> for (; agno <= end_agno; agno++) {
> struct xfs_rmap_irec ri_low = { };
> @@ -165,11 +213,23 @@ xfs_dax_notify_ddev_failure(
> }
>
> xfs_trans_cancel(tp);
> +
> + /*
> + * Determine how to shutdown the filesystem according to the
> + * error code and flags.
> + */
> if (error || notify.want_shutdown) {
> xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
> if (!error)
> error = -EFSCORRUPTED;
> - }
> + } else if (mf_flags & MF_MEM_PRE_REMOVE)
> + xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
> +
> +out:
> + /* Thaw the fs if it is freezed before. */
> + if (mf_flags & MF_MEM_PRE_REMOVE)
> + xfs_dax_notify_failure_thaw(mp);
> +
> return error;
> }
>
> @@ -197,6 +257,8 @@ xfs_dax_notify_failure(
>
> if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
> mp->m_logdev_targp != mp->m_ddev_targp) {
> + if (mf_flags & MF_MEM_PRE_REMOVE)
> + return 0;
> xfs_err(mp, "ondisk log corrupt, shutting down fs!");
> xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
> return -EFSCORRUPTED;
> @@ -210,6 +272,12 @@ xfs_dax_notify_failure(
> ddev_start = mp->m_ddev_targp->bt_dax_part_off;
> ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
>
> + /* Notify failure on the whole device. */
> + if (offset == 0 && len == U64_MAX) {
> + offset = ddev_start;
> + len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
> + }
> +
> /* Ignore the range out of filesystem area */
> if (offset + len - 1 < ddev_start)
> return -ENXIO;
> @@ -226,6 +294,12 @@ xfs_dax_notify_failure(
> if (offset + len - 1 > ddev_end)
> len = ddev_end - offset + 1;
>
> + if (mf_flags & MF_MEM_PRE_REMOVE) {
> + xfs_info(mp, "device is about to be removed!");
> + /* Freeze fs to prevent new mappings from being created. */
> + xfs_dax_notify_failure_freeze(mp);
> + }
> +
> return xfs_dax_notify_ddev_failure(mp, BTOBB(offset), BTOBB(len),
> mf_flags);
> }
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 27ce77080c79..a80c255b88d2 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3576,6 +3576,7 @@ enum mf_flags {
> MF_UNPOISON = 1 << 4,
> MF_SW_SIMULATED = 1 << 5,
> MF_NO_RETRY = 1 << 6,
> + MF_MEM_PRE_REMOVE = 1 << 7,
> };
> int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> unsigned long count, int mf_flags);
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 5b663eca1f29..483b75f2fcfb 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -688,7 +688,7 @@ static void add_to_kill_fsdax(struct task_struct *tsk, struct page *p,
> */
> static void collect_procs_fsdax(struct page *page,
> struct address_space *mapping, pgoff_t pgoff,
> - struct list_head *to_kill)
> + struct list_head *to_kill, bool pre_remove)
> {
> struct vm_area_struct *vma;
> struct task_struct *tsk;
> @@ -696,8 +696,15 @@ static void collect_procs_fsdax(struct page *page,
> i_mmap_lock_read(mapping);
> read_lock(&tasklist_lock);
> for_each_process(tsk) {
> - struct task_struct *t = task_early_kill(tsk, true);
> + struct task_struct *t = tsk;
>
> + /*
> + * Search for all tasks while MF_MEM_PRE_REMOVE, because the
> + * current may not be the one accessing the fsdax page.
> + * Otherwise, search for the current task.
> + */
> + if (!pre_remove)
> + t = task_early_kill(tsk, true);
> if (!t)
> continue;
> vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> @@ -1793,6 +1800,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> dax_entry_t cookie;
> struct page *page;
> size_t end = index + count;
> + bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
>
> mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
>
> @@ -1804,9 +1812,10 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> if (!page)
> goto unlock;
>
> - SetPageHWPoison(page);
> + if (!pre_remove)
> + SetPageHWPoison(page);
>
> - collect_procs_fsdax(page, mapping, index, &to_kill);
> + collect_procs_fsdax(page, mapping, index, &to_kill, pre_remove);
> unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
> index, mf_flags);
> unlock:
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v12 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
2023-07-14 9:07 ` Shiyang Ruan
@ 2023-07-14 14:18 ` Darrick J. Wong
2023-07-20 1:50 ` Shiyang Ruan
0 siblings, 1 reply; 37+ messages in thread
From: Darrick J. Wong @ 2023-07-14 14:18 UTC (permalink / raw)
To: Shiyang Ruan
Cc: linux-mm, linux-xfs, nvdimm, linux-fsdevel, dan.j.williams, willy,
jack, akpm, mcgrof
On Fri, Jul 14, 2023 at 05:07:58PM +0800, Shiyang Ruan wrote:
> Hi Darrick,
>
> Thanks for applying the 1st patch.
>
> Now, since this patch is based on the new freeze_super()/thaw_super()
> api[1], I'd like to ask what's the plan for this api? It seems to have
> missed the v6.5-rc1.
>
> [1] https://lore.kernel.org/linux-xfs/168688010689.860947.1788875898367401950.stgit@frogsfrogsfrogs/
6.6. I intend to push the XFS UBSAN fixes to the list today for review.
Early next week I'll resend the 6.5 rebase of the kernelfreeze series
and push it to vfs-for-next. Some time after that will come large folio
writes.
--D
>
> --
> Thanks,
> Ruan.
>
>
> 在 2023/6/29 16:16, Shiyang Ruan 写道:
> > This patch is inspired by Dan's "mm, dax, pmem: Introduce
> > dev_pagemap_failure()"[1]. With the help of dax_holder and
> > ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> > on it to unmap all files in use, and notify processes who are using
> > those files.
> >
> > Call trace:
> > trigger unbind
> > -> unbind_store()
> > -> ... (skip)
> > -> devres_release_all()
> > -> kill_dax()
> > -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
> > -> xfs_dax_notify_failure()
> > `-> freeze_super() // freeze (kernel call)
> > `-> do xfs rmap
> > ` -> mf_dax_kill_procs()
> > ` -> collect_procs_fsdax() // all associated processes
> > ` -> unmap_and_kill()
> > ` -> invalidate_inode_pages2_range() // drop file's cache
> > `-> thaw_super() // thaw (both kernel & user call)
> >
> > Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> > event. Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
> > new dax mapping from being created. Do not shutdown filesystem directly
> > if configuration is not supported, or if failure range includes metadata
> > area. Make sure all files and processes(not only the current progress)
> > are handled correctly. Also drop the cache of associated files before
> > pmem is removed.
> >
> > [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
> > [2]: https://lore.kernel.org/linux-xfs/168688010689.860947.1788875898367401950.stgit@frogsfrogsfrogs/
> >
> > Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> > ---
> > drivers/dax/super.c | 3 +-
> > fs/xfs/xfs_notify_failure.c | 86 ++++++++++++++++++++++++++++++++++---
> > include/linux/mm.h | 1 +
> > mm/memory-failure.c | 17 ++++++--
> > 4 files changed, 96 insertions(+), 11 deletions(-)
> >
> > diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> > index c4c4728a36e4..2e1a35e82fce 100644
> > --- a/drivers/dax/super.c
> > +++ b/drivers/dax/super.c
> > @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
> > return;
> > if (dax_dev->holder_data != NULL)
> > - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
> > + dax_holder_notify_failure(dax_dev, 0, U64_MAX,
> > + MF_MEM_PRE_REMOVE);
> > clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
> > synchronize_srcu(&dax_srcu);
> > diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> > index 4a9bbd3fe120..f6ec56b76db6 100644
> > --- a/fs/xfs/xfs_notify_failure.c
> > +++ b/fs/xfs/xfs_notify_failure.c
> > @@ -22,6 +22,7 @@
> > #include <linux/mm.h>
> > #include <linux/dax.h>
> > +#include <linux/fs.h>
> > struct xfs_failure_info {
> > xfs_agblock_t startblock;
> > @@ -73,10 +74,16 @@ xfs_dax_failure_fn(
> > struct xfs_mount *mp = cur->bc_mp;
> > struct xfs_inode *ip;
> > struct xfs_failure_info *notify = data;
> > + struct address_space *mapping;
> > + pgoff_t pgoff;
> > + unsigned long pgcnt;
> > int error = 0;
> > if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
> > (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
> > + /* Continue the query because this isn't a failure. */
> > + if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> > + return 0;
> > notify->want_shutdown = true;
> > return 0;
> > }
> > @@ -92,14 +99,55 @@ xfs_dax_failure_fn(
> > return 0;
> > }
> > - error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
> > - xfs_failure_pgoff(mp, rec, notify),
> > - xfs_failure_pgcnt(mp, rec, notify),
> > - notify->mf_flags);
> > + mapping = VFS_I(ip)->i_mapping;
> > + pgoff = xfs_failure_pgoff(mp, rec, notify);
> > + pgcnt = xfs_failure_pgcnt(mp, rec, notify);
> > +
> > + /* Continue the rmap query if the inode isn't a dax file. */
> > + if (dax_mapping(mapping))
> > + error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
> > + notify->mf_flags);
> > +
> > + /* Invalidate the cache in dax pages. */
> > + if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> > + invalidate_inode_pages2_range(mapping, pgoff,
> > + pgoff + pgcnt - 1);
> > +
> > xfs_irele(ip);
> > return error;
> > }
> > +static void
> > +xfs_dax_notify_failure_freeze(
> > + struct xfs_mount *mp)
> > +{
> > + struct super_block *sb = mp->m_super;
> > +
> > + /* Wait until no one is holding the FREEZE_HOLDER_KERNEL. */
> > + while (freeze_super(sb, FREEZE_HOLDER_KERNEL) != 0) {
> > + // Shall we just wait, or print warning then return -EBUSY?
> > + delay(HZ / 10);
> > + }
> > +}
> > +
> > +static void
> > +xfs_dax_notify_failure_thaw(
> > + struct xfs_mount *mp)
> > +{
> > + struct super_block *sb = mp->m_super;
> > + int error;
> > +
> > + error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
> > + if (error)
> > + xfs_emerg(mp, "still frozen after notify failure, err=%d",
> > + error);
> > + /*
> > + * Also thaw userspace call anyway because the device is about to be
> > + * removed immediately.
> > + */
> > + thaw_super(sb, FREEZE_HOLDER_USERSPACE);
> > +}
> > +
> > static int
> > xfs_dax_notify_ddev_failure(
> > struct xfs_mount *mp,
> > @@ -120,7 +168,7 @@ xfs_dax_notify_ddev_failure(
> > error = xfs_trans_alloc_empty(mp, &tp);
> > if (error)
> > - return error;
> > + goto out;
> > for (; agno <= end_agno; agno++) {
> > struct xfs_rmap_irec ri_low = { };
> > @@ -165,11 +213,23 @@ xfs_dax_notify_ddev_failure(
> > }
> > xfs_trans_cancel(tp);
> > +
> > + /*
> > + * Determine how to shutdown the filesystem according to the
> > + * error code and flags.
> > + */
> > if (error || notify.want_shutdown) {
> > xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
> > if (!error)
> > error = -EFSCORRUPTED;
> > - }
> > + } else if (mf_flags & MF_MEM_PRE_REMOVE)
> > + xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
> > +
> > +out:
> > + /* Thaw the fs if it is freezed before. */
> > + if (mf_flags & MF_MEM_PRE_REMOVE)
> > + xfs_dax_notify_failure_thaw(mp);
> > +
> > return error;
> > }
> > @@ -197,6 +257,8 @@ xfs_dax_notify_failure(
> > if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
> > mp->m_logdev_targp != mp->m_ddev_targp) {
> > + if (mf_flags & MF_MEM_PRE_REMOVE)
> > + return 0;
> > xfs_err(mp, "ondisk log corrupt, shutting down fs!");
> > xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
> > return -EFSCORRUPTED;
> > @@ -210,6 +272,12 @@ xfs_dax_notify_failure(
> > ddev_start = mp->m_ddev_targp->bt_dax_part_off;
> > ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
> > + /* Notify failure on the whole device. */
> > + if (offset == 0 && len == U64_MAX) {
> > + offset = ddev_start;
> > + len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
> > + }
> > +
> > /* Ignore the range out of filesystem area */
> > if (offset + len - 1 < ddev_start)
> > return -ENXIO;
> > @@ -226,6 +294,12 @@ xfs_dax_notify_failure(
> > if (offset + len - 1 > ddev_end)
> > len = ddev_end - offset + 1;
> > + if (mf_flags & MF_MEM_PRE_REMOVE) {
> > + xfs_info(mp, "device is about to be removed!");
> > + /* Freeze fs to prevent new mappings from being created. */
> > + xfs_dax_notify_failure_freeze(mp);
> > + }
> > +
> > return xfs_dax_notify_ddev_failure(mp, BTOBB(offset), BTOBB(len),
> > mf_flags);
> > }
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 27ce77080c79..a80c255b88d2 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -3576,6 +3576,7 @@ enum mf_flags {
> > MF_UNPOISON = 1 << 4,
> > MF_SW_SIMULATED = 1 << 5,
> > MF_NO_RETRY = 1 << 6,
> > + MF_MEM_PRE_REMOVE = 1 << 7,
> > };
> > int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> > unsigned long count, int mf_flags);
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index 5b663eca1f29..483b75f2fcfb 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -688,7 +688,7 @@ static void add_to_kill_fsdax(struct task_struct *tsk, struct page *p,
> > */
> > static void collect_procs_fsdax(struct page *page,
> > struct address_space *mapping, pgoff_t pgoff,
> > - struct list_head *to_kill)
> > + struct list_head *to_kill, bool pre_remove)
> > {
> > struct vm_area_struct *vma;
> > struct task_struct *tsk;
> > @@ -696,8 +696,15 @@ static void collect_procs_fsdax(struct page *page,
> > i_mmap_lock_read(mapping);
> > read_lock(&tasklist_lock);
> > for_each_process(tsk) {
> > - struct task_struct *t = task_early_kill(tsk, true);
> > + struct task_struct *t = tsk;
> > + /*
> > + * Search for all tasks while MF_MEM_PRE_REMOVE, because the
> > + * current may not be the one accessing the fsdax page.
> > + * Otherwise, search for the current task.
> > + */
> > + if (!pre_remove)
> > + t = task_early_kill(tsk, true);
> > if (!t)
> > continue;
> > vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> > @@ -1793,6 +1800,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> > dax_entry_t cookie;
> > struct page *page;
> > size_t end = index + count;
> > + bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
> > mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
> > @@ -1804,9 +1812,10 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> > if (!page)
> > goto unlock;
> > - SetPageHWPoison(page);
> > + if (!pre_remove)
> > + SetPageHWPoison(page);
> > - collect_procs_fsdax(page, mapping, index, &to_kill);
> > + collect_procs_fsdax(page, mapping, index, &to_kill, pre_remove);
> > unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
> > index, mf_flags);
> > unlock:
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v12 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
2023-07-14 14:18 ` Darrick J. Wong
@ 2023-07-20 1:50 ` Shiyang Ruan
2023-07-29 10:01 ` Shiyang Ruan
0 siblings, 1 reply; 37+ messages in thread
From: Shiyang Ruan @ 2023-07-20 1:50 UTC (permalink / raw)
To: Darrick J. Wong
Cc: linux-mm, linux-xfs, nvdimm, linux-fsdevel, dan.j.williams, willy,
jack, akpm, mcgrof
在 2023/7/14 22:18, Darrick J. Wong 写道:
> On Fri, Jul 14, 2023 at 05:07:58PM +0800, Shiyang Ruan wrote:
>> Hi Darrick,
>>
>> Thanks for applying the 1st patch.
>>
>> Now, since this patch is based on the new freeze_super()/thaw_super()
>> api[1], I'd like to ask what's the plan for this api? It seems to have
>> missed the v6.5-rc1.
>>
>> [1] https://lore.kernel.org/linux-xfs/168688010689.860947.1788875898367401950.stgit@frogsfrogsfrogs/
>
> 6.6. I intend to push the XFS UBSAN fixes to the list today for review.
> Early next week I'll resend the 6.5 rebase of the kernelfreeze series
> and push it to vfs-for-next. Some time after that will come large folio
> writes.
Got it. Thanks for your information!
--
Ruan.
>
> --D
>
>>
>> --
>> Thanks,
>> Ruan.
>>
>>
>> 在 2023/6/29 16:16, Shiyang Ruan 写道:
>>> This patch is inspired by Dan's "mm, dax, pmem: Introduce
>>> dev_pagemap_failure()"[1]. With the help of dax_holder and
>>> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
>>> on it to unmap all files in use, and notify processes who are using
>>> those files.
>>>
>>> Call trace:
>>> trigger unbind
>>> -> unbind_store()
>>> -> ... (skip)
>>> -> devres_release_all()
>>> -> kill_dax()
>>> -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
>>> -> xfs_dax_notify_failure()
>>> `-> freeze_super() // freeze (kernel call)
>>> `-> do xfs rmap
>>> ` -> mf_dax_kill_procs()
>>> ` -> collect_procs_fsdax() // all associated processes
>>> ` -> unmap_and_kill()
>>> ` -> invalidate_inode_pages2_range() // drop file's cache
>>> `-> thaw_super() // thaw (both kernel & user call)
>>>
>>> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
>>> event. Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
>>> new dax mapping from being created. Do not shutdown filesystem directly
>>> if configuration is not supported, or if failure range includes metadata
>>> area. Make sure all files and processes(not only the current progress)
>>> are handled correctly. Also drop the cache of associated files before
>>> pmem is removed.
>>>
>>> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
>>> [2]: https://lore.kernel.org/linux-xfs/168688010689.860947.1788875898367401950.stgit@frogsfrogsfrogs/
>>>
>>> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
>>> ---
>>> drivers/dax/super.c | 3 +-
>>> fs/xfs/xfs_notify_failure.c | 86 ++++++++++++++++++++++++++++++++++---
>>> include/linux/mm.h | 1 +
>>> mm/memory-failure.c | 17 ++++++--
>>> 4 files changed, 96 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
>>> index c4c4728a36e4..2e1a35e82fce 100644
>>> --- a/drivers/dax/super.c
>>> +++ b/drivers/dax/super.c
>>> @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
>>> return;
>>> if (dax_dev->holder_data != NULL)
>>> - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
>>> + dax_holder_notify_failure(dax_dev, 0, U64_MAX,
>>> + MF_MEM_PRE_REMOVE);
>>> clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
>>> synchronize_srcu(&dax_srcu);
>>> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
>>> index 4a9bbd3fe120..f6ec56b76db6 100644
>>> --- a/fs/xfs/xfs_notify_failure.c
>>> +++ b/fs/xfs/xfs_notify_failure.c
>>> @@ -22,6 +22,7 @@
>>> #include <linux/mm.h>
>>> #include <linux/dax.h>
>>> +#include <linux/fs.h>
>>> struct xfs_failure_info {
>>> xfs_agblock_t startblock;
>>> @@ -73,10 +74,16 @@ xfs_dax_failure_fn(
>>> struct xfs_mount *mp = cur->bc_mp;
>>> struct xfs_inode *ip;
>>> struct xfs_failure_info *notify = data;
>>> + struct address_space *mapping;
>>> + pgoff_t pgoff;
>>> + unsigned long pgcnt;
>>> int error = 0;
>>> if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
>>> (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
>>> + /* Continue the query because this isn't a failure. */
>>> + if (notify->mf_flags & MF_MEM_PRE_REMOVE)
>>> + return 0;
>>> notify->want_shutdown = true;
>>> return 0;
>>> }
>>> @@ -92,14 +99,55 @@ xfs_dax_failure_fn(
>>> return 0;
>>> }
>>> - error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
>>> - xfs_failure_pgoff(mp, rec, notify),
>>> - xfs_failure_pgcnt(mp, rec, notify),
>>> - notify->mf_flags);
>>> + mapping = VFS_I(ip)->i_mapping;
>>> + pgoff = xfs_failure_pgoff(mp, rec, notify);
>>> + pgcnt = xfs_failure_pgcnt(mp, rec, notify);
>>> +
>>> + /* Continue the rmap query if the inode isn't a dax file. */
>>> + if (dax_mapping(mapping))
>>> + error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
>>> + notify->mf_flags);
>>> +
>>> + /* Invalidate the cache in dax pages. */
>>> + if (notify->mf_flags & MF_MEM_PRE_REMOVE)
>>> + invalidate_inode_pages2_range(mapping, pgoff,
>>> + pgoff + pgcnt - 1);
>>> +
>>> xfs_irele(ip);
>>> return error;
>>> }
>>> +static void
>>> +xfs_dax_notify_failure_freeze(
>>> + struct xfs_mount *mp)
>>> +{
>>> + struct super_block *sb = mp->m_super;
>>> +
>>> + /* Wait until no one is holding the FREEZE_HOLDER_KERNEL. */
>>> + while (freeze_super(sb, FREEZE_HOLDER_KERNEL) != 0) {
>>> + // Shall we just wait, or print warning then return -EBUSY?
>>> + delay(HZ / 10);
>>> + }
>>> +}
>>> +
>>> +static void
>>> +xfs_dax_notify_failure_thaw(
>>> + struct xfs_mount *mp)
>>> +{
>>> + struct super_block *sb = mp->m_super;
>>> + int error;
>>> +
>>> + error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
>>> + if (error)
>>> + xfs_emerg(mp, "still frozen after notify failure, err=%d",
>>> + error);
>>> + /*
>>> + * Also thaw userspace call anyway because the device is about to be
>>> + * removed immediately.
>>> + */
>>> + thaw_super(sb, FREEZE_HOLDER_USERSPACE);
>>> +}
>>> +
>>> static int
>>> xfs_dax_notify_ddev_failure(
>>> struct xfs_mount *mp,
>>> @@ -120,7 +168,7 @@ xfs_dax_notify_ddev_failure(
>>> error = xfs_trans_alloc_empty(mp, &tp);
>>> if (error)
>>> - return error;
>>> + goto out;
>>> for (; agno <= end_agno; agno++) {
>>> struct xfs_rmap_irec ri_low = { };
>>> @@ -165,11 +213,23 @@ xfs_dax_notify_ddev_failure(
>>> }
>>> xfs_trans_cancel(tp);
>>> +
>>> + /*
>>> + * Determine how to shutdown the filesystem according to the
>>> + * error code and flags.
>>> + */
>>> if (error || notify.want_shutdown) {
>>> xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>>> if (!error)
>>> error = -EFSCORRUPTED;
>>> - }
>>> + } else if (mf_flags & MF_MEM_PRE_REMOVE)
>>> + xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
>>> +
>>> +out:
>>> + /* Thaw the fs if it is freezed before. */
>>> + if (mf_flags & MF_MEM_PRE_REMOVE)
>>> + xfs_dax_notify_failure_thaw(mp);
>>> +
>>> return error;
>>> }
>>> @@ -197,6 +257,8 @@ xfs_dax_notify_failure(
>>> if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
>>> mp->m_logdev_targp != mp->m_ddev_targp) {
>>> + if (mf_flags & MF_MEM_PRE_REMOVE)
>>> + return 0;
>>> xfs_err(mp, "ondisk log corrupt, shutting down fs!");
>>> xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>>> return -EFSCORRUPTED;
>>> @@ -210,6 +272,12 @@ xfs_dax_notify_failure(
>>> ddev_start = mp->m_ddev_targp->bt_dax_part_off;
>>> ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
>>> + /* Notify failure on the whole device. */
>>> + if (offset == 0 && len == U64_MAX) {
>>> + offset = ddev_start;
>>> + len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
>>> + }
>>> +
>>> /* Ignore the range out of filesystem area */
>>> if (offset + len - 1 < ddev_start)
>>> return -ENXIO;
>>> @@ -226,6 +294,12 @@ xfs_dax_notify_failure(
>>> if (offset + len - 1 > ddev_end)
>>> len = ddev_end - offset + 1;
>>> + if (mf_flags & MF_MEM_PRE_REMOVE) {
>>> + xfs_info(mp, "device is about to be removed!");
>>> + /* Freeze fs to prevent new mappings from being created. */
>>> + xfs_dax_notify_failure_freeze(mp);
>>> + }
>>> +
>>> return xfs_dax_notify_ddev_failure(mp, BTOBB(offset), BTOBB(len),
>>> mf_flags);
>>> }
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index 27ce77080c79..a80c255b88d2 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -3576,6 +3576,7 @@ enum mf_flags {
>>> MF_UNPOISON = 1 << 4,
>>> MF_SW_SIMULATED = 1 << 5,
>>> MF_NO_RETRY = 1 << 6,
>>> + MF_MEM_PRE_REMOVE = 1 << 7,
>>> };
>>> int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>>> unsigned long count, int mf_flags);
>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>> index 5b663eca1f29..483b75f2fcfb 100644
>>> --- a/mm/memory-failure.c
>>> +++ b/mm/memory-failure.c
>>> @@ -688,7 +688,7 @@ static void add_to_kill_fsdax(struct task_struct *tsk, struct page *p,
>>> */
>>> static void collect_procs_fsdax(struct page *page,
>>> struct address_space *mapping, pgoff_t pgoff,
>>> - struct list_head *to_kill)
>>> + struct list_head *to_kill, bool pre_remove)
>>> {
>>> struct vm_area_struct *vma;
>>> struct task_struct *tsk;
>>> @@ -696,8 +696,15 @@ static void collect_procs_fsdax(struct page *page,
>>> i_mmap_lock_read(mapping);
>>> read_lock(&tasklist_lock);
>>> for_each_process(tsk) {
>>> - struct task_struct *t = task_early_kill(tsk, true);
>>> + struct task_struct *t = tsk;
>>> + /*
>>> + * Search for all tasks while MF_MEM_PRE_REMOVE, because the
>>> + * current may not be the one accessing the fsdax page.
>>> + * Otherwise, search for the current task.
>>> + */
>>> + if (!pre_remove)
>>> + t = task_early_kill(tsk, true);
>>> if (!t)
>>> continue;
>>> vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
>>> @@ -1793,6 +1800,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>>> dax_entry_t cookie;
>>> struct page *page;
>>> size_t end = index + count;
>>> + bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
>>> mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
>>> @@ -1804,9 +1812,10 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>>> if (!page)
>>> goto unlock;
>>> - SetPageHWPoison(page);
>>> + if (!pre_remove)
>>> + SetPageHWPoison(page);
>>> - collect_procs_fsdax(page, mapping, index, &to_kill);
>>> + collect_procs_fsdax(page, mapping, index, &to_kill, pre_remove);
>>> unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
>>> index, mf_flags);
>>> unlock:
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v12 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
2023-07-20 1:50 ` Shiyang Ruan
@ 2023-07-29 10:01 ` Shiyang Ruan
2023-07-29 15:15 ` Darrick J. Wong
0 siblings, 1 reply; 37+ messages in thread
From: Shiyang Ruan @ 2023-07-29 10:01 UTC (permalink / raw)
To: Darrick J. Wong
Cc: linux-mm, linux-xfs, nvdimm, linux-fsdevel, dan.j.williams, willy,
jack, akpm, mcgrof
在 2023/7/20 9:50, Shiyang Ruan 写道:
>
>
> 在 2023/7/14 22:18, Darrick J. Wong 写道:
>> On Fri, Jul 14, 2023 at 05:07:58PM +0800, Shiyang Ruan wrote:
>>> Hi Darrick,
>>>
>>> Thanks for applying the 1st patch.
>>>
>>> Now, since this patch is based on the new freeze_super()/thaw_super()
>>> api[1], I'd like to ask what's the plan for this api? It seems to have
>>> missed the v6.5-rc1.
>>>
>>> [1]
>>> https://lore.kernel.org/linux-xfs/168688010689.860947.1788875898367401950.stgit@frogsfrogsfrogs/
>>
>> 6.6. I intend to push the XFS UBSAN fixes to the list today for review.
>> Early next week I'll resend the 6.5 rebase of the kernelfreeze series
>> and push it to vfs-for-next. Some time after that will come large folio
>> writes.
>
> Got it. Thanks for your information!
A small request: If you have time to give some comments, I would
appreciate it because I hope we can make the most out of this
period(before freeze api be merged in 6.6).
--
Thanks,
Ruan.
>
>
> --
> Ruan.
>
>>
>> --D
>>
>>>
>>> --
>>> Thanks,
>>> Ruan.
>>>
>>>
>>> 在 2023/6/29 16:16, Shiyang Ruan 写道:
>>>> This patch is inspired by Dan's "mm, dax, pmem: Introduce
>>>> dev_pagemap_failure()"[1]. With the help of dax_holder and
>>>> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
>>>> on it to unmap all files in use, and notify processes who are using
>>>> those files.
>>>>
>>>> Call trace:
>>>> trigger unbind
>>>> -> unbind_store()
>>>> -> ... (skip)
>>>> -> devres_release_all()
>>>> -> kill_dax()
>>>> -> dax_holder_notify_failure(dax_dev, 0, U64_MAX,
>>>> MF_MEM_PRE_REMOVE)
>>>> -> xfs_dax_notify_failure()
>>>> `-> freeze_super() // freeze (kernel call)
>>>> `-> do xfs rmap
>>>> ` -> mf_dax_kill_procs()
>>>> ` -> collect_procs_fsdax() // all associated processes
>>>> ` -> unmap_and_kill()
>>>> ` -> invalidate_inode_pages2_range() // drop file's cache
>>>> `-> thaw_super() // thaw (both kernel & user
>>>> call)
>>>>
>>>> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
>>>> event. Use the exclusive freeze/thaw[2] to lock the filesystem to
>>>> prevent
>>>> new dax mapping from being created. Do not shutdown filesystem
>>>> directly
>>>> if configuration is not supported, or if failure range includes
>>>> metadata
>>>> area. Make sure all files and processes(not only the current progress)
>>>> are handled correctly. Also drop the cache of associated files before
>>>> pmem is removed.
>>>>
>>>> [1]:
>>>> https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
>>>> [2]:
>>>> https://lore.kernel.org/linux-xfs/168688010689.860947.1788875898367401950.stgit@frogsfrogsfrogs/
>>>>
>>>> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
>>>> ---
>>>> drivers/dax/super.c | 3 +-
>>>> fs/xfs/xfs_notify_failure.c | 86
>>>> ++++++++++++++++++++++++++++++++++---
>>>> include/linux/mm.h | 1 +
>>>> mm/memory-failure.c | 17 ++++++--
>>>> 4 files changed, 96 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
>>>> index c4c4728a36e4..2e1a35e82fce 100644
>>>> --- a/drivers/dax/super.c
>>>> +++ b/drivers/dax/super.c
>>>> @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
>>>> return;
>>>> if (dax_dev->holder_data != NULL)
>>>> - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
>>>> + dax_holder_notify_failure(dax_dev, 0, U64_MAX,
>>>> + MF_MEM_PRE_REMOVE);
>>>> clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
>>>> synchronize_srcu(&dax_srcu);
>>>> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
>>>> index 4a9bbd3fe120..f6ec56b76db6 100644
>>>> --- a/fs/xfs/xfs_notify_failure.c
>>>> +++ b/fs/xfs/xfs_notify_failure.c
>>>> @@ -22,6 +22,7 @@
>>>> #include <linux/mm.h>
>>>> #include <linux/dax.h>
>>>> +#include <linux/fs.h>
>>>> struct xfs_failure_info {
>>>> xfs_agblock_t startblock;
>>>> @@ -73,10 +74,16 @@ xfs_dax_failure_fn(
>>>> struct xfs_mount *mp = cur->bc_mp;
>>>> struct xfs_inode *ip;
>>>> struct xfs_failure_info *notify = data;
>>>> + struct address_space *mapping;
>>>> + pgoff_t pgoff;
>>>> + unsigned long pgcnt;
>>>> int error = 0;
>>>> if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
>>>> (rec->rm_flags & (XFS_RMAP_ATTR_FORK |
>>>> XFS_RMAP_BMBT_BLOCK))) {
>>>> + /* Continue the query because this isn't a failure. */
>>>> + if (notify->mf_flags & MF_MEM_PRE_REMOVE)
>>>> + return 0;
>>>> notify->want_shutdown = true;
>>>> return 0;
>>>> }
>>>> @@ -92,14 +99,55 @@ xfs_dax_failure_fn(
>>>> return 0;
>>>> }
>>>> - error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
>>>> - xfs_failure_pgoff(mp, rec, notify),
>>>> - xfs_failure_pgcnt(mp, rec, notify),
>>>> - notify->mf_flags);
>>>> + mapping = VFS_I(ip)->i_mapping;
>>>> + pgoff = xfs_failure_pgoff(mp, rec, notify);
>>>> + pgcnt = xfs_failure_pgcnt(mp, rec, notify);
>>>> +
>>>> + /* Continue the rmap query if the inode isn't a dax file. */
>>>> + if (dax_mapping(mapping))
>>>> + error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
>>>> + notify->mf_flags);
>>>> +
>>>> + /* Invalidate the cache in dax pages. */
>>>> + if (notify->mf_flags & MF_MEM_PRE_REMOVE)
>>>> + invalidate_inode_pages2_range(mapping, pgoff,
>>>> + pgoff + pgcnt - 1);
>>>> +
>>>> xfs_irele(ip);
>>>> return error;
>>>> }
>>>> +static void
>>>> +xfs_dax_notify_failure_freeze(
>>>> + struct xfs_mount *mp)
>>>> +{
>>>> + struct super_block *sb = mp->m_super;
>>>> +
>>>> + /* Wait until no one is holding the FREEZE_HOLDER_KERNEL. */
>>>> + while (freeze_super(sb, FREEZE_HOLDER_KERNEL) != 0) {
>>>> + // Shall we just wait, or print warning then return -EBUSY?
>>>> + delay(HZ / 10);
>>>> + }
>>>> +}
>>>> +
>>>> +static void
>>>> +xfs_dax_notify_failure_thaw(
>>>> + struct xfs_mount *mp)
>>>> +{
>>>> + struct super_block *sb = mp->m_super;
>>>> + int error;
>>>> +
>>>> + error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
>>>> + if (error)
>>>> + xfs_emerg(mp, "still frozen after notify failure, err=%d",
>>>> + error);
>>>> + /*
>>>> + * Also thaw userspace call anyway because the device is about
>>>> to be
>>>> + * removed immediately.
>>>> + */
>>>> + thaw_super(sb, FREEZE_HOLDER_USERSPACE);
>>>> +}
>>>> +
>>>> static int
>>>> xfs_dax_notify_ddev_failure(
>>>> struct xfs_mount *mp,
>>>> @@ -120,7 +168,7 @@ xfs_dax_notify_ddev_failure(
>>>> error = xfs_trans_alloc_empty(mp, &tp);
>>>> if (error)
>>>> - return error;
>>>> + goto out;
>>>> for (; agno <= end_agno; agno++) {
>>>> struct xfs_rmap_irec ri_low = { };
>>>> @@ -165,11 +213,23 @@ xfs_dax_notify_ddev_failure(
>>>> }
>>>> xfs_trans_cancel(tp);
>>>> +
>>>> + /*
>>>> + * Determine how to shutdown the filesystem according to the
>>>> + * error code and flags.
>>>> + */
>>>> if (error || notify.want_shutdown) {
>>>> xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>>>> if (!error)
>>>> error = -EFSCORRUPTED;
>>>> - }
>>>> + } else if (mf_flags & MF_MEM_PRE_REMOVE)
>>>> + xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
>>>> +
>>>> +out:
>>>> + /* Thaw the fs if it is freezed before. */
>>>> + if (mf_flags & MF_MEM_PRE_REMOVE)
>>>> + xfs_dax_notify_failure_thaw(mp);
>>>> +
>>>> return error;
>>>> }
>>>> @@ -197,6 +257,8 @@ xfs_dax_notify_failure(
>>>> if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev ==
>>>> dax_dev &&
>>>> mp->m_logdev_targp != mp->m_ddev_targp) {
>>>> + if (mf_flags & MF_MEM_PRE_REMOVE)
>>>> + return 0;
>>>> xfs_err(mp, "ondisk log corrupt, shutting down fs!");
>>>> xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>>>> return -EFSCORRUPTED;
>>>> @@ -210,6 +272,12 @@ xfs_dax_notify_failure(
>>>> ddev_start = mp->m_ddev_targp->bt_dax_part_off;
>>>> ddev_end = ddev_start +
>>>> bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
>>>> + /* Notify failure on the whole device. */
>>>> + if (offset == 0 && len == U64_MAX) {
>>>> + offset = ddev_start;
>>>> + len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
>>>> + }
>>>> +
>>>> /* Ignore the range out of filesystem area */
>>>> if (offset + len - 1 < ddev_start)
>>>> return -ENXIO;
>>>> @@ -226,6 +294,12 @@ xfs_dax_notify_failure(
>>>> if (offset + len - 1 > ddev_end)
>>>> len = ddev_end - offset + 1;
>>>> + if (mf_flags & MF_MEM_PRE_REMOVE) {
>>>> + xfs_info(mp, "device is about to be removed!");
>>>> + /* Freeze fs to prevent new mappings from being created. */
>>>> + xfs_dax_notify_failure_freeze(mp);
>>>> + }
>>>> +
>>>> return xfs_dax_notify_ddev_failure(mp, BTOBB(offset),
>>>> BTOBB(len),
>>>> mf_flags);
>>>> }
>>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>>> index 27ce77080c79..a80c255b88d2 100644
>>>> --- a/include/linux/mm.h
>>>> +++ b/include/linux/mm.h
>>>> @@ -3576,6 +3576,7 @@ enum mf_flags {
>>>> MF_UNPOISON = 1 << 4,
>>>> MF_SW_SIMULATED = 1 << 5,
>>>> MF_NO_RETRY = 1 << 6,
>>>> + MF_MEM_PRE_REMOVE = 1 << 7,
>>>> };
>>>> int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>>>> unsigned long count, int mf_flags);
>>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>>> index 5b663eca1f29..483b75f2fcfb 100644
>>>> --- a/mm/memory-failure.c
>>>> +++ b/mm/memory-failure.c
>>>> @@ -688,7 +688,7 @@ static void add_to_kill_fsdax(struct task_struct
>>>> *tsk, struct page *p,
>>>> */
>>>> static void collect_procs_fsdax(struct page *page,
>>>> struct address_space *mapping, pgoff_t pgoff,
>>>> - struct list_head *to_kill)
>>>> + struct list_head *to_kill, bool pre_remove)
>>>> {
>>>> struct vm_area_struct *vma;
>>>> struct task_struct *tsk;
>>>> @@ -696,8 +696,15 @@ static void collect_procs_fsdax(struct page *page,
>>>> i_mmap_lock_read(mapping);
>>>> read_lock(&tasklist_lock);
>>>> for_each_process(tsk) {
>>>> - struct task_struct *t = task_early_kill(tsk, true);
>>>> + struct task_struct *t = tsk;
>>>> + /*
>>>> + * Search for all tasks while MF_MEM_PRE_REMOVE, because the
>>>> + * current may not be the one accessing the fsdax page.
>>>> + * Otherwise, search for the current task.
>>>> + */
>>>> + if (!pre_remove)
>>>> + t = task_early_kill(tsk, true);
>>>> if (!t)
>>>> continue;
>>>> vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff,
>>>> pgoff) {
>>>> @@ -1793,6 +1800,7 @@ int mf_dax_kill_procs(struct address_space
>>>> *mapping, pgoff_t index,
>>>> dax_entry_t cookie;
>>>> struct page *page;
>>>> size_t end = index + count;
>>>> + bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
>>>> mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
>>>> @@ -1804,9 +1812,10 @@ int mf_dax_kill_procs(struct address_space
>>>> *mapping, pgoff_t index,
>>>> if (!page)
>>>> goto unlock;
>>>> - SetPageHWPoison(page);
>>>> + if (!pre_remove)
>>>> + SetPageHWPoison(page);
>>>> - collect_procs_fsdax(page, mapping, index, &to_kill);
>>>> + collect_procs_fsdax(page, mapping, index, &to_kill,
>>>> pre_remove);
>>>> unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
>>>> index, mf_flags);
>>>> unlock:
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v12 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
2023-06-29 8:16 ` [PATCH v12 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan
2023-06-29 12:02 ` kernel test robot
2023-07-14 9:07 ` Shiyang Ruan
@ 2023-07-29 15:15 ` Darrick J. Wong
2023-07-31 9:36 ` Shiyang Ruan
2023-08-08 0:31 ` Dan Williams
` (2 subsequent siblings)
5 siblings, 1 reply; 37+ messages in thread
From: Darrick J. Wong @ 2023-07-29 15:15 UTC (permalink / raw)
To: Shiyang Ruan
Cc: linux-fsdevel, nvdimm, linux-xfs, linux-mm, dan.j.williams, willy,
jack, akpm, mcgrof
On Thu, Jun 29, 2023 at 04:16:51PM +0800, Shiyang Ruan wrote:
> This patch is inspired by Dan's "mm, dax, pmem: Introduce
> dev_pagemap_failure()"[1]. With the help of dax_holder and
> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> on it to unmap all files in use, and notify processes who are using
> those files.
>
> Call trace:
> trigger unbind
> -> unbind_store()
> -> ... (skip)
> -> devres_release_all()
> -> kill_dax()
> -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
> -> xfs_dax_notify_failure()
> `-> freeze_super() // freeze (kernel call)
> `-> do xfs rmap
> ` -> mf_dax_kill_procs()
> ` -> collect_procs_fsdax() // all associated processes
> ` -> unmap_and_kill()
> ` -> invalidate_inode_pages2_range() // drop file's cache
> `-> thaw_super() // thaw (both kernel & user call)
>
> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> event. Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
> new dax mapping from being created. Do not shutdown filesystem directly
> if configuration is not supported, or if failure range includes metadata
> area. Make sure all files and processes(not only the current progress)
> are handled correctly. Also drop the cache of associated files before
> pmem is removed.
>
> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
> [2]: https://lore.kernel.org/linux-xfs/168688010689.860947.1788875898367401950.stgit@frogsfrogsfrogs/
>
> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> ---
> drivers/dax/super.c | 3 +-
> fs/xfs/xfs_notify_failure.c | 86 ++++++++++++++++++++++++++++++++++---
> include/linux/mm.h | 1 +
> mm/memory-failure.c | 17 ++++++--
> 4 files changed, 96 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index c4c4728a36e4..2e1a35e82fce 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
> return;
>
> if (dax_dev->holder_data != NULL)
> - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
> + dax_holder_notify_failure(dax_dev, 0, U64_MAX,
> + MF_MEM_PRE_REMOVE);
>
> clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
> synchronize_srcu(&dax_srcu);
> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> index 4a9bbd3fe120..f6ec56b76db6 100644
> --- a/fs/xfs/xfs_notify_failure.c
> +++ b/fs/xfs/xfs_notify_failure.c
> @@ -22,6 +22,7 @@
>
> #include <linux/mm.h>
> #include <linux/dax.h>
> +#include <linux/fs.h>
>
> struct xfs_failure_info {
> xfs_agblock_t startblock;
> @@ -73,10 +74,16 @@ xfs_dax_failure_fn(
> struct xfs_mount *mp = cur->bc_mp;
> struct xfs_inode *ip;
> struct xfs_failure_info *notify = data;
> + struct address_space *mapping;
> + pgoff_t pgoff;
> + unsigned long pgcnt;
> int error = 0;
>
> if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
> (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
> + /* Continue the query because this isn't a failure. */
> + if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> + return 0;
> notify->want_shutdown = true;
> return 0;
> }
> @@ -92,14 +99,55 @@ xfs_dax_failure_fn(
> return 0;
> }
>
> - error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
> - xfs_failure_pgoff(mp, rec, notify),
> - xfs_failure_pgcnt(mp, rec, notify),
> - notify->mf_flags);
> + mapping = VFS_I(ip)->i_mapping;
> + pgoff = xfs_failure_pgoff(mp, rec, notify);
> + pgcnt = xfs_failure_pgcnt(mp, rec, notify);
> +
> + /* Continue the rmap query if the inode isn't a dax file. */
> + if (dax_mapping(mapping))
> + error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
> + notify->mf_flags);
> +
> + /* Invalidate the cache in dax pages. */
> + if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> + invalidate_inode_pages2_range(mapping, pgoff,
> + pgoff + pgcnt - 1);
> +
> xfs_irele(ip);
> return error;
> }
>
> +static void
> +xfs_dax_notify_failure_freeze(
> + struct xfs_mount *mp)
> +{
> + struct super_block *sb = mp->m_super;
Nit: extra space right ^ here.
> +
> + /* Wait until no one is holding the FREEZE_HOLDER_KERNEL. */
> + while (freeze_super(sb, FREEZE_HOLDER_KERNEL) != 0) {
> + // Shall we just wait, or print warning then return -EBUSY?
Hm. PRE_REMOVE gets called before the pmem gets unplugged, right? So
we'll send a second notification after it goes away, right?
If so, then I'd say return the error here instead of looping, and live
with a kernel-frozen fs discarding the PRE_REMOVE message.
> + delay(HZ / 10);
> + }
> +}
> +
> +static void
> +xfs_dax_notify_failure_thaw(
> + struct xfs_mount *mp)
> +{
> + struct super_block *sb = mp->m_super;
> + int error;
> +
> + error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
> + if (error)
> + xfs_emerg(mp, "still frozen after notify failure, err=%d",
> + error);
> + /*
> + * Also thaw userspace call anyway because the device is about to be
> + * removed immediately.
> + */
> + thaw_super(sb, FREEZE_HOLDER_USERSPACE);
> +}
> +
> static int
> xfs_dax_notify_ddev_failure(
> struct xfs_mount *mp,
> @@ -120,7 +168,7 @@ xfs_dax_notify_ddev_failure(
>
> error = xfs_trans_alloc_empty(mp, &tp);
> if (error)
> - return error;
> + goto out;
>
> for (; agno <= end_agno; agno++) {
> struct xfs_rmap_irec ri_low = { };
> @@ -165,11 +213,23 @@ xfs_dax_notify_ddev_failure(
> }
>
> xfs_trans_cancel(tp);
> +
> + /*
> + * Determine how to shutdown the filesystem according to the
> + * error code and flags.
> + */
> if (error || notify.want_shutdown) {
> xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
> if (!error)
> error = -EFSCORRUPTED;
> - }
> + } else if (mf_flags & MF_MEM_PRE_REMOVE)
> + xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
> +
> +out:
> + /* Thaw the fs if it is freezed before. */
> + if (mf_flags & MF_MEM_PRE_REMOVE)
> + xfs_dax_notify_failure_thaw(mp);
_thaw should be called from the same function that called _freeze.
The rest of the patch seems ok to me.
--D
> +
> return error;
> }
>
> @@ -197,6 +257,8 @@ xfs_dax_notify_failure(
>
> if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
> mp->m_logdev_targp != mp->m_ddev_targp) {
> + if (mf_flags & MF_MEM_PRE_REMOVE)
> + return 0;
> xfs_err(mp, "ondisk log corrupt, shutting down fs!");
> xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
> return -EFSCORRUPTED;
> @@ -210,6 +272,12 @@ xfs_dax_notify_failure(
> ddev_start = mp->m_ddev_targp->bt_dax_part_off;
> ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
>
> + /* Notify failure on the whole device. */
> + if (offset == 0 && len == U64_MAX) {
> + offset = ddev_start;
> + len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
> + }
> +
> /* Ignore the range out of filesystem area */
> if (offset + len - 1 < ddev_start)
> return -ENXIO;
> @@ -226,6 +294,12 @@ xfs_dax_notify_failure(
> if (offset + len - 1 > ddev_end)
> len = ddev_end - offset + 1;
>
> + if (mf_flags & MF_MEM_PRE_REMOVE) {
> + xfs_info(mp, "device is about to be removed!");
> + /* Freeze fs to prevent new mappings from being created. */
> + xfs_dax_notify_failure_freeze(mp);
> + }
> +
> return xfs_dax_notify_ddev_failure(mp, BTOBB(offset), BTOBB(len),
> mf_flags);
> }
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 27ce77080c79..a80c255b88d2 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3576,6 +3576,7 @@ enum mf_flags {
> MF_UNPOISON = 1 << 4,
> MF_SW_SIMULATED = 1 << 5,
> MF_NO_RETRY = 1 << 6,
> + MF_MEM_PRE_REMOVE = 1 << 7,
> };
> int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> unsigned long count, int mf_flags);
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 5b663eca1f29..483b75f2fcfb 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -688,7 +688,7 @@ static void add_to_kill_fsdax(struct task_struct *tsk, struct page *p,
> */
> static void collect_procs_fsdax(struct page *page,
> struct address_space *mapping, pgoff_t pgoff,
> - struct list_head *to_kill)
> + struct list_head *to_kill, bool pre_remove)
> {
> struct vm_area_struct *vma;
> struct task_struct *tsk;
> @@ -696,8 +696,15 @@ static void collect_procs_fsdax(struct page *page,
> i_mmap_lock_read(mapping);
> read_lock(&tasklist_lock);
> for_each_process(tsk) {
> - struct task_struct *t = task_early_kill(tsk, true);
> + struct task_struct *t = tsk;
>
> + /*
> + * Search for all tasks while MF_MEM_PRE_REMOVE, because the
> + * current may not be the one accessing the fsdax page.
> + * Otherwise, search for the current task.
> + */
> + if (!pre_remove)
> + t = task_early_kill(tsk, true);
> if (!t)
> continue;
> vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> @@ -1793,6 +1800,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> dax_entry_t cookie;
> struct page *page;
> size_t end = index + count;
> + bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
>
> mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
>
> @@ -1804,9 +1812,10 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> if (!page)
> goto unlock;
>
> - SetPageHWPoison(page);
> + if (!pre_remove)
> + SetPageHWPoison(page);
>
> - collect_procs_fsdax(page, mapping, index, &to_kill);
> + collect_procs_fsdax(page, mapping, index, &to_kill, pre_remove);
> unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
> index, mf_flags);
> unlock:
> --
> 2.40.1
>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v12 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
2023-07-29 10:01 ` Shiyang Ruan
@ 2023-07-29 15:15 ` Darrick J. Wong
0 siblings, 0 replies; 37+ messages in thread
From: Darrick J. Wong @ 2023-07-29 15:15 UTC (permalink / raw)
To: Shiyang Ruan
Cc: linux-mm, linux-xfs, nvdimm, linux-fsdevel, dan.j.williams, willy,
jack, akpm, mcgrof
On Sat, Jul 29, 2023 at 06:01:00PM +0800, Shiyang Ruan wrote:
>
>
> 在 2023/7/20 9:50, Shiyang Ruan 写道:
> >
> >
> > 在 2023/7/14 22:18, Darrick J. Wong 写道:
> > > On Fri, Jul 14, 2023 at 05:07:58PM +0800, Shiyang Ruan wrote:
> > > > Hi Darrick,
> > > >
> > > > Thanks for applying the 1st patch.
> > > >
> > > > Now, since this patch is based on the new freeze_super()/thaw_super()
> > > > api[1], I'd like to ask what's the plan for this api? It seems to have
> > > > missed the v6.5-rc1.
> > > >
> > > > [1] https://lore.kernel.org/linux-xfs/168688010689.860947.1788875898367401950.stgit@frogsfrogsfrogs/
> > >
> > > 6.6. I intend to push the XFS UBSAN fixes to the list today for review.
> > > Early next week I'll resend the 6.5 rebase of the kernelfreeze series
> > > and push it to vfs-for-next. Some time after that will come large folio
> > > writes.
> >
> > Got it. Thanks for your information!
>
> A small request: If you have time to give some comments, I would appreciate
> it because I hope we can make the most out of this period(before freeze api
> be merged in 6.6).
Done.
--D
>
> --
> Thanks,
> Ruan.
>
> >
> >
> > --
> > Ruan.
> >
> > >
> > > --D
> > >
> > > >
> > > > --
> > > > Thanks,
> > > > Ruan.
> > > >
> > > >
> > > > 在 2023/6/29 16:16, Shiyang Ruan 写道:
> > > > > This patch is inspired by Dan's "mm, dax, pmem: Introduce
> > > > > dev_pagemap_failure()"[1]. With the help of dax_holder and
> > > > > ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> > > > > on it to unmap all files in use, and notify processes who are using
> > > > > those files.
> > > > >
> > > > > Call trace:
> > > > > trigger unbind
> > > > > -> unbind_store()
> > > > > -> ... (skip)
> > > > > -> devres_release_all()
> > > > > -> kill_dax()
> > > > > -> dax_holder_notify_failure(dax_dev, 0, U64_MAX,
> > > > > MF_MEM_PRE_REMOVE)
> > > > > -> xfs_dax_notify_failure()
> > > > > `-> freeze_super() // freeze (kernel call)
> > > > > `-> do xfs rmap
> > > > > ` -> mf_dax_kill_procs()
> > > > > ` -> collect_procs_fsdax() // all associated processes
> > > > > ` -> unmap_and_kill()
> > > > > ` -> invalidate_inode_pages2_range() // drop file's cache
> > > > > `-> thaw_super() // thaw (both kernel
> > > > > & user call)
> > > > >
> > > > > Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> > > > > event. Use the exclusive freeze/thaw[2] to lock the
> > > > > filesystem to prevent
> > > > > new dax mapping from being created. Do not shutdown
> > > > > filesystem directly
> > > > > if configuration is not supported, or if failure range
> > > > > includes metadata
> > > > > area. Make sure all files and processes(not only the current progress)
> > > > > are handled correctly. Also drop the cache of associated files before
> > > > > pmem is removed.
> > > > >
> > > > > [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
> > > > > [2]: https://lore.kernel.org/linux-xfs/168688010689.860947.1788875898367401950.stgit@frogsfrogsfrogs/
> > > > >
> > > > > Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> > > > > ---
> > > > > drivers/dax/super.c | 3 +-
> > > > > fs/xfs/xfs_notify_failure.c | 86
> > > > > ++++++++++++++++++++++++++++++++++---
> > > > > include/linux/mm.h | 1 +
> > > > > mm/memory-failure.c | 17 ++++++--
> > > > > 4 files changed, 96 insertions(+), 11 deletions(-)
> > > > >
> > > > > diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> > > > > index c4c4728a36e4..2e1a35e82fce 100644
> > > > > --- a/drivers/dax/super.c
> > > > > +++ b/drivers/dax/super.c
> > > > > @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
> > > > > return;
> > > > > if (dax_dev->holder_data != NULL)
> > > > > - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
> > > > > + dax_holder_notify_failure(dax_dev, 0, U64_MAX,
> > > > > + MF_MEM_PRE_REMOVE);
> > > > > clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
> > > > > synchronize_srcu(&dax_srcu);
> > > > > diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> > > > > index 4a9bbd3fe120..f6ec56b76db6 100644
> > > > > --- a/fs/xfs/xfs_notify_failure.c
> > > > > +++ b/fs/xfs/xfs_notify_failure.c
> > > > > @@ -22,6 +22,7 @@
> > > > > #include <linux/mm.h>
> > > > > #include <linux/dax.h>
> > > > > +#include <linux/fs.h>
> > > > > struct xfs_failure_info {
> > > > > xfs_agblock_t startblock;
> > > > > @@ -73,10 +74,16 @@ xfs_dax_failure_fn(
> > > > > struct xfs_mount *mp = cur->bc_mp;
> > > > > struct xfs_inode *ip;
> > > > > struct xfs_failure_info *notify = data;
> > > > > + struct address_space *mapping;
> > > > > + pgoff_t pgoff;
> > > > > + unsigned long pgcnt;
> > > > > int error = 0;
> > > > > if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
> > > > > (rec->rm_flags & (XFS_RMAP_ATTR_FORK |
> > > > > XFS_RMAP_BMBT_BLOCK))) {
> > > > > + /* Continue the query because this isn't a failure. */
> > > > > + if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> > > > > + return 0;
> > > > > notify->want_shutdown = true;
> > > > > return 0;
> > > > > }
> > > > > @@ -92,14 +99,55 @@ xfs_dax_failure_fn(
> > > > > return 0;
> > > > > }
> > > > > - error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
> > > > > - xfs_failure_pgoff(mp, rec, notify),
> > > > > - xfs_failure_pgcnt(mp, rec, notify),
> > > > > - notify->mf_flags);
> > > > > + mapping = VFS_I(ip)->i_mapping;
> > > > > + pgoff = xfs_failure_pgoff(mp, rec, notify);
> > > > > + pgcnt = xfs_failure_pgcnt(mp, rec, notify);
> > > > > +
> > > > > + /* Continue the rmap query if the inode isn't a dax file. */
> > > > > + if (dax_mapping(mapping))
> > > > > + error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
> > > > > + notify->mf_flags);
> > > > > +
> > > > > + /* Invalidate the cache in dax pages. */
> > > > > + if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> > > > > + invalidate_inode_pages2_range(mapping, pgoff,
> > > > > + pgoff + pgcnt - 1);
> > > > > +
> > > > > xfs_irele(ip);
> > > > > return error;
> > > > > }
> > > > > +static void
> > > > > +xfs_dax_notify_failure_freeze(
> > > > > + struct xfs_mount *mp)
> > > > > +{
> > > > > + struct super_block *sb = mp->m_super;
> > > > > +
> > > > > + /* Wait until no one is holding the FREEZE_HOLDER_KERNEL. */
> > > > > + while (freeze_super(sb, FREEZE_HOLDER_KERNEL) != 0) {
> > > > > + // Shall we just wait, or print warning then return -EBUSY?
> > > > > + delay(HZ / 10);
> > > > > + }
> > > > > +}
> > > > > +
> > > > > +static void
> > > > > +xfs_dax_notify_failure_thaw(
> > > > > + struct xfs_mount *mp)
> > > > > +{
> > > > > + struct super_block *sb = mp->m_super;
> > > > > + int error;
> > > > > +
> > > > > + error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
> > > > > + if (error)
> > > > > + xfs_emerg(mp, "still frozen after notify failure, err=%d",
> > > > > + error);
> > > > > + /*
> > > > > + * Also thaw userspace call anyway because the device
> > > > > is about to be
> > > > > + * removed immediately.
> > > > > + */
> > > > > + thaw_super(sb, FREEZE_HOLDER_USERSPACE);
> > > > > +}
> > > > > +
> > > > > static int
> > > > > xfs_dax_notify_ddev_failure(
> > > > > struct xfs_mount *mp,
> > > > > @@ -120,7 +168,7 @@ xfs_dax_notify_ddev_failure(
> > > > > error = xfs_trans_alloc_empty(mp, &tp);
> > > > > if (error)
> > > > > - return error;
> > > > > + goto out;
> > > > > for (; agno <= end_agno; agno++) {
> > > > > struct xfs_rmap_irec ri_low = { };
> > > > > @@ -165,11 +213,23 @@ xfs_dax_notify_ddev_failure(
> > > > > }
> > > > > xfs_trans_cancel(tp);
> > > > > +
> > > > > + /*
> > > > > + * Determine how to shutdown the filesystem according to the
> > > > > + * error code and flags.
> > > > > + */
> > > > > if (error || notify.want_shutdown) {
> > > > > xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
> > > > > if (!error)
> > > > > error = -EFSCORRUPTED;
> > > > > - }
> > > > > + } else if (mf_flags & MF_MEM_PRE_REMOVE)
> > > > > + xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
> > > > > +
> > > > > +out:
> > > > > + /* Thaw the fs if it is freezed before. */
> > > > > + if (mf_flags & MF_MEM_PRE_REMOVE)
> > > > > + xfs_dax_notify_failure_thaw(mp);
> > > > > +
> > > > > return error;
> > > > > }
> > > > > @@ -197,6 +257,8 @@ xfs_dax_notify_failure(
> > > > > if (mp->m_logdev_targp &&
> > > > > mp->m_logdev_targp->bt_daxdev == dax_dev &&
> > > > > mp->m_logdev_targp != mp->m_ddev_targp) {
> > > > > + if (mf_flags & MF_MEM_PRE_REMOVE)
> > > > > + return 0;
> > > > > xfs_err(mp, "ondisk log corrupt, shutting down fs!");
> > > > > xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
> > > > > return -EFSCORRUPTED;
> > > > > @@ -210,6 +272,12 @@ xfs_dax_notify_failure(
> > > > > ddev_start = mp->m_ddev_targp->bt_dax_part_off;
> > > > > ddev_end = ddev_start +
> > > > > bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
> > > > > + /* Notify failure on the whole device. */
> > > > > + if (offset == 0 && len == U64_MAX) {
> > > > > + offset = ddev_start;
> > > > > + len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
> > > > > + }
> > > > > +
> > > > > /* Ignore the range out of filesystem area */
> > > > > if (offset + len - 1 < ddev_start)
> > > > > return -ENXIO;
> > > > > @@ -226,6 +294,12 @@ xfs_dax_notify_failure(
> > > > > if (offset + len - 1 > ddev_end)
> > > > > len = ddev_end - offset + 1;
> > > > > + if (mf_flags & MF_MEM_PRE_REMOVE) {
> > > > > + xfs_info(mp, "device is about to be removed!");
> > > > > + /* Freeze fs to prevent new mappings from being created. */
> > > > > + xfs_dax_notify_failure_freeze(mp);
> > > > > + }
> > > > > +
> > > > > return xfs_dax_notify_ddev_failure(mp, BTOBB(offset),
> > > > > BTOBB(len),
> > > > > mf_flags);
> > > > > }
> > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > > > index 27ce77080c79..a80c255b88d2 100644
> > > > > --- a/include/linux/mm.h
> > > > > +++ b/include/linux/mm.h
> > > > > @@ -3576,6 +3576,7 @@ enum mf_flags {
> > > > > MF_UNPOISON = 1 << 4,
> > > > > MF_SW_SIMULATED = 1 << 5,
> > > > > MF_NO_RETRY = 1 << 6,
> > > > > + MF_MEM_PRE_REMOVE = 1 << 7,
> > > > > };
> > > > > int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> > > > > unsigned long count, int mf_flags);
> > > > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > > > > index 5b663eca1f29..483b75f2fcfb 100644
> > > > > --- a/mm/memory-failure.c
> > > > > +++ b/mm/memory-failure.c
> > > > > @@ -688,7 +688,7 @@ static void add_to_kill_fsdax(struct
> > > > > task_struct *tsk, struct page *p,
> > > > > */
> > > > > static void collect_procs_fsdax(struct page *page,
> > > > > struct address_space *mapping, pgoff_t pgoff,
> > > > > - struct list_head *to_kill)
> > > > > + struct list_head *to_kill, bool pre_remove)
> > > > > {
> > > > > struct vm_area_struct *vma;
> > > > > struct task_struct *tsk;
> > > > > @@ -696,8 +696,15 @@ static void collect_procs_fsdax(struct page *page,
> > > > > i_mmap_lock_read(mapping);
> > > > > read_lock(&tasklist_lock);
> > > > > for_each_process(tsk) {
> > > > > - struct task_struct *t = task_early_kill(tsk, true);
> > > > > + struct task_struct *t = tsk;
> > > > > + /*
> > > > > + * Search for all tasks while MF_MEM_PRE_REMOVE, because the
> > > > > + * current may not be the one accessing the fsdax page.
> > > > > + * Otherwise, search for the current task.
> > > > > + */
> > > > > + if (!pre_remove)
> > > > > + t = task_early_kill(tsk, true);
> > > > > if (!t)
> > > > > continue;
> > > > > vma_interval_tree_foreach(vma, &mapping->i_mmap,
> > > > > pgoff, pgoff) {
> > > > > @@ -1793,6 +1800,7 @@ int mf_dax_kill_procs(struct
> > > > > address_space *mapping, pgoff_t index,
> > > > > dax_entry_t cookie;
> > > > > struct page *page;
> > > > > size_t end = index + count;
> > > > > + bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
> > > > > mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
> > > > > @@ -1804,9 +1812,10 @@ int mf_dax_kill_procs(struct
> > > > > address_space *mapping, pgoff_t index,
> > > > > if (!page)
> > > > > goto unlock;
> > > > > - SetPageHWPoison(page);
> > > > > + if (!pre_remove)
> > > > > + SetPageHWPoison(page);
> > > > > - collect_procs_fsdax(page, mapping, index, &to_kill);
> > > > > + collect_procs_fsdax(page, mapping, index, &to_kill,
> > > > > pre_remove);
> > > > > unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
> > > > > index, mf_flags);
> > > > > unlock:
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v12 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
2023-07-29 15:15 ` Darrick J. Wong
@ 2023-07-31 9:36 ` Shiyang Ruan
2023-08-01 3:25 ` Darrick J. Wong
0 siblings, 1 reply; 37+ messages in thread
From: Shiyang Ruan @ 2023-07-31 9:36 UTC (permalink / raw)
To: Darrick J. Wong
Cc: linux-fsdevel, nvdimm, linux-xfs, linux-mm, dan.j.williams, willy,
jack, akpm, mcgrof
在 2023/7/29 23:15, Darrick J. Wong 写道:
> On Thu, Jun 29, 2023 at 04:16:51PM +0800, Shiyang Ruan wrote:
>> This patch is inspired by Dan's "mm, dax, pmem: Introduce
>> dev_pagemap_failure()"[1]. With the help of dax_holder and
>> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
>> on it to unmap all files in use, and notify processes who are using
>> those files.
>>
>> Call trace:
>> trigger unbind
>> -> unbind_store()
>> -> ... (skip)
>> -> devres_release_all()
>> -> kill_dax()
>> -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
>> -> xfs_dax_notify_failure()
>> `-> freeze_super() // freeze (kernel call)
>> `-> do xfs rmap
>> ` -> mf_dax_kill_procs()
>> ` -> collect_procs_fsdax() // all associated processes
>> ` -> unmap_and_kill()
>> ` -> invalidate_inode_pages2_range() // drop file's cache
>> `-> thaw_super() // thaw (both kernel & user call)
>>
>> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
>> event. Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
>> new dax mapping from being created. Do not shutdown filesystem directly
>> if configuration is not supported, or if failure range includes metadata
>> area. Make sure all files and processes(not only the current progress)
>> are handled correctly. Also drop the cache of associated files before
>> pmem is removed.
>>
>> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
>> [2]: https://lore.kernel.org/linux-xfs/168688010689.860947.1788875898367401950.stgit@frogsfrogsfrogs/
>>
>> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
>> ---
>> drivers/dax/super.c | 3 +-
>> fs/xfs/xfs_notify_failure.c | 86 ++++++++++++++++++++++++++++++++++---
>> include/linux/mm.h | 1 +
>> mm/memory-failure.c | 17 ++++++--
>> 4 files changed, 96 insertions(+), 11 deletions(-)
>>
>> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
>> index c4c4728a36e4..2e1a35e82fce 100644
>> --- a/drivers/dax/super.c
>> +++ b/drivers/dax/super.c
>> @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
>> return;
>>
>> if (dax_dev->holder_data != NULL)
>> - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
>> + dax_holder_notify_failure(dax_dev, 0, U64_MAX,
>> + MF_MEM_PRE_REMOVE);
>>
>> clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
>> synchronize_srcu(&dax_srcu);
>> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
>> index 4a9bbd3fe120..f6ec56b76db6 100644
>> --- a/fs/xfs/xfs_notify_failure.c
>> +++ b/fs/xfs/xfs_notify_failure.c
>> @@ -22,6 +22,7 @@
>>
>> #include <linux/mm.h>
>> #include <linux/dax.h>
>> +#include <linux/fs.h>
>>
>> struct xfs_failure_info {
>> xfs_agblock_t startblock;
>> @@ -73,10 +74,16 @@ xfs_dax_failure_fn(
>> struct xfs_mount *mp = cur->bc_mp;
>> struct xfs_inode *ip;
>> struct xfs_failure_info *notify = data;
>> + struct address_space *mapping;
>> + pgoff_t pgoff;
>> + unsigned long pgcnt;
>> int error = 0;
>>
>> if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
>> (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
>> + /* Continue the query because this isn't a failure. */
>> + if (notify->mf_flags & MF_MEM_PRE_REMOVE)
>> + return 0;
>> notify->want_shutdown = true;
>> return 0;
>> }
>> @@ -92,14 +99,55 @@ xfs_dax_failure_fn(
>> return 0;
>> }
>>
>> - error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
>> - xfs_failure_pgoff(mp, rec, notify),
>> - xfs_failure_pgcnt(mp, rec, notify),
>> - notify->mf_flags);
>> + mapping = VFS_I(ip)->i_mapping;
>> + pgoff = xfs_failure_pgoff(mp, rec, notify);
>> + pgcnt = xfs_failure_pgcnt(mp, rec, notify);
>> +
>> + /* Continue the rmap query if the inode isn't a dax file. */
>> + if (dax_mapping(mapping))
>> + error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
>> + notify->mf_flags);
>> +
>> + /* Invalidate the cache in dax pages. */
>> + if (notify->mf_flags & MF_MEM_PRE_REMOVE)
>> + invalidate_inode_pages2_range(mapping, pgoff,
>> + pgoff + pgcnt - 1);
>> +
>> xfs_irele(ip);
>> return error;
>> }
>>
>> +static void
>> +xfs_dax_notify_failure_freeze(
>> + struct xfs_mount *mp)
>> +{
>> + struct super_block *sb = mp->m_super;
>
> Nit: extra space right ^ here.
>
>> +
>> + /* Wait until no one is holding the FREEZE_HOLDER_KERNEL. */
>> + while (freeze_super(sb, FREEZE_HOLDER_KERNEL) != 0) {
>> + // Shall we just wait, or print warning then return -EBUSY?
>
> Hm. PRE_REMOVE gets called before the pmem gets unplugged, right? So
> we'll send a second notification after it goes away, right?
For the first question, yes.
But I'm not sure about the second one. Do you mean: we'll send this
notification again if unbind didn't success because freeze_super()
returns -EBUSY? In other words, if the previous unbind operation did
not work, we could unbind the device again.
>
> If so, then I'd say return the error here instead of looping, and live
> with a kernel-frozen fs discarding the PRE_REMOVE message.
>
>> + delay(HZ / 10);
>> + }
>> +}
>> +
>> +static void
>> +xfs_dax_notify_failure_thaw(
>> + struct xfs_mount *mp)
>> +{
>> + struct super_block *sb = mp->m_super;
>> + int error;
>> +
>> + error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
>> + if (error)
>> + xfs_emerg(mp, "still frozen after notify failure, err=%d",
>> + error);
>> + /*
>> + * Also thaw userspace call anyway because the device is about to be
>> + * removed immediately.
>> + */
>> + thaw_super(sb, FREEZE_HOLDER_USERSPACE);
>> +}
>> +
>> static int
>> xfs_dax_notify_ddev_failure(
>> struct xfs_mount *mp,
>> @@ -120,7 +168,7 @@ xfs_dax_notify_ddev_failure(
>>
>> error = xfs_trans_alloc_empty(mp, &tp);
>> if (error)
>> - return error;
>> + goto out;
>>
>> for (; agno <= end_agno; agno++) {
>> struct xfs_rmap_irec ri_low = { };
>> @@ -165,11 +213,23 @@ xfs_dax_notify_ddev_failure(
>> }
>>
>> xfs_trans_cancel(tp);
>> +
>> + /*
>> + * Determine how to shutdown the filesystem according to the
>> + * error code and flags.
>> + */
>> if (error || notify.want_shutdown) {
>> xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>> if (!error)
>> error = -EFSCORRUPTED;
>> - }
>> + } else if (mf_flags & MF_MEM_PRE_REMOVE)
>> + xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
>> +
>> +out:
>> + /* Thaw the fs if it is freezed before. */
>> + if (mf_flags & MF_MEM_PRE_REMOVE)
>> + xfs_dax_notify_failure_thaw(mp);
>
> _thaw should be called from the same function that called _freeze.
Will fix this.
>
> The rest of the patch seems ok to me.
Thank you!
--
Ruan.
>
> --D
>
>> +
>> return error;
>> }
>>
>> @@ -197,6 +257,8 @@ xfs_dax_notify_failure(
>>
>> if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
>> mp->m_logdev_targp != mp->m_ddev_targp) {
>> + if (mf_flags & MF_MEM_PRE_REMOVE)
>> + return 0;
>> xfs_err(mp, "ondisk log corrupt, shutting down fs!");
>> xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>> return -EFSCORRUPTED;
>> @@ -210,6 +272,12 @@ xfs_dax_notify_failure(
>> ddev_start = mp->m_ddev_targp->bt_dax_part_off;
>> ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
>>
>> + /* Notify failure on the whole device. */
>> + if (offset == 0 && len == U64_MAX) {
>> + offset = ddev_start;
>> + len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
>> + }
>> +
>> /* Ignore the range out of filesystem area */
>> if (offset + len - 1 < ddev_start)
>> return -ENXIO;
>> @@ -226,6 +294,12 @@ xfs_dax_notify_failure(
>> if (offset + len - 1 > ddev_end)
>> len = ddev_end - offset + 1;
>>
>> + if (mf_flags & MF_MEM_PRE_REMOVE) {
>> + xfs_info(mp, "device is about to be removed!");
>> + /* Freeze fs to prevent new mappings from being created. */
>> + xfs_dax_notify_failure_freeze(mp);
>> + }
>> +
>> return xfs_dax_notify_ddev_failure(mp, BTOBB(offset), BTOBB(len),
>> mf_flags);
>> }
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 27ce77080c79..a80c255b88d2 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -3576,6 +3576,7 @@ enum mf_flags {
>> MF_UNPOISON = 1 << 4,
>> MF_SW_SIMULATED = 1 << 5,
>> MF_NO_RETRY = 1 << 6,
>> + MF_MEM_PRE_REMOVE = 1 << 7,
>> };
>> int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>> unsigned long count, int mf_flags);
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index 5b663eca1f29..483b75f2fcfb 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -688,7 +688,7 @@ static void add_to_kill_fsdax(struct task_struct *tsk, struct page *p,
>> */
>> static void collect_procs_fsdax(struct page *page,
>> struct address_space *mapping, pgoff_t pgoff,
>> - struct list_head *to_kill)
>> + struct list_head *to_kill, bool pre_remove)
>> {
>> struct vm_area_struct *vma;
>> struct task_struct *tsk;
>> @@ -696,8 +696,15 @@ static void collect_procs_fsdax(struct page *page,
>> i_mmap_lock_read(mapping);
>> read_lock(&tasklist_lock);
>> for_each_process(tsk) {
>> - struct task_struct *t = task_early_kill(tsk, true);
>> + struct task_struct *t = tsk;
>>
>> + /*
>> + * Search for all tasks while MF_MEM_PRE_REMOVE, because the
>> + * current may not be the one accessing the fsdax page.
>> + * Otherwise, search for the current task.
>> + */
>> + if (!pre_remove)
>> + t = task_early_kill(tsk, true);
>> if (!t)
>> continue;
>> vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
>> @@ -1793,6 +1800,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>> dax_entry_t cookie;
>> struct page *page;
>> size_t end = index + count;
>> + bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
>>
>> mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
>>
>> @@ -1804,9 +1812,10 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>> if (!page)
>> goto unlock;
>>
>> - SetPageHWPoison(page);
>> + if (!pre_remove)
>> + SetPageHWPoison(page);
>>
>> - collect_procs_fsdax(page, mapping, index, &to_kill);
>> + collect_procs_fsdax(page, mapping, index, &to_kill, pre_remove);
>> unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
>> index, mf_flags);
>> unlock:
>> --
>> 2.40.1
>>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v12 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
2023-07-31 9:36 ` Shiyang Ruan
@ 2023-08-01 3:25 ` Darrick J. Wong
2023-08-03 10:44 ` Shiyang Ruan
0 siblings, 1 reply; 37+ messages in thread
From: Darrick J. Wong @ 2023-08-01 3:25 UTC (permalink / raw)
To: Shiyang Ruan
Cc: linux-fsdevel, nvdimm, linux-xfs, linux-mm, dan.j.williams, willy,
jack, akpm, mcgrof
On Mon, Jul 31, 2023 at 05:36:36PM +0800, Shiyang Ruan wrote:
>
>
> 在 2023/7/29 23:15, Darrick J. Wong 写道:
> > On Thu, Jun 29, 2023 at 04:16:51PM +0800, Shiyang Ruan wrote:
> > > This patch is inspired by Dan's "mm, dax, pmem: Introduce
> > > dev_pagemap_failure()"[1]. With the help of dax_holder and
> > > ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> > > on it to unmap all files in use, and notify processes who are using
> > > those files.
> > >
> > > Call trace:
> > > trigger unbind
> > > -> unbind_store()
> > > -> ... (skip)
> > > -> devres_release_all()
> > > -> kill_dax()
> > > -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
> > > -> xfs_dax_notify_failure()
> > > `-> freeze_super() // freeze (kernel call)
> > > `-> do xfs rmap
> > > ` -> mf_dax_kill_procs()
> > > ` -> collect_procs_fsdax() // all associated processes
> > > ` -> unmap_and_kill()
> > > ` -> invalidate_inode_pages2_range() // drop file's cache
> > > `-> thaw_super() // thaw (both kernel & user call)
> > >
> > > Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> > > event. Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
> > > new dax mapping from being created. Do not shutdown filesystem directly
> > > if configuration is not supported, or if failure range includes metadata
> > > area. Make sure all files and processes(not only the current progress)
> > > are handled correctly. Also drop the cache of associated files before
> > > pmem is removed.
> > >
> > > [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
> > > [2]: https://lore.kernel.org/linux-xfs/168688010689.860947.1788875898367401950.stgit@frogsfrogsfrogs/
> > >
> > > Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> > > ---
> > > drivers/dax/super.c | 3 +-
> > > fs/xfs/xfs_notify_failure.c | 86 ++++++++++++++++++++++++++++++++++---
> > > include/linux/mm.h | 1 +
> > > mm/memory-failure.c | 17 ++++++--
> > > 4 files changed, 96 insertions(+), 11 deletions(-)
> > >
> > > diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> > > index c4c4728a36e4..2e1a35e82fce 100644
> > > --- a/drivers/dax/super.c
> > > +++ b/drivers/dax/super.c
> > > @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
> > > return;
> > > if (dax_dev->holder_data != NULL)
> > > - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
> > > + dax_holder_notify_failure(dax_dev, 0, U64_MAX,
> > > + MF_MEM_PRE_REMOVE);
> > > clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
> > > synchronize_srcu(&dax_srcu);
> > > diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> > > index 4a9bbd3fe120..f6ec56b76db6 100644
> > > --- a/fs/xfs/xfs_notify_failure.c
> > > +++ b/fs/xfs/xfs_notify_failure.c
> > > @@ -22,6 +22,7 @@
> > > #include <linux/mm.h>
> > > #include <linux/dax.h>
> > > +#include <linux/fs.h>
> > > struct xfs_failure_info {
> > > xfs_agblock_t startblock;
> > > @@ -73,10 +74,16 @@ xfs_dax_failure_fn(
> > > struct xfs_mount *mp = cur->bc_mp;
> > > struct xfs_inode *ip;
> > > struct xfs_failure_info *notify = data;
> > > + struct address_space *mapping;
> > > + pgoff_t pgoff;
> > > + unsigned long pgcnt;
> > > int error = 0;
> > > if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
> > > (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
> > > + /* Continue the query because this isn't a failure. */
> > > + if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> > > + return 0;
> > > notify->want_shutdown = true;
> > > return 0;
> > > }
> > > @@ -92,14 +99,55 @@ xfs_dax_failure_fn(
> > > return 0;
> > > }
> > > - error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
> > > - xfs_failure_pgoff(mp, rec, notify),
> > > - xfs_failure_pgcnt(mp, rec, notify),
> > > - notify->mf_flags);
> > > + mapping = VFS_I(ip)->i_mapping;
> > > + pgoff = xfs_failure_pgoff(mp, rec, notify);
> > > + pgcnt = xfs_failure_pgcnt(mp, rec, notify);
> > > +
> > > + /* Continue the rmap query if the inode isn't a dax file. */
> > > + if (dax_mapping(mapping))
> > > + error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
> > > + notify->mf_flags);
> > > +
> > > + /* Invalidate the cache in dax pages. */
> > > + if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> > > + invalidate_inode_pages2_range(mapping, pgoff,
> > > + pgoff + pgcnt - 1);
> > > +
> > > xfs_irele(ip);
> > > return error;
> > > }
> > > +static void
> > > +xfs_dax_notify_failure_freeze(
> > > + struct xfs_mount *mp)
> > > +{
> > > + struct super_block *sb = mp->m_super;
> >
> > Nit: extra space right ^ here.
> >
> > > +
> > > + /* Wait until no one is holding the FREEZE_HOLDER_KERNEL. */
> > > + while (freeze_super(sb, FREEZE_HOLDER_KERNEL) != 0) {
> > > + // Shall we just wait, or print warning then return -EBUSY?
> >
> > Hm. PRE_REMOVE gets called before the pmem gets unplugged, right? So
> > we'll send a second notification after it goes away, right?
>
> For the first question, yes.
>
> But I'm not sure about the second one. Do you mean: we'll send this
> notification again if unbind didn't success because freeze_super() returns
> -EBUSY? In other words, if the previous unbind operation did not work, we
> could unbind the device again.
Yeah. If the MF_MEM_PRE_REMOVE fails with EBUSY, then call it again
without PRE_REMOVE and let it kill processes.
--D
> >
> > If so, then I'd say return the error here instead of looping, and live
> > with a kernel-frozen fs discarding the PRE_REMOVE message.
> >
> > > + delay(HZ / 10);
> > > + }
> > > +}
> > > +
> > > +static void
> > > +xfs_dax_notify_failure_thaw(
> > > + struct xfs_mount *mp)
> > > +{
> > > + struct super_block *sb = mp->m_super;
> > > + int error;
> > > +
> > > + error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
> > > + if (error)
> > > + xfs_emerg(mp, "still frozen after notify failure, err=%d",
> > > + error);
> > > + /*
> > > + * Also thaw userspace call anyway because the device is about to be
> > > + * removed immediately.
> > > + */
> > > + thaw_super(sb, FREEZE_HOLDER_USERSPACE);
> > > +}
> > > +
> > > static int
> > > xfs_dax_notify_ddev_failure(
> > > struct xfs_mount *mp,
> > > @@ -120,7 +168,7 @@ xfs_dax_notify_ddev_failure(
> > > error = xfs_trans_alloc_empty(mp, &tp);
> > > if (error)
> > > - return error;
> > > + goto out;
> > > for (; agno <= end_agno; agno++) {
> > > struct xfs_rmap_irec ri_low = { };
> > > @@ -165,11 +213,23 @@ xfs_dax_notify_ddev_failure(
> > > }
> > > xfs_trans_cancel(tp);
> > > +
> > > + /*
> > > + * Determine how to shutdown the filesystem according to the
> > > + * error code and flags.
> > > + */
> > > if (error || notify.want_shutdown) {
> > > xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
> > > if (!error)
> > > error = -EFSCORRUPTED;
> > > - }
> > > + } else if (mf_flags & MF_MEM_PRE_REMOVE)
> > > + xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
> > > +
> > > +out:
> > > + /* Thaw the fs if it is freezed before. */
> > > + if (mf_flags & MF_MEM_PRE_REMOVE)
> > > + xfs_dax_notify_failure_thaw(mp);
> >
> > _thaw should be called from the same function that called _freeze.
>
> Will fix this.
>
> >
> > The rest of the patch seems ok to me.
>
> Thank you!
>
>
> --
> Ruan.
>
> >
> > --D
> >
> > > +
> > > return error;
> > > }
> > > @@ -197,6 +257,8 @@ xfs_dax_notify_failure(
> > > if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
> > > mp->m_logdev_targp != mp->m_ddev_targp) {
> > > + if (mf_flags & MF_MEM_PRE_REMOVE)
> > > + return 0;
> > > xfs_err(mp, "ondisk log corrupt, shutting down fs!");
> > > xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
> > > return -EFSCORRUPTED;
> > > @@ -210,6 +272,12 @@ xfs_dax_notify_failure(
> > > ddev_start = mp->m_ddev_targp->bt_dax_part_off;
> > > ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
> > > + /* Notify failure on the whole device. */
> > > + if (offset == 0 && len == U64_MAX) {
> > > + offset = ddev_start;
> > > + len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
> > > + }
> > > +
> > > /* Ignore the range out of filesystem area */
> > > if (offset + len - 1 < ddev_start)
> > > return -ENXIO;
> > > @@ -226,6 +294,12 @@ xfs_dax_notify_failure(
> > > if (offset + len - 1 > ddev_end)
> > > len = ddev_end - offset + 1;
> > > + if (mf_flags & MF_MEM_PRE_REMOVE) {
> > > + xfs_info(mp, "device is about to be removed!");
> > > + /* Freeze fs to prevent new mappings from being created. */
> > > + xfs_dax_notify_failure_freeze(mp);
> > > + }
> > > +
> > > return xfs_dax_notify_ddev_failure(mp, BTOBB(offset), BTOBB(len),
> > > mf_flags);
> > > }
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index 27ce77080c79..a80c255b88d2 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -3576,6 +3576,7 @@ enum mf_flags {
> > > MF_UNPOISON = 1 << 4,
> > > MF_SW_SIMULATED = 1 << 5,
> > > MF_NO_RETRY = 1 << 6,
> > > + MF_MEM_PRE_REMOVE = 1 << 7,
> > > };
> > > int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> > > unsigned long count, int mf_flags);
> > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > > index 5b663eca1f29..483b75f2fcfb 100644
> > > --- a/mm/memory-failure.c
> > > +++ b/mm/memory-failure.c
> > > @@ -688,7 +688,7 @@ static void add_to_kill_fsdax(struct task_struct *tsk, struct page *p,
> > > */
> > > static void collect_procs_fsdax(struct page *page,
> > > struct address_space *mapping, pgoff_t pgoff,
> > > - struct list_head *to_kill)
> > > + struct list_head *to_kill, bool pre_remove)
> > > {
> > > struct vm_area_struct *vma;
> > > struct task_struct *tsk;
> > > @@ -696,8 +696,15 @@ static void collect_procs_fsdax(struct page *page,
> > > i_mmap_lock_read(mapping);
> > > read_lock(&tasklist_lock);
> > > for_each_process(tsk) {
> > > - struct task_struct *t = task_early_kill(tsk, true);
> > > + struct task_struct *t = tsk;
> > > + /*
> > > + * Search for all tasks while MF_MEM_PRE_REMOVE, because the
> > > + * current may not be the one accessing the fsdax page.
> > > + * Otherwise, search for the current task.
> > > + */
> > > + if (!pre_remove)
> > > + t = task_early_kill(tsk, true);
> > > if (!t)
> > > continue;
> > > vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> > > @@ -1793,6 +1800,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> > > dax_entry_t cookie;
> > > struct page *page;
> > > size_t end = index + count;
> > > + bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
> > > mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
> > > @@ -1804,9 +1812,10 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> > > if (!page)
> > > goto unlock;
> > > - SetPageHWPoison(page);
> > > + if (!pre_remove)
> > > + SetPageHWPoison(page);
> > > - collect_procs_fsdax(page, mapping, index, &to_kill);
> > > + collect_procs_fsdax(page, mapping, index, &to_kill, pre_remove);
> > > unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
> > > index, mf_flags);
> > > unlock:
> > > --
> > > 2.40.1
> > >
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v12 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
2023-08-01 3:25 ` Darrick J. Wong
@ 2023-08-03 10:44 ` Shiyang Ruan
0 siblings, 0 replies; 37+ messages in thread
From: Shiyang Ruan @ 2023-08-03 10:44 UTC (permalink / raw)
To: Darrick J. Wong
Cc: linux-fsdevel, nvdimm, linux-xfs, linux-mm, dan.j.williams, willy,
jack, akpm, mcgrof
在 2023/8/1 11:25, Darrick J. Wong 写道:
> On Mon, Jul 31, 2023 at 05:36:36PM +0800, Shiyang Ruan wrote:
>>
>>
>> 在 2023/7/29 23:15, Darrick J. Wong 写道:
>>> On Thu, Jun 29, 2023 at 04:16:51PM +0800, Shiyang Ruan wrote:
>>>> This patch is inspired by Dan's "mm, dax, pmem: Introduce
>>>> dev_pagemap_failure()"[1]. With the help of dax_holder and
>>>> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
>>>> on it to unmap all files in use, and notify processes who are using
>>>> those files.
>>>>
>>>> Call trace:
>>>> trigger unbind
>>>> -> unbind_store()
>>>> -> ... (skip)
>>>> -> devres_release_all()
>>>> -> kill_dax()
>>>> -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
>>>> -> xfs_dax_notify_failure()
>>>> `-> freeze_super() // freeze (kernel call)
>>>> `-> do xfs rmap
>>>> ` -> mf_dax_kill_procs()
>>>> ` -> collect_procs_fsdax() // all associated processes
>>>> ` -> unmap_and_kill()
>>>> ` -> invalidate_inode_pages2_range() // drop file's cache
>>>> `-> thaw_super() // thaw (both kernel & user call)
>>>>
>>>> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
>>>> event. Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
>>>> new dax mapping from being created. Do not shutdown filesystem directly
>>>> if configuration is not supported, or if failure range includes metadata
>>>> area. Make sure all files and processes(not only the current progress)
>>>> are handled correctly. Also drop the cache of associated files before
>>>> pmem is removed.
>>>>
>>>> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
>>>> [2]: https://lore.kernel.org/linux-xfs/168688010689.860947.1788875898367401950.stgit@frogsfrogsfrogs/
>>>>
>>>> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
>>>> ---
>>>> drivers/dax/super.c | 3 +-
>>>> fs/xfs/xfs_notify_failure.c | 86 ++++++++++++++++++++++++++++++++++---
>>>> include/linux/mm.h | 1 +
>>>> mm/memory-failure.c | 17 ++++++--
>>>> 4 files changed, 96 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
>>>> index c4c4728a36e4..2e1a35e82fce 100644
>>>> --- a/drivers/dax/super.c
>>>> +++ b/drivers/dax/super.c
>>>> @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
>>>> return;
>>>> if (dax_dev->holder_data != NULL)
>>>> - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
>>>> + dax_holder_notify_failure(dax_dev, 0, U64_MAX,
>>>> + MF_MEM_PRE_REMOVE);
>>>> clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
>>>> synchronize_srcu(&dax_srcu);
>>>> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
>>>> index 4a9bbd3fe120..f6ec56b76db6 100644
>>>> --- a/fs/xfs/xfs_notify_failure.c
>>>> +++ b/fs/xfs/xfs_notify_failure.c
>>>> @@ -22,6 +22,7 @@
>>>> #include <linux/mm.h>
>>>> #include <linux/dax.h>
>>>> +#include <linux/fs.h>
>>>> struct xfs_failure_info {
>>>> xfs_agblock_t startblock;
>>>> @@ -73,10 +74,16 @@ xfs_dax_failure_fn(
>>>> struct xfs_mount *mp = cur->bc_mp;
>>>> struct xfs_inode *ip;
>>>> struct xfs_failure_info *notify = data;
>>>> + struct address_space *mapping;
>>>> + pgoff_t pgoff;
>>>> + unsigned long pgcnt;
>>>> int error = 0;
>>>> if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
>>>> (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
>>>> + /* Continue the query because this isn't a failure. */
>>>> + if (notify->mf_flags & MF_MEM_PRE_REMOVE)
>>>> + return 0;
>>>> notify->want_shutdown = true;
>>>> return 0;
>>>> }
>>>> @@ -92,14 +99,55 @@ xfs_dax_failure_fn(
>>>> return 0;
>>>> }
>>>> - error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
>>>> - xfs_failure_pgoff(mp, rec, notify),
>>>> - xfs_failure_pgcnt(mp, rec, notify),
>>>> - notify->mf_flags);
>>>> + mapping = VFS_I(ip)->i_mapping;
>>>> + pgoff = xfs_failure_pgoff(mp, rec, notify);
>>>> + pgcnt = xfs_failure_pgcnt(mp, rec, notify);
>>>> +
>>>> + /* Continue the rmap query if the inode isn't a dax file. */
>>>> + if (dax_mapping(mapping))
>>>> + error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
>>>> + notify->mf_flags);
>>>> +
>>>> + /* Invalidate the cache in dax pages. */
>>>> + if (notify->mf_flags & MF_MEM_PRE_REMOVE)
>>>> + invalidate_inode_pages2_range(mapping, pgoff,
>>>> + pgoff + pgcnt - 1);
>>>> +
>>>> xfs_irele(ip);
>>>> return error;
>>>> }
>>>> +static void
>>>> +xfs_dax_notify_failure_freeze(
>>>> + struct xfs_mount *mp)
>>>> +{
>>>> + struct super_block *sb = mp->m_super;
>>>
>>> Nit: extra space right ^ here.
>>>
>>>> +
>>>> + /* Wait until no one is holding the FREEZE_HOLDER_KERNEL. */
>>>> + while (freeze_super(sb, FREEZE_HOLDER_KERNEL) != 0) {
>>>> + // Shall we just wait, or print warning then return -EBUSY?
>>>
>>> Hm. PRE_REMOVE gets called before the pmem gets unplugged, right? So
>>> we'll send a second notification after it goes away, right?
>>
>> For the first question, yes.
>>
>> But I'm not sure about the second one. Do you mean: we'll send this
>> notification again if unbind didn't success because freeze_super() returns
>> -EBUSY? In other words, if the previous unbind operation did not work, we
>> could unbind the device again.
>
> Yeah. If the MF_MEM_PRE_REMOVE fails with EBUSY, then call it again
> without PRE_REMOVE and let it kill processes.
Ok. But I have to pass the flag (MF_MEM_PRE_REMOVE) to
mf_dax_kill_procs() so that it can search for all processes who are
holding dax pages rather than the only the current process.
Then, my thought is, if filesystem is currently frozen by kernel during
unbind, just allow the -EBUSY and keep on the RMAP & killing processes.
After RMAP is done, ignore the kernel thaw as well. In this way, there
is no need to send a second notification.
```
bool frozen_by_kernel = false;
// skip... other definitions
if (mf_flags & MF_MEM_PRE_REMOVE) {
xfs_info(mp, "Device is about to be removed!");
/* Freeze fs to prevent new mappings from being created. */
error = xfs_dax_notify_failure_freeze(mp);
if (error) {
/* Keep on if filesystem is frozen by kernel */
if (error == -EBUSY)
frozen_by_kernel = true;
else
return error;
}
}
// skip... RMAP
out:
/* Thaw the filesystem. */
if (mf_flags & MF_MEM_PRE_REMOVE)
/* don't thaw kernel frozen if already frozen by kernel */
xfs_dax_notify_failure_thaw(mp, frozen_by_kernel);
return error;
```
--
Thanks,
Ruan.
>
> --D
>
>>>
>>> If so, then I'd say return the error here instead of looping, and live
>>> with a kernel-frozen fs discarding the PRE_REMOVE message.
>>>
>>>> + delay(HZ / 10);
>>>> + }
>>>> +}
>>>> +
>>>> +static void
>>>> +xfs_dax_notify_failure_thaw(
>>>> + struct xfs_mount *mp)
>>>> +{
>>>> + struct super_block *sb = mp->m_super;
>>>> + int error;
>>>> +
>>>> + error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
>>>> + if (error)
>>>> + xfs_emerg(mp, "still frozen after notify failure, err=%d",
>>>> + error);
>>>> + /*
>>>> + * Also thaw userspace call anyway because the device is about to be
>>>> + * removed immediately.
>>>> + */
>>>> + thaw_super(sb, FREEZE_HOLDER_USERSPACE);
>>>> +}
>>>> +
>>>> static int
>>>> xfs_dax_notify_ddev_failure(
>>>> struct xfs_mount *mp,
>>>> @@ -120,7 +168,7 @@ xfs_dax_notify_ddev_failure(
>>>> error = xfs_trans_alloc_empty(mp, &tp);
>>>> if (error)
>>>> - return error;
>>>> + goto out;
>>>> for (; agno <= end_agno; agno++) {
>>>> struct xfs_rmap_irec ri_low = { };
>>>> @@ -165,11 +213,23 @@ xfs_dax_notify_ddev_failure(
>>>> }
>>>> xfs_trans_cancel(tp);
>>>> +
>>>> + /*
>>>> + * Determine how to shutdown the filesystem according to the
>>>> + * error code and flags.
>>>> + */
>>>> if (error || notify.want_shutdown) {
>>>> xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>>>> if (!error)
>>>> error = -EFSCORRUPTED;
>>>> - }
>>>> + } else if (mf_flags & MF_MEM_PRE_REMOVE)
>>>> + xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
>>>> +
>>>> +out:
>>>> + /* Thaw the fs if it is freezed before. */
>>>> + if (mf_flags & MF_MEM_PRE_REMOVE)
>>>> + xfs_dax_notify_failure_thaw(mp);
>>>
>>> _thaw should be called from the same function that called _freeze.
>>
>> Will fix this.
>>
>>>
>>> The rest of the patch seems ok to me.
>>
>> Thank you!
>>
>>
>> --
>> Ruan.
>>
>>>
>>> --D
>>>
>>>> +
>>>> return error;
>>>> }
>>>> @@ -197,6 +257,8 @@ xfs_dax_notify_failure(
>>>> if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
>>>> mp->m_logdev_targp != mp->m_ddev_targp) {
>>>> + if (mf_flags & MF_MEM_PRE_REMOVE)
>>>> + return 0;
>>>> xfs_err(mp, "ondisk log corrupt, shutting down fs!");
>>>> xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>>>> return -EFSCORRUPTED;
>>>> @@ -210,6 +272,12 @@ xfs_dax_notify_failure(
>>>> ddev_start = mp->m_ddev_targp->bt_dax_part_off;
>>>> ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
>>>> + /* Notify failure on the whole device. */
>>>> + if (offset == 0 && len == U64_MAX) {
>>>> + offset = ddev_start;
>>>> + len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
>>>> + }
>>>> +
>>>> /* Ignore the range out of filesystem area */
>>>> if (offset + len - 1 < ddev_start)
>>>> return -ENXIO;
>>>> @@ -226,6 +294,12 @@ xfs_dax_notify_failure(
>>>> if (offset + len - 1 > ddev_end)
>>>> len = ddev_end - offset + 1;
>>>> + if (mf_flags & MF_MEM_PRE_REMOVE) {
>>>> + xfs_info(mp, "device is about to be removed!");
>>>> + /* Freeze fs to prevent new mappings from being created. */
>>>> + xfs_dax_notify_failure_freeze(mp);
>>>> + }
>>>> +
>>>> return xfs_dax_notify_ddev_failure(mp, BTOBB(offset), BTOBB(len),
>>>> mf_flags);
>>>> }
>>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>>> index 27ce77080c79..a80c255b88d2 100644
>>>> --- a/include/linux/mm.h
>>>> +++ b/include/linux/mm.h
>>>> @@ -3576,6 +3576,7 @@ enum mf_flags {
>>>> MF_UNPOISON = 1 << 4,
>>>> MF_SW_SIMULATED = 1 << 5,
>>>> MF_NO_RETRY = 1 << 6,
>>>> + MF_MEM_PRE_REMOVE = 1 << 7,
>>>> };
>>>> int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>>>> unsigned long count, int mf_flags);
>>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>>> index 5b663eca1f29..483b75f2fcfb 100644
>>>> --- a/mm/memory-failure.c
>>>> +++ b/mm/memory-failure.c
>>>> @@ -688,7 +688,7 @@ static void add_to_kill_fsdax(struct task_struct *tsk, struct page *p,
>>>> */
>>>> static void collect_procs_fsdax(struct page *page,
>>>> struct address_space *mapping, pgoff_t pgoff,
>>>> - struct list_head *to_kill)
>>>> + struct list_head *to_kill, bool pre_remove)
>>>> {
>>>> struct vm_area_struct *vma;
>>>> struct task_struct *tsk;
>>>> @@ -696,8 +696,15 @@ static void collect_procs_fsdax(struct page *page,
>>>> i_mmap_lock_read(mapping);
>>>> read_lock(&tasklist_lock);
>>>> for_each_process(tsk) {
>>>> - struct task_struct *t = task_early_kill(tsk, true);
>>>> + struct task_struct *t = tsk;
>>>> + /*
>>>> + * Search for all tasks while MF_MEM_PRE_REMOVE, because the
>>>> + * current may not be the one accessing the fsdax page.
>>>> + * Otherwise, search for the current task.
>>>> + */
>>>> + if (!pre_remove)
>>>> + t = task_early_kill(tsk, true);
>>>> if (!t)
>>>> continue;
>>>> vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
>>>> @@ -1793,6 +1800,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>>>> dax_entry_t cookie;
>>>> struct page *page;
>>>> size_t end = index + count;
>>>> + bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
>>>> mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
>>>> @@ -1804,9 +1812,10 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>>>> if (!page)
>>>> goto unlock;
>>>> - SetPageHWPoison(page);
>>>> + if (!pre_remove)
>>>> + SetPageHWPoison(page);
>>>> - collect_procs_fsdax(page, mapping, index, &to_kill);
>>>> + collect_procs_fsdax(page, mapping, index, &to_kill, pre_remove);
>>>> unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
>>>> index, mf_flags);
>>>> unlock:
>>>> --
>>>> 2.40.1
>>>>
^ permalink raw reply [flat|nested] 37+ messages in thread
* RE: [PATCH v12 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
2023-06-29 8:16 ` [PATCH v12 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan
` (2 preceding siblings ...)
2023-07-29 15:15 ` Darrick J. Wong
@ 2023-08-08 0:31 ` Dan Williams
2023-08-23 8:36 ` Shiyang Ruan
2023-08-23 8:17 ` [PATCH v13] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE " Shiyang Ruan
2023-08-28 6:57 ` [PATCH v14] " Shiyang Ruan
5 siblings, 1 reply; 37+ messages in thread
From: Dan Williams @ 2023-08-08 0:31 UTC (permalink / raw)
To: Shiyang Ruan, linux-fsdevel, nvdimm, linux-xfs, linux-mm
Cc: dan.j.williams, willy, jack, akpm, djwong, mcgrof
Shiyang Ruan wrote:
> This patch is inspired by Dan's "mm, dax, pmem: Introduce
> dev_pagemap_failure()"[1]. With the help of dax_holder and
> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> on it to unmap all files in use, and notify processes who are using
> those files.
>
> Call trace:
> trigger unbind
> -> unbind_store()
> -> ... (skip)
> -> devres_release_all()
> -> kill_dax()
> -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
> -> xfs_dax_notify_failure()
> `-> freeze_super() // freeze (kernel call)
> `-> do xfs rmap
> ` -> mf_dax_kill_procs()
> ` -> collect_procs_fsdax() // all associated processes
> ` -> unmap_and_kill()
> ` -> invalidate_inode_pages2_range() // drop file's cache
> `-> thaw_super() // thaw (both kernel & user call)
>
> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> event. Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
> new dax mapping from being created. Do not shutdown filesystem directly
> if configuration is not supported, or if failure range includes metadata
> area. Make sure all files and processes(not only the current progress)
> are handled correctly. Also drop the cache of associated files before
> pmem is removed.
I would say more about why this is important for DAX users. Yes, the
devm_memremap_pages() vs get_user_pages() infrastructure can be improved
if it has a mechanism to revoke all pages that it has handed out for a
given device, but that's not an end user visible effect.
The end user impact needs to be clear. Is this for existing deployed
pmem where a user accidentally removes a device and wants failures and
process killing instead of hangs?
The reason Linux has got along without this for so long is because pmem
is difficult to remove (and with the sunset of Optane, difficult to
acquire). One motivation to pursue this is CXL where hotplug is better
defined and use cases like dynamic capacity devices where making forward
progress to kill processes is better than hanging.
It would help to have an example of what happens without this patch.
>
> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
> [2]: https://lore.kernel.org/linux-xfs/168688010689.860947.1788875898367401950.stgit@frogsfrogsfrogs/
>
> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> ---
> drivers/dax/super.c | 3 +-
> fs/xfs/xfs_notify_failure.c | 86 ++++++++++++++++++++++++++++++++++---
> include/linux/mm.h | 1 +
> mm/memory-failure.c | 17 ++++++--
> 4 files changed, 96 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index c4c4728a36e4..2e1a35e82fce 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
> return;
>
> if (dax_dev->holder_data != NULL)
> - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
> + dax_holder_notify_failure(dax_dev, 0, U64_MAX,
> + MF_MEM_PRE_REMOVE);
The motivation in the original proposal was to convey the death of
large extents to memory_failure(). However, that proposal predated your
mf_dax_kill_procs() approach. With mf_dax_kill_procs() the need for a
new bulk memory_failure() API is gone.
This is where the end user impact needs to be clear. It seems that
without this patch the filesystem may assume failure while the device is
already present, but that seems ok. The goal is forward progress after a
mistake not necessarily minimizing damage after a mistake. The fact that
the current code is not as gentle could be considered a feature because
graceful shutdown should always unmount before unplug, and if one
unplugs before unmount it is already understood that they get to keep
the pieces.
Because the driver ->remove() callback can not enforce that the device
is still present it seems unnecessary to optimize for the case where the
filesystem is the device is being removed from an actively mounted
filesystem, but the device is still present.
The dax_holder_notify_failure(dax_dev, 0, U64_MAX) is sufficient to say
"userspace failed to umount before hardware eject, stop trying to access
this range", rather than "try to finish up in this range, but it might
already be too late".
^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH v13] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind
2023-06-29 8:16 ` [PATCH v12 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan
` (3 preceding siblings ...)
2023-08-08 0:31 ` Dan Williams
@ 2023-08-23 8:17 ` Shiyang Ruan
2023-08-23 23:36 ` Darrick J. Wong
2023-08-28 6:57 ` [PATCH v14] " Shiyang Ruan
5 siblings, 1 reply; 37+ messages in thread
From: Shiyang Ruan @ 2023-08-23 8:17 UTC (permalink / raw)
To: linux-fsdevel, nvdimm, linux-xfs, linux-mm
Cc: dan.j.williams, willy, jack, akpm, djwong, mcgrof
====
Changes since v12:
1. correct flag name in subject (MF_MEM_REMOVE => MF_MEM_PRE_REMOVE)
2. complete the behavior when fs has already frozen by kernel call
NOTICE: Instead of "call notify_failure() again w/o PRE_REMOVE",
I tried this proposal[0].
3. call xfs_dax_notify_failure_freeze() and _thaw() in same function
4. rebase on: xfs/xfs-linux.git vfs-for-next
====
Now, if we suddenly remove a PMEM device(by calling unbind) which
contains FSDAX while programs are still accessing data in this device,
e.g.:
```
$FSSTRESS_PROG -d $SCRATCH_MNT -n 99999 -p 4 &
# $FSX_PROG -N 1000000 -o 8192 -l 500000 $SCRATCH_MNT/t001 &
echo "pfn1.1" > /sys/bus/nd/drivers/nd_pmem/unbind
```
it could come into an unacceptable state:
1. device has gone but mount point still exists, and umount will fail
with "target is busy"
2. programs will hang and cannot be killed
3. may crash with NULL pointer dereference
To fix this, we introduce a MF_MEM_PRE_REMOVE flag to let it know that we
are going to remove the whole device, and make sure all related processes
could be notified so that they could end up gracefully.
This patch is inspired by Dan's "mm, dax, pmem: Introduce
dev_pagemap_failure()"[1]. With the help of dax_holder and
->notify_failure() mechanism, the pmem driver is able to ask filesystem
on it to unmap all files in use, and notify processes who are using
those files.
Call trace:
trigger unbind
-> unbind_store()
-> ... (skip)
-> devres_release_all()
-> kill_dax()
-> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
-> xfs_dax_notify_failure()
`-> freeze_super() // freeze (kernel call)
`-> do xfs rmap
` -> mf_dax_kill_procs()
` -> collect_procs_fsdax() // all associated processes
` -> unmap_and_kill()
` -> invalidate_inode_pages2_range() // drop file's cache
`-> thaw_super() // thaw (both kernel & user call)
Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
event. Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
new dax mapping from being created. Do not shutdown filesystem directly
if configuration is not supported, or if failure range includes metadata
area. Make sure all files and processes(not only the current progress)
are handled correctly. Also drop the cache of associated files before
pmem is removed.
[0]: https://lore.kernel.org/linux-xfs/25cf6700-4db0-a346-632c-ec9fc291793a@fujitsu.com/
[1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
[2]: https://lore.kernel.org/linux-xfs/169116275623.3187159.16862410128731457358.stg-ugh@frogsfrogsfrogs/
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
---
drivers/dax/super.c | 3 +-
fs/xfs/xfs_notify_failure.c | 99 ++++++++++++++++++++++++++++++++++---
include/linux/mm.h | 1 +
mm/memory-failure.c | 17 +++++--
4 files changed, 109 insertions(+), 11 deletions(-)
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index c4c4728a36e4..2e1a35e82fce 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
return;
if (dax_dev->holder_data != NULL)
- dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
+ dax_holder_notify_failure(dax_dev, 0, U64_MAX,
+ MF_MEM_PRE_REMOVE);
clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
synchronize_srcu(&dax_srcu);
diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index 4a9bbd3fe120..6496c32a9172 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -22,6 +22,7 @@
#include <linux/mm.h>
#include <linux/dax.h>
+#include <linux/fs.h>
struct xfs_failure_info {
xfs_agblock_t startblock;
@@ -73,10 +74,16 @@ xfs_dax_failure_fn(
struct xfs_mount *mp = cur->bc_mp;
struct xfs_inode *ip;
struct xfs_failure_info *notify = data;
+ struct address_space *mapping;
+ pgoff_t pgoff;
+ unsigned long pgcnt;
int error = 0;
if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
(rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
+ /* Continue the query because this isn't a failure. */
+ if (notify->mf_flags & MF_MEM_PRE_REMOVE)
+ return 0;
notify->want_shutdown = true;
return 0;
}
@@ -92,14 +99,60 @@ xfs_dax_failure_fn(
return 0;
}
- error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
- xfs_failure_pgoff(mp, rec, notify),
- xfs_failure_pgcnt(mp, rec, notify),
- notify->mf_flags);
+ mapping = VFS_I(ip)->i_mapping;
+ pgoff = xfs_failure_pgoff(mp, rec, notify);
+ pgcnt = xfs_failure_pgcnt(mp, rec, notify);
+
+ /* Continue the rmap query if the inode isn't a dax file. */
+ if (dax_mapping(mapping))
+ error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
+ notify->mf_flags);
+
+ /* Invalidate the cache in dax pages. */
+ if (notify->mf_flags & MF_MEM_PRE_REMOVE)
+ invalidate_inode_pages2_range(mapping, pgoff,
+ pgoff + pgcnt - 1);
+
xfs_irele(ip);
return error;
}
+static int
+xfs_dax_notify_failure_freeze(
+ struct xfs_mount *mp)
+{
+ struct super_block *sb = mp->m_super;
+ int error;
+
+ error = freeze_super(sb, FREEZE_HOLDER_KERNEL);
+ if (error)
+ xfs_emerg(mp, "already frozen by kernel, err=%d", error);
+
+ return error;
+}
+
+static void
+xfs_dax_notify_failure_thaw(
+ struct xfs_mount *mp,
+ bool kernel_frozen)
+{
+ struct super_block *sb = mp->m_super;
+ int error;
+
+ if (!kernel_frozen) {
+ error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
+ if (error)
+ xfs_emerg(mp, "still frozen after notify failure, err=%d",
+ error);
+ }
+
+ /*
+ * Also thaw userspace call anyway because the device is about to be
+ * removed immediately.
+ */
+ thaw_super(sb, FREEZE_HOLDER_USERSPACE);
+}
+
static int
xfs_dax_notify_ddev_failure(
struct xfs_mount *mp,
@@ -112,15 +165,29 @@ xfs_dax_notify_ddev_failure(
struct xfs_btree_cur *cur = NULL;
struct xfs_buf *agf_bp = NULL;
int error = 0;
+ bool kernel_frozen = false;
xfs_fsblock_t fsbno = XFS_DADDR_TO_FSB(mp, daddr);
xfs_agnumber_t agno = XFS_FSB_TO_AGNO(mp, fsbno);
xfs_fsblock_t end_fsbno = XFS_DADDR_TO_FSB(mp,
daddr + bblen - 1);
xfs_agnumber_t end_agno = XFS_FSB_TO_AGNO(mp, end_fsbno);
+ if (mf_flags & MF_MEM_PRE_REMOVE) {
+ xfs_info(mp, "Device is about to be removed!");
+ /* Freeze fs to prevent new mappings from being created. */
+ error = xfs_dax_notify_failure_freeze(mp);
+ if (error) {
+ /* Keep going on if filesystem is frozen by kernel. */
+ if (error == -EBUSY)
+ kernel_frozen = true;
+ else
+ return error;
+ }
+ }
+
error = xfs_trans_alloc_empty(mp, &tp);
if (error)
- return error;
+ goto out;
for (; agno <= end_agno; agno++) {
struct xfs_rmap_irec ri_low = { };
@@ -165,11 +232,23 @@ xfs_dax_notify_ddev_failure(
}
xfs_trans_cancel(tp);
+
+ /*
+ * Determine how to shutdown the filesystem according to the
+ * error code and flags.
+ */
if (error || notify.want_shutdown) {
xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
if (!error)
error = -EFSCORRUPTED;
- }
+ } else if (mf_flags & MF_MEM_PRE_REMOVE)
+ xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
+
+out:
+ /* Thaw the fs if it is frozen before. */
+ if (mf_flags & MF_MEM_PRE_REMOVE)
+ xfs_dax_notify_failure_thaw(mp, kernel_frozen);
+
return error;
}
@@ -197,6 +276,8 @@ xfs_dax_notify_failure(
if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
mp->m_logdev_targp != mp->m_ddev_targp) {
+ if (mf_flags & MF_MEM_PRE_REMOVE)
+ return 0;
xfs_err(mp, "ondisk log corrupt, shutting down fs!");
xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
return -EFSCORRUPTED;
@@ -210,6 +291,12 @@ xfs_dax_notify_failure(
ddev_start = mp->m_ddev_targp->bt_dax_part_off;
ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
+ /* Notify failure on the whole device. */
+ if (offset == 0 && len == U64_MAX) {
+ offset = ddev_start;
+ len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
+ }
+
/* Ignore the range out of filesystem area */
if (offset + len - 1 < ddev_start)
return -ENXIO;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 799836e84840..944a1165a321 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3577,6 +3577,7 @@ enum mf_flags {
MF_UNPOISON = 1 << 4,
MF_SW_SIMULATED = 1 << 5,
MF_NO_RETRY = 1 << 6,
+ MF_MEM_PRE_REMOVE = 1 << 7,
};
int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
unsigned long count, int mf_flags);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index dc5ff7dd4e50..92f18c9e0aaf 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -688,7 +688,7 @@ static void add_to_kill_fsdax(struct task_struct *tsk, struct page *p,
*/
static void collect_procs_fsdax(struct page *page,
struct address_space *mapping, pgoff_t pgoff,
- struct list_head *to_kill)
+ struct list_head *to_kill, bool pre_remove)
{
struct vm_area_struct *vma;
struct task_struct *tsk;
@@ -696,8 +696,15 @@ static void collect_procs_fsdax(struct page *page,
i_mmap_lock_read(mapping);
read_lock(&tasklist_lock);
for_each_process(tsk) {
- struct task_struct *t = task_early_kill(tsk, true);
+ struct task_struct *t = tsk;
+ /*
+ * Search for all tasks while MF_MEM_PRE_REMOVE is set, because
+ * the current may not be the one accessing the fsdax page.
+ * Otherwise, search for the current task.
+ */
+ if (!pre_remove)
+ t = task_early_kill(tsk, true);
if (!t)
continue;
vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
@@ -1793,6 +1800,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
dax_entry_t cookie;
struct page *page;
size_t end = index + count;
+ bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
@@ -1804,9 +1812,10 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
if (!page)
goto unlock;
- SetPageHWPoison(page);
+ if (!pre_remove)
+ SetPageHWPoison(page);
- collect_procs_fsdax(page, mapping, index, &to_kill);
+ collect_procs_fsdax(page, mapping, index, &to_kill, pre_remove);
unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
index, mf_flags);
unlock:
--
2.41.0
^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH v12 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
2023-08-08 0:31 ` Dan Williams
@ 2023-08-23 8:36 ` Shiyang Ruan
0 siblings, 0 replies; 37+ messages in thread
From: Shiyang Ruan @ 2023-08-23 8:36 UTC (permalink / raw)
To: Dan Williams, linux-fsdevel, nvdimm, linux-xfs, linux-mm
Cc: willy, jack, akpm, djwong, mcgrof
在 2023/8/8 8:31, Dan Williams 写道:
> Shiyang Ruan wrote:
>> This patch is inspired by Dan's "mm, dax, pmem: Introduce
>> dev_pagemap_failure()"[1]. With the help of dax_holder and
>> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
>> on it to unmap all files in use, and notify processes who are using
>> those files.
>>
>> Call trace:
>> trigger unbind
>> -> unbind_store()
>> -> ... (skip)
>> -> devres_release_all()
>> -> kill_dax()
>> -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
>> -> xfs_dax_notify_failure()
>> `-> freeze_super() // freeze (kernel call)
>> `-> do xfs rmap
>> ` -> mf_dax_kill_procs()
>> ` -> collect_procs_fsdax() // all associated processes
>> ` -> unmap_and_kill()
>> ` -> invalidate_inode_pages2_range() // drop file's cache
>> `-> thaw_super() // thaw (both kernel & user call)
>>
>> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
>> event. Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
>> new dax mapping from being created. Do not shutdown filesystem directly
>> if configuration is not supported, or if failure range includes metadata
>> area. Make sure all files and processes(not only the current progress)
>> are handled correctly. Also drop the cache of associated files before
>> pmem is removed.
>
> I would say more about why this is important for DAX users. Yes, the
> devm_memremap_pages() vs get_user_pages() infrastructure can be improved
> if it has a mechanism to revoke all pages that it has handed out for a
> given device, but that's not an end user visible effect.
>
> The end user impact needs to be clear. Is this for existing deployed
> pmem where a user accidentally removes a device and wants failures and
> process killing instead of hangs?
>
> The reason Linux has got along without this for so long is because pmem
> is difficult to remove (and with the sunset of Optane, difficult to
> acquire). One motivation to pursue this is CXL where hotplug is better
> defined and use cases like dynamic capacity devices where making forward
> progress to kill processes is better than hanging.
>
> It would help to have an example of what happens without this patch.
>
>>
>> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
>> [2]: https://lore.kernel.org/linux-xfs/168688010689.860947.1788875898367401950.stgit@frogsfrogsfrogs/
>>
>> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
>> ---
>> drivers/dax/super.c | 3 +-
>> fs/xfs/xfs_notify_failure.c | 86 ++++++++++++++++++++++++++++++++++---
>> include/linux/mm.h | 1 +
>> mm/memory-failure.c | 17 ++++++--
>> 4 files changed, 96 insertions(+), 11 deletions(-)
>>
>> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
>> index c4c4728a36e4..2e1a35e82fce 100644
>> --- a/drivers/dax/super.c
>> +++ b/drivers/dax/super.c
>> @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
>> return;
>>
>> if (dax_dev->holder_data != NULL)
>> - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
>> + dax_holder_notify_failure(dax_dev, 0, U64_MAX,
>> + MF_MEM_PRE_REMOVE);
>
> The motivation in the original proposal was to convey the death of
> large extents to memory_failure(). However, that proposal predated your
> mf_dax_kill_procs() approach. With mf_dax_kill_procs() the need for a
> new bulk memory_failure() API is gone.
>
> This is where the end user impact needs to be clear. It seems that
> without this patch the filesystem may assume failure while the device is
> already present, but that seems ok. The goal is forward progress after a
> mistake not necessarily minimizing damage after a mistake. The fact that
> the current code is not as gentle could be considered a feature because
> graceful shutdown should always unmount before unplug, and if one
> unplugs before unmount it is already understood that they get to keep
> the pieces.
>
> Because the driver ->remove() callback can not enforce that the device
> is still present it seems unnecessary to optimize for the case where the
> filesystem is the device is being removed from an actively mounted
> filesystem, but the device is still present.
>
> The dax_holder_notify_failure(dax_dev, 0, U64_MAX) is sufficient to say
> "userspace failed to umount before hardware eject, stop trying to access
> this range", rather than "try to finish up in this range, but it might
> already be too late".
Hi Dan,
I added an simple example of "accidentally remove pmem device" and its
consequences of not having this patch in the latest version. Please review.
--
Thanks,
Ruan.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v13] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind
2023-08-23 8:17 ` [PATCH v13] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE " Shiyang Ruan
@ 2023-08-23 23:36 ` Darrick J. Wong
2023-08-24 9:41 ` Shiyang Ruan
0 siblings, 1 reply; 37+ messages in thread
From: Darrick J. Wong @ 2023-08-23 23:36 UTC (permalink / raw)
To: Shiyang Ruan
Cc: linux-fsdevel, nvdimm, linux-xfs, linux-mm, dan.j.williams, willy,
jack, akpm, mcgrof
On Wed, Aug 23, 2023 at 04:17:06PM +0800, Shiyang Ruan wrote:
> ====
> Changes since v12:
> 1. correct flag name in subject (MF_MEM_REMOVE => MF_MEM_PRE_REMOVE)
> 2. complete the behavior when fs has already frozen by kernel call
> NOTICE: Instead of "call notify_failure() again w/o PRE_REMOVE",
> I tried this proposal[0].
> 3. call xfs_dax_notify_failure_freeze() and _thaw() in same function
> 4. rebase on: xfs/xfs-linux.git vfs-for-next
> ====
>
> Now, if we suddenly remove a PMEM device(by calling unbind) which
> contains FSDAX while programs are still accessing data in this device,
> e.g.:
> ```
> $FSSTRESS_PROG -d $SCRATCH_MNT -n 99999 -p 4 &
> # $FSX_PROG -N 1000000 -o 8192 -l 500000 $SCRATCH_MNT/t001 &
> echo "pfn1.1" > /sys/bus/nd/drivers/nd_pmem/unbind
> ```
> it could come into an unacceptable state:
> 1. device has gone but mount point still exists, and umount will fail
> with "target is busy"
> 2. programs will hang and cannot be killed
> 3. may crash with NULL pointer dereference
>
> To fix this, we introduce a MF_MEM_PRE_REMOVE flag to let it know that we
> are going to remove the whole device, and make sure all related processes
> could be notified so that they could end up gracefully.
>
> This patch is inspired by Dan's "mm, dax, pmem: Introduce
> dev_pagemap_failure()"[1]. With the help of dax_holder and
> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> on it to unmap all files in use, and notify processes who are using
> those files.
>
> Call trace:
> trigger unbind
> -> unbind_store()
> -> ... (skip)
> -> devres_release_all()
> -> kill_dax()
> -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
> -> xfs_dax_notify_failure()
> `-> freeze_super() // freeze (kernel call)
> `-> do xfs rmap
> ` -> mf_dax_kill_procs()
> ` -> collect_procs_fsdax() // all associated processes
> ` -> unmap_and_kill()
> ` -> invalidate_inode_pages2_range() // drop file's cache
> `-> thaw_super() // thaw (both kernel & user call)
>
> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> event. Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
> new dax mapping from being created. Do not shutdown filesystem directly
> if configuration is not supported, or if failure range includes metadata
> area. Make sure all files and processes(not only the current progress)
> are handled correctly. Also drop the cache of associated files before
> pmem is removed.
>
> [0]: https://lore.kernel.org/linux-xfs/25cf6700-4db0-a346-632c-ec9fc291793a@fujitsu.com/
> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
> [2]: https://lore.kernel.org/linux-xfs/169116275623.3187159.16862410128731457358.stg-ugh@frogsfrogsfrogs/
>
> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> ---
> drivers/dax/super.c | 3 +-
> fs/xfs/xfs_notify_failure.c | 99 ++++++++++++++++++++++++++++++++++---
> include/linux/mm.h | 1 +
> mm/memory-failure.c | 17 +++++--
> 4 files changed, 109 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index c4c4728a36e4..2e1a35e82fce 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
> return;
>
> if (dax_dev->holder_data != NULL)
> - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
> + dax_holder_notify_failure(dax_dev, 0, U64_MAX,
> + MF_MEM_PRE_REMOVE);
>
> clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
> synchronize_srcu(&dax_srcu);
> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> index 4a9bbd3fe120..6496c32a9172 100644
> --- a/fs/xfs/xfs_notify_failure.c
> +++ b/fs/xfs/xfs_notify_failure.c
> @@ -22,6 +22,7 @@
>
> #include <linux/mm.h>
> #include <linux/dax.h>
> +#include <linux/fs.h>
>
> struct xfs_failure_info {
> xfs_agblock_t startblock;
> @@ -73,10 +74,16 @@ xfs_dax_failure_fn(
> struct xfs_mount *mp = cur->bc_mp;
> struct xfs_inode *ip;
> struct xfs_failure_info *notify = data;
> + struct address_space *mapping;
> + pgoff_t pgoff;
> + unsigned long pgcnt;
> int error = 0;
>
> if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
> (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
> + /* Continue the query because this isn't a failure. */
> + if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> + return 0;
> notify->want_shutdown = true;
> return 0;
> }
> @@ -92,14 +99,60 @@ xfs_dax_failure_fn(
> return 0;
> }
>
> - error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
> - xfs_failure_pgoff(mp, rec, notify),
> - xfs_failure_pgcnt(mp, rec, notify),
> - notify->mf_flags);
> + mapping = VFS_I(ip)->i_mapping;
> + pgoff = xfs_failure_pgoff(mp, rec, notify);
> + pgcnt = xfs_failure_pgcnt(mp, rec, notify);
> +
> + /* Continue the rmap query if the inode isn't a dax file. */
> + if (dax_mapping(mapping))
> + error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
> + notify->mf_flags);
> +
> + /* Invalidate the cache in dax pages. */
> + if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> + invalidate_inode_pages2_range(mapping, pgoff,
> + pgoff + pgcnt - 1);
> +
> xfs_irele(ip);
> return error;
> }
>
> +static int
> +xfs_dax_notify_failure_freeze(
> + struct xfs_mount *mp)
> +{
> + struct super_block *sb = mp->m_super;
> + int error;
> +
> + error = freeze_super(sb, FREEZE_HOLDER_KERNEL);
> + if (error)
> + xfs_emerg(mp, "already frozen by kernel, err=%d", error);
> +
> + return error;
> +}
> +
> +static void
> +xfs_dax_notify_failure_thaw(
> + struct xfs_mount *mp,
> + bool kernel_frozen)
> +{
> + struct super_block *sb = mp->m_super;
> + int error;
> +
> + if (!kernel_frozen) {
> + error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
> + if (error)
> + xfs_emerg(mp, "still frozen after notify failure, err=%d",
> + error);
> + }
> +
> + /*
> + * Also thaw userspace call anyway because the device is about to be
> + * removed immediately.
Does a userspace freeze inhibit or otherwise break device removal?
> + */
> + thaw_super(sb, FREEZE_HOLDER_USERSPACE);
> +}
> +
> static int
> xfs_dax_notify_ddev_failure(
> struct xfs_mount *mp,
> @@ -112,15 +165,29 @@ xfs_dax_notify_ddev_failure(
> struct xfs_btree_cur *cur = NULL;
> struct xfs_buf *agf_bp = NULL;
> int error = 0;
> + bool kernel_frozen = false;
> xfs_fsblock_t fsbno = XFS_DADDR_TO_FSB(mp, daddr);
> xfs_agnumber_t agno = XFS_FSB_TO_AGNO(mp, fsbno);
> xfs_fsblock_t end_fsbno = XFS_DADDR_TO_FSB(mp,
> daddr + bblen - 1);
> xfs_agnumber_t end_agno = XFS_FSB_TO_AGNO(mp, end_fsbno);
>
> + if (mf_flags & MF_MEM_PRE_REMOVE) {
> + xfs_info(mp, "Device is about to be removed!");
> + /* Freeze fs to prevent new mappings from being created. */
> + error = xfs_dax_notify_failure_freeze(mp);
> + if (error) {
> + /* Keep going on if filesystem is frozen by kernel. */
> + if (error == -EBUSY)
> + kernel_frozen = true;
EBUSY means that xfs_dax_notify_failure_freeze did /not/ succeed in
kernel-freezing the fs. Someone else did, and they're expecting that
thaw_super will undo that.
switch (error) {
case -EBUSY:
/* someone else froze the fs, keep going */
break;
case 0:
/* we froze the fs */
kernel_frozen = true;
break;
default:
/* something else broke, should we continue anyway? */
return error;
}
TBH I wonder why all that isn't just:
kernel_frozen = xfs_dax_notify_failure_freeze(mp) == 0;
Since we'd want to keep going even if (say) the pmem was already
starting to fail and the freeze actually failed due to EIO, right?
--D
> + else
> + return error;
> + }
> + }
> +
> error = xfs_trans_alloc_empty(mp, &tp);
> if (error)
> - return error;
> + goto out;
>
> for (; agno <= end_agno; agno++) {
> struct xfs_rmap_irec ri_low = { };
> @@ -165,11 +232,23 @@ xfs_dax_notify_ddev_failure(
> }
>
> xfs_trans_cancel(tp);
> +
> + /*
> + * Determine how to shutdown the filesystem according to the
> + * error code and flags.
> + */
> if (error || notify.want_shutdown) {
> xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
> if (!error)
> error = -EFSCORRUPTED;
> - }
> + } else if (mf_flags & MF_MEM_PRE_REMOVE)
> + xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
> +
> +out:
> + /* Thaw the fs if it is frozen before. */
> + if (mf_flags & MF_MEM_PRE_REMOVE)
> + xfs_dax_notify_failure_thaw(mp, kernel_frozen);
> +
> return error;
> }
>
> @@ -197,6 +276,8 @@ xfs_dax_notify_failure(
>
> if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
> mp->m_logdev_targp != mp->m_ddev_targp) {
> + if (mf_flags & MF_MEM_PRE_REMOVE)
> + return 0;
> xfs_err(mp, "ondisk log corrupt, shutting down fs!");
> xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
> return -EFSCORRUPTED;
> @@ -210,6 +291,12 @@ xfs_dax_notify_failure(
> ddev_start = mp->m_ddev_targp->bt_dax_part_off;
> ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
>
> + /* Notify failure on the whole device. */
> + if (offset == 0 && len == U64_MAX) {
> + offset = ddev_start;
> + len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
> + }
> +
> /* Ignore the range out of filesystem area */
> if (offset + len - 1 < ddev_start)
> return -ENXIO;
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 799836e84840..944a1165a321 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3577,6 +3577,7 @@ enum mf_flags {
> MF_UNPOISON = 1 << 4,
> MF_SW_SIMULATED = 1 << 5,
> MF_NO_RETRY = 1 << 6,
> + MF_MEM_PRE_REMOVE = 1 << 7,
> };
> int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> unsigned long count, int mf_flags);
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index dc5ff7dd4e50..92f18c9e0aaf 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -688,7 +688,7 @@ static void add_to_kill_fsdax(struct task_struct *tsk, struct page *p,
> */
> static void collect_procs_fsdax(struct page *page,
> struct address_space *mapping, pgoff_t pgoff,
> - struct list_head *to_kill)
> + struct list_head *to_kill, bool pre_remove)
> {
> struct vm_area_struct *vma;
> struct task_struct *tsk;
> @@ -696,8 +696,15 @@ static void collect_procs_fsdax(struct page *page,
> i_mmap_lock_read(mapping);
> read_lock(&tasklist_lock);
> for_each_process(tsk) {
> - struct task_struct *t = task_early_kill(tsk, true);
> + struct task_struct *t = tsk;
>
> + /*
> + * Search for all tasks while MF_MEM_PRE_REMOVE is set, because
> + * the current may not be the one accessing the fsdax page.
> + * Otherwise, search for the current task.
> + */
> + if (!pre_remove)
> + t = task_early_kill(tsk, true);
> if (!t)
> continue;
> vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> @@ -1793,6 +1800,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> dax_entry_t cookie;
> struct page *page;
> size_t end = index + count;
> + bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
>
> mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
>
> @@ -1804,9 +1812,10 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> if (!page)
> goto unlock;
>
> - SetPageHWPoison(page);
> + if (!pre_remove)
> + SetPageHWPoison(page);
>
> - collect_procs_fsdax(page, mapping, index, &to_kill);
> + collect_procs_fsdax(page, mapping, index, &to_kill, pre_remove);
> unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
> index, mf_flags);
> unlock:
> --
> 2.41.0
>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v13] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind
2023-08-23 23:36 ` Darrick J. Wong
@ 2023-08-24 9:41 ` Shiyang Ruan
2023-08-24 23:57 ` Darrick J. Wong
0 siblings, 1 reply; 37+ messages in thread
From: Shiyang Ruan @ 2023-08-24 9:41 UTC (permalink / raw)
To: Darrick J. Wong
Cc: linux-fsdevel, nvdimm, linux-xfs, linux-mm, dan.j.williams, willy,
jack, akpm, mcgrof
在 2023/8/24 7:36, Darrick J. Wong 写道:
> On Wed, Aug 23, 2023 at 04:17:06PM +0800, Shiyang Ruan wrote:
>> ====
>> Changes since v12:
>> 1. correct flag name in subject (MF_MEM_REMOVE => MF_MEM_PRE_REMOVE)
>> 2. complete the behavior when fs has already frozen by kernel call
>> NOTICE: Instead of "call notify_failure() again w/o PRE_REMOVE",
>> I tried this proposal[0].
>> 3. call xfs_dax_notify_failure_freeze() and _thaw() in same function
>> 4. rebase on: xfs/xfs-linux.git vfs-for-next
>> ====
>>
>> Now, if we suddenly remove a PMEM device(by calling unbind) which
>> contains FSDAX while programs are still accessing data in this device,
>> e.g.:
>> ```
>> $FSSTRESS_PROG -d $SCRATCH_MNT -n 99999 -p 4 &
>> # $FSX_PROG -N 1000000 -o 8192 -l 500000 $SCRATCH_MNT/t001 &
>> echo "pfn1.1" > /sys/bus/nd/drivers/nd_pmem/unbind
>> ```
>> it could come into an unacceptable state:
>> 1. device has gone but mount point still exists, and umount will fail
>> with "target is busy"
>> 2. programs will hang and cannot be killed
>> 3. may crash with NULL pointer dereference
>>
>> To fix this, we introduce a MF_MEM_PRE_REMOVE flag to let it know that we
>> are going to remove the whole device, and make sure all related processes
>> could be notified so that they could end up gracefully.
>>
>> This patch is inspired by Dan's "mm, dax, pmem: Introduce
>> dev_pagemap_failure()"[1]. With the help of dax_holder and
>> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
>> on it to unmap all files in use, and notify processes who are using
>> those files.
>>
>> Call trace:
>> trigger unbind
>> -> unbind_store()
>> -> ... (skip)
>> -> devres_release_all()
>> -> kill_dax()
>> -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
>> -> xfs_dax_notify_failure()
>> `-> freeze_super() // freeze (kernel call)
>> `-> do xfs rmap
>> ` -> mf_dax_kill_procs()
>> ` -> collect_procs_fsdax() // all associated processes
>> ` -> unmap_and_kill()
>> ` -> invalidate_inode_pages2_range() // drop file's cache
>> `-> thaw_super() // thaw (both kernel & user call)
>>
>> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
>> event. Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
>> new dax mapping from being created. Do not shutdown filesystem directly
>> if configuration is not supported, or if failure range includes metadata
>> area. Make sure all files and processes(not only the current progress)
>> are handled correctly. Also drop the cache of associated files before
>> pmem is removed.
>>
>> [0]: https://lore.kernel.org/linux-xfs/25cf6700-4db0-a346-632c-ec9fc291793a@fujitsu.com/
>> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
>> [2]: https://lore.kernel.org/linux-xfs/169116275623.3187159.16862410128731457358.stg-ugh@frogsfrogsfrogs/
>>
>> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
>> ---
>> drivers/dax/super.c | 3 +-
>> fs/xfs/xfs_notify_failure.c | 99 ++++++++++++++++++++++++++++++++++---
>> include/linux/mm.h | 1 +
>> mm/memory-failure.c | 17 +++++--
>> 4 files changed, 109 insertions(+), 11 deletions(-)
>>
>> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
>> index c4c4728a36e4..2e1a35e82fce 100644
>> --- a/drivers/dax/super.c
>> +++ b/drivers/dax/super.c
>> @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
>> return;
>>
>> if (dax_dev->holder_data != NULL)
>> - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
>> + dax_holder_notify_failure(dax_dev, 0, U64_MAX,
>> + MF_MEM_PRE_REMOVE);
>>
>> clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
>> synchronize_srcu(&dax_srcu);
>> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
>> index 4a9bbd3fe120..6496c32a9172 100644
>> --- a/fs/xfs/xfs_notify_failure.c
>> +++ b/fs/xfs/xfs_notify_failure.c
>> @@ -22,6 +22,7 @@
>>
>> #include <linux/mm.h>
>> #include <linux/dax.h>
>> +#include <linux/fs.h>
>>
>> struct xfs_failure_info {
>> xfs_agblock_t startblock;
>> @@ -73,10 +74,16 @@ xfs_dax_failure_fn(
>> struct xfs_mount *mp = cur->bc_mp;
>> struct xfs_inode *ip;
>> struct xfs_failure_info *notify = data;
>> + struct address_space *mapping;
>> + pgoff_t pgoff;
>> + unsigned long pgcnt;
>> int error = 0;
>>
>> if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
>> (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
>> + /* Continue the query because this isn't a failure. */
>> + if (notify->mf_flags & MF_MEM_PRE_REMOVE)
>> + return 0;
>> notify->want_shutdown = true;
>> return 0;
>> }
>> @@ -92,14 +99,60 @@ xfs_dax_failure_fn(
>> return 0;
>> }
>>
>> - error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
>> - xfs_failure_pgoff(mp, rec, notify),
>> - xfs_failure_pgcnt(mp, rec, notify),
>> - notify->mf_flags);
>> + mapping = VFS_I(ip)->i_mapping;
>> + pgoff = xfs_failure_pgoff(mp, rec, notify);
>> + pgcnt = xfs_failure_pgcnt(mp, rec, notify);
>> +
>> + /* Continue the rmap query if the inode isn't a dax file. */
>> + if (dax_mapping(mapping))
>> + error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
>> + notify->mf_flags);
>> +
>> + /* Invalidate the cache in dax pages. */
>> + if (notify->mf_flags & MF_MEM_PRE_REMOVE)
>> + invalidate_inode_pages2_range(mapping, pgoff,
>> + pgoff + pgcnt - 1);
>> +
>> xfs_irele(ip);
>> return error;
>> }
>>
>> +static int
>> +xfs_dax_notify_failure_freeze(
>> + struct xfs_mount *mp)
>> +{
>> + struct super_block *sb = mp->m_super;
>> + int error;
>> +
>> + error = freeze_super(sb, FREEZE_HOLDER_KERNEL);
>> + if (error)
>> + xfs_emerg(mp, "already frozen by kernel, err=%d", error);
>> +
>> + return error;
>> +}
>> +
>> +static void
>> +xfs_dax_notify_failure_thaw(
>> + struct xfs_mount *mp,
>> + bool kernel_frozen)
>> +{
>> + struct super_block *sb = mp->m_super;
>> + int error;
>> +
>> + if (!kernel_frozen) {
>> + error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
>> + if (error)
>> + xfs_emerg(mp, "still frozen after notify failure, err=%d",
>> + error);
>> + }
>> +
>> + /*
>> + * Also thaw userspace call anyway because the device is about to be
>> + * removed immediately.
>
> Does a userspace freeze inhibit or otherwise break device removal?
It doesn't. Device can be removed. But after that, the mount point
still exists, and `umount /mnt/scratch` fails with "target is busy."
`xfs_freeze -u /mnt/scratch` cannot work too.
So, I think thaw_super() anyway here is needed.
>
>> + */
>> + thaw_super(sb, FREEZE_HOLDER_USERSPACE);
>> +}
>> +
>> static int
>> xfs_dax_notify_ddev_failure(
>> struct xfs_mount *mp,
>> @@ -112,15 +165,29 @@ xfs_dax_notify_ddev_failure(
>> struct xfs_btree_cur *cur = NULL;
>> struct xfs_buf *agf_bp = NULL;
>> int error = 0;
>> + bool kernel_frozen = false;
>> xfs_fsblock_t fsbno = XFS_DADDR_TO_FSB(mp, daddr);
>> xfs_agnumber_t agno = XFS_FSB_TO_AGNO(mp, fsbno);
>> xfs_fsblock_t end_fsbno = XFS_DADDR_TO_FSB(mp,
>> daddr + bblen - 1);
>> xfs_agnumber_t end_agno = XFS_FSB_TO_AGNO(mp, end_fsbno);
>>
>> + if (mf_flags & MF_MEM_PRE_REMOVE) {
>> + xfs_info(mp, "Device is about to be removed!");
>> + /* Freeze fs to prevent new mappings from being created. */
>> + error = xfs_dax_notify_failure_freeze(mp);
>> + if (error) {
>> + /* Keep going on if filesystem is frozen by kernel. */
>> + if (error == -EBUSY)
>> + kernel_frozen = true;
>
> EBUSY means that xfs_dax_notify_failure_freeze did /not/ succeed in
> kernel-freezing the fs. Someone else did, and they're expecting that
> thaw_super will undo that.
>
> switch (error) {
> case -EBUSY:
> /* someone else froze the fs, keep going */
> break;
> case 0:
> /* we froze the fs */
> kernel_frozen = true;
> break;
> default:
> /* something else broke, should we continue anyway? */
> return error;
> }
>
> TBH I wonder why all that isn't just:
>
> kernel_frozen = xfs_dax_notify_failure_freeze(mp) == 0;
>
> Since we'd want to keep going even if (say) the pmem was already
> starting to fail and the freeze actually failed due to EIO, right?
Yes. So we can say it is a *try* to _freeze() here. No matter what its
result is, we continue.
Then I think the `kernel_frozen` becomes useless as well. Because we
should try to call both _thaw(KERNEL_CALL) and _thaw(USER_CALL) to make
sure umount can work after device is gone.
Then, I think it's better to change them:
`static int xfs_dax_notify_failure_freeze()`,
`static void xfs_dax_notify_failure_thaw()`
to
`static void xfs_dax_notify_failure_try_freeze()`,
`static void xfs_dax_notify_failure_try_thaw()`.
--
Thanks,
Ruan.
>
> --D
>
>> + else
>> + return error;
>> + }
>> + }
>> +
>> error = xfs_trans_alloc_empty(mp, &tp);
>> if (error)
>> - return error;
>> + goto out;
>>
>> for (; agno <= end_agno; agno++) {
>> struct xfs_rmap_irec ri_low = { };
>> @@ -165,11 +232,23 @@ xfs_dax_notify_ddev_failure(
>> }
>>
>> xfs_trans_cancel(tp);
>> +
>> + /*
>> + * Determine how to shutdown the filesystem according to the
>> + * error code and flags.
>> + */
>> if (error || notify.want_shutdown) {
>> xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>> if (!error)
>> error = -EFSCORRUPTED;
>> - }
>> + } else if (mf_flags & MF_MEM_PRE_REMOVE)
>> + xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
>> +
>> +out:
>> + /* Thaw the fs if it is frozen before. */
>> + if (mf_flags & MF_MEM_PRE_REMOVE)
>> + xfs_dax_notify_failure_thaw(mp, kernel_frozen);
>> +
>> return error;
>> }
>>
>> @@ -197,6 +276,8 @@ xfs_dax_notify_failure(
>>
>> if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
>> mp->m_logdev_targp != mp->m_ddev_targp) {
>> + if (mf_flags & MF_MEM_PRE_REMOVE)
>> + return 0;
>> xfs_err(mp, "ondisk log corrupt, shutting down fs!");
>> xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>> return -EFSCORRUPTED;
>> @@ -210,6 +291,12 @@ xfs_dax_notify_failure(
>> ddev_start = mp->m_ddev_targp->bt_dax_part_off;
>> ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
>>
>> + /* Notify failure on the whole device. */
>> + if (offset == 0 && len == U64_MAX) {
>> + offset = ddev_start;
>> + len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
>> + }
>> +
>> /* Ignore the range out of filesystem area */
>> if (offset + len - 1 < ddev_start)
>> return -ENXIO;
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 799836e84840..944a1165a321 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -3577,6 +3577,7 @@ enum mf_flags {
>> MF_UNPOISON = 1 << 4,
>> MF_SW_SIMULATED = 1 << 5,
>> MF_NO_RETRY = 1 << 6,
>> + MF_MEM_PRE_REMOVE = 1 << 7,
>> };
>> int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>> unsigned long count, int mf_flags);
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index dc5ff7dd4e50..92f18c9e0aaf 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -688,7 +688,7 @@ static void add_to_kill_fsdax(struct task_struct *tsk, struct page *p,
>> */
>> static void collect_procs_fsdax(struct page *page,
>> struct address_space *mapping, pgoff_t pgoff,
>> - struct list_head *to_kill)
>> + struct list_head *to_kill, bool pre_remove)
>> {
>> struct vm_area_struct *vma;
>> struct task_struct *tsk;
>> @@ -696,8 +696,15 @@ static void collect_procs_fsdax(struct page *page,
>> i_mmap_lock_read(mapping);
>> read_lock(&tasklist_lock);
>> for_each_process(tsk) {
>> - struct task_struct *t = task_early_kill(tsk, true);
>> + struct task_struct *t = tsk;
>>
>> + /*
>> + * Search for all tasks while MF_MEM_PRE_REMOVE is set, because
>> + * the current may not be the one accessing the fsdax page.
>> + * Otherwise, search for the current task.
>> + */
>> + if (!pre_remove)
>> + t = task_early_kill(tsk, true);
>> if (!t)
>> continue;
>> vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
>> @@ -1793,6 +1800,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>> dax_entry_t cookie;
>> struct page *page;
>> size_t end = index + count;
>> + bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
>>
>> mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
>>
>> @@ -1804,9 +1812,10 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>> if (!page)
>> goto unlock;
>>
>> - SetPageHWPoison(page);
>> + if (!pre_remove)
>> + SetPageHWPoison(page);
>>
>> - collect_procs_fsdax(page, mapping, index, &to_kill);
>> + collect_procs_fsdax(page, mapping, index, &to_kill, pre_remove);
>> unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
>> index, mf_flags);
>> unlock:
>> --
>> 2.41.0
>>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v13] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind
2023-08-24 9:41 ` Shiyang Ruan
@ 2023-08-24 23:57 ` Darrick J. Wong
2023-08-25 3:52 ` Shiyang Ruan
0 siblings, 1 reply; 37+ messages in thread
From: Darrick J. Wong @ 2023-08-24 23:57 UTC (permalink / raw)
To: Shiyang Ruan
Cc: linux-fsdevel, nvdimm, linux-xfs, linux-mm, dan.j.williams, willy,
jack, akpm, mcgrof
On Thu, Aug 24, 2023 at 05:41:50PM +0800, Shiyang Ruan wrote:
>
>
> 在 2023/8/24 7:36, Darrick J. Wong 写道:
> > On Wed, Aug 23, 2023 at 04:17:06PM +0800, Shiyang Ruan wrote:
> > > ====
> > > Changes since v12:
> > > 1. correct flag name in subject (MF_MEM_REMOVE => MF_MEM_PRE_REMOVE)
> > > 2. complete the behavior when fs has already frozen by kernel call
> > > NOTICE: Instead of "call notify_failure() again w/o PRE_REMOVE",
> > > I tried this proposal[0].
> > > 3. call xfs_dax_notify_failure_freeze() and _thaw() in same function
> > > 4. rebase on: xfs/xfs-linux.git vfs-for-next
> > > ====
> > >
> > > Now, if we suddenly remove a PMEM device(by calling unbind) which
> > > contains FSDAX while programs are still accessing data in this device,
> > > e.g.:
> > > ```
> > > $FSSTRESS_PROG -d $SCRATCH_MNT -n 99999 -p 4 &
> > > # $FSX_PROG -N 1000000 -o 8192 -l 500000 $SCRATCH_MNT/t001 &
> > > echo "pfn1.1" > /sys/bus/nd/drivers/nd_pmem/unbind
> > > ```
> > > it could come into an unacceptable state:
> > > 1. device has gone but mount point still exists, and umount will fail
> > > with "target is busy"
> > > 2. programs will hang and cannot be killed
> > > 3. may crash with NULL pointer dereference
> > >
> > > To fix this, we introduce a MF_MEM_PRE_REMOVE flag to let it know that we
> > > are going to remove the whole device, and make sure all related processes
> > > could be notified so that they could end up gracefully.
> > >
> > > This patch is inspired by Dan's "mm, dax, pmem: Introduce
> > > dev_pagemap_failure()"[1]. With the help of dax_holder and
> > > ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> > > on it to unmap all files in use, and notify processes who are using
> > > those files.
> > >
> > > Call trace:
> > > trigger unbind
> > > -> unbind_store()
> > > -> ... (skip)
> > > -> devres_release_all()
> > > -> kill_dax()
> > > -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
> > > -> xfs_dax_notify_failure()
> > > `-> freeze_super() // freeze (kernel call)
> > > `-> do xfs rmap
> > > ` -> mf_dax_kill_procs()
> > > ` -> collect_procs_fsdax() // all associated processes
> > > ` -> unmap_and_kill()
> > > ` -> invalidate_inode_pages2_range() // drop file's cache
> > > `-> thaw_super() // thaw (both kernel & user call)
> > >
> > > Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> > > event. Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
> > > new dax mapping from being created. Do not shutdown filesystem directly
> > > if configuration is not supported, or if failure range includes metadata
> > > area. Make sure all files and processes(not only the current progress)
> > > are handled correctly. Also drop the cache of associated files before
> > > pmem is removed.
> > >
> > > [0]: https://lore.kernel.org/linux-xfs/25cf6700-4db0-a346-632c-ec9fc291793a@fujitsu.com/
> > > [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
> > > [2]: https://lore.kernel.org/linux-xfs/169116275623.3187159.16862410128731457358.stg-ugh@frogsfrogsfrogs/
> > >
> > > Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> > > ---
> > > drivers/dax/super.c | 3 +-
> > > fs/xfs/xfs_notify_failure.c | 99 ++++++++++++++++++++++++++++++++++---
> > > include/linux/mm.h | 1 +
> > > mm/memory-failure.c | 17 +++++--
> > > 4 files changed, 109 insertions(+), 11 deletions(-)
> > >
> > > diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> > > index c4c4728a36e4..2e1a35e82fce 100644
> > > --- a/drivers/dax/super.c
> > > +++ b/drivers/dax/super.c
> > > @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
> > > return;
> > > if (dax_dev->holder_data != NULL)
> > > - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
> > > + dax_holder_notify_failure(dax_dev, 0, U64_MAX,
> > > + MF_MEM_PRE_REMOVE);
> > > clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
> > > synchronize_srcu(&dax_srcu);
> > > diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> > > index 4a9bbd3fe120..6496c32a9172 100644
> > > --- a/fs/xfs/xfs_notify_failure.c
> > > +++ b/fs/xfs/xfs_notify_failure.c
> > > @@ -22,6 +22,7 @@
> > > #include <linux/mm.h>
> > > #include <linux/dax.h>
> > > +#include <linux/fs.h>
> > > struct xfs_failure_info {
> > > xfs_agblock_t startblock;
> > > @@ -73,10 +74,16 @@ xfs_dax_failure_fn(
> > > struct xfs_mount *mp = cur->bc_mp;
> > > struct xfs_inode *ip;
> > > struct xfs_failure_info *notify = data;
> > > + struct address_space *mapping;
> > > + pgoff_t pgoff;
> > > + unsigned long pgcnt;
> > > int error = 0;
> > > if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
> > > (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
> > > + /* Continue the query because this isn't a failure. */
> > > + if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> > > + return 0;
> > > notify->want_shutdown = true;
> > > return 0;
> > > }
> > > @@ -92,14 +99,60 @@ xfs_dax_failure_fn(
> > > return 0;
> > > }
> > > - error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
> > > - xfs_failure_pgoff(mp, rec, notify),
> > > - xfs_failure_pgcnt(mp, rec, notify),
> > > - notify->mf_flags);
> > > + mapping = VFS_I(ip)->i_mapping;
> > > + pgoff = xfs_failure_pgoff(mp, rec, notify);
> > > + pgcnt = xfs_failure_pgcnt(mp, rec, notify);
> > > +
> > > + /* Continue the rmap query if the inode isn't a dax file. */
> > > + if (dax_mapping(mapping))
> > > + error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
> > > + notify->mf_flags);
> > > +
> > > + /* Invalidate the cache in dax pages. */
> > > + if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> > > + invalidate_inode_pages2_range(mapping, pgoff,
> > > + pgoff + pgcnt - 1);
> > > +
> > > xfs_irele(ip);
> > > return error;
> > > }
> > > +static int
> > > +xfs_dax_notify_failure_freeze(
> > > + struct xfs_mount *mp)
> > > +{
> > > + struct super_block *sb = mp->m_super;
> > > + int error;
> > > +
> > > + error = freeze_super(sb, FREEZE_HOLDER_KERNEL);
> > > + if (error)
> > > + xfs_emerg(mp, "already frozen by kernel, err=%d", error);
> > > +
> > > + return error;
> > > +}
> > > +
> > > +static void
> > > +xfs_dax_notify_failure_thaw(
> > > + struct xfs_mount *mp,
> > > + bool kernel_frozen)
> > > +{
> > > + struct super_block *sb = mp->m_super;
> > > + int error;
> > > +
> > > + if (!kernel_frozen) {
> > > + error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
> > > + if (error)
> > > + xfs_emerg(mp, "still frozen after notify failure, err=%d",
> > > + error);
> > > + }
> > > +
> > > + /*
> > > + * Also thaw userspace call anyway because the device is about to be
> > > + * removed immediately.
> >
> > Does a userspace freeze inhibit or otherwise break device removal?
>
> It doesn't. Device can be removed. But after that, the mount point still
> exists, and `umount /mnt/scratch` fails with "target is busy." `xfs_freeze
> -u /mnt/scratch` cannot work too.
Yes, that's true, but that's long been the case for removing block
devices. Should block device removal (since we now have hooks for
that!) also be breaking freezes?
> So, I think thaw_super() anyway here is needed.
>
>
> >
> > > + */
> > > + thaw_super(sb, FREEZE_HOLDER_USERSPACE);
> > > +}
> > > +
> > > static int
> > > xfs_dax_notify_ddev_failure(
> > > struct xfs_mount *mp,
> > > @@ -112,15 +165,29 @@ xfs_dax_notify_ddev_failure(
> > > struct xfs_btree_cur *cur = NULL;
> > > struct xfs_buf *agf_bp = NULL;
> > > int error = 0;
> > > + bool kernel_frozen = false;
> > > xfs_fsblock_t fsbno = XFS_DADDR_TO_FSB(mp, daddr);
> > > xfs_agnumber_t agno = XFS_FSB_TO_AGNO(mp, fsbno);
> > > xfs_fsblock_t end_fsbno = XFS_DADDR_TO_FSB(mp,
> > > daddr + bblen - 1);
> > > xfs_agnumber_t end_agno = XFS_FSB_TO_AGNO(mp, end_fsbno);
> > > + if (mf_flags & MF_MEM_PRE_REMOVE) {
> > > + xfs_info(mp, "Device is about to be removed!");
> > > + /* Freeze fs to prevent new mappings from being created. */
> > > + error = xfs_dax_notify_failure_freeze(mp);
> > > + if (error) {
> > > + /* Keep going on if filesystem is frozen by kernel. */
> > > + if (error == -EBUSY)
> > > + kernel_frozen = true;
> >
> > EBUSY means that xfs_dax_notify_failure_freeze did /not/ succeed in
> > kernel-freezing the fs. Someone else did, and they're expecting that
> > thaw_super will undo that.
> >
> > switch (error) {
> > case -EBUSY:
> > /* someone else froze the fs, keep going */
> > break;
> > case 0:
> > /* we froze the fs */
> > kernel_frozen = true;
> > break;
> > default:
> > /* something else broke, should we continue anyway? */
> > return error;
> > }
> >
> > TBH I wonder why all that isn't just:
> >
> > kernel_frozen = xfs_dax_notify_failure_freeze(mp) == 0;
> >
> > Since we'd want to keep going even if (say) the pmem was already
> > starting to fail and the freeze actually failed due to EIO, right?
>
> Yes. So we can say it is a *try* to _freeze() here. No matter what its
> result is, we continue.
>
> Then I think the `kernel_frozen` becomes useless as well. Because we should
> try to call both _thaw(KERNEL_CALL) and _thaw(USER_CALL) to make sure umount
> can work after device is gone.
I disagree -- unlike the mess that is userspace freezing, kernel code
that obtained a kernel freeze will get very confused and potentially do
Seriously Bad Things if the kernel freeze is yanked out from under them.
Kernel code is not supposed to release things that they did not
themselves obtain.
That might not ultimately matter for the narrow case of the device going
away, but the two other usecases (online fsck and suspend) will
malfunction if you drop a kernel freeze that they obtained.
I don't mind if PREREMOVE can't get a freeze and keeps going with the
invalidations anyway. We did our best, and when the pmem goes away we
can just kill -9 down the processes.
--D
> Then, I think it's better to change them:
> `static int xfs_dax_notify_failure_freeze()`,
> `static void xfs_dax_notify_failure_thaw()`
> to
> `static void xfs_dax_notify_failure_try_freeze()`,
> `static void xfs_dax_notify_failure_try_thaw()`.
>
>
> --
> Thanks,
> Ruan.
>
> >
> > --D
> >
> > > + else
> > > + return error;
> > > + }
> > > + }
> > > +
> > > error = xfs_trans_alloc_empty(mp, &tp);
> > > if (error)
> > > - return error;
> > > + goto out;
> > > for (; agno <= end_agno; agno++) {
> > > struct xfs_rmap_irec ri_low = { };
> > > @@ -165,11 +232,23 @@ xfs_dax_notify_ddev_failure(
> > > }
> > > xfs_trans_cancel(tp);
> > > +
> > > + /*
> > > + * Determine how to shutdown the filesystem according to the
> > > + * error code and flags.
> > > + */
> > > if (error || notify.want_shutdown) {
> > > xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
> > > if (!error)
> > > error = -EFSCORRUPTED;
> > > - }
> > > + } else if (mf_flags & MF_MEM_PRE_REMOVE)
> > > + xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
> > > +
> > > +out:
> > > + /* Thaw the fs if it is frozen before. */
> > > + if (mf_flags & MF_MEM_PRE_REMOVE)
> > > + xfs_dax_notify_failure_thaw(mp, kernel_frozen);
> > > +
> > > return error;
> > > }
> > > @@ -197,6 +276,8 @@ xfs_dax_notify_failure(
> > > if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
> > > mp->m_logdev_targp != mp->m_ddev_targp) {
> > > + if (mf_flags & MF_MEM_PRE_REMOVE)
> > > + return 0;
> > > xfs_err(mp, "ondisk log corrupt, shutting down fs!");
> > > xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
> > > return -EFSCORRUPTED;
> > > @@ -210,6 +291,12 @@ xfs_dax_notify_failure(
> > > ddev_start = mp->m_ddev_targp->bt_dax_part_off;
> > > ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
> > > + /* Notify failure on the whole device. */
> > > + if (offset == 0 && len == U64_MAX) {
> > > + offset = ddev_start;
> > > + len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
> > > + }
> > > +
> > > /* Ignore the range out of filesystem area */
> > > if (offset + len - 1 < ddev_start)
> > > return -ENXIO;
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index 799836e84840..944a1165a321 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -3577,6 +3577,7 @@ enum mf_flags {
> > > MF_UNPOISON = 1 << 4,
> > > MF_SW_SIMULATED = 1 << 5,
> > > MF_NO_RETRY = 1 << 6,
> > > + MF_MEM_PRE_REMOVE = 1 << 7,
> > > };
> > > int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> > > unsigned long count, int mf_flags);
> > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > > index dc5ff7dd4e50..92f18c9e0aaf 100644
> > > --- a/mm/memory-failure.c
> > > +++ b/mm/memory-failure.c
> > > @@ -688,7 +688,7 @@ static void add_to_kill_fsdax(struct task_struct *tsk, struct page *p,
> > > */
> > > static void collect_procs_fsdax(struct page *page,
> > > struct address_space *mapping, pgoff_t pgoff,
> > > - struct list_head *to_kill)
> > > + struct list_head *to_kill, bool pre_remove)
> > > {
> > > struct vm_area_struct *vma;
> > > struct task_struct *tsk;
> > > @@ -696,8 +696,15 @@ static void collect_procs_fsdax(struct page *page,
> > > i_mmap_lock_read(mapping);
> > > read_lock(&tasklist_lock);
> > > for_each_process(tsk) {
> > > - struct task_struct *t = task_early_kill(tsk, true);
> > > + struct task_struct *t = tsk;
> > > + /*
> > > + * Search for all tasks while MF_MEM_PRE_REMOVE is set, because
> > > + * the current may not be the one accessing the fsdax page.
> > > + * Otherwise, search for the current task.
> > > + */
> > > + if (!pre_remove)
> > > + t = task_early_kill(tsk, true);
> > > if (!t)
> > > continue;
> > > vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> > > @@ -1793,6 +1800,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> > > dax_entry_t cookie;
> > > struct page *page;
> > > size_t end = index + count;
> > > + bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
> > > mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
> > > @@ -1804,9 +1812,10 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> > > if (!page)
> > > goto unlock;
> > > - SetPageHWPoison(page);
> > > + if (!pre_remove)
> > > + SetPageHWPoison(page);
> > > - collect_procs_fsdax(page, mapping, index, &to_kill);
> > > + collect_procs_fsdax(page, mapping, index, &to_kill, pre_remove);
> > > unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
> > > index, mf_flags);
> > > unlock:
> > > --
> > > 2.41.0
> > >
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v13] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind
2023-08-24 23:57 ` Darrick J. Wong
@ 2023-08-25 3:52 ` Shiyang Ruan
2023-08-26 0:17 ` Darrick J. Wong
0 siblings, 1 reply; 37+ messages in thread
From: Shiyang Ruan @ 2023-08-25 3:52 UTC (permalink / raw)
To: Darrick J. Wong
Cc: linux-fsdevel, nvdimm, linux-xfs, linux-mm, dan.j.williams, willy,
jack, akpm, mcgrof
在 2023/8/25 7:57, Darrick J. Wong 写道:
> On Thu, Aug 24, 2023 at 05:41:50PM +0800, Shiyang Ruan wrote:
>>
>>
>> 在 2023/8/24 7:36, Darrick J. Wong 写道:
>>> On Wed, Aug 23, 2023 at 04:17:06PM +0800, Shiyang Ruan wrote:
>>>> ====
>>>> Changes since v12:
>>>> 1. correct flag name in subject (MF_MEM_REMOVE => MF_MEM_PRE_REMOVE)
>>>> 2. complete the behavior when fs has already frozen by kernel call
>>>> NOTICE: Instead of "call notify_failure() again w/o PRE_REMOVE",
>>>> I tried this proposal[0].
>>>> 3. call xfs_dax_notify_failure_freeze() and _thaw() in same function
>>>> 4. rebase on: xfs/xfs-linux.git vfs-for-next
>>>> ====
>>>>
>>>> Now, if we suddenly remove a PMEM device(by calling unbind) which
>>>> contains FSDAX while programs are still accessing data in this device,
>>>> e.g.:
>>>> ```
>>>> $FSSTRESS_PROG -d $SCRATCH_MNT -n 99999 -p 4 &
>>>> # $FSX_PROG -N 1000000 -o 8192 -l 500000 $SCRATCH_MNT/t001 &
>>>> echo "pfn1.1" > /sys/bus/nd/drivers/nd_pmem/unbind
>>>> ```
>>>> it could come into an unacceptable state:
>>>> 1. device has gone but mount point still exists, and umount will fail
>>>> with "target is busy"
>>>> 2. programs will hang and cannot be killed
>>>> 3. may crash with NULL pointer dereference
>>>>
>>>> To fix this, we introduce a MF_MEM_PRE_REMOVE flag to let it know that we
>>>> are going to remove the whole device, and make sure all related processes
>>>> could be notified so that they could end up gracefully.
>>>>
>>>> This patch is inspired by Dan's "mm, dax, pmem: Introduce
>>>> dev_pagemap_failure()"[1]. With the help of dax_holder and
>>>> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
>>>> on it to unmap all files in use, and notify processes who are using
>>>> those files.
>>>>
>>>> Call trace:
>>>> trigger unbind
>>>> -> unbind_store()
>>>> -> ... (skip)
>>>> -> devres_release_all()
>>>> -> kill_dax()
>>>> -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
>>>> -> xfs_dax_notify_failure()
>>>> `-> freeze_super() // freeze (kernel call)
>>>> `-> do xfs rmap
>>>> ` -> mf_dax_kill_procs()
>>>> ` -> collect_procs_fsdax() // all associated processes
>>>> ` -> unmap_and_kill()
>>>> ` -> invalidate_inode_pages2_range() // drop file's cache
>>>> `-> thaw_super() // thaw (both kernel & user call)
>>>>
>>>> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
>>>> event. Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
>>>> new dax mapping from being created. Do not shutdown filesystem directly
>>>> if configuration is not supported, or if failure range includes metadata
>>>> area. Make sure all files and processes(not only the current progress)
>>>> are handled correctly. Also drop the cache of associated files before
>>>> pmem is removed.
>>>>
>>>> [0]: https://lore.kernel.org/linux-xfs/25cf6700-4db0-a346-632c-ec9fc291793a@fujitsu.com/
>>>> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
>>>> [2]: https://lore.kernel.org/linux-xfs/169116275623.3187159.16862410128731457358.stg-ugh@frogsfrogsfrogs/
>>>>
>>>> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
>>>> ---
>>>> drivers/dax/super.c | 3 +-
>>>> fs/xfs/xfs_notify_failure.c | 99 ++++++++++++++++++++++++++++++++++---
>>>> include/linux/mm.h | 1 +
>>>> mm/memory-failure.c | 17 +++++--
>>>> 4 files changed, 109 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
>>>> index c4c4728a36e4..2e1a35e82fce 100644
>>>> --- a/drivers/dax/super.c
>>>> +++ b/drivers/dax/super.c
>>>> @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
>>>> return;
>>>> if (dax_dev->holder_data != NULL)
>>>> - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
>>>> + dax_holder_notify_failure(dax_dev, 0, U64_MAX,
>>>> + MF_MEM_PRE_REMOVE);
>>>> clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
>>>> synchronize_srcu(&dax_srcu);
>>>> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
>>>> index 4a9bbd3fe120..6496c32a9172 100644
>>>> --- a/fs/xfs/xfs_notify_failure.c
>>>> +++ b/fs/xfs/xfs_notify_failure.c
>>>> @@ -22,6 +22,7 @@
>>>> #include <linux/mm.h>
>>>> #include <linux/dax.h>
>>>> +#include <linux/fs.h>
>>>> struct xfs_failure_info {
>>>> xfs_agblock_t startblock;
>>>> @@ -73,10 +74,16 @@ xfs_dax_failure_fn(
>>>> struct xfs_mount *mp = cur->bc_mp;
>>>> struct xfs_inode *ip;
>>>> struct xfs_failure_info *notify = data;
>>>> + struct address_space *mapping;
>>>> + pgoff_t pgoff;
>>>> + unsigned long pgcnt;
>>>> int error = 0;
>>>> if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
>>>> (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
>>>> + /* Continue the query because this isn't a failure. */
>>>> + if (notify->mf_flags & MF_MEM_PRE_REMOVE)
>>>> + return 0;
>>>> notify->want_shutdown = true;
>>>> return 0;
>>>> }
>>>> @@ -92,14 +99,60 @@ xfs_dax_failure_fn(
>>>> return 0;
>>>> }
>>>> - error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
>>>> - xfs_failure_pgoff(mp, rec, notify),
>>>> - xfs_failure_pgcnt(mp, rec, notify),
>>>> - notify->mf_flags);
>>>> + mapping = VFS_I(ip)->i_mapping;
>>>> + pgoff = xfs_failure_pgoff(mp, rec, notify);
>>>> + pgcnt = xfs_failure_pgcnt(mp, rec, notify);
>>>> +
>>>> + /* Continue the rmap query if the inode isn't a dax file. */
>>>> + if (dax_mapping(mapping))
>>>> + error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
>>>> + notify->mf_flags);
>>>> +
>>>> + /* Invalidate the cache in dax pages. */
>>>> + if (notify->mf_flags & MF_MEM_PRE_REMOVE)
>>>> + invalidate_inode_pages2_range(mapping, pgoff,
>>>> + pgoff + pgcnt - 1);
>>>> +
>>>> xfs_irele(ip);
>>>> return error;
>>>> }
>>>> +static int
>>>> +xfs_dax_notify_failure_freeze(
>>>> + struct xfs_mount *mp)
>>>> +{
>>>> + struct super_block *sb = mp->m_super;
>>>> + int error;
>>>> +
>>>> + error = freeze_super(sb, FREEZE_HOLDER_KERNEL);
>>>> + if (error)
>>>> + xfs_emerg(mp, "already frozen by kernel, err=%d", error);
>>>> +
>>>> + return error;
>>>> +}
>>>> +
>>>> +static void
>>>> +xfs_dax_notify_failure_thaw(
>>>> + struct xfs_mount *mp,
>>>> + bool kernel_frozen)
>>>> +{
>>>> + struct super_block *sb = mp->m_super;
>>>> + int error;
>>>> +
>>>> + if (!kernel_frozen) {
>>>> + error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
>>>> + if (error)
>>>> + xfs_emerg(mp, "still frozen after notify failure, err=%d",
>>>> + error);
>>>> + }
>>>> +
>>>> + /*
>>>> + * Also thaw userspace call anyway because the device is about to be
>>>> + * removed immediately.
>>>
>>> Does a userspace freeze inhibit or otherwise break device removal?
>>
>> It doesn't. Device can be removed. But after that, the mount point still
>> exists, and `umount /mnt/scratch` fails with "target is busy." `xfs_freeze
>> -u /mnt/scratch` cannot work too.
>
> Yes, that's true, but that's long been the case for removing block
> devices. Should block device removal (since we now have hooks for
> that!) also be breaking freezes?
I think so. But it may need more time to accomplish. Shall we leave it
for later optimization?
>
>> So, I think thaw_super() anyway here is needed.
>>
>>
>>>
>>>> + */
>>>> + thaw_super(sb, FREEZE_HOLDER_USERSPACE);
>>>> +}
>>>> +
>>>> static int
>>>> xfs_dax_notify_ddev_failure(
>>>> struct xfs_mount *mp,
>>>> @@ -112,15 +165,29 @@ xfs_dax_notify_ddev_failure(
>>>> struct xfs_btree_cur *cur = NULL;
>>>> struct xfs_buf *agf_bp = NULL;
>>>> int error = 0;
>>>> + bool kernel_frozen = false;
>>>> xfs_fsblock_t fsbno = XFS_DADDR_TO_FSB(mp, daddr);
>>>> xfs_agnumber_t agno = XFS_FSB_TO_AGNO(mp, fsbno);
>>>> xfs_fsblock_t end_fsbno = XFS_DADDR_TO_FSB(mp,
>>>> daddr + bblen - 1);
>>>> xfs_agnumber_t end_agno = XFS_FSB_TO_AGNO(mp, end_fsbno);
>>>> + if (mf_flags & MF_MEM_PRE_REMOVE) {
>>>> + xfs_info(mp, "Device is about to be removed!");
>>>> + /* Freeze fs to prevent new mappings from being created. */
>>>> + error = xfs_dax_notify_failure_freeze(mp);
>>>> + if (error) {
>>>> + /* Keep going on if filesystem is frozen by kernel. */
>>>> + if (error == -EBUSY)
>>>> + kernel_frozen = true;
>>>
>>> EBUSY means that xfs_dax_notify_failure_freeze did /not/ succeed in
>>> kernel-freezing the fs. Someone else did, and they're expecting that
>>> thaw_super will undo that.
>>>
>>> switch (error) {
>>> case -EBUSY:
>>> /* someone else froze the fs, keep going */
>>> break;
>>> case 0:
>>> /* we froze the fs */
>>> kernel_frozen = true;
>>> break;
>>> default:
>>> /* something else broke, should we continue anyway? */
>>> return error;
>>> }
>>>
>>> TBH I wonder why all that isn't just:
>>>
>>> kernel_frozen = xfs_dax_notify_failure_freeze(mp) == 0;
>>>
>>> Since we'd want to keep going even if (say) the pmem was already
>>> starting to fail and the freeze actually failed due to EIO, right?
>>
>> Yes. So we can say it is a *try* to _freeze() here. No matter what its
>> result is, we continue.
>>
>> Then I think the `kernel_frozen` becomes useless as well. Because we should
>> try to call both _thaw(KERNEL_CALL) and _thaw(USER_CALL) to make sure umount
>> can work after device is gone.
>
> I disagree -- unlike the mess that is userspace freezing, kernel code
> that obtained a kernel freeze will get very confused and potentially do
> Seriously Bad Things if the kernel freeze is yanked out from under them.
> Kernel code is not supposed to release things that they did not
> themselves obtain.
>
> That might not ultimately matter for the narrow case of the device going
> away, but the two other usecases (online fsck and suspend) will
> malfunction if you drop a kernel freeze that they obtained.
Could online fsck and suspend keep working even after
`xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);` being called?
>
> I don't mind if PREREMOVE can't get a freeze and keeps going with the
> invalidations anyway. We did our best, and when the pmem goes away we
> can just kill -9 down the processes.
Ok, I agree.
Then, the last thing I want to be confirmed:
On my host, if the freeze state wasn't _thaw() after device gone, the
processes will keep on waiting and cannot be killed by `kill -9`
manually. Is there another way to make the processes killed?
--
Thanks,
Ruan.
>
> --D
>
>> Then, I think it's better to change them:
>> `static int xfs_dax_notify_failure_freeze()`,
>> `static void xfs_dax_notify_failure_thaw()`
>> to
>> `static void xfs_dax_notify_failure_try_freeze()`,
>> `static void xfs_dax_notify_failure_try_thaw()`.
>>
>>
>> --
>> Thanks,
>> Ruan.
>>
>>>
>>> --D
>>>
>>>> + else
>>>> + return error;
>>>> + }
>>>> + }
>>>> +
>>>> error = xfs_trans_alloc_empty(mp, &tp);
>>>> if (error)
>>>> - return error;
>>>> + goto out;
>>>> for (; agno <= end_agno; agno++) {
>>>> struct xfs_rmap_irec ri_low = { };
>>>> @@ -165,11 +232,23 @@ xfs_dax_notify_ddev_failure(
>>>> }
>>>> xfs_trans_cancel(tp);
>>>> +
>>>> + /*
>>>> + * Determine how to shutdown the filesystem according to the
>>>> + * error code and flags.
>>>> + */
>>>> if (error || notify.want_shutdown) {
>>>> xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>>>> if (!error)
>>>> error = -EFSCORRUPTED;
>>>> - }
>>>> + } else if (mf_flags & MF_MEM_PRE_REMOVE)
>>>> + xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
>>>> +
>>>> +out:
>>>> + /* Thaw the fs if it is frozen before. */
>>>> + if (mf_flags & MF_MEM_PRE_REMOVE)
>>>> + xfs_dax_notify_failure_thaw(mp, kernel_frozen);
>>>> +
>>>> return error;
>>>> }
>>>> @@ -197,6 +276,8 @@ xfs_dax_notify_failure(
>>>> if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
>>>> mp->m_logdev_targp != mp->m_ddev_targp) {
>>>> + if (mf_flags & MF_MEM_PRE_REMOVE)
>>>> + return 0;
>>>> xfs_err(mp, "ondisk log corrupt, shutting down fs!");
>>>> xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>>>> return -EFSCORRUPTED;
>>>> @@ -210,6 +291,12 @@ xfs_dax_notify_failure(
>>>> ddev_start = mp->m_ddev_targp->bt_dax_part_off;
>>>> ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
>>>> + /* Notify failure on the whole device. */
>>>> + if (offset == 0 && len == U64_MAX) {
>>>> + offset = ddev_start;
>>>> + len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
>>>> + }
>>>> +
>>>> /* Ignore the range out of filesystem area */
>>>> if (offset + len - 1 < ddev_start)
>>>> return -ENXIO;
>>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>>> index 799836e84840..944a1165a321 100644
>>>> --- a/include/linux/mm.h
>>>> +++ b/include/linux/mm.h
>>>> @@ -3577,6 +3577,7 @@ enum mf_flags {
>>>> MF_UNPOISON = 1 << 4,
>>>> MF_SW_SIMULATED = 1 << 5,
>>>> MF_NO_RETRY = 1 << 6,
>>>> + MF_MEM_PRE_REMOVE = 1 << 7,
>>>> };
>>>> int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>>>> unsigned long count, int mf_flags);
>>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>>> index dc5ff7dd4e50..92f18c9e0aaf 100644
>>>> --- a/mm/memory-failure.c
>>>> +++ b/mm/memory-failure.c
>>>> @@ -688,7 +688,7 @@ static void add_to_kill_fsdax(struct task_struct *tsk, struct page *p,
>>>> */
>>>> static void collect_procs_fsdax(struct page *page,
>>>> struct address_space *mapping, pgoff_t pgoff,
>>>> - struct list_head *to_kill)
>>>> + struct list_head *to_kill, bool pre_remove)
>>>> {
>>>> struct vm_area_struct *vma;
>>>> struct task_struct *tsk;
>>>> @@ -696,8 +696,15 @@ static void collect_procs_fsdax(struct page *page,
>>>> i_mmap_lock_read(mapping);
>>>> read_lock(&tasklist_lock);
>>>> for_each_process(tsk) {
>>>> - struct task_struct *t = task_early_kill(tsk, true);
>>>> + struct task_struct *t = tsk;
>>>> + /*
>>>> + * Search for all tasks while MF_MEM_PRE_REMOVE is set, because
>>>> + * the current may not be the one accessing the fsdax page.
>>>> + * Otherwise, search for the current task.
>>>> + */
>>>> + if (!pre_remove)
>>>> + t = task_early_kill(tsk, true);
>>>> if (!t)
>>>> continue;
>>>> vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
>>>> @@ -1793,6 +1800,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>>>> dax_entry_t cookie;
>>>> struct page *page;
>>>> size_t end = index + count;
>>>> + bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
>>>> mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
>>>> @@ -1804,9 +1812,10 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>>>> if (!page)
>>>> goto unlock;
>>>> - SetPageHWPoison(page);
>>>> + if (!pre_remove)
>>>> + SetPageHWPoison(page);
>>>> - collect_procs_fsdax(page, mapping, index, &to_kill);
>>>> + collect_procs_fsdax(page, mapping, index, &to_kill, pre_remove);
>>>> unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
>>>> index, mf_flags);
>>>> unlock:
>>>> --
>>>> 2.41.0
>>>>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v13] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind
2023-08-25 3:52 ` Shiyang Ruan
@ 2023-08-26 0:17 ` Darrick J. Wong
0 siblings, 0 replies; 37+ messages in thread
From: Darrick J. Wong @ 2023-08-26 0:17 UTC (permalink / raw)
To: Shiyang Ruan
Cc: linux-fsdevel, nvdimm, linux-xfs, linux-mm, dan.j.williams, willy,
jack, akpm, mcgrof
On Fri, Aug 25, 2023 at 11:52:35AM +0800, Shiyang Ruan wrote:
>
>
> 在 2023/8/25 7:57, Darrick J. Wong 写道:
> > On Thu, Aug 24, 2023 at 05:41:50PM +0800, Shiyang Ruan wrote:
> > >
> > >
> > > 在 2023/8/24 7:36, Darrick J. Wong 写道:
> > > > On Wed, Aug 23, 2023 at 04:17:06PM +0800, Shiyang Ruan wrote:
> > > > > ====
> > > > > Changes since v12:
> > > > > 1. correct flag name in subject (MF_MEM_REMOVE => MF_MEM_PRE_REMOVE)
> > > > > 2. complete the behavior when fs has already frozen by kernel call
> > > > > NOTICE: Instead of "call notify_failure() again w/o PRE_REMOVE",
> > > > > I tried this proposal[0].
> > > > > 3. call xfs_dax_notify_failure_freeze() and _thaw() in same function
> > > > > 4. rebase on: xfs/xfs-linux.git vfs-for-next
> > > > > ====
> > > > >
> > > > > Now, if we suddenly remove a PMEM device(by calling unbind) which
> > > > > contains FSDAX while programs are still accessing data in this device,
> > > > > e.g.:
> > > > > ```
> > > > > $FSSTRESS_PROG -d $SCRATCH_MNT -n 99999 -p 4 &
> > > > > # $FSX_PROG -N 1000000 -o 8192 -l 500000 $SCRATCH_MNT/t001 &
> > > > > echo "pfn1.1" > /sys/bus/nd/drivers/nd_pmem/unbind
> > > > > ```
> > > > > it could come into an unacceptable state:
> > > > > 1. device has gone but mount point still exists, and umount will fail
> > > > > with "target is busy"
> > > > > 2. programs will hang and cannot be killed
> > > > > 3. may crash with NULL pointer dereference
> > > > >
> > > > > To fix this, we introduce a MF_MEM_PRE_REMOVE flag to let it know that we
> > > > > are going to remove the whole device, and make sure all related processes
> > > > > could be notified so that they could end up gracefully.
> > > > >
> > > > > This patch is inspired by Dan's "mm, dax, pmem: Introduce
> > > > > dev_pagemap_failure()"[1]. With the help of dax_holder and
> > > > > ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> > > > > on it to unmap all files in use, and notify processes who are using
> > > > > those files.
> > > > >
> > > > > Call trace:
> > > > > trigger unbind
> > > > > -> unbind_store()
> > > > > -> ... (skip)
> > > > > -> devres_release_all()
> > > > > -> kill_dax()
> > > > > -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
> > > > > -> xfs_dax_notify_failure()
> > > > > `-> freeze_super() // freeze (kernel call)
> > > > > `-> do xfs rmap
> > > > > ` -> mf_dax_kill_procs()
> > > > > ` -> collect_procs_fsdax() // all associated processes
> > > > > ` -> unmap_and_kill()
> > > > > ` -> invalidate_inode_pages2_range() // drop file's cache
> > > > > `-> thaw_super() // thaw (both kernel & user call)
> > > > >
> > > > > Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> > > > > event. Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
> > > > > new dax mapping from being created. Do not shutdown filesystem directly
> > > > > if configuration is not supported, or if failure range includes metadata
> > > > > area. Make sure all files and processes(not only the current progress)
> > > > > are handled correctly. Also drop the cache of associated files before
> > > > > pmem is removed.
> > > > >
> > > > > [0]: https://lore.kernel.org/linux-xfs/25cf6700-4db0-a346-632c-ec9fc291793a@fujitsu.com/
> > > > > [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
> > > > > [2]: https://lore.kernel.org/linux-xfs/169116275623.3187159.16862410128731457358.stg-ugh@frogsfrogsfrogs/
> > > > >
> > > > > Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> > > > > ---
> > > > > drivers/dax/super.c | 3 +-
> > > > > fs/xfs/xfs_notify_failure.c | 99 ++++++++++++++++++++++++++++++++++---
> > > > > include/linux/mm.h | 1 +
> > > > > mm/memory-failure.c | 17 +++++--
> > > > > 4 files changed, 109 insertions(+), 11 deletions(-)
> > > > >
> > > > > diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> > > > > index c4c4728a36e4..2e1a35e82fce 100644
> > > > > --- a/drivers/dax/super.c
> > > > > +++ b/drivers/dax/super.c
> > > > > @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
> > > > > return;
> > > > > if (dax_dev->holder_data != NULL)
> > > > > - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
> > > > > + dax_holder_notify_failure(dax_dev, 0, U64_MAX,
> > > > > + MF_MEM_PRE_REMOVE);
> > > > > clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
> > > > > synchronize_srcu(&dax_srcu);
> > > > > diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> > > > > index 4a9bbd3fe120..6496c32a9172 100644
> > > > > --- a/fs/xfs/xfs_notify_failure.c
> > > > > +++ b/fs/xfs/xfs_notify_failure.c
> > > > > @@ -22,6 +22,7 @@
> > > > > #include <linux/mm.h>
> > > > > #include <linux/dax.h>
> > > > > +#include <linux/fs.h>
> > > > > struct xfs_failure_info {
> > > > > xfs_agblock_t startblock;
> > > > > @@ -73,10 +74,16 @@ xfs_dax_failure_fn(
> > > > > struct xfs_mount *mp = cur->bc_mp;
> > > > > struct xfs_inode *ip;
> > > > > struct xfs_failure_info *notify = data;
> > > > > + struct address_space *mapping;
> > > > > + pgoff_t pgoff;
> > > > > + unsigned long pgcnt;
> > > > > int error = 0;
> > > > > if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
> > > > > (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
> > > > > + /* Continue the query because this isn't a failure. */
> > > > > + if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> > > > > + return 0;
> > > > > notify->want_shutdown = true;
> > > > > return 0;
> > > > > }
> > > > > @@ -92,14 +99,60 @@ xfs_dax_failure_fn(
> > > > > return 0;
> > > > > }
> > > > > - error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
> > > > > - xfs_failure_pgoff(mp, rec, notify),
> > > > > - xfs_failure_pgcnt(mp, rec, notify),
> > > > > - notify->mf_flags);
> > > > > + mapping = VFS_I(ip)->i_mapping;
> > > > > + pgoff = xfs_failure_pgoff(mp, rec, notify);
> > > > > + pgcnt = xfs_failure_pgcnt(mp, rec, notify);
> > > > > +
> > > > > + /* Continue the rmap query if the inode isn't a dax file. */
> > > > > + if (dax_mapping(mapping))
> > > > > + error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
> > > > > + notify->mf_flags);
> > > > > +
> > > > > + /* Invalidate the cache in dax pages. */
> > > > > + if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> > > > > + invalidate_inode_pages2_range(mapping, pgoff,
> > > > > + pgoff + pgcnt - 1);
> > > > > +
> > > > > xfs_irele(ip);
> > > > > return error;
> > > > > }
> > > > > +static int
> > > > > +xfs_dax_notify_failure_freeze(
> > > > > + struct xfs_mount *mp)
> > > > > +{
> > > > > + struct super_block *sb = mp->m_super;
> > > > > + int error;
> > > > > +
> > > > > + error = freeze_super(sb, FREEZE_HOLDER_KERNEL);
> > > > > + if (error)
> > > > > + xfs_emerg(mp, "already frozen by kernel, err=%d", error);
> > > > > +
> > > > > + return error;
> > > > > +}
> > > > > +
> > > > > +static void
> > > > > +xfs_dax_notify_failure_thaw(
> > > > > + struct xfs_mount *mp,
> > > > > + bool kernel_frozen)
> > > > > +{
> > > > > + struct super_block *sb = mp->m_super;
> > > > > + int error;
> > > > > +
> > > > > + if (!kernel_frozen) {
> > > > > + error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
> > > > > + if (error)
> > > > > + xfs_emerg(mp, "still frozen after notify failure, err=%d",
> > > > > + error);
> > > > > + }
> > > > > +
> > > > > + /*
> > > > > + * Also thaw userspace call anyway because the device is about to be
> > > > > + * removed immediately.
> > > >
> > > > Does a userspace freeze inhibit or otherwise break device removal?
> > >
> > > It doesn't. Device can be removed. But after that, the mount point still
> > > exists, and `umount /mnt/scratch` fails with "target is busy." `xfs_freeze
> > > -u /mnt/scratch` cannot work too.
> >
> > Yes, that's true, but that's long been the case for removing block
> > devices. Should block device removal (since we now have hooks for
> > that!) also be breaking freezes?
>
> I think so. But it may need more time to accomplish. Shall we leave it for
> later optimization?
Yeah, I think patching the block layer is a separate patch.
> >
> > > So, I think thaw_super() anyway here is needed.
> > >
> > >
> > > >
> > > > > + */
> > > > > + thaw_super(sb, FREEZE_HOLDER_USERSPACE);
> > > > > +}
> > > > > +
> > > > > static int
> > > > > xfs_dax_notify_ddev_failure(
> > > > > struct xfs_mount *mp,
> > > > > @@ -112,15 +165,29 @@ xfs_dax_notify_ddev_failure(
> > > > > struct xfs_btree_cur *cur = NULL;
> > > > > struct xfs_buf *agf_bp = NULL;
> > > > > int error = 0;
> > > > > + bool kernel_frozen = false;
> > > > > xfs_fsblock_t fsbno = XFS_DADDR_TO_FSB(mp, daddr);
> > > > > xfs_agnumber_t agno = XFS_FSB_TO_AGNO(mp, fsbno);
> > > > > xfs_fsblock_t end_fsbno = XFS_DADDR_TO_FSB(mp,
> > > > > daddr + bblen - 1);
> > > > > xfs_agnumber_t end_agno = XFS_FSB_TO_AGNO(mp, end_fsbno);
> > > > > + if (mf_flags & MF_MEM_PRE_REMOVE) {
> > > > > + xfs_info(mp, "Device is about to be removed!");
> > > > > + /* Freeze fs to prevent new mappings from being created. */
> > > > > + error = xfs_dax_notify_failure_freeze(mp);
> > > > > + if (error) {
> > > > > + /* Keep going on if filesystem is frozen by kernel. */
> > > > > + if (error == -EBUSY)
> > > > > + kernel_frozen = true;
> > > >
> > > > EBUSY means that xfs_dax_notify_failure_freeze did /not/ succeed in
> > > > kernel-freezing the fs. Someone else did, and they're expecting that
> > > > thaw_super will undo that.
> > > >
> > > > switch (error) {
> > > > case -EBUSY:
> > > > /* someone else froze the fs, keep going */
> > > > break;
> > > > case 0:
> > > > /* we froze the fs */
> > > > kernel_frozen = true;
> > > > break;
> > > > default:
> > > > /* something else broke, should we continue anyway? */
> > > > return error;
> > > > }
> > > >
> > > > TBH I wonder why all that isn't just:
> > > >
> > > > kernel_frozen = xfs_dax_notify_failure_freeze(mp) == 0;
> > > >
> > > > Since we'd want to keep going even if (say) the pmem was already
> > > > starting to fail and the freeze actually failed due to EIO, right?
> > >
> > > Yes. So we can say it is a *try* to _freeze() here. No matter what its
> > > result is, we continue.
> > >
> > > Then I think the `kernel_frozen` becomes useless as well. Because we should
> > > try to call both _thaw(KERNEL_CALL) and _thaw(USER_CALL) to make sure umount
> > > can work after device is gone.
> >
> > I disagree -- unlike the mess that is userspace freezing, kernel code
> > that obtained a kernel freeze will get very confused and potentially do
> > Seriously Bad Things if the kernel freeze is yanked out from under them.
> > Kernel code is not supposed to release things that they did not
> > themselves obtain.
> >
> > That might not ultimately matter for the narrow case of the device going
> > away, but the two other usecases (online fsck and suspend) will
> > malfunction if you drop a kernel freeze that they obtained.
>
> Could online fsck and suspend keep working even after
> `xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);` being called?
It's likely to go down with the filesystem, but the point of the kernel
freeze is that the freeze should be brief and undone by the same
function on its way out. Hence PREREMOVE shouldn't be releasing
something that was obtained by another (running) thread, just like any
other resource.
> >
> > I don't mind if PREREMOVE can't get a freeze and keeps going with the
> > invalidations anyway. We did our best, and when the pmem goes away we
> > can just kill -9 down the processes.
>
> Ok, I agree.
>
> Then, the last thing I want to be confirmed:
> On my host, if the freeze state wasn't _thaw() after device gone, the
> processes will keep on waiting and cannot be killed by `kill -9` manually.
> Is there another way to make the processes killed?
No, I don't think there is. FWIW I'm ok with you moving on to the
invalidation part if something else has frozen the fs; and I'm also ok
with the unconditional thaw_super(sb, FREEZE_HOLDER_USERSPACE).
--D
>
>
> --
> Thanks,
> Ruan.
>
> >
> > --D
> >
> > > Then, I think it's better to change them:
> > > `static int xfs_dax_notify_failure_freeze()`,
> > > `static void xfs_dax_notify_failure_thaw()`
> > > to
> > > `static void xfs_dax_notify_failure_try_freeze()`,
> > > `static void xfs_dax_notify_failure_try_thaw()`.
> > >
> > >
> > > --
> > > Thanks,
> > > Ruan.
> > >
> > > >
> > > > --D
> > > >
> > > > > + else
> > > > > + return error;
> > > > > + }
> > > > > + }
> > > > > +
> > > > > error = xfs_trans_alloc_empty(mp, &tp);
> > > > > if (error)
> > > > > - return error;
> > > > > + goto out;
> > > > > for (; agno <= end_agno; agno++) {
> > > > > struct xfs_rmap_irec ri_low = { };
> > > > > @@ -165,11 +232,23 @@ xfs_dax_notify_ddev_failure(
> > > > > }
> > > > > xfs_trans_cancel(tp);
> > > > > +
> > > > > + /*
> > > > > + * Determine how to shutdown the filesystem according to the
> > > > > + * error code and flags.
> > > > > + */
> > > > > if (error || notify.want_shutdown) {
> > > > > xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
> > > > > if (!error)
> > > > > error = -EFSCORRUPTED;
> > > > > - }
> > > > > + } else if (mf_flags & MF_MEM_PRE_REMOVE)
> > > > > + xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
> > > > > +
> > > > > +out:
> > > > > + /* Thaw the fs if it is frozen before. */
> > > > > + if (mf_flags & MF_MEM_PRE_REMOVE)
> > > > > + xfs_dax_notify_failure_thaw(mp, kernel_frozen);
> > > > > +
> > > > > return error;
> > > > > }
> > > > > @@ -197,6 +276,8 @@ xfs_dax_notify_failure(
> > > > > if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
> > > > > mp->m_logdev_targp != mp->m_ddev_targp) {
> > > > > + if (mf_flags & MF_MEM_PRE_REMOVE)
> > > > > + return 0;
> > > > > xfs_err(mp, "ondisk log corrupt, shutting down fs!");
> > > > > xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
> > > > > return -EFSCORRUPTED;
> > > > > @@ -210,6 +291,12 @@ xfs_dax_notify_failure(
> > > > > ddev_start = mp->m_ddev_targp->bt_dax_part_off;
> > > > > ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
> > > > > + /* Notify failure on the whole device. */
> > > > > + if (offset == 0 && len == U64_MAX) {
> > > > > + offset = ddev_start;
> > > > > + len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
> > > > > + }
> > > > > +
> > > > > /* Ignore the range out of filesystem area */
> > > > > if (offset + len - 1 < ddev_start)
> > > > > return -ENXIO;
> > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > > > index 799836e84840..944a1165a321 100644
> > > > > --- a/include/linux/mm.h
> > > > > +++ b/include/linux/mm.h
> > > > > @@ -3577,6 +3577,7 @@ enum mf_flags {
> > > > > MF_UNPOISON = 1 << 4,
> > > > > MF_SW_SIMULATED = 1 << 5,
> > > > > MF_NO_RETRY = 1 << 6,
> > > > > + MF_MEM_PRE_REMOVE = 1 << 7,
> > > > > };
> > > > > int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> > > > > unsigned long count, int mf_flags);
> > > > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > > > > index dc5ff7dd4e50..92f18c9e0aaf 100644
> > > > > --- a/mm/memory-failure.c
> > > > > +++ b/mm/memory-failure.c
> > > > > @@ -688,7 +688,7 @@ static void add_to_kill_fsdax(struct task_struct *tsk, struct page *p,
> > > > > */
> > > > > static void collect_procs_fsdax(struct page *page,
> > > > > struct address_space *mapping, pgoff_t pgoff,
> > > > > - struct list_head *to_kill)
> > > > > + struct list_head *to_kill, bool pre_remove)
> > > > > {
> > > > > struct vm_area_struct *vma;
> > > > > struct task_struct *tsk;
> > > > > @@ -696,8 +696,15 @@ static void collect_procs_fsdax(struct page *page,
> > > > > i_mmap_lock_read(mapping);
> > > > > read_lock(&tasklist_lock);
> > > > > for_each_process(tsk) {
> > > > > - struct task_struct *t = task_early_kill(tsk, true);
> > > > > + struct task_struct *t = tsk;
> > > > > + /*
> > > > > + * Search for all tasks while MF_MEM_PRE_REMOVE is set, because
> > > > > + * the current may not be the one accessing the fsdax page.
> > > > > + * Otherwise, search for the current task.
> > > > > + */
> > > > > + if (!pre_remove)
> > > > > + t = task_early_kill(tsk, true);
> > > > > if (!t)
> > > > > continue;
> > > > > vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> > > > > @@ -1793,6 +1800,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> > > > > dax_entry_t cookie;
> > > > > struct page *page;
> > > > > size_t end = index + count;
> > > > > + bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
> > > > > mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
> > > > > @@ -1804,9 +1812,10 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> > > > > if (!page)
> > > > > goto unlock;
> > > > > - SetPageHWPoison(page);
> > > > > + if (!pre_remove)
> > > > > + SetPageHWPoison(page);
> > > > > - collect_procs_fsdax(page, mapping, index, &to_kill);
> > > > > + collect_procs_fsdax(page, mapping, index, &to_kill, pre_remove);
> > > > > unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
> > > > > index, mf_flags);
> > > > > unlock:
> > > > > --
> > > > > 2.41.0
> > > > >
^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH v14] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind
2023-06-29 8:16 ` [PATCH v12 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan
` (4 preceding siblings ...)
2023-08-23 8:17 ` [PATCH v13] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE " Shiyang Ruan
@ 2023-08-28 6:57 ` Shiyang Ruan
2023-08-30 15:34 ` Darrick J. Wong
` (2 more replies)
5 siblings, 3 replies; 37+ messages in thread
From: Shiyang Ruan @ 2023-08-28 6:57 UTC (permalink / raw)
To: linux-fsdevel, nvdimm, linux-xfs, linux-mm
Cc: dan.j.williams, willy, jack, akpm, djwong, mcgrof
====
Changes since v13:
1. don't return error if _freeze(FREEZE_HOLDER_KERNEL) got other error
====
Now, if we suddenly remove a PMEM device(by calling unbind) which
contains FSDAX while programs are still accessing data in this device,
e.g.:
```
$FSSTRESS_PROG -d $SCRATCH_MNT -n 99999 -p 4 &
# $FSX_PROG -N 1000000 -o 8192 -l 500000 $SCRATCH_MNT/t001 &
echo "pfn1.1" > /sys/bus/nd/drivers/nd_pmem/unbind
```
it could come into an unacceptable state:
1. device has gone but mount point still exists, and umount will fail
with "target is busy"
2. programs will hang and cannot be killed
3. may crash with NULL pointer dereference
To fix this, we introduce a MF_MEM_PRE_REMOVE flag to let it know that we
are going to remove the whole device, and make sure all related processes
could be notified so that they could end up gracefully.
This patch is inspired by Dan's "mm, dax, pmem: Introduce
dev_pagemap_failure()"[1]. With the help of dax_holder and
->notify_failure() mechanism, the pmem driver is able to ask filesystem
on it to unmap all files in use, and notify processes who are using
those files.
Call trace:
trigger unbind
-> unbind_store()
-> ... (skip)
-> devres_release_all()
-> kill_dax()
-> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
-> xfs_dax_notify_failure()
`-> freeze_super() // freeze (kernel call)
`-> do xfs rmap
` -> mf_dax_kill_procs()
` -> collect_procs_fsdax() // all associated processes
` -> unmap_and_kill()
` -> invalidate_inode_pages2_range() // drop file's cache
`-> thaw_super() // thaw (both kernel & user call)
Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
event. Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
new dax mapping from being created. Do not shutdown filesystem directly
if configuration is not supported, or if failure range includes metadata
area. Make sure all files and processes(not only the current progress)
are handled correctly. Also drop the cache of associated files before
pmem is removed.
[1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
[2]: https://lore.kernel.org/linux-xfs/169116275623.3187159.16862410128731457358.stg-ugh@frogsfrogsfrogs/
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
---
drivers/dax/super.c | 3 +-
fs/xfs/xfs_notify_failure.c | 99 ++++++++++++++++++++++++++++++++++---
include/linux/mm.h | 1 +
mm/memory-failure.c | 17 +++++--
4 files changed, 109 insertions(+), 11 deletions(-)
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 0da9232ea175..f4b635526345 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -326,7 +326,8 @@ void kill_dax(struct dax_device *dax_dev)
return;
if (dax_dev->holder_data != NULL)
- dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
+ dax_holder_notify_failure(dax_dev, 0, U64_MAX,
+ MF_MEM_PRE_REMOVE);
clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
synchronize_srcu(&dax_srcu);
diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index 4a9bbd3fe120..79586abc75bf 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -22,6 +22,7 @@
#include <linux/mm.h>
#include <linux/dax.h>
+#include <linux/fs.h>
struct xfs_failure_info {
xfs_agblock_t startblock;
@@ -73,10 +74,16 @@ xfs_dax_failure_fn(
struct xfs_mount *mp = cur->bc_mp;
struct xfs_inode *ip;
struct xfs_failure_info *notify = data;
+ struct address_space *mapping;
+ pgoff_t pgoff;
+ unsigned long pgcnt;
int error = 0;
if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
(rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
+ /* Continue the query because this isn't a failure. */
+ if (notify->mf_flags & MF_MEM_PRE_REMOVE)
+ return 0;
notify->want_shutdown = true;
return 0;
}
@@ -92,14 +99,60 @@ xfs_dax_failure_fn(
return 0;
}
- error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
- xfs_failure_pgoff(mp, rec, notify),
- xfs_failure_pgcnt(mp, rec, notify),
- notify->mf_flags);
+ mapping = VFS_I(ip)->i_mapping;
+ pgoff = xfs_failure_pgoff(mp, rec, notify);
+ pgcnt = xfs_failure_pgcnt(mp, rec, notify);
+
+ /* Continue the rmap query if the inode isn't a dax file. */
+ if (dax_mapping(mapping))
+ error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
+ notify->mf_flags);
+
+ /* Invalidate the cache in dax pages. */
+ if (notify->mf_flags & MF_MEM_PRE_REMOVE)
+ invalidate_inode_pages2_range(mapping, pgoff,
+ pgoff + pgcnt - 1);
+
xfs_irele(ip);
return error;
}
+static int
+xfs_dax_notify_failure_freeze(
+ struct xfs_mount *mp)
+{
+ struct super_block *sb = mp->m_super;
+ int error;
+
+ error = freeze_super(sb, FREEZE_HOLDER_KERNEL);
+ if (error)
+ xfs_emerg(mp, "already frozen by kernel, err=%d", error);
+
+ return error;
+}
+
+static void
+xfs_dax_notify_failure_thaw(
+ struct xfs_mount *mp,
+ bool kernel_frozen)
+{
+ struct super_block *sb = mp->m_super;
+ int error;
+
+ if (kernel_frozen) {
+ error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
+ if (error)
+ xfs_emerg(mp, "still frozen after notify failure, err=%d",
+ error);
+ }
+
+ /*
+ * Also thaw userspace call anyway because the device is about to be
+ * removed immediately.
+ */
+ thaw_super(sb, FREEZE_HOLDER_USERSPACE);
+}
+
static int
xfs_dax_notify_ddev_failure(
struct xfs_mount *mp,
@@ -112,15 +165,29 @@ xfs_dax_notify_ddev_failure(
struct xfs_btree_cur *cur = NULL;
struct xfs_buf *agf_bp = NULL;
int error = 0;
+ bool kernel_frozen = false;
xfs_fsblock_t fsbno = XFS_DADDR_TO_FSB(mp, daddr);
xfs_agnumber_t agno = XFS_FSB_TO_AGNO(mp, fsbno);
xfs_fsblock_t end_fsbno = XFS_DADDR_TO_FSB(mp,
daddr + bblen - 1);
xfs_agnumber_t end_agno = XFS_FSB_TO_AGNO(mp, end_fsbno);
+ if (mf_flags & MF_MEM_PRE_REMOVE) {
+ xfs_info(mp, "Device is about to be removed!");
+ /*
+ * Freeze fs to prevent new mappings from being created.
+ * - Keep going on if others already hold the kernel forzen.
+ * - Keep going on if other errors too because this device is
+ * starting to fail.
+ * - If kernel frozen state is hold successfully here, thaw it
+ * here as well at the end.
+ */
+ kernel_frozen = xfs_dax_notify_failure_freeze(mp) == 0;
+ }
+
error = xfs_trans_alloc_empty(mp, &tp);
if (error)
- return error;
+ goto out;
for (; agno <= end_agno; agno++) {
struct xfs_rmap_irec ri_low = { };
@@ -165,11 +232,23 @@ xfs_dax_notify_ddev_failure(
}
xfs_trans_cancel(tp);
+
+ /*
+ * Determine how to shutdown the filesystem according to the
+ * error code and flags.
+ */
if (error || notify.want_shutdown) {
xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
if (!error)
error = -EFSCORRUPTED;
- }
+ } else if (mf_flags & MF_MEM_PRE_REMOVE)
+ xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
+
+out:
+ /* Thaw the fs if it is frozen before. */
+ if (mf_flags & MF_MEM_PRE_REMOVE)
+ xfs_dax_notify_failure_thaw(mp, kernel_frozen);
+
return error;
}
@@ -197,6 +276,8 @@ xfs_dax_notify_failure(
if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
mp->m_logdev_targp != mp->m_ddev_targp) {
+ if (mf_flags & MF_MEM_PRE_REMOVE)
+ return 0;
xfs_err(mp, "ondisk log corrupt, shutting down fs!");
xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
return -EFSCORRUPTED;
@@ -210,6 +291,12 @@ xfs_dax_notify_failure(
ddev_start = mp->m_ddev_targp->bt_dax_part_off;
ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
+ /* Notify failure on the whole device. */
+ if (offset == 0 && len == U64_MAX) {
+ offset = ddev_start;
+ len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
+ }
+
/* Ignore the range out of filesystem area */
if (offset + len - 1 < ddev_start)
return -ENXIO;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2dd73e4f3d8e..a10c75bebd6d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3665,6 +3665,7 @@ enum mf_flags {
MF_UNPOISON = 1 << 4,
MF_SW_SIMULATED = 1 << 5,
MF_NO_RETRY = 1 << 6,
+ MF_MEM_PRE_REMOVE = 1 << 7,
};
int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
unsigned long count, int mf_flags);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index e245191e6b04..e71616ccc643 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -683,7 +683,7 @@ static void add_to_kill_fsdax(struct task_struct *tsk, struct page *p,
*/
static void collect_procs_fsdax(struct page *page,
struct address_space *mapping, pgoff_t pgoff,
- struct list_head *to_kill)
+ struct list_head *to_kill, bool pre_remove)
{
struct vm_area_struct *vma;
struct task_struct *tsk;
@@ -691,8 +691,15 @@ static void collect_procs_fsdax(struct page *page,
i_mmap_lock_read(mapping);
read_lock(&tasklist_lock);
for_each_process(tsk) {
- struct task_struct *t = task_early_kill(tsk, true);
+ struct task_struct *t = tsk;
+ /*
+ * Search for all tasks while MF_MEM_PRE_REMOVE is set, because
+ * the current may not be the one accessing the fsdax page.
+ * Otherwise, search for the current task.
+ */
+ if (!pre_remove)
+ t = task_early_kill(tsk, true);
if (!t)
continue;
vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
@@ -1788,6 +1795,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
dax_entry_t cookie;
struct page *page;
size_t end = index + count;
+ bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
@@ -1799,9 +1807,10 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
if (!page)
goto unlock;
- SetPageHWPoison(page);
+ if (!pre_remove)
+ SetPageHWPoison(page);
- collect_procs_fsdax(page, mapping, index, &to_kill);
+ collect_procs_fsdax(page, mapping, index, &to_kill, pre_remove);
unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
index, mf_flags);
unlock:
--
2.41.0
^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH v14] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind
2023-08-28 6:57 ` [PATCH v14] " Shiyang Ruan
@ 2023-08-30 15:34 ` Darrick J. Wong
2023-09-27 8:17 ` Dan Williams
2023-09-28 10:32 ` [PATCH v15] " Shiyang Ruan
2 siblings, 0 replies; 37+ messages in thread
From: Darrick J. Wong @ 2023-08-30 15:34 UTC (permalink / raw)
To: Shiyang Ruan
Cc: linux-fsdevel, nvdimm, linux-xfs, linux-mm, dan.j.williams, willy,
jack, akpm, mcgrof
On Mon, Aug 28, 2023 at 02:57:44PM +0800, Shiyang Ruan wrote:
> ====
> Changes since v13:
> 1. don't return error if _freeze(FREEZE_HOLDER_KERNEL) got other error
> ====
>
> Now, if we suddenly remove a PMEM device(by calling unbind) which
> contains FSDAX while programs are still accessing data in this device,
> e.g.:
> ```
> $FSSTRESS_PROG -d $SCRATCH_MNT -n 99999 -p 4 &
> # $FSX_PROG -N 1000000 -o 8192 -l 500000 $SCRATCH_MNT/t001 &
> echo "pfn1.1" > /sys/bus/nd/drivers/nd_pmem/unbind
> ```
> it could come into an unacceptable state:
> 1. device has gone but mount point still exists, and umount will fail
> with "target is busy"
> 2. programs will hang and cannot be killed
> 3. may crash with NULL pointer dereference
>
> To fix this, we introduce a MF_MEM_PRE_REMOVE flag to let it know that we
> are going to remove the whole device, and make sure all related processes
> could be notified so that they could end up gracefully.
>
> This patch is inspired by Dan's "mm, dax, pmem: Introduce
> dev_pagemap_failure()"[1]. With the help of dax_holder and
> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> on it to unmap all files in use, and notify processes who are using
> those files.
>
> Call trace:
> trigger unbind
> -> unbind_store()
> -> ... (skip)
> -> devres_release_all()
> -> kill_dax()
> -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
> -> xfs_dax_notify_failure()
> `-> freeze_super() // freeze (kernel call)
> `-> do xfs rmap
> ` -> mf_dax_kill_procs()
> ` -> collect_procs_fsdax() // all associated processes
> ` -> unmap_and_kill()
> ` -> invalidate_inode_pages2_range() // drop file's cache
> `-> thaw_super() // thaw (both kernel & user call)
>
> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> event. Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
> new dax mapping from being created. Do not shutdown filesystem directly
> if configuration is not supported, or if failure range includes metadata
> area. Make sure all files and processes(not only the current progress)
> are handled correctly. Also drop the cache of associated files before
> pmem is removed.
>
> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
> [2]: https://lore.kernel.org/linux-xfs/169116275623.3187159.16862410128731457358.stg-ugh@frogsfrogsfrogs/
>
> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Looks good, now who wants to take this patch?
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
--D
> ---
> drivers/dax/super.c | 3 +-
> fs/xfs/xfs_notify_failure.c | 99 ++++++++++++++++++++++++++++++++++---
> include/linux/mm.h | 1 +
> mm/memory-failure.c | 17 +++++--
> 4 files changed, 109 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 0da9232ea175..f4b635526345 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -326,7 +326,8 @@ void kill_dax(struct dax_device *dax_dev)
> return;
>
> if (dax_dev->holder_data != NULL)
> - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
> + dax_holder_notify_failure(dax_dev, 0, U64_MAX,
> + MF_MEM_PRE_REMOVE);
>
> clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
> synchronize_srcu(&dax_srcu);
> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> index 4a9bbd3fe120..79586abc75bf 100644
> --- a/fs/xfs/xfs_notify_failure.c
> +++ b/fs/xfs/xfs_notify_failure.c
> @@ -22,6 +22,7 @@
>
> #include <linux/mm.h>
> #include <linux/dax.h>
> +#include <linux/fs.h>
>
> struct xfs_failure_info {
> xfs_agblock_t startblock;
> @@ -73,10 +74,16 @@ xfs_dax_failure_fn(
> struct xfs_mount *mp = cur->bc_mp;
> struct xfs_inode *ip;
> struct xfs_failure_info *notify = data;
> + struct address_space *mapping;
> + pgoff_t pgoff;
> + unsigned long pgcnt;
> int error = 0;
>
> if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
> (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
> + /* Continue the query because this isn't a failure. */
> + if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> + return 0;
> notify->want_shutdown = true;
> return 0;
> }
> @@ -92,14 +99,60 @@ xfs_dax_failure_fn(
> return 0;
> }
>
> - error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
> - xfs_failure_pgoff(mp, rec, notify),
> - xfs_failure_pgcnt(mp, rec, notify),
> - notify->mf_flags);
> + mapping = VFS_I(ip)->i_mapping;
> + pgoff = xfs_failure_pgoff(mp, rec, notify);
> + pgcnt = xfs_failure_pgcnt(mp, rec, notify);
> +
> + /* Continue the rmap query if the inode isn't a dax file. */
> + if (dax_mapping(mapping))
> + error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
> + notify->mf_flags);
> +
> + /* Invalidate the cache in dax pages. */
> + if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> + invalidate_inode_pages2_range(mapping, pgoff,
> + pgoff + pgcnt - 1);
> +
> xfs_irele(ip);
> return error;
> }
>
> +static int
> +xfs_dax_notify_failure_freeze(
> + struct xfs_mount *mp)
> +{
> + struct super_block *sb = mp->m_super;
> + int error;
> +
> + error = freeze_super(sb, FREEZE_HOLDER_KERNEL);
> + if (error)
> + xfs_emerg(mp, "already frozen by kernel, err=%d", error);
> +
> + return error;
> +}
> +
> +static void
> +xfs_dax_notify_failure_thaw(
> + struct xfs_mount *mp,
> + bool kernel_frozen)
> +{
> + struct super_block *sb = mp->m_super;
> + int error;
> +
> + if (kernel_frozen) {
> + error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
> + if (error)
> + xfs_emerg(mp, "still frozen after notify failure, err=%d",
> + error);
> + }
> +
> + /*
> + * Also thaw userspace call anyway because the device is about to be
> + * removed immediately.
> + */
> + thaw_super(sb, FREEZE_HOLDER_USERSPACE);
> +}
> +
> static int
> xfs_dax_notify_ddev_failure(
> struct xfs_mount *mp,
> @@ -112,15 +165,29 @@ xfs_dax_notify_ddev_failure(
> struct xfs_btree_cur *cur = NULL;
> struct xfs_buf *agf_bp = NULL;
> int error = 0;
> + bool kernel_frozen = false;
> xfs_fsblock_t fsbno = XFS_DADDR_TO_FSB(mp, daddr);
> xfs_agnumber_t agno = XFS_FSB_TO_AGNO(mp, fsbno);
> xfs_fsblock_t end_fsbno = XFS_DADDR_TO_FSB(mp,
> daddr + bblen - 1);
> xfs_agnumber_t end_agno = XFS_FSB_TO_AGNO(mp, end_fsbno);
>
> + if (mf_flags & MF_MEM_PRE_REMOVE) {
> + xfs_info(mp, "Device is about to be removed!");
> + /*
> + * Freeze fs to prevent new mappings from being created.
> + * - Keep going on if others already hold the kernel forzen.
> + * - Keep going on if other errors too because this device is
> + * starting to fail.
> + * - If kernel frozen state is hold successfully here, thaw it
> + * here as well at the end.
> + */
> + kernel_frozen = xfs_dax_notify_failure_freeze(mp) == 0;
> + }
> +
> error = xfs_trans_alloc_empty(mp, &tp);
> if (error)
> - return error;
> + goto out;
>
> for (; agno <= end_agno; agno++) {
> struct xfs_rmap_irec ri_low = { };
> @@ -165,11 +232,23 @@ xfs_dax_notify_ddev_failure(
> }
>
> xfs_trans_cancel(tp);
> +
> + /*
> + * Determine how to shutdown the filesystem according to the
> + * error code and flags.
> + */
> if (error || notify.want_shutdown) {
> xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
> if (!error)
> error = -EFSCORRUPTED;
> - }
> + } else if (mf_flags & MF_MEM_PRE_REMOVE)
> + xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
> +
> +out:
> + /* Thaw the fs if it is frozen before. */
> + if (mf_flags & MF_MEM_PRE_REMOVE)
> + xfs_dax_notify_failure_thaw(mp, kernel_frozen);
> +
> return error;
> }
>
> @@ -197,6 +276,8 @@ xfs_dax_notify_failure(
>
> if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
> mp->m_logdev_targp != mp->m_ddev_targp) {
> + if (mf_flags & MF_MEM_PRE_REMOVE)
> + return 0;
> xfs_err(mp, "ondisk log corrupt, shutting down fs!");
> xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
> return -EFSCORRUPTED;
> @@ -210,6 +291,12 @@ xfs_dax_notify_failure(
> ddev_start = mp->m_ddev_targp->bt_dax_part_off;
> ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
>
> + /* Notify failure on the whole device. */
> + if (offset == 0 && len == U64_MAX) {
> + offset = ddev_start;
> + len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
> + }
> +
> /* Ignore the range out of filesystem area */
> if (offset + len - 1 < ddev_start)
> return -ENXIO;
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 2dd73e4f3d8e..a10c75bebd6d 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3665,6 +3665,7 @@ enum mf_flags {
> MF_UNPOISON = 1 << 4,
> MF_SW_SIMULATED = 1 << 5,
> MF_NO_RETRY = 1 << 6,
> + MF_MEM_PRE_REMOVE = 1 << 7,
> };
> int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> unsigned long count, int mf_flags);
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index e245191e6b04..e71616ccc643 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -683,7 +683,7 @@ static void add_to_kill_fsdax(struct task_struct *tsk, struct page *p,
> */
> static void collect_procs_fsdax(struct page *page,
> struct address_space *mapping, pgoff_t pgoff,
> - struct list_head *to_kill)
> + struct list_head *to_kill, bool pre_remove)
> {
> struct vm_area_struct *vma;
> struct task_struct *tsk;
> @@ -691,8 +691,15 @@ static void collect_procs_fsdax(struct page *page,
> i_mmap_lock_read(mapping);
> read_lock(&tasklist_lock);
> for_each_process(tsk) {
> - struct task_struct *t = task_early_kill(tsk, true);
> + struct task_struct *t = tsk;
>
> + /*
> + * Search for all tasks while MF_MEM_PRE_REMOVE is set, because
> + * the current may not be the one accessing the fsdax page.
> + * Otherwise, search for the current task.
> + */
> + if (!pre_remove)
> + t = task_early_kill(tsk, true);
> if (!t)
> continue;
> vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> @@ -1788,6 +1795,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> dax_entry_t cookie;
> struct page *page;
> size_t end = index + count;
> + bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
>
> mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
>
> @@ -1799,9 +1807,10 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> if (!page)
> goto unlock;
>
> - SetPageHWPoison(page);
> + if (!pre_remove)
> + SetPageHWPoison(page);
>
> - collect_procs_fsdax(page, mapping, index, &to_kill);
> + collect_procs_fsdax(page, mapping, index, &to_kill, pre_remove);
> unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
> index, mf_flags);
> unlock:
> --
> 2.41.0
>
^ permalink raw reply [flat|nested] 37+ messages in thread
* RE: [PATCH v14] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind
2023-08-28 6:57 ` [PATCH v14] " Shiyang Ruan
2023-08-30 15:34 ` Darrick J. Wong
@ 2023-09-27 8:17 ` Dan Williams
2023-09-27 9:18 ` Shiyang Ruan
2023-09-28 10:32 ` [PATCH v15] " Shiyang Ruan
2 siblings, 1 reply; 37+ messages in thread
From: Dan Williams @ 2023-09-27 8:17 UTC (permalink / raw)
To: Shiyang Ruan, linux-fsdevel, nvdimm, linux-xfs, linux-mm
Cc: dan.j.williams, willy, jack, akpm, djwong, mcgrof
Shiyang Ruan wrote:
> ====
> Changes since v13:
> 1. don't return error if _freeze(FREEZE_HOLDER_KERNEL) got other error
> ====
>
> Now, if we suddenly remove a PMEM device(by calling unbind) which
> contains FSDAX while programs are still accessing data in this device,
> e.g.:
> ```
> $FSSTRESS_PROG -d $SCRATCH_MNT -n 99999 -p 4 &
> # $FSX_PROG -N 1000000 -o 8192 -l 500000 $SCRATCH_MNT/t001 &
> echo "pfn1.1" > /sys/bus/nd/drivers/nd_pmem/unbind
> ```
> it could come into an unacceptable state:
> 1. device has gone but mount point still exists, and umount will fail
> with "target is busy"
> 2. programs will hang and cannot be killed
> 3. may crash with NULL pointer dereference
Thanks, this addresses my main concern that this new capability is needed
otherwise DAX regresses the survivability of the kernel when removing a
device from underneath the mounted filesystem compared to removing a
non-DAX capable block device.
>
> To fix this, we introduce a MF_MEM_PRE_REMOVE flag to let it know that we
> are going to remove the whole device, and make sure all related processes
> could be notified so that they could end up gracefully.
>
> This patch is inspired by Dan's "mm, dax, pmem: Introduce
> dev_pagemap_failure()"[1]. With the help of dax_holder and
> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> on it to unmap all files in use, and notify processes who are using
> those files.
>
> Call trace:
> trigger unbind
> -> unbind_store()
> -> ... (skip)
> -> devres_release_all()
> -> kill_dax()
> -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
> -> xfs_dax_notify_failure()
> `-> freeze_super() // freeze (kernel call)
> `-> do xfs rmap
> ` -> mf_dax_kill_procs()
> ` -> collect_procs_fsdax() // all associated processes
> ` -> unmap_and_kill()
> ` -> invalidate_inode_pages2_range() // drop file's cache
> `-> thaw_super() // thaw (both kernel & user call)
>
> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> event. Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
> new dax mapping from being created. Do not shutdown filesystem directly
> if configuration is not supported, or if failure range includes metadata
> area. Make sure all files and processes(not only the current progress)
> are handled correctly. Also drop the cache of associated files before
> pmem is removed.
>
> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
> [2]: https://lore.kernel.org/linux-xfs/169116275623.3187159.16862410128731457358.stg-ugh@frogsfrogsfrogs/
I only have some questions and comment suggestions below, but otherwise
consider this:
Acked-by: Dan Williams <dan.j.williams@intel.com>
>
> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> ---
> drivers/dax/super.c | 3 +-
> fs/xfs/xfs_notify_failure.c | 99 ++++++++++++++++++++++++++++++++++---
> include/linux/mm.h | 1 +
> mm/memory-failure.c | 17 +++++--
> 4 files changed, 109 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 0da9232ea175..f4b635526345 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -326,7 +326,8 @@ void kill_dax(struct dax_device *dax_dev)
> return;
>
> if (dax_dev->holder_data != NULL)
> - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
> + dax_holder_notify_failure(dax_dev, 0, U64_MAX,
> + MF_MEM_PRE_REMOVE);
>
> clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
> synchronize_srcu(&dax_srcu);
> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> index 4a9bbd3fe120..79586abc75bf 100644
> --- a/fs/xfs/xfs_notify_failure.c
> +++ b/fs/xfs/xfs_notify_failure.c
> @@ -22,6 +22,7 @@
>
> #include <linux/mm.h>
> #include <linux/dax.h>
> +#include <linux/fs.h>
>
> struct xfs_failure_info {
> xfs_agblock_t startblock;
> @@ -73,10 +74,16 @@ xfs_dax_failure_fn(
> struct xfs_mount *mp = cur->bc_mp;
> struct xfs_inode *ip;
> struct xfs_failure_info *notify = data;
> + struct address_space *mapping;
> + pgoff_t pgoff;
> + unsigned long pgcnt;
> int error = 0;
>
> if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
> (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
> + /* Continue the query because this isn't a failure. */
> + if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> + return 0;
> notify->want_shutdown = true;
> return 0;
> }
> @@ -92,14 +99,60 @@ xfs_dax_failure_fn(
> return 0;
> }
>
> - error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
> - xfs_failure_pgoff(mp, rec, notify),
> - xfs_failure_pgcnt(mp, rec, notify),
> - notify->mf_flags);
> + mapping = VFS_I(ip)->i_mapping;
> + pgoff = xfs_failure_pgoff(mp, rec, notify);
> + pgcnt = xfs_failure_pgcnt(mp, rec, notify);
> +
> + /* Continue the rmap query if the inode isn't a dax file. */
> + if (dax_mapping(mapping))
> + error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
> + notify->mf_flags);
> +
> + /* Invalidate the cache in dax pages. */
> + if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> + invalidate_inode_pages2_range(mapping, pgoff,
> + pgoff + pgcnt - 1);
> +
> xfs_irele(ip);
> return error;
> }
>
> +static int
> +xfs_dax_notify_failure_freeze(
> + struct xfs_mount *mp)
> +{
> + struct super_block *sb = mp->m_super;
> + int error;
> +
> + error = freeze_super(sb, FREEZE_HOLDER_KERNEL);
> + if (error)
> + xfs_emerg(mp, "already frozen by kernel, err=%d", error);
> +
> + return error;
> +}
> +
> +static void
> +xfs_dax_notify_failure_thaw(
> + struct xfs_mount *mp,
> + bool kernel_frozen)
> +{
> + struct super_block *sb = mp->m_super;
> + int error;
> +
> + if (kernel_frozen) {
> + error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
> + if (error)
> + xfs_emerg(mp, "still frozen after notify failure, err=%d",
> + error);
> + }
> +
> + /*
> + * Also thaw userspace call anyway because the device is about to be
> + * removed immediately.
> + */
> + thaw_super(sb, FREEZE_HOLDER_USERSPACE);
I don't understand why this is not paired with a freeze in
xfs_dax_notify_failure_freeze()?
> +}
> +
> static int
> xfs_dax_notify_ddev_failure(
> struct xfs_mount *mp,
> @@ -112,15 +165,29 @@ xfs_dax_notify_ddev_failure(
> struct xfs_btree_cur *cur = NULL;
> struct xfs_buf *agf_bp = NULL;
> int error = 0;
> + bool kernel_frozen = false;
> xfs_fsblock_t fsbno = XFS_DADDR_TO_FSB(mp, daddr);
> xfs_agnumber_t agno = XFS_FSB_TO_AGNO(mp, fsbno);
> xfs_fsblock_t end_fsbno = XFS_DADDR_TO_FSB(mp,
> daddr + bblen - 1);
> xfs_agnumber_t end_agno = XFS_FSB_TO_AGNO(mp, end_fsbno);
>
> + if (mf_flags & MF_MEM_PRE_REMOVE) {
> + xfs_info(mp, "Device is about to be removed!");
> + /*
> + * Freeze fs to prevent new mappings from being created.
> + * - Keep going on if others already hold the kernel forzen.
> + * - Keep going on if other errors too because this device is
> + * starting to fail.
> + * - If kernel frozen state is hold successfully here, thaw it
> + * here as well at the end.
> + */
> + kernel_frozen = xfs_dax_notify_failure_freeze(mp) == 0;
> + }
> +
> error = xfs_trans_alloc_empty(mp, &tp);
> if (error)
> - return error;
> + goto out;
>
> for (; agno <= end_agno; agno++) {
> struct xfs_rmap_irec ri_low = { };
> @@ -165,11 +232,23 @@ xfs_dax_notify_ddev_failure(
> }
>
> xfs_trans_cancel(tp);
> +
> + /*
> + * Determine how to shutdown the filesystem according to the
> + * error code and flags.
> + */
This comment is not adding any value. It would be better if it clarified
why why want_shutdown will be false in the pre-remove case?
> if (error || notify.want_shutdown) {
> xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
> if (!error)
> error = -EFSCORRUPTED;
> - }
> + } else if (mf_flags & MF_MEM_PRE_REMOVE)
> + xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
> +
> +out:
> + /* Thaw the fs if it is frozen before. */
> + if (mf_flags & MF_MEM_PRE_REMOVE)
> + xfs_dax_notify_failure_thaw(mp, kernel_frozen);
> +
> return error;
> }
>
> @@ -197,6 +276,8 @@ xfs_dax_notify_failure(
>
> if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
> mp->m_logdev_targp != mp->m_ddev_targp) {
Maybe a comment:
/*
* In the pre-remove case the failure notification is attempting to
* trigger a force unmount, the expectation is that the device is still
* present, but its removal is in progress and can not be cancelled,
* proceed with accessing the log device.
*/
> + if (mf_flags & MF_MEM_PRE_REMOVE)
> + return 0;
> xfs_err(mp, "ondisk log corrupt, shutting down fs!");
> xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
> return -EFSCORRUPTED;
> @@ -210,6 +291,12 @@ xfs_dax_notify_failure(
> ddev_start = mp->m_ddev_targp->bt_dax_part_off;
> ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
>
> + /* Notify failure on the whole device. */
> + if (offset == 0 && len == U64_MAX) {
> + offset = ddev_start;
> + len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
> + }
> +
> /* Ignore the range out of filesystem area */
> if (offset + len - 1 < ddev_start)
> return -ENXIO;
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 2dd73e4f3d8e..a10c75bebd6d 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3665,6 +3665,7 @@ enum mf_flags {
> MF_UNPOISON = 1 << 4,
> MF_SW_SIMULATED = 1 << 5,
> MF_NO_RETRY = 1 << 6,
> + MF_MEM_PRE_REMOVE = 1 << 7,
> };
> int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> unsigned long count, int mf_flags);
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index e245191e6b04..e71616ccc643 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -683,7 +683,7 @@ static void add_to_kill_fsdax(struct task_struct *tsk, struct page *p,
> */
> static void collect_procs_fsdax(struct page *page,
> struct address_space *mapping, pgoff_t pgoff,
> - struct list_head *to_kill)
> + struct list_head *to_kill, bool pre_remove)
> {
> struct vm_area_struct *vma;
> struct task_struct *tsk;
> @@ -691,8 +691,15 @@ static void collect_procs_fsdax(struct page *page,
> i_mmap_lock_read(mapping);
> read_lock(&tasklist_lock);
> for_each_process(tsk) {
> - struct task_struct *t = task_early_kill(tsk, true);
> + struct task_struct *t = tsk;
>
> + /*
> + * Search for all tasks while MF_MEM_PRE_REMOVE is set, because
> + * the current may not be the one accessing the fsdax page.
> + * Otherwise, search for the current task.
> + */
> + if (!pre_remove)
> + t = task_early_kill(tsk, true);
> if (!t)
> continue;
> vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> @@ -1788,6 +1795,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> dax_entry_t cookie;
> struct page *page;
> size_t end = index + count;
> + bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
>
> mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
>
> @@ -1799,9 +1807,10 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> if (!page)
> goto unlock;
>
> - SetPageHWPoison(page);
> + if (!pre_remove)
> + SetPageHWPoison(page);
This problably wants a comment like:
/*
* The pre_remove case is revoking access, the memory is still good and
* could theoretically be put back into service
*/
>
> - collect_procs_fsdax(page, mapping, index, &to_kill);
> + collect_procs_fsdax(page, mapping, index, &to_kill, pre_remove);
> unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
> index, mf_flags);
> unlock:
> --
> 2.41.0
>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v14] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind
2023-09-27 8:17 ` Dan Williams
@ 2023-09-27 9:18 ` Shiyang Ruan
0 siblings, 0 replies; 37+ messages in thread
From: Shiyang Ruan @ 2023-09-27 9:18 UTC (permalink / raw)
To: Dan Williams, linux-fsdevel, nvdimm, linux-xfs, linux-mm
Cc: Chandan Babu R, djwong, Andrew Morton, willy, jack, akpm, mcgrof
在 2023/9/27 16:17, Dan Williams 写道:
> Shiyang Ruan wrote:
>> ====
>> Changes since v13:
>> 1. don't return error if _freeze(FREEZE_HOLDER_KERNEL) got other error
>> ====
>>
>> Now, if we suddenly remove a PMEM device(by calling unbind) which
>> contains FSDAX while programs are still accessing data in this device,
>> e.g.:
>> ```
>> $FSSTRESS_PROG -d $SCRATCH_MNT -n 99999 -p 4 &
>> # $FSX_PROG -N 1000000 -o 8192 -l 500000 $SCRATCH_MNT/t001 &
>> echo "pfn1.1" > /sys/bus/nd/drivers/nd_pmem/unbind
>> ```
>> it could come into an unacceptable state:
>> 1. device has gone but mount point still exists, and umount will fail
>> with "target is busy"
>> 2. programs will hang and cannot be killed
>> 3. may crash with NULL pointer dereference
>
> Thanks, this addresses my main concern that this new capability is needed
> otherwise DAX regresses the survivability of the kernel when removing a
> device from underneath the mounted filesystem compared to removing a
> non-DAX capable block device.
>
>>
>> To fix this, we introduce a MF_MEM_PRE_REMOVE flag to let it know that we
>> are going to remove the whole device, and make sure all related processes
>> could be notified so that they could end up gracefully.
>>
>> This patch is inspired by Dan's "mm, dax, pmem: Introduce
>> dev_pagemap_failure()"[1]. With the help of dax_holder and
>> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
>> on it to unmap all files in use, and notify processes who are using
>> those files.
>>
>> Call trace:
>> trigger unbind
>> -> unbind_store()
>> -> ... (skip)
>> -> devres_release_all()
>> -> kill_dax()
>> -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
>> -> xfs_dax_notify_failure()
>> `-> freeze_super() // freeze (kernel call)
>> `-> do xfs rmap
>> ` -> mf_dax_kill_procs()
>> ` -> collect_procs_fsdax() // all associated processes
>> ` -> unmap_and_kill()
>> ` -> invalidate_inode_pages2_range() // drop file's cache
>> `-> thaw_super() // thaw (both kernel & user call)
>>
>> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
>> event. Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
>> new dax mapping from being created. Do not shutdown filesystem directly
>> if configuration is not supported, or if failure range includes metadata
>> area. Make sure all files and processes(not only the current progress)
>> are handled correctly. Also drop the cache of associated files before
>> pmem is removed.
>>
>> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
>> [2]: https://lore.kernel.org/linux-xfs/169116275623.3187159.16862410128731457358.stg-ugh@frogsfrogsfrogs/
>
> I only have some questions and comment suggestions below, but otherwise
> consider this:
>
> Acked-by: Dan Williams <dan.j.williams@intel.com>
>
>>
>> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
>> ---
>> drivers/dax/super.c | 3 +-
>> fs/xfs/xfs_notify_failure.c | 99 ++++++++++++++++++++++++++++++++++---
>> include/linux/mm.h | 1 +
>> mm/memory-failure.c | 17 +++++--
>> 4 files changed, 109 insertions(+), 11 deletions(-)
>>
>> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
>> index 0da9232ea175..f4b635526345 100644
>> --- a/drivers/dax/super.c
>> +++ b/drivers/dax/super.c
>> @@ -326,7 +326,8 @@ void kill_dax(struct dax_device *dax_dev)
>> return;
>>
>> if (dax_dev->holder_data != NULL)
>> - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
>> + dax_holder_notify_failure(dax_dev, 0, U64_MAX,
>> + MF_MEM_PRE_REMOVE);
>>
>> clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
>> synchronize_srcu(&dax_srcu);
>> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
>> index 4a9bbd3fe120..79586abc75bf 100644
>> --- a/fs/xfs/xfs_notify_failure.c
>> +++ b/fs/xfs/xfs_notify_failure.c
>> @@ -22,6 +22,7 @@
>>
>> #include <linux/mm.h>
>> #include <linux/dax.h>
>> +#include <linux/fs.h>
>>
>> struct xfs_failure_info {
>> xfs_agblock_t startblock;
>> @@ -73,10 +74,16 @@ xfs_dax_failure_fn(
>> struct xfs_mount *mp = cur->bc_mp;
>> struct xfs_inode *ip;
>> struct xfs_failure_info *notify = data;
>> + struct address_space *mapping;
>> + pgoff_t pgoff;
>> + unsigned long pgcnt;
>> int error = 0;
>>
>> if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
>> (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
>> + /* Continue the query because this isn't a failure. */
>> + if (notify->mf_flags & MF_MEM_PRE_REMOVE)
>> + return 0;
>> notify->want_shutdown = true;
>> return 0;
>> }
>> @@ -92,14 +99,60 @@ xfs_dax_failure_fn(
>> return 0;
>> }
>>
>> - error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
>> - xfs_failure_pgoff(mp, rec, notify),
>> - xfs_failure_pgcnt(mp, rec, notify),
>> - notify->mf_flags);
>> + mapping = VFS_I(ip)->i_mapping;
>> + pgoff = xfs_failure_pgoff(mp, rec, notify);
>> + pgcnt = xfs_failure_pgcnt(mp, rec, notify);
>> +
>> + /* Continue the rmap query if the inode isn't a dax file. */
>> + if (dax_mapping(mapping))
>> + error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
>> + notify->mf_flags);
>> +
>> + /* Invalidate the cache in dax pages. */
>> + if (notify->mf_flags & MF_MEM_PRE_REMOVE)
>> + invalidate_inode_pages2_range(mapping, pgoff,
>> + pgoff + pgcnt - 1);
>> +
>> xfs_irele(ip);
>> return error;
>> }
>>
>> +static int
>> +xfs_dax_notify_failure_freeze(
>> + struct xfs_mount *mp)
>> +{
>> + struct super_block *sb = mp->m_super;
>> + int error;
>> +
>> + error = freeze_super(sb, FREEZE_HOLDER_KERNEL);
>> + if (error)
>> + xfs_emerg(mp, "already frozen by kernel, err=%d", error);
>> +
>> + return error;
>> +}
>> +
>> +static void
>> +xfs_dax_notify_failure_thaw(
>> + struct xfs_mount *mp,
>> + bool kernel_frozen)
>> +{
>> + struct super_block *sb = mp->m_super;
>> + int error;
>> +
>> + if (kernel_frozen) {
>> + error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
>> + if (error)
>> + xfs_emerg(mp, "still frozen after notify failure, err=%d",
>> + error);
>> + }
>> +
>> + /*
>> + * Also thaw userspace call anyway because the device is about to be
>> + * removed immediately.
>> + */
>> + thaw_super(sb, FREEZE_HOLDER_USERSPACE);
>
> I don't understand why this is not paired with a freeze in
> xfs_dax_notify_failure_freeze()?
What we want to do is freezing the filesystem, so acutally
freeze_super(sb, FREEZE_HOLDER_KERNEL) is enough. But adding
thaw_super(sb, FREEZE_HOLDER_USERSPACE) here is to make sure the mount
point could be umounted after unbind, while other userspace program is
holding the freeze state of this filesystem. Otherwize, after unbind,
the mount point still exists and `umount /mnt/scratch` fails with
"target is busy." `xfs_freeze -u /mnt/scratch` doesn't work too.
>
>> +}
>> +
>> static int
>> xfs_dax_notify_ddev_failure(
>> struct xfs_mount *mp,
>> @@ -112,15 +165,29 @@ xfs_dax_notify_ddev_failure(
>> struct xfs_btree_cur *cur = NULL;
>> struct xfs_buf *agf_bp = NULL;
>> int error = 0;
>> + bool kernel_frozen = false;
>> xfs_fsblock_t fsbno = XFS_DADDR_TO_FSB(mp, daddr);
>> xfs_agnumber_t agno = XFS_FSB_TO_AGNO(mp, fsbno);
>> xfs_fsblock_t end_fsbno = XFS_DADDR_TO_FSB(mp,
>> daddr + bblen - 1);
>> xfs_agnumber_t end_agno = XFS_FSB_TO_AGNO(mp, end_fsbno);
>>
>> + if (mf_flags & MF_MEM_PRE_REMOVE) {
>> + xfs_info(mp, "Device is about to be removed!");
>> + /*
>> + * Freeze fs to prevent new mappings from being created.
>> + * - Keep going on if others already hold the kernel forzen.
>> + * - Keep going on if other errors too because this device is
>> + * starting to fail.
>> + * - If kernel frozen state is hold successfully here, thaw it
>> + * here as well at the end.
>> + */
>> + kernel_frozen = xfs_dax_notify_failure_freeze(mp) == 0;
>> + }
>> +
>> error = xfs_trans_alloc_empty(mp, &tp);
>> if (error)
>> - return error;
>> + goto out;
>>
>> for (; agno <= end_agno; agno++) {
>> struct xfs_rmap_irec ri_low = { };
>> @@ -165,11 +232,23 @@ xfs_dax_notify_ddev_failure(
>> }
>>
>> xfs_trans_cancel(tp);
>> +
>> + /*
>> + * Determine how to shutdown the filesystem according to the
>> + * error code and flags.
>> + */
>
> This comment is not adding any value. It would be better if it clarified
> why why want_shutdown will be false in the pre-remove case?
>
>> if (error || notify.want_shutdown) {
>> xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>> if (!error)
>> error = -EFSCORRUPTED;
>> - }
>> + } else if (mf_flags & MF_MEM_PRE_REMOVE)
>> + xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
>> +
>> +out:
>> + /* Thaw the fs if it is frozen before. */
>> + if (mf_flags & MF_MEM_PRE_REMOVE)
>> + xfs_dax_notify_failure_thaw(mp, kernel_frozen);
>> +
>> return error;
>> }
>>
>> @@ -197,6 +276,8 @@ xfs_dax_notify_failure(
>>
>> if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
>> mp->m_logdev_targp != mp->m_ddev_targp) {
>
> Maybe a comment:
>
> /*
> * In the pre-remove case the failure notification is attempting to
> * trigger a force unmount, the expectation is that the device is still
> * present, but its removal is in progress and can not be cancelled,
> * proceed with accessing the log device.
> */
>
>> + if (mf_flags & MF_MEM_PRE_REMOVE)
>> + return 0;
>> xfs_err(mp, "ondisk log corrupt, shutting down fs!");
>> xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>> return -EFSCORRUPTED;
>> @@ -210,6 +291,12 @@ xfs_dax_notify_failure(
>> ddev_start = mp->m_ddev_targp->bt_dax_part_off;
>> ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
>>
>> + /* Notify failure on the whole device. */
>> + if (offset == 0 && len == U64_MAX) {
>> + offset = ddev_start;
>> + len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
>> + }
>> +
>> /* Ignore the range out of filesystem area */
>> if (offset + len - 1 < ddev_start)
>> return -ENXIO;
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 2dd73e4f3d8e..a10c75bebd6d 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -3665,6 +3665,7 @@ enum mf_flags {
>> MF_UNPOISON = 1 << 4,
>> MF_SW_SIMULATED = 1 << 5,
>> MF_NO_RETRY = 1 << 6,
>> + MF_MEM_PRE_REMOVE = 1 << 7,
>> };
>> int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>> unsigned long count, int mf_flags);
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index e245191e6b04..e71616ccc643 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -683,7 +683,7 @@ static void add_to_kill_fsdax(struct task_struct *tsk, struct page *p,
>> */
>> static void collect_procs_fsdax(struct page *page,
>> struct address_space *mapping, pgoff_t pgoff,
>> - struct list_head *to_kill)
>> + struct list_head *to_kill, bool pre_remove)
>> {
>> struct vm_area_struct *vma;
>> struct task_struct *tsk;
>> @@ -691,8 +691,15 @@ static void collect_procs_fsdax(struct page *page,
>> i_mmap_lock_read(mapping);
>> read_lock(&tasklist_lock);
>> for_each_process(tsk) {
>> - struct task_struct *t = task_early_kill(tsk, true);
>> + struct task_struct *t = tsk;
>>
>> + /*
>> + * Search for all tasks while MF_MEM_PRE_REMOVE is set, because
>> + * the current may not be the one accessing the fsdax page.
>> + * Otherwise, search for the current task.
>> + */
>> + if (!pre_remove)
>> + t = task_early_kill(tsk, true);
>> if (!t)
>> continue;
>> vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
>> @@ -1788,6 +1795,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>> dax_entry_t cookie;
>> struct page *page;
>> size_t end = index + count;
>> + bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
>>
>> mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
>>
>> @@ -1799,9 +1807,10 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>> if (!page)
>> goto unlock;
>>
>> - SetPageHWPoison(page);
>> + if (!pre_remove)
>> + SetPageHWPoison(page);
>
> This problably wants a comment like:
>
> /*
> * The pre_remove case is revoking access, the memory is still good and
> * could theoretically be put back into service
> */
>
>>
>> - collect_procs_fsdax(page, mapping, index, &to_kill);
>> + collect_procs_fsdax(page, mapping, index, &to_kill, pre_remove);
>> unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
>> index, mf_flags);
>> unlock:
I'll add/modify these comments as you suggested. Thanks!
--
Ruan.
>> --
>> 2.41.0
>>
^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH v15] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind
2023-08-28 6:57 ` [PATCH v14] " Shiyang Ruan
2023-08-30 15:34 ` Darrick J. Wong
2023-09-27 8:17 ` Dan Williams
@ 2023-09-28 10:32 ` Shiyang Ruan
2023-09-29 18:31 ` Dan Williams
` (3 more replies)
2 siblings, 4 replies; 37+ messages in thread
From: Shiyang Ruan @ 2023-09-28 10:32 UTC (permalink / raw)
To: linux-fsdevel, nvdimm, linux-xfs, linux-mm
Cc: dan.j.williams, willy, jack, akpm, djwong, mcgrof, chandanbabu
====
Changes since v14:
1. added/fixed code comments per Dan's comments
====
Now, if we suddenly remove a PMEM device(by calling unbind) which
contains FSDAX while programs are still accessing data in this device,
e.g.:
```
$FSSTRESS_PROG -d $SCRATCH_MNT -n 99999 -p 4 &
# $FSX_PROG -N 1000000 -o 8192 -l 500000 $SCRATCH_MNT/t001 &
echo "pfn1.1" > /sys/bus/nd/drivers/nd_pmem/unbind
```
it could come into an unacceptable state:
1. device has gone but mount point still exists, and umount will fail
with "target is busy"
2. programs will hang and cannot be killed
3. may crash with NULL pointer dereference
To fix this, we introduce a MF_MEM_PRE_REMOVE flag to let it know that we
are going to remove the whole device, and make sure all related processes
could be notified so that they could end up gracefully.
This patch is inspired by Dan's "mm, dax, pmem: Introduce
dev_pagemap_failure()"[1]. With the help of dax_holder and
->notify_failure() mechanism, the pmem driver is able to ask filesystem
on it to unmap all files in use, and notify processes who are using
those files.
Call trace:
trigger unbind
-> unbind_store()
-> ... (skip)
-> devres_release_all()
-> kill_dax()
-> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
-> xfs_dax_notify_failure()
`-> freeze_super() // freeze (kernel call)
`-> do xfs rmap
` -> mf_dax_kill_procs()
` -> collect_procs_fsdax() // all associated processes
` -> unmap_and_kill()
` -> invalidate_inode_pages2_range() // drop file's cache
`-> thaw_super() // thaw (both kernel & user call)
Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
event. Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
new dax mapping from being created. Do not shutdown filesystem directly
if configuration is not supported, or if failure range includes metadata
area. Make sure all files and processes(not only the current progress)
are handled correctly. Also drop the cache of associated files before
pmem is removed.
[1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
[2]: https://lore.kernel.org/linux-xfs/169116275623.3187159.16862410128731457358.stg-ugh@frogsfrogsfrogs/
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Acked-by: Dan Williams <dan.j.williams@intel.com>
---
drivers/dax/super.c | 3 +-
fs/xfs/xfs_notify_failure.c | 108 ++++++++++++++++++++++++++++++++++--
include/linux/mm.h | 1 +
mm/memory-failure.c | 21 +++++--
4 files changed, 122 insertions(+), 11 deletions(-)
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 0da9232ea175..f4b635526345 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -326,7 +326,8 @@ void kill_dax(struct dax_device *dax_dev)
return;
if (dax_dev->holder_data != NULL)
- dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
+ dax_holder_notify_failure(dax_dev, 0, U64_MAX,
+ MF_MEM_PRE_REMOVE);
clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
synchronize_srcu(&dax_srcu);
diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index 4a9bbd3fe120..30e9f4e09f76 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -22,6 +22,7 @@
#include <linux/mm.h>
#include <linux/dax.h>
+#include <linux/fs.h>
struct xfs_failure_info {
xfs_agblock_t startblock;
@@ -73,10 +74,16 @@ xfs_dax_failure_fn(
struct xfs_mount *mp = cur->bc_mp;
struct xfs_inode *ip;
struct xfs_failure_info *notify = data;
+ struct address_space *mapping;
+ pgoff_t pgoff;
+ unsigned long pgcnt;
int error = 0;
if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
(rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
+ /* Continue the query because this isn't a failure. */
+ if (notify->mf_flags & MF_MEM_PRE_REMOVE)
+ return 0;
notify->want_shutdown = true;
return 0;
}
@@ -92,14 +99,60 @@ xfs_dax_failure_fn(
return 0;
}
- error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
- xfs_failure_pgoff(mp, rec, notify),
- xfs_failure_pgcnt(mp, rec, notify),
- notify->mf_flags);
+ mapping = VFS_I(ip)->i_mapping;
+ pgoff = xfs_failure_pgoff(mp, rec, notify);
+ pgcnt = xfs_failure_pgcnt(mp, rec, notify);
+
+ /* Continue the rmap query if the inode isn't a dax file. */
+ if (dax_mapping(mapping))
+ error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
+ notify->mf_flags);
+
+ /* Invalidate the cache in dax pages. */
+ if (notify->mf_flags & MF_MEM_PRE_REMOVE)
+ invalidate_inode_pages2_range(mapping, pgoff,
+ pgoff + pgcnt - 1);
+
xfs_irele(ip);
return error;
}
+static int
+xfs_dax_notify_failure_freeze(
+ struct xfs_mount *mp)
+{
+ struct super_block *sb = mp->m_super;
+ int error;
+
+ error = freeze_super(sb, FREEZE_HOLDER_KERNEL);
+ if (error)
+ xfs_emerg(mp, "already frozen by kernel, err=%d", error);
+
+ return error;
+}
+
+static void
+xfs_dax_notify_failure_thaw(
+ struct xfs_mount *mp,
+ bool kernel_frozen)
+{
+ struct super_block *sb = mp->m_super;
+ int error;
+
+ if (kernel_frozen) {
+ error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
+ if (error)
+ xfs_emerg(mp, "still frozen after notify failure, err=%d",
+ error);
+ }
+
+ /*
+ * Also thaw userspace call anyway because the device is about to be
+ * removed immediately.
+ */
+ thaw_super(sb, FREEZE_HOLDER_USERSPACE);
+}
+
static int
xfs_dax_notify_ddev_failure(
struct xfs_mount *mp,
@@ -112,15 +165,29 @@ xfs_dax_notify_ddev_failure(
struct xfs_btree_cur *cur = NULL;
struct xfs_buf *agf_bp = NULL;
int error = 0;
+ bool kernel_frozen = false;
xfs_fsblock_t fsbno = XFS_DADDR_TO_FSB(mp, daddr);
xfs_agnumber_t agno = XFS_FSB_TO_AGNO(mp, fsbno);
xfs_fsblock_t end_fsbno = XFS_DADDR_TO_FSB(mp,
daddr + bblen - 1);
xfs_agnumber_t end_agno = XFS_FSB_TO_AGNO(mp, end_fsbno);
+ if (mf_flags & MF_MEM_PRE_REMOVE) {
+ xfs_info(mp, "Device is about to be removed!");
+ /*
+ * Freeze fs to prevent new mappings from being created.
+ * - Keep going on if others already hold the kernel forzen.
+ * - Keep going on if other errors too because this device is
+ * starting to fail.
+ * - If kernel frozen state is hold successfully here, thaw it
+ * here as well at the end.
+ */
+ kernel_frozen = xfs_dax_notify_failure_freeze(mp) == 0;
+ }
+
error = xfs_trans_alloc_empty(mp, &tp);
if (error)
- return error;
+ goto out;
for (; agno <= end_agno; agno++) {
struct xfs_rmap_irec ri_low = { };
@@ -165,11 +232,26 @@ xfs_dax_notify_ddev_failure(
}
xfs_trans_cancel(tp);
- if (error || notify.want_shutdown) {
+
+ /*
+ * Shutdown fs from a force umount in pre-remove case which won't fail,
+ * so errors can be ignored. Otherwise, shutdown the filesystem with
+ * CORRUPT flag if error occured or notify.want_shutdown was set during
+ * RMAP querying.
+ */
+ if (mf_flags & MF_MEM_PRE_REMOVE)
+ xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
+ else if (error || notify.want_shutdown) {
xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
if (!error)
error = -EFSCORRUPTED;
}
+
+out:
+ /* Thaw the fs if it has been frozen before. */
+ if (mf_flags & MF_MEM_PRE_REMOVE)
+ xfs_dax_notify_failure_thaw(mp, kernel_frozen);
+
return error;
}
@@ -197,6 +279,14 @@ xfs_dax_notify_failure(
if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
mp->m_logdev_targp != mp->m_ddev_targp) {
+ /*
+ * In the pre-remove case the failure notification is attempting
+ * to trigger a force unmount. The expectation is that the
+ * device is still present, but its removal is in progress and
+ * can not be cancelled, proceed with accessing the log device.
+ */
+ if (mf_flags & MF_MEM_PRE_REMOVE)
+ return 0;
xfs_err(mp, "ondisk log corrupt, shutting down fs!");
xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
return -EFSCORRUPTED;
@@ -210,6 +300,12 @@ xfs_dax_notify_failure(
ddev_start = mp->m_ddev_targp->bt_dax_part_off;
ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
+ /* Notify failure on the whole device. */
+ if (offset == 0 && len == U64_MAX) {
+ offset = ddev_start;
+ len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
+ }
+
/* Ignore the range out of filesystem area */
if (offset + len - 1 < ddev_start)
return -ENXIO;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2dd73e4f3d8e..a10c75bebd6d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3665,6 +3665,7 @@ enum mf_flags {
MF_UNPOISON = 1 << 4,
MF_SW_SIMULATED = 1 << 5,
MF_NO_RETRY = 1 << 6,
+ MF_MEM_PRE_REMOVE = 1 << 7,
};
int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
unsigned long count, int mf_flags);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index e245191e6b04..955edea9837f 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -683,7 +683,7 @@ static void add_to_kill_fsdax(struct task_struct *tsk, struct page *p,
*/
static void collect_procs_fsdax(struct page *page,
struct address_space *mapping, pgoff_t pgoff,
- struct list_head *to_kill)
+ struct list_head *to_kill, bool pre_remove)
{
struct vm_area_struct *vma;
struct task_struct *tsk;
@@ -691,8 +691,15 @@ static void collect_procs_fsdax(struct page *page,
i_mmap_lock_read(mapping);
read_lock(&tasklist_lock);
for_each_process(tsk) {
- struct task_struct *t = task_early_kill(tsk, true);
+ struct task_struct *t = tsk;
+ /*
+ * Search for all tasks while MF_MEM_PRE_REMOVE is set, because
+ * the current may not be the one accessing the fsdax page.
+ * Otherwise, search for the current task.
+ */
+ if (!pre_remove)
+ t = task_early_kill(tsk, true);
if (!t)
continue;
vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
@@ -1788,6 +1795,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
dax_entry_t cookie;
struct page *page;
size_t end = index + count;
+ bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
@@ -1799,9 +1807,14 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
if (!page)
goto unlock;
- SetPageHWPoison(page);
+ if (!pre_remove)
+ SetPageHWPoison(page);
- collect_procs_fsdax(page, mapping, index, &to_kill);
+ /*
+ * The pre_remove case is revoking access, the memory is still
+ * good and could theoretically be put back into service.
+ */
+ collect_procs_fsdax(page, mapping, index, &to_kill, pre_remove);
unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
index, mf_flags);
unlock:
--
2.42.0
^ permalink raw reply related [flat|nested] 37+ messages in thread
* RE: [PATCH v15] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind
2023-09-28 10:32 ` [PATCH v15] " Shiyang Ruan
@ 2023-09-29 18:31 ` Dan Williams
2023-10-01 1:43 ` kernel test robot
` (2 subsequent siblings)
3 siblings, 0 replies; 37+ messages in thread
From: Dan Williams @ 2023-09-29 18:31 UTC (permalink / raw)
To: Shiyang Ruan, linux-fsdevel, nvdimm, linux-xfs, linux-mm
Cc: dan.j.williams, willy, jack, akpm, djwong, mcgrof, chandanbabu
Shiyang Ruan wrote:
> ====
> Changes since v14:
> 1. added/fixed code comments per Dan's comments
> ====
>
> Now, if we suddenly remove a PMEM device(by calling unbind) which
> contains FSDAX while programs are still accessing data in this device,
> e.g.:
> ```
> $FSSTRESS_PROG -d $SCRATCH_MNT -n 99999 -p 4 &
> # $FSX_PROG -N 1000000 -o 8192 -l 500000 $SCRATCH_MNT/t001 &
> echo "pfn1.1" > /sys/bus/nd/drivers/nd_pmem/unbind
> ```
> it could come into an unacceptable state:
> 1. device has gone but mount point still exists, and umount will fail
> with "target is busy"
> 2. programs will hang and cannot be killed
> 3. may crash with NULL pointer dereference
>
> To fix this, we introduce a MF_MEM_PRE_REMOVE flag to let it know that we
> are going to remove the whole device, and make sure all related processes
> could be notified so that they could end up gracefully.
>
> This patch is inspired by Dan's "mm, dax, pmem: Introduce
> dev_pagemap_failure()"[1]. With the help of dax_holder and
> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> on it to unmap all files in use, and notify processes who are using
> those files.
>
> Call trace:
> trigger unbind
> -> unbind_store()
> -> ... (skip)
> -> devres_release_all()
> -> kill_dax()
> -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
> -> xfs_dax_notify_failure()
> `-> freeze_super() // freeze (kernel call)
> `-> do xfs rmap
> ` -> mf_dax_kill_procs()
> ` -> collect_procs_fsdax() // all associated processes
> ` -> unmap_and_kill()
> ` -> invalidate_inode_pages2_range() // drop file's cache
> `-> thaw_super() // thaw (both kernel & user call)
>
> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> event. Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
> new dax mapping from being created. Do not shutdown filesystem directly
> if configuration is not supported, or if failure range includes metadata
> area. Make sure all files and processes(not only the current progress)
> are handled correctly. Also drop the cache of associated files before
> pmem is removed.
>
> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
> [2]: https://lore.kernel.org/linux-xfs/169116275623.3187159.16862410128731457358.stg-ugh@frogsfrogsfrogs/
>
> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> Acked-by: Dan Williams <dan.j.williams@intel.com>
This version address my feedback you can upgrade that Acked-by: to
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v15] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind
2023-09-28 10:32 ` [PATCH v15] " Shiyang Ruan
2023-09-29 18:31 ` Dan Williams
@ 2023-10-01 1:43 ` kernel test robot
2023-10-02 11:57 ` Shiyang Ruan
2023-10-20 9:56 ` Chandan Babu R
2023-10-23 7:20 ` [PATCH v15.1] " Shiyang Ruan
3 siblings, 1 reply; 37+ messages in thread
From: kernel test robot @ 2023-10-01 1:43 UTC (permalink / raw)
To: Shiyang Ruan, linux-fsdevel, nvdimm, linux-xfs, linux-mm
Cc: llvm, oe-kbuild-all, dan.j.williams, willy, jack, akpm, djwong,
mcgrof, chandanbabu
Hi Shiyang,
kernel test robot noticed the following build errors:
url: https://github.com/intel-lab-lkp/linux/commits/UPDATE-20230928-183310/Shiyang-Ruan/xfs-fix-the-calculation-for-end-and-length/20230629-161913
base: the 2th patch of https://lore.kernel.org/r/20230629081651.253626-3-ruansy.fnst%40fujitsu.com
patch link: https://lore.kernel.org/r/20230928103227.250550-1-ruansy.fnst%40fujitsu.com
patch subject: [PATCH v15] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind
config: x86_64-rhel-8.3-rust (https://download.01.org/0day-ci/archive/20231001/202310010955.feI4HCwZ-lkp@intel.com/config)
compiler: clang version 15.0.7 (https://github.com/llvm/llvm-project.git 8dfdcc7b7bf66834a761bd8de445840ef68e4d1a)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231001/202310010955.feI4HCwZ-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202310010955.feI4HCwZ-lkp@intel.com/
All errors (new ones prefixed by >>):
>> fs/xfs/xfs_notify_failure.c:127:27: error: use of undeclared identifier 'FREEZE_HOLDER_KERNEL'
error = freeze_super(sb, FREEZE_HOLDER_KERNEL);
^
fs/xfs/xfs_notify_failure.c:143:26: error: use of undeclared identifier 'FREEZE_HOLDER_KERNEL'
error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
^
>> fs/xfs/xfs_notify_failure.c:153:17: error: use of undeclared identifier 'FREEZE_HOLDER_USERSPACE'
thaw_super(sb, FREEZE_HOLDER_USERSPACE);
^
3 errors generated.
vim +/FREEZE_HOLDER_KERNEL +127 fs/xfs/xfs_notify_failure.c
119
120 static int
121 xfs_dax_notify_failure_freeze(
122 struct xfs_mount *mp)
123 {
124 struct super_block *sb = mp->m_super;
125 int error;
126
> 127 error = freeze_super(sb, FREEZE_HOLDER_KERNEL);
128 if (error)
129 xfs_emerg(mp, "already frozen by kernel, err=%d", error);
130
131 return error;
132 }
133
134 static void
135 xfs_dax_notify_failure_thaw(
136 struct xfs_mount *mp,
137 bool kernel_frozen)
138 {
139 struct super_block *sb = mp->m_super;
140 int error;
141
142 if (kernel_frozen) {
143 error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
144 if (error)
145 xfs_emerg(mp, "still frozen after notify failure, err=%d",
146 error);
147 }
148
149 /*
150 * Also thaw userspace call anyway because the device is about to be
151 * removed immediately.
152 */
> 153 thaw_super(sb, FREEZE_HOLDER_USERSPACE);
154 }
155
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v15] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind
2023-10-01 1:43 ` kernel test robot
@ 2023-10-02 11:57 ` Shiyang Ruan
0 siblings, 0 replies; 37+ messages in thread
From: Shiyang Ruan @ 2023-10-02 11:57 UTC (permalink / raw)
To: kernel test robot
Cc: llvm, oe-kbuild-all, dan.j.williams, willy, jack, akpm, djwong,
mcgrof, chandanbabu, linux-fsdevel, nvdimm, linux-xfs, linux-mm
在 2023/10/1 9:43, kernel test robot 写道:
> Hi Shiyang,
>
> kernel test robot noticed the following build errors:
>
>
>
> url: https://github.com/intel-lab-lkp/linux/commits/UPDATE-20230928-183310/Shiyang-Ruan/xfs-fix-the-calculation-for-end-and-length/20230629-161913
> base: the 2th patch of https://lore.kernel.org/r/20230629081651.253626-3-ruansy.fnst%40fujitsu.com
> patch link: https://lore.kernel.org/r/20230928103227.250550-1-ruansy.fnst%40fujitsu.com
> patch subject: [PATCH v15] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind
> config: x86_64-rhel-8.3-rust (https://download.01.org/0day-ci/archive/20231001/202310010955.feI4HCwZ-lkp@intel.com/config)
> compiler: clang version 15.0.7 (https://github.com/llvm/llvm-project.git 8dfdcc7b7bf66834a761bd8de445840ef68e4d1a)
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231001/202310010955.feI4HCwZ-lkp@intel.com/reproduce)
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202310010955.feI4HCwZ-lkp@intel.com/
>
> All errors (new ones prefixed by >>):
>
>>> fs/xfs/xfs_notify_failure.c:127:27: error: use of undeclared identifier 'FREEZE_HOLDER_KERNEL'
> error = freeze_super(sb, FREEZE_HOLDER_KERNEL);
> ^
> fs/xfs/xfs_notify_failure.c:143:26: error: use of undeclared identifier 'FREEZE_HOLDER_KERNEL'
> error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
> ^
>>> fs/xfs/xfs_notify_failure.c:153:17: error: use of undeclared identifier 'FREEZE_HOLDER_USERSPACE'
> thaw_super(sb, FREEZE_HOLDER_USERSPACE);
> ^
> 3 errors generated.
>
The two enums has been introduced since 880b9577855e ("fs: distinguish
between user initiated freeze and kernel initiated freeze"), v6.6-rc1.
I also compiled my patches based on v6.6-rc1 with your config file, it
passed with no error.
So, which kernel version were you testing?
--
Thanks,
Ruan.
>
> vim +/FREEZE_HOLDER_KERNEL +127 fs/xfs/xfs_notify_failure.c
>
> 119
> 120 static int
> 121 xfs_dax_notify_failure_freeze(
> 122 struct xfs_mount *mp)
> 123 {
> 124 struct super_block *sb = mp->m_super;
> 125 int error;
> 126
> > 127 error = freeze_super(sb, FREEZE_HOLDER_KERNEL);
> 128 if (error)
> 129 xfs_emerg(mp, "already frozen by kernel, err=%d", error);
> 130
> 131 return error;
> 132 }
> 133
> 134 static void
> 135 xfs_dax_notify_failure_thaw(
> 136 struct xfs_mount *mp,
> 137 bool kernel_frozen)
> 138 {
> 139 struct super_block *sb = mp->m_super;
> 140 int error;
> 141
> 142 if (kernel_frozen) {
> 143 error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
> 144 if (error)
> 145 xfs_emerg(mp, "still frozen after notify failure, err=%d",
> 146 error);
> 147 }
> 148
> 149 /*
> 150 * Also thaw userspace call anyway because the device is about to be
> 151 * removed immediately.
> 152 */
> > 153 thaw_super(sb, FREEZE_HOLDER_USERSPACE);
> 154 }
> 155
>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v15] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind
2023-09-28 10:32 ` [PATCH v15] " Shiyang Ruan
2023-09-29 18:31 ` Dan Williams
2023-10-01 1:43 ` kernel test robot
@ 2023-10-20 9:56 ` Chandan Babu R
2023-10-20 15:40 ` Darrick J. Wong
2023-10-23 7:20 ` [PATCH v15.1] " Shiyang Ruan
3 siblings, 1 reply; 37+ messages in thread
From: Chandan Babu R @ 2023-10-20 9:56 UTC (permalink / raw)
To: akpm
Cc: Shiyang Ruan, linux-fsdevel, nvdimm, linux-xfs, linux-mm,
dan.j.williams, willy, jack, djwong, mcgrof
On Thu, Sep 28, 2023 at 06:32:27 PM +0800, Shiyang Ruan wrote:
> ====
> Changes since v14:
> 1. added/fixed code comments per Dan's comments
> ====
>
> Now, if we suddenly remove a PMEM device(by calling unbind) which
> contains FSDAX while programs are still accessing data in this device,
> e.g.:
> ```
> $FSSTRESS_PROG -d $SCRATCH_MNT -n 99999 -p 4 &
> # $FSX_PROG -N 1000000 -o 8192 -l 500000 $SCRATCH_MNT/t001 &
> echo "pfn1.1" > /sys/bus/nd/drivers/nd_pmem/unbind
> ```
> it could come into an unacceptable state:
> 1. device has gone but mount point still exists, and umount will fail
> with "target is busy"
> 2. programs will hang and cannot be killed
> 3. may crash with NULL pointer dereference
>
> To fix this, we introduce a MF_MEM_PRE_REMOVE flag to let it know that we
> are going to remove the whole device, and make sure all related processes
> could be notified so that they could end up gracefully.
>
> This patch is inspired by Dan's "mm, dax, pmem: Introduce
> dev_pagemap_failure()"[1]. With the help of dax_holder and
> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> on it to unmap all files in use, and notify processes who are using
> those files.
>
> Call trace:
> trigger unbind
> -> unbind_store()
> -> ... (skip)
> -> devres_release_all()
> -> kill_dax()
> -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
> -> xfs_dax_notify_failure()
> `-> freeze_super() // freeze (kernel call)
> `-> do xfs rmap
> ` -> mf_dax_kill_procs()
> ` -> collect_procs_fsdax() // all associated processes
> ` -> unmap_and_kill()
> ` -> invalidate_inode_pages2_range() // drop file's cache
> `-> thaw_super() // thaw (both kernel & user call)
>
> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> event. Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
> new dax mapping from being created. Do not shutdown filesystem directly
> if configuration is not supported, or if failure range includes metadata
> area. Make sure all files and processes(not only the current progress)
> are handled correctly. Also drop the cache of associated files before
> pmem is removed.
>
> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
> [2]: https://lore.kernel.org/linux-xfs/169116275623.3187159.16862410128731457358.stg-ugh@frogsfrogsfrogs/
>
> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> Acked-by: Dan Williams <dan.j.williams@intel.com>
Hi Andrew,
Shiyang had indicated that this patch has been added to
akpm/mm-hotfixes-unstable branch. However, I don't see the patch listed in
that branch.
I am about to start collecting XFS patches for v6.7 cycle. Please let me know
if you have any objections with me taking this patch via the XFS tree.
--
Chandan
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v15] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind
2023-10-20 9:56 ` Chandan Babu R
@ 2023-10-20 15:40 ` Darrick J. Wong
2023-10-23 6:40 ` Chandan Babu R
0 siblings, 1 reply; 37+ messages in thread
From: Darrick J. Wong @ 2023-10-20 15:40 UTC (permalink / raw)
To: Chandan Babu R
Cc: akpm, Shiyang Ruan, linux-fsdevel, nvdimm, linux-xfs, linux-mm,
dan.j.williams, willy, jack, mcgrof
On Fri, Oct 20, 2023 at 03:26:32PM +0530, Chandan Babu R wrote:
> On Thu, Sep 28, 2023 at 06:32:27 PM +0800, Shiyang Ruan wrote:
> > ====
> > Changes since v14:
> > 1. added/fixed code comments per Dan's comments
> > ====
> >
> > Now, if we suddenly remove a PMEM device(by calling unbind) which
> > contains FSDAX while programs are still accessing data in this device,
> > e.g.:
> > ```
> > $FSSTRESS_PROG -d $SCRATCH_MNT -n 99999 -p 4 &
> > # $FSX_PROG -N 1000000 -o 8192 -l 500000 $SCRATCH_MNT/t001 &
> > echo "pfn1.1" > /sys/bus/nd/drivers/nd_pmem/unbind
> > ```
> > it could come into an unacceptable state:
> > 1. device has gone but mount point still exists, and umount will fail
> > with "target is busy"
> > 2. programs will hang and cannot be killed
> > 3. may crash with NULL pointer dereference
> >
> > To fix this, we introduce a MF_MEM_PRE_REMOVE flag to let it know that we
> > are going to remove the whole device, and make sure all related processes
> > could be notified so that they could end up gracefully.
> >
> > This patch is inspired by Dan's "mm, dax, pmem: Introduce
> > dev_pagemap_failure()"[1]. With the help of dax_holder and
> > ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> > on it to unmap all files in use, and notify processes who are using
> > those files.
> >
> > Call trace:
> > trigger unbind
> > -> unbind_store()
> > -> ... (skip)
> > -> devres_release_all()
> > -> kill_dax()
> > -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
> > -> xfs_dax_notify_failure()
> > `-> freeze_super() // freeze (kernel call)
> > `-> do xfs rmap
> > ` -> mf_dax_kill_procs()
> > ` -> collect_procs_fsdax() // all associated processes
> > ` -> unmap_and_kill()
> > ` -> invalidate_inode_pages2_range() // drop file's cache
> > `-> thaw_super() // thaw (both kernel & user call)
> >
> > Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> > event. Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
> > new dax mapping from being created. Do not shutdown filesystem directly
> > if configuration is not supported, or if failure range includes metadata
> > area. Make sure all files and processes(not only the current progress)
> > are handled correctly. Also drop the cache of associated files before
> > pmem is removed.
> >
> > [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
> > [2]: https://lore.kernel.org/linux-xfs/169116275623.3187159.16862410128731457358.stg-ugh@frogsfrogsfrogs/
> >
> > Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> > Acked-by: Dan Williams <dan.j.williams@intel.com>
>
> Hi Andrew,
>
> Shiyang had indicated that this patch has been added to
> akpm/mm-hotfixes-unstable branch. However, I don't see the patch listed in
> that branch.
>
> I am about to start collecting XFS patches for v6.7 cycle. Please let me know
> if you have any objections with me taking this patch via the XFS tree.
V15 was dropped from his tree on 28 Sept., you might as well pull it
into your own tree for 6.7. It's been testing fine on my trees for the
past 3 weeks.
https://lore.kernel.org/mm-commits/20230928172815.EE6AFC433C8@smtp.kernel.org/
--D
>
> --
> Chandan
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v15] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind
2023-10-20 15:40 ` Darrick J. Wong
@ 2023-10-23 6:40 ` Chandan Babu R
2023-10-23 7:26 ` Shiyang Ruan
0 siblings, 1 reply; 37+ messages in thread
From: Chandan Babu R @ 2023-10-23 6:40 UTC (permalink / raw)
To: Shiyang Ruan
Cc: akpm, Darrick J. Wong, linux-fsdevel, nvdimm, linux-xfs, linux-mm,
dan.j.williams, willy, jack, mcgrof
On Fri, Oct 20, 2023 at 08:40:09 AM -0700, Darrick J. Wong wrote:
> On Fri, Oct 20, 2023 at 03:26:32PM +0530, Chandan Babu R wrote:
>> On Thu, Sep 28, 2023 at 06:32:27 PM +0800, Shiyang Ruan wrote:
>> > ====
>> > Changes since v14:
>> > 1. added/fixed code comments per Dan's comments
>> > ====
>> >
>> > Now, if we suddenly remove a PMEM device(by calling unbind) which
>> > contains FSDAX while programs are still accessing data in this device,
>> > e.g.:
>> > ```
>> > $FSSTRESS_PROG -d $SCRATCH_MNT -n 99999 -p 4 &
>> > # $FSX_PROG -N 1000000 -o 8192 -l 500000 $SCRATCH_MNT/t001 &
>> > echo "pfn1.1" > /sys/bus/nd/drivers/nd_pmem/unbind
>> > ```
>> > it could come into an unacceptable state:
>> > 1. device has gone but mount point still exists, and umount will fail
>> > with "target is busy"
>> > 2. programs will hang and cannot be killed
>> > 3. may crash with NULL pointer dereference
>> >
>> > To fix this, we introduce a MF_MEM_PRE_REMOVE flag to let it know that we
>> > are going to remove the whole device, and make sure all related processes
>> > could be notified so that they could end up gracefully.
>> >
>> > This patch is inspired by Dan's "mm, dax, pmem: Introduce
>> > dev_pagemap_failure()"[1]. With the help of dax_holder and
>> > ->notify_failure() mechanism, the pmem driver is able to ask filesystem
>> > on it to unmap all files in use, and notify processes who are using
>> > those files.
>> >
>> > Call trace:
>> > trigger unbind
>> > -> unbind_store()
>> > -> ... (skip)
>> > -> devres_release_all()
>> > -> kill_dax()
>> > -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
>> > -> xfs_dax_notify_failure()
>> > `-> freeze_super() // freeze (kernel call)
>> > `-> do xfs rmap
>> > ` -> mf_dax_kill_procs()
>> > ` -> collect_procs_fsdax() // all associated processes
>> > ` -> unmap_and_kill()
>> > ` -> invalidate_inode_pages2_range() // drop file's cache
>> > `-> thaw_super() // thaw (both kernel & user call)
>> >
>> > Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
>> > event. Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
>> > new dax mapping from being created. Do not shutdown filesystem directly
>> > if configuration is not supported, or if failure range includes metadata
>> > area. Make sure all files and processes(not only the current progress)
>> > are handled correctly. Also drop the cache of associated files before
>> > pmem is removed.
>> >
>> > [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
>> > [2]: https://lore.kernel.org/linux-xfs/169116275623.3187159.16862410128731457358.stg-ugh@frogsfrogsfrogs/
>> >
>> > Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
>> > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
>> > Acked-by: Dan Williams <dan.j.williams@intel.com>
>>
>> Hi Andrew,
>>
>> Shiyang had indicated that this patch has been added to
>> akpm/mm-hotfixes-unstable branch. However, I don't see the patch listed in
>> that branch.
>>
>> I am about to start collecting XFS patches for v6.7 cycle. Please let me know
>> if you have any objections with me taking this patch via the XFS tree.
>
> V15 was dropped from his tree on 28 Sept., you might as well pull it
> into your own tree for 6.7. It's been testing fine on my trees for the
> past 3 weeks.
>
> https://lore.kernel.org/mm-commits/20230928172815.EE6AFC433C8@smtp.kernel.org/
Shiyang, this patch does not apply cleanly on v6.6-rc7. Can you please rebase
the patch on v6.6-rc7 and send it to the mailing list?
--
Chandan
^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH v15.1] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind
2023-09-28 10:32 ` [PATCH v15] " Shiyang Ruan
` (2 preceding siblings ...)
2023-10-20 9:56 ` Chandan Babu R
@ 2023-10-23 7:20 ` Shiyang Ruan
3 siblings, 0 replies; 37+ messages in thread
From: Shiyang Ruan @ 2023-10-23 7:20 UTC (permalink / raw)
To: linux-fsdevel, nvdimm, linux-xfs, linux-mm, chandanbabu
Cc: dan.j.williams, willy, jack, akpm, djwong, mcgrof
Changes since v15:
1. Rebased on v6.6-rc7
Now, if we suddenly remove a PMEM device(by calling unbind) which
contains FSDAX while programs are still accessing data in this device,
e.g.:
```
$FSSTRESS_PROG -d $SCRATCH_MNT -n 99999 -p 4 &
# $FSX_PROG -N 1000000 -o 8192 -l 500000 $SCRATCH_MNT/t001 &
echo "pfn1.1" > /sys/bus/nd/drivers/nd_pmem/unbind
```
it could come into an unacceptable state:
1. device has gone but mount point still exists, and umount will fail
with "target is busy"
2. programs will hang and cannot be killed
3. may crash with NULL pointer dereference
To fix this, we introduce a MF_MEM_PRE_REMOVE flag to let it know that we
are going to remove the whole device, and make sure all related processes
could be notified so that they could end up gracefully.
This patch is inspired by Dan's "mm, dax, pmem: Introduce
dev_pagemap_failure()"[1]. With the help of dax_holder and
->notify_failure() mechanism, the pmem driver is able to ask filesystem
on it to unmap all files in use, and notify processes who are using
those files.
Call trace:
trigger unbind
-> unbind_store()
-> ... (skip)
-> devres_release_all()
-> kill_dax()
-> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
-> xfs_dax_notify_failure()
`-> freeze_super() // freeze (kernel call)
`-> do xfs rmap
` -> mf_dax_kill_procs()
` -> collect_procs_fsdax() // all associated processes
` -> unmap_and_kill()
` -> invalidate_inode_pages2_range() // drop file's cache
`-> thaw_super() // thaw (both kernel & user call)
Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
event. Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
new dax mapping from being created. Do not shutdown filesystem directly
if configuration is not supported, or if failure range includes metadata
area. Make sure all files and processes(not only the current progress)
are handled correctly. Also drop the cache of associated files before
pmem is removed.
[1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
[2]: https://lore.kernel.org/linux-xfs/169116275623.3187159.16862410128731457358.stg-ugh@frogsfrogsfrogs/
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
---
drivers/dax/super.c | 3 +-
fs/xfs/xfs_notify_failure.c | 108 ++++++++++++++++++++++++++++++++++--
include/linux/mm.h | 1 +
mm/memory-failure.c | 21 +++++--
4 files changed, 122 insertions(+), 11 deletions(-)
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 0da9232ea175..f4b635526345 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -326,7 +326,8 @@ void kill_dax(struct dax_device *dax_dev)
return;
if (dax_dev->holder_data != NULL)
- dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
+ dax_holder_notify_failure(dax_dev, 0, U64_MAX,
+ MF_MEM_PRE_REMOVE);
clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
synchronize_srcu(&dax_srcu);
diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index a7daa522e00f..fa50e5308292 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -22,6 +22,7 @@
#include <linux/mm.h>
#include <linux/dax.h>
+#include <linux/fs.h>
struct xfs_failure_info {
xfs_agblock_t startblock;
@@ -73,10 +74,16 @@ xfs_dax_failure_fn(
struct xfs_mount *mp = cur->bc_mp;
struct xfs_inode *ip;
struct xfs_failure_info *notify = data;
+ struct address_space *mapping;
+ pgoff_t pgoff;
+ unsigned long pgcnt;
int error = 0;
if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
(rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
+ /* Continue the query because this isn't a failure. */
+ if (notify->mf_flags & MF_MEM_PRE_REMOVE)
+ return 0;
notify->want_shutdown = true;
return 0;
}
@@ -92,14 +99,60 @@ xfs_dax_failure_fn(
return 0;
}
- error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
- xfs_failure_pgoff(mp, rec, notify),
- xfs_failure_pgcnt(mp, rec, notify),
- notify->mf_flags);
+ mapping = VFS_I(ip)->i_mapping;
+ pgoff = xfs_failure_pgoff(mp, rec, notify);
+ pgcnt = xfs_failure_pgcnt(mp, rec, notify);
+
+ /* Continue the rmap query if the inode isn't a dax file. */
+ if (dax_mapping(mapping))
+ error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
+ notify->mf_flags);
+
+ /* Invalidate the cache in dax pages. */
+ if (notify->mf_flags & MF_MEM_PRE_REMOVE)
+ invalidate_inode_pages2_range(mapping, pgoff,
+ pgoff + pgcnt - 1);
+
xfs_irele(ip);
return error;
}
+static int
+xfs_dax_notify_failure_freeze(
+ struct xfs_mount *mp)
+{
+ struct super_block *sb = mp->m_super;
+ int error;
+
+ error = freeze_super(sb, FREEZE_HOLDER_KERNEL);
+ if (error)
+ xfs_emerg(mp, "already frozen by kernel, err=%d", error);
+
+ return error;
+}
+
+static void
+xfs_dax_notify_failure_thaw(
+ struct xfs_mount *mp,
+ bool kernel_frozen)
+{
+ struct super_block *sb = mp->m_super;
+ int error;
+
+ if (kernel_frozen) {
+ error = thaw_super(sb, FREEZE_HOLDER_KERNEL);
+ if (error)
+ xfs_emerg(mp, "still frozen after notify failure, err=%d",
+ error);
+ }
+
+ /*
+ * Also thaw userspace call anyway because the device is about to be
+ * removed immediately.
+ */
+ thaw_super(sb, FREEZE_HOLDER_USERSPACE);
+}
+
static int
xfs_dax_notify_ddev_failure(
struct xfs_mount *mp,
@@ -112,15 +165,29 @@ xfs_dax_notify_ddev_failure(
struct xfs_btree_cur *cur = NULL;
struct xfs_buf *agf_bp = NULL;
int error = 0;
+ bool kernel_frozen = false;
xfs_fsblock_t fsbno = XFS_DADDR_TO_FSB(mp, daddr);
xfs_agnumber_t agno = XFS_FSB_TO_AGNO(mp, fsbno);
xfs_fsblock_t end_fsbno = XFS_DADDR_TO_FSB(mp,
daddr + bblen - 1);
xfs_agnumber_t end_agno = XFS_FSB_TO_AGNO(mp, end_fsbno);
+ if (mf_flags & MF_MEM_PRE_REMOVE) {
+ xfs_info(mp, "Device is about to be removed!");
+ /*
+ * Freeze fs to prevent new mappings from being created.
+ * - Keep going on if others already hold the kernel forzen.
+ * - Keep going on if other errors too because this device is
+ * starting to fail.
+ * - If kernel frozen state is hold successfully here, thaw it
+ * here as well at the end.
+ */
+ kernel_frozen = xfs_dax_notify_failure_freeze(mp) == 0;
+ }
+
error = xfs_trans_alloc_empty(mp, &tp);
if (error)
- return error;
+ goto out;
for (; agno <= end_agno; agno++) {
struct xfs_rmap_irec ri_low = { };
@@ -165,11 +232,26 @@ xfs_dax_notify_ddev_failure(
}
xfs_trans_cancel(tp);
- if (error || notify.want_shutdown) {
+
+ /*
+ * Shutdown fs from a force umount in pre-remove case which won't fail,
+ * so errors can be ignored. Otherwise, shutdown the filesystem with
+ * CORRUPT flag if error occured or notify.want_shutdown was set during
+ * RMAP querying.
+ */
+ if (mf_flags & MF_MEM_PRE_REMOVE)
+ xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
+ else if (error || notify.want_shutdown) {
xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
if (!error)
error = -EFSCORRUPTED;
}
+
+out:
+ /* Thaw the fs if it has been frozen before. */
+ if (mf_flags & MF_MEM_PRE_REMOVE)
+ xfs_dax_notify_failure_thaw(mp, kernel_frozen);
+
return error;
}
@@ -197,6 +279,14 @@ xfs_dax_notify_failure(
if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
mp->m_logdev_targp != mp->m_ddev_targp) {
+ /*
+ * In the pre-remove case the failure notification is attempting
+ * to trigger a force unmount. The expectation is that the
+ * device is still present, but its removal is in progress and
+ * can not be cancelled, proceed with accessing the log device.
+ */
+ if (mf_flags & MF_MEM_PRE_REMOVE)
+ return 0;
xfs_err(mp, "ondisk log corrupt, shutting down fs!");
xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
return -EFSCORRUPTED;
@@ -210,6 +300,12 @@ xfs_dax_notify_failure(
ddev_start = mp->m_ddev_targp->bt_dax_part_off;
ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
+ /* Notify failure on the whole device. */
+ if (offset == 0 && len == U64_MAX) {
+ offset = ddev_start;
+ len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
+ }
+
/* Ignore the range out of filesystem area */
if (offset + len - 1 < ddev_start)
return -ENXIO;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index bf5d0b1b16f4..385eee0d05a2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3831,6 +3831,7 @@ enum mf_flags {
MF_UNPOISON = 1 << 4,
MF_SW_SIMULATED = 1 << 5,
MF_NO_RETRY = 1 << 6,
+ MF_MEM_PRE_REMOVE = 1 << 7,
};
int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
unsigned long count, int mf_flags);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 4d6e43c88489..6e43ae369fef 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -679,7 +679,7 @@ static void add_to_kill_fsdax(struct task_struct *tsk, struct page *p,
*/
static void collect_procs_fsdax(struct page *page,
struct address_space *mapping, pgoff_t pgoff,
- struct list_head *to_kill)
+ struct list_head *to_kill, bool pre_remove)
{
struct vm_area_struct *vma;
struct task_struct *tsk;
@@ -687,8 +687,15 @@ static void collect_procs_fsdax(struct page *page,
i_mmap_lock_read(mapping);
rcu_read_lock();
for_each_process(tsk) {
- struct task_struct *t = task_early_kill(tsk, true);
+ struct task_struct *t = tsk;
+ /*
+ * Search for all tasks while MF_MEM_PRE_REMOVE is set, because
+ * the current may not be the one accessing the fsdax page.
+ * Otherwise, search for the current task.
+ */
+ if (!pre_remove)
+ t = task_early_kill(tsk, true);
if (!t)
continue;
vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
@@ -1792,6 +1799,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
dax_entry_t cookie;
struct page *page;
size_t end = index + count;
+ bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
@@ -1803,9 +1811,14 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
if (!page)
goto unlock;
- SetPageHWPoison(page);
+ if (!pre_remove)
+ SetPageHWPoison(page);
- collect_procs_fsdax(page, mapping, index, &to_kill);
+ /*
+ * The pre_remove case is revoking access, the memory is still
+ * good and could theoretically be put back into service.
+ */
+ collect_procs_fsdax(page, mapping, index, &to_kill, pre_remove);
unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
index, mf_flags);
unlock:
--
2.42.0
^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH v15] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind
2023-10-23 6:40 ` Chandan Babu R
@ 2023-10-23 7:26 ` Shiyang Ruan
2023-10-23 12:21 ` Chandan Babu R
0 siblings, 1 reply; 37+ messages in thread
From: Shiyang Ruan @ 2023-10-23 7:26 UTC (permalink / raw)
To: Chandan Babu R
Cc: akpm, Darrick J. Wong, linux-fsdevel, nvdimm, linux-xfs, linux-mm,
dan.j.williams, willy, jack, mcgrof
在 2023/10/23 14:40, Chandan Babu R 写道:
>
> On Fri, Oct 20, 2023 at 08:40:09 AM -0700, Darrick J. Wong wrote:
>> On Fri, Oct 20, 2023 at 03:26:32PM +0530, Chandan Babu R wrote:
>>> On Thu, Sep 28, 2023 at 06:32:27 PM +0800, Shiyang Ruan wrote:
>>>> ====
>>>> Changes since v14:
>>>> 1. added/fixed code comments per Dan's comments
>>>> ====
>>>>
>>>> Now, if we suddenly remove a PMEM device(by calling unbind) which
>>>> contains FSDAX while programs are still accessing data in this device,
>>>> e.g.:
>>>> ```
>>>> $FSSTRESS_PROG -d $SCRATCH_MNT -n 99999 -p 4 &
>>>> # $FSX_PROG -N 1000000 -o 8192 -l 500000 $SCRATCH_MNT/t001 &
>>>> echo "pfn1.1" > /sys/bus/nd/drivers/nd_pmem/unbind
>>>> ```
>>>> it could come into an unacceptable state:
>>>> 1. device has gone but mount point still exists, and umount will fail
>>>> with "target is busy"
>>>> 2. programs will hang and cannot be killed
>>>> 3. may crash with NULL pointer dereference
>>>>
>>>> To fix this, we introduce a MF_MEM_PRE_REMOVE flag to let it know that we
>>>> are going to remove the whole device, and make sure all related processes
>>>> could be notified so that they could end up gracefully.
>>>>
>>>> This patch is inspired by Dan's "mm, dax, pmem: Introduce
>>>> dev_pagemap_failure()"[1]. With the help of dax_holder and
>>>> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
>>>> on it to unmap all files in use, and notify processes who are using
>>>> those files.
>>>>
>>>> Call trace:
>>>> trigger unbind
>>>> -> unbind_store()
>>>> -> ... (skip)
>>>> -> devres_release_all()
>>>> -> kill_dax()
>>>> -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
>>>> -> xfs_dax_notify_failure()
>>>> `-> freeze_super() // freeze (kernel call)
>>>> `-> do xfs rmap
>>>> ` -> mf_dax_kill_procs()
>>>> ` -> collect_procs_fsdax() // all associated processes
>>>> ` -> unmap_and_kill()
>>>> ` -> invalidate_inode_pages2_range() // drop file's cache
>>>> `-> thaw_super() // thaw (both kernel & user call)
>>>>
>>>> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
>>>> event. Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
>>>> new dax mapping from being created. Do not shutdown filesystem directly
>>>> if configuration is not supported, or if failure range includes metadata
>>>> area. Make sure all files and processes(not only the current progress)
>>>> are handled correctly. Also drop the cache of associated files before
>>>> pmem is removed.
>>>>
>>>> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
>>>> [2]: https://lore.kernel.org/linux-xfs/169116275623.3187159.16862410128731457358.stg-ugh@frogsfrogsfrogs/
>>>>
>>>> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
>>>> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
>>>> Acked-by: Dan Williams <dan.j.williams@intel.com>
>>>
>>> Hi Andrew,
>>>
>>> Shiyang had indicated that this patch has been added to
>>> akpm/mm-hotfixes-unstable branch. However, I don't see the patch listed in
>>> that branch.
>>>
>>> I am about to start collecting XFS patches for v6.7 cycle. Please let me know
>>> if you have any objections with me taking this patch via the XFS tree.
>>
>> V15 was dropped from his tree on 28 Sept., you might as well pull it
>> into your own tree for 6.7. It's been testing fine on my trees for the
>> past 3 weeks.
>>
>> https://lore.kernel.org/mm-commits/20230928172815.EE6AFC433C8@smtp.kernel.org/
>
> Shiyang, this patch does not apply cleanly on v6.6-rc7. Can you please rebase
> the patch on v6.6-rc7 and send it to the mailing list?
Sure. I have rebased it and sent a v15.1. Please check it:
https://lore.kernel.org/linux-xfs/20231023072046.1626474-1-ruansy.fnst@fujitsu.com/
--
Thanks,
Ruan.
>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v15] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind
2023-10-23 7:26 ` Shiyang Ruan
@ 2023-10-23 12:21 ` Chandan Babu R
0 siblings, 0 replies; 37+ messages in thread
From: Chandan Babu R @ 2023-10-23 12:21 UTC (permalink / raw)
To: Shiyang Ruan
Cc: akpm, Darrick J. Wong, linux-fsdevel, nvdimm, linux-xfs, linux-mm,
dan.j.williams, willy, jack, mcgrof
On Mon, Oct 23, 2023 at 03:26:52 PM +0800, Shiyang Ruan wrote:
> 在 2023/10/23 14:40, Chandan Babu R 写道:
>> On Fri, Oct 20, 2023 at 08:40:09 AM -0700, Darrick J. Wong wrote:
>>> On Fri, Oct 20, 2023 at 03:26:32PM +0530, Chandan Babu R wrote:
>>>> On Thu, Sep 28, 2023 at 06:32:27 PM +0800, Shiyang Ruan wrote:
>>>>> ====
>>>>> Changes since v14:
>>>>> 1. added/fixed code comments per Dan's comments
>>>>> ====
>>>>>
>>>>> Now, if we suddenly remove a PMEM device(by calling unbind) which
>>>>> contains FSDAX while programs are still accessing data in this device,
>>>>> e.g.:
>>>>> ```
>>>>> $FSSTRESS_PROG -d $SCRATCH_MNT -n 99999 -p 4 &
>>>>> # $FSX_PROG -N 1000000 -o 8192 -l 500000 $SCRATCH_MNT/t001 &
>>>>> echo "pfn1.1" > /sys/bus/nd/drivers/nd_pmem/unbind
>>>>> ```
>>>>> it could come into an unacceptable state:
>>>>> 1. device has gone but mount point still exists, and umount will fail
>>>>> with "target is busy"
>>>>> 2. programs will hang and cannot be killed
>>>>> 3. may crash with NULL pointer dereference
>>>>>
>>>>> To fix this, we introduce a MF_MEM_PRE_REMOVE flag to let it know that we
>>>>> are going to remove the whole device, and make sure all related processes
>>>>> could be notified so that they could end up gracefully.
>>>>>
>>>>> This patch is inspired by Dan's "mm, dax, pmem: Introduce
>>>>> dev_pagemap_failure()"[1]. With the help of dax_holder and
>>>>> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
>>>>> on it to unmap all files in use, and notify processes who are using
>>>>> those files.
>>>>>
>>>>> Call trace:
>>>>> trigger unbind
>>>>> -> unbind_store()
>>>>> -> ... (skip)
>>>>> -> devres_release_all()
>>>>> -> kill_dax()
>>>>> -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
>>>>> -> xfs_dax_notify_failure()
>>>>> `-> freeze_super() // freeze (kernel call)
>>>>> `-> do xfs rmap
>>>>> ` -> mf_dax_kill_procs()
>>>>> ` -> collect_procs_fsdax() // all associated processes
>>>>> ` -> unmap_and_kill()
>>>>> ` -> invalidate_inode_pages2_range() // drop file's cache
>>>>> `-> thaw_super() // thaw (both kernel & user call)
>>>>>
>>>>> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
>>>>> event. Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
>>>>> new dax mapping from being created. Do not shutdown filesystem directly
>>>>> if configuration is not supported, or if failure range includes metadata
>>>>> area. Make sure all files and processes(not only the current progress)
>>>>> are handled correctly. Also drop the cache of associated files before
>>>>> pmem is removed.
>>>>>
>>>>> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
>>>>> [2]: https://lore.kernel.org/linux-xfs/169116275623.3187159.16862410128731457358.stg-ugh@frogsfrogsfrogs/
>>>>>
>>>>> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
>>>>> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
>>>>> Acked-by: Dan Williams <dan.j.williams@intel.com>
>>>>
>>>> Hi Andrew,
>>>>
>>>> Shiyang had indicated that this patch has been added to
>>>> akpm/mm-hotfixes-unstable branch. However, I don't see the patch listed in
>>>> that branch.
>>>>
>>>> I am about to start collecting XFS patches for v6.7 cycle. Please let me know
>>>> if you have any objections with me taking this patch via the XFS tree.
>>>
>>> V15 was dropped from his tree on 28 Sept., you might as well pull it
>>> into your own tree for 6.7. It's been testing fine on my trees for the
>>> past 3 weeks.
>>>
>>> https://lore.kernel.org/mm-commits/20230928172815.EE6AFC433C8@smtp.kernel.org/
>> Shiyang, this patch does not apply cleanly on v6.6-rc7. Can you
>> please rebase
>> the patch on v6.6-rc7 and send it to the mailing list?
>
> Sure. I have rebased it and sent a v15.1. Please check it:
>
> https://lore.kernel.org/linux-xfs/20231023072046.1626474-1-ruansy.fnst@fujitsu.com/
Thank you. I have applied the patch to my local Git tree.
--
Chandan
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v12 0/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
2023-06-29 8:16 [PATCH v12 0/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan
2023-06-29 8:16 ` [PATCH v12 1/2] xfs: fix the calculation for "end" and "length" Shiyang Ruan
2023-06-29 8:16 ` [PATCH v12 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan
@ 2024-01-11 22:24 ` Bill O'Donnell
2024-01-12 1:56 ` Shiyang Ruan
2 siblings, 1 reply; 37+ messages in thread
From: Bill O'Donnell @ 2024-01-11 22:24 UTC (permalink / raw)
To: Shiyang Ruan
Cc: linux-fsdevel, nvdimm, linux-xfs, linux-mm, dan.j.williams, willy,
jack, akpm, djwong, mcgrof
On Thu, Jun 29, 2023 at 04:16:49PM +0800, Shiyang Ruan wrote:
> This patchset is to add gracefully unbind support for pmem.
> Patch1 corrects the calculation of length and end of a given range.
> Patch2 introduces a new flag call MF_MEM_REMOVE, to let dax holder know
> it is a remove event. With the help of notify_failure mechanism, we are
> able to shutdown the filesystem on the pmem gracefully.
What is the status of this patch?
Thanks-
Bill
>
> Changes since v11:
> Patch1:
> 1. correct the count calculation in xfs_failure_pgcnt().
> (was a wrong fix in v11)
> Patch2:
> 1. use new exclusive freeze_super/thaw_super API, to make sure the unbind
> progress won't be disturbed by any other freezer.
>
> Shiyang Ruan (2):
> xfs: fix the calculation for "end" and "length"
> mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
>
> drivers/dax/super.c | 3 +-
> fs/xfs/xfs_notify_failure.c | 95 +++++++++++++++++++++++++++++++++----
> include/linux/mm.h | 1 +
> mm/memory-failure.c | 17 +++++--
> 4 files changed, 101 insertions(+), 15 deletions(-)
>
> --
> 2.40.1
>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH v12 0/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
2024-01-11 22:24 ` [PATCH v12 0/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE " Bill O'Donnell
@ 2024-01-12 1:56 ` Shiyang Ruan
0 siblings, 0 replies; 37+ messages in thread
From: Shiyang Ruan @ 2024-01-12 1:56 UTC (permalink / raw)
To: Bill O'Donnell
Cc: linux-fsdevel, nvdimm, linux-xfs, linux-mm, dan.j.williams, willy,
jack, akpm, djwong, mcgrof
在 2024/1/12 6:24, Bill O'Donnell 写道:
> On Thu, Jun 29, 2023 at 04:16:49PM +0800, Shiyang Ruan wrote:
>> This patchset is to add gracefully unbind support for pmem.
>> Patch1 corrects the calculation of length and end of a given range.
>> Patch2 introduces a new flag call MF_MEM_REMOVE, to let dax holder know
>> it is a remove event. With the help of notify_failure mechanism, we are
>> able to shutdown the filesystem on the pmem gracefully.
>
> What is the status of this patch?
Hi Bill,
This patch has just been merged. You can find it here:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fa422b353d212373fb2b2857a5ea5a6fa4876f9c
--
Thanks,
Ruan.
> Thanks-
> Bill
>
>
>>
>> Changes since v11:
>> Patch1:
>> 1. correct the count calculation in xfs_failure_pgcnt().
>> (was a wrong fix in v11)
>> Patch2:
>> 1. use new exclusive freeze_super/thaw_super API, to make sure the unbind
>> progress won't be disturbed by any other freezer.
>>
>> Shiyang Ruan (2):
>> xfs: fix the calculation for "end" and "length"
>> mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
>>
>> drivers/dax/super.c | 3 +-
>> fs/xfs/xfs_notify_failure.c | 95 +++++++++++++++++++++++++++++++++----
>> include/linux/mm.h | 1 +
>> mm/memory-failure.c | 17 +++++--
>> 4 files changed, 101 insertions(+), 15 deletions(-)
>>
>> --
>> 2.40.1
>>
>
^ permalink raw reply [flat|nested] 37+ messages in thread
end of thread, other threads:[~2024-01-12 1:56 UTC | newest]
Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-06-29 8:16 [PATCH v12 0/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan
2023-06-29 8:16 ` [PATCH v12 1/2] xfs: fix the calculation for "end" and "length" Shiyang Ruan
2023-06-29 8:16 ` [PATCH v12 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan
2023-06-29 12:02 ` kernel test robot
2023-07-14 9:07 ` Shiyang Ruan
2023-07-14 14:18 ` Darrick J. Wong
2023-07-20 1:50 ` Shiyang Ruan
2023-07-29 10:01 ` Shiyang Ruan
2023-07-29 15:15 ` Darrick J. Wong
2023-07-29 15:15 ` Darrick J. Wong
2023-07-31 9:36 ` Shiyang Ruan
2023-08-01 3:25 ` Darrick J. Wong
2023-08-03 10:44 ` Shiyang Ruan
2023-08-08 0:31 ` Dan Williams
2023-08-23 8:36 ` Shiyang Ruan
2023-08-23 8:17 ` [PATCH v13] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE " Shiyang Ruan
2023-08-23 23:36 ` Darrick J. Wong
2023-08-24 9:41 ` Shiyang Ruan
2023-08-24 23:57 ` Darrick J. Wong
2023-08-25 3:52 ` Shiyang Ruan
2023-08-26 0:17 ` Darrick J. Wong
2023-08-28 6:57 ` [PATCH v14] " Shiyang Ruan
2023-08-30 15:34 ` Darrick J. Wong
2023-09-27 8:17 ` Dan Williams
2023-09-27 9:18 ` Shiyang Ruan
2023-09-28 10:32 ` [PATCH v15] " Shiyang Ruan
2023-09-29 18:31 ` Dan Williams
2023-10-01 1:43 ` kernel test robot
2023-10-02 11:57 ` Shiyang Ruan
2023-10-20 9:56 ` Chandan Babu R
2023-10-20 15:40 ` Darrick J. Wong
2023-10-23 6:40 ` Chandan Babu R
2023-10-23 7:26 ` Shiyang Ruan
2023-10-23 12:21 ` Chandan Babu R
2023-10-23 7:20 ` [PATCH v15.1] " Shiyang Ruan
2024-01-11 22:24 ` [PATCH v12 0/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE " Bill O'Donnell
2024-01-12 1:56 ` Shiyang Ruan
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).