[PATCH 0/2] Fix loopback mounted filesystems on NFS

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] Fix loopback mounted filesystems on NFS
@ 2025-07-07 18:46 Benjamin Coddington
  2025-07-07 18:46 ` [PATCH 1/2] workqueue: Add a helper to identify current workqueue Benjamin Coddington
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Benjamin Coddington @ 2025-07-07 18:46 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Tejun Heo, Lai Jiangshan
  Cc: linux-nfs, linux-kernel, djeffery, loberman

We've been investigating new reports of filesystem corruption on
loopback images on NFS clients.  It appears that during writeback the
loopback driver encounters allocation failures in NFS and fails to write
dirty pages to the backing file.

We believe the problem is due to the loopback driver performing writeback
from a workqueue (so PF_WQ_WORKER is set), however ever since work to
improve NFS' memory allocation strategies [1] its possible that NFS
incorrectly assumes that if PF_WQ_WORKER is set then the writeback context
is nfsiod.  To make things worse, NFS does not expect PF_WQ_WORKER to be set
along with other PF_ flags such as PF_MEMALLOC_NOIO, but cannot really know
(without checking them all) which other allocation flags are set should
writeback be entered from a NFS-external workqueue worker.

To fix this, I'd like to introduce a way to check which specific workqueue
is being served by a worker (in patch 1), so that NFS can ensure that it
sets certain allocation flags only for the nfsiod workqueue workers (in
patch 2).

[1]: https://lore.kernel.org/linux-nfs/20220322011618.1052288-1-trondmy@kernel.org/

Benjamin Coddington (2):
  workqueue: Add a helper to identify current workqueue
  NFS: Improve nfsiod workqueue detection for allocation flags

 fs/nfs/internal.h         | 12 +++++++++++-
 include/linux/workqueue.h |  1 +
 kernel/workqueue.c        | 18 ++++++++++++++++++
 3 files changed, 30 insertions(+), 1 deletion(-)

-- 
2.47.0

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/2] workqueue: Add a helper to identify current workqueue
  2025-07-07 18:46 [PATCH 0/2] Fix loopback mounted filesystems on NFS Benjamin Coddington
@ 2025-07-07 18:46 ` Benjamin Coddington
  2025-07-08  4:37   ` Tejun Heo
  2025-07-07 18:46 ` [PATCH 2/2] NFS: Improve nfsiod workqueue detection for allocation flags Benjamin Coddington
  2025-07-07 19:15 ` [PATCH 0/2] Fix loopback mounted filesystems on NFS Jeff Layton
  2 siblings, 1 reply; 13+ messages in thread
From: Benjamin Coddington @ 2025-07-07 18:46 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Tejun Heo, Lai Jiangshan
  Cc: linux-nfs, linux-kernel, djeffery, loberman

Introduce a new helper current_workqueue() which returns the current task's
workqueue pointer or NULL if not a workqueue worker.

This will allow the NFS client to recognize the case where writeback occurs
within the nfsiod workqueue or is being submitted directly.  NFS would like
to change the GFP_ flags for memory allocation to avoid stalls or cycles in
memory pools based on which context writeback is occurring.  In a following
patch, this helper detects the case rather than checking the PF_WQ_WORKER
flag which can be passed along from another workqueue worker.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
---
 include/linux/workqueue.h |  1 +
 kernel/workqueue.c        | 18 ++++++++++++++++++
 2 files changed, 19 insertions(+)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 6e30f275da77..29e1096e6dfa 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -623,6 +623,7 @@ extern void workqueue_set_max_active(struct workqueue_struct *wq,
 extern void workqueue_set_min_active(struct workqueue_struct *wq,
 				     int min_active);
 extern struct work_struct *current_work(void);
+extern struct workqueue_struct *current_workqueue(void);
 extern bool current_is_workqueue_rescuer(void);
 extern bool workqueue_congested(int cpu, struct workqueue_struct *wq);
 extern unsigned int work_busy(struct work_struct *work);
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 9f9148075828..a96eb209d5e0 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -6009,6 +6009,24 @@ struct work_struct *current_work(void)
 }
 EXPORT_SYMBOL(current_work);
 
+/**
+ * current_workqueue - retrieve %current task's work queue
+ *
+ * Determine if %current task is a workqueue worker and what workqueue it's
+ * working on.  Useful to find out the context that the %current task is
+ * running in.
+ *
+ * Return: workqueue_struct if %current task is a workqueue worker, %NULL
+ * otherwise.
+ */
+struct workqueue_struct *current_workqueue(void)
+{
+	struct worker *worker = current_wq_worker();
+
+	return worker ? worker->current_pwq->wq : NULL;
+}
+EXPORT_SYMBOL(current_workqueue);
+
 /**
  * current_is_workqueue_rescuer - is %current workqueue rescuer?
  *
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/2] NFS: Improve nfsiod workqueue detection for allocation flags
  2025-07-07 18:46 [PATCH 0/2] Fix loopback mounted filesystems on NFS Benjamin Coddington
  2025-07-07 18:46 ` [PATCH 1/2] workqueue: Add a helper to identify current workqueue Benjamin Coddington
@ 2025-07-07 18:46 ` Benjamin Coddington
  2025-07-07 19:25   ` Trond Myklebust
  2025-07-07 19:15 ` [PATCH 0/2] Fix loopback mounted filesystems on NFS Jeff Layton
  2 siblings, 1 reply; 13+ messages in thread
From: Benjamin Coddington @ 2025-07-07 18:46 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Tejun Heo, Lai Jiangshan
  Cc: linux-nfs, linux-kernel, djeffery, loberman

The NFS client writeback paths change which flags are passed to their
memory allocation calls based on whether the current task is running from
within a workqueue or not.  More specifically, it appears that during
writeback allocations with PF_WQ_WORKER set on current->flags will add
__GFP_NORETRY | __GFP_NOWARN.  Presumably this is because nfsiod can
simply fail quickly and later retry to write back that specific page should
the allocation fail.

However, the check for PF_WQ_WORKER is too general because tasks can enter NFS
writeback paths from other workqueues.  Specifically, the loopback driver
tends to perform writeback into backing files on NFS with PF_WQ_WORKER set,
and additionally sets PF_MEMALLOC_NOIO.  The combination of
PF_MEMALLOC_NOIO with __GFP_NORETRY can easily result in allocation
failures and the loopback driver has no retry functionality.  As a result,
after commit 0bae835b63c5 ("NFS: Avoid writeback threads getting stuck in
mempool_alloc()") users are seeing corrupted loop-mounted filesystems backed
by image files on NFS.

In a preceding patch, we introduced a function to allow NFS to detect if
the task is executing within a specific workqueue.  Here we use that helper
to set __GFP_NORETRY | __GFP_NOWARN only if the workqueue is nfsiod.

Fixes: 0bae835b63c5 ("NFS: Avoid writeback threads getting stuck in mempool_alloc()")
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
---
 fs/nfs/internal.h | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 69c2c10ee658..173172afa3f5 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -12,6 +12,7 @@
 #include <linux/nfs_page.h>
 #include <linux/nfslocalio.h>
 #include <linux/wait_bit.h>
+#include <linux/workqueue.h>

 #define NFS_SB_MASK (SB_NOSUID|SB_NODEV|SB_NOEXEC|SB_SYNCHRONOUS)

@@ -669,9 +670,18 @@ nfs_write_match_verf(const struct nfs_writeverf *verf,
 		!nfs_write_verifier_cmp(&req->wb_verf, &verf->verifier);
 }

+static inline bool is_nfsiod(void)
+{
+	struct workqueue_struct *current_wq = current_workqueue();
+
+	if (current_wq)
+		return current_wq == nfsiod_workqueue;
+	return false;
+}
+
 static inline gfp_t nfs_io_gfp_mask(void)
 {
-	if (current->flags & PF_WQ_WORKER)
+	if (is_nfsiod())
 		return GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
 	return GFP_KERNEL;
 }
-- 
2.47.0

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/2] Fix loopback mounted filesystems on NFS
  2025-07-07 18:46 [PATCH 0/2] Fix loopback mounted filesystems on NFS Benjamin Coddington
  2025-07-07 18:46 ` [PATCH 1/2] workqueue: Add a helper to identify current workqueue Benjamin Coddington
  2025-07-07 18:46 ` [PATCH 2/2] NFS: Improve nfsiod workqueue detection for allocation flags Benjamin Coddington
@ 2025-07-07 19:15 ` Jeff Layton
  2 siblings, 0 replies; 13+ messages in thread
From: Jeff Layton @ 2025-07-07 19:15 UTC (permalink / raw)
  To: Benjamin Coddington, Trond Myklebust, Anna Schumaker, Tejun Heo,
	Lai Jiangshan
  Cc: linux-nfs, linux-kernel, djeffery, loberman

On Mon, 2025-07-07 at 14:46 -0400, Benjamin Coddington wrote:
> We've been investigating new reports of filesystem corruption on
> loopback images on NFS clients.  It appears that during writeback the
> loopback driver encounters allocation failures in NFS and fails to write
> dirty pages to the backing file.
> 
> We believe the problem is due to the loopback driver performing writeback
> from a workqueue (so PF_WQ_WORKER is set), however ever since work to
> improve NFS' memory allocation strategies [1] its possible that NFS
> incorrectly assumes that if PF_WQ_WORKER is set then the writeback context
> is nfsiod.  To make things worse, NFS does not expect PF_WQ_WORKER to be set
> along with other PF_ flags such as PF_MEMALLOC_NOIO, but cannot really know
> (without checking them all) which other allocation flags are set should
> writeback be entered from a NFS-external workqueue worker.
> 
> To fix this, I'd like to introduce a way to check which specific workqueue
> is being served by a worker (in patch 1), so that NFS can ensure that it
> sets certain allocation flags only for the nfsiod workqueue workers (in
> patch 2).
> 
> [1]: https://lore.kernel.org/linux-nfs/20220322011618.1052288-1-trondmy@kernel.org/
> 
> Benjamin Coddington (2):
>   workqueue: Add a helper to identify current workqueue
>   NFS: Improve nfsiod workqueue detection for allocation flags
> 
>  fs/nfs/internal.h         | 12 +++++++++++-
>  include/linux/workqueue.h |  1 +
>  kernel/workqueue.c        | 18 ++++++++++++++++++
>  3 files changed, 30 insertions(+), 1 deletion(-)

Looks like a nice simple solution, and the workqueue helper seems
reasonable.

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] NFS: Improve nfsiod workqueue detection for allocation flags
  2025-07-07 18:46 ` [PATCH 2/2] NFS: Improve nfsiod workqueue detection for allocation flags Benjamin Coddington
@ 2025-07-07 19:25   ` Trond Myklebust
  2025-07-07 20:12     ` Benjamin Coddington
  2025-07-07 20:28     ` Laurence Oberman
  0 siblings, 2 replies; 13+ messages in thread
From: Trond Myklebust @ 2025-07-07 19:25 UTC (permalink / raw)
  To: Benjamin Coddington, Anna Schumaker, Tejun Heo, Lai Jiangshan
  Cc: linux-nfs, linux-kernel, djeffery, loberman

On Mon, 2025-07-07 at 14:46 -0400, Benjamin Coddington wrote:
> The NFS client writeback paths change which flags are passed to their
> memory allocation calls based on whether the current task is running
> from
> within a workqueue or not.  More specifically, it appears that during
> writeback allocations with PF_WQ_WORKER set on current->flags will
> add
> __GFP_NORETRY | __GFP_NOWARN.  Presumably this is because nfsiod can
> simply fail quickly and later retry to write back that specific page
> should
> the allocation fail.
> 
> However, the check for PF_WQ_WORKER is too general because tasks can
> enter NFS
> writeback paths from other workqueues.  Specifically, the loopback
> driver
> tends to perform writeback into backing files on NFS with
> PF_WQ_WORKER set,
> and additionally sets PF_MEMALLOC_NOIO.  The combination of
> PF_MEMALLOC_NOIO with __GFP_NORETRY can easily result in allocation
> failures and the loopback driver has no retry functionality.  As a
> result,
> after commit 0bae835b63c5 ("NFS: Avoid writeback threads getting
> stuck in
> mempool_alloc()") users are seeing corrupted loop-mounted filesystems
> backed
> by image files on NFS.
> 
> In a preceding patch, we introduced a function to allow NFS to detect
> if
> the task is executing within a specific workqueue.  Here we use that
> helper
> to set __GFP_NORETRY | __GFP_NOWARN only if the workqueue is nfsiod.
> 
> Fixes: 0bae835b63c5 ("NFS: Avoid writeback threads getting stuck in
> mempool_alloc()")
> Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
> ---
>  fs/nfs/internal.h | 12 +++++++++++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
> index 69c2c10ee658..173172afa3f5 100644
> --- a/fs/nfs/internal.h
> +++ b/fs/nfs/internal.h
> @@ -12,6 +12,7 @@
>  #include <linux/nfs_page.h>
>  #include <linux/nfslocalio.h>
>  #include <linux/wait_bit.h>
> +#include <linux/workqueue.h>
>  
>  #define NFS_SB_MASK (SB_NOSUID|SB_NODEV|SB_NOEXEC|SB_SYNCHRONOUS)
>  
> @@ -669,9 +670,18 @@ nfs_write_match_verf(const struct nfs_writeverf
> *verf,
>  		!nfs_write_verifier_cmp(&req->wb_verf, &verf-
> >verifier);
>  }
>  
> +static inline bool is_nfsiod(void)
> +{
> +	struct workqueue_struct *current_wq = current_workqueue();
> +
> +	if (current_wq)
> +		return current_wq == nfsiod_workqueue;
> +	return false;
> +}
> +
>  static inline gfp_t nfs_io_gfp_mask(void)
>  {
> -	if (current->flags & PF_WQ_WORKER)
> +	if (is_nfsiod())
>  		return GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
>  	return GFP_KERNEL;
>  }


Instead of trying to identify the nfsiod_workqueue, why not apply
current_gfp_context() in order to weed out callers that set
PF_MEMALLOC_NOIO and PF_MEMALLOC_NOFS?

i.e.


static inline gfp_t nfs_io_gfp_mask(void)
{
	gfp_t ret = current_gfp_context(GFP_KERNEL);

	if ((current->flags & PF_WQ_WORKER) && ret == GFP_KERNEL)
		ret |= __GFP_NORETRY | __GFP_NOWARN;
	return ret;
}


-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trondmy@kernel.org, trond.myklebust@hammerspace.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] NFS: Improve nfsiod workqueue detection for allocation flags
  2025-07-07 19:25   ` Trond Myklebust
@ 2025-07-07 20:12     ` Benjamin Coddington
  2025-07-07 20:42       ` Trond Myklebust
  2025-07-07 20:28     ` Laurence Oberman
  1 sibling, 1 reply; 13+ messages in thread
From: Benjamin Coddington @ 2025-07-07 20:12 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Anna Schumaker, Tejun Heo, Lai Jiangshan, linux-nfs, linux-kernel,
	djeffery, loberman

On 7 Jul 2025, at 15:25, Trond Myklebust wrote:

> On Mon, 2025-07-07 at 14:46 -0400, Benjamin Coddington wrote:
>> The NFS client writeback paths change which flags are passed to their
>> memory allocation calls based on whether the current task is running
>> from
>> within a workqueue or not.  More specifically, it appears that during
>> writeback allocations with PF_WQ_WORKER set on current->flags will
>> add
>> __GFP_NORETRY | __GFP_NOWARN.  Presumably this is because nfsiod can
>> simply fail quickly and later retry to write back that specific page
>> should
>> the allocation fail.
>>
>> However, the check for PF_WQ_WORKER is too general because tasks can
>> enter NFS
>> writeback paths from other workqueues.  Specifically, the loopback
>> driver
>> tends to perform writeback into backing files on NFS with
>> PF_WQ_WORKER set,
>> and additionally sets PF_MEMALLOC_NOIO.  The combination of
>> PF_MEMALLOC_NOIO with __GFP_NORETRY can easily result in allocation
>> failures and the loopback driver has no retry functionality.  As a
>> result,
>> after commit 0bae835b63c5 ("NFS: Avoid writeback threads getting
>> stuck in
>> mempool_alloc()") users are seeing corrupted loop-mounted filesystems
>> backed
>> by image files on NFS.
>>
>> In a preceding patch, we introduced a function to allow NFS to detect
>> if
>> the task is executing within a specific workqueue.  Here we use that
>> helper
>> to set __GFP_NORETRY | __GFP_NOWARN only if the workqueue is nfsiod.
>>
>> Fixes: 0bae835b63c5 ("NFS: Avoid writeback threads getting stuck in
>> mempool_alloc()")
>> Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
>> ---
>>  fs/nfs/internal.h | 12 +++++++++++-
>>  1 file changed, 11 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
>> index 69c2c10ee658..173172afa3f5 100644
>> --- a/fs/nfs/internal.h
>> +++ b/fs/nfs/internal.h
>> @@ -12,6 +12,7 @@
>>  #include <linux/nfs_page.h>
>>  #include <linux/nfslocalio.h>
>>  #include <linux/wait_bit.h>
>> +#include <linux/workqueue.h>
>>  
>>  #define NFS_SB_MASK (SB_NOSUID|SB_NODEV|SB_NOEXEC|SB_SYNCHRONOUS)
>>  
>> @@ -669,9 +670,18 @@ nfs_write_match_verf(const struct nfs_writeverf
>> *verf,
>>  		!nfs_write_verifier_cmp(&req->wb_verf, &verf-
>>> verifier);
>>  }
>>  
>> +static inline bool is_nfsiod(void)
>> +{
>> +	struct workqueue_struct *current_wq = current_workqueue();
>> +
>> +	if (current_wq)
>> +		return current_wq == nfsiod_workqueue;
>> +	return false;
>> +}
>> +
>>  static inline gfp_t nfs_io_gfp_mask(void)
>>  {
>> -	if (current->flags & PF_WQ_WORKER)
>> +	if (is_nfsiod())
>>  		return GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
>>  	return GFP_KERNEL;
>>  }
>
>
> Instead of trying to identify the nfsiod_workqueue, why not apply
> current_gfp_context() in order to weed out callers that set
> PF_MEMALLOC_NOIO and PF_MEMALLOC_NOFS?
>
> i.e.
>
>
> static inline gfp_t nfs_io_gfp_mask(void)
> {
> 	gfp_t ret = current_gfp_context(GFP_KERNEL);
>
> 	if ((current->flags & PF_WQ_WORKER) && ret == GFP_KERNEL)
> 		ret |= __GFP_NORETRY | __GFP_NOWARN;
> 	return ret;
> }

This would fix the problem we see, but we'll also end up carrying the flags
from the layer above NFS into the client's current allocation strategy.
That seems fragile to part of the original intent - we have static known
flags for NFS' allocation in either context.

On the other hand, perhaps we want to honor those flags if the upper layer
is setting them, because it should have a good reason -- to avoid deadlocks.

We originally considered your suggested flag-checking approach, but went the
"is_nfsiod" way because that seems like the actual intent of checking for
PF_WQ_WORKER.  The code then clarifies what's actually wanted, and we don't
end up with future problems (what if nfsiod changes its PF_ flags in the
future but the author doesn't know to update this function?)

Do you prefer this approach?  I can send it with your Suggested-by or
authorship.

The other way to fix this might be to create a mempool for nfs_page - which
is the one place that uses nfs_io_gfp_mask() that doesn't fall back to a
mempool.  We haven't tested that.

Thanks for the look.

Ben


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] NFS: Improve nfsiod workqueue detection for allocation flags
  2025-07-07 19:25   ` Trond Myklebust
  2025-07-07 20:12     ` Benjamin Coddington
@ 2025-07-07 20:28     ` Laurence Oberman
  2025-07-08 16:50       ` Laurence Oberman
  1 sibling, 1 reply; 13+ messages in thread
From: Laurence Oberman @ 2025-07-07 20:28 UTC (permalink / raw)
  To: Trond Myklebust, Benjamin Coddington, Anna Schumaker, Tejun Heo,
	Lai Jiangshan
  Cc: linux-nfs, linux-kernel, djeffery

On Mon, 2025-07-07 at 12:25 -0700, Trond Myklebust wrote:
> On Mon, 2025-07-07 at 14:46 -0400, Benjamin Coddington wrote:
> > The NFS client writeback paths change which flags are passed to
> > their
> > memory allocation calls based on whether the current task is
> > running
> > from
> > within a workqueue or not.  More specifically, it appears that
> > during
> > writeback allocations with PF_WQ_WORKER set on current->flags will
> > add
> > __GFP_NORETRY | __GFP_NOWARN.  Presumably this is because nfsiod
> > can
> > simply fail quickly and later retry to write back that specific
> > page
> > should
> > the allocation fail.
> > 
> > However, the check for PF_WQ_WORKER is too general because tasks
> > can
> > enter NFS
> > writeback paths from other workqueues.  Specifically, the loopback
> > driver
> > tends to perform writeback into backing files on NFS with
> > PF_WQ_WORKER set,
> > and additionally sets PF_MEMALLOC_NOIO.  The combination of
> > PF_MEMALLOC_NOIO with __GFP_NORETRY can easily result in allocation
> > failures and the loopback driver has no retry functionality.  As a
> > result,
> > after commit 0bae835b63c5 ("NFS: Avoid writeback threads getting
> > stuck in
> > mempool_alloc()") users are seeing corrupted loop-mounted
> > filesystems
> > backed
> > by image files on NFS.
> > 
> > In a preceding patch, we introduced a function to allow NFS to
> > detect
> > if
> > the task is executing within a specific workqueue.  Here we use
> > that
> > helper
> > to set __GFP_NORETRY | __GFP_NOWARN only if the workqueue is
> > nfsiod.
> > 
> > Fixes: 0bae835b63c5 ("NFS: Avoid writeback threads getting stuck in
> > mempool_alloc()")
> > Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
> > ---
> >  fs/nfs/internal.h | 12 +++++++++++-
> >  1 file changed, 11 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
> > index 69c2c10ee658..173172afa3f5 100644
> > --- a/fs/nfs/internal.h
> > +++ b/fs/nfs/internal.h
> > @@ -12,6 +12,7 @@
> >  #include <linux/nfs_page.h>
> >  #include <linux/nfslocalio.h>
> >  #include <linux/wait_bit.h>
> > +#include <linux/workqueue.h>
> >  
> >  #define NFS_SB_MASK (SB_NOSUID|SB_NODEV|SB_NOEXEC|SB_SYNCHRONOUS)
> >  
> > @@ -669,9 +670,18 @@ nfs_write_match_verf(const struct
> > nfs_writeverf
> > *verf,
> >  		!nfs_write_verifier_cmp(&req->wb_verf, &verf-
> > > verifier);
> >  }
> >  
> > +static inline bool is_nfsiod(void)
> > +{
> > +	struct workqueue_struct *current_wq = current_workqueue();
> > +
> > +	if (current_wq)
> > +		return current_wq == nfsiod_workqueue;
> > +	return false;
> > +}
> > +
> >  static inline gfp_t nfs_io_gfp_mask(void)
> >  {
> > -	if (current->flags & PF_WQ_WORKER)
> > +	if (is_nfsiod())
> >  		return GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
> >  	return GFP_KERNEL;
> >  }
> 
> 
> Instead of trying to identify the nfsiod_workqueue, why not apply
> current_gfp_context() in order to weed out callers that set
> PF_MEMALLOC_NOIO and PF_MEMALLOC_NOFS?
> 
> i.e.
> 
> 
> static inline gfp_t nfs_io_gfp_mask(void)
> {
> 	gfp_t ret = current_gfp_context(GFP_KERNEL);
> 
> 	if ((current->flags & PF_WQ_WORKER) && ret == GFP_KERNEL)
> 		ret |= __GFP_NORETRY | __GFP_NOWARN;
> 	return ret;
> }
> 
> 

I am testing both patch options to see if both prevent the failed write
with no other impact and will report back.

The test is confined to the use case of an XFS file system served by an
image that is located on NFS. as that is where the failed writes were
seen.




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] NFS: Improve nfsiod workqueue detection for allocation flags
  2025-07-07 20:12     ` Benjamin Coddington
@ 2025-07-07 20:42       ` Trond Myklebust
  0 siblings, 0 replies; 13+ messages in thread
From: Trond Myklebust @ 2025-07-07 20:42 UTC (permalink / raw)
  To: Benjamin Coddington
  Cc: Anna Schumaker, Tejun Heo, Lai Jiangshan, linux-nfs, linux-kernel,
	djeffery, loberman

On Mon, 2025-07-07 at 16:12 -0400, Benjamin Coddington wrote:
> On 7 Jul 2025, at 15:25, Trond Myklebust wrote:
> 
> > On Mon, 2025-07-07 at 14:46 -0400, Benjamin Coddington wrote:
> > > The NFS client writeback paths change which flags are passed to
> > > their
> > > memory allocation calls based on whether the current task is
> > > running
> > > from
> > > within a workqueue or not.  More specifically, it appears that
> > > during
> > > writeback allocations with PF_WQ_WORKER set on current->flags
> > > will
> > > add
> > > __GFP_NORETRY | __GFP_NOWARN.  Presumably this is because nfsiod
> > > can
> > > simply fail quickly and later retry to write back that specific
> > > page
> > > should
> > > the allocation fail.
> > > 
> > > However, the check for PF_WQ_WORKER is too general because tasks
> > > can
> > > enter NFS
> > > writeback paths from other workqueues.  Specifically, the
> > > loopback
> > > driver
> > > tends to perform writeback into backing files on NFS with
> > > PF_WQ_WORKER set,
> > > and additionally sets PF_MEMALLOC_NOIO.  The combination of
> > > PF_MEMALLOC_NOIO with __GFP_NORETRY can easily result in
> > > allocation
> > > failures and the loopback driver has no retry functionality.  As
> > > a
> > > result,
> > > after commit 0bae835b63c5 ("NFS: Avoid writeback threads getting
> > > stuck in
> > > mempool_alloc()") users are seeing corrupted loop-mounted
> > > filesystems
> > > backed
> > > by image files on NFS.
> > > 
> > > In a preceding patch, we introduced a function to allow NFS to
> > > detect
> > > if
> > > the task is executing within a specific workqueue.  Here we use
> > > that
> > > helper
> > > to set __GFP_NORETRY | __GFP_NOWARN only if the workqueue is
> > > nfsiod.
> > > 
> > > Fixes: 0bae835b63c5 ("NFS: Avoid writeback threads getting stuck
> > > in
> > > mempool_alloc()")
> > > Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
> > > ---
> > >  fs/nfs/internal.h | 12 +++++++++++-
> > >  1 file changed, 11 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
> > > index 69c2c10ee658..173172afa3f5 100644
> > > --- a/fs/nfs/internal.h
> > > +++ b/fs/nfs/internal.h
> > > @@ -12,6 +12,7 @@
> > >  #include <linux/nfs_page.h>
> > >  #include <linux/nfslocalio.h>
> > >  #include <linux/wait_bit.h>
> > > +#include <linux/workqueue.h>
> > >  
> > >  #define NFS_SB_MASK
> > > (SB_NOSUID|SB_NODEV|SB_NOEXEC|SB_SYNCHRONOUS)
> > >  
> > > @@ -669,9 +670,18 @@ nfs_write_match_verf(const struct
> > > nfs_writeverf
> > > *verf,
> > >  		!nfs_write_verifier_cmp(&req->wb_verf, &verf-
> > > > verifier);
> > >  }
> > >  
> > > +static inline bool is_nfsiod(void)
> > > +{
> > > +	struct workqueue_struct *current_wq =
> > > current_workqueue();
> > > +
> > > +	if (current_wq)
> > > +		return current_wq == nfsiod_workqueue;
> > > +	return false;
> > > +}
> > > +
> > >  static inline gfp_t nfs_io_gfp_mask(void)
> > >  {
> > > -	if (current->flags & PF_WQ_WORKER)
> > > +	if (is_nfsiod())
> > >  		return GFP_KERNEL | __GFP_NORETRY |
> > > __GFP_NOWARN;
> > >  	return GFP_KERNEL;
> > >  }
> > 
> > 
> > Instead of trying to identify the nfsiod_workqueue, why not apply
> > current_gfp_context() in order to weed out callers that set
> > PF_MEMALLOC_NOIO and PF_MEMALLOC_NOFS?
> > 
> > i.e.
> > 
> > 
> > static inline gfp_t nfs_io_gfp_mask(void)
> > {
> > 	gfp_t ret = current_gfp_context(GFP_KERNEL);
> > 
> > 	if ((current->flags & PF_WQ_WORKER) && ret == GFP_KERNEL)
> > 		ret |= __GFP_NORETRY | __GFP_NOWARN;
> > 	return ret;
> > }
> 
> This would fix the problem we see, but we'll also end up carrying the
> flags
> from the layer above NFS into the client's current allocation
> strategy.
> That seems fragile to part of the original intent - we have static
> known
> flags for NFS' allocation in either context.

Yes, but if the PF_MEMALLOC_NOIO or PF_MEMALLOC_NOFS flags are set, the
memory manager will in any case water those flags down using the same
call to current_gfp_context().

This is really just making sure that we don't set the __GFP_NORETRY
flag in that case, because in a low memory situation that could end up
deadlocking due to being unable to kick off I/O in order to free up
memory.

> On the other hand, perhaps we want to honor those flags if the upper
> layer
> is setting them, because it should have a good reason -- to avoid
> deadlocks.
> 
> We originally considered your suggested flag-checking approach, but
> went the
> "is_nfsiod" way because that seems like the actual intent of checking
> for
> PF_WQ_WORKER.  The code then clarifies what's actually wanted, and we
> don't
> end up with future problems (what if nfsiod changes its PF_ flags in
> the
> future but the author doesn't know to update this function?)

If that were ever to happen, then we'd be well up the creek and without
a paddle.

Firstly, there is so much VFS work going on in nfsiod (dput(),
path_put(), iput(),....), that we really do not want to encumber it
with any PF_MEMALLOC restrictions.

Secondly, if we were to do so anyway, we definitely would want to
revisit this function, in addition to all those RPC callbacks that
would be affected.

> Do you prefer this approach?  I can send it with your Suggested-by or
> authorship.
> 
> The other way to fix this might be to create a mempool for nfs_page -
> which
> is the one place that uses nfs_io_gfp_mask() that doesn't fall back
> to a
> mempool.  We haven't tested that.

I think I prefer an approach that is aware of the limitations imposed
by the memory manager rather than one that just worries about which
workqueue we're on.

Note that one of the main differences between rpciod and nfsiod is
precisely the PF_MEMALLOC settings.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trondmy@kernel.org, trond.myklebust@hammerspace.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] workqueue: Add a helper to identify current workqueue
  2025-07-07 18:46 ` [PATCH 1/2] workqueue: Add a helper to identify current workqueue Benjamin Coddington
@ 2025-07-08  4:37   ` Tejun Heo
  2025-07-08 10:25     ` Benjamin Coddington
  0 siblings, 1 reply; 13+ messages in thread
From: Tejun Heo @ 2025-07-08  4:37 UTC (permalink / raw)
  To: Benjamin Coddington
  Cc: Trond Myklebust, Anna Schumaker, Lai Jiangshan, linux-nfs,
	linux-kernel, djeffery, loberman

On Mon, Jul 07, 2025 at 02:46:03PM -0400, Benjamin Coddington wrote:
> Introduce a new helper current_workqueue() which returns the current task's
> workqueue pointer or NULL if not a workqueue worker.
> 
> This will allow the NFS client to recognize the case where writeback occurs
> within the nfsiod workqueue or is being submitted directly.  NFS would like
> to change the GFP_ flags for memory allocation to avoid stalls or cycles in
> memory pools based on which context writeback is occurring.  In a following
> patch, this helper detects the case rather than checking the PF_WQ_WORKER
> flag which can be passed along from another workqueue worker.

There's already current_work(). Wouldn't that be enough for identifying
whether the current work item?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] workqueue: Add a helper to identify current workqueue
  2025-07-08  4:37   ` Tejun Heo
@ 2025-07-08 10:25     ` Benjamin Coddington
  0 siblings, 0 replies; 13+ messages in thread
From: Benjamin Coddington @ 2025-07-08 10:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Trond Myklebust, Anna Schumaker, Lai Jiangshan, linux-nfs,
	linux-kernel, djeffery, loberman

On 8 Jul 2025, at 0:37, Tejun Heo wrote:

> On Mon, Jul 07, 2025 at 02:46:03PM -0400, Benjamin Coddington wrote:
>> Introduce a new helper current_workqueue() which returns the current task's
>> workqueue pointer or NULL if not a workqueue worker.
>>
>> This will allow the NFS client to recognize the case where writeback occurs
>> within the nfsiod workqueue or is being submitted directly.  NFS would like
>> to change the GFP_ flags for memory allocation to avoid stalls or cycles in
>> memory pools based on which context writeback is occurring.  In a following
>> patch, this helper detects the case rather than checking the PF_WQ_WORKER
>> flag which can be passed along from another workqueue worker.
>
> There's already current_work(). Wouldn't that be enough for identifying
> whether the current work item?

NFS submits different work items to the same workqueue, so comparing the
workqueue instead of the work items made more sense.

After discussion on patch 2 yesterday, I think we're going to try to fix
this in NFS using a different approach that won't need this helper now.

Thanks for the look Tejun.
Ben


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] NFS: Improve nfsiod workqueue detection for allocation flags
  2025-07-07 20:28     ` Laurence Oberman
@ 2025-07-08 16:50       ` Laurence Oberman
  2025-07-08 17:03         ` Benjamin Coddington
  0 siblings, 1 reply; 13+ messages in thread
From: Laurence Oberman @ 2025-07-08 16:50 UTC (permalink / raw)
  To: Trond Myklebust, Benjamin Coddington, Anna Schumaker, Tejun Heo,
	Lai Jiangshan
  Cc: linux-nfs, linux-kernel, djeffery

On Mon, 2025-07-07 at 16:28 -0400, Laurence Oberman wrote:
> On Mon, 2025-07-07 at 12:25 -0700, Trond Myklebust wrote:
> > On Mon, 2025-07-07 at 14:46 -0400, Benjamin Coddington wrote:
> > > The NFS client writeback paths change which flags are passed to
> > > their
> > > memory allocation calls based on whether the current task is
> > > running
> > > from
> > > within a workqueue or not.  More specifically, it appears that
> > > during
> > > writeback allocations with PF_WQ_WORKER set on current->flags
> > > will
> > > add
> > > __GFP_NORETRY | __GFP_NOWARN.  Presumably this is because nfsiod
> > > can
> > > simply fail quickly and later retry to write back that specific
> > > page
> > > should
> > > the allocation fail.
> > > 
> > > However, the check for PF_WQ_WORKER is too general because tasks
> > > can
> > > enter NFS
> > > writeback paths from other workqueues.  Specifically, the
> > > loopback
> > > driver
> > > tends to perform writeback into backing files on NFS with
> > > PF_WQ_WORKER set,
> > > and additionally sets PF_MEMALLOC_NOIO.  The combination of
> > > PF_MEMALLOC_NOIO with __GFP_NORETRY can easily result in
> > > allocation
> > > failures and the loopback driver has no retry functionality.  As
> > > a
> > > result,
> > > after commit 0bae835b63c5 ("NFS: Avoid writeback threads getting
> > > stuck in
> > > mempool_alloc()") users are seeing corrupted loop-mounted
> > > filesystems
> > > backed
> > > by image files on NFS.
> > > 
> > > In a preceding patch, we introduced a function to allow NFS to
> > > detect
> > > if
> > > the task is executing within a specific workqueue.  Here we use
> > > that
> > > helper
> > > to set __GFP_NORETRY | __GFP_NOWARN only if the workqueue is
> > > nfsiod.
> > > 
> > > Fixes: 0bae835b63c5 ("NFS: Avoid writeback threads getting stuck
> > > in
> > > mempool_alloc()")
> > > Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
> > > ---
> > >  fs/nfs/internal.h | 12 +++++++++++-
> > >  1 file changed, 11 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
> > > index 69c2c10ee658..173172afa3f5 100644
> > > --- a/fs/nfs/internal.h
> > > +++ b/fs/nfs/internal.h
> > > @@ -12,6 +12,7 @@
> > >  #include <linux/nfs_page.h>
> > >  #include <linux/nfslocalio.h>
> > >  #include <linux/wait_bit.h>
> > > +#include <linux/workqueue.h>
> > >  
> > >  #define NFS_SB_MASK
> > > (SB_NOSUID|SB_NODEV|SB_NOEXEC|SB_SYNCHRONOUS)
> > >  
> > > @@ -669,9 +670,18 @@ nfs_write_match_verf(const struct
> > > nfs_writeverf
> > > *verf,
> > >  		!nfs_write_verifier_cmp(&req->wb_verf, &verf-
> > > > verifier);
> > >  }
> > >  
> > > +static inline bool is_nfsiod(void)
> > > +{
> > > +	struct workqueue_struct *current_wq =
> > > current_workqueue();
> > > +
> > > +	if (current_wq)
> > > +		return current_wq == nfsiod_workqueue;
> > > +	return false;
> > > +}
> > > +
> > >  static inline gfp_t nfs_io_gfp_mask(void)
> > >  {
> > > -	if (current->flags & PF_WQ_WORKER)
> > > +	if (is_nfsiod())
> > >  		return GFP_KERNEL | __GFP_NORETRY |
> > > __GFP_NOWARN;
> > >  	return GFP_KERNEL;
> > >  }
> > 
> > 
> > Instead of trying to identify the nfsiod_workqueue, why not apply
> > current_gfp_context() in order to weed out callers that set
> > PF_MEMALLOC_NOIO and PF_MEMALLOC_NOFS?
> > 
> > i.e.
> > 
> > 
> > static inline gfp_t nfs_io_gfp_mask(void)
> > {
> > 	gfp_t ret = current_gfp_context(GFP_KERNEL);
> > 
> > 	if ((current->flags & PF_WQ_WORKER) && ret == GFP_KERNEL)
> > 		ret |= __GFP_NORETRY | __GFP_NOWARN;
> > 	return ret;
> > }
> > 
> > 
> 
> I am testing both patch options to see if both prevent the failed
> write
> with no other impact and will report back.
> 
> The test is confined to the use case of an XFS file system served by
> an
> image that is located on NFS. as that is where the failed writes were
> seen.
> 
> 
> 


Both Ben's patch and Trond's fix the failing write issue so I guess we
need to decide what the final fix will be.

For both solutions
Tested-by: Laurence Oberman <loberman@redhat.com>



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] NFS: Improve nfsiod workqueue detection for allocation flags
  2025-07-08 16:50       ` Laurence Oberman
@ 2025-07-08 17:03         ` Benjamin Coddington
  2025-07-08 17:09           ` Laurence Oberman
  0 siblings, 1 reply; 13+ messages in thread
From: Benjamin Coddington @ 2025-07-08 17:03 UTC (permalink / raw)
  To: Laurence Oberman
  Cc: Trond Myklebust, Anna Schumaker, Tejun Heo, Lai Jiangshan,
	linux-nfs, linux-kernel, djeffery

On 8 Jul 2025, at 12:50, Laurence Oberman wrote:

> Both Ben's patch and Trond's fix the failing write issue so I guess we
> need to decide what the final fix will be.
>
> For both solutions
> Tested-by: Laurence Oberman <loberman@redhat.com>

Thanks Laurence! I think we'll leave these two patches behind.

I'm persuaded by Trond's arguments, and along with not needing to add the
workqueue helper, I've properly posted that approach here after some minimal
testing:

https://lore.kernel.org/linux-nfs/6892807b15cb401f3015e2acdaf1c2ba2bcae130.1751975813.git.bcodding@redhat.com/T/#u

There's only a difference of a comment, so it should be safe to reply with
your Tested-by there.

Ben


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] NFS: Improve nfsiod workqueue detection for allocation flags
  2025-07-08 17:03         ` Benjamin Coddington
@ 2025-07-08 17:09           ` Laurence Oberman
  0 siblings, 0 replies; 13+ messages in thread
From: Laurence Oberman @ 2025-07-08 17:09 UTC (permalink / raw)
  To: Benjamin Coddington
  Cc: Trond Myklebust, Anna Schumaker, Tejun Heo, Lai Jiangshan,
	linux-nfs, linux-kernel, djeffery

On Tue, 2025-07-08 at 13:03 -0400, Benjamin Coddington wrote:
> On 8 Jul 2025, at 12:50, Laurence Oberman wrote:
> 
> > Both Ben's patch and Trond's fix the failing write issue so I guess
> > we
> > need to decide what the final fix will be.
> > 
> > For both solutions
> > Tested-by: Laurence Oberman <loberman@redhat.com>
> 
> Thanks Laurence! I think we'll leave these two patches behind.
> 
> I'm persuaded by Trond's arguments, and along with not needing to add
> the
> workqueue helper, I've properly posted that approach here after some
> minimal
> testing:
> 
> https://lore.kernel.org/linux-nfs/6892807b15cb401f3015e2acdaf1c2ba2bcae130.1751975813.git.bcodding@redhat.com/T/#u
> 
> There's only a difference of a comment, so it should be safe to reply
> with
> your Tested-by there.
> 
> Ben
> 

Thank you Ben and Trond.
Confirming that this patch works to correct this issue.
Looks good.

Reviewed-by: Laurence Oberman <loberman@redhat.com>
Tested-by: Laurence Oberman <loberman@redhat.com>


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-07-08 17:09 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-07 18:46 [PATCH 0/2] Fix loopback mounted filesystems on NFS Benjamin Coddington
2025-07-07 18:46 ` [PATCH 1/2] workqueue: Add a helper to identify current workqueue Benjamin Coddington
2025-07-08  4:37   ` Tejun Heo
2025-07-08 10:25     ` Benjamin Coddington
2025-07-07 18:46 ` [PATCH 2/2] NFS: Improve nfsiod workqueue detection for allocation flags Benjamin Coddington
2025-07-07 19:25   ` Trond Myklebust
2025-07-07 20:12     ` Benjamin Coddington
2025-07-07 20:42       ` Trond Myklebust
2025-07-07 20:28     ` Laurence Oberman
2025-07-08 16:50       ` Laurence Oberman
2025-07-08 17:03         ` Benjamin Coddington
2025-07-08 17:09           ` Laurence Oberman
2025-07-07 19:15 ` [PATCH 0/2] Fix loopback mounted filesystems on NFS Jeff Layton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).