Re: [RFC net] Should sk_page_frag() also look at the current GFP context?

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Guillaume Nault <gnault@redhat.com>
To: Eric Dumazet <edumazet@google.com>
Cc: "David S. Miller" <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	netdev <netdev@vger.kernel.org>,
	Chuck Lever <chuck.lever@oracle.com>,
	Jeff Layton <jlayton@kernel.org>,
	Trond Myklebust <trond.myklebust@hammerspace.com>,
	Anna Schumaker <anna@kernel.org>,
	Steve French <sfrench@samba.org>,
	Josef Bacik <josef@toxicpanda.com>,
	Scott Mayhew <smayhew@redhat.com>,
	Benjamin Coddington <bcodding@redhat.com>,
	Tejun Heo <tj@kernel.org>
Subject: Re: [RFC net] Should sk_page_frag() also look at the current GFP context?
Date: Fri, 8 Jul 2022 19:51:47 +0200	[thread overview]
Message-ID: <20220708175147.GA3166@debian.home> (raw)
In-Reply-To: <CANn89i+=GyHjkrHMZAftB-toEhi9GcAQom1_bpT+S0qMvCz0DQ@mail.gmail.com>

On Thu, Jul 07, 2022 at 06:29:03PM +0200, Eric Dumazet wrote:
> On Fri, Jul 1, 2022 at 8:41 PM Guillaume Nault <gnault@redhat.com> wrote:
> >
> > I'm investigating a kernel oops that looks similar to
> > 20eb4f29b602 ("net: fix sk_page_frag() recursion from memory reclaim")
> > and dacb5d8875cc ("tcp: fix page frag corruption on page fault").
> >
> > This time the problem happens on an NFS client, while the previous bzs
> > respectively used NBD and CIFS. While NBD and CIFS clear __GFP_FS in
> > their socket's ->sk_allocation field (using GFP_NOIO or GFP_NOFS), NFS
> > leaves sk_allocation to its default value since commit a1231fda7e94
> > ("SUNRPC: Set memalloc_nofs_save() on all rpciod/xprtiod jobs").
> >
> > To recap the original problems, in commit 20eb4f29b602 and dacb5d8875cc,
> > memory reclaim happened while executing tcp_sendmsg_locked(). The code
> > path entered tcp_sendmsg_locked() recursively as pages to be reclaimed
> > were backed by files on the network. The problem was that both the
> > outer and the inner tcp_sendmsg_locked() calls used current->task_frag,
> > thus leaving it in an inconsistent state. The fix was to use the
> > socket's ->sk_frag instead for the file system socket, so that the
> > inner and outer calls wouln't step on each other's toes.
> >
> > But now that NFS doesn't modify ->sk_allocation anymore, sk_page_frag()
> > sees sunrpc sockets as plain TCP ones and returns ->task_frag in the
> > inner tcp_sendmsg_locked() call.
> >
> > Also it looks like the trend is to avoid GFS_NOFS and GFP_NOIO and use
> > memalloc_no{fs,io}_save() instead. So maybe other network file systems
> > will also stop setting ->sk_allocation in the future and we should
> > teach sk_page_frag() to look at the current GFP flags. Or should we
> > stick to ->sk_allocation and make NFS drop __GFP_FS again?
> >
> > Signed-off-by: Guillaume Nault <gnault@redhat.com>
> 
> Can you provide a Fixes: tag ?

Fixes: a1231fda7e94 ("SUNRPC: Set memalloc_nofs_save() on all rpciod/xprtiod jobs")

> > ---
> >  include/net/sock.h | 8 ++++++--
> >  1 file changed, 6 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/net/sock.h b/include/net/sock.h
> > index 72ca97ccb460..b934c9851058 100644
> > --- a/include/net/sock.h
> > +++ b/include/net/sock.h
> > @@ -46,6 +46,7 @@
> >  #include <linux/netdevice.h>
> >  #include <linux/skbuff.h>      /* struct sk_buff */
> >  #include <linux/mm.h>
> > +#include <linux/sched/mm.h>
> >  #include <linux/security.h>
> >  #include <linux/slab.h>
> >  #include <linux/uaccess.h>
> > @@ -2503,14 +2504,17 @@ static inline void sk_stream_moderate_sndbuf(struct sock *sk)
> >   * socket operations and end up recursing into sk_page_frag()
> >   * while it's already in use: explicitly avoid task page_frag
> >   * usage if the caller is potentially doing any of them.
> > - * This assumes that page fault handlers use the GFP_NOFS flags.
> > + * This assumes that page fault handlers use the GFP_NOFS flags
> > + * or run under memalloc_nofs_save() protection.
> >   *
> >   * Return: a per task page_frag if context allows that,
> >   * otherwise a per socket one.
> >   */
> >  static inline struct page_frag *sk_page_frag(struct sock *sk)
> >  {
> > -       if ((sk->sk_allocation & (__GFP_DIRECT_RECLAIM | __GFP_MEMALLOC | __GFP_FS)) ==
> > +       gfp_t gfp_mask = current_gfp_context(sk->sk_allocation);
> 
> This is slowing down TCP sendmsg() fast path, reading current->flags,
> possibly cold value.
> 
> I would suggest using one bit in sk, close to sk->sk_allocation to
> make the decision,
> instead of testing sk->sk_allocation for various flags.

current_gfp_context() looked quite elegant to me as it avoided the need
to duplicate the NOFS/NOIO flag in the socket. But I understand the
performance concern.

> Not sure if we have available holes.

Nothing in the same cache line at least. There's a 1 bit hole in
struct sock_common after skc_net_refcnt. And it should be hot because
of sk->sk_state. We could add a "skc_use_task_frag" bit there, but I'm
not sure if it's worth using this last available bit for this.

Otherwise, the next available hole is right after sk_bind_phc.
According to pahole, it's two cache lines away from sk_allocation on my
x86_64 build, but that will depend of the size of spinlock_t and thus
on CONFIG_ options. It doesn't look very natural to add a no-reclaim
bit there.

Or maybe we could base the test on sk_kern_sock since the problem
happens on kernel sockets. But that looks like a hack to me, and it
might impact MPTCP, which also creates kernel TCP sockets but shouldn't
have the same constraints as NFS.

> > +
> > +       if ((gfp_mask & ( | __GFP_MEMALLOC | __GFP_FS)) ==
> >             (__GFP_DIRECT_RECLAIM | __GFP_FS))
> >                 return &current->task_frag;
> >
> > --
> > 2.21.3
> >
>

next prev parent reply	other threads:[~2022-07-08 17:51 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-07-01 18:41 [RFC net] Should sk_page_frag() also look at the current GFP context? Guillaume Nault
2022-07-07 15:31 ` Benjamin Coddington
2022-07-07 16:29 ` Eric Dumazet
2022-07-08 17:51   ` Guillaume Nault [this message]
2022-07-08 18:10   ` Benjamin Coddington
2022-07-08 20:04     ` Trond Myklebust
2022-07-11 14:07       ` Benjamin Coddington
2022-07-11 15:31         ` Eric Dumazet
2022-09-20 18:50           ` Guillaume Nault

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220708175147.GA3166@debian.home \
    --to=gnault@redhat.com \
    --cc=anna@kernel.org \
    --cc=bcodding@redhat.com \
    --cc=chuck.lever@oracle.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=jlayton@kernel.org \
    --cc=josef@toxicpanda.com \
    --cc=kuba@kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=sfrench@samba.org \
    --cc=smayhew@redhat.com \
    --cc=tj@kernel.org \
    --cc=trond.myklebust@hammerspace.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.