From: Florian Westphal <fw@strlen.de>
To: Eric Dumazet <edumazet@google.com>
Cc: Ilya Maximets <i.maximets@ovn.org>,
netdev <netdev@vger.kernel.org>,
"David S. Miller" <davem@davemloft.net>,
dev@openvswitch.org, LKML <linux-kernel@vger.kernel.org>,
Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
Florian Westphal <fw@strlen.de>
Subject: Re: [PATCH net] net: ensure all external references are released in deferred skbuffs
Date: Wed, 22 Jun 2022 12:28:13 +0200 [thread overview]
Message-ID: <20220622102813.GA24844@breakpoint.cc> (raw)
In-Reply-To: <CANn89iL_EmkEgPAVdhNW4tyzwQbARyji93mUQ9E2MRczWpNm7g@mail.gmail.com>
Eric Dumazet <edumazet@google.com> wrote:
> On Sun, Jun 19, 2022 at 2:39 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> >
> > Open vSwitch system test suite is broken due to inability to
> > load/unload netfilter modules. kworker thread is getting trapped
> > in the infinite loop while running a net cleanup inside the
> > nf_conntrack_cleanup_net_list, because deferred skbuffs are still
> > holding nfct references and not being freed by their CPU cores.
> >
> > In general, the idea that we will have an rx interrupt on every
> > CPU core at some point in a near future doesn't seem correct.
> > Devices are getting created and destroyed, interrupts are getting
> > re-scheduled, CPUs are going online and offline dynamically.
> > Any of these events may leave packets stuck in defer list for a
> > long time. It might be OK, if they are just a piece of memory,
> > but we can't afford them holding references to any other resources.
> >
> > In case of OVS, nfct reference keeps the kernel thread in busy loop
> > while holding a 'pernet_ops_rwsem' semaphore. That blocks the
> > later modprobe request from user space:
> >
> > # ps
> > 299 root R 99.3 200:25.89 kworker/u96:4+
> >
> > # journalctl
> > INFO: task modprobe:11787 blocked for more than 1228 seconds.
> > Not tainted 5.19.0-rc2 #8
> > task:modprobe state:D
> > Call Trace:
> > <TASK>
> > __schedule+0x8aa/0x21d0
> > schedule+0xcc/0x200
> > rwsem_down_write_slowpath+0x8e4/0x1580
> > down_write+0xfc/0x140
> > register_pernet_subsys+0x15/0x40
> > nf_nat_init+0xb6/0x1000 [nf_nat]
> > do_one_initcall+0xbb/0x410
> > do_init_module+0x1b4/0x640
> > load_module+0x4c1b/0x58d0
> > __do_sys_init_module+0x1d7/0x220
> > do_syscall_64+0x3a/0x80
> > entry_SYSCALL_64_after_hwframe+0x46/0xb0
> >
> > At this point OVS testsuite is unresponsive and never recover,
> > because these skbuffs are never freed.
> >
> > Solution is to make sure no external references attached to skb
> > before pushing it to the defer list. Using skb_release_head_state()
> > for that purpose. The function modified to be re-enterable, as it
> > will be called again during the defer list flush.
> >
> > Another approach that can fix the OVS use-case, is to kick all
> > cores while waiting for references to be released during the net
> > cleanup. But that sounds more like a workaround for a current
> > issue rather than a proper solution and will not cover possible
> > issues in other parts of the code.
> >
> > Additionally checking for skb_zcopy() while deferring. This might
> > not be necessary, as I'm not sure if we can actually have zero copy
> > packets on this path, but seems worth having for completeness as we
> > should never defer such packets regardless.
> >
> > CC: Eric Dumazet <edumazet@google.com>
> > Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists")
> > Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
> > ---
> > net/core/skbuff.c | 16 +++++++++++-----
> > 1 file changed, 11 insertions(+), 5 deletions(-)
>
> I do not think this patch is doing the right thing.
>
> Packets sitting in TCP receive queues should not hold state that is
> not relevant for TCP recvmsg().
Agree, but tcp_v4/6_rcv() already call nf_reset_ct(), else it would
not be possible to remove nf_conntrack module in practice.
I wonder where the deferred skbs are coming from, any and all
queued skbs need the conntrack state dropped.
I don't mind a new helper that does a combined dst+ct release though.
next prev parent reply other threads:[~2022-06-22 10:41 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-06-19 0:39 [PATCH net] net: ensure all external references are released in deferred skbuffs Ilya Maximets
2022-06-22 10:02 ` Eric Dumazet
2022-06-22 10:15 ` Eric Dumazet
2022-06-22 10:28 ` Florian Westphal [this message]
2022-06-22 10:36 ` Eric Dumazet
2022-06-22 11:32 ` Ilya Maximets
2022-06-22 11:43 ` Eric Dumazet
2022-06-22 14:26 ` Ilya Maximets
2022-06-22 14:35 ` Eric Dumazet
2022-06-22 18:09 ` Ilya Maximets
2022-06-22 16:29 ` Eric Dumazet
2022-06-22 16:39 ` Eric Dumazet
2022-06-22 16:47 ` Eric Dumazet
2022-06-22 17:03 ` Eric Dumazet
2022-06-22 18:19 ` Ilya Maximets
2022-06-22 19:04 ` Eric Dumazet
2022-06-22 19:27 ` Eric Dumazet
2022-06-22 21:12 ` Ilya Maximets
2023-09-22 13:26 ` Ilya Maximets
2023-09-22 19:07 ` Ilya Maximets
2023-09-22 21:09 ` Ilya Maximets
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220622102813.GA24844@breakpoint.cc \
--to=fw@strlen.de \
--cc=davem@davemloft.net \
--cc=dev@openvswitch.org \
--cc=edumazet@google.com \
--cc=i.maximets@ovn.org \
--cc=kuba@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.