From: Florian Westphal <fw@strlen.de>
To: Will Deacon <will@kernel.org>
Cc: Florian Westphal <fw@strlen.de>,
subashab@codeaurora.org, pablo@netfilter.org,
Sean Tranchetti <stranche@codeaurora.org>,
netfilter-devel@vger.kernel.org, peterz@infradead.org,
tglx@linutronix.de
Subject: Re: [PATCH nf] x_tables: Properly close read section with read_seqcount_retry
Date: Wed, 18 Nov 2020 13:42:28 +0100 [thread overview]
Message-ID: <20201118124228.GJ22792@breakpoint.cc> (raw)
In-Reply-To: <20201118121322.GA1821@willie-the-truck>
Will Deacon <will@kernel.org> wrote:
> On Mon, Nov 16, 2020 at 07:20:28PM +0100, Florian Westphal wrote:
> > subashab@codeaurora.org <subashab@codeaurora.org> wrote:
> > > > > Unfortunately we are seeing it on ARM64 regression systems which
> > > > > runs a
> > > > > variety of
> > > > > usecases so the exact steps are not known.
> > > >
> > > > Ok. Would you be willing to run some of those with your suggested
> > > > change to see if that resolves the crashes or is that so rare that this
> > > > isn't practical?
> > >
> > > I can try that out. Let me know if you have any other suggestions as well
> > > and I can try that too.
> > >
> > > I assume we cant add locks here as it would be in the packet processing
> > > path.
> >
> > Yes. We can add a synchronize_net() in xt_replace_table if needed
> > though, before starting to put the references on the old ruleset
> > This would avoid the free of the jumpstack while skbs are still
> > in-flight.
>
> I tried to make sense of what's going on here, and it looks to me like
> the interaction between xt_replace_table() and ip6t_do_table() relies on
> store->load order that is _not_ enforced in the code.
>
> xt_replace_table() does:
>
> table->private = newinfo;
>
> /* make sure all cpus see new ->private value */
> smp_wmb();
>
> /*
> * Even though table entries have now been swapped, other CPU's
> * may still be using the old entries...
> */
> local_bh_enable();
>
> /* ... so wait for even xt_recseq on all cpus */
> for_each_possible_cpu(cpu) {
> seqcount_t *s = &per_cpu(xt_recseq, cpu);
> u32 seq = raw_read_seqcount(s);
>
> if (seq & 1) {
> do {
> cond_resched();
> cpu_relax();
> } while (seq == raw_read_seqcount(s));
> }
> }
>
> and I think the idea here is that we swizzle the table->private pointer
> to point to the new data, and then wait for all pre-existing readers to
> finish with the old table (isn't this what RCU is for?).
Yes and yes. Before the current 'for_each_possible_cpu() + seqcount
abuse the abuse was just implicit; xt_replace_table did NOT wait for
readers to go away and relied on arp, eb and ip(6)tables to store the
counters back to userspace after xt_replace_table().
This meant a read_seqcount_begin/retry for each counter & eventually
gave complaits from k8s users that x_tables rule replacement was too
slow.
[..]
> Given that this appears to be going wrong in practice, I've included a quick
> hack below which should fix the ordering. However, I think it would be
> better if we could avoid abusing the seqcount code for this sort of thing.
I'd be ok with going via the simpler solution & wait if k8s users
complain that its too slow. Those memory blobs can be huge so I would
not use call_rcu here.
diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index af22dbe85e2c..b5911985d1eb 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -1384,7 +1384,7 @@ xt_replace_table(struct xt_table *table,
* private.
*/
smp_wmb();
- table->private = newinfo;
+ WRITE_ONCE(table->private, newinfo);
/* make sure all cpus see new ->private value */
smp_wmb();
@@ -1394,19 +1394,7 @@ xt_replace_table(struct xt_table *table,
* may still be using the old entries...
*/
local_bh_enable();
-
- /* ... so wait for even xt_recseq on all cpus */
- for_each_possible_cpu(cpu) {
- seqcount_t *s = &per_cpu(xt_recseq, cpu);
- u32 seq = raw_read_seqcount(s);
-
- if (seq & 1) {
- do {
- cond_resched();
- cpu_relax();
- } while (seq == raw_read_seqcount(s));
- }
- }
+ synchronize_rcu();
audit_log_nfcfg(table->name, table->af, private->number,
!private->number ? AUDIT_XT_OP_REGISTER :
next prev parent reply other threads:[~2020-11-18 12:42 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-11-14 2:21 [PATCH nf] x_tables: Properly close read section with read_seqcount_retry Sean Tranchetti
2020-11-14 16:53 ` Florian Westphal
2020-11-16 6:32 ` subashab
2020-11-16 14:18 ` Florian Westphal
2020-11-16 16:26 ` subashab
2020-11-16 17:04 ` Florian Westphal
2020-11-16 17:51 ` subashab
2020-11-16 18:20 ` Florian Westphal
2020-11-18 12:13 ` Will Deacon
2020-11-18 12:42 ` Florian Westphal [this message]
2020-11-18 12:54 ` Will Deacon
2020-11-18 13:14 ` Florian Westphal
2020-11-18 20:39 ` subashab
2020-11-18 21:10 ` Florian Westphal
2020-11-20 5:53 ` subashab
2020-11-20 6:31 ` Florian Westphal
2020-11-20 9:44 ` Peter Zijlstra
2020-11-20 9:53 ` Peter Zijlstra
2020-11-20 10:20 ` Florian Westphal
2020-11-20 10:47 ` Peter Zijlstra
2020-11-21 1:27 ` subashab
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20201118124228.GJ22792@breakpoint.cc \
--to=fw@strlen.de \
--cc=netfilter-devel@vger.kernel.org \
--cc=pablo@netfilter.org \
--cc=peterz@infradead.org \
--cc=stranche@codeaurora.org \
--cc=subashab@codeaurora.org \
--cc=tglx@linutronix.de \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).