From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.2 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6CFF2C63697 for ; Wed, 18 Nov 2020 12:42:43 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id EF12324199 for ; Wed, 18 Nov 2020 12:42:39 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726273AbgKRMmj (ORCPT ); Wed, 18 Nov 2020 07:42:39 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50778 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726136AbgKRMmj (ORCPT ); Wed, 18 Nov 2020 07:42:39 -0500 Received: from Chamillionaire.breakpoint.cc (Chamillionaire.breakpoint.cc [IPv6:2a0a:51c0:0:12e:520::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 625DDC0613D6 for ; Wed, 18 Nov 2020 04:42:39 -0800 (PST) Received: from fw by Chamillionaire.breakpoint.cc with local (Exim 4.92) (envelope-from ) id 1kfMn6-0003Br-P7; Wed, 18 Nov 2020 13:42:28 +0100 Date: Wed, 18 Nov 2020 13:42:28 +0100 From: Florian Westphal To: Will Deacon Cc: Florian Westphal , subashab@codeaurora.org, pablo@netfilter.org, Sean Tranchetti , netfilter-devel@vger.kernel.org, peterz@infradead.org, tglx@linutronix.de Subject: Re: [PATCH nf] x_tables: Properly close read section with read_seqcount_retry Message-ID: <20201118124228.GJ22792@breakpoint.cc> References: <1605320516-17810-1-git-send-email-stranche@codeaurora.org> <20201114165330.GM23619@breakpoint.cc> <2ab4bcb63cbacba12ad927621fb56aab@codeaurora.org> <20201116141810.GB22792@breakpoint.cc> <8256f40ba9b73181f121baafe12cac61@codeaurora.org> <20201116170440.GA26150@breakpoint.cc> <983d178e6f3aac81d491362ab60db61f@codeaurora.org> <20201116182028.GE22792@breakpoint.cc> <20201118121322.GA1821@willie-the-truck> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20201118121322.GA1821@willie-the-truck> User-Agent: Mutt/1.10.1 (2018-07-13) Precedence: bulk List-ID: X-Mailing-List: netfilter-devel@vger.kernel.org Will Deacon wrote: > On Mon, Nov 16, 2020 at 07:20:28PM +0100, Florian Westphal wrote: > > subashab@codeaurora.org wrote: > > > > > Unfortunately we are seeing it on ARM64 regression systems which > > > > > runs a > > > > > variety of > > > > > usecases so the exact steps are not known. > > > > > > > > Ok. Would you be willing to run some of those with your suggested > > > > change to see if that resolves the crashes or is that so rare that this > > > > isn't practical? > > > > > > I can try that out. Let me know if you have any other suggestions as well > > > and I can try that too. > > > > > > I assume we cant add locks here as it would be in the packet processing > > > path. > > > > Yes. We can add a synchronize_net() in xt_replace_table if needed > > though, before starting to put the references on the old ruleset > > This would avoid the free of the jumpstack while skbs are still > > in-flight. > > I tried to make sense of what's going on here, and it looks to me like > the interaction between xt_replace_table() and ip6t_do_table() relies on > store->load order that is _not_ enforced in the code. > > xt_replace_table() does: > > table->private = newinfo; > > /* make sure all cpus see new ->private value */ > smp_wmb(); > > /* > * Even though table entries have now been swapped, other CPU's > * may still be using the old entries... > */ > local_bh_enable(); > > /* ... so wait for even xt_recseq on all cpus */ > for_each_possible_cpu(cpu) { > seqcount_t *s = &per_cpu(xt_recseq, cpu); > u32 seq = raw_read_seqcount(s); > > if (seq & 1) { > do { > cond_resched(); > cpu_relax(); > } while (seq == raw_read_seqcount(s)); > } > } > > and I think the idea here is that we swizzle the table->private pointer > to point to the new data, and then wait for all pre-existing readers to > finish with the old table (isn't this what RCU is for?). Yes and yes. Before the current 'for_each_possible_cpu() + seqcount abuse the abuse was just implicit; xt_replace_table did NOT wait for readers to go away and relied on arp, eb and ip(6)tables to store the counters back to userspace after xt_replace_table(). This meant a read_seqcount_begin/retry for each counter & eventually gave complaits from k8s users that x_tables rule replacement was too slow. [..] > Given that this appears to be going wrong in practice, I've included a quick > hack below which should fix the ordering. However, I think it would be > better if we could avoid abusing the seqcount code for this sort of thing. I'd be ok with going via the simpler solution & wait if k8s users complain that its too slow. Those memory blobs can be huge so I would not use call_rcu here. diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c index af22dbe85e2c..b5911985d1eb 100644 --- a/net/netfilter/x_tables.c +++ b/net/netfilter/x_tables.c @@ -1384,7 +1384,7 @@ xt_replace_table(struct xt_table *table, * private. */ smp_wmb(); - table->private = newinfo; + WRITE_ONCE(table->private, newinfo); /* make sure all cpus see new ->private value */ smp_wmb(); @@ -1394,19 +1394,7 @@ xt_replace_table(struct xt_table *table, * may still be using the old entries... */ local_bh_enable(); - - /* ... so wait for even xt_recseq on all cpus */ - for_each_possible_cpu(cpu) { - seqcount_t *s = &per_cpu(xt_recseq, cpu); - u32 seq = raw_read_seqcount(s); - - if (seq & 1) { - do { - cond_resched(); - cpu_relax(); - } while (seq == raw_read_seqcount(s)); - } - } + synchronize_rcu(); audit_log_nfcfg(table->name, table->af, private->number, !private->number ? AUDIT_XT_OP_REGISTER :