All of lore.kernel.org
 help / color / mirror / Atom feed
From: Willy Tarreau <w@1wt.eu>
To: Kajetan Puchalski <kajetan.puchalski@arm.com>
Cc: Florian Westphal <fw@strlen.de>,
	Pablo Neira Ayuso <pablo@netfilter.org>,
	Jozsef Kadlecsik <kadlec@netfilter.org>,
	"David S. Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	Mel Gorman <mgorman@suse.de>,
	lukasz.luba@arm.com, dietmar.eggemann@arm.com,
	netfilter-devel@vger.kernel.org, coreteam@netfilter.org,
	netdev@vger.kernel.org, stable@vger.kernel.org,
	regressions@lists.linux.dev, linux-kernel@vger.kernel.org
Subject: Re: [Regression] stress-ng udp-flood causes kernel panic on Ampere Altra
Date: Sat, 2 Jul 2022 13:54:46 +0200	[thread overview]
Message-ID: <20220702115446.GA25840@1wt.eu> (raw)
In-Reply-To: <YsAnPhPfWRjpkdmn@e126311.manchester.arm.com>

On Sat, Jul 02, 2022 at 12:08:46PM +0100, Kajetan Puchalski wrote:
> On Fri, Jul 01, 2022 at 10:01:10PM +0200, Florian Westphal wrote:
> > Kajetan Puchalski <kajetan.puchalski@arm.com> wrote:
> > > While running the udp-flood test from stress-ng on Ampere Altra (Mt.
> > > Jade platform) I encountered a kernel panic caused by NULL pointer
> > > dereference within nf_conntrack.
> > > 
> > > The issue is present in the latest mainline (5.19-rc4), latest stable
> > > (5.18.8), as well as multiple older stable versions. The last working
> > > stable version I found was 5.15.40.
> > 
> > Do I need a special setup for conntrack?
> 
> I don't think there was any special setup involved, the config I started
> from was a generic distribution config and I didn't change any
> networking-specific options. In case that's helpful here's the .config I
> used.
> 
> https://pastebin.com/Bb2wttdx
> 
> > 
> > No crashes after more than one hour of stress-ng on
> > 1. 4 core amd64 Fedora 5.17 kernel
> > 2. 16 core amd64, linux stable 5.17.15
> > 3. 12 core intel, Fedora 5.18 kernel
> > 4. 3 core aarch64 vm, 5.18.7-200.fc36.aarch64
> > 
> 
> That would make sense, from further experiments I ran it somehow seems
> to be related to the number of workers being spawned by stress-ng along
> with the CPUs/cores involved.
> 
> For instance, running the test with <=25 workers (--udp-flood 25 etc.)
> results in the test running fine for at least 15 minutes.

Another point to keep in mind is that modern ARM processors (ARMv8.1 and
above) have a more relaxed memory model than older ones (and x86), that
can easily exhibit a missing barrier somewhere. I faced this situation
already in the past the first time I ran my code on Graviton2, which
caused crashes that would never happen on A53/A72/A73 cores nor x86.

ARMv8.1 SoCs are not yet widely available for end users like us. A76
is only coming, and A55 has now been available for a bit more than a
year. So testing on regular ARM devices like RPi etc may not exhibit
such differences.

> Running the test with 30 workers results in a panic sometime before it
> hits the 15 minute mark.
> Based on observations there seems to be a corellation between the number
> of workers and how quickly the panic occurs, ie with 30 it takes a few
> minutes, with 160 it consistently happens almost immediately. That also
> holds for various numbers of workers in between.
> 
> On the CPU/core side of things, the machine in question has two CPU
> sockets with 80 identical cores each. All the panics I've encountered
> happened when stress-ng was ran directly and unbound.
> When I tried using hwloc-bind to bind the process to one of the CPU
> sockets, the test ran for 15 mins with 80 and 160 workers with no issues,
> no matter which CPU it was bound to.
> 
> Ie the specific circumstances under which it seems to occur are when the
> test is able to run across multiple CPU sockets with a large number
> of workers being spawned.

This could further fuel the possibliity explained above.

Regards,
Willy

  reply	other threads:[~2022-07-02 11:55 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-07-01 11:11 [Regression] stress-ng udp-flood causes kernel panic on Ampere Altra Kajetan Puchalski
2022-07-01 11:42 ` [Regression] stress-ng udp-flood causes kernel panic on Ampere Altra #forregzbot Thorsten Leemhuis
2022-07-01 20:01 ` [Regression] stress-ng udp-flood causes kernel panic on Ampere Altra Florian Westphal
2022-07-02 11:08   ` Kajetan Puchalski
2022-07-02 11:54     ` Willy Tarreau [this message]
2022-07-02 20:56     ` Florian Westphal
2022-07-04  9:22       ` Kajetan Puchalski
2022-07-04 10:53         ` Kajetan Puchalski
2022-07-05 10:53           ` Kajetan Puchalski
2022-07-05 10:57             ` Will Deacon
2022-07-05 11:07               ` Will Deacon
2022-07-05 11:24                 ` Will Deacon
2022-07-05 15:29                   ` Kajetan Puchalski
2022-07-06 10:39                   ` Kajetan Puchalski
2022-07-06 12:02                     ` Florian Westphal
2022-07-06 12:18                       ` Peter Zijlstra
2022-07-06 12:22                       ` Will Deacon
2022-07-06 12:40                         ` Florian Westphal
2022-07-06 14:50                           ` [PATCH nf v3] netfilter: conntrack: fix crash due to confirmed bit load reordering Florian Westphal
2022-07-07  8:19                             ` Will Deacon
2022-07-07 18:59                               ` Florian Westphal
2022-07-07 10:17                             ` Thorsten Leemhuis
2022-07-11 16:34                             ` Kajetan Puchalski
2022-07-06 14:00                         ` [Regression] stress-ng udp-flood causes kernel panic on Ampere Altra Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220702115446.GA25840@1wt.eu \
    --to=w@1wt.eu \
    --cc=coreteam@netfilter.org \
    --cc=davem@davemloft.net \
    --cc=dietmar.eggemann@arm.com \
    --cc=edumazet@google.com \
    --cc=fw@strlen.de \
    --cc=kadlec@netfilter.org \
    --cc=kajetan.puchalski@arm.com \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lukasz.luba@arm.com \
    --cc=mgorman@suse.de \
    --cc=netdev@vger.kernel.org \
    --cc=netfilter-devel@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=pablo@netfilter.org \
    --cc=regressions@lists.linux.dev \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.