All of lore.kernel.org
 help / color / mirror / Atom feed
From: Joel Fernandes <joel@joelfernandes.org>
To: Zhouyi Zhou <zhouzhouyi@gmail.com>
Cc: "moderated list:ARM/STM32 ARCHITECTURE" 
	<linux-arm-kernel@lists.infradead.org>,
	Will Deacon <will@kernel.org>, Marc Zyngier <maz@kernel.org>,
	Mark Rutland <mark.rutland@arm.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	rcu <rcu@vger.kernel.org>,
	"Paul E. McKenney" <paulmck@kernel.org>
Subject: Re: arm64 torture test hotplug failures (offlining causes -EBUSY)
Date: Tue, 17 Jan 2023 01:45:16 +0000	[thread overview]
Message-ID: <Y8X9rL4ZB9E7gAN9@google.com> (raw)
In-Reply-To: <CAABZP2z50Fbs2rh4j2ToH-0hGfDamzQF3SoxxniC5LrXZ=Ja+A@mail.gmail.com>

On Tue, Jan 17, 2023 at 08:37:16AM +0800, Zhouyi Zhou wrote:
> On Tue, Jan 17, 2023 at 8:15 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> >
> > On Mon, Jan 16, 2023 at 05:38:00PM -0500, Joel Fernandes wrote:
> > > Hi Zhouyi,
> > >
> > > On Mon, Jan 16, 2023 at 1:33 PM Zhouyi Zhou <zhouzhouyi@gmail.com> wrote:
> > > >
> > > [..]
> > > > On Tue, Jan 17, 2023 at 1:27 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> > > > >
> > > > > Hello,
> > > > > I am seeing -EBUSY returned a lot during torture_onoff() when running
> > > > > rcutorture on arm64. This causes hotplug failure 30% of the time. I am
> > > > > also seeing this in 6.1-rc kernels. I believe see this only for CPU0.
> > > > >
> > > > > This causes warnings in torture tests:
> > > > > [  217.582290] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> > > > > [  221.866362] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> > > > >
> > > > > Full kernel log here:
> > > > > http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TREE04/console.log
> > > > >
> > > > > Any ideas on why this is happening and only for CPU 0 (presumably the
> > > > > boot CPU)? I'd personally need these warnings to go away for my tests
> > > > > as this causes rcutorture's tests to not cleanly pass for me. It
> > > > > appears remove_cpu() -> device_offline() is what returns the error.
> > > > >
> > > > I guess this probably because CPU 0 is the tick_do_timer_cpu in
> > > > nohz_full mode, which prevent that cpu from
> > > > going offline [1]. We have discussed this topic, but there is no
> > > > agreement on how to solve it yet.
> > >
> > > But I am seeing the issue in TRACE02 config which is:
> > > CONFIG_NO_HZ_IDLE=y
> > > # CONFIG_NO_HZ_FULL is not set
> > >
> > > So that is not NO_HZ_FULL:
> > > http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TRACE02/console.log.diags/
> > > However, I can't seem to find the full kernel logs for that.
> > >
> > > Also, other than the TRACE02 fail, I only see the issue with configs
> > > with CONFIG_NO_HZ_FULL=y
> > >
> > > Can you try TRACE02 specifically, and see if you can reproduce the
> > > same issue on your setup? Meanwhile, I'll try to trace what is
> > > returning the -EBUSY.
> I am trying TRACE02 on my X86_64 machine using cross compile and
> qemu-system-aarch64 now, my equipment is limited, but hope I can be of
> beneficial to the community ;-)

Cool, I am assuming you are trying the patch you shared which you wrote in
November. I bet you will still see the issue.

> >
> > How about something simple like the following? (untested)
> >
> > ---8<-----------------------
> >
> > diff --git a/kernel/torture.c b/kernel/torture.c
> > index bc8fb361efc0..cd64110694c0 100644
> > --- a/kernel/torture.c
> > +++ b/kernel/torture.c
> > @@ -220,6 +220,9 @@ bool torture_offline(int cpu, long *n_offl_attempts, long *n_offl_successes,
> >                         // PCI probe frequently disables hotplug during boot.
> >                         (*n_offl_attempts)--;
> >                         s = " (-EBUSY forgiven during boot)";
> > +               } else if (tick_nohz_full_running && ret == -EBUSY) {
> > +                       (*n_offl_attempts)--;
> > +                       s = " (-EBUSY forgiven if nohz_full is running)";
>  Fantastic fix!! thus we can fix the time keeper cpu torture problem
> without touch the time keeper code.

Thanks. Unfortunately this does not fix the issue for TRACE02 and the patch
you shared does not fix it either -- because TRACE02 is not a no-hz-full
test. :-(

We will need to do a bit of tracing to figure out where the -EBUSY is coming
from for TRACE02.

I wonder if we should ignore -EBUSY altogether, since as Thomas mentioned,
hotplug failure is "normal". Thoughts?

thanks,

 - Joel


WARNING: multiple messages have this Message-ID (diff)
From: Joel Fernandes <joel@joelfernandes.org>
To: Zhouyi Zhou <zhouzhouyi@gmail.com>
Cc: "moderated list:ARM/STM32 ARCHITECTURE"
	<linux-arm-kernel@lists.infradead.org>,
	Will Deacon <will@kernel.org>, Marc Zyngier <maz@kernel.org>,
	Mark Rutland <mark.rutland@arm.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	rcu <rcu@vger.kernel.org>,
	"Paul E. McKenney" <paulmck@kernel.org>
Subject: Re: arm64 torture test hotplug failures (offlining causes -EBUSY)
Date: Tue, 17 Jan 2023 01:45:16 +0000	[thread overview]
Message-ID: <Y8X9rL4ZB9E7gAN9@google.com> (raw)
In-Reply-To: <CAABZP2z50Fbs2rh4j2ToH-0hGfDamzQF3SoxxniC5LrXZ=Ja+A@mail.gmail.com>

On Tue, Jan 17, 2023 at 08:37:16AM +0800, Zhouyi Zhou wrote:
> On Tue, Jan 17, 2023 at 8:15 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> >
> > On Mon, Jan 16, 2023 at 05:38:00PM -0500, Joel Fernandes wrote:
> > > Hi Zhouyi,
> > >
> > > On Mon, Jan 16, 2023 at 1:33 PM Zhouyi Zhou <zhouzhouyi@gmail.com> wrote:
> > > >
> > > [..]
> > > > On Tue, Jan 17, 2023 at 1:27 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> > > > >
> > > > > Hello,
> > > > > I am seeing -EBUSY returned a lot during torture_onoff() when running
> > > > > rcutorture on arm64. This causes hotplug failure 30% of the time. I am
> > > > > also seeing this in 6.1-rc kernels. I believe see this only for CPU0.
> > > > >
> > > > > This causes warnings in torture tests:
> > > > > [  217.582290] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> > > > > [  221.866362] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> > > > >
> > > > > Full kernel log here:
> > > > > http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TREE04/console.log
> > > > >
> > > > > Any ideas on why this is happening and only for CPU 0 (presumably the
> > > > > boot CPU)? I'd personally need these warnings to go away for my tests
> > > > > as this causes rcutorture's tests to not cleanly pass for me. It
> > > > > appears remove_cpu() -> device_offline() is what returns the error.
> > > > >
> > > > I guess this probably because CPU 0 is the tick_do_timer_cpu in
> > > > nohz_full mode, which prevent that cpu from
> > > > going offline [1]. We have discussed this topic, but there is no
> > > > agreement on how to solve it yet.
> > >
> > > But I am seeing the issue in TRACE02 config which is:
> > > CONFIG_NO_HZ_IDLE=y
> > > # CONFIG_NO_HZ_FULL is not set
> > >
> > > So that is not NO_HZ_FULL:
> > > http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TRACE02/console.log.diags/
> > > However, I can't seem to find the full kernel logs for that.
> > >
> > > Also, other than the TRACE02 fail, I only see the issue with configs
> > > with CONFIG_NO_HZ_FULL=y
> > >
> > > Can you try TRACE02 specifically, and see if you can reproduce the
> > > same issue on your setup? Meanwhile, I'll try to trace what is
> > > returning the -EBUSY.
> I am trying TRACE02 on my X86_64 machine using cross compile and
> qemu-system-aarch64 now, my equipment is limited, but hope I can be of
> beneficial to the community ;-)

Cool, I am assuming you are trying the patch you shared which you wrote in
November. I bet you will still see the issue.

> >
> > How about something simple like the following? (untested)
> >
> > ---8<-----------------------
> >
> > diff --git a/kernel/torture.c b/kernel/torture.c
> > index bc8fb361efc0..cd64110694c0 100644
> > --- a/kernel/torture.c
> > +++ b/kernel/torture.c
> > @@ -220,6 +220,9 @@ bool torture_offline(int cpu, long *n_offl_attempts, long *n_offl_successes,
> >                         // PCI probe frequently disables hotplug during boot.
> >                         (*n_offl_attempts)--;
> >                         s = " (-EBUSY forgiven during boot)";
> > +               } else if (tick_nohz_full_running && ret == -EBUSY) {
> > +                       (*n_offl_attempts)--;
> > +                       s = " (-EBUSY forgiven if nohz_full is running)";
>  Fantastic fix!! thus we can fix the time keeper cpu torture problem
> without touch the time keeper code.

Thanks. Unfortunately this does not fix the issue for TRACE02 and the patch
you shared does not fix it either -- because TRACE02 is not a no-hz-full
test. :-(

We will need to do a bit of tracing to figure out where the -EBUSY is coming
from for TRACE02.

I wonder if we should ignore -EBUSY altogether, since as Thomas mentioned,
hotplug failure is "normal". Thoughts?

thanks,

 - Joel


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

  reply	other threads:[~2023-01-17  1:45 UTC|newest]

Thread overview: 68+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-16 17:03 arm64 torture test hotplug failures (offlining causes -EBUSY) Joel Fernandes
2023-01-16 17:03 ` Joel Fernandes
2023-01-16 18:03 ` Marc Zyngier
2023-01-16 18:03   ` Marc Zyngier
2023-01-16 22:43   ` Joel Fernandes
2023-01-16 22:43     ` Joel Fernandes
2023-01-16 18:32 ` Zhouyi Zhou
2023-01-16 18:32   ` Zhouyi Zhou
2023-01-16 22:38   ` Joel Fernandes
2023-01-16 22:38     ` Joel Fernandes
2023-01-17  0:15     ` Joel Fernandes
2023-01-17  0:15       ` Joel Fernandes
2023-01-17  0:37       ` Zhouyi Zhou
2023-01-17  0:37         ` Zhouyi Zhou
2023-01-17  1:45         ` Joel Fernandes [this message]
2023-01-17  1:45           ` Joel Fernandes
2023-01-17  3:15           ` Zhouyi Zhou
2023-01-17  3:15             ` Zhouyi Zhou
2023-01-17  4:34             ` Joel Fernandes
2023-01-17  4:34               ` Joel Fernandes
2023-01-17 11:42               ` Zhouyi Zhou
2023-01-17 11:42                 ` Zhouyi Zhou
2023-01-17 19:50                 ` Joel Fernandes
2023-01-17 19:50                   ` Joel Fernandes
2023-01-18 10:15                 ` Zhouyi Zhou
2023-01-18 10:15                   ` Zhouyi Zhou
2023-01-18 15:51                   ` Joel Fernandes
2023-01-18 15:51                     ` Joel Fernandes
2023-01-17  4:30       ` Paul E. McKenney
2023-01-17  4:30         ` Paul E. McKenney
2023-01-17  4:36         ` Joel Fernandes
2023-01-17  4:36           ` Joel Fernandes
2023-01-17  4:54           ` Paul E. McKenney
2023-01-17  4:54             ` Paul E. McKenney
2023-01-17 20:02             ` Joel Fernandes
2023-01-17 20:02               ` Joel Fernandes
2023-01-17 20:42               ` Paul E. McKenney
2023-01-17 20:42                 ` Paul E. McKenney
2023-01-18  2:17                 ` Joel Fernandes
2023-01-18  2:17                   ` Joel Fernandes
2023-01-18  4:00                   ` Paul E. McKenney
2023-01-18  4:00                     ` Paul E. McKenney
2023-01-18 16:51                     ` Will Deacon
2023-01-18 16:51                       ` Will Deacon
2023-01-18 17:56                       ` Paul E. McKenney
2023-01-18 17:56                         ` Paul E. McKenney
2023-01-18 22:01                       ` Joel Fernandes
2023-01-18 22:01                         ` Joel Fernandes
2023-01-19  9:12                         ` Mark Rutland
2023-01-19  9:12                           ` Mark Rutland
2023-01-18 22:37                     ` Joel Fernandes
2023-01-18 22:37                       ` Joel Fernandes
2023-01-18 22:39                       ` Joel Fernandes
2023-01-18 22:39                         ` Joel Fernandes
2023-01-19  0:15                         ` Paul E. McKenney
2023-01-19  0:15                           ` Paul E. McKenney
2023-01-19  0:53                           ` Joel Fernandes
2023-01-19  0:53                             ` Joel Fernandes
2023-01-19  3:21                         ` Zhouyi Zhou
2023-01-19  3:21                           ` Zhouyi Zhou
2023-01-19  8:26                           ` Joel Fernandes
2023-01-19  8:26                             ` Joel Fernandes
2023-01-19 12:17                             ` Zhouyi Zhou
2023-01-19 12:17                               ` Zhouyi Zhou
2023-01-19 13:57                       ` Frederic Weisbecker
2023-01-19 13:57                         ` Frederic Weisbecker
2023-01-19 20:25                         ` Joel Fernandes
2023-01-19 20:25                           ` Joel Fernandes

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y8X9rL4ZB9E7gAN9@google.com \
    --to=joel@joelfernandes.org \
    --cc=catalin.marinas@arm.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=mark.rutland@arm.com \
    --cc=maz@kernel.org \
    --cc=paulmck@kernel.org \
    --cc=rcu@vger.kernel.org \
    --cc=will@kernel.org \
    --cc=zhouzhouyi@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.