All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Marek Marczykowski-Górecki" <marmarek@invisiblethingslab.com>
To: "Jürgen Groß" <jgross@suse.com>
Cc: Juergen Gross <jgross@suse.de>,
	Dario Faggioli <dfaggioli@suse.com>,
	Jan Beulich <jbeulich@suse.com>,
	Andrew Cooper <andrew.cooper3@citrix.com>,
	xen-devel <xen-devel@lists.xenproject.org>
Subject: Re: [Xen-devel] Xen crash after S3 suspend - Xen 4.13 and newer
Date: Sat, 9 Oct 2021 18:28:17 +0200	[thread overview]
Message-ID: <YWHDIQC3K8J3LD8+@mail-itl> (raw)
In-Reply-To: <20210131021526.GB6354@mail-itl>

[-- Attachment #1: Type: text/plain, Size: 6033 bytes --]

On Sun, Jan 31, 2021 at 03:15:30AM +0100, Marek Marczykowski-Górecki wrote:
> On Tue, Sep 29, 2020 at 05:27:48PM +0200, Jürgen Groß wrote:
> > On 29.09.20 17:16, Marek Marczykowski-Górecki wrote:
> > > On Tue, Sep 29, 2020 at 05:07:11PM +0200, Jürgen Groß wrote:
> > > > On 29.09.20 16:27, Marek Marczykowski-Górecki wrote:
> > > > > On Mon, Mar 23, 2020 at 01:09:49AM +0100, Marek Marczykowski-Górecki wrote:
> > > > > > On Thu, Mar 19, 2020 at 01:28:10AM +0100, Dario Faggioli wrote:
> > > > > > > [Adding Juergen]
> > > > > > > 
> > > > > > > On Wed, 2020-03-18 at 23:10 +0100, Marek Marczykowski-Górecki wrote:
> > > > > > > > On Wed, Mar 18, 2020 at 02:50:52PM +0000, Andrew Cooper wrote:
> > > > > > > > > On 18/03/2020 14:16, Marek Marczykowski-Górecki wrote:
> > > > > > > > > > Hi,
> > > > > > > > > > 
> > > > > > > > > > In my test setup (inside KVM with nested virt enabled), I rather
> > > > > > > > > > frequently get Xen crash on resume from S3. Full message below.
> > > > > > > > > > 
> > > > > > > > > > This is Xen 4.13.0, with some patches, including "sched: fix
> > > > > > > > > > resuming
> > > > > > > > > > from S3 with smt=0".
> > > > > > > > > > 
> > > > > > > > > > Contrary to the previous issue, this one does not happen always -
> > > > > > > > > > I
> > > > > > > > > > would say in about 40% cases on this setup, but very rarely on
> > > > > > > > > > physical
> > > > > > > > > > setup.
> > > > > > > > > > 
> > > > > > > > > > This is _without_ core scheduling enabled, and also with smt=off.
> > > > > > > > > > 
> > > > > > > > > > Do you think it would be any different on xen-unstable? I cat
> > > > > > > > > > try, but
> > > > > > > > > > it isn't trivial in this setup, so I'd ask first.
> > > > > > > > > > 
> > > > > > > Well, Juergen has fixed quite a few issues.
> > > > > > > 
> > > > > > > Most of them where triggering with core-scheduling enabled, and I don't
> > > > > > > recall any of them which looked similar or related to this.
> > > > > > > 
> > > > > > > Still, it's possible that the same issue causes different symptoms, and
> > > > > > > hence that maybe one of the patches would fix this too.
> > > > > > 
> > > > > > I've tested on master (d094e95fb7c), and reproduced exactly the same crash
> > > > > > (pasted below for the completeness).
> > > > > > But there is more: additionally, in most (all?) cases after resume I've got
> > > > > > soft lockup in Linux dom0 in smp_call_function_single() - see below. It
> > > > > > didn't happened before and the only change was Xen 4.13 -> master.
> > > > > > 
> > > > > > Xen crash:
> > > > > > 
> > > > > > (XEN) Assertion 'c2rqd(sched_unit_master(unit)) == svc->rqd' failed at credit2.c:2133
> > > > > 
> > > > > Juergen, any idea about this one? This is also happening on the current
> > > > > stable-4.14 (28855ebcdbfa).
> > > > > 
> > > > 
> > > > Oh, sorry I didn't come back to this issue.
> > > > 
> > > > I suspect this is related to stop_machine_run() being called during
> > > > suspend(), as I'm seeing very sporadic issues when offlining and then
> > > > onlining cpus with core scheduling being active (it seems as if the
> > > > dom0 vcpu doing the cpu online activity sometimes is using an old
> > > > vcpu state).
> > > 
> > > Note this is default Xen 4.14 start, so core scheduling is _not_ active:
> > 
> > The similarity in the two failure cases is that multiple cpus are
> > affected by the operations during stop_machine_run().
> > 
> > > 
> > >      (XEN) Brought up 2 CPUs
> > >      (XEN) Scheduling granularity: cpu, 1 CPU per sched-resource
> > >      (XEN) Adding cpu 0 to runqueue 0
> > >      (XEN)  First cpu on runqueue, activating
> > >      (XEN) Adding cpu 1 to runqueue 1
> > >      (XEN)  First cpu on runqueue, activating
> > > 
> > > > I wasn't able to catch the real problem despite of having tried lots
> > > > of approaches using debug patches.
> > > > 
> > > > Recently I suspected the whole problem could be somehow related to
> > > > RCU handling, as stop_machine_run() is relying on tasklets which are
> > > > executing in idle context, and RCU handling is done in idle context,
> > > > too. So there might be some kind of use after free scenario in case
> > > > some memory is freed via RCU despite it still being used by a tasklet.
> > > 
> > > That sounds plausible, even though I don't really know this area of Xen.
> > > 
> > > > I "just" need to find some time to verify this suspicion. Any help doing
> > > > this would be appreciated. :-)
> > > 
> > > I do have a setup where I can easily-ish reproduce the issue. If there
> > > is some debug patch you'd like me to try, I can do that.
> > 
> > Thanks. I might come back to that offer as you are seeing a crash which
> > will be much easier to analyze. Catching my error case is much harder as
> > it surfaces some time after the real problem in a non destructive way
> > (usually I'm seeing a failure to load a library in the program which
> > just did its job via exactly the library claiming not being loadable).
> 
> Hi,
> 
> I'm resurrecting this thread as it was recently mentioned elsewhere. I
> can still reproduce the issue on the recent staging branch (9dc687f155).
> 
> It fails after the first resume (not always, but frequent enough to
> debug it). At least one guest needs to be running - with just (PV) dom0
> the crash doesn't happen (at least for the ~8 times in a row I tried).
> If the first resume works, the second (almost?) always will fail but
> with a different symptoms - dom0 kernel lockups (at least some of its
> vcpus). I haven't debugged this one yet at all.
> 
> Any help will be appreciated, I can apply some debug patches, change
> configuration etc.

This still happens on 4.14.3. Maybe it is related to freeing percpu
areas, as it caused other issues with suspend too? Just a thought...

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

  reply	other threads:[~2021-10-09 16:28 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-18 14:16 [Xen-devel] Xen crash after S3 suspend - Xen 4.13 Marek Marczykowski-Górecki
2020-03-18 14:50 ` Andrew Cooper
2020-03-18 22:10   ` Marek Marczykowski-Górecki
2020-03-19  0:28     ` Dario Faggioli
2020-03-19  0:59       ` Marek Marczykowski-Górecki
2020-03-23  0:09       ` Marek Marczykowski-Górecki
2020-03-23  8:14         ` Jan Beulich
2020-09-29 14:27         ` Marek Marczykowski-Górecki
2020-09-29 15:07           ` Jürgen Groß
2020-09-29 15:16             ` Marek Marczykowski-Górecki
2020-09-29 15:27               ` Jürgen Groß
2021-01-31  2:15                 ` [Xen-devel] Xen crash after S3 suspend - Xen 4.13 and newer Marek Marczykowski-Górecki
2021-10-09 16:28                   ` Marek Marczykowski-Górecki [this message]
2022-08-21 16:14                     ` Marek Marczykowski-Górecki
2022-08-22  9:53                       ` Jan Beulich
2022-08-22 10:00                         ` Marek Marczykowski-Górecki
2022-09-20 10:22                           ` Marek Marczykowski-Górecki
2022-09-20 14:30                             ` Jan Beulich
2022-10-11 11:22                               ` Marek Marczykowski-Górecki
2022-10-14 16:42                             ` George Dunlap
2022-10-21  6:41                             ` Juergen Gross
2022-08-22 15:34                       ` Juergen Gross
2022-09-06 11:46                         ` Juergen Gross
2022-09-06 12:35                           ` Marek Marczykowski-Górecki
2022-09-07 12:21                             ` Dario Faggioli
2022-09-07 15:07                               ` marmarek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YWHDIQC3K8J3LD8+@mail-itl \
    --to=marmarek@invisiblethingslab.com \
    --cc=andrew.cooper3@citrix.com \
    --cc=dfaggioli@suse.com \
    --cc=jbeulich@suse.com \
    --cc=jgross@suse.com \
    --cc=jgross@suse.de \
    --cc=xen-devel@lists.xenproject.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.