qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: "Cédric Le Goater" <clg@kaod.org>
To: Stephen Longfield <slongfield@google.com>,
	qemu-devel@nongnu.org, peter.maydell@linaro.org,
	andrew@codeconstruct.com.au, joel@jms.id.au
Cc: Joe Komlodi <komlodi@google.com>,
	Patrick Venture <venture@google.com>,
	Ryan Chen <ryan_chen@aspeedtech.com>,
	Jamin Lin <jamin_lin@aspeedtech.com>,
	Troy Lee <troy_lee@aspeedtech.com>
Subject: Re: Possible race condition in aspeed ast2600 smp boot on TCG QEMU
Date: Fri, 12 Jan 2024 17:33:20 +0100	[thread overview]
Message-ID: <484ebf77-6b62-418c-8319-d69ccaf90c17@kaod.org> (raw)
In-Reply-To: <CAK_0=F+RznDdq27z3r3H1d4pj=QTD-9WZP8xH7jOP75QXJhHpw@mail.gmail.com>

Adding Aspeed Engineers. This reminds me of a discussion a while ago.

On 1/11/24 18:38, Stephen Longfield wrote:
> We’ve noticed inconsistent behavior when running a large number of aspeed ast2600 executions, that seems to be tied to a race condition in the smp boot when executing on TCG-QEMU, and were wondering what a good mediation strategy might be.
> 
> The problem first shows up as part of SMP boot. On a run that’s likely to later run into issues, we’ll see something like:
> 
> ```
> [    0.008350] smp: Bringing up secondary CPUs ...
> [    1.168584] CPU1: failed to come online
> [    1.187277] smp: Brought up 1 node, 1 CPU
> ```
> 
> Compared to the more likely to succeed:
> 
> ```
> [    0.080313] smp: Bringing up secondary CPUs ...
> [    0.093166] smp: Brought up 1 node, 2 CPUs
> [    0.093345] SMP: Total of 2 processors activated (4800.00 BogoMIPS).
> ```
> 
> It’s somewhat reliably reproducible by running the ast2600-evb with an OpenBMC image, using ‘-icount auto’ to slow execution and make the race condition more frequent (it happens without this, just easier to debug if we can reproduce):
> 
> 
> ```
> ./aarch64-softmmu/qemu-system-aarch64 -machine ast2600-evb -nographic -drive file=~/bmc-bin/image-obmc-ast2600,if=mtd,bus=0,unit=0,snapshot=on -nic user -icount auto
> ```
> 
> Our current hypothesis is that the problem comes up in the platform uboot.  As part of the boot, the secondary core waits for the smp mailbox to get a magic number written by the primary core:
> 
> https://github.com/AspeedTech-BMC/u-boot/blob/aspeed-master-v2019.04/arch/arm/mach-aspeed/ast2600/platform.S#L168 <https://github.com/AspeedTech-BMC/u-boot/blob/aspeed-master-v2019.04/arch/arm/mach-aspeed/ast2600/platform.S#L168>
> 
> However, this memory address is cleared on boot:
> 
> https://github.com/AspeedTech-BMC/u-boot/blob/aspeed-master-v2019.04/arch/arm/mach-aspeed/ast2600/platform.S#L146 <https://github.com/AspeedTech-BMC/u-boot/blob/aspeed-master-v2019.04/arch/arm/mach-aspeed/ast2600/platform.S#L146>
> 
> The race condition occurs if the primary core runs far ahead of the secondary core: if the primary core gets to the point where it signals the secondary core’s mailbox before the secondary core gets past the point where it does the initial reset and starts waiting, the reset will clear the signal, and then the secondary core will never get past the point where it’s looping in `poll_smp_mbox_ready`.
> 
> We’ve observed this race happening by dumping all SCU reads and writes, and validated that this is the problem by using a modified `platform.S` that doesn’t clear the =SCU_SMP_READY mailbox on reset, but would rather not have to use a modified version of SMP boot just for QEMU-TCG execution.

you could use '-trace aspeed_scu*' to collect the MMIO accesses on
the SCU unit. A TCG plugin also.

> Is there a way to have QEMU insert a barrier synchronization at some point in the bootloader?  I think getting both cores past the =SCU_SMP_READY reset would get rid of this race, but I’m not aware of a way to do that kind of thing in QEMU-TCG.
> 
> Thanks for any insights!

Could we change the default value to registers 0x180 ... 0x18C in
hw/misc/aspeed_scu.c to make sure the SMP regs are immune to the
race ?

Thanks,

C.






  parent reply	other threads:[~2024-01-12 16:34 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-11 17:38 Possible race condition in aspeed ast2600 smp boot on TCG QEMU Stephen Longfield
2024-01-11 20:49 ` Stephen Longfield
2024-01-12 16:33 ` Cédric Le Goater [this message]
2024-01-15  8:36   ` Troy Lee
2024-01-16 17:52     ` Stephen Longfield

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=484ebf77-6b62-418c-8319-d69ccaf90c17@kaod.org \
    --to=clg@kaod.org \
    --cc=andrew@codeconstruct.com.au \
    --cc=jamin_lin@aspeedtech.com \
    --cc=joel@jms.id.au \
    --cc=komlodi@google.com \
    --cc=peter.maydell@linaro.org \
    --cc=qemu-devel@nongnu.org \
    --cc=ryan_chen@aspeedtech.com \
    --cc=slongfield@google.com \
    --cc=troy_lee@aspeedtech.com \
    --cc=venture@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).