Re: Kernel crash while doing chroot'ed grub2-mkconfig on qemu-emulated Nehalem CPU since late November 6.13 snapshot

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mike Rapoport <rppt@kernel.org>
To: Adam Williamson <awilliam@redhat.com>
Cc: linux-kernel@vger.kernel.org, jforbes@redhat.com, mcgrof@kernel.org
Subject: Re: Kernel crash while doing chroot'ed grub2-mkconfig on qemu-emulated Nehalem CPU since late November 6.13 snapshot
Date: Fri, 10 Jan 2025 11:57:03 +0200	[thread overview]
Message-ID: <Z4Du72UA_ppOzXsD@kernel.org> (raw)
In-Reply-To: <ab8c63b1d722f68f22b1b9449a40406f6a2bcff2.camel@redhat.com>

Hi Adam,

On Thu, Jan 02, 2025 at 12:16:03PM -0800, Adam Williamson wrote:
> On Wed, 2024-12-11 at 08:51 -0800, Adam Williamson wrote:
> > Hi, folks. Please CC me on replies, I'm not subscribed to the list. The
> > downstream bug report for this is
> > https://bugzilla.redhat.com/show_bug.cgi?id=2329581 . I also filed
> > https://bugzilla.kernel.org/show_bug.cgi?id=219554 but it looks like
> > nobody is monitoring that ATM, hence this email. Sorry, I don't know
> > where to send it that would be more targeted.
> > 
> > I maintain Fedora's openQA instance - https://openqa.fedoraproject.org/
> > (openQA is an automated testing system which runs jobs on qemu VMs,
> > inputting keyboard and mouse events via VNC, and monitoring results via
> > screenshots and the serial console).
> > 
> > In openQA testing we've noticed a lot of failures of install tests
> > since kernel-6.13.0-0.rc0.20241125git9f16d5e6f220.8.fc42 landed in
> > Rawhide - that is, a snapshot of upstream git 9f16d5e6f220 . The
> > previous build, kernel-6.13.0-0.rc0.20241119git158f238aa69d.2.fc42 - a
> > snapshot of upstream 158f238aa69d - did not show this problem. The
> > problems persist with the latest kernel build, kernel-6.13.0-
> > 0.rc2.22.fc42 (a build of 6.13 rc2 exactly).
> > 
> > Both BIOS and UEFI x86_64 installs are frequently hitting kernel
> > crashes when the Fedora installer runs grub2-mkconfig as part of the
> > install process. In the BIOS case, this causes the system to hang
> > permanently. In the UEFI case, the system hangs for a while then
> > reboots, and fails to boot properly as the installation did not
> > complete.
> > 
> > I've reproduced both BIOS and UEFI failures locally with a qemu VM
> > configured like the one we use in the affected tests: 2 vCPUs, 4G RAM,
> > and CPU model Nehalem - that's `-cpu Nehalem` argument to qemu. If I
> > use host CPU config instead, the bug doesn't happen. We intentionally
> > use the Nehalem model in this testing to ensure Fedora doesn't
> > inadvertently stop supporting the CPU baseline it intends to support.
> > 
> > This happens on more than 50% of install attempts, but not all of them
> > (sometimes they work; I've set our test system to retry failures five
> > times for now to mitigate the effects of this bug).
> > 
> > The details of the traces we get in the kernel logs differ between
> > occurrences and also between BIOS and UEFI, which someone suggested
> > indicate this may be some kind of memory corruption issue. But the
> > broad shape is consistent: the installer reaches grub2-mkconfig and we
> > get a kernel crash.
> > 
> > I did also try reproducing this by running `grub2-mkconfig -o
> > /boot/grub/grub2.cfg` multiple times on an *installed* VM with the same
> > kernel and VM config, but could not trigger a crash in this case. There
> > must be something specific about how this happens in the installer
> > environment (for one thing, the installer runs the command chroot'ed
> > into the installed system environment).
> > 
> > I'll attach sample logs from a UEFI failure and a BIOS failure.
> > 
> > I haven't attempted to bisect this yet as I find bisecting kernel
> > issues pretty painful (the Fedora kernel package spec is a bit weird if
> > you're not used to it, building a full kernel takes a long time, I
> > don't know how to do intermittent builds with the Fedora kernel spec,
> > and since I can't yet reproduce this outside the installer I then have
> > to build an installer image with the kernel build in to test it...).
> > But if needs must I'll bite the bullet and do it. If anyone could e.g.
> > guess at a commit or commit series that might be causing this so I
> > could try a targeted reversion, though, that'd be great.
> 
> Update on this: over the holidays, I bisected it to
> 5185e7f9f3bd754ab60680814afd714e2673ef88 . A kernel with that commit
> reverted does not hit the bug.
> 
> I also did some testing with various CPU model configurations. I think
> this actually isn't to do with Nehalem per se, but "virtual machines
> where the CPU configuration does not exactly match the host", or
> something like that.
> 
> I tried a bunch of qemu CPU model settings - nehalem, sandybridge,
> haswell, Skylake-Client and Cascadelake-Server - and got failures with
> all of them, but when I set the model to "host", all tests passed.
> 
> The tests get farmed out to a cluster of systems which have different
> CPUs - one is Broadwell, one is Skylake, one is Cascade Lake - so I
> think when I set the model to anything specific, it will match the host
> CPU on some or none of those systems, but never *all* of them, so the
> bug will always show up.
> 
> I have emailed the author and reviewer of
> 5185e7f9f3bd754ab60680814afd714e2673ef88 (also CCed on this mail) but
> have not heard back from them yet. I've sunk over a week into this bug
> at this point so it'd be great if someone could look at it. It's not
> the biggest regression in the world, but it is a bit awkward for our
> automated testing (I'll have to fiddle around to try and set CPU model
> 'host' for the most badly-affected tests but ensure we still have
> enough tests with 'nehalem' to confirm our baseline isn't moved).
> 
> Thanks, and happy new year!

Can you please test this patch:

diff --git a/mm/execmem.c b/mm/execmem.c
index be6b234c032e..0090a6f422aa 100644
--- a/mm/execmem.c
+++ b/mm/execmem.c
@@ -266,6 +266,7 @@ static int execmem_cache_populate(struct execmem_range *range, size_t size)
 	unsigned long vm_flags = VM_ALLOW_HUGE_VMAP;
 	struct execmem_area *area;
 	unsigned long start, end;
+	unsigned int page_shift;
 	struct vm_struct *vm;
 	size_t alloc_size;
 	int err = -ENOMEM;
@@ -296,8 +297,9 @@ static int execmem_cache_populate(struct execmem_range *range, size_t size)
 	if (err)
 		goto err_free_mem;
 
+	page_shift = get_vm_area_page_order(vm) + PAGE_SHIFT;
 	err = vmap_pages_range_noflush(start, end, range->pgprot, vm->pages,
-				       PMD_SHIFT);
+				       page_shift);
 	if (err)
 		goto err_free_mem;
 
-- 
2.45.2


-- 
Sincerely yours,
Mike.

next prev parent reply	other threads:[~2025-01-10  9:57 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-11 16:51 Kernel crash while doing chroot'ed grub2-mkconfig on qemu-emulated Nehalem CPU since late November 6.13 snapshot Adam Williamson
2025-01-02 20:16 ` Adam Williamson
2025-01-04  1:51   ` Luis Chamberlain
2025-01-04  2:57     ` Adam Williamson
2025-01-06  7:09     ` Adam Williamson
2025-01-10  9:57   ` Mike Rapoport [this message]
2025-01-10 17:28     ` Adam Williamson
2025-01-11  8:50       ` Mike Rapoport
     [not found]         ` <358da653bcd8b7875f59e673e5572bddd3677aea.camel@redhat.com>
     [not found]           ` <9cb0bdfc94643fb1544671837b357bc49800ce3f.camel@redhat.com>
2025-01-14  7:55             ` Adam Williamson
2025-01-14  8:08               ` Adam Williamson

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:be6b234c032 dfblob:0090a6f422a )
 OR (
bs:"Re: Kernel crash while doing chroot'ed grub2-mkconfig on qemu-emulated Nehalem CPU since late November 6.13 snapshot" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z4Du72UA_ppOzXsD@kernel.org \
    --to=rppt@kernel.org \
    --cc=awilliam@redhat.com \
    --cc=jforbes@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mcgrof@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.