From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id E5710207A2A
	for <linux-kernel@vger.kernel.org>; Fri, 10 Jan 2025 09:57:15 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1736503036; cv=none; b=pRUG08ZcrMVLHgB5gFShx2DBq1YfkNZ6MPxPj/dJG6mPhQwWpE3Lq/aiXx54/3N8YjV9hz3E1giA/NvETonPXmguGoHPiyFd4r0ZKmaaDzUCsRjkDfZdmGbkUqMDJfi65rpNxZV4sGYnYzCT3Z2tx90ZRSNtTsOcRVW3e6kFzSw=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1736503036; c=relaxed/simple;
	bh=7rMWOxe0ZGA8zF3Xsd0slPyvkfxIcKVd2PhYzk54ZVw=;
	h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version:
	 Content-Type:Content-Disposition:In-Reply-To; b=bE+tJuT87hTrg7xRD7i0BUx6vli53FjI1CnC6kMfMs9Ud+UD/mcuR9oxUXrcdgRkGDnO+3sj/N5KeCLbhuwdN8051VhdCDIkabKXUwsaaqAVp6TtZj85JHEYmZQcQZDZKLPFAYOdP6q6gdMivWXj1EvJK0M1l09Tx8clNnYmY/U=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=i1uTH245; arc=none smtp.client-ip=10.30.226.201
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="i1uTH245"
Received: by smtp.kernel.org (Postfix) with ESMTPSA id C792CC4CED6;
	Fri, 10 Jan 2025 09:57:13 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org;
	s=k20201202; t=1736503035;
	bh=7rMWOxe0ZGA8zF3Xsd0slPyvkfxIcKVd2PhYzk54ZVw=;
	h=Date:From:To:Cc:Subject:References:In-Reply-To:From;
	b=i1uTH2450EGwkHZHTtrcHp7A3dehgmGRG9iZFcEZItzQpuZatNVDiVhyHpV5fpK/4
	 ejheQYqxFvpx2XBzRuPBUzd3BP/jB1Wm2pi5MBt639/6ZZdk2lFO7svpK0W4dGT7qE
	 YqImji7vriehnKlTc1v5sgRAjoH67f39yv3l6LNuOjk6IA2i8a++PKGwG9X2SEzgW+
	 DL+IaAKbBSq14oCRYeFuCKJQHFtvpZGg0KPs8dAY4PtiEJqWv21+u4+hc8RdJeav4t
	 DdpiGMCn0M06Gx3v54oGmNKuil9psjIhQbd9qDdlq846PL0tVJKcKGBoYVoYEl48Er
	 2pr7X38+2X/0w==
Date: Fri, 10 Jan 2025 11:57:03 +0200
From: Mike Rapoport <rppt@kernel.org>
To: Adam Williamson <awilliam@redhat.com>
Cc: linux-kernel@vger.kernel.org, jforbes@redhat.com, mcgrof@kernel.org
Subject: Re: Kernel crash while doing chroot'ed grub2-mkconfig on
 qemu-emulated Nehalem CPU since late November 6.13 snapshot
Message-ID: <Z4Du72UA_ppOzXsD@kernel.org>
References: <9a3c747052fb82274ab3c4a84eaf64c1273117ce.camel@redhat.com>
 <ab8c63b1d722f68f22b1b9449a40406f6a2bcff2.camel@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <ab8c63b1d722f68f22b1b9449a40406f6a2bcff2.camel@redhat.com>

Hi Adam,

On Thu, Jan 02, 2025 at 12:16:03PM -0800, Adam Williamson wrote:
> On Wed, 2024-12-11 at 08:51 -0800, Adam Williamson wrote:
> > Hi, folks. Please CC me on replies, I'm not subscribed to the list. The
> > downstream bug report for this is
> > https://bugzilla.redhat.com/show_bug.cgi?id=2329581 . I also filed
> > https://bugzilla.kernel.org/show_bug.cgi?id=219554 but it looks like
> > nobody is monitoring that ATM, hence this email. Sorry, I don't know
> > where to send it that would be more targeted.
> > 
> > I maintain Fedora's openQA instance - https://openqa.fedoraproject.org/
> > (openQA is an automated testing system which runs jobs on qemu VMs,
> > inputting keyboard and mouse events via VNC, and monitoring results via
> > screenshots and the serial console).
> > 
> > In openQA testing we've noticed a lot of failures of install tests
> > since kernel-6.13.0-0.rc0.20241125git9f16d5e6f220.8.fc42 landed in
> > Rawhide - that is, a snapshot of upstream git 9f16d5e6f220 . The
> > previous build, kernel-6.13.0-0.rc0.20241119git158f238aa69d.2.fc42 - a
> > snapshot of upstream 158f238aa69d - did not show this problem. The
> > problems persist with the latest kernel build, kernel-6.13.0-
> > 0.rc2.22.fc42 (a build of 6.13 rc2 exactly).
> > 
> > Both BIOS and UEFI x86_64 installs are frequently hitting kernel
> > crashes when the Fedora installer runs grub2-mkconfig as part of the
> > install process. In the BIOS case, this causes the system to hang
> > permanently. In the UEFI case, the system hangs for a while then
> > reboots, and fails to boot properly as the installation did not
> > complete.
> > 
> > I've reproduced both BIOS and UEFI failures locally with a qemu VM
> > configured like the one we use in the affected tests: 2 vCPUs, 4G RAM,
> > and CPU model Nehalem - that's `-cpu Nehalem` argument to qemu. If I
> > use host CPU config instead, the bug doesn't happen. We intentionally
> > use the Nehalem model in this testing to ensure Fedora doesn't
> > inadvertently stop supporting the CPU baseline it intends to support.
> > 
> > This happens on more than 50% of install attempts, but not all of them
> > (sometimes they work; I've set our test system to retry failures five
> > times for now to mitigate the effects of this bug).
> > 
> > The details of the traces we get in the kernel logs differ between
> > occurrences and also between BIOS and UEFI, which someone suggested
> > indicate this may be some kind of memory corruption issue. But the
> > broad shape is consistent: the installer reaches grub2-mkconfig and we
> > get a kernel crash.
> > 
> > I did also try reproducing this by running `grub2-mkconfig -o
> > /boot/grub/grub2.cfg` multiple times on an *installed* VM with the same
> > kernel and VM config, but could not trigger a crash in this case. There
> > must be something specific about how this happens in the installer
> > environment (for one thing, the installer runs the command chroot'ed
> > into the installed system environment).
> > 
> > I'll attach sample logs from a UEFI failure and a BIOS failure.
> > 
> > I haven't attempted to bisect this yet as I find bisecting kernel
> > issues pretty painful (the Fedora kernel package spec is a bit weird if
> > you're not used to it, building a full kernel takes a long time, I
> > don't know how to do intermittent builds with the Fedora kernel spec,
> > and since I can't yet reproduce this outside the installer I then have
> > to build an installer image with the kernel build in to test it...).
> > But if needs must I'll bite the bullet and do it. If anyone could e.g.
> > guess at a commit or commit series that might be causing this so I
> > could try a targeted reversion, though, that'd be great.
> 
> Update on this: over the holidays, I bisected it to
> 5185e7f9f3bd754ab60680814afd714e2673ef88 . A kernel with that commit
> reverted does not hit the bug.
> 
> I also did some testing with various CPU model configurations. I think
> this actually isn't to do with Nehalem per se, but "virtual machines
> where the CPU configuration does not exactly match the host", or
> something like that.
> 
> I tried a bunch of qemu CPU model settings - nehalem, sandybridge,
> haswell, Skylake-Client and Cascadelake-Server - and got failures with
> all of them, but when I set the model to "host", all tests passed.
> 
> The tests get farmed out to a cluster of systems which have different
> CPUs - one is Broadwell, one is Skylake, one is Cascade Lake - so I
> think when I set the model to anything specific, it will match the host
> CPU on some or none of those systems, but never *all* of them, so the
> bug will always show up.
> 
> I have emailed the author and reviewer of
> 5185e7f9f3bd754ab60680814afd714e2673ef88 (also CCed on this mail) but
> have not heard back from them yet. I've sunk over a week into this bug
> at this point so it'd be great if someone could look at it. It's not
> the biggest regression in the world, but it is a bit awkward for our
> automated testing (I'll have to fiddle around to try and set CPU model
> 'host' for the most badly-affected tests but ensure we still have
> enough tests with 'nehalem' to confirm our baseline isn't moved).
> 
> Thanks, and happy new year!

Can you please test this patch:

diff --git a/mm/execmem.c b/mm/execmem.c
index be6b234c032e..0090a6f422aa 100644
--- a/mm/execmem.c
+++ b/mm/execmem.c
@@ -266,6 +266,7 @@ static int execmem_cache_populate(struct execmem_range *range, size_t size)
 	unsigned long vm_flags = VM_ALLOW_HUGE_VMAP;
 	struct execmem_area *area;
 	unsigned long start, end;
+	unsigned int page_shift;
 	struct vm_struct *vm;
 	size_t alloc_size;
 	int err = -ENOMEM;
@@ -296,8 +297,9 @@ static int execmem_cache_populate(struct execmem_range *range, size_t size)
 	if (err)
 		goto err_free_mem;
 
+	page_shift = get_vm_area_page_order(vm) + PAGE_SHIFT;
 	err = vmap_pages_range_noflush(start, end, range->pgprot, vm->pages,
-				       PMD_SHIFT);
+				       page_shift);
 	if (err)
 		goto err_free_mem;
 
-- 
2.45.2


-- 
Sincerely yours,
Mike.