From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E5710207A2A for ; Fri, 10 Jan 2025 09:57:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736503036; cv=none; b=pRUG08ZcrMVLHgB5gFShx2DBq1YfkNZ6MPxPj/dJG6mPhQwWpE3Lq/aiXx54/3N8YjV9hz3E1giA/NvETonPXmguGoHPiyFd4r0ZKmaaDzUCsRjkDfZdmGbkUqMDJfi65rpNxZV4sGYnYzCT3Z2tx90ZRSNtTsOcRVW3e6kFzSw= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736503036; c=relaxed/simple; bh=7rMWOxe0ZGA8zF3Xsd0slPyvkfxIcKVd2PhYzk54ZVw=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=bE+tJuT87hTrg7xRD7i0BUx6vli53FjI1CnC6kMfMs9Ud+UD/mcuR9oxUXrcdgRkGDnO+3sj/N5KeCLbhuwdN8051VhdCDIkabKXUwsaaqAVp6TtZj85JHEYmZQcQZDZKLPFAYOdP6q6gdMivWXj1EvJK0M1l09Tx8clNnYmY/U= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=i1uTH245; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="i1uTH245" Received: by smtp.kernel.org (Postfix) with ESMTPSA id C792CC4CED6; Fri, 10 Jan 2025 09:57:13 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1736503035; bh=7rMWOxe0ZGA8zF3Xsd0slPyvkfxIcKVd2PhYzk54ZVw=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=i1uTH2450EGwkHZHTtrcHp7A3dehgmGRG9iZFcEZItzQpuZatNVDiVhyHpV5fpK/4 ejheQYqxFvpx2XBzRuPBUzd3BP/jB1Wm2pi5MBt639/6ZZdk2lFO7svpK0W4dGT7qE YqImji7vriehnKlTc1v5sgRAjoH67f39yv3l6LNuOjk6IA2i8a++PKGwG9X2SEzgW+ DL+IaAKbBSq14oCRYeFuCKJQHFtvpZGg0KPs8dAY4PtiEJqWv21+u4+hc8RdJeav4t DdpiGMCn0M06Gx3v54oGmNKuil9psjIhQbd9qDdlq846PL0tVJKcKGBoYVoYEl48Er 2pr7X38+2X/0w== Date: Fri, 10 Jan 2025 11:57:03 +0200 From: Mike Rapoport To: Adam Williamson Cc: linux-kernel@vger.kernel.org, jforbes@redhat.com, mcgrof@kernel.org Subject: Re: Kernel crash while doing chroot'ed grub2-mkconfig on qemu-emulated Nehalem CPU since late November 6.13 snapshot Message-ID: References: <9a3c747052fb82274ab3c4a84eaf64c1273117ce.camel@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Hi Adam, On Thu, Jan 02, 2025 at 12:16:03PM -0800, Adam Williamson wrote: > On Wed, 2024-12-11 at 08:51 -0800, Adam Williamson wrote: > > Hi, folks. Please CC me on replies, I'm not subscribed to the list. The > > downstream bug report for this is > > https://bugzilla.redhat.com/show_bug.cgi?id=2329581 . I also filed > > https://bugzilla.kernel.org/show_bug.cgi?id=219554 but it looks like > > nobody is monitoring that ATM, hence this email. Sorry, I don't know > > where to send it that would be more targeted. > > > > I maintain Fedora's openQA instance - https://openqa.fedoraproject.org/ > > (openQA is an automated testing system which runs jobs on qemu VMs, > > inputting keyboard and mouse events via VNC, and monitoring results via > > screenshots and the serial console). > > > > In openQA testing we've noticed a lot of failures of install tests > > since kernel-6.13.0-0.rc0.20241125git9f16d5e6f220.8.fc42 landed in > > Rawhide - that is, a snapshot of upstream git 9f16d5e6f220 . The > > previous build, kernel-6.13.0-0.rc0.20241119git158f238aa69d.2.fc42 - a > > snapshot of upstream 158f238aa69d - did not show this problem. The > > problems persist with the latest kernel build, kernel-6.13.0- > > 0.rc2.22.fc42 (a build of 6.13 rc2 exactly). > > > > Both BIOS and UEFI x86_64 installs are frequently hitting kernel > > crashes when the Fedora installer runs grub2-mkconfig as part of the > > install process. In the BIOS case, this causes the system to hang > > permanently. In the UEFI case, the system hangs for a while then > > reboots, and fails to boot properly as the installation did not > > complete. > > > > I've reproduced both BIOS and UEFI failures locally with a qemu VM > > configured like the one we use in the affected tests: 2 vCPUs, 4G RAM, > > and CPU model Nehalem - that's `-cpu Nehalem` argument to qemu. If I > > use host CPU config instead, the bug doesn't happen. We intentionally > > use the Nehalem model in this testing to ensure Fedora doesn't > > inadvertently stop supporting the CPU baseline it intends to support. > > > > This happens on more than 50% of install attempts, but not all of them > > (sometimes they work; I've set our test system to retry failures five > > times for now to mitigate the effects of this bug). > > > > The details of the traces we get in the kernel logs differ between > > occurrences and also between BIOS and UEFI, which someone suggested > > indicate this may be some kind of memory corruption issue. But the > > broad shape is consistent: the installer reaches grub2-mkconfig and we > > get a kernel crash. > > > > I did also try reproducing this by running `grub2-mkconfig -o > > /boot/grub/grub2.cfg` multiple times on an *installed* VM with the same > > kernel and VM config, but could not trigger a crash in this case. There > > must be something specific about how this happens in the installer > > environment (for one thing, the installer runs the command chroot'ed > > into the installed system environment). > > > > I'll attach sample logs from a UEFI failure and a BIOS failure. > > > > I haven't attempted to bisect this yet as I find bisecting kernel > > issues pretty painful (the Fedora kernel package spec is a bit weird if > > you're not used to it, building a full kernel takes a long time, I > > don't know how to do intermittent builds with the Fedora kernel spec, > > and since I can't yet reproduce this outside the installer I then have > > to build an installer image with the kernel build in to test it...). > > But if needs must I'll bite the bullet and do it. If anyone could e.g. > > guess at a commit or commit series that might be causing this so I > > could try a targeted reversion, though, that'd be great. > > Update on this: over the holidays, I bisected it to > 5185e7f9f3bd754ab60680814afd714e2673ef88 . A kernel with that commit > reverted does not hit the bug. > > I also did some testing with various CPU model configurations. I think > this actually isn't to do with Nehalem per se, but "virtual machines > where the CPU configuration does not exactly match the host", or > something like that. > > I tried a bunch of qemu CPU model settings - nehalem, sandybridge, > haswell, Skylake-Client and Cascadelake-Server - and got failures with > all of them, but when I set the model to "host", all tests passed. > > The tests get farmed out to a cluster of systems which have different > CPUs - one is Broadwell, one is Skylake, one is Cascade Lake - so I > think when I set the model to anything specific, it will match the host > CPU on some or none of those systems, but never *all* of them, so the > bug will always show up. > > I have emailed the author and reviewer of > 5185e7f9f3bd754ab60680814afd714e2673ef88 (also CCed on this mail) but > have not heard back from them yet. I've sunk over a week into this bug > at this point so it'd be great if someone could look at it. It's not > the biggest regression in the world, but it is a bit awkward for our > automated testing (I'll have to fiddle around to try and set CPU model > 'host' for the most badly-affected tests but ensure we still have > enough tests with 'nehalem' to confirm our baseline isn't moved). > > Thanks, and happy new year! Can you please test this patch: diff --git a/mm/execmem.c b/mm/execmem.c index be6b234c032e..0090a6f422aa 100644 --- a/mm/execmem.c +++ b/mm/execmem.c @@ -266,6 +266,7 @@ static int execmem_cache_populate(struct execmem_range *range, size_t size) unsigned long vm_flags = VM_ALLOW_HUGE_VMAP; struct execmem_area *area; unsigned long start, end; + unsigned int page_shift; struct vm_struct *vm; size_t alloc_size; int err = -ENOMEM; @@ -296,8 +297,9 @@ static int execmem_cache_populate(struct execmem_range *range, size_t size) if (err) goto err_free_mem; + page_shift = get_vm_area_page_order(vm) + PAGE_SHIFT; err = vmap_pages_range_noflush(start, end, range->pgprot, vm->pages, - PMD_SHIFT); + page_shift); if (err) goto err_free_mem; -- 2.45.2 -- Sincerely yours, Mike.