[PATCH 00/11] x86: 32-bit cleanups

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/11] x86: 32-bit cleanups
@ 2024-12-04 10:30 Arnd Bergmann
  2024-12-04 10:30 ` [PATCH 01/11] x86/Kconfig: Geode CPU has cmpxchg8b Arnd Bergmann
                   ` (10 more replies)
  0 siblings, 11 replies; 78+ messages in thread
From: Arnd Bergmann @ 2024-12-04 10:30 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Arnd Bergmann, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Linus Torvalds, Andy Shevchenko,
	Matthew Wilcox, Sean Christopherson, Davide Ciminaghi,
	Paolo Bonzini, kvm

From: Arnd Bergmann <arnd@arndb.de>

While looking at 32-bit arm cleanups, I came across some related topics
on x86 and ended up making a series for those as well.

Primarily this is about running 32-bit kernels on 64-bit hardware,
which usually works but should probably be discouraged more clearly by
only providing support for features that are used on real 32-bit hardware:

I found only a few 2003-era high-end servers (HP DL740 and DL760 G2)
that were the only possible remaining uses of HIGHMEM64G and BIGSMP after
the removal of 32-bit NUMA machines in 2014. Similary, there is only
one generation of hardware with support for VT-x.  All these features
can be removed without hurting users.

In the CPU selection, building a 32-bit kernel optimized for AMD K8
or Intel Core2 is anachronistic, so instead only 32-bit CPU types need
to be offered as optimization targets. The "generic" target on 64-bit
turned out to be slightly broken, so I included a fix for that as well,
replacing the compiler default target with an intentional selection
between the useful levels.

Arnd Bergmann (11):
  x86/Kconfig: Geode CPU has cmpxchg8b
  x86: drop 32-bit "bigsmp" machine support
  x86: Kconfig.cpu: split out 64-bit atom
  x86: split CPU selection into 32-bit and 64-bit
  x86: remove HIGHMEM64G support
  x86: drop SWIOTLB and PHYS_ADDR_T_64BIT for PAE
  x86: drop support for CONFIG_HIGHPTE
  x86: document X86_INTEL_MID as 64-bit-only
  x86: rework CONFIG_GENERIC_CPU compiler flags
  x86: remove old STA2x11 support
  x86: drop 32-bit KVM host support

 Documentation/admin-guide/kdump/kdump.rst     |   4 -
 .../admin-guide/kernel-parameters.txt         |  11 -
 Documentation/arch/x86/usb-legacy-support.rst |  11 +-
 arch/x86/Kconfig                              | 119 ++-------
 arch/x86/Kconfig.cpu                          | 130 +++++++---
 arch/x86/Makefile                             |  10 +-
 arch/x86/Makefile_32.cpu                      |   3 +-
 arch/x86/configs/xen.config                   |   2 -
 arch/x86/include/asm/page_32_types.h          |   4 +-
 arch/x86/include/asm/pgalloc.h                |   5 -
 arch/x86/include/asm/sta2x11.h                |  13 -
 arch/x86/include/asm/vermagic.h               |   4 -
 arch/x86/kernel/apic/Makefile                 |   3 -
 arch/x86/kernel/apic/apic.c                   |   3 -
 arch/x86/kernel/apic/bigsmp_32.c              | 105 --------
 arch/x86/kernel/apic/local.h                  |  13 -
 arch/x86/kernel/apic/probe_32.c               |  29 ---
 arch/x86/kernel/head32.c                      |   3 -
 arch/x86/kvm/Kconfig                          |   6 +-
 arch/x86/kvm/Makefile                         |   4 +-
 arch/x86/kvm/cpuid.c                          |   9 +-
 arch/x86/kvm/emulate.c                        |  34 +--
 arch/x86/kvm/fpu.h                            |   4 -
 arch/x86/kvm/hyperv.c                         |   5 +-
 arch/x86/kvm/i8254.c                          |   4 -
 arch/x86/kvm/kvm_cache_regs.h                 |   2 -
 arch/x86/kvm/kvm_emulate.h                    |   8 -
 arch/x86/kvm/lapic.c                          |   4 -
 arch/x86/kvm/mmu.h                            |   4 -
 arch/x86/kvm/mmu/mmu.c                        | 134 ----------
 arch/x86/kvm/mmu/mmu_internal.h               |   9 -
 arch/x86/kvm/mmu/paging_tmpl.h                |   9 -
 arch/x86/kvm/mmu/spte.h                       |   5 -
 arch/x86/kvm/mmu/tdp_mmu.h                    |   4 -
 arch/x86/kvm/smm.c                            |  19 --
 arch/x86/kvm/svm/sev.c                        |   2 -
 arch/x86/kvm/svm/svm.c                        |  23 +-
 arch/x86/kvm/svm/vmenter.S                    |  20 --
 arch/x86/kvm/trace.h                          |   4 -
 arch/x86/kvm/vmx/main.c                       |   2 -
 arch/x86/kvm/vmx/nested.c                     |  24 +-
 arch/x86/kvm/vmx/vmcs.h                       |   2 -
 arch/x86/kvm/vmx/vmenter.S                    |  25 +-
 arch/x86/kvm/vmx/vmx.c                        | 117 +--------
 arch/x86/kvm/vmx/vmx.h                        |  23 +-
 arch/x86/kvm/vmx/vmx_ops.h                    |   7 -
 arch/x86/kvm/vmx/x86_ops.h                    |   2 -
 arch/x86/kvm/x86.c                            |  74 +-----
 arch/x86/kvm/x86.h                            |   4 -
 arch/x86/kvm/xen.c                            |  61 ++---
 arch/x86/mm/init_32.c                         |   9 +-
 arch/x86/mm/pgtable.c                         |  29 +--
 arch/x86/pci/Makefile                         |   2 -
 arch/x86/pci/sta2x11-fixup.c                  | 233 ------------------
 include/linux/mm.h                            |   2 +-
 55 files changed, 185 insertions(+), 1216 deletions(-)
 delete mode 100644 arch/x86/include/asm/sta2x11.h
 delete mode 100644 arch/x86/kernel/apic/bigsmp_32.c
 delete mode 100644 arch/x86/pci/sta2x11-fixup.c

-- 
2.39.5

To: x86@kernel.org 
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Shevchenko <andy@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Sean Christopherson <seanjc@google.com> 
Cc: Davide Ciminaghi <ciminaghi@gnudd.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: kvm@vger.kernel.org

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 01/11] x86/Kconfig: Geode CPU has cmpxchg8b
  2024-12-04 10:30 [PATCH 00/11] x86: 32-bit cleanups Arnd Bergmann
@ 2024-12-04 10:30 ` Arnd Bergmann
  2024-12-04 10:30 ` [PATCH 02/11] x86: drop 32-bit "bigsmp" machine support Arnd Bergmann
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 78+ messages in thread
From: Arnd Bergmann @ 2024-12-04 10:30 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Arnd Bergmann, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Linus Torvalds, Andy Shevchenko,
	Matthew Wilcox, Sean Christopherson, Davide Ciminaghi,
	Paolo Bonzini, kvm, stable

From: Arnd Bergmann <arnd@arndb.de>

An older cleanup of mine inadvertently removed geode-gx1 and geode-lx
from the list of CPUs that are known to support a working cmpxchg8b.

Fixes: 88a2b4edda3d ("x86/Kconfig: Rework CONFIG_X86_PAE dependency")
Cc: stable@vger.kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 arch/x86/Kconfig.cpu | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig.cpu b/arch/x86/Kconfig.cpu
index 2a7279d80460..42e6a40876ea 100644
--- a/arch/x86/Kconfig.cpu
+++ b/arch/x86/Kconfig.cpu
@@ -368,7 +368,7 @@ config X86_HAVE_PAE
 
 config X86_CMPXCHG64
 	def_bool y
-	depends on X86_HAVE_PAE || M586TSC || M586MMX || MK6 || MK7
+	depends on X86_HAVE_PAE || M586TSC || M586MMX || MK6 || MK7 || MGEODEGX1 || MGEODE_LX
 
 # this should be set for all -march=.. options where the compiler
 # generates cmov.
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 02/11] x86: drop 32-bit "bigsmp" machine support
  2024-12-04 10:30 [PATCH 00/11] x86: 32-bit cleanups Arnd Bergmann
  2024-12-04 10:30 ` [PATCH 01/11] x86/Kconfig: Geode CPU has cmpxchg8b Arnd Bergmann
@ 2024-12-04 10:30 ` Arnd Bergmann
  2024-12-04 10:30 ` [PATCH 03/11] x86: Kconfig.cpu: split out 64-bit atom Arnd Bergmann
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 78+ messages in thread
From: Arnd Bergmann @ 2024-12-04 10:30 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Arnd Bergmann, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Linus Torvalds, Andy Shevchenko,
	Matthew Wilcox, Sean Christopherson, Davide Ciminaghi,
	Paolo Bonzini, kvm

From: Arnd Bergmann <arnd@arndb.de>

The x86-32 kernel used to support multiple platforms with more than eight
logical CPUs, from the 1999-2003 timeframe: Sequent NUMA-Q, IBM Summit,
Unisys ES7000 and HP F8. Support for all except the latter was dropped
back in 2014, leaving only the F8 based DL740 and DL760 G2 machines in
this catery, with up to eight single-core Socket-603 Xeon-MP processors
with hyperthreading.

Like the already removed machines, the HP F8 servers at the cost upwards
of $100k in typical configurations, but were quickly obsoleted by
their 64-bit Socket-604 cousins and the AMD Opteron.

Earlier servers with up to 8 Pentium Pro or Xeon processors remain
fully supported as they had no hyperthreading. Similarly, the more
common 4-socket Xeon-MP machines with hyperthreading using Intel
or ServerWorks chipsets continue to work without this, and all the
multi-core Xeon processors also run 64-bit kernels.

While the "bigsmp" support can also be used to run on later 64-bit
machines (including VM guests), it seems best to discourage that
and get any remaining users to update their kernels to 64-bit builds
on these. As a side-effect of this, there is also no more need to
support NUMA configurations on 32-bit x86, as all true 32-bit
NUMA platforms are already gone.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 .../admin-guide/kernel-parameters.txt         |   4 -
 arch/x86/Kconfig                              |  20 +---
 arch/x86/kernel/apic/Makefile                 |   3 -
 arch/x86/kernel/apic/apic.c                   |   3 -
 arch/x86/kernel/apic/bigsmp_32.c              | 105 ------------------
 arch/x86/kernel/apic/local.h                  |  13 ---
 arch/x86/kernel/apic/probe_32.c               |  29 -----
 7 files changed, 4 insertions(+), 173 deletions(-)
 delete mode 100644 arch/x86/kernel/apic/bigsmp_32.c

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index dc663c0ca670..eca370e99844 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -410,10 +410,6 @@
 			Format: { quiet (default) | verbose | debug }
 			Change the amount of debugging information output
 			when initialising the APIC and IO-APIC components.
-			For X86-32, this can also be used to specify an APIC
-			driver name.
-			Format: apic=driver_name
-			Examples: apic=bigsmp
 
 	apic_extnmi=	[APIC,X86,EARLY] External NMI delivery setting
 			Format: { bsp (default) | all | none }
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 9d7bd0ae48c4..42494739344d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -526,12 +526,6 @@ config X86_FRED
 	  ring transitions and exception/interrupt handling if the
 	  system supports it.
 
-config X86_BIGSMP
-	bool "Support for big SMP systems with more than 8 CPUs"
-	depends on SMP && X86_32
-	help
-	  This option is needed for the systems that have more than 8 CPUs.
-
 config X86_EXTENDED_PLATFORM
 	bool "Support for extended (non-PC) x86 platforms"
 	default y
@@ -730,8 +724,8 @@ config X86_32_NON_STANDARD
 	depends on X86_32 && SMP
 	depends on X86_EXTENDED_PLATFORM
 	help
-	  This option compiles in the bigsmp and STA2X11 default
-	  subarchitectures.  It is intended for a generic binary
+	  This option compiles in the STA2X11 default
+	  subarchitecture.  It is intended for a generic binary
 	  kernel. If you select them all, kernel will probe it one by
 	  one and will fallback to default.
 
@@ -1008,8 +1002,7 @@ config NR_CPUS_RANGE_BEGIN
 config NR_CPUS_RANGE_END
 	int
 	depends on X86_32
-	default   64 if  SMP &&  X86_BIGSMP
-	default    8 if  SMP && !X86_BIGSMP
+	default    8 if  SMP
 	default    1 if !SMP
 
 config NR_CPUS_RANGE_END
@@ -1022,7 +1015,6 @@ config NR_CPUS_RANGE_END
 config NR_CPUS_DEFAULT
 	int
 	depends on X86_32
-	default   32 if  X86_BIGSMP
 	default    8 if  SMP
 	default    1 if !SMP
 
@@ -1568,8 +1560,7 @@ config AMD_MEM_ENCRYPT
 config NUMA
 	bool "NUMA Memory Allocation and Scheduler Support"
 	depends on SMP
-	depends on X86_64 || (X86_32 && HIGHMEM64G && X86_BIGSMP)
-	default y if X86_BIGSMP
+	depends on X86_64
 	select USE_PERCPU_NUMA_NODE_ID
 	select OF_NUMA if OF
 	help
@@ -1582,9 +1573,6 @@ config NUMA
 	  For 64-bit this is recommended if the system is Intel Core i7
 	  (or later), AMD Opteron, or EM64T NUMA.
 
-	  For 32-bit this is only needed if you boot a 32-bit
-	  kernel on a 64-bit NUMA platform.
-
 	  Otherwise, you should say N.
 
 config AMD_NUMA
diff --git a/arch/x86/kernel/apic/Makefile b/arch/x86/kernel/apic/Makefile
index 3bf0487cf3b7..52d1808ee360 100644
--- a/arch/x86/kernel/apic/Makefile
+++ b/arch/x86/kernel/apic/Makefile
@@ -23,8 +23,5 @@ obj-$(CONFIG_X86_X2APIC)	+= x2apic_cluster.o
 obj-y				+= apic_flat_64.o
 endif
 
-# APIC probe will depend on the listing order here
-obj-$(CONFIG_X86_BIGSMP)	+= bigsmp_32.o
-
 # For 32bit, probe_32 need to be listed last
 obj-$(CONFIG_X86_LOCAL_APIC)	+= probe_$(BITS).o
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index c5fb28e6451a..cb453bacf281 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -1371,8 +1371,6 @@ void __init apic_intr_mode_init(void)
 
 	x86_64_probe_apic();
 
-	x86_32_install_bigsmp();
-
 	if (x86_platform.apic_post_init)
 		x86_platform.apic_post_init();
 
@@ -1674,7 +1672,6 @@ static __init void apic_read_boot_cpu_id(bool x2apic)
 		boot_cpu_apic_version = GET_APIC_VERSION(apic_read(APIC_LVR));
 	}
 	topology_register_boot_apic(boot_cpu_physical_apicid);
-	x86_32_probe_bigsmp_early();
 }
 
 #ifdef CONFIG_X86_X2APIC
diff --git a/arch/x86/kernel/apic/bigsmp_32.c b/arch/x86/kernel/apic/bigsmp_32.c
deleted file mode 100644
index 9285d500d5b4..000000000000
--- a/arch/x86/kernel/apic/bigsmp_32.c
+++ /dev/null
@@ -1,105 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0
-/*
- * APIC driver for "bigsmp" xAPIC machines with more than 8 virtual CPUs.
- *
- * Drives the local APIC in "clustered mode".
- */
-#include <linux/cpumask.h>
-#include <linux/dmi.h>
-#include <linux/smp.h>
-
-#include <asm/apic.h>
-#include <asm/io_apic.h>
-
-#include "local.h"
-
-static u32 bigsmp_get_apic_id(u32 x)
-{
-	return (x >> 24) & 0xFF;
-}
-
-static void bigsmp_send_IPI_allbutself(int vector)
-{
-	default_send_IPI_mask_allbutself_phys(cpu_online_mask, vector);
-}
-
-static void bigsmp_send_IPI_all(int vector)
-{
-	default_send_IPI_mask_sequence_phys(cpu_online_mask, vector);
-}
-
-static int dmi_bigsmp; /* can be set by dmi scanners */
-
-static int hp_ht_bigsmp(const struct dmi_system_id *d)
-{
-	printk(KERN_NOTICE "%s detected: force use of apic=bigsmp\n", d->ident);
-	dmi_bigsmp = 1;
-
-	return 0;
-}
-
-
-static const struct dmi_system_id bigsmp_dmi_table[] = {
-	{ hp_ht_bigsmp, "HP ProLiant DL760 G2",
-		{	DMI_MATCH(DMI_BIOS_VENDOR, "HP"),
-			DMI_MATCH(DMI_BIOS_VERSION, "P44-"),
-		}
-	},
-
-	{ hp_ht_bigsmp, "HP ProLiant DL740",
-		{	DMI_MATCH(DMI_BIOS_VENDOR, "HP"),
-			DMI_MATCH(DMI_BIOS_VERSION, "P47-"),
-		}
-	},
-	{ } /* NULL entry stops DMI scanning */
-};
-
-static int probe_bigsmp(void)
-{
-	return dmi_check_system(bigsmp_dmi_table);
-}
-
-static struct apic apic_bigsmp __ro_after_init = {
-
-	.name				= "bigsmp",
-	.probe				= probe_bigsmp,
-
-	.dest_mode_logical		= false,
-
-	.disable_esr			= 1,
-
-	.cpu_present_to_apicid		= default_cpu_present_to_apicid,
-
-	.max_apic_id			= 0xFE,
-	.get_apic_id			= bigsmp_get_apic_id,
-
-	.calc_dest_apicid		= apic_default_calc_apicid,
-
-	.send_IPI			= default_send_IPI_single_phys,
-	.send_IPI_mask			= default_send_IPI_mask_sequence_phys,
-	.send_IPI_mask_allbutself	= NULL,
-	.send_IPI_allbutself		= bigsmp_send_IPI_allbutself,
-	.send_IPI_all			= bigsmp_send_IPI_all,
-	.send_IPI_self			= default_send_IPI_self,
-
-	.read				= native_apic_mem_read,
-	.write				= native_apic_mem_write,
-	.eoi				= native_apic_mem_eoi,
-	.icr_read			= native_apic_icr_read,
-	.icr_write			= native_apic_icr_write,
-	.wait_icr_idle			= apic_mem_wait_icr_idle,
-	.safe_wait_icr_idle		= apic_mem_wait_icr_idle_timeout,
-};
-
-bool __init apic_bigsmp_possible(bool cmdline_override)
-{
-	return apic == &apic_bigsmp || !cmdline_override;
-}
-
-void __init apic_bigsmp_force(void)
-{
-	if (apic != &apic_bigsmp)
-		apic_install_driver(&apic_bigsmp);
-}
-
-apic_driver(apic_bigsmp);
diff --git a/arch/x86/kernel/apic/local.h b/arch/x86/kernel/apic/local.h
index 842fe28496be..bdcf609eb283 100644
--- a/arch/x86/kernel/apic/local.h
+++ b/arch/x86/kernel/apic/local.h
@@ -65,17 +65,4 @@ void default_send_IPI_self(int vector);
 void default_send_IPI_mask_sequence_logical(const struct cpumask *mask, int vector);
 void default_send_IPI_mask_allbutself_logical(const struct cpumask *mask, int vector);
 void default_send_IPI_mask_logical(const struct cpumask *mask, int vector);
-void x86_32_probe_bigsmp_early(void);
-void x86_32_install_bigsmp(void);
-#else
-static inline void x86_32_probe_bigsmp_early(void) { }
-static inline void x86_32_install_bigsmp(void) { }
-#endif
-
-#ifdef CONFIG_X86_BIGSMP
-bool apic_bigsmp_possible(bool cmdline_selected);
-void apic_bigsmp_force(void);
-#else
-static inline bool apic_bigsmp_possible(bool cmdline_selected) { return false; };
-static inline void apic_bigsmp_force(void) { }
 #endif
diff --git a/arch/x86/kernel/apic/probe_32.c b/arch/x86/kernel/apic/probe_32.c
index f75ee345c02d..87bc9e7ca5d6 100644
--- a/arch/x86/kernel/apic/probe_32.c
+++ b/arch/x86/kernel/apic/probe_32.c
@@ -93,35 +93,6 @@ static int __init parse_apic(char *arg)
 }
 early_param("apic", parse_apic);
 
-void __init x86_32_probe_bigsmp_early(void)
-{
-	if (nr_cpu_ids <= 8 || xen_pv_domain())
-		return;
-
-	if (IS_ENABLED(CONFIG_X86_BIGSMP)) {
-		switch (boot_cpu_data.x86_vendor) {
-		case X86_VENDOR_INTEL:
-			if (!APIC_XAPIC(boot_cpu_apic_version))
-				break;
-			/* P4 and above */
-			fallthrough;
-		case X86_VENDOR_HYGON:
-		case X86_VENDOR_AMD:
-			if (apic_bigsmp_possible(cmdline_apic))
-				return;
-			break;
-		}
-	}
-	pr_info("Limiting to 8 possible CPUs\n");
-	set_nr_cpu_ids(8);
-}
-
-void __init x86_32_install_bigsmp(void)
-{
-	if (nr_cpu_ids > 8 && !xen_pv_domain())
-		apic_bigsmp_force();
-}
-
 void __init x86_32_probe_apic(void)
 {
 	if (!cmdline_apic) {
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 03/11] x86: Kconfig.cpu: split out 64-bit atom
  2024-12-04 10:30 [PATCH 00/11] x86: 32-bit cleanups Arnd Bergmann
  2024-12-04 10:30 ` [PATCH 01/11] x86/Kconfig: Geode CPU has cmpxchg8b Arnd Bergmann
  2024-12-04 10:30 ` [PATCH 02/11] x86: drop 32-bit "bigsmp" machine support Arnd Bergmann
@ 2024-12-04 10:30 ` Arnd Bergmann
  2024-12-04 13:16   ` Thomas Gleixner
  2024-12-04 10:30 ` [PATCH 04/11] x86: split CPU selection into 32-bit and 64-bit Arnd Bergmann
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 78+ messages in thread
From: Arnd Bergmann @ 2024-12-04 10:30 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Arnd Bergmann, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Linus Torvalds, Andy Shevchenko,
	Matthew Wilcox, Sean Christopherson, Davide Ciminaghi,
	Paolo Bonzini, kvm

From: Arnd Bergmann <arnd@arndb.de>

Both 32-bit and 64-bit builds allow optimizing using "-march=atom", but
this is somewhat suboptimal, as gcc and clang use this option to refer
to the original in-order "Bonnell" microarchitecture used in the early
"Diamondville" and "Silverthorne" processors that were mostly 32-bit only.

The later 22nm "Silvermont" architecture saw a significant redesign to
an out-of-order architecture that is reflected in the -mtune=silvermont
flag in the compilers, and all of these are 64-bit capable. Variations
of this microarchitecture were in CPUs launched from 2014 to 2021 and
are still common in 2024.

Split this up so that 32-bit targets keep building with -march=atom,
but 64-bit ones get the more useful silvermont optimization. On modern
"tremont" and newer CPUs, using -march=generic or -march=tremont
would be even better, but the silvermont optimization is still an
improvement.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 arch/x86/Kconfig.cpu     | 28 ++++++++++++++++++++--------
 arch/x86/Makefile        |  4 ++--
 arch/x86/Makefile_32.cpu |  3 +--
 3 files changed, 23 insertions(+), 12 deletions(-)

diff --git a/arch/x86/Kconfig.cpu b/arch/x86/Kconfig.cpu
index 42e6a40876ea..05a3f57ac20b 100644
--- a/arch/x86/Kconfig.cpu
+++ b/arch/x86/Kconfig.cpu
@@ -155,6 +155,15 @@ config MPENTIUM4
 		-Paxville
 		-Dempsey
 
+config MATOM
+	bool "Intel Atom (Bonnell)"
+	help
+
+	  Select this for the Intel Atom platform. Intel Atom CPUs have an
+	  in-order pipelining architecture and thus can benefit from
+	  accordingly optimized code.
+	  This includes all the 32-bit-only Atom chips such as N2xx and
+	  Z5xx/Z6xx.
 
 config MK6
 	bool "K6/K6-II/K6-III"
@@ -278,14 +287,17 @@ config MCORE2
 	  family in /proc/cpuinfo. Newer ones have 6 and older ones 15
 	  (not a typo)
 
-config MATOM
-	bool "Intel Atom"
+config MSILVERMONT
+	bool "Intel Atom (Silvermont/Goldmont)"
+	depends on X86_64
 	help
-
-	  Select this for the Intel Atom platform. Intel Atom CPUs have an
-	  in-order pipelining architecture and thus can benefit from
-	  accordingly optimized code. Use a recent GCC with specific Atom
-	  support in order to fully benefit from selecting this option.
+	  Select this to optimize for the 64-bit Intel Atom platform
+	  of the 22nm Silvermont microarchitecture and its 14nm
+	  Goldmont shrink (e.g. Atom C2xxx, Atom Z3xxx, Celeron
+	  N2xxx/J1xxx, Pentium N3xxx/J2xxx).
+	  Kernels built with this option are incompatible with very
+	  early Atom CPUs based on the Bonnell microarchitecture,
+	  such as Atom 230/330, D4xx/D5xx, D2xxx, N2xxx or Z2xxx.
 
 config GENERIC_CPU
 	bool "Generic-x86-64"
@@ -318,7 +330,7 @@ config X86_INTERNODE_CACHE_SHIFT
 config X86_L1_CACHE_SHIFT
 	int
 	default "7" if MPENTIUM4 || MPSC
-	default "6" if MK7 || MK8 || MPENTIUMM || MCORE2 || MATOM || MVIAC7 || X86_GENERIC || GENERIC_CPU
+	default "6" if MK7 || MK8 || MPENTIUMM || MCORE2 || MATOM || MSILVERMONT || MVIAC7 || X86_GENERIC || GENERIC_CPU
 	default "4" if MELAN || M486SX || M486 || MGEODEGX1
 	default "5" if MWINCHIP3D || MWINCHIPC6 || MCRUSOE || MEFFICEON || MCYRIXIII || MK6 || MPENTIUMIII || MPENTIUMII || M686 || M586MMX || M586TSC || M586 || MVIAC3_2 || MGEODE_LX
 
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 5b773b34768d..05887ae282f5 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -182,14 +182,14 @@ else
         cflags-$(CONFIG_MK8)		+= -march=k8
         cflags-$(CONFIG_MPSC)		+= -march=nocona
         cflags-$(CONFIG_MCORE2)		+= -march=core2
-        cflags-$(CONFIG_MATOM)		+= -march=atom
+        cflags-$(CONFIG_MSILVERMONT)	+= -march=silvermont
         cflags-$(CONFIG_GENERIC_CPU)	+= -mtune=generic
         KBUILD_CFLAGS += $(cflags-y)
 
         rustflags-$(CONFIG_MK8)		+= -Ctarget-cpu=k8
         rustflags-$(CONFIG_MPSC)	+= -Ctarget-cpu=nocona
         rustflags-$(CONFIG_MCORE2)	+= -Ctarget-cpu=core2
-        rustflags-$(CONFIG_MATOM)	+= -Ctarget-cpu=atom
+        rustflags-$(CONFIG_MSILVERMONT)	+= -Ctarget-cpu=silvermont
         rustflags-$(CONFIG_GENERIC_CPU)	+= -Ztune-cpu=generic
         KBUILD_RUSTFLAGS += $(rustflags-y)
 
diff --git a/arch/x86/Makefile_32.cpu b/arch/x86/Makefile_32.cpu
index 94834c4b5e5e..0adc3a59520a 100644
--- a/arch/x86/Makefile_32.cpu
+++ b/arch/x86/Makefile_32.cpu
@@ -33,8 +33,7 @@ cflags-$(CONFIG_MCYRIXIII)	+= $(call cc-option,-march=c3,-march=i486) $(align)
 cflags-$(CONFIG_MVIAC3_2)	+= $(call cc-option,-march=c3-2,-march=i686)
 cflags-$(CONFIG_MVIAC7)		+= -march=i686
 cflags-$(CONFIG_MCORE2)		+= -march=i686 $(call tune,core2)
-cflags-$(CONFIG_MATOM)		+= $(call cc-option,-march=atom,$(call cc-option,-march=core2,-march=i686)) \
-	$(call cc-option,-mtune=atom,$(call cc-option,-mtune=generic))
+cflags-$(CONFIG_MATOM)		+= -march=atom
 
 # AMD Elan support
 cflags-$(CONFIG_MELAN)		+= -march=i486
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 04/11] x86: split CPU selection into 32-bit and 64-bit
  2024-12-04 10:30 [PATCH 00/11] x86: 32-bit cleanups Arnd Bergmann
                   ` (2 preceding siblings ...)
  2024-12-04 10:30 ` [PATCH 03/11] x86: Kconfig.cpu: split out 64-bit atom Arnd Bergmann
@ 2024-12-04 10:30 ` Arnd Bergmann
  2024-12-04 18:31   ` Andy Shevchenko
  2024-12-04 10:30 ` [PATCH 05/11] x86: remove HIGHMEM64G support Arnd Bergmann
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 78+ messages in thread
From: Arnd Bergmann @ 2024-12-04 10:30 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Arnd Bergmann, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Linus Torvalds, Andy Shevchenko,
	Matthew Wilcox, Sean Christopherson, Davide Ciminaghi,
	Paolo Bonzini, kvm

From: Arnd Bergmann <arnd@arndb.de>

The x86 CPU selection menu is confusing for a number of reasons.
One of them is how it's possible to build a 32-bit kernel for
a small number of early 64-bit microarchitectures (K8, Core2)
but not the regular generic 64-bit target that is the normal
default.

There is no longer a reason to run 32-bit kernels on production
64-bit systems, so simplify the configuration menu by completely
splitting the two into 32-bit-only and 64-bit-only machines.

Testing generic 32-bit kernels on 64-bit hardware remains
possible, just not building a 32-bit kernel that requires
a 64-bit CPU.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 arch/x86/Kconfig.cpu            | 65 ++++++++++++++++++++-------------
 arch/x86/include/asm/vermagic.h |  4 --
 2 files changed, 40 insertions(+), 29 deletions(-)

diff --git a/arch/x86/Kconfig.cpu b/arch/x86/Kconfig.cpu
index 05a3f57ac20b..139db904e564 100644
--- a/arch/x86/Kconfig.cpu
+++ b/arch/x86/Kconfig.cpu
@@ -1,9 +1,9 @@
 # SPDX-License-Identifier: GPL-2.0
 # Put here option for CPU selection and depending optimization
 choice
-	prompt "Processor family"
-	default M686 if X86_32
-	default GENERIC_CPU if X86_64
+	prompt "x86-32 Processor family"
+	depends on X86_32
+	default M686
 	help
 	  This is the processor type of your CPU. This information is
 	  used for optimizing purposes. In order to compile a kernel
@@ -31,7 +31,6 @@ choice
 	  - "Pentium-4" for the Intel Pentium 4 or P4-based Celeron.
 	  - "K6" for the AMD K6, K6-II and K6-III (aka K6-3D).
 	  - "Athlon" for the AMD K7 family (Athlon/Duron/Thunderbird).
-	  - "Opteron/Athlon64/Hammer/K8" for all K8 and newer AMD CPUs.
 	  - "Crusoe" for the Transmeta Crusoe series.
 	  - "Efficeon" for the Transmeta Efficeon series.
 	  - "Winchip-C6" for original IDT Winchip.
@@ -42,13 +41,10 @@ choice
 	  - "CyrixIII/VIA C3" for VIA Cyrix III or VIA C3.
 	  - "VIA C3-2" for VIA C3-2 "Nehemiah" (model 9 and above).
 	  - "VIA C7" for VIA C7.
-	  - "Intel P4" for the Pentium 4/Netburst microarchitecture.
-	  - "Core 2/newer Xeon" for all core2 and newer Intel CPUs.
 	  - "Intel Atom" for the Atom-microarchitecture CPUs.
-	  - "Generic-x86-64" for a kernel which runs on any x86-64 CPU.
 
 	  See each option's help text for additional details. If you don't know
-	  what to do, choose "486".
+	  what to do, choose "Pentium-Pro".
 
 config M486SX
 	bool "486SX"
@@ -114,11 +110,11 @@ config MPENTIUMIII
 	  extensions.
 
 config MPENTIUMM
-	bool "Pentium M"
+	bool "Pentium M/Pentium Dual Core/Core Solo/Core Duo"
 	depends on X86_32
 	help
 	  Select this for Intel Pentium M (not Pentium-4 M)
-	  notebook chips.
+	  "Merom" Core Solo/Duo notebook chips
 
 config MPENTIUM4
 	bool "Pentium-4/Celeron(P4-based)/Pentium-4 M/older Xeon"
@@ -181,13 +177,6 @@ config MK7
 	  some extended instructions, and passes appropriate optimization
 	  flags to GCC.
 
-config MK8
-	bool "Opteron/Athlon64/Hammer/K8"
-	help
-	  Select this for an AMD Opteron or Athlon64 Hammer-family processor.
-	  Enables use of some extended instructions, and passes appropriate
-	  optimization flags to GCC.
-
 config MCRUSOE
 	bool "Crusoe"
 	depends on X86_32
@@ -266,10 +255,37 @@ config MVIAC7
 	help
 	  Select this for a VIA C7.  Selecting this uses the correct cache
 	  shift and tells gcc to treat the CPU as a 686.
+endchoice
+
+choice
+	prompt "x86-64 Processor family"
+	depends on X86_64
+	default GENERIC_CPU
+	help
+	  This is the processor type of your CPU. This information is
+	  used for optimizing purposes. In order to compile a kernel
+	  that can run on all supported x86 CPU types (albeit not
+	  optimally fast), you can specify "Generic-x86-64" here.
+
+	  Here are the settings recommended for greatest speed:
+	  - "Opteron/Athlon64/Hammer/K8" for all K8 and newer AMD CPUs.
+	  - "Intel P4" for the Pentium 4/Netburst microarchitecture.
+	  - "Core 2/newer Xeon" for all core2 and newer Intel CPUs.
+	  - "Intel Atom" for the Atom-microarchitecture CPUs.
+	  - "Generic-x86-64" for a kernel which runs on any x86-64 CPU.
+
+	  See each option's help text for additional details. If you don't know
+	  what to do, choose "Generic-x86-64".
+
+config MK8
+	bool "Opteron/Athlon64/Hammer/K8"
+	help
+	  Select this for an AMD Opteron or Athlon64 Hammer-family processor.
+	  Enables use of some extended instructions, and passes appropriate
+	  optimization flags to GCC.
 
 config MPSC
 	bool "Intel P4 / older Netburst based Xeon"
-	depends on X86_64
 	help
 	  Optimize for Intel Pentium 4, Pentium D and older Nocona/Dempsey
 	  Xeon CPUs with Intel 64bit which is compatible with x86-64.
@@ -281,7 +297,6 @@ config MPSC
 config MCORE2
 	bool "Core 2/newer Xeon"
 	help
-
 	  Select this for Intel Core 2 and newer Core 2 Xeons (Xeon 51xx and
 	  53xx) CPUs. You can distinguish newer from older Xeons by the CPU
 	  family in /proc/cpuinfo. Newer ones have 6 and older ones 15
@@ -348,11 +363,11 @@ config X86_ALIGNMENT_16
 
 config X86_INTEL_USERCOPY
 	def_bool y
-	depends on MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M586MMX || X86_GENERIC || MK8 || MK7 || MEFFICEON || MCORE2
+	depends on MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M586MMX || X86_GENERIC || MK7 || MEFFICEON
 
 config X86_USE_PPRO_CHECKSUM
 	def_bool y
-	depends on MWINCHIP3D || MWINCHIPC6 || MCYRIXIII || MK7 || MK6 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MK8 || MVIAC3_2 || MVIAC7 || MEFFICEON || MGEODE_LX || MCORE2 || MATOM
+	depends on MWINCHIP3D || MWINCHIPC6 || MCYRIXIII || MK7 || MK6 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MVIAC3_2 || MVIAC7 || MEFFICEON || MGEODE_LX || MATOM
 
 #
 # P6_NOPs are a relatively minor optimization that require a family >=
@@ -372,11 +387,11 @@ config X86_P6_NOP
 
 config X86_TSC
 	def_bool y
-	depends on (MWINCHIP3D || MCRUSOE || MEFFICEON || MCYRIXIII || MK7 || MK6 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || M586MMX || M586TSC || MK8 || MVIAC3_2 || MVIAC7 || MGEODEGX1 || MGEODE_LX || MCORE2 || MATOM) || X86_64
+	depends on (MWINCHIP3D || MCRUSOE || MEFFICEON || MCYRIXIII || MK7 || MK6 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || M586MMX || M586TSC || MVIAC3_2 || MVIAC7 || MGEODEGX1 || MGEODE_LX || MATOM) || X86_64
 
 config X86_HAVE_PAE
 	def_bool y
-	depends on MCRUSOE || MEFFICEON || MCYRIXIII || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MK8 || MVIAC7 || MCORE2 || MATOM || X86_64
+	depends on MCRUSOE || MEFFICEON || MCYRIXIII || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MVIAC7 || MATOM || X86_64
 
 config X86_CMPXCHG64
 	def_bool y
@@ -386,12 +401,12 @@ config X86_CMPXCHG64
 # generates cmov.
 config X86_CMOV
 	def_bool y
-	depends on (MK8 || MK7 || MCORE2 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MVIAC3_2 || MVIAC7 || MCRUSOE || MEFFICEON || X86_64 || MATOM || MGEODE_LX)
+	depends on (MK7 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MVIAC3_2 || MVIAC7 || MCRUSOE || MEFFICEON || MATOM || MGEODE_LX || X86_64)
 
 config X86_MINIMUM_CPU_FAMILY
 	int
 	default "64" if X86_64
-	default "6" if X86_32 && (MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MVIAC3_2 || MVIAC7 || MEFFICEON || MATOM || MCORE2 || MK7 || MK8)
+	default "6" if X86_32 && (MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MVIAC3_2 || MVIAC7 || MEFFICEON || MATOM || MK7)
 	default "5" if X86_32 && X86_CMPXCHG64
 	default "4"
 
diff --git a/arch/x86/include/asm/vermagic.h b/arch/x86/include/asm/vermagic.h
index 75884d2cdec3..5d471253c755 100644
--- a/arch/x86/include/asm/vermagic.h
+++ b/arch/x86/include/asm/vermagic.h
@@ -15,8 +15,6 @@
 #define MODULE_PROC_FAMILY "586TSC "
 #elif defined CONFIG_M586MMX
 #define MODULE_PROC_FAMILY "586MMX "
-#elif defined CONFIG_MCORE2
-#define MODULE_PROC_FAMILY "CORE2 "
 #elif defined CONFIG_MATOM
 #define MODULE_PROC_FAMILY "ATOM "
 #elif defined CONFIG_M686
@@ -33,8 +31,6 @@
 #define MODULE_PROC_FAMILY "K6 "
 #elif defined CONFIG_MK7
 #define MODULE_PROC_FAMILY "K7 "
-#elif defined CONFIG_MK8
-#define MODULE_PROC_FAMILY "K8 "
 #elif defined CONFIG_MELAN
 #define MODULE_PROC_FAMILY "ELAN "
 #elif defined CONFIG_MCRUSOE
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 05/11] x86: remove HIGHMEM64G support
  2024-12-04 10:30 [PATCH 00/11] x86: 32-bit cleanups Arnd Bergmann
                   ` (3 preceding siblings ...)
  2024-12-04 10:30 ` [PATCH 04/11] x86: split CPU selection into 32-bit and 64-bit Arnd Bergmann
@ 2024-12-04 10:30 ` Arnd Bergmann
  2024-12-04 13:29   ` Brian Gerst
  2025-04-11 23:44   ` Dave Hansen
  2024-12-04 10:30 ` [PATCH 06/11] x86: drop SWIOTLB and PHYS_ADDR_T_64BIT for PAE Arnd Bergmann
                   ` (5 subsequent siblings)
  10 siblings, 2 replies; 78+ messages in thread
From: Arnd Bergmann @ 2024-12-04 10:30 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Arnd Bergmann, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Linus Torvalds, Andy Shevchenko,
	Matthew Wilcox, Sean Christopherson, Davide Ciminaghi,
	Paolo Bonzini, kvm

From: Arnd Bergmann <arnd@arndb.de>

The HIGHMEM64G support was added in linux-2.3.25 to support (then)
high-end Pentium Pro and Pentium III Xeon servers with more than 4GB of
addressing, NUMA and PCI-X slots started appearing.

I have found no evidence of this ever being used in regular dual-socket
servers or consumer devices, all the users seem obsolete these days,
even by i386 standards:

 - Support for NUMA servers (NUMA-Q, IBM x440, unisys) was already
   removed ten years ago.

 - 4+ socket non-NUMA servers based on Intel 450GX/450NX, HP F8 and
   ServerWorks ServerSet/GrandChampion could theoretically still work
   with 8GB, but these were exceptionally rare even 20 years ago and
   would have usually been equipped with than the maximum amount of
   RAM.

 - Some SKUs of the Celeron D from 2004 had 64-bit mode fused off but
   could still work in a Socket 775 mainboard designed for the later
   Core 2 Duo and 8GB. Apparently most BIOSes at the time only allowed
   64-bit CPUs.

 - In the early days of x86-64 hardware, there was sometimes the need
   to run a 32-bit kernel to work around bugs in the hardware drivers,
   or in the syscall emulation for 32-bit userspace. This likely still
   works but there should never be a need for this any more.

Removing this also drops the need for PHYS_ADDR_T_64BIT and SWIOTLB.
PAE mode is still required to get access to the 'NX' bit on Atom
'Pentium M' and 'Core Duo' CPUs.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 Documentation/admin-guide/kdump/kdump.rst     |  4 --
 Documentation/arch/x86/usb-legacy-support.rst | 11 +----
 arch/x86/Kconfig                              | 46 +++----------------
 arch/x86/configs/xen.config                   |  2 -
 arch/x86/include/asm/page_32_types.h          |  4 +-
 arch/x86/mm/init_32.c                         |  9 +---
 6 files changed, 11 insertions(+), 65 deletions(-)

diff --git a/Documentation/admin-guide/kdump/kdump.rst b/Documentation/admin-guide/kdump/kdump.rst
index 5376890adbeb..1f7f14c6e184 100644
--- a/Documentation/admin-guide/kdump/kdump.rst
+++ b/Documentation/admin-guide/kdump/kdump.rst
@@ -180,10 +180,6 @@ Dump-capture kernel config options (Arch Dependent, i386 and x86_64)
 1) On i386, enable high memory support under "Processor type and
    features"::
 
-	CONFIG_HIGHMEM64G=y
-
-   or::
-
 	CONFIG_HIGHMEM4G
 
 2) With CONFIG_SMP=y, usually nr_cpus=1 need specified on the kernel
diff --git a/Documentation/arch/x86/usb-legacy-support.rst b/Documentation/arch/x86/usb-legacy-support.rst
index e01c08b7c981..b17bf122270a 100644
--- a/Documentation/arch/x86/usb-legacy-support.rst
+++ b/Documentation/arch/x86/usb-legacy-support.rst
@@ -20,11 +20,7 @@ It has several drawbacks, though:
    features (wheel, extra buttons, touchpad mode) of the real PS/2 mouse may
    not be available.
 
-2) If CONFIG_HIGHMEM64G is enabled, the PS/2 mouse emulation can cause
-   system crashes, because the SMM BIOS is not expecting to be in PAE mode.
-   The Intel E7505 is a typical machine where this happens.
-
-3) If AMD64 64-bit mode is enabled, again system crashes often happen,
+2) If AMD64 64-bit mode is enabled, again system crashes often happen,
    because the SMM BIOS isn't expecting the CPU to be in 64-bit mode.  The
    BIOS manufacturers only test with Windows, and Windows doesn't do 64-bit
    yet.
@@ -38,11 +34,6 @@ Problem 1)
   compiled-in, too.
 
 Problem 2)
-  can currently only be solved by either disabling HIGHMEM64G
-  in the kernel config or USB Legacy support in the BIOS. A BIOS update
-  could help, but so far no such update exists.
-
-Problem 3)
   is usually fixed by a BIOS update. Check the board
   manufacturers web site. If an update is not available, disable USB
   Legacy support in the BIOS. If this alone doesn't help, try also adding
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 42494739344d..b373db8a8176 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1383,15 +1383,11 @@ config X86_CPUID
 	  with major 203 and minors 0 to 31 for /dev/cpu/0/cpuid to
 	  /dev/cpu/31/cpuid.
 
-choice
-	prompt "High Memory Support"
-	default HIGHMEM4G
+config HIGHMEM4G
+	bool "High Memory Support"
 	depends on X86_32
-
-config NOHIGHMEM
-	bool "off"
 	help
-	  Linux can use up to 64 Gigabytes of physical memory on x86 systems.
+	  Linux can use up to 4 Gigabytes of physical memory on x86 systems.
 	  However, the address space of 32-bit x86 processors is only 4
 	  Gigabytes large. That means that, if you have a large amount of
 	  physical memory, not all of it can be "permanently mapped" by the
@@ -1407,38 +1403,9 @@ config NOHIGHMEM
 	  possible.
 
 	  If the machine has between 1 and 4 Gigabytes physical RAM, then
-	  answer "4GB" here.
+	  answer "Y" here.
 
-	  If more than 4 Gigabytes is used then answer "64GB" here. This
-	  selection turns Intel PAE (Physical Address Extension) mode on.
-	  PAE implements 3-level paging on IA32 processors. PAE is fully
-	  supported by Linux, PAE mode is implemented on all recent Intel
-	  processors (Pentium Pro and better). NOTE: If you say "64GB" here,
-	  then the kernel will not boot on CPUs that don't support PAE!
-
-	  The actual amount of total physical memory will either be
-	  auto detected or can be forced by using a kernel command line option
-	  such as "mem=256M". (Try "man bootparam" or see the documentation of
-	  your boot loader (lilo or loadlin) about how to pass options to the
-	  kernel at boot time.)
-
-	  If unsure, say "off".
-
-config HIGHMEM4G
-	bool "4GB"
-	help
-	  Select this if you have a 32-bit processor and between 1 and 4
-	  gigabytes of physical RAM.
-
-config HIGHMEM64G
-	bool "64GB"
-	depends on X86_HAVE_PAE
-	select X86_PAE
-	help
-	  Select this if you have a 32-bit processor and more than 4
-	  gigabytes of physical RAM.
-
-endchoice
+	  If unsure, say N.
 
 choice
 	prompt "Memory split" if EXPERT
@@ -1484,8 +1451,7 @@ config PAGE_OFFSET
 	depends on X86_32
 
 config HIGHMEM
-	def_bool y
-	depends on X86_32 && (HIGHMEM64G || HIGHMEM4G)
+	def_bool HIGHMEM4G
 
 config X86_PAE
 	bool "PAE (Physical Address Extension) Support"
diff --git a/arch/x86/configs/xen.config b/arch/x86/configs/xen.config
index 581296255b39..d5d091e03bd3 100644
--- a/arch/x86/configs/xen.config
+++ b/arch/x86/configs/xen.config
@@ -1,6 +1,4 @@
 # global x86 required specific stuff
-# On 32-bit HIGHMEM4G is not allowed
-CONFIG_HIGHMEM64G=y
 CONFIG_64BIT=y
 
 # These enable us to allow some of the
diff --git a/arch/x86/include/asm/page_32_types.h b/arch/x86/include/asm/page_32_types.h
index faf9cc1c14bb..25c32652f404 100644
--- a/arch/x86/include/asm/page_32_types.h
+++ b/arch/x86/include/asm/page_32_types.h
@@ -11,8 +11,8 @@
  * a virtual address space of one gigabyte, which limits the
  * amount of physical memory you can use to about 950MB.
  *
- * If you want more physical memory than this then see the CONFIG_HIGHMEM4G
- * and CONFIG_HIGHMEM64G options in the kernel configuration.
+ * If you want more physical memory than this then see the CONFIG_VMSPLIT_2G
+ * and CONFIG_HIGHMEM4G options in the kernel configuration.
  */
 #define __PAGE_OFFSET_BASE	_AC(CONFIG_PAGE_OFFSET, UL)
 #define __PAGE_OFFSET		__PAGE_OFFSET_BASE
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index ac41b1e0940d..f288aad8dc74 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -582,7 +582,7 @@ static void __init lowmem_pfn_init(void)
 	"only %luMB highmem pages available, ignoring highmem size of %luMB!\n"
 
 #define MSG_HIGHMEM_TRIMMED \
-	"Warning: only 4GB will be used. Use a HIGHMEM64G enabled kernel!\n"
+	"Warning: only 4GB will be used. Support for for CONFIG_HIGHMEM64G was removed!\n"
 /*
  * We have more RAM than fits into lowmem - we try to put it into
  * highmem, also taking the highmem=x boot parameter into account:
@@ -606,18 +606,13 @@ static void __init highmem_pfn_init(void)
 #ifndef CONFIG_HIGHMEM
 	/* Maximum memory usable is what is directly addressable */
 	printk(KERN_WARNING "Warning only %ldMB will be used.\n", MAXMEM>>20);
-	if (max_pfn > MAX_NONPAE_PFN)
-		printk(KERN_WARNING "Use a HIGHMEM64G enabled kernel.\n");
-	else
-		printk(KERN_WARNING "Use a HIGHMEM enabled kernel.\n");
+	printk(KERN_WARNING "Use a HIGHMEM enabled kernel.\n");
 	max_pfn = MAXMEM_PFN;
 #else /* !CONFIG_HIGHMEM */
-#ifndef CONFIG_HIGHMEM64G
 	if (max_pfn > MAX_NONPAE_PFN) {
 		max_pfn = MAX_NONPAE_PFN;
 		printk(KERN_WARNING MSG_HIGHMEM_TRIMMED);
 	}
-#endif /* !CONFIG_HIGHMEM64G */
 #endif /* !CONFIG_HIGHMEM */
 }
 
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 06/11] x86: drop SWIOTLB and PHYS_ADDR_T_64BIT for PAE
  2024-12-04 10:30 [PATCH 00/11] x86: 32-bit cleanups Arnd Bergmann
                   ` (4 preceding siblings ...)
  2024-12-04 10:30 ` [PATCH 05/11] x86: remove HIGHMEM64G support Arnd Bergmann
@ 2024-12-04 10:30 ` Arnd Bergmann
  2024-12-04 18:41   ` Andy Shevchenko
  2024-12-04 10:30 ` [PATCH 07/11] x86: drop support for CONFIG_HIGHPTE Arnd Bergmann
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 78+ messages in thread
From: Arnd Bergmann @ 2024-12-04 10:30 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Arnd Bergmann, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Linus Torvalds, Andy Shevchenko,
	Matthew Wilcox, Sean Christopherson, Davide Ciminaghi,
	Paolo Bonzini, kvm

From: Arnd Bergmann <arnd@arndb.de>

Since kernels with and without CONFIG_X86_PAE are now limited
to the low 4GB of physical address space, there is no need to
use either swiotlb or 64-bit phys_addr_t any more, so stop
selecting these and fix up the build warnings from that.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 arch/x86/Kconfig      | 2 --
 arch/x86/mm/pgtable.c | 2 +-
 include/linux/mm.h    | 2 +-
 3 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b373db8a8176..d0d055f6f56e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1456,8 +1456,6 @@ config HIGHMEM
 config X86_PAE
 	bool "PAE (Physical Address Extension) Support"
 	depends on X86_32 && X86_HAVE_PAE
-	select PHYS_ADDR_T_64BIT
-	select SWIOTLB
 	help
 	  PAE is required for NX support, and furthermore enables
 	  larger swapspace support for non-overcommit purposes. It
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 5745a354a241..bdf63524e30a 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -769,7 +769,7 @@ int pmd_set_huge(pmd_t *pmd, phys_addr_t addr, pgprot_t prot)
 	mtrr_type_lookup(addr, addr + PMD_SIZE, &uniform);
 	if (!uniform) {
 		pr_warn_once("%s: Cannot satisfy [mem %#010llx-%#010llx] with a huge-page mapping due to MTRR override.\n",
-			     __func__, addr, addr + PMD_SIZE);
+			     __func__, (u64)addr, (u64)addr + PMD_SIZE);
 		return 0;
 	}
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index c39c4945946c..7725e9e46e90 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -99,7 +99,7 @@ extern int mmap_rnd_compat_bits __read_mostly;
 
 #ifndef DIRECT_MAP_PHYSMEM_END
 # ifdef MAX_PHYSMEM_BITS
-# define DIRECT_MAP_PHYSMEM_END	((1ULL << MAX_PHYSMEM_BITS) - 1)
+# define DIRECT_MAP_PHYSMEM_END	(phys_addr_t)((1ULL << MAX_PHYSMEM_BITS) - 1)
 # else
 # define DIRECT_MAP_PHYSMEM_END	(((phys_addr_t)-1)&~(1ULL<<63))
 # endif
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 07/11] x86: drop support for CONFIG_HIGHPTE
  2024-12-04 10:30 [PATCH 00/11] x86: 32-bit cleanups Arnd Bergmann
                   ` (5 preceding siblings ...)
  2024-12-04 10:30 ` [PATCH 06/11] x86: drop SWIOTLB and PHYS_ADDR_T_64BIT for PAE Arnd Bergmann
@ 2024-12-04 10:30 ` Arnd Bergmann
  2024-12-04 10:30 ` [PATCH 08/11] x86: document X86_INTEL_MID as 64-bit-only Arnd Bergmann
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 78+ messages in thread
From: Arnd Bergmann @ 2024-12-04 10:30 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Arnd Bergmann, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Linus Torvalds, Andy Shevchenko,
	Matthew Wilcox, Sean Christopherson, Davide Ciminaghi,
	Paolo Bonzini, kvm

From: Arnd Bergmann <arnd@arndb.de>

With the maximum amount of RAM now 4GB, there is very little point
to still have PTE pages in highmem. Drop this for simplification.

The only other architecture supporting HIGHPTE is 32-bit arm, and
once that feature is removed as well, the highpte logic can be
dropped from common code as well.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 .../admin-guide/kernel-parameters.txt         |  7 -----
 arch/x86/Kconfig                              |  9 -------
 arch/x86/include/asm/pgalloc.h                |  5 ----
 arch/x86/mm/pgtable.c                         | 27 +------------------
 4 files changed, 1 insertion(+), 47 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index eca370e99844..cf25853a5c4a 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -7341,13 +7341,6 @@
 				16 - SIGBUS faults
 			Example: user_debug=31
 
-	userpte=
-			[X86,EARLY] Flags controlling user PTE allocations.
-
-				nohigh = do not allocate PTE pages in
-					HIGHMEM regardless of setting
-					of CONFIG_HIGHPTE.
-
 	vdso=		[X86,SH,SPARC]
 			On X86_32, this is an alias for vdso32=.  Otherwise:
 
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index d0d055f6f56e..d8a8bf9ea9b9 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1621,15 +1621,6 @@ config X86_PMEM_LEGACY
 
 	  Say Y if unsure.
 
-config HIGHPTE
-	bool "Allocate 3rd-level pagetables from highmem"
-	depends on HIGHMEM
-	help
-	  The VM uses one page table entry for each page of physical memory.
-	  For systems with a lot of RAM, this can be wasteful of precious
-	  low memory.  Setting this option will put user-space page table
-	  entries in high memory.
-
 config X86_CHECK_BIOS_CORRUPTION
 	bool "Check for low memory corruption"
 	help
diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h
index dcd836b59beb..582cf5b7ec8c 100644
--- a/arch/x86/include/asm/pgalloc.h
+++ b/arch/x86/include/asm/pgalloc.h
@@ -29,11 +29,6 @@ static inline void paravirt_release_pud(unsigned long pfn) {}
 static inline void paravirt_release_p4d(unsigned long pfn) {}
 #endif
 
-/*
- * Flags to use when allocating a user page table page.
- */
-extern gfp_t __userpte_alloc_gfp;
-
 #ifdef CONFIG_MITIGATION_PAGE_TABLE_ISOLATION
 /*
  * Instead of one PGD, we acquire two PGDs.  Being order-1, it is
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index bdf63524e30a..895d91e879b4 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -12,12 +12,6 @@ phys_addr_t physical_mask __ro_after_init = (1ULL << __PHYSICAL_MASK_SHIFT) - 1;
 EXPORT_SYMBOL(physical_mask);
 #endif
 
-#ifdef CONFIG_HIGHPTE
-#define PGTABLE_HIGHMEM __GFP_HIGHMEM
-#else
-#define PGTABLE_HIGHMEM 0
-#endif
-
 #ifndef CONFIG_PARAVIRT
 static inline
 void paravirt_tlb_remove_table(struct mmu_gather *tlb, void *table)
@@ -26,29 +20,10 @@ void paravirt_tlb_remove_table(struct mmu_gather *tlb, void *table)
 }
 #endif
 
-gfp_t __userpte_alloc_gfp = GFP_PGTABLE_USER | PGTABLE_HIGHMEM;
-
 pgtable_t pte_alloc_one(struct mm_struct *mm)
 {
-	return __pte_alloc_one(mm, __userpte_alloc_gfp);
-}
-
-static int __init setup_userpte(char *arg)
-{
-	if (!arg)
-		return -EINVAL;
-
-	/*
-	 * "userpte=nohigh" disables allocation of user pagetables in
-	 * high memory.
-	 */
-	if (strcmp(arg, "nohigh") == 0)
-		__userpte_alloc_gfp &= ~__GFP_HIGHMEM;
-	else
-		return -EINVAL;
-	return 0;
+	return __pte_alloc_one(mm, GFP_PGTABLE_USER);
 }
-early_param("userpte", setup_userpte);
 
 void ___pte_free_tlb(struct mmu_gather *tlb, struct page *pte)
 {
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 08/11] x86: document X86_INTEL_MID as 64-bit-only
  2024-12-04 10:30 [PATCH 00/11] x86: 32-bit cleanups Arnd Bergmann
                   ` (6 preceding siblings ...)
  2024-12-04 10:30 ` [PATCH 07/11] x86: drop support for CONFIG_HIGHPTE Arnd Bergmann
@ 2024-12-04 10:30 ` Arnd Bergmann
  2024-12-04 18:55   ` Andy Shevchenko
  2024-12-04 10:30 ` [PATCH 09/11] x86: rework CONFIG_GENERIC_CPU compiler flags Arnd Bergmann
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 78+ messages in thread
From: Arnd Bergmann @ 2024-12-04 10:30 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Arnd Bergmann, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Linus Torvalds, Andy Shevchenko,
	Matthew Wilcox, Sean Christopherson, Davide Ciminaghi,
	Paolo Bonzini, kvm

From: Arnd Bergmann <arnd@arndb.de>

The X86_INTEL_MID code was originally introduced for the
32-bit Moorestown/Medfield/Clovertrail platform, later the 64-bit
Merrifield/Moorefield variant got added, but the final
Morganfield/Broxton 14nm chips were canceled before they hit
the market.

To help users understand what the option actually refers to,
update the help text, and make it a hard dependency on 64-bit
kernels. While they could theoretically run a 32-bit kernel,
the devices originally shipped with 64-bit one in 2015, so that
was proabably never tested.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 arch/x86/Kconfig         | 16 ++++++++++------
 arch/x86/kernel/head32.c |  3 ---
 2 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index d8a8bf9ea9b9..fa6dd9ec4bdf 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -544,12 +544,12 @@ config X86_EXTENDED_PLATFORM
 		RDC R-321x SoC
 		SGI 320/540 (Visual Workstation)
 		STA2X11-based (e.g. Northville)
-		Moorestown MID devices
 
 	  64-bit platforms (CONFIG_64BIT=y):
 		Numascale NumaChip
 		ScaleMP vSMP
 		SGI Ultraviolet
+		Merrifield/Moorefield MID devices
 
 	  If you have one of these systems, or if you want to build a
 	  generic distribution kernel, say Y here - otherwise say N.
@@ -621,11 +621,11 @@ config X86_INTEL_CE
 	  boxes and media devices.
 
 config X86_INTEL_MID
-	bool "Intel MID platform support"
+	bool "Intel Z34xx/Z35xx MID platform support"
 	depends on X86_EXTENDED_PLATFORM
 	depends on X86_PLATFORM_DEVICES
 	depends on PCI
-	depends on X86_64 || (PCI_GOANY && X86_32)
+	depends on X86_64
 	depends on X86_IO_APIC
 	select I2C
 	select DW_APB_TIMER
@@ -633,10 +633,14 @@ config X86_INTEL_MID
 	help
 	  Select to build a kernel capable of supporting Intel MID (Mobile
 	  Internet Device) platform systems which do not have the PCI legacy
-	  interfaces. If you are building for a PC class system say N here.
+	  interfaces.
+
+	  The only supported devices are the 22nm Merrified (Z34xx) and
+	  Moorefield (Z35xx) SoC used in Android devices such as the
+	  Asus Zenfone 2, Asus FonePad 8 and Dell Venue 7.
 
-	  Intel MID platforms are based on an Intel processor and chipset which
-	  consume less power than most of the x86 derivatives.
+	  If you are building for a PC class system or non-MID tablet
+	  SoCs like Bay Trail (Z36xx/Z37xx), say N here.
 
 config X86_INTEL_QUARK
 	bool "Intel Quark platform support"
diff --git a/arch/x86/kernel/head32.c b/arch/x86/kernel/head32.c
index de001b2146ab..4f69239556e4 100644
--- a/arch/x86/kernel/head32.c
+++ b/arch/x86/kernel/head32.c
@@ -65,9 +65,6 @@ asmlinkage __visible void __init __noreturn i386_start_kernel(void)
 
 	/* Call the subarch specific early setup function */
 	switch (boot_params.hdr.hardware_subarch) {
-	case X86_SUBARCH_INTEL_MID:
-		x86_intel_mid_early_setup();
-		break;
 	case X86_SUBARCH_CE4100:
 		x86_ce4100_early_setup();
 		break;
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 09/11] x86: rework CONFIG_GENERIC_CPU compiler flags
  2024-12-04 10:30 [PATCH 00/11] x86: 32-bit cleanups Arnd Bergmann
                   ` (7 preceding siblings ...)
  2024-12-04 10:30 ` [PATCH 08/11] x86: document X86_INTEL_MID as 64-bit-only Arnd Bergmann
@ 2024-12-04 10:30 ` Arnd Bergmann
  2024-12-04 15:36   ` Tor Vic
                     ` (3 more replies)
  2024-12-04 10:30 ` [PATCH 10/11] x86: remove old STA2x11 support Arnd Bergmann
  2024-12-04 10:30 ` [PATCH 11/11] x86: drop 32-bit KVM host support Arnd Bergmann
  10 siblings, 4 replies; 78+ messages in thread
From: Arnd Bergmann @ 2024-12-04 10:30 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Arnd Bergmann, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Linus Torvalds, Andy Shevchenko,
	Matthew Wilcox, Sean Christopherson, Davide Ciminaghi,
	Paolo Bonzini, kvm

From: Arnd Bergmann <arnd@arndb.de>

Building an x86-64 kernel with CONFIG_GENERIC_CPU is documented to
run on all CPUs, but the Makefile does not actually pass an -march=
argument, instead relying on the default that was used to configure
the toolchain.

In many cases, gcc will be configured to -march=x86-64 or -march=k8
for maximum compatibility, but in other cases a distribution default
may be either raised to a more recent ISA, or set to -march=native
to build for the CPU used for compilation. This still works in the
case of building a custom kernel for the local machine.

The point where it breaks down is building a kernel for another
machine that is older the the default target. Changing the default
to -march=x86-64 would make it work reliable, but possibly produce
worse code on distros that intentionally default to a newer ISA.

To allow reliably building a kernel for either the oldest x86-64
CPUs or a more recent level, add three separate options for
v1, v2 and v3 of the architecture as defined by gcc and clang
and make them all turn on CONFIG_GENERIC_CPU. Based on this it
should be possible to change runtime feature detection into
build-time detection for things like cmpxchg16b, or possibly
gate features that are only available on older architectures.

Link: https://lists.llvm.org/pipermail/llvm-dev/2020-July/143289.html
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 arch/x86/Kconfig.cpu | 39 ++++++++++++++++++++++++++++++++++-----
 arch/x86/Makefile    |  6 ++++++
 2 files changed, 40 insertions(+), 5 deletions(-)

diff --git a/arch/x86/Kconfig.cpu b/arch/x86/Kconfig.cpu
index 139db904e564..1461a739237b 100644
--- a/arch/x86/Kconfig.cpu
+++ b/arch/x86/Kconfig.cpu
@@ -260,7 +260,7 @@ endchoice
 choice
 	prompt "x86-64 Processor family"
 	depends on X86_64
-	default GENERIC_CPU
+	default X86_64_V2
 	help
 	  This is the processor type of your CPU. This information is
 	  used for optimizing purposes. In order to compile a kernel
@@ -314,15 +314,44 @@ config MSILVERMONT
 	  early Atom CPUs based on the Bonnell microarchitecture,
 	  such as Atom 230/330, D4xx/D5xx, D2xxx, N2xxx or Z2xxx.
 
-config GENERIC_CPU
-	bool "Generic-x86-64"
+config X86_64_V1
+	bool "Generic x86-64"
 	depends on X86_64
 	help
-	  Generic x86-64 CPU.
-	  Run equally well on all x86-64 CPUs.
+	  Generic x86-64-v1 CPU.
+	  Run equally well on all x86-64 CPUs, including early Pentium-4
+	  variants lacking the sahf and cmpxchg16b instructions as well
+	  as the AMD K8 and Intel Core 2 lacking popcnt.
+
+config X86_64_V2
+	bool "Generic x86-64 v2"
+	depends on X86_64
+	help
+	  Generic x86-64-v2 CPU.
+	  Run equally well on all x86-64 CPUs that meet the x86-64-v2
+	  definition as well as those that only miss the optional
+	  SSE3/SSSE3/SSE4.1 portions.
+	  Examples of this include Intel Nehalem and Silvermont,
+	  AMD Bulldozer (K10) and Jaguar as well as VIA Nano that
+	  include popcnt, cmpxchg16b and sahf.
+
+config X86_64_V3
+	bool "Generic x86-64 v3"
+	depends on X86_64
+	help
+	  Generic x86-64-v3 CPU.
+	  Run equally well on all x86-64 CPUs that meet the x86-64-v3
+	  definition as well as those that only miss the optional
+	  AVX/AVX2 portions.
+	  Examples of this include the Intel Haswell and AMD Excavator
+	  microarchitectures that include the bmi1/bmi2, lzncnt, movbe
+	  and xsave instruction set extensions.
 
 endchoice
 
+config GENERIC_CPU
+	def_bool X86_64_V1 || X86_64_V2 || X86_64_V3
+
 config X86_GENERIC
 	bool "Generic x86 support"
 	depends on X86_32
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 05887ae282f5..1fdc3fc6a54e 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -183,6 +183,9 @@ else
         cflags-$(CONFIG_MPSC)		+= -march=nocona
         cflags-$(CONFIG_MCORE2)		+= -march=core2
         cflags-$(CONFIG_MSILVERMONT)	+= -march=silvermont
+        cflags-$(CONFIG_MX86_64_V1)	+= -march=x86-64
+        cflags-$(CONFIG_MX86_64_V2)	+= $(call cc-option,-march=x86-64-v2,-march=x86-64)
+        cflags-$(CONFIG_MX86_64_V3)	+= $(call cc-option,-march=x86-64-v3,-march=x86-64)
         cflags-$(CONFIG_GENERIC_CPU)	+= -mtune=generic
         KBUILD_CFLAGS += $(cflags-y)
 
@@ -190,6 +193,9 @@ else
         rustflags-$(CONFIG_MPSC)	+= -Ctarget-cpu=nocona
         rustflags-$(CONFIG_MCORE2)	+= -Ctarget-cpu=core2
         rustflags-$(CONFIG_MSILVERMONT)	+= -Ctarget-cpu=silvermont
+        rustflags-$(CONFIG_MX86_64_V1)	+= -Ctarget-cpu=x86-64
+        rustflags-$(CONFIG_MX86_64_V2)	+= -Ctarget-cpu=x86-64-v2
+        rustflags-$(CONFIG_MX86_64_V3)	+= -Ctarget-cpu=x86-64-v3
         rustflags-$(CONFIG_GENERIC_CPU)	+= -Ztune-cpu=generic
         KBUILD_RUSTFLAGS += $(rustflags-y)
 
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 10/11] x86: remove old STA2x11 support
  2024-12-04 10:30 [PATCH 00/11] x86: 32-bit cleanups Arnd Bergmann
                   ` (8 preceding siblings ...)
  2024-12-04 10:30 ` [PATCH 09/11] x86: rework CONFIG_GENERIC_CPU compiler flags Arnd Bergmann
@ 2024-12-04 10:30 ` Arnd Bergmann
  2024-12-05  7:35   ` Davide Ciminaghi
  2024-12-04 10:30 ` [PATCH 11/11] x86: drop 32-bit KVM host support Arnd Bergmann
  10 siblings, 1 reply; 78+ messages in thread
From: Arnd Bergmann @ 2024-12-04 10:30 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Arnd Bergmann, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Linus Torvalds, Andy Shevchenko,
	Matthew Wilcox, Sean Christopherson, Davide Ciminaghi,
	Paolo Bonzini, kvm

From: Arnd Bergmann <arnd@arndb.de>

ST ConneXt STA2x11 was an interface chip for Atom E6xx processors,
using a number of components usually found on Arm SoCs. Most of this
was merged upstream, but it was never complete enough to actually work
and has been abandoned for many years.

We already had an agreement on removing it in 2022, but nobody ever
submitted the patch to do it.

Without STA2x11, the CONFIG_X86_32_NON_STANDARD no longer has any
use.

Link: https://lore.kernel.org/lkml/Yw3DKCuDoPkCaqxE@arcana.i.gnudd.com/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 arch/x86/Kconfig               |  30 +----
 arch/x86/include/asm/sta2x11.h |  13 --
 arch/x86/pci/Makefile          |   2 -
 arch/x86/pci/sta2x11-fixup.c   | 233 ---------------------------------
 4 files changed, 3 insertions(+), 275 deletions(-)
 delete mode 100644 arch/x86/include/asm/sta2x11.h
 delete mode 100644 arch/x86/pci/sta2x11-fixup.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index fa6dd9ec4bdf..6b5144f10ad7 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -543,7 +543,6 @@ config X86_EXTENDED_PLATFORM
 		AMD Elan
 		RDC R-321x SoC
 		SGI 320/540 (Visual Workstation)
-		STA2X11-based (e.g. Northville)
 
 	  64-bit platforms (CONFIG_64BIT=y):
 		Numascale NumaChip
@@ -723,16 +722,6 @@ config X86_RDC321X
 	  as R-8610-(G).
 	  If you don't have one of these chips, you should say N here.
 
-config X86_32_NON_STANDARD
-	bool "Support non-standard 32-bit SMP architectures"
-	depends on X86_32 && SMP
-	depends on X86_EXTENDED_PLATFORM
-	help
-	  This option compiles in the STA2X11 default
-	  subarchitecture.  It is intended for a generic binary
-	  kernel. If you select them all, kernel will probe it one by
-	  one and will fallback to default.
-
 # Alphabetically sorted list of Non standard 32 bit platforms
 
 config X86_SUPPORTS_MEMORY_FAILURE
@@ -744,19 +733,6 @@ config X86_SUPPORTS_MEMORY_FAILURE
 	depends on X86_64 || !SPARSEMEM
 	select ARCH_SUPPORTS_MEMORY_FAILURE
 
-config STA2X11
-	bool "STA2X11 Companion Chip Support"
-	depends on X86_32_NON_STANDARD && PCI
-	select SWIOTLB
-	select MFD_STA2X11
-	select GPIOLIB
-	help
-	  This adds support for boards based on the STA2X11 IO-Hub,
-	  a.k.a. "ConneXt". The chip is used in place of the standard
-	  PC chipset, so all "standard" peripherals are missing. If this
-	  option is selected the kernel will still be able to boot on
-	  standard PC machines.
-
 config X86_32_IRIS
 	tristate "Eurobraille/Iris poweroff module"
 	depends on X86_32
@@ -1094,7 +1070,7 @@ config UP_LATE_INIT
 config X86_UP_APIC
 	bool "Local APIC support on uniprocessors" if !PCI_MSI
 	default PCI_MSI
-	depends on X86_32 && !SMP && !X86_32_NON_STANDARD
+	depends on X86_32 && !SMP
 	help
 	  A local APIC (Advanced Programmable Interrupt Controller) is an
 	  integrated interrupt controller in the CPU. If you have a single-CPU
@@ -1119,7 +1095,7 @@ config X86_UP_IOAPIC
 
 config X86_LOCAL_APIC
 	def_bool y
-	depends on X86_64 || SMP || X86_32_NON_STANDARD || X86_UP_APIC || PCI_MSI
+	depends on X86_64 || SMP || X86_UP_APIC || PCI_MSI
 	select IRQ_DOMAIN_HIERARCHY
 
 config ACPI_MADT_WAKEUP
@@ -1579,7 +1555,7 @@ config ARCH_FLATMEM_ENABLE
 
 config ARCH_SPARSEMEM_ENABLE
 	def_bool y
-	depends on X86_64 || NUMA || X86_32 || X86_32_NON_STANDARD
+	depends on X86_64 || NUMA || X86_32
 	select SPARSEMEM_STATIC if X86_32
 	select SPARSEMEM_VMEMMAP_ENABLE if X86_64
 
diff --git a/arch/x86/include/asm/sta2x11.h b/arch/x86/include/asm/sta2x11.h
deleted file mode 100644
index e0975e9c4f47..000000000000
--- a/arch/x86/include/asm/sta2x11.h
+++ /dev/null
@@ -1,13 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-/*
- * Header file for STMicroelectronics ConneXt (STA2X11) IOHub
- */
-#ifndef __ASM_STA2X11_H
-#define __ASM_STA2X11_H
-
-#include <linux/pci.h>
-
-/* This needs to be called from the MFD to configure its sub-devices */
-struct sta2x11_instance *sta2x11_get_instance(struct pci_dev *pdev);
-
-#endif /* __ASM_STA2X11_H */
diff --git a/arch/x86/pci/Makefile b/arch/x86/pci/Makefile
index 48bcada5cabe..4933fb337983 100644
--- a/arch/x86/pci/Makefile
+++ b/arch/x86/pci/Makefile
@@ -12,8 +12,6 @@ obj-$(CONFIG_X86_INTEL_CE)      += ce4100.o
 obj-$(CONFIG_ACPI)		+= acpi.o
 obj-y				+= legacy.o irq.o
 
-obj-$(CONFIG_STA2X11)           += sta2x11-fixup.o
-
 obj-$(CONFIG_X86_NUMACHIP)	+= numachip.o
 
 obj-$(CONFIG_X86_INTEL_MID)	+= intel_mid_pci.o
diff --git a/arch/x86/pci/sta2x11-fixup.c b/arch/x86/pci/sta2x11-fixup.c
deleted file mode 100644
index 8c8ddc4dcc08..000000000000
--- a/arch/x86/pci/sta2x11-fixup.c
+++ /dev/null
@@ -1,233 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0-only
-/*
- * DMA translation between STA2x11 AMBA memory mapping and the x86 memory mapping
- *
- * ST Microelectronics ConneXt (STA2X11/STA2X10)
- *
- * Copyright (c) 2010-2011 Wind River Systems, Inc.
- */
-
-#include <linux/pci.h>
-#include <linux/pci_ids.h>
-#include <linux/export.h>
-#include <linux/list.h>
-#include <linux/dma-map-ops.h>
-#include <linux/swiotlb.h>
-#include <asm/iommu.h>
-#include <asm/sta2x11.h>
-
-#define STA2X11_SWIOTLB_SIZE (4*1024*1024)
-
-/*
- * We build a list of bus numbers that are under the ConneXt. The
- * main bridge hosts 4 busses, which are the 4 endpoints, in order.
- */
-#define STA2X11_NR_EP		4	/* 0..3 included */
-#define STA2X11_NR_FUNCS	8	/* 0..7 included */
-#define STA2X11_AMBA_SIZE	(512 << 20)
-
-struct sta2x11_ahb_regs { /* saved during suspend */
-	u32 base, pexlbase, pexhbase, crw;
-};
-
-struct sta2x11_mapping {
-	int is_suspended;
-	struct sta2x11_ahb_regs regs[STA2X11_NR_FUNCS];
-};
-
-struct sta2x11_instance {
-	struct list_head list;
-	int bus0;
-	struct sta2x11_mapping map[STA2X11_NR_EP];
-};
-
-static LIST_HEAD(sta2x11_instance_list);
-
-/* At probe time, record new instances of this bridge (likely one only) */
-static void sta2x11_new_instance(struct pci_dev *pdev)
-{
-	struct sta2x11_instance *instance;
-
-	instance = kzalloc(sizeof(*instance), GFP_ATOMIC);
-	if (!instance)
-		return;
-	/* This has a subordinate bridge, with 4 more-subordinate ones */
-	instance->bus0 = pdev->subordinate->number + 1;
-
-	if (list_empty(&sta2x11_instance_list)) {
-		int size = STA2X11_SWIOTLB_SIZE;
-		/* First instance: register your own swiotlb area */
-		dev_info(&pdev->dev, "Using SWIOTLB (size %i)\n", size);
-		if (swiotlb_init_late(size, GFP_DMA, NULL))
-			dev_emerg(&pdev->dev, "init swiotlb failed\n");
-	}
-	list_add(&instance->list, &sta2x11_instance_list);
-}
-DECLARE_PCI_FIXUP_ENABLE(PCI_VENDOR_ID_STMICRO, 0xcc17, sta2x11_new_instance);
-
-/*
- * Utility functions used in this file from below
- */
-static struct sta2x11_instance *sta2x11_pdev_to_instance(struct pci_dev *pdev)
-{
-	struct sta2x11_instance *instance;
-	int ep;
-
-	list_for_each_entry(instance, &sta2x11_instance_list, list) {
-		ep = pdev->bus->number - instance->bus0;
-		if (ep >= 0 && ep < STA2X11_NR_EP)
-			return instance;
-	}
-	return NULL;
-}
-
-static int sta2x11_pdev_to_ep(struct pci_dev *pdev)
-{
-	struct sta2x11_instance *instance;
-
-	instance = sta2x11_pdev_to_instance(pdev);
-	if (!instance)
-		return -1;
-
-	return pdev->bus->number - instance->bus0;
-}
-
-/* This is exported, as some devices need to access the MFD registers */
-struct sta2x11_instance *sta2x11_get_instance(struct pci_dev *pdev)
-{
-	return sta2x11_pdev_to_instance(pdev);
-}
-EXPORT_SYMBOL(sta2x11_get_instance);
-
-/* At setup time, we use our own ops if the device is a ConneXt one */
-static void sta2x11_setup_pdev(struct pci_dev *pdev)
-{
-	struct sta2x11_instance *instance = sta2x11_pdev_to_instance(pdev);
-
-	if (!instance) /* either a sta2x11 bridge or another ST device */
-		return;
-
-	/* We must enable all devices as master, for audio DMA to work */
-	pci_set_master(pdev);
-}
-DECLARE_PCI_FIXUP_ENABLE(PCI_VENDOR_ID_STMICRO, PCI_ANY_ID, sta2x11_setup_pdev);
-
-/*
- * At boot we must set up the mappings for the pcie-to-amba bridge.
- * It involves device access, and the same happens at suspend/resume time
- */
-
-#define AHB_MAPB		0xCA4
-#define AHB_CRW(i)		(AHB_MAPB + 0  + (i) * 0x10)
-#define AHB_CRW_SZMASK			0xfffffc00UL
-#define AHB_CRW_ENABLE			(1 << 0)
-#define AHB_CRW_WTYPE_MEM		(2 << 1)
-#define AHB_CRW_ROE			(1UL << 3)	/* Relax Order Ena */
-#define AHB_CRW_NSE			(1UL << 4)	/* No Snoop Enable */
-#define AHB_BASE(i)		(AHB_MAPB + 4  + (i) * 0x10)
-#define AHB_PEXLBASE(i)		(AHB_MAPB + 8  + (i) * 0x10)
-#define AHB_PEXHBASE(i)		(AHB_MAPB + 12 + (i) * 0x10)
-
-/* At probe time, enable mapping for each endpoint, using the pdev */
-static void sta2x11_map_ep(struct pci_dev *pdev)
-{
-	struct sta2x11_instance *instance = sta2x11_pdev_to_instance(pdev);
-	struct device *dev = &pdev->dev;
-	u32 amba_base, max_amba_addr;
-	int i, ret;
-
-	if (!instance)
-		return;
-
-	pci_read_config_dword(pdev, AHB_BASE(0), &amba_base);
-	max_amba_addr = amba_base + STA2X11_AMBA_SIZE - 1;
-
-	ret = dma_direct_set_offset(dev, 0, amba_base, STA2X11_AMBA_SIZE);
-	if (ret)
-		dev_err(dev, "sta2x11: could not set DMA offset\n");
-
-	dev->bus_dma_limit = max_amba_addr;
-	dma_set_mask_and_coherent(&pdev->dev, max_amba_addr);
-
-	/* Configure AHB mapping */
-	pci_write_config_dword(pdev, AHB_PEXLBASE(0), 0);
-	pci_write_config_dword(pdev, AHB_PEXHBASE(0), 0);
-	pci_write_config_dword(pdev, AHB_CRW(0), STA2X11_AMBA_SIZE |
-			       AHB_CRW_WTYPE_MEM | AHB_CRW_ENABLE);
-
-	/* Disable all the other windows */
-	for (i = 1; i < STA2X11_NR_FUNCS; i++)
-		pci_write_config_dword(pdev, AHB_CRW(i), 0);
-
-	dev_info(&pdev->dev,
-		 "sta2x11: Map EP %i: AMBA address %#8x-%#8x\n",
-		 sta2x11_pdev_to_ep(pdev), amba_base, max_amba_addr);
-}
-DECLARE_PCI_FIXUP_ENABLE(PCI_VENDOR_ID_STMICRO, PCI_ANY_ID, sta2x11_map_ep);
-
-#ifdef CONFIG_PM /* Some register values must be saved and restored */
-
-static struct sta2x11_mapping *sta2x11_pdev_to_mapping(struct pci_dev *pdev)
-{
-	struct sta2x11_instance *instance;
-	int ep;
-
-	instance = sta2x11_pdev_to_instance(pdev);
-	if (!instance)
-		return NULL;
-	ep = sta2x11_pdev_to_ep(pdev);
-	return instance->map + ep;
-}
-
-static void suspend_mapping(struct pci_dev *pdev)
-{
-	struct sta2x11_mapping *map = sta2x11_pdev_to_mapping(pdev);
-	int i;
-
-	if (!map)
-		return;
-
-	if (map->is_suspended)
-		return;
-	map->is_suspended = 1;
-
-	/* Save all window configs */
-	for (i = 0; i < STA2X11_NR_FUNCS; i++) {
-		struct sta2x11_ahb_regs *regs = map->regs + i;
-
-		pci_read_config_dword(pdev, AHB_BASE(i), &regs->base);
-		pci_read_config_dword(pdev, AHB_PEXLBASE(i), &regs->pexlbase);
-		pci_read_config_dword(pdev, AHB_PEXHBASE(i), &regs->pexhbase);
-		pci_read_config_dword(pdev, AHB_CRW(i), &regs->crw);
-	}
-}
-DECLARE_PCI_FIXUP_SUSPEND(PCI_VENDOR_ID_STMICRO, PCI_ANY_ID, suspend_mapping);
-
-static void resume_mapping(struct pci_dev *pdev)
-{
-	struct sta2x11_mapping *map = sta2x11_pdev_to_mapping(pdev);
-	int i;
-
-	if (!map)
-		return;
-
-
-	if (!map->is_suspended)
-		goto out;
-	map->is_suspended = 0;
-
-	/* Restore all window configs */
-	for (i = 0; i < STA2X11_NR_FUNCS; i++) {
-		struct sta2x11_ahb_regs *regs = map->regs + i;
-
-		pci_write_config_dword(pdev, AHB_BASE(i), regs->base);
-		pci_write_config_dword(pdev, AHB_PEXLBASE(i), regs->pexlbase);
-		pci_write_config_dword(pdev, AHB_PEXHBASE(i), regs->pexhbase);
-		pci_write_config_dword(pdev, AHB_CRW(i), regs->crw);
-	}
-out:
-	pci_set_master(pdev); /* Like at boot, enable master on all devices */
-}
-DECLARE_PCI_FIXUP_RESUME(PCI_VENDOR_ID_STMICRO, PCI_ANY_ID, resume_mapping);
-
-#endif /* CONFIG_PM */
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 11/11] x86: drop 32-bit KVM host support
  2024-12-04 10:30 [PATCH 00/11] x86: 32-bit cleanups Arnd Bergmann
                   ` (9 preceding siblings ...)
  2024-12-04 10:30 ` [PATCH 10/11] x86: remove old STA2x11 support Arnd Bergmann
@ 2024-12-04 10:30 ` Arnd Bergmann
  2024-12-04 15:30   ` Sean Christopherson
  10 siblings, 1 reply; 78+ messages in thread
From: Arnd Bergmann @ 2024-12-04 10:30 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: Arnd Bergmann, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Linus Torvalds, Andy Shevchenko,
	Matthew Wilcox, Sean Christopherson, Davide Ciminaghi,
	Paolo Bonzini, kvm

From: Arnd Bergmann <arnd@arndb.de>

There are very few 32-bit machines that support KVM, the main exceptions
are the "Yonah" Generation Xeon-LV and Core Duo from 2006 and the Atom
Z5xx "Silverthorne" from 2008 that were all release just before their
64-bit counterparts.

Using KVM as a host on a 64-bit CPU using a 32-bit kernel generally
works fine, but is rather pointless since 64-bit kernels are much better
supported and deal better with the memory requirements of VM guests.

Drop all the 32-bit-only portions and the "#ifdef CONFIG_X86_64" checks
of the x86 KVM code and add a Kconfig dependency to only allow building
this on 64-bit kernels.

Support for 32-bit guests is of course untouched by this.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 arch/x86/kvm/Kconfig            |   6 +-
 arch/x86/kvm/Makefile           |   4 +-
 arch/x86/kvm/cpuid.c            |   9 +--
 arch/x86/kvm/emulate.c          |  34 ++------
 arch/x86/kvm/fpu.h              |   4 -
 arch/x86/kvm/hyperv.c           |   5 +-
 arch/x86/kvm/i8254.c            |   4 -
 arch/x86/kvm/kvm_cache_regs.h   |   2 -
 arch/x86/kvm/kvm_emulate.h      |   8 --
 arch/x86/kvm/lapic.c            |   4 -
 arch/x86/kvm/mmu.h              |   4 -
 arch/x86/kvm/mmu/mmu.c          | 134 --------------------------------
 arch/x86/kvm/mmu/mmu_internal.h |   9 ---
 arch/x86/kvm/mmu/paging_tmpl.h  |   9 ---
 arch/x86/kvm/mmu/spte.h         |   5 --
 arch/x86/kvm/mmu/tdp_mmu.h      |   4 -
 arch/x86/kvm/smm.c              |  19 -----
 arch/x86/kvm/svm/sev.c          |   2 -
 arch/x86/kvm/svm/svm.c          |  23 +-----
 arch/x86/kvm/svm/vmenter.S      |  20 -----
 arch/x86/kvm/trace.h            |   4 -
 arch/x86/kvm/vmx/main.c         |   2 -
 arch/x86/kvm/vmx/nested.c       |  24 +-----
 arch/x86/kvm/vmx/vmcs.h         |   2 -
 arch/x86/kvm/vmx/vmenter.S      |  25 +-----
 arch/x86/kvm/vmx/vmx.c          | 117 +---------------------------
 arch/x86/kvm/vmx/vmx.h          |  23 +-----
 arch/x86/kvm/vmx/vmx_ops.h      |   7 --
 arch/x86/kvm/vmx/x86_ops.h      |   2 -
 arch/x86/kvm/x86.c              |  74 ++----------------
 arch/x86/kvm/x86.h              |   4 -
 arch/x86/kvm/xen.c              |  61 ++++++---------
 32 files changed, 54 insertions(+), 600 deletions(-)

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index ea2c4f21c1ca..7bdc7639aa8d 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -7,6 +7,7 @@ source "virt/kvm/Kconfig"
 
 menuconfig VIRTUALIZATION
 	bool "Virtualization"
+	depends on X86_64
 	default y
 	help
 	  Say Y here to get to see options for using your Linux host to run other
@@ -50,7 +51,6 @@ config KVM_X86
 
 config KVM
 	tristate "Kernel-based Virtual Machine (KVM) support"
-	depends on X86_LOCAL_APIC
 	help
 	  Support hosting fully virtualized guest machines using hardware
 	  virtualization extensions.  You will need a fairly recent
@@ -82,7 +82,7 @@ config KVM_WERROR
 config KVM_SW_PROTECTED_VM
 	bool "Enable support for KVM software-protected VMs"
 	depends on EXPERT
-	depends on KVM && X86_64
+	depends on KVM
 	help
 	  Enable support for KVM software-protected VMs.  Currently, software-
 	  protected VMs are purely a development and testing vehicle for
@@ -141,7 +141,7 @@ config KVM_AMD
 config KVM_AMD_SEV
 	bool "AMD Secure Encrypted Virtualization (SEV) support"
 	default y
-	depends on KVM_AMD && X86_64
+	depends on KVM_AMD
 	depends on CRYPTO_DEV_SP_PSP && !(KVM_AMD=y && CRYPTO_DEV_CCP_DD=m)
 	select ARCH_HAS_CC_PLATFORM
 	select KVM_GENERIC_PRIVATE_MEM
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index f9dddb8cb466..46654dc0428f 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -8,9 +8,7 @@ include $(srctree)/virt/kvm/Makefile.kvm
 kvm-y			+= x86.o emulate.o i8259.o irq.o lapic.o \
 			   i8254.o ioapic.o irq_comm.o cpuid.o pmu.o mtrr.o \
 			   debugfs.o mmu/mmu.o mmu/page_track.o \
-			   mmu/spte.o
-
-kvm-$(CONFIG_X86_64) += mmu/tdp_iter.o mmu/tdp_mmu.o
+			   mmu/spte.o mmu/tdp_iter.o mmu/tdp_mmu.o
 kvm-$(CONFIG_KVM_HYPERV) += hyperv.o
 kvm-$(CONFIG_KVM_XEN)	+= xen.o
 kvm-$(CONFIG_KVM_SMM)	+= smm.o
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 097bdc022d0f..d34b6e276ba1 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -606,15 +606,10 @@ static __always_inline void kvm_cpu_cap_mask(enum cpuid_leafs leaf, u32 mask)
 
 void kvm_set_cpu_caps(void)
 {
-#ifdef CONFIG_X86_64
 	unsigned int f_gbpages = F(GBPAGES);
 	unsigned int f_lm = F(LM);
 	unsigned int f_xfd = F(XFD);
-#else
-	unsigned int f_gbpages = 0;
-	unsigned int f_lm = 0;
-	unsigned int f_xfd = 0;
-#endif
+
 	memset(kvm_cpu_caps, 0, sizeof(kvm_cpu_caps));
 
 	BUILD_BUG_ON(sizeof(kvm_cpu_caps) - (NKVMCAPINTS * sizeof(*kvm_cpu_caps)) >
@@ -746,7 +741,7 @@ void kvm_set_cpu_caps(void)
 		0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW)
 	);
 
-	if (!tdp_enabled && IS_ENABLED(CONFIG_X86_64))
+	if (!tdp_enabled)
 		kvm_cpu_cap_set(X86_FEATURE_GBPAGES);
 
 	kvm_cpu_cap_init_kvm_defined(CPUID_8000_0007_EDX,
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 60986f67c35a..ebac76a10fbd 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -265,12 +265,6 @@ static void invalidate_registers(struct x86_emulate_ctxt *ctxt)
 #define EFLAGS_MASK (X86_EFLAGS_OF|X86_EFLAGS_SF|X86_EFLAGS_ZF|X86_EFLAGS_AF|\
 		     X86_EFLAGS_PF|X86_EFLAGS_CF)
 
-#ifdef CONFIG_X86_64
-#define ON64(x) x
-#else
-#define ON64(x)
-#endif
-
 /*
  * fastop functions have a special calling convention:
  *
@@ -341,7 +335,7 @@ static int fastop(struct x86_emulate_ctxt *ctxt, fastop_t fop);
 	FOP1E(op##b, al) \
 	FOP1E(op##w, ax) \
 	FOP1E(op##l, eax) \
-	ON64(FOP1E(op##q, rax))	\
+	FOP1E(op##q, rax) \
 	FOP_END
 
 /* 1-operand, using src2 (for MUL/DIV r/m) */
@@ -350,7 +344,7 @@ static int fastop(struct x86_emulate_ctxt *ctxt, fastop_t fop);
 	FOP1E(op, cl) \
 	FOP1E(op, cx) \
 	FOP1E(op, ecx) \
-	ON64(FOP1E(op, rcx)) \
+	FOP1E(op, rcx) \
 	FOP_END
 
 /* 1-operand, using src2 (for MUL/DIV r/m), with exceptions */
@@ -359,7 +353,7 @@ static int fastop(struct x86_emulate_ctxt *ctxt, fastop_t fop);
 	FOP1EEX(op, cl) \
 	FOP1EEX(op, cx) \
 	FOP1EEX(op, ecx) \
-	ON64(FOP1EEX(op, rcx)) \
+	FOP1EEX(op, rcx) \
 	FOP_END
 
 #define FOP2E(op,  dst, src)	   \
@@ -372,7 +366,7 @@ static int fastop(struct x86_emulate_ctxt *ctxt, fastop_t fop);
 	FOP2E(op##b, al, dl) \
 	FOP2E(op##w, ax, dx) \
 	FOP2E(op##l, eax, edx) \
-	ON64(FOP2E(op##q, rax, rdx)) \
+	FOP2E(op##q, rax, rdx) \
 	FOP_END
 
 /* 2 operand, word only */
@@ -381,7 +375,7 @@ static int fastop(struct x86_emulate_ctxt *ctxt, fastop_t fop);
 	FOPNOP() \
 	FOP2E(op##w, ax, dx) \
 	FOP2E(op##l, eax, edx) \
-	ON64(FOP2E(op##q, rax, rdx)) \
+	FOP2E(op##q, rax, rdx) \
 	FOP_END
 
 /* 2 operand, src is CL */
@@ -390,7 +384,7 @@ static int fastop(struct x86_emulate_ctxt *ctxt, fastop_t fop);
 	FOP2E(op##b, al, cl) \
 	FOP2E(op##w, ax, cl) \
 	FOP2E(op##l, eax, cl) \
-	ON64(FOP2E(op##q, rax, cl)) \
+	FOP2E(op##q, rax, cl) \
 	FOP_END
 
 /* 2 operand, src and dest are reversed */
@@ -399,7 +393,7 @@ static int fastop(struct x86_emulate_ctxt *ctxt, fastop_t fop);
 	FOP2E(op##b, dl, al) \
 	FOP2E(op##w, dx, ax) \
 	FOP2E(op##l, edx, eax) \
-	ON64(FOP2E(op##q, rdx, rax)) \
+	FOP2E(op##q, rdx, rax) \
 	FOP_END
 
 #define FOP3E(op,  dst, src, src2) \
@@ -413,7 +407,7 @@ static int fastop(struct x86_emulate_ctxt *ctxt, fastop_t fop);
 	FOPNOP() \
 	FOP3E(op##w, ax, dx, cl) \
 	FOP3E(op##l, eax, edx, cl) \
-	ON64(FOP3E(op##q, rax, rdx, cl)) \
+	FOP3E(op##q, rax, rdx, cl) \
 	FOP_END
 
 /* Special case for SETcc - 1 instruction per cc */
@@ -1508,7 +1502,6 @@ static int get_descriptor_ptr(struct x86_emulate_ctxt *ctxt,
 
 	addr = dt.address + index * 8;
 
-#ifdef CONFIG_X86_64
 	if (addr >> 32 != 0) {
 		u64 efer = 0;
 
@@ -1516,7 +1509,6 @@ static int get_descriptor_ptr(struct x86_emulate_ctxt *ctxt,
 		if (!(efer & EFER_LMA))
 			addr &= (u32)-1;
 	}
-#endif
 
 	*desc_addr_p = addr;
 	return X86EMUL_CONTINUE;
@@ -2399,7 +2391,6 @@ static int em_syscall(struct x86_emulate_ctxt *ctxt)
 
 	*reg_write(ctxt, VCPU_REGS_RCX) = ctxt->_eip;
 	if (efer & EFER_LMA) {
-#ifdef CONFIG_X86_64
 		*reg_write(ctxt, VCPU_REGS_R11) = ctxt->eflags;
 
 		ops->get_msr(ctxt,
@@ -2410,7 +2401,6 @@ static int em_syscall(struct x86_emulate_ctxt *ctxt)
 		ops->get_msr(ctxt, MSR_SYSCALL_MASK, &msr_data);
 		ctxt->eflags &= ~msr_data;
 		ctxt->eflags |= X86_EFLAGS_FIXED;
-#endif
 	} else {
 		/* legacy mode */
 		ops->get_msr(ctxt, MSR_STAR, &msr_data);
@@ -2575,9 +2565,7 @@ static bool emulator_io_port_access_allowed(struct x86_emulate_ctxt *ctxt,
 	if (desc_limit_scaled(&tr_seg) < 103)
 		return false;
 	base = get_desc_base(&tr_seg);
-#ifdef CONFIG_X86_64
 	base |= ((u64)base3) << 32;
-#endif
 	r = ops->read_std(ctxt, base + 102, &io_bitmap_ptr, 2, NULL, true);
 	if (r != X86EMUL_CONTINUE)
 		return false;
@@ -2612,7 +2600,6 @@ static void string_registers_quirk(struct x86_emulate_ctxt *ctxt)
 	 * Intel CPUs mask the counter and pointers in quite strange
 	 * manner when ECX is zero due to REP-string optimizations.
 	 */
-#ifdef CONFIG_X86_64
 	u32 eax, ebx, ecx, edx;
 
 	if (ctxt->ad_bytes != 4)
@@ -2634,7 +2621,6 @@ static void string_registers_quirk(struct x86_emulate_ctxt *ctxt)
 	case 0xab:	/* stosd/w */
 		*reg_rmw(ctxt, VCPU_REGS_RDI) &= (u32)-1;
 	}
-#endif
 }
 
 static void save_state_to_tss16(struct x86_emulate_ctxt *ctxt,
@@ -3641,11 +3627,9 @@ static int em_lahf(struct x86_emulate_ctxt *ctxt)
 static int em_bswap(struct x86_emulate_ctxt *ctxt)
 {
 	switch (ctxt->op_bytes) {
-#ifdef CONFIG_X86_64
 	case 8:
 		asm("bswap %0" : "+r"(ctxt->dst.val));
 		break;
-#endif
 	default:
 		asm("bswap %0" : "+r"(*(u32 *)&ctxt->dst.val));
 		break;
@@ -4767,12 +4751,10 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len, int
 	case X86EMUL_MODE_PROT32:
 		def_op_bytes = def_ad_bytes = 4;
 		break;
-#ifdef CONFIG_X86_64
 	case X86EMUL_MODE_PROT64:
 		def_op_bytes = 4;
 		def_ad_bytes = 8;
 		break;
-#endif
 	default:
 		return EMULATION_FAILED;
 	}
diff --git a/arch/x86/kvm/fpu.h b/arch/x86/kvm/fpu.h
index 3ba12888bf66..56a402dbf24a 100644
--- a/arch/x86/kvm/fpu.h
+++ b/arch/x86/kvm/fpu.h
@@ -26,7 +26,6 @@ static inline void _kvm_read_sse_reg(int reg, sse128_t *data)
 	case 5: asm("movdqa %%xmm5, %0" : "=m"(*data)); break;
 	case 6: asm("movdqa %%xmm6, %0" : "=m"(*data)); break;
 	case 7: asm("movdqa %%xmm7, %0" : "=m"(*data)); break;
-#ifdef CONFIG_X86_64
 	case 8: asm("movdqa %%xmm8, %0" : "=m"(*data)); break;
 	case 9: asm("movdqa %%xmm9, %0" : "=m"(*data)); break;
 	case 10: asm("movdqa %%xmm10, %0" : "=m"(*data)); break;
@@ -35,7 +34,6 @@ static inline void _kvm_read_sse_reg(int reg, sse128_t *data)
 	case 13: asm("movdqa %%xmm13, %0" : "=m"(*data)); break;
 	case 14: asm("movdqa %%xmm14, %0" : "=m"(*data)); break;
 	case 15: asm("movdqa %%xmm15, %0" : "=m"(*data)); break;
-#endif
 	default: BUG();
 	}
 }
@@ -51,7 +49,6 @@ static inline void _kvm_write_sse_reg(int reg, const sse128_t *data)
 	case 5: asm("movdqa %0, %%xmm5" : : "m"(*data)); break;
 	case 6: asm("movdqa %0, %%xmm6" : : "m"(*data)); break;
 	case 7: asm("movdqa %0, %%xmm7" : : "m"(*data)); break;
-#ifdef CONFIG_X86_64
 	case 8: asm("movdqa %0, %%xmm8" : : "m"(*data)); break;
 	case 9: asm("movdqa %0, %%xmm9" : : "m"(*data)); break;
 	case 10: asm("movdqa %0, %%xmm10" : : "m"(*data)); break;
@@ -60,7 +57,6 @@ static inline void _kvm_write_sse_reg(int reg, const sse128_t *data)
 	case 13: asm("movdqa %0, %%xmm13" : : "m"(*data)); break;
 	case 14: asm("movdqa %0, %%xmm14" : : "m"(*data)); break;
 	case 15: asm("movdqa %0, %%xmm15" : : "m"(*data)); break;
-#endif
 	default: BUG();
 	}
 }
diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index 4f0a94346d00..8fb9f45c7465 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -2532,14 +2532,11 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
 		return 1;
 	}
 
-#ifdef CONFIG_X86_64
 	if (is_64_bit_hypercall(vcpu)) {
 		hc.param = kvm_rcx_read(vcpu);
 		hc.ingpa = kvm_rdx_read(vcpu);
 		hc.outgpa = kvm_r8_read(vcpu);
-	} else
-#endif
-	{
+	} else {
 		hc.param = ((u64)kvm_rdx_read(vcpu) << 32) |
 			    (kvm_rax_read(vcpu) & 0xffffffff);
 		hc.ingpa = ((u64)kvm_rbx_read(vcpu) << 32) |
diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
index cd57a517d04a..b2b63a835797 100644
--- a/arch/x86/kvm/i8254.c
+++ b/arch/x86/kvm/i8254.c
@@ -40,11 +40,7 @@
 #include "i8254.h"
 #include "x86.h"
 
-#ifndef CONFIG_X86_64
-#define mod_64(x, y) ((x) - (y) * div64_u64(x, y))
-#else
 #define mod_64(x, y) ((x) % (y))
-#endif
 
 #define RW_STATE_LSB 1
 #define RW_STATE_MSB 2
diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
index 36a8786db291..66d12dc5b243 100644
--- a/arch/x86/kvm/kvm_cache_regs.h
+++ b/arch/x86/kvm/kvm_cache_regs.h
@@ -32,7 +32,6 @@ BUILD_KVM_GPR_ACCESSORS(rdx, RDX)
 BUILD_KVM_GPR_ACCESSORS(rbp, RBP)
 BUILD_KVM_GPR_ACCESSORS(rsi, RSI)
 BUILD_KVM_GPR_ACCESSORS(rdi, RDI)
-#ifdef CONFIG_X86_64
 BUILD_KVM_GPR_ACCESSORS(r8,  R8)
 BUILD_KVM_GPR_ACCESSORS(r9,  R9)
 BUILD_KVM_GPR_ACCESSORS(r10, R10)
@@ -41,7 +40,6 @@ BUILD_KVM_GPR_ACCESSORS(r12, R12)
 BUILD_KVM_GPR_ACCESSORS(r13, R13)
 BUILD_KVM_GPR_ACCESSORS(r14, R14)
 BUILD_KVM_GPR_ACCESSORS(r15, R15)
-#endif
 
 /*
  * Using the register cache from interrupt context is generally not allowed, as
diff --git a/arch/x86/kvm/kvm_emulate.h b/arch/x86/kvm/kvm_emulate.h
index 10495fffb890..970761100e26 100644
--- a/arch/x86/kvm/kvm_emulate.h
+++ b/arch/x86/kvm/kvm_emulate.h
@@ -305,11 +305,7 @@ typedef void (*fastop_t)(struct fastop *);
  * also uses _eip, RIP cannot be a register operand nor can it be an operand in
  * a ModRM or SIB byte.
  */
-#ifdef CONFIG_X86_64
 #define NR_EMULATOR_GPRS	16
-#else
-#define NR_EMULATOR_GPRS	8
-#endif
 
 struct x86_emulate_ctxt {
 	void *vcpu;
@@ -501,11 +497,7 @@ enum x86_intercept {
 };
 
 /* Host execution mode. */
-#if defined(CONFIG_X86_32)
-#define X86EMUL_MODE_HOST X86EMUL_MODE_PROT32
-#elif defined(CONFIG_X86_64)
 #define X86EMUL_MODE_HOST X86EMUL_MODE_PROT64
-#endif
 
 int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len, int emulation_type);
 bool x86_page_table_writing_insn(struct x86_emulate_ctxt *ctxt);
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 3c83951c619e..53a10a0ca03b 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -46,11 +46,7 @@
 #include "hyperv.h"
 #include "smm.h"
 
-#ifndef CONFIG_X86_64
-#define mod_64(x, y) ((x) - (y) * div64_u64(x, y))
-#else
 #define mod_64(x, y) ((x) % (y))
-#endif
 
 /* 14 is the version for Xeon and Pentium 8.4.8*/
 #define APIC_VERSION			0x14UL
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index e9322358678b..91e0969d23d7 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -238,11 +238,7 @@ static inline bool kvm_shadow_root_allocated(struct kvm *kvm)
 	return smp_load_acquire(&kvm->arch.shadow_root_allocated);
 }
 
-#ifdef CONFIG_X86_64
 extern bool tdp_mmu_enabled;
-#else
-#define tdp_mmu_enabled false
-#endif
 
 static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
 {
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 22e7ad235123..23d5074edbc5 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -107,10 +107,8 @@ bool tdp_enabled = false;
 
 static bool __ro_after_init tdp_mmu_allowed;
 
-#ifdef CONFIG_X86_64
 bool __read_mostly tdp_mmu_enabled = true;
 module_param_named(tdp_mmu, tdp_mmu_enabled, bool, 0444);
-#endif
 
 static int max_huge_page_level __read_mostly;
 static int tdp_root_level __read_mostly;
@@ -332,7 +330,6 @@ static int is_cpuid_PSE36(void)
 	return 1;
 }
 
-#ifdef CONFIG_X86_64
 static void __set_spte(u64 *sptep, u64 spte)
 {
 	KVM_MMU_WARN_ON(is_ept_ve_possible(spte));
@@ -355,122 +352,6 @@ static u64 __get_spte_lockless(u64 *sptep)
 {
 	return READ_ONCE(*sptep);
 }
-#else
-union split_spte {
-	struct {
-		u32 spte_low;
-		u32 spte_high;
-	};
-	u64 spte;
-};
-
-static void count_spte_clear(u64 *sptep, u64 spte)
-{
-	struct kvm_mmu_page *sp =  sptep_to_sp(sptep);
-
-	if (is_shadow_present_pte(spte))
-		return;
-
-	/* Ensure the spte is completely set before we increase the count */
-	smp_wmb();
-	sp->clear_spte_count++;
-}
-
-static void __set_spte(u64 *sptep, u64 spte)
-{
-	union split_spte *ssptep, sspte;
-
-	ssptep = (union split_spte *)sptep;
-	sspte = (union split_spte)spte;
-
-	ssptep->spte_high = sspte.spte_high;
-
-	/*
-	 * If we map the spte from nonpresent to present, We should store
-	 * the high bits firstly, then set present bit, so cpu can not
-	 * fetch this spte while we are setting the spte.
-	 */
-	smp_wmb();
-
-	WRITE_ONCE(ssptep->spte_low, sspte.spte_low);
-}
-
-static void __update_clear_spte_fast(u64 *sptep, u64 spte)
-{
-	union split_spte *ssptep, sspte;
-
-	ssptep = (union split_spte *)sptep;
-	sspte = (union split_spte)spte;
-
-	WRITE_ONCE(ssptep->spte_low, sspte.spte_low);
-
-	/*
-	 * If we map the spte from present to nonpresent, we should clear
-	 * present bit firstly to avoid vcpu fetch the old high bits.
-	 */
-	smp_wmb();
-
-	ssptep->spte_high = sspte.spte_high;
-	count_spte_clear(sptep, spte);
-}
-
-static u64 __update_clear_spte_slow(u64 *sptep, u64 spte)
-{
-	union split_spte *ssptep, sspte, orig;
-
-	ssptep = (union split_spte *)sptep;
-	sspte = (union split_spte)spte;
-
-	/* xchg acts as a barrier before the setting of the high bits */
-	orig.spte_low = xchg(&ssptep->spte_low, sspte.spte_low);
-	orig.spte_high = ssptep->spte_high;
-	ssptep->spte_high = sspte.spte_high;
-	count_spte_clear(sptep, spte);
-
-	return orig.spte;
-}
-
-/*
- * The idea using the light way get the spte on x86_32 guest is from
- * gup_get_pte (mm/gup.c).
- *
- * An spte tlb flush may be pending, because they are coalesced and
- * we are running out of the MMU lock.  Therefore
- * we need to protect against in-progress updates of the spte.
- *
- * Reading the spte while an update is in progress may get the old value
- * for the high part of the spte.  The race is fine for a present->non-present
- * change (because the high part of the spte is ignored for non-present spte),
- * but for a present->present change we must reread the spte.
- *
- * All such changes are done in two steps (present->non-present and
- * non-present->present), hence it is enough to count the number of
- * present->non-present updates: if it changed while reading the spte,
- * we might have hit the race.  This is done using clear_spte_count.
- */
-static u64 __get_spte_lockless(u64 *sptep)
-{
-	struct kvm_mmu_page *sp =  sptep_to_sp(sptep);
-	union split_spte spte, *orig = (union split_spte *)sptep;
-	int count;
-
-retry:
-	count = sp->clear_spte_count;
-	smp_rmb();
-
-	spte.spte_low = orig->spte_low;
-	smp_rmb();
-
-	spte.spte_high = orig->spte_high;
-	smp_rmb();
-
-	if (unlikely(spte.spte_low != orig->spte_low ||
-	      count != sp->clear_spte_count))
-		goto retry;
-
-	return spte.spte;
-}
-#endif
 
 /* Rules for using mmu_spte_set:
  * Set the sptep from nonpresent to present.
@@ -3931,7 +3812,6 @@ static int mmu_alloc_special_roots(struct kvm_vcpu *vcpu)
 	if (!pae_root)
 		return -ENOMEM;
 
-#ifdef CONFIG_X86_64
 	pml4_root = (void *)get_zeroed_page(GFP_KERNEL_ACCOUNT);
 	if (!pml4_root)
 		goto err_pml4;
@@ -3941,7 +3821,6 @@ static int mmu_alloc_special_roots(struct kvm_vcpu *vcpu)
 		if (!pml5_root)
 			goto err_pml5;
 	}
-#endif
 
 	mmu->pae_root = pae_root;
 	mmu->pml4_root = pml4_root;
@@ -3949,13 +3828,11 @@ static int mmu_alloc_special_roots(struct kvm_vcpu *vcpu)
 
 	return 0;
 
-#ifdef CONFIG_X86_64
 err_pml5:
 	free_page((unsigned long)pml4_root);
 err_pml4:
 	free_page((unsigned long)pae_root);
 	return -ENOMEM;
-#endif
 }
 
 static bool is_unsync_root(hpa_t root)
@@ -4584,11 +4461,6 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
 	int r = 1;
 	u32 flags = vcpu->arch.apf.host_apf_flags;
 
-#ifndef CONFIG_X86_64
-	/* A 64-bit CR2 should be impossible on 32-bit KVM. */
-	if (WARN_ON_ONCE(fault_address >> 32))
-		return -EFAULT;
-#endif
 	/*
 	 * Legacy #PF exception only have a 32-bit error code.  Simply drop the
 	 * upper bits as KVM doesn't use them for #PF (because they are never
@@ -4622,7 +4494,6 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
 }
 EXPORT_SYMBOL_GPL(kvm_handle_page_fault);
 
-#ifdef CONFIG_X86_64
 static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
 				  struct kvm_page_fault *fault)
 {
@@ -4656,7 +4527,6 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
 	read_unlock(&vcpu->kvm->mmu_lock);
 	return r;
 }
-#endif
 
 bool kvm_mmu_may_ignore_guest_pat(void)
 {
@@ -4673,10 +4543,8 @@ bool kvm_mmu_may_ignore_guest_pat(void)
 
 int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
-#ifdef CONFIG_X86_64
 	if (tdp_mmu_enabled)
 		return kvm_tdp_mmu_page_fault(vcpu, fault);
-#endif
 
 	return direct_page_fault(vcpu, fault);
 }
@@ -6249,9 +6117,7 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
 	tdp_root_level = tdp_forced_root_level;
 	max_tdp_level = tdp_max_root_level;
 
-#ifdef CONFIG_X86_64
 	tdp_mmu_enabled = tdp_mmu_allowed && tdp_enabled;
-#endif
 	/*
 	 * max_huge_page_level reflects KVM's MMU capabilities irrespective
 	 * of kernel support, e.g. KVM may be capable of using 1GB pages when
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index b00abbe3f6cf..34cfffc32476 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -116,21 +116,12 @@ struct kvm_mmu_page {
 	 * isn't properly aligned, etc...
 	 */
 	struct list_head possible_nx_huge_page_link;
-#ifdef CONFIG_X86_32
-	/*
-	 * Used out of the mmu-lock to avoid reading spte values while an
-	 * update is in progress; see the comments in __get_spte_lockless().
-	 */
-	int clear_spte_count;
-#endif
 
 	/* Number of writes since the last time traversal visited this page.  */
 	atomic_t write_flooding_count;
 
-#ifdef CONFIG_X86_64
 	/* Used for freeing the page asynchronously if it is a TDP MMU page. */
 	struct rcu_head rcu_head;
-#endif
 };
 
 extern struct kmem_cache *mmu_page_header_cache;
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index f4711674c47b..fa6493641429 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -29,11 +29,7 @@
 	#define PT_GUEST_DIRTY_SHIFT PT_DIRTY_SHIFT
 	#define PT_GUEST_ACCESSED_SHIFT PT_ACCESSED_SHIFT
 	#define PT_HAVE_ACCESSED_DIRTY(mmu) true
-	#ifdef CONFIG_X86_64
 	#define PT_MAX_FULL_LEVELS PT64_ROOT_MAX_LEVEL
-	#else
-	#define PT_MAX_FULL_LEVELS 2
-	#endif
 #elif PTTYPE == 32
 	#define pt_element_t u32
 	#define guest_walker guest_walker32
@@ -862,11 +858,6 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 	gpa_t gpa = INVALID_GPA;
 	int r;
 
-#ifndef CONFIG_X86_64
-	/* A 64-bit GVA should be impossible on 32-bit KVM. */
-	WARN_ON_ONCE((addr >> 32) && mmu == vcpu->arch.walk_mmu);
-#endif
-
 	r = FNAME(walk_addr_generic)(&walker, vcpu, mmu, addr, access);
 
 	if (r) {
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index f332b33bc817..fd776ab25cc3 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -160,13 +160,8 @@ static_assert(MMIO_SPTE_GEN_LOW_BITS == 8 && MMIO_SPTE_GEN_HIGH_BITS == 11);
  * For VMX EPT, bit 63 is ignored if #VE is disabled. (EPT_VIOLATION_VE=0)
  *              bit 63 is #VE suppress if #VE is enabled. (EPT_VIOLATION_VE=1)
  */
-#ifdef CONFIG_X86_64
 #define SHADOW_NONPRESENT_VALUE	BIT_ULL(63)
 static_assert(!(SHADOW_NONPRESENT_VALUE & SPTE_MMU_PRESENT_MASK));
-#else
-#define SHADOW_NONPRESENT_VALUE	0ULL
-#endif
-
 
 /*
  * True if A/D bits are supported in hardware and are enabled by KVM.  When
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index f03ca0dd13d9..c137fdd6b347 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -67,10 +67,6 @@ int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
 u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, gfn_t gfn,
 					u64 *spte);
 
-#ifdef CONFIG_X86_64
 static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp) { return sp->tdp_mmu_page; }
-#else
-static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp) { return false; }
-#endif
 
 #endif /* __KVM_X86_MMU_TDP_MMU_H */
diff --git a/arch/x86/kvm/smm.c b/arch/x86/kvm/smm.c
index 85241c0c7f56..ad764adac3de 100644
--- a/arch/x86/kvm/smm.c
+++ b/arch/x86/kvm/smm.c
@@ -165,7 +165,6 @@ static void enter_smm_save_seg_32(struct kvm_vcpu *vcpu,
 	state->flags = enter_smm_get_segment_flags(&seg);
 }
 
-#ifdef CONFIG_X86_64
 static void enter_smm_save_seg_64(struct kvm_vcpu *vcpu,
 				  struct kvm_smm_seg_state_64 *state,
 				  int n)
@@ -178,7 +177,6 @@ static void enter_smm_save_seg_64(struct kvm_vcpu *vcpu,
 	state->limit = seg.limit;
 	state->base = seg.base;
 }
-#endif
 
 static void enter_smm_save_state_32(struct kvm_vcpu *vcpu,
 				    struct kvm_smram_state_32 *smram)
@@ -223,7 +221,6 @@ static void enter_smm_save_state_32(struct kvm_vcpu *vcpu,
 	smram->int_shadow = kvm_x86_call(get_interrupt_shadow)(vcpu);
 }
 
-#ifdef CONFIG_X86_64
 static void enter_smm_save_state_64(struct kvm_vcpu *vcpu,
 				    struct kvm_smram_state_64 *smram)
 {
@@ -269,7 +266,6 @@ static void enter_smm_save_state_64(struct kvm_vcpu *vcpu,
 
 	smram->int_shadow = kvm_x86_call(get_interrupt_shadow)(vcpu);
 }
-#endif
 
 void enter_smm(struct kvm_vcpu *vcpu)
 {
@@ -282,11 +278,9 @@ void enter_smm(struct kvm_vcpu *vcpu)
 
 	memset(smram.bytes, 0, sizeof(smram.bytes));
 
-#ifdef CONFIG_X86_64
 	if (guest_cpuid_has(vcpu, X86_FEATURE_LM))
 		enter_smm_save_state_64(vcpu, &smram.smram64);
 	else
-#endif
 		enter_smm_save_state_32(vcpu, &smram.smram32);
 
 	/*
@@ -352,11 +346,9 @@ void enter_smm(struct kvm_vcpu *vcpu)
 	kvm_set_segment(vcpu, &ds, VCPU_SREG_GS);
 	kvm_set_segment(vcpu, &ds, VCPU_SREG_SS);
 
-#ifdef CONFIG_X86_64
 	if (guest_cpuid_has(vcpu, X86_FEATURE_LM))
 		if (kvm_x86_call(set_efer)(vcpu, 0))
 			goto error;
-#endif
 
 	kvm_update_cpuid_runtime(vcpu);
 	kvm_mmu_reset_context(vcpu);
@@ -394,8 +386,6 @@ static int rsm_load_seg_32(struct kvm_vcpu *vcpu,
 	return X86EMUL_CONTINUE;
 }
 
-#ifdef CONFIG_X86_64
-
 static int rsm_load_seg_64(struct kvm_vcpu *vcpu,
 			   const struct kvm_smm_seg_state_64 *state,
 			   int n)
@@ -409,7 +399,6 @@ static int rsm_load_seg_64(struct kvm_vcpu *vcpu,
 	kvm_set_segment(vcpu, &desc, n);
 	return X86EMUL_CONTINUE;
 }
-#endif
 
 static int rsm_enter_protected_mode(struct kvm_vcpu *vcpu,
 				    u64 cr0, u64 cr3, u64 cr4)
@@ -507,7 +496,6 @@ static int rsm_load_state_32(struct x86_emulate_ctxt *ctxt,
 	return r;
 }
 
-#ifdef CONFIG_X86_64
 static int rsm_load_state_64(struct x86_emulate_ctxt *ctxt,
 			     const struct kvm_smram_state_64 *smstate)
 {
@@ -559,7 +547,6 @@ static int rsm_load_state_64(struct x86_emulate_ctxt *ctxt,
 
 	return X86EMUL_CONTINUE;
 }
-#endif
 
 int emulator_leave_smm(struct x86_emulate_ctxt *ctxt)
 {
@@ -585,7 +572,6 @@ int emulator_leave_smm(struct x86_emulate_ctxt *ctxt)
 	 * CR0/CR3/CR4/EFER.  It's all a bit more complicated if the vCPU
 	 * supports long mode.
 	 */
-#ifdef CONFIG_X86_64
 	if (guest_cpuid_has(vcpu, X86_FEATURE_LM)) {
 		struct kvm_segment cs_desc;
 		unsigned long cr4;
@@ -601,14 +587,12 @@ int emulator_leave_smm(struct x86_emulate_ctxt *ctxt)
 		cs_desc.s = cs_desc.g = cs_desc.present = 1;
 		kvm_set_segment(vcpu, &cs_desc, VCPU_SREG_CS);
 	}
-#endif
 
 	/* For the 64-bit case, this will clear EFER.LMA.  */
 	cr0 = kvm_read_cr0(vcpu);
 	if (cr0 & X86_CR0_PE)
 		kvm_set_cr0(vcpu, cr0 & ~(X86_CR0_PG | X86_CR0_PE));
 
-#ifdef CONFIG_X86_64
 	if (guest_cpuid_has(vcpu, X86_FEATURE_LM)) {
 		unsigned long cr4, efer;
 
@@ -621,7 +605,6 @@ int emulator_leave_smm(struct x86_emulate_ctxt *ctxt)
 		efer = 0;
 		kvm_set_msr(vcpu, MSR_EFER, efer);
 	}
-#endif
 
 	/*
 	 * FIXME: When resuming L2 (a.k.a. guest mode), the transition to guest
@@ -633,11 +616,9 @@ int emulator_leave_smm(struct x86_emulate_ctxt *ctxt)
 	if (kvm_x86_call(leave_smm)(vcpu, &smram))
 		return X86EMUL_UNHANDLEABLE;
 
-#ifdef CONFIG_X86_64
 	if (guest_cpuid_has(vcpu, X86_FEATURE_LM))
 		ret = rsm_load_state_64(ctxt, &smram.smram64);
 	else
-#endif
 		ret = rsm_load_state_32(ctxt, &smram.smram32);
 
 	/*
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 943bd074a5d3..a78cdb1a9314 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -830,7 +830,6 @@ static int sev_es_sync_vmsa(struct vcpu_svm *svm)
 	save->rbp = svm->vcpu.arch.regs[VCPU_REGS_RBP];
 	save->rsi = svm->vcpu.arch.regs[VCPU_REGS_RSI];
 	save->rdi = svm->vcpu.arch.regs[VCPU_REGS_RDI];
-#ifdef CONFIG_X86_64
 	save->r8  = svm->vcpu.arch.regs[VCPU_REGS_R8];
 	save->r9  = svm->vcpu.arch.regs[VCPU_REGS_R9];
 	save->r10 = svm->vcpu.arch.regs[VCPU_REGS_R10];
@@ -839,7 +838,6 @@ static int sev_es_sync_vmsa(struct vcpu_svm *svm)
 	save->r13 = svm->vcpu.arch.regs[VCPU_REGS_R13];
 	save->r14 = svm->vcpu.arch.regs[VCPU_REGS_R14];
 	save->r15 = svm->vcpu.arch.regs[VCPU_REGS_R15];
-#endif
 	save->rip = svm->vcpu.arch.regs[VCPU_REGS_RIP];
 
 	/* Sync some non-GPR registers before encrypting */
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index dd15cc635655..aeb24495cf64 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -89,14 +89,12 @@ static const struct svm_direct_access_msrs {
 	{ .index = MSR_IA32_SYSENTER_CS,		.always = true  },
 	{ .index = MSR_IA32_SYSENTER_EIP,		.always = false },
 	{ .index = MSR_IA32_SYSENTER_ESP,		.always = false },
-#ifdef CONFIG_X86_64
 	{ .index = MSR_GS_BASE,				.always = true  },
 	{ .index = MSR_FS_BASE,				.always = true  },
 	{ .index = MSR_KERNEL_GS_BASE,			.always = true  },
 	{ .index = MSR_LSTAR,				.always = true  },
 	{ .index = MSR_CSTAR,				.always = true  },
 	{ .index = MSR_SYSCALL_MASK,			.always = true  },
-#endif
 	{ .index = MSR_IA32_SPEC_CTRL,			.always = false },
 	{ .index = MSR_IA32_PRED_CMD,			.always = false },
 	{ .index = MSR_IA32_FLUSH_CMD,			.always = false },
@@ -288,11 +286,7 @@ static void svm_flush_tlb_current(struct kvm_vcpu *vcpu);
 
 static int get_npt_level(void)
 {
-#ifdef CONFIG_X86_64
 	return pgtable_l5_enabled() ? PT64_ROOT_5LEVEL : PT64_ROOT_4LEVEL;
-#else
-	return PT32E_ROOT_LEVEL;
-#endif
 }
 
 int svm_set_efer(struct kvm_vcpu *vcpu, u64 efer)
@@ -1860,7 +1854,6 @@ void svm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
 	u64 hcr0 = cr0;
 	bool old_paging = is_paging(vcpu);
 
-#ifdef CONFIG_X86_64
 	if (vcpu->arch.efer & EFER_LME) {
 		if (!is_paging(vcpu) && (cr0 & X86_CR0_PG)) {
 			vcpu->arch.efer |= EFER_LMA;
@@ -1874,7 +1867,6 @@ void svm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
 				svm->vmcb->save.efer &= ~(EFER_LMA | EFER_LME);
 		}
 	}
-#endif
 	vcpu->arch.cr0 = cr0;
 
 	if (!npt_enabled) {
@@ -2871,7 +2863,6 @@ static int svm_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	case MSR_STAR:
 		msr_info->data = svm->vmcb01.ptr->save.star;
 		break;
-#ifdef CONFIG_X86_64
 	case MSR_LSTAR:
 		msr_info->data = svm->vmcb01.ptr->save.lstar;
 		break;
@@ -2890,7 +2881,6 @@ static int svm_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	case MSR_SYSCALL_MASK:
 		msr_info->data = svm->vmcb01.ptr->save.sfmask;
 		break;
-#endif
 	case MSR_IA32_SYSENTER_CS:
 		msr_info->data = svm->vmcb01.ptr->save.sysenter_cs;
 		break;
@@ -3102,7 +3092,6 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
 	case MSR_STAR:
 		svm->vmcb01.ptr->save.star = data;
 		break;
-#ifdef CONFIG_X86_64
 	case MSR_LSTAR:
 		svm->vmcb01.ptr->save.lstar = data;
 		break;
@@ -3121,7 +3110,6 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
 	case MSR_SYSCALL_MASK:
 		svm->vmcb01.ptr->save.sfmask = data;
 		break;
-#endif
 	case MSR_IA32_SYSENTER_CS:
 		svm->vmcb01.ptr->save.sysenter_cs = data;
 		break;
@@ -5323,14 +5311,6 @@ static __init int svm_hardware_setup(void)
 		kvm_enable_efer_bits(EFER_SVME | EFER_LMSLE);
 	}
 
-	/*
-	 * KVM's MMU doesn't support using 2-level paging for itself, and thus
-	 * NPT isn't supported if the host is using 2-level paging since host
-	 * CR4 is unchanged on VMRUN.
-	 */
-	if (!IS_ENABLED(CONFIG_X86_64) && !IS_ENABLED(CONFIG_X86_PAE))
-		npt_enabled = false;
-
 	if (!boot_cpu_has(X86_FEATURE_NPT))
 		npt_enabled = false;
 
@@ -5378,8 +5358,7 @@ static __init int svm_hardware_setup(void)
 
 	if (vls) {
 		if (!npt_enabled ||
-		    !boot_cpu_has(X86_FEATURE_V_VMSAVE_VMLOAD) ||
-		    !IS_ENABLED(CONFIG_X86_64)) {
+		    !boot_cpu_has(X86_FEATURE_V_VMSAVE_VMLOAD)) {
 			vls = false;
 		} else {
 			pr_info("Virtual VMLOAD VMSAVE supported\n");
diff --git a/arch/x86/kvm/svm/vmenter.S b/arch/x86/kvm/svm/vmenter.S
index 2ed80aea3bb1..2e8c0f5a238a 100644
--- a/arch/x86/kvm/svm/vmenter.S
+++ b/arch/x86/kvm/svm/vmenter.S
@@ -19,7 +19,6 @@
 #define VCPU_RSI	(SVM_vcpu_arch_regs + __VCPU_REGS_RSI * WORD_SIZE)
 #define VCPU_RDI	(SVM_vcpu_arch_regs + __VCPU_REGS_RDI * WORD_SIZE)
 
-#ifdef CONFIG_X86_64
 #define VCPU_R8		(SVM_vcpu_arch_regs + __VCPU_REGS_R8  * WORD_SIZE)
 #define VCPU_R9		(SVM_vcpu_arch_regs + __VCPU_REGS_R9  * WORD_SIZE)
 #define VCPU_R10	(SVM_vcpu_arch_regs + __VCPU_REGS_R10 * WORD_SIZE)
@@ -28,7 +27,6 @@
 #define VCPU_R13	(SVM_vcpu_arch_regs + __VCPU_REGS_R13 * WORD_SIZE)
 #define VCPU_R14	(SVM_vcpu_arch_regs + __VCPU_REGS_R14 * WORD_SIZE)
 #define VCPU_R15	(SVM_vcpu_arch_regs + __VCPU_REGS_R15 * WORD_SIZE)
-#endif
 
 #define SVM_vmcb01_pa	(SVM_vmcb01 + KVM_VMCB_pa)
 
@@ -101,15 +99,10 @@
 SYM_FUNC_START(__svm_vcpu_run)
 	push %_ASM_BP
 	mov  %_ASM_SP, %_ASM_BP
-#ifdef CONFIG_X86_64
 	push %r15
 	push %r14
 	push %r13
 	push %r12
-#else
-	push %edi
-	push %esi
-#endif
 	push %_ASM_BX
 
 	/*
@@ -157,7 +150,6 @@ SYM_FUNC_START(__svm_vcpu_run)
 	mov VCPU_RBX(%_ASM_DI), %_ASM_BX
 	mov VCPU_RBP(%_ASM_DI), %_ASM_BP
 	mov VCPU_RSI(%_ASM_DI), %_ASM_SI
-#ifdef CONFIG_X86_64
 	mov VCPU_R8 (%_ASM_DI),  %r8
 	mov VCPU_R9 (%_ASM_DI),  %r9
 	mov VCPU_R10(%_ASM_DI), %r10
@@ -166,7 +158,6 @@ SYM_FUNC_START(__svm_vcpu_run)
 	mov VCPU_R13(%_ASM_DI), %r13
 	mov VCPU_R14(%_ASM_DI), %r14
 	mov VCPU_R15(%_ASM_DI), %r15
-#endif
 	mov VCPU_RDI(%_ASM_DI), %_ASM_DI
 
 	/* Enter guest mode */
@@ -186,7 +177,6 @@ SYM_FUNC_START(__svm_vcpu_run)
 	mov %_ASM_BP,   VCPU_RBP(%_ASM_AX)
 	mov %_ASM_SI,   VCPU_RSI(%_ASM_AX)
 	mov %_ASM_DI,   VCPU_RDI(%_ASM_AX)
-#ifdef CONFIG_X86_64
 	mov %r8,  VCPU_R8 (%_ASM_AX)
 	mov %r9,  VCPU_R9 (%_ASM_AX)
 	mov %r10, VCPU_R10(%_ASM_AX)
@@ -195,7 +185,6 @@ SYM_FUNC_START(__svm_vcpu_run)
 	mov %r13, VCPU_R13(%_ASM_AX)
 	mov %r14, VCPU_R14(%_ASM_AX)
 	mov %r15, VCPU_R15(%_ASM_AX)
-#endif
 
 	/* @svm can stay in RDI from now on.  */
 	mov %_ASM_AX, %_ASM_DI
@@ -239,7 +228,6 @@ SYM_FUNC_START(__svm_vcpu_run)
 	xor %ebp, %ebp
 	xor %esi, %esi
 	xor %edi, %edi
-#ifdef CONFIG_X86_64
 	xor %r8d,  %r8d
 	xor %r9d,  %r9d
 	xor %r10d, %r10d
@@ -248,22 +236,16 @@ SYM_FUNC_START(__svm_vcpu_run)
 	xor %r13d, %r13d
 	xor %r14d, %r14d
 	xor %r15d, %r15d
-#endif
 
 	/* "Pop" @spec_ctrl_intercepted.  */
 	pop %_ASM_BX
 
 	pop %_ASM_BX
 
-#ifdef CONFIG_X86_64
 	pop %r12
 	pop %r13
 	pop %r14
 	pop %r15
-#else
-	pop %esi
-	pop %edi
-#endif
 	pop %_ASM_BP
 	RET
 
@@ -293,7 +275,6 @@ SYM_FUNC_END(__svm_vcpu_run)
 #ifdef CONFIG_KVM_AMD_SEV
 
 
-#ifdef CONFIG_X86_64
 #define SEV_ES_GPRS_BASE 0x300
 #define SEV_ES_RBX	(SEV_ES_GPRS_BASE + __VCPU_REGS_RBX * WORD_SIZE)
 #define SEV_ES_RBP	(SEV_ES_GPRS_BASE + __VCPU_REGS_RBP * WORD_SIZE)
@@ -303,7 +284,6 @@ SYM_FUNC_END(__svm_vcpu_run)
 #define SEV_ES_R13	(SEV_ES_GPRS_BASE + __VCPU_REGS_R13 * WORD_SIZE)
 #define SEV_ES_R14	(SEV_ES_GPRS_BASE + __VCPU_REGS_R14 * WORD_SIZE)
 #define SEV_ES_R15	(SEV_ES_GPRS_BASE + __VCPU_REGS_R15 * WORD_SIZE)
-#endif
 
 /**
  * __svm_sev_es_vcpu_run - Run a SEV-ES vCPU via a transition to SVM guest mode
diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index d3aeffd6ae75..0bceb33e1d2c 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -897,8 +897,6 @@ TRACE_EVENT(kvm_write_tsc_offset,
 		  __entry->previous_tsc_offset, __entry->next_tsc_offset)
 );
 
-#ifdef CONFIG_X86_64
-
 #define host_clocks					\
 	{VDSO_CLOCKMODE_NONE, "none"},			\
 	{VDSO_CLOCKMODE_TSC,  "tsc"}			\
@@ -955,8 +953,6 @@ TRACE_EVENT(kvm_track_tsc,
 		  __print_symbolic(__entry->host_clock, host_clocks))
 );
 
-#endif /* CONFIG_X86_64 */
-
 /*
  * Tracepoint for PML full VMEXIT.
  */
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 92d35cc6cd15..32a8dc508cd7 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -134,10 +134,8 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.pi_update_irte = vmx_pi_update_irte,
 	.pi_start_assignment = vmx_pi_start_assignment,
 
-#ifdef CONFIG_X86_64
 	.set_hv_timer = vmx_set_hv_timer,
 	.cancel_hv_timer = vmx_cancel_hv_timer,
-#endif
 
 	.setup_mce = vmx_setup_mce,
 
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index aa78b6f38dfe..3e7f004d1788 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -86,11 +86,7 @@ static void init_vmcs_shadow_fields(void)
 
 		clear_bit(field, vmx_vmread_bitmap);
 		if (field & 1)
-#ifdef CONFIG_X86_64
 			continue;
-#else
-			entry.offset += sizeof(u32);
-#endif
 		shadow_read_only_fields[j++] = entry;
 	}
 	max_shadow_read_only_fields = j;
@@ -134,11 +130,7 @@ static void init_vmcs_shadow_fields(void)
 		clear_bit(field, vmx_vmwrite_bitmap);
 		clear_bit(field, vmx_vmread_bitmap);
 		if (field & 1)
-#ifdef CONFIG_X86_64
 			continue;
-#else
-			entry.offset += sizeof(u32);
-#endif
 		shadow_read_write_fields[j++] = entry;
 	}
 	max_shadow_read_write_fields = j;
@@ -283,10 +275,8 @@ static void vmx_sync_vmcs_host_state(struct vcpu_vmx *vmx,
 
 	vmx_set_host_fs_gs(dest, src->fs_sel, src->gs_sel, src->fs_base, src->gs_base);
 	dest->ldt_sel = src->ldt_sel;
-#ifdef CONFIG_X86_64
 	dest->ds_sel = src->ds_sel;
 	dest->es_sel = src->es_sel;
-#endif
 }
 
 static void vmx_switch_vmcs(struct kvm_vcpu *vcpu, struct loaded_vmcs *vmcs)
@@ -695,7 +685,6 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
 	 * Always check vmcs01's bitmap to honor userspace MSR filters and any
 	 * other runtime changes to vmcs01's bitmap, e.g. dynamic pass-through.
 	 */
-#ifdef CONFIG_X86_64
 	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
 					 MSR_FS_BASE, MSR_TYPE_RW);
 
@@ -704,7 +693,7 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
 
 	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
 					 MSR_KERNEL_GS_BASE, MSR_TYPE_RW);
-#endif
+
 	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
 					 MSR_IA32_SPEC_CTRL, MSR_TYPE_RW);
 
@@ -2375,11 +2364,9 @@ static void prepare_vmcs02_early(struct vcpu_vmx *vmx, struct loaded_vmcs *vmcs0
 	vmx->nested.l1_tpr_threshold = -1;
 	if (exec_control & CPU_BASED_TPR_SHADOW)
 		vmcs_write32(TPR_THRESHOLD, vmcs12->tpr_threshold);
-#ifdef CONFIG_X86_64
 	else
 		exec_control |= CPU_BASED_CR8_LOAD_EXITING |
 				CPU_BASED_CR8_STORE_EXITING;
-#endif
 
 	/*
 	 * A vmexit (to either L1 hypervisor or L0 userspace) is always needed
@@ -3002,11 +2989,10 @@ static int nested_vmx_check_controls(struct kvm_vcpu *vcpu,
 static int nested_vmx_check_address_space_size(struct kvm_vcpu *vcpu,
 				       struct vmcs12 *vmcs12)
 {
-#ifdef CONFIG_X86_64
 	if (CC(!!(vmcs12->vm_exit_controls & VM_EXIT_HOST_ADDR_SPACE_SIZE) !=
 		!!(vcpu->arch.efer & EFER_LMA)))
 		return -EINVAL;
-#endif
+
 	return 0;
 }
 
@@ -6979,9 +6965,7 @@ static void nested_vmx_setup_exit_ctls(struct vmcs_config *vmcs_conf,
 
 	msrs->exit_ctls_high = vmcs_conf->vmexit_ctrl;
 	msrs->exit_ctls_high &=
-#ifdef CONFIG_X86_64
 		VM_EXIT_HOST_ADDR_SPACE_SIZE |
-#endif
 		VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT |
 		VM_EXIT_CLEAR_BNDCFGS;
 	msrs->exit_ctls_high |=
@@ -7002,9 +6986,7 @@ static void nested_vmx_setup_entry_ctls(struct vmcs_config *vmcs_conf,
 
 	msrs->entry_ctls_high = vmcs_conf->vmentry_ctrl;
 	msrs->entry_ctls_high &=
-#ifdef CONFIG_X86_64
 		VM_ENTRY_IA32E_MODE |
-#endif
 		VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS;
 	msrs->entry_ctls_high |=
 		(VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | VM_ENTRY_LOAD_IA32_EFER |
@@ -7027,9 +7009,7 @@ static void nested_vmx_setup_cpubased_ctls(struct vmcs_config *vmcs_conf,
 		CPU_BASED_HLT_EXITING | CPU_BASED_INVLPG_EXITING |
 		CPU_BASED_MWAIT_EXITING | CPU_BASED_CR3_LOAD_EXITING |
 		CPU_BASED_CR3_STORE_EXITING |
-#ifdef CONFIG_X86_64
 		CPU_BASED_CR8_LOAD_EXITING | CPU_BASED_CR8_STORE_EXITING |
-#endif
 		CPU_BASED_MOV_DR_EXITING | CPU_BASED_UNCOND_IO_EXITING |
 		CPU_BASED_USE_IO_BITMAPS | CPU_BASED_MONITOR_TRAP_FLAG |
 		CPU_BASED_MONITOR_EXITING | CPU_BASED_RDPMC_EXITING |
diff --git a/arch/x86/kvm/vmx/vmcs.h b/arch/x86/kvm/vmx/vmcs.h
index b25625314658..487137da7860 100644
--- a/arch/x86/kvm/vmx/vmcs.h
+++ b/arch/x86/kvm/vmx/vmcs.h
@@ -39,9 +39,7 @@ struct vmcs_host_state {
 	unsigned long rsp;
 
 	u16           fs_sel, gs_sel, ldt_sel;
-#ifdef CONFIG_X86_64
 	u16           ds_sel, es_sel;
-#endif
 };
 
 struct vmcs_controls_shadow {
diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S
index f6986dee6f8c..5a548724ca1f 100644
--- a/arch/x86/kvm/vmx/vmenter.S
+++ b/arch/x86/kvm/vmx/vmenter.S
@@ -20,7 +20,6 @@
 #define VCPU_RSI	__VCPU_REGS_RSI * WORD_SIZE
 #define VCPU_RDI	__VCPU_REGS_RDI * WORD_SIZE
 
-#ifdef CONFIG_X86_64
 #define VCPU_R8		__VCPU_REGS_R8  * WORD_SIZE
 #define VCPU_R9		__VCPU_REGS_R9  * WORD_SIZE
 #define VCPU_R10	__VCPU_REGS_R10 * WORD_SIZE
@@ -29,7 +28,6 @@
 #define VCPU_R13	__VCPU_REGS_R13 * WORD_SIZE
 #define VCPU_R14	__VCPU_REGS_R14 * WORD_SIZE
 #define VCPU_R15	__VCPU_REGS_R15 * WORD_SIZE
-#endif
 
 .macro VMX_DO_EVENT_IRQOFF call_insn call_target
 	/*
@@ -40,7 +38,6 @@
 	push %_ASM_BP
 	mov %_ASM_SP, %_ASM_BP
 
-#ifdef CONFIG_X86_64
 	/*
 	 * Align RSP to a 16-byte boundary (to emulate CPU behavior) before
 	 * creating the synthetic interrupt stack frame for the IRQ/NMI.
@@ -48,7 +45,6 @@
 	and  $-16, %rsp
 	push $__KERNEL_DS
 	push %rbp
-#endif
 	pushf
 	push $__KERNEL_CS
 	\call_insn \call_target
@@ -79,15 +75,10 @@
 SYM_FUNC_START(__vmx_vcpu_run)
 	push %_ASM_BP
 	mov  %_ASM_SP, %_ASM_BP
-#ifdef CONFIG_X86_64
 	push %r15
 	push %r14
 	push %r13
 	push %r12
-#else
-	push %edi
-	push %esi
-#endif
 	push %_ASM_BX
 
 	/* Save @vmx for SPEC_CTRL handling */
@@ -148,7 +139,6 @@ SYM_FUNC_START(__vmx_vcpu_run)
 	mov VCPU_RBP(%_ASM_AX), %_ASM_BP
 	mov VCPU_RSI(%_ASM_AX), %_ASM_SI
 	mov VCPU_RDI(%_ASM_AX), %_ASM_DI
-#ifdef CONFIG_X86_64
 	mov VCPU_R8 (%_ASM_AX),  %r8
 	mov VCPU_R9 (%_ASM_AX),  %r9
 	mov VCPU_R10(%_ASM_AX), %r10
@@ -157,7 +147,7 @@ SYM_FUNC_START(__vmx_vcpu_run)
 	mov VCPU_R13(%_ASM_AX), %r13
 	mov VCPU_R14(%_ASM_AX), %r14
 	mov VCPU_R15(%_ASM_AX), %r15
-#endif
+
 	/* Load guest RAX.  This kills the @regs pointer! */
 	mov VCPU_RAX(%_ASM_AX), %_ASM_AX
 
@@ -210,7 +200,6 @@ SYM_INNER_LABEL_ALIGN(vmx_vmexit, SYM_L_GLOBAL)
 	mov %_ASM_BP, VCPU_RBP(%_ASM_AX)
 	mov %_ASM_SI, VCPU_RSI(%_ASM_AX)
 	mov %_ASM_DI, VCPU_RDI(%_ASM_AX)
-#ifdef CONFIG_X86_64
 	mov %r8,  VCPU_R8 (%_ASM_AX)
 	mov %r9,  VCPU_R9 (%_ASM_AX)
 	mov %r10, VCPU_R10(%_ASM_AX)
@@ -219,7 +208,6 @@ SYM_INNER_LABEL_ALIGN(vmx_vmexit, SYM_L_GLOBAL)
 	mov %r13, VCPU_R13(%_ASM_AX)
 	mov %r14, VCPU_R14(%_ASM_AX)
 	mov %r15, VCPU_R15(%_ASM_AX)
-#endif
 
 	/* Clear return value to indicate VM-Exit (as opposed to VM-Fail). */
 	xor %ebx, %ebx
@@ -244,7 +232,6 @@ SYM_INNER_LABEL_ALIGN(vmx_vmexit, SYM_L_GLOBAL)
 	xor %ebp, %ebp
 	xor %esi, %esi
 	xor %edi, %edi
-#ifdef CONFIG_X86_64
 	xor %r8d,  %r8d
 	xor %r9d,  %r9d
 	xor %r10d, %r10d
@@ -253,7 +240,6 @@ SYM_INNER_LABEL_ALIGN(vmx_vmexit, SYM_L_GLOBAL)
 	xor %r13d, %r13d
 	xor %r14d, %r14d
 	xor %r15d, %r15d
-#endif
 
 	/*
 	 * IMPORTANT: RSB filling and SPEC_CTRL handling must be done before
@@ -281,15 +267,10 @@ SYM_INNER_LABEL_ALIGN(vmx_vmexit, SYM_L_GLOBAL)
 	mov %_ASM_BX, %_ASM_AX
 
 	pop %_ASM_BX
-#ifdef CONFIG_X86_64
 	pop %r12
 	pop %r13
 	pop %r14
 	pop %r15
-#else
-	pop %esi
-	pop %edi
-#endif
 	pop %_ASM_BP
 	RET
 
@@ -325,14 +306,12 @@ SYM_FUNC_START(vmread_error_trampoline)
 	push %_ASM_AX
 	push %_ASM_CX
 	push %_ASM_DX
-#ifdef CONFIG_X86_64
 	push %rdi
 	push %rsi
 	push %r8
 	push %r9
 	push %r10
 	push %r11
-#endif
 
 	/* Load @field and @fault to arg1 and arg2 respectively. */
 	mov 3*WORD_SIZE(%_ASM_BP), %_ASM_ARG2
@@ -343,14 +322,12 @@ SYM_FUNC_START(vmread_error_trampoline)
 	/* Zero out @fault, which will be popped into the result register. */
 	_ASM_MOV $0, 3*WORD_SIZE(%_ASM_BP)
 
-#ifdef CONFIG_X86_64
 	pop %r11
 	pop %r10
 	pop %r9
 	pop %r8
 	pop %rsi
 	pop %rdi
-#endif
 	pop %_ASM_DX
 	pop %_ASM_CX
 	pop %_ASM_AX
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 893366e53732..de47bc57afe4 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -140,9 +140,7 @@ module_param(dump_invalid_vmcs, bool, 0644);
 /* Guest_tsc -> host_tsc conversion requires 64-bit division.  */
 static int __read_mostly cpu_preemption_timer_multi;
 static bool __read_mostly enable_preemption_timer = 1;
-#ifdef CONFIG_X86_64
 module_param_named(preemption_timer, enable_preemption_timer, bool, S_IRUGO);
-#endif
 
 extern bool __read_mostly allow_smaller_maxphyaddr;
 module_param(allow_smaller_maxphyaddr, bool, S_IRUGO);
@@ -172,13 +170,11 @@ static u32 vmx_possible_passthrough_msrs[MAX_POSSIBLE_PASSTHROUGH_MSRS] = {
 	MSR_IA32_PRED_CMD,
 	MSR_IA32_FLUSH_CMD,
 	MSR_IA32_TSC,
-#ifdef CONFIG_X86_64
 	MSR_FS_BASE,
 	MSR_GS_BASE,
 	MSR_KERNEL_GS_BASE,
 	MSR_IA32_XFD,
 	MSR_IA32_XFD_ERR,
-#endif
 	MSR_IA32_SYSENTER_CS,
 	MSR_IA32_SYSENTER_ESP,
 	MSR_IA32_SYSENTER_EIP,
@@ -1108,12 +1104,10 @@ static bool update_transition_efer(struct vcpu_vmx *vmx)
 	 * LMA and LME handled by hardware; SCE meaningless outside long mode.
 	 */
 	ignore_bits |= EFER_SCE;
-#ifdef CONFIG_X86_64
 	ignore_bits |= EFER_LMA | EFER_LME;
 	/* SCE is meaningful only in long mode on Intel */
 	if (guest_efer & EFER_LMA)
 		ignore_bits &= ~(u64)EFER_SCE;
-#endif
 
 	/*
 	 * On EPT, we can't emulate NX, so we must switch EFER atomically.
@@ -1147,35 +1141,6 @@ static bool update_transition_efer(struct vcpu_vmx *vmx)
 	return true;
 }
 
-#ifdef CONFIG_X86_32
-/*
- * On 32-bit kernels, VM exits still load the FS and GS bases from the
- * VMCS rather than the segment table.  KVM uses this helper to figure
- * out the current bases to poke them into the VMCS before entry.
- */
-static unsigned long segment_base(u16 selector)
-{
-	struct desc_struct *table;
-	unsigned long v;
-
-	if (!(selector & ~SEGMENT_RPL_MASK))
-		return 0;
-
-	table = get_current_gdt_ro();
-
-	if ((selector & SEGMENT_TI_MASK) == SEGMENT_LDT) {
-		u16 ldt_selector = kvm_read_ldt();
-
-		if (!(ldt_selector & ~SEGMENT_RPL_MASK))
-			return 0;
-
-		table = (struct desc_struct *)segment_base(ldt_selector);
-	}
-	v = get_desc_base(&table[selector >> 3]);
-	return v;
-}
-#endif
-
 static inline bool pt_can_write_msr(struct vcpu_vmx *vmx)
 {
 	return vmx_pt_mode_is_host_guest() &&
@@ -1282,9 +1247,7 @@ void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	struct vmcs_host_state *host_state;
-#ifdef CONFIG_X86_64
 	int cpu = raw_smp_processor_id();
-#endif
 	unsigned long fs_base, gs_base;
 	u16 fs_sel, gs_sel;
 	int i;
@@ -1320,7 +1283,6 @@ void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
 	 */
 	host_state->ldt_sel = kvm_read_ldt();
 
-#ifdef CONFIG_X86_64
 	savesegment(ds, host_state->ds_sel);
 	savesegment(es, host_state->es_sel);
 
@@ -1339,12 +1301,6 @@ void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
 	}
 
 	wrmsrl(MSR_KERNEL_GS_BASE, vmx->msr_guest_kernel_gs_base);
-#else
-	savesegment(fs, fs_sel);
-	savesegment(gs, gs_sel);
-	fs_base = segment_base(fs_sel);
-	gs_base = segment_base(gs_sel);
-#endif
 
 	vmx_set_host_fs_gs(host_state, fs_sel, gs_sel, fs_base, gs_base);
 	vmx->guest_state_loaded = true;
@@ -1361,35 +1317,24 @@ static void vmx_prepare_switch_to_host(struct vcpu_vmx *vmx)
 
 	++vmx->vcpu.stat.host_state_reload;
 
-#ifdef CONFIG_X86_64
 	rdmsrl(MSR_KERNEL_GS_BASE, vmx->msr_guest_kernel_gs_base);
-#endif
 	if (host_state->ldt_sel || (host_state->gs_sel & 7)) {
 		kvm_load_ldt(host_state->ldt_sel);
-#ifdef CONFIG_X86_64
 		load_gs_index(host_state->gs_sel);
-#else
-		loadsegment(gs, host_state->gs_sel);
-#endif
 	}
 	if (host_state->fs_sel & 7)
 		loadsegment(fs, host_state->fs_sel);
-#ifdef CONFIG_X86_64
 	if (unlikely(host_state->ds_sel | host_state->es_sel)) {
 		loadsegment(ds, host_state->ds_sel);
 		loadsegment(es, host_state->es_sel);
 	}
-#endif
 	invalidate_tss_limit();
-#ifdef CONFIG_X86_64
 	wrmsrl(MSR_KERNEL_GS_BASE, vmx->msr_host_kernel_gs_base);
-#endif
 	load_fixmap_gdt(raw_smp_processor_id());
 	vmx->guest_state_loaded = false;
 	vmx->guest_uret_msrs_loaded = false;
 }
 
-#ifdef CONFIG_X86_64
 static u64 vmx_read_guest_kernel_gs_base(struct vcpu_vmx *vmx)
 {
 	preempt_disable();
@@ -1407,7 +1352,6 @@ static void vmx_write_guest_kernel_gs_base(struct vcpu_vmx *vmx, u64 data)
 	preempt_enable();
 	vmx->msr_guest_kernel_gs_base = data;
 }
-#endif
 
 static void grow_ple_window(struct kvm_vcpu *vcpu)
 {
@@ -1498,7 +1442,7 @@ void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu,
 			    (unsigned long)&get_cpu_entry_area(cpu)->tss.x86_tss);
 		vmcs_writel(HOST_GDTR_BASE, (unsigned long)gdt);   /* 22.2.4 */
 
-		if (IS_ENABLED(CONFIG_IA32_EMULATION) || IS_ENABLED(CONFIG_X86_32)) {
+		if (IS_ENABLED(CONFIG_IA32_EMULATION)) {
 			/* 22.2.3 */
 			vmcs_writel(HOST_IA32_SYSENTER_ESP,
 				    (unsigned long)(cpu_entry_stack(cpu) + 1));
@@ -1750,7 +1694,6 @@ static int skip_emulated_instruction(struct kvm_vcpu *vcpu)
 
 		orig_rip = kvm_rip_read(vcpu);
 		rip = orig_rip + instr_len;
-#ifdef CONFIG_X86_64
 		/*
 		 * We need to mask out the high 32 bits of RIP if not in 64-bit
 		 * mode, but just finding out that we are in 64-bit mode is
@@ -1758,7 +1701,7 @@ static int skip_emulated_instruction(struct kvm_vcpu *vcpu)
 		 */
 		if (unlikely(((rip ^ orig_rip) >> 31) == 3) && !is_64_bit_mode(vcpu))
 			rip = (u32)rip;
-#endif
+
 		kvm_rip_write(vcpu, rip);
 	} else {
 		if (!kvm_emulate_instruction(vcpu, EMULTYPE_SKIP))
@@ -1891,7 +1834,6 @@ static void vmx_setup_uret_msr(struct vcpu_vmx *vmx, unsigned int msr,
  */
 static void vmx_setup_uret_msrs(struct vcpu_vmx *vmx)
 {
-#ifdef CONFIG_X86_64
 	bool load_syscall_msrs;
 
 	/*
@@ -1904,7 +1846,6 @@ static void vmx_setup_uret_msrs(struct vcpu_vmx *vmx)
 	vmx_setup_uret_msr(vmx, MSR_STAR, load_syscall_msrs);
 	vmx_setup_uret_msr(vmx, MSR_LSTAR, load_syscall_msrs);
 	vmx_setup_uret_msr(vmx, MSR_SYSCALL_MASK, load_syscall_msrs);
-#endif
 	vmx_setup_uret_msr(vmx, MSR_EFER, update_transition_efer(vmx));
 
 	vmx_setup_uret_msr(vmx, MSR_TSC_AUX,
@@ -2019,7 +1960,6 @@ int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	u32 index;
 
 	switch (msr_info->index) {
-#ifdef CONFIG_X86_64
 	case MSR_FS_BASE:
 		msr_info->data = vmcs_readl(GUEST_FS_BASE);
 		break;
@@ -2029,7 +1969,6 @@ int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	case MSR_KERNEL_GS_BASE:
 		msr_info->data = vmx_read_guest_kernel_gs_base(vmx);
 		break;
-#endif
 	case MSR_EFER:
 		return kvm_get_msr_common(vcpu, msr_info);
 	case MSR_IA32_TSX_CTRL:
@@ -2166,10 +2105,8 @@ int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 static u64 nested_vmx_truncate_sysenter_addr(struct kvm_vcpu *vcpu,
 						    u64 data)
 {
-#ifdef CONFIG_X86_64
 	if (!guest_cpuid_has(vcpu, X86_FEATURE_LM))
 		return (u32)data;
-#endif
 	return (unsigned long)data;
 }
 
@@ -2206,7 +2143,6 @@ int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	case MSR_EFER:
 		ret = kvm_set_msr_common(vcpu, msr_info);
 		break;
-#ifdef CONFIG_X86_64
 	case MSR_FS_BASE:
 		vmx_segment_cache_clear(vmx);
 		vmcs_writel(GUEST_FS_BASE, data);
@@ -2236,7 +2172,6 @@ int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 			vmx_update_exception_bitmap(vcpu);
 		}
 		break;
-#endif
 	case MSR_IA32_SYSENTER_CS:
 		if (is_guest_mode(vcpu))
 			get_vmcs12(vcpu)->guest_sysenter_cs = data;
@@ -2621,12 +2556,6 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 	if (!IS_ENABLED(CONFIG_KVM_INTEL_PROVE_VE))
 		_cpu_based_2nd_exec_control &= ~SECONDARY_EXEC_EPT_VIOLATION_VE;
 
-#ifndef CONFIG_X86_64
-	if (!(_cpu_based_2nd_exec_control &
-				SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES))
-		_cpu_based_exec_control &= ~CPU_BASED_TPR_SHADOW;
-#endif
-
 	if (!(_cpu_based_exec_control & CPU_BASED_TPR_SHADOW))
 		_cpu_based_2nd_exec_control &= ~(
 				SECONDARY_EXEC_APIC_REGISTER_VIRT |
@@ -2734,7 +2663,6 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 	if (vmx_basic_vmcs_size(basic_msr) > PAGE_SIZE)
 		return -EIO;
 
-#ifdef CONFIG_X86_64
 	/*
 	 * KVM expects to be able to shove all legal physical addresses into
 	 * VMCS fields for 64-bit kernels, and per the SDM, "This bit is always
@@ -2742,7 +2670,6 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 	 */
 	if (basic_msr & VMX_BASIC_32BIT_PHYS_ADDR_ONLY)
 		return -EIO;
-#endif
 
 	/* Require Write-Back (WB) memory type for VMCS accesses. */
 	if (vmx_basic_vmcs_mem_type(basic_msr) != X86_MEMTYPE_WB)
@@ -3149,22 +3076,15 @@ int vmx_set_efer(struct kvm_vcpu *vcpu, u64 efer)
 		return 0;
 
 	vcpu->arch.efer = efer;
-#ifdef CONFIG_X86_64
 	if (efer & EFER_LMA)
 		vm_entry_controls_setbit(vmx, VM_ENTRY_IA32E_MODE);
 	else
 		vm_entry_controls_clearbit(vmx, VM_ENTRY_IA32E_MODE);
-#else
-	if (KVM_BUG_ON(efer & EFER_LMA, vcpu->kvm))
-		return 1;
-#endif
 
 	vmx_setup_uret_msrs(vmx);
 	return 0;
 }
 
-#ifdef CONFIG_X86_64
-
 static void enter_lmode(struct kvm_vcpu *vcpu)
 {
 	u32 guest_tr_ar;
@@ -3187,8 +3107,6 @@ static void exit_lmode(struct kvm_vcpu *vcpu)
 	vmx_set_efer(vcpu, vcpu->arch.efer & ~EFER_LMA);
 }
 
-#endif
-
 void vmx_flush_tlb_all(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -3328,14 +3246,12 @@ void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
 	vcpu->arch.cr0 = cr0;
 	kvm_register_mark_available(vcpu, VCPU_EXREG_CR0);
 
-#ifdef CONFIG_X86_64
 	if (vcpu->arch.efer & EFER_LME) {
 		if (!old_cr0_pg && (cr0 & X86_CR0_PG))
 			enter_lmode(vcpu);
 		else if (old_cr0_pg && !(cr0 & X86_CR0_PG))
 			exit_lmode(vcpu);
 	}
-#endif
 
 	if (enable_ept && !enable_unrestricted_guest) {
 		/*
@@ -4342,7 +4258,6 @@ void vmx_set_constant_host_state(struct vcpu_vmx *vmx)
 	vmx->loaded_vmcs->host_state.cr4 = cr4;
 
 	vmcs_write16(HOST_CS_SELECTOR, __KERNEL_CS);  /* 22.2.4 */
-#ifdef CONFIG_X86_64
 	/*
 	 * Load null selectors, so we can avoid reloading them in
 	 * vmx_prepare_switch_to_host(), in case userspace uses
@@ -4350,10 +4265,6 @@ void vmx_set_constant_host_state(struct vcpu_vmx *vmx)
 	 */
 	vmcs_write16(HOST_DS_SELECTOR, 0);
 	vmcs_write16(HOST_ES_SELECTOR, 0);
-#else
-	vmcs_write16(HOST_DS_SELECTOR, __KERNEL_DS);  /* 22.2.4 */
-	vmcs_write16(HOST_ES_SELECTOR, __KERNEL_DS);  /* 22.2.4 */
-#endif
 	vmcs_write16(HOST_SS_SELECTOR, __KERNEL_DS);  /* 22.2.4 */
 	vmcs_write16(HOST_TR_SELECTOR, GDT_ENTRY_TSS*8);  /* 22.2.4 */
 
@@ -4370,7 +4281,7 @@ void vmx_set_constant_host_state(struct vcpu_vmx *vmx)
 	 * vmx_vcpu_load_vmcs loads it with the per-CPU entry stack (and may
 	 * have already done so!).
 	 */
-	if (!IS_ENABLED(CONFIG_IA32_EMULATION) && !IS_ENABLED(CONFIG_X86_32))
+	if (!IS_ENABLED(CONFIG_IA32_EMULATION))
 		vmcs_writel(HOST_IA32_SYSENTER_ESP, 0);
 
 	rdmsrl(MSR_IA32_SYSENTER_EIP, tmpl);
@@ -4504,14 +4415,13 @@ static u32 vmx_exec_control(struct vcpu_vmx *vmx)
 	if (!cpu_need_tpr_shadow(&vmx->vcpu))
 		exec_control &= ~CPU_BASED_TPR_SHADOW;
 
-#ifdef CONFIG_X86_64
 	if (exec_control & CPU_BASED_TPR_SHADOW)
 		exec_control &= ~(CPU_BASED_CR8_LOAD_EXITING |
 				  CPU_BASED_CR8_STORE_EXITING);
 	else
 		exec_control |= CPU_BASED_CR8_STORE_EXITING |
 				CPU_BASED_CR8_LOAD_EXITING;
-#endif
+
 	/* No need to intercept CR3 access or INVPLG when using EPT. */
 	if (enable_ept)
 		exec_control &= ~(CPU_BASED_CR3_LOAD_EXITING |
@@ -7449,19 +7359,6 @@ fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
 	if (vmx->host_debugctlmsr)
 		update_debugctlmsr(vmx->host_debugctlmsr);
 
-#ifndef CONFIG_X86_64
-	/*
-	 * The sysexit path does not restore ds/es, so we must set them to
-	 * a reasonable value ourselves.
-	 *
-	 * We can't defer this to vmx_prepare_switch_to_host() since that
-	 * function may be executed in interrupt context, which saves and
-	 * restore segments around it, nullifying its effect.
-	 */
-	loadsegment(ds, __USER_DS);
-	loadsegment(es, __USER_DS);
-#endif
-
 	pt_guest_exit(vmx);
 
 	kvm_load_host_xsave_state(vcpu);
@@ -7571,11 +7468,9 @@ int vmx_vcpu_create(struct kvm_vcpu *vcpu)
 	bitmap_fill(vmx->shadow_msr_intercept.write, MAX_POSSIBLE_PASSTHROUGH_MSRS);
 
 	vmx_disable_intercept_for_msr(vcpu, MSR_IA32_TSC, MSR_TYPE_R);
-#ifdef CONFIG_X86_64
 	vmx_disable_intercept_for_msr(vcpu, MSR_FS_BASE, MSR_TYPE_RW);
 	vmx_disable_intercept_for_msr(vcpu, MSR_GS_BASE, MSR_TYPE_RW);
 	vmx_disable_intercept_for_msr(vcpu, MSR_KERNEL_GS_BASE, MSR_TYPE_RW);
-#endif
 	vmx_disable_intercept_for_msr(vcpu, MSR_IA32_SYSENTER_CS, MSR_TYPE_RW);
 	vmx_disable_intercept_for_msr(vcpu, MSR_IA32_SYSENTER_ESP, MSR_TYPE_RW);
 	vmx_disable_intercept_for_msr(vcpu, MSR_IA32_SYSENTER_EIP, MSR_TYPE_RW);
@@ -8099,7 +7994,6 @@ int vmx_check_intercept(struct kvm_vcpu *vcpu,
 	return X86EMUL_UNHANDLEABLE;
 }
 
-#ifdef CONFIG_X86_64
 /* (a << shift) / divisor, return 1 if overflow otherwise 0 */
 static inline int u64_shl_div_u64(u64 a, unsigned int shift,
 				  u64 divisor, u64 *result)
@@ -8162,7 +8056,6 @@ void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu)
 {
 	to_vmx(vcpu)->hv_deadline_tsc = -1;
 }
-#endif
 
 void vmx_update_cpu_dirty_logging(struct kvm_vcpu *vcpu)
 {
@@ -8356,9 +8249,7 @@ static __init void vmx_setup_user_return_msrs(void)
 	 * into hardware and is here purely for emulation purposes.
 	 */
 	const u32 vmx_uret_msrs_list[] = {
-	#ifdef CONFIG_X86_64
 		MSR_SYSCALL_MASK, MSR_LSTAR, MSR_CSTAR,
-	#endif
 		MSR_EFER, MSR_TSC_AUX, MSR_STAR,
 		MSR_IA32_TSX_CTRL,
 	};
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 43f573f6ca46..ba9428728f99 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -19,11 +19,7 @@
 
 #define X2APIC_MSR(r) (APIC_BASE_MSR + ((r) >> 4))
 
-#ifdef CONFIG_X86_64
 #define MAX_NR_USER_RETURN_MSRS	7
-#else
-#define MAX_NR_USER_RETURN_MSRS	4
-#endif
 
 #define MAX_NR_LOADSTORE_MSRS	8
 
@@ -272,10 +268,8 @@ struct vcpu_vmx {
 	 */
 	struct vmx_uret_msr   guest_uret_msrs[MAX_NR_USER_RETURN_MSRS];
 	bool                  guest_uret_msrs_loaded;
-#ifdef CONFIG_X86_64
 	u64		      msr_host_kernel_gs_base;
 	u64		      msr_guest_kernel_gs_base;
-#endif
 
 	u64		      spec_ctrl;
 	u32		      msr_ia32_umwait_control;
@@ -470,14 +464,10 @@ static inline u8 vmx_get_rvi(void)
 
 #define __KVM_REQUIRED_VMX_VM_ENTRY_CONTROLS				\
 	(VM_ENTRY_LOAD_DEBUG_CONTROLS)
-#ifdef CONFIG_X86_64
 	#define KVM_REQUIRED_VMX_VM_ENTRY_CONTROLS			\
 		(__KVM_REQUIRED_VMX_VM_ENTRY_CONTROLS |			\
 		 VM_ENTRY_IA32E_MODE)
-#else
-	#define KVM_REQUIRED_VMX_VM_ENTRY_CONTROLS			\
-		__KVM_REQUIRED_VMX_VM_ENTRY_CONTROLS
-#endif
+
 #define KVM_OPTIONAL_VMX_VM_ENTRY_CONTROLS				\
 	(VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL |				\
 	 VM_ENTRY_LOAD_IA32_PAT |					\
@@ -489,14 +479,10 @@ static inline u8 vmx_get_rvi(void)
 #define __KVM_REQUIRED_VMX_VM_EXIT_CONTROLS				\
 	(VM_EXIT_SAVE_DEBUG_CONTROLS |					\
 	 VM_EXIT_ACK_INTR_ON_EXIT)
-#ifdef CONFIG_X86_64
 	#define KVM_REQUIRED_VMX_VM_EXIT_CONTROLS			\
 		(__KVM_REQUIRED_VMX_VM_EXIT_CONTROLS |			\
 		 VM_EXIT_HOST_ADDR_SPACE_SIZE)
-#else
-	#define KVM_REQUIRED_VMX_VM_EXIT_CONTROLS			\
-		__KVM_REQUIRED_VMX_VM_EXIT_CONTROLS
-#endif
+
 #define KVM_OPTIONAL_VMX_VM_EXIT_CONTROLS				\
 	      (VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL |			\
 	       VM_EXIT_SAVE_IA32_PAT |					\
@@ -529,15 +515,10 @@ static inline u8 vmx_get_rvi(void)
 	 CPU_BASED_RDPMC_EXITING |					\
 	 CPU_BASED_INTR_WINDOW_EXITING)
 
-#ifdef CONFIG_X86_64
 	#define KVM_REQUIRED_VMX_CPU_BASED_VM_EXEC_CONTROL		\
 		(__KVM_REQUIRED_VMX_CPU_BASED_VM_EXEC_CONTROL |		\
 		 CPU_BASED_CR8_LOAD_EXITING |				\
 		 CPU_BASED_CR8_STORE_EXITING)
-#else
-	#define KVM_REQUIRED_VMX_CPU_BASED_VM_EXEC_CONTROL		\
-		__KVM_REQUIRED_VMX_CPU_BASED_VM_EXEC_CONTROL
-#endif
 
 #define KVM_OPTIONAL_VMX_CPU_BASED_VM_EXEC_CONTROL			\
 	(CPU_BASED_RDTSC_EXITING |					\
diff --git a/arch/x86/kvm/vmx/vmx_ops.h b/arch/x86/kvm/vmx/vmx_ops.h
index 633c87e2fd92..72031b669925 100644
--- a/arch/x86/kvm/vmx/vmx_ops.h
+++ b/arch/x86/kvm/vmx/vmx_ops.h
@@ -171,11 +171,7 @@ static __always_inline u64 vmcs_read64(unsigned long field)
 	vmcs_check64(field);
 	if (kvm_is_using_evmcs())
 		return evmcs_read64(field);
-#ifdef CONFIG_X86_64
 	return __vmcs_readl(field);
-#else
-	return __vmcs_readl(field) | ((u64)__vmcs_readl(field+1) << 32);
-#endif
 }
 
 static __always_inline unsigned long vmcs_readl(unsigned long field)
@@ -250,9 +246,6 @@ static __always_inline void vmcs_write64(unsigned long field, u64 value)
 		return evmcs_write64(field, value);
 
 	__vmcs_writel(field, value);
-#ifndef CONFIG_X86_64
-	__vmcs_writel(field+1, value >> 32);
-#endif
 }
 
 static __always_inline void vmcs_writel(unsigned long field, unsigned long value)
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index a55981c5216e..8573f1e401be 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -111,11 +111,9 @@ u64 vmx_get_l2_tsc_multiplier(struct kvm_vcpu *vcpu);
 void vmx_write_tsc_offset(struct kvm_vcpu *vcpu);
 void vmx_write_tsc_multiplier(struct kvm_vcpu *vcpu);
 void vmx_update_cpu_dirty_logging(struct kvm_vcpu *vcpu);
-#ifdef CONFIG_X86_64
 int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
 		     bool *expired);
 void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu);
-#endif
 void vmx_setup_mce(struct kvm_vcpu *vcpu);
 
 #endif /* __KVM_X86_VMX_X86_OPS_H */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2e713480933a..b776e697c0d9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -112,12 +112,8 @@ EXPORT_SYMBOL_GPL(kvm_host);
  * - enable syscall per default because its emulated by KVM
  * - enable LME and LMA per default on 64 bit KVM
  */
-#ifdef CONFIG_X86_64
 static
 u64 __read_mostly efer_reserved_bits = ~((u64)(EFER_SCE | EFER_LME | EFER_LMA));
-#else
-static u64 __read_mostly efer_reserved_bits = ~((u64)EFER_SCE);
-#endif
 
 static u64 __read_mostly cr4_reserved_bits = CR4_RESERVED_BITS;
 
@@ -318,9 +314,7 @@ static struct kmem_cache *x86_emulator_cache;
 static const u32 msrs_to_save_base[] = {
 	MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
 	MSR_STAR,
-#ifdef CONFIG_X86_64
 	MSR_CSTAR, MSR_KERNEL_GS_BASE, MSR_SYSCALL_MASK, MSR_LSTAR,
-#endif
 	MSR_IA32_TSC, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA,
 	MSR_IA32_FEAT_CTL, MSR_IA32_BNDCFGS, MSR_TSC_AUX,
 	MSR_IA32_SPEC_CTRL, MSR_IA32_TSX_CTRL,
@@ -1071,10 +1065,8 @@ EXPORT_SYMBOL_GPL(load_pdptrs);
 
 static bool kvm_is_valid_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
 {
-#ifdef CONFIG_X86_64
 	if (cr0 & 0xffffffff00000000UL)
 		return false;
-#endif
 
 	if ((cr0 & X86_CR0_NW) && !(cr0 & X86_CR0_CD))
 		return false;
@@ -1134,7 +1126,6 @@ int kvm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
 	/* Write to CR0 reserved bits are ignored, even on Intel. */
 	cr0 &= ~CR0_RESERVED_BITS;
 
-#ifdef CONFIG_X86_64
 	if ((vcpu->arch.efer & EFER_LME) && !is_paging(vcpu) &&
 	    (cr0 & X86_CR0_PG)) {
 		int cs_db, cs_l;
@@ -1145,7 +1136,7 @@ int kvm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
 		if (cs_l)
 			return 1;
 	}
-#endif
+
 	if (!(vcpu->arch.efer & EFER_LME) && (cr0 & X86_CR0_PG) &&
 	    is_pae(vcpu) && ((cr0 ^ old_cr0) & X86_CR0_PDPTR_BITS) &&
 	    !load_pdptrs(vcpu, kvm_read_cr3(vcpu)))
@@ -1218,12 +1209,10 @@ void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL_GPL(kvm_load_host_xsave_state);
 
-#ifdef CONFIG_X86_64
 static inline u64 kvm_guest_supported_xfd(struct kvm_vcpu *vcpu)
 {
 	return vcpu->arch.guest_supported_xcr0 & XFEATURE_MASK_USER_DYNAMIC;
 }
-#endif
 
 static int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
 {
@@ -1421,13 +1410,12 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
 {
 	bool skip_tlb_flush = false;
 	unsigned long pcid = 0;
-#ifdef CONFIG_X86_64
+
 	if (kvm_is_cr4_bit_set(vcpu, X86_CR4_PCIDE)) {
 		skip_tlb_flush = cr3 & X86_CR3_PCID_NOFLUSH;
 		cr3 &= ~X86_CR3_PCID_NOFLUSH;
 		pcid = cr3 & X86_CR3_PCID_MASK;
 	}
-#endif
 
 	/* PDPTRs are always reloaded for PAE paging. */
 	if (cr3 == kvm_read_cr3(vcpu) && !is_pae_paging(vcpu))
@@ -2216,7 +2204,6 @@ static int do_set_msr(struct kvm_vcpu *vcpu, unsigned index, u64 *data)
 	return kvm_set_msr_ignored_check(vcpu, index, *data, true);
 }
 
-#ifdef CONFIG_X86_64
 struct pvclock_clock {
 	int vclock_mode;
 	u64 cycle_last;
@@ -2274,13 +2261,6 @@ static s64 get_kvmclock_base_ns(void)
 	/* Count up from boot time, but with the frequency of the raw clock.  */
 	return ktime_to_ns(ktime_add(ktime_get_raw(), pvclock_gtod_data.offs_boot));
 }
-#else
-static s64 get_kvmclock_base_ns(void)
-{
-	/* Master clock not used, so we can just use CLOCK_BOOTTIME.  */
-	return ktime_get_boottime_ns();
-}
-#endif
 
 static void kvm_write_wall_clock(struct kvm *kvm, gpa_t wall_clock, int sec_hi_ofs)
 {
@@ -2382,9 +2362,7 @@ static void kvm_get_time_scale(uint64_t scaled_hz, uint64_t base_hz,
 	*pmultiplier = div_frac(scaled64, tps32);
 }
 
-#ifdef CONFIG_X86_64
 static atomic_t kvm_guest_has_master_clock = ATOMIC_INIT(0);
-#endif
 
 static DEFINE_PER_CPU(unsigned long, cpu_tsc_khz);
 static unsigned long max_tsc_khz;
@@ -2477,16 +2455,13 @@ static u64 compute_guest_tsc(struct kvm_vcpu *vcpu, s64 kernel_ns)
 	return tsc;
 }
 
-#ifdef CONFIG_X86_64
 static inline bool gtod_is_based_on_tsc(int mode)
 {
 	return mode == VDSO_CLOCKMODE_TSC || mode == VDSO_CLOCKMODE_HVCLOCK;
 }
-#endif
 
 static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu, bool new_generation)
 {
-#ifdef CONFIG_X86_64
 	struct kvm_arch *ka = &vcpu->kvm->arch;
 	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
 
@@ -2512,7 +2487,6 @@ static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu, bool new_generation)
 	trace_kvm_track_tsc(vcpu->vcpu_id, ka->nr_vcpus_matched_tsc,
 			    atomic_read(&vcpu->kvm->online_vcpus),
 		            ka->use_master_clock, gtod->clock.vclock_mode);
-#endif
 }
 
 /*
@@ -2623,14 +2597,13 @@ static void kvm_vcpu_write_tsc_multiplier(struct kvm_vcpu *vcpu, u64 l1_multipli
 
 static inline bool kvm_check_tsc_unstable(void)
 {
-#ifdef CONFIG_X86_64
 	/*
 	 * TSC is marked unstable when we're running on Hyper-V,
 	 * 'TSC page' clocksource is good.
 	 */
 	if (pvclock_gtod_data.clock.vclock_mode == VDSO_CLOCKMODE_HVCLOCK)
 		return false;
-#endif
+
 	return check_tsc_unstable();
 }
 
@@ -2772,8 +2745,6 @@ static inline void adjust_tsc_offset_host(struct kvm_vcpu *vcpu, s64 adjustment)
 	adjust_tsc_offset_guest(vcpu, adjustment);
 }
 
-#ifdef CONFIG_X86_64
-
 static u64 read_tsc(void)
 {
 	u64 ret = (u64)rdtsc_ordered();
@@ -2941,7 +2912,6 @@ static bool kvm_get_walltime_and_clockread(struct timespec64 *ts,
 
 	return gtod_is_based_on_tsc(do_realtime(ts, tsc_timestamp));
 }
-#endif
 
 /*
  *
@@ -2986,7 +2956,6 @@ static bool kvm_get_walltime_and_clockread(struct timespec64 *ts,
 
 static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
 {
-#ifdef CONFIG_X86_64
 	struct kvm_arch *ka = &kvm->arch;
 	int vclock_mode;
 	bool host_tsc_clocksource, vcpus_matched;
@@ -3013,7 +2982,6 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
 	vclock_mode = pvclock_gtod_data.clock.vclock_mode;
 	trace_kvm_update_master_clock(ka->use_master_clock, vclock_mode,
 					vcpus_matched);
-#endif
 }
 
 static void kvm_make_mclock_inprogress_request(struct kvm *kvm)
@@ -3087,15 +3055,13 @@ static void __get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data)
 	data->flags = 0;
 	if (ka->use_master_clock &&
 	    (static_cpu_has(X86_FEATURE_CONSTANT_TSC) || __this_cpu_read(cpu_tsc_khz))) {
-#ifdef CONFIG_X86_64
 		struct timespec64 ts;
 
 		if (kvm_get_walltime_and_clockread(&ts, &data->host_tsc)) {
 			data->realtime = ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec;
 			data->flags |= KVM_CLOCK_REALTIME | KVM_CLOCK_HOST_TSC;
 		} else
-#endif
-		data->host_tsc = rdtsc();
+			data->host_tsc = rdtsc();
 
 		data->flags |= KVM_CLOCK_TSC_STABLE;
 		hv_clock.tsc_timestamp = ka->master_cycle_now;
@@ -3317,7 +3283,6 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
  */
 uint64_t kvm_get_wall_clock_epoch(struct kvm *kvm)
 {
-#ifdef CONFIG_X86_64
 	struct pvclock_vcpu_time_info hv_clock;
 	struct kvm_arch *ka = &kvm->arch;
 	unsigned long seq, local_tsc_khz;
@@ -3368,7 +3333,6 @@ uint64_t kvm_get_wall_clock_epoch(struct kvm *kvm)
 		return ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec -
 			__pvclock_read_cycles(&hv_clock, host_tsc);
 	}
-#endif
 	return ktime_get_real_ns() - get_kvmclock_ns(kvm);
 }
 
@@ -4098,7 +4062,6 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 			return 1;
 		vcpu->arch.msr_misc_features_enables = data;
 		break;
-#ifdef CONFIG_X86_64
 	case MSR_IA32_XFD:
 		if (!msr_info->host_initiated &&
 		    !guest_cpuid_has(vcpu, X86_FEATURE_XFD))
@@ -4119,7 +4082,6 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 
 		vcpu->arch.guest_fpu.xfd_err = data;
 		break;
-#endif
 	default:
 		if (kvm_pmu_is_valid_msr(vcpu, msr))
 			return kvm_pmu_set_msr(vcpu, msr_info);
@@ -4453,7 +4415,6 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	case MSR_K7_HWCR:
 		msr_info->data = vcpu->arch.msr_hwcr;
 		break;
-#ifdef CONFIG_X86_64
 	case MSR_IA32_XFD:
 		if (!msr_info->host_initiated &&
 		    !guest_cpuid_has(vcpu, X86_FEATURE_XFD))
@@ -4468,7 +4429,6 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 
 		msr_info->data = vcpu->arch.guest_fpu.xfd_err;
 		break;
-#endif
 	default:
 		if (kvm_pmu_is_valid_msr(vcpu, msr_info->index))
 			return kvm_pmu_get_msr(vcpu, msr_info);
@@ -8380,10 +8340,8 @@ static bool emulator_get_segment(struct x86_emulate_ctxt *ctxt, u16 *selector,
 		var.limit >>= 12;
 	set_desc_limit(desc, var.limit);
 	set_desc_base(desc, (unsigned long)var.base);
-#ifdef CONFIG_X86_64
 	if (base3)
 		*base3 = var.base >> 32;
-#endif
 	desc->type = var.type;
 	desc->s = var.s;
 	desc->dpl = var.dpl;
@@ -8405,9 +8363,7 @@ static void emulator_set_segment(struct x86_emulate_ctxt *ctxt, u16 selector,
 
 	var.selector = selector;
 	var.base = get_desc_base(desc);
-#ifdef CONFIG_X86_64
 	var.base |= ((u64)base3) << 32;
-#endif
 	var.limit = get_desc_limit(desc);
 	if (desc->g)
 		var.limit = (var.limit << 12) | 0xfff;
@@ -9400,7 +9356,6 @@ static void tsc_khz_changed(void *data)
 	__this_cpu_write(cpu_tsc_khz, khz);
 }
 
-#ifdef CONFIG_X86_64
 static void kvm_hyperv_tsc_notifier(void)
 {
 	struct kvm *kvm;
@@ -9428,7 +9383,6 @@ static void kvm_hyperv_tsc_notifier(void)
 
 	mutex_unlock(&kvm_lock);
 }
-#endif
 
 static void __kvmclock_cpufreq_notifier(struct cpufreq_freqs *freq, int cpu)
 {
@@ -9560,7 +9514,6 @@ static void kvm_timer_init(void)
 	}
 }
 
-#ifdef CONFIG_X86_64
 static void pvclock_gtod_update_fn(struct work_struct *work)
 {
 	struct kvm *kvm;
@@ -9614,7 +9567,6 @@ static int pvclock_gtod_notify(struct notifier_block *nb, unsigned long unused,
 static struct notifier_block pvclock_gtod_notifier = {
 	.notifier_call = pvclock_gtod_notify,
 };
-#endif
 
 static inline void kvm_ops_update(struct kvm_x86_init_ops *ops)
 {
@@ -9758,12 +9710,10 @@ int kvm_x86_vendor_init(struct kvm_x86_init_ops *ops)
 
 	if (pi_inject_timer == -1)
 		pi_inject_timer = housekeeping_enabled(HK_TYPE_TIMER);
-#ifdef CONFIG_X86_64
 	pvclock_gtod_register_notifier(&pvclock_gtod_notifier);
 
 	if (hypervisor_is_type(X86_HYPER_MS_HYPERV))
 		set_hv_tscchange_cb(kvm_hyperv_tsc_notifier);
-#endif
 
 	kvm_register_perf_callbacks(ops->handle_intel_pt_intr);
 
@@ -9809,10 +9759,9 @@ void kvm_x86_vendor_exit(void)
 {
 	kvm_unregister_perf_callbacks();
 
-#ifdef CONFIG_X86_64
 	if (hypervisor_is_type(X86_HYPER_MS_HYPERV))
 		clear_hv_tscchange_cb();
-#endif
+
 	kvm_lapic_exit();
 
 	if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
@@ -9820,11 +9769,10 @@ void kvm_x86_vendor_exit(void)
 					    CPUFREQ_TRANSITION_NOTIFIER);
 		cpuhp_remove_state_nocalls(CPUHP_AP_X86_KVM_CLK_ONLINE);
 	}
-#ifdef CONFIG_X86_64
+
 	pvclock_gtod_unregister_notifier(&pvclock_gtod_notifier);
 	irq_work_sync(&pvclock_irq_work);
 	cancel_work_sync(&pvclock_gtod_work);
-#endif
 	kvm_x86_call(hardware_unsetup)();
 	kvm_mmu_vendor_module_exit();
 	free_percpu(user_return_msrs);
@@ -9839,7 +9787,6 @@ void kvm_x86_vendor_exit(void)
 }
 EXPORT_SYMBOL_GPL(kvm_x86_vendor_exit);
 
-#ifdef CONFIG_X86_64
 static int kvm_pv_clock_pairing(struct kvm_vcpu *vcpu, gpa_t paddr,
 			        unsigned long clock_type)
 {
@@ -9874,7 +9821,6 @@ static int kvm_pv_clock_pairing(struct kvm_vcpu *vcpu, gpa_t paddr,
 
 	return ret;
 }
-#endif
 
 /*
  * kvm_pv_kick_cpu_op:  Kick a vcpu.
@@ -10019,11 +9965,9 @@ unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
 		kvm_sched_yield(vcpu, a1);
 		ret = 0;
 		break;
-#ifdef CONFIG_X86_64
 	case KVM_HC_CLOCK_PAIRING:
 		ret = kvm_pv_clock_pairing(vcpu, a0, a1);
 		break;
-#endif
 	case KVM_HC_SEND_IPI:
 		if (!guest_pv_has(vcpu, KVM_FEATURE_PV_SEND_IPI))
 			break;
@@ -11592,7 +11536,6 @@ static void __get_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
 	regs->rdi = kvm_rdi_read(vcpu);
 	regs->rsp = kvm_rsp_read(vcpu);
 	regs->rbp = kvm_rbp_read(vcpu);
-#ifdef CONFIG_X86_64
 	regs->r8 = kvm_r8_read(vcpu);
 	regs->r9 = kvm_r9_read(vcpu);
 	regs->r10 = kvm_r10_read(vcpu);
@@ -11601,8 +11544,6 @@ static void __get_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
 	regs->r13 = kvm_r13_read(vcpu);
 	regs->r14 = kvm_r14_read(vcpu);
 	regs->r15 = kvm_r15_read(vcpu);
-#endif
-
 	regs->rip = kvm_rip_read(vcpu);
 	regs->rflags = kvm_get_rflags(vcpu);
 }
@@ -11632,7 +11573,6 @@ static void __set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
 	kvm_rdi_write(vcpu, regs->rdi);
 	kvm_rsp_write(vcpu, regs->rsp);
 	kvm_rbp_write(vcpu, regs->rbp);
-#ifdef CONFIG_X86_64
 	kvm_r8_write(vcpu, regs->r8);
 	kvm_r9_write(vcpu, regs->r9);
 	kvm_r10_write(vcpu, regs->r10);
@@ -11641,8 +11581,6 @@ static void __set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
 	kvm_r13_write(vcpu, regs->r13);
 	kvm_r14_write(vcpu, regs->r14);
 	kvm_r15_write(vcpu, regs->r15);
-#endif
-
 	kvm_rip_write(vcpu, regs->rip);
 	kvm_set_rflags(vcpu, regs->rflags | X86_EFLAGS_FIXED);
 
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index ec623d23d13d..0b2e03f083a7 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -166,11 +166,7 @@ static inline bool is_protmode(struct kvm_vcpu *vcpu)
 
 static inline bool is_long_mode(struct kvm_vcpu *vcpu)
 {
-#ifdef CONFIG_X86_64
 	return !!(vcpu->arch.efer & EFER_LMA);
-#else
-	return false;
-#endif
 }
 
 static inline bool is_64_bit_mode(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index a909b817b9c0..9f6115da02ea 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -67,19 +67,16 @@ static int kvm_xen_shared_info_init(struct kvm *kvm)
 	BUILD_BUG_ON(offsetof(struct compat_shared_info, arch.wc_sec_hi) != 0x924);
 	BUILD_BUG_ON(offsetof(struct pvclock_vcpu_time_info, version) != 0);
 
-#ifdef CONFIG_X86_64
 	/* Paranoia checks on the 64-bit struct layout */
 	BUILD_BUG_ON(offsetof(struct shared_info, wc) != 0xc00);
 	BUILD_BUG_ON(offsetof(struct shared_info, wc_sec_hi) != 0xc0c);
 
-	if (IS_ENABLED(CONFIG_64BIT) && kvm->arch.xen.long_mode) {
+	if (kvm->arch.xen.long_mode) {
 		struct shared_info *shinfo = gpc->khva;
 
 		wc_sec_hi = &shinfo->wc_sec_hi;
 		wc = &shinfo->wc;
-	} else
-#endif
-	{
+	} else {
 		struct compat_shared_info *shinfo = gpc->khva;
 
 		wc_sec_hi = &shinfo->arch.wc_sec_hi;
@@ -177,8 +174,7 @@ static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs,
 	    static_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
 		uint64_t host_tsc, guest_tsc;
 
-		if (!IS_ENABLED(CONFIG_64BIT) ||
-		    !kvm_get_monotonic_and_clockread(&kernel_now, &host_tsc)) {
+		if (!kvm_get_monotonic_and_clockread(&kernel_now, &host_tsc)) {
 			/*
 			 * Don't fall back to get_kvmclock_ns() because it's
 			 * broken; it has a systemic error in its results
@@ -288,7 +284,6 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
 	BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, state) != 0);
 	BUILD_BUG_ON(offsetof(struct compat_vcpu_runstate_info, state) != 0);
 	BUILD_BUG_ON(sizeof(struct compat_vcpu_runstate_info) != 0x2c);
-#ifdef CONFIG_X86_64
 	/*
 	 * The 64-bit structure has 4 bytes of padding before 'state_entry_time'
 	 * so each subsequent field is shifted by 4, and it's 4 bytes longer.
@@ -298,7 +293,6 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
 	BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, time) !=
 		     offsetof(struct compat_vcpu_runstate_info, time) + 4);
 	BUILD_BUG_ON(sizeof(struct vcpu_runstate_info) != 0x2c + 4);
-#endif
 	/*
 	 * The state field is in the same place at the start of both structs,
 	 * and is the same size (int) as vx->current_runstate.
@@ -335,7 +329,7 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
 	BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, time) !=
 		     sizeof(vx->runstate_times));
 
-	if (IS_ENABLED(CONFIG_64BIT) && v->kvm->arch.xen.long_mode) {
+	if (v->kvm->arch.xen.long_mode) {
 		user_len = sizeof(struct vcpu_runstate_info);
 		times_ofs = offsetof(struct vcpu_runstate_info,
 				     state_entry_time);
@@ -472,13 +466,11 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
 					sizeof(uint64_t) - 1 - user_len1;
 		}
 
-#ifdef CONFIG_X86_64
 		/*
 		 * Don't leak kernel memory through the padding in the 64-bit
 		 * version of the struct.
 		 */
 		memset(&rs, 0, offsetof(struct vcpu_runstate_info, state_entry_time));
-#endif
 	}
 
 	/*
@@ -606,7 +598,7 @@ void kvm_xen_inject_pending_events(struct kvm_vcpu *v)
 	}
 
 	/* Now gpc->khva is a valid kernel address for the vcpu_info */
-	if (IS_ENABLED(CONFIG_64BIT) && v->kvm->arch.xen.long_mode) {
+	if (v->kvm->arch.xen.long_mode) {
 		struct vcpu_info *vi = gpc->khva;
 
 		asm volatile(LOCK_PREFIX "orq %0, %1\n"
@@ -695,22 +687,18 @@ int kvm_xen_hvm_set_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data)
 
 	switch (data->type) {
 	case KVM_XEN_ATTR_TYPE_LONG_MODE:
-		if (!IS_ENABLED(CONFIG_64BIT) && data->u.long_mode) {
-			r = -EINVAL;
-		} else {
-			mutex_lock(&kvm->arch.xen.xen_lock);
-			kvm->arch.xen.long_mode = !!data->u.long_mode;
+		mutex_lock(&kvm->arch.xen.xen_lock);
+		kvm->arch.xen.long_mode = !!data->u.long_mode;
 
-			/*
-			 * Re-initialize shared_info to put the wallclock in the
-			 * correct place. Whilst it's not necessary to do this
-			 * unless the mode is actually changed, it does no harm
-			 * to make the call anyway.
-			 */
-			r = kvm->arch.xen.shinfo_cache.active ?
-				kvm_xen_shared_info_init(kvm) : 0;
-			mutex_unlock(&kvm->arch.xen.xen_lock);
-		}
+		/*
+		 * Re-initialize shared_info to put the wallclock in the
+		 * correct place. Whilst it's not necessary to do this
+		 * unless the mode is actually changed, it does no harm
+		 * to make the call anyway.
+		 */
+		r = kvm->arch.xen.shinfo_cache.active ?
+			kvm_xen_shared_info_init(kvm) : 0;
+		mutex_unlock(&kvm->arch.xen.xen_lock);
 		break;
 
 	case KVM_XEN_ATTR_TYPE_SHARED_INFO:
@@ -923,7 +911,7 @@ int kvm_xen_vcpu_set_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data)
 		 * address, that's actually OK. kvm_xen_update_runstate_guest()
 		 * will cope.
 		 */
-		if (IS_ENABLED(CONFIG_64BIT) && vcpu->kvm->arch.xen.long_mode)
+		if (vcpu->kvm->arch.xen.long_mode)
 			sz = sizeof(struct vcpu_runstate_info);
 		else
 			sz = sizeof(struct compat_vcpu_runstate_info);
@@ -1360,7 +1348,7 @@ static int kvm_xen_hypercall_complete_userspace(struct kvm_vcpu *vcpu)
 
 static inline int max_evtchn_port(struct kvm *kvm)
 {
-	if (IS_ENABLED(CONFIG_64BIT) && kvm->arch.xen.long_mode)
+	if (kvm->arch.xen.long_mode)
 		return EVTCHN_2L_NR_CHANNELS;
 	else
 		return COMPAT_EVTCHN_2L_NR_CHANNELS;
@@ -1382,7 +1370,7 @@ static bool wait_pending_event(struct kvm_vcpu *vcpu, int nr_ports,
 		goto out_rcu;
 
 	ret = false;
-	if (IS_ENABLED(CONFIG_64BIT) && kvm->arch.xen.long_mode) {
+	if (kvm->arch.xen.long_mode) {
 		struct shared_info *shinfo = gpc->khva;
 		pending_bits = (unsigned long *)&shinfo->evtchn_pending;
 	} else {
@@ -1416,7 +1404,7 @@ static bool kvm_xen_schedop_poll(struct kvm_vcpu *vcpu, bool longmode,
 	    !(vcpu->kvm->arch.xen_hvm_config.flags & KVM_XEN_HVM_CONFIG_EVTCHN_SEND))
 		return false;
 
-	if (IS_ENABLED(CONFIG_64BIT) && !longmode) {
+	if (!longmode) {
 		struct compat_sched_poll sp32;
 
 		/* Sanity check that the compat struct definition is correct */
@@ -1629,9 +1617,7 @@ int kvm_xen_hypercall(struct kvm_vcpu *vcpu)
 		params[3] = (u32)kvm_rsi_read(vcpu);
 		params[4] = (u32)kvm_rdi_read(vcpu);
 		params[5] = (u32)kvm_rbp_read(vcpu);
-	}
-#ifdef CONFIG_X86_64
-	else {
+	} else {
 		params[0] = (u64)kvm_rdi_read(vcpu);
 		params[1] = (u64)kvm_rsi_read(vcpu);
 		params[2] = (u64)kvm_rdx_read(vcpu);
@@ -1639,7 +1625,6 @@ int kvm_xen_hypercall(struct kvm_vcpu *vcpu)
 		params[4] = (u64)kvm_r8_read(vcpu);
 		params[5] = (u64)kvm_r9_read(vcpu);
 	}
-#endif
 	cpl = kvm_x86_call(get_cpl)(vcpu);
 	trace_kvm_xen_hypercall(cpl, input, params[0], params[1], params[2],
 				params[3], params[4], params[5]);
@@ -1756,7 +1741,7 @@ int kvm_xen_set_evtchn_fast(struct kvm_xen_evtchn *xe, struct kvm *kvm)
 	if (!kvm_gpc_check(gpc, PAGE_SIZE))
 		goto out_rcu;
 
-	if (IS_ENABLED(CONFIG_64BIT) && kvm->arch.xen.long_mode) {
+	if (kvm->arch.xen.long_mode) {
 		struct shared_info *shinfo = gpc->khva;
 		pending_bits = (unsigned long *)&shinfo->evtchn_pending;
 		mask_bits = (unsigned long *)&shinfo->evtchn_mask;
@@ -1797,7 +1782,7 @@ int kvm_xen_set_evtchn_fast(struct kvm_xen_evtchn *xe, struct kvm *kvm)
 			goto out_rcu;
 		}
 
-		if (IS_ENABLED(CONFIG_64BIT) && kvm->arch.xen.long_mode) {
+		if (kvm->arch.xen.long_mode) {
 			struct vcpu_info *vcpu_info = gpc->khva;
 			if (!test_and_set_bit(port_word_bit, &vcpu_info->evtchn_pending_sel)) {
 				WRITE_ONCE(vcpu_info->evtchn_upcall_pending, 1);
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH 03/11] x86: Kconfig.cpu: split out 64-bit atom
  2024-12-04 10:30 ` [PATCH 03/11] x86: Kconfig.cpu: split out 64-bit atom Arnd Bergmann
@ 2024-12-04 13:16   ` Thomas Gleixner
  2024-12-04 15:55     ` H. Peter Anvin
  0 siblings, 1 reply; 78+ messages in thread
From: Thomas Gleixner @ 2024-12-04 13:16 UTC (permalink / raw)
  To: Arnd Bergmann, linux-kernel, x86
  Cc: Arnd Bergmann, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Linus Torvalds, Andy Shevchenko, Matthew Wilcox,
	Sean Christopherson, Davide Ciminaghi, Paolo Bonzini, kvm

On Wed, Dec 04 2024 at 11:30, Arnd Bergmann wrote:
> From: Arnd Bergmann <arnd@arndb.de>
>
> Both 32-bit and 64-bit builds allow optimizing using "-march=atom", but
> this is somewhat suboptimal, as gcc and clang use this option to refer
> to the original in-order "Bonnell" microarchitecture used in the early
> "Diamondville" and "Silverthorne" processors that were mostly 32-bit only.
>
> The later 22nm "Silvermont" architecture saw a significant redesign to
> an out-of-order architecture that is reflected in the -mtune=silvermont
> flag in the compilers, and all of these are 64-bit capable.

In theory. There are quite some crippled variants of silvermont which
are 32-bit only (either fused or at least officially not-supported to
run 64-bit)...


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 05/11] x86: remove HIGHMEM64G support
  2024-12-04 10:30 ` [PATCH 05/11] x86: remove HIGHMEM64G support Arnd Bergmann
@ 2024-12-04 13:29   ` Brian Gerst
  2024-12-04 13:43     ` Arnd Bergmann
  2024-12-04 16:37     ` H. Peter Anvin
  2025-04-11 23:44   ` Dave Hansen
  1 sibling, 2 replies; 78+ messages in thread
From: Brian Gerst @ 2024-12-04 13:29 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linux-kernel, x86, Arnd Bergmann, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Linus Torvalds,
	Andy Shevchenko, Matthew Wilcox, Sean Christopherson,
	Davide Ciminaghi, Paolo Bonzini, kvm

On Wed, Dec 4, 2024 at 5:34 AM Arnd Bergmann <arnd@kernel.org> wrote:
>
> From: Arnd Bergmann <arnd@arndb.de>
>
> The HIGHMEM64G support was added in linux-2.3.25 to support (then)
> high-end Pentium Pro and Pentium III Xeon servers with more than 4GB of
> addressing, NUMA and PCI-X slots started appearing.
>
> I have found no evidence of this ever being used in regular dual-socket
> servers or consumer devices, all the users seem obsolete these days,
> even by i386 standards:
>
>  - Support for NUMA servers (NUMA-Q, IBM x440, unisys) was already
>    removed ten years ago.
>
>  - 4+ socket non-NUMA servers based on Intel 450GX/450NX, HP F8 and
>    ServerWorks ServerSet/GrandChampion could theoretically still work
>    with 8GB, but these were exceptionally rare even 20 years ago and
>    would have usually been equipped with than the maximum amount of
>    RAM.
>
>  - Some SKUs of the Celeron D from 2004 had 64-bit mode fused off but
>    could still work in a Socket 775 mainboard designed for the later
>    Core 2 Duo and 8GB. Apparently most BIOSes at the time only allowed
>    64-bit CPUs.
>
>  - In the early days of x86-64 hardware, there was sometimes the need
>    to run a 32-bit kernel to work around bugs in the hardware drivers,
>    or in the syscall emulation for 32-bit userspace. This likely still
>    works but there should never be a need for this any more.
>
> Removing this also drops the need for PHYS_ADDR_T_64BIT and SWIOTLB.
> PAE mode is still required to get access to the 'NX' bit on Atom
> 'Pentium M' and 'Core Duo' CPUs.

8GB of memory is still useful for 32-bit guest VMs.


Brian Gerst

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 05/11] x86: remove HIGHMEM64G support
  2024-12-04 13:29   ` Brian Gerst
@ 2024-12-04 13:43     ` Arnd Bergmann
  2024-12-04 14:02       ` Brian Gerst
  2024-12-04 15:53       ` H. Peter Anvin
  2024-12-04 16:37     ` H. Peter Anvin
  1 sibling, 2 replies; 78+ messages in thread
From: Arnd Bergmann @ 2024-12-04 13:43 UTC (permalink / raw)
  To: Brian Gerst, Arnd Bergmann
  Cc: linux-kernel, x86, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Linus Torvalds, Andy Shevchenko,
	Matthew Wilcox, Sean Christopherson, Davide Ciminaghi,
	Paolo Bonzini, kvm

On Wed, Dec 4, 2024, at 14:29, Brian Gerst wrote:
> On Wed, Dec 4, 2024 at 5:34 AM Arnd Bergmann <arnd@kernel.org> wrote:
>>
>>  - In the early days of x86-64 hardware, there was sometimes the need
>>    to run a 32-bit kernel to work around bugs in the hardware drivers,
>>    or in the syscall emulation for 32-bit userspace. This likely still
>>    works but there should never be a need for this any more.
>>
>> Removing this also drops the need for PHYS_ADDR_T_64BIT and SWIOTLB.
>> PAE mode is still required to get access to the 'NX' bit on Atom
>> 'Pentium M' and 'Core Duo' CPUs.
>
> 8GB of memory is still useful for 32-bit guest VMs.

Can you give some more background on this?

It's clear that one can run a virtual machine this way and it
currently works, but are you able to construct a case where this
is a good idea, compared to running the same userspace with a
64-bit kernel?

From what I can tell, any practical workload that requires
8GB of total RAM will likely run into either the lowmem
limits or into virtual addressig limits, in addition to the
problems of 32-bit kernels being generally worse than 64-bit
ones in terms of performance, features and testing.

      Arnd

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 05/11] x86: remove HIGHMEM64G support
  2024-12-04 13:43     ` Arnd Bergmann
@ 2024-12-04 14:02       ` Brian Gerst
  2024-12-04 15:00         ` Brian Gerst
  2024-12-04 15:58         ` H. Peter Anvin
  2024-12-04 15:53       ` H. Peter Anvin
  1 sibling, 2 replies; 78+ messages in thread
From: Brian Gerst @ 2024-12-04 14:02 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Arnd Bergmann, linux-kernel, x86, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Linus Torvalds,
	Andy Shevchenko, Matthew Wilcox, Sean Christopherson,
	Davide Ciminaghi, Paolo Bonzini, kvm

On Wed, Dec 4, 2024 at 8:43 AM Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Wed, Dec 4, 2024, at 14:29, Brian Gerst wrote:
> > On Wed, Dec 4, 2024 at 5:34 AM Arnd Bergmann <arnd@kernel.org> wrote:
> >>
> >>  - In the early days of x86-64 hardware, there was sometimes the need
> >>    to run a 32-bit kernel to work around bugs in the hardware drivers,
> >>    or in the syscall emulation for 32-bit userspace. This likely still
> >>    works but there should never be a need for this any more.
> >>
> >> Removing this also drops the need for PHYS_ADDR_T_64BIT and SWIOTLB.
> >> PAE mode is still required to get access to the 'NX' bit on Atom
> >> 'Pentium M' and 'Core Duo' CPUs.
> >
> > 8GB of memory is still useful for 32-bit guest VMs.
>
> Can you give some more background on this?
>
> It's clear that one can run a virtual machine this way and it
> currently works, but are you able to construct a case where this
> is a good idea, compared to running the same userspace with a
> 64-bit kernel?
>
> From what I can tell, any practical workload that requires
> 8GB of total RAM will likely run into either the lowmem
> limits or into virtual addressig limits, in addition to the
> problems of 32-bit kernels being generally worse than 64-bit
> ones in terms of performance, features and testing.

I use a 32-bit VM to test 32-bit kernel builds.  I haven't benchmarked
kernel builds with 4GB/8GB yet, but logically more memory would be
better for caching files.


Brian Gerst

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 05/11] x86: remove HIGHMEM64G support
  2024-12-04 14:02       ` Brian Gerst
@ 2024-12-04 15:00         ` Brian Gerst
  2024-12-04 15:58         ` H. Peter Anvin
  1 sibling, 0 replies; 78+ messages in thread
From: Brian Gerst @ 2024-12-04 15:00 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Arnd Bergmann, linux-kernel, x86, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Linus Torvalds,
	Andy Shevchenko, Matthew Wilcox, Sean Christopherson,
	Davide Ciminaghi, Paolo Bonzini, kvm

On Wed, Dec 4, 2024 at 9:02 AM Brian Gerst <brgerst@gmail.com> wrote:
>
> On Wed, Dec 4, 2024 at 8:43 AM Arnd Bergmann <arnd@arndb.de> wrote:
> >
> > On Wed, Dec 4, 2024, at 14:29, Brian Gerst wrote:
> > > On Wed, Dec 4, 2024 at 5:34 AM Arnd Bergmann <arnd@kernel.org> wrote:
> > >>
> > >>  - In the early days of x86-64 hardware, there was sometimes the need
> > >>    to run a 32-bit kernel to work around bugs in the hardware drivers,
> > >>    or in the syscall emulation for 32-bit userspace. This likely still
> > >>    works but there should never be a need for this any more.
> > >>
> > >> Removing this also drops the need for PHYS_ADDR_T_64BIT and SWIOTLB.
> > >> PAE mode is still required to get access to the 'NX' bit on Atom
> > >> 'Pentium M' and 'Core Duo' CPUs.
> > >
> > > 8GB of memory is still useful for 32-bit guest VMs.
> >
> > Can you give some more background on this?
> >
> > It's clear that one can run a virtual machine this way and it
> > currently works, but are you able to construct a case where this
> > is a good idea, compared to running the same userspace with a
> > 64-bit kernel?
> >
> > From what I can tell, any practical workload that requires
> > 8GB of total RAM will likely run into either the lowmem
> > limits or into virtual addressig limits, in addition to the
> > problems of 32-bit kernels being generally worse than 64-bit
> > ones in terms of performance, features and testing.
>
> I use a 32-bit VM to test 32-bit kernel builds.  I haven't benchmarked
> kernel builds with 4GB/8GB yet, but logically more memory would be
> better for caching files.
>
>
> Brian Gerst

After verifying, I only had the VM set to 4GB and CONFIG_HIGHMEM64G
was not set.  So I have no issue with this.


Brian Gerst

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 11/11] x86: drop 32-bit KVM host support
  2024-12-04 10:30 ` [PATCH 11/11] x86: drop 32-bit KVM host support Arnd Bergmann
@ 2024-12-04 15:30   ` Sean Christopherson
  2024-12-04 16:33     ` Arnd Bergmann
  0 siblings, 1 reply; 78+ messages in thread
From: Sean Christopherson @ 2024-12-04 15:30 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linux-kernel, x86, Arnd Bergmann, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Linus Torvalds,
	Andy Shevchenko, Matthew Wilcox, Davide Ciminaghi, Paolo Bonzini,
	kvm

On Wed, Dec 04, 2024, Arnd Bergmann wrote:
> From: Arnd Bergmann <arnd@arndb.de>
> 
> There are very few 32-bit machines that support KVM, the main exceptions
> are the "Yonah" Generation Xeon-LV and Core Duo from 2006 and the Atom
> Z5xx "Silverthorne" from 2008 that were all release just before their
> 64-bit counterparts.
> 
> Using KVM as a host on a 64-bit CPU using a 32-bit kernel generally
> works fine, but is rather pointless since 64-bit kernels are much better
> supported and deal better with the memory requirements of VM guests.
> 
> Drop all the 32-bit-only portions and the "#ifdef CONFIG_X86_64" checks
> of the x86 KVM code and add a Kconfig dependency to only allow building
> this on 64-bit kernels.

While 32-bit KVM doesn't need to be a thing for x86 usage, Paolo expressed concerns
that dropping 32-bit support on x86 would cause general 32-bit KVM support to
bitrot horribly.  32-bit x86 doesn't get much testing, but I do at least boot VMs
with it on a semi-regular basis.  I don't think we can say the same for other
architectures with 32-bit variants.

PPC apparently still has 32-bit users[1][2], and 32-bit RISC-V is a thing, so I
think we're stuck with 32-bit x86 for the time being. :-(

[1] https://lore.kernel.org/all/87zg4aveow.fsf@mail.lhotse
[2] https://lore.kernel.org/all/fc43f9eb-a60f-5c4a-a694-83029234a9c4@xenosoft.de

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 09/11] x86: rework CONFIG_GENERIC_CPU compiler flags
  2024-12-04 10:30 ` [PATCH 09/11] x86: rework CONFIG_GENERIC_CPU compiler flags Arnd Bergmann
@ 2024-12-04 15:36   ` Tor Vic
  2024-12-04 17:51     ` Arnd Bergmann
  2024-12-04 17:09   ` Nathan Chancellor
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 78+ messages in thread
From: Tor Vic @ 2024-12-04 15:36 UTC (permalink / raw)
  To: Arnd Bergmann, linux-kernel, x86
  Cc: Arnd Bergmann, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Linus Torvalds, Andy Shevchenko,
	Matthew Wilcox, Sean Christopherson, Davide Ciminaghi,
	Paolo Bonzini, kvm



On 12/4/24 11:30, Arnd Bergmann wrote:
> From: Arnd Bergmann <arnd@arndb.de>
> 
> Building an x86-64 kernel with CONFIG_GENERIC_CPU is documented to
> run on all CPUs, but the Makefile does not actually pass an -march=
> argument, instead relying on the default that was used to configure
> the toolchain.
> 
> In many cases, gcc will be configured to -march=x86-64 or -march=k8
> for maximum compatibility, but in other cases a distribution default
> may be either raised to a more recent ISA, or set to -march=native
> to build for the CPU used for compilation. This still works in the
> case of building a custom kernel for the local machine.
> 
> The point where it breaks down is building a kernel for another
> machine that is older the the default target. Changing the default
> to -march=x86-64 would make it work reliable, but possibly produce
> worse code on distros that intentionally default to a newer ISA.
> 
> To allow reliably building a kernel for either the oldest x86-64
> CPUs or a more recent level, add three separate options for
> v1, v2 and v3 of the architecture as defined by gcc and clang
> and make them all turn on CONFIG_GENERIC_CPU. Based on this it
> should be possible to change runtime feature detection into
> build-time detection for things like cmpxchg16b, or possibly
> gate features that are only available on older architectures.
> 

Hi Arnd,

Similar but not identical changes have been proposed in the past several 
times like e.g. in 1, 2 and likely even more often.

Your solution seems to be much cleaner, I like it.

That said, on my Skylake platform, there is no difference between 
-march=x86-64 and -march=x86-64-v3 in terms of kernel binary size or 
performance.
I think Boris also said that these settings make no real difference on 
code generation.

Other settings might make a small difference (numbers are from 2023):
   -generic:       85.089.784 bytes
   -core2:         85.139.932 bytes
   -march=skylake: 85.017.808 bytes

----
[1] 
https://lore.kernel.org/all/4_u6ZNYPbaK36xkLt8ApRhiRTyWp_-NExHCH_tTFO_fanDglEmcbfowmiB505heI4md2AuR9hS-VSkf4s90sXb5--AnNTOwvPaTmcgzRYSY=@proton.me/

[2] 
https://lore.kernel.org/all/20230707105601.133221-1-dimitri.ledkov@canonical.com/



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 05/11] x86: remove HIGHMEM64G support
  2024-12-04 13:43     ` Arnd Bergmann
  2024-12-04 14:02       ` Brian Gerst
@ 2024-12-04 15:53       ` H. Peter Anvin
  1 sibling, 0 replies; 78+ messages in thread
From: H. Peter Anvin @ 2024-12-04 15:53 UTC (permalink / raw)
  To: Arnd Bergmann, Brian Gerst, Arnd Bergmann
  Cc: linux-kernel, x86, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Linus Torvalds, Andy Shevchenko, Matthew Wilcox,
	Sean Christopherson, Davide Ciminaghi, Paolo Bonzini, kvm

On December 4, 2024 5:43:28 AM PST, Arnd Bergmann <arnd@arndb.de> wrote:
>On Wed, Dec 4, 2024, at 14:29, Brian Gerst wrote:
>> On Wed, Dec 4, 2024 at 5:34 AM Arnd Bergmann <arnd@kernel.org> wrote:
>>>
>>>  - In the early days of x86-64 hardware, there was sometimes the need
>>>    to run a 32-bit kernel to work around bugs in the hardware drivers,
>>>    or in the syscall emulation for 32-bit userspace. This likely still
>>>    works but there should never be a need for this any more.
>>>
>>> Removing this also drops the need for PHYS_ADDR_T_64BIT and SWIOTLB.
>>> PAE mode is still required to get access to the 'NX' bit on Atom
>>> 'Pentium M' and 'Core Duo' CPUs.
>>
>> 8GB of memory is still useful for 32-bit guest VMs.
>
>Can you give some more background on this?
>
>It's clear that one can run a virtual machine this way and it
>currently works, but are you able to construct a case where this
>is a good idea, compared to running the same userspace with a
>64-bit kernel?
>
>From what I can tell, any practical workload that requires
>8GB of total RAM will likely run into either the lowmem
>limits or into virtual addressig limits, in addition to the
>problems of 32-bit kernels being generally worse than 64-bit
>ones in terms of performance, features and testing.
>
>      Arnd
>

The biggest proven is that without HIGHMEM you put a limit of just under 1 GB (892 MiB if I recall correctly), *not* 4 GB, on 32-bit kernels. That is *well* below the amount of RAM present in late-era 32-bit legacy systems, which were put in production as "recently" as 20 years ago and may still be in niche production uses. Embedded systems may be significantly more recent; I know for a fact that 32-bit systems were put in production in the 2010s.


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 03/11] x86: Kconfig.cpu: split out 64-bit atom
  2024-12-04 13:16   ` Thomas Gleixner
@ 2024-12-04 15:55     ` H. Peter Anvin
  2024-12-04 18:21       ` Andy Shevchenko
  0 siblings, 1 reply; 78+ messages in thread
From: H. Peter Anvin @ 2024-12-04 15:55 UTC (permalink / raw)
  To: Thomas Gleixner, Arnd Bergmann, linux-kernel, x86
  Cc: Arnd Bergmann, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Linus Torvalds, Andy Shevchenko, Matthew Wilcox,
	Sean Christopherson, Davide Ciminaghi, Paolo Bonzini, kvm

On December 4, 2024 5:16:50 AM PST, Thomas Gleixner <tglx@linutronix.de> wrote:
>On Wed, Dec 04 2024 at 11:30, Arnd Bergmann wrote:
>> From: Arnd Bergmann <arnd@arndb.de>
>>
>> Both 32-bit and 64-bit builds allow optimizing using "-march=atom", but
>> this is somewhat suboptimal, as gcc and clang use this option to refer
>> to the original in-order "Bonnell" microarchitecture used in the early
>> "Diamondville" and "Silverthorne" processors that were mostly 32-bit only.
>>
>> The later 22nm "Silvermont" architecture saw a significant redesign to
>> an out-of-order architecture that is reflected in the -mtune=silvermont
>> flag in the compilers, and all of these are 64-bit capable.
>
>In theory. There are quite some crippled variants of silvermont which
>are 32-bit only (either fused or at least officially not-supported to
>run 64-bit)...
>

Yeah. That was a sad story, which I unfortunately am not at liberty to share.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 05/11] x86: remove HIGHMEM64G support
  2024-12-04 14:02       ` Brian Gerst
  2024-12-04 15:00         ` Brian Gerst
@ 2024-12-04 15:58         ` H. Peter Anvin
  1 sibling, 0 replies; 78+ messages in thread
From: H. Peter Anvin @ 2024-12-04 15:58 UTC (permalink / raw)
  To: Brian Gerst, Arnd Bergmann
  Cc: Arnd Bergmann, linux-kernel, x86, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, Linus Torvalds, Andy Shevchenko,
	Matthew Wilcox, Sean Christopherson, Davide Ciminaghi,
	Paolo Bonzini, kvm

On December 4, 2024 6:02:48 AM PST, Brian Gerst <brgerst@gmail.com> wrote:
>On Wed, Dec 4, 2024 at 8:43 AM Arnd Bergmann <arnd@arndb.de> wrote:
>>
>> On Wed, Dec 4, 2024, at 14:29, Brian Gerst wrote:
>> > On Wed, Dec 4, 2024 at 5:34 AM Arnd Bergmann <arnd@kernel.org> wrote:
>> >>
>> >>  - In the early days of x86-64 hardware, there was sometimes the need
>> >>    to run a 32-bit kernel to work around bugs in the hardware drivers,
>> >>    or in the syscall emulation for 32-bit userspace. This likely still
>> >>    works but there should never be a need for this any more.
>> >>
>> >> Removing this also drops the need for PHYS_ADDR_T_64BIT and SWIOTLB.
>> >> PAE mode is still required to get access to the 'NX' bit on Atom
>> >> 'Pentium M' and 'Core Duo' CPUs.
>> >
>> > 8GB of memory is still useful for 32-bit guest VMs.
>>
>> Can you give some more background on this?
>>
>> It's clear that one can run a virtual machine this way and it
>> currently works, but are you able to construct a case where this
>> is a good idea, compared to running the same userspace with a
>> 64-bit kernel?
>>
>> From what I can tell, any practical workload that requires
>> 8GB of total RAM will likely run into either the lowmem
>> limits or into virtual addressig limits, in addition to the
>> problems of 32-bit kernels being generally worse than 64-bit
>> ones in terms of performance, features and testing.
>
>I use a 32-bit VM to test 32-bit kernel builds.  I haven't benchmarked
>kernel builds with 4GB/8GB yet, but logically more memory would be
>better for caching files.
>
>
>Brian Gerst
>

For the record, back when kernel.org was still a 32-bit machine which, once would have thought, would have been ideal for caching files, rarely achieved more than 50% memory usage with which I believe was 8 GB RAM. The next generation was 16 GB x86-64.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 11/11] x86: drop 32-bit KVM host support
  2024-12-04 15:30   ` Sean Christopherson
@ 2024-12-04 16:33     ` Arnd Bergmann
  0 siblings, 0 replies; 78+ messages in thread
From: Arnd Bergmann @ 2024-12-04 16:33 UTC (permalink / raw)
  To: Sean Christopherson, Arnd Bergmann
  Cc: linux-kernel, x86, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Linus Torvalds, Andy Shevchenko,
	Matthew Wilcox, Davide Ciminaghi, Paolo Bonzini, kvm

On Wed, Dec 4, 2024, at 16:30, Sean Christopherson wrote:
> On Wed, Dec 04, 2024, Arnd Bergmann wrote:
>> From: Arnd Bergmann <arnd@arndb.de>
>> 
>> There are very few 32-bit machines that support KVM, the main exceptions
>> are the "Yonah" Generation Xeon-LV and Core Duo from 2006 and the Atom
>> Z5xx "Silverthorne" from 2008 that were all release just before their
>> 64-bit counterparts.
>> 
>> Using KVM as a host on a 64-bit CPU using a 32-bit kernel generally
>> works fine, but is rather pointless since 64-bit kernels are much better
>> supported and deal better with the memory requirements of VM guests.
>> 
>> Drop all the 32-bit-only portions and the "#ifdef CONFIG_X86_64" checks
>> of the x86 KVM code and add a Kconfig dependency to only allow building
>> this on 64-bit kernels.
>
> While 32-bit KVM doesn't need to be a thing for x86 usage, Paolo 
> expressed concerns
> that dropping 32-bit support on x86 would cause general 32-bit KVM 
> support to
> bitrot horribly.  32-bit x86 doesn't get much testing, but I do at 
> least boot VMs
> with it on a semi-regular basis.  I don't think we can say the same for 
> other
> architectures with 32-bit variants.

I see.

> PPC apparently still has 32-bit users[1][2],

I looked at the links but only see 64-bit users there.

There is KVM support for 32-bit BookE (e500v2, e500mc)
in the PPC85xx and QorIQ P1/P2/P3/P4, and Crystal mentioned
that there might be users, but did not point to anyone
in particular.

The A-EON AmigaOne X5000 and Powerboard Tyche that were
mentioned in the thread as being actively used are both
64-bit QorIQ P5/T2 (e5500, e6500) based. These are the
same platform ("85xx" in Linux, "e500" in qemu), so it's
easy to confuse. We can probably ask again if anyone
cares about removing the 32-bit side of this.

> and 32-bit RISC-V is a thing,

There are many 32-bit RISC-V chips, but all RISC-V
SoCs supported by Linux today are in fact 64-bit.

While there is still talk of adding support for 32-bit
SoCs, the only usecase for those is really to allow
machines with smaller amounts of physical RAM, which
tends to rule out virtualization.

There is one more platform that supports virtualization
on 32-bit CPUs, which is the MIPS P5600 core in the
Baikal T1.

I still think it makes sense to just drop KVM support
for all 32-bit hosts, but I agree that it also
makes sense to keep x86-32 as the last one of those
for testing purposes.

    Arnd

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 05/11] x86: remove HIGHMEM64G support
  2024-12-04 13:29   ` Brian Gerst
  2024-12-04 13:43     ` Arnd Bergmann
@ 2024-12-04 16:37     ` H. Peter Anvin
  2024-12-04 16:55       ` Arnd Bergmann
  1 sibling, 1 reply; 78+ messages in thread
From: H. Peter Anvin @ 2024-12-04 16:37 UTC (permalink / raw)
  To: Brian Gerst, Arnd Bergmann
  Cc: linux-kernel, x86, Arnd Bergmann, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, Linus Torvalds, Andy Shevchenko,
	Matthew Wilcox, Sean Christopherson, Davide Ciminaghi,
	Paolo Bonzini, kvm

On December 4, 2024 5:29:17 AM PST, Brian Gerst <brgerst@gmail.com> wrote:
>On Wed, Dec 4, 2024 at 5:34 AM Arnd Bergmann <arnd@kernel.org> wrote:
>>
>> From: Arnd Bergmann <arnd@arndb.de>
>>
>> The HIGHMEM64G support was added in linux-2.3.25 to support (then)
>> high-end Pentium Pro and Pentium III Xeon servers with more than 4GB of
>> addressing, NUMA and PCI-X slots started appearing.
>>
>> I have found no evidence of this ever being used in regular dual-socket
>> servers or consumer devices, all the users seem obsolete these days,
>> even by i386 standards:
>>
>>  - Support for NUMA servers (NUMA-Q, IBM x440, unisys) was already
>>    removed ten years ago.
>>
>>  - 4+ socket non-NUMA servers based on Intel 450GX/450NX, HP F8 and
>>    ServerWorks ServerSet/GrandChampion could theoretically still work
>>    with 8GB, but these were exceptionally rare even 20 years ago and
>>    would have usually been equipped with than the maximum amount of
>>    RAM.
>>
>>  - Some SKUs of the Celeron D from 2004 had 64-bit mode fused off but
>>    could still work in a Socket 775 mainboard designed for the later
>>    Core 2 Duo and 8GB. Apparently most BIOSes at the time only allowed
>>    64-bit CPUs.
>>
>>  - In the early days of x86-64 hardware, there was sometimes the need
>>    to run a 32-bit kernel to work around bugs in the hardware drivers,
>>    or in the syscall emulation for 32-bit userspace. This likely still
>>    works but there should never be a need for this any more.
>>
>> Removing this also drops the need for PHYS_ADDR_T_64BIT and SWIOTLB.
>> PAE mode is still required to get access to the 'NX' bit on Atom
>> 'Pentium M' and 'Core Duo' CPUs.
>
>8GB of memory is still useful for 32-bit guest VMs.
>
>
>Brian Gerst
>

By the way, there are 64-bit machines which require swiotlb.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 05/11] x86: remove HIGHMEM64G support
  2024-12-04 16:37     ` H. Peter Anvin
@ 2024-12-04 16:55       ` Arnd Bergmann
  2024-12-04 18:37         ` Andy Shevchenko
  0 siblings, 1 reply; 78+ messages in thread
From: Arnd Bergmann @ 2024-12-04 16:55 UTC (permalink / raw)
  To: H. Peter Anvin, Brian Gerst, Arnd Bergmann
  Cc: linux-kernel, x86, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Linus Torvalds, Andy Shevchenko, Matthew Wilcox,
	Sean Christopherson, Davide Ciminaghi, Paolo Bonzini, kvm

On Wed, Dec 4, 2024, at 17:37, H. Peter Anvin wrote:
> On December 4, 2024 5:29:17 AM PST, Brian Gerst <brgerst@gmail.com> wrote:
>>>
>>> Removing this also drops the need for PHYS_ADDR_T_64BIT and SWIOTLB.
>>> PAE mode is still required to get access to the 'NX' bit on Atom
>>> 'Pentium M' and 'Core Duo' CPUs.
>
> By the way, there are 64-bit machines which require swiotlb.

What I meant to write here was that CONFIG_X86_PAE no longer
needs to select PHYS_ADDR_T_64BIT and SWIOTLB. I ended up
splitting that change out to patch 06/11 with a better explanation,
so the sentence above is just wrong now and I've removed it
in my local copy now.

Obviously 64-bit kernels still generally need swiotlb.

       Arnd

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 09/11] x86: rework CONFIG_GENERIC_CPU compiler flags
  2024-12-04 10:30 ` [PATCH 09/11] x86: rework CONFIG_GENERIC_CPU compiler flags Arnd Bergmann
  2024-12-04 15:36   ` Tor Vic
@ 2024-12-04 17:09   ` Nathan Chancellor
  2024-12-04 17:52     ` Arnd Bergmann
  2024-12-04 18:10   ` Linus Torvalds
  2024-12-06 13:56   ` David Laight
  3 siblings, 1 reply; 78+ messages in thread
From: Nathan Chancellor @ 2024-12-04 17:09 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linux-kernel, x86, Arnd Bergmann, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Linus Torvalds,
	Andy Shevchenko, Matthew Wilcox, Sean Christopherson,
	Davide Ciminaghi, Paolo Bonzini, kvm

Hi Arnd,

On Wed, Dec 04, 2024 at 11:30:40AM +0100, Arnd Bergmann wrote:
...
> +++ b/arch/x86/Kconfig.cpu
> +config X86_64_V1
> +config X86_64_V2
> +config X86_64_V3
...
> +++ b/arch/x86/Makefile
> +        cflags-$(CONFIG_MX86_64_V1)	+= -march=x86-64
> +        cflags-$(CONFIG_MX86_64_V2)	+= $(call cc-option,-march=x86-64-v2,-march=x86-64)
> +        cflags-$(CONFIG_MX86_64_V3)	+= $(call cc-option,-march=x86-64-v3,-march=x86-64)
...
> +        rustflags-$(CONFIG_MX86_64_V1)	+= -Ctarget-cpu=x86-64
> +        rustflags-$(CONFIG_MX86_64_V2)	+= -Ctarget-cpu=x86-64-v2
> +        rustflags-$(CONFIG_MX86_64_V3)	+= -Ctarget-cpu=x86-64-v3

There appears to be an extra 'M' when using these CONFIGs in Makefile,
so I don't think this works as is?

Cheers,
Nathan

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 09/11] x86: rework CONFIG_GENERIC_CPU compiler flags
  2024-12-04 15:36   ` Tor Vic
@ 2024-12-04 17:51     ` Arnd Bergmann
  0 siblings, 0 replies; 78+ messages in thread
From: Arnd Bergmann @ 2024-12-04 17:51 UTC (permalink / raw)
  To: Tor Vic, Arnd Bergmann, linux-kernel, x86
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Linus Torvalds, Andy Shevchenko, Matthew Wilcox,
	Sean Christopherson, Davide Ciminaghi, Paolo Bonzini, kvm,
	Nathan Chancellor

On Wed, Dec 4, 2024, at 16:36, Tor Vic wrote:
> On 12/4/24 11:30, Arnd Bergmann wrote:
> Similar but not identical changes have been proposed in the past several 
> times like e.g. in 1, 2 and likely even more often.
>
> Your solution seems to be much cleaner, I like it.

Thanks. It looks like the other two did not actually
address the bug I'm fixing in my version.

> That said, on my Skylake platform, there is no difference between 
> -march=x86-64 and -march=x86-64-v3 in terms of kernel binary size or 
> performance.
> I think Boris also said that these settings make no real difference on 
> code generation.

As Nathan pointed out, I had a typo in my patch, so the
options didn't actually do anything at all. I fixed it now
and did a 'defconfig' test build with all three:

> Other settings might make a small difference (numbers are from 2023):
>    -generic:       85.089.784 bytes
>    -core2:         85.139.932 bytes
>    -march=skylake: 85.017.808 bytes

   text	   data	    bss	    dec	    hex	filename
26664466	10806622	1490948	38962036	2528374	obj-x86/vmlinux-v1
26664466	10806622	1490948	38962036	2528374	obj-x86/vmlinux-v2
26662504	10806654	1490948	38960106	2527bea	obj-x86/vmlinux-v3

which is a tiny 2KB saved between v2 and v3. I looked at
the object code and found that the v3 version takes advantage
of the BMI extension, which makes perfect sense. Not sure
if it has any real performance benefits.

Between v1 and v2, there is a chance to turn things like
system_has_cmpxchg128() into a constant on v2 and higher.

The v4 version is meaningless in practice since it only
adds AVX512 instructions that are only present in very
few CPUs and not that useful inside the kernel side from
specialized crypto and raid helpers.

      Arnd

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 09/11] x86: rework CONFIG_GENERIC_CPU compiler flags
  2024-12-04 17:09   ` Nathan Chancellor
@ 2024-12-04 17:52     ` Arnd Bergmann
  0 siblings, 0 replies; 78+ messages in thread
From: Arnd Bergmann @ 2024-12-04 17:52 UTC (permalink / raw)
  To: Nathan Chancellor, Arnd Bergmann
  Cc: linux-kernel, x86, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Linus Torvalds, Andy Shevchenko,
	Matthew Wilcox, Sean Christopherson, Davide Ciminaghi,
	Paolo Bonzini, kvm

On Wed, Dec 4, 2024, at 18:09, Nathan Chancellor wrote:
> Hi Arnd,
>
> On Wed, Dec 04, 2024 at 11:30:40AM +0100, Arnd Bergmann wrote:
> ...
>> +++ b/arch/x86/Kconfig.cpu
>> +config X86_64_V1
>> +config X86_64_V2
>> +config X86_64_V3
> ...
>> +++ b/arch/x86/Makefile
>> +        cflags-$(CONFIG_MX86_64_V1)	+= -march=x86-64
>> +        cflags-$(CONFIG_MX86_64_V2)	+= $(call cc-option,-march=x86-64-v2,-march=x86-64)
>> +        cflags-$(CONFIG_MX86_64_V3)	+= $(call cc-option,-march=x86-64-v3,-march=x86-64)
> ...
>> +        rustflags-$(CONFIG_MX86_64_V1)	+= -Ctarget-cpu=x86-64
>> +        rustflags-$(CONFIG_MX86_64_V2)	+= -Ctarget-cpu=x86-64-v2
>> +        rustflags-$(CONFIG_MX86_64_V3)	+= -Ctarget-cpu=x86-64-v3
>
> There appears to be an extra 'M' when using these CONFIGs in Makefile,
> so I don't think this works as is?

Fixed now by adding the 'M' in the Kconfig file, thanks for
noticing it.

      Arnd

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 09/11] x86: rework CONFIG_GENERIC_CPU compiler flags
  2024-12-04 10:30 ` [PATCH 09/11] x86: rework CONFIG_GENERIC_CPU compiler flags Arnd Bergmann
  2024-12-04 15:36   ` Tor Vic
  2024-12-04 17:09   ` Nathan Chancellor
@ 2024-12-04 18:10   ` Linus Torvalds
  2024-12-04 19:43     ` Arnd Bergmann
  2024-12-06 13:56   ` David Laight
  3 siblings, 1 reply; 78+ messages in thread
From: Linus Torvalds @ 2024-12-04 18:10 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linux-kernel, x86, Arnd Bergmann, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Andy Shevchenko,
	Matthew Wilcox, Sean Christopherson, Davide Ciminaghi,
	Paolo Bonzini, kvm

  "On second thought , let’s not go to x86-64 microarchitectural
levels. ‘Tis a silly place"

On Wed, 4 Dec 2024 at 02:31, Arnd Bergmann <arnd@kernel.org> wrote:
>
> To allow reliably building a kernel for either the oldest x86-64
> CPUs or a more recent level, add three separate options for
> v1, v2 and v3 of the architecture as defined by gcc and clang
> and make them all turn on CONFIG_GENERIC_CPU.

The whole "v2", "v3", "v4" etc naming seems to be some crazy glibc
artifact and is stupid and needs to die.

It has no relevance to anything. Please do *not* introduce that
mind-fart into the kernel sources.

I have no idea who came up with the "microarchitecture levels"
garbage, but as far as I can tell, it's entirely unofficial, and it's
a completely broken model.

There is a very real model for microarchitectural features, and it's
the CPUID bits. Trying to linearize those bits is technically wrong,
since these things simply aren't some kind of linear progression.

And worse, it's a "simplification" that literally adds complexity. Now
instead of asking "does this CPU support the cmpxchgb16 instruction?",
the question instead becomes one of "what the hell does 'v3' mean
again?"

So no. We are *NOT* introducing that idiocy in the kernel.

                Linus

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 03/11] x86: Kconfig.cpu: split out 64-bit atom
  2024-12-04 15:55     ` H. Peter Anvin
@ 2024-12-04 18:21       ` Andy Shevchenko
  0 siblings, 0 replies; 78+ messages in thread
From: Andy Shevchenko @ 2024-12-04 18:21 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Thomas Gleixner, Arnd Bergmann, linux-kernel, x86, Arnd Bergmann,
	Ingo Molnar, Borislav Petkov, Dave Hansen, Linus Torvalds,
	Andy Shevchenko, Matthew Wilcox, Sean Christopherson,
	Davide Ciminaghi, Paolo Bonzini, kvm

On Wed, Dec 4, 2024 at 5:55 PM H. Peter Anvin <hpa@zytor.com> wrote:
>
> On December 4, 2024 5:16:50 AM PST, Thomas Gleixner <tglx@linutronix.de> wrote:
> >On Wed, Dec 04 2024 at 11:30, Arnd Bergmann wrote:
> >> From: Arnd Bergmann <arnd@arndb.de>
> >>
> >> Both 32-bit and 64-bit builds allow optimizing using "-march=atom", but
> >> this is somewhat suboptimal, as gcc and clang use this option to refer
> >> to the original in-order "Bonnell" microarchitecture used in the early
> >> "Diamondville" and "Silverthorne" processors that were mostly 32-bit only.
> >>
> >> The later 22nm "Silvermont" architecture saw a significant redesign to
> >> an out-of-order architecture that is reflected in the -mtune=silvermont
> >> flag in the compilers, and all of these are 64-bit capable.
> >
> >In theory. There are quite some crippled variants of silvermont which
> >are 32-bit only (either fused or at least officially not-supported to
> >run 64-bit)...

> Yeah. That was a sad story, which I unfortunately am not at liberty to share.

Are they available in the wild? What I know with that core are
Merrifield, Moorefield, and Bay Trail that were distributed in
millions and are perfectly available, but I never heard about ones
that are 32-bit only. The Avoton and Rangley I have read about on
https://en.wikipedia.org/wiki/Silvermont seems specific to the servers
and routers and most likely are gone from use.


-- 
With Best Regards,
Andy Shevchenko

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 04/11] x86: split CPU selection into 32-bit and 64-bit
  2024-12-04 10:30 ` [PATCH 04/11] x86: split CPU selection into 32-bit and 64-bit Arnd Bergmann
@ 2024-12-04 18:31   ` Andy Shevchenko
  2024-12-04 21:18     ` Arnd Bergmann
  0 siblings, 1 reply; 78+ messages in thread
From: Andy Shevchenko @ 2024-12-04 18:31 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linux-kernel, x86, Arnd Bergmann, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Linus Torvalds,
	Andy Shevchenko, Matthew Wilcox, Sean Christopherson,
	Davide Ciminaghi, Paolo Bonzini, kvm

On Wed, Dec 4, 2024 at 12:31 PM Arnd Bergmann <arnd@kernel.org> wrote:
>
> From: Arnd Bergmann <arnd@arndb.de>
>
> The x86 CPU selection menu is confusing for a number of reasons.
> One of them is how it's possible to build a 32-bit kernel for
> a small number of early 64-bit microarchitectures (K8, Core2)

Core 2

It spells with a space starting with a capital letter, some issues of
the spelling below as well.

> but not the regular generic 64-bit target that is the normal
> default.
>
> There is no longer a reason to run 32-bit kernels on production
> 64-bit systems, so simplify the configuration menu by completely
> splitting the two into 32-bit-only and 64-bit-only machines.
>
> Testing generic 32-bit kernels on 64-bit hardware remains
> possible, just not building a 32-bit kernel that requires
> a 64-bit CPU.

> +choice
> +       prompt "x86-64 Processor family"
> +       depends on X86_64
> +       default GENERIC_CPU
> +       help
> +         This is the processor type of your CPU. This information is
> +         used for optimizing purposes. In order to compile a kernel
> +         that can run on all supported x86 CPU types (albeit not
> +         optimally fast), you can specify "Generic-x86-64" here.
> +
> +         Here are the settings recommended for greatest speed:
> +         - "Opteron/Athlon64/Hammer/K8" for all K8 and newer AMD CPUs.
> +         - "Intel P4" for the Pentium 4/Netburst microarchitecture.
> +         - "Core 2/newer Xeon" for all core2 and newer Intel CPUs.

Core 2

> +         - "Intel Atom" for the Atom-microarchitecture CPUs.
> +         - "Generic-x86-64" for a kernel which runs on any x86-64 CPU.
> +
> +         See each option's help text for additional details. If you don't know
> +         what to do, choose "Generic-x86-64".

-- 
With Best Regards,
Andy Shevchenko

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 05/11] x86: remove HIGHMEM64G support
  2024-12-04 16:55       ` Arnd Bergmann
@ 2024-12-04 18:37         ` Andy Shevchenko
  2024-12-04 21:14           ` Arnd Bergmann
  0 siblings, 1 reply; 78+ messages in thread
From: Andy Shevchenko @ 2024-12-04 18:37 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: H. Peter Anvin, Brian Gerst, Arnd Bergmann, linux-kernel, x86,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Linus Torvalds, Andy Shevchenko, Matthew Wilcox,
	Sean Christopherson, Davide Ciminaghi, Paolo Bonzini, kvm

On Wed, Dec 4, 2024 at 6:57 PM Arnd Bergmann <arnd@arndb.de> wrote:
> On Wed, Dec 4, 2024, at 17:37, H. Peter Anvin wrote:
> > On December 4, 2024 5:29:17 AM PST, Brian Gerst <brgerst@gmail.com> wrote:
> >>>
> >>> Removing this also drops the need for PHYS_ADDR_T_64BIT and SWIOTLB.
> >>> PAE mode is still required to get access to the 'NX' bit on Atom
> >>> 'Pentium M' and 'Core Duo' CPUs.
> >
> > By the way, there are 64-bit machines which require swiotlb.
>
> What I meant to write here was that CONFIG_X86_PAE no longer
> needs to select PHYS_ADDR_T_64BIT and SWIOTLB. I ended up
> splitting that change out to patch 06/11 with a better explanation,
> so the sentence above is just wrong now and I've removed it
> in my local copy now.
>
> Obviously 64-bit kernels still generally need swiotlb.

Theoretically swiotlb can be useful on 32-bit machines as well for the
DMA controllers that have < 32-bit mask. Dunno if swiotlb was designed
to run on 32-bit machines at all.


-- 
With Best Regards,
Andy Shevchenko

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 06/11] x86: drop SWIOTLB and PHYS_ADDR_T_64BIT for PAE
  2024-12-04 10:30 ` [PATCH 06/11] x86: drop SWIOTLB and PHYS_ADDR_T_64BIT for PAE Arnd Bergmann
@ 2024-12-04 18:41   ` Andy Shevchenko
  2024-12-04 20:52     ` Arnd Bergmann
  0 siblings, 1 reply; 78+ messages in thread
From: Andy Shevchenko @ 2024-12-04 18:41 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linux-kernel, x86, Arnd Bergmann, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Linus Torvalds,
	Andy Shevchenko, Matthew Wilcox, Sean Christopherson,
	Davide Ciminaghi, Paolo Bonzini, kvm

On Wed, Dec 4, 2024 at 12:31 PM Arnd Bergmann <arnd@kernel.org> wrote:
>
> From: Arnd Bergmann <arnd@arndb.de>
>
> Since kernels with and without CONFIG_X86_PAE are now limited
> to the low 4GB of physical address space, there is no need to
> use either swiotlb or 64-bit phys_addr_t any more, so stop
> selecting these and fix up the build warnings from that.

...

>         mtrr_type_lookup(addr, addr + PMD_SIZE, &uniform);
>         if (!uniform) {
>                 pr_warn_once("%s: Cannot satisfy [mem %#010llx-%#010llx] with a huge-page mapping due to MTRR override.\n",
> -                            __func__, addr, addr + PMD_SIZE);
> +                            __func__, (u64)addr, (u64)addr + PMD_SIZE);

Instead of castings I would rather:
1) have addr and size (? does above have off-by-one error?) or end;
2) use struct resource / range with the respective %p[Rr][a] specifier
or use %pa.



-- 
With Best Regards,
Andy Shevchenko

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 08/11] x86: document X86_INTEL_MID as 64-bit-only
  2024-12-04 10:30 ` [PATCH 08/11] x86: document X86_INTEL_MID as 64-bit-only Arnd Bergmann
@ 2024-12-04 18:55   ` Andy Shevchenko
  2024-12-04 20:38     ` Arnd Bergmann
  2024-12-06 11:23     ` Ferry Toth
  0 siblings, 2 replies; 78+ messages in thread
From: Andy Shevchenko @ 2024-12-04 18:55 UTC (permalink / raw)
  To: Arnd Bergmann, Ferry Toth
  Cc: linux-kernel, x86, Arnd Bergmann, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Linus Torvalds,
	Andy Shevchenko, Matthew Wilcox, Sean Christopherson,
	Davide Ciminaghi, Paolo Bonzini, kvm

+Cc: Ferry

On Wed, Dec 4, 2024 at 12:31 PM Arnd Bergmann <arnd@kernel.org> wrote:
>
> From: Arnd Bergmann <arnd@arndb.de>
>
> The X86_INTEL_MID code was originally introduced for the
> 32-bit Moorestown/Medfield/Clovertrail platform, later the 64-bit
> Merrifield/Moorefield variant got added, but the final

variant got --> variants were

> Morganfield/Broxton 14nm chips were canceled before they hit
> the market.

Inaccurate. "Broxton for Mobile", and not "Broxton" in general.

> To help users understand what the option actually refers to,
> update the help text, and make it a hard dependency on 64-bit
> kernels. While they could theoretically run a 32-bit kernel,
> the devices originally shipped with 64-bit one in 2015, so that
> was proabably never tested.

probably

It's all other way around (from SW point of view). For unknown reasons
Intel decided to release only 32-bit SW and it became the only thing
that was heavily tested (despite misunderstanding by some developers
that pointed finger to the HW without researching the issue that
appears to be purely software in a few cases) _that_ time.  Starting
ca. 2017 I enabled 64-bit for Merrifield and from then it's being used
by both 32- and 64-bit builds.

I'm totally fine to drop 32-bit defaults for Merrifield/Moorefield,
but let's hear Ferry who might/may still have a use case for that.

...

> -               Moorestown MID devices

FTR, a year or so ago it was a (weak) interest to revive Medfield, but
I think it would require too much work even for the person who is
quite familiar with HW, U-Boot, and Linux kernel, so it is most
unlikely to happen.

...

>           Select to build a kernel capable of supporting Intel MID (Mobile
>           Internet Device) platform systems which do not have the PCI legacy
> -         interfaces. If you are building for a PC class system say N here.
> +         interfaces.
> +
> +         The only supported devices are the 22nm Merrified (Z34xx) and
> +         Moorefield (Z35xx) SoC used in Android devices such as the
> +         Asus Zenfone 2, Asus FonePad 8 and Dell Venue 7.

The list is missing the Intel Edison DIY platform which is probably
the main user of Intel MID kernels nowadays.

...

> -         Intel MID platforms are based on an Intel processor and chipset which
> -         consume less power than most of the x86 derivatives.

Why remove this? AFAIK it states the truth.

-- 
With Best Regards,
Andy Shevchenko

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 09/11] x86: rework CONFIG_GENERIC_CPU compiler flags
  2024-12-04 18:10   ` Linus Torvalds
@ 2024-12-04 19:43     ` Arnd Bergmann
  2024-12-04 23:33       ` Linus Torvalds
  2024-12-05  8:07       ` Andy Shevchenko
  0 siblings, 2 replies; 78+ messages in thread
From: Arnd Bergmann @ 2024-12-04 19:43 UTC (permalink / raw)
  To: Linus Torvalds, Arnd Bergmann
  Cc: linux-kernel, x86, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Andy Shevchenko, Matthew Wilcox,
	Sean Christopherson, Davide Ciminaghi, Paolo Bonzini, kvm

On Wed, Dec 4, 2024, at 19:10, Linus Torvalds wrote:
> "On second thought , let’s not go to x86-64 microarchitectural
> levels. ‘Tis a silly place"

Fair enough. I'll just make it use -march=x86_64 to override
the compiler default then.

> On Wed, 4 Dec 2024 at 02:31, Arnd Bergmann <arnd@kernel.org> wrote:
>>
>> To allow reliably building a kernel for either the oldest x86-64
>> CPUs or a more recent level, add three separate options for
>> v1, v2 and v3 of the architecture as defined by gcc and clang
>> and make them all turn on CONFIG_GENERIC_CPU.
>
> The whole "v2", "v3", "v4" etc naming seems to be some crazy glibc
> artifact and is stupid and needs to die.
>
> It has no relevance to anything. Please do *not* introduce that
> mind-fart into the kernel sources.
>
> I have no idea who came up with the "microarchitecture levels"
> garbage, but as far as I can tell, it's entirely unofficial, and it's
> a completely broken model.

I agree that both the name and the concept are broken.
My idea was based on how distros (Red Hat Enterprise Linux
at least) already use the same levels for making userspace
require newer CPUs, so using the same flag for the kernel
makes some sense.

Making a point about the levels being stupid is a useful
goal as well.

> There is a very real model for microarchitectural features, and it's
> the CPUID bits. Trying to linearize those bits is technically wrong,
> since these things simply aren't some kind of linear progression.
>
> And worse, it's a "simplification" that literally adds complexity. Now
> instead of asking "does this CPU support the cmpxchgb16 instruction?",
> the question instead becomes one of "what the hell does 'v3' mean
> again?"

I guess the other side of it is that the current selection
between pentium4/core2/k8/bonnell/generic is not much better,
given that in practice nobody has any of the
pentium4/core2/k8/bonnell variants any more.

A more radical solution would be to just drop the entire
menu for 64-bit kernels and always default to "-march=x86_64
-mtune=generic" and 64 byte L1 cachelines.

      Arnd

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 08/11] x86: document X86_INTEL_MID as 64-bit-only
  2024-12-04 18:55   ` Andy Shevchenko
@ 2024-12-04 20:38     ` Arnd Bergmann
  2024-12-05  8:03       ` Andy Shevchenko
  2024-12-06 11:23     ` Ferry Toth
  1 sibling, 1 reply; 78+ messages in thread
From: Arnd Bergmann @ 2024-12-04 20:38 UTC (permalink / raw)
  To: Andy Shevchenko, Arnd Bergmann, Ferry Toth
  Cc: linux-kernel, x86, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Linus Torvalds, Andy Shevchenko,
	Matthew Wilcox, Sean Christopherson, Davide Ciminaghi,
	Paolo Bonzini, kvm

On Wed, Dec 4, 2024, at 19:55, Andy Shevchenko wrote:
> +Cc: Ferry
>
> On Wed, Dec 4, 2024 at 12:31 PM Arnd Bergmann <arnd@kernel.org> wrote:
>>
>> From: Arnd Bergmann <arnd@arndb.de>
>>
>> The X86_INTEL_MID code was originally introduced for the
>> 32-bit Moorestown/Medfield/Clovertrail platform, later the 64-bit
>> Merrifield/Moorefield variant got added, but the final
>
> variant got --> variants were

Fixed

>> Morganfield/Broxton 14nm chips were canceled before they hit
>> the market.
>
> Inaccurate. "Broxton for Mobile", and not "Broxton" in general.

Changed to "but the final Morganfield 14nm platform was canceled
before it hit the market" 

>> To help users understand what the option actually refers to,
>> update the help text, and make it a hard dependency on 64-bit
>> kernels. While they could theoretically run a 32-bit kernel,
>> the devices originally shipped with 64-bit one in 2015, so that
>> was proabably never tested.
>
> probably

Fixed.

> It's all other way around (from SW point of view). For unknown reasons
> Intel decided to release only 32-bit SW and it became the only thing
> that was heavily tested (despite misunderstanding by some developers
> that pointed finger to the HW without researching the issue that
> appears to be purely software in a few cases) _that_ time.  Starting
> ca. 2017 I enabled 64-bit for Merrifield and from then it's being used
> by both 32- and 64-bit builds.
>
> I'm totally fine to drop 32-bit defaults for Merrifield/Moorefield,
> but let's hear Ferry who might/may still have a use case for that.

Ok. I tried to find the oldest Android image and saw it used a 64-bit
kernel, but that must have been after your work then.

>
>> -               Moorestown MID devices
>
> FTR, a year or so ago it was a (weak) interest to revive Medfield, but
> I think it would require too much work even for the person who is
> quite familiar with HW, U-Boot, and Linux kernel, so it is most
> unlikely to happen.

Ok.

>> +
>> +         The only supported devices are the 22nm Merrified (Z34xx) and
>> +         Moorefield (Z35xx) SoC used in Android devices such as the
>> +         Asus Zenfone 2, Asus FonePad 8 and Dell Venue 7.
>
> The list is missing the Intel Edison DIY platform which is probably
> the main user of Intel MID kernels nowadays.

Ah, that explains a lot ;-)

Changed now to

          The only supported devices are the 22nm Merrified (Z34xx) and
          Moorefield (Z35xx) SoC used in the Intel Edison board and
          a small number of Android devices such as the Asus Zenfone 2,
          Asus FonePad 8 and Dell Venue 7.

> ...
>
>> -         Intel MID platforms are based on an Intel processor and chipset which
>> -         consume less power than most of the x86 derivatives.
>
> Why remove this? AFAIK it states the truth.

It seemed irrelevant for users that configure the kernel. I've
put it back now.

Thanks for the review!

     Arnd

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 06/11] x86: drop SWIOTLB and PHYS_ADDR_T_64BIT for PAE
  2024-12-04 18:41   ` Andy Shevchenko
@ 2024-12-04 20:52     ` Arnd Bergmann
  2024-12-05  7:59       ` Andy Shevchenko
  0 siblings, 1 reply; 78+ messages in thread
From: Arnd Bergmann @ 2024-12-04 20:52 UTC (permalink / raw)
  To: Andy Shevchenko, Arnd Bergmann
  Cc: linux-kernel, x86, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Linus Torvalds, Andy Shevchenko,
	Matthew Wilcox, Sean Christopherson, Davide Ciminaghi,
	Paolo Bonzini, kvm

On Wed, Dec 4, 2024, at 19:41, Andy Shevchenko wrote:
> On Wed, Dec 4, 2024 at 12:31 PM Arnd Bergmann <arnd@kernel.org> wrote:
>>
>> From: Arnd Bergmann <arnd@arndb.de>
>>
>> Since kernels with and without CONFIG_X86_PAE are now limited
>> to the low 4GB of physical address space, there is no need to
>> use either swiotlb or 64-bit phys_addr_t any more, so stop
>> selecting these and fix up the build warnings from that.
>
> ...
>
>>         mtrr_type_lookup(addr, addr + PMD_SIZE, &uniform);
>>         if (!uniform) {
>>                 pr_warn_once("%s: Cannot satisfy [mem %#010llx-%#010llx] with a huge-page mapping due to MTRR override.\n",
>> -                            __func__, addr, addr + PMD_SIZE);
>> +                            __func__, (u64)addr, (u64)addr + PMD_SIZE);
>
> Instead of castings I would rather:
> 1) have addr and size (? does above have off-by-one error?) or end;
> 2) use struct resource / range with the respective %p[Rr][a] specifier
> or use %pa.

Changed as below now. I'm still not sure whether the mtrr_type_lookup
end argument is meant to be inclusive or exclusive, so I've left
that alone, but the printed range should be correct now.

Thanks,

     Arnd

@@ -740,11 +740,12 @@ int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot)
 int pmd_set_huge(pmd_t *pmd, phys_addr_t addr, pgprot_t prot)
 {
        u8 uniform;
+       struct resource res = DEFINE_RES_MEM(addr, PMD_SIZE);
 
        mtrr_type_lookup(addr, addr + PMD_SIZE, &uniform);
        if (!uniform) {
-               pr_warn_once("%s: Cannot satisfy [mem %#010llx-%#010llx] with a huge-page mapping due to MTRR override.\n",
-                            __func__, (u64)addr, (u64)addr + PMD_SIZE);
+               pr_warn_once("%s: Cannot satisfy %pR with a huge-page mapping due to MTRR override.\n",
+                            __func__, &res);
                return 0;
        }
 


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 05/11] x86: remove HIGHMEM64G support
  2024-12-04 18:37         ` Andy Shevchenko
@ 2024-12-04 21:14           ` Arnd Bergmann
  0 siblings, 0 replies; 78+ messages in thread
From: Arnd Bergmann @ 2024-12-04 21:14 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: H. Peter Anvin, Brian Gerst, Arnd Bergmann, linux-kernel, x86,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Linus Torvalds, Andy Shevchenko, Matthew Wilcox,
	Sean Christopherson, Davide Ciminaghi, Paolo Bonzini, kvm

On Wed, Dec 4, 2024, at 19:37, Andy Shevchenko wrote:
> On Wed, Dec 4, 2024 at 6:57 PM Arnd Bergmann <arnd@arndb.de> wrote:
>> On Wed, Dec 4, 2024, at 17:37, H. Peter Anvin wrote:
>> > On December 4, 2024 5:29:17 AM PST, Brian Gerst <brgerst@gmail.com> wrote:
>> >>>
>> >>> Removing this also drops the need for PHYS_ADDR_T_64BIT and SWIOTLB.
>> >>> PAE mode is still required to get access to the 'NX' bit on Atom
>> >>> 'Pentium M' and 'Core Duo' CPUs.
>> >
>> > By the way, there are 64-bit machines which require swiotlb.
>>
>> What I meant to write here was that CONFIG_X86_PAE no longer
>> needs to select PHYS_ADDR_T_64BIT and SWIOTLB. I ended up
>> splitting that change out to patch 06/11 with a better explanation,
>> so the sentence above is just wrong now and I've removed it
>> in my local copy now.
>>
>> Obviously 64-bit kernels still generally need swiotlb.
>
> Theoretically swiotlb can be useful on 32-bit machines as well for the
> DMA controllers that have < 32-bit mask. Dunno if swiotlb was designed
> to run on 32-bit machines at all.

Right, that is a possibility. However those machines would
currently be broken on kernels without X86_PAE, since they
don't select swiotlb.

If anyone does rely on the current behavior of X86_PAE to support
broken DMA devices, it's probably best to fix it in a different
way.

      Arnd

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 04/11] x86: split CPU selection into 32-bit and 64-bit
  2024-12-04 18:31   ` Andy Shevchenko
@ 2024-12-04 21:18     ` Arnd Bergmann
  0 siblings, 0 replies; 78+ messages in thread
From: Arnd Bergmann @ 2024-12-04 21:18 UTC (permalink / raw)
  To: Andy Shevchenko, Arnd Bergmann
  Cc: linux-kernel, x86, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Linus Torvalds, Andy Shevchenko,
	Matthew Wilcox, Sean Christopherson, Davide Ciminaghi,
	Paolo Bonzini, kvm

On Wed, Dec 4, 2024, at 19:31, Andy Shevchenko wrote:
> On Wed, Dec 4, 2024 at 12:31 PM Arnd Bergmann <arnd@kernel.org> wrote:
>>
>> From: Arnd Bergmann <arnd@arndb.de>
>>
>> The x86 CPU selection menu is confusing for a number of reasons.
>> One of them is how it's possible to build a 32-bit kernel for
>> a small number of early 64-bit microarchitectures (K8, Core2)
>
> Core 2

Fixed

>> +choice
>> +       prompt "x86-64 Processor family"
>> +       depends on X86_64
>> +       default GENERIC_CPU
>> +       help
>> +         This is the processor type of your CPU. This information is
>> +         used for optimizing purposes. In order to compile a kernel
>> +         that can run on all supported x86 CPU types (albeit not
>> +         optimally fast), you can specify "Generic-x86-64" here.
>> +
>> +         Here are the settings recommended for greatest speed:
>> +         - "Opteron/Athlon64/Hammer/K8" for all K8 and newer AMD CPUs.
>> +         - "Intel P4" for the Pentium 4/Netburst microarchitecture.
>> +         - "Core 2/newer Xeon" for all core2 and newer Intel CPUs.
>
> Core 2

Fixed, though this is the preexisting help text that I just
moved around.

      Arnd

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 09/11] x86: rework CONFIG_GENERIC_CPU compiler flags
  2024-12-04 19:43     ` Arnd Bergmann
@ 2024-12-04 23:33       ` Linus Torvalds
  2024-12-05  8:13         ` Andy Shevchenko
  2024-12-05  9:46         ` Arnd Bergmann
  2024-12-05  8:07       ` Andy Shevchenko
  1 sibling, 2 replies; 78+ messages in thread
From: Linus Torvalds @ 2024-12-04 23:33 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Arnd Bergmann, linux-kernel, x86, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Andy Shevchenko,
	Matthew Wilcox, Sean Christopherson, Davide Ciminaghi,
	Paolo Bonzini, kvm

On Wed, 4 Dec 2024 at 11:44, Arnd Bergmann <arnd@arndb.de> wrote:
>
> I guess the other side of it is that the current selection
> between pentium4/core2/k8/bonnell/generic is not much better,
> given that in practice nobody has any of the
> pentium4/core2/k8/bonnell variants any more.

Yeah, I think that whole part of the x86 Kconfig is almost entirely historical.

It's historical also in the sense that a lot of those decisions matter
a whole lot less these days.

The whole CPU tuning issue is happily mostly a thing of the past,
since all modern CPU's do fairly well, and you don't have the crazy
glass jaws of yesteryear with in-order cores and the insane
instruction choice sensitivity of the P4 uarch.

And on our side, we've just also basically turned to much more dynamic
models, with either instruction rewriting or static branches or both.

So I suspect:

> A more radical solution would be to just drop the entire
> menu for 64-bit kernels and always default to "-march=x86_64
> -mtune=generic" and 64 byte L1 cachelines.

would actually be perfectly acceptable. The non-generic choices are
all entirely historical and not really very interesting.

Absolutely nobody sane cares about instruction scheduling for the old P4 cores.

In the bad old 32-bit days, we had real code generation issues with
basic instruction set, ie the whole "some CPU's are P6-class, but
don't actually support the CMOVxx instruction". Those days are gone.

And yes, on x86-64, we still have the whole cmpxchg16b issue, which
really is a slight annoyance. But the emphasis is on "slight" - we
basically have one back for this in the SLAB code, and a couple of
dynamic tests for one particular driver (iommu 128-bit IRTE mode).

So yeah, the cmpxchg16b thing is annoying, but _realistically_ I don't
think we care.

And some day we will forget about it, notice that those (few) AMD
early 64-bit CPU's can't possibly have been working for the last year
or two, and we'll finally just kill that code, but in the meantime the
cost of maintaining it is so slight that it's not worth actively going
out to kill it.

I do think that the *one* option we might have is "optimize for the
current CPU" for people who just want to build their own kernel for
their own machine. That's a nice easy choice to give people, and
'-march=native' is kind of simple to use.

Will that work when you cross-compile? No. Do we care? Also no. It's
basically a simple "you want to optimize for your own local machine"
switch.

Maybe that could replace some of the 32-bit choices too?

             Linus

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 10/11] x86: remove old STA2x11 support
  2024-12-04 10:30 ` [PATCH 10/11] x86: remove old STA2x11 support Arnd Bergmann
@ 2024-12-05  7:35   ` Davide Ciminaghi
  0 siblings, 0 replies; 78+ messages in thread
From: Davide Ciminaghi @ 2024-12-05  7:35 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linux-kernel, x86, Arnd Bergmann, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Linus Torvalds,
	Andy Shevchenko, Matthew Wilcox, Sean Christopherson,
	Paolo Bonzini, kvm

On Wed, Dec 04, 2024 at 11:30:41AM +0100, Arnd Bergmann wrote:
> From: Arnd Bergmann <arnd@arndb.de>
> 
> ST ConneXt STA2x11 was an interface chip for Atom E6xx processors,
> using a number of components usually found on Arm SoCs. Most of this
> was merged upstream, but it was never complete enough to actually work
> and has been abandoned for many years.
> 
> We already had an agreement on removing it in 2022, but nobody ever
> submitted the patch to do it.
>
yes sorry for that, I've never found the time to do it.

Thanks a lot
Davide Ciminaghi

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 06/11] x86: drop SWIOTLB and PHYS_ADDR_T_64BIT for PAE
  2024-12-04 20:52     ` Arnd Bergmann
@ 2024-12-05  7:59       ` Andy Shevchenko
  0 siblings, 0 replies; 78+ messages in thread
From: Andy Shevchenko @ 2024-12-05  7:59 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Arnd Bergmann, linux-kernel, x86, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Linus Torvalds,
	Matthew Wilcox, Sean Christopherson, Davide Ciminaghi,
	Paolo Bonzini, kvm

On Wed, Dec 04, 2024 at 09:52:01PM +0100, Arnd Bergmann wrote:
> On Wed, Dec 4, 2024, at 19:41, Andy Shevchenko wrote:
> > On Wed, Dec 4, 2024 at 12:31 PM Arnd Bergmann <arnd@kernel.org> wrote:

...

> >>                 pr_warn_once("%s: Cannot satisfy [mem %#010llx-%#010llx] with a huge-page mapping due to MTRR override.\n",
> >> -                            __func__, addr, addr + PMD_SIZE);
> >> +                            __func__, (u64)addr, (u64)addr + PMD_SIZE);
> >
> > Instead of castings I would rather:
> > 1) have addr and size (? does above have off-by-one error?) or end;
> > 2) use struct resource / range with the respective %p[Rr][a] specifier
> > or use %pa.
> 
> Changed as below now. I'm still not sure whether the mtrr_type_lookup
> end argument is meant to be inclusive or exclusive, so I've left
> that alone, but the printed range should be correct now.

Yep, thanks!

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 08/11] x86: document X86_INTEL_MID as 64-bit-only
  2024-12-04 20:38     ` Arnd Bergmann
@ 2024-12-05  8:03       ` Andy Shevchenko
  0 siblings, 0 replies; 78+ messages in thread
From: Andy Shevchenko @ 2024-12-05  8:03 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Arnd Bergmann, Ferry Toth, linux-kernel, x86, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Linus Torvalds, Matthew Wilcox, Sean Christopherson,
	Davide Ciminaghi, Paolo Bonzini, kvm

On Wed, Dec 04, 2024 at 09:38:05PM +0100, Arnd Bergmann wrote:
> On Wed, Dec 4, 2024, at 19:55, Andy Shevchenko wrote:
> > On Wed, Dec 4, 2024 at 12:31 PM Arnd Bergmann <arnd@kernel.org> wrote:

...

> > It's all other way around (from SW point of view). For unknown reasons
> > Intel decided to release only 32-bit SW and it became the only thing
> > that was heavily tested (despite misunderstanding by some developers
> > that pointed finger to the HW without researching the issue that
> > appears to be purely software in a few cases) _that_ time.  Starting
> > ca. 2017 I enabled 64-bit for Merrifield and from then it's being used
> > by both 32- and 64-bit builds.
> >
> > I'm totally fine to drop 32-bit defaults for Merrifield/Moorefield,
> > but let's hear Ferry who might/may still have a use case for that.
> 
> Ok. I tried to find the oldest Android image and saw it used a 64-bit
> kernel, but that must have been after your work then.

I stand up corrected, what I said is related to Merrifield, Moorefield
may have 64-bit users on the phones from day 1, though.

...

> Changed now to
> 
>           The only supported devices are the 22nm Merrified (Z34xx) and
>           Moorefield (Z35xx) SoC used in the Intel Edison board and
>           a small number of Android devices such as the Asus Zenfone 2,
>           Asus FonePad 8 and Dell Venue 7.

LGTM, thanks!

...

> >> -         Intel MID platforms are based on an Intel processor and chipset which
> >> -         consume less power than most of the x86 derivatives.
> >
> > Why remove this? AFAIK it states the truth.
> 
> It seemed irrelevant for users that configure the kernel. I've
> put it back now.

It might be, but it was already there. Thanks for leaving it untouched.

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 09/11] x86: rework CONFIG_GENERIC_CPU compiler flags
  2024-12-04 19:43     ` Arnd Bergmann
  2024-12-04 23:33       ` Linus Torvalds
@ 2024-12-05  8:07       ` Andy Shevchenko
  1 sibling, 0 replies; 78+ messages in thread
From: Andy Shevchenko @ 2024-12-05  8:07 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Linus Torvalds, Arnd Bergmann, linux-kernel, x86, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Matthew Wilcox, Sean Christopherson, Davide Ciminaghi,
	Paolo Bonzini, kvm

On Wed, Dec 04, 2024 at 08:43:35PM +0100, Arnd Bergmann wrote:
> On Wed, Dec 4, 2024, at 19:10, Linus Torvalds wrote:

...

> I guess the other side of it is that the current selection
> between pentium4/core2/k8/bonnell/generic is not much better,
> given that in practice nobody has any of the
> pentium4/core2/k8/bonnell variants any more.

Just booted Bonnell device a day ago (WeTab), pity that it has old kernel
and I have no time to try anything recent on it...

(Just saying :-)

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 09/11] x86: rework CONFIG_GENERIC_CPU compiler flags
  2024-12-04 23:33       ` Linus Torvalds
@ 2024-12-05  8:13         ` Andy Shevchenko
  2024-12-05 10:09           ` Arnd Bergmann
  2024-12-05  9:46         ` Arnd Bergmann
  1 sibling, 1 reply; 78+ messages in thread
From: Andy Shevchenko @ 2024-12-05  8:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arnd Bergmann, Arnd Bergmann, linux-kernel, x86, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Matthew Wilcox, Sean Christopherson, Davide Ciminaghi,
	Paolo Bonzini, kvm

On Wed, Dec 04, 2024 at 03:33:19PM -0800, Linus Torvalds wrote:
> On Wed, 4 Dec 2024 at 11:44, Arnd Bergmann <arnd@arndb.de> wrote:

...

> Will that work when you cross-compile? No. Do we care? Also no. It's
> basically a simple "you want to optimize for your own local machine"
> switch.

Maybe it's okay for 64-bit machines, but for cross-compiling for 32-bit on
64-bit. I dunno what '-march=native -m32' (or equivalent) will give in such
cases.

> Maybe that could replace some of the 32-bit choices too?

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 09/11] x86: rework CONFIG_GENERIC_CPU compiler flags
  2024-12-04 23:33       ` Linus Torvalds
  2024-12-05  8:13         ` Andy Shevchenko
@ 2024-12-05  9:46         ` Arnd Bergmann
  2024-12-05 10:01           ` Andy Shevchenko
  1 sibling, 1 reply; 78+ messages in thread
From: Arnd Bergmann @ 2024-12-05  9:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arnd Bergmann, linux-kernel, x86, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Andy Shevchenko,
	Matthew Wilcox, Sean Christopherson, Davide Ciminaghi,
	Paolo Bonzini, kvm

On Thu, Dec 5, 2024, at 00:33, Linus Torvalds wrote:
> On Wed, 4 Dec 2024 at 11:44, Arnd Bergmann <arnd@arndb.de> wrote:
>>
>> I guess the other side of it is that the current selection
>> between pentium4/core2/k8/bonnell/generic is not much better,
>> given that in practice nobody has any of the
>> pentium4/core2/k8/bonnell variants any more.
>
> So I suspect:
>
>> A more radical solution would be to just drop the entire
>> menu for 64-bit kernels and always default to "-march=x86_64
>> -mtune=generic" and 64 byte L1 cachelines.
>
> would actually be perfectly acceptable. The non-generic choices are
> all entirely historical and not really very interesting.
>
> Absolutely nobody sane cares about instruction scheduling for the old P4 cores.

Ok, I'll do that instead then. This also means I can drop
the patch for CONFIG_MATOM.

> In the bad old 32-bit days, we had real code generation issues with
> basic instruction set, ie the whole "some CPU's are P6-class, but
> don't actually support the CMOVxx instruction". Those days are gone.

I did come across a remaining odd problem with this, as Crusoe and
GeodeLX both identify as Family 5 but have CMOV.  Trying to use
a CONFIG_M686+CONFIG_X86_GENERIC on these runs fails with a boot
error "This kernel requires a 686 CPU but only detected a 586 CPU".

As a result, the Debian 686 kernel binary gets built with
CONFIG_MGEODE_LX , which seems mildly wrong but harmful enough
to require a change in how we handle the levels.

> And yes, on x86-64, we still have the whole cmpxchg16b issue, which
> really is a slight annoyance. But the emphasis is on "slight" - we
> basically have one back for this in the SLAB code, and a couple of
> dynamic tests for one particular driver (iommu 128-bit IRTE mode).
>
> So yeah, the cmpxchg16b thing is annoying, but _realistically_ I don't
> think we care.
>
> And some day we will forget about it, notice that those (few) AMD
> early 64-bit CPU's can't possibly have been working for the last year
> or two, and we'll finally just kill that code, but in the meantime the
> cost of maintaining it is so slight that it's not worth actively going
> out to kill it.

Right, in particular my hope of turning the runtime detection into
always using compile-time configuration for cmpxchg16b is no longer
works as I noticed that risc-v has also gained a runtime detection
for system_has_cmpxchg128().

Besides cmpxchg16b, I can also see compile-time configuration
for some instructions (popcnt, tzcnt, movbe) and for 5-level
paging being useful, but not enough so to make up for the
configuration complexity.

I still think we will end up needing more compile time
configurability like this on arm64 to deal with small-memory
embedded systems, e.g. with a specialized cortex-a55 kernel
that leaves out support for other CPUs, but this is quite
different from the situation on x86-64.

> I do think that the *one* option we might have is "optimize for the
> current CPU" for people who just want to build their own kernel for
> their own machine. That's a nice easy choice to give people, and
> '-march=native' is kind of simple to use.
>
> Will that work when you cross-compile? No. Do we care? Also no. It's
> basically a simple "you want to optimize for your own local machine"
> switch.

Sure, I'll add that as a separate patch. Should it be -march=native
or -mtune=native though? Using -march= can be faster if it picks
up newer instructions, but it will eventually lead to users
running into a boot panic if it is accidentally turned on for
a kernel that runs on an older machine than it was built on.

> Maybe that could replace some of the 32-bit choices too?

Probably not. I spent hours looking through the 32-bit choices
in the hope of finding a way that is less of a mess. The current
menu mixes up instruction set level (486/586/686), optimization
(atom/k7/m3/pentiumm) and platform (elan/geode/pc) options.
This is needlessly confusing, but any change to the status quo
is going to cause more problems for existing users than it
solves. All the "interesting" embedded ones are likely to be
cross-compiled anyway, so mtune=native or -march=native wouldn't
help them either.

     Arnd

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 09/11] x86: rework CONFIG_GENERIC_CPU compiler flags
  2024-12-05  9:46         ` Arnd Bergmann
@ 2024-12-05 10:01           ` Andy Shevchenko
  2024-12-05 10:47             ` Arnd Bergmann
  0 siblings, 1 reply; 78+ messages in thread
From: Andy Shevchenko @ 2024-12-05 10:01 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Linus Torvalds, Arnd Bergmann, linux-kernel, x86, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Matthew Wilcox, Sean Christopherson, Davide Ciminaghi,
	Paolo Bonzini, kvm

On Thu, Dec 05, 2024 at 10:46:25AM +0100, Arnd Bergmann wrote:
> On Thu, Dec 5, 2024, at 00:33, Linus Torvalds wrote:
> > On Wed, 4 Dec 2024 at 11:44, Arnd Bergmann <arnd@arndb.de> wrote:

...

> I did come across a remaining odd problem with this, as Crusoe and
> GeodeLX both identify as Family 5 but have CMOV.  Trying to use
> a CONFIG_M686+CONFIG_X86_GENERIC on these runs fails with a boot
> error "This kernel requires a 686 CPU but only detected a 586 CPU".

It might be also that Intel Quark is affected same way.

> As a result, the Debian 686 kernel binary gets built with
> CONFIG_MGEODE_LX , which seems mildly wrong but harmful enough
> to require a change in how we handle the levels.

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 09/11] x86: rework CONFIG_GENERIC_CPU compiler flags
  2024-12-05  8:13         ` Andy Shevchenko
@ 2024-12-05 10:09           ` Arnd Bergmann
  2024-12-05 11:17             ` Andy Shevchenko
  0 siblings, 1 reply; 78+ messages in thread
From: Arnd Bergmann @ 2024-12-05 10:09 UTC (permalink / raw)
  To: Andy Shevchenko, Linus Torvalds
  Cc: Arnd Bergmann, linux-kernel, x86, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Matthew Wilcox,
	Sean Christopherson, Davide Ciminaghi, Paolo Bonzini, kvm

On Thu, Dec 5, 2024, at 09:13, Andy Shevchenko wrote:
> On Wed, Dec 04, 2024 at 03:33:19PM -0800, Linus Torvalds wrote:
>> On Wed, 4 Dec 2024 at 11:44, Arnd Bergmann <arnd@arndb.de> wrote:
>
> ...
>
>> Will that work when you cross-compile? No. Do we care? Also no. It's
>> basically a simple "you want to optimize for your own local machine"
>> switch.
>
> Maybe it's okay for 64-bit machines, but for cross-compiling for 32-bit on
> 64-bit. I dunno what '-march=native -m32' (or equivalent) will give in such
> cases.

From the compiler's perspective this is nothing special, it just
builds a 32-bit binary that can use any instruction supported in
32-bit mode of that 64-bit CPU, the same as the 32-bit CONFIG_MCORE2
option that I disallow in patch 04/11.

     Arnd

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 09/11] x86: rework CONFIG_GENERIC_CPU compiler flags
  2024-12-05 10:01           ` Andy Shevchenko
@ 2024-12-05 10:47             ` Arnd Bergmann
  0 siblings, 0 replies; 78+ messages in thread
From: Arnd Bergmann @ 2024-12-05 10:47 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: Linus Torvalds, Arnd Bergmann, linux-kernel, x86, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Matthew Wilcox, Sean Christopherson, Davide Ciminaghi,
	Paolo Bonzini, kvm

On Thu, Dec 5, 2024, at 11:01, Andy Shevchenko wrote:
> On Thu, Dec 05, 2024 at 10:46:25AM +0100, Arnd Bergmann wrote:
>> On Thu, Dec 5, 2024, at 00:33, Linus Torvalds wrote:
>> > On Wed, 4 Dec 2024 at 11:44, Arnd Bergmann <arnd@arndb.de> wrote:
>
> ...
>
>> I did come across a remaining odd problem with this, as Crusoe and
>> GeodeLX both identify as Family 5 but have CMOV.  Trying to use
>> a CONFIG_M686+CONFIG_X86_GENERIC on these runs fails with a boot
>> error "This kernel requires a 686 CPU but only detected a 586 CPU".
>
> It might be also that Intel Quark is affected same way.

No, as far as I can tell, Quark correctly identifies as Family 5
and is lacking CMOV. It does seem though that it's currently
impossible to configure a kernel for Quark that uses PAE/NX,
because there is no CONFIG_MQUARK and it relies on building
with CONFIG_M586TSC. If anyone still cared enough about it,
they could probably add an MQUARK option that has lets
you build the kernel with -march=i586 -mtune=i486 and
optional PAE.

The only other one that perhaps gets misidentified is the IDT
Winchip that is claimed to support cmpxchg64b but only
identifies as Family 4. It's even less likely that anyone
cares about this one than the Quark.

     Arnd

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 09/11] x86: rework CONFIG_GENERIC_CPU compiler flags
  2024-12-05 10:09           ` Arnd Bergmann
@ 2024-12-05 11:17             ` Andy Shevchenko
  2024-12-05 11:58               ` Arnd Bergmann
  0 siblings, 1 reply; 78+ messages in thread
From: Andy Shevchenko @ 2024-12-05 11:17 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Linus Torvalds, Arnd Bergmann, linux-kernel, x86, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Matthew Wilcox, Sean Christopherson, Davide Ciminaghi,
	Paolo Bonzini, kvm

On Thu, Dec 05, 2024 at 11:09:41AM +0100, Arnd Bergmann wrote:
> On Thu, Dec 5, 2024, at 09:13, Andy Shevchenko wrote:
> > On Wed, Dec 04, 2024 at 03:33:19PM -0800, Linus Torvalds wrote:
> >> On Wed, 4 Dec 2024 at 11:44, Arnd Bergmann <arnd@arndb.de> wrote:

...

> >> Will that work when you cross-compile? No. Do we care? Also no. It's
> >> basically a simple "you want to optimize for your own local machine"
> >> switch.
> >
> > Maybe it's okay for 64-bit machines, but for cross-compiling for 32-bit on
> > 64-bit. I dunno what '-march=native -m32' (or equivalent) will give in such
> > cases.
> 
> From the compiler's perspective this is nothing special, it just
> builds a 32-bit binary that can use any instruction supported in
> 32-bit mode of that 64-bit CPU,

But does this affect building, e.g., for Quark on my Skylake desktop?

> the same as the 32-bit CONFIG_MCORE2 option that I disallow in patch 04/11.

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 09/11] x86: rework CONFIG_GENERIC_CPU compiler flags
  2024-12-05 11:17             ` Andy Shevchenko
@ 2024-12-05 11:58               ` Arnd Bergmann
  2024-12-05 12:35                 ` Jason A. Donenfeld
  0 siblings, 1 reply; 78+ messages in thread
From: Arnd Bergmann @ 2024-12-05 11:58 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: Linus Torvalds, Arnd Bergmann, linux-kernel, x86, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Matthew Wilcox, Sean Christopherson, Davide Ciminaghi,
	Paolo Bonzini, kvm

On Thu, Dec 5, 2024, at 12:17, Andy Shevchenko wrote:
> On Thu, Dec 05, 2024 at 11:09:41AM +0100, Arnd Bergmann wrote:
>> On Thu, Dec 5, 2024, at 09:13, Andy Shevchenko wrote:
>> > On Wed, Dec 04, 2024 at 03:33:19PM -0800, Linus Torvalds wrote:
>> >> On Wed, 4 Dec 2024 at 11:44, Arnd Bergmann <arnd@arndb.de> wrote:
>>
>> >> Will that work when you cross-compile? No. Do we care? Also no. It's
>> >> basically a simple "you want to optimize for your own local machine"
>> >> switch.
>> >
>> > Maybe it's okay for 64-bit machines, but for cross-compiling for 32-bit on
>> > 64-bit. I dunno what '-march=native -m32' (or equivalent) will give in such
>> > cases.
>> 
>> From the compiler's perspective this is nothing special, it just
>> builds a 32-bit binary that can use any instruction supported in
>> 32-bit mode of that 64-bit CPU,
>
> But does this affect building, e.g., for Quark on my Skylake desktop?

Not at the moment:

- the bug I'm fixing in the patch at hand is currently only present
  when building 64-bit kernels

- For a 64-bit target such as a Pineview Atom, it's only a problem
  if the toolchain default is -arch=native and you build with
  CONFIG_GENERIC_CPU

- If we add support for configuring -march=native and you build
  using that option on a Skylake host, that would be equally
  broken for 32-bit Quark or 64-bit Pineview targets that are
  lacking some of the instructions present in Skylake.

As I said earlier, I don't think we should offer the 'native'
option for 32-bit targets at all. For 64-bit, we either decide
it's a user error to enable -march=native, or change it to
-mtune=native to avoid the problem.

     Arnd

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 09/11] x86: rework CONFIG_GENERIC_CPU compiler flags
  2024-12-05 11:58               ` Arnd Bergmann
@ 2024-12-05 12:35                 ` Jason A. Donenfeld
  0 siblings, 0 replies; 78+ messages in thread
From: Jason A. Donenfeld @ 2024-12-05 12:35 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Andy Shevchenko, Linus Torvalds, Arnd Bergmann, linux-kernel, x86,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Matthew Wilcox, Sean Christopherson,
	Davide Ciminaghi, Paolo Bonzini, kvm, sultan

On Thu, Dec 05, 2024 at 12:58:22PM +0100, Arnd Bergmann wrote:
> As I said earlier, I don't think we should offer the 'native'
> option for 32-bit targets at all. For 64-bit, we either decide
> it's a user error to enable -march=native, or change it to
> -mtune=native to avoid the problem.

I've been building my laptop's kernel with -march=native for years, and
I'd be happy if this capability were upstream.

Jason

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 08/11] x86: document X86_INTEL_MID as 64-bit-only
  2024-12-04 18:55   ` Andy Shevchenko
  2024-12-04 20:38     ` Arnd Bergmann
@ 2024-12-06 11:23     ` Ferry Toth
  2024-12-06 14:27       ` Arnd Bergmann
  1 sibling, 1 reply; 78+ messages in thread
From: Ferry Toth @ 2024-12-06 11:23 UTC (permalink / raw)
  To: Andy Shevchenko, Arnd Bergmann
  Cc: linux-kernel, x86, Arnd Bergmann, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Linus Torvalds,
	Andy Shevchenko, Matthew Wilcox, Sean Christopherson,
	Davide Ciminaghi, Paolo Bonzini, kvm

Hi,

Op 04-12-2024 om 19:55 schreef Andy Shevchenko:
> +Cc: Ferry
>
> On Wed, Dec 4, 2024 at 12:31 PM Arnd Bergmann <arnd@kernel.org> wrote:
>> From: Arnd Bergmann <arnd@arndb.de>
>>
>> The X86_INTEL_MID code was originally introduced for the
>> 32-bit Moorestown/Medfield/Clovertrail platform, later the 64-bit
>> Merrifield/Moorefield variant got added, but the final
> variant got --> variants were
>
>> Morganfield/Broxton 14nm chips were canceled before they hit
>> the market.
> Inaccurate. "Broxton for Mobile", and not "Broxton" in general.
>
>
>> To help users understand what the option actually refers to,
>> update the help text, and make it a hard dependency on 64-bit
>> kernels. While they could theoretically run a 32-bit kernel,
>> the devices originally shipped with 64-bit one in 2015, so that
>> was proabably never tested.
> probably
>
> It's all other way around (from SW point of view). For unknown reasons
> Intel decided to release only 32-bit SW and it became the only thing
> that was heavily tested (despite misunderstanding by some developers
> that pointed finger to the HW without researching the issue that
> appears to be purely software in a few cases) _that_ time.  Starting
> ca. 2017 I enabled 64-bit for Merrifield and from then it's being used
> by both 32- and 64-bit builds.
>
> I'm totally fine to drop 32-bit defaults for Merrifield/Moorefield,
> but let's hear Ferry who might/may still have a use case for that.

Do to the design of SLM if found (and it is also documented in Intel's 
HW documentation)

that there is a penalty introduced when executing certain instructions 
in 64b mode. The one I found

is crc32di, running slower than 2 crc32si in series. Then there are 
other instructions seem to runs faster in 64b mode.

And there is of course the usual limited memory space than could benefit 
for 32b mode. I never tried the mixed (x86_32?)

mode. But I am building and testing both i686 and x86_64 for each Edison 
image.

I think that should at minimum be useful to catch 32b errors in the 
kernel in certain areas (shared with other 32b

archs. So, I would prefer 32b support for this platform to continue.


> ...
>
>> -               Moorestown MID devices
> FTR, a year or so ago it was a (weak) interest to revive Medfield, but
> I think it would require too much work even for the person who is
> quite familiar with HW, U-Boot, and Linux kernel, so it is most
> unlikely to happen.
>
> ...
>
>>            Select to build a kernel capable of supporting Intel MID (Mobile
>>            Internet Device) platform systems which do not have the PCI legacy
>> -         interfaces. If you are building for a PC class system say N here.
>> +         interfaces.
>> +
>> +         The only supported devices are the 22nm Merrified (Z34xx) and
>> +         Moorefield (Z35xx) SoC used in Android devices such as the
>> +         Asus Zenfone 2, Asus FonePad 8 and Dell Venue 7.
> The list is missing the Intel Edison DIY platform which is probably
> the main user of Intel MID kernels nowadays.
Despite the Dell Venue 7 originally running a 32b Android kernel (I 
think), I got it run linux/Yocto in 64 bits.
> ...
>
>> -         Intel MID platforms are based on an Intel processor and chipset which
>> -         consume less power than most of the x86 derivatives.
> Why remove this? AFAIK it states the truth.
>
>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* RE: [PATCH 09/11] x86: rework CONFIG_GENERIC_CPU compiler flags
  2024-12-04 10:30 ` [PATCH 09/11] x86: rework CONFIG_GENERIC_CPU compiler flags Arnd Bergmann
                     ` (2 preceding siblings ...)
  2024-12-04 18:10   ` Linus Torvalds
@ 2024-12-06 13:56   ` David Laight
  3 siblings, 0 replies; 78+ messages in thread
From: David Laight @ 2024-12-06 13:56 UTC (permalink / raw)
  To: 'Arnd Bergmann', linux-kernel@vger.kernel.org,
	x86@kernel.org
  Cc: Arnd Bergmann, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Linus Torvalds, Andy Shevchenko,
	Matthew Wilcox, Sean Christopherson, Davide Ciminaghi,
	Paolo Bonzini, kvm@vger.kernel.org

From: Arnd Bergmann
> Sent: 04 December 2024 10:31
> Building an x86-64 kernel with CONFIG_GENERIC_CPU is documented to
> run on all CPUs, but the Makefile does not actually pass an -march=
> argument, instead relying on the default that was used to configure
> the toolchain.
> 
> In many cases, gcc will be configured to -march=x86-64 or -march=k8
> for maximum compatibility, but in other cases a distribution default
> may be either raised to a more recent ISA, or set to -march=native
> to build for the CPU used for compilation. This still works in the
> case of building a custom kernel for the local machine.
> 
> The point where it breaks down is building a kernel for another
> machine that is older the the default target. Changing the default
> to -march=x86-64 would make it work reliable, but possibly produce
> worse code on distros that intentionally default to a newer ISA.
> 
> To allow reliably building a kernel for either the oldest x86-64
> CPUs or a more recent level, add three separate options for
> v1, v2 and v3 of the architecture as defined by gcc and clang
> and make them all turn on CONFIG_GENERIC_CPU. Based on this it
> should be possible to change runtime feature detection into
> build-time detection for things like cmpxchg16b, or possibly
> gate features that are only available on older architectures.
> 
> Link: https://lists.llvm.org/pipermail/llvm-dev/2020-July/143289.html
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> ---
>  arch/x86/Kconfig.cpu | 39 ++++++++++++++++++++++++++++++++++-----
>  arch/x86/Makefile    |  6 ++++++
>  2 files changed, 40 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/Kconfig.cpu b/arch/x86/Kconfig.cpu
> index 139db904e564..1461a739237b 100644
> --- a/arch/x86/Kconfig.cpu
> +++ b/arch/x86/Kconfig.cpu
> @@ -260,7 +260,7 @@ endchoice
>  choice
>  	prompt "x86-64 Processor family"
>  	depends on X86_64
> -	default GENERIC_CPU
> +	default X86_64_V2
>  	help
>  	  This is the processor type of your CPU. This information is
>  	  used for optimizing purposes. In order to compile a kernel
> @@ -314,15 +314,44 @@ config MSILVERMONT
>  	  early Atom CPUs based on the Bonnell microarchitecture,
>  	  such as Atom 230/330, D4xx/D5xx, D2xxx, N2xxx or Z2xxx.
> 
> -config GENERIC_CPU
> -	bool "Generic-x86-64"
> +config X86_64_V1
> +	bool "Generic x86-64"
>  	depends on X86_64
>  	help
> -	  Generic x86-64 CPU.
> -	  Run equally well on all x86-64 CPUs.
> +	  Generic x86-64-v1 CPU.
> +	  Run equally well on all x86-64 CPUs, including early Pentium-4
> +	  variants lacking the sahf and cmpxchg16b instructions as well
> +	  as the AMD K8 and Intel Core 2 lacking popcnt.

The 'equally well' text was clearly always wrong (equally badly?)
but is now just 'plain wrong'.
Perhaps:
	Runs on all x86-64 CPUs including early cpu that lack the sahf,
	cmpxchg16b and popcnt instructions.

Then for V2 (or whatever it gets called)
	Requires support for the sahf, cmpxchg16b and popcnt instructions.
	This will not run on AMD K8 or Intel before Sandy bridge.

I think someone suggested that run-time detect of AVX/AVX2/AVX512
is fine?

	David

> +
> +config X86_64_V2
> +	bool "Generic x86-64 v2"
> +	depends on X86_64
> +	help
> +	  Generic x86-64-v2 CPU.
> +	  Run equally well on all x86-64 CPUs that meet the x86-64-v2
> +	  definition as well as those that only miss the optional
> +	  SSE3/SSSE3/SSE4.1 portions.
> +	  Examples of this include Intel Nehalem and Silvermont,
> +	  AMD Bulldozer (K10) and Jaguar as well as VIA Nano that
> +	  include popcnt, cmpxchg16b and sahf.
> +
> +config X86_64_V3
> +	bool "Generic x86-64 v3"
> +	depends on X86_64
> +	help
> +	  Generic x86-64-v3 CPU.
> +	  Run equally well on all x86-64 CPUs that meet the x86-64-v3
> +	  definition as well as those that only miss the optional
> +	  AVX/AVX2 portions.
> +	  Examples of this include the Intel Haswell and AMD Excavator
> +	  microarchitectures that include the bmi1/bmi2, lzncnt, movbe
> +	  and xsave instruction set extensions.
> 
>  endchoice
> 
> +config GENERIC_CPU
> +	def_bool X86_64_V1 || X86_64_V2 || X86_64_V3
> +
>  config X86_GENERIC
>  	bool "Generic x86 support"
>  	depends on X86_32
> diff --git a/arch/x86/Makefile b/arch/x86/Makefile
> index 05887ae282f5..1fdc3fc6a54e 100644
> --- a/arch/x86/Makefile
> +++ b/arch/x86/Makefile
> @@ -183,6 +183,9 @@ else
>          cflags-$(CONFIG_MPSC)		+= -march=nocona
>          cflags-$(CONFIG_MCORE2)		+= -march=core2
>          cflags-$(CONFIG_MSILVERMONT)	+= -march=silvermont
> +        cflags-$(CONFIG_MX86_64_V1)	+= -march=x86-64
> +        cflags-$(CONFIG_MX86_64_V2)	+= $(call cc-option,-march=x86-64-v2,-march=x86-64)
> +        cflags-$(CONFIG_MX86_64_V3)	+= $(call cc-option,-march=x86-64-v3,-march=x86-64)
>          cflags-$(CONFIG_GENERIC_CPU)	+= -mtune=generic
>          KBUILD_CFLAGS += $(cflags-y)
> 
> @@ -190,6 +193,9 @@ else
>          rustflags-$(CONFIG_MPSC)	+= -Ctarget-cpu=nocona
>          rustflags-$(CONFIG_MCORE2)	+= -Ctarget-cpu=core2
>          rustflags-$(CONFIG_MSILVERMONT)	+= -Ctarget-cpu=silvermont
> +        rustflags-$(CONFIG_MX86_64_V1)	+= -Ctarget-cpu=x86-64
> +        rustflags-$(CONFIG_MX86_64_V2)	+= -Ctarget-cpu=x86-64-v2
> +        rustflags-$(CONFIG_MX86_64_V3)	+= -Ctarget-cpu=x86-64-v3
>          rustflags-$(CONFIG_GENERIC_CPU)	+= -Ztune-cpu=generic
>          KBUILD_RUSTFLAGS += $(rustflags-y)
> 
> --
> 2.39.5
> 

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 08/11] x86: document X86_INTEL_MID as 64-bit-only
  2024-12-06 11:23     ` Ferry Toth
@ 2024-12-06 14:27       ` Arnd Bergmann
  0 siblings, 0 replies; 78+ messages in thread
From: Arnd Bergmann @ 2024-12-06 14:27 UTC (permalink / raw)
  To: Ferry Toth, Andy Shevchenko, Arnd Bergmann
  Cc: linux-kernel, x86, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Linus Torvalds, Andy Shevchenko,
	Matthew Wilcox, Sean Christopherson, Davide Ciminaghi,
	Paolo Bonzini, kvm

On Fri, Dec 6, 2024, at 12:23, Ferry Toth wrote:
> Op 04-12-2024 om 19:55 schreef Andy Shevchenko:
>>
>> It's all other way around (from SW point of view). For unknown reasons
>> Intel decided to release only 32-bit SW and it became the only thing
>> that was heavily tested (despite misunderstanding by some developers
>> that pointed finger to the HW without researching the issue that
>> appears to be purely software in a few cases) _that_ time.  Starting
>> ca. 2017 I enabled 64-bit for Merrifield and from then it's being used
>> by both 32- and 64-bit builds.
>>
>> I'm totally fine to drop 32-bit defaults for Merrifield/Moorefield,
>> but let's hear Ferry who might/may still have a use case for that.
>
> Do to the design of SLM if found (and it is also documented in Intel's 
> HW documentation)
>
> that there is a penalty introduced when executing certain instructions 
> in 64b mode. The one I found
>
> is crc32di, running slower than 2 crc32si in series. Then there are 
> other instructions seem to runs faster in 64b mode.
>
> And there is of course the usual limited memory space than could benefit 
> for 32b mode. I never tried the mixed (x86_32?)
>
> mode. But I am building and testing both i686 and x86_64 for each Edison 
> image.

Hi Ferry,

Thanks a lot for the detailed reply, this is exactly the kind of
information I was hoping to get out of my series, in particular
since we have a lot of the same tradeoffs on low-end 64-bit
Arm platforms, and I've been trying to push users toward running
64-bit kernels on those.

I generally think that it makes a lot of sense to run 32-bit
userspace on memory limited devices, in particular with less
than 512MB, but it's often still useful on devices with 1GB.

Running a 32-bit kernel is usually not worth it if you can
avoid it, and with 1GB of RAM you definitely run into limits
either from using HIGHMEM (with CONFIG_VMSPLIT_3G) or in
user addressing (with any other VMPLIT_*), in addition to the
32-bit kernels just being less well maintained and missing
security features.

Using a 64-bit kernel with CONFIG_COMPAT for 32-bit userspace
tends to be the best combination for a large number of
embedded workloads. As a rough estimate on Arm hardware,
I found that a 64-bit kernel tends to use close to twice
the amount of RAM for itself (vmlinux, slab caches, page
tables, mem_map[]) compared to a 32-bit kernel, but this
should be no more than 10-20% of the total RAM for sensible
workloads as all the interesting bits happen in userland.
I expect the numbers to be similar for x86, but have not
looked in detail.

In userspace there is more variation depending on the type
of application: the base system has a similar 2x ratio, but
once you get into data intensive tasks (file server,
networking, image/video processing, ...) the overhead of
64-bit userspace is lower because the size of the actual
data is the same on both.

For the specific case of the crc32di instruction, I
suspect the in-kernel version of this can be trivially
changed like

diff --git a/arch/x86/crypto/crc32c-intel_glue.c b/arch/x86/crypto/crc32c-intel_glue.c
index 52c5d47ef5a1..60b9b3cab679 100644
--- a/arch/x86/crypto/crc32c-intel_glue.c
+++ b/arch/x86/crypto/crc32c-intel_glue.c
@@ -60,10 +60,10 @@ static u32 __pure crc32c_intel_le_hw(u32 crc, unsigned char const *p, size_t len
 {
        unsigned int iquotient = len / SCALE_F;
        unsigned int iremainder = len % SCALE_F;
-       unsigned long *ptmp = (unsigned long *)p;
+       unsigned int *ptmp = (unsigned int *)p;

        while (iquotient--) {
-               asm(CRC32_INST
+               asm("crc32l %1, %0"
                    : "+r" (crc) : "rm" (*ptmp));
                ptmp++;
        }

to get you the faster version, plus some form of
configurability to make sure other CPUs still get the
crc32q version by default.

> I think that should at minimum be useful to catch 32b errors in the 
> kernel in certain areas (shared with other 32b
> archs. So, I would prefer 32b support for this platform to continue.

I can certainly see this both ways, on the one hand I do
care a lot about 32-bit Arm platforms and appreciate the help
in finding issues on 32-bit kernels. On the other hand I
really don't want anyone to waste time testing something that
should never be used in practice and keeping a feature in
the kernel only for the purpose of regression testing that
feature.

The platform is also special enough that I don't see
testing it in 32-bit mode as particularly helpful to
others, and it's unlikely to catch bugs that testing in
KVM won't.

Testing your 32-bit userland with a 64-bit kernel would be
helpful of course to ensure it keeps working for anyone
that had been using 32-bit kernel+userspace if we drop
32-bit kernel support for it.

One related idea that I've discussed before is to have
32-bit kernels refuse to boot on 64-bit hardware and
instead print the URL of a wiki page to explain all of
the above. There would probably have to be whitelist
of platforms that are buggy in 64-bit mode, and a command
line option to revert back to the previous behavior
to allow testing.

       Arnd

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH 05/11] x86: remove HIGHMEM64G support
  2024-12-04 10:30 ` [PATCH 05/11] x86: remove HIGHMEM64G support Arnd Bergmann
  2024-12-04 13:29   ` Brian Gerst
@ 2025-04-11 23:44   ` Dave Hansen
  2025-04-12  8:39     ` Ingo Molnar
                       ` (2 more replies)
  1 sibling, 3 replies; 78+ messages in thread
From: Dave Hansen @ 2025-04-11 23:44 UTC (permalink / raw)
  To: Arnd Bergmann, linux-kernel, x86
  Cc: Arnd Bergmann, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Linus Torvalds, Andy Shevchenko,
	Matthew Wilcox, Sean Christopherson, Davide Ciminaghi,
	Paolo Bonzini, kvm, Mike Rapoport

Has anyone run into any problems on 6.15-rc1 with this stuff?

0xf75fe000 is the mem_map[] entry for the first page >4GB. It obviously
wasn't allocated, thus the oops. Looks like the memblock for the >4GB
memory didn't get removed although the pgdats seem correct.

I'll dig into it some more. Just wanted to make sure there wasn't a fix
out there already.

The way I'm triggering this is booting qemu with a 32-bit PAE kernel,
and "-m 4096" (or more).

> [    0.003806] Warning: only 4GB will be used. Support for for CONFIG_HIGHMEM64G was removed!
...
> [    0.561310] BUG: unable to handle page fault for address: f75fe000
> [    0.562226] #PF: supervisor write access in kernel mode
> [    0.562947] #PF: error_code(0x0002) - not-present page
> [    0.563653] *pdpt = 0000000002da2001 *pde = 000000000300c067 *pte = 0000000000000000 
> [    0.564728] Oops: Oops: 0002 [#1] SMP NOPTI
> [    0.565315] CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.15.0-rc1-00288-ge618ee89561b-dirty #311 PREEMPT(undef) 
> [    0.567428] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
> [    0.568777] EIP: __free_pages_core+0x3c/0x74
> [    0.569378] Code: c3 d3 e6 83 ec 10 89 44 24 08 89 74 24 04 c7 04 24 c6 32 3a c2 89 55 f4 e8 a9 11 45 fe 85 f6 8b 55 f4 74 19 89 d8 31 c9 66 90 <0f> ba 30 0d c7 40 1c 00 00 00 00 41 83 c0 28 39 ce 75 ed 8b 03 c1
> [    0.571943] EAX: f75fe000 EBX: f75fe000 ECX: 00000000 EDX: 0000000a
> [    0.572806] ESI: 00000400 EDI: 00500000 EBP: c247becc ESP: c247beb4
> [    0.573776] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210046
> [    0.574606] CR0: 80050033 CR2: f75fe000 CR3: 02da6000 CR4: 000000b0
> [    0.575464] Call Trace:
> [    0.575816]  memblock_free_pages+0x11/0x2c
> [    0.576392]  memblock_free_all+0x2ce/0x3a0
> [    0.576955]  mm_core_init+0xf5/0x320
> [    0.577423]  start_kernel+0x296/0x79c
> [    0.577950]  ? set_init_arg+0x70/0x70
> [    0.578478]  ? load_ucode_bsp+0x13c/0x1a8
> [    0.579059]  i386_start_kernel+0xad/0xb0
> [    0.579614]  startup_32_smp+0x151/0x154
> [    0.580100] Modules linked in:
> [    0.580358] CR2: 00000000f75fe000
> [    0.580630] ---[ end trace 0000000000000000 ]---
> [    0.581111] EIP: __free_pages_core+0x3c/0x74
> [    0.581455] Code: c3 d3 e6 83 ec 10 89 44 24 08 89 74 24 04 c7 04 24 c6 32 3a c2 89 55 f4 e8 a9 11 45 fe 85 f6 8b 55 f4 74 19 89 d8 31 c9 66 90 <0f> ba 30 0d c7 40 1c 00 00 00 00 41 83 c0 28 39 ce 75 ed 8b 03 c1
> [    0.584767] EAX: f75fe000 EBX: f75fe000 ECX: 00000000 EDX: 0000000a
> [    0.585651] ESI: 00000400 EDI: 00500000 EBP: c247becc ESP: c247beb4
> [    0.586530] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210046
> [    0.587480] CR0: 80050033 CR2: f75fe000 CR3: 02da6000 CR4: 000000b0
> [    0.588344] Kernel panic - not syncing: Attempted to kill the idle task!
> [    0.589435] ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---

> [    0.561310] BUG: unable to handle page fault for address: f75fe000
> [    0.562226] #PF: supervisor write access in kernel mode
> [    0.562947] #PF: error_code(0x0002) - not-present page
> [    0.563653] *pdpt = 0000000002da2001 *pde = 000000000300c067 *pte = 0000000000000000 
> [    0.564728] Oops: Oops: 0002 [#1] SMP NOPTI
> [    0.565315] CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.15.0-rc1-00288-ge618ee89561b-dirty #311 PREEMPT(undef) 
> [    0.567428] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
> [    0.568777] EIP: __free_pages_core+0x3c/0x74
> [    0.569378] Code: c3 d3 e6 83 ec 10 89 44 24 08 89 74 24 04 c7 04 24 c6 32 3a c2 89 55 f4 e8 a9 11 45 fe 85 f6 8b 55 f4 74 19 89 d8 31 c9 66 90 <0f> ba 30 0d c7 40 1c 00 00 00 00 41 83 c0 28 39 ce 75 ed 8b 03 c1
> [    0.571943] EAX: f75fe000 EBX: f75fe000 ECX: 00000000 EDX: 0000000a
> [    0.572806] ESI: 00000400 EDI: 00500000 EBP: c247becc ESP: c247beb4
> [    0.573776] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210046
> [    0.574606] CR0: 80050033 CR2: f75fe000 CR3: 02da6000 CR4: 000000b0
> [    0.575464] Call Trace:
> [    0.575816]  memblock_free_pages+0x11/0x2c
> [    0.576392]  memblock_free_all+0x2ce/0x3a0
> [    0.576955]  mm_core_init+0xf5/0x320
> [    0.577423]  start_kernel+0x296/0x79c
> [    0.577950]  ? set_init_arg+0x70/0x70
> [    0.578478]  ? load_ucode_bsp+0x13c/0x1a8
> [    0.579059]  i386_start_kernel+0xad/0xb0
> [    0.579614]  startup_32_smp+0x151/0x154
> [    0.580100] Modules linked in:
> [    0.580358] CR2: 00000000f75fe000
> [    0.580630] ---[ end trace 0000000000000000 ]---
> [    0.581111] EIP: __free_pages_core+0x3c/0x74
> [    0.581455] Code: c3 d3 e6 83 ec 10 89 44 24 08 89 74 24 04 c7 04 24 c6 32 3a c2 89 55 f4 e8 a9 11 45 fe 85 f6 8b 55 f4 74 19 89 d8 31 c9 66 90 <0f> ba 30 0d c7 40 1c 00 00 00 00 41 83 c0 28 39 ce 75 ed 8b 03 c1
> [    0.584767] EAX: f75fe000 EBX: f75fe000 ECX: 00000000 EDX: 0000000a
> [    0.585651] ESI: 00000400 EDI: 00500000 EBP: c247becc ESP: c247beb4
> [    0.586530] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210046
> [    0.587480] CR0: 80050033 CR2: f75fe000 CR3: 02da6000 CR4: 000000b0
> [    0.588344] Kernel panic - not syncing: Attempted to kill the idle task!
> [    0.589435] ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 05/11] x86: remove HIGHMEM64G support
  2025-04-11 23:44   ` Dave Hansen
@ 2025-04-12  8:39     ` Ingo Molnar
  2025-04-12 10:05     ` Mike Rapoport
  2025-04-12 10:40     ` [PATCH 05/11] x86: remove HIGHMEM64G support Arnd Bergmann
  2 siblings, 0 replies; 78+ messages in thread
From: Ingo Molnar @ 2025-04-12  8:39 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Arnd Bergmann, linux-kernel, x86, Arnd Bergmann, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Linus Torvalds, Andy Shevchenko, Matthew Wilcox,
	Sean Christopherson, Davide Ciminaghi, Paolo Bonzini, kvm,
	Mike Rapoport


* Dave Hansen <dave.hansen@intel.com> wrote:

> Has anyone run into any problems on 6.15-rc1 with this stuff?
> 
> 0xf75fe000 is the mem_map[] entry for the first page >4GB. It 
> obviously wasn't allocated, thus the oops. Looks like the memblock 
> for the >4GB memory didn't get removed although the pgdats seem 
> correct.
> 
> I'll dig into it some more. Just wanted to make sure there wasn't a 
> fix out there already.

Not that I'm aware of.

> The way I'm triggering this is booting qemu with a 32-bit PAE kernel, 
> and "-m 4096" (or more).

That's a new regression indeed.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 05/11] x86: remove HIGHMEM64G support
  2025-04-11 23:44   ` Dave Hansen
  2025-04-12  8:39     ` Ingo Molnar
@ 2025-04-12 10:05     ` Mike Rapoport
  2025-04-12 10:44       ` Arnd Bergmann
  2025-04-12 19:48       ` Ingo Molnar
  2025-04-12 10:40     ` [PATCH 05/11] x86: remove HIGHMEM64G support Arnd Bergmann
  2 siblings, 2 replies; 78+ messages in thread
From: Mike Rapoport @ 2025-04-12 10:05 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Arnd Bergmann, linux-kernel, x86, Arnd Bergmann, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Linus Torvalds, Andy Shevchenko, Matthew Wilcox,
	Sean Christopherson, Davide Ciminaghi, Paolo Bonzini, kvm

On Fri, Apr 11, 2025 at 04:44:13PM -0700, Dave Hansen wrote:
> Has anyone run into any problems on 6.15-rc1 with this stuff?
> 
> 0xf75fe000 is the mem_map[] entry for the first page >4GB. It obviously
> wasn't allocated, thus the oops. Looks like the memblock for the >4GB
> memory didn't get removed although the pgdats seem correct.

That's apparently because of 6faea3422e3b ("arch, mm: streamline HIGHMEM
freeing"). 
Freeing of high memory was clamped to the end of ZONE_HIGHMEM which is 4G
and after 6faea3422e3b there's no more clamping, so memblock_free_all()
tries to free memory >4G as well.
 
> I'll dig into it some more. Just wanted to make sure there wasn't a fix
> out there already.

This should fix it.

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 57120f0749cc..4b24c0ccade4 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1300,6 +1300,8 @@ void __init e820__memblock_setup(void)
 		memblock_add(entry->addr, entry->size);
 	}
 
+	memblock_remove(PFN_PHYS(max_pfn), -1);
+
 	/* Throw away partial pages: */
 	memblock_trim_memory(PAGE_SIZE);
 
> The way I'm triggering this is booting qemu with a 32-bit PAE kernel,
> and "-m 4096" (or more).
> 
> > [    0.003806] Warning: only 4GB will be used. Support for for CONFIG_HIGHMEM64G was removed!
> ...
> > [    0.561310] BUG: unable to handle page fault for address: f75fe000
> > [    0.562226] #PF: supervisor write access in kernel mode
> > [    0.562947] #PF: error_code(0x0002) - not-present page
> > [    0.563653] *pdpt = 0000000002da2001 *pde = 000000000300c067 *pte = 0000000000000000 
> > [    0.564728] Oops: Oops: 0002 [#1] SMP NOPTI
> > [    0.565315] CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.15.0-rc1-00288-ge618ee89561b-dirty #311 PREEMPT(undef) 
> > [    0.567428] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
> > [    0.568777] EIP: __free_pages_core+0x3c/0x74
> > [    0.569378] Code: c3 d3 e6 83 ec 10 89 44 24 08 89 74 24 04 c7 04 24 c6 32 3a c2 89 55 f4 e8 a9 11 45 fe 85 f6 8b 55 f4 74 19 89 d8 31 c9 66 90 <0f> ba 30 0d c7 40 1c 00 00 00 00 41 83 c0 28 39 ce 75 ed 8b 03 c1
> > [    0.571943] EAX: f75fe000 EBX: f75fe000 ECX: 00000000 EDX: 0000000a
> > [    0.572806] ESI: 00000400 EDI: 00500000 EBP: c247becc ESP: c247beb4
> > [    0.573776] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210046
> > [    0.574606] CR0: 80050033 CR2: f75fe000 CR3: 02da6000 CR4: 000000b0
> > [    0.575464] Call Trace:
> > [    0.575816]  memblock_free_pages+0x11/0x2c
> > [    0.576392]  memblock_free_all+0x2ce/0x3a0
> > [    0.576955]  mm_core_init+0xf5/0x320
> > [    0.577423]  start_kernel+0x296/0x79c
> > [    0.577950]  ? set_init_arg+0x70/0x70
> > [    0.578478]  ? load_ucode_bsp+0x13c/0x1a8
> > [    0.579059]  i386_start_kernel+0xad/0xb0
> > [    0.579614]  startup_32_smp+0x151/0x154
> > [    0.580100] Modules linked in:
> > [    0.580358] CR2: 00000000f75fe000
> > [    0.580630] ---[ end trace 0000000000000000 ]---
> > [    0.581111] EIP: __free_pages_core+0x3c/0x74
> > [    0.581455] Code: c3 d3 e6 83 ec 10 89 44 24 08 89 74 24 04 c7 04 24 c6 32 3a c2 89 55 f4 e8 a9 11 45 fe 85 f6 8b 55 f4 74 19 89 d8 31 c9 66 90 <0f> ba 30 0d c7 40 1c 00 00 00 00 41 83 c0 28 39 ce 75 ed 8b 03 c1
> > [    0.584767] EAX: f75fe000 EBX: f75fe000 ECX: 00000000 EDX: 0000000a
> > [    0.585651] ESI: 00000400 EDI: 00500000 EBP: c247becc ESP: c247beb4
> > [    0.586530] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210046
> > [    0.587480] CR0: 80050033 CR2: f75fe000 CR3: 02da6000 CR4: 000000b0
> > [    0.588344] Kernel panic - not syncing: Attempted to kill the idle task!
> > [    0.589435] ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---
> 
> > [    0.561310] BUG: unable to handle page fault for address: f75fe000
> > [    0.562226] #PF: supervisor write access in kernel mode
> > [    0.562947] #PF: error_code(0x0002) - not-present page
> > [    0.563653] *pdpt = 0000000002da2001 *pde = 000000000300c067 *pte = 0000000000000000 
> > [    0.564728] Oops: Oops: 0002 [#1] SMP NOPTI
> > [    0.565315] CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.15.0-rc1-00288-ge618ee89561b-dirty #311 PREEMPT(undef) 
> > [    0.567428] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
> > [    0.568777] EIP: __free_pages_core+0x3c/0x74
> > [    0.569378] Code: c3 d3 e6 83 ec 10 89 44 24 08 89 74 24 04 c7 04 24 c6 32 3a c2 89 55 f4 e8 a9 11 45 fe 85 f6 8b 55 f4 74 19 89 d8 31 c9 66 90 <0f> ba 30 0d c7 40 1c 00 00 00 00 41 83 c0 28 39 ce 75 ed 8b 03 c1
> > [    0.571943] EAX: f75fe000 EBX: f75fe000 ECX: 00000000 EDX: 0000000a
> > [    0.572806] ESI: 00000400 EDI: 00500000 EBP: c247becc ESP: c247beb4
> > [    0.573776] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210046
> > [    0.574606] CR0: 80050033 CR2: f75fe000 CR3: 02da6000 CR4: 000000b0
> > [    0.575464] Call Trace:
> > [    0.575816]  memblock_free_pages+0x11/0x2c
> > [    0.576392]  memblock_free_all+0x2ce/0x3a0
> > [    0.576955]  mm_core_init+0xf5/0x320
> > [    0.577423]  start_kernel+0x296/0x79c
> > [    0.577950]  ? set_init_arg+0x70/0x70
> > [    0.578478]  ? load_ucode_bsp+0x13c/0x1a8
> > [    0.579059]  i386_start_kernel+0xad/0xb0
> > [    0.579614]  startup_32_smp+0x151/0x154
> > [    0.580100] Modules linked in:
> > [    0.580358] CR2: 00000000f75fe000
> > [    0.580630] ---[ end trace 0000000000000000 ]---
> > [    0.581111] EIP: __free_pages_core+0x3c/0x74
> > [    0.581455] Code: c3 d3 e6 83 ec 10 89 44 24 08 89 74 24 04 c7 04 24 c6 32 3a c2 89 55 f4 e8 a9 11 45 fe 85 f6 8b 55 f4 74 19 89 d8 31 c9 66 90 <0f> ba 30 0d c7 40 1c 00 00 00 00 41 83 c0 28 39 ce 75 ed 8b 03 c1
> > [    0.584767] EAX: f75fe000 EBX: f75fe000 ECX: 00000000 EDX: 0000000a
> > [    0.585651] ESI: 00000400 EDI: 00500000 EBP: c247becc ESP: c247beb4
> > [    0.586530] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210046
> > [    0.587480] CR0: 80050033 CR2: f75fe000 CR3: 02da6000 CR4: 000000b0
> > [    0.588344] Kernel panic - not syncing: Attempted to kill the idle task!
> > [    0.589435] ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---

-- 
Sincerely yours,
Mike.

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH 05/11] x86: remove HIGHMEM64G support
  2025-04-11 23:44   ` Dave Hansen
  2025-04-12  8:39     ` Ingo Molnar
  2025-04-12 10:05     ` Mike Rapoport
@ 2025-04-12 10:40     ` Arnd Bergmann
  2 siblings, 0 replies; 78+ messages in thread
From: Arnd Bergmann @ 2025-04-12 10:40 UTC (permalink / raw)
  To: Dave Hansen, Arnd Bergmann, linux-kernel, x86
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Linus Torvalds, Andy Shevchenko, Matthew Wilcox,
	Sean Christopherson, Davide Ciminaghi, Paolo Bonzini, kvm,
	Mike Rapoport, Mike Rapoport

On Sat, Apr 12, 2025, at 01:44, Dave Hansen wrote:
> Has anyone run into any problems on 6.15-rc1 with this stuff?
>
> 0xf75fe000 is the mem_map[] entry for the first page >4GB. It obviously
> wasn't allocated, thus the oops. Looks like the memblock for the >4GB
> memory didn't get removed although the pgdats seem correct.
>
> I'll dig into it some more. Just wanted to make sure there wasn't a fix
> out there already.
>
> The way I'm triggering this is booting qemu with a 32-bit PAE kernel,
> and "-m 4096" (or more).

I have reproduced the bug now and found that it did not happen in
my series. Bisection points to Mike Rapoport's highmem series,
specifically  6faea3422e3b ("arch, mm: streamline HIGHMEM freeing")

There was a related bug that was caused by an earlier version
of my series when I also removed CONFIG_PHYS_ADDR_T_64BIT
https://lore.kernel.org/all/202412201005.77fb063-lkp@intel.com/

    Arnd

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 05/11] x86: remove HIGHMEM64G support
  2025-04-12 10:05     ` Mike Rapoport
@ 2025-04-12 10:44       ` Arnd Bergmann
  2025-04-12 19:48       ` Ingo Molnar
  1 sibling, 0 replies; 78+ messages in thread
From: Arnd Bergmann @ 2025-04-12 10:44 UTC (permalink / raw)
  To: Mike Rapoport, Dave Hansen
  Cc: Arnd Bergmann, linux-kernel, x86, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Linus Torvalds,
	Andy Shevchenko, Matthew Wilcox, Sean Christopherson,
	Davide Ciminaghi, Paolo Bonzini, kvm

On Sat, Apr 12, 2025, at 12:05, Mike Rapoport wrote:
> On Fri, Apr 11, 2025 at 04:44:13PM -0700, Dave Hansen wrote:
>> Has anyone run into any problems on 6.15-rc1 with this stuff?
>> 
>> 0xf75fe000 is the mem_map[] entry for the first page >4GB. It obviously
>> wasn't allocated, thus the oops. Looks like the memblock for the >4GB
>> memory didn't get removed although the pgdats seem correct.
>
> That's apparently because of 6faea3422e3b ("arch, mm: streamline HIGHMEM
> freeing"). 
> Freeing of high memory was clamped to the end of ZONE_HIGHMEM which is 4G
> and after 6faea3422e3b there's no more clamping, so memblock_free_all()
> tries to free memory >4G as well.

Ah, I should have waited with my bisection, you found it first...

>> I'll dig into it some more. Just wanted to make sure there wasn't a fix
>> out there already.
>
> This should fix it.

Confirmed on 6.15-rc1.

     Arnd

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 05/11] x86: remove HIGHMEM64G support
  2025-04-12 10:05     ` Mike Rapoport
  2025-04-12 10:44       ` Arnd Bergmann
@ 2025-04-12 19:48       ` Ingo Molnar
  2025-04-13  8:08         ` [PATCH] x86/e820: discard high memory that can't be addressed by 32-bit systems Mike Rapoport
  1 sibling, 1 reply; 78+ messages in thread
From: Ingo Molnar @ 2025-04-12 19:48 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Dave Hansen, Arnd Bergmann, linux-kernel, x86, Arnd Bergmann,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Linus Torvalds, Andy Shevchenko, Matthew Wilcox,
	Sean Christopherson, Davide Ciminaghi, Paolo Bonzini, kvm


* Mike Rapoport <rppt@kernel.org> wrote:

> On Fri, Apr 11, 2025 at 04:44:13PM -0700, Dave Hansen wrote:
> > Has anyone run into any problems on 6.15-rc1 with this stuff?
> > 
> > 0xf75fe000 is the mem_map[] entry for the first page >4GB. It obviously
> > wasn't allocated, thus the oops. Looks like the memblock for the >4GB
> > memory didn't get removed although the pgdats seem correct.
> 
> That's apparently because of 6faea3422e3b ("arch, mm: streamline HIGHMEM
> freeing"). 
> Freeing of high memory was clamped to the end of ZONE_HIGHMEM which is 4G
> and after 6faea3422e3b there's no more clamping, so memblock_free_all()
> tries to free memory >4G as well.
>  
> > I'll dig into it some more. Just wanted to make sure there wasn't a fix
> > out there already.
> 
> This should fix it.
> 
> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> index 57120f0749cc..4b24c0ccade4 100644
> --- a/arch/x86/kernel/e820.c
> +++ b/arch/x86/kernel/e820.c
> @@ -1300,6 +1300,8 @@ void __init e820__memblock_setup(void)
>  		memblock_add(entry->addr, entry->size);
>  	}
>  
> +	memblock_remove(PFN_PHYS(max_pfn), -1);
> +
>  	/* Throw away partial pages: */
>  	memblock_trim_memory(PAGE_SIZE);

Mind sending a full patch with changelog, SOB, Ard's Tested-by, Dave's 
Reported-by, etc.?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH] x86/e820: discard high memory that can't be addressed by 32-bit systems
  2025-04-12 19:48       ` Ingo Molnar
@ 2025-04-13  8:08         ` Mike Rapoport
  2025-04-13  9:23           ` [tip: x86/urgent] x86/e820: Discard " tip-bot2 for Mike Rapoport (Microsoft)
                             ` (4 more replies)
  0 siblings, 5 replies; 78+ messages in thread
From: Mike Rapoport @ 2025-04-13  8:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Shevchenko, Arnd Bergmann, Arnd Bergmann, Borislav Petkov,
	Dave Hansen, Dave Hansen, Davide Ciminaghi, Ingo Molnar,
	Linus Torvalds, Matthew Wilcox, H. Peter Anvin, Paolo Bonzini,
	Sean Christopherson, Thomas Gleixner, kvm, linux-kernel, x86,
	Mike Rapoport (Microsoft)

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

Dave Hansen reports the following crash on a 32-bit system with
CONFIG_HIGHMEM=y and CONFIG_X86_PAE=y:

  > 0xf75fe000 is the mem_map[] entry for the first page >4GB. It
  > obviously wasn't allocated, thus the oops.

  BUG: unable to handle page fault for address: f75fe000
  #PF: supervisor write access in kernel mode
  #PF: error_code(0x0002) - not-present page
  *pdpt = 0000000002da2001 *pde = 000000000300c067 *pte = 0000000000000000
  Oops: Oops: 0002 [#1] SMP NOPTI
  CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.15.0-rc1-00288-ge618ee89561b-dirty #311 PREEMPT(undef)
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
  EIP: __free_pages_core+0x3c/0x74
  Code: c3 d3 e6 83 ec 10 89 44 24 08 89 74 24 04 c7 04 24 c6 32 3a c2 89 55 f4 e8 a9 11 45 fe 85 f6 8b 55 f4 74 19 89 d8 31 c9 66 90 <0f> ba 30 0d c7 40 1c 00 00 00 00 41 83 c0 28 39 ce 75 ed 8b

  EAX: f75fe000 EBX: f75fe000 ECX: 00000000 EDX: 0000000a
  ESI: 00000400 EDI: 00500000 EBP: c247becc ESP: c247beb4
  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210046
  CR0: 80050033 CR2: f75fe000 CR3: 02da6000 CR4: 000000b0
  Call Trace:
   memblock_free_pages+0x11/0x2c
   memblock_free_all+0x2ce/0x3a0
   mm_core_init+0xf5/0x320
   start_kernel+0x296/0x79c
   ? set_init_arg+0x70/0x70
   ? load_ucode_bsp+0x13c/0x1a8
   i386_start_kernel+0xad/0xb0
   startup_32_smp+0x151/0x154
  Modules linked in:
  CR2: 00000000f75fe000

The mem_map[] is allocated up to the end of ZONE_HIGHMEM which is defined
by max_pfn.

Before 6faea3422e3b ("arch, mm: streamline HIGHMEM freeing") freeing of
high memory was also clamped to the end of ZONE_HIGHMEM but after
6faea3422e3b memblock_free_all() tries to free memory above the of
ZONE_HIGHMEM as well and that causes access to mem_map[] entries beyond
the end of the memory map.

Discard the memory after max_pfn from memblock on 32-bit systems so that
core MM would be aware only of actually usable memory.

Reported-by: Dave Hansen <dave.hansen@intel.com>
Tested-by: Arnd Bergmann <arnd@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 arch/x86/kernel/e820.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 57120f0749cc..5f673bd6c7d7 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1300,6 +1300,14 @@ void __init e820__memblock_setup(void)
 		memblock_add(entry->addr, entry->size);
 	}
 
+	/*
+	 * 32-bit systems are limited to 4BG of memory even with HIGHMEM and
+	 * to even less without it.
+	 * Discard memory after max_pfn - the actual limit detected at runtime.
+	 */
+	if (IS_ENABLED(CONFIG_X86_32))
+		memblock_remove(PFN_PHYS(max_pfn), -1);
+
 	/* Throw away partial pages: */
 	memblock_trim_memory(PAGE_SIZE);
 

base-commit: 0af2f6be1b4281385b618cb86ad946eded089ac8
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [tip: x86/urgent] x86/e820: Discard high memory that can't be addressed by 32-bit systems
  2025-04-13  8:08         ` [PATCH] x86/e820: discard high memory that can't be addressed by 32-bit systems Mike Rapoport
@ 2025-04-13  9:23           ` tip-bot2 for Mike Rapoport (Microsoft)
  2025-04-14 14:19             ` Dave Hansen
  2025-04-16  7:24           ` tip-bot2 for Mike Rapoport (Microsoft)
                             ` (3 subsequent siblings)
  4 siblings, 1 reply; 78+ messages in thread
From: tip-bot2 for Mike Rapoport (Microsoft) @ 2025-04-13  9:23 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Dave Hansen, Arnd Bergmann, Mike Rapoport (Microsoft),
	Ingo Molnar, Andy Shevchenko, Arnd Bergmann, Davide Ciminaghi,
	H. Peter Anvin, Linus Torvalds, Matthew Wilcox, Paolo Bonzini,
	Sean Christopherson, kvm, x86, linux-kernel

The following commit has been merged into the x86/urgent branch of tip:

Commit-ID:     3f0036c0b5f850d1200dbfa7365ed24197a0f157
Gitweb:        https://git.kernel.org/tip/3f0036c0b5f850d1200dbfa7365ed24197a0f157
Author:        Mike Rapoport (Microsoft) <rppt@kernel.org>
AuthorDate:    Sun, 13 Apr 2025 11:08:58 +03:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Sun, 13 Apr 2025 11:09:39 +02:00

x86/e820: Discard high memory that can't be addressed by 32-bit systems

Dave Hansen reports the following crash on a 32-bit system with
CONFIG_HIGHMEM=y and CONFIG_X86_PAE=y:

  > 0xf75fe000 is the mem_map[] entry for the first page >4GB. It
  > obviously wasn't allocated, thus the oops.

  BUG: unable to handle page fault for address: f75fe000
  #PF: supervisor write access in kernel mode
  #PF: error_code(0x0002) - not-present page
  *pdpt = 0000000002da2001 *pde = 000000000300c067 *pte = 0000000000000000
  Oops: Oops: 0002 [#1] SMP NOPTI
  CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.15.0-rc1-00288-ge618ee89561b-dirty #311 PREEMPT(undef)
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
  EIP: __free_pages_core+0x3c/0x74
  ...
  Call Trace:
   memblock_free_pages+0x11/0x2c
   memblock_free_all+0x2ce/0x3a0
   mm_core_init+0xf5/0x320
   start_kernel+0x296/0x79c
   i386_start_kernel+0xad/0xb0
   startup_32_smp+0x151/0x154

The mem_map[] is allocated up to the end of ZONE_HIGHMEM which is defined
by max_pfn.

The bug was introduced by this recent commit:

  6faea3422e3b ("arch, mm: streamline HIGHMEM freeing")

Previously, freeing of high memory was also clamped to the end of
ZONE_HIGHMEM but after this change, memblock_free_all() tries to
free memory above the of ZONE_HIGHMEM as well and that causes
access to mem_map[] entries beyond the end of the memory map.

To fix this, discard the memory after max_pfn from memblock on
32-bit systems so that core MM would be aware only of actually
usable memory.

Fixes: 6faea3422e3b ("arch, mm: streamline HIGHMEM freeing")
Reported-by: Dave Hansen <dave.hansen@intel.com>
Tested-by: Arnd Bergmann <arnd@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Shevchenko <andy@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Davide Ciminaghi <ciminaghi@gnudd.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: kvm@vger.kernel.org
Link: https://lore.kernel.org/r/20250413080858.743221-1-rppt@kernel.org
---
 arch/x86/kernel/e820.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 9d8dd8d..9920122 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1299,6 +1299,14 @@ void __init e820__memblock_setup(void)
 		memblock_add(entry->addr, entry->size);
 	}
 
+	/*
+	 * 32-bit systems are limited to 4BG of memory even with HIGHMEM and
+	 * to even less without it.
+	 * Discard memory after max_pfn - the actual limit detected at runtime.
+	 */
+	if (IS_ENABLED(CONFIG_X86_32))
+		memblock_remove(PFN_PHYS(max_pfn), -1);
+
 	/* Throw away partial pages: */
 	memblock_trim_memory(PAGE_SIZE);
 

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [tip: x86/urgent] x86/e820: Discard high memory that can't be addressed by 32-bit systems
  2025-04-13  9:23           ` [tip: x86/urgent] x86/e820: Discard " tip-bot2 for Mike Rapoport (Microsoft)
@ 2025-04-14 14:19             ` Dave Hansen
  2025-04-15  7:18               ` Mike Rapoport
  0 siblings, 1 reply; 78+ messages in thread
From: Dave Hansen @ 2025-04-14 14:19 UTC (permalink / raw)
  To: linux-kernel, linux-tip-commits
  Cc: Arnd Bergmann, Mike Rapoport (Microsoft), Ingo Molnar,
	Andy Shevchenko, Arnd Bergmann, Davide Ciminaghi, H. Peter Anvin,
	Linus Torvalds, Matthew Wilcox, Paolo Bonzini,
	Sean Christopherson, kvm, x86

On 4/13/25 02:23, tip-bot2 for Mike Rapoport (Microsoft) wrote:
> +	/*
> +	 * 32-bit systems are limited to 4BG of memory even with HIGHMEM and
> +	 * to even less without it.
> +	 * Discard memory after max_pfn - the actual limit detected at runtime.
> +	 */
> +	if (IS_ENABLED(CONFIG_X86_32))
> +		memblock_remove(PFN_PHYS(max_pfn), -1);

Mike, thanks for the quick fix! I did verify that this gets my silly
test VM booting again.

The patch obviously _works_. But in the case I was hitting max_pfn was
set MAX_NONPAE_PFN. The unfortunate part about this hunk is that it's
far away from the related warning:

>         if (max_pfn > MAX_NONPAE_PFN) {
>                 max_pfn = MAX_NONPAE_PFN;
>                 printk(KERN_WARNING MSG_HIGHMEM_TRIMMED);
>         }

and it's logically doing the same thing: truncating memory at
MAX_NONPAE_PFN.

How about we reuse 'MAX_NONPAE_PFN' like this:

	if (IS_ENABLED(CONFIG_X86_32))
		memblock_remove(PFN_PHYS(MAX_NONPAE_PFN), -1);

Would that make the connection more obvious?

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [tip: x86/urgent] x86/e820: Discard high memory that can't be addressed by 32-bit systems
  2025-04-14 14:19             ` Dave Hansen
@ 2025-04-15  7:18               ` Mike Rapoport
  2025-04-15 13:43                 ` Dave Hansen
  0 siblings, 1 reply; 78+ messages in thread
From: Mike Rapoport @ 2025-04-15  7:18 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-tip-commits, Arnd Bergmann, Ingo Molnar,
	Andy Shevchenko, Arnd Bergmann, Davide Ciminaghi, H. Peter Anvin,
	Linus Torvalds, Matthew Wilcox, Paolo Bonzini,
	Sean Christopherson, kvm, x86

On Mon, Apr 14, 2025 at 07:19:02AM -0700, Dave Hansen wrote:
> On 4/13/25 02:23, tip-bot2 for Mike Rapoport (Microsoft) wrote:
> > +	/*
> > +	 * 32-bit systems are limited to 4BG of memory even with HIGHMEM and
> > +	 * to even less without it.
> > +	 * Discard memory after max_pfn - the actual limit detected at runtime.
> > +	 */
> > +	if (IS_ENABLED(CONFIG_X86_32))
> > +		memblock_remove(PFN_PHYS(max_pfn), -1);
> 
> Mike, thanks for the quick fix! I did verify that this gets my silly
> test VM booting again.
> 
> The patch obviously _works_. But in the case I was hitting max_pfn was
> set MAX_NONPAE_PFN. The unfortunate part about this hunk is that it's
> far away from the related warning:

Yeah, my first instinct was to put memblock_remove() in the same 'if',
but there's no memblock there yet :)
 
> >         if (max_pfn > MAX_NONPAE_PFN) {
> >                 max_pfn = MAX_NONPAE_PFN;
> >                 printk(KERN_WARNING MSG_HIGHMEM_TRIMMED);
> >         }
> 
> and it's logically doing the same thing: truncating memory at
> MAX_NONPAE_PFN.
> 
> How about we reuse 'MAX_NONPAE_PFN' like this:
> 
> 	if (IS_ENABLED(CONFIG_X86_32))
> 		memblock_remove(PFN_PHYS(MAX_NONPAE_PFN), -1);
> 
> Would that make the connection more obvious?

Yes, that's better. Here's the updated patch:

From a235764221e4a849fa274a546ff2a3d9f15da2a9 Mon Sep 17 00:00:00 2001
From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
Date: Sun, 13 Apr 2025 10:36:17 +0300
Subject: [PATCH v2] x86/e820: discard high memory that can't be addressed by
 32-bit systems

Dave Hansen reports the following crash on a 32-bit system with
CONFIG_HIGHMEM=y and CONFIG_X86_PAE=y:

  > 0xf75fe000 is the mem_map[] entry for the first page >4GB. It
  > obviously wasn't allocated, thus the oops.

  BUG: unable to handle page fault for address: f75fe000
  #PF: supervisor write access in kernel mode
  #PF: error_code(0x0002) - not-present page
  *pdpt = 0000000002da2001 *pde = 000000000300c067 *pte = 0000000000000000
  Oops: Oops: 0002 [#1] SMP NOPTI
  CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.15.0-rc1-00288-ge618ee89561b-dirty #311 PREEMPT(undef)
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
  EIP: __free_pages_core+0x3c/0x74
  Code: c3 d3 e6 83 ec 10 89 44 24 08 89 74 24 04 c7 04 24 c6 32 3a c2 89 55 f4 e8 a9 11 45 fe 85 f6 8b 55 f4 74 19 89 d8 31 c9 66 90 <0f> ba 30 0d c7 40 1c 00 00 00 00 41 83 c0 28 39 ce 75 ed 8b

  EAX: f75fe000 EBX: f75fe000 ECX: 00000000 EDX: 0000000a
  ESI: 00000400 EDI: 00500000 EBP: c247becc ESP: c247beb4
  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210046
  CR0: 80050033 CR2: f75fe000 CR3: 02da6000 CR4: 000000b0
  Call Trace:
   memblock_free_pages+0x11/0x2c
   memblock_free_all+0x2ce/0x3a0
   mm_core_init+0xf5/0x320
   start_kernel+0x296/0x79c
   ? set_init_arg+0x70/0x70
   ? load_ucode_bsp+0x13c/0x1a8
   i386_start_kernel+0xad/0xb0
   startup_32_smp+0x151/0x154
  Modules linked in:
  CR2: 00000000f75fe000

The mem_map[] is allocated up to the end of ZONE_HIGHMEM which is defined
by max_pfn.

Before 6faea3422e3b ("arch, mm: streamline HIGHMEM freeing") freeing of
high memory was also clamped to the end of ZONE_HIGHMEM but after
6faea3422e3b memblock_free_all() tries to free memory above the of
ZONE_HIGHMEM as well and that causes access to mem_map[] entries beyond
the end of the memory map.

Discard the memory after MAX_NONPAE_PFN from memblock on 32-bit systems
so that core MM would be aware only of actually usable memory.

Reported-by: Dave Hansen <dave.hansen@intel.com>
Tested-by: Arnd Bergmann <arnd@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 arch/x86/kernel/e820.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 57120f0749cc..5e6b1034e6f1 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1300,6 +1300,13 @@ void __init e820__memblock_setup(void)
 		memblock_add(entry->addr, entry->size);
 	}
 
+	/*
+	 * Discard memory above 4GB because 32-bit systems are limited to 4GB
+	 * of memory even with HIGHMEM.
+	 */
+	if (IS_ENABLED(CONFIG_X86_32))
+		memblock_remove(PFN_PHYS(MAX_NONPAE_PFN), -1);
+
 	/* Throw away partial pages: */
 	memblock_trim_memory(PAGE_SIZE);
 
-- 
2.47.2

-- 
Sincerely yours,
Mike.

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [tip: x86/urgent] x86/e820: Discard high memory that can't be addressed by 32-bit systems
  2025-04-15  7:18               ` Mike Rapoport
@ 2025-04-15 13:43                 ` Dave Hansen
  2025-04-16  7:17                   ` Ingo Molnar
  0 siblings, 1 reply; 78+ messages in thread
From: Dave Hansen @ 2025-04-15 13:43 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, linux-tip-commits, Arnd Bergmann, Ingo Molnar,
	Andy Shevchenko, Arnd Bergmann, Davide Ciminaghi, H. Peter Anvin,
	Linus Torvalds, Matthew Wilcox, Paolo Bonzini,
	Sean Christopherson, kvm, x86

On 4/15/25 00:18, Mike Rapoport wrote:
>> How about we reuse 'MAX_NONPAE_PFN' like this:
>>
>> 	if (IS_ENABLED(CONFIG_X86_32))
>> 		memblock_remove(PFN_PHYS(MAX_NONPAE_PFN), -1);
>>
>> Would that make the connection more obvious?
> Yes, that's better. Here's the updated patch:

Looks, great. Thanks for the update and the quick turnaround on the
first one after the bug report!

Tested-by: Dave Hansen <dave.hansen@intel.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [tip: x86/urgent] x86/e820: Discard high memory that can't be addressed by 32-bit systems
  2025-04-15 13:43                 ` Dave Hansen
@ 2025-04-16  7:17                   ` Ingo Molnar
  2025-04-16  7:51                     ` Ingo Molnar
  0 siblings, 1 reply; 78+ messages in thread
From: Ingo Molnar @ 2025-04-16  7:17 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Mike Rapoport, linux-kernel, linux-tip-commits, Arnd Bergmann,
	Andy Shevchenko, Arnd Bergmann, Davide Ciminaghi, H. Peter Anvin,
	Linus Torvalds, Matthew Wilcox, Paolo Bonzini,
	Sean Christopherson, kvm, x86


* Dave Hansen <dave.hansen@intel.com> wrote:

> On 4/15/25 00:18, Mike Rapoport wrote:
> >> How about we reuse 'MAX_NONPAE_PFN' like this:
> >>
> >> 	if (IS_ENABLED(CONFIG_X86_32))
> >> 		memblock_remove(PFN_PHYS(MAX_NONPAE_PFN), -1);
> >>
> >> Would that make the connection more obvious?
> > Yes, that's better. Here's the updated patch:
> 
> Looks, great. Thanks for the update and the quick turnaround on the
> first one after the bug report!
> 
> Tested-by: Dave Hansen <dave.hansen@intel.com>
> Acked-by: Dave Hansen <dave.hansen@intel.com>

I've amended the fix in tip:x86/urgent accordingly and added your tags, 
thanks!

	Ingo

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [tip: x86/urgent] x86/e820: Discard high memory that can't be addressed by 32-bit systems
  2025-04-13  8:08         ` [PATCH] x86/e820: discard high memory that can't be addressed by 32-bit systems Mike Rapoport
  2025-04-13  9:23           ` [tip: x86/urgent] x86/e820: Discard " tip-bot2 for Mike Rapoport (Microsoft)
@ 2025-04-16  7:24           ` tip-bot2 for Mike Rapoport (Microsoft)
  2025-04-16  8:16           ` tip-bot2 for Mike Rapoport (Microsoft)
                             ` (2 subsequent siblings)
  4 siblings, 0 replies; 78+ messages in thread
From: tip-bot2 for Mike Rapoport (Microsoft) @ 2025-04-16  7:24 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Dave Hansen, Arnd Bergmann, Mike Rapoport (Microsoft),
	Ingo Molnar, Andy Shevchenko, Arnd Bergmann, Davide Ciminaghi,
	H. Peter Anvin, Linus Torvalds, Matthew Wilcox, Paolo Bonzini,
	Sean Christopherson, kvm, x86, linux-kernel

The following commit has been merged into the x86/urgent branch of tip:

Commit-ID:     e71b6094c20f5dc9c43dc89af8a569ffa511d676
Gitweb:        https://git.kernel.org/tip/e71b6094c20f5dc9c43dc89af8a569ffa511d676
Author:        Mike Rapoport (Microsoft) <rppt@kernel.org>
AuthorDate:    Sun, 13 Apr 2025 11:08:58 +03:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Wed, 16 Apr 2025 09:16:02 +02:00

x86/e820: Discard high memory that can't be addressed by 32-bit systems

Dave Hansen reports the following crash on a 32-bit system with
CONFIG_HIGHMEM=y and CONFIG_X86_PAE=y:

  > 0xf75fe000 is the mem_map[] entry for the first page >4GB. It
  > obviously wasn't allocated, thus the oops.

  BUG: unable to handle page fault for address: f75fe000
  #PF: supervisor write access in kernel mode
  #PF: error_code(0x0002) - not-present page
  *pdpt = 0000000002da2001 *pde = 000000000300c067 *pte = 0000000000000000
  Oops: Oops: 0002 [#1] SMP NOPTI
  CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.15.0-rc1-00288-ge618ee89561b-dirty #311 PREEMPT(undef)
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
  EIP: __free_pages_core+0x3c/0x74
  ...
  Call Trace:
   memblock_free_pages+0x11/0x2c
   memblock_free_all+0x2ce/0x3a0
   mm_core_init+0xf5/0x320
   start_kernel+0x296/0x79c
   i386_start_kernel+0xad/0xb0
   startup_32_smp+0x151/0x154

The mem_map[] is allocated up to the end of ZONE_HIGHMEM which is defined
by max_pfn.

The bug was introduced by this recent commit:

  6faea3422e3b ("arch, mm: streamline HIGHMEM freeing")

Previously, freeing of high memory was also clamped to the end of
ZONE_HIGHMEM but after this change, memblock_free_all() tries to
free memory above the of ZONE_HIGHMEM as well and that causes
access to mem_map[] entries beyond the end of the memory map.

To fix this, discard the memory after max_pfn from memblock on
32-bit systems so that core MM would be aware only of actually
usable memory.

Fixes: 6faea3422e3b ("arch, mm: streamline HIGHMEM freeing")
Reported-by: Dave Hansen <dave.hansen@intel.com>
Tested-by: Arnd Bergmann <arnd@kernel.org>
Tested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Cc: Andy Shevchenko <andy@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Davide Ciminaghi <ciminaghi@gnudd.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: kvm@vger.kernel.org
Link: https://lore.kernel.org/r/20250413080858.743221-1-rppt@kernel.org
---
 arch/x86/kernel/e820.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 9d8dd8d..de62388 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1299,6 +1299,13 @@ void __init e820__memblock_setup(void)
 		memblock_add(entry->addr, entry->size);
 	}
 
+	/*
+	 * Discard memory above 4GB because 32-bit systems are limited to 4GB
+	 * of memory even with HIGHMEM.
+	 */
+	if (IS_ENABLED(CONFIG_X86_32))
+		memblock_remove(PFN_PHYS(MAX_NONPAE_PFN), -1);
+
 	/* Throw away partial pages: */
 	memblock_trim_memory(PAGE_SIZE);
 

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [tip: x86/urgent] x86/e820: Discard high memory that can't be addressed by 32-bit systems
  2025-04-16  7:17                   ` Ingo Molnar
@ 2025-04-16  7:51                     ` Ingo Molnar
  0 siblings, 0 replies; 78+ messages in thread
From: Ingo Molnar @ 2025-04-16  7:51 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Mike Rapoport, linux-kernel, linux-tip-commits, Arnd Bergmann,
	Andy Shevchenko, Arnd Bergmann, Davide Ciminaghi, H. Peter Anvin,
	Linus Torvalds, Matthew Wilcox, Paolo Bonzini,
	Sean Christopherson, kvm, x86


* Ingo Molnar <mingo@kernel.org> wrote:

> 
> * Dave Hansen <dave.hansen@intel.com> wrote:
> 
> > On 4/15/25 00:18, Mike Rapoport wrote:
> > >> How about we reuse 'MAX_NONPAE_PFN' like this:
> > >>
> > >> 	if (IS_ENABLED(CONFIG_X86_32))
> > >> 		memblock_remove(PFN_PHYS(MAX_NONPAE_PFN), -1);
> > >>
> > >> Would that make the connection more obvious?
> > > Yes, that's better. Here's the updated patch:
> > 
> > Looks, great. Thanks for the update and the quick turnaround on the
> > first one after the bug report!
> > 
> > Tested-by: Dave Hansen <dave.hansen@intel.com>
> > Acked-by: Dave Hansen <dave.hansen@intel.com>
> 
> I've amended the fix in tip:x86/urgent accordingly and added your tags, 
> thanks!

So I had to apply the fix below as well, due to this build failure on 
x86-defconfig:

  arch/x86/kernel/e820.c:1307:42: error: ‘MAX_NONPAE_PFN’ undeclared (first use in this function); did you mean ‘MAX_DMA_PFN’?

IS_ENABLED(CONFIG_X86_32) can only be used when the code is 
syntactically correct on !CONFIG_X86_32 kernels too - which it wasn't.

So I went for the straightforward #ifdef block instead.

Thanks,

	Ingo

===========>
 arch/x86/kernel/e820.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index de6238886cb2..c984be8ee060 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1299,13 +1299,14 @@ void __init e820__memblock_setup(void)
 		memblock_add(entry->addr, entry->size);
 	}
 
+#ifdef CONFIG_X86_32
 	/*
 	 * Discard memory above 4GB because 32-bit systems are limited to 4GB
 	 * of memory even with HIGHMEM.
 	 */
-	if (IS_ENABLED(CONFIG_X86_32))
-		memblock_remove(PFN_PHYS(MAX_NONPAE_PFN), -1);
+	memblock_remove(PFN_PHYS(MAX_NONPAE_PFN), -1);
+#endif
 
 	/* Throw away partial pages: */
 	memblock_trim_memory(PAGE_SIZE);

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [tip: x86/urgent] x86/e820: Discard high memory that can't be addressed by 32-bit systems
  2025-04-13  8:08         ` [PATCH] x86/e820: discard high memory that can't be addressed by 32-bit systems Mike Rapoport
  2025-04-13  9:23           ` [tip: x86/urgent] x86/e820: Discard " tip-bot2 for Mike Rapoport (Microsoft)
  2025-04-16  7:24           ` tip-bot2 for Mike Rapoport (Microsoft)
@ 2025-04-16  8:16           ` tip-bot2 for Mike Rapoport (Microsoft)
  2025-04-17 16:22           ` [PATCH] x86/e820: discard " Nathan Chancellor
  2025-04-18 19:49           ` Guenter Roeck
  4 siblings, 0 replies; 78+ messages in thread
From: tip-bot2 for Mike Rapoport (Microsoft) @ 2025-04-16  8:16 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Dave Hansen, Arnd Bergmann, Mike Rapoport (Microsoft),
	Ingo Molnar, Andy Shevchenko, Arnd Bergmann, Davide Ciminaghi,
	H. Peter Anvin, Linus Torvalds, Matthew Wilcox, Paolo Bonzini,
	Sean Christopherson, kvm, x86, linux-kernel

The following commit has been merged into the x86/urgent branch of tip:

Commit-ID:     1e07b9fad022e0e02215150ca1e20912e78e8ec1
Gitweb:        https://git.kernel.org/tip/1e07b9fad022e0e02215150ca1e20912e78e8ec1
Author:        Mike Rapoport (Microsoft) <rppt@kernel.org>
AuthorDate:    Sun, 13 Apr 2025 11:08:58 +03:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Wed, 16 Apr 2025 09:51:02 +02:00

x86/e820: Discard high memory that can't be addressed by 32-bit systems

Dave Hansen reports the following crash on a 32-bit system with
CONFIG_HIGHMEM=y and CONFIG_X86_PAE=y:

  > 0xf75fe000 is the mem_map[] entry for the first page >4GB. It
  > obviously wasn't allocated, thus the oops.

  BUG: unable to handle page fault for address: f75fe000
  #PF: supervisor write access in kernel mode
  #PF: error_code(0x0002) - not-present page
  *pdpt = 0000000002da2001 *pde = 000000000300c067 *pte = 0000000000000000
  Oops: Oops: 0002 [#1] SMP NOPTI
  CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.15.0-rc1-00288-ge618ee89561b-dirty #311 PREEMPT(undef)
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
  EIP: __free_pages_core+0x3c/0x74
  ...
  Call Trace:
   memblock_free_pages+0x11/0x2c
   memblock_free_all+0x2ce/0x3a0
   mm_core_init+0xf5/0x320
   start_kernel+0x296/0x79c
   i386_start_kernel+0xad/0xb0
   startup_32_smp+0x151/0x154

The mem_map[] is allocated up to the end of ZONE_HIGHMEM which is defined
by max_pfn.

The bug was introduced by this recent commit:

  6faea3422e3b ("arch, mm: streamline HIGHMEM freeing")

Previously, freeing of high memory was also clamped to the end of
ZONE_HIGHMEM but after this change, memblock_free_all() tries to
free memory above the of ZONE_HIGHMEM as well and that causes
access to mem_map[] entries beyond the end of the memory map.

To fix this, discard the memory after max_pfn from memblock on
32-bit systems so that core MM would be aware only of actually
usable memory.

[ mingo: Fixed build failure. ]

Fixes: 6faea3422e3b ("arch, mm: streamline HIGHMEM freeing")
Reported-by: Dave Hansen <dave.hansen@intel.com>
Tested-by: Arnd Bergmann <arnd@kernel.org>
Tested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Cc: Andy Shevchenko <andy@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Davide Ciminaghi <ciminaghi@gnudd.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: kvm@vger.kernel.org
Link: https://lore.kernel.org/r/20250413080858.743221-1-rppt@kernel.org
---
 arch/x86/kernel/e820.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 9d8dd8d..2f38175 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1299,6 +1299,14 @@ void __init e820__memblock_setup(void)
 		memblock_add(entry->addr, entry->size);
 	}
 
+#ifdef CONFIG_X86_32
+	/*
+	 * Discard memory above 4GB because 32-bit systems are limited to 4GB
+	 * of memory even with HIGHMEM.
+	 */
+	memblock_remove(PFN_PHYS(MAX_NONPAE_PFN), -1);
+#endif
+
 	/* Throw away partial pages: */
 	memblock_trim_memory(PAGE_SIZE);
 

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH] x86/e820: discard high memory that can't be addressed by 32-bit systems
  2025-04-13  8:08         ` [PATCH] x86/e820: discard high memory that can't be addressed by 32-bit systems Mike Rapoport
                             ` (2 preceding siblings ...)
  2025-04-16  8:16           ` tip-bot2 for Mike Rapoport (Microsoft)
@ 2025-04-17 16:22           ` Nathan Chancellor
  2025-04-18  6:33             ` Ingo Molnar
  2025-04-18 19:49           ` Guenter Roeck
  4 siblings, 1 reply; 78+ messages in thread
From: Nathan Chancellor @ 2025-04-17 16:22 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Ingo Molnar, Andy Shevchenko, Arnd Bergmann, Arnd Bergmann,
	Borislav Petkov, Dave Hansen, Dave Hansen, Davide Ciminaghi,
	Ingo Molnar, Linus Torvalds, Matthew Wilcox, H. Peter Anvin,
	Paolo Bonzini, Sean Christopherson, Thomas Gleixner, kvm,
	linux-kernel, x86

Hi Mike,

On Sun, Apr 13, 2025 at 11:08:58AM +0300, Mike Rapoport wrote:
...
>  arch/x86/kernel/e820.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> index 57120f0749cc..5f673bd6c7d7 100644
> --- a/arch/x86/kernel/e820.c
> +++ b/arch/x86/kernel/e820.c
> @@ -1300,6 +1300,14 @@ void __init e820__memblock_setup(void)
>  		memblock_add(entry->addr, entry->size);
>  	}
>  
> +	/*
> +	 * 32-bit systems are limited to 4BG of memory even with HIGHMEM and
> +	 * to even less without it.
> +	 * Discard memory after max_pfn - the actual limit detected at runtime.
> +	 */
> +	if (IS_ENABLED(CONFIG_X86_32))
> +		memblock_remove(PFN_PHYS(max_pfn), -1);
> +
>  	/* Throw away partial pages: */
>  	memblock_trim_memory(PAGE_SIZE);

Our CI noticed a boot failure after this change as commit 1e07b9fad022
("x86/e820: Discard high memory that can't be addressed by 32-bit
systems") in -tip when booting i386_defconfig with a simple buildroot
initrd.

  $ make -skj"$(nproc)" ARCH=i386 CROSS_COMPILE=i386-linux- mrproper defconfig bzImage

  $ curl -LSs https://github.com/ClangBuiltLinux/boot-utils/releases/download/20241120-044434/x86-rootfs.cpio.zst | zstd -d >rootfs.cpio

  $ qemu-system-i386 \
      -display none \
      -nodefaults \
      -M q35 \
      -d unimp,guest_errors \
      -append 'console=ttyS0 earlycon=uart8250,io,0x3f8' \
      -kernel arch/x86/boot/bzImage \
      -initrd rootfs.cpio \
      -cpu host \
      -enable-kvm \
      -m 512m \
      -smp 8 \
      -serial mon:stdio
  [    0.000000] Linux version 6.15.0-rc1-00177-g1e07b9fad022 (nathan@ax162) (i386-linux-gcc (GCC) 14.2.0, GNU ld (GNU Binutils) 2.42) #1 SMP PREEMPT_DYNAMIC Thu Apr 17 09:02:19 MST 2025
  [    0.000000] BIOS-provided physical RAM map:
  [    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
  [    0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
  [    0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
  [    0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000001ffdffff] usable
  [    0.000000] BIOS-e820: [mem 0x000000001ffe0000-0x000000001fffffff] reserved
  [    0.000000] BIOS-e820: [mem 0x00000000b0000000-0x00000000bfffffff] reserved
  [    0.000000] BIOS-e820: [mem 0x00000000fed1c000-0x00000000fed1ffff] reserved
  [    0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved
  [    0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
  [    0.000000] earlycon: uart8250 at I/O port 0x3f8 (options '')
  [    0.000000] printk: legacy bootconsole [uart8250] enabled
  [    0.000000] Notice: NX (Execute Disable) protection cannot be enabled: non-PAE kernel!
  [    0.000000] APIC: Static calls initialized
  [    0.000000] SMBIOS 2.8 present.
  [    0.000000] DMI: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
  [    0.000000] DMI: Memory slots populated: 1/1
  [    0.000000] Hypervisor detected: KVM
  [    0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
  [    0.000000] kvm-clock: using sched offset of 196444860 cycles
  [    0.000589] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
  [    0.002401] tsc: Detected 2750.000 MHz processor
  [    0.003126] last_pfn = 0x1ffe0 max_arch_pfn = 0x100000
  [    0.003728] MTRR map: 4 entries (3 fixed + 1 variable; max 19), built from 8 variable MTRRs
  [    0.004664] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
  [    0.007149] found SMP MP-table at [mem 0x000f5480-0x000f548f]
  [    0.007802] No sub-1M memory is available for the trampoline
  [    0.008435] Failed to release memory for alloc_low_pages()
  [    0.008438] RAMDISK: [mem 0x1fa5f000-0x1ffdffff]
  [    0.009571] Kernel panic - not syncing: Cannot find place for new RAMDISK of size 5771264
  [    0.010486] CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.15.0-rc1-00177-g1e07b9fad022 #1 PREEMPT(undef)
  [    0.011601] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
  [    0.012857] Call Trace:
  [    0.013135]  dump_stack_lvl+0x43/0x58
  [    0.013555]  dump_stack+0xd/0x10
  [    0.013919]  panic+0xa5/0x221
  [    0.014252]  setup_arch+0x86f/0x9f0
  [    0.014650]  ? vprintk_default+0x29/0x30
  [    0.015089]  start_kernel+0x4b/0x570
  [    0.015487]  i386_start_kernel+0x65/0x68
  [    0.015919]  startup_32_smp+0x151/0x154
  [    0.016344] ---[ end Kernel panic - not syncing: Cannot find place for new RAMDISK of size 5771264 ]---

At the parent change with the same command, the boot completes fine.

  [    0.000000] Linux version 6.15.0-rc1-00176-gd466304c4322 (nathan@ax162) (i386-linux-gcc (GCC) 14.2.0, GNU ld (GNU Binutils) 2.42) #1 SMP PREEMPT_DYNAMIC Thu Apr 17 09:00:12 MST 2025
  [    0.000000] BIOS-provided physical RAM map:
  ...
  [    0.000000] earlycon: uart8250 at I/O port 0x3f8 (options '')
  [    0.000000] printk: legacy bootconsole [uart8250] enabled
  [    0.000000] Notice: NX (Execute Disable) protection cannot be enabled: non-PAE kernel!
  [    0.000000] APIC: Static calls initialized
  [    0.000000] SMBIOS 2.8 present.
  [    0.000000] DMI: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
  [    0.000000] DMI: Memory slots populated: 1/1
  [    0.000000] Hypervisor detected: KVM
  [    0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
  [    0.000001] kvm-clock: using sched offset of 429786443 cycles
  [    0.000806] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
  [    0.003278] tsc: Detected 2750.000 MHz processor
  [    0.004730] last_pfn = 0x1ffe0 max_arch_pfn = 0x100000
  [    0.006220] MTRR map: 4 entries (3 fixed + 1 variable; max 19), built from 8 variable MTRRs
  [    0.009169] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
  [    0.012840] found SMP MP-table at [mem 0x000f5480-0x000f548f]
  [    0.014310] RAMDISK: [mem 0x1fa5f000-0x1ffdffff]
  [    0.015141] ACPI: Early table checksum verification disabled
  ...
  [    0.046564] 511MB LOWMEM available.
  [    0.047421]   mapped low ram: 0 - 1ffe0000
  [    0.048431]   low ram: 0 - 1ffe0000
  [    0.049289] Zone ranges:
  [    0.049934]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
  [    0.051184]   Normal   [mem 0x0000000001000000-0x000000001ffdffff]
  [    0.053087] Movable zone start for each node
  [    0.054409] Early memory node ranges
  [    0.055513]   node   0: [mem 0x0000000000001000-0x000000000009efff]
  [    0.057411]   node   0: [mem 0x0000000000100000-0x000000001ffdffff]
  [    0.059176] Initmem setup node 0 [mem 0x0000000000001000-0x000000001ffdffff]
  ...

Is this an invalid configuration or virtual setup that is being tested
here or is there something else problematic with this change?

Cheers,
Nathan

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] x86/e820: discard high memory that can't be addressed by 32-bit systems
  2025-04-17 16:22           ` [PATCH] x86/e820: discard " Nathan Chancellor
@ 2025-04-18  6:33             ` Ingo Molnar
  2025-04-18  9:01               ` Mike Rapoport
  0 siblings, 1 reply; 78+ messages in thread
From: Ingo Molnar @ 2025-04-18  6:33 UTC (permalink / raw)
  To: Nathan Chancellor
  Cc: Mike Rapoport, Andy Shevchenko, Arnd Bergmann, Arnd Bergmann,
	Borislav Petkov, Dave Hansen, Dave Hansen, Davide Ciminaghi,
	Ingo Molnar, Linus Torvalds, Matthew Wilcox, H. Peter Anvin,
	Paolo Bonzini, Sean Christopherson, Thomas Gleixner, kvm,
	linux-kernel, x86


* Nathan Chancellor <nathan@kernel.org> wrote:

> Hi Mike,
> 
> On Sun, Apr 13, 2025 at 11:08:58AM +0300, Mike Rapoport wrote:
> ...
> >  arch/x86/kernel/e820.c | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> > 
> > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> > index 57120f0749cc..5f673bd6c7d7 100644
> > --- a/arch/x86/kernel/e820.c
> > +++ b/arch/x86/kernel/e820.c
> > @@ -1300,6 +1300,14 @@ void __init e820__memblock_setup(void)
> >  		memblock_add(entry->addr, entry->size);
> >  	}
> >  
> > +	/*
> > +	 * 32-bit systems are limited to 4BG of memory even with HIGHMEM and
> > +	 * to even less without it.
> > +	 * Discard memory after max_pfn - the actual limit detected at runtime.
> > +	 */
> > +	if (IS_ENABLED(CONFIG_X86_32))
> > +		memblock_remove(PFN_PHYS(max_pfn), -1);
> > +
> >  	/* Throw away partial pages: */
> >  	memblock_trim_memory(PAGE_SIZE);
> 
> Our CI noticed a boot failure after this change as commit 1e07b9fad022
> ("x86/e820: Discard high memory that can't be addressed by 32-bit
> systems") in -tip when booting i386_defconfig with a simple buildroot
> initrd.

I've zapped this commit from tip:x86/urgent for the time being:

  1e07b9fad022 ("x86/e820: Discard high memory that can't be addressed by 32-bit systems")

until these bugs are better understood.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] x86/e820: discard high memory that can't be addressed by 32-bit systems
  2025-04-18  6:33             ` Ingo Molnar
@ 2025-04-18  9:01               ` Mike Rapoport
  2025-04-18 12:59                 ` Ingo Molnar
  0 siblings, 1 reply; 78+ messages in thread
From: Mike Rapoport @ 2025-04-18  9:01 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nathan Chancellor, Andy Shevchenko, Arnd Bergmann, Arnd Bergmann,
	Borislav Petkov, Dave Hansen, Dave Hansen, Davide Ciminaghi,
	Ingo Molnar, Linus Torvalds, Matthew Wilcox, H. Peter Anvin,
	Paolo Bonzini, Sean Christopherson, Thomas Gleixner, kvm,
	linux-kernel, x86

On Fri, Apr 18, 2025 at 08:33:02AM +0200, Ingo Molnar wrote:
> 
> * Nathan Chancellor <nathan@kernel.org> wrote:
> 
> > Hi Mike,
> > 
> > On Sun, Apr 13, 2025 at 11:08:58AM +0300, Mike Rapoport wrote:
> > ...
> > >  arch/x86/kernel/e820.c | 8 ++++++++
> > >  1 file changed, 8 insertions(+)
> > > 
> > > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> > > index 57120f0749cc..5f673bd6c7d7 100644
> > > --- a/arch/x86/kernel/e820.c
> > > +++ b/arch/x86/kernel/e820.c
> > > @@ -1300,6 +1300,14 @@ void __init e820__memblock_setup(void)
> > >  		memblock_add(entry->addr, entry->size);
> > >  	}
> > >  
> > > +	/*
> > > +	 * 32-bit systems are limited to 4BG of memory even with HIGHMEM and
> > > +	 * to even less without it.
> > > +	 * Discard memory after max_pfn - the actual limit detected at runtime.
> > > +	 */
> > > +	if (IS_ENABLED(CONFIG_X86_32))
> > > +		memblock_remove(PFN_PHYS(max_pfn), -1);
> > > +
> > >  	/* Throw away partial pages: */
> > >  	memblock_trim_memory(PAGE_SIZE);
> > 
> > Our CI noticed a boot failure after this change as commit 1e07b9fad022
> > ("x86/e820: Discard high memory that can't be addressed by 32-bit
> > systems") in -tip when booting i386_defconfig with a simple buildroot
> > initrd.
> 
> I've zapped this commit from tip:x86/urgent for the time being:
> 
>   1e07b9fad022 ("x86/e820: Discard high memory that can't be addressed by 32-bit systems")
> 
> until these bugs are better understood.

With X86_PAE disabled phys_addr_t is 32 bit, PFN_PHYS(MAX_NONPAE_PFN)
overflows and we get memblock_remove(0, -1) :(

Using max_pfn instead of MAX_NONPAE_PFN would work because there's a hole
under 4G and max_pfn should never overflow.

Another option is to skip e820 entries above 4G and not add them to
memblock at the first place, e.g.

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 57120f0749cc..2b617f36f11a 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1297,6 +1297,17 @@ void __init e820__memblock_setup(void)
 		if (entry->type != E820_TYPE_RAM)
 			continue;
 
+#ifdef CONFIG_X86_32
+		/*
+		 * Discard memory above 4GB because 32-bit systems are limited
+		 * to 4GB of memory even with HIGHMEM.
+		 */
+		if (entry->addr > SZ_4G)
+			continue;
+		if (entry->addr + entry->size > SZ_4G)
+			entry->size = SZ_4G - entry->addr;
+#endif
+
 		memblock_add(entry->addr, entry->size);
 	}
 
 
> Thanks,
> 
> 	Ingo

-- 
Sincerely yours,
Mike.

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH] x86/e820: discard high memory that can't be addressed by 32-bit systems
  2025-04-18  9:01               ` Mike Rapoport
@ 2025-04-18 12:59                 ` Ingo Molnar
  2025-04-18 19:25                   ` Mike Rapoport
  0 siblings, 1 reply; 78+ messages in thread
From: Ingo Molnar @ 2025-04-18 12:59 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Nathan Chancellor, Andy Shevchenko, Arnd Bergmann, Arnd Bergmann,
	Borislav Petkov, Dave Hansen, Dave Hansen, Davide Ciminaghi,
	Ingo Molnar, Linus Torvalds, Matthew Wilcox, H. Peter Anvin,
	Paolo Bonzini, Sean Christopherson, Thomas Gleixner, kvm,
	linux-kernel, x86


* Mike Rapoport <rppt@kernel.org> wrote:

> On Fri, Apr 18, 2025 at 08:33:02AM +0200, Ingo Molnar wrote:
> > 
> > * Nathan Chancellor <nathan@kernel.org> wrote:
> > 
> > > Hi Mike,
> > > 
> > > On Sun, Apr 13, 2025 at 11:08:58AM +0300, Mike Rapoport wrote:
> > > ...
> > > >  arch/x86/kernel/e820.c | 8 ++++++++
> > > >  1 file changed, 8 insertions(+)
> > > > 
> > > > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> > > > index 57120f0749cc..5f673bd6c7d7 100644
> > > > --- a/arch/x86/kernel/e820.c
> > > > +++ b/arch/x86/kernel/e820.c
> > > > @@ -1300,6 +1300,14 @@ void __init e820__memblock_setup(void)
> > > >  		memblock_add(entry->addr, entry->size);
> > > >  	}
> > > >  
> > > > +	/*
> > > > +	 * 32-bit systems are limited to 4BG of memory even with HIGHMEM and
> > > > +	 * to even less without it.
> > > > +	 * Discard memory after max_pfn - the actual limit detected at runtime.
> > > > +	 */
> > > > +	if (IS_ENABLED(CONFIG_X86_32))
> > > > +		memblock_remove(PFN_PHYS(max_pfn), -1);
> > > > +
> > > >  	/* Throw away partial pages: */
> > > >  	memblock_trim_memory(PAGE_SIZE);
> > > 
> > > Our CI noticed a boot failure after this change as commit 1e07b9fad022
> > > ("x86/e820: Discard high memory that can't be addressed by 32-bit
> > > systems") in -tip when booting i386_defconfig with a simple buildroot
> > > initrd.
> > 
> > I've zapped this commit from tip:x86/urgent for the time being:
> > 
> >   1e07b9fad022 ("x86/e820: Discard high memory that can't be addressed by 32-bit systems")
> > 
> > until these bugs are better understood.
> 
> With X86_PAE disabled phys_addr_t is 32 bit, PFN_PHYS(MAX_NONPAE_PFN)
> overflows and we get memblock_remove(0, -1) :(
> 
> Using max_pfn instead of MAX_NONPAE_PFN would work because there's a hole
> under 4G and max_pfn should never overflow.

So why don't we use max_pfn like your -v1 fix did IIRC?

	Ingo

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] x86/e820: discard high memory that can't be addressed by 32-bit systems
  2025-04-18 12:59                 ` Ingo Molnar
@ 2025-04-18 19:25                   ` Mike Rapoport
  2025-04-18 19:29                     ` Dave Hansen
  0 siblings, 1 reply; 78+ messages in thread
From: Mike Rapoport @ 2025-04-18 19:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nathan Chancellor, Andy Shevchenko, Arnd Bergmann, Arnd Bergmann,
	Borislav Petkov, Dave Hansen, Dave Hansen, Davide Ciminaghi,
	Ingo Molnar, Linus Torvalds, Matthew Wilcox, H. Peter Anvin,
	Paolo Bonzini, Sean Christopherson, Thomas Gleixner, kvm,
	linux-kernel, x86

On Fri, Apr 18, 2025 at 02:59:05PM +0200, Ingo Molnar wrote:
> 
> * Mike Rapoport <rppt@kernel.org> wrote:
> 
> > On Fri, Apr 18, 2025 at 08:33:02AM +0200, Ingo Molnar wrote:
> > > 
> > > * Nathan Chancellor <nathan@kernel.org> wrote:
> > > 
> > > > Hi Mike,
> > > > 
> > > > On Sun, Apr 13, 2025 at 11:08:58AM +0300, Mike Rapoport wrote:
> > > > ...
> > > > >  arch/x86/kernel/e820.c | 8 ++++++++
> > > > >  1 file changed, 8 insertions(+)
> > > > > 
> > > > > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> > > > > index 57120f0749cc..5f673bd6c7d7 100644
> > > > > --- a/arch/x86/kernel/e820.c
> > > > > +++ b/arch/x86/kernel/e820.c
> > > > > @@ -1300,6 +1300,14 @@ void __init e820__memblock_setup(void)
> > > > >  		memblock_add(entry->addr, entry->size);
> > > > >  	}
> > > > >  
> > > > > +	/*
> > > > > +	 * 32-bit systems are limited to 4BG of memory even with HIGHMEM and
> > > > > +	 * to even less without it.
> > > > > +	 * Discard memory after max_pfn - the actual limit detected at runtime.
> > > > > +	 */
> > > > > +	if (IS_ENABLED(CONFIG_X86_32))
> > > > > +		memblock_remove(PFN_PHYS(max_pfn), -1);
> > > > > +
> > > > >  	/* Throw away partial pages: */
> > > > >  	memblock_trim_memory(PAGE_SIZE);
> > > > 
> > > > Our CI noticed a boot failure after this change as commit 1e07b9fad022
> > > > ("x86/e820: Discard high memory that can't be addressed by 32-bit
> > > > systems") in -tip when booting i386_defconfig with a simple buildroot
> > > > initrd.
> > > 
> > > I've zapped this commit from tip:x86/urgent for the time being:
> > > 
> > >   1e07b9fad022 ("x86/e820: Discard high memory that can't be addressed by 32-bit systems")
> > > 
> > > until these bugs are better understood.
> > 
> > With X86_PAE disabled phys_addr_t is 32 bit, PFN_PHYS(MAX_NONPAE_PFN)
> > overflows and we get memblock_remove(0, -1) :(
> > 
> > Using max_pfn instead of MAX_NONPAE_PFN would work because there's a hole
> > under 4G and max_pfn should never overflow.
> 
> So why don't we use max_pfn like your -v1 fix did IIRC?

Dave didn't like max_pfn. I don't feel strongly about using max_pfn or
skipping e820 ranges above 4G and not adding them to memblock.
 
> 	Ingo

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] x86/e820: discard high memory that can't be addressed by 32-bit systems
  2025-04-18 19:25                   ` Mike Rapoport
@ 2025-04-18 19:29                     ` Dave Hansen
  0 siblings, 0 replies; 78+ messages in thread
From: Dave Hansen @ 2025-04-18 19:29 UTC (permalink / raw)
  To: Mike Rapoport, Ingo Molnar
  Cc: Nathan Chancellor, Andy Shevchenko, Arnd Bergmann, Arnd Bergmann,
	Borislav Petkov, Dave Hansen, Davide Ciminaghi, Ingo Molnar,
	Linus Torvalds, Matthew Wilcox, H. Peter Anvin, Paolo Bonzini,
	Sean Christopherson, Thomas Gleixner, kvm, linux-kernel, x86

On 4/18/25 12:25, Mike Rapoport wrote:
>> So why don't we use max_pfn like your -v1 fix did IIRC?
> Dave didn't like max_pfn. I don't feel strongly about using max_pfn or
> skipping e820 ranges above 4G and not adding them to memblock.

I feel more strongly about fixing the bug than avoiding max_pfn. ;)

Going back to v1 is fine with me.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] x86/e820: discard high memory that can't be addressed by 32-bit systems
  2025-04-13  8:08         ` [PATCH] x86/e820: discard high memory that can't be addressed by 32-bit systems Mike Rapoport
                             ` (3 preceding siblings ...)
  2025-04-17 16:22           ` [PATCH] x86/e820: discard " Nathan Chancellor
@ 2025-04-18 19:49           ` Guenter Roeck
  4 siblings, 0 replies; 78+ messages in thread
From: Guenter Roeck @ 2025-04-18 19:49 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Ingo Molnar, Andy Shevchenko, Arnd Bergmann, Arnd Bergmann,
	Borislav Petkov, Dave Hansen, Dave Hansen, Davide Ciminaghi,
	Ingo Molnar, Linus Torvalds, Matthew Wilcox, H. Peter Anvin,
	Paolo Bonzini, Sean Christopherson, Thomas Gleixner, kvm,
	linux-kernel, x86

Hi,

On Sun, Apr 13, 2025 at 11:08:58AM +0300, Mike Rapoport wrote:
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> 
> Dave Hansen reports the following crash on a 32-bit system with
> CONFIG_HIGHMEM=y and CONFIG_X86_PAE=y:
> 
>   > 0xf75fe000 is the mem_map[] entry for the first page >4GB. It
>   > obviously wasn't allocated, thus the oops.
> 
>   BUG: unable to handle page fault for address: f75fe000
>   #PF: supervisor write access in kernel mode
>   #PF: error_code(0x0002) - not-present page
>   *pdpt = 0000000002da2001 *pde = 000000000300c067 *pte = 0000000000000000
>   Oops: Oops: 0002 [#1] SMP NOPTI
>   CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.15.0-rc1-00288-ge618ee89561b-dirty #311 PREEMPT(undef)
>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
>   EIP: __free_pages_core+0x3c/0x74
>   Code: c3 d3 e6 83 ec 10 89 44 24 08 89 74 24 04 c7 04 24 c6 32 3a c2 89 55 f4 e8 a9 11 45 fe 85 f6 8b 55 f4 74 19 89 d8 31 c9 66 90 <0f> ba 30 0d c7 40 1c 00 00 00 00 41 83 c0 28 39 ce 75 ed 8b
> 
>   EAX: f75fe000 EBX: f75fe000 ECX: 00000000 EDX: 0000000a
>   ESI: 00000400 EDI: 00500000 EBP: c247becc ESP: c247beb4
>   DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210046
>   CR0: 80050033 CR2: f75fe000 CR3: 02da6000 CR4: 000000b0
>   Call Trace:
>    memblock_free_pages+0x11/0x2c
>    memblock_free_all+0x2ce/0x3a0
>    mm_core_init+0xf5/0x320
>    start_kernel+0x296/0x79c
>    ? set_init_arg+0x70/0x70
>    ? load_ucode_bsp+0x13c/0x1a8
>    i386_start_kernel+0xad/0xb0
>    startup_32_smp+0x151/0x154
>   Modules linked in:
>   CR2: 00000000f75fe000
> 
> The mem_map[] is allocated up to the end of ZONE_HIGHMEM which is defined
> by max_pfn.
> 
> Before 6faea3422e3b ("arch, mm: streamline HIGHMEM freeing") freeing of
> high memory was also clamped to the end of ZONE_HIGHMEM but after
> 6faea3422e3b memblock_free_all() tries to free memory above the of
> ZONE_HIGHMEM as well and that causes access to mem_map[] entries beyond
> the end of the memory map.
> 
> Discard the memory after max_pfn from memblock on 32-bit systems so that
> core MM would be aware only of actually usable memory.
> 
> Reported-by: Dave Hansen <dave.hansen@intel.com>
> Tested-by: Arnd Bergmann <arnd@kernel.org>
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

With this patch in pending-fixes ( v6.15-rc2-434-g93ced5296772),
all my i386 test runs crash.

[    0.020893] Kernel panic - not syncing: ioapic_setup_resources: Failed to allocate 0x0000002b bytes
[    0.021248] CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.15.0-rc2-00434-g93ced5296772 #1 PREEMPT(undef)
[    0.021373] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[    0.021549] Call Trace:
[    0.021711]  dump_stack_lvl+0x20/0x104
[    0.022023]  dump_stack+0x12/0x18
[    0.022064]  panic+0x2c1/0x2d8
[    0.022116]  ? vprintk_default+0x29/0x30
[    0.022163]  __memblock_alloc_or_panic+0x57/0x58
[    0.022221]  io_apic_init_mappings+0x2e/0x1a8
[    0.022284]  setup_arch+0x909/0xdac
[    0.022338]  ? vprintk_default+0x29/0x30
[    0.022410]  start_kernel+0x63/0x760
[    0.022457]  ? load_ucode_bsp+0x12c/0x198
[    0.022507]  i386_start_kernel+0x74/0x74
[    0.022548]  startup_32_smp+0x151/0x154
[    0.023089] ---[ end Kernel panic - not syncing: ioapic_setup_resources: Failed to allocate 0x0000002b bytes ]---

Reverting this patch fixes the problem. Bisect log is attached for reference.

Guenter

---
# bad: [93ced5296772b7b704f48e4bad9fcfdf0633c780] Merge branch 'for-linux-next-fixes' of https://gitlab.freedesktop.org/drm/misc/kernel.git
# good: [8ffd015db85fea3e15a77027fda6c02ced4d2444] Linux 6.15-rc2
git bisect start 'HEAD' 'v6.15-rc2'
# good: [5d6f363fc974e32dd9930fecaae63958b68a1df4] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap.git
git bisect good 5d6f363fc974e32dd9930fecaae63958b68a1df4
# good: [1790b4a242fe119fead08fccc5bf923423c7449a] Merge branch 'dma-mapping-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux.git
git bisect good 1790b4a242fe119fead08fccc5bf923423c7449a
# good: [5d37ee8a1d6455968ea3134d78223090d487c7f4] Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux.git
git bisect good 5d37ee8a1d6455968ea3134d78223090d487c7f4
# good: [9d4de5ae5208548eb9c6a490ac454601f4fbf00b] Merge branch 'i2c/i2c-host-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/andi.shyti/linux.git
git bisect good 9d4de5ae5208548eb9c6a490ac454601f4fbf00b
# bad: [f737ab93945fb8f0213e1cccc39d028eb5d880e0] Merge branch into tip/master: 'x86/urgent'
git bisect bad f737ab93945fb8f0213e1cccc39d028eb5d880e0
# good: [2e7a2843d0de7677b7bb908ca006dc435e52c416] Merge branch into tip/master: 'irq/urgent'
git bisect good 2e7a2843d0de7677b7bb908ca006dc435e52c416
# good: [d466304c4322ad391797437cd84cca7ce1660de0] x86/cpu: Add CPU model number for Bartlett Lake CPUs with Raptor Cove cores
git bisect good d466304c4322ad391797437cd84cca7ce1660de0
# good: [39893b1e4ad7c4380abe4cfddaa58b34c4363bf4] Merge branch into tip/master: 'timers/urgent'
git bisect good 39893b1e4ad7c4380abe4cfddaa58b34c4363bf4
# bad: [1e07b9fad022e0e02215150ca1e20912e78e8ec1] x86/e820: Discard high memory that can't be addressed by 32-bit systems
git bisect bad 1e07b9fad022e0e02215150ca1e20912e78e8ec1
# first bad commit: [1e07b9fad022e0e02215150ca1e20912e78e8ec1] x86/e820: Discard high memory that can't be addressed by 32-bit systems

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [tip: x86/urgent] x86/e820: Discard high memory that can't be addressed by 32-bit systems
       [not found] <20250413080858.743221-1-rppt@kernel.org # discussion and submission>
@ 2025-04-19 15:00 ` tip-bot2 for Mike Rapoport (Microsoft)
  0 siblings, 0 replies; 78+ messages in thread
From: tip-bot2 for Mike Rapoport (Microsoft) @ 2025-04-19 15:00 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Dave Hansen, Arnd Bergmann, Mike Rapoport (Microsoft),
	Ingo Molnar, Andy Shevchenko, Arnd Bergmann, Davide Ciminaghi,
	H. Peter Anvin, Linus Torvalds, Matthew Wilcox, Paolo Bonzini,
	Sean Christopherson, kvm, x86, linux-kernel

The following commit has been merged into the x86/urgent branch of tip:

Commit-ID:     83b2d345e1786fdab96fc2b52942eebde125e7cd
Gitweb:        https://git.kernel.org/tip/83b2d345e1786fdab96fc2b52942eebde125e7cd
Author:        Mike Rapoport (Microsoft) <rppt@kernel.org>
AuthorDate:    Sun, 13 Apr 2025 11:08:58 +03:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Sat, 19 Apr 2025 16:48:18 +02:00

x86/e820: Discard high memory that can't be addressed by 32-bit systems

Dave Hansen reports the following crash on a 32-bit system with
CONFIG_HIGHMEM=y and CONFIG_X86_PAE=y:

  > 0xf75fe000 is the mem_map[] entry for the first page >4GB. It
  > obviously wasn't allocated, thus the oops.

  BUG: unable to handle page fault for address: f75fe000
  #PF: supervisor write access in kernel mode
  #PF: error_code(0x0002) - not-present page
  *pdpt = 0000000002da2001 *pde = 000000000300c067 *pte = 0000000000000000
  Oops: Oops: 0002 [#1] SMP NOPTI
  CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.15.0-rc1-00288-ge618ee89561b-dirty #311 PREEMPT(undef)
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
  EIP: __free_pages_core+0x3c/0x74
  ...
  Call Trace:
   memblock_free_pages+0x11/0x2c
   memblock_free_all+0x2ce/0x3a0
   mm_core_init+0xf5/0x320
   start_kernel+0x296/0x79c
   i386_start_kernel+0xad/0xb0
   startup_32_smp+0x151/0x154

The mem_map[] is allocated up to the end of ZONE_HIGHMEM which is defined
by max_pfn.

The bug was introduced by this recent commit:

  6faea3422e3b ("arch, mm: streamline HIGHMEM freeing")

Previously, freeing of high memory was also clamped to the end of
ZONE_HIGHMEM but after this change, memblock_free_all() tries to
free memory above the of ZONE_HIGHMEM as well and that causes
access to mem_map[] entries beyond the end of the memory map.

To fix this, discard the memory after max_pfn from memblock on
32-bit systems so that core MM would be aware only of actually
usable memory.

Fixes: 6faea3422e3b ("arch, mm: streamline HIGHMEM freeing")
Reported-by: Dave Hansen <dave.hansen@intel.com>
Tested-by: Arnd Bergmann <arnd@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Shevchenko <andy@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Davide Ciminaghi <ciminaghi@gnudd.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: kvm@vger.kernel.org
Link: https://lore.kernel.org/r/20250413080858.743221-1-rppt@kernel.org # discussion and submission
---
 arch/x86/kernel/e820.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 9d8dd8d..9920122 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1299,6 +1299,14 @@ void __init e820__memblock_setup(void)
 		memblock_add(entry->addr, entry->size);
 	}
 
+	/*
+	 * 32-bit systems are limited to 4BG of memory even with HIGHMEM and
+	 * to even less without it.
+	 * Discard memory after max_pfn - the actual limit detected at runtime.
+	 */
+	if (IS_ENABLED(CONFIG_X86_32))
+		memblock_remove(PFN_PHYS(max_pfn), -1);
+
 	/* Throw away partial pages: */
 	memblock_trim_memory(PAGE_SIZE);
 

^ permalink raw reply related	[flat|nested] 78+ messages in thread

end of thread, other threads:[~2025-04-19 15:01 UTC | newest]

Thread overview: 78+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-04 10:30 [PATCH 00/11] x86: 32-bit cleanups Arnd Bergmann
2024-12-04 10:30 ` [PATCH 01/11] x86/Kconfig: Geode CPU has cmpxchg8b Arnd Bergmann
2024-12-04 10:30 ` [PATCH 02/11] x86: drop 32-bit "bigsmp" machine support Arnd Bergmann
2024-12-04 10:30 ` [PATCH 03/11] x86: Kconfig.cpu: split out 64-bit atom Arnd Bergmann
2024-12-04 13:16   ` Thomas Gleixner
2024-12-04 15:55     ` H. Peter Anvin
2024-12-04 18:21       ` Andy Shevchenko
2024-12-04 10:30 ` [PATCH 04/11] x86: split CPU selection into 32-bit and 64-bit Arnd Bergmann
2024-12-04 18:31   ` Andy Shevchenko
2024-12-04 21:18     ` Arnd Bergmann
2024-12-04 10:30 ` [PATCH 05/11] x86: remove HIGHMEM64G support Arnd Bergmann
2024-12-04 13:29   ` Brian Gerst
2024-12-04 13:43     ` Arnd Bergmann
2024-12-04 14:02       ` Brian Gerst
2024-12-04 15:00         ` Brian Gerst
2024-12-04 15:58         ` H. Peter Anvin
2024-12-04 15:53       ` H. Peter Anvin
2024-12-04 16:37     ` H. Peter Anvin
2024-12-04 16:55       ` Arnd Bergmann
2024-12-04 18:37         ` Andy Shevchenko
2024-12-04 21:14           ` Arnd Bergmann
2025-04-11 23:44   ` Dave Hansen
2025-04-12  8:39     ` Ingo Molnar
2025-04-12 10:05     ` Mike Rapoport
2025-04-12 10:44       ` Arnd Bergmann
2025-04-12 19:48       ` Ingo Molnar
2025-04-13  8:08         ` [PATCH] x86/e820: discard high memory that can't be addressed by 32-bit systems Mike Rapoport
2025-04-13  9:23           ` [tip: x86/urgent] x86/e820: Discard " tip-bot2 for Mike Rapoport (Microsoft)
2025-04-14 14:19             ` Dave Hansen
2025-04-15  7:18               ` Mike Rapoport
2025-04-15 13:43                 ` Dave Hansen
2025-04-16  7:17                   ` Ingo Molnar
2025-04-16  7:51                     ` Ingo Molnar
2025-04-16  7:24           ` tip-bot2 for Mike Rapoport (Microsoft)
2025-04-16  8:16           ` tip-bot2 for Mike Rapoport (Microsoft)
2025-04-17 16:22           ` [PATCH] x86/e820: discard " Nathan Chancellor
2025-04-18  6:33             ` Ingo Molnar
2025-04-18  9:01               ` Mike Rapoport
2025-04-18 12:59                 ` Ingo Molnar
2025-04-18 19:25                   ` Mike Rapoport
2025-04-18 19:29                     ` Dave Hansen
2025-04-18 19:49           ` Guenter Roeck
2025-04-12 10:40     ` [PATCH 05/11] x86: remove HIGHMEM64G support Arnd Bergmann
2024-12-04 10:30 ` [PATCH 06/11] x86: drop SWIOTLB and PHYS_ADDR_T_64BIT for PAE Arnd Bergmann
2024-12-04 18:41   ` Andy Shevchenko
2024-12-04 20:52     ` Arnd Bergmann
2024-12-05  7:59       ` Andy Shevchenko
2024-12-04 10:30 ` [PATCH 07/11] x86: drop support for CONFIG_HIGHPTE Arnd Bergmann
2024-12-04 10:30 ` [PATCH 08/11] x86: document X86_INTEL_MID as 64-bit-only Arnd Bergmann
2024-12-04 18:55   ` Andy Shevchenko
2024-12-04 20:38     ` Arnd Bergmann
2024-12-05  8:03       ` Andy Shevchenko
2024-12-06 11:23     ` Ferry Toth
2024-12-06 14:27       ` Arnd Bergmann
2024-12-04 10:30 ` [PATCH 09/11] x86: rework CONFIG_GENERIC_CPU compiler flags Arnd Bergmann
2024-12-04 15:36   ` Tor Vic
2024-12-04 17:51     ` Arnd Bergmann
2024-12-04 17:09   ` Nathan Chancellor
2024-12-04 17:52     ` Arnd Bergmann
2024-12-04 18:10   ` Linus Torvalds
2024-12-04 19:43     ` Arnd Bergmann
2024-12-04 23:33       ` Linus Torvalds
2024-12-05  8:13         ` Andy Shevchenko
2024-12-05 10:09           ` Arnd Bergmann
2024-12-05 11:17             ` Andy Shevchenko
2024-12-05 11:58               ` Arnd Bergmann
2024-12-05 12:35                 ` Jason A. Donenfeld
2024-12-05  9:46         ` Arnd Bergmann
2024-12-05 10:01           ` Andy Shevchenko
2024-12-05 10:47             ` Arnd Bergmann
2024-12-05  8:07       ` Andy Shevchenko
2024-12-06 13:56   ` David Laight
2024-12-04 10:30 ` [PATCH 10/11] x86: remove old STA2x11 support Arnd Bergmann
2024-12-05  7:35   ` Davide Ciminaghi
2024-12-04 10:30 ` [PATCH 11/11] x86: drop 32-bit KVM host support Arnd Bergmann
2024-12-04 15:30   ` Sean Christopherson
2024-12-04 16:33     ` Arnd Bergmann
     [not found] <20250413080858.743221-1-rppt@kernel.org # discussion and submission>
2025-04-19 15:00 ` [tip: x86/urgent] x86/e820: Discard high memory that can't be addressed by 32-bit systems tip-bot2 for Mike Rapoport (Microsoft)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).