linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Thomas Gleixner <tglx@linutronix.de>
To: "Michael Kelley (LINUX)" <mikelley@microsoft.com>,
	LKML <linux-kernel@vger.kernel.org>
Cc: "x86@kernel.org" <x86@kernel.org>,
	Tom Lendacky <thomas.lendacky@amd.com>,
	Andrew Cooper <andrew.cooper3@citrix.com>,
	Arjan van de Ven <arjan@linux.intel.com>,
	"James E.J. Bottomley" <jejb@linux.ibm.com>,
	Dick Kennedy <dick.kennedy@broadcom.com>,
	James Smart <james.smart@broadcom.com>,
	"Martin K. Petersen" <martin.petersen@oracle.com>,
	"linux-scsi@vger.kernel.org" <linux-scsi@vger.kernel.org>,
	Guenter Roeck <linux@roeck-us.net>,
	"linux-hwmon@vger.kernel.org" <linux-hwmon@vger.kernel.org>,
	Jean Delvare <jdelvare@suse.com>, Huang Rui <ray.huang@amd.com>,
	Juergen Gross <jgross@suse.com>, Steve Wahl <steve.wahl@hpe.com>,
	Mike Travis <mike.travis@hpe.com>,
	Dimitri Sivanich <dimitri.sivanich@hpe.com>,
	Russ Anderson <russ.anderson@hpe.com>
Subject: RE: [patch v2 21/38] x86/cpu: Provide cpu_init/parse_topology()
Date: Mon, 31 Jul 2023 14:34:39 +0200	[thread overview]
Message-ID: <871qgop8dc.ffs@tglx> (raw)
In-Reply-To: <BYAPR21MB16889FD224344B1B28BE22A1D705A@BYAPR21MB1688.namprd21.prod.outlook.com>

On Mon, Jul 31 2023 at 04:05, Michael Kelley wrote:
>> +	/*
>> +	 * The initial invocation from early_identify_cpu() happens before
>> +	 * the APIC is mapped or X2APIC enabled. For establishing the
>> +	 * topology, that's not required. Use the initial APIC ID.
>> +	 */
>> +	if (early)
>> +		c->topo.apicid = c->topo.initial_apicid;
>> +	else
>> +		c->topo.apicid = read_apic_id();
>
> Using the value from the local APIC ID reg turns out to cause a problem in
> some Hyper-V VM configurations.  If a VM has multiple L3 caches (probably
> due to multiple NUMA nodes) and the # of CPUs in the span of the L3 cache
> is not a power of 2, the APIC IDs for the CPUs in the span of the 1st L3 cache
> are sequential starting with 0.  But then there is a gap before starting the
> APIC IDs for the CPUs in the span of the 2nd L3 cache.  The gap is
> repeated if there are additional L3 caches.
>
> The CPUID instruction executed on a guest vCPU correctly reports the APIC
> IDs.  However, the ACPI MADT assigns the APIC IDs sequentially with no
> gaps, and the guest firmware sets the APIC_ID register for each local APIC
> to match the MADT.  When parse_topology() sets the apicid field based on
> reading the local APIC ID register, the value it sets is different from the
> initial_apicid value for CPUs in the span of the 2nd and subsequent L3
> caches, because there's no gap in the APIC IDs read from the local APIC.
> Linux boots and runs, but the topology is set up with the wrong span for
> the L3 cache and for the associated scheduling domains.

TBH. That's an insanity. MADT and the actual APIC ID determine the
topology. So the gaps should be reflected in MADT and the actual APIC
IDs should be set correctly if the intent is to provide topology
information.

Just for the record. This hack works only on Intel today, because AMD
init sets topo.apicid = read_apic_id() unconditionally. So this is
inconsistent already, no?

> The old code derives the apicid from the initial_apicid via the
> phys_pkg_id() callback, so these bad Hyper-V VM configs skate by.  The
> wrong value in the local APIC ID register and MADT does not affect
> anything, except that the check in validate_apic_and_package_id() fails
> during boot, and a set of "Firmware bug:" messages is correctly
> output.

So instead of fixing the firmware bugs, hyper-v just moves on and
pretends that everything works fine, right?

> Three thoughts:
>
> 1)  Are Hyper-V VMs the only place where the local APIC ID register might
> have a bogus value?  Probably so, but you never know what might crawl
> out.

Define bogus. MADT is the primary source of information because that's
how we know how many CPUs (APICs) are there and what their APIC ID is
which we can use to wake them up. So there is a reasonable expectation
that this information is consistent with the rest of the system.

The Intel SDM clearly says in Vol 3A section 9.4.5 Identifying Logical
Processors in an MP System:

  "After the BIOS has completed the MP initialization protocol, each
   logical processor can be uniquely identified by its local APIC
   ID. Software can access these APIC IDs in either of the following
   ways:"

These ways include read from APIC, read MADT, read CPUID and implies
that this must be consistent. For X2APIC it's actually written out:

  "If the local APIC unit supports x2APIC and is operating in x2APIC
   mode, 32-bit APIC ID can be read by executing a RDMSR instruction to
   read the processor’s x2APIC ID register. This method is equivalent to
   executing CPUID leaf 0BH described below."

AMD has not been following that in the early 64bit systems as they moved
the APIC ID space to start at 32 for the first CPU in the first socket
for whatever reasons. But since then the kernel reads back the APIC ID
on AMD systems into topo.apicid. But that was long ago and can easily be
dealt with because at least the real APIC ID and the MADT/MPTABLE
entries are consistent.

Hypervisors have their own CPUID space to override functionality with
their own magic stuff, but imposing their nutbolt ideas on the
architectural part of the system is not only wrong, it's disrespectful
against the OS developers who try to keep their system sane.

> 2) The natural response is "Well, fix Hyper-V!"  I first had this conversation
> with the Hyper-V team about 5 years ago.  Some cases of the problem were
> fixed, but some cases remain unfixed.  It's a long story.
>
> 3)  Since Hyper-V code in Linux already has an override for the apic->read()
> function, it's possible to do a hack in that override so that apicid gets set to
> the same value as initial_apicid, which matches the old code.  Here's the diff:

This collides massively with the other work I'm doing, which uses the
MADT provided information to actually evaluate various topology related
things upfront and later during bringup. Thats badly needed because lots
of todays infrastructure is based on heuristics and guesswork.

But it seems I wasted a month on reworking all of this just to be
stopped cold in the tracks by completely undocumented and unnecessary
hyper-v abuse.

So if Hyper-V insists on abusing the initial APIC ID as read from CPUID
for topology information related to L3, then hyper-v should override the
cache topology mechanism and not impose this insanity on the basic
topology evaluation infrastructure.

Yours seriously grumpy

      tglx

  reply	other threads:[~2023-07-31 12:34 UTC|newest]

Thread overview: 72+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-28 12:12 [patch v2 00/38] x86/cpu: Rework the topology evaluation Thomas Gleixner
2023-07-28 12:12 ` [patch v2 01/38] x86/cpu: Encapsulate topology information in cpuinfo_x86 Thomas Gleixner
2023-07-28 12:12 ` [patch v2 02/38] x86/cpu: Move phys_proc_id into topology info Thomas Gleixner
2023-07-28 12:12 ` [patch v2 03/38] x86/cpu: Move cpu_die_id " Thomas Gleixner
2023-07-28 12:12 ` [patch v2 04/38] scsi: lpfc: Use topology_core_id() Thomas Gleixner
2023-07-28 12:12 ` [patch v2 05/38] hwmon: (fam15h_power) " Thomas Gleixner
2023-07-28 12:12 ` [patch v2 06/38] x86/cpu: Move cpu_core_id into topology info Thomas Gleixner
2023-07-28 12:12 ` [patch v2 07/38] x86/cpu: Move cu_id " Thomas Gleixner
2023-07-28 12:12 ` [patch v2 08/38] x86/cpu: Remove pointless evaluation of x86_coreid_bits Thomas Gleixner
2023-07-28 12:12 ` [patch v2 09/38] x86/cpu: Move logical package and die IDs into topology info Thomas Gleixner
2023-07-28 12:12 ` [patch v2 10/38] x86/cpu: Move cpu_l[l2]c_id " Thomas Gleixner
2023-07-28 12:12 ` [patch v2 11/38] x86/apic: Use BAD_APICID consistently; Thomas Gleixner
2023-07-28 22:36   ` Sohil Mehta
2023-07-28 22:50     ` Thomas Gleixner
2023-07-28 12:12 ` [patch v2 12/38] x86/apic: Use u32 for APIC IDs in global data Thomas Gleixner
2023-07-28 12:12 ` [patch v2 13/38] x86/apic: Use u32 for check_apicid_used() Thomas Gleixner
2023-07-28 12:13 ` [patch v2 14/38] x86/apic: Use u32 for cpu_present_to_apicid() Thomas Gleixner
2023-07-28 12:13 ` [patch v2 15/38] x86/apic: Use u32 for phys_pkg_id() Thomas Gleixner
2023-08-08 20:14   ` Steve Wahl
2023-07-28 12:13 ` [patch v2 16/38] x86/apic: Use u32 for [gs]et_apic_id() Thomas Gleixner
2023-08-08 20:15   ` Steve Wahl
2023-07-28 12:13 ` [patch v2 17/38] x86/apic: Use u32 for wakeup_secondary_cpu[_64]() Thomas Gleixner
2023-08-08 20:15   ` Steve Wahl
2023-07-28 12:13 ` [patch v2 18/38] x86/cpu/topology: Cure the abuse of cpuinfo for persisting logical ids Thomas Gleixner
2023-07-28 12:13 ` [patch v2 19/38] x86/cpu: Provide debug interface Thomas Gleixner
2023-07-28 23:57   ` Sohil Mehta
2023-07-28 12:13 ` [patch v2 20/38] x86/cpu: Provide cpuid_read() et al Thomas Gleixner
2023-07-28 12:13 ` [patch v2 21/38] x86/cpu: Provide cpu_init/parse_topology() Thomas Gleixner
2023-07-29  0:02   ` Sohil Mehta
2023-07-31  4:05   ` Michael Kelley (LINUX)
2023-07-31 12:34     ` Thomas Gleixner [this message]
2023-07-31 13:27       ` Peter Zijlstra
2023-07-31 15:38         ` Thomas Gleixner
2023-07-31 16:10           ` Michael Kelley (LINUX)
2023-07-31 20:48             ` Thomas Gleixner
2023-07-31 21:27               ` Michael Kelley (LINUX)
2023-07-31 22:12                 ` Thomas Gleixner
2023-08-01 22:25                   ` Thomas Gleixner
2023-08-01 22:35                     ` Andrew Cooper
2023-08-02 14:43                     ` Michael Kelley (LINUX)
2023-07-31 16:25       ` Michael Kelley (LINUX)
2023-07-31 20:41         ` Thomas Gleixner
2023-07-31 13:47     ` Arjan van de Ven
2023-07-31 14:08       ` Andrew Cooper
2023-08-01  7:05   ` Gautham R. Shenoy
2023-08-01  7:34     ` Thomas Gleixner
2023-07-28 12:13 ` [patch v2 22/38] x86/cpu: Add legacy topology parser Thomas Gleixner
2023-07-28 21:33   ` [patch v2A " Thomas Gleixner
2023-07-28 12:13 ` [patch v2 23/38] x86/cpu: Use common topology code for Centaur and Zhaoxin Thomas Gleixner
2023-07-28 12:13 ` [patch v2 24/38] x86/cpu: Move __max_die_per_package to common.c Thomas Gleixner
2023-07-28 12:13 ` [patch v2 25/38] x86/cpu: Provide a sane leaf 0xb/0x1f parser Thomas Gleixner
2023-07-28 12:13 ` [patch v2 26/38] x86/cpu: Use common topology code for Intel Thomas Gleixner
2023-07-28 12:13 ` [patch v2 27/38] x86/cpu/amd: Provide a separate acessor for Node ID Thomas Gleixner
2023-07-29  0:07   ` Sohil Mehta
2023-07-28 12:13 ` [patch v2 28/38] x86/cpu: Provide an AMD/HYGON specific topology parser Thomas Gleixner
2023-07-29  0:05   ` Sohil Mehta
2023-07-30  5:20   ` Michael Kelley (LINUX)
2023-07-30  7:41     ` Thomas Gleixner
2023-07-28 12:13 ` [patch v2 29/38] x86/smpboot: Teach it about topo.amd_node_id Thomas Gleixner
2023-07-28 12:13 ` [patch v2 30/38] x86/cpu: Use common topology code for AMD Thomas Gleixner
2023-07-28 12:13 ` [patch v2 31/38] x86/cpu: Use common topology code for HYGON Thomas Gleixner
2023-07-28 12:13 ` [patch v2 32/38] x86/mm/numa: Use core domain size on AMD Thomas Gleixner
2023-07-28 12:13 ` [patch v2 33/38] x86/cpu: Make topology_amd_node_id() use the actual node info Thomas Gleixner
2023-07-28 12:13 ` [patch v2 34/38] x86/cpu: Remove topology.c Thomas Gleixner
2023-07-28 12:13 ` [patch v2 35/38] x86/cpu: Remove x86_coreid_bits Thomas Gleixner
2023-07-28 12:13 ` [patch v2 36/38] x86/apic: Remove unused phys_pkg_id() callback Thomas Gleixner
2023-08-08 20:15   ` Steve Wahl
2023-07-28 12:13 ` [patch v2 37/38] x86/xen/smp_pv: Remove cpudata fiddling Thomas Gleixner
2023-07-28 12:13 ` [patch v2 38/38] x86/apic/uv: Remove the private leaf 0xb parser Thomas Gleixner
2023-08-08 20:16   ` Steve Wahl
2023-07-28 15:41 ` [patch v2 00/38] x86/cpu: Rework the topology evaluation Dimitri Sivanich
2023-07-28 19:01 ` Sohil Mehta

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=871qgop8dc.ffs@tglx \
    --to=tglx@linutronix.de \
    --cc=andrew.cooper3@citrix.com \
    --cc=arjan@linux.intel.com \
    --cc=dick.kennedy@broadcom.com \
    --cc=dimitri.sivanich@hpe.com \
    --cc=james.smart@broadcom.com \
    --cc=jdelvare@suse.com \
    --cc=jejb@linux.ibm.com \
    --cc=jgross@suse.com \
    --cc=linux-hwmon@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=linux@roeck-us.net \
    --cc=martin.petersen@oracle.com \
    --cc=mike.travis@hpe.com \
    --cc=mikelley@microsoft.com \
    --cc=ray.huang@amd.com \
    --cc=russ.anderson@hpe.com \
    --cc=steve.wahl@hpe.com \
    --cc=thomas.lendacky@amd.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).