public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Thomas Gleixner <tglx@linutronix.de>
To: "Brown, Len" <len.brown@intel.com>,
	"Zhang, Rui" <rui.zhang@intel.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Cc: "Gross, Jurgen" <jgross@suse.com>,
	"mikelley@microsoft.com" <mikelley@microsoft.com>,
	"arjan@linux.intel.com" <arjan@linux.intel.com>,
	"x86@kernel.org" <x86@kernel.org>,
	"thomas.lendacky@amd.com" <thomas.lendacky@amd.com>,
	"ray.huang@amd.com" <ray.huang@amd.com>,
	"andrew.cooper3@citrix.com" <andrew.cooper3@citrix.com>,
	"Sivanich, Dimitri" <dimitri.sivanich@hpe.com>,
	"wei.liu@kernel.org" <wei.liu@kernel.org>
Subject: RE: [patch V3 27/40] x86/cpu: Provide a sane leaf 0xb/0x1f parser
Date: Thu, 17 Aug 2023 11:09:56 +0200	[thread overview]
Message-ID: <87fs4ighln.ffs@tglx> (raw)
In-Reply-To: <MN0PR11MB601007E7DAE4C5FC86E58DB0E014A@MN0PR11MB6010.namprd11.prod.outlook.com>

Len!

On Tue, Aug 15 2023 at 19:30, Len Brown wrote:
> In retrospect, we under-specified what it means to enumerate a
> CPUID.1F die, because it has been a constant battle to get the HW
> people to *not* enumerate hidden die that software does not see.
>
> Indeed, we were equally guilty in not codifying an architectural
> definition of "module" and "tile", which were placed into the CPUID.1F
> definition mostly as place-holders with awareness of hardware
> structures that were already in common use.  For example, there were
> already module-scoped counters that were hard-coded, and enumerating
> modules seems to be an to give architectural (re-usable) enumeration
> to model-specific code.

Sure 0x1f is underspecified in terms of meaning of the intermediate
levels, but it's perfectly parseable. Once there is information how to
actually utilize MODULE or TILE for software decisions, then it can be
implemented. Until then we still have to can parse it and otherwise just
ignore it. Where is the problem?

> Second, failings of the Linux topology code...
>
> I agree with you that "thread_siblings" and "core_cpus" are the
> different words for the same thing.  This will always be true because
> the hardware architecture guarantees that SMT siblings are the next
> level down from core.

Right. So you "fix" was pure cosmetic.

> But no such definition exists for "core_siblings".  It is impossible
> to write correct software that reads "core_siblings" and takes any
> action on it.  Those could be the CPUs inside a module, or inside a
> die, or inside some other level that today's software can't possibly
> know by name.
>
> On the other hand, die_cpus is clear -- the CPUs within a die.
> Package_cpus -- the CPUs within a package.
> Core_cpus -- the cpus within a core....
> Words matter.

Of course does terminology matter, but that's not the real problem.

> Re: globally unique core_id
>
> I have 100% confidence that you can make the Linux kernel handle a
> sparce globally unique core_id name space.  My concern is unknown
> exposure to joe-random-user-space program that consumes the sysfs
> representation.

You simply can keep the current representation for proc and sysfs with
the relative IDs and just make the kernel use unique core IDs. The
kernel needs to be sane before you can even think about user space
exposure.

>>> Secondly, with the obsolescence of CPUID.0b and its replacement with 
>>> CPUID.1F, the contract between The hardware and the software is that a 
>>> level can appear and can in between any existing levels.  (the only 
>>> exception is that SMT is married to core).
>
>> In theory, yes. But what's the practical relevance that there might
>> be a new level between CORE and MODULE or MODULE and TILE etc...?
>
>>> It is not possible For an old kernel to know the name or position of a 
>>> new level in the hierarchy, going forward.
>
>> Again, where is the practical problem? These new levels are not going
>> to be declared nilly willy and every other week, right?
>
> It is irrelevant if a new level is of any practical use to Linux.
>
> What is important is that Linux be able to parse and use the levels it
> finds useful, while gracefully ignoring any that it doesn't care about
> (or doesn't yet know about).
>
> Yes, hardware folks can drop something into the ucode and the SDM w/o
> us knowing ahead of time (see DieGrp in the June 2023 SDM).  Certainly
> they can do it in well under the 4-year's notice we'd need if we were
> to simply track the named levels in the SDM.

That's a matter of education of the hardware people. Sure they can do
whatever they want, but if they want us to provide primary support for
the stuff they dream up, then they better go and tell us early enough.

>>> Today, this manifests with a (currently) latent bug that I caused.
>>> For I implemented die_id In the style of package_id, and I shouldn't 
>>> have followed that example.
>
>> You did NOT. You implemented die_id relative to the package, which
>> does not make it unique in the same way as core_id is relative to the
>> package and therefore not unique.
>
> The point is that like package_id=0 on a single package system, I put
> a die_id=0 attribute in sysfs even when NO "die" level is enumerated
> in CPUID.1F.
>
> That was a mistake.

No. It's not a mistake. Conceptually the DIE level exists even if not
enumerated. It consumes zero bits and therefore has size 1.

>>> Today, if CPUID.1F doesn't know anything about multiple DIE, Linux 
>>> conjurs up A die_id 0 in sysfs.  It should not.  The reason is that 
>>> when CPUID.1F enumerates A level that legacy code doesn't know about, 
>>> we can't possibly tell if it is above DIE, or below DIE.  If it is 
>>> above DIE, then our default die_id 0 is becomes bogus.
>
>>That's an implementation problem and the code I posted fixes this by
>>making die_id unique and taking the documented domain levels into
>>account.
>
> Your code change does not fix the problem above.

Why are all of you so fixated on domain levels which are not documented
today? Either you know something which I don't know or you are just
debating an academic problem to death.

>> So if 0x1f does not enumerate dies, then each package has one die and
>> the die ID is the same as the package ID. It's that simple.
>
> Unfortunately, no.
>
> Your code will be written and ship before level-X is defined.
>
> A couple of years later, level-X is defined above die.  Your code runs
> on new hardware that defines no packages, level-X, and no die.  How
> many die-id's does this system have?
>
> If you could see into the future, you'd answer that there are 2-die,
> because There is one inside each level-X.
>
> But since die isn't enumerated, and you don't know if a level-X is
> defined to be above or below die, then you can't tell if level-X is
> something containing die, or something contained-by die...
>
> The proper solution is to not expose a die_id attribute in sysfs if
> there is no die level enumerated in CPUID.1F.  When it is enumerated,
> we get it right.  When it is not enumerated, we don't guess.

The main problem is the kernel side itself. /proc/ /sys/ are things
which can do conditinal exposure, but you can't do so in the kernel.

Fact is that the APIC ID space is segmented to reflect the today
decribed topology domains:

     [PKG] [DIE] [TILE] [MODULE] [CORE] [THREAD]

Each of them can occupy 0 or more bits. So using an internal
representation for them which treats them as size one if not specified
is the obvious and right thing to do.

You cannot create a software monstrosity which makes everything
conditional. It's neither workable nor maintainable. You need a
consistent view independent of the enumerated levels in 0x1F and as a
consequence you have to assume that the non-enumerated levels consume
zero bits in the APIC ID space and have size one.

If the hardware people fail to understand that software needs a
consistent representation of these things, then we can give up and just
refuse to parse 0x1f at all.

Now lets look at your imaginary future systems enumeration:

     [LEVELX] [CORE] [THREAD]

You have to take LEVELX - even if unknown to the kernel - into account
to evaluate the APID ID range which defines the package space.

Otherwise you obviously just have to ignore it because yes, it's unknown
between which domain levels this is going to sit.

But this assumes that LEVELX is going to appear out of the blue just
because the hardware people took the wrong pills. So it's our internal
problem to educate them so this won't happen.

>> What do you win by removing them from the SDM?
>
> When you give HW people enough rope to hang themselves, they will.

You are not preventing this by removing the MODULE/TILE domains
from the SDM.

> Give them something vague in the SDM, and you've created a monster
> that is interpreted differently by different hardware teams and no
> validation team on the planet can figure out if the hardware is
> correct or not.  Then the definition becomes how the OS (possibly not
> Linux) happened to use that interface on some past chip -- and that
> use is not documented in the SDM -- and down the rabbit hole you go...
>
> When the SDM precisely documents the software/hardware interface, then
> proper tests can be written, independent hardware teams are forced to
> follow the same definition, and correct software can be written once
> and never break.

I agree with that sentiment in principle, but you lost this battle
already because there is hardware which enumerates MODULE.

[    0.212535] CPU topo: Thread    :     8
[    0.212537] CPU topo: Core      :     8
[    0.212539] CPU topo: Module    :     2

So what are we going to do about that? Just pretend that it does not
exist?

Sure, we can do that. But I'm also 100% sure, that there is a meaning
which goes beyond the pure physical description of the CPU.

Yes, this description is not there today, but we still have to utilize
the enumeration level for evaluating the APIC ID bit range which defines
the package ID space.

That does not mean that we have to put a meaning on that level by doing
wild guesses. That'd be completely wrong and would cause the problems
you are so worried about.

Why has this even still to be debated? 0x1F support was merged more than
four years ago and we are now indulging in debating academic problems of
unknown levels appearing out of the blue in systems which ship today.

There is a meaning to the existing levels, so time is better spent to
get that documented.

Thanks,

        tglx

  reply	other threads:[~2023-08-17  9:10 UTC|newest]

Thread overview: 88+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-02 10:20 [patch V3 00/40] x86/cpu: Rework the topology evaluation Thomas Gleixner
2023-08-02 10:20 ` [patch V3 01/40] cpu/SMT: Make SMT control more robust against enumeration failures Thomas Gleixner
2023-08-04 17:50   ` Borislav Petkov
2023-08-04 20:01     ` Thomas Gleixner
2023-08-02 10:21 ` [patch V3 02/40] x86/apic: Fake primary thread mask for XEN/PV Thomas Gleixner
2023-08-04 18:12   ` Borislav Petkov
2023-08-04 20:02     ` Thomas Gleixner
2023-08-02 10:21 ` [patch V3 03/40] x86/cpu: Encapsulate topology information in cpuinfo_x86 Thomas Gleixner
2023-08-02 10:21 ` [patch V3 04/40] x86/cpu: Move phys_proc_id into topology info Thomas Gleixner
2023-08-02 10:21 ` [patch V3 05/40] x86/cpu: Move cpu_die_id " Thomas Gleixner
2023-08-09 14:32   ` Zhang, Rui
2023-08-09 15:14     ` Thomas Gleixner
2023-08-02 10:21 ` [patch V3 06/40] scsi: lpfc: Use topology_core_id() Thomas Gleixner
2023-08-02 10:21 ` [patch V3 07/40] hwmon: (fam15h_power) " Thomas Gleixner
2023-08-02 10:21 ` [patch V3 08/40] x86/cpu: Move cpu_core_id into topology info Thomas Gleixner
2023-08-02 10:21 ` [patch V3 09/40] x86/cpu: Move cu_id " Thomas Gleixner
2023-08-02 10:21 ` [patch V3 10/40] x86/cpu: Remove pointless evaluation of x86_coreid_bits Thomas Gleixner
2023-08-02 10:21 ` [patch V3 11/40] x86/cpu: Move logical package and die IDs into topology info Thomas Gleixner
2023-08-02 10:21 ` [patch V3 12/40] x86/cpu: Move cpu_l[l2]c_id " Thomas Gleixner
2023-08-02 10:21 ` [patch V3 13/40] x86/apic: Use BAD_APICID consistently Thomas Gleixner
2023-08-02 10:21 ` [patch V3 14/40] x86/apic: Use u32 for APIC IDs in global data Thomas Gleixner
2023-08-02 10:21 ` [patch V3 15/40] x86/apic: Use u32 for check_apicid_used() Thomas Gleixner
2023-08-02 10:21 ` [patch V3 16/40] x86/apic: Use u32 for cpu_present_to_apicid() Thomas Gleixner
2023-08-02 10:21 ` [patch V3 17/40] x86/apic: Use u32 for phys_pkg_id() Thomas Gleixner
2023-08-02 10:21 ` [patch V3 18/40] x86/apic: Use u32 for [gs]et_apic_id() Thomas Gleixner
2023-08-02 10:21 ` [patch V3 19/40] x86/apic: Use u32 for wakeup_secondary_cpu[_64]() Thomas Gleixner
2023-08-10  7:58   ` Qiuxu Zhuo
2023-08-02 10:21 ` [patch V3 20/40] x86/cpu/topology: Cure the abuse of cpuinfo for persisting logical ids Thomas Gleixner
2023-08-02 10:21 ` [patch V3 21/40] x86/cpu: Provide debug interface Thomas Gleixner
2023-08-02 10:21 ` [patch V3 22/40] x86/cpu: Provide cpuid_read() et al Thomas Gleixner
2023-08-02 10:21 ` [patch V3 23/40] x86/cpu: Provide cpu_init/parse_topology() Thomas Gleixner
2023-08-04  8:14   ` K Prateek Nayak
2023-08-04  8:28     ` Thomas Gleixner
2023-08-04  8:34       ` K Prateek Nayak
2023-08-12  6:41   ` Zhang, Rui
2023-08-12  8:00     ` Zhang, Rui
2023-08-14  6:26       ` Thomas Gleixner
2023-08-14  7:11         ` Zhang, Rui
2023-08-13 13:30     ` Zhang, Rui
2023-08-13 14:36       ` Thomas Gleixner
2023-08-14  6:20         ` Thomas Gleixner
2023-08-14  6:42           ` Zhang, Rui
2023-08-02 10:21 ` [patch V3 24/40] x86/cpu: Add legacy topology parser Thomas Gleixner
2023-08-02 10:21 ` [patch V3 25/40] x86/cpu: Use common topology code for Centaur and Zhaoxin Thomas Gleixner
2023-08-02 10:21 ` [patch V3 26/40] x86/cpu: Move __max_die_per_package to common.c Thomas Gleixner
2023-08-02 10:21 ` [patch V3 27/40] x86/cpu: Provide a sane leaf 0xb/0x1f parser Thomas Gleixner
2023-08-12  8:21   ` Zhang, Rui
2023-08-12 20:04     ` Thomas Gleixner
2023-08-13 15:04       ` Thomas Gleixner
2023-08-14  8:25         ` Zhang, Rui
2023-08-14 12:26           ` Thomas Gleixner
2023-08-14 14:48             ` Brown, Len
2023-08-14 16:13               ` Thomas Gleixner
2023-08-15 19:30                 ` Brown, Len
2023-08-17  9:09                   ` Thomas Gleixner [this message]
2023-08-18  5:01                     ` Brown, Len
2023-08-21 10:27                       ` Thomas Gleixner
2023-08-30  2:46                         ` Brown, Len
2023-08-30 12:39                           ` Thomas Gleixner
2023-09-01  3:09                             ` Brown, Len
2023-09-01  7:45                               ` Thomas Gleixner
2023-08-14 15:28             ` Zhang, Rui
2023-08-14 17:12               ` Thomas Gleixner
2023-08-02 10:21 ` [patch V3 28/40] x86/cpu: Use common topology code for Intel Thomas Gleixner
2023-08-02 10:21 ` [patch V3 29/40] x86/cpu/amd: Provide a separate accessor for Node ID Thomas Gleixner
2023-08-02 10:21 ` [patch V3 30/40] x86/cpu: Provide an AMD/HYGON specific topology parser Thomas Gleixner
2023-08-02 19:28   ` Michael Kelley (LINUX)
2023-08-02 19:46     ` Thomas Gleixner
2023-08-02 19:51   ` [patch V3a " Thomas Gleixner
2023-08-11 12:58     ` Pu Wen
2023-08-11 17:11       ` Thomas Gleixner
2023-08-12  3:58         ` Pu Wen
2023-10-13  9:38           ` [tip: x86/core] x86/cpu/hygon: Fix the CPU topology evaluation for real tip-bot2 for Pu Wen
2023-08-02 10:21 ` [patch V3 31/40] x86/smpboot: Teach it about topo.amd_node_id Thomas Gleixner
2023-08-02 10:21 ` [patch V3 32/40] x86/cpu: Use common topology code for AMD Thomas Gleixner
2023-08-02 10:21 ` [patch V3 33/40] x86/cpu: Use common topology code for HYGON Thomas Gleixner
2023-08-11 13:00   ` Pu Wen
2023-08-11 17:11     ` Thomas Gleixner
2023-08-02 10:21 ` [patch V3 34/40] x86/mm/numa: Use core domain size on AMD Thomas Gleixner
2023-08-02 10:21 ` [patch V3 35/40] x86/cpu: Make topology_amd_node_id() use the actual node info Thomas Gleixner
2023-08-02 10:21 ` [patch V3 36/40] x86/cpu: Remove topology.c Thomas Gleixner
2023-08-02 10:21 ` [patch V3 37/40] x86/cpu: Remove x86_coreid_bits Thomas Gleixner
2023-08-02 10:21 ` [patch V3 38/40] x86/apic: Remove unused phys_pkg_id() callback Thomas Gleixner
2023-08-02 10:21 ` [patch V3 39/40] x86/xen/smp_pv: Remove cpudata fiddling Thomas Gleixner
2023-08-02 10:22 ` [patch V3 40/40] x86/apic/uv: Remove the private leaf 0xb parser Thomas Gleixner
2023-08-02 11:56 ` [patch V3 00/40] x86/cpu: Rework the topology evaluation Juergen Gross
2023-08-03  0:07 ` Sohil Mehta
2023-08-03  3:44 ` Michael Kelley (LINUX)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87fs4ighln.ffs@tglx \
    --to=tglx@linutronix.de \
    --cc=andrew.cooper3@citrix.com \
    --cc=arjan@linux.intel.com \
    --cc=dimitri.sivanich@hpe.com \
    --cc=jgross@suse.com \
    --cc=len.brown@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mikelley@microsoft.com \
    --cc=ray.huang@amd.com \
    --cc=rui.zhang@intel.com \
    --cc=thomas.lendacky@amd.com \
    --cc=wei.liu@kernel.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox