Re: [PATCH v5 0/8] Remove EPYC mode apicid decode and use generic decode

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Igor Mammedov <imammedo@redhat.com>
To: Eduardo Habkost <ehabkost@redhat.com>
Cc: "Daniel P. Berrangé" <berrange@redhat.com>,
	mst@redhat.com, "Michal Privoznik" <mprivozn@redhat.com>,
	qemu-devel@nongnu.org, "Babu Moger" <babu.moger@amd.com>,
	pbonzini@redhat.com, rth@twiddle.net
Subject: Re: [PATCH v5 0/8] Remove EPYC mode apicid decode and use generic decode
Date: Thu, 27 Aug 2020 22:55:26 +0200	[thread overview]
Message-ID: <20200827225526.0b1f6d32@imammedo-mac> (raw)
In-Reply-To: <20200827190752.GK642093@habkost.net>

On Thu, 27 Aug 2020 15:07:52 -0400
Eduardo Habkost <ehabkost@redhat.com> wrote:

> On Thu, Aug 27, 2020 at 07:03:14PM +0200, Igor Mammedov wrote:
> > On Wed, 26 Aug 2020 16:03:40 +0100
> > Daniel P. Berrangé <berrange@redhat.com> wrote:
> > 
> > > On Wed, Aug 26, 2020 at 04:02:58PM +0200, Igor Mammedov wrote:
> > > > On Wed, 26 Aug 2020 14:36:38 +0100
> > > > Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > > 
> > > > > On Wed, Aug 26, 2020 at 03:30:34PM +0200, Igor Mammedov wrote:
> > > > > > On Wed, 26 Aug 2020 13:50:59 +0100
> > > > > > Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > > > >   
> > > > > > > On Wed, Aug 26, 2020 at 02:38:49PM +0200, Igor Mammedov wrote:  
> > > > > > > > On Fri, 21 Aug 2020 17:12:19 -0500
> > > > > > > > Babu Moger <babu.moger@amd.com> wrote:
> > > > > > > >     
> > > > > > > > > To support some of the complex topology, we introduced EPYC mode apicid decode.
> > > > > > > > > But, EPYC mode decode is running into problems. Also it can become quite a
> > > > > > > > > maintenance problem in the future. So, it was decided to remove that code and
> > > > > > > > > use the generic decode which works for majority of the topology. Most of the
> > > > > > > > > SPECed configuration would work just fine. With some non-SPECed user inputs,
> > > > > > > > > it will create some sub-optimal configuration.
> > > > > > > > > Here is the discussion thread.
> > > > > > > > > https://lore.kernel.org/qemu-devel/c0bcc1a6-1d84-a6e7-e468-d5b437c1b254@amd.com/
> > > > > > > > > 
> > > > > > > > > This series removes all the EPYC mode specific apicid changes and use the generic
> > > > > > > > > apicid decode.    
> > > > > > > > 
> > > > > > > > the main difference between EPYC and all other CPUs is that
> > > > > > > > it requires numa configuration (it's not optional)
> > > > > > > > so we need an extra patch on top of this series to enfoce that, i.e:
> > > > > > > > 
> > > > > > > >  if (epyc && !numa) 
> > > > > > > >     error("EPYC cpu requires numa to be configured")    
> > > > > > > 
> > > > > > > Please no. This will break 90% of current usage of the EPYC CPU in
> > > > > > > real world QEMU deployments. That is way too user hostile to introduce
> > > > > > > as a requirement.
> > > > > > > 
> > > > > > > Why do we need to force this ?  People have been successfuly using
> > > > > > > EPYC CPUs without NUMA in QEMU for years now.
> > > > > > > 
> > > > > > > It might not match behaviour of bare metal silicon, but that hasn't
> > > > > > > obviously caused the world to come crashing down.  
> > > > > > So far it produces warning in linux kernel (RHBZ1728166),
> > > > > > (resulting performance might be suboptimal), but I haven't seen
> > > > > > anyone reporting crashes yet.
> > > > > > 
> > > > > > 
> > > > > > What other options do we have?
> > > > > > Perhaps we can turn on strict check for new machine types only,
> > > > > > so old configs can keep broken topology (CPUID),
> > > > > > while new ones would require -numa and produce correct topology.  
> > > > > 
> > > > > No, tieing this to machine types is not viable either. That is still
> > > > > going to break essentially every single management application that
> > > > > exists today using QEMU.
> > > > for that we have deprecation process, so users could switch to new CLI
> > > > that would be required.
> > > 
> > > We could, but I don't find the cost/benefit tradeoff is compelling.
> > > 
> > > There are so many places where we diverge from what bare metal would
> > > do, that I don't see a good reason to introduce this breakage, even
> > > if we notify users via a deprecation message. 
> > I find (3) and (4) good enough reasons to use deprecation.
> > 
> > > If QEMU wants to require NUMA for EPYC, then QEMU could internally
> > > create a single NUMA node if none was specified for new machine
> > > types, such that there is no visible change or breakage to any
> > > mgmt apps.  
> > 
> > (1) for configs that started without -numa &&|| without -smp dies>1,
> >       QEMU can do just that (enable auto_enable_numa).
> 
> Why exactly do we need auto_enable_numa with dies=1?
> 
> If I understand correctly, Babu said earlier in this thread[1]
> that we don't need auto_enable_numa.
> 
> [1] https://lore.kernel.org/qemu-devel/11489e5f-2285-ddb4-9c35-c9f522d603a0@amd.com/

in case of 1 die, -numa is not must have as it's one numa node only.
Though having auto_enable_numa, will allow to reuse the CPU.node-id property
to compose CPUID_Fn8000001E_ECX. i.e only code one path vs numa|non-numa variant.

 
> > (2) As for configs that are out of spec, I do not care much (junk in - junk out)
> > (though not having to spend time on bug reports and debug issues, just to say
> > it's not supported in the end, makes deprecation sound like a reasonable
> > choice)
> > 
> > (3) However if config matches bare metal i.e. CPU has more than 1 die and within
> > dies limits (spec wise), QEMU has to produce valid CPUs.
> > In this case QEMU can't make up multiple numa nodes and mappings of RAM/CPUs
> > on user's behalf. That's where we have to error out and ask for explicit
> > numa configuration.
> > 
> > For such configs, current code (since 5.0), will produce in the best case
> > performance issues  due to mismatching data in APICID, CPUID and ACPI tables,
> > in the worst case issues might be related to invalid APIC ID if running on EPYC host
> > and HW takes in account subfields of APIC ID (according to Babu real CPU uses
> > die_id(aka node_id) internally).
> > I'd rather error out on nonsense configs earlier than debug such issues
> > and than error out anyways later (upsetting more users).
> > 
> 
> The requirements are not clear to me.  Is this just about making
> CPU die_id match the NUMA node ID, or are there additional
> constraints?
die_id is per socket numa node index, so it's not numa node id in
a sense we use it in qemu
(that's where all the confusion started that led to current code)

I understood that each die in EPYC chip is a numa node, which encodes
NUMA node ID (system wide) in CPUID_Fn8000001E_ECX, that's why I
wrote earlier that EPYC makes -numa non optional.

In case of only one die we can either use auto_enable_numa to ensure
that we have consistent code or special case it and just hardcode 
CPUID_Fn8000001E_ECX value which is hackish but will let us avoid
enabling numa (explicitly or implictly).

in case of multiple dies, CPUID_Fn8000001E_ECX (encodes number of nodes +
systemwide numa node id looking at CPUID of real EPYC machine)
shall match -numa mapping (otherwise it's a bug where CPUID and
ACPI mismatch).
Here we can go to ways:
  1) ask user to provide sane config with -numa (I'd prefer that)
     and use that info to fill in CPUID_Fn8000001E_ECX
  2) pretend that it's non numa machine, skip ACPI SRAT table
     but make up CPUID_Fn8000001E (i.e. another special case)
     (requires another code path and addition to -numa one)



> 
> 
> > (4)
> > If I were non hobby user, I'd hate if QEMU allowed me to start invalid config,
> > that I'd have to spend time on debugging issues (including performance ones),
> > instead of clearly telling me what's wrong and how config should be corrected.
> > I'd probably jump to another hypervisor that does the job right,
> > instead of digging into QEMU codebase and CPU specs to figure out how
> > to hack and configure it.
> > 
>

next prev parent reply	other threads:[~2020-08-27 20:56 UTC|newest]

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-21 22:12 [PATCH v5 0/8] Remove EPYC mode apicid decode and use generic decode Babu Moger
2020-08-21 22:12 ` [PATCH v5 1/8] hw/i386: Remove node_id, nr_nodes and nodes_per_pkg from topology Babu Moger
2020-08-26  9:57   ` Igor Mammedov
2020-08-26 17:31     ` Babu Moger
2020-08-21 22:12 ` [PATCH v5 2/8] Revert "i386: Fix pkg_id offset for EPYC cpu models" Babu Moger
2020-08-21 22:12 ` [PATCH v5 3/8] Revert "target/i386: Enable new apic id encoding for EPYC based cpus models" Babu Moger
2020-08-21 22:12 ` [PATCH v5 4/8] Revert "hw/i386: Move arch_id decode inside x86_cpus_init" Babu Moger
2020-08-21 22:12 ` [PATCH v5 5/8] Revert "i386: Introduce use_epyc_apic_id_encoding in X86CPUDefinition" Babu Moger
2020-08-21 22:12 ` [PATCH v5 6/8] Revert "hw/i386: Introduce apicid functions inside X86MachineState" Babu Moger
2020-08-21 22:13 ` [PATCH v5 7/8] Revert "hw/386: Add EPYC mode topology decoding functions" Babu Moger
2020-08-28 17:27   ` Eduardo Habkost
2020-08-28 18:03     ` Babu Moger
2020-08-21 22:13 ` [PATCH v5 8/8] i386: Simplify CPUID_8000_001E for AMD Babu Moger
2020-08-26 12:19   ` Igor Mammedov
2020-08-24 18:41 ` [PATCH v5 0/8] Remove EPYC mode apicid decode and use generic decode Dr. David Alan Gilbert
2020-08-24 19:20   ` Babu Moger
2020-08-25  8:15     ` Dr. David Alan Gilbert
2020-08-25 14:38       ` Igor Mammedov
2020-08-25 15:25         ` Dr. David Alan Gilbert
2020-08-26 12:43           ` Igor Mammedov
2020-08-26 14:10             ` Dr. David Alan Gilbert
2020-08-27 21:19               ` Igor Mammedov
2020-08-27 22:58                 ` Babu Moger
2020-08-28  8:42                   ` Igor Mammedov
2020-08-28 14:22                     ` Babu Moger
2020-08-28  8:48                 ` Dr. David Alan Gilbert
2020-08-28 11:36                   ` Igor Mammedov
2020-08-26 12:38 ` Igor Mammedov
2020-08-26 12:50   ` Daniel P. Berrangé
2020-08-26 13:30     ` Igor Mammedov
2020-08-26 13:36       ` Daniel P. Berrangé
2020-08-26 14:02         ` Igor Mammedov
2020-08-26 15:03           ` Daniel P. Berrangé
2020-08-26 15:18             ` Eduardo Habkost
2020-08-27 17:03             ` Igor Mammedov
2020-08-27 19:07               ` Eduardo Habkost
2020-08-27 20:55                 ` Igor Mammedov [this message]
2020-08-28  8:55                   ` Daniel P. Berrangé
2020-08-28 16:29                     ` Eduardo Habkost
2020-08-28 16:32                       ` Daniel P. Berrangé
2020-08-28 16:45                         ` Eduardo Habkost
2020-08-28 18:00                           ` Babu Moger
2020-08-26 17:17       ` Babu Moger
2020-08-26 18:33         ` Dr. David Alan Gilbert
2020-08-26 18:45           ` Babu Moger
2020-08-27 20:21             ` Igor Mammedov
2020-08-28  8:58               ` Daniel P. Berrangé
2020-08-28 11:24                 ` Igor Mammedov
2020-08-28 14:17                   ` Babu Moger
2020-08-28 14:48                     ` Igor Mammedov
2020-08-26 14:04   ` Eduardo Habkost

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200827225526.0b1f6d32@imammedo-mac \
    --to=imammedo@redhat.com \
    --cc=babu.moger@amd.com \
    --cc=berrange@redhat.com \
    --cc=ehabkost@redhat.com \
    --cc=mprivozn@redhat.com \
    --cc=mst@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=rth@twiddle.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).