All of lore.kernel.org
 help / color / mirror / Atom feed
From: Igor Mammedov <imammedo@redhat.com>
To: "Huang, FangSheng (Jerry)" <FangSheng.Huang@amd.com>
Cc: <qemu-devel@nongnu.org>, <david@redhat.com>, <gourry@gourry.net>,
	<jonathan.cameron@huawei.com>, <apopple@nvidia.com>,
	<dan.j.williams@intel.com>, <Zhigang.Luo@amd.com>,
	<Lianjie.Shi@amd.com>, David Hildenbrand <david@kernel.org>
Subject: Re: [PATCH v7 1/1] numa: add 'memmap-type' option for memory type configuration
Date: Fri, 15 May 2026 15:04:53 +0200	[thread overview]
Message-ID: <20260515150453.409e0e3e@imammedo> (raw)
In-Reply-To: <d26cfb4b-d70e-415c-8747-904d94f8ccb2@amd.com>

On Fri, 15 May 2026 15:53:07 +0800
"Huang, FangSheng (Jerry)" <FangSheng.Huang@amd.com> wrote:

> On 5/14/2026 9:05 PM, Igor Mammedov wrote:
> > [You don't often get email from imammedo@redhat.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> > 
> > On Fri, 6 Mar 2026 16:27:35 +0800
> > fanhuang <FangSheng.Huang@amd.com> wrote:
> >   
> >> Add a 'memmap-type' option to NUMA node configuration that allows
> >> specifying the memory type for a NUMA node.
> >>
> >> Supported values:
> >>    - normal:   Regular system RAM (E820 type 1, default)
> >>    - spm:      Specific Purpose Memory (E820 type 0xEFFFFFFF)
> >>    - reserved: Reserved memory (E820 type 2)
> >>
> >> The 'spm' type indicates Specific Purpose Memory - a hint to the guest
> >> that this memory might be managed by device drivers based on guest policy.
> >> The 'reserved' type marks memory as not usable as RAM.
> >>
> >> Note: This option is only supported on x86 platforms.
> >>
> >> Usage:
> >>    -numa node,nodeid=1,memdev=m1,memmap-type=spm  
> > 
> > in short:
> >    don't do it this way
> >    I'm against merging it as is, till you convince me otherwise.
> > 
> > more detailed answer:
> > 
> > * mandatory bashing chapter:
> > 
> > the more i look at it, the hackier this approach looks to me,
> > and what even worse that nonsense propagates to firmware.
> > 
> > Judging by commit message, the goal is to expose some RAM as
> > E820 SPM, to guest (that's it).
> > 
> > You however picked -numa node as a way to achieve that,
> > and then hack the numa code not to generate numa data for it (SRAT)
> > and massage e820 to exclude SPM from  RAM entries.
> > 
> > But at this stage I don't really see a good justification for hack(s)
> > this patch introduces (it's definitely is not in commit message not cover letter).
> > 
> > And until alternative approach is not explored and proved to be worse,
> > I'm against merging this patch.
> > 
> > * suggestion chapter:
> > 
> > I don't recall but I likely asked before
> > why not use device memory instead for it (aka DIMM device or some device derived
> > from device memory object and then add e820 entry for it).
> > 
> > It would be a way more simpler approach and impl. without need to resplit
> > anything in e820.
> > And no need for messing with firmware (SeaBIOS: RamSizeOver4G patch) nor EDK2.
> > 
> >  
> 
> Hi Igor,
> 
> Thanks for taking the time to review this -- and for the candor in
> the bashing chapter.  Before going into the bigger picture, let me
> re-establish one factual point that v7 didn't carry forward from
> the v6 cover letter.

feel free to bash my review as well, I hope that we end up with
clear picture what and why we are doing.

> 
> On SRAT generation:
> 
> v7 only suppresses SRAT for memmap-type=reserved.  memmap-type=spm
> nodes get a normal SRAT Memory Affinity entry.  This was shown
> explicitly in the v6 cover letter, which v7 didn't carry forward
> since v7 is a single-patch series.  For the spm case:
> 
>      [    0.042582] ACPI: SRAT: Node 1 PXM 1 [mem 0x280000000-0x47fffffff]
> 
> Full transcript with all three memmap-type variants side by side:
> https://lore.kernel.org/qemu-devel/20260226105023.256568-1-FangSheng.Huang@amd.com/
> 
> The bigger picture -- real-world context that drove the design:

bigger picture should be somewhere in commit message so later on
a reader could understand why we are doing it at all/this way.
 
lets continue with questions wrt impl.

> The use case is GPU/accelerator HBM exposed to the OS as SPM.  On
> bare metal, the platform firmware:
> 
>    - emits E820 type 0xEFFFFFFF (SOFT_RESERVED) for the HBM region;
>    - emits ACPI SRAT memory affinity entries that bind HBM to a
>      dedicated proximity domain (NUMA node);
>    - tags the accelerator's PCI device with _PXM matching that node.
> 
> That gives the device driver a stable lookup chain at runtime:
> 
>      dev -> pci_dev_to_node(dev) -> SRAT walk -> HBM GPA range

it looks kind of convoluted, isn't it.
PCI devices were supposed to be self describing/discoverable.
Preferably without above mentioned firmware 'hooks'.
Above example could be just early impl. issues, rather than by
design issue.

> NUMA node here is not incidental -- it is the OS-exposed
> intermediary ID that the device driver uses to find its own HBM.
> This is the in-tree path used by accelerator drivers today.

I'm assuming GPU is exposed as some composite PCI/CXL device.
and use-case is its pass-through to guest.

Perhaps we can't do anything about it now.
But shouldn't device driver discover its own memory (HBM and what not)
without external parties that magically gain knowledge about parts of
device that driver supposedly driving the device has not a clue about? 
How doesn bios know about SPM when device's driver with knowledge of
device internals knows nothing about?
 
> The "-numa node + memmap-type=spm + E820 SOFT_RESERVED" combo in
> v7 is a direct 1:1 model of this BM topology.  The E820 retyping
> in the patch is exactly what makes the guest-visible E820 match
> what BM firmware emits for the same kind of region.
> 
> On the DIMM / device-memory alternative:

wrt modeling GPU pass-through, my 1st attempt would be
to make -device gpu-foo take everything need to compose the device
(like in real hw) and be done with it (and PCI/CXL machinery would
take care of mapping/exposing memory to guest).
Why we aren't doing it?

barring that, and assuming we have to pass SPM as a separate memory
(why and why it should be exposed in E820 and at boot time only?)
I'd try -device foo-memory approach.

> David pointed this out in the v6 thread, and Gregory's reply in
> this thread reinforces the same point -- DIMM / NVDIMM ranges are
> described in E820 only as the hotplug area.  SPM needs to be in
> the boot E820 from the start so the OS classifies it as SP and
> treats it accordingly.  Going via DIMM would also detach the
> memory from the NUMA topology (no SRAT entry tied to the device's
> _PXM), which breaks the dev -> node -> SRAT -> HBM lookup the
> driver relies on.

Where we should bend modeling to driver behavior is questionable.
But I don't know nearly enough about subj, it could be parallel discussion.
But we need capture 'why' somewhere in commit message, to give
a justification for going pass-through as a separate memory approach.
 
For now lets leave it alone.

wrt my suggestion using memory-device.
It's true that the device memory region has started as hotpluggable memory.
But that's impl. detail, nothing fundamentally prevents us from
describing mix of present at boot time memory devices within it in e820/SRAT.
 
Answer to why DIMMs aren't in e820 was for us to avoid dealing with
linux kernel putting that memory into zone_normal instead of zone_movable.
On real hardware, one is likely to see all present at boot dimms, in e820
and SRAT.
For already existing memory devices, I'd like us continue dodging e820,
so we wouldn't break existing deployments. however for a new memory device
we don't have such limitations.

What I'd try is:
 1: inherit spm-memory device from memory-device
       (all memory mapping and APCI memory device descriptors, can be made
        to pick it along with DIMM devices) 
 2: figure out why device driver has to fetch memory map
   and proximity from static tables as opposed to getting it dynamically
   from _PXM -> maped-memory range. (at the time PCI devices enum runs, all ACPI
   info incl. run time one is fully accessible to in-kernel users)
   i.e. try to make driver work with runtime proximity
 3. if #2 is impossible, we can try to expose SPM memory devices in e820,
     and partition SRAT to match actual device_memory region layout. 

 
> Happy to dig into any of this further, or to reshape parts you
> still see as too hacky.
> 
> Best regards,
> FangSheng Huang (Jerry)
> >   
> 



  reply	other threads:[~2026-05-15 13:05 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-06  8:27 [PATCH v7 0/1] numa: add 'memmap-type' option for memory type configuration fanhuang
2026-03-06  8:27 ` [PATCH v7 1/1] " fanhuang
2026-05-14 13:05   ` Igor Mammedov
2026-05-14 13:38     ` Gregory Price
2026-05-15  7:53     ` Huang, FangSheng (Jerry)
2026-05-15 13:04       ` Igor Mammedov [this message]
2026-03-13  8:30 ` [PATCH v7 0/1] " Huang, FangSheng (Jerry)
2026-03-13 15:18   ` Gregory Price
2026-03-13 16:14     ` Jonathan Cameron via qemu development
2026-03-16  7:17       ` Huang, FangSheng (Jerry)
2026-04-27  8:47         ` Huang, FangSheng (Jerry)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260515150453.409e0e3e@imammedo \
    --to=imammedo@redhat.com \
    --cc=FangSheng.Huang@amd.com \
    --cc=Lianjie.Shi@amd.com \
    --cc=Zhigang.Luo@amd.com \
    --cc=apopple@nvidia.com \
    --cc=dan.j.williams@intel.com \
    --cc=david@kernel.org \
    --cc=david@redhat.com \
    --cc=gourry@gourry.net \
    --cc=jonathan.cameron@huawei.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.