qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Gavin Shan <gshan@redhat.com>
To: Igor Mammedov <imammedo@redhat.com>, Andrew Jones <drjones@redhat.com>
Cc: robh@kernel.org, ehabkost@redhat.com, peter.maydell@linaro.org,
	qemu-devel@nongnu.org, qemu-arm@nongnu.org, shan.gavin@gmail.com
Subject: Re: [PATCH 1/2] numa: Set default distance map if needed
Date: Wed, 13 Oct 2021 10:32:18 +1100	[thread overview]
Message-ID: <eed6bdaf-e52e-0bc6-7669-7a65e75a4b08@redhat.com> (raw)
In-Reply-To: <20211012155321.256e8867@redhat.com>

Hi Igor,

On 10/13/21 12:53 AM, Igor Mammedov wrote:
> On Tue, 12 Oct 2021 15:13:08 +0200
> Andrew Jones <drjones@redhat.com> wrote: 
>> On Tue, Oct 12, 2021 at 02:27:54PM +0200, Igor Mammedov wrote:
>>> On Tue, 12 Oct 2021 12:37:54 +0200
>>> Andrew Jones <drjones@redhat.com> wrote:   
>>>> On Tue, Oct 12, 2021 at 11:40:16AM +0200, Igor Mammedov wrote:
>>>>> On Wed,  6 Oct 2021 18:22:08 +0800
>>>>> Gavin Shan <gshan@redhat.com> wrote:
>>>>>      
>>>>>> The following option is used to specify the distance map. It's
>>>>>> possible the option isn't provided by user. In this case, the
>>>>>> distance map isn't populated and exposed to platform. On the
>>>>>> other hand, the empty NUMA node, where no memory resides, is
>>>>>> allowed on ARM64 virt platform. For these empty NUMA nodes,
>>>>>> their corresponding device-tree nodes aren't populated, but
>>>>>> their NUMA IDs should be included in the "/distance-map"
>>>>>> device-tree node, so that kernel can probe them properly if
>>>>>> device-tree is used.
>>>>>>
>>>>>>    -numa,dist,src=<numa_id>,dst=<numa_id>,val=<distance>
>>>>>>
>>>>>> So when user doesn't specify distance map, we need to generate
>>>>>> the default distance map, where the local and remote distances
>>>>>> are 10 and 20 separately. This adds an extra parameter to the
>>>>>> exiting complete_init_numa_distance() to generate the default
>>>>>> distance map for this case.
>>>>>>
>>>>>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>>>>>
>>>>>
>>>>> how about error-ing out if distance map is required but
>>>>> not provided by user explicitly and asking user to fix
>>>>> command line?
>>>>>
>>>>> Reasoning behind this that defaults are hard to maintain
>>>>> and will require compat hacks and being raod blocks down
>>>>> the road.
>>>>> Approach I was taking with generic NUMA code, is deprecating
>>>>> defaults and replacing them with sanity checks, which bail
>>>>> out on incorrect configuration and ask user to correct command line.
>>>>> Hence I dislike approach taken in this patch.
>>>>>
>>>>> If you really wish to provide default, push it out of
>>>>> generic code into ARM specific one
>>>>> (then I won't oppose it that much (I think PPC does
>>>>> some magic like this))
>>>>> Also behavior seems to be ARM specific so generic
>>>>> NUMA code isn't a place for it anyways
>>>>
>>>> The distance-map DT node and the default 10/20 distance-map values
>>>> aren't arch-specific. RISCV is using it too.
>>>>
>>>> I'm on the fence with this. I see erroring-out to require users
>>>> to provide explicit command lines as a good thing, but I also
>>>> see it as potentially an unnecessary burden for those that want
>>>> the default map anyway. The optional nature of the distance-map
>>>> node and the specification of the default map is here [1]
>>>>
>>>> [1] Linux source: Documentation/devicetree/bindings/numa.txt
>>>
>>> Looking at proposed linux patches [ https://lkml.org/lkml/2021/9/27/31 ],
>>> using optional distance table as source for numa-node-ids,
>>> looks like a hack around kernel's inability to fish them out
>>> from CPU &| PCI nodes (using those nodes as source should
>>> cover memory-less node use-case).
>>>
>>> I consider including optional node as a policy decision.
>>> So user shall include it explicitly on QEMU command line
>>> if necessary (that works just fine for x86), or guest OS
>>> can make up defaults on its own in absence of data.
>>
>> OK, so erroring-out on configs that must provide distance-maps, rather
>> than automatically generating them for all configs is better.
>>
>>>    
>>>> So, my r-b stands for this patch, but I also wouldn't complain
>>>> about respinning it to error out instead.
>>>    
>>>> I would complain about
>>>> moving the logic to Arm specific code, though, since RISCV would
>>>> then need to duplicate it.
>>>
>>> Instead of putting workaround in QEMU and then making them generic,
>>> I'd prefer to:
>>>   1. make QEMU to be able generate DT with memory-less nodes
>>
>> How? DT syntax doesn't allow this, because each node needs a unique
>> name which is derived from its base address, which an empty numa
> you are talking about memory@foo nodes, aren't you?
> 
>> node doesn't have.
> 
> Looking at Documentation/devicetree/bindings/numa.txt
> 
> mem/cpu/pci nodes also contain numa-node-id attribute,
> so idea is to collect IDs from all present sources
> instead of abusing distance map.
>   
> That would allow QEMU to skip memory@foo elements for
> memory-less nodes because they obviously do not exist
> and there is no way to describe them using 'memory' nodes.
> 

I don't think it's feasible because it's hard to elaborate NUMA node IDs
from this sort of sources. Apart from mem/cpu/pci, the NUMA node IDs
can be included into platform devices, which could be vendor specific
sometimes. Other type of devices, which I don't know, could include
NUMA node IDs either.

Besides, things become more complicated when hotplug is considered.
For example, the hot-added CPU is associated with a non-existing
NUMA node. The CPU hot-add fails until the associated NUMA node
is initialized. This means CPU/mem hotplug have to be twisted.

So the point is to elaborate the NUMA node IDs from the limited
source: mem/cpu/distance-map. The distance-map is optional in
current Linux implementation.

>>>   2. fix guest to get numa-node-id from CPU/PCI nodes if
>>>      memory node isn't present,
>>
>> I'm not sure that's possible with DT. If it is, then proposing it
>> upstream to Linux DT maintainers would be the next step.
> Added Rob to CC.
> 

As explained above.

>>
>>> or use ACPI tables which can
>>>      describe memory-less NUMA nodes if fixing how DT is
>>>      parsed unfeasible.
>>
>> We use ACPI already for our guests, but we also generate a DT (which
>> edk2 consumes). We can't generate a valid DT when empty numa nodes
> does edk2 actually uses numa info from QEMU?
> 
>> are put on the command line unless we follow a DT spec saying how
>> to do that. The current spec says we should have a distance-map
>> that contains those nodes.
> 
> can you point out to the spec and place within it, pls?
>

https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20211012&id=58ae0b51506802713aa0e9956d1853ba4c722c98
("Documentation, dt, numa: Add note to empty NUMA node")

Thanks,
Gavin



  reply	other threads:[~2021-10-12 23:33 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-06 10:22 [PATCH 0/2] hw/arm/virt: Fix qemu booting failure on device-tree Gavin Shan
2021-10-06 10:22 ` [PATCH 1/2] numa: Set default distance map if needed Gavin Shan
2021-10-06 10:35   ` Andrew Jones
2021-10-06 11:03     ` Gavin Shan
2021-10-06 11:56       ` Andrew Jones
2021-10-07 23:51         ` Gavin Shan
2021-10-08  6:07           ` Andrew Jones
2021-10-12  6:13             ` Gavin Shan
2021-10-12  9:40   ` Igor Mammedov
2021-10-12 10:31     ` Gavin Shan
2021-10-12 11:18       ` Igor Mammedov
2021-10-12 11:48       ` Andrew Jones
2021-10-12 12:34         ` Igor Mammedov
2021-10-12 13:05           ` Andrew Jones
2021-10-12 22:59             ` Gavin Shan
2021-10-12 10:37     ` Andrew Jones
2021-10-12 12:27       ` Igor Mammedov
2021-10-12 13:13         ` Andrew Jones
2021-10-12 13:53           ` Igor Mammedov
2021-10-12 23:32             ` Gavin Shan [this message]
2021-10-13  9:32               ` Igor Mammedov
2021-10-13  6:29             ` Andrew Jones
2021-10-06 10:22 ` [PATCH 2/2] hw/arm/virt: Don't create device-tree node for empty NUMA node Gavin Shan
2021-10-06 10:36   ` Andrew Jones

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=eed6bdaf-e52e-0bc6-7669-7a65e75a4b08@redhat.com \
    --to=gshan@redhat.com \
    --cc=drjones@redhat.com \
    --cc=ehabkost@redhat.com \
    --cc=imammedo@redhat.com \
    --cc=peter.maydell@linaro.org \
    --cc=qemu-arm@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=robh@kernel.org \
    --cc=shan.gavin@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).