Re: [RFC] Disable auto_movable_ratio for selfhosted memmap

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: David Hildenbrand <david@redhat.com>
To: Hannes Reinecke <hare@suse.de>, Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Hannes Reinecke <hare@kernel.org>
Subject: Re: [RFC] Disable auto_movable_ratio for selfhosted memmap
Date: Mon, 28 Jul 2025 17:15:02 +0200	[thread overview]
Message-ID: <09794c70-06a2-44dc-8e54-bc6e6a7d6c74@redhat.com> (raw)
In-Reply-To: <f859e5c3-7c96-4d97-a447-75070813450c@suse.de>

On 28.07.25 11:37, Hannes Reinecke wrote:
> On 7/28/25 11:10, David Hildenbrand wrote:
>> On 28.07.25 11:04, Michal Hocko wrote:
>>> On Mon 28-07-25 10:53:08, David Hildenbrand wrote:
>>>> On 28.07.25 10:48, Michal Hocko wrote:
>>>>> On Mon 28-07-25 10:15:47, Oscar Salvador wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Currently, we have several mechanisms to pick a zone for the new
>>>>>> memory we are
>>>>>> onlining.
>>>>>> Eventually, we will land on zone_for_pfn_range() which will pick
>>>>>> the zone.
>>>>>>
>>>>>> Two of these mechanisms are 'movable_node' and 'auto-movable' policy.
>>>>>> The former will put every single hotpluggled memory in ZONE_MOVABLE
>>>>>> (unless we can keep zones contiguous by not doing so), while the
>>>>>> latter
>>>>>> will put it in ZONA_MOVABLE IFF we are within the established ratio
>>>>>> MOVABLE:KERNEL.
>>>>>>
>>>>>> It seems, the later doesn't play well with CXL memory where CXL
>>>>>> cards hold really
>>>>>> large amounts of memory, making the ratio fail, and since CXL cards
>>>>>> must be removed
>>>>>> as a unit, it can't be done if any memory block fell within
>>>>>> !ZONE_MOVABLE zone.
>>>>>
>>>>> I suspect this is just an example of how our existing memory hotplug
>>>>> interface based on memory blocks is just suoptimal and it doesn't fit
>>>>> new usecases. We should start thinking about how a new v2 api should
>>>>> look like. I am not sure how that should look like but I believe we
>>>>> should be able to express a "device" as whole rather than having a very
>>>>> loosely bound generic memblocks. Anyway this is likely for a longer
>>>>> discussion and a long term plan rather than addressing this particular
>>>>> issue.
>>>>
>>>> We have that concept with memory groups in the kernel already.
>>>
>>> I must have missed that. I will have a look, thanks! Do we have any
>>> documentation for that? Memory group is an overloaded term in the
>>> kernel.
>>
>> It's an internal concept so far, the grouping is not exposed to user space.
>>
>> We have kerneldoc for e.g., "struct memory_group". E.g., from there
>>
>> "A memory group logically groups memory blocks; each memory block
>> belongs to at most one memory group. A memory group corresponds to a
>> memory device, such as a DIMM or a NUMA node, which spans multiple
>> memory blocks and might even span multiple non-contiguous physical
>> memory ranges."
>>
>>>
>>>> In dax/kmem we register a static memory group. It will be considered one
>>>> union.
>>>
>>> But we still do export those memory blocks and let udev or whoever act
>>> on those right? If that is the case then ....
>>
>> Yes.
>>
>>>
>>> [...]
>>>
>>>> daxctl wants to online memory itself. We want to keep that memory
>>>> offline
>>>> from a kernel perspective and let daxctl handle it in this case.
>>>>
>>>> We have that problem in RHEL where we currently require user space to
>>>> disable udev rules so daxctl "can win".
>>>
>>> ... this is the result. Those shouldn't really race. If udev is suppose
>>> to see the device then only in its entirity so regular memory block
>>> based onlining rules shouldn't even see that memory. Or am I completely
>>> missing the picture?
>>
>> We can't break user space, which relies on individual memory blocks.
>>
>> So udev or $whatever will right now see individual memory blocks. We
>> could export the group id to user space if that is of any help, but at
>> least for daxctl purposes, it will be sufficient to identify "oh, this
>> was added by dax/kmem" (which we can obtain from /proc/iomem) and say
>> "okay, I'll let user-space deal with it."
>>
>> Having the whole thing exposed as a unit is not really solving any
>> problems unless I am missing something important.
>>
> Basically it boils down to:
> Who should be responsible for onlining the memory?
> 
> As it stands we have two methods:
> - user-space as per sysfs attributes
> - kernel policy
> 
> And to make matters worse, we have two competing user-space programs:
> - udev
> - daxctl
> neither of which is (or can be made) aware of each other.
> This leads to races and/or inconsistencies.
> 
> As we've seen the current kernel policy (cf the 'ratio' discussion)
> doesn't really fit how users expect CXL to work, so one is tempted to
> not having the kernel to do the onlining. But then the user is caught
> in the udev vs daxctl race, requiring awkward cludges on either side.
 > > Can't we make daxctl aware of udev? IE updating daxctl call out to
> udev and just wait for udev to complete its thing?
> At worst we're running into a timeout if some udev rules are garbage,
> but daxctl will be able to see the final state and we would avoid
> the need for modifying and/or moving udev rules.
> (Which, incidentally, is required on SLES, too :-)

I will try moving away from udev for memory onlining completely in RHEL 
-- let's see if I will succeed ;) .

We really want to make use of auto-onlining in the kernel where 
possible, and do it manually in user space only in a handful of cases 
(e.g., CXL, standby memory on s390x). Configuring auto-onlining is the 
nasty bit that still needs to be done manually by the admin, and that's 
really the nasty bit.


> 
> Discussion point for LPC?

Yes, probably.

-- 
Cheers,

David / dhildenb

next prev parent reply	other threads:[~2025-07-28 15:15 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-28  8:15 [RFC] Disable auto_movable_ratio for selfhosted memmap Oscar Salvador
2025-07-28  8:44 ` David Hildenbrand
2025-07-28  9:28   ` Hannes Reinecke
2025-07-28  9:42     ` David Hildenbrand
2025-07-28  8:48 ` Michal Hocko
2025-07-28  8:53   ` David Hildenbrand
2025-07-28  9:04     ` Michal Hocko
2025-07-28  9:10       ` David Hildenbrand
2025-07-28  9:37         ` Hannes Reinecke
2025-07-28 13:06           ` Michal Hocko
2025-07-28 13:08             ` David Hildenbrand
2025-07-29  7:24               ` Hannes Reinecke
2025-07-29  9:19                 ` Michal Hocko
2025-07-29  9:29                   ` David Hildenbrand
2025-07-29  9:33                   ` Hannes Reinecke
2025-07-29 11:58                     ` Michal Hocko
2025-07-29 13:52                       ` Hannes Reinecke
2025-07-28 15:15           ` David Hildenbrand [this message]
2025-07-28 12:17         ` Michal Hocko
2025-07-28 12:27           ` David Hildenbrand
2025-07-28 12:27             ` David Hildenbrand
2025-07-28 13:00               ` Michal Hocko
2025-07-28 13:03                 ` David Hildenbrand
2025-07-28 12:54             ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=09794c70-06a2-44dc-8e54-bc6e6a7d6c74@redhat.com \
    --to=david@redhat.com \
    --cc=hare@kernel.org \
    --cc=hare@suse.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=osalvador@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).