linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC] Disable auto_movable_ratio for selfhosted memmap
@ 2025-07-28  8:15 Oscar Salvador
  2025-07-28  8:44 ` David Hildenbrand
  2025-07-28  8:48 ` Michal Hocko
  0 siblings, 2 replies; 24+ messages in thread
From: Oscar Salvador @ 2025-07-28  8:15 UTC (permalink / raw)
  To: david; +Cc: linux-mm, linux-kernel, Michal Hocko, Hannes Reinecke

Hi,

Currently, we have several mechanisms to pick a zone for the new memory we are
onlining.
Eventually, we will land on zone_for_pfn_range() which will pick the zone.

Two of these mechanisms are 'movable_node' and 'auto-movable' policy.
The former will put every single hotpluggled memory in ZONE_MOVABLE
(unless we can keep zones contiguous by not doing so), while the latter
will put it in ZONA_MOVABLE IFF we are within the established ratio
MOVABLE:KERNEL.

It seems, the later doesn't play well with CXL memory where CXL cards hold really
large amounts of memory, making the ratio fail, and since CXL cards must be removed
as a unit, it can't be done if any memory block fell within
!ZONE_MOVABLE zone.

One way to tackle this would be update the ratio every time a new CXL
card gets inserted, but this seems suboptimal.
Another way is that since CXL memory works with selfhosted memmap, we could relax
the check when 'auto-movable' and only look at the ratio if we aren't
working with selfhosted memmap.

Something like the following (acthung: it's just a PoC)
Comments? Ideas? 

 diff --git a/drivers/base/memory.c b/drivers/base/memory.c
 index 5c6c1d6bb59f..ff87cfb3881a 100644
 --- a/drivers/base/memory.c
 +++ b/drivers/base/memory.c
 @@ -234,7 +234,7 @@ static int memory_block_online(struct memory_block *mem)
  		return -EHWPOISON;
 
  	zone = zone_for_pfn_range(mem->online_type, mem->nid, mem->group,
 -				  start_pfn, nr_pages);
 +				  start_pfn, nr_pages, mem->altmap);
 
  	/*
  	 * Although vmemmap pages have a different lifecycle than the pages
 @@ -473,11 +473,11 @@ static ssize_t phys_device_show(struct device *dev,
  static int print_allowed_zone(char *buf, int len, int nid,
  			      struct memory_group *group,
  			      unsigned long start_pfn, unsigned long nr_pages,
 -			      int online_type, struct zone *default_zone)
 +			      int online_type, struct zone *default_zone, struct vmem_altmap *altmap)
  {
  	struct zone *zone;
 
 -	zone = zone_for_pfn_range(online_type, nid, group, start_pfn, nr_pages);
 +	zone = zone_for_pfn_range(online_type, nid, group, start_pfn, nr_pages, altmap);
  	if (zone == default_zone)
  		return 0;
 
 @@ -509,13 +509,13 @@ static ssize_t valid_zones_show(struct device *dev,
  	}
 
  	default_zone = zone_for_pfn_range(MMOP_ONLINE, nid, group,
 -					  start_pfn, nr_pages);
 +					  start_pfn, nr_pages, mem->altmap);
 
  	len = sysfs_emit(buf, "%s", default_zone->name);
  	len += print_allowed_zone(buf, len, nid, group, start_pfn, nr_pages,
 -				  MMOP_ONLINE_KERNEL, default_zone);
 +				  MMOP_ONLINE_KERNEL, default_zone, mem->altmap);
  	len += print_allowed_zone(buf, len, nid, group, start_pfn, nr_pages,
 -				  MMOP_ONLINE_MOVABLE, default_zone);
 +				  MMOP_ONLINE_MOVABLE, default_zone, mem->altmap);
  	len += sysfs_emit_at(buf, len, "\n");
  	return len;
  }
 diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
 index 23f038a16231..89f7b9c5d995 100644
 --- a/include/linux/memory_hotplug.h
 +++ b/include/linux/memory_hotplug.h
 @@ -328,7 +328,7 @@ extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map,
  					  unsigned long pnum);
  extern struct zone *zone_for_pfn_range(int online_type, int nid,
  		struct memory_group *group, unsigned long start_pfn,
 -		unsigned long nr_pages);
 +		unsigned long nr_pages, struct vmem_altmap *altmap);
  extern int arch_create_linear_mapping(int nid, u64 start, u64 size,
  				      struct mhp_params *params);
  void arch_remove_linear_mapping(u64 start, u64 size);
 diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
 index 69a636e20f7b..6c6600a9c839 100644
 --- a/mm/memory_hotplug.c
 +++ b/mm/memory_hotplug.c
 @@ -1048,7 +1048,7 @@ static inline struct zone *default_zone_for_pfn(int nid, unsigned long start_pfn
 
  struct zone *zone_for_pfn_range(int online_type, int nid,
  		struct memory_group *group, unsigned long start_pfn,
 -		unsigned long nr_pages)
 +		unsigned long nr_pages, struct vmem_altmap *altmap)
  {
  	if (online_type == MMOP_ONLINE_KERNEL)
  		return default_kernel_zone_for_pfn(nid, start_pfn, nr_pages);
 @@ -1056,6 +1056,10 @@ struct zone *zone_for_pfn_range(int online_type, int nid,
  	if (online_type == MMOP_ONLINE_MOVABLE)
  		return &NODE_DATA(nid)->node_zones[ZONE_MOVABLE];
 
 +	/* Selfhosted memmap, skip ratio check */
 +	if (online_policy == ONLINE_POLICY_AUTO_MOVABLE && altmap)
 +		return &NODE_DATA(nid)->node_zones[ZONE_MOVABLE];
 +
  	if (online_policy == ONLINE_POLICY_AUTO_MOVABLE)
  		return auto_movable_zone_for_pfn(nid, group, start_pfn, nr_pages);

-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Disable auto_movable_ratio for selfhosted memmap
  2025-07-28  8:15 [RFC] Disable auto_movable_ratio for selfhosted memmap Oscar Salvador
@ 2025-07-28  8:44 ` David Hildenbrand
  2025-07-28  9:28   ` Hannes Reinecke
  2025-07-28  8:48 ` Michal Hocko
  1 sibling, 1 reply; 24+ messages in thread
From: David Hildenbrand @ 2025-07-28  8:44 UTC (permalink / raw)
  To: Oscar Salvador; +Cc: linux-mm, linux-kernel, Michal Hocko, Hannes Reinecke

On 28.07.25 10:15, Oscar Salvador wrote:
> Hi,

Hi,

> 
> Currently, we have several mechanisms to pick a zone for the new memory we are
> onlining.
> Eventually, we will land on zone_for_pfn_range() which will pick the zone.
> 
> Two of these mechanisms are 'movable_node' and 'auto-movable' policy.
> The former will put every single hotpluggled memory in ZONE_MOVABLE
> (unless we can keep zones contiguous by not doing so), while the latter
> will put it in ZONA_MOVABLE IFF we are within the established ratio
> MOVABLE:KERNEL.

It's more complicated, because we have the concept of memory groups.

Dynamic memory groups allow for a mixture of MOVABLE vs. NORMAL within 
the group, static memory groups want a single type.

Hotplugging a large DIMM would online it either as MOVABLE or NORMAL. 
Similarly with CXL.

> 
> It seems, the later doesn't play well with CXL memory where CXL cards hold really
> large amounts of memory, making the ratio fail, and since CXL cards must be removed
> as a unit, it can't be done if any memory block fell within
> !ZONE_MOVABLE zone.

So, user space configured a ratio and the kernel does exactly that: obey 
the ratio.

> 
> One way to tackle this would be update the ratio every time a new CXL
> card gets inserted, but this seems suboptimal.
> Another way is that since CXL memory works with selfhosted memmap, we could relax
> the check when 'auto-movable' and only look at the ratio if we aren't
> working with selfhosted memmap.

The memmap is only a small piece of unmovable data we require late at 
runtime (a bigger factor is user space page tables actually mapping that 
memory). The zone ratio we have configured in the kernel dates back to 
the highmem times, where such ratios were considered safe. Maybe there 
are better defaults for the ratios today, but it really depends on the 
workload.

One could find ways of subtracting the selfhosted part, to account it 
differently in the kernel, but the memmap is not the only consumer that 
affects the ratio.

I mean, the memmap is roughly 1.6%, I don't think that really makes a 
difference for you, does it? Can you share some real-life examples?


I have a colleague working on one of my old prototypes (memoryhotplugd) 
for replacing udev rules.

The idea there is, to detect that CXL memory is getting hotplugged and 
keep it offline. Because user space hotplugging that memory (daxctl) 
will explicitly online it to the proper zone.

Things like virtio-mem, DIMMs etc can happily use the auto-movable 
behavior. But the auto-movable behavior doesn't quite make sense if (a) 
you want everything movable and (b) daxctl already expects to online the 
memory itself, usually to ZONE_MOVABLE.

So I think this is mostly a user-space problem to solve.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Disable auto_movable_ratio for selfhosted memmap
  2025-07-28  8:15 [RFC] Disable auto_movable_ratio for selfhosted memmap Oscar Salvador
  2025-07-28  8:44 ` David Hildenbrand
@ 2025-07-28  8:48 ` Michal Hocko
  2025-07-28  8:53   ` David Hildenbrand
  1 sibling, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2025-07-28  8:48 UTC (permalink / raw)
  To: Oscar Salvador; +Cc: david, linux-mm, linux-kernel, Hannes Reinecke

On Mon 28-07-25 10:15:47, Oscar Salvador wrote:
> Hi,
> 
> Currently, we have several mechanisms to pick a zone for the new memory we are
> onlining.
> Eventually, we will land on zone_for_pfn_range() which will pick the zone.
> 
> Two of these mechanisms are 'movable_node' and 'auto-movable' policy.
> The former will put every single hotpluggled memory in ZONE_MOVABLE
> (unless we can keep zones contiguous by not doing so), while the latter
> will put it in ZONA_MOVABLE IFF we are within the established ratio
> MOVABLE:KERNEL.
> 
> It seems, the later doesn't play well with CXL memory where CXL cards hold really
> large amounts of memory, making the ratio fail, and since CXL cards must be removed
> as a unit, it can't be done if any memory block fell within
> !ZONE_MOVABLE zone.

I suspect this is just an example of how our existing memory hotplug
interface based on memory blocks is just suoptimal and it doesn't fit
new usecases. We should start thinking about how a new v2 api should
look like. I am not sure how that should look like but I believe we
should be able to express a "device" as whole rather than having a very
loosely bound generic memblocks. Anyway this is likely for a longer
discussion and a long term plan rather than addressing this particular
issue.
 
> One way to tackle this would be update the ratio every time a new CXL
> card gets inserted, but this seems suboptimal.

I do not think this is a usable interface.

> Another way is that since CXL memory works with selfhosted memmap, we could relax
> the check when 'auto-movable' and only look at the ratio if we aren't
> working with selfhosted memmap.

This is likely the only choice we have with the current interface. We
either need a way to disable the ratio altogether or make it more
automagic and treat self hosted memory differently because that memory
doesn't eat up ZONE_NORMAL memory and therefore cannot deplete it for
ZONE_MOVABLE.

Lowmem (ZONE_NORMAL) oom problems are still possible but kinda
unavoidable no matter what the hotplug interface is as the CXL usecase
really needs its memory to be movable to operate it as desired AFAIU.

> Something like the following (acthung: it's just a PoC)
> Comments? Ideas? 
> 
>  diff --git a/drivers/base/memory.c b/drivers/base/memory.c
>  index 5c6c1d6bb59f..ff87cfb3881a 100644
>  --- a/drivers/base/memory.c
>  +++ b/drivers/base/memory.c
>  @@ -234,7 +234,7 @@ static int memory_block_online(struct memory_block *mem)
>   		return -EHWPOISON;
>  
>   	zone = zone_for_pfn_range(mem->online_type, mem->nid, mem->group,
>  -				  start_pfn, nr_pages);
>  +				  start_pfn, nr_pages, mem->altmap);

Shouldn't this be a more descriptive (enum like) argument?
ONLINE_MOVABLE, ONLINE_AUTO etc..

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Disable auto_movable_ratio for selfhosted memmap
  2025-07-28  8:48 ` Michal Hocko
@ 2025-07-28  8:53   ` David Hildenbrand
  2025-07-28  9:04     ` Michal Hocko
  0 siblings, 1 reply; 24+ messages in thread
From: David Hildenbrand @ 2025-07-28  8:53 UTC (permalink / raw)
  To: Michal Hocko, Oscar Salvador; +Cc: linux-mm, linux-kernel, Hannes Reinecke

On 28.07.25 10:48, Michal Hocko wrote:
> On Mon 28-07-25 10:15:47, Oscar Salvador wrote:
>> Hi,
>>
>> Currently, we have several mechanisms to pick a zone for the new memory we are
>> onlining.
>> Eventually, we will land on zone_for_pfn_range() which will pick the zone.
>>
>> Two of these mechanisms are 'movable_node' and 'auto-movable' policy.
>> The former will put every single hotpluggled memory in ZONE_MOVABLE
>> (unless we can keep zones contiguous by not doing so), while the latter
>> will put it in ZONA_MOVABLE IFF we are within the established ratio
>> MOVABLE:KERNEL.
>>
>> It seems, the later doesn't play well with CXL memory where CXL cards hold really
>> large amounts of memory, making the ratio fail, and since CXL cards must be removed
>> as a unit, it can't be done if any memory block fell within
>> !ZONE_MOVABLE zone.
> 
> I suspect this is just an example of how our existing memory hotplug
> interface based on memory blocks is just suoptimal and it doesn't fit
> new usecases. We should start thinking about how a new v2 api should
> look like. I am not sure how that should look like but I believe we
> should be able to express a "device" as whole rather than having a very
> loosely bound generic memblocks. Anyway this is likely for a longer
> discussion and a long term plan rather than addressing this particular
> issue.

We have that concept with memory groups in the kernel already.

In dax/kmem we register a static memory group. It will be considered one 
union.

>   
>> One way to tackle this would be update the ratio every time a new CXL
>> card gets inserted, but this seems suboptimal.
> 
> I do not think this is a usable interface.
> 
>> Another way is that since CXL memory works with selfhosted memmap, we could relax
>> the check when 'auto-movable' and only look at the ratio if we aren't
>> working with selfhosted memmap.
> 
> This is likely the only choice we have with the current interface. We
> either need a way to disable the ratio altogether or make it more
> automagic and treat self hosted memory differently because that memory
> doesn't eat up ZONE_NORMAL memory and therefore cannot deplete it for
> ZONE_MOVABLE.

daxctl wants to online memory itself. We want to keep that memory 
offline from a kernel perspective and let daxctl handle it in this case.

We have that problem in RHEL where we currently require user space to 
disable udev rules so daxctl "can win".

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Disable auto_movable_ratio for selfhosted memmap
  2025-07-28  8:53   ` David Hildenbrand
@ 2025-07-28  9:04     ` Michal Hocko
  2025-07-28  9:10       ` David Hildenbrand
  0 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2025-07-28  9:04 UTC (permalink / raw)
  To: David Hildenbrand; +Cc: Oscar Salvador, linux-mm, linux-kernel, Hannes Reinecke

On Mon 28-07-25 10:53:08, David Hildenbrand wrote:
> On 28.07.25 10:48, Michal Hocko wrote:
> > On Mon 28-07-25 10:15:47, Oscar Salvador wrote:
> > > Hi,
> > > 
> > > Currently, we have several mechanisms to pick a zone for the new memory we are
> > > onlining.
> > > Eventually, we will land on zone_for_pfn_range() which will pick the zone.
> > > 
> > > Two of these mechanisms are 'movable_node' and 'auto-movable' policy.
> > > The former will put every single hotpluggled memory in ZONE_MOVABLE
> > > (unless we can keep zones contiguous by not doing so), while the latter
> > > will put it in ZONA_MOVABLE IFF we are within the established ratio
> > > MOVABLE:KERNEL.
> > > 
> > > It seems, the later doesn't play well with CXL memory where CXL cards hold really
> > > large amounts of memory, making the ratio fail, and since CXL cards must be removed
> > > as a unit, it can't be done if any memory block fell within
> > > !ZONE_MOVABLE zone.
> > 
> > I suspect this is just an example of how our existing memory hotplug
> > interface based on memory blocks is just suoptimal and it doesn't fit
> > new usecases. We should start thinking about how a new v2 api should
> > look like. I am not sure how that should look like but I believe we
> > should be able to express a "device" as whole rather than having a very
> > loosely bound generic memblocks. Anyway this is likely for a longer
> > discussion and a long term plan rather than addressing this particular
> > issue.
> 
> We have that concept with memory groups in the kernel already.

I must have missed that. I will have a look, thanks! Do we have any
documentation for that? Memory group is an overloaded term in the
kernel.

> In dax/kmem we register a static memory group. It will be considered one
> union.

But we still do export those memory blocks and let udev or whoever act
on those right? If that is the case then ....

[...] 

> daxctl wants to online memory itself. We want to keep that memory offline
> from a kernel perspective and let daxctl handle it in this case.
> 
> We have that problem in RHEL where we currently require user space to
> disable udev rules so daxctl "can win".

... this is the result. Those shouldn't really race. If udev is suppose
to see the device then only in its entirity so regular memory block
based onlining rules shouldn't even see that memory. Or am I completely
missing the picture?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Disable auto_movable_ratio for selfhosted memmap
  2025-07-28  9:04     ` Michal Hocko
@ 2025-07-28  9:10       ` David Hildenbrand
  2025-07-28  9:37         ` Hannes Reinecke
  2025-07-28 12:17         ` Michal Hocko
  0 siblings, 2 replies; 24+ messages in thread
From: David Hildenbrand @ 2025-07-28  9:10 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Oscar Salvador, linux-mm, linux-kernel, Hannes Reinecke

On 28.07.25 11:04, Michal Hocko wrote:
> On Mon 28-07-25 10:53:08, David Hildenbrand wrote:
>> On 28.07.25 10:48, Michal Hocko wrote:
>>> On Mon 28-07-25 10:15:47, Oscar Salvador wrote:
>>>> Hi,
>>>>
>>>> Currently, we have several mechanisms to pick a zone for the new memory we are
>>>> onlining.
>>>> Eventually, we will land on zone_for_pfn_range() which will pick the zone.
>>>>
>>>> Two of these mechanisms are 'movable_node' and 'auto-movable' policy.
>>>> The former will put every single hotpluggled memory in ZONE_MOVABLE
>>>> (unless we can keep zones contiguous by not doing so), while the latter
>>>> will put it in ZONA_MOVABLE IFF we are within the established ratio
>>>> MOVABLE:KERNEL.
>>>>
>>>> It seems, the later doesn't play well with CXL memory where CXL cards hold really
>>>> large amounts of memory, making the ratio fail, and since CXL cards must be removed
>>>> as a unit, it can't be done if any memory block fell within
>>>> !ZONE_MOVABLE zone.
>>>
>>> I suspect this is just an example of how our existing memory hotplug
>>> interface based on memory blocks is just suoptimal and it doesn't fit
>>> new usecases. We should start thinking about how a new v2 api should
>>> look like. I am not sure how that should look like but I believe we
>>> should be able to express a "device" as whole rather than having a very
>>> loosely bound generic memblocks. Anyway this is likely for a longer
>>> discussion and a long term plan rather than addressing this particular
>>> issue.
>>
>> We have that concept with memory groups in the kernel already.
> 
> I must have missed that. I will have a look, thanks! Do we have any
> documentation for that? Memory group is an overloaded term in the
> kernel.

It's an internal concept so far, the grouping is not exposed to user space.

We have kerneldoc for e.g., "struct memory_group". E.g., from there

"A memory group logically groups memory blocks; each memory block 
belongs to at most one memory group. A memory group corresponds to a 
memory device, such as a DIMM or a NUMA node, which spans multiple 
memory blocks and might even span multiple non-contiguous physical 
memory ranges."

> 
>> In dax/kmem we register a static memory group. It will be considered one
>> union.
> 
> But we still do export those memory blocks and let udev or whoever act
> on those right? If that is the case then ....

Yes.

> 
> [...]
> 
>> daxctl wants to online memory itself. We want to keep that memory offline
>> from a kernel perspective and let daxctl handle it in this case.
>>
>> We have that problem in RHEL where we currently require user space to
>> disable udev rules so daxctl "can win".
> 
> ... this is the result. Those shouldn't really race. If udev is suppose
> to see the device then only in its entirity so regular memory block
> based onlining rules shouldn't even see that memory. Or am I completely
> missing the picture?

We can't break user space, which relies on individual memory blocks.

So udev or $whatever will right now see individual memory blocks. We 
could export the group id to user space if that is of any help, but at 
least for daxctl purposes, it will be sufficient to identify "oh, this 
was added by dax/kmem" (which we can obtain from /proc/iomem) and say 
"okay, I'll let user-space deal with it."

Having the whole thing exposed as a unit is not really solving any 
problems unless I am missing something important.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Disable auto_movable_ratio for selfhosted memmap
  2025-07-28  8:44 ` David Hildenbrand
@ 2025-07-28  9:28   ` Hannes Reinecke
  2025-07-28  9:42     ` David Hildenbrand
  0 siblings, 1 reply; 24+ messages in thread
From: Hannes Reinecke @ 2025-07-28  9:28 UTC (permalink / raw)
  To: David Hildenbrand, Oscar Salvador
  Cc: linux-mm, linux-kernel, Michal Hocko, Hannes Reinecke

On 7/28/25 10:44, David Hildenbrand wrote:
> On 28.07.25 10:15, Oscar Salvador wrote:
>> Hi,
[ .. ]
>>
>> One way to tackle this would be update the ratio every time a new CXL
>> card gets inserted, but this seems suboptimal.
>> Another way is that since CXL memory works with selfhosted memmap, we 
>> could relax
>> the check when 'auto-movable' and only look at the ratio if we aren't
>> working with selfhosted memmap.
> 
> The memmap is only a small piece of unmovable data we require late at 
> runtime (a bigger factor is user space page tables actually mapping that 
> memory). The zone ratio we have configured in the kernel dates back to 
> the highmem times, where such ratios were considered safe. Maybe there 
> are better defaults for the ratios today, but it really depends on the 
> workload.
> 
Point is, the ratio is accounted for the _entire_ memory.
Which means that you have to _know_ how much memory you are going to
plug in prior to plugging that in.
So to make that correct one would need to update the ratio prior to
plug in one module, check if that succeeded, update the ratio, plug
in the next module, check that, etc.

Really?

> One could find ways of subtracting the selfhosted part, to account it 
> differently in the kernel, but the memmap is not the only consumer that 
> affects the ratio.
> 
> I mean, the memmap is roughly 1.6%, I don't think that really makes a 
> difference for you, does it? Can you share some real-life examples?
> 
> 
> I have a colleague working on one of my old prototypes (memoryhotplugd) 
> for replacing udev rules.
> 
> The idea there is, to detect that CXL memory is getting hotplugged and 
> keep it offline. Because user space hotplugging that memory (daxctl) 
> will explicitly online it to the proper zone.
> 
> Things like virtio-mem, DIMMs etc can happily use the auto-movable 
> behavior. But the auto-movable behavior doesn't quite make sense if (a) 
> you want everything movable and (b) daxctl already expects to online the 
> memory itself, usually to ZONE_MOVABLE.
> 
> So I think this is mostly a user-space problem to solve.
> 
Hmm.
Yes, and no.

While CXL memory is hotpluggable (it's a PCI device, after all),
it won't be hotplugged on a regular basis.
So the current use-case I'm aware of is that the system will be
configured once, and then it will be expected to come up in the
very same state after reboot.
As such a daemon is a bit of an overkill, as the number of events
it would need to listen to is in the very low single-digit range.

Udev rules would work fine, though (in fact, we already have one ...)
so I'd be happy to keep CXL memory offline after boot / hotplug
and let udev / daxctl handle things.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Disable auto_movable_ratio for selfhosted memmap
  2025-07-28  9:10       ` David Hildenbrand
@ 2025-07-28  9:37         ` Hannes Reinecke
  2025-07-28 13:06           ` Michal Hocko
  2025-07-28 15:15           ` David Hildenbrand
  2025-07-28 12:17         ` Michal Hocko
  1 sibling, 2 replies; 24+ messages in thread
From: Hannes Reinecke @ 2025-07-28  9:37 UTC (permalink / raw)
  To: David Hildenbrand, Michal Hocko
  Cc: Oscar Salvador, linux-mm, linux-kernel, Hannes Reinecke

On 7/28/25 11:10, David Hildenbrand wrote:
> On 28.07.25 11:04, Michal Hocko wrote:
>> On Mon 28-07-25 10:53:08, David Hildenbrand wrote:
>>> On 28.07.25 10:48, Michal Hocko wrote:
>>>> On Mon 28-07-25 10:15:47, Oscar Salvador wrote:
>>>>> Hi,
>>>>>
>>>>> Currently, we have several mechanisms to pick a zone for the new 
>>>>> memory we are
>>>>> onlining.
>>>>> Eventually, we will land on zone_for_pfn_range() which will pick 
>>>>> the zone.
>>>>>
>>>>> Two of these mechanisms are 'movable_node' and 'auto-movable' policy.
>>>>> The former will put every single hotpluggled memory in ZONE_MOVABLE
>>>>> (unless we can keep zones contiguous by not doing so), while the 
>>>>> latter
>>>>> will put it in ZONA_MOVABLE IFF we are within the established ratio
>>>>> MOVABLE:KERNEL.
>>>>>
>>>>> It seems, the later doesn't play well with CXL memory where CXL 
>>>>> cards hold really
>>>>> large amounts of memory, making the ratio fail, and since CXL cards 
>>>>> must be removed
>>>>> as a unit, it can't be done if any memory block fell within
>>>>> !ZONE_MOVABLE zone.
>>>>
>>>> I suspect this is just an example of how our existing memory hotplug
>>>> interface based on memory blocks is just suoptimal and it doesn't fit
>>>> new usecases. We should start thinking about how a new v2 api should
>>>> look like. I am not sure how that should look like but I believe we
>>>> should be able to express a "device" as whole rather than having a very
>>>> loosely bound generic memblocks. Anyway this is likely for a longer
>>>> discussion and a long term plan rather than addressing this particular
>>>> issue.
>>>
>>> We have that concept with memory groups in the kernel already.
>>
>> I must have missed that. I will have a look, thanks! Do we have any
>> documentation for that? Memory group is an overloaded term in the
>> kernel.
> 
> It's an internal concept so far, the grouping is not exposed to user space.
> 
> We have kerneldoc for e.g., "struct memory_group". E.g., from there
> 
> "A memory group logically groups memory blocks; each memory block 
> belongs to at most one memory group. A memory group corresponds to a 
> memory device, such as a DIMM or a NUMA node, which spans multiple 
> memory blocks and might even span multiple non-contiguous physical 
> memory ranges."
> 
>>
>>> In dax/kmem we register a static memory group. It will be considered one
>>> union.
>>
>> But we still do export those memory blocks and let udev or whoever act
>> on those right? If that is the case then ....
> 
> Yes.
> 
>>
>> [...]
>>
>>> daxctl wants to online memory itself. We want to keep that memory 
>>> offline
>>> from a kernel perspective and let daxctl handle it in this case.
>>>
>>> We have that problem in RHEL where we currently require user space to
>>> disable udev rules so daxctl "can win".
>>
>> ... this is the result. Those shouldn't really race. If udev is suppose
>> to see the device then only in its entirity so regular memory block
>> based onlining rules shouldn't even see that memory. Or am I completely
>> missing the picture?
> 
> We can't break user space, which relies on individual memory blocks.
> 
> So udev or $whatever will right now see individual memory blocks. We 
> could export the group id to user space if that is of any help, but at 
> least for daxctl purposes, it will be sufficient to identify "oh, this 
> was added by dax/kmem" (which we can obtain from /proc/iomem) and say 
> "okay, I'll let user-space deal with it."
> 
> Having the whole thing exposed as a unit is not really solving any 
> problems unless I am missing something important.
> 
Basically it boils down to:
Who should be responsible for onlining the memory?

As it stands we have two methods:
- user-space as per sysfs attributes
- kernel policy

And to make matters worse, we have two competing user-space programs:
- udev
- daxctl
neither of which is (or can be made) aware of each other.
This leads to races and/or inconsistencies.

As we've seen the current kernel policy (cf the 'ratio' discussion)
doesn't really fit how users expect CXL to work, so one is tempted to
not having the kernel to do the onlining. But then the user is caught
in the udev vs daxctl race, requiring awkward cludges on either side.

Can't we make daxctl aware of udev? IE updating daxctl call out to
udev and just wait for udev to complete its thing?
At worst we're running into a timeout if some udev rules are garbage,
but daxctl will be able to see the final state and we would avoid
the need for modifying and/or moving udev rules.
(Which, incidentally, is required on SLES, too :-)

Discussion point for LPC?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Disable auto_movable_ratio for selfhosted memmap
  2025-07-28  9:28   ` Hannes Reinecke
@ 2025-07-28  9:42     ` David Hildenbrand
  0 siblings, 0 replies; 24+ messages in thread
From: David Hildenbrand @ 2025-07-28  9:42 UTC (permalink / raw)
  To: Hannes Reinecke, Oscar Salvador
  Cc: linux-mm, linux-kernel, Michal Hocko, Hannes Reinecke

On 28.07.25 11:28, Hannes Reinecke wrote:
> On 7/28/25 10:44, David Hildenbrand wrote:
>> On 28.07.25 10:15, Oscar Salvador wrote:
>>> Hi,
> [ .. ]
>>>
>>> One way to tackle this would be update the ratio every time a new CXL
>>> card gets inserted, but this seems suboptimal.
>>> Another way is that since CXL memory works with selfhosted memmap, we
>>> could relax
>>> the check when 'auto-movable' and only look at the ratio if we aren't
>>> working with selfhosted memmap.
>>
>> The memmap is only a small piece of unmovable data we require late at
>> runtime (a bigger factor is user space page tables actually mapping that
>> memory). The zone ratio we have configured in the kernel dates back to
>> the highmem times, where such ratios were considered safe. Maybe there
>> are better defaults for the ratios today, but it really depends on the
>> workload.
>>
> Point is, the ratio is accounted for the _entire_ memory.
> Which means that you have to _know_ how much memory you are going to
> plug in prior to plugging that in.
 > So to make that correct one would need to update the ratio prior to> 
plug in one module, check if that succeeded, update the ratio, plug
> in the next module, check that, etc.

I am confused. We know how big a DIMM is at the time we plug it. I 
assume you talk about CXL?

Can you describe how that workflow would look like with tools like daxctl?

(what is a "module"? A DIMM?)

> 
>> One could find ways of subtracting the selfhosted part, to account it
>> differently in the kernel, but the memmap is not the only consumer that
>> affects the ratio.
>>
>> I mean, the memmap is roughly 1.6%, I don't think that really makes a
>> difference for you, does it? Can you share some real-life examples?
>>
>>
>> I have a colleague working on one of my old prototypes (memoryhotplugd)
>> for replacing udev rules.
>>
>> The idea there is, to detect that CXL memory is getting hotplugged and
>> keep it offline. Because user space hotplugging that memory (daxctl)
>> will explicitly online it to the proper zone.
>>
>> Things like virtio-mem, DIMMs etc can happily use the auto-movable
>> behavior. But the auto-movable behavior doesn't quite make sense if (a)
>> you want everything movable and (b) daxctl already expects to online the
>> memory itself, usually to ZONE_MOVABLE.
>>
>> So I think this is mostly a user-space problem to solve.
>>
> Hmm.
> Yes, and no.
> 
> While CXL memory is hotpluggable (it's a PCI device, after all),
> it won't be hotplugged on a regular basis.

I've been told that with dynamic memory pooling it is supposed to get 
much more dynamic.

> So the current use-case I'm aware of is that the system will be
> configured once, and then it will be expected to come up in the
> very same state after reboot.
> As such a daemon is a bit of an overkill, as the number of events
> it would need to listen to is in the very low single-digit range.

I am mostly concerned with all the use cases that existed before CXL (in 
particular, virtio-mem, standby memory on s390x, DIMMs) where you see 
memory hotplug way more frequently and also would want to deal with 
things such as memory onlining failing in some environments more 
gracefully (e.g., retry).

What I realized is that
(1) udev rules are not a good for all use cases
(2) auto-onlining in the kernel is not good fit for all use cases

The goal of the daemon will be to configure auto-onlining in the kernel 
where possible (e.g., only virtio-mem, only CXL), but fallback to manual 
onlining in case mixtures might be possible (CXL and virtio-mem etc). I 
expect the latter to be rare, but sometimes we can't make a fully 
reliable decision of what might get hotplugged in the future ...

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Disable auto_movable_ratio for selfhosted memmap
  2025-07-28  9:10       ` David Hildenbrand
  2025-07-28  9:37         ` Hannes Reinecke
@ 2025-07-28 12:17         ` Michal Hocko
  2025-07-28 12:27           ` David Hildenbrand
  1 sibling, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2025-07-28 12:17 UTC (permalink / raw)
  To: David Hildenbrand; +Cc: Oscar Salvador, linux-mm, linux-kernel, Hannes Reinecke

On Mon 28-07-25 11:10:44, David Hildenbrand wrote:
> On 28.07.25 11:04, Michal Hocko wrote:
> > On Mon 28-07-25 10:53:08, David Hildenbrand wrote:
[...]
> > > daxctl wants to online memory itself. We want to keep that memory offline
> > > from a kernel perspective and let daxctl handle it in this case.
> > > 
> > > We have that problem in RHEL where we currently require user space to
> > > disable udev rules so daxctl "can win".
> > 
> > ... this is the result. Those shouldn't really race. If udev is suppose
> > to see the device then only in its entirity so regular memory block
> > based onlining rules shouldn't even see that memory. Or am I completely
> > missing the picture?
> 
> We can't break user space, which relies on individual memory blocks.

We do have userspace which onlines specific memory blocks and we cannot
break that. But do we have any userspace that wants to online CXL like
memory (or in general dax like memory) that would need to operate on
those memory blocks with that kind of granularity?

In other words what would break if we didn't expose CXL memory through
memory blocks in sysfs?

> So udev or $whatever will right now see individual memory blocks. We could
> export the group id to user space if that is of any help, but at least for
> daxctl purposes, it will be sufficient to identify "oh, this was added by
> dax/kmem" (which we can obtain from /proc/iomem) and say "okay, I'll let
> user-space deal with it."
> 
> Having the whole thing exposed as a unit is not really solving any problems
> unless I am missing something important.

If we need to handle that thing as whole we should have an interface
that allows for that. Per block breakdown doesn't really help anything.
It just makes the whole problem much more complex.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Disable auto_movable_ratio for selfhosted memmap
  2025-07-28 12:17         ` Michal Hocko
@ 2025-07-28 12:27           ` David Hildenbrand
  2025-07-28 12:27             ` David Hildenbrand
  2025-07-28 12:54             ` Michal Hocko
  0 siblings, 2 replies; 24+ messages in thread
From: David Hildenbrand @ 2025-07-28 12:27 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Oscar Salvador, linux-mm, linux-kernel, Hannes Reinecke

On 28.07.25 14:17, Michal Hocko wrote:
> On Mon 28-07-25 11:10:44, David Hildenbrand wrote:
>> On 28.07.25 11:04, Michal Hocko wrote:
>>> On Mon 28-07-25 10:53:08, David Hildenbrand wrote:
> [...]
>>>> daxctl wants to online memory itself. We want to keep that memory offline
>>>> from a kernel perspective and let daxctl handle it in this case.
>>>>
>>>> We have that problem in RHEL where we currently require user space to
>>>> disable udev rules so daxctl "can win".
>>>
>>> ... this is the result. Those shouldn't really race. If udev is suppose
>>> to see the device then only in its entirity so regular memory block
>>> based onlining rules shouldn't even see that memory. Or am I completely
>>> missing the picture?
>>
>> We can't break user space, which relies on individual memory blocks.
> 
> We do have userspace which onlines specific memory blocks and we cannot
> break that. But do we have any userspace that wants to online CXL like
> memory (or in general dax like memory) that would need to operate on
> those memory blocks with that kind of granularity?

I'm afraid that ship has sailed.

> 
> In other words what would break if we didn't expose CXL memory through
> memory blocks in sysfs?

I think the whole libdaxctl handling for onlining memory is based on that.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Disable auto_movable_ratio for selfhosted memmap
  2025-07-28 12:27           ` David Hildenbrand
@ 2025-07-28 12:27             ` David Hildenbrand
  2025-07-28 13:00               ` Michal Hocko
  2025-07-28 12:54             ` Michal Hocko
  1 sibling, 1 reply; 24+ messages in thread
From: David Hildenbrand @ 2025-07-28 12:27 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Oscar Salvador, linux-mm, linux-kernel, Hannes Reinecke

On 28.07.25 14:27, David Hildenbrand wrote:
> On 28.07.25 14:17, Michal Hocko wrote:
>> On Mon 28-07-25 11:10:44, David Hildenbrand wrote:
>>> On 28.07.25 11:04, Michal Hocko wrote:
>>>> On Mon 28-07-25 10:53:08, David Hildenbrand wrote:
>> [...]
>>>>> daxctl wants to online memory itself. We want to keep that memory offline
>>>>> from a kernel perspective and let daxctl handle it in this case.
>>>>>
>>>>> We have that problem in RHEL where we currently require user space to
>>>>> disable udev rules so daxctl "can win".
>>>>
>>>> ... this is the result. Those shouldn't really race. If udev is suppose
>>>> to see the device then only in its entirity so regular memory block
>>>> based onlining rules shouldn't even see that memory. Or am I completely
>>>> missing the picture?
>>>
>>> We can't break user space, which relies on individual memory blocks.
>>
>> We do have userspace which onlines specific memory blocks and we cannot
>> break that. But do we have any userspace that wants to online CXL like
>> memory (or in general dax like memory) that would need to operate on
>> those memory blocks with that kind of granularity?
> 
> I'm afraid that ship has sailed.
> 
>>
>> In other words what would break if we didn't expose CXL memory through
>> memory blocks in sysfs?
> 
> I think the whole libdaxctl handling for onlining memory is based on that.
> 

Sorry, forgot to add a pointer:

https://github.com/pmem/ndctl/blob/main/daxctl/lib/libdaxctl.c

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Disable auto_movable_ratio for selfhosted memmap
  2025-07-28 12:27           ` David Hildenbrand
  2025-07-28 12:27             ` David Hildenbrand
@ 2025-07-28 12:54             ` Michal Hocko
  1 sibling, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2025-07-28 12:54 UTC (permalink / raw)
  To: David Hildenbrand; +Cc: Oscar Salvador, linux-mm, linux-kernel, Hannes Reinecke

On Mon 28-07-25 14:27:01, David Hildenbrand wrote:
> On 28.07.25 14:17, Michal Hocko wrote:
[...]
> > In other words what would break if we didn't expose CXL memory through
> > memory blocks in sysfs?
> 
> I think the whole libdaxctl handling for onlining memory is based on that.

I am not familiar with libdaxctl so bear with me. What exactly from the
existing sysfs interface does it need?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Disable auto_movable_ratio for selfhosted memmap
  2025-07-28 12:27             ` David Hildenbrand
@ 2025-07-28 13:00               ` Michal Hocko
  2025-07-28 13:03                 ` David Hildenbrand
  0 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2025-07-28 13:00 UTC (permalink / raw)
  To: David Hildenbrand; +Cc: Oscar Salvador, linux-mm, linux-kernel, Hannes Reinecke

On Mon 28-07-25 14:27:29, David Hildenbrand wrote:
[...]
> > I think the whole libdaxctl handling for onlining memory is based on that.
> > 
> 
> Sorry, forgot to add a pointer:
> 
> https://github.com/pmem/ndctl/blob/main/daxctl/lib/libdaxctl.c

Thanks for the pointer! I will have a look.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Disable auto_movable_ratio for selfhosted memmap
  2025-07-28 13:00               ` Michal Hocko
@ 2025-07-28 13:03                 ` David Hildenbrand
  0 siblings, 0 replies; 24+ messages in thread
From: David Hildenbrand @ 2025-07-28 13:03 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Oscar Salvador, linux-mm, linux-kernel, Hannes Reinecke

On 28.07.25 15:00, Michal Hocko wrote:
> On Mon 28-07-25 14:27:29, David Hildenbrand wrote:
> [...]
>>> I think the whole libdaxctl handling for onlining memory is based on that.
>>>
>>
>> Sorry, forgot to add a pointer:
>>
>> https://github.com/pmem/ndctl/blob/main/daxctl/lib/libdaxctl.c
> 
> Thanks for the pointer! I will have a look.

In particular daxctl_memory_op(), used to implement stuff like

daxctl online-memory
daxctl offline-memory
daxctl reconfigure-device (--no-online, --no-movable)


I am not sure if they also offline memory automatically before disabling 
a device.

But in essence, they want to to everything automatically as possible.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Disable auto_movable_ratio for selfhosted memmap
  2025-07-28  9:37         ` Hannes Reinecke
@ 2025-07-28 13:06           ` Michal Hocko
  2025-07-28 13:08             ` David Hildenbrand
  2025-07-28 15:15           ` David Hildenbrand
  1 sibling, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2025-07-28 13:06 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: David Hildenbrand, Oscar Salvador, linux-mm, linux-kernel,
	Hannes Reinecke

On Mon 28-07-25 11:37:46, Hannes Reinecke wrote:
> On 7/28/25 11:10, David Hildenbrand wrote:
> And to make matters worse, we have two competing user-space programs:
> - udev
> - daxctl
> neither of which is (or can be made) aware of each other.
> This leads to races and/or inconsistencies.

Would it help if generic udev memory hotplug rule exclude anything that
is dax backed? Is there a way to check for that? Sorry if this is a
stupid question.

To me it sounds like daxctl should be the one to online the memory
excluseively and udev should just care about regular memory.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Disable auto_movable_ratio for selfhosted memmap
  2025-07-28 13:06           ` Michal Hocko
@ 2025-07-28 13:08             ` David Hildenbrand
  2025-07-29  7:24               ` Hannes Reinecke
  0 siblings, 1 reply; 24+ messages in thread
From: David Hildenbrand @ 2025-07-28 13:08 UTC (permalink / raw)
  To: Michal Hocko, Hannes Reinecke
  Cc: Oscar Salvador, linux-mm, linux-kernel, Hannes Reinecke

On 28.07.25 15:06, Michal Hocko wrote:
> On Mon 28-07-25 11:37:46, Hannes Reinecke wrote:
>> On 7/28/25 11:10, David Hildenbrand wrote:
>> And to make matters worse, we have two competing user-space programs:
>> - udev
>> - daxctl
>> neither of which is (or can be made) aware of each other.
>> This leads to races and/or inconsistencies.
> 
> Would it help if generic udev memory hotplug rule exclude anything that
> is dax backed? Is there a way to check for that? Sorry if this is a
> stupid question.
Parsing /proc/iomem, it's indicated as "System RAM (kmem)".

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Disable auto_movable_ratio for selfhosted memmap
  2025-07-28  9:37         ` Hannes Reinecke
  2025-07-28 13:06           ` Michal Hocko
@ 2025-07-28 15:15           ` David Hildenbrand
  1 sibling, 0 replies; 24+ messages in thread
From: David Hildenbrand @ 2025-07-28 15:15 UTC (permalink / raw)
  To: Hannes Reinecke, Michal Hocko
  Cc: Oscar Salvador, linux-mm, linux-kernel, Hannes Reinecke

On 28.07.25 11:37, Hannes Reinecke wrote:
> On 7/28/25 11:10, David Hildenbrand wrote:
>> On 28.07.25 11:04, Michal Hocko wrote:
>>> On Mon 28-07-25 10:53:08, David Hildenbrand wrote:
>>>> On 28.07.25 10:48, Michal Hocko wrote:
>>>>> On Mon 28-07-25 10:15:47, Oscar Salvador wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Currently, we have several mechanisms to pick a zone for the new
>>>>>> memory we are
>>>>>> onlining.
>>>>>> Eventually, we will land on zone_for_pfn_range() which will pick
>>>>>> the zone.
>>>>>>
>>>>>> Two of these mechanisms are 'movable_node' and 'auto-movable' policy.
>>>>>> The former will put every single hotpluggled memory in ZONE_MOVABLE
>>>>>> (unless we can keep zones contiguous by not doing so), while the
>>>>>> latter
>>>>>> will put it in ZONA_MOVABLE IFF we are within the established ratio
>>>>>> MOVABLE:KERNEL.
>>>>>>
>>>>>> It seems, the later doesn't play well with CXL memory where CXL
>>>>>> cards hold really
>>>>>> large amounts of memory, making the ratio fail, and since CXL cards
>>>>>> must be removed
>>>>>> as a unit, it can't be done if any memory block fell within
>>>>>> !ZONE_MOVABLE zone.
>>>>>
>>>>> I suspect this is just an example of how our existing memory hotplug
>>>>> interface based on memory blocks is just suoptimal and it doesn't fit
>>>>> new usecases. We should start thinking about how a new v2 api should
>>>>> look like. I am not sure how that should look like but I believe we
>>>>> should be able to express a "device" as whole rather than having a very
>>>>> loosely bound generic memblocks. Anyway this is likely for a longer
>>>>> discussion and a long term plan rather than addressing this particular
>>>>> issue.
>>>>
>>>> We have that concept with memory groups in the kernel already.
>>>
>>> I must have missed that. I will have a look, thanks! Do we have any
>>> documentation for that? Memory group is an overloaded term in the
>>> kernel.
>>
>> It's an internal concept so far, the grouping is not exposed to user space.
>>
>> We have kerneldoc for e.g., "struct memory_group". E.g., from there
>>
>> "A memory group logically groups memory blocks; each memory block
>> belongs to at most one memory group. A memory group corresponds to a
>> memory device, such as a DIMM or a NUMA node, which spans multiple
>> memory blocks and might even span multiple non-contiguous physical
>> memory ranges."
>>
>>>
>>>> In dax/kmem we register a static memory group. It will be considered one
>>>> union.
>>>
>>> But we still do export those memory blocks and let udev or whoever act
>>> on those right? If that is the case then ....
>>
>> Yes.
>>
>>>
>>> [...]
>>>
>>>> daxctl wants to online memory itself. We want to keep that memory
>>>> offline
>>>> from a kernel perspective and let daxctl handle it in this case.
>>>>
>>>> We have that problem in RHEL where we currently require user space to
>>>> disable udev rules so daxctl "can win".
>>>
>>> ... this is the result. Those shouldn't really race. If udev is suppose
>>> to see the device then only in its entirity so regular memory block
>>> based onlining rules shouldn't even see that memory. Or am I completely
>>> missing the picture?
>>
>> We can't break user space, which relies on individual memory blocks.
>>
>> So udev or $whatever will right now see individual memory blocks. We
>> could export the group id to user space if that is of any help, but at
>> least for daxctl purposes, it will be sufficient to identify "oh, this
>> was added by dax/kmem" (which we can obtain from /proc/iomem) and say
>> "okay, I'll let user-space deal with it."
>>
>> Having the whole thing exposed as a unit is not really solving any
>> problems unless I am missing something important.
>>
> Basically it boils down to:
> Who should be responsible for onlining the memory?
> 
> As it stands we have two methods:
> - user-space as per sysfs attributes
> - kernel policy
> 
> And to make matters worse, we have two competing user-space programs:
> - udev
> - daxctl
> neither of which is (or can be made) aware of each other.
> This leads to races and/or inconsistencies.
> 
> As we've seen the current kernel policy (cf the 'ratio' discussion)
> doesn't really fit how users expect CXL to work, so one is tempted to
> not having the kernel to do the onlining. But then the user is caught
> in the udev vs daxctl race, requiring awkward cludges on either side.
 > > Can't we make daxctl aware of udev? IE updating daxctl call out to
> udev and just wait for udev to complete its thing?
> At worst we're running into a timeout if some udev rules are garbage,
> but daxctl will be able to see the final state and we would avoid
> the need for modifying and/or moving udev rules.
> (Which, incidentally, is required on SLES, too :-)

I will try moving away from udev for memory onlining completely in RHEL 
-- let's see if I will succeed ;) .

We really want to make use of auto-onlining in the kernel where 
possible, and do it manually in user space only in a handful of cases 
(e.g., CXL, standby memory on s390x). Configuring auto-onlining is the 
nasty bit that still needs to be done manually by the admin, and that's 
really the nasty bit.


> 
> Discussion point for LPC?

Yes, probably.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Disable auto_movable_ratio for selfhosted memmap
  2025-07-28 13:08             ` David Hildenbrand
@ 2025-07-29  7:24               ` Hannes Reinecke
  2025-07-29  9:19                 ` Michal Hocko
  0 siblings, 1 reply; 24+ messages in thread
From: Hannes Reinecke @ 2025-07-29  7:24 UTC (permalink / raw)
  To: David Hildenbrand, Michal Hocko
  Cc: Oscar Salvador, linux-mm, linux-kernel, Hannes Reinecke

On 7/28/25 15:08, David Hildenbrand wrote:
> On 28.07.25 15:06, Michal Hocko wrote:
>> On Mon 28-07-25 11:37:46, Hannes Reinecke wrote:
>>> On 7/28/25 11:10, David Hildenbrand wrote:
>>> And to make matters worse, we have two competing user-space programs:
>>> - udev
>>> - daxctl
>>> neither of which is (or can be made) aware of each other.
>>> This leads to races and/or inconsistencies.
>>
>> Would it help if generic udev memory hotplug rule exclude anything that
>> is dax backed? Is there a way to check for that? Sorry if this is a
>> stupid question.
> Parsing /proc/iomem, it's indicated as "System RAM (kmem)".
> 
I would rather do it the other way round, and make daxctl aware of
udev. In the end, even 'daxctl' uses the sysfs interface to online
memory, which really is the territory of udev and can easily be
done via udev rules (for static configuration).

Note, we do a similar thing on s/390; the configuration tool there
just spits out udev rules.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Disable auto_movable_ratio for selfhosted memmap
  2025-07-29  7:24               ` Hannes Reinecke
@ 2025-07-29  9:19                 ` Michal Hocko
  2025-07-29  9:29                   ` David Hildenbrand
  2025-07-29  9:33                   ` Hannes Reinecke
  0 siblings, 2 replies; 24+ messages in thread
From: Michal Hocko @ 2025-07-29  9:19 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: David Hildenbrand, Oscar Salvador, linux-mm, linux-kernel,
	Hannes Reinecke

On Tue 29-07-25 09:24:37, Hannes Reinecke wrote:
> On 7/28/25 15:08, David Hildenbrand wrote:
> > On 28.07.25 15:06, Michal Hocko wrote:
> > > On Mon 28-07-25 11:37:46, Hannes Reinecke wrote:
> > > > On 7/28/25 11:10, David Hildenbrand wrote:
> > > > And to make matters worse, we have two competing user-space programs:
> > > > - udev
> > > > - daxctl
> > > > neither of which is (or can be made) aware of each other.
> > > > This leads to races and/or inconsistencies.
> > > 
> > > Would it help if generic udev memory hotplug rule exclude anything that
> > > is dax backed? Is there a way to check for that? Sorry if this is a
> > > stupid question.
> > Parsing /proc/iomem, it's indicated as "System RAM (kmem)".
> > 
> I would rather do it the other way round, and make daxctl aware of
> udev. In the end, even 'daxctl' uses the sysfs interface to online
> memory, which really is the territory of udev and can easily be
> done via udev rules (for static configuration).

udev doesn't really have any context what user space wants to do with
the memory and therefore how to online it. Therefore we have (arguably)
ugly hacks like auto onlining and movable_ration etc. daxctl can take
information from the admin directly and therfore it can do what is
needed without further hacks.

> Note, we do a similar thing on s/390; the configuration tool there
> just spits out udev rules.

Those were easy times when you just need to online memory without any
more requirements where it should land.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Disable auto_movable_ratio for selfhosted memmap
  2025-07-29  9:19                 ` Michal Hocko
@ 2025-07-29  9:29                   ` David Hildenbrand
  2025-07-29  9:33                   ` Hannes Reinecke
  1 sibling, 0 replies; 24+ messages in thread
From: David Hildenbrand @ 2025-07-29  9:29 UTC (permalink / raw)
  To: Michal Hocko, Hannes Reinecke
  Cc: Oscar Salvador, linux-mm, linux-kernel, Hannes Reinecke

On 29.07.25 11:19, Michal Hocko wrote:
> On Tue 29-07-25 09:24:37, Hannes Reinecke wrote:
>> On 7/28/25 15:08, David Hildenbrand wrote:
>>> On 28.07.25 15:06, Michal Hocko wrote:
>>>> On Mon 28-07-25 11:37:46, Hannes Reinecke wrote:
>>>>> On 7/28/25 11:10, David Hildenbrand wrote:
>>>>> And to make matters worse, we have two competing user-space programs:
>>>>> - udev
>>>>> - daxctl
>>>>> neither of which is (or can be made) aware of each other.
>>>>> This leads to races and/or inconsistencies.
>>>>
>>>> Would it help if generic udev memory hotplug rule exclude anything that
>>>> is dax backed? Is there a way to check for that? Sorry if this is a
>>>> stupid question.
>>> Parsing /proc/iomem, it's indicated as "System RAM (kmem)".
>>>
>> I would rather do it the other way round, and make daxctl aware of
>> udev. In the end, even 'daxctl' uses the sysfs interface to online
>> memory, which really is the territory of udev and can easily be
>> done via udev rules (for static configuration).
> 
> udev doesn't really have any context what user space wants to do with
> the memory and therefore how to online it. Therefore we have (arguably)
> ugly hacks like auto onlining and movable_ration etc. daxctl can take
> information from the admin directly and therfore it can do what is
> needed without further hacks.

Really the only difference between daxctl and everything else is the way 
the memory is added.

daxctl triggers hotplug of memory synchronously, everything else is 
asynchronous.

On most systems, the admin (the same one that triggers onlining) could 
just set the auto-onlining policy accordingly instead of manually 
onlining memory blocks from user space.

> 
>> Note, we do a similar thing on s/390; the configuration tool there
>> just spits out udev rules.
> 
> Those were easy times when you just need to online memory without any
> more requirements where it should land.

Again, I don't think udev is the future for that.

What I think we (Red Hat) want is a better and easier way to configure 
the kernel policy.

If you want to control onlining manually, then disable the auto-online 
policy.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Disable auto_movable_ratio for selfhosted memmap
  2025-07-29  9:19                 ` Michal Hocko
  2025-07-29  9:29                   ` David Hildenbrand
@ 2025-07-29  9:33                   ` Hannes Reinecke
  2025-07-29 11:58                     ` Michal Hocko
  1 sibling, 1 reply; 24+ messages in thread
From: Hannes Reinecke @ 2025-07-29  9:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Hildenbrand, Oscar Salvador, linux-mm, linux-kernel,
	Hannes Reinecke

On 7/29/25 11:19, Michal Hocko wrote:
> On Tue 29-07-25 09:24:37, Hannes Reinecke wrote:
>> On 7/28/25 15:08, David Hildenbrand wrote:
>>> On 28.07.25 15:06, Michal Hocko wrote:
>>>> On Mon 28-07-25 11:37:46, Hannes Reinecke wrote:
>>>>> On 7/28/25 11:10, David Hildenbrand wrote:
>>>>> And to make matters worse, we have two competing user-space programs:
>>>>> - udev
>>>>> - daxctl
>>>>> neither of which is (or can be made) aware of each other.
>>>>> This leads to races and/or inconsistencies.
>>>>
>>>> Would it help if generic udev memory hotplug rule exclude anything that
>>>> is dax backed? Is there a way to check for that? Sorry if this is a
>>>> stupid question.
>>> Parsing /proc/iomem, it's indicated as "System RAM (kmem)".
>>>
>> I would rather do it the other way round, and make daxctl aware of
>> udev. In the end, even 'daxctl' uses the sysfs interface to online
>> memory, which really is the territory of udev and can easily be
>> done via udev rules (for static configuration).
> 
> udev doesn't really have any context what user space wants to do with
> the memory and therefore how to online it. Therefore we have (arguably)
> ugly hacks like auto onlining and movable_ration etc. daxctl can take
> information from the admin directly and therfore it can do what is
> needed without further hacks.
> 
Huh?
I thought udev was _all_ about userspace preferences...
We can easily have udev rules onlining memory with whatever policy
the user want; the whole point of udev rules is that they are dynamic
and can include policy decisions.

>> Note, we do a similar thing on s/390; the configuration tool there
>> just spits out udev rules.
> 
> Those were easy times when you just need to online memory without any
> more requirements where it should land.

Sorry, I don't get that.
udev rules can easily parse any user-space policy, and you can have a
policy as detailed as you want.
And each installation can have its own udev rules.
Why wouldn't that work?

(Excluding main memory, obviously. We need memory to execute userspace
processes after all).

I do think we're talking past each other...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Disable auto_movable_ratio for selfhosted memmap
  2025-07-29  9:33                   ` Hannes Reinecke
@ 2025-07-29 11:58                     ` Michal Hocko
  2025-07-29 13:52                       ` Hannes Reinecke
  0 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2025-07-29 11:58 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: David Hildenbrand, Oscar Salvador, linux-mm, linux-kernel,
	Hannes Reinecke

On Tue 29-07-25 11:33:58, Hannes Reinecke wrote:
> On 7/29/25 11:19, Michal Hocko wrote:
> > On Tue 29-07-25 09:24:37, Hannes Reinecke wrote:
> > > On 7/28/25 15:08, David Hildenbrand wrote:
> > > > On 28.07.25 15:06, Michal Hocko wrote:
> > > > > On Mon 28-07-25 11:37:46, Hannes Reinecke wrote:
> > > > > > On 7/28/25 11:10, David Hildenbrand wrote:
> > > > > > And to make matters worse, we have two competing user-space programs:
> > > > > > - udev
> > > > > > - daxctl
> > > > > > neither of which is (or can be made) aware of each other.
> > > > > > This leads to races and/or inconsistencies.
> > > > > 
> > > > > Would it help if generic udev memory hotplug rule exclude anything that
> > > > > is dax backed? Is there a way to check for that? Sorry if this is a
> > > > > stupid question.
> > > > Parsing /proc/iomem, it's indicated as "System RAM (kmem)".
> > > > 
> > > I would rather do it the other way round, and make daxctl aware of
> > > udev. In the end, even 'daxctl' uses the sysfs interface to online
> > > memory, which really is the territory of udev and can easily be
> > > done via udev rules (for static configuration).
> > 
> > udev doesn't really have any context what user space wants to do with
> > the memory and therefore how to online it. Therefore we have (arguably)
> > ugly hacks like auto onlining and movable_ration etc. daxctl can take
> > information from the admin directly and therfore it can do what is
> > needed without further hacks.
> > 
> Huh?
> I thought udev was _all_ about userspace preferences...
> We can easily have udev rules onlining memory with whatever policy
> the user want; the whole point of udev rules is that they are dynamic
> and can include policy decisions.

My experience with memory hotplug and udev doesn't match that. Udev
sees memory blocks showing up rather than understanding any concept of
what is the memory behind that. So any actual policy is rather hard to
define. You would need to backtrack what kind of memory blocks you are
seeing and what the initiator could have intended with them. 

While this could work reasonably for regular RAM appearing to your
system asynchronously (e.g. physical memory plugged in or virtual system
getting more memory) when you always want to online it in a certain way
I suspect this falls short for synchronous daxctl like usecase where you
know what to do with that memory and you can operate on sysfs directly.
Udev just makes the life much more complicated for the later IMO.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] Disable auto_movable_ratio for selfhosted memmap
  2025-07-29 11:58                     ` Michal Hocko
@ 2025-07-29 13:52                       ` Hannes Reinecke
  0 siblings, 0 replies; 24+ messages in thread
From: Hannes Reinecke @ 2025-07-29 13:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Hildenbrand, Oscar Salvador, linux-mm, linux-kernel,
	Hannes Reinecke

On 7/29/25 13:58, Michal Hocko wrote:
> On Tue 29-07-25 11:33:58, Hannes Reinecke wrote:
>> On 7/29/25 11:19, Michal Hocko wrote:
>>> On Tue 29-07-25 09:24:37, Hannes Reinecke wrote:
>>>> On 7/28/25 15:08, David Hildenbrand wrote:
>>>>> On 28.07.25 15:06, Michal Hocko wrote:
>>>>>> On Mon 28-07-25 11:37:46, Hannes Reinecke wrote:
>>>>>>> On 7/28/25 11:10, David Hildenbrand wrote:
>>>>>>> And to make matters worse, we have two competing user-space programs:
>>>>>>> - udev
>>>>>>> - daxctl
>>>>>>> neither of which is (or can be made) aware of each other.
>>>>>>> This leads to races and/or inconsistencies.
>>>>>>
>>>>>> Would it help if generic udev memory hotplug rule exclude anything that
>>>>>> is dax backed? Is there a way to check for that? Sorry if this is a
>>>>>> stupid question.
>>>>> Parsing /proc/iomem, it's indicated as "System RAM (kmem)".
>>>>>
>>>> I would rather do it the other way round, and make daxctl aware of
>>>> udev. In the end, even 'daxctl' uses the sysfs interface to online
>>>> memory, which really is the territory of udev and can easily be
>>>> done via udev rules (for static configuration).
>>>
>>> udev doesn't really have any context what user space wants to do with
>>> the memory and therefore how to online it. Therefore we have (arguably)
>>> ugly hacks like auto onlining and movable_ration etc. daxctl can take
>>> information from the admin directly and therfore it can do what is
>>> needed without further hacks.
>>>
>> Huh?
>> I thought udev was _all_ about userspace preferences...
>> We can easily have udev rules onlining memory with whatever policy
>> the user want; the whole point of udev rules is that they are dynamic
>> and can include policy decisions.
> 
> My experience with memory hotplug and udev doesn't match that. Udev
> sees memory blocks showing up rather than understanding any concept of
> what is the memory behind that. So any actual policy is rather hard to
> define. You would need to backtrack what kind of memory blocks you are
> seeing and what the initiator could have intended with them.
> 
> While this could work reasonably for regular RAM appearing to your
> system asynchronously (e.g. physical memory plugged in or virtual system
> getting more memory) when you always want to online it in a certain way
> I suspect this falls short for synchronous daxctl like usecase where you
> know what to do with that memory and you can operate on sysfs directly.
> Udev just makes the life much more complicated for the later IMO.

Not disagreeing with that.
Problem is that memory hotplug is tied to sysfs (and, with that, to
udev) such that currently there is no way of _not_ sending uevents
(and, consequently, udev interfering).

We could (for any of the 'auto' modes) disable uevent generation
via the dev_set_uevent_suppress() thingie.
Or we could teach daxctl to wait for uevents after an action has
been triggered.

But the current situation is really daft, requiring the user to move
away udev rules in specific situations.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2025-07-29 13:52 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-28  8:15 [RFC] Disable auto_movable_ratio for selfhosted memmap Oscar Salvador
2025-07-28  8:44 ` David Hildenbrand
2025-07-28  9:28   ` Hannes Reinecke
2025-07-28  9:42     ` David Hildenbrand
2025-07-28  8:48 ` Michal Hocko
2025-07-28  8:53   ` David Hildenbrand
2025-07-28  9:04     ` Michal Hocko
2025-07-28  9:10       ` David Hildenbrand
2025-07-28  9:37         ` Hannes Reinecke
2025-07-28 13:06           ` Michal Hocko
2025-07-28 13:08             ` David Hildenbrand
2025-07-29  7:24               ` Hannes Reinecke
2025-07-29  9:19                 ` Michal Hocko
2025-07-29  9:29                   ` David Hildenbrand
2025-07-29  9:33                   ` Hannes Reinecke
2025-07-29 11:58                     ` Michal Hocko
2025-07-29 13:52                       ` Hannes Reinecke
2025-07-28 15:15           ` David Hildenbrand
2025-07-28 12:17         ` Michal Hocko
2025-07-28 12:27           ` David Hildenbrand
2025-07-28 12:27             ` David Hildenbrand
2025-07-28 13:00               ` Michal Hocko
2025-07-28 13:03                 ` David Hildenbrand
2025-07-28 12:54             ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).