On Wed, Jul 23, 2025 at 03:41:47AM +0900, YoungJun Park wrote: > This leaves us with a few design options: > > 1. Treat negative values as valid priorities. Once any device is > assigned via `memory.swap.priority`, the NUMA autobind logic is > entirely disabled. > - Pros: Simplifies implementation; avoids exposing NUMA autobind via > cgroup interface. > - Cons: Overrides autobind for all devices even if only one is set. > > 2. Continue to treat negative values as NUMA autobind weights, without > implicit shifting. If a user assigns `-3`, it is stored and used > exactly as `-3`, and does not affect other devices. > - Pros: Simple and intuitive; matches current implementation > semantics. > - Cons: Autobind semantics still need to be reasoned about when > using the interface. > > 3. Disallow setting negative values via `memory.swap.priority`. > Existing NUMA autobind config is preserved, but no new autobind > configuration is possible from cgroup interface. > - Pros: Keeps cgroup interface simple; no autobind manipulation. > - Cons: Autobind infra remains partially active, increasing code > complexity. > > 4. Keep the current design: allow setting negative values to express > NUMA autobind weights explicitly. Devices without overridden values > continue to follow NUMA-based dynamic selection. > - Pros: Preserves current flexibility; gives users control per device. > - Cons: Slightly more complex semantics; NUMA autobind remains a > visible part of the interface. > > After thinking through these tradeoffs, I'm inclined to think that > preserving the NUMA autobind option might be the better path forward. > What are your thoughts on this? > > Thank you again for your helpful feedback. Let me share my mental model in order to help forming the design. I find these per-cgroup swap priorities similar to cpuset -- instead of having a configured cpumask (bitmask) for each cgroup, you have weight-mask for individual swap devices (or distribution over the devices, I hope it's not too big deviation from priority ranking). Then you have the hierarchy, so you need a method how to combine child+parent masks (or global/root) to obtain effective weight-mask (and effective ranking) for each cgroup. Furthermore, there's the NUMA autobinding which adds another weight-mask to the game but this time it's not configured but it depends on "who is asking". (Tasks running on node N would have autobind shifted towards devices associated to node N. Is that how autobinding works?) From the hierarchy point of view, you have to compound weight-masks in top-down preference (so that higher cgroups can override lower) and autobind weight-mask that is only conceivable at the very bottom (not a cgroup but depending on the task's NUMA placement). There I see conflict between the ends a tad. I think the attempted reconciliation was to allow emptiness of a single slot in the weight-mask but it may not be practical for the compounding (that's why you came up with the four variants). So another option would be to allow whole weight-mask being empty (or uniform) so that it'd be identity in the compounding operation. The conflict exists also in the current non-percg priorities -- there are the global priorities and autobind priorities. IIUC, the global level either defines a weight (user prio) or it is empty (defer to NUMA autobinding). [I leveled rankings and weight-masks of devices but I left a loophole of how the empty slots in the latter would be converted to (and from) rankings. This e-mail is already too long.] An very different alternative that comes to my mind together with autobinding and leveraging that to your use case: - define virtual NUMA nodes [1], - associate separate swap devices to those nodes, - utilize task (or actual (mem)cpuset) affinity to those virtual NUMA nodes based on each process's swap requirements, - NUMA autobinding would then yield the device constraints you sought. HTH, Michal [1] Not sure how close this is to the linked series [2] which is AFAIU a different kind of virtualization that isn't supposed to be exposed to userspace(?). [2] https://lore.kernel.org/linux-mm/20250429233848.3093350-1-nphamcs@gmail.com/