Linux cgroups development
 help / color / mirror / Atom feed
From: "yanjun.zhu" <yanjun.zhu@linux.dev>
To: Tao Cui <cui.tao@linux.dev>,
	tj@kernel.org, hannes@cmpxchg.org, mkoutny@suse.com,
	leon@kernel.org, jgg@ziepe.ca
Cc: linux-rdma@vger.kernel.org, cgroups@vger.kernel.org,
	Tao Cui <cuitao@kylinos.cn>
Subject: Re: [PATCH rdma-next v2 0/3] cgroup/rdma: add MR memory size resource tracking
Date: Fri, 29 May 2026 14:14:11 -0700	[thread overview]
Message-ID: <ea3c6ed3-5d15-436e-9fa7-2e2d8ce26147@linux.dev> (raw)
In-Reply-To: <20260529090733.2242822-1-cui.tao@linux.dev>

On 5/29/26 2:07 AM, Tao Cui wrote:
> From: Tao Cui <cuitao@kylinos.cn>
> 
> Currently the RDMA cgroup only tracks two aggregate counters:
> hca_handle and hca_object.  The real scarce resource in multi-tenant
> deployments is pinned memory: how much physical memory gets registered
> through MRs.  The existing hca_object counter is too coarse to capture
> this.
> 
> This series adds a single new resource type:
> 
>    - mr_mem  - Cumulative MR memory size in bytes
> 
> The per-object-type counters (qp, mr) from RFC v1 have been removed
> per review feedback [1]: modern NICs pool objects from the same memory
> pool so the distinction between QP count and MR count is not
> meaningful for resource limiting.  hca_object remains sufficient for
> coarse object accounting.
> 
> After this series, an administrator can set limits like:
> 
>      echo "mlx5_0 mr_mem=1073741824" > rdma.max
> 

Hi,

Thanks for the patchset! Introducing `mr_mem` to track and limit pinned
memory size is a very practical enhancement for multi-tenant deployments.

I have a question regarding how this new resource type interacts with
Fast Registration (FRWR / FRMR), which is widely used in production
environments (e.g., NVMe-oF, iSER) to achieve high performance.

As we know, FRWR decouples the MR object allocation (`ib_alloc_mr`) from
the actual memory page mapping (`ib_map_mr_sg`). The creation of FRWR
Memory Regions is often managed via a pre-allocated page pool.

Could you clarify how `mr_mem` accounts for FRWR in the following scenarios?

1. Accounting Granularity: Does `mr_mem` charge the maximum capacity of
    the FRWR object at its allocation time (`ib_alloc_mr`), or does it
    dynamically track the actual mapped bytes during the fast-reg data 
path? If it's the former, it represents a "static maximum budget" per 
pool, which seems more practical for performance.

2. Kernel-space vs Userspace: FRWR pools are frequently allocated by
    kernel-space drivers (like NVMe-oF target/host). If these kernel
    threads are not bound to a specific user cgroup, will their FRWR
    allocations end up in the root cgroup, potentially bypassing the
    per-tenant limits?

Don't you think it would be beneficial to explicitly document or 
consider the FRWR pattern in the design section, given its prevalence in
real-world storage and networking workloads?

Thanks,
Zhu Yanjun

> Design
> ~~~~~~
> 
> mr_mem is not page-level ownership tracking; it is object-based
> accounting tied to the MR lifetime:
> 
>    - charged at MR registration time
>    - uncharged at MR destruction time
>    - the charge is pinned to the cgroup that created the MR for the
>      entire lifetime of the MR object
> 
> This model intentionally defines accounting semantics around MR
> object lifetime rather than page ownership:
> 
> 1. fork(): fork() does not duplicate MR objects.  Even though the
>     child inherits the uverbs fd and can access the parent's ucontext,
>     the MR remains a single kernel object.  The charge is tied to the
>     MR object, not to the number of processes that can reach it, so
>     no splitting or re-accounting is needed.
> 
> 2. Cgroup migration: mr_mem follows the same semantics as the existing
>     hca_object; charge at creation time against the invoking task's
>     cgroup, uncharge at destruction time.  The RDMA cgroup does not
>     implement can_attach/attach callbacks today, so charges do not
>     migrate with the task.  This is a known limitation that applies
>     equally to hca_handle and hca_object.  mr_mem does not introduce
>     any new complication here.
> 
> 3. Overlap with memory cgroup: mr_mem does not count process memory
>     usage; it represents a per-device DMA registration budget: the
>     amount of memory this cgroup may register through a given HCA.
>     This is a different dimension from what memory cgroup tracks.  An
>     administrator might set mr_mem limits differently per device, which
>     memory cgroup cannot express.
> 
>     In particular, mr_mem tracks the registered memory range associated
>     with the MR rather than exact dynamically pinned pages (e.g. for
>     ODP MRs).  This is a stable, policy-oriented approximation of
>     registration footprint, not an attempt at precise physical page
>     accounting.
> 
> Tao Cui (3):
>    cgroup/rdma: extend charge/uncharge API with s64 amount parameter
>    cgroup/rdma: add MR memory size resource tracking
>    cgroup/rdma: update cgroup resource list for MR_MEM
> 
>   Documentation/admin-guide/cgroup-v2.rst       |  21 ++--
>   drivers/infiniband/core/cgroup.c              |  10 +-
>   drivers/infiniband/core/core_priv.h           |  12 +-
>   drivers/infiniband/core/rdma_core.c           |  20 +++-
>   drivers/infiniband/core/uverbs_cmd.c          |  61 +++++++++-
>   drivers/infiniband/core/uverbs_std_types_mr.c |  37 ++++++
>   include/linux/cgroup_rdma.h                   |   8 +-
>   include/rdma/ib_verbs.h                       |   1 +
>   kernel/cgroup/rdma.c                          | 108 +++++++++++++-----
>   9 files changed, 219 insertions(+), 59 deletions(-)
> 
> ---
> Changes from RFC v1:
> 
>    - Removed RDMACG_RESOURCE_QP and RDMACG_RESOURCE_MR per-type
>      counters following review feedback from Jason Gunthorpe [1].
>    - Retained only RDMACG_RESOURCE_MR_MEM as the sole new resource.
>    - Added detailed semantic notes to the commit messages addressing
>      fork(), cgroup migration, and overlap with memory cgroup [2].
>    - Renamed patches to reflect the narrower scope.
> 
> [1] https://lore.kernel.org/all/20260525134314.GI7702@ziepe.ca/
> [2] https://lore.kernel.org/all/20260528075537.2170697-1-cuitao@kylinos.cn/


      parent reply	other threads:[~2026-05-29 21:14 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-29  9:07 [PATCH rdma-next v2 0/3] cgroup/rdma: add MR memory size resource tracking Tao Cui
2026-05-29  9:07 ` [PATCH rdma-next v2 1/3] cgroup/rdma: extend charge/uncharge API with s64 amount parameter Tao Cui
2026-05-29  9:07 ` [PATCH rdma-next v2 2/3] cgroup/rdma: add MR memory size resource tracking Tao Cui
2026-05-29  9:07 ` [PATCH rdma-next v2 3/3] cgroup/rdma: update cgroup resource list for MR_MEM Tao Cui
2026-05-29 16:18   ` kernel test robot
2026-05-29 12:46 ` [PATCH rdma-next v2 0/3] cgroup/rdma: add MR memory size resource tracking Michal Koutný
2026-05-29 21:14 ` yanjun.zhu [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ea3c6ed3-5d15-436e-9fa7-2e2d8ce26147@linux.dev \
    --to=yanjun.zhu@linux.dev \
    --cc=cgroups@vger.kernel.org \
    --cc=cui.tao@linux.dev \
    --cc=cuitao@kylinos.cn \
    --cc=hannes@cmpxchg.org \
    --cc=jgg@ziepe.ca \
    --cc=leon@kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=mkoutny@suse.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox