From: "yanjun.zhu" <yanjun.zhu@linux.dev>
To: Tao Cui <cui.tao@linux.dev>,
tj@kernel.org, hannes@cmpxchg.org, mkoutny@suse.com,
leon@kernel.org, jgg@ziepe.ca
Cc: linux-rdma@vger.kernel.org, cgroups@vger.kernel.org,
Tao Cui <cuitao@kylinos.cn>
Subject: Re: [PATCH rdma-next v2 0/3] cgroup/rdma: add MR memory size resource tracking
Date: Fri, 29 May 2026 14:14:11 -0700 [thread overview]
Message-ID: <ea3c6ed3-5d15-436e-9fa7-2e2d8ce26147@linux.dev> (raw)
In-Reply-To: <20260529090733.2242822-1-cui.tao@linux.dev>
On 5/29/26 2:07 AM, Tao Cui wrote:
> From: Tao Cui <cuitao@kylinos.cn>
>
> Currently the RDMA cgroup only tracks two aggregate counters:
> hca_handle and hca_object. The real scarce resource in multi-tenant
> deployments is pinned memory: how much physical memory gets registered
> through MRs. The existing hca_object counter is too coarse to capture
> this.
>
> This series adds a single new resource type:
>
> - mr_mem - Cumulative MR memory size in bytes
>
> The per-object-type counters (qp, mr) from RFC v1 have been removed
> per review feedback [1]: modern NICs pool objects from the same memory
> pool so the distinction between QP count and MR count is not
> meaningful for resource limiting. hca_object remains sufficient for
> coarse object accounting.
>
> After this series, an administrator can set limits like:
>
> echo "mlx5_0 mr_mem=1073741824" > rdma.max
>
Hi,
Thanks for the patchset! Introducing `mr_mem` to track and limit pinned
memory size is a very practical enhancement for multi-tenant deployments.
I have a question regarding how this new resource type interacts with
Fast Registration (FRWR / FRMR), which is widely used in production
environments (e.g., NVMe-oF, iSER) to achieve high performance.
As we know, FRWR decouples the MR object allocation (`ib_alloc_mr`) from
the actual memory page mapping (`ib_map_mr_sg`). The creation of FRWR
Memory Regions is often managed via a pre-allocated page pool.
Could you clarify how `mr_mem` accounts for FRWR in the following scenarios?
1. Accounting Granularity: Does `mr_mem` charge the maximum capacity of
the FRWR object at its allocation time (`ib_alloc_mr`), or does it
dynamically track the actual mapped bytes during the fast-reg data
path? If it's the former, it represents a "static maximum budget" per
pool, which seems more practical for performance.
2. Kernel-space vs Userspace: FRWR pools are frequently allocated by
kernel-space drivers (like NVMe-oF target/host). If these kernel
threads are not bound to a specific user cgroup, will their FRWR
allocations end up in the root cgroup, potentially bypassing the
per-tenant limits?
Don't you think it would be beneficial to explicitly document or
consider the FRWR pattern in the design section, given its prevalence in
real-world storage and networking workloads?
Thanks,
Zhu Yanjun
> Design
> ~~~~~~
>
> mr_mem is not page-level ownership tracking; it is object-based
> accounting tied to the MR lifetime:
>
> - charged at MR registration time
> - uncharged at MR destruction time
> - the charge is pinned to the cgroup that created the MR for the
> entire lifetime of the MR object
>
> This model intentionally defines accounting semantics around MR
> object lifetime rather than page ownership:
>
> 1. fork(): fork() does not duplicate MR objects. Even though the
> child inherits the uverbs fd and can access the parent's ucontext,
> the MR remains a single kernel object. The charge is tied to the
> MR object, not to the number of processes that can reach it, so
> no splitting or re-accounting is needed.
>
> 2. Cgroup migration: mr_mem follows the same semantics as the existing
> hca_object; charge at creation time against the invoking task's
> cgroup, uncharge at destruction time. The RDMA cgroup does not
> implement can_attach/attach callbacks today, so charges do not
> migrate with the task. This is a known limitation that applies
> equally to hca_handle and hca_object. mr_mem does not introduce
> any new complication here.
>
> 3. Overlap with memory cgroup: mr_mem does not count process memory
> usage; it represents a per-device DMA registration budget: the
> amount of memory this cgroup may register through a given HCA.
> This is a different dimension from what memory cgroup tracks. An
> administrator might set mr_mem limits differently per device, which
> memory cgroup cannot express.
>
> In particular, mr_mem tracks the registered memory range associated
> with the MR rather than exact dynamically pinned pages (e.g. for
> ODP MRs). This is a stable, policy-oriented approximation of
> registration footprint, not an attempt at precise physical page
> accounting.
>
> Tao Cui (3):
> cgroup/rdma: extend charge/uncharge API with s64 amount parameter
> cgroup/rdma: add MR memory size resource tracking
> cgroup/rdma: update cgroup resource list for MR_MEM
>
> Documentation/admin-guide/cgroup-v2.rst | 21 ++--
> drivers/infiniband/core/cgroup.c | 10 +-
> drivers/infiniband/core/core_priv.h | 12 +-
> drivers/infiniband/core/rdma_core.c | 20 +++-
> drivers/infiniband/core/uverbs_cmd.c | 61 +++++++++-
> drivers/infiniband/core/uverbs_std_types_mr.c | 37 ++++++
> include/linux/cgroup_rdma.h | 8 +-
> include/rdma/ib_verbs.h | 1 +
> kernel/cgroup/rdma.c | 108 +++++++++++++-----
> 9 files changed, 219 insertions(+), 59 deletions(-)
>
> ---
> Changes from RFC v1:
>
> - Removed RDMACG_RESOURCE_QP and RDMACG_RESOURCE_MR per-type
> counters following review feedback from Jason Gunthorpe [1].
> - Retained only RDMACG_RESOURCE_MR_MEM as the sole new resource.
> - Added detailed semantic notes to the commit messages addressing
> fork(), cgroup migration, and overlap with memory cgroup [2].
> - Renamed patches to reflect the narrower scope.
>
> [1] https://lore.kernel.org/all/20260525134314.GI7702@ziepe.ca/
> [2] https://lore.kernel.org/all/20260528075537.2170697-1-cuitao@kylinos.cn/
next prev parent reply other threads:[~2026-05-29 21:14 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-29 9:07 [PATCH rdma-next v2 0/3] cgroup/rdma: add MR memory size resource tracking Tao Cui
2026-05-29 9:07 ` [PATCH rdma-next v2 1/3] cgroup/rdma: extend charge/uncharge API with s64 amount parameter Tao Cui
2026-05-29 9:07 ` [PATCH rdma-next v2 2/3] cgroup/rdma: add MR memory size resource tracking Tao Cui
2026-05-29 9:07 ` [PATCH rdma-next v2 3/3] cgroup/rdma: update cgroup resource list for MR_MEM Tao Cui
2026-05-29 16:18 ` kernel test robot
2026-05-29 12:46 ` [PATCH rdma-next v2 0/3] cgroup/rdma: add MR memory size resource tracking Michal Koutný
2026-06-01 5:37 ` Tao Cui
2026-05-29 21:14 ` yanjun.zhu [this message]
2026-06-01 6:08 ` Tao Cui
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ea3c6ed3-5d15-436e-9fa7-2e2d8ce26147@linux.dev \
--to=yanjun.zhu@linux.dev \
--cc=cgroups@vger.kernel.org \
--cc=cui.tao@linux.dev \
--cc=cuitao@kylinos.cn \
--cc=hannes@cmpxchg.org \
--cc=jgg@ziepe.ca \
--cc=leon@kernel.org \
--cc=linux-rdma@vger.kernel.org \
--cc=mkoutny@suse.com \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.