From: "yanjun.zhu" <yanjun.zhu@linux.dev>
To: Tao Cui <cui.tao@linux.dev>,
tj@kernel.org, hannes@cmpxchg.org, mkoutny@suse.com,
leon@kernel.org, jgg@ziepe.ca
Cc: linux-rdma@vger.kernel.org, cgroups@vger.kernel.org,
Tao Cui <cuitao@kylinos.cn>
Subject: Re: [PATCH rdma-next v2 0/3] cgroup/rdma: add MR memory size resource tracking
Date: Fri, 29 May 2026 14:14:11 -0700 [thread overview]
Message-ID: <ea3c6ed3-5d15-436e-9fa7-2e2d8ce26147@linux.dev> (raw)
In-Reply-To: <20260529090733.2242822-1-cui.tao@linux.dev>
On 5/29/26 2:07 AM, Tao Cui wrote:
> From: Tao Cui <cuitao@kylinos.cn>
>
> Currently the RDMA cgroup only tracks two aggregate counters:
> hca_handle and hca_object. The real scarce resource in multi-tenant
> deployments is pinned memory: how much physical memory gets registered
> through MRs. The existing hca_object counter is too coarse to capture
> this.
>
> This series adds a single new resource type:
>
> - mr_mem - Cumulative MR memory size in bytes
>
> The per-object-type counters (qp, mr) from RFC v1 have been removed
> per review feedback [1]: modern NICs pool objects from the same memory
> pool so the distinction between QP count and MR count is not
> meaningful for resource limiting. hca_object remains sufficient for
> coarse object accounting.
>
> After this series, an administrator can set limits like:
>
> echo "mlx5_0 mr_mem=1073741824" > rdma.max
>
Hi,
Thanks for the patchset! Introducing `mr_mem` to track and limit pinned
memory size is a very practical enhancement for multi-tenant deployments.
I have a question regarding how this new resource type interacts with
Fast Registration (FRWR / FRMR), which is widely used in production
environments (e.g., NVMe-oF, iSER) to achieve high performance.
As we know, FRWR decouples the MR object allocation (`ib_alloc_mr`) from
the actual memory page mapping (`ib_map_mr_sg`). The creation of FRWR
Memory Regions is often managed via a pre-allocated page pool.
Could you clarify how `mr_mem` accounts for FRWR in the following scenarios?
1. Accounting Granularity: Does `mr_mem` charge the maximum capacity of
the FRWR object at its allocation time (`ib_alloc_mr`), or does it
dynamically track the actual mapped bytes during the fast-reg data
path? If it's the former, it represents a "static maximum budget" per
pool, which seems more practical for performance.
2. Kernel-space vs Userspace: FRWR pools are frequently allocated by
kernel-space drivers (like NVMe-oF target/host). If these kernel
threads are not bound to a specific user cgroup, will their FRWR
allocations end up in the root cgroup, potentially bypassing the
per-tenant limits?
Don't you think it would be beneficial to explicitly document or
consider the FRWR pattern in the design section, given its prevalence in
real-world storage and networking workloads?
Thanks,
Zhu Yanjun
> Design
> ~~~~~~
>
> mr_mem is not page-level ownership tracking; it is object-based
> accounting tied to the MR lifetime:
>
> - charged at MR registration time
> - uncharged at MR destruction time
> - the charge is pinned to the cgroup that created the MR for the
> entire lifetime of the MR object
>
> This model intentionally defines accounting semantics around MR
> object lifetime rather than page ownership:
>
> 1. fork(): fork() does not duplicate MR objects. Even though the
> child inherits the uverbs fd and can access the parent's ucontext,
> the MR remains a single kernel object. The charge is tied to the
> MR object, not to the number of processes that can reach it, so
> no splitting or re-accounting is needed.
>
> 2. Cgroup migration: mr_mem follows the same semantics as the existing
> hca_object; charge at creation time against the invoking task's
> cgroup, uncharge at destruction time. The RDMA cgroup does not
> implement can_attach/attach callbacks today, so charges do not
> migrate with the task. This is a known limitation that applies
> equally to hca_handle and hca_object. mr_mem does not introduce
> any new complication here.
>
> 3. Overlap with memory cgroup: mr_mem does not count process memory
> usage; it represents a per-device DMA registration budget: the
> amount of memory this cgroup may register through a given HCA.
> This is a different dimension from what memory cgroup tracks. An
> administrator might set mr_mem limits differently per device, which
> memory cgroup cannot express.
>
> In particular, mr_mem tracks the registered memory range associated
> with the MR rather than exact dynamically pinned pages (e.g. for
> ODP MRs). This is a stable, policy-oriented approximation of
> registration footprint, not an attempt at precise physical page
> accounting.
>
> Tao Cui (3):
> cgroup/rdma: extend charge/uncharge API with s64 amount parameter
> cgroup/rdma: add MR memory size resource tracking
> cgroup/rdma: update cgroup resource list for MR_MEM
>
> Documentation/admin-guide/cgroup-v2.rst | 21 ++--
> drivers/infiniband/core/cgroup.c | 10 +-
> drivers/infiniband/core/core_priv.h | 12 +-
> drivers/infiniband/core/rdma_core.c | 20 +++-
> drivers/infiniband/core/uverbs_cmd.c | 61 +++++++++-
> drivers/infiniband/core/uverbs_std_types_mr.c | 37 ++++++
> include/linux/cgroup_rdma.h | 8 +-
> include/rdma/ib_verbs.h | 1 +
> kernel/cgroup/rdma.c | 108 +++++++++++++-----
> 9 files changed, 219 insertions(+), 59 deletions(-)
>
> ---
> Changes from RFC v1:
>
> - Removed RDMACG_RESOURCE_QP and RDMACG_RESOURCE_MR per-type
> counters following review feedback from Jason Gunthorpe [1].
> - Retained only RDMACG_RESOURCE_MR_MEM as the sole new resource.
> - Added detailed semantic notes to the commit messages addressing
> fork(), cgroup migration, and overlap with memory cgroup [2].
> - Renamed patches to reflect the narrower scope.
>
> [1] https://lore.kernel.org/all/20260525134314.GI7702@ziepe.ca/
> [2] https://lore.kernel.org/all/20260528075537.2170697-1-cuitao@kylinos.cn/
prev parent reply other threads:[~2026-05-29 21:14 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-29 9:07 [PATCH rdma-next v2 0/3] cgroup/rdma: add MR memory size resource tracking Tao Cui
2026-05-29 9:07 ` [PATCH rdma-next v2 1/3] cgroup/rdma: extend charge/uncharge API with s64 amount parameter Tao Cui
2026-05-29 9:07 ` [PATCH rdma-next v2 2/3] cgroup/rdma: add MR memory size resource tracking Tao Cui
2026-05-29 9:07 ` [PATCH rdma-next v2 3/3] cgroup/rdma: update cgroup resource list for MR_MEM Tao Cui
2026-05-29 16:18 ` kernel test robot
2026-05-29 12:46 ` [PATCH rdma-next v2 0/3] cgroup/rdma: add MR memory size resource tracking Michal Koutný
2026-05-29 21:14 ` yanjun.zhu [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ea3c6ed3-5d15-436e-9fa7-2e2d8ce26147@linux.dev \
--to=yanjun.zhu@linux.dev \
--cc=cgroups@vger.kernel.org \
--cc=cui.tao@linux.dev \
--cc=cuitao@kylinos.cn \
--cc=hannes@cmpxchg.org \
--cc=jgg@ziepe.ca \
--cc=leon@kernel.org \
--cc=linux-rdma@vger.kernel.org \
--cc=mkoutny@suse.com \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox