From: Tejun Heo <tj@kernel.org>
To: "Christian König" <christian.koenig@amd.com>
Cc: Dave Airlie <airlied@gmail.com>,
Johannes Weiner <hannes@cmpxchg.org>,
dri-devel@lists.freedesktop.org, Michal Hocko <mhocko@kernel.org>,
Roman Gushchin <roman.gushchin@linux.dev>,
Shakeel Butt <shakeel.butt@linux.dev>,
Muchun Song <muchun.song@linux.dev>,
cgroups@vger.kernel.org, Waiman Long <longman@redhat.com>,
simona@ffwll.ch
Subject: Re: [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2)
Date: Fri, 23 May 2025 07:06:53 -1000 [thread overview]
Message-ID: <aDCrLTNoWC8oSS7Z@slm.duckdns.org> (raw)
In-Reply-To: <de476962-194f-4c77-aabb-559a74caf5ac@amd.com>
Hello, Christian.
On Fri, May 23, 2025 at 09:58:58AM +0200, Christian König wrote:
...
> > - There's a GPU workload which uses a sizable amount of system memory for
> > the pool being discussed in this thread. This GPU workload is very
> > important, so we want to make sure that other activities in the system
> > don't bother it. We give it plenty of isolated CPUs and protect its memory
> > with high enough memory.low.
>
> That situation simply doesn't happen. See isolation is *not* a requirement
> for the pool.
...
> See the submission model of GPUs is best effort. E.g. you don't guarantee
> any performance isolation between processes whatsoever. If we would start
> to do this we would need to start re-designing the HW.
This is a radical claim. Let's table the rest of the discussion for now. I
don't know enough to tell whether this claim is true or not, but for this to
be true, the following should be true:
Whether the GPU memory pool is reclaimed or not doesn't have noticeable
performance implications on the GPU performance.
Is this true?
As for the scenario that I described above, I didn't just come up with it.
I'm only supporting from system side but that's based on what our ML folks
are doing right now. We have a bunch of lage machines with multiple GPUs
running ML workloads. The workloads can run for a long time spread across
many machines and they synchronize frequently, so any performance drop on
one GPU lowers utiliization on all involved GPUs which can go up to three
digits. For example, any scheduling disturbances on the submitting thread
propagates through the whole cluster and slows down all involved GPUs.
Also, because these machines are large on the CPU and memory sides too and
aren't doing whole lot other than managing the GPUs, people want to put on a
significant amount of CPU work on them which can easily create at least
moderate memory pressure. Is the claim that the combined write memory pool
doesn't have any meaningful impact on the GPU workload performance?
Thanks.
--
tejun
next prev parent reply other threads:[~2025-05-23 17:06 UTC|newest]
Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-05-02 3:35 [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2) Dave Airlie
2025-05-02 3:36 ` [PATCH 1/5] memcg: add GPU statistic Dave Airlie
2025-05-02 3:36 ` [PATCH 2/5] memcg: add hooks for gpu memcg charging/uncharging Dave Airlie
2025-05-02 3:36 ` [PATCH 3/5] ttm: add initial memcg integration. (v2) Dave Airlie
2025-05-02 12:01 ` Christian König
2025-05-02 14:24 ` kernel test robot
2025-05-03 2:09 ` kernel test robot
2025-05-02 3:36 ` [PATCH 4/5] amdgpu: add support for memcg integration Dave Airlie
2025-05-02 14:01 ` Waiman Long
2025-05-02 3:36 ` [PATCH 5/5] nouveau: add " Dave Airlie
2025-05-06 0:37 ` [rfc] drm/ttm/memcg: simplest initial memcg/ttm integration (v2) Shakeel Butt
2025-05-06 0:59 ` Dave Airlie
2025-05-07 17:52 ` Johannes Weiner
2025-05-07 22:03 ` Dave Airlie
2025-05-07 22:11 ` Dave Airlie
2025-05-13 7:54 ` Johannes Weiner
2025-05-15 3:02 ` Dave Airlie
2025-05-15 8:55 ` Christian König
2025-05-15 15:04 ` Waiman Long
2025-05-15 15:16 ` Christian König
2025-05-15 16:08 ` Johannes Weiner
2025-05-16 6:53 ` Christian König
2025-05-16 14:53 ` Johannes Weiner
2025-05-16 15:35 ` Christian König
2025-05-16 16:41 ` Johannes Weiner
2025-05-16 17:42 ` Christian König
2025-05-16 20:04 ` Johannes Weiner
2025-05-16 20:25 ` Dave Airlie
2025-05-18 16:28 ` Christian König
2025-05-19 6:18 ` Dave Airlie
2025-05-19 8:26 ` Christian König
2025-05-22 19:51 ` Tejun Heo
2025-05-23 7:58 ` Christian König
2025-05-23 17:06 ` Tejun Heo [this message]
2025-05-26 8:19 ` Christian König
2025-05-26 20:13 ` Dave Airlie
2025-05-27 8:01 ` Christian König
2025-05-16 16:12 ` Johannes Weiner
2025-05-21 2:23 ` Dave Airlie
2025-05-21 7:50 ` Christian König
2025-05-21 14:43 ` Johannes Weiner
2025-05-22 7:03 ` Dave Airlie
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aDCrLTNoWC8oSS7Z@slm.duckdns.org \
--to=tj@kernel.org \
--cc=airlied@gmail.com \
--cc=cgroups@vger.kernel.org \
--cc=christian.koenig@amd.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=hannes@cmpxchg.org \
--cc=longman@redhat.com \
--cc=mhocko@kernel.org \
--cc=muchun.song@linux.dev \
--cc=roman.gushchin@linux.dev \
--cc=shakeel.butt@linux.dev \
--cc=simona@ffwll.ch \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.