From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7FCC5C83030 for ; Thu, 3 Jul 2025 05:53:57 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id F03AA6B010D; Thu, 3 Jul 2025 01:53:56 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id EB3F06B010E; Thu, 3 Jul 2025 01:53:56 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DA33C6B010F; Thu, 3 Jul 2025 01:53:56 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id C613C6B010D for ; Thu, 3 Jul 2025 01:53:56 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 1697680664 for ; Thu, 3 Jul 2025 05:53:56 +0000 (UTC) X-FDA: 83621887272.29.B763EB5 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf19.hostedemail.com (Postfix) with ESMTP id D619E1A0005 for ; Thu, 3 Jul 2025 05:53:53 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=XDUYndh9; dmarc=pass (policy=quarantine) header.from=redhat.com; spf=pass (imf19.hostedemail.com: domain of airlied@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=airlied@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1751522034; a=rsa-sha256; cv=none; b=tKWPZA90fU3DAY3BZHTFvfU5uMx7jrZ7du7dch6cBjpKlmaMFTAVz0DW16zeCazMjrU7R1 B1jTsehvskHIYx46V1f5Y/GIpFpvdki3gIqoVCmBZ0XSMMEzeS0KVeQ9+IBvD6Zi3srLeP 9l+t3n1SSCvJKZ5PIpOK/9M2+8RdKMc= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=XDUYndh9; dmarc=pass (policy=quarantine) header.from=redhat.com; spf=pass (imf19.hostedemail.com: domain of airlied@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=airlied@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1751522034; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=g93xnR94ghhQeXUascDTqu51LAmFxEiGorxT1x78S3I=; b=XdOsSnF+E7eIhimWfTYCjyconSVhQUKb9Qg0rY5QZpniP2xOgIzO6DsNkCe7R7iRTsuo8f hL+0pBpPbdUOIIm/4xzeH0CaX4dg9M4MsGUcGmMK+R+IoPOWRkkwvar6kwgI8SOW0ESdzQ lkqs7Z+LEZKr8aUNDp9eQnEGBtvH/Oc= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1751522031; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=g93xnR94ghhQeXUascDTqu51LAmFxEiGorxT1x78S3I=; b=XDUYndh9zpZhByqFgyBwbjxDciFLrT0oBwqWRLyW6TI1AuVI8tBPDwqOUrYTrs/Crz3aCn DWHFq/cwzM+VMGVZ8NW2hw8a2Yl0j+u1841OxNs5xNonegCGGsnx11r9Tu9xFPyg/5IWxj hu31MMlPbRWDdgo9eeo6d+aB00+Ybh8= Received: from mail-pg1-f199.google.com (mail-pg1-f199.google.com [209.85.215.199]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-651-Z7snLhWOPs6qzo-Dq1j8HQ-1; Thu, 03 Jul 2025 01:53:50 -0400 X-MC-Unique: Z7snLhWOPs6qzo-Dq1j8HQ-1 X-Mimecast-MFC-AGG-ID: Z7snLhWOPs6qzo-Dq1j8HQ_1751522029 Received: by mail-pg1-f199.google.com with SMTP id 41be03b00d2f7-b34fa832869so495444a12.1 for ; Wed, 02 Jul 2025 22:53:50 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1751522029; x=1752126829; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=g93xnR94ghhQeXUascDTqu51LAmFxEiGorxT1x78S3I=; b=FC7GerLZWyi6TC0MzRCsJ943apFgsFxP8AAveqdWABwNQNx5aewl3+WAt4yvl2wxe5 mgFn1VGkj7JIojLcLqU3nrRb800VV2DPQEgLpmBf3NGys6EAlxbwn5cN7mtMqpNRIc4x p+C+sid1XR4qIir7O5Feckbx1OzK5pT3/UhGXHZN5kRRWCz2TvrOTeeEo4Z4zIr5vRnR 3XaZeC0UgU57fnbURzX9eLZMhz76Gd+Cv1bqFQpYP4O5h/4uDYnYR0/lnxuUmXG75Xr8 e85xqtWcKEAwIaH5SvyI0rbNcbOBuwGQ1rK6zKuLUk2131THxJPiJlxVvzPhLnMpQmdW wgog== X-Forwarded-Encrypted: i=1; AJvYcCVX4qdgG5aJt59ItdRjCu1VNC9xOwXTkFN9lXFXnMpqFE8JfaMA086qsJ59HQUBsLPRGlPv0/bJ4w==@kvack.org X-Gm-Message-State: AOJu0YxyKWwEaz282LOpZFhX/+TaGnkurOMbMN/jf8nmP4Unv8rIG//E 9lAcu9+9/IYtPZbZ67u1qa31a9AvO6zH8rrQoPTX7u1QuCXEoLDqMdV8hSroVTSoKUoRbZ1Dl7M T5JzP6XX93oZsOvp8bAEwcuRN77Lb6ISn+wEvCz82ylSTCQ/ExXraLiELzhxkKbuNM5vUX9kaZU gW/kd8f9Fsfc0Zpmzz1+BFwnLecI4= X-Gm-Gg: ASbGncuAkvLq8RzXLxT4FqIg0nwx5qt2XAHQX/I5XT5RU5OZn1emryv73NWa4Ov6+0/ VmN+tZpA27r/Q9ZieZ2rU7k+0s3EgWmm8FdhzOiNIdXZ7I42ENvUK59asClh1NIx8wDdCMpPr0Z uIrg== X-Received: by 2002:a17:90b:2252:b0:313:d7ec:b7b7 with SMTP id 98e67ed59e1d1-31a9f5b04ccmr1481428a91.13.1751522029110; Wed, 02 Jul 2025 22:53:49 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEw5hFVi//TsCa4Q6XzLBJ8tf++ndTT7/F7yDroUNNHQkrMEr4tlTJRzyW12RoLx7a+Se4ML52TJNjzTq9i/BI= X-Received: by 2002:a17:90b:2252:b0:313:d7ec:b7b7 with SMTP id 98e67ed59e1d1-31a9f5b04ccmr1481391a91.13.1751522028543; Wed, 02 Jul 2025 22:53:48 -0700 (PDT) MIME-Version: 1.0 References: <20250630045005.1337339-1-airlied@gmail.com> <20250630045005.1337339-13-airlied@gmail.com> <20a90668-3ddf-4153-9953-a2df9179a1b1@amd.com> <26c79b1e-0f7f-4efa-9040-92df8c5bdf1f@amd.com> <54b2ee4a-0f2f-49a1-a680-8dc1193e2d30@amd.com> In-Reply-To: From: David Airlie Date: Thu, 3 Jul 2025 15:53:37 +1000 X-Gm-Features: Ac12FXyN8cI1GoKXfE4cc8g5wD4GK8VxW5bbEvzv2TsNWks9aUUHYz3_8vZfjps Message-ID: Subject: Re: [PATCH 12/17] ttm: add objcg pointer to bo and tt To: =?UTF-8?Q?Christian_K=C3=B6nig?= Cc: Dave Airlie , dri-devel@lists.freedesktop.org, linux-mm@kvack.org, Johannes Weiner , Dave Chinner , Kairui Song X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: 9jopn1QHdS0cm6tA_I5Q17eeGlyhBLmRGxCGDON57AE_1751522029 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: D619E1A0005 X-Stat-Signature: nn4utmsqnxzm4ft3mxut6hzeku76or13 X-HE-Tag: 1751522033-971096 X-HE-Meta: U2FsdGVkX18zmysiCW7dqMlMyxt0qvU8DW8bNFHo6Nv+xiWmZ9w0JROH3G8/EhYu84YOl/2svJvZZUE/nyzPGreDtcZrHzfM9laNd7a2syuWfrfZ1K8NtyZjymHtlA3Rgq7bB7oaqW/D2l5gWy8NCT7rSWPwCC+0OzQ3MKf/au7IuA4ScV8MvY4Lymu3c5j3X37ZMC5nhPnCjl5kcTZQKbJkwndCO2XpRH5QRshgQ9xYkrfIx+IntCnSuTOdYFo8SqHPcqE8t/oRewem6gVvzxylhE//Zt0u/WnlyANdYmiskOgnQTkHlNPrAKzoeIHkTaAoOxvYJ3qxeKkNKGQLfVKgl8R/+6oz82vQ7xkPYg2pKYGb7oNNfwTRQ867K6VtxxElZxzX6vmVE3lhMekMUsRjLP+R6hKf+pbTQ1AvfBZgx7JFvK/AB3pDZiL22iapLyMt6du1mjEFb0DaMNGUFm8PbRTXdLCwrbmrxi8X6KlttZ9B2YbFMsBGYgOitGIa3aFhF+Fq5ZtvJ8GNfjtFWgzx+XWOG0gqusQTSpn9Y0IbTsHELGkMndx6niYO3y9DS6HhlB7aze8dyra8eWD8G/ouixcNsykIZvaOLlolXDDKw/sGEj53SPdKHS3lzpyRZ7Hh17kkYeHMFgjvIHcgklfh//4okn9uPGzBHUI5z3p4D7h0U739PjQGUOqfQ0s4Jpd9y24J6riIVAiLQyb6KHxXO4JHoT/SzmM3f/O7pOMKaDmCq20V24mh6Bv/QlxaX81IF6pMJSg9Cyrtc60fFEm71UbDaX2NXZ4WiZqXLjIykGlVdEqg+DeOzTDka5jUtkbN6wpxc8p77WTofCx9ONVl/yi/Q6QxOvAGTDcuBO0UJaBvDP2MGKhbwPUehUGaiDoF47IAJAd8FMPYNrWMLLVO3DDOm/xdCdVw/AKerw8RwbHnDSunViS4RuiuVj/ihEAIhKUA0ZuhKgPVjIX A0Mya40S yG6UTzFQWpVv0gjBlyEMhoqSDAHslUZ69wLDSPwry1uQ/ga5+9IGL9FROsu/peXap4xbhPU73LdFt3+ym7jLdnmxBmNFMC9cVSph4rCyrsY83ZYabHAM6KvGhiwZaGwzrGWGzc4cWiWh3VSrGO/5hcii091lo76hvOof6ndVmaZMgPqg+FGujuYrg8Xv6u8bwVw2WS/YIlmtFJAorDDTIa/7nnvUQUTx2lUG6UtUNgX8g3dxbMdk5+nXTSxvGeu8BjHvdjcHEVQf4gAqHCifyGsSsAA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Jul 2, 2025 at 6:24=E2=80=AFPM Christian K=C3=B6nig wrote: > > On 02.07.25 09:57, David Airlie wrote: > >>> > >>> It makes it easier now, but when we have to solve swapping, step one > >>> will be moving all this code around to what I have now, and starting > >>> from there. > >>> > >>> This just raises the bar to solving the next problem. > >>> > >>> We need to find incremental approaches to getting all the pieces of > >>> the puzzle solved, or else we will still be here in 10 years. > >>> > >>> The steps I've formulated (none of them are perfect, but they all see= m > >>> better than status quo) > >>> > >>> 1. add global counters for pages - now we can at least see things in > >>> vmstat and per-node > >>> 2. add numa to the pool lru - we can remove our own numa code and > >>> align with core kernel - probably doesn't help anything > >> > >> So far no objections from my side to that. > >> > >>> 3. add memcg awareness to the pool and pool shrinker. > >>> if you are on a APU with no swap configured - you have a lot bett= er time. > >>> if you are on a dGPU or APU with swap - you have a moderately > >>> better time, but I can't see you having a worse time. > >> > >> Well that's what I'm strongly disagreeing on. > >> > >> Adding memcg to the pool has no value at all and complicates things ma= ssively when moving forward. > >> > >> What exactly should be the benefit of that? > > > > I'm already showing the benefit of the pool moving to memcg, we've > > even talked about it multiple times on the list, it's not a OMG change > > the world benefit, but it definitely provides better alignment between > > the pool and memcg allocations. > > > > We expose userspace API to allocate write combined memory, we do this > > for all currently supported CPU/GPUs. We might think in the future we > > don't want to continue to do this, but we do it now. My Fedora 42 > > desktop uses it, even if you say there is no need. > > > > If I allocate 100% of my memcg budget to WC memory, free it, then > > allocate 100% of my budget to non-WC memory, we break container > > containment as we can force other cgroups to run out of memory budget > > and have to call the global shrinker. > > Yeah and that is perfectly intentional. But it's wrong, we've been told by the mm/cgroup people that this isn't correct behaviour and we should fix it, and in order to move forward with fixing our other problems, we should start with this one. We are violating cgroups containment and we should endeavour to stop doing so. > > > With this in place, the > > container that allocates the WC memory also pays the price to switch > > it back. Again this is just correctness, it's not going to fix any > > major workloads, but I also don't think it should cause any > > regressions, since it won't be worse than current worst case > > expectation for most workloads. > > No, this is not correct behavior any more. > > Memory which is used by your cgroup is not used for allocations by anothe= r cgroup any more nor given back to the core memory managment for the page = pool. E.g. one cgroup can't steal the memory from another cgroup any more. > > In other words that is reserving the memory for the cgroup and don't give= it back to the global pool as soon as you free it. But what is the big advantage of giving it back to the global pool here, I'm pretty neither the worst case or steady state behaviour will change here, but the ability for one cgroup to help or hinder another cgroup will be curtailed, which as far as I can see is what the cgroup behaviour is meant to be. Each piece operates in it's own container, and can cause minimal disruption either good or bad to other containers. > That would only be acceptable if we have per cgroup limit on the pool siz= e which is *much* lower than the current global limit we have. That is up to whoever configures the cgroup limits, if they say this process should only have access to 1GB of RAM, then between normal RAM and uncached/wc RAM they get 1GB, if they need to move RAM between this via the ttm shrinker then it's all contained in that cgroup. This isn't taking swapping into account, but currently we don't do that now. > > Maybe we could register a memcg aware shrinker, but not make the LRU memc= g aware or something like that. > > As far as I can see that would give us the benefit of both approaches, th= e only problem is that we would have to do per cgroup counter tracking on o= ur own. > > That's why I asked if we could have TTM pool specific variables in the cg= roup. > > Another alternative would be to change the LRU so that we track per memcg= , but allow stealing of pages between cgroups. I just don't get why we'd want to steal pages, just put all the processes in the same cgroup. If you want to do that, we leave it up to the cgroup administration to decide what they want to share between processes. That policy shouldn't be in the driver/ttm layers, it should be entirely configurable by the admin, and default to reasonably sane behaviours. If there is a system out there already using cgroups for containment, but relying on this cgroup bypass to share uncached/wc pages, then clearly it's not a great system, and we should figure out how to fix that. If we need a backwards compat flag to turn this off, then I'm fine with that, but we've been told by the cgroup folks that it's not really a correct cgroup usage, and we should discourage it. > > I understand we have to add more code to the tt level and that's fine, > > I just don't see why you think starting at the bottom level is wrong? > > it clearly has a use, and it's just cleaning up and preparing the > > levels, so we can move up and solve the next problem. > > Because we don't have the necessary functionality to implement a memcg aw= are shrinker which moves BOs into swap there. We need to have two levels of shrinker here, I'm not disputing the tt level shinker like xe has doesn't need more work, but right now we have two shrinkers that aren't aware of numa or memcg, I'd like to start by reducing that to one for the corner case that nobody really cares about but would be good to be correct. Then we can work on swap/shrinker problem, which isn't this shrinker, and if after we do that we find a single shrinker could be take care of it all, then we move towards that. Dave.