From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 411F4C3ABA3 for ; Fri, 2 May 2025 05:48:58 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 051946B0088; Fri, 2 May 2025 01:48:58 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0015B6B0089; Fri, 2 May 2025 01:48:57 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E2F206B008A; Fri, 2 May 2025 01:48:57 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id C815A6B0088 for ; Fri, 2 May 2025 01:48:57 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id A1F2B1CC2FA for ; Fri, 2 May 2025 05:48:57 +0000 (UTC) X-FDA: 83396889114.02.9C532FA Received: from mail-qv1-f46.google.com (mail-qv1-f46.google.com [209.85.219.46]) by imf26.hostedemail.com (Postfix) with ESMTP id B13C2140003 for ; Fri, 2 May 2025 05:48:55 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=dUhVNLDU; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf26.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.46 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1746164935; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=cMexbe5tQlgN890FeKVE4DWJ/0rTxSY9QsDpbevR/S0=; b=AWP/V2D84sWnnS8NqSN5AyEBwg7aUbm1pQIywtEe5uGNZO0LPy0OVzYVq23/ELzphkONkT JAAUj0JsnRdDhnJ5mnwyu2L5TtIVCFZspTvHECmwWdWaAGgR/wE5JagYBwQFDZkz7tcm4S qYjBtOqIt2Z2Xm02AzWbadCYe91v4n8= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1746164935; a=rsa-sha256; cv=none; b=lEJwRYhWSxO55C4FjnTHmChxN1c1uw63RqrESL88j4EN+F3Dqy7Nx76HExTfC76vK8ddvF dgnnflPF6E4/+a4wm09A/zXx91SwXbxnKyGYUov2jYxE/foEGq+GjAmx0qmsjiUdw5q5wt 0oHetdv4ZUFPsnNew8jcaLes6OciOgg= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=dUhVNLDU; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf26.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.46 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com Received: by mail-qv1-f46.google.com with SMTP id 6a1803df08f44-6e900a7ce55so31641456d6.3 for ; Thu, 01 May 2025 22:48:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1746164935; x=1746769735; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=cMexbe5tQlgN890FeKVE4DWJ/0rTxSY9QsDpbevR/S0=; b=dUhVNLDULp5SgZbXMCXfU/gcHIKzrxEf7Z+DTrcLXLfLq1Vauj9yM90LN5InQbycyn welcDEkWBJZqvenzNC+eSwZVXHb1d3VfHzarSTrrWgefCz+asR1QPN5yveucIGwDD4Qt eYPtZT9UdlWB4BUOay+SpzLudMPdA9cw2hEjjHgH3hf5qI3pt0oWY/dCn+bsDQgNpDs0 zdlWvFy/WtxssAazHI07owPagDdWv/V/kwvx9E8cUDfUS45EBM34u4o0W1yS9n+Y+NCs 7aPUSnEz773R8suTLSF78APjX9w4mHs2Ww4wOOqE+FORPOK52Yq9pBpKDc7NlqV052G5 a4sw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1746164935; x=1746769735; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=cMexbe5tQlgN890FeKVE4DWJ/0rTxSY9QsDpbevR/S0=; b=vKwWi3UZbwAJ9peKcKb6uLgxhYmQ4B+OusbMtJqBtr1C8GqUsZMg0eeG0AWrZZ2vI4 g2cl1kl+zEMdiuGnT8feZNK4VdMMG/Gnh2kWESl96MxbYPYqZUkISj3tt+iv5FpGbhgh DPpX6SpptX3URK09ZgMO3eHbmZTZn1+/oQaEfUoZED1KmfXjuEe7DLmkWZV0uMi2bsbu Q+Za7LyIkcHB9y+hDBPRJPmxFNpcAZhmyg4ZIkOX26r+pHcjvx+ztxRzHDbOMf9sLKx3 7uxXYJ+4y5zODRf+8vCjmfQzPjLHd5C8tnqfOQT8Qcrk5rFeT45dOjq8KV+clV+IWg3z 4SLg== X-Forwarded-Encrypted: i=1; AJvYcCXtT5/22XDg7NZ1dnailnYumr0NPXuKOKz7+bDJ7FKZxwPpNofYJ0Ivgr5nbIpawejD0ACxhmRZ5g==@kvack.org X-Gm-Message-State: AOJu0Yz1MUUUzQvym61z5IAjTBakk/sCOIHXMjhqZl/f0ks5u6001BQE N2eoaV9X8YNcK5WRMIxhYbYzajWVZMMXDfk9z+FfuXqJNzLTL9se0dsJiujYv8retiltkQ8tGDC L2icZP6zGbAKu/1efyWRLs8gJC/Y= X-Gm-Gg: ASbGncupuJrSZUQ+mVSs7mtnfo+lzdOFGtcEA/NgWRYZYhG+DZc+Liy9VXSUniqpQwm fwf0c9dDgOdSpzj1FuDWm39OHjWBqdz8fPxUybhEtGrrxiRFyopqzGKT3vvho9IF3OXCaR9KQ58 OCFNlSVnWLNVzttQjKX+7+mjLBdkalDnAqGA== X-Google-Smtp-Source: AGHT+IFAssNPhXmkgMLUWRxviUOdBE4T5K5fXxVrO6hktAtHO6KaLdqo53pPUWNztt9jFmQkR6BEzihAnn5NAiIhWOw= X-Received: by 2002:ad4:5aed:0:b0:6f4:f123:a97a with SMTP id 6a1803df08f44-6f515279443mr31888236d6.15.1746164934558; Thu, 01 May 2025 22:48:54 -0700 (PDT) MIME-Version: 1.0 References: <20250429024139.34365-1-laoar.shao@gmail.com> <42ECBC51-E695-4480-A055-36D08FE61C12@nvidia.com> <8F000270-A724-4536-B69E-C22701522B89@nvidia.com> <20250430174521.GC2020@cmpxchg.org> <84DE7C0C-DA49-4E4F-9F66-E07567665A53@nvidia.com> <6850ac3f-af96-4cc6-9dd0-926dd3a022c9@huawei-partners.com> In-Reply-To: <6850ac3f-af96-4cc6-9dd0-926dd3a022c9@huawei-partners.com> From: Yafang Shao Date: Fri, 2 May 2025 13:48:15 +0800 X-Gm-Features: ATxdqUEXwez0j7P-DIWkuWVLYO87xhgDtGxwM5YlaAjAM8EDfYSzpKPy36U9ofo Message-ID: Subject: Re: [RFC PATCH 0/4] mm, bpf: BPF based THP adjustment To: Gutierrez Asier Cc: Zi Yan , Johannes Weiner , "Liam R. Howlett" , akpm@linux-foundation.org, ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org, David Hildenbrand , Baolin Wang , Lorenzo Stoakes , Nico Pache , Ryan Roberts , Dev Jain , bpf@vger.kernel.org, linux-mm@kvack.org, Michal Hocko Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: B13C2140003 X-Rspam-User: X-Stat-Signature: b3qwumdwscbgsi5nt3tpxtaprsowfj76 X-HE-Tag: 1746164935-592482 X-HE-Meta: U2FsdGVkX18rLvKh6CzVva5r0KWLRe0AbiJFhhJUZF4SpWNvk0HlNI6KKfntQUqUJvfr3md6WfUkoCJpJZxmrUilLl5hC8jDToP+vRt04x0AmmWp6kVd+J5d32sk+Jt0F2g3BPlFWDnEU3on2mqPx3s9T9TdULLKcqjEomothVl1uB4MQhSes23crOsPoECzx8/EZ+Jk6NxgRjFgvTGVZHqfKfSR6lU7f+gk30AXBgFR5hlWAB/IuW1lP+ifgMTKMkz3G24DdWPW4DOOhZVR7YeDX0tQ/XcE2EAhoUzOnLKdbzR9qnSEx+pHi8NxUj7wMs7t4oJ+YqIlCsfJXiDD8bY6vpchYML47oVsEUFiOe+d70rJG0ytQHmMFTg9x7t8cc6PfjVO8Xp9D02sq7GG9eA33bmAOxGdkNI5WPLG7k4UdVsI006PnuPj1vljYwCRdVhZJBC1kKTFt/U4FxYn5oCN7dTpvBBjcb3phcShqtW1GPcA6T96cHZsHndQhRwnfXWT48ASIxl0leO+zCWSZLpVFlFAIqdOUvXreqd2U47Il0PluA9p0JHWd9aClEDgFZUkAfOQLUoKfkclMz5azfYdSlKtnxcBx11F2Yrt1uPSvdhrczftSJd4Xk5eI3V0I6pHq2dU6o4CZ9VxtdO7EhdkyRvXwPUUTj6yUbGRbzkgTeVZ7EJgkDxUToeJE8Z/IZWuaOeGABQH5zXEOBH21GKazu9rL9U5B8fQQv3ZVyY7JjWmDN+TkKmvdjjs6rAM1Yhw9sxIPQ5gwjVQNdR6DqTxgFqaT/VlXjmC+z9kuq+2AVXuslGO0UYgLRxw+SN4el8D2eAFmaTEjDmz9JsCb+/o6U3yN0eO3+YvtYKHdP2shmolpdTBtGBz3ysjTxGhepG3Y0GnZt1grMJotLRx9vlccaYxDAkEpkG2zaQidHrf86FQ3DyLYQcGufUG6w2l1X7GJkejiyZhrlhXfW6 boUSsr15 qkowoe9AH0NiJF5Kn9FjNOmD8uQ966Lt3kcCQnTqpHr0rDSeQ5YLCdxbdDdCOy66FuENu90ybB7voj7/U4A1kZR/mgwPXxA8dkdyP8GmJ28lqQt1xEJU5C3kK/VS6/FERq28VmwxIZ2Eoz71XQAq4RFeAv5ouGeooFEcjQoE51jICQGODGgW9jR+lflD5eHPOKiCFeMssCaDYKuZq1wxKxKD+2Fq5Ul8hkFnt2nXAm+c6RZ5xGG0TreTuY/kQ1w9NF+9O942W3dJ8QKDkU9Q7nUDVQqlmUqaP/GoqrJnGvNung0IS2WBT5+QGJIWrMV4hkhZxt5739XvRlXvOVjRZtfOOdeFFJbsQFDn2QG+oJf+VDARe7W4ztY5yd0qeD6xtzPnaGhBb+d9cLHPVl2iiPQlowIzuMDpoGxTt2oWg3drdx6Q= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, May 2, 2025 at 3:36=E2=80=AFAM Gutierrez Asier wrote: > > > On 4/30/2025 8:53 PM, Zi Yan wrote: > > On 30 Apr 2025, at 13:45, Johannes Weiner wrote: > > > >> On Thu, May 01, 2025 at 12:06:31AM +0800, Yafang Shao wrote: > >>>>>> If it isn't, can you state why? > >>>>>> > >>>>>> The main difference is that you are saying it's in a container tha= t you > >>>>>> don't control. Your plan is to violate the control the internal > >>>>>> applications have over THP because you know better. I'm not sure = how > >>>>>> people might feel about you messing with workloads, > >>>>> > >>>>> It=E2=80=99s not a mess. They have the option to deploy their servi= ces on > >>>>> dedicated servers, but they would need to pay more for that choice. > >>>>> This is a two-way decision. > >>>> > >>>> This implies you want a container-level way of controlling the setti= ng > >>>> and not a system service-level? > >>> > >>> Right. We want to control the THP per container. > >> > >> This does strike me as a reasonable usecase. > >> > >> I think there is consensus that in the long-term we want this stuff to > >> just work and truly be transparent to userspace. > >> > >> In the short-to-medium term, however, there are still quite a few > >> caveats. thp=3Dalways can significantly increase the memory footprint = of > >> sparse virtual regions. Huge allocations are not as cheap and reliable > >> as we would like them to be, which for real production systems means > >> having to make workload-specifcic choices and tradeoffs. > >> > >> There is ongoing work in these areas, but we do have a bit of a > >> chicken-and-egg problem: on the one hand, huge page adoption is slow > >> due to limitations in how they can be deployed. For example, we can't > >> do thp=3Dalways on a DC node that runs arbitary combinations of jobs > >> from a wide array of services. Some might benefit, some might hurt. > >> > >> Yet, it's much easier to improve the kernel based on exactly such > >> production experience and data from real-world usecases. We can't > >> improve the THP shrinker if we can't run THP. > >> > >> So I don't see it as overriding whoever wrote the software running > >> inside the container. They don't know, and they shouldn't have to care > >> about page sizes. It's about letting admins and kernel teams get > >> started on using and experimenting with this stuff, given the very > >> real constraints right now, so we can get the feedback necessary to > >> improve the situation. > > > > Since you think it is reasonable to control THP at container-level, > > namely per-cgroup. Should we reconsider cgroup-based THP control[1]? > > (Asier cc'd) > > > > In this patchset, Yafang uses BPF to adjust THP global configs based > > on VMA, which does not look a good approach to me. WDYT? > > > > > > [1] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.= asier@huawei-partners.com/ > > > > -- > > Best Regards, > > Yan, Zi > > Hi, > > I believe cgroup is a better approach for containers, since this > approach can be easily integrated with the user space stack like > containerd and kubernets, which use cgroup to control system resources. The integration of BPF with containerd and Kubernetes is emerging as a clear trend. > > However, I pointed out earlier, the approach I suggested has some > flaws: > 1. Potential polution of cgroup with a big number of knobs Right, the memcg maintainers once told me that introducing a new cgroup file means committing to maintaining it indefinitely, as these interface files are treated as part of the ABI. In contrast, BPF kfuncs are considered an unstable API, giving you the flexibility to modify them later if needed. > 2. Requires configuration by the admin > > Ideally, as Matthew W. mentioned, there should be an automatic system. Take Matthew=E2=80=99s XFS large folio feature as an example=E2=80=94it was= enabled automatically. A few years ago, when we upgraded to the 6.1.y stable kernel, we noticed this new feature. Since it was enabled by default, we assumed the author was confident in its stability. Unfortunately, it led to severe issues in our production environment: servers crashed randomly, and in some cases, we experienced data loss without understanding the root cause. We began disabling various kernel configurations in an attempt to isolate the issue, and eventually, the problem disappeared after disabling CONFIG_TRANSPARENT_HUGEPAGE. As a result, we released a new kernel version with THP disabled and had to restart hundreds of thousands of production servers. It was a nightmare for both us and our sysadmins. Last year, we discovered that the initial issue had been resolved by this p= atch: https://lore.kernel.org/stable/20241001210625.95825-1-ryncsn@gmail.com/. We backported the fix and re-enabled XFS large folios=E2=80=94only to face = a new nightmare. One of our services began crashing sporadically with core dumps. It took us several months to trace the issue back to the re-enabled XFS large folio feature. Fortunately, we were able to disable it using livepatch, avoiding another round of mass server restarts. To this day, the root cause remains unknown. The good news is that the issue appears to be resolved in the 6.12.y stable kernel. We're still trying to bisect which commit fixed it, though progress is slow because the issue is not reliably reproducible. In theory, new features should be enabled automatically. But in practice, every new feature should come with a tunable knob. That=E2=80=99s= a lesson we learned the hard way from this experience=E2=80=94and perhaps Matthew did too. > > Anyway, regarding containers, I believe cgroup is a good approach > given that the admin or the container management system uses cgroups > to set up the containers. > > -- > Asier Gutierrez > Huawei > --=20 Regards Yafang