From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 42D9BCD5BB1 for ; Fri, 22 May 2026 11:03:08 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 83C8B6B0095; Fri, 22 May 2026 07:03:07 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7ECE46B0096; Fri, 22 May 2026 07:03:07 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6DE096B0098; Fri, 22 May 2026 07:03:07 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 585546B0095 for ; Fri, 22 May 2026 07:03:07 -0400 (EDT) Received: from smtpin23.hostedemail.com (lb01a-stub [10.200.18.249]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 044F1C2790 for ; Fri, 22 May 2026 11:03:06 +0000 (UTC) X-FDA: 84794768814.23.D051B0B Received: from out-182.mta0.migadu.com (out-182.mta0.migadu.com [91.218.175.182]) by imf13.hostedemail.com (Postfix) with ESMTP id 19FDE2000F for ; Fri, 22 May 2026 11:03:04 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=xMV3BMDy; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf13.hostedemail.com: domain of usama.arif@linux.dev designates 91.218.175.182 as permitted sender) smtp.mailfrom=usama.arif@linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1779447785; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=gPGW++x8KGsaG1c5xauFzGp4pMym1MLFRTEVwpXU/hg=; b=bmfMWGlZnbUIlYotfqae3KeOOsTN09kZVNi6hYLtz9+H6YIzcF82D5z/Vc/kysFN9gml31 TRnnoQDFBOAw1H9/2hfFJZBnJaUMkbIgFxYgtZoV2OiLPgLhYdr6XBR9yDHxPvg9sak47Q JtM17xBOj8l0+d12tsoLfePOSx66plY= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=xMV3BMDy; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf13.hostedemail.com: domain of usama.arif@linux.dev designates 91.218.175.182 as permitted sender) smtp.mailfrom=usama.arif@linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1779447785; a=rsa-sha256; cv=none; b=F4bttYKP7OdngCl0Kbba+R7B6SMjjgG3TYQ5P3fpAQqKZVFh36gcqPoEISASsR/GDupqMJ ukPBgvow/62Yr5Xzp9gigMq0V6tNXS2pitg+zasqAp3nKrvPapiR3ZzU1fvOKgUkoM0kyw Kxm0dEWTHVGBn2TDlCV8xtjsa+41yIg= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1779447783; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=gPGW++x8KGsaG1c5xauFzGp4pMym1MLFRTEVwpXU/hg=; b=xMV3BMDyJ/IWo8WHo8YNCvUQpUI4iwCy8DEFQXzPmQjexWybSUZCkFni5amkiD2Lctb3zI IU6hlmByv4KOUHYUY1YLxF63s7dyJIH0z0xUydIT47P56jHv8REau7C3L2HQVAruvL84fo SAG7SMJy3S0UIpOeqOD/NRLJUdjgPKU= From: Usama Arif To: Rik van Riel Cc: Usama Arif , linux-kernel@vger.kernel.org, kernel-team@meta.com, linux-mm@kvack.org, david@kernel.org, willy@infradead.org, surenb@google.com, hannes@cmpxchg.org, ljs@kernel.org, ziy@nvidia.com, fvdl@google.com Subject: Re: [RFC PATCH 00/40] mm: reliable 1GB page allocation Date: Fri, 22 May 2026 04:02:55 -0700 Message-ID: <20260522110257.1640781-1-usama.arif@linux.dev> In-Reply-To: <20260520150018.2491267-1-riel@surriel.com> References: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT X-Stat-Signature: 8rzm8kiuqrpdqr1mqrfntn8s13xg6oiq X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 19FDE2000F X-Rspam-User: X-HE-Tag: 1779447784-372661 X-HE-Meta: U2FsdGVkX19mx8lkPNS+Vt77txhLt5g1ST5dkLZfq5DGGkFsMqJ6Q7rr1dbiSE9pZ4uSmu7BuXca5GUfVW23asqpMmJtpdF9lP+ptryhCk5RCD0hASIkGXNaN0PzMnNViiKqFg5GRh+glqVipg2xMTYFGYypStLyfnL/60IVtWc3IdX6ntCpNHPL5d9Qwg6TDd650cO6eAsDam7Xag3v1iwDM5/1N1zJ84rZ0CvLnlKTr+ib/+y5qZpLHNtV38MlHnlmX98V6KgztQ59NpHXA6Mtvkt/kTsKuM4q9POKhqAw7joc/tC6M8/SxPmAgqXB7tTW441QVxIp6keVYDgHKaXhKdlriX5bHNX4D+1/UyhDW4ONqjqTEkZ1f0dddpTel3ytUAsQChlq593OpensaOT43zU2jALLVwVna74I8ycZGxT7ahcCz42xnOANstiEsK7X47xaGmHOt4vtbxx8LiiLJqJFt/HYaShIbbzg5XkgLc4HM4AjpIxnZYoq756Nuai+b7lAc3n7cD4Xp34+jicbdu+rW3VmQlQ+GtWU9TrN9mlHJF9vSgLN+2wjEoOlkM9dvrAfHNsgysf3H3L1m4D4ClXTzatbgR0qnKtJfTBXlALGgxEPmO96MBSGv1TN9j5xlNw8hgTlbjvYvWo4fYUMA4V5i5TAMsfRidUcP8jWbDOfJAXSdjBCU4rrfRJ9NeMys9g5b44HJcfY1N6dz2ilpnPKvp4S3Yjtarue0mCqkAp4PvXIoUhZwclel+24b62B9MSdqt2t9IJZUKoRslXYiXT3rYMpUV7TgahuW0ebmSx/bRrnMWkbR8HtqxOAYzcFDci4uoNCYKJ5+LPUT/ERaIoIOx2SHMUxFogCEvIuqY8WVdaYx8fiQI1jNfWbIRn8cAoTqs3R0o8udjPRG3v28lFmBUz0gEgWPvU4ExxhXFbvCnw5ijrFflD2WAmuWCl2wHcSSqBIG3LY/uD kI4tOruM HEr0+qeMsh47d2w+m+yRG8OKKrok/uR0MuP/9I3u+irj88kROlLk3lxeQ0K1EJDWzLqWL6RJc9DcKLiGxbh5ERY3CLYzNkatLRl0iS/bCA8coWsHPt8xACUnqyViOTdljGVVQUX2vURuOsJKo3k1m0w/MJoRwj+MRgi8xkJDIpbzKpWQB/846wfhnkHAMMRxzgL/BKZla6LmwQ5yv1efQCQl8HO+AOUj8mbDm3FIDHMiG0WYIcSvDpe6Vo8XNpsD5fjRW Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, 20 May 2026 10:59:06 -0400 Rik van Riel wrote: > > Some workloads see real performance benefits from using 1GB pages, > but allocating 1GB pages has often been limited to hugetlb pages > that were set aside at boot time, or using CMA to keep a fixed > amount of system memory off limits to the kernel. > > Neither of those are great solutions, given that modern servers > tend to be large, often run multiple workloads simultaneously, > and each workload wants something else. > > To address that issue, this patch series divides memory not just > into 2MB page blocks, but into PUD sized superpageblocks, and > aggressively tries to steer unmovable, reclaimable, and highatomic > allocations into those superpageblocks that have already been > "tainted" by such allocations. > > The goal is to leave as many 1GB superpageblocks as possible > used by only movable allocations, so they can be easily > defragmented for either regular PMD sized huge pages, or > for PUD sized huge pages. > > Various strategies are used to accomplish this goal: > - unmovable and reclaimable allocations are preferentially > done from 1GB blocks that have already been "tainted" by > these allocations > - kernel allocations that can be done as one higher order > allocation, or a number of smaller allocations (eg. kvmalloc) > will fall back to small pages, rather than taint a new > 1GB block Hi Rik! The comments are just based on coverletter. Hopefully will get to review all the patches. The above one of kernel allocations falling back to small pages is interesting. - Will it result in a performance impact as kernel allocations wont benefit from higher order allocation? - Will this impact 2M THP allocation efficiency due to more fragmentation of kernel memory? > - movable allocations are preferentially done from clean 1GB > blocks, which have only free and movable memory inside, > starting with the fullest of these 1GB blocks > - 2MB allocations follow the same strategy > - 1GB allocations start with the emptiest clean 1GB block > - if a 1GB block is mixed, with some movable pageblocks, > some free pageblocks, and some unmovable/reclaimable pageblocks, > the system has a free threshold below which only unmovable and > reclaimable allocations can be done from that 1GB block > - below that threshold, no new movable allocations are allowed > in that 1GB block, while new unmovable/reclaimable allocations > are still allowed by allowed, do you mean if movable allocations fail, it will result in OOM? > - when a 1GB block is below that threshold, use the migration > code to evacuate enough movable memory from the 1GB block > to bring free memory in that 1GB block back to the threshold > > These strategies together serve to concentrate unmovable and > reclaimable allocations in as few 1GB blocks as possible, > leaving as many 1GB blocks as possible available for movable > allocations. > > That enables both more extensive use of 2MB THPs and mTHPs, > as well as reliable allocation of 1GB pages. > > The above strategies also make the core page allocator > more complicated, and slower. In order to avoid that issue, > the series is built on top of Johannes's PCPBuddy series, > which has the goal of reducing how often CPUs need to get > pages from the zone free lists, instead relying on CPUs > giving back pages to each other, based on page block ownership. > > TODO: > - compaction "always" succeeds, with a success rate of 99.96% seen > in traces; this sounds great, but it also results in compaction > never being throttled, and compaction blowing out everybody's > PCP through lru_add_drain() calls. This needs some sort of solution. > - replace the superpageblock name with something Matthew and David > both like > - find more corner cases, and fix them > > Based on e1914add2799 > > >