From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pf1-f169.google.com (mail-pf1-f169.google.com [209.85.210.169]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CC639352F9B for ; Thu, 12 Feb 2026 14:02:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.169 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770904949; cv=none; b=i8M4oY5y3nwCOKZG1OHg0reNnwwRMXKLDzOtSIuwmtg1IDPV0iNXww42/zYBNRzr1alnczVhbw+sZSBBQdi800Lbx3qzSzwrGEtOckCwzdQjUwtpxRsTsd/1wU7WKZkJ9+xNmXaJHz5vxhKuaIHVtAyFyiauptvabjbPXgE5ERU= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770904949; c=relaxed/simple; bh=OTQK6QPNsdkj9a/Fdtv9inc232V/DT1J4pl2+tfTqx0=; h=From:To:Cc:Subject:In-Reply-To:Date:Message-ID:References: MIME-version:Content-type; b=B9bFXp1jmAO0coCPb0JFw80l4BMHdowSe8n3QG7DxKLvdmMTQErte3seOavTlDDwKF/D+cST17dQZRcnSCdUvZ7lgh4EfEdR05YmUcf6RtpqQsZhXUKNfX1xEEaAP943oXuaPtxgwpvKC6EKI7vdyn2fIZWryq5zzUn3vbkG+EA= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=mZJF7xkv; arc=none smtp.client-ip=209.85.210.169 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="mZJF7xkv" Received: by mail-pf1-f169.google.com with SMTP id d2e1a72fcca58-8249aca0affso1023132b3a.3 for ; Thu, 12 Feb 2026 06:02:27 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1770904947; x=1771509747; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:message-id:date :in-reply-to:subject:cc:to:from:from:to:cc:subject:date:message-id :reply-to; bh=TFo74y0zOAZJ1gXikBmhZ5x7gnmWzyBSYEmf4KFlu6g=; b=mZJF7xkvybMy/t4J8SeQ3vnSsJNu5Mcho8cG+DL2+J976MZvpjvmaFlP6dbTi47tS3 7zBHpzVCs+4APX25uKlin48nLaCHu5oR61RSc0aQ2DsSpxt0gkf8IHuVJHs7fGlCeeHY bMYwXCXf6rqQPbEBkfIsE19lmUHsj2QgjAsQnbak0FymOfVKUxd+WMDWD+cZ+x2wx5xv 6GOJM4ngZVtZt4v4wfSmdnrLdUzcMt6e5TqFeB78fkRK+CdSdNCEv3nijVUm3JrF3Hzy 1nZVAr8X6hQRLIAiFiwaVEVok2b8RK5tIdGXbugXmVekvI+adc2Cx6co3F/ugIsyQ7RC fDkw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1770904947; x=1771509747; h=content-transfer-encoding:mime-version:references:message-id:date :in-reply-to:subject:cc:to:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=TFo74y0zOAZJ1gXikBmhZ5x7gnmWzyBSYEmf4KFlu6g=; b=vLpMDJ6cZqG0l5yTquID4FQ6aWrLfNWvxZs4PiWtcRJNZVqAwVSDnX7E++PkB1pIlF j4TXF8jIGySna5T+e5kTProvOzGKMLQVsODVZb9A3oh4JHLA9L/RKs1Dgscsj9rVFVm4 zAD7pxS8IP/ce5QtbcfBz40NPI72nf1SSI2vAmH23QZRt4Lz45eRn5xSH7h5HKltbzhE 16DdDLIeIASEXe8/tZ7TMCUlgBC5x/KpXBdO0bMlSTAjARt7DvczL5GuM5JvgYN+GnS2 Ne+jg4jBcputwMtMRqjYJLXpE6aqyTFBo1oc0+ElLK1vWZbIiiH279KfKleLdRjaVu4A tMsw== X-Forwarded-Encrypted: i=1; AJvYcCWkcDRVmHxtS6jqgf1sdm+59XTKtCdlShxNOuWJMJU30K8oS/UYWMa972cn0RbQ2AsHdb6dUs5CAGksghI=@vger.kernel.org X-Gm-Message-State: AOJu0YxIF5bIASKaJa7RmGJAQmDl6/5Q5rKwqBn2j3wawQnXeGiZoMwo 569bzam3vf4qOPNyKg87fADyXe6lHbZiiz6giXoYsLl/kZqLtYEqmPch X-Gm-Gg: AZuq6aKYNb12FFOKhj2gymasPk8eKoQdoA1SJUAA6LN8cZ+exgXp2hrChWB2U4pH4IA 7Z5Og2wZEBzMbfrZ/rEzR2mJ6n1NJx4uM2NDFnRi1aCDNGPgqJfT66Retja+rPHV4KS2sJW+MV5 ctC9pDCdCg5TRvIw8F5yiKs1aOMrX1qXVczAwpNmeor5AUJ102vwD1q82AUY4y7HjoMdlU7q26i euY1BiIiVnM8B6p1M/bCka/xkTnPAbIk2TYT2lCm2ikVTdnbucLy7W5G0k2inCfLeUFAyC5fszr 5f5jaGKySdXNo5mrYEnZcrB9mxm917s2BKY3x0l2Gc6EevHDOgnA8D9UYS0YoJ1JSC0PwU9HKKB v7gt2WMXjGedXotBgNQZMSq3CVkfzrKIa01tVVsqKHQWmIxT3YpxXjCP5I0e20zA7BKLgZd61ms rwNaNrshzdhMIKWstq X-Received: by 2002:a05:6a00:a24b:b0:823:7ac:1417 with SMTP id d2e1a72fcca58-824b05a920bmr2508988b3a.67.1770904946842; Thu, 12 Feb 2026 06:02:26 -0800 (PST) Received: from dw-tp ([49.205.216.49]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-8249e3bd8cbsm5369677b3a.24.2026.02.12.06.02.19 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 12 Feb 2026 06:02:26 -0800 (PST) From: Ritesh Harjani (IBM) To: "David Hildenbrand (Arm)" , Usama Arif , Andrew Morton , lorenzo.stoakes@oracle.com, willy@infradead.org, linux-mm@kvack.org Cc: fvdl@google.com, hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev, kas@kernel.org, baohua@kernel.org, dev.jain@arm.com, baolin.wang@linux.alibaba.com, npache@redhat.com, Liam.Howlett@oracle.com, ryan.roberts@arm.com, vbabka@suse.cz, lance.yang@linux.dev, linux-kernel@vger.kernel.org, kernel-team@meta.com, Madhavan Srinivasan , Michael Ellerman , linuxppc-dev@lists.ozlabs.org Subject: Re: [RFC 1/2] mm: thp: allocate PTE page tables lazily at split time In-Reply-To: <13ab56cb-7fdb-4ee4-9170-f9f4fa4b6e37@kernel.org> Date: Thu, 12 Feb 2026 17:43:33 +0530 Message-ID: <875x82ma6q.ritesh.list@gmail.com> References: <20260211125507.4175026-1-usama.arif@linux.dev> <20260211125507.4175026-2-usama.arif@linux.dev> <13ab56cb-7fdb-4ee4-9170-f9f4fa4b6e37@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-version: 1.0 Content-type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit "David Hildenbrand (Arm)" writes: > CCing ppc folks > Thanks David! > On 2/11/26 13:49, Usama Arif wrote: >> When the kernel creates a PMD-level THP mapping for anonymous pages, >> it pre-allocates a PTE page table and deposits it via >> pgtable_trans_huge_deposit(). This deposited table is withdrawn during >> PMD split or zap. The rationale was that split must not fail—if the >> kernel decides to split a THP, it needs a PTE table to populate. >> >> However, every anon THP wastes 4KB (one page table page) that sits >> unused in the deposit list for the lifetime of the mapping. On systems >> with many THPs, this adds up to significant memory waste. The original >> rationale is also not an issue. It is ok for split to fail, and if the >> kernel can't find an order 0 allocation for split, there are much bigger >> problems. On large servers where you can easily have 100s of GBs of THPs, >> the memory usage for these tables is 200M per 100G. This memory could be >> used for any other usecase, which include allocating the pagetables >> required during split. >> >> This patch removes the pre-deposit for anonymous pages on architectures >> where arch_needs_pgtable_deposit() returns false (every arch apart from >> powerpc, and only when radix hash tables are not enabled) and allocates >> the PTE table lazily—only when a split actually occurs. The split path >> is modified to accept a caller-provided page table. >> >> PowerPC exception: >> >> It would have been great if we can completely remove the pagetable >> deposit code and this commit would mostly have been a code cleanup patch, >> unfortunately PowerPC has hash MMU, it stores hash slot information in >> the deposited page table and pre-deposit is necessary. All deposit/ >> withdraw paths are guarded by arch_needs_pgtable_deposit(), so PowerPC >> behavior is unchanged with this patch. On a better note, >> arch_needs_pgtable_deposit will always evaluate to false at compile time >> on non PowerPC architectures and the pre-deposit code will not be >> compiled in. > > Is there a way to remove this? It's always been a confusing hack, now > it's unpleasant to have around :) > Hash MMU on PowerPC works fundamentally different than other MMUs (unlike Radix MMU on PowerPC). So yes, it requires few tricks to fit into the Linux's multi-level SW page table model. ;) > In particular, seeing that radix__pgtable_trans_huge_deposit() just 1:1 > copied generic pgtable_trans_huge_deposit() hurts my belly. > On PowerPC, pgtable_t can be a pte fragment. typedef pte_t *pgtable_t; That means a single page can be shared among other PTE page tables. So, we cannot use page->lru which the generic implementation uses. I guess due to this, there is a slight change in implementation of radix__pgtable_trans_huge_deposit(). Doing a grep search, I think that's the same for sparc and s390 as well. > > IIUC, hash is mostly used on legacy power systems, radix on newer ones. > > So one obvious solution: remove PMD THP support for hash MMUs along with > all this hacky deposit code. > Unfortunately, please no. There are real customers using Hash MMU on Power9 and even on older generations and this would mean breaking Hash PMD THP support for them. > > the "vma_is_anonymous(vma) && !arch_needs_pgtable_deposit()" and similar > checks need to be wrapped in a reasonable helper and likely this all > needs to get cleaned up further. > > The implementation if the generic pgtable_trans_huge_deposit and the > radix handlers etc must be removed. If any code would trigger them it > would be a bug. > Sure, I think after this patch series, the radix__pgtable_trans_huge_deposit() will mostly be a dead code anyways. I will spend some time going through this series and will also give it a test on powerpc HW (with both Hash and Radix MMU). I guess, we should also look at removing pgtable_trans_huge_deposit() and pgtable_trans_huge_withdraw() implementations from s390 and sparc, since those too will be dead code after this. > If we have to keep this around, pgtable_trans_huge_deposit() should > likely get renamed to arch_pgtable_trans_huge_deposit() etc, as there > will not be generic support for it. > Sure. That make sense since PowerPC Hash MMU will still need this. -ritesh