From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-186.mta0.migadu.com (out-186.mta0.migadu.com [91.218.175.186]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 12D8023AB88 for ; Thu, 30 Apr 2026 10:39:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.186 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777545555; cv=none; b=SCpr/M+AM0aYxgONxXSdLNHHdmH8YQxykMqkQBX3sZ2SVXtGqEIWTOzVV9sEQ2m0HGx9CWhjqmXOza11VUIt3h5ShTh1ULPH7pY3vG16Q+VvDQbH8anacyTaXQDfrnIYEXML8GP4ujFNmdckBw6IO7XO9IWM1wjAr2h1MiaS5/A= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777545555; c=relaxed/simple; bh=xB+DAEo68d+Ou7Df9sWwedU8p5VcSN065ptdXoQ6LNU=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=AaZEfXdoi2ahXG0Mib2dfS09fjlpni/1mY/oWZ/wdOiWGnQe52rdkskMvrq/xSMpKwmymMiEcOpN8XiI1cXABXSOGhEhhK0pBqUKOJm/B2kE2nWjlUyjwylrNgJVwpF1MaGyXSsgWfa37wTsF6ttyM7X7+c7qR7vcGHIN15jWu8= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=A4OAfcAi; arc=none smtp.client-ip=91.218.175.186 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="A4OAfcAi" Message-ID: <98c0694d-626c-498d-898b-f65ec4549d71@linux.dev> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1777545550; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=p1NhmGw66KQR6NIK7xlUsnuUDFBRRIIHRF8amAccn2w=; b=A4OAfcAicw/SnacoDjiK9ovQbaFKl7GCkb+TkdoWvCIAT69uz1oezsNJH5zMwurUlXuKBb 3A1g01mIddNbz8SN0lnPFw/kjqyPkoQudi+UiEwpPHxkf0gX5vFasoKrNjJDw53XmGWHp6 sjpkFeA1p9IKBx8LDEYr+nItHTFFAh4= Date: Thu, 30 Apr 2026 11:38:45 +0100 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Subject: Re: [PATCH 00/13] mm: PMD-level swap entries for anonymous THPs To: Kairui Song Cc: Andrew Morton , david@kernel.org, chrisl@kernel.org, ljs@kernel.org, ziy@nvidia.com, bhe@redhat.com, willy@infradead.org, youngjun.park@lge.com, hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev, alex@ghiti.fr, kas@kernel.org, baohua@kernel.org, dev.jain@arm.com, baolin.wang@linux.alibaba.com, npache@redhat.com, Liam.Howlett@oracle.com, ryan.roberts@arm.com, Vlastimil Babka , lance.yang@linux.dev, linux-kernel@vger.kernel.org, nphamcs@gmail.com, shikemeng@huaweicloud.com, kernel-team@meta.com References: <20260427100553.2754667-1-usama.arif@linux.dev> Content-Language: en-US X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Usama Arif In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT On 29/04/2026 11:44, Kairui Song wrote: > On Mon, Apr 27, 2026 at 6:09 PM Usama Arif wrote: >> >> When reclaim swaps out a PMD-mapped anonymous THP today, the PMD is >> split into 512 PTE-level swap entries via TTU_SPLIT_HUGE_PMD before >> unmap. >> >> This series introduces a PMD-level swap entry. The huge mapping is >> preserved across the swap round-trip, and do_huge_pmd_swap_page() >> resolves the entire 2 MB region in a single fault on swap-in, > > Hi Usama, > > Thanks for the work! > >> no khugepaged involvement is needed. swap_map metadata is identical > > swap_map is gone, metadata is still per slot but with PMD sized > swapout, I think soon we can store a swp_tb entry directly in > ci->table (make it a union maybe) so the metadata is significantly > reduced from there too. Better do that later with cluster compaction. > >> Core patches: >> 5. PMD swap entry detection (pmd_is_swap_entry, >> softleaf_is_valid_pmd_entry) and per-arch pmd_swp_*exclusive >> helpers (x86/arm64/s390/riscv/loongarch). >> 6. __split_huge_pmd_locked() learns to split a PMD swap entry >> into 512 PTE swap entries, used as the fallback when a >> PMD-order resource is unavailable. >> 7. Fork: copy_huge_non_present_pmd() duplicates the PMD swap entry >> in one folio_dup_swap() call, with GFP_KERNEL retry mirroring >> copy_pte_range(). >> 8. Swapoff: unuse_pmd() reads the whole 2 MB folio and reinstalls >> the PMD; falls back to PTE-split + unuse_pte_range() on error. > > There is a slight conflict with the swap folio allocation unification, > which should be easy to solve. Just a little head up, check the > swap_cache_alloc_folio helper here: > https://lore.kernel.org/linux-mm/20260421-swap-table-p4-v3-4-2f23759a76bc@tencent.com/ > > We will be able to directly allocate 2M folios using > swap_cache_alloc_folio(orders = BIT(PMD_ORDER)) in the patch link > above. Might even help to avoid issues with splitting or raced swapin? Oh yeah, I like your swapin_alloc_pmd_folio a lot more than swapin_alloc_pmd_folio. > The conflict can be solved from either side, I'll update that series to > disable the forced order 0 fallback and let caller pass in (orders = > | BIT(0)) instead. Yes, that would be great. We dont want order 0 fallback in the 2 cases where we fail in this series. Thanks!