From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-174.mta1.migadu.com (out-174.mta1.migadu.com [95.215.58.174]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 90C75357D08 for ; Fri, 12 Jun 2026 14:22:46 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.174 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781274170; cv=none; b=rvDxJBdTP8rGy3Tw6NBESo9zeFH1GghJSQ7EubbRA/hh/Y/lupd/KjolrZr79jIeZqcXNC3BYHyoB2zNX3gwQPkfn8jn2PQD0FDelg4T01dw7fL+WEhNItZJrWCV0URcef138yvIPf7r45G+Xu5n9KRjgozV0YjOr0+O8mJF0s4= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781274170; c=relaxed/simple; bh=dRrok0IA9vEsc0A2V6FZEJ9kuN70X1nxJuFAyKOGV5E=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version:Content-Type; b=btign2JCzS0JCXylt4gp9K36KsWpgn2qjb1RZ2XaactnoRhTerfOPXe0q0S0OTdbqEu9AMiqTaS/bqwf6paZrRbeaC04YUu2bYXytkVH4EVsAI+OP01DPYkuZhB+Htt4xC7CNx2TBBPp3mkykgbfAKXYZxz6MSdsg0ZZQoEJgUI= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=IIWvvZja; arc=none smtp.client-ip=95.215.58.174 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="IIWvvZja" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1781274164; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=RDM/naiUy7IWjIuXVH8zM74N1Dr0cGZHLlVFqqGy5sk=; b=IIWvvZjaJJoV7NR4JCOHDkSbKcFvz71ysWroR2jnxaP4Fcw3M3/K8n6kwMEXobgHZpxyyw xq/V0R5OZiDgXoBEfOOFIxpJzHoD89XH2c6OxcNd/H6/oeSZQ5R8igiVms6eZsvVbSzgML bmhipw0lTaVwEeVvgPna0gH4j6V8WB0= From: Lance Yang To: usama.arif@linux.dev Cc: akpm@linux-foundation.org, david@kernel.org, chrisl@kernel.org, kasong@tencent.com, ljs@kernel.org, ziy@nvidia.com, ying.huang@linux.alibaba.com, baoquan.he@linux.dev, willy@infradead.org, youngjun.park@lge.com, hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev, alex@ghiti.fr, kas@kernel.org, baohua@kernel.org, dev.jain@arm.com, baolin.wang@linux.alibaba.com, npache@redhat.com, liam@infradead.org, ryan.roberts@arm.com, vbabka@kernel.org, lance.yang@linux.dev, linux-kernel@vger.kernel.org, nphamcs@gmail.com, shikemeng@huaweicloud.com, kernel-team@meta.com, linux-mm@kvack.org Subject: Re: [v2 15/16] mm: install PMD swap entries on swap-out Date: Fri, 12 Jun 2026 22:21:24 +0800 Message-Id: <20260612142124.73367-1-lance.yang@linux.dev> In-Reply-To: <20260602142537.198755-16-usama.arif@linux.dev> References: <20260602142537.198755-16-usama.arif@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT +Cc linux-mm On Tue, Jun 02, 2026 at 07:24:23AM -0700, Usama Arif wrote: [...] >diff --git a/mm/vmscan.c b/mm/vmscan.c >index e8a90911bf88..0f376fbf9bb3 100644 >--- a/mm/vmscan.c >+++ b/mm/vmscan.c >@@ -64,6 +64,7 @@ > > #include > #include >+#include > > #include "internal.h" > #include "swap.h" >@@ -1332,7 +1333,18 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, > enum ttu_flags flags = TTU_BATCH_FLUSH; > bool was_swapbacked = folio_test_swapbacked(folio); > >- if (folio_test_pmd_mappable(folio)) >+ /* >+ * With THP_SWAP, PMD-mappable folios already in the >+ * swap cache can be unmapped with a PMD-level swap >+ * entry, avoiding the cost of splitting the PMD. >+ * Skip this when zswap has been enabled because >+ * zswap stores pages individually and cannot >+ * reconstruct a large folio on swap-in. >+ */ >+ if (folio_test_pmd_mappable(folio) && >+ !(IS_ENABLED(CONFIG_THP_SWAP) && >+ folio_test_swapcache(folio) && >+ zswap_never_enabled())) There may be a race here ... 1) zswap_never_enabled() passes, 2) try_to_unmap() installs the PMD swap entry, and 3) zswap can still be enabled before the later pageout() -> swap_writeout() -> zswap_store(). zswap_store() loops over each page of the folio: for (index = 0; index < nr_pages; ++index) { struct page *page = folio_page(folio, index); if (!zswap_store_page(page, objcg, pool)) goto put_pool; } So still one PMD swap entry, while zswap has 512 entries, one for each page of the folio ... If the swapcache is reclaimed later, a PMD fault will try PMD-order swapin again: do_huge_pmd_swap_page() swap_cache_get_folio() swapin_sync(..., BIT(HPAGE_PMD_ORDER)) swap_read_folio() zswap_load() zswap_load() rejects large folios with -EINVAL and leaves the folio not uptodate: /* * Large folios should not be swapped in while zswap is being used, as * they are not properly handled. Zswap does not properly load large * folios, and a large folio may only be partially in zswap. */ if (WARN_ON_ONCE(folio_test_large(folio))) { folio_unlock(folio); return -EINVAL; } swap_read_folio() jumps to finish and does not try a normal swap read: if (zswap_load(folio) != -ENOENT) goto finish; And the awkward part is that no error really gets propagated ... swap_read_folio() is void, and swapin_sync() just hands the same folio back to do_huge_pmd_swap_page(). At that point the folio is still !uptodate, so the fault would just end up: if (unlikely(!folio_test_uptodate(folio))) { ret = VM_FAULT_SIGBUS; goto out_page; } Looks race, but possible? Cheers, Lance > flags |= TTU_SPLIT_HUGE_PMD; > /* > * Without TTU_SYNC, try_to_unmap will only begin to >diff --git a/mm/vmstat.c b/mm/vmstat.c >index f534972f517d..9b4963a7eb04 100644 >--- a/mm/vmstat.c >+++ b/mm/vmstat.c >@@ -1421,6 +1421,7 @@ const char * const vmstat_text[] = { > [I(THP_ZERO_PAGE_ALLOC_FAILED)] = "thp_zero_page_alloc_failed", > [I(THP_SWPOUT)] = "thp_swpout", > [I(THP_SWPOUT_FALLBACK)] = "thp_swpout_fallback", >+ [I(THP_SWPOUT_PMD)] = "thp_swpout_pmd", > #endif > #ifdef CONFIG_BALLOON > [I(BALLOON_INFLATE)] = "balloon_inflate", >-- >2.52.0 > >