From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8B5567E584 for ; Fri, 22 Mar 2024 19:45:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1711136748; cv=none; b=C+YOI09xifZahmSrRGVr6KcwxC1m4KBGSi/gJoZf6bHJTf/DdTGKYT9gmGLFbMb3zyFWK6dkTlyIEeOVoY1eouMfvgEV6bHmHTHBX7T1bY1tGpufpZPriMqM6gFXUXLcgNzubDR7d6NU992bGQPlggfMb4CkwRmqanMLeeAemEA= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1711136748; c=relaxed/simple; bh=lKIlyMxHrpPuFr/cJAliG8/Luw8YiiRMOj+6j3ux22k=; h=Date:To:From:Subject:Message-Id; b=QNEiB3QYqWEpTbQ4nEIMD9y4pB1U24kjrJm8tLnKsa9U+QTAD6WmVNoBHSckiM2fCbxmk3J2WhId01vqhS3eRWNeJ/+A9wlVkeHTK0ScNubrqmjEFEsSCAw1QCFWU3KOAF3R68MeBxddC3E3hI0PJCeTod4u5sCQPpk9cnRZYqQ= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux-foundation.org header.i=@linux-foundation.org header.b=EnAZNC4W; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux-foundation.org header.i=@linux-foundation.org header.b="EnAZNC4W" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 0ACFAC43390; Fri, 22 Mar 2024 19:45:48 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linux-foundation.org; s=korg; t=1711136748; bh=lKIlyMxHrpPuFr/cJAliG8/Luw8YiiRMOj+6j3ux22k=; h=Date:To:From:Subject:From; b=EnAZNC4WPVhzBodPV40SyhnwcvF/Ctjp72MbaLAYp7ufCvXxEc1vbo9mz8SjRu3Ao 0PnUwZwv56fikjhzXIEumrfryodAUq9mvE7A0pEHzOoZk8YWEadSHxJrZdjM9hs8JN oj6NSIIsByT0GsVyDouGHbRvmznDxXxTwYe2w9mI= Date: Fri, 22 Mar 2024 12:45:47 -0700 To: mm-commits@vger.kernel.org,yuzhao@google.com,ying.huang@intel.com,xiang@kernel.org,willy@infradead.org,wangkefeng.wang@huawei.com,shy828301@gmail.com,ryan.roberts@arm.com,mhocko@suse.com,hughd@google.com,hanchuanhua@oppo.com,david@redhat.com,chrisl@kernel.org,v-songbaohua@oppo.com,akpm@linux-foundation.org From: Andrew Morton Subject: + mm-hold-ptl-from-the-first-pte-while-reclaiming-a-large-folio.patch added to mm-unstable branch Message-Id: <20240322194548.0ACFAC43390@smtp.kernel.org> Precedence: bulk X-Mailing-List: mm-commits@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: The patch titled Subject: mm: hold PTL from the first PTE while reclaiming a large folio has been added to the -mm mm-unstable branch. Its filename is mm-hold-ptl-from-the-first-pte-while-reclaiming-a-large-folio.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-hold-ptl-from-the-first-pte-while-reclaiming-a-large-folio.patch This patch will later appear in the mm-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: Barry Song Subject: mm: hold PTL from the first PTE while reclaiming a large folio Date: Wed, 6 Mar 2024 22:52:19 +1300 Within try_to_unmap_one(), page_vma_mapped_walk() races with other PTE modifications preceded by pte clear. While iterating over PTEs of a large folio, it only starts acquiring PTL from the first valid (present) PTE. PTE modifications can temporarily set PTEs to pte_none. Consequently, the initial PTEs of a large folio might be skipped in try_to_unmap_one(). For example, for an anon folio, if we skip PTE0, we may have PTE0 which is still present, while PTE1 ~ PTE(nr_pages - 1) are swap entries after try_to_unmap_one(). So folio will be still mapped, the folio fails to be reclaimed and is put back to LRU in this round. This also breaks up PTEs optimization such as CONT-PTE on this large folio and may lead to accident folio_split() afterwards. And since a part of PTEs are now swap entries, accessing those parts will introduce overhead - do_swap_page. Although the kernel can withstand all of the above issues, the situation still seems quite awkward and warrants making it more ideal. The same race also occurs with small folios, but they have only one PTE, thus, it won't be possible for them to be partially unmapped. This patch holds PTL from PTE0, allowing us to avoid reading PTE values that are in the process of being transformed. With stable PTE values, we can ensure that this large folio is either completely reclaimed or that all PTEs remain untouched in this round. A corner case is that if we hold PTL from PTE0 and most initial PTEs have been really unmapped before that, we may increase the duration of holding PTL. Thus we only apply this optimization to folios which are still entirely mapped (not in deferred_split list). Link: https://lkml.kernel.org/r/20240306095219.71086-1-21cnbao@gmail.com Signed-off-by: Barry Song Cc: Hugh Dickins Cc: Chris Li Cc: Chuanhua Han Cc: David Hildenbrand Cc: Gao Xiang Cc: Huang, Ying Cc: Hugh Dickins Cc: Kefeng Wang Cc: Matthew Wilcox (Oracle) Cc: Michal Hocko Cc: Ryan Roberts Cc: Yang Shi Cc: Yu Zhao Signed-off-by: Andrew Morton --- mm/vmscan.c | 12 ++++++++++++ 1 file changed, 12 insertions(+) --- a/mm/vmscan.c~mm-hold-ptl-from-the-first-pte-while-reclaiming-a-large-folio +++ a/mm/vmscan.c @@ -1257,6 +1257,18 @@ retry: if (folio_test_pmd_mappable(folio)) flags |= TTU_SPLIT_HUGE_PMD; + /* + * Without TTU_SYNC, try_to_unmap will only begin to hold PTL + * from the first present PTE within a large folio. Some initial + * PTEs might be skipped due to races with parallel PTE writes + * in which PTEs can be cleared temporarily before being written + * new present values. This will lead to a large folio is still + * mapped while some subpages have been partially unmapped after + * try_to_unmap; TTU_SYNC helps try_to_unmap acquire PTL from the + * first PTE, eliminating the influence of temporary PTE values. + */ + if (folio_test_large(folio) && list_empty(&folio->_deferred_list)) + flags |= TTU_SYNC; try_to_unmap(folio, flags); if (folio_mapped(folio)) { _ Patches currently in -mm which might be from v-songbaohua@oppo.com are mm-zswap-fix-kernel-bug-in-sg_init_one.patch arm64-mm-swap-support-thp_swap-on-hardware-with-mte.patch mm-hold-ptl-from-the-first-pte-while-reclaiming-a-large-folio.patch documentation-coding-style-ask-function-like-macros-to-evaluate-parameters.patch