From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pl1-f172.google.com (mail-pl1-f172.google.com [209.85.214.172]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7C3AB156F3B for ; Wed, 15 Jan 2025 03:38:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.172 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736912310; cv=none; b=N0XMuDy86JJeYO7qPEAihfVg/8LABJyEf41J5FYCJ6vSOJ9eDjgVGOMp3akM10b1oHL/REHzGaef78mYrTNHx9j/kSX2WMNqu5NotIDMRymjAAvct6Ahvw2NghH3dHbShcLcfvuSv2xFVkK5GyOQdoSqt9IJqd06GCt48u02Cz8= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736912310; c=relaxed/simple; bh=aujGh1EiVn/Qhy1kmr6wtF/wuBkyxUIIEEUmp5gZiVM=; h=From:To:Cc:Subject:Date:Message-Id:MIME-Version:Content-Type; b=NrfxNl77oJpbq2oaLvEPaZBdzb146Uf1VZkUosA7EQMIe45GkgPOkwtiwM9iPjChGh1eRvNls3WRcOl1ppcuRIbdP6w7XL1BmGXCAjMj8B7vKJwrNIE/XdgxcaWvdWFuArFEIS3Y35Xlcq/zU4B1DoS48a11nur5AH5xeSEVKQI= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=UIZVigyN; arc=none smtp.client-ip=209.85.214.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="UIZVigyN" Received: by mail-pl1-f172.google.com with SMTP id d9443c01a7336-2161eb95317so111399665ad.1 for ; Tue, 14 Jan 2025 19:38:27 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1736912307; x=1737517107; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=WrzVQvi/jgw/I6AEsUGXcvK1mN9COshrp0/mcUa7yDQ=; b=UIZVigyNzats/49p5ubFr6MlHWc9xabRI4DPss0NHm+9/LTpXRzLqLsE8b2hjm/WJi b1ibErmw3fXT6QiTbO1DDQ7vB7n71oD2oIksljQSVOiE0i5t527vKwm1dDPhTF9rQ64V ul5IVCBIZ5kU81Kx4WG7Z7pvboYTabPATzB2mA35HiNlQ8I4XG/MVYKK13kN9w1HRINj 6W60IT+pPaudKK+58D62yAJCgtzlk6Qzl2lgv4hW7YpCzM5CDvV7BwInp1RPAegpbLU/ g2fFmTdQe6LlfJEsN/NTH074CRh4nLWkZxsV6xQdAhA4AJXsQfWdt0gmQsxxA8ax4+uM WZGA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1736912307; x=1737517107; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=WrzVQvi/jgw/I6AEsUGXcvK1mN9COshrp0/mcUa7yDQ=; b=L7Y9rzGgbFuqc4TljMSwilKO8iNTqinpR2KufxwItydwo6s50txYVMCYrbzqfpj2+5 oCjos3Z65f2fGtvQAQuZrsZAaFPATUjNKMLKj8Z1z7oz4EEy61ktqA3wvJIvi/cg6HDl QpdIMjKtqdvRG34SucPEsGHmUAgEJslHX/VoKixU7ffEX4hHA6HTnS/iB8/42owxHfOA a7pDQTOjLp+b7tXBYeTOydjdGafjleQ/2VFUIW3iVMXaieHQg9zsB9gVR8ui6EoeYkjF Xyq7VJS79z0vzZ7WzipeE+NXPuiMOGOKIiTNO97KA+/YCJ2hUM9oaFej11/2RwU7d2fm Lt/A== X-Forwarded-Encrypted: i=1; AJvYcCV/BEl03yZLHIlVDk1KXyGEDJlU4p1/m9GlTmMEqRM4P55qVWmGgeL/FrM24G+Q8BVf+Oiim0hqvYuFzTA=@vger.kernel.org X-Gm-Message-State: AOJu0YzgaD53riFQi1xFhjslI3R+mUP9QRs33B1dWSsF59R62YEG4b5+ 62kCoT0D8BfjrAom9xYTiH08OPPiQjdukgXFr7D7lakLHGyWwmUz X-Gm-Gg: ASbGncurFHPslm5FsBpxoR5etosoEceW1S7rDrK6cBYU7MlSOZZ74rQzVnTJL6iT9kC 7PIy7Hw2OjVJjK2EgCp1qV6ORqnBqTLg/+I03LyeLLEQgURaLqnTrv2ltqN5ThEna4DI9MJgZHn 8K8tXDKk2AJXRg35DNSIyQAeVPvvuM+Zk12RH3pyF5tyBnmhCE1LlL9h4Ja3sMR76c3ERB4QKfU 3Az8gpskfxgVNAxn+R4krwHfLrQL2YhGsxnrOQbvT91DqyvJcw9m/jJOoZu8bfgsGpsHwGeNkcL vxlmVxzI X-Google-Smtp-Source: AGHT+IG7nVHqDcezd3xmQAp4vBQiBwgib6FzP4BDYjRCpSnpFC+KQyJk4fxnyZSyDrckcOuiOGpO6Q== X-Received: by 2002:a17:902:f705:b0:205:4721:19c with SMTP id d9443c01a7336-21a83fe48fdmr395479085ad.37.1736912306544; Tue, 14 Jan 2025 19:38:26 -0800 (PST) Received: from Barrys-MBP.hub ([2407:7000:af65:8200:e5d5:b870:ca9b:78f8]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-21a9f10dffbsm73368195ad.49.2025.01.14.19.38.19 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Tue, 14 Jan 2025 19:38:25 -0800 (PST) From: Barry Song <21cnbao@gmail.com> To: akpm@linux-foundation.org, linux-mm@kvack.org Cc: 21cnbao@gmail.com, baolin.wang@linux.alibaba.com, chrisl@kernel.org, david@redhat.com, ioworker0@gmail.com, kasong@tencent.com, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-riscv@lists.infradead.org, lorenzo.stoakes@oracle.com, ryan.roberts@arm.com, v-songbaohua@oppo.com, x86@kernel.org, ying.huang@intel.com, zhengtangquan@oppo.com Subject: [PATCH v3 0/4] mm: batched unmap lazyfree large folios during reclamation Date: Wed, 15 Jan 2025 16:38:04 +1300 Message-Id: <20250115033808.40641-1-21cnbao@gmail.com> X-Mailer: git-send-email 2.39.3 (Apple Git-146) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit From: Barry Song Commit 735ecdfaf4e8 ("mm/vmscan: avoid split lazyfree THP during shrink_folio_list()") prevents the splitting of MADV_FREE'd THP in madvise.c. However, those folios are still added to the deferred_split list in try_to_unmap_one() because we are unmapping PTEs and removing rmap entries one by one. Firstly, this has rendered the following counter somewhat confusing, /sys/kernel/mm/transparent_hugepage/hugepages-size/stats/split_deferred The split_deferred counter was originally designed to track operations such as partial unmap or madvise of large folios. However, in practice, most split_deferred cases arise from memory reclamation of aligned lazyfree mTHPs as observed by Tangquan. This discrepancy has made the split_deferred counter highly misleading. Secondly, this approach is slow because it requires iterating through each PTE and removing the rmap one by one for a large folio. In fact, all PTEs of a pte-mapped large folio should be unmapped at once, and the entire folio should be removed from the rmap as a whole. Thirdly, it also increases the risk of a race condition where lazyfree folios are incorrectly set back to swapbacked, as a speculative folio_get may occur in the shrinker's callback. deferred_split_scan() might call folio_try_get(folio) since we have added the folio to split_deferred list while removing rmap for the 1st subpage, and while we are scanning the 2nd to nr_pages PTEs of this folio in try_to_unmap_one(), the entire mTHP could be transitioned back to swap-backed because the reference count is incremented, which can make "ref_count == 1 + map_count" within try_to_unmap_one() false. /* * The only page refs must be one from isolation * plus the rmap(s) (dropped by discard:). */ if (ref_count == 1 + map_count && (!folio_test_dirty(folio) || ... (vma->vm_flags & VM_DROPPABLE))) { dec_mm_counter(mm, MM_ANONPAGES); goto discard; } This patchset resolves the issue by marking only genuinely dirty folios as swap-backed, as suggested by David, and transitioning to batched unmapping of entire folios in try_to_unmap_one(). Consequently, the deferred_split count drops to zero, and memory reclamation performance improves significantly — reclaiming 64KiB lazyfree large folios is now 2.5x faster(The specific data is embedded in the changelog of patch 3/4). By the way, while the patchset is primarily aimed at PTE-mapped large folios, Baolin and Lance also found that try_to_unmap_one() handles lazyfree redirtied PMD-mapped large folios inefficiently — it splits the PMD into PTEs and iterates over them. This patchset removes the unnecessary splitting, enabling us to skip redirtied PMD-mapped large folios 3.5X faster during memory reclamation. (The specific data can be found in the changelog of patch 4/4). -v3: * collect reviewed-by and acked-by of Baolin, David, Lance and Will. thanks! * refine pmd-mapped THP lazyfree code per Baolin and Lance. * refine tlbbatch deferred flushing range support code per David. -v2: https://lore.kernel.org/linux-mm/20250113033901.68951-1-21cnbao@gmail.com/ * describle backgrounds, problems more clearly in cover-letter per Lorenzo Stoakes; * also handle redirtied pmd-mapped large folios per Baolin and Lance; * handle some corner cases such as HWPOSION, pte_unused; * riscv and x86 build issues. -v1: https://lore.kernel.org/linux-mm/20250106031711.82855-1-21cnbao@gmail.com/ Barry Song (4): mm: Set folio swapbacked iff folios are dirty in try_to_unmap_one mm: Support tlbbatch flush for a range of PTEs mm: Support batched unmap for lazyfree large folios during reclamation mm: Avoid splitting pmd for lazyfree pmd-mapped THP in try_to_unmap arch/arm64/include/asm/tlbflush.h | 25 +++---- arch/arm64/mm/contpte.c | 2 +- arch/riscv/include/asm/tlbflush.h | 5 +- arch/riscv/mm/tlbflush.c | 5 +- arch/x86/include/asm/tlbflush.h | 5 +- mm/huge_memory.c | 24 +++++-- mm/rmap.c | 115 ++++++++++++++++++++---------- 7 files changed, 117 insertions(+), 64 deletions(-) -- 2.39.3 (Apple Git-146)