From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pj1-f52.google.com (mail-pj1-f52.google.com [209.85.216.52]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 730A0208CA for ; Mon, 13 Jan 2025 03:39:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.52 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736739559; cv=none; b=uWXzjHcTYhXQoDfRKUuEpmj5t5IsMcmjeIAMBtZ4zD3lwKs/eS8BsSMojNqdeAhfxDzoCUA+tY31NSuz/P6SGB08/5mLhxXAGMlf+aiYNKu1Ol0VjHl4xPRJiUdk4Q7gI9C4iO6emJY/utA4SIVM9ake+9Bc+Rb3s89FBH0i/KU= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736739559; c=relaxed/simple; bh=HXn8uPjw0H7IGEMwdJWEiPIFVWB8ZlKmPr8J59c2/IU=; h=From:To:Cc:Subject:Date:Message-Id:MIME-Version:Content-Type; b=YVSyUaQi7rWaSJ/SJBqzarpj983nRdzyWeQ+ypMG+setNJDHPsRgFZs4plSp5UPI1GSUo9HuC7xlPmQ60eoWE5GpKPQQogHAYZoRB4b8wr9OOL3WRXAndSdkX/scfdT+Jv7M6w+tY6cz4s7abrN68nRM6t8ONxpLrxFmKF0zOzg= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=mw0016GG; arc=none smtp.client-ip=209.85.216.52 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="mw0016GG" Received: by mail-pj1-f52.google.com with SMTP id 98e67ed59e1d1-2f13acbe29bso7301901a91.1 for ; Sun, 12 Jan 2025 19:39:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1736739558; x=1737344358; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=FsECYO8sB292CmYpC3fxwMu7nIovFj7xiGwzKOJEc7U=; b=mw0016GGM25eq2oyVNiUlfzULR0Nh03acGeAGhnu/Ul4yNmhsqevYnHZdLuO7umpTQ OzSRoUaOSHZnloLb4xNQPr64URrS3ynuCbjJRVzjpQgORMY1N9si31LX99IHEFpkr/62 3U9iECzEVKxtF6je+YuNljNnwvDGz9/e0GvsEyNK/548agnsl8AidICK22mihakMHyr4 BTnp7Ni00eP4YvzjPhN77++2CWchWVDJqMkDUNcNEvpqPL5Nqg1dhgnYzpj2kklmSwqm 206mUSYy/eNlAO4tSMGURk23vuAROred+vBgXXPcddd1mvu124s05LTN8kIfjcSzKBzV RgjA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1736739558; x=1737344358; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=FsECYO8sB292CmYpC3fxwMu7nIovFj7xiGwzKOJEc7U=; b=ZGBhnkHT4da53UVidZ8wxNwgtIOR/Z59D5ui1Eu8nPqbef3PNyOLCiflLNXC2UP1E7 zX6AuNbhOSFH34xRe2jesvW45r/L6UqH65BasakVRKdEuff7EH/lRoA7Ljp+zLy6fcKy fDohxFx8TQvtBh2cXKvrvIz7HvRcpWrx8qPiBPzE+vzWLnN2SufFrUD7fhkh0fm7UeFU xDG705a63umwbSZwcG+kWf1FPaGCxO/aqYSA/FmCTdD7rIfIuBF2xxLssFBgX7LdSmQJ npmALdXQXL5ceQGeJgsc9pa38xBc4Ag90fTlX/6jk4ATmGw0GgrcBY7ZWpe+4K8lkidm zY8g== X-Forwarded-Encrypted: i=1; AJvYcCUOk8k4JcMia1kZAiJEKKujmibFIunUnODpz2OaifG7x/lobuGYNP1XBjcritiRS3Rvqx0XMS05d/KZ9Oc=@vger.kernel.org X-Gm-Message-State: AOJu0YwhR/loxShGueVmqftcKTBjSqIn9zw8w91GB4rQ6SnP9MJLB3su aXW58lKwM1a7CF8Va6WCT/+Nlmaa11AqaE5Gjo3FydcF7HeGJkd3 X-Gm-Gg: ASbGncvLVPEr21zamJzPgb2Q24wACqdoQ72922+KVSr7/7ozUa8bxzueZAZL+ddbWZ4 IQPjEZYr9JW0Hw36EWEux5alK45sVRImWEvZdDqnMBpSJdShAY/LbweLUH5An29PaDdYLNzB+r6 sz9Ox0eMjjaLA0wu5KEG/IWkrJB6j/NlY/7KVFPg/9rPj733r7BfK8oJZOaasZ8wGwkqZKClDPP GIKrTyihJ3VcmsK2hBtk2OSELS313ZMNSyT5eCO09LSknS+OcYm9WtBN/SYEzLaKU5GyItq1sAX 18S4+KlS X-Google-Smtp-Source: AGHT+IHOVUEQIwahsNe0JPIF/xCrkqjPY47Zb3CsoPvALwPFRqBWuqOSOx1Y9Cfq48Gfus/C7iH6nw== X-Received: by 2002:a17:90b:3848:b0:2f5:63a:44f8 with SMTP id 98e67ed59e1d1-2f55443a0e3mr25854156a91.8.1736739557599; Sun, 12 Jan 2025 19:39:17 -0800 (PST) Received: from Barrys-MBP.hub ([2407:7000:af65:8200:39b5:3f0b:acf3:9158]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-21a9f25aabfsm44368405ad.246.2025.01.12.19.39.11 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Sun, 12 Jan 2025 19:39:17 -0800 (PST) From: Barry Song <21cnbao@gmail.com> To: akpm@linux-foundation.org, linux-mm@kvack.org Cc: baolin.wang@linux.alibaba.com, chrisl@kernel.org, david@redhat.com, ioworker0@gmail.com, kasong@tencent.com, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, ryan.roberts@arm.com, v-songbaohua@oppo.com, x86@kernel.org, linux-riscv@lists.infradead.org, ying.huang@intel.com, zhengtangquan@oppo.com, lorenzo.stoakes@oracle.com Subject: [PATCH v2 0/4] mm: batched unmap lazyfree large folios during reclamation Date: Mon, 13 Jan 2025 16:38:57 +1300 Message-Id: <20250113033901.68951-1-21cnbao@gmail.com> X-Mailer: git-send-email 2.39.3 (Apple Git-146) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit From: Barry Song Commit 735ecdfaf4e8 ("mm/vmscan: avoid split lazyfree THP during shrink_folio_list()") prevents the splitting of MADV_FREE'd THP in madvise.c. However, those folios are still added to the deferred_split list in try_to_unmap_one() because we are unmapping PTEs and removing rmap entries one by one. Firstly, this has rendered the following counter somewhat confusing, /sys/kernel/mm/transparent_hugepage/hugepages-size/stats/split_deferred The split_deferred counter was originally designed to track operations such as partial unmap or madvise of large folios. However, in practice, most split_deferred cases arise from memory reclamation of aligned lazyfree mTHPs as observed by Tangquan. This discrepancy has made the split_deferred counter highly misleading. Secondly, this approach is slow because it requires iterating through each PTE and removing the rmap one by one for a large folio. In fact, all PTEs of a pte-mapped large folio should be unmapped at once, and the entire folio should be removed from the rmap as a whole. Thirdly, it also increases the risk of a race condition where lazyfree folios are incorrectly set back to swapbacked, as a speculative folio_get may occur in the shrinker's callback. deferred_split_scan() might call folio_try_get(folio) since we have added the folio to split_deferred list while removing rmap for the 1st subpage, and while we are scanning the 2nd to nr_pages PTEs of this folio in try_to_unmap_one(), the entire mTHP could be transitioned back to swap-backed because the reference count is incremented, which can make "ref_count == 1 + map_count" within try_to_unmap_one() false. /* * The only page refs must be one from isolation * plus the rmap(s) (dropped by discard:). */ if (ref_count == 1 + map_count && (!folio_test_dirty(folio) || ... (vma->vm_flags & VM_DROPPABLE))) { dec_mm_counter(mm, MM_ANONPAGES); goto discard; } This patchset resolves the issue by marking only genuinely dirty folios as swap-backed, as suggested by David, and transitioning to batched unmapping of entire folios in try_to_unmap_one(). Consequently, the deferred_split count drops to zero, and memory reclamation performance improves significantly — reclaiming 64KiB lazyfree large folios is now 2.5x faster(The specific data is embedded in the changelog of patch 3/4). By the way, while the patchset is primarily aimed at PTE-mapped large folios, Baolin and Lance also found that try_to_unmap_one() handles lazyfree redirtied PMD-mapped large folios inefficiently — it splits the PMD into PTEs and iterates over them. This patchset removes the unnecessary splitting, enabling us to skip redirtied PMD-mapped large folios 3.5X faster during memory reclamation. (The specific data can be found in the changelog of patch 4/4). -v2: * describle backgrounds, problems more clearly in cover-letter per Lorenzo Stoakes; * also handle redirtied pmd-mapped large folios per Baolin and Lance; * handle some corner cases such as HWPOSION, pte_unused; * riscv and x86 build issues. -v1: https://lore.kernel.org/linux-mm/20250106031711.82855-1-21cnbao@gmail.com/ Barry Song (4): mm: Set folio swapbacked iff folios are dirty in try_to_unmap_one mm: Support tlbbatch flush for a range of PTEs mm: Support batched unmap for lazyfree large folios during reclamation mm: Avoid splitting pmd for lazyfree pmd-mapped THP in try_to_unmap arch/arm64/include/asm/tlbflush.h | 26 +++---- arch/arm64/mm/contpte.c | 2 +- arch/riscv/include/asm/tlbflush.h | 3 +- arch/riscv/mm/tlbflush.c | 3 +- arch/x86/include/asm/tlbflush.h | 3 +- mm/huge_memory.c | 17 ++++- mm/rmap.c | 112 ++++++++++++++++++++---------- 7 files changed, 111 insertions(+), 55 deletions(-) -- 2.39.3 (Apple Git-146)