From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E7EABC3ABD8 for ; Wed, 14 May 2025 20:17:43 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 884136B0085; Wed, 14 May 2025 16:17:41 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 80B336B0088; Wed, 14 May 2025 16:17:41 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6864B6B0089; Wed, 14 May 2025 16:17:41 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 44E5E6B0085 for ; Wed, 14 May 2025 16:17:41 -0400 (EDT) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 0ABBEBEAC3 for ; Wed, 14 May 2025 20:17:43 +0000 (UTC) X-FDA: 83442624006.23.8067E03 Received: from mail-pj1-f43.google.com (mail-pj1-f43.google.com [209.85.216.43]) by imf23.hostedemail.com (Postfix) with ESMTP id 2EE8614000A for ; Wed, 14 May 2025 20:17:40 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=cNmmrShk; spf=pass (imf23.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.216.43 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1747253861; h=from:from:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=oipYBsjZNqzU1ii2PkDwAfBFaJ3OHlOiEhSpOfdLTyc=; b=40FENJxoOYFwn6aVGBeWe8tZxohgox/bmbtk1RieQdUhm+uFh5/j0U5B3emngRqUPd4GY+ RMO+lJbM0IKsSswsgHQoavK5CJItfjDY60nK7sg4SmSKIwD8W6/rxBDPsGKQf/Ch+VIzi5 LDtQFU0GynPPmPbBLqX36qoKi1eZaqE= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=cNmmrShk; spf=pass (imf23.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.216.43 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1747253861; a=rsa-sha256; cv=none; b=oczhpg/R3qjZcqW4WagzV+OBhElsTyIxIcJ1+cS9R65gqnQw1uVFvSGLFyQ1iVl/fHPolK QzVsjH8QJBW06PKFstrPaTbyNIhSu2IlGtMAGZT9/OKd64I+71WDjn+0UO4zrxHtLvacyG mTiurpM6EABmgnDSxy1F/Wgpj7/WP78= Received: by mail-pj1-f43.google.com with SMTP id 98e67ed59e1d1-30e4f3b1275so111024a91.3 for ; Wed, 14 May 2025 13:17:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1747253859; x=1747858659; darn=kvack.org; h=content-transfer-encoding:mime-version:reply-to:message-id:date :subject:cc:to:from:from:to:cc:subject:date:message-id:reply-to; bh=oipYBsjZNqzU1ii2PkDwAfBFaJ3OHlOiEhSpOfdLTyc=; b=cNmmrShkMUEEGAZtG25K71gqKJu1Whn1153l3lMirti29YuO/3c+cfKnjQgxsdno2s UxQ3Eu4mEP/BVHvDkN6WLf9jDeHYW2ULZGCOBuVyB8mN1V0H1F86cfEg1FRiG3ixAb1V pY8CQPd6zEkaUZyeII4KPjJnESGu+llkPzLqu29SL/8lYKTzPuG6RJ20B1BGLtATzaJh VSYgJ4cAphi7x0MbgrOmkNKkJKQl6wcekV3VccOXcs7nj4zyQ07T+sBXrqAqzUbyA6C1 xPVCCPMUFld8dbGcWMJKcSNNorqq6T3C94ifmfEESznYXIM+VdhMlQmDSzfd1KruVhp4 7dnA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1747253859; x=1747858659; h=content-transfer-encoding:mime-version:reply-to:message-id:date :subject:cc:to:from:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=oipYBsjZNqzU1ii2PkDwAfBFaJ3OHlOiEhSpOfdLTyc=; b=M2aG6aj56ipEMtPbBC/lKeqMCqzNbqUki/vQYVfThY8W6QSaG12U+BaY0MRyDEWkCV oeiFV9MpMsrjdz7G8xh1iJ58gOyDBXf/7sHIV/MvVKFBtjte5DCyRU/tafJR+mWVB7/F 00PMnm02PyZbamvzS4taGbC87keO6TB70Q1ipbr7+pq5Ig0xrFHOMAO4zFj097smQ/Dg 4QGNxQoY+8dpr01Xps6xhM6L4ZZ96PfqNOEp4DO9yDCmaDn2O0IdEde4grdqybNZJ9ER ZxzyZrJdFELXTUKDF+0fiyPJCTOXkP+DYRVNRu/+rRSMgqqieN3xkGJ4jJQyBis19/C9 9nxQ== X-Gm-Message-State: AOJu0YzuNVUNPnjS2CBqh2oTgyumM9Lh2yMaGa0Ei5BLr1jXIMKpg12p a/jO02CUVFSeMM4OA4ETxxYfscpXEjk2TKiDPeal5r4e+2MBtVfhcpJ2r8EQcsg= X-Gm-Gg: ASbGncuJ5OON0E2KQQRoG+3c4jAz9rWuRaL3rmqKF8A2NsQKp5M4gPyHAjzQB0QoSN2 dZmqUY8fPeyFmQo/NpUwQZH3ASpZkNTG3GRt5GW9+HAL/q36FTesiDt9SZHVc0t+LE39snizXFo N3kUzSMjjoHc2cnxtyGOH1VQY2uOgtcCuZfLRP7BPHcNDXKiCMJYcMSDRIQDUDQzpGb9AL+CAC1 NWMvXOzE2fjjHc/IUN7GWWSapJYpv6+qNrI2J38zMcyTcTWMFH9j2m7RvYZ2Uc7MlnW/iYITV0w TW6ATF56a2orpXR47rJ4W5mUhvbxJABhncUxha4DCy9r8lao93pqx6ie3jNPM+viQgtSBGGa X-Google-Smtp-Source: AGHT+IGdseuKANCdLIxW+HPBeLTWlFIujJ7gzibK91Avw3Y1Gd8nzoOQSrE9v8oKj03uvAwD19fUaw== X-Received: by 2002:a17:90b:53ce:b0:309:f53c:b0a0 with SMTP id 98e67ed59e1d1-30e2e62a19cmr7645447a91.24.1747253859118; Wed, 14 May 2025 13:17:39 -0700 (PDT) Received: from KASONG-MC4.tencent.com ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-30e33401934sm2003692a91.9.2025.05.14.13.17.34 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Wed, 14 May 2025 13:17:38 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Matthew Wilcox , Hugh Dickins , Chris Li , David Hildenbrand , Yosry Ahmed , "Huang, Ying" , Nhat Pham , Johannes Weiner , Baolin Wang , Baoquan He , Barry Song , Kalesh Singh , Kemeng Shi , Tim Chen , Ryan Roberts , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH 00/28] mm, swap: introduce swap table Date: Thu, 15 May 2025 04:17:00 +0800 Message-ID: <20250514201729.48420-1-ryncsn@gmail.com> X-Mailer: git-send-email 2.49.0 Reply-To: Kairui Song MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam10 X-Stat-Signature: m5jqincynitb9yfece89zcdm53x356ea X-Rspamd-Queue-Id: 2EE8614000A X-Rspam-User: X-HE-Tag: 1747253860-574048 X-HE-Meta: U2FsdGVkX1+pUeR/ThJFyRu6/CbJ2y4HgnSR0ycOFua4bWJwwyN5HYAdNuaiGu2JI2X80KKVokLs5M/Ijx3Tygs4suE5F/vz0fBaFAKUbzwbQiku0ndD9vHmxmwE+jJuJ7GB5Pc94T1cc1JpMO9VS2i/TWJZmzVQ+iAeXBvqlPVcHXEsYgk5TiH5yUq1i1khsnVzLNA5wVL8R+YXbrNyWwkuIPun9Amq4hcJtj88OV5nW9Nep2RyhIVGpiUR8BEzNKvN0IHhbLUtnzCkuIefAKkt14j+zh8ODqhZE0jf5dCqozCPAunV4mnN3KzlW41UQQgm6RI0RHbJ6aP0SswlDzH+1tMebl176nKOi+ELqADS+4qVSbcGXoaFa1Gta9u7JRnyVbTyXE3qSGRyo1+7nZcG2pScaCUskgjHMhq2o4w39pHwV7JOoz6WuJkx9SYG2Rwvet1yS8TJ1CPOq0EAw+ZnewnTTqnxt8ze+R95yDjvdVl8sqdd+RXa72PGBHcsmzWtjrs3B4uD7GIqRpIme1UVL3SvMeZgfd8YFEnU75aLYl8kHuG/YtriBzfTN3ohhzKq65t/0wi/82XDKQJ7VADNoIfZU2Qf2uo8ypsdIE+Ee7x7vJIDOFZWFr+EPQ8ECbzUP2BCxrylLDkCvA0MPh4jE/hBRYIUXeBr0kDiyZZ4yYeURR1bX+jHke/pnkgg9LUc2sEMv2aml31NSFfDpNWjp0eUsvNoZCJOhJEwwIkdJxzaPzwvSCL3rH6FgVWTA49vd0ANkI6x9tnSat4WXFy/sMcB6D3enGAIpPeN/kvvFpwjcW2AROWB2f/Cl7cLl8yWtIr13qqoZuE3baymafGpLp5Q8jvjsRLeCoMwssyeNMKJ20PmJHf2C8y9/pe9nMAEEgqRLL1LPWhN4Eo6r3DxOALtoroCrGqq4pQV1Pi6tsatHellCapy4YoDixGwsokJyDdCvsyBnSLvGPc VHTyIw7H ARuoC9QBMyHImJp1T4usVAaZVIMfKp523x7pDpxt5yq/ZX/XHJUjbjzyByLtuewQoGeG1RNkMoqQdKkeQeRusw6gg/fiBZRgo2gtI46v2jgCbjmS7zgUNx/DuCTw/LbDePskKaUgqlABewLiSdAytal1ny4QmXHqjF3S6r1bY5bQmXz4heXVK9dVGzcwv7eFN2HoMsklfSHM97af/yxBtec7YHuKFDZ8o5imq7PMYbmqmUSqkimKgFEGA8NfU7EW+0bb2iX3NVqwz1phW2UayIwBInbmEzOANQz+X328UudRRBlqDRXKr3HIxpCSr7I6BCGRe6fu7w0xw5NDvjphnLBR8gnIbtkB7hdEy/byUGL07gPsP3CHTQf7fPgWre3jrsaARcCUnxYwhchRHkwjxaNCgLa/lI7s2oExBmyLSAkxKfWlAPcoQoPfSV9R3ZJQU/2OZfy0/ScDlJufiuMl8kK+oQUZqAkR1ttXmWjQmdcbGOq02hxFGfBsV617J7P2m6klnNweVMAMsRmDsK4yjcUWjfd8JkoasnBQqt+PFAavsDj0mNzmnk54ej3rS/D2LNvnTg61UxhB06mHyE9RzGqsqIdthbM/BOLjxXHitZzvThnakBbyRzBAMKzRYlmy1eANiXsDnThWlYF/owLbRIa7OwvpDrB3a4zUl X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: From: Kairui Song This is the series that implements the Swap Table idea propose in the LSF/MM/BPF topic "Integrate swap cache, swap maps with swap allocator" about one month ago [1]. With this series, swap subsystem will have a ~20-30% performance gain from basic sequential swap to heavy workloads, for both 4K and mTHP folios. The idle memory usage is already much lower, the average memory consumption is still the same or will also be even lower (with further works). And this enables many more future optimizations, with better defined swap operations. This series is stable and mergeable on both mm-unstable and mm-stable. It's a long series so it might be challenging to review though. It has been working well with many stress tests. You can also find the latest branch here: https://github.com/ryncsn/linux/tree/kasong/devel/swap-table With swap table, a table entry will be the fundamental and the only needed data structure for: swap cache, swap map, swap cgroup map. Reducing the memory usage and making the performance better, also provide more flexibility and better abstraction. /* * Swap table entry type and bit layouts: * * NULL: | ------------ 0 -------------| * Shadow: | SWAP_COUNT |---- SHADOW_VAL ---|1| * Folio: | SWAP_COUNT |------ PFN -------|10| * Pointer: |----------- Pointer ----------|100| */ This series contains many clean up and refractors due to many historical issues with the SWAP subsystem, e.g. the fuzzy workflow and definition of how swap entries are handled, and a lot of corner issues as mentioned in the LSF/MM/BPF talk. There could be temporary increase of complicity or memory consumption in the middle of this series, but in the end it's a much simplified and sanitized in the end. And these patches have dependency on each other due to the current complex swap design. So this takes a long series. This series cleaned up most of the issues and improved the situation with following order: - Simplification and optimizations (Patch 1 - 3) - Tidy up swap info and cache lookup (Patch 4 - 6) - Introduce basic swap table infrastructure (Patch 7 - 8) - Removed swap cache bypassing with SWP_SYNCHRONOUS_IO, enabling mTHP for more workloads (Patch 9 - 14). - Simplify swap in synchronization with swap cache, eliminating long tailing issues and improve performance, swap can be synced with folio lock now (Patch 15 - 16). - Make most swap operations into folio based. We now can use folio based helpers that ensures the swap entries are stable with folio lock, which also make more optimization and sanity checks doable. (Patch 17 - 18) - Removed SWAP_HAS_CACHE. (Patch 19 - 22) - Completely rework the swap counting using swap table, and remove COUNT_CONTINUED (Patch 23 - 27). - Dynamic reclaim and allocation for swap table (Patch 28) And the performance is looking great too: vm-scalability usemem shows a great improvement: Test using: usemem --init-time -O -y -x -n 31 1G (1G memcg, pmem as swap) Before: After: System time: 217.39s 161.59s (-25.67%) Throughput: 3933.58 MB/s 4975.55 MB/s (+26.48%) (Similar results with random usemem -R) Build kernel with defconfig on tmpfs with ZRAM: Below results shows a test matrix using different memory cgroup limit and job numbers. make -j| Total Sys Time (seconds) | Total Time (seconds) (NR / Mem )| (Before / After / Delta) | (Before / After / Delta) With 4k pages only: 6 / 192M | 5327 / 3915 / -26.5% | 1427 / 1141 / -20.0% 12 / 256M | 5373 / 4009 / -25.3% | 743 / 606 / -18.4% 24 / 384M | 6149 / 4523 / -26.4% | 438 / 353 / -19.4% 48 / 768M | 7285 / 4521 / -37.9% | 251 / 190 / -24.3% With 64k mTHP: 24 / 512M | 4399 / 3328 / -24.3% | 345 / 289 / -16.2% 48 / 1G | 5072 / 3406 / -32.8% | 187 / 150 / -19.7% Memory usage is also reduced. Although this series haven't remove the swap cgroup array yet, the peak usage of one swap entry is already reduced from 12 bytes to 10 bytes. And the swap table is dynamically allocated which means the idle memory usage will be reduced by a lot. Some other high lights and notes: 1. This series introduce a set of helpers "folio_alloc_swap", "folio_dup_swap", "folio_put_swap", "folio_free_swap*" to make most swap operations folio based, this should brought a clean border between swap and rest of mm. Also split the hibernation swap entry allocation out of ordinary swap operations. 3. This series enabled mTHP swap-in, and read ahead skipping for more workloads, as it removed the swap cache bypassing path: We currently only do mTHP swap in and read ahead bypass for SWP_SYNCHRONOUS_IO device only when swap count of all related entries are equal to one. This makes no sense, clearly read ahead and mTHP behaviour should have nothing to do with swap count, it's only a defect due to current design that they are coupled with swap cache bypassing. This series removed that limitation while showing a major performance improvement. This not only showed a performance gain, also should reduce mTHP fragmentation. 4. By removing the old swap cache design, now all swap cache are protected by fine grained cluster lock, this also removed the cluster shuffle algorithm should improve the performance for SWAP on HDD too (Fixing [4]). And also got rid of the many swap address_space instance design. 5. I dropped some future doable optimizations for now, e.g. the folio based helper will be an essential part for dropping the swap cgroup control map, which will improve the performance and reduce memory usage even more. It could be done later. And more folio batched operations could be done based on this. So this series is not in the best shape but already looking good enough. Future work items: 1. More tests, and maybe some of the patches need to be split into smaller ones or need a few preparation series. 2. Integrate with Nhat Pham's Virtual swap space [2], while this series improves the performance and added a sanitized workload for SWAP, nothing changed feature wise. The swap table idea is suppose to be able to handle things like a virtual device in a cleaner way with both lower overhead and better flexibility, more work is needed to figure out a way to implement it. 3. Some helpers from this series could be very helpful for future works. E.g. the folio based swap helpers, now locking a folio will stabilize its swap entries, which could also be used to stabilize the under layer swap device's entries if a virtual device design is implemented, hence simplify the locking design. Also more entry types could be added for things like zero map or shmem. 4. The unified swap in path now already enabled mTHP swap in for swap count > 1 entries. This also making unify the read ahead of shmem / anon (as demostrated a year ago [3], that one conflicted with the standalone mTHP swapin path but now it's unified) doable now. We can also implement a read ahead based mTHP swapin based on this too. This needs more discussion. Link: https://lore.kernel.org/CAMgjq7BvQ0ZXvyLGp2YP96+i+6COCBBJCYmjXHGBnfisCAb8VA@mail.gmail.com [1] Link: https://lore.kernel.org/lkml/20250407234223.1059191-1-nphamcs@gmail.com/ [2] Link: https://lore.kernel.org/all/20240129175423.1987-1-ryncsn@gmail.com/ [3] Link: https://lore.kernel.org/linux-mm/202504241621.f27743ec-lkp@intel.com/ [4] Kairui Song (27): mm, swap: don't scan every fragment cluster mm, swap: consolidate the helper for mincore mm, swap: split readahead update out of swap cache lookup mm, swap: sanitize swap cache lookup convention mm, swap: rearrange swap cluster definition and helpers mm, swap: tidy up swap device and cluster info helpers mm, swap: use swap table for the swap cache and switch API mm/swap: rename __read_swap_cache_async to __swapin_cache_alloc mm, swap: add a swap helper for bypassing only read ahead mm, swap: clean up and consolidate helper for mTHP swapin check mm, swap: never bypass the swap cache for SWP_SYNCHRONOUS_IO mm/shmem, swap: avoid redundant Xarray lookup during swapin mm/shmem: never bypass the swap cache for SWP_SYNCHRONOUS_IO mm, swap: split locked entry freeing into a standalone helper mm, swap: use swap cache as the swap in synchronize layer mm, swap: sanitize swap entry management workflow mm, swap: rename and introduce folio_free_swap_cache mm, swap: clean up and improve swap entries batch freeing mm, swap: check swap table directly for checking cache mm, swap: add folio to swap cache directly on allocation mm, swap: drop the SWAP_HAS_CACHE flag mm, swap: remove no longer needed _swap_info_get mm, swap: implement helpers for reserving data in swap table mm/workingset: leave highest 8 bits empty for anon shadow mm, swap: minor clean up for swapon mm, swap: use swap table to track swap count mm, swap: implement dynamic allocation of swap table Nhat Pham (1): mm/shmem, swap: remove SWAP_MAP_SHMEM arch/s390/mm/pgtable.c | 2 +- include/linux/swap.h | 119 +-- kernel/power/swap.c | 8 +- mm/filemap.c | 20 +- mm/huge_memory.c | 20 +- mm/madvise.c | 2 +- mm/memory-failure.c | 2 +- mm/memory.c | 384 ++++----- mm/migrate.c | 28 +- mm/mincore.c | 49 +- mm/page_io.c | 12 +- mm/rmap.c | 7 +- mm/shmem.c | 204 ++--- mm/swap.h | 316 ++++++-- mm/swap_state.c | 646 ++++++++------- mm/swap_table.h | 231 ++++++ mm/swapfile.c | 1708 +++++++++++++++++----------------------- mm/userfaultfd.c | 9 +- mm/vmscan.c | 22 +- mm/workingset.c | 39 +- mm/zswap.c | 13 +- 21 files changed, 1981 insertions(+), 1860 deletions(-) create mode 100644 mm/swap_table.h -- 2.49.0