From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 559B1C369C2 for ; Tue, 22 Apr 2025 14:56:17 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 56C376B0011; Tue, 22 Apr 2025 10:56:16 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 51D176B0012; Tue, 22 Apr 2025 10:56:16 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3950F6B0022; Tue, 22 Apr 2025 10:56:16 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 1CD506B0011 for ; Tue, 22 Apr 2025 10:56:16 -0400 (EDT) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 591E281767 for ; Tue, 22 Apr 2025 14:56:16 +0000 (UTC) X-FDA: 83361980352.06.781B48D Received: from out-188.mta1.migadu.com (out-188.mta1.migadu.com [95.215.58.188]) by imf10.hostedemail.com (Postfix) with ESMTP id 6E0F3C0008 for ; Tue, 22 Apr 2025 14:56:14 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=XVIYRkeR; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf10.hostedemail.com: domain of yosry.ahmed@linux.dev designates 95.215.58.188 as permitted sender) smtp.mailfrom=yosry.ahmed@linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1745333774; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=HukUGfyVGRy6pXjxhQLdvDYNG7dXUnK5NwUPnqbXZdk=; b=EcDyZJN68E8tXuCuxryw+fd0bRN2AA6fcREOPB420OOGPM2UQP37AjPNQ3d0Q98E5K7JKF G6yNNr/ocW/689uw2IUJOu1i/lcsSmXT7pxt7RE5kewaHkQ0WNHxKPk8gYS3CrzL1fZw45 d8aGtf5W/VRz/FJwoYHaniQfst++DXg= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=XVIYRkeR; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf10.hostedemail.com: domain of yosry.ahmed@linux.dev designates 95.215.58.188 as permitted sender) smtp.mailfrom=yosry.ahmed@linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745333774; a=rsa-sha256; cv=none; b=O1Dz93yveJ6JrVFRnryDVv3CmAdIaDFFty54zVUsH0ddsFsTpskonghm8fkMXhtxfq8Go5 dyBduUf5uUGSajVauXlw3ZQnyuOItWQo7H6Saq8zcZTfnZmJYQfuhU3aeuiVP/2nLZCnlf CRf2D7Fm8a/U/sFjDeiB0Ep8Mo2NNvI= Date: Tue, 22 Apr 2025 07:56:03 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1745333772; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=HukUGfyVGRy6pXjxhQLdvDYNG7dXUnK5NwUPnqbXZdk=; b=XVIYRkeRx8+VmylNMxOePvCyqQdMzBfHYja7PaFFS3wuw9UcFScKSi7LXEeu9Lm2wWVz90 +S0xDqDiokxHZ49HmXk3Ikz1Trqe/HedTEaB3D35h3W4LsGwYZ9ZhM2nXY7dkE1MoaRnyt qhiuL6n3aS3LqiOR6zDlRtjPIThXkIM= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Yosry Ahmed To: Nhat Pham Cc: linux-mm@kvack.org, akpm@linux-foundation.org, hannes@cmpxchg.org, hughd@google.com, mhocko@kernel.org, roman.gushchin@linux.dev, shakeel.butt@linux.dev, muchun.song@linux.dev, len.brown@intel.com, chengming.zhou@linux.dev, kasong@tencent.com, chrisl@kernel.org, huang.ying.caritas@gmail.com, ryan.roberts@arm.com, viro@zeniv.linux.org.uk, baohua@kernel.org, osalvador@suse.de, lorenzo.stoakes@oracle.com, christophe.leroy@csgroup.eu, pavel@kernel.org, kernel-team@meta.com, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-pm@vger.kernel.org Subject: Re: [RFC PATCH 00/14] Virtual Swap Space Message-ID: References: <20250407234223.1059191-1-nphamcs@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20250407234223.1059191-1-nphamcs@gmail.com> X-Migadu-Flow: FLOW_OUT X-Stat-Signature: gdr8xown8bh99g9a6ycmgjg5in5sqn1m X-Rspamd-Queue-Id: 6E0F3C0008 X-Rspam-User: X-Rspamd-Server: rspam05 X-HE-Tag: 1745333774-460346 X-HE-Meta: U2FsdGVkX1/oohh/nJoQziU9NgW9fVZfxg7+ACw8uRisSiSKUSdnT8073jzT+2XBdp0Yid66J+j36jHjqYUc0LEx1oZ7UI1ZQ4dw6Z6ea/ZWFCjXSkGbma4lt4ddIR41hVwmMUMR2a4sMrKVuFSxTUe/bhcAoMrVvcHOaB0lXP1DL9TjHL+4fAbMNJqzppnlKoztRtPTgenzS0NAnXI40xcFg6gbELUu1aAG2wdgObP28WqE+FWQLbDrjFw4Edw3iZSN80deocz86X9SrhYJopS59FqRd0FlO0vnCVNvzrycgQf3WDFX4gBniRsWLoMP4At77OR5f7tcwzEE5Rg4H5QFq8WxPOf1mhQrEwAoSnt7U35NY6fQOYEmDXwnW8m2ABoSUum90aD2/sB26amYdH11qI4HONYm9SrKwq0gJcNhDHUKQzjFL5pFuKj/LvPL1wzkK9xwfJ+q0/cfQ/saoD6WL0USJTJudnj37ccTxDdZzhvN6kuQib+ZfHOibPKD31ehp2GcGaTI70m3EJA70v9TYtvfFz2BgH+oj43OI4aI28ixjuOSefZD+t8btEgkqI+K0VSs0TaIu1sGrWWJ1fxZqb3Ty2rDUzghy0IWtcdPKIBrOLBQNKyWVqSzJtnYhABM2KR2S8mfJLeq9ct8KA/Btp9FjFfYFhUAPgdx0VGPhhQ9FONBy/KW6G467CgnFJo0mZV+hkBnKWpnfmVh5CP5+eeS1hVAbzLR/v+MVnNb69XUZluwMOyUOpFwDhUR+ay4I8JGiDVse/NNkpQzmFPHeWoFpuV4VxEj/8OIKM//Ncvv53HKIELXNi2a3sqU0k0MJRYP1HWqyzOOEPdP1nKcTCnF+rW/xS5mjv5EFK6jqcZpRoKl7TE8k5HjdDBq4wl0aXTiiMV0X7RyraKP+5mc5vLgyfseNEOoaZrRYsM+V22Ry+0ZFcP9fz1zh2ns7hzNoweNvfDN/R6Xakh X0HbHBda rxtsOCP5DUniDe0NgcaZMuuUeF6pZS/eef1PCqsCBpyWARRHTCXF5LjUNKcCEpUmZTtzP7KfyMYGdGsUlNmnSqDVDOeSAla78/+Z64M8NxWVZq2llZPuk3x+1QCx1nPfLsSlJktLrf6KQKUVSWQNb7jStgv9/tUyFTSPBFG/af9a2DCs04jF+hzSBMyHNARfSx88Z+ci/Ea2cLkHGMqnm2g1X3pfmuMv3eQBXCXjsP/uZ4857nspNkbyX85Arsa2Ebh8PnO0EmPrcEDLsW2Qe35zICQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Apr 07, 2025 at 04:42:01PM -0700, Nhat Pham wrote: > This RFC implements the virtual swap space idea, based on Yosry's > proposals at LSFMMBPF 2023 (see [1], [2], [3]), as well as valuable > inputs from Johannes Weiner. The same idea (with different > implementation details) has been floated by Rik van Riel since at least > 2011 (see [8]). > > The code attached to this RFC is purely a prototype. It is not 100% > merge-ready (see section VI for future work). I do, however, want to show > people this prototype/RFC, including all the bells and whistles and a > couple of actual use cases, so that folks can see what the end results > will look like, and give me early feedback :) > > I. Motivation > > Currently, when an anon page is swapped out, a slot in a backing swap > device is allocated and stored in the page table entries that refer to > the original page. This slot is also used as the "key" to find the > swapped out content, as well as the index to swap data structures, such > as the swap cache, or the swap cgroup mapping. Tying a swap entry to its > backing slot in this way is performant and efficient when swap is purely > just disk space, and swapoff is rare. > > However, the advent of many swap optimizations has exposed major > drawbacks of this design. The first problem is that we occupy a physical > slot in the swap space, even for pages that are NEVER expected to hit > the disk: pages compressed and stored in the zswap pool, zero-filled > pages, or pages rejected by both of these optimizations when zswap > writeback is disabled. This is the arguably central shortcoming of > zswap: > * In deployments when no disk space can be afforded for swap (such as > mobile and embedded devices), users cannot adopt zswap, and are forced > to use zram. This is confusing for users, and creates extra burdens > for developers, having to develop and maintain similar features for > two separate swap backends (writeback, cgroup charging, THP support, > etc.). For instance, see the discussion in [4]. > * Resource-wise, it is hugely wasteful in terms of disk usage, and > limits the memory saving potentials of these optimizations by the > static size of the swapfile, especially in high memory systems that > can have up to terabytes worth of memory. It also creates significant > challenges for users who rely on swap utilization as an early OOM > signal. > > Another motivation for a swap redesign is to simplify swapoff, which > is complicated and expensive in the current design. Tight coupling > between a swap entry and its backing storage means that it requires a > whole page table walk to update all the page table entries that refer to > this swap entry, as well as updating all the associated swap data > structures (swap cache, etc.). > > > II. High Level Design Overview > > To fix the aforementioned issues, we need an abstraction that separates > a swap entry from its physical backing storage. IOW, we need to > “virtualize” the swap space: swap clients will work with a dynamically > allocated virtual swap slot, storing it in page table entries, and > using it to index into various swap-related data structures. The > backing storage is decoupled from the virtual swap slot, and the newly > introduced layer will “resolve” the virtual swap slot to the actual > storage. This layer also manages other metadata of the swap entry, such > as its lifetime information (swap count), via a dynamically allocated > per-swap-entry descriptor: > > struct swp_desc { > swp_entry_t vswap; > union { > swp_slot_t slot; > struct folio *folio; > struct zswap_entry *zswap_entry; > }; > struct rcu_head rcu; > > rwlock_t lock; > enum swap_type type; > > atomic_t memcgid; > > atomic_t in_swapcache; > struct kref refcnt; > atomic_t swap_count; > }; It's exciting to see this proposal materilizing :) I didn't get a chance to look too closely at the code, but I have a few high-level comments. Do we need separate refcnt and swap_count? I am aware that there are cases where we need to hold a reference to prevent the descriptor from going away, without an extra page table entry referencing the swap descriptor -- but I am wondering if we can get away by just incrementing the swap count in these cases too? Would this mess things up? > > This design allows us to: > * Decouple zswap (and zeromapped swap entry) from backing swapfile: > simply associate the virtual swap slot with one of the supported > backends: a zswap entry, a zero-filled swap page, a slot on the > swapfile, or an in-memory page . > * Simplify and optimize swapoff: we only have to fault the page in and > have the virtual swap slot points to the page instead of the on-disk > physical swap slot. No need to perform any page table walking. > > Please see the attached patches for implementation details. > > Note that I do not remove the old implementation for now. Users can > select between the old and the new implementation via the > CONFIG_VIRTUAL_SWAP build config. This will also allow us to land the > new design, and iteratively optimize upon it (without having to include > everything in an even more massive patch series). I know this is easier, but honestly I'd prefer if we do an incremental replacement (if possible) rather than introducing a new implementation and slowly deprecating the old one, which historically doesn't seem to go well :P Once the series is organized as Johannes suggested, and we have better insights into how this will be integrated with Kairui's work, it should be clearer whether it's possible to incrementally update the current implemetation rather than add a parallel implementation. > > III. Future Use Cases > > Other than decoupling swap backends and optimizing swapoff, this new > design allows us to implement the following more easily and > efficiently: > > * Multi-tier swapping (as mentioned in [5]), with transparent > transferring (promotion/demotion) of pages across tiers (see [8] and > [9]). Similar to swapoff, with the old design we would need to > perform the expensive page table walk. > * Swapfile compaction to alleviate fragmentation (as proposed by Ying > Huang in [6]). > * Mixed backing THP swapin (see [7]): Once you have pinned down the > backing store of THPs, then you can dispatch each range of subpages > to appropriate swapin handle. > * Swapping a folio out with discontiguous physical swap slots (see [10]) > > > IV. Potential Issues > > Here is a couple of issues I can think of, along with some potential > solutions: > > 1. Space overhead: we need one swap descriptor per swap entry. > * Note that this overhead is dynamic, i.e only incurred when we actually > need to swap a page out. > * It can be further offset by the reduction of swap map and the > elimination of zeromapped bitmap. > > 2. Lock contention: since the virtual swap space is dynamic/unbounded, > we cannot naively range partition it anymore. This can increase lock > contention on swap-related data structures (swap cache, zswap’s xarray, > etc.). > * The problem is slightly alleviated by the lockless nature of the new > reference counting scheme, as well as the per-entry locking for > backing store information. > * Johannes suggested that I can implement a dynamic partition scheme, in > which new partitions (along with associated data structures) are > allocated on demand. It is one extra layer of indirection, but global > locking will only be done only on partition allocation, rather than on > each access. All other accesses only take local (per-partition) > locks, or are completely lockless (such as partition lookup). > > > V. Benchmarking > > As a proof of concept, I run the prototype through some simple > benchmarks: > > 1. usemem: 16 threads, 2G each, memory.max = 16G > > I benchmarked the following usemem commands: > > time usemem --init-time -w -O -s 10 -n 16 2g > > Baseline: > real: 33.96s > user: 25.31s > sys: 341.09s > average throughput: 111295.45 KB/s > average free time: 2079258.68 usecs > > New Design: > real: 35.87s > user: 25.15s > sys: 373.01s > average throughput: 106965.46 KB/s > average free time: 3192465.62 usecs > > To root cause this regression, I ran perf on the usemem program, as > well as on the following stress-ng program: > > perf record -ag -e cycles -G perf_cg -- ./stress-ng/stress-ng --pageswap $(nproc) --pageswap-ops 100000 > > and observed the (predicted) increase in lock contention on swap cache > accesses. This regression is alleviated if I put together the > following hack: limit the virtual swap space to a sufficient size for > the benchmark, range partition the swap-related data structures (swap > cache, zswap tree, etc.) based on the limit, and distribute the > allocation of virtual swap slotss among these partitions (on a per-CPU > basis): > > real: 34.94s > user: 25.28s > sys: 360.25s > average throughput: 108181.15 KB/s > average free time: 2680890.24 usecs > > As mentioned above, I will implement proper dynamic swap range > partitioning in a follow up work. I thought there would be some improvements with the new design once the lock contention is gone, due to the colocation of all swap metadata. Do we know why this isn't the case? Also, one missing key metric in this cover letter is disk space savings. It would be useful if you can give a realistic example about how much disk space is being provisioned and wasted today to effictively use zswap, and how much this can decrease with this design. I believe the disk space savings are one of the main motivations so let's showcase that :)