From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-wm1-f52.google.com (mail-wm1-f52.google.com [209.85.128.52]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B16B031E852 for ; Tue, 3 Mar 2026 11:54:02 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.52 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772538844; cv=none; b=Una5ssjFoSocCVIIx9e2WQQ/WpK04MnSVorCu3Lh9pJxEU8mRXLByCOjBol3ZeCla/F0FUNxxTvn09TFgSFsMAn5zZusIEdbYgiF6EASbUXeDiBmoRQKVybCQ729Lo6Ow6qJiZHaQQnG7AdpeVwZEnJ/1OBQz7JyqHSTePbsZcE= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772538844; c=relaxed/simple; bh=qipa6ILRGS5cHACDxG5GUAIJUHNI19uAtI+n7ShwrEw=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=mbMA+hOex4wuxuwVWLQBTG4y5iKzIp5UwCGncdOqZJN+US/6CeU03MacRyoFgBETqgdHhSWb8S7JQXGfmhXGHtGX2rrqfDpQKuo2EaWjQHXKYpY4wBhMzYfQozwM84A0xHGslFJTq4a9wZUJNqTEi9XCiVR3vBTHvPp6u1HN+4Y= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=readmodwrite.com; spf=none smtp.mailfrom=readmodwrite.com; dkim=pass (2048-bit key) header.d=readmodwrite-com.20230601.gappssmtp.com header.i=@readmodwrite-com.20230601.gappssmtp.com header.b=t+0qZYd7; arc=none smtp.client-ip=209.85.128.52 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=readmodwrite.com Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=readmodwrite.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=readmodwrite-com.20230601.gappssmtp.com header.i=@readmodwrite-com.20230601.gappssmtp.com header.b="t+0qZYd7" Received: by mail-wm1-f52.google.com with SMTP id 5b1f17b1804b1-483bd7354efso73715245e9.2 for ; Tue, 03 Mar 2026 03:54:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=readmodwrite-com.20230601.gappssmtp.com; s=20230601; t=1772538841; x=1773143641; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=VSbAaFhMKYH/6sDptBm4izzjlxKhysV1aVqA/IFmMak=; b=t+0qZYd7IkNiw5B8jFLkt2GnfHd4jDjfJG5WjdNDPSwMgFxkvRWlizAq0sP2h2/aUo JxsytFPjaYE+CRNn7AOO8RbKHNP5i17K04ueXbuiHdmashTbx49w7gvmKKUTGf3FOM2t 05RJ7wv/BMlRmGqW0vQ8eiEdiVneofmptx0nzWfAC7KtqHWFTLdpBRXmX7qQ48m6GLwP 2nq2lICGL0ccHRv08CXE7QjKNyOuaW3W1Euxdhfs/vi856QxaGG6hwHbsZnneUli5Pm/ AOEoM6SCzIfjZXR9aHTZrfElwyz8kJ3zMugYFISz8hLNBO5gcbLvAFblQ22GUqk9oxlG Rd3w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1772538841; x=1773143641; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=VSbAaFhMKYH/6sDptBm4izzjlxKhysV1aVqA/IFmMak=; b=S83aMlmeAI1Ssb7n4ggLx3tOSKLI0Ghm3e7ERbGDTjSLCiNbBV+kFlAJv/+gfu3Qz/ L6IlajNP/YyQSGLvtG/0dZ8nBr8CNWwLXr3corY9FVh7maFH2M9Dh6HS1NQknYGfA+gF 0ByTsKWowMV1x57/TSXPzGvOq3sTpXVtoXg1DLyM3O2Tup0rdZcmogSkXygzOk6fwVd+ 4ff7GG/0j0f+smPuKSJrrp3TxA3WzSTovvMw98zPXLCAEoKJx/s53wrq0kNFuco15X8h QrbdnH+KfGhdU/MFnWWDnIQXPT31b6D1ee3O1CVzvwIkTIC6cl4qp9gGazIZF51bETS5 UILQ== X-Forwarded-Encrypted: i=1; AJvYcCUjpxiGKHrpEV1MjKn7+pymWeC1wVic1wZ4o1coK3cn5X8wOQ/JYi0Q5AjMXm9YH1vWTcgjWYhSO4YFANI=@vger.kernel.org X-Gm-Message-State: AOJu0Yy8vjQG1Qe3reNWtc2NmDwfsmrtQcnNKRc/56TFRWba7F4S7ygt nVVtjFTt2bF/ix2nXBQBP796+77P23rCIukZ0M81gtp6Xa3oJqLn/+TQ+D1849pCIjA= X-Gm-Gg: ATEYQzySnNYyoZH8ciq6zl57EP/z81xNkbA4fHSFgyIOIFsM905VrNo+qckvVrtmg3Q kHh9PLU2RnKdwSjm+KXPuBemrZoaP6ChDDYKUPoIFlcD+NWXaLKDVDVp8Y9/Wlfqp8NzoOyr9/W HkkhIcdjeq7yKVu8jUXEaaAp4qZxq5GnxZoAHr/nZjy112NEi4Kb1hz1QVAh/Vql+/o5JFx3ZDS VF2m9OpCGnUEvHItRGSkBglJ5FuutKiRrYioiieAxCU9kOQdjjQuPX8GK4HMZyN4L2RVRpsGreH AWX45qfmWoL/NvbBS6Idr+BY5h9Why/7r3FxvWOJC3bMmerTpd6h35pdVymgsILrJeq/PZg6biG eqdE8mdMeuvVpniCeHuLbb6iswCBvOam/fAIal0XVfIVSlS1aAG1KPp3fzboi//i5fkw8m4Axqx S6zQizcQn0TkF7pre1zjnT39w= X-Received: by 2002:a05:600c:a4b:b0:483:71f7:2797 with SMTP id 5b1f17b1804b1-483c9ba3785mr295878295e9.14.1772538840803; Tue, 03 Mar 2026 03:54:00 -0800 (PST) Received: from matt-Precision-5490.. ([2a09:bac1:2880:f0::15:430]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-439b3cc2e65sm18598717f8f.2.2026.03.03.03.53.59 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 03 Mar 2026 03:54:00 -0800 (PST) From: Matt Fleming To: Andrew Morton Cc: Jens Axboe , Minchan Kim , Sergey Senozhatsky , Chris Li , Kairui Song , Kemeng Shi , Nhat Pham , Baoquan He , Barry Song , Vlastimil Babka , Suren Baghdasaryan , Michal Hocko , Brendan Jackman , Johannes Weiner , Zi Yan , linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, kernel-team@cloudflare.com, Matt Fleming Subject: [RFC PATCH 0/1] mm: Reduce direct reclaim stalls with RAM-backed swap Date: Tue, 3 Mar 2026 11:53:57 +0000 Message-ID: <20260303115358.1323188-1-matt@readmodwrite.com> X-Mailer: git-send-email 2.43.0 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit From: Matt Fleming Hi, Systems with zram-only swap can spin in direct reclaim for 20-30 minutes without ever invoking the OOM killer. We've hit this repeatedly in production on machines with 377 GiB RAM and a 377 GiB zram device. The problem ----------- should_reclaim_retry() calls zone_reclaimable_pages() to estimate how much memory is still reclaimable. That estimate includes anonymous pages, on the assumption that swapping them out frees physical pages. With disk-backed swap, that's true -- writing a page to disk frees a page of RAM, and SwapFree accurately reflects how many more pages can be written. With zram, the free slot count is inaccurate. A 377 GiB zram device with 10% used reports ~340 GiB of free swap slots, but filling those slots requires physical RAM that the system doesn't have -- that's why it's in direct reclaim in the first place. The reclaimable estimate is off by orders of magnitude. The fix ------- This patch introduces two new flags: BLK_FEAT_RAM_BACKED at the block layer (set by zram and brd) and SWP_RAM_BACKED at the swap layer. When all active swap devices are RAM-backed, should_reclaim_retry() excludes anonymous pages from the reclaimable estimate and counts only file-backed pages. Once file pages are exhausted the watermark check fails and the kernel falls through to OOM. Opting to OOM kill something over spinning in direct reclaim optimises for Mean Time To Recovery (MTTR) and prevents "brownout" situations where performance is degraded for prolonged periods (we've seen 20-30 minutes degraded system performance). Design choices and known limitations ------------------------------------- Why not fix zone_reclaimable_pages() globally? Other callers (e.g. balance_pgdat() in kswapd) use the anon-inclusive count for different purposes. Changing it globally risks breaking kswapd's reclaim decisions in ways that are hard to test. Limiting the change to should_reclaim_retry() keeps the blast radius small and squarely in the direct reclaim path. What about mixed swap configurations (zram + disk)? When at least one disk-backed swap device is active, swap_all_ram_backed is false and the current behaviour is preserved. Per-device reclaimable accounting is possible but it's a much larger change, and mixed zram+disk configurations are uncommon in practice AFAIK. Can we make zram free space accounting more accurate? This is possible but probably the most complicated solution. Swap device drivers could provide a callback which RAM-backed drivers would use to estimate how much physical memory they could store given some average compression ratio (either historic or projected given a list of anon pages to swap) and the amount of free physical memory. Plus, this wouldn't be constant and would change on every invocation of the callback inline with the current compression ratio and the amount of free memory. Build-testing ------------- Built with defconfig, allnoconfig, allmodconfig, and multiple randconfig iterations on x86_64 / 7.0-rc2. Matt Fleming (1): mm: Reduce direct reclaim stalls with RAM-backed swap drivers/block/brd.c | 3 ++- drivers/block/zram/zram_drv.c | 3 ++- include/linux/blkdev.h | 8 ++++++ include/linux/swap.h | 9 +++++++ mm/page_alloc.c | 23 ++++++++++++++++- mm/swapfile.c | 47 ++++++++++++++++++++++++++++++++++- 6 files changed, 89 insertions(+), 4 deletions(-) -- 2.43.0