From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3160ECF9C6B for ; Wed, 25 Sep 2024 13:40:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 94ECD6B00AA; Wed, 25 Sep 2024 09:40:18 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8D5CC6B00AB; Wed, 25 Sep 2024 09:40:18 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 729916B00AD; Wed, 25 Sep 2024 09:40:18 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 5265E6B00AA for ; Wed, 25 Sep 2024 09:40:18 -0400 (EDT) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id F054C160D9A for ; Wed, 25 Sep 2024 13:40:17 +0000 (UTC) X-FDA: 82603369674.18.8AE3627 Received: from mail-ot1-f45.google.com (mail-ot1-f45.google.com [209.85.210.45]) by imf25.hostedemail.com (Postfix) with ESMTP id D976EA0002 for ; Wed, 25 Sep 2024 13:40:14 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b="xqFDWPc/"; spf=pass (imf25.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.210.45 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (policy=none) header.from=cmpxchg.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1727271455; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=McGeQoVVUHZQD/E8iEOOr96UUUB38k+tRv+wS5Bnr4I=; b=v7v55wqu5P+BFh1bcFarVtK4eMdh8FLUOl9M1WMksUdJfaRIPNBfJbhbMAGau+M+YJ5Cgr 6cKs6ZQfcO5zNVmCnd1Cx09qsRGQVa0Pthcv0IbxxBKcmxZeSMwj2es1BDqBwyKPECMVLo tWndYQBR7RpO97DRXHfdWy+brDnknxc= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b="xqFDWPc/"; spf=pass (imf25.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.210.45 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (policy=none) header.from=cmpxchg.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1727271455; a=rsa-sha256; cv=none; b=0JwJ7Og2lcqh/MOvX1AN6XTCEMPOQdJ/LAzdq7UoIRVVxvJLpt/upblGW53SgKRE/R8F0P L7Bf5KjnHqzXwsGtq7mtDZPxzjSG/90zqRY9V1HP+ZQaGSzysVz6c+dvK89qxaJiFefIqQ r519nbFjaPmBGGhzDHtYS8O4y02+x1s= Received: by mail-ot1-f45.google.com with SMTP id 46e09a7af769-7093997dffdso1765804a34.2 for ; Wed, 25 Sep 2024 06:40:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20230601.gappssmtp.com; s=20230601; t=1727271613; x=1727876413; darn=kvack.org; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:from:to :cc:subject:date:message-id:reply-to; bh=McGeQoVVUHZQD/E8iEOOr96UUUB38k+tRv+wS5Bnr4I=; b=xqFDWPc/ekr/3wQSpsdM8JDhPwY14cvxPRxyIqK0kbWXehPlkkyZWd0lS0fc26hroP Jd+70EhCZq26wJlJ03cFUg4NMw5ax+XCZdAYIX/rUsJj4AlXwgcjMW4J1bFUq+8EK5DS tGppK8coXusCyHnHhbFJCLf3bYVizFYhyjOuARM4SX57b6IJQqDCh+FQgkP7YWaQngfL n4wa+vjI4i9uyZj3qNlNROdOlTpE18CVxb/TeFleKb+yVV5BNJL4fq/W8hlGEByg4P3T PogUhF7Ppy0PXC5b2EC5aBtqo9Ncw2rTw/ZINfZxLkQYVeIU3AmSrwydIUveEebHlXLx HvFw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1727271613; x=1727876413; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=McGeQoVVUHZQD/E8iEOOr96UUUB38k+tRv+wS5Bnr4I=; b=HJEpxwT4+Re3FJOiX8PlnBwWgkRDsc97NhXkQhgXXueKIQS0ibdvaVdudG2hleKc85 HR8TEPeYkxAM8E2Ey7unuPeCgkpvrEqiJ8vr1nWpaHh7yu9oBb+u1Sd/gdxkTeYxLGST UMtr6wCKDjLCUjqET1KGeeS0Mmohr9y0O8x3lEvb3kUmrAifWsS3zioIJxfexa1FxYp3 ZpswEVTa4Qt399M0g6vddIIGWFSOn8BOzRyLB4gEVKXj8KmuWGtdi8ivHjoYb77lp2LF ojEtVHgvXA9uQIYZW5Z6BxbHhVlHwUN43gVSFJIva7zM/fm5KXya4BSdeB/n3d44ePQi gZlw== X-Forwarded-Encrypted: i=1; AJvYcCUaVCFO6cXCn3QvkAugM0Q7c+ObVlzt8C6hQ88iDPd4zjHw4Koo5zjg982yambTd4SprWJxeiaIOA==@kvack.org X-Gm-Message-State: AOJu0YyzWBsA/acRie2URRPzcVtAVrckXZjyjrSAppUW+RdjDXjvdRWK CQAM8bFJ56ucFGCHWBEc75lOb0HD/N8orHrgKQCyvrRSfv+INA83RoJjmWX5Vn0= X-Google-Smtp-Source: AGHT+IEBmiQhFGvJNTAGZVGuqfzWCKOHZFFZnC2VkTDxztLf+BWSeTA2UrM+rjv/sR1j2EDO6SIEqg== X-Received: by 2002:a05:6359:4588:b0:1bc:58ba:bbb3 with SMTP id e5c5f4694b2df-1bea8652868mr124101455d.23.1727271613387; Wed, 25 Sep 2024 06:40:13 -0700 (PDT) Received: from localhost ([2603:7000:c01:2716:da5e:d3ff:fee7:26e7]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-6cb0f4a3e67sm16459206d6.2.2024.09.25.06.40.12 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 25 Sep 2024 06:40:12 -0700 (PDT) Date: Wed, 25 Sep 2024 09:40:08 -0400 From: Johannes Weiner To: Yosry Ahmed Cc: Kanchana P Sridhar , linux-kernel@vger.kernel.org, linux-mm@kvack.org, nphamcs@gmail.com, chengming.zhou@linux.dev, usamaarif642@gmail.com, shakeel.butt@linux.dev, ryan.roberts@arm.com, ying.huang@intel.com, 21cnbao@gmail.com, akpm@linux-foundation.org, nanhai.zou@intel.com, wajdi.k.feghali@intel.com, vinodh.gopal@intel.com Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store(). Message-ID: <20240925134008.GA875661@cmpxchg.org> References: <20240924011709.7037-1-kanchana.p.sridhar@intel.com> <20240924011709.7037-7-kanchana.p.sridhar@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: D976EA0002 X-Stat-Signature: w6z6a15ei1pz5hj684ntiyjtyebjzhua X-Rspam-User: X-HE-Tag: 1727271614-368274 X-HE-Meta: U2FsdGVkX1+YWFWQBLZ7F6tZLNkc/pajWpWGMXm0sJrhbsKckaT7GmJ4RAMx3WU4o0IHfLzIDxik279Em8rO/zP/PiGufKDsI2MKeqK7ZRIhqCc2JyMyy5FAUXtOSIheuXPqL5dp4aTqQI2M49/ocV1Ua3Clkt+yhXIXv7IYJoLj/Eduy38frDfaYpWP4g39WZz7uVpyJg4NA7y3F6o0uKM2iwARGQPVUP7+u0zC6/liZmo9aZbdNIxcgWDf4wFNFX+daEE0DYGomVFZvB4FzGxPHD76T++CtewHqI9YQZiIU1CCu9jkU90mAWjXHDG3wWngWi8BzikfJh2FXn7D5hHfMdh4UugxXOz0EZo964cYRJVvbRrmV1SY/Tx4/2+EJxYL0uh3ADOc7WmVpS1EG8tWLsT+AAeFwae7WIeaXVZ2uNL3oF16N2viwLHWB288huS6X6kT4TgEd/hEI9KD/wGWHFNZzd/O2E4Hk/SMU9osvynbuknUsx+09qXXKnOgC8wCvqbcFLQN47RTovtSX7NWYSqntv010qqSziLG+75zVDN1OlLzl4jGDBRU2u1pYzj7QyAXlEj/E0h/GF3MTQKRdaWKEfnTibECXtKhnSrJP01n2dKj2EJeXR65MbGAlxxz+yqRGXW0WpzLfyun3V9tg6FmKbjji2/cBYbx/6XmjvxH5qFmLJjnSt+W+sbnIRyGpPqTf8+5NN4nWLyId1H7eEWalBW0471aCSI2p70q1nQYUcIMaFHMkLBsxyBbZlkSqzfE/9z/7BwlxWacGwZxA6hFVcLxpBHtRlre7jwfZctFL0CiVNBi8TMsHdsBsavomI17ifPx4pkH1NsQMbFKsMzfB1ro0ihjs6F0S7LRk+sbmeCfwftinYHQ7oKJzvCWAw1WETBP5q6/7qrOxENgB/9WA/OBTlmrDLJIFoni0KLnEmTjw3RRdCfj8XJczIC/bzj+HMdEASXOXd0 QACso4rP Kn6uOkzCZn6r2MUZJsevQl9MswGz+LqXjxd4RZXucXHacEw8ivWYiU8XKDCTlJ1bNEib59RFhUHCZpy46sRmdqhF7talWMl0fN7hC1DyKmVYjBS+BqBUcsEagn5bdKyUsfzhCXkV1aP1YeTr1Jm32/0qkp5fiILeGxOC9fKb1nGkoYckidT1FL51HKC/a6ZwE6KVrIVIT9mCw44ee73MBcU/xLJrUq15s38v6Y8U3easohvyqnevhO28uFCykJ9yh+DEWYGAD2s/S8CPjzWAlTULyq0Eg3wK4JmVIvhgOdCQ0hDADjRGojKDo3L9IZUUD/UCWQaeXPvIg6/EKdEazjPnfFcjzGCYIHkuYLWRCMdJl9Ix18Hm12KSwI/zjbrCgMTYATq1I5R7/bO3n2xfyuCWzFRF2eaHzwkMt8nOasJKspAoS0fVa+LBp6Al19e3C8IJL0tRggQpN4hJb6vkhia1Z9bhl3bEGRII2alwZMxL8gGKubbFkwzE/OUmo0n9+XzaHdJqtI2hAVJtdqerLrbF1WP45va7Slxe/z0GYUPRqui3ayd7zDlHxEpDrqwihcZjvY/PDsy3ffw4Ppxycfr2cH29BQQ1NCp3skC1EWLjSnBJPQ9DpdHgql6d8pewApy8M X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Sep 24, 2024 at 12:38:32PM -0700, Yosry Ahmed wrote: > On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar > wrote: > > > > zswap_store() will now store mTHP and PMD-size THP folios by compressing > > them page by page. > > > > This patch provides a sequential implementation of storing an mTHP in > > zswap_store() by iterating through each page in the folio to compress > > and store it in the zswap zpool. > > > > Towards this goal, zswap_compress() is modified to take a page instead > > of a folio as input. > > > > Each page's swap offset is stored as a separate zswap entry. > > > > If an error is encountered during the store of any page in the mTHP, > > all previous pages/entries stored will be invalidated. Thus, an mTHP > > is either entirely stored in ZSWAP, or entirely not stored in ZSWAP. > > > > This forms the basis for building batching of pages during zswap store > > of large folios by compressing batches of up to say, 8 pages in an > > mTHP in parallel in hardware, with the Intel In-Memory Analytics > > Accelerator (Intel IAA). > > > > A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default) > > will enable/disable zswap storing of (m)THP. The corresponding tunable > > zswap module parameter is "mthp_enabled". > > > > This change reuses and adapts the functionality in Ryan Roberts' RFC > > patch [1]: > > > > "[RFC,v1] mm: zswap: Store large folios without splitting" > > > > [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u > > > > Also, addressed some of the RFC comments from the discussion in [1]. > > > > Co-developed-by: Ryan Roberts > > Signed-off-by: > > Signed-off-by: Kanchana P Sridhar > > --- > > mm/Kconfig | 8 ++++ > > mm/zswap.c | 122 +++++++++++++++++++++++++---------------------------- > > 2 files changed, 66 insertions(+), 64 deletions(-) > > > > diff --git a/mm/Kconfig b/mm/Kconfig > > index 09aebca1cae3..c659fb732ec4 100644 > > --- a/mm/Kconfig > > +++ b/mm/Kconfig > > @@ -59,6 +59,14 @@ config ZSWAP_SHRINKER_DEFAULT_ON > > reducing the chance that cold pages will reside in the zswap pool > > and consume memory indefinitely. > > > > +config ZSWAP_STORE_THP_DEFAULT_ON > > + bool "Store mTHP and THP folios in zswap" > > + depends on ZSWAP > > + default n > > + help > > + If selected, zswap will process mTHP and THP folios by > > + compressing and storing each 4K page in the large folio. > > + > > choice > > prompt "Default compressor" > > depends on ZSWAP > > diff --git a/mm/zswap.c b/mm/zswap.c > > index 8f2e0ab34c84..16ab770546d6 100644 > > --- a/mm/zswap.c > > +++ b/mm/zswap.c > > @@ -127,6 +127,14 @@ static bool zswap_shrinker_enabled = IS_ENABLED( > > CONFIG_ZSWAP_SHRINKER_DEFAULT_ON); > > module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool, 0644); > > > > +/* > > + * Enable/disable zswap processing of mTHP folios. > > + * For now, only zswap_store will process mTHP folios. > > + */ > > +static bool zswap_mthp_enabled = IS_ENABLED( > > + CONFIG_ZSWAP_STORE_THP_DEFAULT_ON); > > +module_param_named(mthp_enabled, zswap_mthp_enabled, bool, 0644); > > + > > bool zswap_is_enabled(void) > > { > > return zswap_enabled; > > @@ -1471,9 +1479,9 @@ static void zswap_delete_stored_offsets(struct xarray *tree, > > * @objcg: The folio's objcg. > > * @pool: The zswap_pool to store the compressed data for the page. > > */ > > -static bool __maybe_unused zswap_store_page(struct folio *folio, long index, > > - struct obj_cgroup *objcg, > > - struct zswap_pool *pool) > > +static bool zswap_store_page(struct folio *folio, long index, > > + struct obj_cgroup *objcg, > > + struct zswap_pool *pool) > > As I mentioned earlier, the patch that introduced zswap_store_page() > should have directly used it in zswap_store(). This would make this > patch much clearer. > > > { > > swp_entry_t swp = folio->swap; > > int type = swp_type(swp); > > @@ -1551,51 +1559,63 @@ static bool __maybe_unused zswap_store_page(struct folio *folio, long index, > > return false; > > } > > > > +/* > > + * Modified to store mTHP folios. Each page in the mTHP will be compressed > > + * and stored sequentially. > > + */ > > bool zswap_store(struct folio *folio) > > { > > long nr_pages = folio_nr_pages(folio); > > swp_entry_t swp = folio->swap; > > pgoff_t offset = swp_offset(swp); > > struct xarray *tree = swap_zswap_tree(swp); > > - struct zswap_entry *entry; > > struct obj_cgroup *objcg = NULL; > > struct mem_cgroup *memcg = NULL; > > + struct zswap_pool *pool; > > + bool ret = false; > > + long index; > > > > VM_WARN_ON_ONCE(!folio_test_locked(folio)); > > VM_WARN_ON_ONCE(!folio_test_swapcache(folio)); > > > > - /* Large folios aren't supported */ > > - if (folio_test_large(folio)) > > + /* Storing large folios isn't enabled */ > > The comment is now stating the obvious, please remove it. > > > + if (!zswap_mthp_enabled && folio_test_large(folio)) > > return false; > > > > if (!zswap_enabled) > > - goto check_old; > > + goto reject; > > > > - /* Check cgroup limits */ > > + /* > > + * Check cgroup limits: > > + * > > + * The cgroup zswap limit check is done once at the beginning of an > > + * mTHP store, and not within zswap_store_page() for each page > > + * in the mTHP. We do however check the zswap pool limits at the > > + * start of zswap_store_page(). What this means is, the cgroup > > + * could go over the limits by at most (HPAGE_PMD_NR - 1) pages. > > + * However, the per-store-page zswap pool limits check should > > + * hopefully trigger the cgroup aware and zswap LRU aware global > > + * reclaim implemented in the shrinker. If this assumption holds, > > + * the cgroup exceeding the zswap limits could potentially be > > + * resolved before the next zswap_store, and if it is not, the next > > + * zswap_store would fail the cgroup zswap limit check at the start. > > + */ > > I do not really like this. Allowing going one page above the limit is > one thing, but one THP above the limit seems too much. I also don't > like relying on the repeated limit checking in zswap_store_page(), if > anything I think that should be batched too. > > Is it too unreasonable to maintain the average compression ratio and > use that to estimate limit checking for both memcg and global limits? > Johannes, Nhat, any thoughts on this? I honestly don't think it's much of an issue. The global limit is huge, and the cgroup limit is to the best of my knowledge only used as a binary switch. Setting a non-binary limit - global or cgroup - seems like a bit of an obscure usecase to me, because in the vast majority of cases it's preferable to keep compresing over declaring OOM. And even if you do have some granular limit, the workload size scales with it. It's not like you have a thousand THPs in a 10M cgroup. If this ever becomes an issue, we can handle it in a fastpath-slowpath scheme: check the limit up front for fast-path failure if we're already maxed out, just like now; then make obj_cgroup_charge_zswap() atomically charge against zswap.max and unwind the store if we raced. For now, I would just keep the simple version we currently have: check once in zswap_store() and then just go ahead for the whole folio.