From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D7C2BC54E49 for ; Thu, 7 Mar 2024 09:07:11 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4DEAA6B0130; Thu, 7 Mar 2024 04:07:11 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 48EC66B0131; Thu, 7 Mar 2024 04:07:11 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 37D7E6B0132; Thu, 7 Mar 2024 04:07:11 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 2994F6B0130 for ; Thu, 7 Mar 2024 04:07:11 -0500 (EST) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id EDC20160B89 for ; Thu, 7 Mar 2024 09:07:10 +0000 (UTC) X-FDA: 81869663820.24.CCCADA2 Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) by imf11.hostedemail.com (Postfix) with ESMTP id 591184000F for ; Thu, 7 Mar 2024 09:07:08 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=eitR9yEP; spf=pass (imf11.hostedemail.com: domain of 3u4PpZQoKCJ4WMQPW8FKCBEMMEJC.AMKJGLSV-KKIT8AI.MPE@flex--yosryahmed.bounces.google.com designates 209.85.219.202 as permitted sender) smtp.mailfrom=3u4PpZQoKCJ4WMQPW8FKCBEMMEJC.AMKJGLSV-KKIT8AI.MPE@flex--yosryahmed.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709802428; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=T5FgHm9Oc8Gd3shZgIpG9kICRlrcXx7ez5dC2yHi6nc=; b=SthBZEiTS5pWi4z/YY0o00vazBiLUwOwekr4tgHY8wHwn87VihYmAnVJrzusJovGSVj4NU isPGLM9xkWRMCMUqnp2a9CYUFr6g/lJlinf2E0S+gGl4gzczXWjsNaCeXaVM0h/vXqs7xh qBmodfgzrYlavtPrdaAWTTBfk+eIsa4= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709802428; a=rsa-sha256; cv=none; b=1H4E1qCiYCPY6KC25Cq5DVwlXFT34Mqd9Sja8arzRMI8sfzE/fQMvKTxdnJHynIT51p/tY GrxWPvYLxnutcVXW5fUsi93hM7hhiu+8NL3HZ9dRyij5UEu+sBoEk9ijrGOBlObztER6M1 AcnD7KPzApa+//USaNRFtpSIBaLZMbE= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=eitR9yEP; spf=pass (imf11.hostedemail.com: domain of 3u4PpZQoKCJ4WMQPW8FKCBEMMEJC.AMKJGLSV-KKIT8AI.MPE@flex--yosryahmed.bounces.google.com designates 209.85.219.202 as permitted sender) smtp.mailfrom=3u4PpZQoKCJ4WMQPW8FKCBEMMEJC.AMKJGLSV-KKIT8AI.MPE@flex--yosryahmed.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-yb1-f202.google.com with SMTP id 3f1490d57ef6-dc693399655so1251900276.1 for ; Thu, 07 Mar 2024 01:07:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1709802427; x=1710407227; darn=kvack.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=T5FgHm9Oc8Gd3shZgIpG9kICRlrcXx7ez5dC2yHi6nc=; b=eitR9yEPxMPBgTvf7uiycjPcAYo0KwYD/HDBl9FQBaTbCic4+sjquERREgBh6h6B2g vNzcon1ope+yzmrDiIBZtKMP3GwcKWFkJ6chw237KPIp6eUB7jEC1rjGnI8TzvrCE5qc ppRPq/WJ+rnKURANN455Xj1zmKHq8xqGGzPAcTeRjEmgnz8mE8j2pvitpUDBjFDHD3eS 7apFGoZlQ0fulwhpW7o3Pqf6bnZ8TM8phn4NDBiGmKn/+8sDYQnYixNPLKMJ70R6qKyz d0xQtscsf//0qk3iAsWsp0296JQMdtDZA2F/XmEhMd87wppS5O94H9Qi1rKDF/Na4hvM Pf0g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1709802427; x=1710407227; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=T5FgHm9Oc8Gd3shZgIpG9kICRlrcXx7ez5dC2yHi6nc=; b=mWk/P1p9NltJ28EsJQXEfHkOGUav7pUAJwpL5Lfd51yL4t2IBxiDwwLfBRqPoB3qEX Z/JAgizRVvmVY4UXW1o6oSuI2Sfkv+JsaYJlhvehgCfKbYY4E/2ilVjZPg/Z/i4b+Dy3 fHlR0t1nyCsfX6/xFaTZ5hPMdrcGWit4axz8p36WG9Jl6DXxv0E2WJnRuWuckcqa3G1E wscQg1wdBQ222xHNtdD+AvKPrBz2+MGbz4wtf0wSmx9NWJMjnmKAXcfacANRbFJftDWR 3YRNIVFelQoNRbfe0HHUAcADsi7l5IdZ1jQyZuZ03zut2pCLRLTuESeD3fnE6zTCBQpt kh/A== X-Forwarded-Encrypted: i=1; AJvYcCV11e5ziE/oGL3JkkJMxIgn14ctGfQSxPOARMTjU4E0/kVbJzsj28NifEpxfTd1UJ9ipsBO0Fp5AskWeMoxm1+GKvU= X-Gm-Message-State: AOJu0YzXMXjIF5uHXvGD2+FPaGuZ/A+hlMkXIF8YY4SDeBGNtSwhuKSD rZ39NQWB+X8dpScAOZ0ACJGoDFHmw+Q9tiEYNURVbMpUB7nRGjTnaCfRdkrGeboyEhR0BFunbWX zuI+Jme9eeUmlfx6hCg== X-Google-Smtp-Source: AGHT+IFcz0HP6cZW+8P1RxO1yEIgqxG2w9ZrRof4nVCSuaftylofDsLtSMdC19mKHJqgmhePAvPMBBsXgCERBo8E X-Received: from yosry.c.googlers.com ([fda3:e722:ac3:cc00:20:ed76:c0a8:29b4]) (user=yosryahmed job=sendgmr) by 2002:a25:aacb:0:b0:dc7:7ce9:fb4d with SMTP id t69-20020a25aacb000000b00dc77ce9fb4dmr4415114ybi.12.1709802427445; Thu, 07 Mar 2024 01:07:07 -0800 (PST) Date: Thu, 7 Mar 2024 09:07:05 +0000 In-Reply-To: Mime-Version: 1.0 References: <20240304-zswap-xarray-v4-1-c4b45670cc30@kernel.org> Message-ID: Subject: Re: [PATCH v4] zswap: replace RB tree with xarray From: Yosry Ahmed To: Chris Li Cc: Andrew Morton , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Nhat Pham , Johannes Weiner , "Matthew Wilcox (Oracle)" , Chengming Zhou , Barry Song Content-Type: text/plain; charset="us-ascii" X-Rspamd-Queue-Id: 591184000F X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: j9wucdos1hqqmeu1f1gbot4qx6hecpjq X-HE-Tag: 1709802428-472851 X-HE-Meta: U2FsdGVkX1/dKBDhOHmLDb5qL+N145uI5Q4PYXE2TrPn2mI9p3Vwp8bC4wFNLG5G1xpT7cqtR8pzrH2rs010SInvjvjlNJhOzjePh7wM/ioS9HsNam+kOI0JdBqBgLbWo5OBE+WYwI6m3ENVcCclXqFGBOvdLofdw1bW1DhD2Qv8mag+wJC/wf10t5ZymzWegXe9AAtaNQIw5vLd488eYF6EuaqjzRuWR9hnw9D+UELa2ywx+li/CZXA74j2pr8Y+r4gLKJdCnQudRLUdw3eZ2M9XP+Br5zFrruqnjyLg2QArsI+UIwvCiWieID0ZKGZo3J+0eAERs9yr7a39qiqws3Jj7y1xmB0wM/+iRH9O62IB35u8HQfed0i38n2wm18FQzYMTM3ZPO7W53R44WaDjbK+RB7n1yLvoGSp8ksokVbARUtjntcuTogACtTylBqi1CukUjNxRGnYpvpfNOR+RAAXjw2UOjJvlCECWp1lsnrm2j0unKoVA+EXbmkPM8wbFq2VC+e9AUqNuEH1cS/m7kyCqe8DQgUX7rzOkjOsM2ro941R5s+Jv7FI4rsdSCzurqmy1ys1OQ0teoezYYLa4Hu3qWNdmAGdFtmH1cRCKJ+8NUp5POw3BxuHKQfdJOmi70W12EmkyaKOk+V0my1NmoiMrimc92uZqXiAscv8Gs/HiaQXOHp6PmeLxjEEsIaNo1uO3jrSiJNnOT030Y2eW+NFNNVGd4dNHDLO6PCsUKaBat7JyI3w4LMJIS33WvhzY7ux+V2I/mvCfz0uYXK9gQJIWGxLHAwZ2D2m/wdwwBeKFNdC0/GxHfo9PlnjRWsc2hjbu6ol02n5dMvdE8b/yya5NdZOgB8/0Xa2Q2mhMY3MDttQbBYm4VosXpaMqDP/HUN/Yj72wY2kOH5pBebJLs7RSx4jFTM4DWtjq0YK1CVLwvVr6D1j4hwSQi6Wb/lYiX3kfkctZ+2PxYVmkQ TfTSz3Xn 7E+RcX187qvEDEYCWl2h6SKbYSy83FBYp8BMwEyYqjE89Dj5aKNioi+L3yOBdYkWRQfu68jVpQfj4AjRGnjb4RBXFTSDo5JbkMnkKCyxuiCN4+pmrn6aysMM2icJiL0wvlZTV5eOsCdo4GloZwHYViXd+akNAnJsfHfPJJvTR2peON3M8/QlBJDmbZ4kia/h1gYMDqJfnD4lVErznyXRIHbgSXfOebjMv/dm7jzqZdOY8itBsOvm509yTOaU7mrlY7UdV1XQw/vn9rTmIzi5E5jH4JynwTZrpq82KgiOKiguHizsre3Rbk4nsklFMb8HZ+0KGx1GsIS1nkmZLII2aTznVS+IyX7MnTnUFOBJ5S0lBRB52rvNiER2YQRgc5jykuiKJ X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: [..] > > > -static void zswap_rb_erase(struct rb_root *root, struct zswap_entry *entry) > > > -{ > > > - rb_erase(&entry->rbnode, root); > > > - RB_CLEAR_NODE(&entry->rbnode); > > > + e = xa_store(tree, offset, entry, GFP_KERNEL); > > > + err = xa_err(e); > > > + > > > + if (err) { > > > + e = xa_erase(tree, offset); > > > + if (err == -ENOMEM) > > > + zswap_reject_alloc_fail++; > > > + else > > > + zswap_reject_xarray_fail++; > > > > I think this is too complicated, and as Chengming pointed out, I believe > > we can use xa_store() directly in zswap_store(). > > Sure. > > > I am also not sure what the need for zswap_reject_xarray_fail is. Are > > there any reasons why the store here can fail other than -ENOMEM? The > > docs say the only other option is -EINVAL, and looking at __xa_store(), > > it seems like this is only possible if xa_is_internal() is true (which > > means we are not passing in a properly aligned pointer IIUC). > > Because the xa_store document said it can return two error codes. I > see zswap try to classify the error count it hit, that is why I add > the zswap_reject_xarray_fail. Right, but I think we should not get -EINVAL in this case. I think it would be more appropriate to have WARN_ON() or VM_WARN_ON() in this case? [..] > > > @@ -1113,7 +1068,9 @@ static void zswap_decompress(struct zswap_entry *entry, struct page *page) > > > static int zswap_writeback_entry(struct zswap_entry *entry, > > > swp_entry_t swpentry) > > > { > > > - struct zswap_tree *tree; > > > + struct xarray *tree; > > > + pgoff_t offset = swp_offset(swpentry); > > > + struct zswap_entry *e; > > > struct folio *folio; > > > struct mempolicy *mpol; > > > bool folio_was_allocated; > > > @@ -1150,19 +1107,14 @@ static int zswap_writeback_entry(struct zswap_entry *entry, > > > * be dereferenced. > > > */ > > > tree = swap_zswap_tree(swpentry); > > > - spin_lock(&tree->lock); > > > - if (zswap_rb_search(&tree->rbroot, swp_offset(swpentry)) != entry) { > > > - spin_unlock(&tree->lock); > > > + e = xa_cmpxchg(tree, offset, entry, NULL, GFP_KERNEL); > > > + if (e != entry) { > > > > I think we can avoid adding 'e' and 'offset' local variables here and > > just do everything in the if condition. If you want to avoid the line > > break, then introducing 'offset' is fine, but I don't see any value from > > 'e'. > > As I said in my other email. I don't think having this type of local > variable affects the compiler negatively. The compiler generally uses > their own local variable to track the expression anyway. So I am not > sure about the motivation to remove local variables alone, if it helps > the reading. I feel the line "if (xa_cmpxchg(tree, offset, entry, > NULL, GFP_KERNEL) != entry)" is too long and complicated inside the if > condition. That is just me. Not a big deal. I just think 'e' is not providing any readability improvements. If anything, people need to pay closer attention to figure out 'e' is only a temp variable and 'entry' is the real deal. I vote for: if (entry != xa_cmpxchg(tree, offset, entry, NULL, GFP_KERNEL)) [..] > > > @@ -1471,10 +1423,12 @@ bool zswap_store(struct folio *folio) > > > { > > > swp_entry_t swp = folio->swap; > > > pgoff_t offset = swp_offset(swp); > > > - struct zswap_tree *tree = swap_zswap_tree(swp); > > > - struct zswap_entry *entry, *dupentry; > > > + struct xarray *tree = swap_zswap_tree(swp); > > > + struct zswap_entry *entry, *old; > > > struct obj_cgroup *objcg = NULL; > > > struct mem_cgroup *memcg = NULL; > > > + int err; > > > + bool old_erased = false; > > > > > > VM_WARN_ON_ONCE(!folio_test_locked(folio)); > > > VM_WARN_ON_ONCE(!folio_test_swapcache(folio)); > > > @@ -1526,6 +1480,7 @@ bool zswap_store(struct folio *folio) > > > kunmap_local(src); > > > entry->length = 0; > > > entry->value = value; > > > + entry->pool = NULL; > > > > Why do we need to initialize the pool here? Is this is a bug fix for an > > existing problem or just keeping things clean? Either way I think it > > should be done separately, unless it is related to a change in this > > patch. > > I notice the entry->pool will leave uninitialized. I think it should > be cleaned up. It is a clean up, it does not need to happen in this > patch. I can do that as a separate patch. Yes please. [..] > > > > > /* > > > * The folio may have been dirtied again, invalidate the > > > * possibly stale entry before inserting the new entry. > > > */ > > > - if (zswap_rb_insert(&tree->rbroot, entry, &dupentry) == -EEXIST) { > > > - zswap_invalidate_entry(tree, dupentry); > > > - WARN_ON(zswap_rb_insert(&tree->rbroot, entry, &dupentry)); > > > + err = zswap_xa_insert(tree, entry, &old); > > > + if (old) > > > + zswap_entry_free(old); > > > + if (err) { > > > + old_erased = true; > > > > I think this can be made simpler if we open code xa_store() here, > > especially that we already have cleanup code below under 'check_old' > > that removes the exisitng old entry. So zswap_xa_insert() replicates > > this cleanup, then we add this 'old_erased' boolean to avoid doing the > > cleanup below. It seems like it would much more straightforward with > > open-coding xa_store() here and relying on the existing cleanup for the > > old entry. Also, if we initialize 'old' to NULL, we can use its value > > to figure out whether any cleanup is needed under 'check_old' or not. > > I think that is very similar to what Chengming was suggesting. > > > > > Taking a step back, I think we can further simplify this. What if we > > move the tree insertion to right after we allocate the zswap entry? In > > this case, if the tree insertion fails, we don't need to decrement the > > same filled counter. If the tree insertion succeeds and then something > > else fails, the existing cleanup code under 'check_old' will already > > clean up the tree insertion for us. > > That will create complications that, if the zswap compression fails > the compression ratio, you will have to remove the entry from the tree > as clean up. You have both xa_store() and xa_erase() where the current > code just does one xa_erase() on compression failure. Not really. If xa_store() fails because of -ENOMEM, then I think by definition we do not need xa_erase() as there shouldn't be any stale entries. I also think -ENOMEM should be the only valid errno from xa_store() in this context. So we can avoid the check_old code if xa_store() is called (whether it fails or succeeds) IIUC. I prefer calling xa_store() entry and avoiding the extra 'insert_failed' cleanup code, especially that unlike other cleanup code, it has its own branching based on entry->length. I am also planning a cleanup for zswap_store() to split the code better for the same_filled case and avoid some unnecessary checks and failures, so it would be useful to keep the common code path together. > > > > > If this works, we don't need to add extra cleanup code or move any code > > around. Something like: > > Due to the extra xa_insert() on compression failure, I think > Chengming's or your earlier suggestion is better. > > BTW, while you are here, can you confirm this race discussed in > earlier email can't happen? Chengming convinced me this shouldn't > happen. Like to hear your thoughts. > > CPU1 CPU2 > > xa_store() > entry = xa_erase() > zswap_free_entry(entry) > > if (entry->length) > ... > CPU1 is using entry after free. IIUC, CPU1 is in zswap_store(), CPU2 could either in zswap_invalidate() or zswap_load(). For zswap_load(), I think synchronization is done in the core swap code ensure we are not doing parallel swapin/swapout at the same entry, right? In this specific case, I think the folio would be in the swapcache while swapout (i.e. zswap_store()) is ongoing, so any swapins will read the folio and not call zswap_load(). Actually, if we do not prevent parallel swapin/swapou at the same entry, I suspect we may have problems even outside of zswap. For example, we may read a partially written swap entry from disk, right? Or does the block layer synchronize this somehow? For zswap_invalidate(), the core swap code calls it when the swap entry is no longer used and before we free it for reuse, so IIUC parallel swapouts (i.e. zswap_store()) should not be possible here as well.