From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 78991C54798 for ; Thu, 7 Mar 2024 21:28:58 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DA0E26B02C5; Thu, 7 Mar 2024 16:28:57 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id D51036B02C6; Thu, 7 Mar 2024 16:28:57 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BF10E6B02C7; Thu, 7 Mar 2024 16:28:57 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id A7D3A6B02C5 for ; Thu, 7 Mar 2024 16:28:57 -0500 (EST) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 427DEA18FD for ; Thu, 7 Mar 2024 21:28:57 +0000 (UTC) X-FDA: 81871533114.03.2C7A6A0 Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) by imf03.hostedemail.com (Postfix) with ESMTP id 8909D20018 for ; Thu, 7 Mar 2024 21:28:54 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=Y35Y6ZRo; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf03.hostedemail.com: domain of 3lTHqZQoKCNYQGKJQ29E658GG8D6.4GEDAFMP-EECN24C.GJ8@flex--yosryahmed.bounces.google.com designates 209.85.219.201 as permitted sender) smtp.mailfrom=3lTHqZQoKCNYQGKJQ29E658GG8D6.4GEDAFMP-EECN24C.GJ8@flex--yosryahmed.bounces.google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709846934; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=9T1FgRduCmc2Zd5OXiyEx7qrbkEtYrIn9LbhrPaySvA=; b=dda7af5P43+QRRQUxptTGdQi/TUAJ/m1uKmPFuZCPWiKscVOdh3mB2VIQKupXfwe/A8JPY ExxW8iuVAWNLURzRglBb+bp3PQHfLfVshwuMThLdZoa51u7u+graFJJY9eUDqjq6LWIvoQ WywXoPEav//tNaqQUsEfb7RWu/70Mhs= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=Y35Y6ZRo; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf03.hostedemail.com: domain of 3lTHqZQoKCNYQGKJQ29E658GG8D6.4GEDAFMP-EECN24C.GJ8@flex--yosryahmed.bounces.google.com designates 209.85.219.201 as permitted sender) smtp.mailfrom=3lTHqZQoKCNYQGKJQ29E658GG8D6.4GEDAFMP-EECN24C.GJ8@flex--yosryahmed.bounces.google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709846934; a=rsa-sha256; cv=none; b=edGwfJXVJr2jbjJnElFrFKT7UQuomOcOiF/MWvfBF4m4ZQ1ps3XY7cbt8ovHWs6GUGBkDA Jc+e3pwMbdQXVfaDAPfvE4GwclwvgCVKPwK/EqyRsGP2zOdRJuhnbEYWLfWnC7nTzoxVl4 xqwPWTG33bv2sG++2XlzqRxcjQ7yqaI= Received: by mail-yb1-f201.google.com with SMTP id 3f1490d57ef6-dc746178515so2204280276.2 for ; Thu, 07 Mar 2024 13:28:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1709846933; x=1710451733; darn=kvack.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=9T1FgRduCmc2Zd5OXiyEx7qrbkEtYrIn9LbhrPaySvA=; b=Y35Y6ZRoVajOEP7TZJy61l0Do/Wc9TqYLW3i4mYUv61JqMzHNS52ZRRmWo3j0u+kHy rlaqz4JkhK/domb6QhBqUWwPNYMTqUfI5LVgIjYuxdEjAFuty1qV0xj61dVquQpIB7Ih vvZdoBog5Sh0M8Ms8dzSYbyqkWJgWy3mQeNHvRJFUpZ1CTrr0noWeeJgArEY53ScrCSz jRo4p74wz5+mmlLuJCROPK1/gxrhPfL+5SUHZUhTdZDW7WCux+PbFatEDechpyGqKD+K nOFMg/oAFW1Opqm9k5kvoE5pG1KvPkSY7iGBxzQZMbi1UGFL5AT/K7gOiu9PR8wKFotZ 6RwA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1709846933; x=1710451733; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=9T1FgRduCmc2Zd5OXiyEx7qrbkEtYrIn9LbhrPaySvA=; b=u0+L9cc6bBUn75DsLD0AmHT0Nmz28HH9ifhGKO9+7qI3BvAtyeQyXVNb9dIa2BAmXw Ro4ScoRQGuBHfFdfkKSEmqYSLBzkT5BkIQELeZimlIGA+KqvYd1B51QR4qQHgDdxhHut +P99yEerw+2XVNzUmDPPWmEnbZNDzAbi5gOPTIGBGdODMfOerLRNxUvbkmVglJcmj+TC uCRajyw9WpvoqLL/8VnueHCAeT4HgnhbW0enDECtPZ7/o/bDG0qDN5gQEAjwQ38U2il2 Yj3yKd0Jh/uNqQJ12RBCadhUj8OI0WlOfFlJkAtV0NRLWSHIcTfrOinIE4MdC69sqP63 Yr+A== X-Forwarded-Encrypted: i=1; AJvYcCUE2xFTxKOmiqDIHllO8xa1VA09XdXHvikF/nKJPYhVULLe7VvAMnU1xOfi4QVYpa8ckPdggS/zwqILrZmMXIhYCEE= X-Gm-Message-State: AOJu0YyareswdfGT8YcuBWMErOBKn/TqBI6eKxOv/u4WwSbOisBn1jl7 yqecuQYGvdapbgUjmgipM6DrmnDHVvul4rNNdk3X1WJnnLpSwz5bxXk6kj4Pk0pMECdww3m1nWu SlahE42LZHBcTXEodbQ== X-Google-Smtp-Source: AGHT+IGGYeOFCJCLMH1R5I7Em8wKQqSb/HOrzd80jxXBvlObm3cr9yWEdlIijeplJYkLvie73G5JpRbubvRmBMx9 X-Received: from yosry.c.googlers.com ([fda3:e722:ac3:cc00:20:ed76:c0a8:29b4]) (user=yosryahmed job=sendgmr) by 2002:a25:2688:0:b0:dc6:eea0:1578 with SMTP id m130-20020a252688000000b00dc6eea01578mr4759543ybm.13.1709846933705; Thu, 07 Mar 2024 13:28:53 -0800 (PST) Date: Thu, 7 Mar 2024 21:28:51 +0000 In-Reply-To: Mime-Version: 1.0 References: <20240304-zswap-xarray-v4-1-c4b45670cc30@kernel.org> Message-ID: Subject: Re: [PATCH v4] zswap: replace RB tree with xarray From: Yosry Ahmed To: Chris Li Cc: Andrew Morton , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Nhat Pham , Johannes Weiner , "Matthew Wilcox (Oracle)" , Chengming Zhou , Barry Song Content-Type: text/plain; charset="us-ascii" X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 8909D20018 X-Stat-Signature: cxwa8szkooqsdiedzzgbsad9qrwj1krs X-HE-Tag: 1709846934-205152 X-HE-Meta: U2FsdGVkX18nRydL45pBb5NSwxd7N+iEa85VDJVZxRZ5DnDBTuHR9WcjoRDjiIUlr5AZsWTx2HodskzuIa6VhZLXi0fG0pNdiAmxc4qoHfI5t5GEvmjjEgN28rV3dKS2paDIgOj1/rDeWZrZWD5A1n83ugOy5CKWEiBX9Q8/uL9jeFZOWBjq64gxg8D52hir+JPb4ZksIRJ8FpqFOCWOrad30Hui5kLvdOI0B7LY8x9eSvptwdNQMKmQyYZGu8AbgOBm5ED/Rm2xGRwXmufahrwHGG1CrLiKVsvL8zv+R+rZFqDnGL+qCLcTQqinboypkXGhgV4cXxY4s52slw5DCfq5WJw5+hESnvJt3tud7GQ7WcMhxnFgV8n0ZgiWjwWOb8KnezNsCn5OeRp2GTiWb/s/3aH7gOX085ykEf/XZ7HIqz1qEYFa3EUHd9hbK6zGDI11e8WAQn9bMzbMJ5ysGRiOguEgMo8LCfATXLhR4YLTUwtu1ER0iXll3KQTiUFJ+joag7N9Opf94+YRPJXIGoExikGXoFyrarMsT3R5+AAUDxDuLE7HG0naxMuWqTZHfryEEI8WFM3CLvqSEdERMUZh7SnlJiTboxlazhpbP1/45UVOC+GCLydp/uFhjnIpROq4c3WF6VrewUvLPQf8dKXS4x5SZClyx1pevuGGyjjPc+47luU63Bmk1o45vlzovK7GcVyOWQP5cSy4kfH4UTNYLNNvpUy0SwKk0LPgGtnmgKLxArpuFbDQekHNPpOXAJDWQ23A2C04WeTIsztDEL/ULFjXw8jGycKqKxAAxMyjf+FNNenLNmN/HFJXOBWtFGSLlc4NxYM1Fr4wz2/QNom4LKX3Jm4liqCXtlDLyA4ox/6O6HEJuC+7hilgXjOmL+oBsr1nJ1SDeVZuaP9dz6TZM5dE5Yt7UAywzPjxQ9DX6XAUjAF3hFxLyuePW9UnPZNG9Oa33fhdzmMkH5N TTLZavsG epl07SCL1mTu7cOPqru5ZiK9dj5xZnaf310MGlwugYfNoVwfvCR9fXYthhHXOs4pG4R2nhiGelUSI5QSkW/JRKSb7PVnJ7y/y2zxi3eO3aKWXX3GE4ZfD7oQa6sSGpg9Wp8fAN4z/2h6X8kDrQ/+/hkp6fKRg18wk6tTH7qpn5Hu9A/eMnY5rIy5eD5bt7PqgTdRB/Y953aSJZyYPUkzUAbym6Ry5yNM433zhgbFNVlMdVtZu5okZgu6f3akyN8pXGlgZVHLIdLWGUYkwPv5VNhZDvvY3K5FB8HJFIIifHKIHroJel6uw+/u7PHYx2/YOruNUmD0cElBB5XkFkqEz4mUU1z7AuVY97BgbQD1lQe43vyAi5qvp9HnUS/lolkc8FFNB X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: [..] > > > > I am also not sure what the need for zswap_reject_xarray_fail is. Are > > > > there any reasons why the store here can fail other than -ENOMEM? The > > > > docs say the only other option is -EINVAL, and looking at __xa_store(), > > > > it seems like this is only possible if xa_is_internal() is true (which > > > > means we are not passing in a properly aligned pointer IIUC). > > > > > > Because the xa_store document said it can return two error codes. I > > > see zswap try to classify the error count it hit, that is why I add > > > the zswap_reject_xarray_fail. > > > > Right, but I think we should not get -EINVAL in this case. I think it > > would be more appropriate to have WARN_ON() or VM_WARN_ON() in this > > case? > > In the context of the existing zswap code, it seems the zswap keeps > track of different failure reason counters. So that we can read the > failure counter and have a good clue where the source code causes the > failure. zswap keeps track of different *valid* failure reasons. I think in this case this would be a bug, which warrants a warning rather than a counter (same reasoning why we removed the duplicate entry counter). > > If the document said it can return -EINVAL, I wouldn't just look at > the current implementation and assume it does not generate -EINVAL. My assumption is that a properly-aligned pointer will always be valid to store in the xarray, so the only valid return error is -ENOMEM, but perhaps Matthew will correct me here. Maybe we can have an xa_is_valid() helper to check this separately and guarantee no -EINVAL returns, but I *think* the fact that it is slab-allocated pointer may be enough. > What if the zswap entry allocation has a bug that returns some strange > alignment that makes xarray not happy? There is a lot of assumption > that some error code can or can show up. Just to be consistent with > the other part of the zswap code, I give it a counter to indicate it > is the xarray causing the fail and error. That is the safe thing to > do. It is a good thing this kind of error branch never hits. But if it > does, it gives better information for debugging. It is just following > the same spirit of the rest of the zswap error handling. What do you > say? This is exactly my point, these counters are not for bugs. We should have a warning if there is a bug. [..] > > > > > + e = xa_cmpxchg(tree, offset, entry, NULL, GFP_KERNEL); > > > > > + if (e != entry) { > > > > > > > > I think we can avoid adding 'e' and 'offset' local variables here and > > > > just do everything in the if condition. If you want to avoid the line > > > > break, then introducing 'offset' is fine, but I don't see any value from > > > > 'e'. > > > > > > As I said in my other email. I don't think having this type of local > > > variable affects the compiler negatively. The compiler generally uses > > > their own local variable to track the expression anyway. So I am not > > > sure about the motivation to remove local variables alone, if it helps > > > the reading. I feel the line "if (xa_cmpxchg(tree, offset, entry, > > > NULL, GFP_KERNEL) != entry)" is too long and complicated inside the if > > > condition. That is just me. Not a big deal. > > > > I just think 'e' is not providing any readability improvements. If > > anything, people need to pay closer attention to figure out 'e' is only > > a temp variable and 'entry' is the real deal. > > Well, 'e' is assigned in the previous line and used in the very next > line (only once). > I hardly can see this impact any one understands what it is doing. I > can also come up with a reason (for the argument sake) that this maps > out to the machine code better. In the machine code it generates, you > see xa_cmpxchg first then "if" branch. You can also ask the debugger > what is the current value of "e", which register is currently mapped > to if it is still in the frame context. But that is not important. > > There are also C Macro symbols that define something else, just once. > Are we following the rule to remove all symbols that only define and > use once? > > Anyway, the bottom line is that it generates the exact same machine > code. Readability is subjective, there is some personal preference in > it. IMHO the difference in readability is so trivial here that it does > not justify having to flip one way or the other in this case. My argument was solely based on readability. I agree there is no difference in machine code, and that this is personal preference, which is why I stated my vote below (and I believe Chengming had the same preference). This is not a blocker either way :) [..] > > > > > /* > > > > > * The folio may have been dirtied again, invalidate the > > > > > * possibly stale entry before inserting the new entry. > > > > > */ > > > > > - if (zswap_rb_insert(&tree->rbroot, entry, &dupentry) == -EEXIST) { > > > > > - zswap_invalidate_entry(tree, dupentry); > > > > > - WARN_ON(zswap_rb_insert(&tree->rbroot, entry, &dupentry)); > > > > > + err = zswap_xa_insert(tree, entry, &old); > > > > > + if (old) > > > > > + zswap_entry_free(old); > > > > > + if (err) { > > > > > + old_erased = true; > > > > > > > > I think this can be made simpler if we open code xa_store() here, > > > > especially that we already have cleanup code below under 'check_old' > > > > that removes the exisitng old entry. So zswap_xa_insert() replicates > > > > this cleanup, then we add this 'old_erased' boolean to avoid doing the > > > > cleanup below. It seems like it would much more straightforward with > > > > open-coding xa_store() here and relying on the existing cleanup for the > > > > old entry. Also, if we initialize 'old' to NULL, we can use its value > > > > to figure out whether any cleanup is needed under 'check_old' or not. > > > > > > I think that is very similar to what Chengming was suggesting. > > > > > > > > > > > Taking a step back, I think we can further simplify this. What if we > > > > move the tree insertion to right after we allocate the zswap entry? In > > > > this case, if the tree insertion fails, we don't need to decrement the > > > > same filled counter. If the tree insertion succeeds and then something > > > > else fails, the existing cleanup code under 'check_old' will already > > > > clean up the tree insertion for us. > > > > > > That will create complications that, if the zswap compression fails > > > the compression ratio, you will have to remove the entry from the tree > > > as clean up. You have both xa_store() and xa_erase() where the current > > > code just does one xa_erase() on compression failure. > > > > Not really. If xa_store() fails because of -ENOMEM, then I think by > > definition we do not need xa_erase() as there shouldn't be any stale > > entries. I also think -ENOMEM should be the only valid errno from > > xa_store() in this context. So we can avoid the check_old code if > > xa_store() is called (whether it fails or succeeds) IIUC. > > > > I prefer calling xa_store() entry and avoiding the extra 'insert_failed' > > cleanup code, especially that unlike other cleanup code, it has its own > > branching based on entry->length. I am also planning a cleanup for > > zswap_store() to split the code better for the same_filled case and > > avoid some unnecessary checks and failures, so it would be useful to > > keep the common code path together. > > I would like to point out that, in real world usage, we will likely > see more pages fail due to compress rather than xa_store failure. > There is a non zero percentage of pages that are not compressible. > The -ENOMEM for xa_store are very rare, much much rarer than the non > compressible page. So in the real work running, insert first will have > one more "xa_store" than compression first. I would argue compression > first will perform slightly better, because "xa_store" will need to > take the spin lock. Avoid the extra "xa_store" and avoid taking the > lock all together. There is simply more compression failure than > xa_store failure. > It also feels slightly strange when inserting the entry into xarray > when the entry is not ready to use. I don't think that is triggeriable > as a bug right now, but that makes me feel a little bit strange. Good point. I don't think the perfomrnace difference will be significant, but you are right. Let's keep the insertion here and we can revisit this later when we cleanup the same_filled code. nit: please rename the cleanup label from 'insert_failed' to 'store_failed' :) [..] > > > BTW, while you are here, can you confirm this race discussed in > > > earlier email can't happen? Chengming convinced me this shouldn't > > > happen. Like to hear your thoughts. > > > > > > CPU1 CPU2 > > > > > > xa_store() > > > entry = xa_erase() > > > zswap_free_entry(entry) > > > > > > if (entry->length) > > > ... > > > CPU1 is using entry after free. > > > > IIUC, CPU1 is in zswap_store(), CPU2 could either in zswap_invalidate() > > or zswap_load(). > > > > For zswap_load(), I think synchronization is done in the core swap code > > ensure we are not doing parallel swapin/swapout at the same entry, > > right? In this specific case, I think the folio would be in the > > swapcache while swapout (i.e. zswap_store()) is ongoing, so any swapins > > will read the folio and not call zswap_load(). > > > > Actually, if we do not prevent parallel swapin/swapou at the same entry, > > I suspect we may have problems even outside of zswap. For example, we > > may read a partially written swap entry from disk, right? Or does the > > block layer synchronize this somehow? > > I don't think the block layer is doing the synchronization here. Zswap > does not go to the block layer at all. > I think after looking up from the swap cache, the filio will be > locked, that is preventing others' path. I would like to hear others' > thoughts on that. Need more audits. I did not mean that the block layer is doing any synchronizaiton here. It comes from the swap cache as I outlined. I think it's safe. I was just wondering what happens with parallel swapin/swapout if there is no swap cache synchronization for normal swapfiles. I was wondering if the block layer will handle the synchronization, in which case zswap may have different requirements than normal swapfiles. Just something that seemed good to know :) > > > > > For zswap_invalidate(), the core swap code calls it when the swap entry > > is no longer used and before we free it for reuse, so IIUC parallel > > swapouts (i.e. zswap_store()) should not be possible here as well. > > There is also the write back path, I don't think it is triggerable due > to entry not being inserted into the LRU list yet. Ah yes, I looked at callsites for xa_erase(), there is a xa_cmpxchg() there. Actually I think the swap cache synchronizes here too. zswap_writeback_entry() is like swapin, so this case is like too parallel swapins. Both of them should call __read_swap_cache_async(), and only one of them will actually do the swapin. If zswap_writeback_entry() wins the race, zswap_load() will not be called in the first place, the swap cache entry will be used. If swapin (i.e. zswap_load()) wins the race, zswap_writeback_entry() should skip. Right?