Re: [RFC 1/4] mm/zswap: skip swapcache for swapping in zswap pages

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Usama Arif <usamaarif642@gmail.com>
To: Yosry Ahmed <yosryahmed@google.com>
Cc: akpm@linux-foundation.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, david@redhat.com, willy@infradead.org,
	kanchana.p.sridhar@intel.com, nphamcs@gmail.com,
	chengming.zhou@linux.dev, ryan.roberts@arm.com,
	ying.huang@intel.com, 21cnbao@gmail.com, riel@surriel.com,
	shakeel.butt@linux.dev, kernel-team@meta.com,
	linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org,
	Kairui Song <kasong@tencent.com>, Kairui Song <ryncsn@gmail.com>
Subject: Re: [RFC 1/4] mm/zswap: skip swapcache for swapping in zswap pages
Date: Tue, 22 Oct 2024 20:49:55 +0100	[thread overview]
Message-ID: <07e6018b-bc6f-4e1e-9bc3-07a4b5a384fc@gmail.com> (raw)
In-Reply-To: <CAJD7tkYEe10Xw-8azM-80pHv6YjvosZDHTdZfYttAuD5u1+s8A@mail.gmail.com>



On 21/10/2024 22:09, Yosry Ahmed wrote:
> On Fri, Oct 18, 2024 at 3:50 AM Usama Arif <usamaarif642@gmail.com> wrote:
>>
>> As mentioned in [1], there is a significant improvement in no
>> readahead swapin performance for super fast devices when skipping
>> swapcache.
> 
> FYI, Kairui was working on removing the swapcache bypass completely,
> which I think may be a good thing:
> https://lore.kernel.org/lkml/20240326185032.72159-1-ryncsn@gmail.com/
> 
> However, that series is old, since before the large folio swapin
> support, so I am not sure if/when he intends to refresh it.
> 
> In his approach there is still a swapin path for synchronous swapin
> though, which we can still utilize for zswap.
> 
>>
>> With large folio zswapin support added in later patches, this will also
>> mean this path will also act as "readahead" by swapping in multiple
>> pages into large folios. further improving performance.
>>
>> [1] https://lore.kernel.org/all/1505886205-9671-5-git-send-email-minchan@kernel.org/T/#m5a792a04dfea20eb7af4c355d00503efe1c86a93
>>
>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>> ---
>>  include/linux/zswap.h |  6 ++++++
>>  mm/memory.c           |  3 ++-
>>  mm/page_io.c          |  1 -
>>  mm/zswap.c            | 46 +++++++++++++++++++++++++++++++++++++++++++
>>  4 files changed, 54 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/zswap.h b/include/linux/zswap.h
>> index d961ead91bf1..e418d75db738 100644
>> --- a/include/linux/zswap.h
>> +++ b/include/linux/zswap.h
>> @@ -27,6 +27,7 @@ struct zswap_lruvec_state {
>>  unsigned long zswap_total_pages(void);
>>  bool zswap_store(struct folio *folio);
>>  bool zswap_load(struct folio *folio);
>> +bool zswap_present_test(swp_entry_t swp, int nr_pages);
>>  void zswap_invalidate(swp_entry_t swp);
>>  int zswap_swapon(int type, unsigned long nr_pages);
>>  void zswap_swapoff(int type);
>> @@ -49,6 +50,11 @@ static inline bool zswap_load(struct folio *folio)
>>         return false;
>>  }
>>
>> +static inline bool zswap_present_test(swp_entry_t swp, int nr_pages)
>> +{
>> +       return false;
>> +}
>> +
>>  static inline void zswap_invalidate(swp_entry_t swp) {}
>>  static inline int zswap_swapon(int type, unsigned long nr_pages)
>>  {
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 03e5452dd0c0..49d243131169 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4289,7 +4289,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>         swapcache = folio;
>>
>>         if (!folio) {
>> -               if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
>> +               if ((data_race(si->flags & SWP_SYNCHRONOUS_IO) ||
>> +                   zswap_present_test(entry, 1)) &&
>>                     __swap_count(entry) == 1) {
>>                         /* skip swapcache */
>>                         folio = alloc_swap_folio(vmf);
>> diff --git a/mm/page_io.c b/mm/page_io.c
>> index 4aa34862676f..2a15b197968a 100644
>> --- a/mm/page_io.c
>> +++ b/mm/page_io.c
>> @@ -602,7 +602,6 @@ void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
>>         unsigned long pflags;
>>         bool in_thrashing;
>>
>> -       VM_BUG_ON_FOLIO(!folio_test_swapcache(folio) && !synchronous, folio);
>>         VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
>>         VM_BUG_ON_FOLIO(folio_test_uptodate(folio), folio);
>>
>> diff --git a/mm/zswap.c b/mm/zswap.c
>> index 7f00cc918e7c..f4b03071b2fb 100644
>> --- a/mm/zswap.c
>> +++ b/mm/zswap.c
>> @@ -1576,6 +1576,52 @@ bool zswap_store(struct folio *folio)
>>         return ret;
>>  }
>>
>> +static bool swp_offset_in_zswap(unsigned int type, pgoff_t offset)
>> +{
>> +       return (offset >> SWAP_ADDRESS_SPACE_SHIFT) <  nr_zswap_trees[type];
> 
> I am not sure I understand what we are looking for here. When does
> this return false? Aren't the zswap trees always allocated during
> swapon?
> 

Hi Yosry,

Thanks for the review!

It becomes useful in patch 3 when trying to determine if a large folio can be allocated.

For e.g. if the swap entry is the last entry of the last tree, and 1M folios are enabled
(nr_pages = 256), then the while loop in zswap_present_test will try to access a tree
that doesn't exist from the 2nd 4K page onwards if we dont have this check in
zswap_present_test.

>> +}
>> +
>> +/* Returns true if the entire folio is in zswap */
> 
> There isn't really a folio at this point, maybe "Returns true if the
> entire range is in zswap"?

Will change, Thanks!

> 
> Also, this is racy because an exclusive load, invalidation, or
> writeback can cause an entry to be removed from zswap. Under what
> conditions is this safe? The caller can probably guarantee we don't
> race against invalidation, but can we guarantee that concurrent
> exclusive loads or writebacks don't happen?
> 
> If the answer is yes, this needs to be properly documented.

swapcache_prepare should stop things from becoming racy.

lets say trying to swapin a mTHP of size 32 pages:
- T1 is doing do_swap_page, T2 is doing zswap_writeback.
- T1 - Check if the entire 32 pages is in zswap, swapcache_prepare(entry, nr_pages) in do_swap_page is not yet called. 
- T2 - zswap_writeback_entry starts and lets say writes page 2 to swap. it calls __read_swap_cache_async -> swapcache_prepare increments swap_map count, writes page to swap.
- T1 - swapcache_prepare is then called and fails and then there will be a pagefault again for it.

I will try and document this better.

> 
>> +bool zswap_present_test(swp_entry_t swp, int nr_pages)
>> +{
>> +       pgoff_t offset = swp_offset(swp), tree_max_idx;
>> +       int max_idx = 0, i = 0, tree_offset = 0;
>> +       unsigned int type = swp_type(swp);
>> +       struct zswap_entry *entry = NULL;
>> +       struct xarray *tree;
>> +
>> +       while (i < nr_pages) {
>> +               tree_offset = offset + i;
>> +               /* Check if the tree exists. */
>> +               if (!swp_offset_in_zswap(type, tree_offset))
>> +                       return false;
>> +
>> +               tree = swap_zswap_tree(swp_entry(type, tree_offset));
>> +               XA_STATE(xas, tree, tree_offset);
> 
> Please do not mix declarations with code.
> 
>> +
>> +               tree_max_idx = tree_offset % SWAP_ADDRESS_SPACE_PAGES ?
>> +                       ALIGN(tree_offset, SWAP_ADDRESS_SPACE_PAGES) :
>> +                       ALIGN(tree_offset + 1, SWAP_ADDRESS_SPACE_PAGES);
> 
> Does this work if we always use ALIGN(tree_offset + 1,
> SWAP_ADDRESS_SPACE_PAGES)?

Yes, I think max_idx = min(offset + nr_pages, ALIGN(tree_offset + 1, SWAP_ADDRESS_SPACE_PAGES)) - 1;
will work. I will test it out, Thanks!


> 
>> +               max_idx = min(offset + nr_pages, tree_max_idx) - 1;
>> +               rcu_read_lock();
>> +               xas_for_each(&xas, entry, max_idx) {
>> +                       if (xas_retry(&xas, entry))
>> +                               continue;
>> +                       i++;
>> +               }
>> +               rcu_read_unlock();
>> +               /*
>> +                * If xas_for_each exits because entry is NULL and
> 
> nit: add () to the end of function names (i.e. xas_for_each())
> 
>> +                * the number of entries checked are less then max idx,
> 
> s/then/than
> 
>> +                * then zswap does not contain the entire folio.
>> +                */
>> +               if (!entry && offset + i <= max_idx)
>> +                       return false;
>> +       }
>> +
>> +       return true;
>> +}
>> +
>>  bool zswap_load(struct folio *folio)
>>  {
>>         swp_entry_t swp = folio->swap;
>> --
>> 2.43.5
>>

next prev parent reply	other threads:[~2024-10-22 19:49 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-18 10:48 [RFC 0/4] mm: zswap: add support for zswapin of large folios Usama Arif
2024-10-18 10:48 ` [RFC 1/4] mm/zswap: skip swapcache for swapping in zswap pages Usama Arif
2024-10-21 21:09   ` Yosry Ahmed
2024-10-22 19:49     ` Usama Arif [this message]
2024-10-23  0:45       ` Yosry Ahmed
2024-10-25 18:19         ` Nhat Pham
2024-10-25 19:10           ` Yosry Ahmed
2024-10-21 21:11   ` Yosry Ahmed
2024-10-22 19:59     ` Usama Arif
2024-10-23  0:47       ` Yosry Ahmed
2024-10-18 10:48 ` [RFC 2/4] mm/zswap: modify zswap_decompress to accept page instead of folio Usama Arif
2024-10-18 10:48 ` [RFC 3/4] mm/zswap: add support for large folio zswapin Usama Arif
2024-10-21  5:49   ` Barry Song
2024-10-21 10:44     ` Usama Arif
2024-10-21 10:55       ` Barry Song
2024-10-21 12:21         ` Usama Arif
2024-10-21 20:28           ` Barry Song
2024-10-21 20:57             ` Usama Arif
2024-10-21 21:34               ` Yosry Ahmed
2024-10-18 10:48 ` [RFC 4/4] mm/zswap: count successful large folio zswap loads Usama Arif
2024-10-21  5:09 ` [RFC 0/4] mm: zswap: add support for zswapin of large folios Barry Song
2024-10-21 10:40   ` Usama Arif
2024-10-22 15:26     ` Usama Arif
2024-10-22 20:46       ` Barry Song
2024-10-22 21:17         ` Usama Arif
2024-10-22 22:07           ` Barry Song
2024-10-23 10:26             ` Barry Song
2024-10-23 10:48               ` Usama Arif
2024-10-23 13:08                 ` Usama Arif
2024-10-23 18:02                   ` Yosry Ahmed
2024-10-23 18:31                     ` Usama Arif
2024-10-23 18:52                       ` Barry Song
2024-10-23 19:47                         ` Usama Arif
2024-10-23 20:36                           ` Barry Song
2024-10-23 23:35                           ` Barry Song
2024-10-24 14:29                             ` Johannes Weiner
2024-10-24 17:48                               ` Barry Song

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=07e6018b-bc6f-4e1e-9bc3-07a4b5a384fc@gmail.com \
    --to=usamaarif642@gmail.com \
    --cc=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=chengming.zhou@linux.dev \
    --cc=david@redhat.com \
    --cc=hannes@cmpxchg.org \
    --cc=kanchana.p.sridhar@intel.com \
    --cc=kasong@tencent.com \
    --cc=kernel-team@meta.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nphamcs@gmail.com \
    --cc=riel@surriel.com \
    --cc=ryan.roberts@arm.com \
    --cc=ryncsn@gmail.com \
    --cc=shakeel.butt@linux.dev \
    --cc=willy@infradead.org \
    --cc=ying.huang@intel.com \
    --cc=yosryahmed@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.