[syzbot] [mm?] WARNING in zswap

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [syzbot] [mm?] WARNING in zswap_swapoff
@ 2024-08-16 19:52 syzbot
  2024-08-19 20:12 ` Yosry Ahmed
  0 siblings, 1 reply; 19+ messages in thread
From: syzbot @ 2024-08-16 19:52 UTC (permalink / raw)
  To: akpm, chengming.zhou, hannes, linux-kernel, linux-mm, nphamcs,
	syzkaller-bugs, yosryahmed

Hello,

syzbot found the following issue on:

HEAD commit:    367b5c3d53e5 Add linux-next specific files for 20240816
git tree:       linux-next
console output: https://syzkaller.appspot.com/x/log.txt?x=12489105980000
kernel config:  https://syzkaller.appspot.com/x/.config?x=61ba6f3b22ee5467
dashboard link: https://syzkaller.appspot.com/bug?extid=ce6029250d7fd4d0476d
compiler:       Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40

Unfortunately, I don't have any reproducer for this issue yet.

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/0b1b4e3cad3c/disk-367b5c3d.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/5bb090f7813c/vmlinux-367b5c3d.xz
kernel image: https://storage.googleapis.com/syzbot-assets/6674cb0709b1/bzImage-367b5c3d.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+ce6029250d7fd4d0476d@syzkaller.appspotmail.com

------------[ cut here ]------------
WARNING: CPU: 0 PID: 11298 at mm/zswap.c:1700 zswap_swapoff+0x11b/0x2b0 mm/zswap.c:1700
Modules linked in:
CPU: 0 UID: 0 PID: 11298 Comm: swapoff Not tainted 6.11.0-rc3-next-20240816-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 06/27/2024
RIP: 0010:zswap_swapoff+0x11b/0x2b0 mm/zswap.c:1700
Code: 74 05 e8 78 73 07 00 4b 83 7c 35 00 00 75 15 e8 1b bd 9e ff 48 ff c5 49 83 c6 50 83 7c 24 0c 17 76 9b eb 24 e8 06 bd 9e ff 90 <0f> 0b 90 eb e5 48 8b 0c 24 80 e1 07 80 c1 03 38 c1 7c 90 48 8b 3c
RSP: 0018:ffffc9000302fa38 EFLAGS: 00010293
RAX: ffffffff81f4d66a RBX: dffffc0000000000 RCX: ffff88802c19bc00
RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff888015986248
RBP: 0000000000000000 R08: ffffffff81f4d620 R09: 1ffffffff1d476ac
R10: dffffc0000000000 R11: fffffbfff1d476ad R12: dffffc0000000000
R13: ffff888015986200 R14: 0000000000000048 R15: 0000000000000002
FS:  00007f9e628a5380(0000) GS:ffff8880b9000000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b30f15ff8 CR3: 000000006c5f0000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 __do_sys_swapoff mm/swapfile.c:2837 [inline]
 __se_sys_swapoff+0x4653/0x4cf0 mm/swapfile.c:2706
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f9e629feb37
Code: 73 01 c3 48 8b 0d f1 52 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 a8 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c1 52 0d 00 f7 d8 64 89 01 48
RSP: 002b:00007fff17734f68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a8
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f9e629feb37
RDX: 00007f9e62a9e7e8 RSI: 00007f9e62b9beed RDI: 0000563090942a20
RBP: 0000563090942a20 R08: 0000000000000000 R09: 77872e07ed164f94
R10: 000000000000001f R11: 0000000000000246 R12: 00007fff17735188
R13: 00005630909422a0 R14: 0000563073724169 R15: 00007f9e62bdda80
 </TASK>


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title

If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)

If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report

If you want to undo deduplication, reply with:
#syz undup


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [syzbot] [mm?] WARNING in zswap_swapoff
  2024-08-16 19:52 [syzbot] [mm?] WARNING in zswap_swapoff syzbot
@ 2024-08-19 20:12 ` Yosry Ahmed
  2024-08-20  8:47   ` Kairui Song
  0 siblings, 1 reply; 19+ messages in thread
From: Yosry Ahmed @ 2024-08-19 20:12 UTC (permalink / raw)
  To: syzbot
  Cc: akpm, chengming.zhou, hannes, linux-kernel, linux-mm, nphamcs,
	syzkaller-bugs, Chris Li, Kairui Song, Ying, Barry Song,
	Ryan Roberts

On Fri, Aug 16, 2024 at 12:52 PM syzbot
<syzbot+ce6029250d7fd4d0476d@syzkaller.appspotmail.com> wrote:
>
> Hello,
>
> syzbot found the following issue on:
>
> HEAD commit:    367b5c3d53e5 Add linux-next specific files for 20240816
> git tree:       linux-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=12489105980000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=61ba6f3b22ee5467
> dashboard link: https://syzkaller.appspot.com/bug?extid=ce6029250d7fd4d0476d
> compiler:       Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40
>
> Unfortunately, I don't have any reproducer for this issue yet.
>
> Downloadable assets:
> disk image: https://storage.googleapis.com/syzbot-assets/0b1b4e3cad3c/disk-367b5c3d.raw.xz
> vmlinux: https://storage.googleapis.com/syzbot-assets/5bb090f7813c/vmlinux-367b5c3d.xz
> kernel image: https://storage.googleapis.com/syzbot-assets/6674cb0709b1/bzImage-367b5c3d.xz
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+ce6029250d7fd4d0476d@syzkaller.appspotmail.com
>
> ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 11298 at mm/zswap.c:1700 zswap_swapoff+0x11b/0x2b0 mm/zswap.c:1700
> Modules linked in:
> CPU: 0 UID: 0 PID: 11298 Comm: swapoff Not tainted 6.11.0-rc3-next-20240816-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 06/27/2024
> RIP: 0010:zswap_swapoff+0x11b/0x2b0 mm/zswap.c:1700
> Code: 74 05 e8 78 73 07 00 4b 83 7c 35 00 00 75 15 e8 1b bd 9e ff 48 ff c5 49 83 c6 50 83 7c 24 0c 17 76 9b eb 24 e8 06 bd 9e ff 90 <0f> 0b 90 eb e5 48 8b 0c 24 80 e1 07 80 c1 03 38 c1 7c 90 48 8b 3c
> RSP: 0018:ffffc9000302fa38 EFLAGS: 00010293
> RAX: ffffffff81f4d66a RBX: dffffc0000000000 RCX: ffff88802c19bc00
> RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff888015986248
> RBP: 0000000000000000 R08: ffffffff81f4d620 R09: 1ffffffff1d476ac
> R10: dffffc0000000000 R11: fffffbfff1d476ad R12: dffffc0000000000
> R13: ffff888015986200 R14: 0000000000000048 R15: 0000000000000002
> FS:  00007f9e628a5380(0000) GS:ffff8880b9000000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000001b30f15ff8 CR3: 000000006c5f0000 CR4: 00000000003506f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
>  <TASK>
>  __do_sys_swapoff mm/swapfile.c:2837 [inline]
>  __se_sys_swapoff+0x4653/0x4cf0 mm/swapfile.c:2706
>  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
>  do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
>  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> RIP: 0033:0x7f9e629feb37
> Code: 73 01 c3 48 8b 0d f1 52 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 a8 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c1 52 0d 00 f7 d8 64 89 01 48
> RSP: 002b:00007fff17734f68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a8
> RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f9e629feb37
> RDX: 00007f9e62a9e7e8 RSI: 00007f9e62b9beed RDI: 0000563090942a20
> RBP: 0000563090942a20 R08: 0000000000000000 R09: 77872e07ed164f94
> R10: 000000000000001f R11: 0000000000000246 R12: 00007fff17735188
> R13: 00005630909422a0 R14: 0000563073724169 R15: 00007f9e62bdda80
>  </TASK>

I am hoping syzbot would find a reproducer and bisect this for us.
Meanwhile, from a high-level it looks to me like we are missing a
zswap_invalidate() call in some paths.

If I have to guess, I would say it's related to the latest mTHP swap
changes, but I am not following closely. Perhaps one of the following
things happened:

(1) We are not calling zswap_invalidate() in some invalidation paths.
It used to not be called for the cluster freeing path, so maybe we end
up with some order-0 swap entries in a cluster? or maybe there is an
entirely new invalidation path that does not go through
free_swap_slot() for order-0 entries?

(2) Some higher order swap entries (i.e. a cluster) end up in zswap
somehow. zswap_store() has a warning to cover that though. Maybe
somehow some swap entries are allocated as a cluster, but then pages
are swapped out one-by-one as order-0 (which can go to zswap), but
then we still free the swap entries as a cluster?

I am not closely following the latest changes so I am not sure. CCing
folks who have done work in that area recently.

I am starting to think maybe it would be more reliable to just call
zswap_invalidate() for all freed swap entries anyway. Would that be
too expensive? We used to do that before the zswap_invalidate() call
was moved by commit 0827a1fb143f ("mm/zswap: invalidate zswap entry
when swap entry free"), and that was before we started using the
xarray (so it was arguably worse than it would be now).


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [syzbot] [mm?] WARNING in zswap_swapoff
  2024-08-19 20:12 ` Yosry Ahmed
@ 2024-08-20  8:47   ` Kairui Song
  2024-08-20  9:02     ` Kairui Song
  2024-08-20  9:22     ` Barry Song
  0 siblings, 2 replies; 19+ messages in thread
From: Kairui Song @ 2024-08-20  8:47 UTC (permalink / raw)
  To: Yosry Ahmed, Barry Song
  Cc: syzbot, akpm, chengming.zhou, hannes, linux-kernel, linux-mm,
	nphamcs, syzkaller-bugs, Chris Li, Ying, Ryan Roberts

[-- Attachment #1: Type: text/plain, Size: 6339 bytes --]

On Tue, Aug 20, 2024 at 4:13 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> On Fri, Aug 16, 2024 at 12:52 PM syzbot
> <syzbot+ce6029250d7fd4d0476d@syzkaller.appspotmail.com> wrote:
> >
> > Hello,
> >
> > syzbot found the following issue on:
> >
> > HEAD commit:    367b5c3d53e5 Add linux-next specific files for 20240816

I can't find this commit, seems this commit is not in linux-next any more?

> > git tree:       linux-next
> > console output: https://syzkaller.appspot.com/x/log.txt?x=12489105980000
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=61ba6f3b22ee5467
> > dashboard link: https://syzkaller.appspot.com/bug?extid=ce6029250d7fd4d0476d
> > compiler:       Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40
> >
> > Unfortunately, I don't have any reproducer for this issue yet.
> >
> > Downloadable assets:
> > disk image: https://storage.googleapis.com/syzbot-assets/0b1b4e3cad3c/disk-367b5c3d.raw.xz
> > vmlinux: https://storage.googleapis.com/syzbot-assets/5bb090f7813c/vmlinux-367b5c3d.xz
> > kernel image: https://storage.googleapis.com/syzbot-assets/6674cb0709b1/bzImage-367b5c3d.xz
> >
> > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > Reported-by: syzbot+ce6029250d7fd4d0476d@syzkaller.appspotmail.com
> >
> > ------------[ cut here ]------------
> > WARNING: CPU: 0 PID: 11298 at mm/zswap.c:1700 zswap_swapoff+0x11b/0x2b0 mm/zswap.c:1700
> > Modules linked in:
> > CPU: 0 UID: 0 PID: 11298 Comm: swapoff Not tainted 6.11.0-rc3-next-20240816-syzkaller #0
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 06/27/2024
> > RIP: 0010:zswap_swapoff+0x11b/0x2b0 mm/zswap.c:1700
> > Code: 74 05 e8 78 73 07 00 4b 83 7c 35 00 00 75 15 e8 1b bd 9e ff 48 ff c5 49 83 c6 50 83 7c 24 0c 17 76 9b eb 24 e8 06 bd 9e ff 90 <0f> 0b 90 eb e5 48 8b 0c 24 80 e1 07 80 c1 03 38 c1 7c 90 48 8b 3c
> > RSP: 0018:ffffc9000302fa38 EFLAGS: 00010293
> > RAX: ffffffff81f4d66a RBX: dffffc0000000000 RCX: ffff88802c19bc00
> > RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff888015986248
> > RBP: 0000000000000000 R08: ffffffff81f4d620 R09: 1ffffffff1d476ac
> > R10: dffffc0000000000 R11: fffffbfff1d476ad R12: dffffc0000000000
> > R13: ffff888015986200 R14: 0000000000000048 R15: 0000000000000002
> > FS:  00007f9e628a5380(0000) GS:ffff8880b9000000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 0000001b30f15ff8 CR3: 000000006c5f0000 CR4: 00000000003506f0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > Call Trace:
> >  <TASK>
> >  __do_sys_swapoff mm/swapfile.c:2837 [inline]
> >  __se_sys_swapoff+0x4653/0x4cf0 mm/swapfile.c:2706
> >  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
> >  do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
> >  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> > RIP: 0033:0x7f9e629feb37
> > Code: 73 01 c3 48 8b 0d f1 52 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 a8 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c1 52 0d 00 f7 d8 64 89 01 48
> > RSP: 002b:00007fff17734f68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a8
> > RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f9e629feb37
> > RDX: 00007f9e62a9e7e8 RSI: 00007f9e62b9beed RDI: 0000563090942a20
> > RBP: 0000563090942a20 R08: 0000000000000000 R09: 77872e07ed164f94
> > R10: 000000000000001f R11: 0000000000000246 R12: 00007fff17735188
> > R13: 00005630909422a0 R14: 0000563073724169 R15: 00007f9e62bdda80
> >  </TASK>
>
> I am hoping syzbot would find a reproducer and bisect this for us.
> Meanwhile, from a high-level it looks to me like we are missing a
> zswap_invalidate() call in some paths.
>
> If I have to guess, I would say it's related to the latest mTHP swap
> changes, but I am not following closely. Perhaps one of the following
> things happened:
>
> (1) We are not calling zswap_invalidate() in some invalidation paths.
> It used to not be called for the cluster freeing path, so maybe we end
> up with some order-0 swap entries in a cluster? or maybe there is an
> entirely new invalidation path that does not go through
> free_swap_slot() for order-0 entries?
>
> (2) Some higher order swap entries (i.e. a cluster) end up in zswap
> somehow. zswap_store() has a warning to cover that though. Maybe
> somehow some swap entries are allocated as a cluster, but then pages
> are swapped out one-by-one as order-0 (which can go to zswap), but
> then we still free the swap entries as a cluster?

Hi Yosry, thanks for the report.

There are many mTHP related optimizations recently, for this problem I
can reproduce this locally. Can confirm the problem is gone for me
after reverting:

"mm: attempt to batch free swap entries for zap_pte_range()"

Hi Barry,

If a set of continuous slots are having the same value, they are
considered a mTHP and freed, bypassing the slot cache, and causing
zswap leak.
This didn't happen in put_swap_folio because that function is
expecting an actual mTHP folio behind the slots but
free_swap_and_cache_nr is simply walking the slots.

For the testing, I actually have to disable mTHP, because linux-next
will panic with mTHP due to lack of following fixes:
https://lore.kernel.org/linux-mm/a4b1b34f-0d8c-490d-ab00-eaedbf3fe780@gmail.com/
https://lore.kernel.org/linux-mm/403b7f3c-6e5b-4030-ab1c-3198f36e3f73@gmail.com/

>
> I am not closely following the latest changes so I am not sure. CCing
> folks who have done work in that area recently.
>
> I am starting to think maybe it would be more reliable to just call
> zswap_invalidate() for all freed swap entries anyway. Would that be
> too expensive? We used to do that before the zswap_invalidate() call
> was moved by commit 0827a1fb143f ("mm/zswap: invalidate zswap entry
> when swap entry free"), and that was before we started using the
> xarray (so it was arguably worse than it would be now).
>

That might be a good idea, I suggest moving zswap_invalidate to
swap_range_free and call it for every freed slot.

Below patch can be squash into or put before "mm: attempt to batch
free swap entries for zap_pte_range()".

[-- Attachment #2: 0001-mm-swap-sanitize-zswap-invalidating.patch --]
[-- Type: application/octet-stream, Size: 1840 bytes --]

From 7e07736deb955b6fe1390e3c67751da77796b660 Mon Sep 17 00:00:00 2001
From: Kairui Song <kasong@tencent.com>
Date: Tue, 20 Aug 2024 16:19:45 +0800
Subject: [PATCH] mm: swap: sanitize zswap invalidating

From: Kairui Song <ryncsn@gmail.com>

ZSWAP doesn't support mTHP/THP yet, so we are only invalidating order 0
entries. But thing will change soon, so call the invalidation for every
slot that are freed.

Signed-off-by: Kairui Song <ryncsn@gmail.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swap_slots.c | 3 ---
 mm/swapfile.c   | 4 +---
 2 files changed, 1 insertion(+), 6 deletions(-)

diff --git a/mm/swap_slots.c b/mm/swap_slots.c
index 13ab3b771409..d7bb3caa9d4e 100644
--- a/mm/swap_slots.c
+++ b/mm/swap_slots.c
@@ -273,9 +273,6 @@ void free_swap_slot(swp_entry_t entry)
 {
 	struct swap_slots_cache *cache;
 
-	/* Large folio swap slot is not covered. */
-	zswap_invalidate(entry);
-
 	cache = raw_cpu_ptr(&swp_slots);
 	if (likely(use_swap_slot_cache && cache->slots_ret)) {
 		spin_lock_irq(&cache->free_lock);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index f947f4dd31a9..31ca8b15a8da 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -242,9 +242,6 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
 	folio_set_dirty(folio);
 
 	spin_lock(&si->lock);
-	/* Only sinple page folio can be backed by zswap */
-	if (nr_pages == 1)
-		zswap_invalidate(entry);
 	swap_entry_range_free(si, entry, nr_pages);
 	spin_unlock(&si->lock);
 	ret = nr_pages;
@@ -956,6 +953,7 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
 	else
 		swap_slot_free_notify = NULL;
 	while (offset <= end) {
+		zswap_invalidate(swp_entry(si->type, offset));
 		arch_swap_invalidate_page(si->type, offset);
 		if (swap_slot_free_notify)
 			swap_slot_free_notify(si->bdev, offset);
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [syzbot] [mm?] WARNING in zswap_swapoff
  2024-08-20  8:47   ` Kairui Song
@ 2024-08-20  9:02     ` Kairui Song
  2024-08-21  5:49       ` Barry Song
  2024-08-20  9:22     ` Barry Song
  1 sibling, 1 reply; 19+ messages in thread
From: Kairui Song @ 2024-08-20  9:02 UTC (permalink / raw)
  To: Yosry Ahmed, Barry Song
  Cc: syzbot, akpm, chengming.zhou, hannes, linux-kernel, linux-mm,
	nphamcs, syzkaller-bugs, Chris Li, Ying, Ryan Roberts

On Tue, Aug 20, 2024 at 4:47 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Tue, Aug 20, 2024 at 4:13 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > On Fri, Aug 16, 2024 at 12:52 PM syzbot
> > <syzbot+ce6029250d7fd4d0476d@syzkaller.appspotmail.com> wrote:
> > >
> > > Hello,
> > >
> > > syzbot found the following issue on:
> > >
> > > HEAD commit:    367b5c3d53e5 Add linux-next specific files for 20240816
>
> I can't find this commit, seems this commit is not in linux-next any more?
>
> > > git tree:       linux-next
> > > console output: https://syzkaller.appspot.com/x/log.txt?x=12489105980000
> > > kernel config:  https://syzkaller.appspot.com/x/.config?x=61ba6f3b22ee5467
> > > dashboard link: https://syzkaller.appspot.com/bug?extid=ce6029250d7fd4d0476d
> > > compiler:       Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40
> > >
> > > Unfortunately, I don't have any reproducer for this issue yet.
> > >
> > > Downloadable assets:
> > > disk image: https://storage.googleapis.com/syzbot-assets/0b1b4e3cad3c/disk-367b5c3d.raw.xz
> > > vmlinux: https://storage.googleapis.com/syzbot-assets/5bb090f7813c/vmlinux-367b5c3d.xz
> > > kernel image: https://storage.googleapis.com/syzbot-assets/6674cb0709b1/bzImage-367b5c3d.xz
> > >
> > > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > > Reported-by: syzbot+ce6029250d7fd4d0476d@syzkaller.appspotmail.com
> > >
> > > ------------[ cut here ]------------
> > > WARNING: CPU: 0 PID: 11298 at mm/zswap.c:1700 zswap_swapoff+0x11b/0x2b0 mm/zswap.c:1700
> > > Modules linked in:
> > > CPU: 0 UID: 0 PID: 11298 Comm: swapoff Not tainted 6.11.0-rc3-next-20240816-syzkaller #0
> > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 06/27/2024
> > > RIP: 0010:zswap_swapoff+0x11b/0x2b0 mm/zswap.c:1700
> > > Code: 74 05 e8 78 73 07 00 4b 83 7c 35 00 00 75 15 e8 1b bd 9e ff 48 ff c5 49 83 c6 50 83 7c 24 0c 17 76 9b eb 24 e8 06 bd 9e ff 90 <0f> 0b 90 eb e5 48 8b 0c 24 80 e1 07 80 c1 03 38 c1 7c 90 48 8b 3c
> > > RSP: 0018:ffffc9000302fa38 EFLAGS: 00010293
> > > RAX: ffffffff81f4d66a RBX: dffffc0000000000 RCX: ffff88802c19bc00
> > > RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff888015986248
> > > RBP: 0000000000000000 R08: ffffffff81f4d620 R09: 1ffffffff1d476ac
> > > R10: dffffc0000000000 R11: fffffbfff1d476ad R12: dffffc0000000000
> > > R13: ffff888015986200 R14: 0000000000000048 R15: 0000000000000002
> > > FS:  00007f9e628a5380(0000) GS:ffff8880b9000000(0000) knlGS:0000000000000000
> > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > CR2: 0000001b30f15ff8 CR3: 000000006c5f0000 CR4: 00000000003506f0
> > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > Call Trace:
> > >  <TASK>
> > >  __do_sys_swapoff mm/swapfile.c:2837 [inline]
> > >  __se_sys_swapoff+0x4653/0x4cf0 mm/swapfile.c:2706
> > >  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
> > >  do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
> > >  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> > > RIP: 0033:0x7f9e629feb37
> > > Code: 73 01 c3 48 8b 0d f1 52 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 a8 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c1 52 0d 00 f7 d8 64 89 01 48
> > > RSP: 002b:00007fff17734f68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a8
> > > RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f9e629feb37
> > > RDX: 00007f9e62a9e7e8 RSI: 00007f9e62b9beed RDI: 0000563090942a20
> > > RBP: 0000563090942a20 R08: 0000000000000000 R09: 77872e07ed164f94
> > > R10: 000000000000001f R11: 0000000000000246 R12: 00007fff17735188
> > > R13: 00005630909422a0 R14: 0000563073724169 R15: 00007f9e62bdda80
> > >  </TASK>
> >
> > I am hoping syzbot would find a reproducer and bisect this for us.
> > Meanwhile, from a high-level it looks to me like we are missing a
> > zswap_invalidate() call in some paths.
> >
> > If I have to guess, I would say it's related to the latest mTHP swap
> > changes, but I am not following closely. Perhaps one of the following
> > things happened:
> >
> > (1) We are not calling zswap_invalidate() in some invalidation paths.
> > It used to not be called for the cluster freeing path, so maybe we end
> > up with some order-0 swap entries in a cluster? or maybe there is an
> > entirely new invalidation path that does not go through
> > free_swap_slot() for order-0 entries?
> >
> > (2) Some higher order swap entries (i.e. a cluster) end up in zswap
> > somehow. zswap_store() has a warning to cover that though. Maybe
> > somehow some swap entries are allocated as a cluster, but then pages
> > are swapped out one-by-one as order-0 (which can go to zswap), but
> > then we still free the swap entries as a cluster?
>
> Hi Yosry, thanks for the report.
>
> There are many mTHP related optimizations recently, for this problem I
> can reproduce this locally. Can confirm the problem is gone for me
> after reverting:
>
> "mm: attempt to batch free swap entries for zap_pte_range()"
>
> Hi Barry,
>
> If a set of continuous slots are having the same value, they are
> considered a mTHP and freed, bypassing the slot cache, and causing
> zswap leak.
> This didn't happen in put_swap_folio because that function is
> expecting an actual mTHP folio behind the slots but
> free_swap_and_cache_nr is simply walking the slots.
>
> For the testing, I actually have to disable mTHP, because linux-next
> will panic with mTHP due to lack of following fixes:
> https://lore.kernel.org/linux-mm/a4b1b34f-0d8c-490d-ab00-eaedbf3fe780@gmail.com/
> https://lore.kernel.org/linux-mm/403b7f3c-6e5b-4030-ab1c-3198f36e3f73@gmail.com/
>
> >
> > I am not closely following the latest changes so I am not sure. CCing
> > folks who have done work in that area recently.
> >
> > I am starting to think maybe it would be more reliable to just call
> > zswap_invalidate() for all freed swap entries anyway. Would that be
> > too expensive? We used to do that before the zswap_invalidate() call
> > was moved by commit 0827a1fb143f ("mm/zswap: invalidate zswap entry
> > when swap entry free"), and that was before we started using the
> > xarray (so it was arguably worse than it would be now).
> >
>
> That might be a good idea, I suggest moving zswap_invalidate to
> swap_range_free and call it for every freed slot.
>
> Below patch can be squash into or put before "mm: attempt to batch
> free swap entries for zap_pte_range()".

Hmm, on second thought, the commit message in the attachment commit
might be not suitable, current zswap_invalidate is also designed to
only work for order 0 ZSWAP, so things are not clean even after this.

And for performance, it will cause unnecessary heavier contention for
the mTHP page on ZSWAP Xarray. It does fix the leak though, please
ignore this fix, let's try find a better fix.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [syzbot] [mm?] WARNING in zswap_swapoff
  2024-08-20  9:02     ` Kairui Song
@ 2024-08-21  5:49       ` Barry Song
  2024-08-21  6:42         ` Kairui Song
  0 siblings, 1 reply; 19+ messages in thread
From: Barry Song @ 2024-08-21  5:49 UTC (permalink / raw)
  To: ryncsn
  Cc: 21cnbao, akpm, chengming.zhou, chrisl, hannes, linux-kernel,
	linux-mm, nphamcs, ryan.roberts, syzbot+ce6029250d7fd4d0476d,
	syzkaller-bugs, ying.huang, yosryahmed

On Tue, Aug 20, 2024 at 9:02 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Tue, Aug 20, 2024 at 4:47 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > On Tue, Aug 20, 2024 at 4:13 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > On Fri, Aug 16, 2024 at 12:52 PM syzbot
> > > <syzbot+ce6029250d7fd4d0476d@syzkaller.appspotmail.com> wrote:
> > > >
> > > > Hello,
> > > >
> > > > syzbot found the following issue on:
> > > >
> > > > HEAD commit:    367b5c3d53e5 Add linux-next specific files for 20240816
> >
> > I can't find this commit, seems this commit is not in linux-next any more?
> >
> > > > git tree:       linux-next
> > > > console output: https://syzkaller.appspot.com/x/log.txt?x=12489105980000
> > > > kernel config:  https://syzkaller.appspot.com/x/.config?x=61ba6f3b22ee5467
> > > > dashboard link: https://syzkaller.appspot.com/bug?extid=ce6029250d7fd4d0476d
> > > > compiler:       Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40
> > > >
> > > > Unfortunately, I don't have any reproducer for this issue yet.
> > > >
> > > > Downloadable assets:
> > > > disk image: https://storage.googleapis.com/syzbot-assets/0b1b4e3cad3c/disk-367b5c3d.raw.xz
> > > > vmlinux: https://storage.googleapis.com/syzbot-assets/5bb090f7813c/vmlinux-367b5c3d.xz
> > > > kernel image: https://storage.googleapis.com/syzbot-assets/6674cb0709b1/bzImage-367b5c3d.xz
> > > >
> > > > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > > > Reported-by: syzbot+ce6029250d7fd4d0476d@syzkaller.appspotmail.com
> > > >
> > > > ------------[ cut here ]------------
> > > > WARNING: CPU: 0 PID: 11298 at mm/zswap.c:1700 zswap_swapoff+0x11b/0x2b0 mm/zswap.c:1700
> > > > Modules linked in:
> > > > CPU: 0 UID: 0 PID: 11298 Comm: swapoff Not tainted 6.11.0-rc3-next-20240816-syzkaller #0
> > > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 06/27/2024
> > > > RIP: 0010:zswap_swapoff+0x11b/0x2b0 mm/zswap.c:1700
> > > > Code: 74 05 e8 78 73 07 00 4b 83 7c 35 00 00 75 15 e8 1b bd 9e ff 48 ff c5 49 83 c6 50 83 7c 24 0c 17 76 9b eb 24 e8 06 bd 9e ff 90 <0f> 0b 90 eb e5 48 8b 0c 24 80 e1 07 80 c1 03 38 c1 7c 90 48 8b 3c
> > > > RSP: 0018:ffffc9000302fa38 EFLAGS: 00010293
> > > > RAX: ffffffff81f4d66a RBX: dffffc0000000000 RCX: ffff88802c19bc00
> > > > RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff888015986248
> > > > RBP: 0000000000000000 R08: ffffffff81f4d620 R09: 1ffffffff1d476ac
> > > > R10: dffffc0000000000 R11: fffffbfff1d476ad R12: dffffc0000000000
> > > > R13: ffff888015986200 R14: 0000000000000048 R15: 0000000000000002
> > > > FS:  00007f9e628a5380(0000) GS:ffff8880b9000000(0000) knlGS:0000000000000000
> > > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > CR2: 0000001b30f15ff8 CR3: 000000006c5f0000 CR4: 00000000003506f0
> > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > > Call Trace:
> > > >  <TASK>
> > > >  __do_sys_swapoff mm/swapfile.c:2837 [inline]
> > > >  __se_sys_swapoff+0x4653/0x4cf0 mm/swapfile.c:2706
> > > >  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
> > > >  do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
> > > >  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> > > > RIP: 0033:0x7f9e629feb37
> > > > Code: 73 01 c3 48 8b 0d f1 52 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 a8 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c1 52 0d 00 f7 d8 64 89 01 48
> > > > RSP: 002b:00007fff17734f68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a8
> > > > RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f9e629feb37
> > > > RDX: 00007f9e62a9e7e8 RSI: 00007f9e62b9beed RDI: 0000563090942a20
> > > > RBP: 0000563090942a20 R08: 0000000000000000 R09: 77872e07ed164f94
> > > > R10: 000000000000001f R11: 0000000000000246 R12: 00007fff17735188
> > > > R13: 00005630909422a0 R14: 0000563073724169 R15: 00007f9e62bdda80
> > > >  </TASK>
> > >
> > > I am hoping syzbot would find a reproducer and bisect this for us.
> > > Meanwhile, from a high-level it looks to me like we are missing a
> > > zswap_invalidate() call in some paths.
> > >
> > > If I have to guess, I would say it's related to the latest mTHP swap
> > > changes, but I am not following closely. Perhaps one of the following
> > > things happened:
> > >
> > > (1) We are not calling zswap_invalidate() in some invalidation paths.
> > > It used to not be called for the cluster freeing path, so maybe we end
> > > up with some order-0 swap entries in a cluster? or maybe there is an
> > > entirely new invalidation path that does not go through
> > > free_swap_slot() for order-0 entries?
> > >
> > > (2) Some higher order swap entries (i.e. a cluster) end up in zswap
> > > somehow. zswap_store() has a warning to cover that though. Maybe
> > > somehow some swap entries are allocated as a cluster, but then pages
> > > are swapped out one-by-one as order-0 (which can go to zswap), but
> > > then we still free the swap entries as a cluster?
> >
> > Hi Yosry, thanks for the report.
> >
> > There are many mTHP related optimizations recently, for this problem I
> > can reproduce this locally. Can confirm the problem is gone for me
> > after reverting:
> >
> > "mm: attempt to batch free swap entries for zap_pte_range()"
> >
> > Hi Barry,
> >
> > If a set of continuous slots are having the same value, they are
> > considered a mTHP and freed, bypassing the slot cache, and causing
> > zswap leak.
> > This didn't happen in put_swap_folio because that function is
> > expecting an actual mTHP folio behind the slots but
> > free_swap_and_cache_nr is simply walking the slots.
> >
> > For the testing, I actually have to disable mTHP, because linux-next
> > will panic with mTHP due to lack of following fixes:
> > https://lore.kernel.org/linux-mm/a4b1b34f-0d8c-490d-ab00-eaedbf3fe780@gmail.com/
> > https://lore.kernel.org/linux-mm/403b7f3c-6e5b-4030-ab1c-3198f36e3f73@gmail.com/
> >
> > >
> > > I am not closely following the latest changes so I am not sure. CCing
> > > folks who have done work in that area recently.
> > >
> > > I am starting to think maybe it would be more reliable to just call
> > > zswap_invalidate() for all freed swap entries anyway. Would that be
> > > too expensive? We used to do that before the zswap_invalidate() call
> > > was moved by commit 0827a1fb143f ("mm/zswap: invalidate zswap entry
> > > when swap entry free"), and that was before we started using the
> > > xarray (so it was arguably worse than it would be now).
> > >
> >
> > That might be a good idea, I suggest moving zswap_invalidate to
> > swap_range_free and call it for every freed slot.
> >
> > Below patch can be squash into or put before "mm: attempt to batch
> > free swap entries for zap_pte_range()".
>
> Hmm, on second thought, the commit message in the attachment commit
> might be not suitable, current zswap_invalidate is also designed to
> only work for order 0 ZSWAP, so things are not clean even after this.

Kairui, what about the below? we don't touch the path of __try_to_reclaim_swap() where
you have one folio backed?

diff --git a/mm/swapfile.c b/mm/swapfile.c
index c1638a009113..8ff58be40544 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1514,6 +1514,8 @@ static bool __swap_entries_free(struct swap_info_struct *si,
 	unlock_cluster_or_swap_info(si, ci);
 
 	if (!has_cache) {
+		for (i = 0; i < nr; i++)
+			zswap_invalidate(swp_entry(si->type, offset + i));
 		spin_lock(&si->lock);
 		swap_entry_range_free(si, entry, nr);
 		spin_unlock(&si->lock);

>
> And for performance, it will cause unnecessary heavier contention for
> the mTHP page on ZSWAP Xarray. It does fix the leak though, please
> ignore this fix, let's try find a better fix.


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [syzbot] [mm?] WARNING in zswap_swapoff
  2024-08-21  5:49       ` Barry Song
@ 2024-08-21  6:42         ` Kairui Song
  2024-08-21  7:38           ` Barry Song
  2024-08-22 18:12           ` Yosry Ahmed
  0 siblings, 2 replies; 19+ messages in thread
From: Kairui Song @ 2024-08-21  6:42 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, chengming.zhou, chrisl, hannes, linux-kernel, linux-mm,
	nphamcs, ryan.roberts, syzbot+ce6029250d7fd4d0476d,
	syzkaller-bugs, ying.huang, yosryahmed

On Wed, Aug 21, 2024 at 1:49 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Tue, Aug 20, 2024 at 9:02 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > On Tue, Aug 20, 2024 at 4:47 PM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > On Tue, Aug 20, 2024 at 4:13 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > On Fri, Aug 16, 2024 at 12:52 PM syzbot
> > > > <syzbot+ce6029250d7fd4d0476d@syzkaller.appspotmail.com> wrote:
> > > > >
> > > > > Hello,
> > > > >
> > > > > syzbot found the following issue on:
> > > > >
> > > > > HEAD commit:    367b5c3d53e5 Add linux-next specific files for 20240816
> > >
> > > I can't find this commit, seems this commit is not in linux-next any more?
> > >
> > > > > git tree:       linux-next
> > > > > console output: https://syzkaller.appspot.com/x/log.txt?x=12489105980000
> > > > > kernel config:  https://syzkaller.appspot.com/x/.config?x=61ba6f3b22ee5467
> > > > > dashboard link: https://syzkaller.appspot.com/bug?extid=ce6029250d7fd4d0476d
> > > > > compiler:       Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40
> > > > >
> > > > > Unfortunately, I don't have any reproducer for this issue yet.
> > > > >
> > > > > Downloadable assets:
> > > > > disk image: https://storage.googleapis.com/syzbot-assets/0b1b4e3cad3c/disk-367b5c3d.raw.xz
> > > > > vmlinux: https://storage.googleapis.com/syzbot-assets/5bb090f7813c/vmlinux-367b5c3d.xz
> > > > > kernel image: https://storage.googleapis.com/syzbot-assets/6674cb0709b1/bzImage-367b5c3d.xz
> > > > >
> > > > > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > > > > Reported-by: syzbot+ce6029250d7fd4d0476d@syzkaller.appspotmail.com
> > > > >
> > > > > ------------[ cut here ]------------
> > > > > WARNING: CPU: 0 PID: 11298 at mm/zswap.c:1700 zswap_swapoff+0x11b/0x2b0 mm/zswap.c:1700
> > > > > Modules linked in:
> > > > > CPU: 0 UID: 0 PID: 11298 Comm: swapoff Not tainted 6.11.0-rc3-next-20240816-syzkaller #0
> > > > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 06/27/2024
> > > > > RIP: 0010:zswap_swapoff+0x11b/0x2b0 mm/zswap.c:1700
> > > > > Code: 74 05 e8 78 73 07 00 4b 83 7c 35 00 00 75 15 e8 1b bd 9e ff 48 ff c5 49 83 c6 50 83 7c 24 0c 17 76 9b eb 24 e8 06 bd 9e ff 90 <0f> 0b 90 eb e5 48 8b 0c 24 80 e1 07 80 c1 03 38 c1 7c 90 48 8b 3c
> > > > > RSP: 0018:ffffc9000302fa38 EFLAGS: 00010293
> > > > > RAX: ffffffff81f4d66a RBX: dffffc0000000000 RCX: ffff88802c19bc00
> > > > > RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff888015986248
> > > > > RBP: 0000000000000000 R08: ffffffff81f4d620 R09: 1ffffffff1d476ac
> > > > > R10: dffffc0000000000 R11: fffffbfff1d476ad R12: dffffc0000000000
> > > > > R13: ffff888015986200 R14: 0000000000000048 R15: 0000000000000002
> > > > > FS:  00007f9e628a5380(0000) GS:ffff8880b9000000(0000) knlGS:0000000000000000
> > > > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > CR2: 0000001b30f15ff8 CR3: 000000006c5f0000 CR4: 00000000003506f0
> > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > > > Call Trace:
> > > > >  <TASK>
> > > > >  __do_sys_swapoff mm/swapfile.c:2837 [inline]
> > > > >  __se_sys_swapoff+0x4653/0x4cf0 mm/swapfile.c:2706
> > > > >  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
> > > > >  do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
> > > > >  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> > > > > RIP: 0033:0x7f9e629feb37
> > > > > Code: 73 01 c3 48 8b 0d f1 52 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 a8 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c1 52 0d 00 f7 d8 64 89 01 48
> > > > > RSP: 002b:00007fff17734f68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a8
> > > > > RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f9e629feb37
> > > > > RDX: 00007f9e62a9e7e8 RSI: 00007f9e62b9beed RDI: 0000563090942a20
> > > > > RBP: 0000563090942a20 R08: 0000000000000000 R09: 77872e07ed164f94
> > > > > R10: 000000000000001f R11: 0000000000000246 R12: 00007fff17735188
> > > > > R13: 00005630909422a0 R14: 0000563073724169 R15: 00007f9e62bdda80
> > > > >  </TASK>
> > > >
> > > > I am hoping syzbot would find a reproducer and bisect this for us.
> > > > Meanwhile, from a high-level it looks to me like we are missing a
> > > > zswap_invalidate() call in some paths.
> > > >
> > > > If I have to guess, I would say it's related to the latest mTHP swap
> > > > changes, but I am not following closely. Perhaps one of the following
> > > > things happened:
> > > >
> > > > (1) We are not calling zswap_invalidate() in some invalidation paths.
> > > > It used to not be called for the cluster freeing path, so maybe we end
> > > > up with some order-0 swap entries in a cluster? or maybe there is an
> > > > entirely new invalidation path that does not go through
> > > > free_swap_slot() for order-0 entries?
> > > >
> > > > (2) Some higher order swap entries (i.e. a cluster) end up in zswap
> > > > somehow. zswap_store() has a warning to cover that though. Maybe
> > > > somehow some swap entries are allocated as a cluster, but then pages
> > > > are swapped out one-by-one as order-0 (which can go to zswap), but
> > > > then we still free the swap entries as a cluster?
> > >
> > > Hi Yosry, thanks for the report.
> > >
> > > There are many mTHP related optimizations recently, for this problem I
> > > can reproduce this locally. Can confirm the problem is gone for me
> > > after reverting:
> > >
> > > "mm: attempt to batch free swap entries for zap_pte_range()"
> > >
> > > Hi Barry,
> > >
> > > If a set of continuous slots are having the same value, they are
> > > considered a mTHP and freed, bypassing the slot cache, and causing
> > > zswap leak.
> > > This didn't happen in put_swap_folio because that function is
> > > expecting an actual mTHP folio behind the slots but
> > > free_swap_and_cache_nr is simply walking the slots.
> > >
> > > For the testing, I actually have to disable mTHP, because linux-next
> > > will panic with mTHP due to lack of following fixes:
> > > https://lore.kernel.org/linux-mm/a4b1b34f-0d8c-490d-ab00-eaedbf3fe780@gmail.com/
> > > https://lore.kernel.org/linux-mm/403b7f3c-6e5b-4030-ab1c-3198f36e3f73@gmail.com/
> > >
> > > >
> > > > I am not closely following the latest changes so I am not sure. CCing
> > > > folks who have done work in that area recently.
> > > >
> > > > I am starting to think maybe it would be more reliable to just call
> > > > zswap_invalidate() for all freed swap entries anyway. Would that be
> > > > too expensive? We used to do that before the zswap_invalidate() call
> > > > was moved by commit 0827a1fb143f ("mm/zswap: invalidate zswap entry
> > > > when swap entry free"), and that was before we started using the
> > > > xarray (so it was arguably worse than it would be now).
> > > >
> > >
> > > That might be a good idea, I suggest moving zswap_invalidate to
> > > swap_range_free and call it for every freed slot.
> > >
> > > Below patch can be squash into or put before "mm: attempt to batch
> > > free swap entries for zap_pte_range()".
> >
> > Hmm, on second thought, the commit message in the attachment commit
> > might be not suitable, current zswap_invalidate is also designed to
> > only work for order 0 ZSWAP, so things are not clean even after this.
>
> Kairui, what about the below? we don't touch the path of __try_to_reclaim_swap() where
> you have one folio backed?
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index c1638a009113..8ff58be40544 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1514,6 +1514,8 @@ static bool __swap_entries_free(struct swap_info_struct *si,
>         unlock_cluster_or_swap_info(si, ci);
>
>         if (!has_cache) {
> +               for (i = 0; i < nr; i++)
> +                       zswap_invalidate(swp_entry(si->type, offset + i));
>                 spin_lock(&si->lock);
>                 swap_entry_range_free(si, entry, nr);
>                 spin_unlock(&si->lock);
>

Hi Barry,

Thanks for updating this thread, I'm thinking maybe something will
better be done at the zswap side?

The concern of using zswap_invalidate is that it calls xa_erase which
requires the xa spin lock. But if we are calling zswap_invalidate in
swap_entry_range_free, and ensure the slot is HAS_CACHE pinned, doing
a lockless read first with xa_load should be OK for checking if the
slot needs a ZSWAP invalidation. The performance cost will be minimal
and we only need to call zswap_invalidate in one place, something like
this (haven't tested, comments are welcome). Also ZSWAP mthp will
still store entried in order 0 so this should be OK for future.

diff --git a/mm/swap_slots.c b/mm/swap_slots.c
index 13ab3b771409..d7bb3caa9d4e 100644
--- a/mm/swap_slots.c
+++ b/mm/swap_slots.c
@@ -273,9 +273,6 @@ void free_swap_slot(swp_entry_t entry)
 {
         struct swap_slots_cache *cache;

-        /* Large folio swap slot is not covered. */
-        zswap_invalidate(entry);
-
         cache = raw_cpu_ptr(&swp_slots);
         if (likely(use_swap_slot_cache && cache->slots_ret)) {
                 spin_lock_irq(&cache->free_lock);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index f947f4dd31a9..fbc25d38a27e 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -242,9 +242,6 @@ static int __try_to_reclaim_swap(struct
swap_info_struct *si,
         folio_set_dirty(folio);

         spin_lock(&si->lock);
-        /* Only sinple page folio can be backed by zswap */
-        if (nr_pages == 1)
-                zswap_invalidate(entry);
         swap_entry_range_free(si, entry, nr_pages);
         spin_unlock(&si->lock);
         ret = nr_pages;
@@ -1545,6 +1542,10 @@ static void swap_entry_range_free(struct
swap_info_struct *si, swp_entry_t entry
         unsigned char *map_end = map + nr_pages;
         struct swap_cluster_info *ci;

+        /* Slots are pinned with SWAP_HAS_CACHE, safe to invalidate */
+        for (int i = 0; i < nr_pages; ++i)
+                zswap_invalidate(swp_entry(si->type, offset + i));
+
         ci = lock_cluster(si, offset);
         do {
                 VM_BUG_ON(*map != SWAP_HAS_CACHE);
diff --git a/mm/zswap.c b/mm/zswap.c
index df66ab102d27..100ad04397fe 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1656,15 +1656,18 @@ bool zswap_load(struct folio *folio)
         return true;
 }

+/* Caller need to pin the slot to prevent parallel store */
 void zswap_invalidate(swp_entry_t swp)
 {
         pgoff_t offset = swp_offset(swp);
         struct xarray *tree = swap_zswap_tree(swp);
         struct zswap_entry *entry;

-        entry = xa_erase(tree, offset);
-        if (entry)
-                zswap_entry_free(entry);
+        if (xa_load(tree, offset)) {
+                entry = xa_erase(tree, offset);
+                if (entry)
+                        zswap_entry_free(entry);
+        }
 }

 int zswap_swapon(int type, unsigned long nr_pages)
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [syzbot] [mm?] WARNING in zswap_swapoff
  2024-08-21  6:42         ` Kairui Song
@ 2024-08-21  7:38           ` Barry Song
  2024-08-21 17:33             ` Kairui Song
  2024-08-22 18:12           ` Yosry Ahmed
  1 sibling, 1 reply; 19+ messages in thread
From: Barry Song @ 2024-08-21  7:38 UTC (permalink / raw)
  To: Kairui Song
  Cc: akpm, chengming.zhou, chrisl, hannes, linux-kernel, linux-mm,
	nphamcs, ryan.roberts, syzbot+ce6029250d7fd4d0476d,
	syzkaller-bugs, ying.huang, yosryahmed

On Wed, Aug 21, 2024 at 6:42 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Wed, Aug 21, 2024 at 1:49 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Tue, Aug 20, 2024 at 9:02 PM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > On Tue, Aug 20, 2024 at 4:47 PM Kairui Song <ryncsn@gmail.com> wrote:
> > > >
> > > > On Tue, Aug 20, 2024 at 4:13 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > On Fri, Aug 16, 2024 at 12:52 PM syzbot
> > > > > <syzbot+ce6029250d7fd4d0476d@syzkaller.appspotmail.com> wrote:
> > > > > >
> > > > > > Hello,
> > > > > >
> > > > > > syzbot found the following issue on:
> > > > > >
> > > > > > HEAD commit:    367b5c3d53e5 Add linux-next specific files for 20240816
> > > >
> > > > I can't find this commit, seems this commit is not in linux-next any more?
> > > >
> > > > > > git tree:       linux-next
> > > > > > console output: https://syzkaller.appspot.com/x/log.txt?x=12489105980000
> > > > > > kernel config:  https://syzkaller.appspot.com/x/.config?x=61ba6f3b22ee5467
> > > > > > dashboard link: https://syzkaller.appspot.com/bug?extid=ce6029250d7fd4d0476d
> > > > > > compiler:       Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40
> > > > > >
> > > > > > Unfortunately, I don't have any reproducer for this issue yet.
> > > > > >
> > > > > > Downloadable assets:
> > > > > > disk image: https://storage.googleapis.com/syzbot-assets/0b1b4e3cad3c/disk-367b5c3d.raw.xz
> > > > > > vmlinux: https://storage.googleapis.com/syzbot-assets/5bb090f7813c/vmlinux-367b5c3d.xz
> > > > > > kernel image: https://storage.googleapis.com/syzbot-assets/6674cb0709b1/bzImage-367b5c3d.xz
> > > > > >
> > > > > > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > > > > > Reported-by: syzbot+ce6029250d7fd4d0476d@syzkaller.appspotmail.com
> > > > > >
> > > > > > ------------[ cut here ]------------
> > > > > > WARNING: CPU: 0 PID: 11298 at mm/zswap.c:1700 zswap_swapoff+0x11b/0x2b0 mm/zswap.c:1700
> > > > > > Modules linked in:
> > > > > > CPU: 0 UID: 0 PID: 11298 Comm: swapoff Not tainted 6.11.0-rc3-next-20240816-syzkaller #0
> > > > > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 06/27/2024
> > > > > > RIP: 0010:zswap_swapoff+0x11b/0x2b0 mm/zswap.c:1700
> > > > > > Code: 74 05 e8 78 73 07 00 4b 83 7c 35 00 00 75 15 e8 1b bd 9e ff 48 ff c5 49 83 c6 50 83 7c 24 0c 17 76 9b eb 24 e8 06 bd 9e ff 90 <0f> 0b 90 eb e5 48 8b 0c 24 80 e1 07 80 c1 03 38 c1 7c 90 48 8b 3c
> > > > > > RSP: 0018:ffffc9000302fa38 EFLAGS: 00010293
> > > > > > RAX: ffffffff81f4d66a RBX: dffffc0000000000 RCX: ffff88802c19bc00
> > > > > > RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff888015986248
> > > > > > RBP: 0000000000000000 R08: ffffffff81f4d620 R09: 1ffffffff1d476ac
> > > > > > R10: dffffc0000000000 R11: fffffbfff1d476ad R12: dffffc0000000000
> > > > > > R13: ffff888015986200 R14: 0000000000000048 R15: 0000000000000002
> > > > > > FS:  00007f9e628a5380(0000) GS:ffff8880b9000000(0000) knlGS:0000000000000000
> > > > > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > > CR2: 0000001b30f15ff8 CR3: 000000006c5f0000 CR4: 00000000003506f0
> > > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > > > > Call Trace:
> > > > > >  <TASK>
> > > > > >  __do_sys_swapoff mm/swapfile.c:2837 [inline]
> > > > > >  __se_sys_swapoff+0x4653/0x4cf0 mm/swapfile.c:2706
> > > > > >  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
> > > > > >  do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
> > > > > >  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> > > > > > RIP: 0033:0x7f9e629feb37
> > > > > > Code: 73 01 c3 48 8b 0d f1 52 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 a8 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c1 52 0d 00 f7 d8 64 89 01 48
> > > > > > RSP: 002b:00007fff17734f68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a8
> > > > > > RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f9e629feb37
> > > > > > RDX: 00007f9e62a9e7e8 RSI: 00007f9e62b9beed RDI: 0000563090942a20
> > > > > > RBP: 0000563090942a20 R08: 0000000000000000 R09: 77872e07ed164f94
> > > > > > R10: 000000000000001f R11: 0000000000000246 R12: 00007fff17735188
> > > > > > R13: 00005630909422a0 R14: 0000563073724169 R15: 00007f9e62bdda80
> > > > > >  </TASK>
> > > > >
> > > > > I am hoping syzbot would find a reproducer and bisect this for us.
> > > > > Meanwhile, from a high-level it looks to me like we are missing a
> > > > > zswap_invalidate() call in some paths.
> > > > >
> > > > > If I have to guess, I would say it's related to the latest mTHP swap
> > > > > changes, but I am not following closely. Perhaps one of the following
> > > > > things happened:
> > > > >
> > > > > (1) We are not calling zswap_invalidate() in some invalidation paths.
> > > > > It used to not be called for the cluster freeing path, so maybe we end
> > > > > up with some order-0 swap entries in a cluster? or maybe there is an
> > > > > entirely new invalidation path that does not go through
> > > > > free_swap_slot() for order-0 entries?
> > > > >
> > > > > (2) Some higher order swap entries (i.e. a cluster) end up in zswap
> > > > > somehow. zswap_store() has a warning to cover that though. Maybe
> > > > > somehow some swap entries are allocated as a cluster, but then pages
> > > > > are swapped out one-by-one as order-0 (which can go to zswap), but
> > > > > then we still free the swap entries as a cluster?
> > > >
> > > > Hi Yosry, thanks for the report.
> > > >
> > > > There are many mTHP related optimizations recently, for this problem I
> > > > can reproduce this locally. Can confirm the problem is gone for me
> > > > after reverting:
> > > >
> > > > "mm: attempt to batch free swap entries for zap_pte_range()"
> > > >
> > > > Hi Barry,
> > > >
> > > > If a set of continuous slots are having the same value, they are
> > > > considered a mTHP and freed, bypassing the slot cache, and causing
> > > > zswap leak.
> > > > This didn't happen in put_swap_folio because that function is
> > > > expecting an actual mTHP folio behind the slots but
> > > > free_swap_and_cache_nr is simply walking the slots.
> > > >
> > > > For the testing, I actually have to disable mTHP, because linux-next
> > > > will panic with mTHP due to lack of following fixes:
> > > > https://lore.kernel.org/linux-mm/a4b1b34f-0d8c-490d-ab00-eaedbf3fe780@gmail.com/
> > > > https://lore.kernel.org/linux-mm/403b7f3c-6e5b-4030-ab1c-3198f36e3f73@gmail.com/
> > > >
> > > > >
> > > > > I am not closely following the latest changes so I am not sure. CCing
> > > > > folks who have done work in that area recently.
> > > > >
> > > > > I am starting to think maybe it would be more reliable to just call
> > > > > zswap_invalidate() for all freed swap entries anyway. Would that be
> > > > > too expensive? We used to do that before the zswap_invalidate() call
> > > > > was moved by commit 0827a1fb143f ("mm/zswap: invalidate zswap entry
> > > > > when swap entry free"), and that was before we started using the
> > > > > xarray (so it was arguably worse than it would be now).
> > > > >
> > > >
> > > > That might be a good idea, I suggest moving zswap_invalidate to
> > > > swap_range_free and call it for every freed slot.
> > > >
> > > > Below patch can be squash into or put before "mm: attempt to batch
> > > > free swap entries for zap_pte_range()".
> > >
> > > Hmm, on second thought, the commit message in the attachment commit
> > > might be not suitable, current zswap_invalidate is also designed to
> > > only work for order 0 ZSWAP, so things are not clean even after this.
> >
> > Kairui, what about the below? we don't touch the path of __try_to_reclaim_swap() where
> > you have one folio backed?
> >
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index c1638a009113..8ff58be40544 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -1514,6 +1514,8 @@ static bool __swap_entries_free(struct swap_info_struct *si,
> >         unlock_cluster_or_swap_info(si, ci);
> >
> >         if (!has_cache) {
> > +               for (i = 0; i < nr; i++)
> > +                       zswap_invalidate(swp_entry(si->type, offset + i));
> >                 spin_lock(&si->lock);
> >                 swap_entry_range_free(si, entry, nr);
> >                 spin_unlock(&si->lock);
> >
>
> Hi Barry,
>
> Thanks for updating this thread, I'm thinking maybe something will
> better be done at the zswap side?
>
> The concern of using zswap_invalidate is that it calls xa_erase which
> requires the xa spin lock. But if we are calling zswap_invalidate in
> swap_entry_range_free, and ensure the slot is HAS_CACHE pinned, doing
> a lockless read first with xa_load should be OK for checking if the
> slot needs a ZSWAP invalidation. The performance cost will be minimal
> and we only need to call zswap_invalidate in one place, something like
> this (haven't tested, comments are welcome). Also ZSWAP mthp will
> still store entried in order 0 so this should be OK for future.

Hi Kairui,

I fully welcome the callers of swap_entry_range_free not needing to worry
about the zswap mechanism—it's a freedom from concerning themselves
with zswap. However, currently, zswap_invalidate is executed outside of
the si lock, and after the changes, it will be inside the si lock, which may
potentially increase the time spent holding the si lock. So, is it possible to
move the action of acquiring the si lock for swap_entry_range_free to
this function and do it after completing zswap_invalidate? Or am I
overthinking it?

swap_entry_range_free(si, entry, size)
{
      for(nr)
            zswap_invalidate(....)

     spin_lock(si->lock)
     ...
     spin_unlock(si->lock)
}

>
> diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> index 13ab3b771409..d7bb3caa9d4e 100644
> --- a/mm/swap_slots.c
> +++ b/mm/swap_slots.c
> @@ -273,9 +273,6 @@ void free_swap_slot(swp_entry_t entry)
>  {
>          struct swap_slots_cache *cache;
>
> -        /* Large folio swap slot is not covered. */
> -        zswap_invalidate(entry);
> -
>          cache = raw_cpu_ptr(&swp_slots);
>          if (likely(use_swap_slot_cache && cache->slots_ret)) {
>                  spin_lock_irq(&cache->free_lock);
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index f947f4dd31a9..fbc25d38a27e 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -242,9 +242,6 @@ static int __try_to_reclaim_swap(struct
> swap_info_struct *si,
>          folio_set_dirty(folio);
>
>          spin_lock(&si->lock);
> -        /* Only sinple page folio can be backed by zswap */
> -        if (nr_pages == 1)
> -                zswap_invalidate(entry);
>          swap_entry_range_free(si, entry, nr_pages);
>          spin_unlock(&si->lock);
>          ret = nr_pages;
> @@ -1545,6 +1542,10 @@ static void swap_entry_range_free(struct
> swap_info_struct *si, swp_entry_t entry
>          unsigned char *map_end = map + nr_pages;
>          struct swap_cluster_info *ci;
>
> +        /* Slots are pinned with SWAP_HAS_CACHE, safe to invalidate */
> +        for (int i = 0; i < nr_pages; ++i)
> +                zswap_invalidate(swp_entry(si->type, offset + i));
> +
>          ci = lock_cluster(si, offset);
>          do {
>                  VM_BUG_ON(*map != SWAP_HAS_CACHE);
> diff --git a/mm/zswap.c b/mm/zswap.c
> index df66ab102d27..100ad04397fe 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1656,15 +1656,18 @@ bool zswap_load(struct folio *folio)
>          return true;
>  }
>
> +/* Caller need to pin the slot to prevent parallel store */
>  void zswap_invalidate(swp_entry_t swp)
>  {
>          pgoff_t offset = swp_offset(swp);
>          struct xarray *tree = swap_zswap_tree(swp);
>          struct zswap_entry *entry;
>
> -        entry = xa_erase(tree, offset);
> -        if (entry)
> -                zswap_entry_free(entry);
> +        if (xa_load(tree, offset)) {
> +                entry = xa_erase(tree, offset);
> +                if (entry)
> +                        zswap_entry_free(entry);
> +        }
>  }
>
>  int zswap_swapon(int type, unsigned long nr_pages)
> --
> 2.45.2


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [syzbot] [mm?] WARNING in zswap_swapoff
  2024-08-21  7:38           ` Barry Song
@ 2024-08-21 17:33             ` Kairui Song
  2024-08-21 20:59               ` Barry Song
  0 siblings, 1 reply; 19+ messages in thread
From: Kairui Song @ 2024-08-21 17:33 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, chengming.zhou, chrisl, hannes, linux-kernel, linux-mm,
	nphamcs, ryan.roberts, syzbot+ce6029250d7fd4d0476d,
	syzkaller-bugs, ying.huang, yosryahmed

On Wed, Aug 21, 2024 at 3:38 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Wed, Aug 21, 2024 at 6:42 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > On Wed, Aug 21, 2024 at 1:49 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Tue, Aug 20, 2024 at 9:02 PM Kairui Song <ryncsn@gmail.com> wrote:
> > > >
> > > > On Tue, Aug 20, 2024 at 4:47 PM Kairui Song <ryncsn@gmail.com> wrote:
> > > > >
> > > > > On Tue, Aug 20, 2024 at 4:13 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > > On Fri, Aug 16, 2024 at 12:52 PM syzbot
> > > > > > <syzbot+ce6029250d7fd4d0476d@syzkaller.appspotmail.com> wrote:
> > > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > syzbot found the following issue on:
> > > > > > >
> > > > > > > HEAD commit:    367b5c3d53e5 Add linux-next specific files for 20240816
> > > > >
> > > > > I can't find this commit, seems this commit is not in linux-next any more?
> > > > >
> > > > > > > git tree:       linux-next
> > > > > > > console output: https://syzkaller.appspot.com/x/log.txt?x=12489105980000
> > > > > > > kernel config:  https://syzkaller.appspot.com/x/.config?x=61ba6f3b22ee5467
> > > > > > > dashboard link: https://syzkaller.appspot.com/bug?extid=ce6029250d7fd4d0476d
> > > > > > > compiler:       Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40
> > > > > > >
> > > > > > > Unfortunately, I don't have any reproducer for this issue yet.
> > > > > > >
> > > > > > > Downloadable assets:
> > > > > > > disk image: https://storage.googleapis.com/syzbot-assets/0b1b4e3cad3c/disk-367b5c3d.raw.xz
> > > > > > > vmlinux: https://storage.googleapis.com/syzbot-assets/5bb090f7813c/vmlinux-367b5c3d.xz
> > > > > > > kernel image: https://storage.googleapis.com/syzbot-assets/6674cb0709b1/bzImage-367b5c3d.xz
> > > > > > >
> > > > > > > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > > > > > > Reported-by: syzbot+ce6029250d7fd4d0476d@syzkaller.appspotmail.com
> > > > > > >
> > > > > > > ------------[ cut here ]------------
> > > > > > > WARNING: CPU: 0 PID: 11298 at mm/zswap.c:1700 zswap_swapoff+0x11b/0x2b0 mm/zswap.c:1700
> > > > > > > Modules linked in:
> > > > > > > CPU: 0 UID: 0 PID: 11298 Comm: swapoff Not tainted 6.11.0-rc3-next-20240816-syzkaller #0
> > > > > > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 06/27/2024
> > > > > > > RIP: 0010:zswap_swapoff+0x11b/0x2b0 mm/zswap.c:1700
> > > > > > > Code: 74 05 e8 78 73 07 00 4b 83 7c 35 00 00 75 15 e8 1b bd 9e ff 48 ff c5 49 83 c6 50 83 7c 24 0c 17 76 9b eb 24 e8 06 bd 9e ff 90 <0f> 0b 90 eb e5 48 8b 0c 24 80 e1 07 80 c1 03 38 c1 7c 90 48 8b 3c
> > > > > > > RSP: 0018:ffffc9000302fa38 EFLAGS: 00010293
> > > > > > > RAX: ffffffff81f4d66a RBX: dffffc0000000000 RCX: ffff88802c19bc00
> > > > > > > RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff888015986248
> > > > > > > RBP: 0000000000000000 R08: ffffffff81f4d620 R09: 1ffffffff1d476ac
> > > > > > > R10: dffffc0000000000 R11: fffffbfff1d476ad R12: dffffc0000000000
> > > > > > > R13: ffff888015986200 R14: 0000000000000048 R15: 0000000000000002
> > > > > > > FS:  00007f9e628a5380(0000) GS:ffff8880b9000000(0000) knlGS:0000000000000000
> > > > > > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > > > CR2: 0000001b30f15ff8 CR3: 000000006c5f0000 CR4: 00000000003506f0
> > > > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > > > > > Call Trace:
> > > > > > >  <TASK>
> > > > > > >  __do_sys_swapoff mm/swapfile.c:2837 [inline]
> > > > > > >  __se_sys_swapoff+0x4653/0x4cf0 mm/swapfile.c:2706
> > > > > > >  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
> > > > > > >  do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
> > > > > > >  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> > > > > > > RIP: 0033:0x7f9e629feb37
> > > > > > > Code: 73 01 c3 48 8b 0d f1 52 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 a8 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c1 52 0d 00 f7 d8 64 89 01 48
> > > > > > > RSP: 002b:00007fff17734f68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a8
> > > > > > > RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f9e629feb37
> > > > > > > RDX: 00007f9e62a9e7e8 RSI: 00007f9e62b9beed RDI: 0000563090942a20
> > > > > > > RBP: 0000563090942a20 R08: 0000000000000000 R09: 77872e07ed164f94
> > > > > > > R10: 000000000000001f R11: 0000000000000246 R12: 00007fff17735188
> > > > > > > R13: 00005630909422a0 R14: 0000563073724169 R15: 00007f9e62bdda80
> > > > > > >  </TASK>
> > > > > >
> > > > > > I am hoping syzbot would find a reproducer and bisect this for us.
> > > > > > Meanwhile, from a high-level it looks to me like we are missing a
> > > > > > zswap_invalidate() call in some paths.
> > > > > >
> > > > > > If I have to guess, I would say it's related to the latest mTHP swap
> > > > > > changes, but I am not following closely. Perhaps one of the following
> > > > > > things happened:
> > > > > >
> > > > > > (1) We are not calling zswap_invalidate() in some invalidation paths.
> > > > > > It used to not be called for the cluster freeing path, so maybe we end
> > > > > > up with some order-0 swap entries in a cluster? or maybe there is an
> > > > > > entirely new invalidation path that does not go through
> > > > > > free_swap_slot() for order-0 entries?
> > > > > >
> > > > > > (2) Some higher order swap entries (i.e. a cluster) end up in zswap
> > > > > > somehow. zswap_store() has a warning to cover that though. Maybe
> > > > > > somehow some swap entries are allocated as a cluster, but then pages
> > > > > > are swapped out one-by-one as order-0 (which can go to zswap), but
> > > > > > then we still free the swap entries as a cluster?
> > > > >
> > > > > Hi Yosry, thanks for the report.
> > > > >
> > > > > There are many mTHP related optimizations recently, for this problem I
> > > > > can reproduce this locally. Can confirm the problem is gone for me
> > > > > after reverting:
> > > > >
> > > > > "mm: attempt to batch free swap entries for zap_pte_range()"
> > > > >
> > > > > Hi Barry,
> > > > >
> > > > > If a set of continuous slots are having the same value, they are
> > > > > considered a mTHP and freed, bypassing the slot cache, and causing
> > > > > zswap leak.
> > > > > This didn't happen in put_swap_folio because that function is
> > > > > expecting an actual mTHP folio behind the slots but
> > > > > free_swap_and_cache_nr is simply walking the slots.
> > > > >
> > > > > For the testing, I actually have to disable mTHP, because linux-next
> > > > > will panic with mTHP due to lack of following fixes:
> > > > > https://lore.kernel.org/linux-mm/a4b1b34f-0d8c-490d-ab00-eaedbf3fe780@gmail.com/
> > > > > https://lore.kernel.org/linux-mm/403b7f3c-6e5b-4030-ab1c-3198f36e3f73@gmail.com/
> > > > >
> > > > > >
> > > > > > I am not closely following the latest changes so I am not sure. CCing
> > > > > > folks who have done work in that area recently.
> > > > > >
> > > > > > I am starting to think maybe it would be more reliable to just call
> > > > > > zswap_invalidate() for all freed swap entries anyway. Would that be
> > > > > > too expensive? We used to do that before the zswap_invalidate() call
> > > > > > was moved by commit 0827a1fb143f ("mm/zswap: invalidate zswap entry
> > > > > > when swap entry free"), and that was before we started using the
> > > > > > xarray (so it was arguably worse than it would be now).
> > > > > >
> > > > >
> > > > > That might be a good idea, I suggest moving zswap_invalidate to
> > > > > swap_range_free and call it for every freed slot.
> > > > >
> > > > > Below patch can be squash into or put before "mm: attempt to batch
> > > > > free swap entries for zap_pte_range()".
> > > >
> > > > Hmm, on second thought, the commit message in the attachment commit
> > > > might be not suitable, current zswap_invalidate is also designed to
> > > > only work for order 0 ZSWAP, so things are not clean even after this.
> > >
> > > Kairui, what about the below? we don't touch the path of __try_to_reclaim_swap() where
> > > you have one folio backed?
> > >
> > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > index c1638a009113..8ff58be40544 100644
> > > --- a/mm/swapfile.c
> > > +++ b/mm/swapfile.c
> > > @@ -1514,6 +1514,8 @@ static bool __swap_entries_free(struct swap_info_struct *si,
> > >         unlock_cluster_or_swap_info(si, ci);
> > >
> > >         if (!has_cache) {
> > > +               for (i = 0; i < nr; i++)
> > > +                       zswap_invalidate(swp_entry(si->type, offset + i));
> > >                 spin_lock(&si->lock);
> > >                 swap_entry_range_free(si, entry, nr);
> > >                 spin_unlock(&si->lock);
> > >
> >
> > Hi Barry,
> >
> > Thanks for updating this thread, I'm thinking maybe something will
> > better be done at the zswap side?
> >
> > The concern of using zswap_invalidate is that it calls xa_erase which
> > requires the xa spin lock. But if we are calling zswap_invalidate in
> > swap_entry_range_free, and ensure the slot is HAS_CACHE pinned, doing
> > a lockless read first with xa_load should be OK for checking if the
> > slot needs a ZSWAP invalidation. The performance cost will be minimal
> > and we only need to call zswap_invalidate in one place, something like
> > this (haven't tested, comments are welcome). Also ZSWAP mthp will
> > still store entried in order 0 so this should be OK for future.
>
> Hi Kairui,
>
> I fully welcome the callers of swap_entry_range_free not needing to worry
> about the zswap mechanism—it's a freedom from concerning themselves
> with zswap. However, currently, zswap_invalidate is executed outside of
> the si lock, and after the changes, it will be inside the si lock, which may
> potentially increase the time spent holding the si lock. So, is it possible to
> move the action of acquiring the si lock for swap_entry_range_free to
> this function and do it after completing zswap_invalidate? Or am I
> overthinking it?
>
> swap_entry_range_free(si, entry, size)
> {
>       for(nr)
>             zswap_invalidate(....)
>
>      spin_lock(si->lock)
>      ...
>      spin_unlock(si->lock)
> }
>

Hi Barry,

Thanks for raising the concern.

zswap_invalidate was called inside si->lock before commit
0827a1fb143f ("mm/zswap: invalidate zswap entry when
swap entry free"), and I didn't see any performance change update
about this, so this might be trivial.

And if we apply the patch above, when ZSWAP is not enabled,
the xa_load will return NULL very fast so there should be very
low overhead. If ZSWAP is enabled, that may increase contention
on si->lock indeed, but maybe not too much of a concern either,
it should not be worse than before 0827a1fb143f. And I have a
pending series that will remove si->lock dependency of
swap_entry_range_free. I'm working on it and should be sent out
soon. So I think this should be acceptable?

And I think we may optimize the operations of xarray in zswap_invalidate
too to reduce the overhead of redundant xarray walk.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [syzbot] [mm?] WARNING in zswap_swapoff
  2024-08-21 17:33             ` Kairui Song
@ 2024-08-21 20:59               ` Barry Song
  0 siblings, 0 replies; 19+ messages in thread
From: Barry Song @ 2024-08-21 20:59 UTC (permalink / raw)
  To: Kairui Song
  Cc: akpm, chengming.zhou, chrisl, hannes, linux-kernel, linux-mm,
	nphamcs, ryan.roberts, syzbot+ce6029250d7fd4d0476d,
	syzkaller-bugs, ying.huang, yosryahmed

On Thu, Aug 22, 2024 at 1:33 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Wed, Aug 21, 2024 at 3:38 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Wed, Aug 21, 2024 at 6:42 PM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > On Wed, Aug 21, 2024 at 1:49 PM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > On Tue, Aug 20, 2024 at 9:02 PM Kairui Song <ryncsn@gmail.com> wrote:
> > > > >
> > > > > On Tue, Aug 20, 2024 at 4:47 PM Kairui Song <ryncsn@gmail.com> wrote:
> > > > > >
> > > > > > On Tue, Aug 20, 2024 at 4:13 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > > > On Fri, Aug 16, 2024 at 12:52 PM syzbot
> > > > > > > <syzbot+ce6029250d7fd4d0476d@syzkaller.appspotmail.com> wrote:
> > > > > > > >
> > > > > > > > Hello,
> > > > > > > >
> > > > > > > > syzbot found the following issue on:
> > > > > > > >
> > > > > > > > HEAD commit:    367b5c3d53e5 Add linux-next specific files for 20240816
> > > > > >
> > > > > > I can't find this commit, seems this commit is not in linux-next any more?
> > > > > >
> > > > > > > > git tree:       linux-next
> > > > > > > > console output: https://syzkaller.appspot.com/x/log.txt?x=12489105980000
> > > > > > > > kernel config:  https://syzkaller.appspot.com/x/.config?x=61ba6f3b22ee5467
> > > > > > > > dashboard link: https://syzkaller.appspot.com/bug?extid=ce6029250d7fd4d0476d
> > > > > > > > compiler:       Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40
> > > > > > > >
> > > > > > > > Unfortunately, I don't have any reproducer for this issue yet.
> > > > > > > >
> > > > > > > > Downloadable assets:
> > > > > > > > disk image: https://storage.googleapis.com/syzbot-assets/0b1b4e3cad3c/disk-367b5c3d.raw.xz
> > > > > > > > vmlinux: https://storage.googleapis.com/syzbot-assets/5bb090f7813c/vmlinux-367b5c3d.xz
> > > > > > > > kernel image: https://storage.googleapis.com/syzbot-assets/6674cb0709b1/bzImage-367b5c3d.xz
> > > > > > > >
> > > > > > > > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > > > > > > > Reported-by: syzbot+ce6029250d7fd4d0476d@syzkaller.appspotmail.com
> > > > > > > >
> > > > > > > > ------------[ cut here ]------------
> > > > > > > > WARNING: CPU: 0 PID: 11298 at mm/zswap.c:1700 zswap_swapoff+0x11b/0x2b0 mm/zswap.c:1700
> > > > > > > > Modules linked in:
> > > > > > > > CPU: 0 UID: 0 PID: 11298 Comm: swapoff Not tainted 6.11.0-rc3-next-20240816-syzkaller #0
> > > > > > > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 06/27/2024
> > > > > > > > RIP: 0010:zswap_swapoff+0x11b/0x2b0 mm/zswap.c:1700
> > > > > > > > Code: 74 05 e8 78 73 07 00 4b 83 7c 35 00 00 75 15 e8 1b bd 9e ff 48 ff c5 49 83 c6 50 83 7c 24 0c 17 76 9b eb 24 e8 06 bd 9e ff 90 <0f> 0b 90 eb e5 48 8b 0c 24 80 e1 07 80 c1 03 38 c1 7c 90 48 8b 3c
> > > > > > > > RSP: 0018:ffffc9000302fa38 EFLAGS: 00010293
> > > > > > > > RAX: ffffffff81f4d66a RBX: dffffc0000000000 RCX: ffff88802c19bc00
> > > > > > > > RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff888015986248
> > > > > > > > RBP: 0000000000000000 R08: ffffffff81f4d620 R09: 1ffffffff1d476ac
> > > > > > > > R10: dffffc0000000000 R11: fffffbfff1d476ad R12: dffffc0000000000
> > > > > > > > R13: ffff888015986200 R14: 0000000000000048 R15: 0000000000000002
> > > > > > > > FS:  00007f9e628a5380(0000) GS:ffff8880b9000000(0000) knlGS:0000000000000000
> > > > > > > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > > > > CR2: 0000001b30f15ff8 CR3: 000000006c5f0000 CR4: 00000000003506f0
> > > > > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > > > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > > > > > > Call Trace:
> > > > > > > >  <TASK>
> > > > > > > >  __do_sys_swapoff mm/swapfile.c:2837 [inline]
> > > > > > > >  __se_sys_swapoff+0x4653/0x4cf0 mm/swapfile.c:2706
> > > > > > > >  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
> > > > > > > >  do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
> > > > > > > >  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> > > > > > > > RIP: 0033:0x7f9e629feb37
> > > > > > > > Code: 73 01 c3 48 8b 0d f1 52 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 a8 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c1 52 0d 00 f7 d8 64 89 01 48
> > > > > > > > RSP: 002b:00007fff17734f68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a8
> > > > > > > > RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f9e629feb37
> > > > > > > > RDX: 00007f9e62a9e7e8 RSI: 00007f9e62b9beed RDI: 0000563090942a20
> > > > > > > > RBP: 0000563090942a20 R08: 0000000000000000 R09: 77872e07ed164f94
> > > > > > > > R10: 000000000000001f R11: 0000000000000246 R12: 00007fff17735188
> > > > > > > > R13: 00005630909422a0 R14: 0000563073724169 R15: 00007f9e62bdda80
> > > > > > > >  </TASK>
> > > > > > >
> > > > > > > I am hoping syzbot would find a reproducer and bisect this for us.
> > > > > > > Meanwhile, from a high-level it looks to me like we are missing a
> > > > > > > zswap_invalidate() call in some paths.
> > > > > > >
> > > > > > > If I have to guess, I would say it's related to the latest mTHP swap
> > > > > > > changes, but I am not following closely. Perhaps one of the following
> > > > > > > things happened:
> > > > > > >
> > > > > > > (1) We are not calling zswap_invalidate() in some invalidation paths.
> > > > > > > It used to not be called for the cluster freeing path, so maybe we end
> > > > > > > up with some order-0 swap entries in a cluster? or maybe there is an
> > > > > > > entirely new invalidation path that does not go through
> > > > > > > free_swap_slot() for order-0 entries?
> > > > > > >
> > > > > > > (2) Some higher order swap entries (i.e. a cluster) end up in zswap
> > > > > > > somehow. zswap_store() has a warning to cover that though. Maybe
> > > > > > > somehow some swap entries are allocated as a cluster, but then pages
> > > > > > > are swapped out one-by-one as order-0 (which can go to zswap), but
> > > > > > > then we still free the swap entries as a cluster?
> > > > > >
> > > > > > Hi Yosry, thanks for the report.
> > > > > >
> > > > > > There are many mTHP related optimizations recently, for this problem I
> > > > > > can reproduce this locally. Can confirm the problem is gone for me
> > > > > > after reverting:
> > > > > >
> > > > > > "mm: attempt to batch free swap entries for zap_pte_range()"
> > > > > >
> > > > > > Hi Barry,
> > > > > >
> > > > > > If a set of continuous slots are having the same value, they are
> > > > > > considered a mTHP and freed, bypassing the slot cache, and causing
> > > > > > zswap leak.
> > > > > > This didn't happen in put_swap_folio because that function is
> > > > > > expecting an actual mTHP folio behind the slots but
> > > > > > free_swap_and_cache_nr is simply walking the slots.
> > > > > >
> > > > > > For the testing, I actually have to disable mTHP, because linux-next
> > > > > > will panic with mTHP due to lack of following fixes:
> > > > > > https://lore.kernel.org/linux-mm/a4b1b34f-0d8c-490d-ab00-eaedbf3fe780@gmail.com/
> > > > > > https://lore.kernel.org/linux-mm/403b7f3c-6e5b-4030-ab1c-3198f36e3f73@gmail.com/
> > > > > >
> > > > > > >
> > > > > > > I am not closely following the latest changes so I am not sure. CCing
> > > > > > > folks who have done work in that area recently.
> > > > > > >
> > > > > > > I am starting to think maybe it would be more reliable to just call
> > > > > > > zswap_invalidate() for all freed swap entries anyway. Would that be
> > > > > > > too expensive? We used to do that before the zswap_invalidate() call
> > > > > > > was moved by commit 0827a1fb143f ("mm/zswap: invalidate zswap entry
> > > > > > > when swap entry free"), and that was before we started using the
> > > > > > > xarray (so it was arguably worse than it would be now).
> > > > > > >
> > > > > >
> > > > > > That might be a good idea, I suggest moving zswap_invalidate to
> > > > > > swap_range_free and call it for every freed slot.
> > > > > >
> > > > > > Below patch can be squash into or put before "mm: attempt to batch
> > > > > > free swap entries for zap_pte_range()".
> > > > >
> > > > > Hmm, on second thought, the commit message in the attachment commit
> > > > > might be not suitable, current zswap_invalidate is also designed to
> > > > > only work for order 0 ZSWAP, so things are not clean even after this.
> > > >
> > > > Kairui, what about the below? we don't touch the path of __try_to_reclaim_swap() where
> > > > you have one folio backed?
> > > >
> > > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > > index c1638a009113..8ff58be40544 100644
> > > > --- a/mm/swapfile.c
> > > > +++ b/mm/swapfile.c
> > > > @@ -1514,6 +1514,8 @@ static bool __swap_entries_free(struct swap_info_struct *si,
> > > >         unlock_cluster_or_swap_info(si, ci);
> > > >
> > > >         if (!has_cache) {
> > > > +               for (i = 0; i < nr; i++)
> > > > +                       zswap_invalidate(swp_entry(si->type, offset + i));
> > > >                 spin_lock(&si->lock);
> > > >                 swap_entry_range_free(si, entry, nr);
> > > >                 spin_unlock(&si->lock);
> > > >
> > >
> > > Hi Barry,
> > >
> > > Thanks for updating this thread, I'm thinking maybe something will
> > > better be done at the zswap side?
> > >
> > > The concern of using zswap_invalidate is that it calls xa_erase which
> > > requires the xa spin lock. But if we are calling zswap_invalidate in
> > > swap_entry_range_free, and ensure the slot is HAS_CACHE pinned, doing
> > > a lockless read first with xa_load should be OK for checking if the
> > > slot needs a ZSWAP invalidation. The performance cost will be minimal
> > > and we only need to call zswap_invalidate in one place, something like
> > > this (haven't tested, comments are welcome). Also ZSWAP mthp will
> > > still store entried in order 0 so this should be OK for future.
> >
> > Hi Kairui,
> >
> > I fully welcome the callers of swap_entry_range_free not needing to worry
> > about the zswap mechanism—it's a freedom from concerning themselves
> > with zswap. However, currently, zswap_invalidate is executed outside of
> > the si lock, and after the changes, it will be inside the si lock, which may
> > potentially increase the time spent holding the si lock. So, is it possible to
> > move the action of acquiring the si lock for swap_entry_range_free to
> > this function and do it after completing zswap_invalidate? Or am I
> > overthinking it?
> >
> > swap_entry_range_free(si, entry, size)
> > {
> >       for(nr)
> >             zswap_invalidate(....)
> >
> >      spin_lock(si->lock)
> >      ...
> >      spin_unlock(si->lock)
> > }
> >
>
> Hi Barry,
>
> Thanks for raising the concern.
>
> zswap_invalidate was called inside si->lock before commit
> 0827a1fb143f ("mm/zswap: invalidate zswap entry when
> swap entry free"), and I didn't see any performance change update
> about this, so this might be trivial.
>
> And if we apply the patch above, when ZSWAP is not enabled,
> the xa_load will return NULL very fast so there should be very
> low overhead. If ZSWAP is enabled, that may increase contention
> on si->lock indeed, but maybe not too much of a concern either,
> it should not be worse than before 0827a1fb143f. And I have a
> pending series that will remove si->lock dependency of
> swap_entry_range_free. I'm working on it and should be sent out
> soon. So I think this should be acceptable?

right.

>
> And I think we may optimize the operations of xarray in zswap_invalidate
> too to reduce the overhead of redundant xarray walk.

ack.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [syzbot] [mm?] WARNING in zswap_swapoff
  2024-08-21  6:42         ` Kairui Song
  2024-08-21  7:38           ` Barry Song
@ 2024-08-22 18:12           ` Yosry Ahmed
  2024-08-22 20:16             ` Chris Li
  2024-09-03 18:21             ` Yosry Ahmed
  1 sibling, 2 replies; 19+ messages in thread
From: Yosry Ahmed @ 2024-08-22 18:12 UTC (permalink / raw)
  To: Kairui Song
  Cc: Barry Song, akpm, chengming.zhou, chrisl, hannes, linux-kernel,
	linux-mm, nphamcs, ryan.roberts, syzbot+ce6029250d7fd4d0476d,
	syzkaller-bugs, ying.huang

On Tue, Aug 20, 2024 at 11:42 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Wed, Aug 21, 2024 at 1:49 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Tue, Aug 20, 2024 at 9:02 PM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > On Tue, Aug 20, 2024 at 4:47 PM Kairui Song <ryncsn@gmail.com> wrote:
> > > >
> > > > On Tue, Aug 20, 2024 at 4:13 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > On Fri, Aug 16, 2024 at 12:52 PM syzbot
> > > > > <syzbot+ce6029250d7fd4d0476d@syzkaller.appspotmail.com> wrote:
> > > > > >
> > > > > > Hello,
> > > > > >
> > > > > > syzbot found the following issue on:
> > > > > >
> > > > > > HEAD commit:    367b5c3d53e5 Add linux-next specific files for 20240816
> > > >
> > > > I can't find this commit, seems this commit is not in linux-next any more?
> > > >
> > > > > > git tree:       linux-next
> > > > > > console output: https://syzkaller.appspot.com/x/log.txt?x=12489105980000
> > > > > > kernel config:  https://syzkaller.appspot.com/x/.config?x=61ba6f3b22ee5467
> > > > > > dashboard link: https://syzkaller.appspot.com/bug?extid=ce6029250d7fd4d0476d
> > > > > > compiler:       Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40
> > > > > >
> > > > > > Unfortunately, I don't have any reproducer for this issue yet.
> > > > > >
> > > > > > Downloadable assets:
> > > > > > disk image: https://storage.googleapis.com/syzbot-assets/0b1b4e3cad3c/disk-367b5c3d.raw.xz
> > > > > > vmlinux: https://storage.googleapis.com/syzbot-assets/5bb090f7813c/vmlinux-367b5c3d.xz
> > > > > > kernel image: https://storage.googleapis.com/syzbot-assets/6674cb0709b1/bzImage-367b5c3d.xz
> > > > > >
> > > > > > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > > > > > Reported-by: syzbot+ce6029250d7fd4d0476d@syzkaller.appspotmail.com
> > > > > >
> > > > > > ------------[ cut here ]------------
> > > > > > WARNING: CPU: 0 PID: 11298 at mm/zswap.c:1700 zswap_swapoff+0x11b/0x2b0 mm/zswap.c:1700
> > > > > > Modules linked in:
> > > > > > CPU: 0 UID: 0 PID: 11298 Comm: swapoff Not tainted 6.11.0-rc3-next-20240816-syzkaller #0
> > > > > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 06/27/2024
> > > > > > RIP: 0010:zswap_swapoff+0x11b/0x2b0 mm/zswap.c:1700
> > > > > > Code: 74 05 e8 78 73 07 00 4b 83 7c 35 00 00 75 15 e8 1b bd 9e ff 48 ff c5 49 83 c6 50 83 7c 24 0c 17 76 9b eb 24 e8 06 bd 9e ff 90 <0f> 0b 90 eb e5 48 8b 0c 24 80 e1 07 80 c1 03 38 c1 7c 90 48 8b 3c
> > > > > > RSP: 0018:ffffc9000302fa38 EFLAGS: 00010293
> > > > > > RAX: ffffffff81f4d66a RBX: dffffc0000000000 RCX: ffff88802c19bc00
> > > > > > RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff888015986248
> > > > > > RBP: 0000000000000000 R08: ffffffff81f4d620 R09: 1ffffffff1d476ac
> > > > > > R10: dffffc0000000000 R11: fffffbfff1d476ad R12: dffffc0000000000
> > > > > > R13: ffff888015986200 R14: 0000000000000048 R15: 0000000000000002
> > > > > > FS:  00007f9e628a5380(0000) GS:ffff8880b9000000(0000) knlGS:0000000000000000
> > > > > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > > CR2: 0000001b30f15ff8 CR3: 000000006c5f0000 CR4: 00000000003506f0
> > > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > > > > Call Trace:
> > > > > >  <TASK>
> > > > > >  __do_sys_swapoff mm/swapfile.c:2837 [inline]
> > > > > >  __se_sys_swapoff+0x4653/0x4cf0 mm/swapfile.c:2706
> > > > > >  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
> > > > > >  do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
> > > > > >  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> > > > > > RIP: 0033:0x7f9e629feb37
> > > > > > Code: 73 01 c3 48 8b 0d f1 52 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 a8 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c1 52 0d 00 f7 d8 64 89 01 48
> > > > > > RSP: 002b:00007fff17734f68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a8
> > > > > > RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f9e629feb37
> > > > > > RDX: 00007f9e62a9e7e8 RSI: 00007f9e62b9beed RDI: 0000563090942a20
> > > > > > RBP: 0000563090942a20 R08: 0000000000000000 R09: 77872e07ed164f94
> > > > > > R10: 000000000000001f R11: 0000000000000246 R12: 00007fff17735188
> > > > > > R13: 00005630909422a0 R14: 0000563073724169 R15: 00007f9e62bdda80
> > > > > >  </TASK>
> > > > >
> > > > > I am hoping syzbot would find a reproducer and bisect this for us.
> > > > > Meanwhile, from a high-level it looks to me like we are missing a
> > > > > zswap_invalidate() call in some paths.
> > > > >
> > > > > If I have to guess, I would say it's related to the latest mTHP swap
> > > > > changes, but I am not following closely. Perhaps one of the following
> > > > > things happened:
> > > > >
> > > > > (1) We are not calling zswap_invalidate() in some invalidation paths.
> > > > > It used to not be called for the cluster freeing path, so maybe we end
> > > > > up with some order-0 swap entries in a cluster? or maybe there is an
> > > > > entirely new invalidation path that does not go through
> > > > > free_swap_slot() for order-0 entries?
> > > > >
> > > > > (2) Some higher order swap entries (i.e. a cluster) end up in zswap
> > > > > somehow. zswap_store() has a warning to cover that though. Maybe
> > > > > somehow some swap entries are allocated as a cluster, but then pages
> > > > > are swapped out one-by-one as order-0 (which can go to zswap), but
> > > > > then we still free the swap entries as a cluster?
> > > >
> > > > Hi Yosry, thanks for the report.
> > > >
> > > > There are many mTHP related optimizations recently, for this problem I
> > > > can reproduce this locally. Can confirm the problem is gone for me
> > > > after reverting:
> > > >
> > > > "mm: attempt to batch free swap entries for zap_pte_range()"
> > > >
> > > > Hi Barry,
> > > >
> > > > If a set of continuous slots are having the same value, they are
> > > > considered a mTHP and freed, bypassing the slot cache, and causing
> > > > zswap leak.
> > > > This didn't happen in put_swap_folio because that function is
> > > > expecting an actual mTHP folio behind the slots but
> > > > free_swap_and_cache_nr is simply walking the slots.
> > > >
> > > > For the testing, I actually have to disable mTHP, because linux-next
> > > > will panic with mTHP due to lack of following fixes:
> > > > https://lore.kernel.org/linux-mm/a4b1b34f-0d8c-490d-ab00-eaedbf3fe780@gmail.com/
> > > > https://lore.kernel.org/linux-mm/403b7f3c-6e5b-4030-ab1c-3198f36e3f73@gmail.com/
> > > >
> > > > >
> > > > > I am not closely following the latest changes so I am not sure. CCing
> > > > > folks who have done work in that area recently.
> > > > >
> > > > > I am starting to think maybe it would be more reliable to just call
> > > > > zswap_invalidate() for all freed swap entries anyway. Would that be
> > > > > too expensive? We used to do that before the zswap_invalidate() call
> > > > > was moved by commit 0827a1fb143f ("mm/zswap: invalidate zswap entry
> > > > > when swap entry free"), and that was before we started using the
> > > > > xarray (so it was arguably worse than it would be now).
> > > > >
> > > >
> > > > That might be a good idea, I suggest moving zswap_invalidate to
> > > > swap_range_free and call it for every freed slot.
> > > >
> > > > Below patch can be squash into or put before "mm: attempt to batch
> > > > free swap entries for zap_pte_range()".
> > >
> > > Hmm, on second thought, the commit message in the attachment commit
> > > might be not suitable, current zswap_invalidate is also designed to
> > > only work for order 0 ZSWAP, so things are not clean even after this.
> >
> > Kairui, what about the below? we don't touch the path of __try_to_reclaim_swap() where
> > you have one folio backed?
> >
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index c1638a009113..8ff58be40544 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -1514,6 +1514,8 @@ static bool __swap_entries_free(struct swap_info_struct *si,
> >         unlock_cluster_or_swap_info(si, ci);
> >
> >         if (!has_cache) {
> > +               for (i = 0; i < nr; i++)
> > +                       zswap_invalidate(swp_entry(si->type, offset + i));
> >                 spin_lock(&si->lock);
> >                 swap_entry_range_free(si, entry, nr);
> >                 spin_unlock(&si->lock);
> >
>
> Hi Barry,
>
> Thanks for updating this thread, I'm thinking maybe something will
> better be done at the zswap side?
>
> The concern of using zswap_invalidate is that it calls xa_erase which
> requires the xa spin lock. But if we are calling zswap_invalidate in
> swap_entry_range_free, and ensure the slot is HAS_CACHE pinned, doing
> a lockless read first with xa_load should be OK for checking if the
> slot needs a ZSWAP invalidation. The performance cost will be minimal
> and we only need to call zswap_invalidate in one place, something like
> this (haven't tested, comments are welcome). Also ZSWAP mthp will
> still store entried in order 0 so this should be OK for future.


While I do agree with this change on a high level, it's essentially
reverting commit 0827a1fb143f ("mm/zswap: invalidate zswap entry when
swap entry free") which fixed a small problem with zswap writeback.
I'd prefer that we don't if possible.

One thing that I always wanted to do is to pull some of the work done
in swap_entry_range_free() and swap_range_free() before the slots
caching layer. The memcg uncharging, clearing shadow entries from the
swap cache, arch invalidation, zswap invalidation, etc. If we can have
a hook for these pre-free callbacks we can call it for single entries
before we add them to the slots cache, and call them for the clusters
as we do today. This should also reduce the amount of work done under
the lock, and move more work to where the freeing is actually
happening vs. the cache draining.

I remember discussing this briefly with Ying before. Anyone have any thoughts?

>
> diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> index 13ab3b771409..d7bb3caa9d4e 100644
> --- a/mm/swap_slots.c
> +++ b/mm/swap_slots.c
> @@ -273,9 +273,6 @@ void free_swap_slot(swp_entry_t entry)
>  {
>          struct swap_slots_cache *cache;
>
> -        /* Large folio swap slot is not covered. */
> -        zswap_invalidate(entry);
> -
>          cache = raw_cpu_ptr(&swp_slots);
>          if (likely(use_swap_slot_cache && cache->slots_ret)) {
>                  spin_lock_irq(&cache->free_lock);
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index f947f4dd31a9..fbc25d38a27e 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -242,9 +242,6 @@ static int __try_to_reclaim_swap(struct
> swap_info_struct *si,
>          folio_set_dirty(folio);
>
>          spin_lock(&si->lock);
> -        /* Only sinple page folio can be backed by zswap */
> -        if (nr_pages == 1)
> -                zswap_invalidate(entry);
>          swap_entry_range_free(si, entry, nr_pages);
>          spin_unlock(&si->lock);
>          ret = nr_pages;
> @@ -1545,6 +1542,10 @@ static void swap_entry_range_free(struct
> swap_info_struct *si, swp_entry_t entry
>          unsigned char *map_end = map + nr_pages;
>          struct swap_cluster_info *ci;
>
> +        /* Slots are pinned with SWAP_HAS_CACHE, safe to invalidate */
> +        for (int i = 0; i < nr_pages; ++i)
> +                zswap_invalidate(swp_entry(si->type, offset + i));
> +
>          ci = lock_cluster(si, offset);
>          do {
>                  VM_BUG_ON(*map != SWAP_HAS_CACHE);
> diff --git a/mm/zswap.c b/mm/zswap.c
> index df66ab102d27..100ad04397fe 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1656,15 +1656,18 @@ bool zswap_load(struct folio *folio)
>          return true;
>  }
>
> +/* Caller need to pin the slot to prevent parallel store */
>  void zswap_invalidate(swp_entry_t swp)
>  {
>          pgoff_t offset = swp_offset(swp);
>          struct xarray *tree = swap_zswap_tree(swp);
>          struct zswap_entry *entry;
>
> -        entry = xa_erase(tree, offset);
> -        if (entry)
> -                zswap_entry_free(entry);
> +        if (xa_load(tree, offset)) {
> +                entry = xa_erase(tree, offset);
> +                if (entry)
> +                        zswap_entry_free(entry);
> +        }
>  }
>
>  int zswap_swapon(int type, unsigned long nr_pages)
> --
> 2.45.2


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [syzbot] [mm?] WARNING in zswap_swapoff
  2024-08-22 18:12           ` Yosry Ahmed
@ 2024-08-22 20:16             ` Chris Li
  2024-08-22 20:20               ` Yosry Ahmed
  2024-09-03 18:21             ` Yosry Ahmed
  1 sibling, 1 reply; 19+ messages in thread
From: Chris Li @ 2024-08-22 20:16 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Kairui Song, Barry Song, akpm, chengming.zhou, hannes,
	linux-kernel, linux-mm, nphamcs, ryan.roberts,
	syzbot+ce6029250d7fd4d0476d, syzkaller-bugs, ying.huang

On Thu, Aug 22, 2024 at 11:13 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Tue, Aug 20, 2024 at 11:42 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > On Wed, Aug 21, 2024 at 1:49 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Tue, Aug 20, 2024 at 9:02 PM Kairui Song <ryncsn@gmail.com> wrote:
> > > >
> > > > On Tue, Aug 20, 2024 at 4:47 PM Kairui Song <ryncsn@gmail.com> wrote:
> > > > >
> > > > > On Tue, Aug 20, 2024 at 4:13 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > > On Fri, Aug 16, 2024 at 12:52 PM syzbot
> > > > > > <syzbot+ce6029250d7fd4d0476d@syzkaller.appspotmail.com> wrote:
> > > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > syzbot found the following issue on:
> > > > > > >
> > > > > > > HEAD commit:    367b5c3d53e5 Add linux-next specific files for 20240816
> > > > >
> > > > > I can't find this commit, seems this commit is not in linux-next any more?
> > > > >
> > > > > > > git tree:       linux-next
> > > > > > > console output: https://syzkaller.appspot.com/x/log.txt?x=12489105980000
> > > > > > > kernel config:  https://syzkaller.appspot.com/x/.config?x=61ba6f3b22ee5467
> > > > > > > dashboard link: https://syzkaller.appspot.com/bug?extid=ce6029250d7fd4d0476d
> > > > > > > compiler:       Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40
> > > > > > >
> > > > > > > Unfortunately, I don't have any reproducer for this issue yet.
> > > > > > >
> > > > > > > Downloadable assets:
> > > > > > > disk image: https://storage.googleapis.com/syzbot-assets/0b1b4e3cad3c/disk-367b5c3d.raw.xz
> > > > > > > vmlinux: https://storage.googleapis.com/syzbot-assets/5bb090f7813c/vmlinux-367b5c3d.xz
> > > > > > > kernel image: https://storage.googleapis.com/syzbot-assets/6674cb0709b1/bzImage-367b5c3d.xz
> > > > > > >
> > > > > > > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > > > > > > Reported-by: syzbot+ce6029250d7fd4d0476d@syzkaller.appspotmail.com
> > > > > > >
> > > > > > > ------------[ cut here ]------------
> > > > > > > WARNING: CPU: 0 PID: 11298 at mm/zswap.c:1700 zswap_swapoff+0x11b/0x2b0 mm/zswap.c:1700
> > > > > > > Modules linked in:
> > > > > > > CPU: 0 UID: 0 PID: 11298 Comm: swapoff Not tainted 6.11.0-rc3-next-20240816-syzkaller #0
> > > > > > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 06/27/2024
> > > > > > > RIP: 0010:zswap_swapoff+0x11b/0x2b0 mm/zswap.c:1700
> > > > > > > Code: 74 05 e8 78 73 07 00 4b 83 7c 35 00 00 75 15 e8 1b bd 9e ff 48 ff c5 49 83 c6 50 83 7c 24 0c 17 76 9b eb 24 e8 06 bd 9e ff 90 <0f> 0b 90 eb e5 48 8b 0c 24 80 e1 07 80 c1 03 38 c1 7c 90 48 8b 3c
> > > > > > > RSP: 0018:ffffc9000302fa38 EFLAGS: 00010293
> > > > > > > RAX: ffffffff81f4d66a RBX: dffffc0000000000 RCX: ffff88802c19bc00
> > > > > > > RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff888015986248
> > > > > > > RBP: 0000000000000000 R08: ffffffff81f4d620 R09: 1ffffffff1d476ac
> > > > > > > R10: dffffc0000000000 R11: fffffbfff1d476ad R12: dffffc0000000000
> > > > > > > R13: ffff888015986200 R14: 0000000000000048 R15: 0000000000000002
> > > > > > > FS:  00007f9e628a5380(0000) GS:ffff8880b9000000(0000) knlGS:0000000000000000
> > > > > > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > > > CR2: 0000001b30f15ff8 CR3: 000000006c5f0000 CR4: 00000000003506f0
> > > > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > > > > > Call Trace:
> > > > > > >  <TASK>
> > > > > > >  __do_sys_swapoff mm/swapfile.c:2837 [inline]
> > > > > > >  __se_sys_swapoff+0x4653/0x4cf0 mm/swapfile.c:2706
> > > > > > >  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
> > > > > > >  do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
> > > > > > >  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> > > > > > > RIP: 0033:0x7f9e629feb37
> > > > > > > Code: 73 01 c3 48 8b 0d f1 52 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 a8 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c1 52 0d 00 f7 d8 64 89 01 48
> > > > > > > RSP: 002b:00007fff17734f68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a8
> > > > > > > RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f9e629feb37
> > > > > > > RDX: 00007f9e62a9e7e8 RSI: 00007f9e62b9beed RDI: 0000563090942a20
> > > > > > > RBP: 0000563090942a20 R08: 0000000000000000 R09: 77872e07ed164f94
> > > > > > > R10: 000000000000001f R11: 0000000000000246 R12: 00007fff17735188
> > > > > > > R13: 00005630909422a0 R14: 0000563073724169 R15: 00007f9e62bdda80
> > > > > > >  </TASK>
> > > > > >
> > > > > > I am hoping syzbot would find a reproducer and bisect this for us.
> > > > > > Meanwhile, from a high-level it looks to me like we are missing a
> > > > > > zswap_invalidate() call in some paths.
> > > > > >
> > > > > > If I have to guess, I would say it's related to the latest mTHP swap
> > > > > > changes, but I am not following closely. Perhaps one of the following
> > > > > > things happened:
> > > > > >
> > > > > > (1) We are not calling zswap_invalidate() in some invalidation paths.
> > > > > > It used to not be called for the cluster freeing path, so maybe we end
> > > > > > up with some order-0 swap entries in a cluster? or maybe there is an
> > > > > > entirely new invalidation path that does not go through
> > > > > > free_swap_slot() for order-0 entries?
> > > > > >
> > > > > > (2) Some higher order swap entries (i.e. a cluster) end up in zswap
> > > > > > somehow. zswap_store() has a warning to cover that though. Maybe
> > > > > > somehow some swap entries are allocated as a cluster, but then pages
> > > > > > are swapped out one-by-one as order-0 (which can go to zswap), but
> > > > > > then we still free the swap entries as a cluster?
> > > > >
> > > > > Hi Yosry, thanks for the report.
> > > > >
> > > > > There are many mTHP related optimizations recently, for this problem I
> > > > > can reproduce this locally. Can confirm the problem is gone for me
> > > > > after reverting:
> > > > >
> > > > > "mm: attempt to batch free swap entries for zap_pte_range()"
> > > > >
> > > > > Hi Barry,
> > > > >
> > > > > If a set of continuous slots are having the same value, they are
> > > > > considered a mTHP and freed, bypassing the slot cache, and causing
> > > > > zswap leak.
> > > > > This didn't happen in put_swap_folio because that function is
> > > > > expecting an actual mTHP folio behind the slots but
> > > > > free_swap_and_cache_nr is simply walking the slots.
> > > > >
> > > > > For the testing, I actually have to disable mTHP, because linux-next
> > > > > will panic with mTHP due to lack of following fixes:
> > > > > https://lore.kernel.org/linux-mm/a4b1b34f-0d8c-490d-ab00-eaedbf3fe780@gmail.com/
> > > > > https://lore.kernel.org/linux-mm/403b7f3c-6e5b-4030-ab1c-3198f36e3f73@gmail.com/
> > > > >
> > > > > >
> > > > > > I am not closely following the latest changes so I am not sure. CCing
> > > > > > folks who have done work in that area recently.
> > > > > >
> > > > > > I am starting to think maybe it would be more reliable to just call
> > > > > > zswap_invalidate() for all freed swap entries anyway. Would that be
> > > > > > too expensive? We used to do that before the zswap_invalidate() call
> > > > > > was moved by commit 0827a1fb143f ("mm/zswap: invalidate zswap entry
> > > > > > when swap entry free"), and that was before we started using the
> > > > > > xarray (so it was arguably worse than it would be now).
> > > > > >
> > > > >
> > > > > That might be a good idea, I suggest moving zswap_invalidate to
> > > > > swap_range_free and call it for every freed slot.
> > > > >
> > > > > Below patch can be squash into or put before "mm: attempt to batch
> > > > > free swap entries for zap_pte_range()".
> > > >
> > > > Hmm, on second thought, the commit message in the attachment commit
> > > > might be not suitable, current zswap_invalidate is also designed to
> > > > only work for order 0 ZSWAP, so things are not clean even after this.
> > >
> > > Kairui, what about the below? we don't touch the path of __try_to_reclaim_swap() where
> > > you have one folio backed?
> > >
> > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > index c1638a009113..8ff58be40544 100644
> > > --- a/mm/swapfile.c
> > > +++ b/mm/swapfile.c
> > > @@ -1514,6 +1514,8 @@ static bool __swap_entries_free(struct swap_info_struct *si,
> > >         unlock_cluster_or_swap_info(si, ci);
> > >
> > >         if (!has_cache) {
> > > +               for (i = 0; i < nr; i++)
> > > +                       zswap_invalidate(swp_entry(si->type, offset + i));
> > >                 spin_lock(&si->lock);
> > >                 swap_entry_range_free(si, entry, nr);
> > >                 spin_unlock(&si->lock);
> > >
> >
> > Hi Barry,
> >
> > Thanks for updating this thread, I'm thinking maybe something will
> > better be done at the zswap side?
> >
> > The concern of using zswap_invalidate is that it calls xa_erase which
> > requires the xa spin lock. But if we are calling zswap_invalidate in
> > swap_entry_range_free, and ensure the slot is HAS_CACHE pinned, doing
> > a lockless read first with xa_load should be OK for checking if the
> > slot needs a ZSWAP invalidation. The performance cost will be minimal
> > and we only need to call zswap_invalidate in one place, something like
> > this (haven't tested, comments are welcome). Also ZSWAP mthp will
> > still store entried in order 0 so this should be OK for future.
>
>
> While I do agree with this change on a high level, it's essentially
> reverting commit 0827a1fb143f ("mm/zswap: invalidate zswap entry when
> swap entry free") which fixed a small problem with zswap writeback.
> I'd prefer that we don't if possible.
>
> One thing that I always wanted to do is to pull some of the work done
> in swap_entry_range_free() and swap_range_free() before the slots
> caching layer. The memcg uncharging, clearing shadow entries from the
> swap cache, arch invalidation, zswap invalidation, etc. If we can have
> a hook for these pre-free callbacks we can call it for single entries
> before we add them to the slots cache, and call them for the clusters
> as we do today. This should also reduce the amount of work done under
> the lock, and move more work to where the freeing is actually
> happening vs. the cache draining.
>
> I remember discussing this briefly with Ying before. Anyone have any thoughts?

Hi Yosry,

If I understand correctly, the lock you are talking about is the
si->lock, right?

Kairui has some WIP patches removing the swap slot cache in the swap
entry freeing path. Basically the si->lock is only used to protect the
cluster list. Most of the time freeing swap entry will only take the
ci->lock. No need to take the si->lock to change the cluster lists.
Only when the cluster moves to another list will it require the
si->lock e.g. the cluster moves to the free list when all 512 entries
are freed. Because each cluster has 512 entries. The need to take
si->lock is dramatically reduced. That patch is based on the new
cluster swap allocator series. Kairui can share more details.

I don't think ci->lock has heavy contentions.

Chris

>
> >
> > diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> > index 13ab3b771409..d7bb3caa9d4e 100644
> > --- a/mm/swap_slots.c
> > +++ b/mm/swap_slots.c
> > @@ -273,9 +273,6 @@ void free_swap_slot(swp_entry_t entry)
> >  {
> >          struct swap_slots_cache *cache;
> >
> > -        /* Large folio swap slot is not covered. */
> > -        zswap_invalidate(entry);
> > -
> >          cache = raw_cpu_ptr(&swp_slots);
> >          if (likely(use_swap_slot_cache && cache->slots_ret)) {
> >                  spin_lock_irq(&cache->free_lock);
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index f947f4dd31a9..fbc25d38a27e 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -242,9 +242,6 @@ static int __try_to_reclaim_swap(struct
> > swap_info_struct *si,
> >          folio_set_dirty(folio);
> >
> >          spin_lock(&si->lock);
> > -        /* Only sinple page folio can be backed by zswap */
> > -        if (nr_pages == 1)
> > -                zswap_invalidate(entry);
> >          swap_entry_range_free(si, entry, nr_pages);
> >          spin_unlock(&si->lock);
> >          ret = nr_pages;
> > @@ -1545,6 +1542,10 @@ static void swap_entry_range_free(struct
> > swap_info_struct *si, swp_entry_t entry
> >          unsigned char *map_end = map + nr_pages;
> >          struct swap_cluster_info *ci;
> >
> > +        /* Slots are pinned with SWAP_HAS_CACHE, safe to invalidate */
> > +        for (int i = 0; i < nr_pages; ++i)
> > +                zswap_invalidate(swp_entry(si->type, offset + i));
> > +
> >          ci = lock_cluster(si, offset);
> >          do {
> >                  VM_BUG_ON(*map != SWAP_HAS_CACHE);
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index df66ab102d27..100ad04397fe 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -1656,15 +1656,18 @@ bool zswap_load(struct folio *folio)
> >          return true;
> >  }
> >
> > +/* Caller need to pin the slot to prevent parallel store */
> >  void zswap_invalidate(swp_entry_t swp)
> >  {
> >          pgoff_t offset = swp_offset(swp);
> >          struct xarray *tree = swap_zswap_tree(swp);
> >          struct zswap_entry *entry;
> >
> > -        entry = xa_erase(tree, offset);
> > -        if (entry)
> > -                zswap_entry_free(entry);
> > +        if (xa_load(tree, offset)) {
> > +                entry = xa_erase(tree, offset);
> > +                if (entry)
> > +                        zswap_entry_free(entry);
> > +        }
> >  }
> >
> >  int zswap_swapon(int type, unsigned long nr_pages)
> > --
> > 2.45.2


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [syzbot] [mm?] WARNING in zswap_swapoff
  2024-08-22 20:16             ` Chris Li
@ 2024-08-22 20:20               ` Yosry Ahmed
  2024-08-22 21:44                 ` Chris Li
  0 siblings, 1 reply; 19+ messages in thread
From: Yosry Ahmed @ 2024-08-22 20:20 UTC (permalink / raw)
  To: Chris Li
  Cc: Kairui Song, Barry Song, akpm, chengming.zhou, hannes,
	linux-kernel, linux-mm, nphamcs, ryan.roberts,
	syzbot+ce6029250d7fd4d0476d, syzkaller-bugs, ying.huang

On Thu, Aug 22, 2024 at 1:16 PM Chris Li <chrisl@kernel.org> wrote:
>
> On Thu, Aug 22, 2024 at 11:13 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Tue, Aug 20, 2024 at 11:42 PM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > On Wed, Aug 21, 2024 at 1:49 PM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > On Tue, Aug 20, 2024 at 9:02 PM Kairui Song <ryncsn@gmail.com> wrote:
> > > > >
> > > > > On Tue, Aug 20, 2024 at 4:47 PM Kairui Song <ryncsn@gmail.com> wrote:
> > > > > >
> > > > > > On Tue, Aug 20, 2024 at 4:13 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > > > On Fri, Aug 16, 2024 at 12:52 PM syzbot
> > > > > > > <syzbot+ce6029250d7fd4d0476d@syzkaller.appspotmail.com> wrote:
> > > > > > > >
> > > > > > > > Hello,
> > > > > > > >
> > > > > > > > syzbot found the following issue on:
> > > > > > > >
> > > > > > > > HEAD commit:    367b5c3d53e5 Add linux-next specific files for 20240816
> > > > > >
> > > > > > I can't find this commit, seems this commit is not in linux-next any more?
> > > > > >
> > > > > > > > git tree:       linux-next
> > > > > > > > console output: https://syzkaller.appspot.com/x/log.txt?x=12489105980000
> > > > > > > > kernel config:  https://syzkaller.appspot.com/x/.config?x=61ba6f3b22ee5467
> > > > > > > > dashboard link: https://syzkaller.appspot.com/bug?extid=ce6029250d7fd4d0476d
> > > > > > > > compiler:       Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40
> > > > > > > >
> > > > > > > > Unfortunately, I don't have any reproducer for this issue yet.
> > > > > > > >
> > > > > > > > Downloadable assets:
> > > > > > > > disk image: https://storage.googleapis.com/syzbot-assets/0b1b4e3cad3c/disk-367b5c3d.raw.xz
> > > > > > > > vmlinux: https://storage.googleapis.com/syzbot-assets/5bb090f7813c/vmlinux-367b5c3d.xz
> > > > > > > > kernel image: https://storage.googleapis.com/syzbot-assets/6674cb0709b1/bzImage-367b5c3d.xz
> > > > > > > >
> > > > > > > > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > > > > > > > Reported-by: syzbot+ce6029250d7fd4d0476d@syzkaller.appspotmail.com
> > > > > > > >
> > > > > > > > ------------[ cut here ]------------
> > > > > > > > WARNING: CPU: 0 PID: 11298 at mm/zswap.c:1700 zswap_swapoff+0x11b/0x2b0 mm/zswap.c:1700
> > > > > > > > Modules linked in:
> > > > > > > > CPU: 0 UID: 0 PID: 11298 Comm: swapoff Not tainted 6.11.0-rc3-next-20240816-syzkaller #0
> > > > > > > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 06/27/2024
> > > > > > > > RIP: 0010:zswap_swapoff+0x11b/0x2b0 mm/zswap.c:1700
> > > > > > > > Code: 74 05 e8 78 73 07 00 4b 83 7c 35 00 00 75 15 e8 1b bd 9e ff 48 ff c5 49 83 c6 50 83 7c 24 0c 17 76 9b eb 24 e8 06 bd 9e ff 90 <0f> 0b 90 eb e5 48 8b 0c 24 80 e1 07 80 c1 03 38 c1 7c 90 48 8b 3c
> > > > > > > > RSP: 0018:ffffc9000302fa38 EFLAGS: 00010293
> > > > > > > > RAX: ffffffff81f4d66a RBX: dffffc0000000000 RCX: ffff88802c19bc00
> > > > > > > > RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff888015986248
> > > > > > > > RBP: 0000000000000000 R08: ffffffff81f4d620 R09: 1ffffffff1d476ac
> > > > > > > > R10: dffffc0000000000 R11: fffffbfff1d476ad R12: dffffc0000000000
> > > > > > > > R13: ffff888015986200 R14: 0000000000000048 R15: 0000000000000002
> > > > > > > > FS:  00007f9e628a5380(0000) GS:ffff8880b9000000(0000) knlGS:0000000000000000
> > > > > > > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > > > > CR2: 0000001b30f15ff8 CR3: 000000006c5f0000 CR4: 00000000003506f0
> > > > > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > > > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > > > > > > Call Trace:
> > > > > > > >  <TASK>
> > > > > > > >  __do_sys_swapoff mm/swapfile.c:2837 [inline]
> > > > > > > >  __se_sys_swapoff+0x4653/0x4cf0 mm/swapfile.c:2706
> > > > > > > >  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
> > > > > > > >  do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
> > > > > > > >  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> > > > > > > > RIP: 0033:0x7f9e629feb37
> > > > > > > > Code: 73 01 c3 48 8b 0d f1 52 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 a8 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c1 52 0d 00 f7 d8 64 89 01 48
> > > > > > > > RSP: 002b:00007fff17734f68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a8
> > > > > > > > RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f9e629feb37
> > > > > > > > RDX: 00007f9e62a9e7e8 RSI: 00007f9e62b9beed RDI: 0000563090942a20
> > > > > > > > RBP: 0000563090942a20 R08: 0000000000000000 R09: 77872e07ed164f94
> > > > > > > > R10: 000000000000001f R11: 0000000000000246 R12: 00007fff17735188
> > > > > > > > R13: 00005630909422a0 R14: 0000563073724169 R15: 00007f9e62bdda80
> > > > > > > >  </TASK>
> > > > > > >
> > > > > > > I am hoping syzbot would find a reproducer and bisect this for us.
> > > > > > > Meanwhile, from a high-level it looks to me like we are missing a
> > > > > > > zswap_invalidate() call in some paths.
> > > > > > >
> > > > > > > If I have to guess, I would say it's related to the latest mTHP swap
> > > > > > > changes, but I am not following closely. Perhaps one of the following
> > > > > > > things happened:
> > > > > > >
> > > > > > > (1) We are not calling zswap_invalidate() in some invalidation paths.
> > > > > > > It used to not be called for the cluster freeing path, so maybe we end
> > > > > > > up with some order-0 swap entries in a cluster? or maybe there is an
> > > > > > > entirely new invalidation path that does not go through
> > > > > > > free_swap_slot() for order-0 entries?
> > > > > > >
> > > > > > > (2) Some higher order swap entries (i.e. a cluster) end up in zswap
> > > > > > > somehow. zswap_store() has a warning to cover that though. Maybe
> > > > > > > somehow some swap entries are allocated as a cluster, but then pages
> > > > > > > are swapped out one-by-one as order-0 (which can go to zswap), but
> > > > > > > then we still free the swap entries as a cluster?
> > > > > >
> > > > > > Hi Yosry, thanks for the report.
> > > > > >
> > > > > > There are many mTHP related optimizations recently, for this problem I
> > > > > > can reproduce this locally. Can confirm the problem is gone for me
> > > > > > after reverting:
> > > > > >
> > > > > > "mm: attempt to batch free swap entries for zap_pte_range()"
> > > > > >
> > > > > > Hi Barry,
> > > > > >
> > > > > > If a set of continuous slots are having the same value, they are
> > > > > > considered a mTHP and freed, bypassing the slot cache, and causing
> > > > > > zswap leak.
> > > > > > This didn't happen in put_swap_folio because that function is
> > > > > > expecting an actual mTHP folio behind the slots but
> > > > > > free_swap_and_cache_nr is simply walking the slots.
> > > > > >
> > > > > > For the testing, I actually have to disable mTHP, because linux-next
> > > > > > will panic with mTHP due to lack of following fixes:
> > > > > > https://lore.kernel.org/linux-mm/a4b1b34f-0d8c-490d-ab00-eaedbf3fe780@gmail.com/
> > > > > > https://lore.kernel.org/linux-mm/403b7f3c-6e5b-4030-ab1c-3198f36e3f73@gmail.com/
> > > > > >
> > > > > > >
> > > > > > > I am not closely following the latest changes so I am not sure. CCing
> > > > > > > folks who have done work in that area recently.
> > > > > > >
> > > > > > > I am starting to think maybe it would be more reliable to just call
> > > > > > > zswap_invalidate() for all freed swap entries anyway. Would that be
> > > > > > > too expensive? We used to do that before the zswap_invalidate() call
> > > > > > > was moved by commit 0827a1fb143f ("mm/zswap: invalidate zswap entry
> > > > > > > when swap entry free"), and that was before we started using the
> > > > > > > xarray (so it was arguably worse than it would be now).
> > > > > > >
> > > > > >
> > > > > > That might be a good idea, I suggest moving zswap_invalidate to
> > > > > > swap_range_free and call it for every freed slot.
> > > > > >
> > > > > > Below patch can be squash into or put before "mm: attempt to batch
> > > > > > free swap entries for zap_pte_range()".
> > > > >
> > > > > Hmm, on second thought, the commit message in the attachment commit
> > > > > might be not suitable, current zswap_invalidate is also designed to
> > > > > only work for order 0 ZSWAP, so things are not clean even after this.
> > > >
> > > > Kairui, what about the below? we don't touch the path of __try_to_reclaim_swap() where
> > > > you have one folio backed?
> > > >
> > > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > > index c1638a009113..8ff58be40544 100644
> > > > --- a/mm/swapfile.c
> > > > +++ b/mm/swapfile.c
> > > > @@ -1514,6 +1514,8 @@ static bool __swap_entries_free(struct swap_info_struct *si,
> > > >         unlock_cluster_or_swap_info(si, ci);
> > > >
> > > >         if (!has_cache) {
> > > > +               for (i = 0; i < nr; i++)
> > > > +                       zswap_invalidate(swp_entry(si->type, offset + i));
> > > >                 spin_lock(&si->lock);
> > > >                 swap_entry_range_free(si, entry, nr);
> > > >                 spin_unlock(&si->lock);
> > > >
> > >
> > > Hi Barry,
> > >
> > > Thanks for updating this thread, I'm thinking maybe something will
> > > better be done at the zswap side?
> > >
> > > The concern of using zswap_invalidate is that it calls xa_erase which
> > > requires the xa spin lock. But if we are calling zswap_invalidate in
> > > swap_entry_range_free, and ensure the slot is HAS_CACHE pinned, doing
> > > a lockless read first with xa_load should be OK for checking if the
> > > slot needs a ZSWAP invalidation. The performance cost will be minimal
> > > and we only need to call zswap_invalidate in one place, something like
> > > this (haven't tested, comments are welcome). Also ZSWAP mthp will
> > > still store entried in order 0 so this should be OK for future.
> >
> >
> > While I do agree with this change on a high level, it's essentially
> > reverting commit 0827a1fb143f ("mm/zswap: invalidate zswap entry when
> > swap entry free") which fixed a small problem with zswap writeback.
> > I'd prefer that we don't if possible.
> >
> > One thing that I always wanted to do is to pull some of the work done
> > in swap_entry_range_free() and swap_range_free() before the slots
> > caching layer. The memcg uncharging, clearing shadow entries from the
> > swap cache, arch invalidation, zswap invalidation, etc. If we can have
> > a hook for these pre-free callbacks we can call it for single entries
> > before we add them to the slots cache, and call them for the clusters
> > as we do today. This should also reduce the amount of work done under
> > the lock, and move more work to where the freeing is actually
> > happening vs. the cache draining.
> >
> > I remember discussing this briefly with Ying before. Anyone have any thoughts?
>
> Hi Yosry,
>
> If I understand correctly, the lock you are talking about is the
> si->lock, right?
>
> Kairui has some WIP patches removing the swap slot cache in the swap
> entry freeing path. Basically the si->lock is only used to protect the
> cluster list. Most of the time freeing swap entry will only take the
> ci->lock. No need to take the si->lock to change the cluster lists.
> Only when the cluster moves to another list will it require the
> si->lock e.g. the cluster moves to the free list when all 512 entries
> are freed. Because each cluster has 512 entries. The need to take
> si->lock is dramatically reduced. That patch is based on the new
> cluster swap allocator series. Kairui can share more details.
>
> I don't think ci->lock has heavy contentions.

Thanks for sharing these details. However, I think we should still
pull the irrelevant logic from swap_entry_range_free() and
swap_range_free() before the slots cache layer. There is no reason to
defer this work until we drain the cache, and whoever ends up draining
the cache ends up paying for all the work of previously freed slots.

In the zswap case, deferring the invalidation causes a problem, which
is why we moved it to free_swap_slot(). However, now with the mTHP
changes more freeing paths are being added and it's easy to miss
invalidating zswap in every case, so I think we need a more
centralized place to execute pre-free hooks, and call it before the
swap slots cache.

>
> Chris


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [syzbot] [mm?] WARNING in zswap_swapoff
  2024-08-22 20:20               ` Yosry Ahmed
@ 2024-08-22 21:44                 ` Chris Li
  0 siblings, 0 replies; 19+ messages in thread
From: Chris Li @ 2024-08-22 21:44 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Kairui Song, Barry Song, akpm, chengming.zhou, hannes,
	linux-kernel, linux-mm, nphamcs, ryan.roberts,
	syzbot+ce6029250d7fd4d0476d, syzkaller-bugs, ying.huang

On Thu, Aug 22, 2024 at 1:20 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Thu, Aug 22, 2024 at 1:16 PM Chris Li <chrisl@kernel.org> wrote:
> >
> > On Thu, Aug 22, 2024 at 11:13 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > On Tue, Aug 20, 2024 at 11:42 PM Kairui Song <ryncsn@gmail.com> wrote:
> > > >
> > > > On Wed, Aug 21, 2024 at 1:49 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > >
> > > > > On Tue, Aug 20, 2024 at 9:02 PM Kairui Song <ryncsn@gmail.com> wrote:
> > > > > >
> > > > > > On Tue, Aug 20, 2024 at 4:47 PM Kairui Song <ryncsn@gmail.com> wrote:
> > > > > > >
> > > > > > > On Tue, Aug 20, 2024 at 4:13 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > > > > On Fri, Aug 16, 2024 at 12:52 PM syzbot
> > > > > > > > <syzbot+ce6029250d7fd4d0476d@syzkaller.appspotmail.com> wrote:
> > > > > > > > >
> > > > > > > > > Hello,
> > > > > > > > >
> > > > > > > > > syzbot found the following issue on:
> > > > > > > > >
> > > > > > > > > HEAD commit:    367b5c3d53e5 Add linux-next specific files for 20240816
> > > > > > >
> > > > > > > I can't find this commit, seems this commit is not in linux-next any more?
> > > > > > >
> > > > > > > > > git tree:       linux-next
> > > > > > > > > console output: https://syzkaller.appspot.com/x/log.txt?x=12489105980000
> > > > > > > > > kernel config:  https://syzkaller.appspot.com/x/.config?x=61ba6f3b22ee5467
> > > > > > > > > dashboard link: https://syzkaller.appspot.com/bug?extid=ce6029250d7fd4d0476d
> > > > > > > > > compiler:       Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40
> > > > > > > > >
> > > > > > > > > Unfortunately, I don't have any reproducer for this issue yet.
> > > > > > > > >
> > > > > > > > > Downloadable assets:
> > > > > > > > > disk image: https://storage.googleapis.com/syzbot-assets/0b1b4e3cad3c/disk-367b5c3d.raw.xz
> > > > > > > > > vmlinux: https://storage.googleapis.com/syzbot-assets/5bb090f7813c/vmlinux-367b5c3d.xz
> > > > > > > > > kernel image: https://storage.googleapis.com/syzbot-assets/6674cb0709b1/bzImage-367b5c3d.xz
> > > > > > > > >
> > > > > > > > > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > > > > > > > > Reported-by: syzbot+ce6029250d7fd4d0476d@syzkaller.appspotmail.com
> > > > > > > > >
> > > > > > > > > ------------[ cut here ]------------
> > > > > > > > > WARNING: CPU: 0 PID: 11298 at mm/zswap.c:1700 zswap_swapoff+0x11b/0x2b0 mm/zswap.c:1700
> > > > > > > > > Modules linked in:
> > > > > > > > > CPU: 0 UID: 0 PID: 11298 Comm: swapoff Not tainted 6.11.0-rc3-next-20240816-syzkaller #0
> > > > > > > > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 06/27/2024
> > > > > > > > > RIP: 0010:zswap_swapoff+0x11b/0x2b0 mm/zswap.c:1700
> > > > > > > > > Code: 74 05 e8 78 73 07 00 4b 83 7c 35 00 00 75 15 e8 1b bd 9e ff 48 ff c5 49 83 c6 50 83 7c 24 0c 17 76 9b eb 24 e8 06 bd 9e ff 90 <0f> 0b 90 eb e5 48 8b 0c 24 80 e1 07 80 c1 03 38 c1 7c 90 48 8b 3c
> > > > > > > > > RSP: 0018:ffffc9000302fa38 EFLAGS: 00010293
> > > > > > > > > RAX: ffffffff81f4d66a RBX: dffffc0000000000 RCX: ffff88802c19bc00
> > > > > > > > > RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff888015986248
> > > > > > > > > RBP: 0000000000000000 R08: ffffffff81f4d620 R09: 1ffffffff1d476ac
> > > > > > > > > R10: dffffc0000000000 R11: fffffbfff1d476ad R12: dffffc0000000000
> > > > > > > > > R13: ffff888015986200 R14: 0000000000000048 R15: 0000000000000002
> > > > > > > > > FS:  00007f9e628a5380(0000) GS:ffff8880b9000000(0000) knlGS:0000000000000000
> > > > > > > > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > > > > > CR2: 0000001b30f15ff8 CR3: 000000006c5f0000 CR4: 00000000003506f0
> > > > > > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > > > > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > > > > > > > Call Trace:
> > > > > > > > >  <TASK>
> > > > > > > > >  __do_sys_swapoff mm/swapfile.c:2837 [inline]
> > > > > > > > >  __se_sys_swapoff+0x4653/0x4cf0 mm/swapfile.c:2706
> > > > > > > > >  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
> > > > > > > > >  do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
> > > > > > > > >  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> > > > > > > > > RIP: 0033:0x7f9e629feb37
> > > > > > > > > Code: 73 01 c3 48 8b 0d f1 52 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 a8 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c1 52 0d 00 f7 d8 64 89 01 48
> > > > > > > > > RSP: 002b:00007fff17734f68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a8
> > > > > > > > > RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f9e629feb37
> > > > > > > > > RDX: 00007f9e62a9e7e8 RSI: 00007f9e62b9beed RDI: 0000563090942a20
> > > > > > > > > RBP: 0000563090942a20 R08: 0000000000000000 R09: 77872e07ed164f94
> > > > > > > > > R10: 000000000000001f R11: 0000000000000246 R12: 00007fff17735188
> > > > > > > > > R13: 00005630909422a0 R14: 0000563073724169 R15: 00007f9e62bdda80
> > > > > > > > >  </TASK>
> > > > > > > >
> > > > > > > > I am hoping syzbot would find a reproducer and bisect this for us.
> > > > > > > > Meanwhile, from a high-level it looks to me like we are missing a
> > > > > > > > zswap_invalidate() call in some paths.
> > > > > > > >
> > > > > > > > If I have to guess, I would say it's related to the latest mTHP swap
> > > > > > > > changes, but I am not following closely. Perhaps one of the following
> > > > > > > > things happened:
> > > > > > > >
> > > > > > > > (1) We are not calling zswap_invalidate() in some invalidation paths.
> > > > > > > > It used to not be called for the cluster freeing path, so maybe we end
> > > > > > > > up with some order-0 swap entries in a cluster? or maybe there is an
> > > > > > > > entirely new invalidation path that does not go through
> > > > > > > > free_swap_slot() for order-0 entries?
> > > > > > > >
> > > > > > > > (2) Some higher order swap entries (i.e. a cluster) end up in zswap
> > > > > > > > somehow. zswap_store() has a warning to cover that though. Maybe
> > > > > > > > somehow some swap entries are allocated as a cluster, but then pages
> > > > > > > > are swapped out one-by-one as order-0 (which can go to zswap), but
> > > > > > > > then we still free the swap entries as a cluster?
> > > > > > >
> > > > > > > Hi Yosry, thanks for the report.
> > > > > > >
> > > > > > > There are many mTHP related optimizations recently, for this problem I
> > > > > > > can reproduce this locally. Can confirm the problem is gone for me
> > > > > > > after reverting:
> > > > > > >
> > > > > > > "mm: attempt to batch free swap entries for zap_pte_range()"
> > > > > > >
> > > > > > > Hi Barry,
> > > > > > >
> > > > > > > If a set of continuous slots are having the same value, they are
> > > > > > > considered a mTHP and freed, bypassing the slot cache, and causing
> > > > > > > zswap leak.
> > > > > > > This didn't happen in put_swap_folio because that function is
> > > > > > > expecting an actual mTHP folio behind the slots but
> > > > > > > free_swap_and_cache_nr is simply walking the slots.
> > > > > > >
> > > > > > > For the testing, I actually have to disable mTHP, because linux-next
> > > > > > > will panic with mTHP due to lack of following fixes:
> > > > > > > https://lore.kernel.org/linux-mm/a4b1b34f-0d8c-490d-ab00-eaedbf3fe780@gmail.com/
> > > > > > > https://lore.kernel.org/linux-mm/403b7f3c-6e5b-4030-ab1c-3198f36e3f73@gmail.com/
> > > > > > >
> > > > > > > >
> > > > > > > > I am not closely following the latest changes so I am not sure. CCing
> > > > > > > > folks who have done work in that area recently.
> > > > > > > >
> > > > > > > > I am starting to think maybe it would be more reliable to just call
> > > > > > > > zswap_invalidate() for all freed swap entries anyway. Would that be
> > > > > > > > too expensive? We used to do that before the zswap_invalidate() call
> > > > > > > > was moved by commit 0827a1fb143f ("mm/zswap: invalidate zswap entry
> > > > > > > > when swap entry free"), and that was before we started using the
> > > > > > > > xarray (so it was arguably worse than it would be now).
> > > > > > > >
> > > > > > >
> > > > > > > That might be a good idea, I suggest moving zswap_invalidate to
> > > > > > > swap_range_free and call it for every freed slot.
> > > > > > >
> > > > > > > Below patch can be squash into or put before "mm: attempt to batch
> > > > > > > free swap entries for zap_pte_range()".
> > > > > >
> > > > > > Hmm, on second thought, the commit message in the attachment commit
> > > > > > might be not suitable, current zswap_invalidate is also designed to
> > > > > > only work for order 0 ZSWAP, so things are not clean even after this.
> > > > >
> > > > > Kairui, what about the below? we don't touch the path of __try_to_reclaim_swap() where
> > > > > you have one folio backed?
> > > > >
> > > > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > > > index c1638a009113..8ff58be40544 100644
> > > > > --- a/mm/swapfile.c
> > > > > +++ b/mm/swapfile.c
> > > > > @@ -1514,6 +1514,8 @@ static bool __swap_entries_free(struct swap_info_struct *si,
> > > > >         unlock_cluster_or_swap_info(si, ci);
> > > > >
> > > > >         if (!has_cache) {
> > > > > +               for (i = 0; i < nr; i++)
> > > > > +                       zswap_invalidate(swp_entry(si->type, offset + i));
> > > > >                 spin_lock(&si->lock);
> > > > >                 swap_entry_range_free(si, entry, nr);
> > > > >                 spin_unlock(&si->lock);
> > > > >
> > > >
> > > > Hi Barry,
> > > >
> > > > Thanks for updating this thread, I'm thinking maybe something will
> > > > better be done at the zswap side?
> > > >
> > > > The concern of using zswap_invalidate is that it calls xa_erase which
> > > > requires the xa spin lock. But if we are calling zswap_invalidate in
> > > > swap_entry_range_free, and ensure the slot is HAS_CACHE pinned, doing
> > > > a lockless read first with xa_load should be OK for checking if the
> > > > slot needs a ZSWAP invalidation. The performance cost will be minimal
> > > > and we only need to call zswap_invalidate in one place, something like
> > > > this (haven't tested, comments are welcome). Also ZSWAP mthp will
> > > > still store entried in order 0 so this should be OK for future.
> > >
> > >
> > > While I do agree with this change on a high level, it's essentially
> > > reverting commit 0827a1fb143f ("mm/zswap: invalidate zswap entry when
> > > swap entry free") which fixed a small problem with zswap writeback.
> > > I'd prefer that we don't if possible.
> > >
> > > One thing that I always wanted to do is to pull some of the work done
> > > in swap_entry_range_free() and swap_range_free() before the slots
> > > caching layer. The memcg uncharging, clearing shadow entries from the
> > > swap cache, arch invalidation, zswap invalidation, etc. If we can have
> > > a hook for these pre-free callbacks we can call it for single entries
> > > before we add them to the slots cache, and call them for the clusters
> > > as we do today. This should also reduce the amount of work done under
> > > the lock, and move more work to where the freeing is actually
> > > happening vs. the cache draining.
> > >
> > > I remember discussing this briefly with Ying before. Anyone have any thoughts?
> >
> > Hi Yosry,
> >
> > If I understand correctly, the lock you are talking about is the
> > si->lock, right?
> >
> > Kairui has some WIP patches removing the swap slot cache in the swap
> > entry freeing path. Basically the si->lock is only used to protect the
> > cluster list. Most of the time freeing swap entry will only take the
> > ci->lock. No need to take the si->lock to change the cluster lists.
> > Only when the cluster moves to another list will it require the
> > si->lock e.g. the cluster moves to the free list when all 512 entries
> > are freed. Because each cluster has 512 entries. The need to take
> > si->lock is dramatically reduced. That patch is based on the new
> > cluster swap allocator series. Kairui can share more details.
> >
> > I don't think ci->lock has heavy contentions.
>
> Thanks for sharing these details. However, I think we should still
> pull the irrelevant logic from swap_entry_range_free() and
> swap_range_free() before the slots cache layer. There is no reason to
> defer this work until we drain the cache, and whoever ends up draining
> the cache ends up paying for all the work of previously freed slots.

Sounds like a good idea from a high-level point of view. It will
likely reduce the complexity of freeing the swap entries as well.
Again, the devil is in the details. Likely need to see the patch to
reason it in more detail.

Chris


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [syzbot] [mm?] WARNING in zswap_swapoff
  2024-08-22 18:12           ` Yosry Ahmed
  2024-08-22 20:16             ` Chris Li
@ 2024-09-03 18:21             ` Yosry Ahmed
  2024-09-03 18:50               ` Kairui Song
  1 sibling, 1 reply; 19+ messages in thread
From: Yosry Ahmed @ 2024-09-03 18:21 UTC (permalink / raw)
  To: Kairui Song
  Cc: Barry Song, akpm, chengming.zhou, chrisl, hannes, linux-kernel,
	linux-mm, nphamcs, ryan.roberts, syzbot+ce6029250d7fd4d0476d,
	syzkaller-bugs, ying.huang

[..]
> > > > > > I am not closely following the latest changes so I am not sure. CCing
> > > > > > folks who have done work in that area recently.
> > > > > >
> > > > > > I am starting to think maybe it would be more reliable to just call
> > > > > > zswap_invalidate() for all freed swap entries anyway. Would that be
> > > > > > too expensive? We used to do that before the zswap_invalidate() call
> > > > > > was moved by commit 0827a1fb143f ("mm/zswap: invalidate zswap entry
> > > > > > when swap entry free"), and that was before we started using the
> > > > > > xarray (so it was arguably worse than it would be now).
> > > > > >
> > > > >
> > > > > That might be a good idea, I suggest moving zswap_invalidate to
> > > > > swap_range_free and call it for every freed slot.
> > > > >
> > > > > Below patch can be squash into or put before "mm: attempt to batch
> > > > > free swap entries for zap_pte_range()".
> > > >
> > > > Hmm, on second thought, the commit message in the attachment commit
> > > > might be not suitable, current zswap_invalidate is also designed to
> > > > only work for order 0 ZSWAP, so things are not clean even after this.
> > >
> > > Kairui, what about the below? we don't touch the path of __try_to_reclaim_swap() where
> > > you have one folio backed?
> > >
> > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > index c1638a009113..8ff58be40544 100644
> > > --- a/mm/swapfile.c
> > > +++ b/mm/swapfile.c
> > > @@ -1514,6 +1514,8 @@ static bool __swap_entries_free(struct swap_info_struct *si,
> > >         unlock_cluster_or_swap_info(si, ci);
> > >
> > >         if (!has_cache) {
> > > +               for (i = 0; i < nr; i++)
> > > +                       zswap_invalidate(swp_entry(si->type, offset + i));
> > >                 spin_lock(&si->lock);
> > >                 swap_entry_range_free(si, entry, nr);
> > >                 spin_unlock(&si->lock);
> > >
> >
> > Hi Barry,
> >
> > Thanks for updating this thread, I'm thinking maybe something will
> > better be done at the zswap side?
> >
> > The concern of using zswap_invalidate is that it calls xa_erase which
> > requires the xa spin lock. But if we are calling zswap_invalidate in
> > swap_entry_range_free, and ensure the slot is HAS_CACHE pinned, doing
> > a lockless read first with xa_load should be OK for checking if the
> > slot needs a ZSWAP invalidation. The performance cost will be minimal
> > and we only need to call zswap_invalidate in one place, something like
> > this (haven't tested, comments are welcome). Also ZSWAP mthp will
> > still store entried in order 0 so this should be OK for future.
>
>
> While I do agree with this change on a high level, it's essentially
> reverting commit 0827a1fb143f ("mm/zswap: invalidate zswap entry when
> swap entry free") which fixed a small problem with zswap writeback.
> I'd prefer that we don't if possible.
>
> One thing that I always wanted to do is to pull some of the work done
> in swap_entry_range_free() and swap_range_free() before the slots
> caching layer. The memcg uncharging, clearing shadow entries from the
> swap cache, arch invalidation, zswap invalidation, etc. If we can have
> a hook for these pre-free callbacks we can call it for single entries
> before we add them to the slots cache, and call them for the clusters
> as we do today. This should also reduce the amount of work done under
> the lock, and move more work to where the freeing is actually
> happening vs. the cache draining.
>
> I remember discussing this briefly with Ying before. Anyone have any thoughts?

Kairui, Barry, any thoughts on this? Any preferences on how to make
sure zswap_invalidate() is being called in all possible swap freeing
paths?


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [syzbot] [mm?] WARNING in zswap_swapoff
  2024-09-03 18:21             ` Yosry Ahmed
@ 2024-09-03 18:50               ` Kairui Song
  2024-09-03 19:41                 ` Yosry Ahmed
  0 siblings, 1 reply; 19+ messages in thread
From: Kairui Song @ 2024-09-03 18:50 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Barry Song, akpm, chengming.zhou, chrisl, hannes, linux-kernel,
	linux-mm, nphamcs, ryan.roberts, syzbot+ce6029250d7fd4d0476d,
	syzkaller-bugs, ying.huang

On Wed, Sep 4, 2024 at 2:22 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>

Hi Yosry,

> > > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > > index c1638a009113..8ff58be40544 100644
> > > > --- a/mm/swapfile.c
> > > > +++ b/mm/swapfile.c
> > > > @@ -1514,6 +1514,8 @@ static bool __swap_entries_free(struct swap_info_struct *si,
> > > >         unlock_cluster_or_swap_info(si, ci);
> > > >
> > > >         if (!has_cache) {
> > > > +               for (i = 0; i < nr; i++)
> > > > +                       zswap_invalidate(swp_entry(si->type, offset + i));
> > > >                 spin_lock(&si->lock);
> > > >                 swap_entry_range_free(si, entry, nr);
> > > >                 spin_unlock(&si->lock);
> > > >

This fix from Barry have been applied for mm-unstable and it's looking
good so far.

> Kairui, Barry, any thoughts on this? Any preferences on how to make
> sure zswap_invalidate() is being called in all possible swap freeing
> paths?

I have a set of patched that removed the si->lock around
swap_entry_range_free, and the slot return cache is removed too. So we
can move zswap_invalidate into it as locking or caching is no longer
an issue, I will post it ASAP.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [syzbot] [mm?] WARNING in zswap_swapoff
  2024-09-03 18:50               ` Kairui Song
@ 2024-09-03 19:41                 ` Yosry Ahmed
  0 siblings, 0 replies; 19+ messages in thread
From: Yosry Ahmed @ 2024-09-03 19:41 UTC (permalink / raw)
  To: Kairui Song
  Cc: Barry Song, akpm, chengming.zhou, chrisl, hannes, linux-kernel,
	linux-mm, nphamcs, ryan.roberts, syzbot+ce6029250d7fd4d0476d,
	syzkaller-bugs, ying.huang

On Tue, Sep 3, 2024 at 11:51 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Wed, Sep 4, 2024 at 2:22 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
>
> Hi Yosry,
>
> > > > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > > > index c1638a009113..8ff58be40544 100644
> > > > > --- a/mm/swapfile.c
> > > > > +++ b/mm/swapfile.c
> > > > > @@ -1514,6 +1514,8 @@ static bool __swap_entries_free(struct swap_info_struct *si,
> > > > >         unlock_cluster_or_swap_info(si, ci);
> > > > >
> > > > >         if (!has_cache) {
> > > > > +               for (i = 0; i < nr; i++)
> > > > > +                       zswap_invalidate(swp_entry(si->type, offset + i));
> > > > >                 spin_lock(&si->lock);
> > > > >                 swap_entry_range_free(si, entry, nr);
> > > > >                 spin_unlock(&si->lock);
> > > > >
>
> This fix from Barry have been applied for mm-unstable and it's looking
> good so far.

Are we sure it fixes the problem? Unfortunately we don't have a
reproducer. The swap entry freeing paths are getting too tangled to
know for sure if zswap_invalidate() is being called in all cases,
which is why I prefer having a centralized callback for all swap entry
freeing hooks.

>
> > Kairui, Barry, any thoughts on this? Any preferences on how to make
> > sure zswap_invalidate() is being called in all possible swap freeing
> > paths?
>
> I have a set of patched that removed the si->lock around
> swap_entry_range_free, and the slot return cache is removed too. So we
> can move zswap_invalidate into it as locking or caching is no longer
> an issue, I will post it ASAP.

That sounds good, but I am not sure I understand what you mean by
removing the slot return cache. I will wait to see the patches.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [syzbot] [mm?] WARNING in zswap_swapoff
  2024-08-20  8:47   ` Kairui Song
  2024-08-20  9:02     ` Kairui Song
@ 2024-08-20  9:22     ` Barry Song
  2024-08-20  9:29       ` Kairui Song
  1 sibling, 1 reply; 19+ messages in thread
From: Barry Song @ 2024-08-20  9:22 UTC (permalink / raw)
  To: Kairui Song
  Cc: Yosry Ahmed, syzbot, akpm, chengming.zhou, hannes, linux-kernel,
	linux-mm, nphamcs, syzkaller-bugs, Chris Li, Ying, Ryan Roberts

On Tue, Aug 20, 2024 at 8:47 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Tue, Aug 20, 2024 at 4:13 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > On Fri, Aug 16, 2024 at 12:52 PM syzbot
> > <syzbot+ce6029250d7fd4d0476d@syzkaller.appspotmail.com> wrote:
> > >
> > > Hello,
> > >
> > > syzbot found the following issue on:
> > >
> > > HEAD commit:    367b5c3d53e5 Add linux-next specific files for 20240816
>
> I can't find this commit, seems this commit is not in linux-next any more?
>
> > > git tree:       linux-next
> > > console output: https://syzkaller.appspot.com/x/log.txt?x=12489105980000
> > > kernel config:  https://syzkaller.appspot.com/x/.config?x=61ba6f3b22ee5467
> > > dashboard link: https://syzkaller.appspot.com/bug?extid=ce6029250d7fd4d0476d
> > > compiler:       Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40
> > >
> > > Unfortunately, I don't have any reproducer for this issue yet.
> > >
> > > Downloadable assets:
> > > disk image: https://storage.googleapis.com/syzbot-assets/0b1b4e3cad3c/disk-367b5c3d.raw.xz
> > > vmlinux: https://storage.googleapis.com/syzbot-assets/5bb090f7813c/vmlinux-367b5c3d.xz
> > > kernel image: https://storage.googleapis.com/syzbot-assets/6674cb0709b1/bzImage-367b5c3d.xz
> > >
> > > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > > Reported-by: syzbot+ce6029250d7fd4d0476d@syzkaller.appspotmail.com
> > >
> > > ------------[ cut here ]------------
> > > WARNING: CPU: 0 PID: 11298 at mm/zswap.c:1700 zswap_swapoff+0x11b/0x2b0 mm/zswap.c:1700
> > > Modules linked in:
> > > CPU: 0 UID: 0 PID: 11298 Comm: swapoff Not tainted 6.11.0-rc3-next-20240816-syzkaller #0
> > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 06/27/2024
> > > RIP: 0010:zswap_swapoff+0x11b/0x2b0 mm/zswap.c:1700
> > > Code: 74 05 e8 78 73 07 00 4b 83 7c 35 00 00 75 15 e8 1b bd 9e ff 48 ff c5 49 83 c6 50 83 7c 24 0c 17 76 9b eb 24 e8 06 bd 9e ff 90 <0f> 0b 90 eb e5 48 8b 0c 24 80 e1 07 80 c1 03 38 c1 7c 90 48 8b 3c
> > > RSP: 0018:ffffc9000302fa38 EFLAGS: 00010293
> > > RAX: ffffffff81f4d66a RBX: dffffc0000000000 RCX: ffff88802c19bc00
> > > RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff888015986248
> > > RBP: 0000000000000000 R08: ffffffff81f4d620 R09: 1ffffffff1d476ac
> > > R10: dffffc0000000000 R11: fffffbfff1d476ad R12: dffffc0000000000
> > > R13: ffff888015986200 R14: 0000000000000048 R15: 0000000000000002
> > > FS:  00007f9e628a5380(0000) GS:ffff8880b9000000(0000) knlGS:0000000000000000
> > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > CR2: 0000001b30f15ff8 CR3: 000000006c5f0000 CR4: 00000000003506f0
> > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > Call Trace:
> > >  <TASK>
> > >  __do_sys_swapoff mm/swapfile.c:2837 [inline]
> > >  __se_sys_swapoff+0x4653/0x4cf0 mm/swapfile.c:2706
> > >  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
> > >  do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
> > >  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> > > RIP: 0033:0x7f9e629feb37
> > > Code: 73 01 c3 48 8b 0d f1 52 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 a8 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c1 52 0d 00 f7 d8 64 89 01 48
> > > RSP: 002b:00007fff17734f68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a8
> > > RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f9e629feb37
> > > RDX: 00007f9e62a9e7e8 RSI: 00007f9e62b9beed RDI: 0000563090942a20
> > > RBP: 0000563090942a20 R08: 0000000000000000 R09: 77872e07ed164f94
> > > R10: 000000000000001f R11: 0000000000000246 R12: 00007fff17735188
> > > R13: 00005630909422a0 R14: 0000563073724169 R15: 00007f9e62bdda80
> > >  </TASK>
> >
> > I am hoping syzbot would find a reproducer and bisect this for us.
> > Meanwhile, from a high-level it looks to me like we are missing a
> > zswap_invalidate() call in some paths.
> >
> > If I have to guess, I would say it's related to the latest mTHP swap
> > changes, but I am not following closely. Perhaps one of the following
> > things happened:
> >
> > (1) We are not calling zswap_invalidate() in some invalidation paths.
> > It used to not be called for the cluster freeing path, so maybe we end
> > up with some order-0 swap entries in a cluster? or maybe there is an
> > entirely new invalidation path that does not go through
> > free_swap_slot() for order-0 entries?
> >
> > (2) Some higher order swap entries (i.e. a cluster) end up in zswap
> > somehow. zswap_store() has a warning to cover that though. Maybe
> > somehow some swap entries are allocated as a cluster, but then pages
> > are swapped out one-by-one as order-0 (which can go to zswap), but
> > then we still free the swap entries as a cluster?
>
> Hi Yosry, thanks for the report.
>
> There are many mTHP related optimizations recently, for this problem I
> can reproduce this locally. Can confirm the problem is gone for me
> after reverting:
>
> "mm: attempt to batch free swap entries for zap_pte_range()"
>
> Hi Barry,
>
> If a set of continuous slots are having the same value, they are
> considered a mTHP and freed, bypassing the slot cache, and causing
> zswap leak.
> This didn't happen in put_swap_folio because that function is
> expecting an actual mTHP folio behind the slots but
> free_swap_and_cache_nr is simply walking the slots.

Hi Kairui,

I don't understand, if anyone has a folio backend, the code will
go fallback to __try_to_reclaim_swap(),  it won't call
swap_entry_range_free().

     ci = lock_cluster_or_swap_info(si, offset);
        if (!swap_is_last_map(si, offset, nr, &has_cache)) {
                unlock_cluster_or_swap_info(si, ci);
                goto fallback;
        }
        for (i = 0; i < nr; i++)
                WRITE_ONCE(si->swap_map[offset + i], SWAP_HAS_CACHE);
        unlock_cluster_or_swap_info(si, ci);

        if (!has_cache) {
                spin_lock(&si->lock);
                swap_entry_range_free(si, entry, nr);
                spin_unlock(&si->lock);
        }
        return has_cache;

Am i missing something?

>
> For the testing, I actually have to disable mTHP, because linux-next
> will panic with mTHP due to lack of following fixes:
> https://lore.kernel.org/linux-mm/a4b1b34f-0d8c-490d-ab00-eaedbf3fe780@gmail.com/
> https://lore.kernel.org/linux-mm/403b7f3c-6e5b-4030-ab1c-3198f36e3f73@gmail.com/
>
> >
> > I am not closely following the latest changes so I am not sure. CCing
> > folks who have done work in that area recently.
> >
> > I am starting to think maybe it would be more reliable to just call
> > zswap_invalidate() for all freed swap entries anyway. Would that be
> > too expensive? We used to do that before the zswap_invalidate() call
> > was moved by commit 0827a1fb143f ("mm/zswap: invalidate zswap entry
> > when swap entry free"), and that was before we started using the
> > xarray (so it was arguably worse than it would be now).
> >
>
> That might be a good idea, I suggest moving zswap_invalidate to
> swap_range_free and call it for every freed slot.
>
> Below patch can be squash into or put before "mm: attempt to batch
> free swap entries for zap_pte_range()".


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [syzbot] [mm?] WARNING in zswap_swapoff
  2024-08-20  9:22     ` Barry Song
@ 2024-08-20  9:29       ` Kairui Song
  2024-08-20  9:53         ` Barry Song
  0 siblings, 1 reply; 19+ messages in thread
From: Kairui Song @ 2024-08-20  9:29 UTC (permalink / raw)
  To: Barry Song
  Cc: Yosry Ahmed, syzbot, akpm, chengming.zhou, hannes, linux-kernel,
	linux-mm, nphamcs, syzkaller-bugs, Chris Li, Ying, Ryan Roberts

On Tue, Aug 20, 2024 at 5:22 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Tue, Aug 20, 2024 at 8:47 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > On Tue, Aug 20, 2024 at 4:13 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > On Fri, Aug 16, 2024 at 12:52 PM syzbot
> > > <syzbot+ce6029250d7fd4d0476d@syzkaller.appspotmail.com> wrote:
> > > >
> > > > Hello,
> > > >
> > > > syzbot found the following issue on:
> > > >
> > > > HEAD commit:    367b5c3d53e5 Add linux-next specific files for 20240816
> >
> > I can't find this commit, seems this commit is not in linux-next any more?
> >
> > > > git tree:       linux-next
> > > > console output: https://syzkaller.appspot.com/x/log.txt?x=12489105980000
> > > > kernel config:  https://syzkaller.appspot.com/x/.config?x=61ba6f3b22ee5467
> > > > dashboard link: https://syzkaller.appspot.com/bug?extid=ce6029250d7fd4d0476d
> > > > compiler:       Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40
> > > >
> > > > Unfortunately, I don't have any reproducer for this issue yet.
> > > >
> > > > Downloadable assets:
> > > > disk image: https://storage.googleapis.com/syzbot-assets/0b1b4e3cad3c/disk-367b5c3d.raw.xz
> > > > vmlinux: https://storage.googleapis.com/syzbot-assets/5bb090f7813c/vmlinux-367b5c3d.xz
> > > > kernel image: https://storage.googleapis.com/syzbot-assets/6674cb0709b1/bzImage-367b5c3d.xz
> > > >
> > > > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > > > Reported-by: syzbot+ce6029250d7fd4d0476d@syzkaller.appspotmail.com
> > > >
> > > > ------------[ cut here ]------------
> > > > WARNING: CPU: 0 PID: 11298 at mm/zswap.c:1700 zswap_swapoff+0x11b/0x2b0 mm/zswap.c:1700
> > > > Modules linked in:
> > > > CPU: 0 UID: 0 PID: 11298 Comm: swapoff Not tainted 6.11.0-rc3-next-20240816-syzkaller #0
> > > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 06/27/2024
> > > > RIP: 0010:zswap_swapoff+0x11b/0x2b0 mm/zswap.c:1700
> > > > Code: 74 05 e8 78 73 07 00 4b 83 7c 35 00 00 75 15 e8 1b bd 9e ff 48 ff c5 49 83 c6 50 83 7c 24 0c 17 76 9b eb 24 e8 06 bd 9e ff 90 <0f> 0b 90 eb e5 48 8b 0c 24 80 e1 07 80 c1 03 38 c1 7c 90 48 8b 3c
> > > > RSP: 0018:ffffc9000302fa38 EFLAGS: 00010293
> > > > RAX: ffffffff81f4d66a RBX: dffffc0000000000 RCX: ffff88802c19bc00
> > > > RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff888015986248
> > > > RBP: 0000000000000000 R08: ffffffff81f4d620 R09: 1ffffffff1d476ac
> > > > R10: dffffc0000000000 R11: fffffbfff1d476ad R12: dffffc0000000000
> > > > R13: ffff888015986200 R14: 0000000000000048 R15: 0000000000000002
> > > > FS:  00007f9e628a5380(0000) GS:ffff8880b9000000(0000) knlGS:0000000000000000
> > > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > CR2: 0000001b30f15ff8 CR3: 000000006c5f0000 CR4: 00000000003506f0
> > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > > Call Trace:
> > > >  <TASK>
> > > >  __do_sys_swapoff mm/swapfile.c:2837 [inline]
> > > >  __se_sys_swapoff+0x4653/0x4cf0 mm/swapfile.c:2706
> > > >  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
> > > >  do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
> > > >  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> > > > RIP: 0033:0x7f9e629feb37
> > > > Code: 73 01 c3 48 8b 0d f1 52 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 a8 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c1 52 0d 00 f7 d8 64 89 01 48
> > > > RSP: 002b:00007fff17734f68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a8
> > > > RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f9e629feb37
> > > > RDX: 00007f9e62a9e7e8 RSI: 00007f9e62b9beed RDI: 0000563090942a20
> > > > RBP: 0000563090942a20 R08: 0000000000000000 R09: 77872e07ed164f94
> > > > R10: 000000000000001f R11: 0000000000000246 R12: 00007fff17735188
> > > > R13: 00005630909422a0 R14: 0000563073724169 R15: 00007f9e62bdda80
> > > >  </TASK>
> > >
> > > I am hoping syzbot would find a reproducer and bisect this for us.
> > > Meanwhile, from a high-level it looks to me like we are missing a
> > > zswap_invalidate() call in some paths.
> > >
> > > If I have to guess, I would say it's related to the latest mTHP swap
> > > changes, but I am not following closely. Perhaps one of the following
> > > things happened:
> > >
> > > (1) We are not calling zswap_invalidate() in some invalidation paths.
> > > It used to not be called for the cluster freeing path, so maybe we end
> > > up with some order-0 swap entries in a cluster? or maybe there is an
> > > entirely new invalidation path that does not go through
> > > free_swap_slot() for order-0 entries?
> > >
> > > (2) Some higher order swap entries (i.e. a cluster) end up in zswap
> > > somehow. zswap_store() has a warning to cover that though. Maybe
> > > somehow some swap entries are allocated as a cluster, but then pages
> > > are swapped out one-by-one as order-0 (which can go to zswap), but
> > > then we still free the swap entries as a cluster?
> >
> > Hi Yosry, thanks for the report.
> >
> > There are many mTHP related optimizations recently, for this problem I
> > can reproduce this locally. Can confirm the problem is gone for me
> > after reverting:
> >
> > "mm: attempt to batch free swap entries for zap_pte_range()"
> >
> > Hi Barry,
> >
> > If a set of continuous slots are having the same value, they are
> > considered a mTHP and freed, bypassing the slot cache, and causing
> > zswap leak.
> > This didn't happen in put_swap_folio because that function is
> > expecting an actual mTHP folio behind the slots but
> > free_swap_and_cache_nr is simply walking the slots.
>
> Hi Kairui,
>
> I don't understand, if anyone has a folio backend, the code will
> go fallback to __try_to_reclaim_swap(),  it won't call
> swap_entry_range_free().
>
>      ci = lock_cluster_or_swap_info(si, offset);
>         if (!swap_is_last_map(si, offset, nr, &has_cache)) {
>                 unlock_cluster_or_swap_info(si, ci);
>                 goto fallback;
>         }
>         for (i = 0; i < nr; i++)
>                 WRITE_ONCE(si->swap_map[offset + i], SWAP_HAS_CACHE);
>         unlock_cluster_or_swap_info(si, ci);
>
>         if (!has_cache) {
>                 spin_lock(&si->lock);
>                 swap_entry_range_free(si, entry, nr);
>                 spin_unlock(&si->lock);
>         }
>         return has_cache;
>
> Am i missing something?

Hi Barry,

Per my understanding, ZSWAP invalidation could happen after the folio
is gone from the swap cache, especially in free_swap_and_cache_nr, it
will iterate and zap the swap slots without swapping them in.
So a slot doesn't have a folio backed doesn't mean it doesn't have ZSWAP data.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [syzbot] [mm?] WARNING in zswap_swapoff
  2024-08-20  9:29       ` Kairui Song
@ 2024-08-20  9:53         ` Barry Song
  0 siblings, 0 replies; 19+ messages in thread
From: Barry Song @ 2024-08-20  9:53 UTC (permalink / raw)
  To: Kairui Song
  Cc: Yosry Ahmed, syzbot, akpm, chengming.zhou, hannes, linux-kernel,
	linux-mm, nphamcs, syzkaller-bugs, Chris Li, Ying, Ryan Roberts

On Tue, Aug 20, 2024 at 9:29 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Tue, Aug 20, 2024 at 5:22 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Tue, Aug 20, 2024 at 8:47 PM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > On Tue, Aug 20, 2024 at 4:13 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > On Fri, Aug 16, 2024 at 12:52 PM syzbot
> > > > <syzbot+ce6029250d7fd4d0476d@syzkaller.appspotmail.com> wrote:
> > > > >
> > > > > Hello,
> > > > >
> > > > > syzbot found the following issue on:
> > > > >
> > > > > HEAD commit:    367b5c3d53e5 Add linux-next specific files for 20240816
> > >
> > > I can't find this commit, seems this commit is not in linux-next any more?
> > >
> > > > > git tree:       linux-next
> > > > > console output: https://syzkaller.appspot.com/x/log.txt?x=12489105980000
> > > > > kernel config:  https://syzkaller.appspot.com/x/.config?x=61ba6f3b22ee5467
> > > > > dashboard link: https://syzkaller.appspot.com/bug?extid=ce6029250d7fd4d0476d
> > > > > compiler:       Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40
> > > > >
> > > > > Unfortunately, I don't have any reproducer for this issue yet.
> > > > >
> > > > > Downloadable assets:
> > > > > disk image: https://storage.googleapis.com/syzbot-assets/0b1b4e3cad3c/disk-367b5c3d.raw.xz
> > > > > vmlinux: https://storage.googleapis.com/syzbot-assets/5bb090f7813c/vmlinux-367b5c3d.xz
> > > > > kernel image: https://storage.googleapis.com/syzbot-assets/6674cb0709b1/bzImage-367b5c3d.xz
> > > > >
> > > > > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > > > > Reported-by: syzbot+ce6029250d7fd4d0476d@syzkaller.appspotmail.com
> > > > >
> > > > > ------------[ cut here ]------------
> > > > > WARNING: CPU: 0 PID: 11298 at mm/zswap.c:1700 zswap_swapoff+0x11b/0x2b0 mm/zswap.c:1700
> > > > > Modules linked in:
> > > > > CPU: 0 UID: 0 PID: 11298 Comm: swapoff Not tainted 6.11.0-rc3-next-20240816-syzkaller #0
> > > > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 06/27/2024
> > > > > RIP: 0010:zswap_swapoff+0x11b/0x2b0 mm/zswap.c:1700
> > > > > Code: 74 05 e8 78 73 07 00 4b 83 7c 35 00 00 75 15 e8 1b bd 9e ff 48 ff c5 49 83 c6 50 83 7c 24 0c 17 76 9b eb 24 e8 06 bd 9e ff 90 <0f> 0b 90 eb e5 48 8b 0c 24 80 e1 07 80 c1 03 38 c1 7c 90 48 8b 3c
> > > > > RSP: 0018:ffffc9000302fa38 EFLAGS: 00010293
> > > > > RAX: ffffffff81f4d66a RBX: dffffc0000000000 RCX: ffff88802c19bc00
> > > > > RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff888015986248
> > > > > RBP: 0000000000000000 R08: ffffffff81f4d620 R09: 1ffffffff1d476ac
> > > > > R10: dffffc0000000000 R11: fffffbfff1d476ad R12: dffffc0000000000
> > > > > R13: ffff888015986200 R14: 0000000000000048 R15: 0000000000000002
> > > > > FS:  00007f9e628a5380(0000) GS:ffff8880b9000000(0000) knlGS:0000000000000000
> > > > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > CR2: 0000001b30f15ff8 CR3: 000000006c5f0000 CR4: 00000000003506f0
> > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > > > Call Trace:
> > > > >  <TASK>
> > > > >  __do_sys_swapoff mm/swapfile.c:2837 [inline]
> > > > >  __se_sys_swapoff+0x4653/0x4cf0 mm/swapfile.c:2706
> > > > >  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
> > > > >  do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
> > > > >  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> > > > > RIP: 0033:0x7f9e629feb37
> > > > > Code: 73 01 c3 48 8b 0d f1 52 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 a8 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c1 52 0d 00 f7 d8 64 89 01 48
> > > > > RSP: 002b:00007fff17734f68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a8
> > > > > RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f9e629feb37
> > > > > RDX: 00007f9e62a9e7e8 RSI: 00007f9e62b9beed RDI: 0000563090942a20
> > > > > RBP: 0000563090942a20 R08: 0000000000000000 R09: 77872e07ed164f94
> > > > > R10: 000000000000001f R11: 0000000000000246 R12: 00007fff17735188
> > > > > R13: 00005630909422a0 R14: 0000563073724169 R15: 00007f9e62bdda80
> > > > >  </TASK>
> > > >
> > > > I am hoping syzbot would find a reproducer and bisect this for us.
> > > > Meanwhile, from a high-level it looks to me like we are missing a
> > > > zswap_invalidate() call in some paths.
> > > >
> > > > If I have to guess, I would say it's related to the latest mTHP swap
> > > > changes, but I am not following closely. Perhaps one of the following
> > > > things happened:
> > > >
> > > > (1) We are not calling zswap_invalidate() in some invalidation paths.
> > > > It used to not be called for the cluster freeing path, so maybe we end
> > > > up with some order-0 swap entries in a cluster? or maybe there is an
> > > > entirely new invalidation path that does not go through
> > > > free_swap_slot() for order-0 entries?
> > > >
> > > > (2) Some higher order swap entries (i.e. a cluster) end up in zswap
> > > > somehow. zswap_store() has a warning to cover that though. Maybe
> > > > somehow some swap entries are allocated as a cluster, but then pages
> > > > are swapped out one-by-one as order-0 (which can go to zswap), but
> > > > then we still free the swap entries as a cluster?
> > >
> > > Hi Yosry, thanks for the report.
> > >
> > > There are many mTHP related optimizations recently, for this problem I
> > > can reproduce this locally. Can confirm the problem is gone for me
> > > after reverting:
> > >
> > > "mm: attempt to batch free swap entries for zap_pte_range()"
> > >
> > > Hi Barry,
> > >
> > > If a set of continuous slots are having the same value, they are
> > > considered a mTHP and freed, bypassing the slot cache, and causing
> > > zswap leak.
> > > This didn't happen in put_swap_folio because that function is
> > > expecting an actual mTHP folio behind the slots but
> > > free_swap_and_cache_nr is simply walking the slots.
> >
> > Hi Kairui,
> >
> > I don't understand, if anyone has a folio backend, the code will
> > go fallback to __try_to_reclaim_swap(),  it won't call
> > swap_entry_range_free().
> >
> >      ci = lock_cluster_or_swap_info(si, offset);
> >         if (!swap_is_last_map(si, offset, nr, &has_cache)) {
> >                 unlock_cluster_or_swap_info(si, ci);
> >                 goto fallback;
> >         }
> >         for (i = 0; i < nr; i++)
> >                 WRITE_ONCE(si->swap_map[offset + i], SWAP_HAS_CACHE);
> >         unlock_cluster_or_swap_info(si, ci);
> >
> >         if (!has_cache) {
> >                 spin_lock(&si->lock);
> >                 swap_entry_range_free(si, entry, nr);
> >                 spin_unlock(&si->lock);
> >         }
> >         return has_cache;
> >
> > Am i missing something?
>
> Hi Barry,
>
> Per my understanding, ZSWAP invalidation could happen after the folio
> is gone from the swap cache, especially in free_swap_and_cache_nr, it
> will iterate and zap the swap slots without swapping them in.
> So a slot doesn't have a folio backed doesn't mean it doesn't have ZSWAP data.

well. thanks! the original non-batched code always has a zswap_invalidate()
in free_swap_slot().
void free_swap_slot(swp_entry_t entry)

{
        struct swap_slots_cache *cache;
        /* Large folio swap slot is not covered. */
        zswap_invalidate(entry);
        ...
}

but the benefits of batched-free is huge(on phones, almost 100% PTEs have no
swapcache as it uses sync io swap; on SSD cases, only a small part of PTEs have
folio backend in swapcache, so it will still benefit significantly
from this batched-free).
so let's try to find a proper fix for this before we have to do the
below ugly (which will
at least benefit phones who never use zswap):

diff --git a/mm/swapfile.c b/mm/swapfile.c
index f947f4dd31a9..c8c70a9bf6d6 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1498,6 +1498,9 @@ static bool __swap_entries_free(struct
swap_info_struct *si,
        unsigned char count;
        int i;

+       if (!zswap_never_enabled())
+               return fallback;
+
        if (nr <= 1 || swap_count(data_race(si->swap_map[offset])) != 1)
                goto fallback;
        /* cross into another cluster */


Thanks
Barry


^ permalink raw reply related	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2024-09-03 19:41 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-16 19:52 [syzbot] [mm?] WARNING in zswap_swapoff syzbot
2024-08-19 20:12 ` Yosry Ahmed
2024-08-20  8:47   ` Kairui Song
2024-08-20  9:02     ` Kairui Song
2024-08-21  5:49       ` Barry Song
2024-08-21  6:42         ` Kairui Song
2024-08-21  7:38           ` Barry Song
2024-08-21 17:33             ` Kairui Song
2024-08-21 20:59               ` Barry Song
2024-08-22 18:12           ` Yosry Ahmed
2024-08-22 20:16             ` Chris Li
2024-08-22 20:20               ` Yosry Ahmed
2024-08-22 21:44                 ` Chris Li
2024-09-03 18:21             ` Yosry Ahmed
2024-09-03 18:50               ` Kairui Song
2024-09-03 19:41                 ` Yosry Ahmed
2024-08-20  9:22     ` Barry Song
2024-08-20  9:29       ` Kairui Song
2024-08-20  9:53         ` Barry Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).