[BUG] filemap_get_read

All of lore.kernel.org
 help / color / mirror / Atom feed

* [BUG] filemap_get_read_batch()
@ 2022-06-17 15:12 Brian Foster
  2022-06-17 17:04 ` Matthew Wilcox
  0 siblings, 1 reply; 2+ messages in thread
From: Brian Foster @ 2022-06-17 15:12 UTC (permalink / raw)
  To: linux-fsdevel, Matthew Wilcox

Hi,

I've hit the filemap_get_read_batch() BUG [1] that I think I saw Dave
had also recently reported. It looks like the problem is essentially a
race between reads, pagecache removal and folio reinsertion that leads
to an invalid folio pointer. E.g., what I observe is the following
(ordered) sequence of events:

Task A:
- Lands in filemap_get_read_batch() looking for a couple folio indexes,
  currently both populated by single page folios.
- Grabs the folio at the first index and starts to process it.

Task B:
- Invalidates several folios from the mapping, including both the
  aforementioned folios task A is after.

Task C: 
- Instantiates a compound (order 2) folio that covers both indexes being
  processed by task A.

Task A:
- Iterates to the next xarray index based on the (now already removed)
  non-compound folio via xas_advance()/xas_next().
- BUG splat down in folio_try_get_rcu() on the folio pointer..

I'm not quite sure what is being returned from the xarray here. It
doesn't appear to be another page or anything (i.e. a tail page of a
different folio sort of like we saw with the iomap writeback completion
issue). I just get more splats if I try to access it purely as a page,
so I'm not sure it's a pointer at all. I don't have enough context on
the xarray bits to intuit on whether it might be internal data or just
garbage if the node happened to be reformatted, etc. If you have any
thoughts on extra things to check around that I can try to dig further
into it..

In any event, it sort of feels like somehow or another this folio order
change peturbs the xarray iteration since IIUC the non-compound page
variant has been in place for a while, but that could just be wrong or
circumstance. I'm not sure if it's possible to check the xarray node for
such changes or whatever before attempting to process the returned entry
(and to preserve the lockless algorithm). FWIW wrapping the whole lookup
around an xa_lock_irq(&mapping->i_pages) lock cycle does make the
problem disappear.

Brian

[1]

BUG: kernel NULL pointer dereference, address: 0000000000000106
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 72 PID: 297881 Comm: xfs_io Tainted: G          I       5.19.0-rc2+ #160
Hardware name: Dell Inc. PowerEdge R740/01KPX8, BIOS 1.6.11 11/20/2018
RIP: 0010:filemap_get_read_batch+0x8e/0x240
Code: 81 ff 06 04 00 00 0f 84 f7 00 00 00 48 81 ff 02 04 00 00 0f 84 c4 00 00 00 48 39 6c 24 08 0f 87 84 00 00 00 40 f6 c7 01 75 7e <8b> 47 34 85 c0 0f 84 a8 00 00 00 8d 50 01 48 8d 77 34 f0 0f b1 57
RSP: 0018:ffffacdf200d7c28 EFLAGS: 00010246
RAX: 0000000000000039 RBX: ffffacdf200d7d68 RCX: 0000000000000034
RDX: ffff9b2aa6805220 RSI: 0000000000000074 RDI: 00000000000000d2
RBP: 0000000000000075 R08: 0000000000000402 R09: ffff9b2a89ac4488
R10: 0000000000020000 R11: 0000000000000000 R12: ffff9b2a89ac4600
R13: 0000000000000075 R14: 0000000000000074 R15: ffffacdf200d7e88
FS:  00007fbba6ce7800(0000) GS:ffff9b29c1100000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000106 CR3: 0000000150c7e001 CR4: 00000000007706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
 <TASK>
 filemap_get_pages+0x80/0x710
 ? current_time+0x1b/0xd0
 ? atime_needs_update+0xfc/0x170
 ? touch_atime+0x27/0x190
 filemap_read+0xa8/0x310
 ? __folio_start_writeback+0x91/0x2d0
 ? folio_add_lru+0x8d/0x100
 ? _raw_spin_unlock+0x15/0x30
 ? __handle_mm_fault+0xd13/0xf50
 xfs_file_buffered_read+0x50/0xd0 [xfs]
 xfs_file_read_iter+0x70/0xd0 [xfs]
 new_sync_read+0xf6/0x160
 vfs_read+0x138/0x190
 __x64_sys_pread64+0x6e/0xa0
 do_syscall_64+0x3b/0x90
 entry_SYSCALL_64_after_hwframe+0x46/0xb0
RIP: 0033:0x7fbba71fa1ef
Code: 08 89 3c 24 48 89 4c 24 18 e8 2d f4 ff ff 4c 8b 54 24 18 48 8b 54 24 10 41 89 c0 48 8b 74 24 08 8b 3c 24 b8 11 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 31 44 89 c7 48 89 04 24 e8 7d f4 ff ff 48 8b
RSP: 002b:00007ffcdc03d0d0 EFLAGS: 00000293 ORIG_RAX: 0000000000000011
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fbba71fa1ef
RDX: 0000000000001000 RSI: 000056021a601000 RDI: 0000000000000003
RBP: 0000000000074000 R08: 0000000000000000 R09: 00007fbba7140a60
R10: 0000000000074000 R11: 0000000000000293 R12: 0000000000074000
R13: 000000000002c000 R14: 00000000000a0000 R15: 0000000000001000


^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [BUG] filemap_get_read_batch()
  2022-06-17 15:12 [BUG] filemap_get_read_batch() Brian Foster
@ 2022-06-17 17:04 ` Matthew Wilcox
  0 siblings, 0 replies; 2+ messages in thread
From: Matthew Wilcox @ 2022-06-17 17:04 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel

On Fri, Jun 17, 2022 at 11:12:18AM -0400, Brian Foster wrote:
> Hi,
> 
> I've hit the filemap_get_read_batch() BUG [1] that I think I saw Dave
> had also recently reported. It looks like the problem is essentially a
> race between reads, pagecache removal and folio reinsertion that leads
> to an invalid folio pointer. E.g., what I observe is the following
> (ordered) sequence of events:

Oh, this is fantastic.  Thank you for explaining my bug to me!  I've been
racking my brains for weeks trying to figure out what it is.

> Task A:
> - Lands in filemap_get_read_batch() looking for a couple folio indexes,
>   currently both populated by single page folios.
> - Grabs the folio at the first index and starts to process it.
> 
> Task B:
> - Invalidates several folios from the mapping, including both the
>   aforementioned folios task A is after.
> 
> Task C: 
> - Instantiates a compound (order 2) folio that covers both indexes being
>   processed by task A.
> 
> Task A:
> - Iterates to the next xarray index based on the (now already removed)
>   non-compound folio via xas_advance()/xas_next().
> - BUG splat down in folio_try_get_rcu() on the folio pointer..
> 
> I'm not quite sure what is being returned from the xarray here. It
> doesn't appear to be another page or anything (i.e. a tail page of a
> different folio sort of like we saw with the iomap writeback completion
> issue). I just get more splats if I try to access it purely as a page,
> so I'm not sure it's a pointer at all. I don't have enough context on
> the xarray bits to intuit on whether it might be internal data or just
> garbage if the node happened to be reformatted, etc. If you have any
> thoughts on extra things to check around that I can try to dig further
> into it..

It's a sibling entry (you can tell because bit 1 is set and it's less
than 256).  It tells you where the real entry actually is.  See below
for an explanation.

> In any event, it sort of feels like somehow or another this folio order
> change peturbs the xarray iteration since IIUC the non-compound page
> variant has been in place for a while, but that could just be wrong or
> circumstance. I'm not sure if it's possible to check the xarray node for
> such changes or whatever before attempting to process the returned entry
> (and to preserve the lockless algorithm). FWIW wrapping the whole lookup
> around an xa_lock_irq(&mapping->i_pages) lock cycle does make the
> problem disappear.

Heh, yes, it would because you wouldn't be able to change the array
while the batch lookup was running.

I think there are four possible ways to proceed.  Let's say we're
looking up the range from 0x35-0x36, we get the single-page-folio at 0x35,
they're both truncated and replaced with an order-2 folio at 0x34.

Option one is that we could return the newly-added folio, so the
folio_batch would contain the old folio at index 0x35, then the new folio
at index 0x34.  Looking at the loop in filemap_read(), I think that would
work; it doesn't make assumptions that the folios are actually contiguous,
just that there's no gaps between folios.  This feels a bit weird though.

Option two is that we could notice we got a folio which overlaps some
earlier folios, go back and delete them from the batch.  So we'd return
only the new folio at 0x34.  This code is going to be moderately complex
to write.

Option three is that we abandon all work we've done to this point and
start the lookup from the beginning of the batch.

Option four is that we just return the batch so far.  So we return the
(now-truncated) folio at 0x35, then on the next call we return the
new folio at 0x34.  This is functionally the same at option one, but
doesn't have the weird feeling.  It's also the simplest to implement,
and for this kind of race, maybe simple is better.

+++ b/mm/filemap.c
@@ -2357,6 +2357,8 @@ static void filemap_get_read_batch(struct address_space *mapping,
                        continue;
                if (xas.xa_index > max || xa_is_value(folio))
                        break;
+               if (xa_is_sibling(folio))
+                       break;
                if (!folio_try_get_rcu(folio))
                        goto retry;

> Brian
> 
> [1]
> 
> BUG: kernel NULL pointer dereference, address: 0000000000000106
> #PF: supervisor read access in kernel mode
> #PF: error_code(0x0000) - not-present page
> PGD 0 P4D 0
> Oops: 0000 [#1] PREEMPT SMP PTI
> CPU: 72 PID: 297881 Comm: xfs_io Tainted: G          I       5.19.0-rc2+ #160
> Hardware name: Dell Inc. PowerEdge R740/01KPX8, BIOS 1.6.11 11/20/2018
> RIP: 0010:filemap_get_read_batch+0x8e/0x240
> Code: 81 ff 06 04 00 00 0f 84 f7 00 00 00 48 81 ff 02 04 00 00 0f 84 c4 00 00 00 48 39 6c 24 08 0f 87 84 00 00 00 40 f6 c7 01 75 7e <8b> 47 34 85 c0 0f 84 a8 00 00 00 8d 50 01 48 8d 77 34 f0 0f b1 57

The faulting instruction is "mov    0x34(%rdi),%eax",

> RSP: 0018:ffffacdf200d7c28 EFLAGS: 00010246
> RAX: 0000000000000039 RBX: ffffacdf200d7d68 RCX: 0000000000000034
> RDX: ffff9b2aa6805220 RSI: 0000000000000074 RDI: 00000000000000d2

RDI is 0xd2.  Shift it by 2 and you get offset 52 (0x34).  So you're
proabbly looking up one of the indices 53-55 (0x35-0x37).

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2022-06-17 17:04 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-06-17 15:12 [BUG] filemap_get_read_batch() Brian Foster
2022-06-17 17:04 ` Matthew Wilcox

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.