All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: linux-fsdevel@vger.kernel.org, linux-xfs@vger.kernel.org,
	Christoph Hellwig <hch@infradead.org>, Jan Kara <jack@suse.cz>,
	Ritesh Harjani <riteshh@linux.ibm.com>,
	Dave Chinner <dchinner@redhat.com>
Subject: Re: [PATCH] iomap: Submit the BIO at the end of each extent
Date: Mon, 4 May 2020 17:37:10 -0700	[thread overview]
Message-ID: <20200505003710.GO5703@magnolia> (raw)
In-Reply-To: <20200320214654.GC6812@magnolia>

On Fri, Mar 20, 2020 at 02:46:54PM -0700, Darrick J. Wong wrote:
> On Fri, Mar 20, 2020 at 07:40:14AM -0700, Matthew Wilcox wrote:
> > From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> > 
> > By definition, an extent covers a range of consecutive blocks, so
> > it would be quite rare to be able to just add pages to the BIO from
> > a previous range.  The only case we can think of is a mapped extent
> > followed by a hole extent, followed by another mapped extent which has
> > been allocated immediately after the first extent.  We believe this to
> > be an unlikely layout for a filesystem to choose and, since the queue
> > is plugged, those two BIOs would be merged by the block layer.
> > 
> > The reason we care is that ext2/ext4 choose to lay out blocks 0-11
> > consecutively, followed by the indirect block, and we want to merge those
> > two BIOs.  If we don't submit the data BIO before asking the filesystem
> > for the next extent, then the indirect BIO will be submitted first,
> > and waited for, leading to inefficient I/O patterns.  Buffer heads solve
> > this with the BH_boundary flag, but iomap doesn't need that as long as
> > we submit the bio here.
> 
> Hmm, I just received the following stack trace while running generic/418
> on a v5 filesystem with 1k blocks:

I hit this crash again, though I've figured out how to trigger it
reliably.  Boot a x64 VM with mem=512M to make memory constrained, and
then kick off fstests with 1k block size:

FSTYP         -- xfs (debug)
PLATFORM      -- Linux/x86_64 alder-mtr00 5.7.0-rc4-djw #rc4 SMP PREEMPT Mon May 4 09:24:50 PDT 2020
MKFS_OPTIONS  -- -f -m reflink=1,rmapbt=1 -i sparse=1, -b size=1024, /dev/sdf
MOUNT_OPTIONS -- -o usrquota,grpquota,prjquota, /dev/sdf /opt

and then you get the following:

xfs filesystem being mounted at /mnt supports timestamps until 2038 (0x7fffffff)
run fstests generic/418 at 2020-05-04 17:27:51
rm (3338) used greatest stack depth: 11728 bytes left
BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0 
Oops: 0000 [#1] PREEMPT SMP
CPU: 1 PID: 4900 Comm: dio-invalidate- Not tainted 5.7.0-rc4-djw #rc4
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1 04/01/2014
RIP: 0010:iomap_set_range_uptodate+0x5d/0x170
Code: 07 00 60 00 00 75 13 f0 80 0f 04 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 8b 47 18 44 8d 74 16 ff 41 89 f1 4c 8b 6f 28 <4c> 8b 38 41 0f b6 af ca 00 00 00 89 e9 41 d3 e9 40 80 fd 1f 0f 87
RSP: 0018:ffffc90004b0b8e8 EFLAGS: 00010206
RAX: 0000000000000000 RBX: ffffea0000292440 RCX: 0000000000000000
RDX: 0000000000000400 RSI: 0000000000000c00 RDI: ffffea0000292440
RBP: 0000000000001000 R08: ffffc90004b0b958 R09: 0000000000000c00
R10: 0000000000000002 R11: ffff888017806720 R12: ffffc90004b0ba20
R13: ffff888017806720 R14: 0000000000000fff R15: ffff88801872c610
FS:  00007f2091593740(0000) GS:ffff88801e800000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000000017abb005 CR4: 00000000001606a0
Call Trace:
 ? __raw_spin_lock_init+0x39/0x60
 iomap_readpage_actor+0x113/0x3f0
 iomap_readpages_actor+0x1dc/0x240
 iomap_apply+0x12d/0x4e9
 ? iomap_readpage_actor+0x3f0/0x3f0
 ? mark_held_locks+0x45/0x70
 iomap_readpages+0xc2/0x290
 ? iomap_readpage_actor+0x3f0/0x3f0
 ? xa_clear_mark+0x30/0x30
 read_pages+0x75/0x1b0
 __do_page_cache_readahead+0x1bb/0x1d0
 ondemand_readahead+0x21a/0x540
 ? pagecache_get_page+0x26/0x320
 generic_file_read_iter+0x91a/0xd10
 ? xfs_file_buffered_aio_read+0x88/0x170 [xfs]
 xfs_file_buffered_aio_read+0x65/0x170 [xfs]
 xfs_file_read_iter+0xe9/0x2a0 [xfs]
 new_sync_read+0x12d/0x1d0
 vfs_read+0xc7/0x180
 ksys_pread64+0x64/0xa0
 do_syscall_64+0x50/0x1a0
 entry_SYSCALL_64_after_hwframe+0x49/0xb3
RIP: 0033:0x7f209179cbca

Digging into this with gcc, the RIP value is:

0xffffffff813047cd is in iomap_set_range_uptodate (/storage/home/djwong/cdev/work/linux-djw/fs/iomap/buffered-io.c:147).
142
143     static void
144     iomap_iop_set_range_uptodate(struct page *page, unsigned off, unsigned len)
145     {
146             struct iomap_page *iop = to_iomap_page(page);
147             struct inode *inode = page->mapping->host;
148             unsigned first = off >> inode->i_blkbits;
149             unsigned last = (off + len - 1) >> inode->i_blkbits;
150             bool uptodate = true;
151             unsigned long flags;

So now this makes me wonder, is it possible to be performing readahead
into a page that doesn't have page->mapping set yet?  I reran this a few
times, got crashes in different places, but the common factor is that
page->mapping is NULL, and we're doing readhead.

I also tried this with the patch *not* applied and had the same
problems, so it's not actually this patch.  But there's something going
wrong in the iomap code...

--D

> FSTYP         -- xfs (debug)
> PLATFORM      -- Linux/x86_64 alder-mtr01 5.6.0-rc4-djw #rc4 SMP PREEMPT Fri Mar 13 14:48:13 PDT 2020
> MKFS_OPTIONS  -- -f -m reflink=1,rmapbt=1 -i sparse=1, -b size=1024, /dev/sdd
> MOUNT_OPTIONS -- -o usrquota,grpquota,prjquota, /dev/sdd /opt
> 
> (Note that it seems to do this even with MKFS_OPTIONS='-m crc=0' and
> empty MOUNT_OPTIONS.)
> 
> [   32.931667] XFS (sdd): Mounting V5 Filesystem
> [   32.940193] XFS (sdd): Ending clean mount
> [   32.941350] XFS (sdd): Quotacheck needed: Please wait.
> [   32.970724] XFS (sdd): Quotacheck: Done.
> [   32.972550] xfs filesystem being mounted at /opt supports timestamps until 2038 (0x7fffffff)
> [   32.985191] XFS (sdd): Unmounting Filesystem
> [   33.162426] XFS (sde): EXPERIMENTAL online scrub feature in use. Use at your own risk!
> [   33.273185] XFS (sde): Unmounting Filesystem
> [   33.517957] XFS (sde): Mounting V5 Filesystem
> [   33.526176] XFS (sde): Ending clean mount
> [   33.527439] XFS (sde): Quotacheck needed: Please wait.
> [   33.566117] XFS (sde): Quotacheck: Done.
> [   33.569427] xfs filesystem being mounted at /mnt supports timestamps until 2038 (0x7fffffff)
> [   33.656942] run fstests generic/418 at 2020-03-20 14:42:29
> [   36.332268] BUG: kernel NULL pointer dereference, address: 0000000000000060
> [   36.334254] #PF: supervisor read access in kernel mode
> [   36.334849] #PF: error_code(0x0000) - not-present page
> [   36.335461] PGD 0 P4D 0 
> [   36.335779] Oops: 0000 [#1] PREEMPT SMP
> [   36.336246] CPU: 2 PID: 5144 Comm: dio-invalidate- Not tainted 5.6.0-rc4-djw #rc4
> [   36.337078] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1ubuntu1 04/01/2014
> [   36.338069] RIP: 0010:iomap_readpage_actor+0x2ea/0x3c0
> [   36.338671] Code: 43 10 8b 54 24 24 48 c7 40 38 80 bc 2f 81 48 8b 7b 10 e9 00 ff ff ff 31 c0 48 85 ed 0f 85 c9 fe ff ff 49 8b 46 18 48 8b 2c 24 <8b> 48 60 48 81 c5 ff 0f 00 00 48 c1 ed 0c 81 e1 c0 0c 00 00 e9 12
> [   36.340705] RSP: 0018:ffffc90004ebb968 EFLAGS: 00010246
> [   36.341309] RAX: 0000000000000000 RBX: ffffc90004ebbb30 RCX: 000000000000000a
> [   36.342105] RDX: 0000000000000400 RSI: 0000000000000003 RDI: 0000000000000000
> [   36.342909] RBP: 0000000000000400 R08: ffffc90004ebb988 R09: ffffc90004ebb98c
> [   36.343710] R10: 0000000000001000 R11: 0000000000000400 R12: ffffc90004ebba50
> [   36.344505] R13: 0000000000000086 R14: ffffea0001cd2400 R15: 0000000000000c00
> [   36.345246] FS:  00007f892894c740(0000) GS:ffff88807e000000(0000) knlGS:0000000000000000
> [   36.346087] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   36.346696] CR2: 0000000000000060 CR3: 0000000078753005 CR4: 00000000001606a0
> [   36.347445] Call Trace:
> [   36.347734]  iomap_readpages_actor+0x1e3/0x250
> [   36.348699]  iomap_apply+0x12c/0x4e3
> [   36.349097]  ? iomap_readpage_actor+0x3c0/0x3c0
> [   36.349593]  ? prep_new_page+0x3f/0x100
> [   36.350022]  ? iomap_readpage_actor+0x3c0/0x3c0
> [   36.350519]  iomap_readpages+0xc7/0x2b0
> [   36.350938]  ? iomap_readpage_actor+0x3c0/0x3c0
> [   36.351438]  read_pages+0x6e/0x1a0
> [   36.351824]  __do_page_cache_readahead+0x1c3/0x1e0
> [   36.352343]  ondemand_readahead+0x210/0x4b0
> [   36.352797]  generic_file_read_iter+0x871/0xcd0
> [   36.353365]  ? xfs_file_buffered_aio_read+0x54/0x170 [xfs]
> [   36.353982]  xfs_file_buffered_aio_read+0x5f/0x170 [xfs]
> [   36.354591]  xfs_file_read_iter+0xea/0x2a0 [xfs]
> [   36.355139]  ? xfs_file_write_iter+0xf2/0x1d0 [xfs]
> [   36.355668]  new_sync_read+0x12d/0x1d0
> [   36.356085]  vfs_read+0xa6/0x180
> [   36.356454]  ksys_pread64+0x64/0xa0
> [   36.356841]  do_syscall_64+0x50/0x1a0
> [   36.357252]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [   36.357792] RIP: 0033:0x7f8928524f64
> [   36.358189] Code: 15 61 80 20 00 f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 8b 05 aa c4 20 00 49 89 ca 85 c0 75 13 b8 11 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 5c f3 c3 66 90 41 55 41 54 49 89 cd 55 53 49
> [   36.360053] RSP: 002b:00007fffc7893b18 EFLAGS: 00000246 ORIG_RAX: 0000000000000011
> [   36.360841] RAX: ffffffffffffffda RBX: 0000000000000400 RCX: 00007f8928524f64
> [   36.361594] RDX: 0000000000000400 RSI: 00005593e3c23000 RDI: 0000000000000003
> [   36.362344] RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000001
> [   36.363075] R10: 0000000000000000 R11: 0000000000000246 R12: 00005593e3c23000
> [   36.363817] R13: 0000000000000000 R14: 00005593e3c25000 R15: 0000000000000400
> [   36.364569] Modules linked in: xfs libcrc32c ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 ip_set_hash_ip ip_set_hash_net xt_tcpudp xt_set ip_set_hash_mac ip_set nfnetlink ip6table_filter ip6_tables iptable_filter bfq sch_fq_codel ip_tables x_tables nfsv4 af_packet
> [   36.366966] Dumping ftrace buffer:
> [   36.367351]    (ftrace buffer empty)
> [   36.367742] CR2: 0000000000000060
> [   36.369050] ---[ end trace d599586d1259866c ]---
> [   36.369884] RIP: 0010:iomap_readpage_actor+0x2ea/0x3c0
> [   36.370694] Code: 43 10 8b 54 24 24 48 c7 40 38 80 bc 2f 81 48 8b 7b 10 e9 00 ff ff ff 31 c0 48 85 ed 0f 85 c9 fe ff ff 49 8b 46 18 48 8b 2c 24 <8b> 48 60 48 81 c5 ff 0f 00 00 48 c1 ed 0c 81 e1 c0 0c 00 00 e9 12
> [   36.373022] RSP: 0018:ffffc90004ebb968 EFLAGS: 00010246
> [   36.373615] RAX: 0000000000000000 RBX: ffffc90004ebbb30 RCX: 000000000000000a
> [   36.374362] RDX: 0000000000000400 RSI: 0000000000000003 RDI: 0000000000000000
> [   36.375100] RBP: 0000000000000400 R08: ffffc90004ebb988 R09: ffffc90004ebb98c
> [   36.375833] R10: 0000000000001000 R11: 0000000000000400 R12: ffffc90004ebba50
> [   36.376577] R13: 0000000000000086 R14: ffffea0001cd2400 R15: 0000000000000c00
> [   36.377322] FS:  00007f892894c740(0000) GS:ffff88807e000000(0000) knlGS:0000000000000000
> [   36.378190] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   36.378795] CR2: 0000000000000060 CR3: 0000000078753005 CR4: 00000000001606a0
> 
> I'll email back if I find anything else but since this is the second
> Friday in a row of getting STOP SHIP bugs dropped in my lap at 2pm, I
> doubt I'm going to manage much.
> 
> --D
> 
> > 
> > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> > ---
> >  fs/iomap/buffered-io.c | 12 ++++++++++--
> >  1 file changed, 10 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > index f080f542911b..417115bfaf6b 100644
> > --- a/fs/iomap/buffered-io.c
> > +++ b/fs/iomap/buffered-io.c
> > @@ -420,6 +420,16 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
> >  				ctx, iomap, srcmap);
> >  	}
> >  
> > +	/*
> > +	 * Submitting the bio here leads to better I/O patterns for
> > +	 * filesystems which need to do metadata reads to find the
> > +	 * next extent.
> > +	 */
> > +	if (ctx->bio) {
> > +		submit_bio(ctx->bio);
> > +		ctx->bio = NULL;
> > +	}
> > +
> >  	return done;
> >  }
> >  
> > @@ -449,8 +459,6 @@ iomap_readpages(struct address_space *mapping, struct list_head *pages,
> >  	}
> >  	ret = 0;
> >  done:
> > -	if (ctx.bio)
> > -		submit_bio(ctx.bio);
> >  	if (ctx.cur_page) {
> >  		if (!ctx.cur_page_in_bio)
> >  			unlock_page(ctx.cur_page);
> > -- 
> > 2.25.1
> > 

  parent reply	other threads:[~2020-05-05  0:39 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-20 14:40 [PATCH] iomap: Submit the BIO at the end of each extent Matthew Wilcox
2020-03-20 14:44 ` Christoph Hellwig
2020-03-20 15:50 ` Darrick J. Wong
2020-03-20 21:46 ` Darrick J. Wong
2020-03-21 14:20   ` Matthew Wilcox
2020-03-23 12:59     ` Christoph Hellwig
2020-05-05  0:37   ` Darrick J. Wong [this message]
2020-05-05  2:24     ` Matthew Wilcox
2020-05-05  2:59       ` Darrick J. Wong
2020-05-05 19:08         ` Darrick J. Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200505003710.GO5703@magnolia \
    --to=darrick.wong@oracle.com \
    --cc=dchinner@redhat.com \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=riteshh@linux.ibm.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.