linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Liu Bo <bo.li.liu@oracle.com>
To: fdmanana@kernel.org
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH] Btrfs: fix write corruption due to bio cloning on raid5/6
Date: Thu, 13 Jul 2017 15:38:59 -0700	[thread overview]
Message-ID: <20170713223859.GC6181@lim.localdomain> (raw)
In-Reply-To: <20170713163619.GB12913@lim.localdomain>

On Thu, Jul 13, 2017 at 09:36:19AM -0700, Liu Bo wrote:
> On Thu, Jul 13, 2017 at 03:09:54PM +0100, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> > 
> > The recent changes to make bio cloning faster (added in the 4.13 merge
> > window) by using the bio_clone_fast() API introduced a regression on
> > raid5/6 modes, because cloned bios have an invalid bi_vcnt field
> > (therefore it can not be used) and the raid5/6 code uses the
> > bio_for_each_segment_all() API to iterate the segments of a bio, and this
> > API uses a bio's bi_vcnt field.
> > 
> > The issue is very simple to trigger by doing for example a direct IO write
> > against a raid5 or raid6 filesystem and then attempting to read what we
> > wrote before:
> > 
> >   $ mkfs.btrfs -m raid5 -d raid5 -f /dev/sdc /dev/sdd /dev/sde /dev/sdf
> >   $ mount /dev/sdc /mnt
> >   $ xfs_io -f -d -c "pwrite -S 0xab 0 1M" /mnt/foobar
> >   $ od -t x1 /mnt/foobar
> >   od: /mnt/foobar: read error: Input/output error
> > 
> > For that example, the following is also reported in dmesg/syslog:
> > 
> >   [18274.985557] btrfs_print_data_csum_error: 18 callbacks suppressed
> >   [18274.995277] BTRFS warning (device sdf): csum failed root 5 ino 257 off 0 csum 0x98f94189 expected csum 0x94374193 mirror 1
> >   [18274.997205] BTRFS warning (device sdf): csum failed root 5 ino 257 off 4096 csum 0x98f94189 expected csum 0x94374193 mirror 1
> >   [18275.025221] BTRFS warning (device sdf): csum failed root 5 ino 257 off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 1
> >   [18275.047422] BTRFS warning (device sdf): csum failed root 5 ino 257 off 12288 csum 0x98f94189 expected csum 0x94374193 mirror 1
> >   [18275.054818] BTRFS warning (device sdf): csum failed root 5 ino 257 off 4096 csum 0x98f94189 expected csum 0x94374193 mirror 1
> >   [18275.054834] BTRFS warning (device sdf): csum failed root 5 ino 257 off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 1
> >   [18275.054943] BTRFS warning (device sdf): csum failed root 5 ino 257 off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 2
> >   [18275.055207] BTRFS warning (device sdf): csum failed root 5 ino 257 off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 3
> >   [18275.055571] BTRFS warning (device sdf): csum failed root 5 ino 257 off 0 csum 0x98f94189 expected csum 0x94374193 mirror 1
> >   [18275.062171] BTRFS warning (device sdf): csum failed root 5 ino 257 off 12288 csum 0x98f94189 expected csum 0x94374193 mirror 1
> > 
> > A scrub will also fail correcting bad copies, mentioning the following in
> > dmesg/syslog:
> > 
> >   [18276.128696] scrub_handle_errored_block: 498 callbacks suppressed
> >   [18276.129617] BTRFS warning (device sdf): checksum error at logical 2186346496 on dev /dev/sde, sector 2116608, root 5, inode 257, offset 65536, length 4096, links $
> >   [18276.149235] btrfs_dev_stat_print_on_error: 498 callbacks suppressed
> >   [18276.157897] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
> >   [18276.206059] BTRFS warning (device sdf): checksum error at logical 2186477568 on dev /dev/sdd, sector 2116736, root 5, inode 257, offset 196608, length 4096, links$
> >   [18276.206059] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
> >   [18276.306552] BTRFS warning (device sdf): checksum error at logical 2186543104 on dev /dev/sdd, sector 2116864, root 5, inode 257, offset 262144, length 4096, links$
> >   [18276.319152] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
> >   [18276.394316] BTRFS warning (device sdf): checksum error at logical 2186739712 on dev /dev/sdf, sector 2116992, root 5, inode 257, offset 458752, length 4096, links$
> >   [18276.396348] BTRFS error (device sdf): bdev /dev/sdf errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
> >   [18276.434127] BTRFS warning (device sdf): checksum error at logical 2186870784 on dev /dev/sde, sector 2117120, root 5, inode 257, offset 589824, length 4096, links$
> >   [18276.434127] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
> >   [18276.500504] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186477568 on dev /dev/sdd
> >   [18276.538400] BTRFS warning (device sdf): checksum error at logical 2186481664 on dev /dev/sdd, sector 2116744, root 5, inode 257, offset 200704, length 4096, links$
> >   [18276.540452] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, flush 0, corrupt 3, gen 0
> >   [18276.542012] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186481664 on dev /dev/sdd
> >   [18276.585030] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186346496 on dev /dev/sde
> >   [18276.598306] BTRFS warning (device sdf): checksum error at logical 2186412032 on dev /dev/sde, sector 2116736, root 5, inode 257, offset 131072, length 4096, links$
> >   [18276.598310] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 3, gen 0
> >   [18276.598582] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186350592 on dev /dev/sde
> >   [18276.603455] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 4, gen 0
> >   [18276.638362] BTRFS warning (device sdf): checksum error at logical 2186354688 on dev /dev/sde, sector 2116624, root 5, inode 257, offset 73728, length 4096, links $
> >   [18276.640445] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 5, gen 0
> >   [18276.645942] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186354688 on dev /dev/sde
> >   [18276.657204] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186412032 on dev /dev/sde
> >   [18276.660563] BTRFS warning (device sdf): checksum error at logical 2186416128 on dev /dev/sde, sector 2116744, root 5, inode 257, offset 135168, length 4096, links$
> >   [18276.664609] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 6, gen 0
> >   [18276.664609] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186358784 on dev /dev/sde
> > 
> > So fix this by using the bio_for_each_segment() API and setting before
> > the bio's bi_iter field to the value of the corresponding btrfs bio
> > container's saved iterator if we are processing a cloned bio in the
> > raid5/6 code (the same code processes both cloned and non-cloned bios).
> > 
> > This incorrect iteration of cloned bios was also causing some occasional
> > BUG_ONs when running fstest btrfs/064, which have a trace like the
> > following:
> > 
> >   [ 6674.416156] ------------[ cut here ]------------
> >   [ 6674.416157] kernel BUG at fs/btrfs/raid56.c:1897!
> >   [ 6674.416159] invalid opcode: 0000 [#1] PREEMPT SMP
> >   [ 6674.416160] Modules linked in: dm_flakey dm_mod dax ppdev tpm_tis parport_pc tpm_tis_core evdev tpm psmouse sg i2c_piix4 pcspkr parport i2c_core serio_raw button s
> >   [ 6674.416184] CPU: 3 PID: 19236 Comm: kworker/u32:10 Not tainted 4.12.0-rc6-btrfs-next-44+ #1
> >   [ 6674.416185] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
> >   [ 6674.416210] Workqueue: btrfs-endio btrfs_endio_helper [btrfs]
> >   [ 6674.416211] task: ffff880147f6c740 task.stack: ffffc90001fb8000
> >   [ 6674.416229] RIP: 0010:__raid_recover_end_io+0x1ac/0x370 [btrfs]
> >   [ 6674.416230] RSP: 0018:ffffc90001fbbb90 EFLAGS: 00010217
> >   [ 6674.416231] RAX: ffff8801ff4b4f00 RBX: 0000000000000002 RCX: 0000000000000001
> >   [ 6674.416232] RDX: ffff880099b045d8 RSI: ffffffff81a5f6e0 RDI: 0000000000000004
> >   [ 6674.416232] RBP: ffffc90001fbbbc8 R08: 0000000000000001 R09: 0000000000000001
> >   [ 6674.416233] R10: ffffc90001fbbac8 R11: 0000000000001000 R12: 0000000000000002
> >   [ 6674.416234] R13: ffff880099b045c0 R14: 0000000000000004 R15: ffff88012bff2000
> >   [ 6674.416235] FS:  0000000000000000(0000) GS:ffff88023f2c0000(0000) knlGS:0000000000000000
> >   [ 6674.416235] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >   [ 6674.416236] CR2: 00007f28cf282000 CR3: 00000001000c6000 CR4: 00000000000006e0
> >   [ 6674.416239] Call Trace:
> >   [ 6674.416259]  __raid56_parity_recover+0xfc/0x16e [btrfs]
> >   [ 6674.416276]  raid56_parity_recover+0x157/0x16b [btrfs]
> >   [ 6674.416293]  btrfs_map_bio+0xe0/0x259 [btrfs]
> >   [ 6674.416310]  btrfs_submit_bio_hook+0xbf/0x147 [btrfs]
> >   [ 6674.416327]  end_bio_extent_readpage+0x27b/0x4a0 [btrfs]
> >   [ 6674.416331]  bio_endio+0x17d/0x1b3
> >   [ 6674.416346]  end_workqueue_fn+0x3c/0x3f [btrfs]
> >   [ 6674.416362]  btrfs_scrubparity_helper+0x1aa/0x3b8 [btrfs]
> >   [ 6674.416379]  btrfs_endio_helper+0xe/0x10 [btrfs]
> >   [ 6674.416381]  process_one_work+0x276/0x4b6
> >   [ 6674.416384]  worker_thread+0x1ac/0x266
> >   [ 6674.416386]  ? rescuer_thread+0x278/0x278
> >   [ 6674.416387]  kthread+0x106/0x10e
> >   [ 6674.416389]  ? __list_del_entry+0x22/0x22
> >   [ 6674.416391]  ret_from_fork+0x27/0x40
> >   [ 6674.416395] Code: 44 89 e2 be 00 10 00 00 ff 15 b0 ab ef ff eb 72 4d 89 e8 89 d9 44 89 e2 be 00 10 00 00 ff 15 a3 ab ef ff eb 5d 41 83 fc ff 74 02 <0f> 0b 49 63 97
> >   [ 6674.416432] RIP: __raid_recover_end_io+0x1ac/0x370 [btrfs] RSP: ffffc90001fbbb90
> >   [ 6674.416434] ---[ end trace 74d56ebe7489dd6a ]---
> >
> 
> Thank you so much for the fix!
> 
> 
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > ---
> >  fs/btrfs/raid56.c | 26 ++++++++++++++++++--------
> >  1 file changed, 18 insertions(+), 8 deletions(-)
> > 
> > diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
> > index b9abb0b01021..b89d07003697 100644
> > --- a/fs/btrfs/raid56.c
> > +++ b/fs/btrfs/raid56.c
> > @@ -1136,20 +1136,27 @@ static void validate_rbio_for_rmw(struct btrfs_raid_bio *rbio)
> >  static void index_rbio_pages(struct btrfs_raid_bio *rbio)
> >  {
> >  	struct bio *bio;
> > -	struct bio_vec *bvec;
> >  	u64 start;
> >  	unsigned long stripe_offset;
> >  	unsigned long page_index;
> > -	int i;
> >  
> >  	spin_lock_irq(&rbio->bio_list_lock);
> >  	bio_list_for_each(bio, &rbio->bio_list) {
> > +		struct bio_vec bvec;
> > +		struct bvec_iter iter;
> > +		int i = 0;
> > +
> >  		start = (u64)bio->bi_iter.bi_sector << 9;
> >  		stripe_offset = start - rbio->bbio->raid_map[0];
> >  		page_index = stripe_offset >> PAGE_SHIFT;
> >  
> > -		bio_for_each_segment_all(bvec, bio, i)
> > -			rbio->bio_pages[page_index + i] = bvec->bv_page;
> > +		if (bio_flagged(bio, BIO_CLONED))
> > +			bio->bi_iter = btrfs_io_bio(bio)->iter;
> > +
> 
> I think we can use use bio->bi_iter directly as the bio is not
> submitted yet, i.e. bi_iter is not advanced yet.
> 
> > +		bio_for_each_segment(bvec, bio, iter) {
> > +			rbio->bio_pages[page_index + i] = bvec.bv_page;
> > +			i++;
> > +		}
> >  	}
> >  	spin_unlock_irq(&rbio->bio_list_lock);
> >  }
> > @@ -1423,11 +1430,14 @@ static int fail_bio_stripe(struct btrfs_raid_bio *rbio,
> >   */
> >  static void set_bio_pages_uptodate(struct bio *bio)
> >  {
> > -	struct bio_vec *bvec;
> > -	int i;
> > +	struct bio_vec bvec;
> > +	struct bvec_iter iter;
> > +
> > +	if (bio_flagged(bio, BIO_CLONED))
> > +		bio->bi_iter = btrfs_io_bio(bio)->iter;
> >

It is not necessary to update set_bio_pages_uptodate() because the bio
here is for reading pages in RMW and no way it could be a cloned one.

thanks,

-liubo

> 
> Ditto.
> 
> Others look good.
> 
> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
> 
> -liubo
> 
> > -	bio_for_each_segment_all(bvec, bio, i)
> > -		SetPageUptodate(bvec->bv_page);
> > +	bio_for_each_segment(bvec, bio, iter)
> > +		SetPageUptodate(bvec.bv_page);
> >  }
> >  
> >  /*
> > -- 
> > 2.11.0
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

  parent reply	other threads:[~2017-07-13 22:41 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-07-13 14:09 [PATCH] Btrfs: fix write corruption due to bio cloning on raid5/6 fdmanana
2017-07-13 16:08 ` David Sterba
2017-07-13 16:36 ` Liu Bo
2017-07-13 16:46   ` Filipe Manana
2017-07-13 17:06     ` Liu Bo
2017-07-13 18:00       ` Liu Bo
2017-07-13 18:11         ` Filipe Manana
2017-07-13 19:09           ` Liu Bo
2017-07-14  8:51             ` Filipe Manana
2017-07-13 22:38   ` Liu Bo [this message]
2017-07-14  9:04     ` Filipe Manana
2017-07-14 12:45       ` David Sterba
2017-07-14 16:56       ` Liu Bo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170713223859.GC6181@lim.localdomain \
    --to=bo.li.liu@oracle.com \
    --cc=fdmanana@kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).