* Re: 5.3-rc1 regression with XFS log recovery [not found] ` <20190820044135.GC1119@dread.disaster.area> @ 2019-08-20 5:53 ` hch 2019-08-20 7:44 ` Dave Chinner 2019-08-20 8:13 ` Ming Lei 0 siblings, 2 replies; 10+ messages in thread From: hch @ 2019-08-20 5:53 UTC (permalink / raw) To: Dave Chinner Cc: hch@lst.de, Verma, Vishal L, linux-xfs@vger.kernel.org, Williams, Dan J, darrick.wong@oracle.com, Ming Lei, linux-block On Tue, Aug 20, 2019 at 02:41:35PM +1000, Dave Chinner wrote: > > With the following debug patch. Based on that I think I'll just > > formally submit the vmalloc switch as we're at -rc5, and then we > > can restart the unaligned slub allocation drama.. > > This still doesn't make sense to me, because the pmem and brd code > have no aligment limitations in their make_request code - they can > handle byte adressing and should not have any problem at all with > 8 byte aligned memory in bios. > > Digging a little furhter, I note that both brd and pmem use > identical mechanisms to marshall data in and out of bios, so they > are likely to have the same issue. > > So, brd_make_request() does: > > bio_for_each_segment(bvec, bio, iter) { > unsigned int len = bvec.bv_len; > int err; > > err = brd_do_bvec(brd, bvec.bv_page, len, bvec.bv_offset, > bio_op(bio), sector); > if (err) > goto io_error; > sector += len >> SECTOR_SHIFT; > } > > So, the code behind bio_for_each_segment() splits multi-page bvecs > into individual pages, which are passed to brd_do_bvec(). An > unaligned 4kB io traces out as: > > [ 121.295550] p,o,l,s 00000000a77f0146,768,3328,0x7d0048 > [ 121.297635] p,o,l,s 000000006ceca91e,0,768,0x7d004e > > i.e. page offset len sector > 00000000a77f0146 768 3328 0x7d0048 > 000000006ceca91e 0 768 0x7d004e > > You should be able to guess what the problems are from this. > > Both pmem and brd are _sector_ based. We've done a partial sector > copy on the first bvec, then the second bvec has started the copy > from the wrong offset into the sector we've done a partial copy > from. > > IOWs, no error is reported when the bvec buffer isn't sector > aligned, no error is reported when the length of data to copy was > not a multiple of sector size, and no error was reported when we > copied the same partial sector twice. Yes. I think bio_for_each_segment is buggy here, as it should not blindly split by pages. CcingMing as he wrote much of this code. I'll also dig into fixing it, but I just arrived in Japan and might be a little jetlagged. > There's nothing quite like being repeatedly bitten by the same > misalignment bug because there's no validation in the infrastructure > that could catch it immediately and throw a useful warning/error > message. The xen block driver doesn't use bio_for_each_segment, so it isn't exactly the same but a very related issue. Maybe until we sort all this mess out we just need to depend on !SLUB_DEBUG for XFS? ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 5.3-rc1 regression with XFS log recovery 2019-08-20 5:53 ` 5.3-rc1 regression with XFS log recovery hch @ 2019-08-20 7:44 ` Dave Chinner 2019-08-20 8:13 ` Ming Lei 1 sibling, 0 replies; 10+ messages in thread From: Dave Chinner @ 2019-08-20 7:44 UTC (permalink / raw) To: hch@lst.de Cc: Verma, Vishal L, linux-xfs@vger.kernel.org, Williams, Dan J, darrick.wong@oracle.com, Ming Lei, linux-block On Tue, Aug 20, 2019 at 07:53:20AM +0200, hch@lst.de wrote: > On Tue, Aug 20, 2019 at 02:41:35PM +1000, Dave Chinner wrote: > > > With the following debug patch. Based on that I think I'll just > > > formally submit the vmalloc switch as we're at -rc5, and then we > > > can restart the unaligned slub allocation drama.. > > > > This still doesn't make sense to me, because the pmem and brd code > > have no aligment limitations in their make_request code - they can > > handle byte adressing and should not have any problem at all with > > 8 byte aligned memory in bios. > > > > Digging a little furhter, I note that both brd and pmem use > > identical mechanisms to marshall data in and out of bios, so they > > are likely to have the same issue. > > > > So, brd_make_request() does: > > > > bio_for_each_segment(bvec, bio, iter) { > > unsigned int len = bvec.bv_len; > > int err; > > > > err = brd_do_bvec(brd, bvec.bv_page, len, bvec.bv_offset, > > bio_op(bio), sector); > > if (err) > > goto io_error; > > sector += len >> SECTOR_SHIFT; > > } > > > > So, the code behind bio_for_each_segment() splits multi-page bvecs > > into individual pages, which are passed to brd_do_bvec(). An > > unaligned 4kB io traces out as: > > > > [ 121.295550] p,o,l,s 00000000a77f0146,768,3328,0x7d0048 > > [ 121.297635] p,o,l,s 000000006ceca91e,0,768,0x7d004e > > > > i.e. page offset len sector > > 00000000a77f0146 768 3328 0x7d0048 > > 000000006ceca91e 0 768 0x7d004e > > > > You should be able to guess what the problems are from this. > > > > Both pmem and brd are _sector_ based. We've done a partial sector > > copy on the first bvec, then the second bvec has started the copy > > from the wrong offset into the sector we've done a partial copy > > from. > > > > IOWs, no error is reported when the bvec buffer isn't sector > > aligned, no error is reported when the length of data to copy was > > not a multiple of sector size, and no error was reported when we > > copied the same partial sector twice. > > Yes. I think bio_for_each_segment is buggy here, as it should not > blindly split by pages. CcingMing as he wrote much of this code. I'll > also dig into fixing it, but I just arrived in Japan and might be a > little jetlagged. > > > There's nothing quite like being repeatedly bitten by the same > > misalignment bug because there's no validation in the infrastructure > > that could catch it immediately and throw a useful warning/error > > message. > > The xen block driver doesn't use bio_for_each_segment, so it isn't > exactly the same but a very related issue. Both stem from the fact that nothing in the block layer validates memory alignment constraints. Xenblk, pmem and brd all return 511 to queue_dma_alignment(), and all break when passed memory that isn't aligned to 512 bytes. There aren't even debug options we can turn on that would tell use this is happening. Instead, we start with data corruption and have to work backwards to find the needle in the haystack from there. EIO and a WARN_ONCE would be a massive improvement here.... > Maybe until we sort > all this mess out we just need to depend on !SLUB_DEBUG for XFS? SLUB_DEBUG=y by itself doesn't cause problems - I run that all the time because otherwise there's no /proc/slabinfo with slub. I used KASAN to get the above alignment behaviour - it's SLUB_DEBUG_ON=y that perturbs the alignment, and I think the same thing can happen with SLAB_DEBUG=y, so there's several dependencies we'd have to add here. Is there any way we can pass kmalloc a "use aligned heap" GFP flag so that it allocates from the -rcl slabs to guarantee alignment rather than the standard slabs that change alignment with config options? Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 5.3-rc1 regression with XFS log recovery 2019-08-20 5:53 ` 5.3-rc1 regression with XFS log recovery hch 2019-08-20 7:44 ` Dave Chinner @ 2019-08-20 8:13 ` Ming Lei 2019-08-20 9:24 ` Ming Lei 1 sibling, 1 reply; 10+ messages in thread From: Ming Lei @ 2019-08-20 8:13 UTC (permalink / raw) To: hch@lst.de Cc: Dave Chinner, Verma, Vishal L, linux-xfs@vger.kernel.org, Williams, Dan J, darrick.wong@oracle.com, linux-block On Tue, Aug 20, 2019 at 07:53:20AM +0200, hch@lst.de wrote: > On Tue, Aug 20, 2019 at 02:41:35PM +1000, Dave Chinner wrote: > > > With the following debug patch. Based on that I think I'll just > > > formally submit the vmalloc switch as we're at -rc5, and then we > > > can restart the unaligned slub allocation drama.. > > > > This still doesn't make sense to me, because the pmem and brd code > > have no aligment limitations in their make_request code - they can > > handle byte adressing and should not have any problem at all with > > 8 byte aligned memory in bios. > > > > Digging a little furhter, I note that both brd and pmem use > > identical mechanisms to marshall data in and out of bios, so they > > are likely to have the same issue. > > > > So, brd_make_request() does: > > > > bio_for_each_segment(bvec, bio, iter) { > > unsigned int len = bvec.bv_len; > > int err; > > > > err = brd_do_bvec(brd, bvec.bv_page, len, bvec.bv_offset, > > bio_op(bio), sector); > > if (err) > > goto io_error; > > sector += len >> SECTOR_SHIFT; > > } > > > > So, the code behind bio_for_each_segment() splits multi-page bvecs > > into individual pages, which are passed to brd_do_bvec(). An > > unaligned 4kB io traces out as: > > > > [ 121.295550] p,o,l,s 00000000a77f0146,768,3328,0x7d0048 > > [ 121.297635] p,o,l,s 000000006ceca91e,0,768,0x7d004e > > > > i.e. page offset len sector > > 00000000a77f0146 768 3328 0x7d0048 > > 000000006ceca91e 0 768 0x7d004e > > > > You should be able to guess what the problems are from this. The problem should be that offset of '768' is passed to bio_add_page(). It should be one slub buffer used for block IO, looks an old unsolved problem. > > > > Both pmem and brd are _sector_ based. We've done a partial sector > > copy on the first bvec, then the second bvec has started the copy > > from the wrong offset into the sector we've done a partial copy > > from. > > > > IOWs, no error is reported when the bvec buffer isn't sector > > aligned, no error is reported when the length of data to copy was > > not a multiple of sector size, and no error was reported when we > > copied the same partial sector twice. > > Yes. I think bio_for_each_segment is buggy here, as it should not > blindly split by pages. bio_for_each_segment() just keeps the original interface as before introducing multi-page bvec. Thanks, Ming ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 5.3-rc1 regression with XFS log recovery 2019-08-20 8:13 ` Ming Lei @ 2019-08-20 9:24 ` Ming Lei 2019-08-20 16:30 ` Verma, Vishal L 2019-08-20 21:44 ` Dave Chinner 0 siblings, 2 replies; 10+ messages in thread From: Ming Lei @ 2019-08-20 9:24 UTC (permalink / raw) To: hch@lst.de, Verma, Vishal L Cc: Dave Chinner, linux-xfs@vger.kernel.org, Williams, Dan J, darrick.wong@oracle.com, linux-block On Tue, Aug 20, 2019 at 04:13:26PM +0800, Ming Lei wrote: > On Tue, Aug 20, 2019 at 07:53:20AM +0200, hch@lst.de wrote: > > On Tue, Aug 20, 2019 at 02:41:35PM +1000, Dave Chinner wrote: > > > > With the following debug patch. Based on that I think I'll just > > > > formally submit the vmalloc switch as we're at -rc5, and then we > > > > can restart the unaligned slub allocation drama.. > > > > > > This still doesn't make sense to me, because the pmem and brd code > > > have no aligment limitations in their make_request code - they can > > > handle byte adressing and should not have any problem at all with > > > 8 byte aligned memory in bios. > > > > > > Digging a little furhter, I note that both brd and pmem use > > > identical mechanisms to marshall data in and out of bios, so they > > > are likely to have the same issue. > > > > > > So, brd_make_request() does: > > > > > > bio_for_each_segment(bvec, bio, iter) { > > > unsigned int len = bvec.bv_len; > > > int err; > > > > > > err = brd_do_bvec(brd, bvec.bv_page, len, bvec.bv_offset, > > > bio_op(bio), sector); > > > if (err) > > > goto io_error; > > > sector += len >> SECTOR_SHIFT; > > > } > > > > > > So, the code behind bio_for_each_segment() splits multi-page bvecs > > > into individual pages, which are passed to brd_do_bvec(). An > > > unaligned 4kB io traces out as: > > > > > > [ 121.295550] p,o,l,s 00000000a77f0146,768,3328,0x7d0048 > > > [ 121.297635] p,o,l,s 000000006ceca91e,0,768,0x7d004e > > > > > > i.e. page offset len sector > > > 00000000a77f0146 768 3328 0x7d0048 > > > 000000006ceca91e 0 768 0x7d004e > > > > > > You should be able to guess what the problems are from this. > > The problem should be that offset of '768' is passed to bio_add_page(). It can be quite hard to deal with non-512 aligned sector buffer, since one sector buffer may cross two pages, so far one workaround I thought of is to not merge such IO buffer into one bvec. Verma, could you try the following patch? diff --git a/block/bio.c b/block/bio.c index 24a496f5d2e2..49deab2ac8c4 100644 --- a/block/bio.c +++ b/block/bio.c @@ -769,6 +769,9 @@ bool __bio_try_merge_page(struct bio *bio, struct page *page, if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED))) return false; + if (off & 511) + return false; + if (bio->bi_vcnt > 0) { struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1]; Thanks, Ming ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: 5.3-rc1 regression with XFS log recovery 2019-08-20 9:24 ` Ming Lei @ 2019-08-20 16:30 ` Verma, Vishal L 2019-08-20 21:44 ` Dave Chinner 1 sibling, 0 replies; 10+ messages in thread From: Verma, Vishal L @ 2019-08-20 16:30 UTC (permalink / raw) To: hch@lst.de, ming.lei@redhat.com Cc: Williams, Dan J, linux-xfs@vger.kernel.org, david@fromorbit.com, darrick.wong@oracle.com, linux-block@vger.kernel.org On Tue, 2019-08-20 at 17:24 +0800, Ming Lei wrote: > > It can be quite hard to deal with non-512 aligned sector buffer, since > one sector buffer may cross two pages, so far one workaround I thought > of is to not merge such IO buffer into one bvec. > > Verma, could you try the following patch? Hi Ming, I can hit the same failure with this patch. Full thread, in case you haven't already seen it: https://lore.kernel.org/linux-xfs/e49a6a3a244db055995769eb844c281f93e50ab9.camel@intel.com/ > > diff --git a/block/bio.c b/block/bio.c > index 24a496f5d2e2..49deab2ac8c4 100644 > --- a/block/bio.c > +++ b/block/bio.c > @@ -769,6 +769,9 @@ bool __bio_try_merge_page(struct bio *bio, struct > page *page, > if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED))) > return false; > > + if (off & 511) > + return false; > + > if (bio->bi_vcnt > 0) { > struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1]; > > > Thanks, > Ming ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 5.3-rc1 regression with XFS log recovery 2019-08-20 9:24 ` Ming Lei 2019-08-20 16:30 ` Verma, Vishal L @ 2019-08-20 21:44 ` Dave Chinner 2019-08-20 22:08 ` Verma, Vishal L 2019-08-21 1:56 ` Ming Lei 1 sibling, 2 replies; 10+ messages in thread From: Dave Chinner @ 2019-08-20 21:44 UTC (permalink / raw) To: Ming Lei Cc: hch@lst.de, Verma, Vishal L, linux-xfs@vger.kernel.org, Williams, Dan J, darrick.wong@oracle.com, linux-block On Tue, Aug 20, 2019 at 05:24:25PM +0800, Ming Lei wrote: > On Tue, Aug 20, 2019 at 04:13:26PM +0800, Ming Lei wrote: > > On Tue, Aug 20, 2019 at 07:53:20AM +0200, hch@lst.de wrote: > > > On Tue, Aug 20, 2019 at 02:41:35PM +1000, Dave Chinner wrote: > > > > > With the following debug patch. Based on that I think I'll just > > > > > formally submit the vmalloc switch as we're at -rc5, and then we > > > > > can restart the unaligned slub allocation drama.. > > > > > > > > This still doesn't make sense to me, because the pmem and brd code > > > > have no aligment limitations in their make_request code - they can > > > > handle byte adressing and should not have any problem at all with > > > > 8 byte aligned memory in bios. > > > > > > > > Digging a little furhter, I note that both brd and pmem use > > > > identical mechanisms to marshall data in and out of bios, so they > > > > are likely to have the same issue. > > > > > > > > So, brd_make_request() does: > > > > > > > > bio_for_each_segment(bvec, bio, iter) { > > > > unsigned int len = bvec.bv_len; > > > > int err; > > > > > > > > err = brd_do_bvec(brd, bvec.bv_page, len, bvec.bv_offset, > > > > bio_op(bio), sector); > > > > if (err) > > > > goto io_error; > > > > sector += len >> SECTOR_SHIFT; > > > > } > > > > > > > > So, the code behind bio_for_each_segment() splits multi-page bvecs > > > > into individual pages, which are passed to brd_do_bvec(). An > > > > unaligned 4kB io traces out as: > > > > > > > > [ 121.295550] p,o,l,s 00000000a77f0146,768,3328,0x7d0048 > > > > [ 121.297635] p,o,l,s 000000006ceca91e,0,768,0x7d004e > > > > > > > > i.e. page offset len sector > > > > 00000000a77f0146 768 3328 0x7d0048 > > > > 000000006ceca91e 0 768 0x7d004e > > > > > > > > You should be able to guess what the problems are from this. > > > > The problem should be that offset of '768' is passed to bio_add_page(). > > It can be quite hard to deal with non-512 aligned sector buffer, since > one sector buffer may cross two pages, so far one workaround I thought > of is to not merge such IO buffer into one bvec. > > Verma, could you try the following patch? > > diff --git a/block/bio.c b/block/bio.c > index 24a496f5d2e2..49deab2ac8c4 100644 > --- a/block/bio.c > +++ b/block/bio.c > @@ -769,6 +769,9 @@ bool __bio_try_merge_page(struct bio *bio, struct page *page, > if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED))) > return false; > > + if (off & 511) > + return false; What does this acheive? It only prevents the unaligned segment from being merged, it doesn't prevent it from being added to a new bvec. However, the case here is that: > > > > i.e. page offset len sector > > > > 00000000a77f0146 768 3328 0x7d0048 > > > > 000000006ceca91e 0 768 0x7d004e The second page added to the bvec is actually offset alignedr. Hence the check would do nothing on the first page because the bvec array is empty (so goes into a new bvec anyway), and the check on the second page would do nothing an it would merge with first because the offset is aligned correctly. In both cases, the length of the segment is not aligned, so that needs to be checked, too. IOWs, I think the check needs to be in bio_add_page, it needs to check both the offset and length for alignment, and it needs to grab the alignment from queue_dma_alignment(), not use a hard coded value of 511. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 5.3-rc1 regression with XFS log recovery 2019-08-20 21:44 ` Dave Chinner @ 2019-08-20 22:08 ` Verma, Vishal L 2019-08-20 23:53 ` Dave Chinner 2019-08-21 2:19 ` Ming Lei 2019-08-21 1:56 ` Ming Lei 1 sibling, 2 replies; 10+ messages in thread From: Verma, Vishal L @ 2019-08-20 22:08 UTC (permalink / raw) To: david@fromorbit.com, ming.lei@redhat.com Cc: hch@lst.de, linux-xfs@vger.kernel.org, Williams, Dan J, linux-block@vger.kernel.org, darrick.wong@oracle.com On Wed, 2019-08-21 at 07:44 +1000, Dave Chinner wrote: > > However, the case here is that: > > > > > > i.e. page offset len sector > > > > > 00000000a77f0146 768 3328 0x7d0048 > > > > > 000000006ceca91e 0 768 0x7d004e > > The second page added to the bvec is actually offset alignedr. Hence > the check would do nothing on the first page because the bvec array > is empty (so goes into a new bvec anyway), and the check on the > second page would do nothing an it would merge with first because > the offset is aligned correctly. In both cases, the length of the > segment is not aligned, so that needs to be checked, too. > > IOWs, I think the check needs to be in bio_add_page, it needs to > check both the offset and length for alignment, and it needs to grab > the alignment from queue_dma_alignment(), not use a hard coded value > of 511. > So something like this? diff --git a/block/bio.c b/block/bio.c index 299a0e7651ec..80f449d23e5a 100644 --- a/block/bio.c +++ b/block/bio.c @@ -822,8 +822,12 @@ EXPORT_SYMBOL_GPL(__bio_add_page); int bio_add_page(struct bio *bio, struct page *page, unsigned int len, unsigned int offset) { + struct request_queue *q = bio->bi_disk->queue; bool same_page = false; + if (offset & queue_dma_alignment(q) || len & queue_dma_alignment(q)) + return 0; + if (!__bio_try_merge_page(bio, page, len, offset, &same_page)) { if (bio_full(bio, len)) return 0; I tried this, but the 'mount' just hangs - which looks like it might be due to xfs_rw_bdev() doing: while (bio_add_page(bio, page, len, off) != len) { ... ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: 5.3-rc1 regression with XFS log recovery 2019-08-20 22:08 ` Verma, Vishal L @ 2019-08-20 23:53 ` Dave Chinner 2019-08-21 2:19 ` Ming Lei 1 sibling, 0 replies; 10+ messages in thread From: Dave Chinner @ 2019-08-20 23:53 UTC (permalink / raw) To: Verma, Vishal L Cc: ming.lei@redhat.com, hch@lst.de, linux-xfs@vger.kernel.org, Williams, Dan J, linux-block@vger.kernel.org, darrick.wong@oracle.com On Tue, Aug 20, 2019 at 10:08:38PM +0000, Verma, Vishal L wrote: > On Wed, 2019-08-21 at 07:44 +1000, Dave Chinner wrote: > > > > However, the case here is that: > > > > > > > > i.e. page offset len sector > > > > > > 00000000a77f0146 768 3328 0x7d0048 > > > > > > 000000006ceca91e 0 768 0x7d004e > > > > The second page added to the bvec is actually offset alignedr. Hence > > the check would do nothing on the first page because the bvec array > > is empty (so goes into a new bvec anyway), and the check on the > > second page would do nothing an it would merge with first because > > the offset is aligned correctly. In both cases, the length of the > > segment is not aligned, so that needs to be checked, too. > > > > IOWs, I think the check needs to be in bio_add_page, it needs to > > check both the offset and length for alignment, and it needs to grab > > the alignment from queue_dma_alignment(), not use a hard coded value > > of 511. > > > So something like this? > > diff --git a/block/bio.c b/block/bio.c > index 299a0e7651ec..80f449d23e5a 100644 > --- a/block/bio.c > +++ b/block/bio.c > @@ -822,8 +822,12 @@ EXPORT_SYMBOL_GPL(__bio_add_page); > int bio_add_page(struct bio *bio, struct page *page, > unsigned int len, unsigned int offset) > { > + struct request_queue *q = bio->bi_disk->queue; > bool same_page = false; > > + if (offset & queue_dma_alignment(q) || len & queue_dma_alignment(q)) > + return 0; > + > if (!__bio_try_merge_page(bio, page, len, offset, &same_page)) { > if (bio_full(bio, len)) > return 0; > > I tried this, but the 'mount' just hangs - which looks like it might be > due to xfs_rw_bdev() doing: > > while (bio_add_page(bio, page, len, off) != len) { That's the return of zero that causes the loop to make no progress. i.e. a return of 0 means "won't fit in bio, allocate a new bio and try again". It's not an error return, so always returning zero will eventually chew up all your memory allocating bios it doesn't use, because submit_bio() doesn't return errors on chained bios until the final bio in the chain is completed. Add a bio_add_page_checked() function that does exactly the same this as bio_add_page(), but add the if (WARN_ON_ONCE((offset | len) & queue_dma_alignment(q))) return -EIO; to it and change the xfs code to: while ((len = bio_add_page_checked(bio, page, len, off)) != len) { if (len < 0) { /* * submit the bio to wait on the rest of the * chain to complete, then return an error. * This is a really shitty failure on write, as we * will have just done a partial write and * effectively corrupted something on disk. */ submit_bio_wait(bio); return len; } .... } We probably should change all the XFS calls to bio_add_page to bio_add_page_checked() while we are at it, because we have the same alignment problem through xfs_buf.c and, potentially, on iclogs via xfs_log.c as iclogs are allocated with kmem_alloc_large(), too. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 5.3-rc1 regression with XFS log recovery 2019-08-20 22:08 ` Verma, Vishal L 2019-08-20 23:53 ` Dave Chinner @ 2019-08-21 2:19 ` Ming Lei 1 sibling, 0 replies; 10+ messages in thread From: Ming Lei @ 2019-08-21 2:19 UTC (permalink / raw) To: Verma, Vishal L Cc: david@fromorbit.com, hch@lst.de, linux-xfs@vger.kernel.org, Williams, Dan J, linux-block@vger.kernel.org, darrick.wong@oracle.com On Tue, Aug 20, 2019 at 10:08:38PM +0000, Verma, Vishal L wrote: > On Wed, 2019-08-21 at 07:44 +1000, Dave Chinner wrote: > > > > However, the case here is that: > > > > > > > > i.e. page offset len sector > > > > > > 00000000a77f0146 768 3328 0x7d0048 > > > > > > 000000006ceca91e 0 768 0x7d004e > > > > The second page added to the bvec is actually offset alignedr. Hence > > the check would do nothing on the first page because the bvec array > > is empty (so goes into a new bvec anyway), and the check on the > > second page would do nothing an it would merge with first because > > the offset is aligned correctly. In both cases, the length of the > > segment is not aligned, so that needs to be checked, too. > > > > IOWs, I think the check needs to be in bio_add_page, it needs to > > check both the offset and length for alignment, and it needs to grab > > the alignment from queue_dma_alignment(), not use a hard coded value > > of 511. > > > So something like this? > > diff --git a/block/bio.c b/block/bio.c > index 299a0e7651ec..80f449d23e5a 100644 > --- a/block/bio.c > +++ b/block/bio.c > @@ -822,8 +822,12 @@ EXPORT_SYMBOL_GPL(__bio_add_page); > int bio_add_page(struct bio *bio, struct page *page, > unsigned int len, unsigned int offset) > { > + struct request_queue *q = bio->bi_disk->queue; > bool same_page = false; > > + if (offset & queue_dma_alignment(q) || len & queue_dma_alignment(q)) > + return 0; bio->bi_disk or bio->bi_disk->queue may not be available in bio_add_page(). And 'len' has to be aligned with logical block size of the disk on which fs is mounted, which info is obviously available for fs. So the following code is really buggy: xfs_rw_bdev(): + struct page *page = kmem_to_page(data); + unsigned int off = offset_in_page(data); + unsigned int len = min_t(unsigned, left, PAGE_SIZE - off); + + while (bio_add_page(bio, page, len, off) != len) { Thanks, Ming ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 5.3-rc1 regression with XFS log recovery 2019-08-20 21:44 ` Dave Chinner 2019-08-20 22:08 ` Verma, Vishal L @ 2019-08-21 1:56 ` Ming Lei 1 sibling, 0 replies; 10+ messages in thread From: Ming Lei @ 2019-08-21 1:56 UTC (permalink / raw) To: Dave Chinner Cc: hch@lst.de, Verma, Vishal L, linux-xfs@vger.kernel.org, Williams, Dan J, darrick.wong@oracle.com, linux-block On Wed, Aug 21, 2019 at 07:44:09AM +1000, Dave Chinner wrote: > On Tue, Aug 20, 2019 at 05:24:25PM +0800, Ming Lei wrote: > > On Tue, Aug 20, 2019 at 04:13:26PM +0800, Ming Lei wrote: > > > On Tue, Aug 20, 2019 at 07:53:20AM +0200, hch@lst.de wrote: > > > > On Tue, Aug 20, 2019 at 02:41:35PM +1000, Dave Chinner wrote: > > > > > > With the following debug patch. Based on that I think I'll just > > > > > > formally submit the vmalloc switch as we're at -rc5, and then we > > > > > > can restart the unaligned slub allocation drama.. > > > > > > > > > > This still doesn't make sense to me, because the pmem and brd code > > > > > have no aligment limitations in their make_request code - they can > > > > > handle byte adressing and should not have any problem at all with > > > > > 8 byte aligned memory in bios. > > > > > > > > > > Digging a little furhter, I note that both brd and pmem use > > > > > identical mechanisms to marshall data in and out of bios, so they > > > > > are likely to have the same issue. > > > > > > > > > > So, brd_make_request() does: > > > > > > > > > > bio_for_each_segment(bvec, bio, iter) { > > > > > unsigned int len = bvec.bv_len; > > > > > int err; > > > > > > > > > > err = brd_do_bvec(brd, bvec.bv_page, len, bvec.bv_offset, > > > > > bio_op(bio), sector); > > > > > if (err) > > > > > goto io_error; > > > > > sector += len >> SECTOR_SHIFT; > > > > > } > > > > > > > > > > So, the code behind bio_for_each_segment() splits multi-page bvecs > > > > > into individual pages, which are passed to brd_do_bvec(). An > > > > > unaligned 4kB io traces out as: > > > > > > > > > > [ 121.295550] p,o,l,s 00000000a77f0146,768,3328,0x7d0048 > > > > > [ 121.297635] p,o,l,s 000000006ceca91e,0,768,0x7d004e > > > > > > > > > > i.e. page offset len sector > > > > > 00000000a77f0146 768 3328 0x7d0048 > > > > > 000000006ceca91e 0 768 0x7d004e > > > > > > > > > > You should be able to guess what the problems are from this. > > > > > > The problem should be that offset of '768' is passed to bio_add_page(). > > > > It can be quite hard to deal with non-512 aligned sector buffer, since > > one sector buffer may cross two pages, so far one workaround I thought > > of is to not merge such IO buffer into one bvec. > > > > Verma, could you try the following patch? > > > > diff --git a/block/bio.c b/block/bio.c > > index 24a496f5d2e2..49deab2ac8c4 100644 > > --- a/block/bio.c > > +++ b/block/bio.c > > @@ -769,6 +769,9 @@ bool __bio_try_merge_page(struct bio *bio, struct page *page, > > if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED))) > > return false; > > > > + if (off & 511) > > + return false; > > What does this acheive? It only prevents the unaligned segment from > being merged, it doesn't prevent it from being added to a new bvec. The current issue is that block layer won't handle such case, as I mentioned, one sector buffer may cross two pages, then it can't be splitted successfully, so simply put into one new bvec just as before enabling multi-page bvec. > > However, the case here is that: > > > > > > i.e. page offset len sector > > > > > 00000000a77f0146 768 3328 0x7d0048 > > > > > 000000006ceca91e 0 768 0x7d004e > > The second page added to the bvec is actually offset alignedr. Hence > the check would do nothing on the first page because the bvec array > is empty (so goes into a new bvec anyway), and the check on the > second page would do nothing an it would merge with first because > the offset is aligned correctly. In both cases, the length of the > segment is not aligned, so that needs to be checked, too. What the patch changes is the bvec stored in bio, the above bvec is actually built in-flight. So if the 1st page added to bio is (768, 512), then finally bio_for_each_segment() will generate the 1st page as (768, 512), then everything will be fine. > > IOWs, I think the check needs to be in bio_add_page, it needs to > check both the offset and length for alignment, and it needs to grab The length has to be 512 aligned, otherwise it is simply bug in fs. > the alignment from queue_dma_alignment(), not use a hard coded value > of 511. Now the policy for bio_add_page() is to not checking any queue limit given we don't know what is the actual limit for the finally IO, even the queue isn't set when bio_add_page() is called. If the upper layer won't pass slub buffer which is > PAGE_SIZE, block layer may handle it well without the 512 alignment check. Thanks, Ming ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2019-08-21 2:19 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20190818071128.GA17286@lst.de>
[not found] ` <20190818074140.GA18648@lst.de>
[not found] ` <20190818173426.GA32311@lst.de>
[not found] ` <20190819000831.GX6129@dread.disaster.area>
[not found] ` <20190819034948.GA14261@lst.de>
[not found] ` <20190819041132.GA14492@lst.de>
[not found] ` <20190819042259.GZ6129@dread.disaster.area>
[not found] ` <20190819042905.GA15613@lst.de>
[not found] ` <20190819044012.GA15800@lst.de>
[not found] ` <20190820044135.GC1119@dread.disaster.area>
2019-08-20 5:53 ` 5.3-rc1 regression with XFS log recovery hch
2019-08-20 7:44 ` Dave Chinner
2019-08-20 8:13 ` Ming Lei
2019-08-20 9:24 ` Ming Lei
2019-08-20 16:30 ` Verma, Vishal L
2019-08-20 21:44 ` Dave Chinner
2019-08-20 22:08 ` Verma, Vishal L
2019-08-20 23:53 ` Dave Chinner
2019-08-21 2:19 ` Ming Lei
2019-08-21 1:56 ` Ming Lei
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).