Re: 5.3-rc1 regression with XFS log recovery

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: 5.3-rc1 regression with XFS log recovery
       [not found]                 ` <20190820044135.GC1119@dread.disaster.area>
@ 2019-08-20  5:53                   ` hch
  2019-08-20  7:44                     ` Dave Chinner
  2019-08-20  8:13                     ` Ming Lei
  0 siblings, 2 replies; 10+ messages in thread
From: hch @ 2019-08-20  5:53 UTC (permalink / raw)
  To: Dave Chinner
  Cc: hch@lst.de, Verma, Vishal L, linux-xfs@vger.kernel.org,
	Williams, Dan J, darrick.wong@oracle.com, Ming Lei, linux-block

On Tue, Aug 20, 2019 at 02:41:35PM +1000, Dave Chinner wrote:
> > With the following debug patch.  Based on that I think I'll just
> > formally submit the vmalloc switch as we're at -rc5, and then we
> > can restart the unaligned slub allocation drama..
> 
> This still doesn't make sense to me, because the pmem and brd code
> have no aligment limitations in their make_request code - they can
> handle byte adressing and should not have any problem at all with
> 8 byte aligned memory in bios.
> 
> Digging a little furhter, I note that both brd and pmem use
> identical mechanisms to marshall data in and out of bios, so they
> are likely to have the same issue.
> 
> So, brd_make_request() does:
> 
>         bio_for_each_segment(bvec, bio, iter) {
>                 unsigned int len = bvec.bv_len;
>                 int err;
> 
>                 err = brd_do_bvec(brd, bvec.bv_page, len, bvec.bv_offset,
>                                   bio_op(bio), sector);
>                 if (err)
>                         goto io_error;
>                 sector += len >> SECTOR_SHIFT;
>         }
> 
> So, the code behind bio_for_each_segment() splits multi-page bvecs
> into individual pages, which are passed to brd_do_bvec(). An
> unaligned 4kB io traces out as:
> 
>  [  121.295550] p,o,l,s 00000000a77f0146,768,3328,0x7d0048
>  [  121.297635] p,o,l,s 000000006ceca91e,0,768,0x7d004e
> 
> i.e. page		offset	len	sector
> 00000000a77f0146	768	3328	0x7d0048
> 000000006ceca91e	0	768	0x7d004e
> 
> You should be able to guess what the problems are from this.
> 
> Both pmem and brd are _sector_ based. We've done a partial sector
> copy on the first bvec, then the second bvec has started the copy
> from the wrong offset into the sector we've done a partial copy
> from.
> 
> IOWs, no error is reported when the bvec buffer isn't sector
> aligned, no error is reported when the length of data to copy was
> not a multiple of sector size, and no error was reported when we
> copied the same partial sector twice.

Yes.  I think bio_for_each_segment is buggy here, as it should not
blindly split by pages.  CcingMing as he wrote much of this code.  I'll
also dig into fixing it, but I just arrived in Japan and might be a
little jetlagged.

> There's nothing quite like being repeatedly bitten by the same
> misalignment bug because there's no validation in the infrastructure
> that could catch it immediately and throw a useful warning/error
> message.

The xen block driver doesn't use bio_for_each_segment, so it isn't
exactly the same but a very related issue.  Maybe until we sort
all this mess out we just need to depend on !SLUB_DEBUG for XFS?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: 5.3-rc1 regression with XFS log recovery
  2019-08-20  5:53                   ` 5.3-rc1 regression with XFS log recovery hch
@ 2019-08-20  7:44                     ` Dave Chinner
  2019-08-20  8:13                     ` Ming Lei
  1 sibling, 0 replies; 10+ messages in thread
From: Dave Chinner @ 2019-08-20  7:44 UTC (permalink / raw)
  To: hch@lst.de
  Cc: Verma, Vishal L, linux-xfs@vger.kernel.org, Williams, Dan J,
	darrick.wong@oracle.com, Ming Lei, linux-block

On Tue, Aug 20, 2019 at 07:53:20AM +0200, hch@lst.de wrote:
> On Tue, Aug 20, 2019 at 02:41:35PM +1000, Dave Chinner wrote:
> > > With the following debug patch.  Based on that I think I'll just
> > > formally submit the vmalloc switch as we're at -rc5, and then we
> > > can restart the unaligned slub allocation drama..
> > 
> > This still doesn't make sense to me, because the pmem and brd code
> > have no aligment limitations in their make_request code - they can
> > handle byte adressing and should not have any problem at all with
> > 8 byte aligned memory in bios.
> > 
> > Digging a little furhter, I note that both brd and pmem use
> > identical mechanisms to marshall data in and out of bios, so they
> > are likely to have the same issue.
> > 
> > So, brd_make_request() does:
> > 
> >         bio_for_each_segment(bvec, bio, iter) {
> >                 unsigned int len = bvec.bv_len;
> >                 int err;
> > 
> >                 err = brd_do_bvec(brd, bvec.bv_page, len, bvec.bv_offset,
> >                                   bio_op(bio), sector);
> >                 if (err)
> >                         goto io_error;
> >                 sector += len >> SECTOR_SHIFT;
> >         }
> > 
> > So, the code behind bio_for_each_segment() splits multi-page bvecs
> > into individual pages, which are passed to brd_do_bvec(). An
> > unaligned 4kB io traces out as:
> > 
> >  [  121.295550] p,o,l,s 00000000a77f0146,768,3328,0x7d0048
> >  [  121.297635] p,o,l,s 000000006ceca91e,0,768,0x7d004e
> > 
> > i.e. page		offset	len	sector
> > 00000000a77f0146	768	3328	0x7d0048
> > 000000006ceca91e	0	768	0x7d004e
> > 
> > You should be able to guess what the problems are from this.
> > 
> > Both pmem and brd are _sector_ based. We've done a partial sector
> > copy on the first bvec, then the second bvec has started the copy
> > from the wrong offset into the sector we've done a partial copy
> > from.
> > 
> > IOWs, no error is reported when the bvec buffer isn't sector
> > aligned, no error is reported when the length of data to copy was
> > not a multiple of sector size, and no error was reported when we
> > copied the same partial sector twice.
> 
> Yes.  I think bio_for_each_segment is buggy here, as it should not
> blindly split by pages.  CcingMing as he wrote much of this code.  I'll
> also dig into fixing it, but I just arrived in Japan and might be a
> little jetlagged.
> 
> > There's nothing quite like being repeatedly bitten by the same
> > misalignment bug because there's no validation in the infrastructure
> > that could catch it immediately and throw a useful warning/error
> > message.
> 
> The xen block driver doesn't use bio_for_each_segment, so it isn't
> exactly the same but a very related issue. 

Both stem from the fact that nothing in the block layer validates
memory alignment constraints. Xenblk, pmem and brd all return 511 to
queue_dma_alignment(), and all break when passed memory that isn't
aligned to 512 bytes.  There aren't even debug options we can turn
on that would tell use this is happening. Instead, we start with
data corruption and have to work backwards to find the needle in the
haystack from there. EIO and a WARN_ONCE would be a massive
improvement here....

> Maybe until we sort
> all this mess out we just need to depend on !SLUB_DEBUG for XFS?

SLUB_DEBUG=y by itself doesn't cause problems - I run that
all the time because otherwise there's no /proc/slabinfo with slub.

I used KASAN to get the above alignment behaviour - it's
SLUB_DEBUG_ON=y that perturbs the alignment, and I think the same
thing can happen with SLAB_DEBUG=y, so there's several dependencies
we'd have to add here.

Is there any way we can pass kmalloc a "use aligned heap" GFP flag
so that it allocates from the -rcl slabs to guarantee alignment
rather than the standard slabs that change alignment with config
options?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: 5.3-rc1 regression with XFS log recovery
  2019-08-20  5:53                   ` 5.3-rc1 regression with XFS log recovery hch
  2019-08-20  7:44                     ` Dave Chinner
@ 2019-08-20  8:13                     ` Ming Lei
  2019-08-20  9:24                       ` Ming Lei
  1 sibling, 1 reply; 10+ messages in thread
From: Ming Lei @ 2019-08-20  8:13 UTC (permalink / raw)
  To: hch@lst.de
  Cc: Dave Chinner, Verma, Vishal L, linux-xfs@vger.kernel.org,
	Williams, Dan J, darrick.wong@oracle.com, linux-block

On Tue, Aug 20, 2019 at 07:53:20AM +0200, hch@lst.de wrote:
> On Tue, Aug 20, 2019 at 02:41:35PM +1000, Dave Chinner wrote:
> > > With the following debug patch.  Based on that I think I'll just
> > > formally submit the vmalloc switch as we're at -rc5, and then we
> > > can restart the unaligned slub allocation drama..
> > 
> > This still doesn't make sense to me, because the pmem and brd code
> > have no aligment limitations in their make_request code - they can
> > handle byte adressing and should not have any problem at all with
> > 8 byte aligned memory in bios.
> > 
> > Digging a little furhter, I note that both brd and pmem use
> > identical mechanisms to marshall data in and out of bios, so they
> > are likely to have the same issue.
> > 
> > So, brd_make_request() does:
> > 
> >         bio_for_each_segment(bvec, bio, iter) {
> >                 unsigned int len = bvec.bv_len;
> >                 int err;
> > 
> >                 err = brd_do_bvec(brd, bvec.bv_page, len, bvec.bv_offset,
> >                                   bio_op(bio), sector);
> >                 if (err)
> >                         goto io_error;
> >                 sector += len >> SECTOR_SHIFT;
> >         }
> > 
> > So, the code behind bio_for_each_segment() splits multi-page bvecs
> > into individual pages, which are passed to brd_do_bvec(). An
> > unaligned 4kB io traces out as:
> > 
> >  [  121.295550] p,o,l,s 00000000a77f0146,768,3328,0x7d0048
> >  [  121.297635] p,o,l,s 000000006ceca91e,0,768,0x7d004e
> > 
> > i.e. page		offset	len	sector
> > 00000000a77f0146	768	3328	0x7d0048
> > 000000006ceca91e	0	768	0x7d004e
> > 
> > You should be able to guess what the problems are from this.

The problem should be that offset of '768' is passed to bio_add_page().

It should be one slub buffer used for block IO, looks an old unsolved
problem.

> > 
> > Both pmem and brd are _sector_ based. We've done a partial sector
> > copy on the first bvec, then the second bvec has started the copy
> > from the wrong offset into the sector we've done a partial copy
> > from.
> > 
> > IOWs, no error is reported when the bvec buffer isn't sector
> > aligned, no error is reported when the length of data to copy was
> > not a multiple of sector size, and no error was reported when we
> > copied the same partial sector twice.
> 
> Yes.  I think bio_for_each_segment is buggy here, as it should not
> blindly split by pages.

bio_for_each_segment() just keeps the original interface as before
introducing multi-page bvec.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: 5.3-rc1 regression with XFS log recovery
  2019-08-20  8:13                     ` Ming Lei
@ 2019-08-20  9:24                       ` Ming Lei
  2019-08-20 16:30                         ` Verma, Vishal L
  2019-08-20 21:44                         ` Dave Chinner
  0 siblings, 2 replies; 10+ messages in thread
From: Ming Lei @ 2019-08-20  9:24 UTC (permalink / raw)
  To: hch@lst.de, Verma, Vishal L
  Cc: Dave Chinner, linux-xfs@vger.kernel.org, Williams, Dan J,
	darrick.wong@oracle.com, linux-block

On Tue, Aug 20, 2019 at 04:13:26PM +0800, Ming Lei wrote:
> On Tue, Aug 20, 2019 at 07:53:20AM +0200, hch@lst.de wrote:
> > On Tue, Aug 20, 2019 at 02:41:35PM +1000, Dave Chinner wrote:
> > > > With the following debug patch.  Based on that I think I'll just
> > > > formally submit the vmalloc switch as we're at -rc5, and then we
> > > > can restart the unaligned slub allocation drama..
> > > 
> > > This still doesn't make sense to me, because the pmem and brd code
> > > have no aligment limitations in their make_request code - they can
> > > handle byte adressing and should not have any problem at all with
> > > 8 byte aligned memory in bios.
> > > 
> > > Digging a little furhter, I note that both brd and pmem use
> > > identical mechanisms to marshall data in and out of bios, so they
> > > are likely to have the same issue.
> > > 
> > > So, brd_make_request() does:
> > > 
> > >         bio_for_each_segment(bvec, bio, iter) {
> > >                 unsigned int len = bvec.bv_len;
> > >                 int err;
> > > 
> > >                 err = brd_do_bvec(brd, bvec.bv_page, len, bvec.bv_offset,
> > >                                   bio_op(bio), sector);
> > >                 if (err)
> > >                         goto io_error;
> > >                 sector += len >> SECTOR_SHIFT;
> > >         }
> > > 
> > > So, the code behind bio_for_each_segment() splits multi-page bvecs
> > > into individual pages, which are passed to brd_do_bvec(). An
> > > unaligned 4kB io traces out as:
> > > 
> > >  [  121.295550] p,o,l,s 00000000a77f0146,768,3328,0x7d0048
> > >  [  121.297635] p,o,l,s 000000006ceca91e,0,768,0x7d004e
> > > 
> > > i.e. page		offset	len	sector
> > > 00000000a77f0146	768	3328	0x7d0048
> > > 000000006ceca91e	0	768	0x7d004e
> > > 
> > > You should be able to guess what the problems are from this.
> 
> The problem should be that offset of '768' is passed to bio_add_page().

It can be quite hard to deal with non-512 aligned sector buffer, since
one sector buffer may cross two pages, so far one workaround I thought
of is to not merge such IO buffer into one bvec.

Verma, could you try the following patch?

diff --git a/block/bio.c b/block/bio.c
index 24a496f5d2e2..49deab2ac8c4 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -769,6 +769,9 @@ bool __bio_try_merge_page(struct bio *bio, struct page *page,
 	if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED)))
 		return false;
 
+	if (off & 511)
+		return false;
+
 	if (bio->bi_vcnt > 0) {
 		struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
 

Thanks,
Ming

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: 5.3-rc1 regression with XFS log recovery
  2019-08-20  9:24                       ` Ming Lei
@ 2019-08-20 16:30                         ` Verma, Vishal L
  2019-08-20 21:44                         ` Dave Chinner
  1 sibling, 0 replies; 10+ messages in thread
From: Verma, Vishal L @ 2019-08-20 16:30 UTC (permalink / raw)
  To: hch@lst.de, ming.lei@redhat.com
  Cc: Williams, Dan J, linux-xfs@vger.kernel.org, david@fromorbit.com,
	darrick.wong@oracle.com, linux-block@vger.kernel.org

On Tue, 2019-08-20 at 17:24 +0800, Ming Lei wrote:
> 
> It can be quite hard to deal with non-512 aligned sector buffer, since
> one sector buffer may cross two pages, so far one workaround I thought
> of is to not merge such IO buffer into one bvec.
> 
> Verma, could you try the following patch?

Hi Ming,

I can hit the same failure with this patch.
Full thread, in case you haven't already seen it:
https://lore.kernel.org/linux-xfs/e49a6a3a244db055995769eb844c281f93e50ab9.camel@intel.com/

> 
> diff --git a/block/bio.c b/block/bio.c
> index 24a496f5d2e2..49deab2ac8c4 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -769,6 +769,9 @@ bool __bio_try_merge_page(struct bio *bio, struct
> page *page,
>  	if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED)))
>  		return false;
>  
> +	if (off & 511)
> +		return false;
> +
>  	if (bio->bi_vcnt > 0) {
>  		struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
>  
> 
> Thanks,
> Ming


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: 5.3-rc1 regression with XFS log recovery
  2019-08-20  9:24                       ` Ming Lei
  2019-08-20 16:30                         ` Verma, Vishal L
@ 2019-08-20 21:44                         ` Dave Chinner
  2019-08-20 22:08                           ` Verma, Vishal L
  2019-08-21  1:56                           ` Ming Lei
  1 sibling, 2 replies; 10+ messages in thread
From: Dave Chinner @ 2019-08-20 21:44 UTC (permalink / raw)
  To: Ming Lei
  Cc: hch@lst.de, Verma, Vishal L, linux-xfs@vger.kernel.org,
	Williams, Dan J, darrick.wong@oracle.com, linux-block

On Tue, Aug 20, 2019 at 05:24:25PM +0800, Ming Lei wrote:
> On Tue, Aug 20, 2019 at 04:13:26PM +0800, Ming Lei wrote:
> > On Tue, Aug 20, 2019 at 07:53:20AM +0200, hch@lst.de wrote:
> > > On Tue, Aug 20, 2019 at 02:41:35PM +1000, Dave Chinner wrote:
> > > > > With the following debug patch.  Based on that I think I'll just
> > > > > formally submit the vmalloc switch as we're at -rc5, and then we
> > > > > can restart the unaligned slub allocation drama..
> > > > 
> > > > This still doesn't make sense to me, because the pmem and brd code
> > > > have no aligment limitations in their make_request code - they can
> > > > handle byte adressing and should not have any problem at all with
> > > > 8 byte aligned memory in bios.
> > > > 
> > > > Digging a little furhter, I note that both brd and pmem use
> > > > identical mechanisms to marshall data in and out of bios, so they
> > > > are likely to have the same issue.
> > > > 
> > > > So, brd_make_request() does:
> > > > 
> > > >         bio_for_each_segment(bvec, bio, iter) {
> > > >                 unsigned int len = bvec.bv_len;
> > > >                 int err;
> > > > 
> > > >                 err = brd_do_bvec(brd, bvec.bv_page, len, bvec.bv_offset,
> > > >                                   bio_op(bio), sector);
> > > >                 if (err)
> > > >                         goto io_error;
> > > >                 sector += len >> SECTOR_SHIFT;
> > > >         }
> > > > 
> > > > So, the code behind bio_for_each_segment() splits multi-page bvecs
> > > > into individual pages, which are passed to brd_do_bvec(). An
> > > > unaligned 4kB io traces out as:
> > > > 
> > > >  [  121.295550] p,o,l,s 00000000a77f0146,768,3328,0x7d0048
> > > >  [  121.297635] p,o,l,s 000000006ceca91e,0,768,0x7d004e
> > > > 
> > > > i.e. page		offset	len	sector
> > > > 00000000a77f0146	768	3328	0x7d0048
> > > > 000000006ceca91e	0	768	0x7d004e
> > > > 
> > > > You should be able to guess what the problems are from this.
> > 
> > The problem should be that offset of '768' is passed to bio_add_page().
> 
> It can be quite hard to deal with non-512 aligned sector buffer, since
> one sector buffer may cross two pages, so far one workaround I thought
> of is to not merge such IO buffer into one bvec.
> 
> Verma, could you try the following patch?
> 
> diff --git a/block/bio.c b/block/bio.c
> index 24a496f5d2e2..49deab2ac8c4 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -769,6 +769,9 @@ bool __bio_try_merge_page(struct bio *bio, struct page *page,
>  	if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED)))
>  		return false;
>  
> +	if (off & 511)
> +		return false;

What does this acheive? It only prevents the unaligned segment from
being merged, it doesn't prevent it from being added to a new bvec.

However, the case here is that:

> > > > i.e. page		offset	len	sector
> > > > 00000000a77f0146	768	3328	0x7d0048
> > > > 000000006ceca91e	0	768	0x7d004e

The second page added to the bvec is actually offset alignedr. Hence
the check would do nothing on the first page because the bvec array
is empty (so goes into a new bvec anyway), and the check on the
second page would do nothing an it would merge with first because
the offset is aligned correctly. In both cases, the length of the
segment is not aligned, so that needs to be checked, too.

IOWs, I think the check needs to be in bio_add_page, it needs to
check both the offset and length for alignment, and it needs to grab
the alignment from queue_dma_alignment(), not use a hard coded value
of 511.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: 5.3-rc1 regression with XFS log recovery
  2019-08-20 21:44                         ` Dave Chinner
@ 2019-08-20 22:08                           ` Verma, Vishal L
  2019-08-20 23:53                             ` Dave Chinner
  2019-08-21  2:19                             ` Ming Lei
  2019-08-21  1:56                           ` Ming Lei
  1 sibling, 2 replies; 10+ messages in thread
From: Verma, Vishal L @ 2019-08-20 22:08 UTC (permalink / raw)
  To: david@fromorbit.com, ming.lei@redhat.com
  Cc: hch@lst.de, linux-xfs@vger.kernel.org, Williams, Dan J,
	linux-block@vger.kernel.org, darrick.wong@oracle.com

On Wed, 2019-08-21 at 07:44 +1000, Dave Chinner wrote:
> 
> However, the case here is that:
> 
> > > > > i.e. page		offset	len	sector
> > > > > 00000000a77f0146	768	3328	0x7d0048
> > > > > 000000006ceca91e	0	768	0x7d004e
> 
> The second page added to the bvec is actually offset alignedr. Hence
> the check would do nothing on the first page because the bvec array
> is empty (so goes into a new bvec anyway), and the check on the
> second page would do nothing an it would merge with first because
> the offset is aligned correctly. In both cases, the length of the
> segment is not aligned, so that needs to be checked, too.
> 
> IOWs, I think the check needs to be in bio_add_page, it needs to
> check both the offset and length for alignment, and it needs to grab
> the alignment from queue_dma_alignment(), not use a hard coded value
> of 511.
> 
So something like this?

diff --git a/block/bio.c b/block/bio.c
index 299a0e7651ec..80f449d23e5a 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -822,8 +822,12 @@ EXPORT_SYMBOL_GPL(__bio_add_page);
 int bio_add_page(struct bio *bio, struct page *page,
                 unsigned int len, unsigned int offset)
 {
+       struct request_queue *q = bio->bi_disk->queue;
        bool same_page = false;
 
+       if (offset & queue_dma_alignment(q) || len & queue_dma_alignment(q))
+               return 0;
+
        if (!__bio_try_merge_page(bio, page, len, offset, &same_page)) {
                if (bio_full(bio, len))
                        return 0;

I tried this, but the 'mount' just hangs - which looks like it might be
due to xfs_rw_bdev() doing:

  while (bio_add_page(bio, page, len, off) != len) {
  	...


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: 5.3-rc1 regression with XFS log recovery
  2019-08-20 22:08                           ` Verma, Vishal L
@ 2019-08-20 23:53                             ` Dave Chinner
  2019-08-21  2:19                             ` Ming Lei
  1 sibling, 0 replies; 10+ messages in thread
From: Dave Chinner @ 2019-08-20 23:53 UTC (permalink / raw)
  To: Verma, Vishal L
  Cc: ming.lei@redhat.com, hch@lst.de, linux-xfs@vger.kernel.org,
	Williams, Dan J, linux-block@vger.kernel.org,
	darrick.wong@oracle.com

On Tue, Aug 20, 2019 at 10:08:38PM +0000, Verma, Vishal L wrote:
> On Wed, 2019-08-21 at 07:44 +1000, Dave Chinner wrote:
> > 
> > However, the case here is that:
> > 
> > > > > > i.e. page		offset	len	sector
> > > > > > 00000000a77f0146	768	3328	0x7d0048
> > > > > > 000000006ceca91e	0	768	0x7d004e
> > 
> > The second page added to the bvec is actually offset alignedr. Hence
> > the check would do nothing on the first page because the bvec array
> > is empty (so goes into a new bvec anyway), and the check on the
> > second page would do nothing an it would merge with first because
> > the offset is aligned correctly. In both cases, the length of the
> > segment is not aligned, so that needs to be checked, too.
> > 
> > IOWs, I think the check needs to be in bio_add_page, it needs to
> > check both the offset and length for alignment, and it needs to grab
> > the alignment from queue_dma_alignment(), not use a hard coded value
> > of 511.
> > 
> So something like this?
> 
> diff --git a/block/bio.c b/block/bio.c
> index 299a0e7651ec..80f449d23e5a 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -822,8 +822,12 @@ EXPORT_SYMBOL_GPL(__bio_add_page);
>  int bio_add_page(struct bio *bio, struct page *page,
>                  unsigned int len, unsigned int offset)
>  {
> +       struct request_queue *q = bio->bi_disk->queue;
>         bool same_page = false;
>  
> +       if (offset & queue_dma_alignment(q) || len & queue_dma_alignment(q))
> +               return 0;
> +
>         if (!__bio_try_merge_page(bio, page, len, offset, &same_page)) {
>                 if (bio_full(bio, len))
>                         return 0;
> 
> I tried this, but the 'mount' just hangs - which looks like it might be
> due to xfs_rw_bdev() doing:
> 
>   while (bio_add_page(bio, page, len, off) != len) {

That's the return of zero that causes the loop to make no progress.
i.e. a return of 0 means "won't fit in bio, allocate a new bio
and try again". It's not an error return, so always returning zero
will eventually chew up all your memory allocating bios it
doesn't use, because submit_bio() doesn't return errors on chained
bios until the final bio in the chain is completed.

Add a bio_add_page_checked() function that does exactly the same
this as bio_add_page(), but add the

	if (WARN_ON_ONCE((offset | len) & queue_dma_alignment(q)))
		return -EIO;

to it and change the xfs code to:

	while ((len = bio_add_page_checked(bio, page, len, off)) != len) {
		if (len < 0) {
			/*
			 * submit the bio to wait on the rest of the
			 * chain to complete, then return an error.
			 * This is a really shitty failure on write, as we
			 * will have just done a partial write and
			 * effectively corrupted something on disk.
			 */
			submit_bio_wait(bio);
			return len;
		}
		....
	}

We probably should change all the XFS calls to bio_add_page to
bio_add_page_checked() while we are at it, because we have the
same alignment problem through xfs_buf.c and, potentially, on iclogs
via xfs_log.c as iclogs are allocated with kmem_alloc_large(), too.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: 5.3-rc1 regression with XFS log recovery
  2019-08-20 22:08                           ` Verma, Vishal L
  2019-08-20 23:53                             ` Dave Chinner
@ 2019-08-21  2:19                             ` Ming Lei
  1 sibling, 0 replies; 10+ messages in thread
From: Ming Lei @ 2019-08-21  2:19 UTC (permalink / raw)
  To: Verma, Vishal L
  Cc: david@fromorbit.com, hch@lst.de, linux-xfs@vger.kernel.org,
	Williams, Dan J, linux-block@vger.kernel.org,
	darrick.wong@oracle.com

On Tue, Aug 20, 2019 at 10:08:38PM +0000, Verma, Vishal L wrote:
> On Wed, 2019-08-21 at 07:44 +1000, Dave Chinner wrote:
> > 
> > However, the case here is that:
> > 
> > > > > > i.e. page		offset	len	sector
> > > > > > 00000000a77f0146	768	3328	0x7d0048
> > > > > > 000000006ceca91e	0	768	0x7d004e
> > 
> > The second page added to the bvec is actually offset alignedr. Hence
> > the check would do nothing on the first page because the bvec array
> > is empty (so goes into a new bvec anyway), and the check on the
> > second page would do nothing an it would merge with first because
> > the offset is aligned correctly. In both cases, the length of the
> > segment is not aligned, so that needs to be checked, too.
> > 
> > IOWs, I think the check needs to be in bio_add_page, it needs to
> > check both the offset and length for alignment, and it needs to grab
> > the alignment from queue_dma_alignment(), not use a hard coded value
> > of 511.
> > 
> So something like this?
> 
> diff --git a/block/bio.c b/block/bio.c
> index 299a0e7651ec..80f449d23e5a 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -822,8 +822,12 @@ EXPORT_SYMBOL_GPL(__bio_add_page);
>  int bio_add_page(struct bio *bio, struct page *page,
>                  unsigned int len, unsigned int offset)
>  {
> +       struct request_queue *q = bio->bi_disk->queue;
>         bool same_page = false;
>  
> +       if (offset & queue_dma_alignment(q) || len & queue_dma_alignment(q))
> +               return 0;

bio->bi_disk or bio->bi_disk->queue may not be available in bio_add_page().

And 'len' has to be aligned with logical block size of the disk on which fs is
mounted, which info is obviously available for fs.

So the following code is really buggy:

xfs_rw_bdev():

+               struct page     *page = kmem_to_page(data);
+               unsigned int    off = offset_in_page(data);
+               unsigned int    len = min_t(unsigned, left, PAGE_SIZE - off);
+
+               while (bio_add_page(bio, page, len, off) != len) {


Thanks,
Ming

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: 5.3-rc1 regression with XFS log recovery
  2019-08-20 21:44                         ` Dave Chinner
  2019-08-20 22:08                           ` Verma, Vishal L
@ 2019-08-21  1:56                           ` Ming Lei
  1 sibling, 0 replies; 10+ messages in thread
From: Ming Lei @ 2019-08-21  1:56 UTC (permalink / raw)
  To: Dave Chinner
  Cc: hch@lst.de, Verma, Vishal L, linux-xfs@vger.kernel.org,
	Williams, Dan J, darrick.wong@oracle.com, linux-block

On Wed, Aug 21, 2019 at 07:44:09AM +1000, Dave Chinner wrote:
> On Tue, Aug 20, 2019 at 05:24:25PM +0800, Ming Lei wrote:
> > On Tue, Aug 20, 2019 at 04:13:26PM +0800, Ming Lei wrote:
> > > On Tue, Aug 20, 2019 at 07:53:20AM +0200, hch@lst.de wrote:
> > > > On Tue, Aug 20, 2019 at 02:41:35PM +1000, Dave Chinner wrote:
> > > > > > With the following debug patch.  Based on that I think I'll just
> > > > > > formally submit the vmalloc switch as we're at -rc5, and then we
> > > > > > can restart the unaligned slub allocation drama..
> > > > > 
> > > > > This still doesn't make sense to me, because the pmem and brd code
> > > > > have no aligment limitations in their make_request code - they can
> > > > > handle byte adressing and should not have any problem at all with
> > > > > 8 byte aligned memory in bios.
> > > > > 
> > > > > Digging a little furhter, I note that both brd and pmem use
> > > > > identical mechanisms to marshall data in and out of bios, so they
> > > > > are likely to have the same issue.
> > > > > 
> > > > > So, brd_make_request() does:
> > > > > 
> > > > >         bio_for_each_segment(bvec, bio, iter) {
> > > > >                 unsigned int len = bvec.bv_len;
> > > > >                 int err;
> > > > > 
> > > > >                 err = brd_do_bvec(brd, bvec.bv_page, len, bvec.bv_offset,
> > > > >                                   bio_op(bio), sector);
> > > > >                 if (err)
> > > > >                         goto io_error;
> > > > >                 sector += len >> SECTOR_SHIFT;
> > > > >         }
> > > > > 
> > > > > So, the code behind bio_for_each_segment() splits multi-page bvecs
> > > > > into individual pages, which are passed to brd_do_bvec(). An
> > > > > unaligned 4kB io traces out as:
> > > > > 
> > > > >  [  121.295550] p,o,l,s 00000000a77f0146,768,3328,0x7d0048
> > > > >  [  121.297635] p,o,l,s 000000006ceca91e,0,768,0x7d004e
> > > > > 
> > > > > i.e. page		offset	len	sector
> > > > > 00000000a77f0146	768	3328	0x7d0048
> > > > > 000000006ceca91e	0	768	0x7d004e
> > > > > 
> > > > > You should be able to guess what the problems are from this.
> > > 
> > > The problem should be that offset of '768' is passed to bio_add_page().
> > 
> > It can be quite hard to deal with non-512 aligned sector buffer, since
> > one sector buffer may cross two pages, so far one workaround I thought
> > of is to not merge such IO buffer into one bvec.
> > 
> > Verma, could you try the following patch?
> > 
> > diff --git a/block/bio.c b/block/bio.c
> > index 24a496f5d2e2..49deab2ac8c4 100644
> > --- a/block/bio.c
> > +++ b/block/bio.c
> > @@ -769,6 +769,9 @@ bool __bio_try_merge_page(struct bio *bio, struct page *page,
> >  	if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED)))
> >  		return false;
> >  
> > +	if (off & 511)
> > +		return false;
> 
> What does this acheive? It only prevents the unaligned segment from
> being merged, it doesn't prevent it from being added to a new bvec.

The current issue is that block layer won't handle such case, as I
mentioned, one sector buffer may cross two pages, then it can't be
splitted successfully, so simply put into one new bvec just as
before enabling multi-page bvec.

> 
> However, the case here is that:
> 
> > > > > i.e. page		offset	len	sector
> > > > > 00000000a77f0146	768	3328	0x7d0048
> > > > > 000000006ceca91e	0	768	0x7d004e
> 
> The second page added to the bvec is actually offset alignedr. Hence
> the check would do nothing on the first page because the bvec array
> is empty (so goes into a new bvec anyway), and the check on the
> second page would do nothing an it would merge with first because
> the offset is aligned correctly. In both cases, the length of the
> segment is not aligned, so that needs to be checked, too.

What the patch changes is the bvec stored in bio, the above bvec is
actually built in-flight.

So if the 1st page added to bio is (768, 512), then finally
bio_for_each_segment() will generate the 1st page as (768, 512), then
everything will be fine.

> 
> IOWs, I think the check needs to be in bio_add_page, it needs to
> check both the offset and length for alignment, and it needs to grab

The length has to be 512 aligned, otherwise it is simply bug in fs.

> the alignment from queue_dma_alignment(), not use a hard coded value
> of 511.

Now the policy for bio_add_page() is to not checking any queue limit
given we don't know what is the actual limit for the finally IO, even the
queue isn't set when bio_add_page() is called.

If the upper layer won't pass slub buffer which is > PAGE_SIZE, block
layer may handle it well without the 512 alignment check.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-08-21  2:19 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20190818071128.GA17286@lst.de>
     [not found] ` <20190818074140.GA18648@lst.de>
     [not found]   ` <20190818173426.GA32311@lst.de>
     [not found]     ` <20190819000831.GX6129@dread.disaster.area>
     [not found]       ` <20190819034948.GA14261@lst.de>
     [not found]         ` <20190819041132.GA14492@lst.de>
     [not found]           ` <20190819042259.GZ6129@dread.disaster.area>
     [not found]             ` <20190819042905.GA15613@lst.de>
     [not found]               ` <20190819044012.GA15800@lst.de>
     [not found]                 ` <20190820044135.GC1119@dread.disaster.area>
2019-08-20  5:53                   ` 5.3-rc1 regression with XFS log recovery hch
2019-08-20  7:44                     ` Dave Chinner
2019-08-20  8:13                     ` Ming Lei
2019-08-20  9:24                       ` Ming Lei
2019-08-20 16:30                         ` Verma, Vishal L
2019-08-20 21:44                         ` Dave Chinner
2019-08-20 22:08                           ` Verma, Vishal L
2019-08-20 23:53                             ` Dave Chinner
2019-08-21  2:19                             ` Ming Lei
2019-08-21  1:56                           ` Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).