[bug?] blk_queue_may_bounce() has the comparison max_low_pfn and max

Linux block layer
 help / color / mirror / Atom feed

* [bug?] blk_queue_may_bounce() has the comparison max_low_pfn and max_pfn wrong way
@ 2022-10-28 18:02 Al Viro
  2022-10-28 18:51 ` Jens Axboe
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Al Viro @ 2022-10-28 18:02 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-block

	We have this:

static inline bool blk_queue_may_bounce(struct request_queue *q)
{
        return IS_ENABLED(CONFIG_BOUNCE) &&
                q->limits.bounce == BLK_BOUNCE_HIGH &&
                max_low_pfn >= max_pfn;
}

static inline struct bio *blk_queue_bounce(struct bio *bio,
                struct request_queue *q)
{
        if (unlikely(blk_queue_may_bounce(q) && bio_has_data(bio)))
                return __blk_queue_bounce(bio, q);
        return bio;
}

Now, the last term in expression in blk_queue_may_bounce() is
true only on the boxen where max_pfn is no greater than max_low_pfn.

Unless I'm very confused, that's the boxen where we don't *have*
any highmem pages.

What's more, consider this:

static __init int init_emergency_pool(void)
{
        int ret;

#ifndef CONFIG_MEMORY_HOTPLUG
        if (max_pfn <= max_low_pfn)
                return 0;
#endif

        ret = mempool_init_page_pool(&page_pool, POOL_SIZE, 0);
        BUG_ON(ret);
        pr_info("pool size: %d pages\n", POOL_SIZE);

        init_bounce_bioset();
        return 0;
}

On the same boxen (assuming we've not hotplug) page_pool won't be set up
at all, so we wouldn't be able to bounce any highmem page if we ever
ran into one.

AFAICS, this condition is backwards - it should be

static inline bool blk_queue_may_bounce(struct request_queue *q)
{
        return IS_ENABLED(CONFIG_BOUNCE) &&
                q->limits.bounce == BLK_BOUNCE_HIGH &&
                max_low_pfn < max_pfn;
}

What am I missing here?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [bug?] blk_queue_may_bounce() has the comparison max_low_pfn and max_pfn wrong way
  2022-10-28 18:02 [bug?] blk_queue_may_bounce() has the comparison max_low_pfn and max_pfn wrong way Al Viro
@ 2022-10-28 18:51 ` Jens Axboe
  2022-10-28 19:21   ` Al Viro
  2022-10-28 19:14 ` Al Viro
  2022-10-30  7:45 ` Christoph Hellwig
  2 siblings, 1 reply; 10+ messages in thread
From: Jens Axboe @ 2022-10-28 18:51 UTC (permalink / raw)
  To: Al Viro, Christoph Hellwig; +Cc: linux-block

On 10/28/22 12:02 PM, Al Viro wrote:
> 	We have this:
> 
> static inline bool blk_queue_may_bounce(struct request_queue *q)
> {
>         return IS_ENABLED(CONFIG_BOUNCE) &&
>                 q->limits.bounce == BLK_BOUNCE_HIGH &&
>                 max_low_pfn >= max_pfn;
> }
> 
> static inline struct bio *blk_queue_bounce(struct bio *bio,
>                 struct request_queue *q)
> {
>         if (unlikely(blk_queue_may_bounce(q) && bio_has_data(bio)))
>                 return __blk_queue_bounce(bio, q);
>         return bio;
> }
> 
> Now, the last term in expression in blk_queue_may_bounce() is
> true only on the boxen where max_pfn is no greater than max_low_pfn.
> 
> Unless I'm very confused, that's the boxen where we don't *have*
> any highmem pages.
> 
> What's more, consider this:
> 
> static __init int init_emergency_pool(void)
> {
>         int ret;
> 
> #ifndef CONFIG_MEMORY_HOTPLUG
>         if (max_pfn <= max_low_pfn)
>                 return 0;
> #endif
> 
>         ret = mempool_init_page_pool(&page_pool, POOL_SIZE, 0);
>         BUG_ON(ret);
>         pr_info("pool size: %d pages\n", POOL_SIZE);
> 
>         init_bounce_bioset();
>         return 0;
> }
> 
> On the same boxen (assuming we've not hotplug) page_pool won't be set up
> at all, so we wouldn't be able to bounce any highmem page if we ever
> ran into one.
> 
> AFAICS, this condition is backwards - it should be
> 
> static inline bool blk_queue_may_bounce(struct request_queue *q)
> {
>         return IS_ENABLED(CONFIG_BOUNCE) &&
>                 q->limits.bounce == BLK_BOUNCE_HIGH &&
>                 max_low_pfn < max_pfn;
> }
> 
> What am I missing here?

I don't think you're missing anything, the case we need it for is when
max_pfn > max_low_pfn. I wonder when this got botched? Because some
of those statements date back probably about 20 years when we started
allowing highmem pages to do IO.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [bug?] blk_queue_may_bounce() has the comparison max_low_pfn and max_pfn wrong way
  2022-10-28 18:02 [bug?] blk_queue_may_bounce() has the comparison max_low_pfn and max_pfn wrong way Al Viro
  2022-10-28 18:51 ` Jens Axboe
@ 2022-10-28 19:14 ` Al Viro
  2022-10-30  7:45 ` Christoph Hellwig
  2 siblings, 0 replies; 10+ messages in thread
From: Al Viro @ 2022-10-28 19:14 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-block

On Fri, Oct 28, 2022 at 07:02:54PM +0100, Al Viro wrote:

> AFAICS, this condition is backwards - it should be
> 
> static inline bool blk_queue_may_bounce(struct request_queue *q)
> {
>         return IS_ENABLED(CONFIG_BOUNCE) &&
>                 q->limits.bounce == BLK_BOUNCE_HIGH &&
>                 max_low_pfn < max_pfn;
> }
> 
> What am I missing here?

More fun in that area:
	/*
	 * Bvec table can't be updated by bio_for_each_segment_all(),
	 * so retrieve bvec from the table directly. This way is safe
	 * because the 'bio' is single-page bvec.
	 */
	for (i = 0, to = bio->bi_io_vec; i < bio->bi_vcnt; to++, i++) {
		struct page *bounce_page;

		if (!PageHighMem(to->bv_page))
			continue;

		bounce_page = mempool_alloc(&page_pool, GFP_NOIO);
		inc_zone_page_state(bounce_page, NR_BOUNCE);

		if (rw == WRITE) {
			flush_dcache_page(to->bv_page);
			memcpy_from_bvec(page_address(bounce_page), to);
					 ^^^^^^^^^^^^^^^^^^^^^^^^^
		}
		to->bv_page = bounce_page;
	}

Consider the case when highmem page comes in bio_vec that covers the
second half of it.  We
	* allocate a bounce page
	* copy the second half of old page into the first half of new one
	* point the bio_vec to the second half of the new page.
	* submit the mangled bio.

While we are at it, the logics above that re splitting the bio before
bothering with bounces also looks somewhat fishy; if it triggers (which
needs > 256 elements in the original vector) we get bio split and
parts chained, then, assuming we run into a highmem page in each
half, we end up with

	bounce bio 1:	has ->bi_private pointing to bio 1, ->bi_end_io
			bounce_end_io_{read,write}().  Queued.
	bounce bio 2:	has ->bi_private pointing to bio 2, ->bi_end_io
			bounce_end_io_{read,write}().  Queued.
	bio 1:		original, covers the tail of original range.
	bio 2:		covers the beginning of original range,
			->bi_private points to bio 1, ->bi_end_io is
			bio_chain_endio().

Suppose the IO on bounce bio 2 fails.  We get ->bi_status of that sucker
set to non-zero.  ->bi_end_io() is called, leading to bounce_end_io(),
which will copy that status to bio 2 and call bio_endio() on bio 2.
Which will check ->bi_status on bio 1, see it zero and propagate the
error to bio 1.  Now bounce bio 1 completes without an error.
We still have zero in its ->bi_status and the call of bounce_end_io()
hits
        bio_orig->bi_status = bio->bi_status;
        bio_endio(bio_orig);
copying that zero to bio 1.  Oops - we'd just lost the error reported
by the IO on the other half...

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [bug?] blk_queue_may_bounce() has the comparison max_low_pfn and max_pfn wrong way
  2022-10-28 18:51 ` Jens Axboe
@ 2022-10-28 19:21   ` Al Viro
  2022-10-29 16:45     ` Al Viro
  2022-10-30  7:48     ` Christoph Hellwig
  0 siblings, 2 replies; 10+ messages in thread
From: Al Viro @ 2022-10-28 19:21 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Christoph Hellwig, linux-block

On Fri, Oct 28, 2022 at 12:51:00PM -0600, Jens Axboe wrote:
> On 10/28/22 12:02 PM, Al Viro wrote:
> > 	We have this:
> > 
> > static inline bool blk_queue_may_bounce(struct request_queue *q)
> > {
> >         return IS_ENABLED(CONFIG_BOUNCE) &&
> >                 q->limits.bounce == BLK_BOUNCE_HIGH &&
> >                 max_low_pfn >= max_pfn;
> > }
> > 
> > static inline struct bio *blk_queue_bounce(struct bio *bio,
> >                 struct request_queue *q)
> > {
> >         if (unlikely(blk_queue_may_bounce(q) && bio_has_data(bio)))
> >                 return __blk_queue_bounce(bio, q);
> >         return bio;
> > }
> > 
> > Now, the last term in expression in blk_queue_may_bounce() is
> > true only on the boxen where max_pfn is no greater than max_low_pfn.
> > 
> > Unless I'm very confused, that's the boxen where we don't *have*
> > any highmem pages.
> > 
> > What's more, consider this:
> > 
> > static __init int init_emergency_pool(void)
> > {
> >         int ret;
> > 
> > #ifndef CONFIG_MEMORY_HOTPLUG
> >         if (max_pfn <= max_low_pfn)
> >                 return 0;
> > #endif
> > 
> >         ret = mempool_init_page_pool(&page_pool, POOL_SIZE, 0);
> >         BUG_ON(ret);
> >         pr_info("pool size: %d pages\n", POOL_SIZE);
> > 
> >         init_bounce_bioset();
> >         return 0;
> > }
> > 
> > On the same boxen (assuming we've not hotplug) page_pool won't be set up
> > at all, so we wouldn't be able to bounce any highmem page if we ever
> > ran into one.
> > 
> > AFAICS, this condition is backwards - it should be
> > 
> > static inline bool blk_queue_may_bounce(struct request_queue *q)
> > {
> >         return IS_ENABLED(CONFIG_BOUNCE) &&
> >                 q->limits.bounce == BLK_BOUNCE_HIGH &&
> >                 max_low_pfn < max_pfn;
> > }
> > 
> > What am I missing here?
> 
> I don't think you're missing anything, the case we need it for is when
> max_pfn > max_low_pfn. I wonder when this got botched? Because some
> of those statements date back probably about 20 years when we started
> allowing highmem pages to do IO.

9bb33f24abbd "block: refactor the bounce buffering code" about a year ago.

But fixing the test alone is not going to be enough - just a quick look
through __blk_queue_bounce() catches an obvious bug on write (we copy
the part of highmem page we covered by bio into the beginning of the
bounce page - and leave ->bv_offset as-is) as well as a possible bug in
->bi_status handling (if we really can run into bio_split() there).

I wonder if we ought to simply add "depends on BROKEN" for CONFIG_BOUNCE... ;-/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [bug?] blk_queue_may_bounce() has the comparison max_low_pfn and max_pfn wrong way
  2022-10-28 19:21   ` Al Viro
@ 2022-10-29 16:45     ` Al Viro
  2022-10-30  7:50       ` Christoph Hellwig
  2022-10-30  7:48     ` Christoph Hellwig
  1 sibling, 1 reply; 10+ messages in thread
From: Al Viro @ 2022-10-29 16:45 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Christoph Hellwig, linux-block

On Fri, Oct 28, 2022 at 08:21:55PM +0100, Al Viro wrote:
> On Fri, Oct 28, 2022 at 12:51:00PM -0600, Jens Axboe wrote:
> > On 10/28/22 12:02 PM, Al Viro wrote:
> > > 	We have this:
> > > 
> > > static inline bool blk_queue_may_bounce(struct request_queue *q)
> > > {
> > >         return IS_ENABLED(CONFIG_BOUNCE) &&
> > >                 q->limits.bounce == BLK_BOUNCE_HIGH &&
> > >                 max_low_pfn >= max_pfn;
> > > }
> > > 
> > > static inline struct bio *blk_queue_bounce(struct bio *bio,
> > >                 struct request_queue *q)
> > > {
> > >         if (unlikely(blk_queue_may_bounce(q) && bio_has_data(bio)))
> > >                 return __blk_queue_bounce(bio, q);
> > >         return bio;
> > > }
> > > 
> > > Now, the last term in expression in blk_queue_may_bounce() is
> > > true only on the boxen where max_pfn is no greater than max_low_pfn.
> > > 
> > > Unless I'm very confused, that's the boxen where we don't *have*
> > > any highmem pages.
> > > 
> > > What's more, consider this:
> > > 
> > > static __init int init_emergency_pool(void)
> > > {
> > >         int ret;
> > > 
> > > #ifndef CONFIG_MEMORY_HOTPLUG
> > >         if (max_pfn <= max_low_pfn)
> > >                 return 0;
> > > #endif
> > > 
> > >         ret = mempool_init_page_pool(&page_pool, POOL_SIZE, 0);
> > >         BUG_ON(ret);
> > >         pr_info("pool size: %d pages\n", POOL_SIZE);
> > > 
> > >         init_bounce_bioset();
> > >         return 0;
> > > }
> > > 
> > > On the same boxen (assuming we've not hotplug) page_pool won't be set up
> > > at all, so we wouldn't be able to bounce any highmem page if we ever
> > > ran into one.
> > > 
> > > AFAICS, this condition is backwards - it should be
> > > 
> > > static inline bool blk_queue_may_bounce(struct request_queue *q)
> > > {
> > >         return IS_ENABLED(CONFIG_BOUNCE) &&
> > >                 q->limits.bounce == BLK_BOUNCE_HIGH &&
> > >                 max_low_pfn < max_pfn;
> > > }
> > > 
> > > What am I missing here?
> > 
> > I don't think you're missing anything, the case we need it for is when
> > max_pfn > max_low_pfn. I wonder when this got botched? Because some
> > of those statements date back probably about 20 years when we started
> > allowing highmem pages to do IO.
> 
> 9bb33f24abbd "block: refactor the bounce buffering code" about a year ago.
> 
> But fixing the test alone is not going to be enough - just a quick look
> through __blk_queue_bounce() catches an obvious bug on write (we copy
> the part of highmem page we covered by bio into the beginning of the
> bounce page - and leave ->bv_offset as-is) as well as a possible bug in
> ->bi_status handling (if we really can run into bio_split() there).
> 
> I wonder if we ought to simply add "depends on BROKEN" for CONFIG_BOUNCE... ;-/

BTW, ->bi_status propagation looks brittle in general.  Are there any
places other than bio_init() where we would want to store 0 in ->bi_status
of anything?

Look at e.g. null_blk; AFAICS, we store to ->bi_status on any request
completion (in NULL_Q_BIO case, at least).  What happens if request
gets split and split-off part finishes first with an error?  AFAICS,
its ->bi_status will be copied to parent (original bio, the one that
covers the tail).  Now the IO on the original bio is over as well
and we hit drivers/block/null_blk/main.c:end_cmd().  Suppose this
part succeeds; won't we end up overwriting ->bi_status with zero
and assuming that the entire thing had succeeded, despite the
(now lost) error on the split-off part?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [bug?] blk_queue_may_bounce() has the comparison max_low_pfn and max_pfn wrong way
  2022-10-28 18:02 [bug?] blk_queue_may_bounce() has the comparison max_low_pfn and max_pfn wrong way Al Viro
  2022-10-28 18:51 ` Jens Axboe
  2022-10-28 19:14 ` Al Viro
@ 2022-10-30  7:45 ` Christoph Hellwig
  2 siblings, 0 replies; 10+ messages in thread
From: Christoph Hellwig @ 2022-10-30  7:45 UTC (permalink / raw)
  To: Al Viro; +Cc: Christoph Hellwig, linux-block

On Fri, Oct 28, 2022 at 07:02:54PM +0100, Al Viro wrote:
> Now, the last term in expression in blk_queue_may_bounce() is
> true only on the boxen where max_pfn is no greater than max_low_pfn.
> 
> Unless I'm very confused, that's the boxen where we don't *have*
> any highmem pages.

Yes, this go mixed up.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [bug?] blk_queue_may_bounce() has the comparison max_low_pfn and max_pfn wrong way
  2022-10-28 19:21   ` Al Viro
  2022-10-29 16:45     ` Al Viro
@ 2022-10-30  7:48     ` Christoph Hellwig
  1 sibling, 0 replies; 10+ messages in thread
From: Christoph Hellwig @ 2022-10-30  7:48 UTC (permalink / raw)
  To: Al Viro; +Cc: Jens Axboe, Christoph Hellwig, linux-block

On Fri, Oct 28, 2022 at 08:21:54PM +0100, Al Viro wrote:
> I wonder if we ought to simply add "depends on BROKEN" for CONFIG_BOUNCE... ;-/

Depends on BROKEN does not help anyone, we might better remove it.

Users are:

  paride drivers
  drivers/mmc/ for a few drivers
  3 SCSI LLDDs (two of those parport attached, one ISA)
  usb-storage

The thing that most matters is usb-storage that can't be used on
highmem systems, and maybe one or two of the mmc drivers might exist
on arm highmem systems.

In general nothing should use these, for MMIO/PIO based drivers they
should kmap, and for DMA drivers nothing should use the kernel virtual
address, except that the USB interfaces are a mess that seems to
partially require one.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [bug?] blk_queue_may_bounce() has the comparison max_low_pfn and max_pfn wrong way
  2022-10-29 16:45     ` Al Viro
@ 2022-10-30  7:50       ` Christoph Hellwig
  2022-10-30 18:47         ` Al Viro
  0 siblings, 1 reply; 10+ messages in thread
From: Christoph Hellwig @ 2022-10-30  7:50 UTC (permalink / raw)
  To: Al Viro; +Cc: Jens Axboe, Christoph Hellwig, linux-block

On Sat, Oct 29, 2022 at 05:45:45PM +0100, Al Viro wrote:
> completion (in NULL_Q_BIO case, at least).  What happens if request
> gets split and split-off part finishes first with an error?  AFAICS,
> its ->bi_status will be copied to parent (original bio, the one that
> covers the tail).  Now the IO on the original bio is over as well
> and we hit drivers/block/null_blk/main.c:end_cmd().  Suppose this
> part succeeds; won't we end up overwriting ->bi_status with zero
> and assuming that the entire thing had succeeded, despite the
> (now lost) error on the split-off part?

As a rule of thumb drives should never set bi_status to 0, so null_blk
here has a bug.  What is missing everywhere is proper memory barriers,
though.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [bug?] blk_queue_may_bounce() has the comparison max_low_pfn and max_pfn wrong way
  2022-10-30  7:50       ` Christoph Hellwig
@ 2022-10-30 18:47         ` Al Viro
  2022-11-01 11:11           ` Christoph Hellwig
  0 siblings, 1 reply; 10+ messages in thread
From: Al Viro @ 2022-10-30 18:47 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, linux-block

On Sun, Oct 30, 2022 at 08:50:32AM +0100, Christoph Hellwig wrote:
> On Sat, Oct 29, 2022 at 05:45:45PM +0100, Al Viro wrote:
> > completion (in NULL_Q_BIO case, at least).  What happens if request
> > gets split and split-off part finishes first with an error?  AFAICS,
> > its ->bi_status will be copied to parent (original bio, the one that
> > covers the tail).  Now the IO on the original bio is over as well
> > and we hit drivers/block/null_blk/main.c:end_cmd().  Suppose this
> > part succeeds; won't we end up overwriting ->bi_status with zero
> > and assuming that the entire thing had succeeded, despite the
> > (now lost) error on the split-off part?
> 
> As a rule of thumb drives should never set bi_status to 0, so null_blk
> here has a bug.  What is missing everywhere is proper memory barriers,
> though.

Something like

static inline void set_bio_status(struct bio *bio, blk_status_t status)
{
	if (unlikely(status))
		cmpxchg(&bio->bi_status, 0, status);
}

with e.g.
        if (bio->bi_status && !dio->bio.bi_status)
		dio->bio.bi_status = bio->bi_status;
in blkdev_bio_end_io() replaced with
	set_bio_status(&dio->bio, bio->bi_status);

perhaps?  That would probably do for almost all users, but...  what about
e.g. drivers/md/raid1.c:fix_sync_read_error()?  Looks like it really
intends non-zero -> zero change; I'm not familiar enough with the guts
of that sucker to tell if it is guaranteed to get no propagation from
another bio...

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [bug?] blk_queue_may_bounce() has the comparison max_low_pfn and max_pfn wrong way
  2022-10-30 18:47         ` Al Viro
@ 2022-11-01 11:11           ` Christoph Hellwig
  0 siblings, 0 replies; 10+ messages in thread
From: Christoph Hellwig @ 2022-11-01 11:11 UTC (permalink / raw)
  To: Al Viro; +Cc: Christoph Hellwig, Jens Axboe, linux-block

On Sun, Oct 30, 2022 at 06:47:33PM +0000, Al Viro wrote:
> static inline void set_bio_status(struct bio *bio, blk_status_t status)
> {
> 	if (unlikely(status))
> 		cmpxchg(&bio->bi_status, 0, status);
> }
> 
> with e.g.
>         if (bio->bi_status && !dio->bio.bi_status)
> 		dio->bio.bi_status = bio->bi_status;
> in blkdev_bio_end_io() replaced with
> 	set_bio_status(&dio->bio, bio->bi_status);
> 
> perhaps?

I think so, yes.

> That would probably do for almost all users, but...  what about
> e.g. drivers/md/raid1.c:fix_sync_read_error()?  Looks like it really
> intends non-zero -> zero change; I'm not familiar enough with the guts
> of that sucker to tell if it is guaranteed to get no propagation from
> another bio...

I think the clearing to zero here is intentional as the bio failed,
but it manages to get the data from the other leg of the mirror.  But
the md code is really hard to follow, and any change to this code
needs careful review from the maintainer and the linux-raid list.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-11-01 11:11 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-10-28 18:02 [bug?] blk_queue_may_bounce() has the comparison max_low_pfn and max_pfn wrong way Al Viro
2022-10-28 18:51 ` Jens Axboe
2022-10-28 19:21   ` Al Viro
2022-10-29 16:45     ` Al Viro
2022-10-30  7:50       ` Christoph Hellwig
2022-10-30 18:47         ` Al Viro
2022-11-01 11:11           ` Christoph Hellwig
2022-10-30  7:48     ` Christoph Hellwig
2022-10-28 19:14 ` Al Viro
2022-10-30  7:45 ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox