* [BUG] Regression introduced with "block: split bios to max possible length"
@ 2016-01-21 14:57 Stefan Haberland
2016-01-21 21:34 ` Jens Axboe
0 siblings, 1 reply; 11+ messages in thread
From: Stefan Haberland @ 2016-01-21 14:57 UTC (permalink / raw)
To: keith.busch, axboe, torvalds; +Cc: linux-kernel, linux-s390, Sebastian Ott
Hi,
unfortunately commit e36f62042880 "block: split bios to maxpossible length"
breaks the DASD driver on s390. We expect the block requests to be multiple
of 4k in size. With the patch applied I see the requests split up in
multiple
of 512 byte and therefore the requests get rejected and lots of I/Os fail.
Could you please fix or revert the patch?
We have following values set for the block queue:
hw_sector_size 4096
logical_block_size 4096
minimum_io_size 4096
physical_block_size 4096
max_hw_sectors_kb 760
max_sectors_kb 760
max_segment_size 4096
max_segments 65535
Regards,
Stefan
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [BUG] Regression introduced with "block: split bios to max possible length"
2016-01-21 14:57 [BUG] Regression introduced with "block: split bios to max possible length" Stefan Haberland
@ 2016-01-21 21:34 ` Jens Axboe
2016-01-21 22:51 ` Keith Busch
0 siblings, 1 reply; 11+ messages in thread
From: Jens Axboe @ 2016-01-21 21:34 UTC (permalink / raw)
To: Stefan Haberland, keith.busch, torvalds
Cc: linux-kernel, linux-s390, Sebastian Ott
On 01/21/2016 07:57 AM, Stefan Haberland wrote:
> Hi,
>
> unfortunately commit e36f62042880 "block: split bios to maxpossible length"
> breaks the DASD driver on s390. We expect the block requests to be
> multiple
> of 4k in size. With the patch applied I see the requests split up in
> multiple
> of 512 byte and therefore the requests get rejected and lots of I/Os fail.
Sigh, that's definitely a bug. I'll take a look at it and see if I can
get a tested fix in for 4.5-rc1. If not, we'll revert it, again.
--
Jens Axboe
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [BUG] Regression introduced with "block: split bios to max possible length"
2016-01-21 21:34 ` Jens Axboe
@ 2016-01-21 22:51 ` Keith Busch
2016-01-22 1:12 ` Linus Torvalds
0 siblings, 1 reply; 11+ messages in thread
From: Keith Busch @ 2016-01-21 22:51 UTC (permalink / raw)
To: Jens Axboe
Cc: Stefan Haberland, torvalds, linux-kernel, linux-s390,
Sebastian Ott
On Thu, Jan 21, 2016 at 02:34:28PM -0700, Jens Axboe wrote:
> On 01/21/2016 07:57 AM, Stefan Haberland wrote:
> >Hi,
> >
> >unfortunately commit e36f62042880 "block: split bios to maxpossible length"
> >breaks the DASD driver on s390. We expect the block requests to be
> >multiple
> >of 4k in size. With the patch applied I see the requests split up in
> >multiple
> >of 512 byte and therefore the requests get rejected and lots of I/Os fail.
>
> Sigh, that's definitely a bug. I'll take a look at it and see if I
> can get a tested fix in for 4.5-rc1. If not, we'll revert it, again.
My apologies for the trouble. I trust it really is broken, but I don't
quite see how. The patch supposedly splits the transfer to the max size
the request queue says it allows. How does the max allowed size end up
an invalid multiple?
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [BUG] Regression introduced with "block: split bios to max possible length"
2016-01-21 22:51 ` Keith Busch
@ 2016-01-22 1:12 ` Linus Torvalds
2016-01-22 3:21 ` Keith Busch
0 siblings, 1 reply; 11+ messages in thread
From: Linus Torvalds @ 2016-01-22 1:12 UTC (permalink / raw)
To: Keith Busch
Cc: Jens Axboe, Stefan Haberland, Linux Kernel Mailing List,
linux-s390, Sebastian Ott
On Thu, Jan 21, 2016 at 2:51 PM, Keith Busch <keith.busch@intel.com> wrote:
>
> My apologies for the trouble. I trust it really is broken, but I don't
> quite see how. The patch supposedly splits the transfer to the max size
> the request queue says it allows. How does the max allowed size end up
> an invalid multiple?
I assume that in this case it's simply that
- max_sectors is some odd number in sectors (ie 65535)
- the block size is larger than a sector (ie 4k)
- the device probably doesn't even have any silly chunk size issue
(so chunk_sectors is zero).
and because the block size is 4k, no valid IO can ever generate
anything but 4k-aligned IO's, and everything is fine.
Except now the "split bios" patch will split blindly at the
max_sectors size, which is pure and utter garbage, since it doesn't
take the minimum block size into account.
Also, quite frankly, I think that whole "split bios" patch is garbage *anyway*.
The thing is, the whole "blk_max_size_offset()" use there is broken.
What I think it _should_ do is:
(a) check against max sectors like it used to do:
if (sectors + (bv.bv_len >> 9) > queue_max_sectors(q))
goto split;
(b) completely separately, and _independently_ of that max sector
check, it should check against the "chunk_sectors" limit if it exists.
instead, it uses that nasty blk_max_size_offset() crap, which is
broken because it's confusing, but also because it doesn't honor
max_sectors AT ALL if there is a chunking size.
So I think chunking size should be independent of max_sectors. I could
see some device that has some absolute max sector size, but that
_also_ wants to split so that the bio never crosses a particular chunk
size (perhaps due to RAID, perhaps due to some internal device block
handling rules).
Trying to mix the two things with those "blk_max_size_offset()" games
is just wrong.
Linus
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [BUG] Regression introduced with "block: split bios to max possible length"
2016-01-22 1:12 ` Linus Torvalds
@ 2016-01-22 3:21 ` Keith Busch
2016-01-22 4:15 ` Linus Torvalds
0 siblings, 1 reply; 11+ messages in thread
From: Keith Busch @ 2016-01-22 3:21 UTC (permalink / raw)
To: Linus Torvalds
Cc: Jens Axboe, Stefan Haberland, Linux Kernel Mailing List,
linux-s390, Sebastian Ott
On Thu, Jan 21, 2016 at 05:12:13PM -0800, Linus Torvalds wrote:
> On Thu, Jan 21, 2016 at 2:51 PM, Keith Busch <keith.busch@intel.com> wrote:
> >
> > My apologies for the trouble. I trust it really is broken, but I don't
> > quite see how. The patch supposedly splits the transfer to the max size
> > the request queue says it allows. How does the max allowed size end up
> > an invalid multiple?
>
> I assume that in this case it's simply that
>
> - max_sectors is some odd number in sectors (ie 65535)
>
> - the block size is larger than a sector (ie 4k)
Wouldn't that make max sectors non-sensical? Or am I mistaken to think max
sectors is supposed to be a valid transfer in multiples of the physical
sector size?
I do think this is what's happening though. A recent commit (ca369d51b)
limits the max_sectors to 255 max by default, which isn't right for
4k. A driver has to override the queue's limits.max_dev_sectors first
to get the desired limit for their storage.
I'm not sure if that was the intention. There are lots of drivers
requesting more than 255 and probably unaware they're not getting it,
DASD included. I don't think we'd have seen this problem if the requested
setting wasn't overridden.
> What I think it _should_ do is:
>
> (a) check against max sectors like it used to do:
>
> if (sectors + (bv.bv_len >> 9) > queue_max_sectors(q))
> goto split;
This can create less optimal splits for h/w that advertise chunk size. I
know it's a quirky feature (wasn't my idea), but the h/w is very slow
to not split at the necessary alignments, and we used to handle this
split correctly.
Point taken, though. The implementation needs some cleaning up.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [BUG] Regression introduced with "block: split bios to max possible length"
2016-01-22 3:21 ` Keith Busch
@ 2016-01-22 4:15 ` Linus Torvalds
2016-01-22 14:56 ` Keith Busch
2016-01-22 15:06 ` Ming Lei
0 siblings, 2 replies; 11+ messages in thread
From: Linus Torvalds @ 2016-01-22 4:15 UTC (permalink / raw)
To: Keith Busch
Cc: Jens Axboe, Stefan Haberland, Linux Kernel Mailing List,
linux-s390, Sebastian Ott
On Thu, Jan 21, 2016 at 7:21 PM, Keith Busch <keith.busch@intel.com> wrote:
> On Thu, Jan 21, 2016 at 05:12:13PM -0800, Linus Torvalds wrote:
>>
>> I assume that in this case it's simply that
>>
>> - max_sectors is some odd number in sectors (ie 65535)
>>
>> - the block size is larger than a sector (ie 4k)
>
> Wouldn't that make max sectors non-sensical? Or am I mistaken to think max
> sectors is supposed to be a valid transfer in multiples of the physical
> sector size?
If the controller interface is some 16-bit register, then the maximum
number of sectors you can specify is 65535.
But if the disk then doesn't like 512-byte accesses, but wants 4kB or
whatever, then clearly you can't actually *feed* it that maximum
number. Not because it's a maximal, but because it's not aligned.
But that doesn't mean that it's non-sensical. It just means that you
have to take both things into account. There may be two totally
independent things that cause the two (very different) rules on what
the IO can look like.
Obviously there are probably games we could play, like always limiting
the maximum sector number to a multiple of the sector size. That would
presumably work for Stefan's case, by simply "artificially" making
max_sectors be 65528 instead.
But I do think it's better to consider them independent issues, and
just make sure that we always honor those things independently.
That "honor things independently" used to happen automatically before,
simply because we'd never split in the middle of a bio segment. And
since each bio segment was created with the limitations of the device
in mind, that all worked.
Now that it splits in the middle of a vector entry, that splitting
just needs to honor _all_ the rules. Not just the max sector one.
>> What I think it _should_ do is:
>>
>> (a) check against max sectors like it used to do:
>>
>> if (sectors + (bv.bv_len >> 9) > queue_max_sectors(q))
>> goto split;
>
> This can create less optimal splits for h/w that advertise chunk size. I
> know it's a quirky feature (wasn't my idea), but the h/w is very slow
> to not split at the necessary alignments, and we used to handle this
> split correctly.
I suspect few high-performance controllers will really have big issues
with the max_sectors thing. If you have big enough IO that you could
hit the maximum sector number, you're already pretty well off, you
might as well split at that point.
So I think it's ok to split at the max sector case early.
For the case of nvme, for example, I think the max sector number is so
high that you'll never hit that anyway, and you'll only ever hit the
chunk limit. No?
So in practice it won't matter, I suspect.
Linus
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [BUG] Regression introduced with "block: split bios to max possible length"
2016-01-22 4:15 ` Linus Torvalds
@ 2016-01-22 14:56 ` Keith Busch
2016-01-22 17:15 ` Jens Axboe
2016-01-22 15:06 ` Ming Lei
1 sibling, 1 reply; 11+ messages in thread
From: Keith Busch @ 2016-01-22 14:56 UTC (permalink / raw)
To: Linus Torvalds
Cc: Jens Axboe, Stefan Haberland, Linux Kernel Mailing List,
linux-s390, Sebastian Ott
On Thu, Jan 21, 2016 at 08:15:37PM -0800, Linus Torvalds wrote:
> For the case of nvme, for example, I think the max sector number is so
> high that you'll never hit that anyway, and you'll only ever hit the
> chunk limit. No?
The device's max transfer and chunk size are not very large, both fixed
at 128KB. We can lose ~70% of potential throughput when IO isn't aligned,
and end users reported this when the block layer stopped splitting on
alignment for the NVMe drive.
So it's a big deal for this h/w, but now I feel awkward defending a
device specific feature for the generic block layer.
Anyway, the patch was developed with incorrect assumptions. I'd still
like to try again after reconciling the queue limit constraints, but
I defer to Jens for the near term.
Thanks!
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [BUG] Regression introduced with "block: split bios to max possible length"
2016-01-22 4:15 ` Linus Torvalds
2016-01-22 14:56 ` Keith Busch
@ 2016-01-22 15:06 ` Ming Lei
2016-01-22 17:03 ` Linus Torvalds
1 sibling, 1 reply; 11+ messages in thread
From: Ming Lei @ 2016-01-22 15:06 UTC (permalink / raw)
To: Linus Torvalds
Cc: Keith Busch, Jens Axboe, Stefan Haberland,
Linux Kernel Mailing List, linux-s390, Sebastian Ott, tom.leiming
On Thu, 21 Jan 2016 20:15:37 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Thu, Jan 21, 2016 at 7:21 PM, Keith Busch <keith.busch@intel.com> wrote:
> > On Thu, Jan 21, 2016 at 05:12:13PM -0800, Linus Torvalds wrote:
> >>
> >> I assume that in this case it's simply that
> >>
> >> - max_sectors is some odd number in sectors (ie 65535)
> >>
> >> - the block size is larger than a sector (ie 4k)
> >
> > Wouldn't that make max sectors non-sensical? Or am I mistaken to think max
> > sectors is supposed to be a valid transfer in multiples of the physical
> > sector size?
>
> If the controller interface is some 16-bit register, then the maximum
> number of sectors you can specify is 65535.
>
> But if the disk then doesn't like 512-byte accesses, but wants 4kB or
> whatever, then clearly you can't actually *feed* it that maximum
> number. Not because it's a maximal, but because it's not aligned.
>
> But that doesn't mean that it's non-sensical. It just means that you
> have to take both things into account. There may be two totally
> independent things that cause the two (very different) rules on what
> the IO can look like.
>
> Obviously there are probably games we could play, like always limiting
> the maximum sector number to a multiple of the sector size. That would
> presumably work for Stefan's case, by simply "artificially" making
> max_sectors be 65528 instead.
Yes, it is one problem, something like below does fix my test
with 4K block size.
--
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 1699df5..49e0394 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -81,6 +81,7 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
unsigned front_seg_size = bio->bi_seg_front_size;
bool do_split = true;
struct bio *new = NULL;
+ unsigned max_sectors;
bio_for_each_segment(bv, bio, iter) {
/*
@@ -90,20 +91,23 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
if (bvprvp && bvec_gap_to_prev(q, bvprvp, bv.bv_offset))
goto split;
- if (sectors + (bv.bv_len >> 9) >
- blk_max_size_offset(q, bio->bi_iter.bi_sector)) {
+ /* I/O shoule be aligned to logical block size */
+ max_sectors = blk_max_size_offset(q, bio->bi_iter.bi_sector);
+ max_sectors = ((max_sectors << 9) &
+ ~(queue_logical_block_size(q) - 1)) >> 9;
+ if (sectors + (bv.bv_len >> 9) > max_sectors) {
/*
* Consider this a new segment if we're splitting in
* the middle of this vector.
*/
if (nsegs < queue_max_segments(q) &&
- sectors < blk_max_size_offset(q,
- bio->bi_iter.bi_sector)) {
+ sectors < max_sectors) {
nsegs++;
- sectors = blk_max_size_offset(q,
- bio->bi_iter.bi_sector);
+ sectors = max_sectors;
}
- goto split;
+ if (sectors)
+ goto split;
+ /* It is OK to put single bvec into one segment */
}
if (bvprvp && blk_queue_cluster(q)) {
>
> But I do think it's better to consider them independent issues, and
> just make sure that we always honor those things independently.
>
> That "honor things independently" used to happen automatically before,
> simply because we'd never split in the middle of a bio segment. And
> since each bio segment was created with the limitations of the device
> in mind, that all worked.
>
> Now that it splits in the middle of a vector entry, that splitting
> just needs to honor _all_ the rules. Not just the max sector one.
>
> >> What I think it _should_ do is:
> >>
> >> (a) check against max sectors like it used to do:
> >>
> >> if (sectors + (bv.bv_len >> 9) > queue_max_sectors(q))
> >> goto split;
> >
> > This can create less optimal splits for h/w that advertise chunk size. I
> > know it's a quirky feature (wasn't my idea), but the h/w is very slow
> > to not split at the necessary alignments, and we used to handle this
> > split correctly.
>
> I suspect few high-performance controllers will really have big issues
> with the max_sectors thing. If you have big enough IO that you could
> hit the maximum sector number, you're already pretty well off, you
> might as well split at that point.
>
> So I think it's ok to split at the max sector case early.
>
> For the case of nvme, for example, I think the max sector number is so
> high that you'll never hit that anyway, and you'll only ever hit the
> chunk limit. No?
>
> So in practice it won't matter, I suspect.
>
> Linus
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [BUG] Regression introduced with "block: split bios to max possible length"
2016-01-22 15:06 ` Ming Lei
@ 2016-01-22 17:03 ` Linus Torvalds
2016-01-22 17:48 ` Ming Lei
0 siblings, 1 reply; 11+ messages in thread
From: Linus Torvalds @ 2016-01-22 17:03 UTC (permalink / raw)
To: Ming Lei
Cc: Keith Busch, Jens Axboe, Stefan Haberland,
Linux Kernel Mailing List, linux-s390, Sebastian Ott
On Fri, Jan 22, 2016 at 7:06 AM, Ming Lei <tom.leiming@gmail.com> wrote:
>
> Yes, it is one problem, something like below does fix my test
> with 4K block size.
It just doesn't look very legible.
Also, how could this
> - goto split;
> + if (sectors)
> + goto split;
ever matter? If sectors is 0, something is seriously wrong afaik.
Linus
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [BUG] Regression introduced with "block: split bios to max possible length"
2016-01-22 14:56 ` Keith Busch
@ 2016-01-22 17:15 ` Jens Axboe
0 siblings, 0 replies; 11+ messages in thread
From: Jens Axboe @ 2016-01-22 17:15 UTC (permalink / raw)
To: Keith Busch, Linus Torvalds
Cc: Stefan Haberland, Linux Kernel Mailing List, linux-s390,
Sebastian Ott
On 01/22/2016 07:56 AM, Keith Busch wrote:
> On Thu, Jan 21, 2016 at 08:15:37PM -0800, Linus Torvalds wrote:
>> For the case of nvme, for example, I think the max sector number is so
>> high that you'll never hit that anyway, and you'll only ever hit the
>> chunk limit. No?
>
> The device's max transfer and chunk size are not very large, both fixed
> at 128KB. We can lose ~70% of potential throughput when IO isn't aligned,
> and end users reported this when the block layer stopped splitting on
> alignment for the NVMe drive.
>
> So it's a big deal for this h/w, but now I feel awkward defending a
> device specific feature for the generic block layer.
Honestly, the splitting code is what is a piece of crap, we never should
have gone down that route. Hopefully we can get rid of it soon. In the
mean time, this does need to work. It's an odd hw construct (basically
two devices bolted together), but it's not really an esoteric thing to
support.
> Anyway, the patch was developed with incorrect assumptions. I'd still
> like to try again after reconciling the queue limit constraints, but
> I defer to Jens for the near term.
Instead of scrambling for -rc1, I'd suggest we just revert again and
ensure what we merge for -rc2 is clean and passes the test cases.
--
Jens Axboe
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [BUG] Regression introduced with "block: split bios to max possible length"
2016-01-22 17:03 ` Linus Torvalds
@ 2016-01-22 17:48 ` Ming Lei
0 siblings, 0 replies; 11+ messages in thread
From: Ming Lei @ 2016-01-22 17:48 UTC (permalink / raw)
To: Linus Torvalds
Cc: Keith Busch, Jens Axboe, Stefan Haberland,
Linux Kernel Mailing List, linux-s390, Sebastian Ott
On Sat, Jan 23, 2016 at 1:03 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Fri, Jan 22, 2016 at 7:06 AM, Ming Lei <tom.leiming@gmail.com> wrote:
>>
>> Yes, it is one problem, something like below does fix my test
>> with 4K block size.
>
> It just doesn't look very legible.
OK, I will try to make it better.
>
> Also, how could this
>
>> - goto split;
>> + if (sectors)
>> + goto split;
>
> ever matter? If sectors is 0, something is seriously wrong afaik.
I guess that is possible because blk_max_size_offset() depends on
offset and may return a small size which is less than logical
block size.
--
Ming Lei
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2016-01-22 17:48 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-01-21 14:57 [BUG] Regression introduced with "block: split bios to max possible length" Stefan Haberland
2016-01-21 21:34 ` Jens Axboe
2016-01-21 22:51 ` Keith Busch
2016-01-22 1:12 ` Linus Torvalds
2016-01-22 3:21 ` Keith Busch
2016-01-22 4:15 ` Linus Torvalds
2016-01-22 14:56 ` Keith Busch
2016-01-22 17:15 ` Jens Axboe
2016-01-22 15:06 ` Ming Lei
2016-01-22 17:03 ` Linus Torvalds
2016-01-22 17:48 ` Ming Lei
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).