* [PATCH] block: bios with an offset are always gappy
@ 2017-04-13 8:06 Johannes Thumshirn
2017-04-13 9:48 ` Christoph Hellwig
2017-04-13 10:02 ` Ming Lei
0 siblings, 2 replies; 13+ messages in thread
From: Johannes Thumshirn @ 2017-04-13 8:06 UTC (permalink / raw)
To: Jens Axboe, Omar Sandoval
Cc: Ming Lei, Bart Van Assche, Hannes Reinecke, Christoph Hellwig,
Linux Block Layer Mailinglist, Linux Kernel Mailinglist,
Johannes Thumshirn
Doing a mkfs.btrfs on a (qemu emulated) PCIe NVMe causes a kernel panic
in nvme_setup_prps() because the dma_len will drop below zero but the
length not.
A git bisect tracked the behaviour down to commit 729204ef49ec ("block:
relax check on sg gap"). Since commit 729204ef49ec a bio's offsets are not
taken into account in the decision if the bio will gap any more. Restore
the old behavior of checking bio offsets as well for the decision if a
bio will gap.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Fixes: 729204ef49ec ("block: relax check on sg gap")
Cc: Ming Lei <ming.lei@redhat.com>
---
include/linux/blkdev.h | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 7548f332121a..a03b7196209e 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1677,11 +1677,14 @@ static inline bool bio_will_gap(struct request_queue *q, struct bio *prev,
{
if (bio_has_data(prev) && queue_virt_boundary(q)) {
struct bio_vec pb, nb;
+ bool offset;
bio_get_last_bvec(prev, &pb);
bio_get_first_bvec(next, &nb);
- if (!bios_segs_mergeable(q, prev, &pb, &nb))
+ offset = pb.bv_offset || nb.bv_offset;
+
+ if (offset || !bios_segs_mergeable(q, prev, &pb, &nb))
return __bvec_gap_to_prev(q, &pb, nb.bv_offset);
}
--
2.12.0
^ permalink raw reply related [flat|nested] 13+ messages in thread* Re: [PATCH] block: bios with an offset are always gappy 2017-04-13 8:06 [PATCH] block: bios with an offset are always gappy Johannes Thumshirn @ 2017-04-13 9:48 ` Christoph Hellwig 2017-04-13 9:56 ` Johannes Thumshirn 2017-04-13 10:01 ` Johannes Thumshirn 2017-04-13 10:02 ` Ming Lei 1 sibling, 2 replies; 13+ messages in thread From: Christoph Hellwig @ 2017-04-13 9:48 UTC (permalink / raw) To: Johannes Thumshirn Cc: Jens Axboe, Omar Sandoval, Ming Lei, Bart Van Assche, Hannes Reinecke, Christoph Hellwig, Linux Block Layer Mailinglist, Linux Kernel Mailinglist On Thu, Apr 13, 2017 at 10:06:29AM +0200, Johannes Thumshirn wrote: > Doing a mkfs.btrfs on a (qemu emulated) PCIe NVMe causes a kernel panic > in nvme_setup_prps() because the dma_len will drop below zero but the > length not. I think we should also turns this into a WARN_ON_ONCE + error return.. But do you have an exact btrfsprogs version and command line? I do a lot of testing that involves mkfs.btrfs on nvme and haven't seen it.. > A git bisect tracked the behaviour down to commit 729204ef49ec ("block: > relax check on sg gap"). Since commit 729204ef49ec a bio's offsets are not > taken into account in the decision if the bio will gap any more. Restore > the old behavior of checking bio offsets as well for the decision if a > bio will gap. > > Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> > Fixes: 729204ef49ec ("block: relax check on sg gap") > Cc: Ming Lei <ming.lei@redhat.com> > --- > include/linux/blkdev.h | 5 ++++- > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > index 7548f332121a..a03b7196209e 100644 > --- a/include/linux/blkdev.h > +++ b/include/linux/blkdev.h > @@ -1677,11 +1677,14 @@ static inline bool bio_will_gap(struct request_queue *q, struct bio *prev, > { > if (bio_has_data(prev) && queue_virt_boundary(q)) { > struct bio_vec pb, nb; > + bool offset; > > bio_get_last_bvec(prev, &pb); > bio_get_first_bvec(next, &nb); > > - if (!bios_segs_mergeable(q, prev, &pb, &nb)) > + offset = pb.bv_offset || nb.bv_offset; > + > + if (offset || !bios_segs_mergeable(q, prev, &pb, &nb)) > return __bvec_gap_to_prev(q, &pb, nb.bv_offset); > } I think the code in NVMe (and potentially the other drivers using virt_queue_boundary) is bogus. All of them are actually fine with gaps in the protocol, as long as the gaps are aligned to said boundary. So I suspect what we really need is to fix up NVMe, and after that we could even relax the above check, to not check for offset but offset & queue_virt_boundary(q). ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] block: bios with an offset are always gappy 2017-04-13 9:48 ` Christoph Hellwig @ 2017-04-13 9:56 ` Johannes Thumshirn 2017-04-13 10:01 ` Johannes Thumshirn 1 sibling, 0 replies; 13+ messages in thread From: Johannes Thumshirn @ 2017-04-13 9:56 UTC (permalink / raw) To: Christoph Hellwig Cc: Jens Axboe, Omar Sandoval, Ming Lei, Bart Van Assche, Hannes Reinecke, Linux Block Layer Mailinglist, Linux Kernel Mailinglist On Thu, Apr 13, 2017 at 11:48:35AM +0200, Christoph Hellwig wrote: > I think we should also turns this into a WARN_ON_ONCE + error return.. > > But do you have an exact btrfsprogs version and command line? I do a lot > of testing that involves mkfs.btrfs on nvme and haven't seen it.. Sure, it's: mkfs.btrfs, part of btrfs-progs v4.5.3+20160729 Qemu is 2.6.2 [...] > > I think the code in NVMe (and potentially the other drivers using > virt_queue_boundary) is bogus. All of them are actually fine with > gaps in the protocol, as long as the gaps are aligned to said boundary. > > So I suspect what we really need is to fix up NVMe, and after that > we could even relax the above check, to not check for offset but > offset & queue_virt_boundary(q). That's what I tried doing the last two days but as we're rather late in the rc cycle and it is a regression that came in with -rc1 I'd rather like to have it fixed or at least have a band aid in place. Byte, Johannes -- Johannes Thumshirn Storage jthumshirn@suse.de +49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850 ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] block: bios with an offset are always gappy 2017-04-13 9:48 ` Christoph Hellwig 2017-04-13 9:56 ` Johannes Thumshirn @ 2017-04-13 10:01 ` Johannes Thumshirn 1 sibling, 0 replies; 13+ messages in thread From: Johannes Thumshirn @ 2017-04-13 10:01 UTC (permalink / raw) To: Christoph Hellwig Cc: Jens Axboe, Omar Sandoval, Ming Lei, Bart Van Assche, Hannes Reinecke, Linux Block Layer Mailinglist, Linux Kernel Mailinglist On Thu, Apr 13, 2017 at 11:48:35AM +0200, Christoph Hellwig wrote: > On Thu, Apr 13, 2017 at 10:06:29AM +0200, Johannes Thumshirn wrote: > > Doing a mkfs.btrfs on a (qemu emulated) PCIe NVMe causes a kernel panic > > in nvme_setup_prps() because the dma_len will drop below zero but the > > length not. > > I think we should also turns this into a WARN_ON_ONCE + error return.. > > But do you have an exact btrfsprogs version and command line? I do a lot > of testing that involves mkfs.btrfs on nvme and haven't seen it.. Ah one detail I forgot: mkfs.xfs _does_ work. Haven't checked ext4 though. -- Johannes Thumshirn Storage jthumshirn@suse.de +49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850 ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] block: bios with an offset are always gappy 2017-04-13 8:06 [PATCH] block: bios with an offset are always gappy Johannes Thumshirn 2017-04-13 9:48 ` Christoph Hellwig @ 2017-04-13 10:02 ` Ming Lei 2017-04-13 10:10 ` Johannes Thumshirn 2017-04-13 11:53 ` Johannes Thumshirn 1 sibling, 2 replies; 13+ messages in thread From: Ming Lei @ 2017-04-13 10:02 UTC (permalink / raw) To: Johannes Thumshirn Cc: Jens Axboe, Omar Sandoval, Bart Van Assche, Hannes Reinecke, Christoph Hellwig, Linux Block Layer Mailinglist, Linux Kernel Mailinglist On Thu, Apr 13, 2017 at 10:06:29AM +0200, Johannes Thumshirn wrote: > Doing a mkfs.btrfs on a (qemu emulated) PCIe NVMe causes a kernel panic > in nvme_setup_prps() because the dma_len will drop below zero but the > length not. Looks I can't reproduce the issue in QEMU(32G nvme, either partitioned or not, just use 'mkfs.btrfs /dev/nvme0n1p1'), could you share the exact mkfs command line and size of your emulated NVMe? > > A git bisect tracked the behaviour down to commit 729204ef49ec ("block: > relax check on sg gap"). Since commit 729204ef49ec a bio's offsets are not > taken into account in the decision if the bio will gap any more. Restore > the old behavior of checking bio offsets as well for the decision if a > bio will gap. > > Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> > Fixes: 729204ef49ec ("block: relax check on sg gap") > Cc: Ming Lei <ming.lei@redhat.com> > --- > include/linux/blkdev.h | 5 ++++- > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > index 7548f332121a..a03b7196209e 100644 > --- a/include/linux/blkdev.h > +++ b/include/linux/blkdev.h > @@ -1677,11 +1677,14 @@ static inline bool bio_will_gap(struct request_queue *q, struct bio *prev, > { > if (bio_has_data(prev) && queue_virt_boundary(q)) { > struct bio_vec pb, nb; > + bool offset; > > bio_get_last_bvec(prev, &pb); > bio_get_first_bvec(next, &nb); > > - if (!bios_segs_mergeable(q, prev, &pb, &nb)) > + offset = pb.bv_offset || nb.bv_offset; We don't consider pb's offset here, because if pb.bv_offset isn't zero, pb should be the only bvec in 'prev' and will be put in a standalone segement, and we still can make 'nb' into this segment if both are mergeable. But the following issue might be caused by commit 729204ef49ec ("block: relax check on sg gap"): - if the 'next' has more than one segment - the segment merged from 'pb' and the 1st segment of 'next' may end at un-aligned virt boundary Could you try the following patch to see if it fixes your issue? --- diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 7548f332121a..65d1510681c6 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -1659,16 +1659,28 @@ static inline bool bvec_gap_to_prev(struct request_queue *q, * and the 1st bvec in the 2nd bio can be handled in one segment. */ static inline bool bios_segs_mergeable(struct request_queue *q, - struct bio *prev, struct bio_vec *prev_last_bv, + struct bio *prev, struct bio *next, + struct bio_vec *prev_last_bv, struct bio_vec *next_first_bv) { if (!BIOVEC_PHYS_MERGEABLE(prev_last_bv, next_first_bv)) return false; if (!BIOVEC_SEG_BOUNDARY(q, prev_last_bv, next_first_bv)) return false; - if (prev->bi_seg_back_size + next_first_bv->bv_len > + if (prev->bi_seg_back_size + next->bi_seg_front_size > queue_max_segment_size(q)) return false; + + /* + * if 'next' has multiple segments, we need to make + * sure the merged segment from 'pb' and the 1st segment + * of 'next' ends at aligned virt boundary. + */ + if ((next->bi_seg_front_size < next->bi_iter.bi_size) && + ((prev_last_bv->bv_offset + prev_last_bv->bv_len + + next->bi_seg_front_size) & queue_virt_boundary(q))) + return false; + return true; } @@ -1681,7 +1693,7 @@ static inline bool bio_will_gap(struct request_queue *q, struct bio *prev, bio_get_last_bvec(prev, &pb); bio_get_first_bvec(next, &nb); - if (!bios_segs_mergeable(q, prev, &pb, &nb)) + if (!bios_segs_mergeable(q, prev, next, &pb, &nb)) return __bvec_gap_to_prev(q, &pb, nb.bv_offset); } -- Ming ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH] block: bios with an offset are always gappy 2017-04-13 10:02 ` Ming Lei @ 2017-04-13 10:10 ` Johannes Thumshirn 2017-04-13 11:53 ` Johannes Thumshirn 1 sibling, 0 replies; 13+ messages in thread From: Johannes Thumshirn @ 2017-04-13 10:10 UTC (permalink / raw) To: Ming Lei Cc: Jens Axboe, Omar Sandoval, Bart Van Assche, Hannes Reinecke, Christoph Hellwig, Linux Block Layer Mailinglist, Linux Kernel Mailinglist On Thu, Apr 13, 2017 at 06:02:21PM +0800, Ming Lei wrote: > Could you try the following patch to see if it fixes your issue? Sure, jsut have a short lunch break and then I'll report back. -- Johannes Thumshirn Storage jthumshirn@suse.de +49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850 ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] block: bios with an offset are always gappy 2017-04-13 10:02 ` Ming Lei 2017-04-13 10:10 ` Johannes Thumshirn @ 2017-04-13 11:53 ` Johannes Thumshirn 2017-04-13 12:11 ` Ming Lei 2017-04-13 14:45 ` Ming Lei 1 sibling, 2 replies; 13+ messages in thread From: Johannes Thumshirn @ 2017-04-13 11:53 UTC (permalink / raw) To: Ming Lei Cc: Jens Axboe, Omar Sandoval, Bart Van Assche, Hannes Reinecke, Christoph Hellwig, Linux Block Layer Mailinglist, Linux Kernel Mailinglist On Thu, Apr 13, 2017 at 06:02:21PM +0800, Ming Lei wrote: > On Thu, Apr 13, 2017 at 10:06:29AM +0200, Johannes Thumshirn wrote: > > Doing a mkfs.btrfs on a (qemu emulated) PCIe NVMe causes a kernel panic > > in nvme_setup_prps() because the dma_len will drop below zero but the > > length not. > > Looks I can't reproduce the issue in QEMU(32G nvme, either partitioned > or not, just use 'mkfs.btrfs /dev/nvme0n1p1'), could you share the exact > mkfs command line and size of your emulated NVMe? the exact cmdline is mkfs.btrfs -f /dev/nvme0n1p1 (-f because there was a existing btrfs on the image). The image is 17179869184 (a.k.a 16G) bytes. [...] > Could you try the following patch to see if it fixes your issue? It's back to the old, erratic behaviour, see log below. > > --- > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > index 7548f332121a..65d1510681c6 100644 > --- a/include/linux/blkdev.h > +++ b/include/linux/blkdev.h > @@ -1659,16 +1659,28 @@ static inline bool bvec_gap_to_prev(struct request_queue *q, > * and the 1st bvec in the 2nd bio can be handled in one segment. > */ > static inline bool bios_segs_mergeable(struct request_queue *q, > - struct bio *prev, struct bio_vec *prev_last_bv, > + struct bio *prev, struct bio *next, > + struct bio_vec *prev_last_bv, > struct bio_vec *next_first_bv) > { > if (!BIOVEC_PHYS_MERGEABLE(prev_last_bv, next_first_bv)) > return false; > if (!BIOVEC_SEG_BOUNDARY(q, prev_last_bv, next_first_bv)) > return false; > - if (prev->bi_seg_back_size + next_first_bv->bv_len > > + if (prev->bi_seg_back_size + next->bi_seg_front_size > > queue_max_segment_size(q)) > return false; > + > + /* > + * if 'next' has multiple segments, we need to make > + * sure the merged segment from 'pb' and the 1st segment > + * of 'next' ends at aligned virt boundary. > + */ > + if ((next->bi_seg_front_size < next->bi_iter.bi_size) && > + ((prev_last_bv->bv_offset + prev_last_bv->bv_len + > + next->bi_seg_front_size) & queue_virt_boundary(q))) > + return false; > + > return true; > } > > @@ -1681,7 +1693,7 @@ static inline bool bio_will_gap(struct request_queue *q, struct bio *prev, > bio_get_last_bvec(prev, &pb); > bio_get_first_bvec(next, &nb); > > - if (!bios_segs_mergeable(q, prev, &pb, &nb)) > + if (!bios_segs_mergeable(q, prev, next, &pb, &nb)) > return __bvec_gap_to_prev(q, &pb, nb.bv_offset); > } dracut:/# [ 1.211567] tsc: Refined TSC clocksource calibration: 2297.338 MHz [ 1.212601] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x211d6274d86, max_idle_ns: 440795243673 ns dracut:/# modprobe btrfs [ 8.139179] raid6_pq: module verification failed: signature and/or required key missing - tainting kernel [ 8.207509] raid6: sse2x1 gen() 6827 MB/s [ 8.275512] raid6: sse2x1 xor() 5654 MB/s [ 8.343507] raid6: sse2x2 gen() 11573 MB/s [ 8.411503] raid6: sse2x2 xor() 8826 MB/s [ 8.479504] raid6: sse2x4 gen() 14794 MB/s [ 8.547504] raid6: sse2x4 xor() 10618 MB/s [ 8.547830] raid6: using algorithm sse2x4 gen() 14794 MB/s [ 8.548218] raid6: .... xor() 10618 MB/s, rmw enabled [ 8.548558] raid6: using intx1 recovery algorithm [ 8.549341] xor: measuring software checksum speed [ 8.587533] prefetch64-sse: 15090.000 MB/sec [ 8.627553] generic_sse: 13530.000 MB/sec [ 8.627945] xor: using function: prefetch64-sse (15090.000 MB/sec) [ 8.633795] Btrfs loaded, crc32c=crc32c-generic, assert=on dracut:/# modprobe nvme [ 12.348762] nvme nvme0: pci function 0000:00:04.0 [ 12.386300] ACPI: PCI Interrupt Link [LNKD] enabled at IRQ 11 dracut:/# [ 12.391707] nvme0n1: p1 dracut:/# mkfs.b[ 36.553376] random: fast init done tr dracut:/# mkfs.btrfs -f /dev/nvme0n1p1 btrfs-progs v4.5.3+20160729 See http://btrfs.wiki.kernel.org for more information. Detected a SSD, turning off metadata duplication. Mkfs with -m dup if you want to force metadata duplication. [ 46.696671] ------------[ cut here ]------------ [ 46.697338] kernel BUG at drivers/nvme/host/pci.c:494! [ 46.697806] invalid opcode: 0000 [#1] SMP [ 46.698175] Modules linked in: nvme(E) nvme_core(E) btrfs(E) xor(E) raid6_pq(E) [ 46.698879] CPU: 1 PID: 18 Comm: kworker/1:0H Tainted: G E 4.11.0-rc6-default+ #43 [ 46.699686] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 [ 46.700737] Workqueue: kblockd blk_mq_run_work_fn [ 46.701169] task: ffff88007bd24540 task.stack: ffffc900003bc000 [ 46.701709] RIP: 0010:nvme_queue_rq+0x85d/0x886 [nvme] [ 46.702185] RSP: 0018:ffffc900003bfc78 EFLAGS: 00010286 [ 46.702670] RAX: 0000000000000078 RBX: 0000000000001000 RCX: 000000007f625000 [ 46.703318] RDX: 0000000000000000 RSI: 0000000000000246 RDI: 0000000000000246 [ 46.703968] RBP: ffffc900003bfd50 R08: 000000000013ee00 R09: 0000000000001000 [ 46.704624] R10: ffff88007f1ed000 R11: ffff88007f220000 R12: ffff88007f1ed000 [ 46.705276] R13: 00000000fffffe00 R14: 0000000000000010 R15: 000000000012fe00 [ 46.705927] FS: 0000000000000000(0000) GS:ffff88007ea80000(0000) knlGS:0000000000000000 [ 46.706673] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 46.707199] CR2: 00007ffdd1294000 CR3: 000000007b742000 CR4: 00000000000006e0 [ 46.707846] Call Trace: [ 46.708082] blk_mq_dispatch_rq_list+0x2a0/0x3d0 [ 46.708510] blk_mq_sched_dispatch_requests+0x138/0x160 [ 46.708991] __blk_mq_run_hw_queue+0x8c/0xa0 [ 46.709407] blk_mq_run_work_fn+0x12/0x20 [ 46.709781] process_one_work+0x153/0x400 [ 46.710152] worker_thread+0x12b/0x4b0 [ 46.711698] kthread+0x109/0x140 [ 46.712013] ? rescuer_thread+0x340/0x340 [ 46.712391] ? kthread_park+0x90/0x90 [ 46.712741] ret_from_fork+0x2c/0x40 [ 46.713081] Code: 01 00 48 8b 40 10 48 89 45 a8 49 8b 87 70 01 00 00 48 89 45 b0 0f 84 3e fa ff ff 49 8b 87 88 01 00 00 48 89 45 a0 e9 2e fa ff ff <0f> 0b 4c 8b 0d e2 35 8f e1 eb 80 0f 0b 4c 89 ef c6 07 00 0f 1f [ 46.714861] RIP: nvme_queue_rq+0x85d/0x886 [nvme] RSP: ffffc900003bfc78 [ 46.715810] ---[ end trace 280a594163a124fb ]--- [ 46.796265] ------------[ cut here ]------------ Thanks, Johannes -- Johannes Thumshirn Storage jthumshirn@suse.de +49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850 ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] block: bios with an offset are always gappy 2017-04-13 11:53 ` Johannes Thumshirn @ 2017-04-13 12:11 ` Ming Lei [not found] ` <20170413122010.GJ6734@linux-x5ow.site> 2017-04-13 14:45 ` Ming Lei 1 sibling, 1 reply; 13+ messages in thread From: Ming Lei @ 2017-04-13 12:11 UTC (permalink / raw) To: Johannes Thumshirn Cc: Jens Axboe, Omar Sandoval, Bart Van Assche, Hannes Reinecke, Christoph Hellwig, Linux Block Layer Mailinglist, Linux Kernel Mailinglist [-- Attachment #1: Type: text/plain, Size: 1032 bytes --] On Thu, Apr 13, 2017 at 01:53:28PM +0200, Johannes Thumshirn wrote: > On Thu, Apr 13, 2017 at 06:02:21PM +0800, Ming Lei wrote: > > On Thu, Apr 13, 2017 at 10:06:29AM +0200, Johannes Thumshirn wrote: > > > Doing a mkfs.btrfs on a (qemu emulated) PCIe NVMe causes a kernel panic > > > in nvme_setup_prps() because the dma_len will drop below zero but the > > > length not. > > > > Looks I can't reproduce the issue in QEMU(32G nvme, either partitioned > > or not, just use 'mkfs.btrfs /dev/nvme0n1p1'), could you share the exact > > mkfs command line and size of your emulated NVMe? > > the exact cmdline is mkfs.btrfs -f /dev/nvme0n1p1 (-f because there was a > existing btrfs on the image). The image is 17179869184 (a.k.a 16G) bytes. > > [...] > > > Could you try the following patch to see if it fixes your issue? > > It's back to the old, erratic behaviour, see log below. Ok, could you apply the attached debug patch and collect the ftrace log? (ftrace_dump_on_oops need to be passed to kernel cmd line). Thanks, Ming [-- Attachment #2: dump-rq.patch --] [-- Type: text/plain, Size: 2383 bytes --] diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 26a5fd05fe88..a813a36d48d9 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -491,6 +491,8 @@ static bool nvme_setup_prps(struct nvme_dev *dev, struct request *req) break; if (dma_len > 0) continue; + if (dma_len < 0) + blk_dump_rq(req, "nvme dma sg gap"); BUG_ON(dma_len < 0); sg = sg_next(sg); dma_addr = sg_dma_address(sg); diff --git a/include/linux/bio.h b/include/linux/bio.h index 8e521194f6fc..f3b001e401d2 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -811,5 +811,29 @@ static inline int bio_integrity_add_page(struct bio *bio, struct page *page, #endif /* CONFIG_BLK_DEV_INTEGRITY */ +static inline void blk_dump_bio(struct bio *bio, const char *msg) +{ + struct bvec_iter iter; + struct bio_vec bvec; + int i = 0; + unsigned sectors = 0; + + trace_printk("%s-%p: %hx/%hx %u %llu %u\n", + msg, bio, + bio->bi_flags, bio->bi_opf, + bio->bi_phys_segments, + (unsigned long long)bio->bi_iter.bi_sector, + bio->bi_iter.bi_size); + bio_for_each_segment(bvec, bio, iter) { + sectors += bvec.bv_len >> 9; + trace_printk("\t %d: %lu %u %u(%u)\n", i++, + (unsigned long)page_to_pfn(bvec.bv_page), + bvec.bv_offset, + bvec.bv_len, bvec.bv_len >> 12); + } + trace_printk("\t total sectors %u\n", sectors); +} + + #endif /* CONFIG_BLOCK */ #endif /* __LINUX_BIO_H */ diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 7548f332121a..b75d6fe5a1b9 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -1698,6 +1698,22 @@ static inline bool req_gap_front_merge(struct request *req, struct bio *bio) return bio_will_gap(req->q, bio, req->bio); } +static inline void blk_dump_rq(const struct request *req, const char *msg) +{ + struct bio *bio; + int i = 0; + + trace_printk("%s: dump bvec for %p(f:%x, seg: %d)\n", + msg, req, req->cmd_flags, + req->nr_phys_segments); + + __rq_for_each_bio(bio, req) { + char num[16]; + snprintf(num, 16, "%d", i++); + blk_dump_bio(bio, num); + } +} + int kblockd_schedule_work(struct work_struct *work); int kblockd_schedule_work_on(int cpu, struct work_struct *work); int kblockd_schedule_delayed_work(struct delayed_work *dwork, unsigned long delay); ^ permalink raw reply related [flat|nested] 13+ messages in thread
[parent not found: <20170413122010.GJ6734@linux-x5ow.site>]
* Re: [PATCH] block: bios with an offset are always gappy [not found] ` <20170413122010.GJ6734@linux-x5ow.site> @ 2017-04-13 13:44 ` Ming Lei 0 siblings, 0 replies; 13+ messages in thread From: Ming Lei @ 2017-04-13 13:44 UTC (permalink / raw) To: Johannes Thumshirn Cc: Jens Axboe, Omar Sandoval, Bart Van Assche, Hannes Reinecke, Christoph Hellwig, Linux Block Layer Mailinglist, Linux Kernel Mailinglist On Thu, Apr 13, 2017 at 02:20:10PM +0200, Johannes Thumshirn wrote: > On Thu, Apr 13, 2017 at 08:11:40PM +0800, Ming Lei wrote: > > Ok, could you apply the attached debug patch and collect the > > ftrace log? (ftrace_dump_on_oops need to be passed to kernel cmd line). > Thanks Johannes very much for collecting the log. The root cause should be the following: - if 'offset' of one segment isn't zero, we can't make that segment at full size, otherwise the segment ends in the unaligned virt boundary. Give me a while, I will figure out one patch to address the issue. Thanks, Ming ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] block: bios with an offset are always gappy 2017-04-13 11:53 ` Johannes Thumshirn 2017-04-13 12:11 ` Ming Lei @ 2017-04-13 14:45 ` Ming Lei 2017-04-13 14:50 ` Johannes Thumshirn 2017-04-13 20:35 ` Andreas Mohr 1 sibling, 2 replies; 13+ messages in thread From: Ming Lei @ 2017-04-13 14:45 UTC (permalink / raw) To: Johannes Thumshirn Cc: Jens Axboe, Omar Sandoval, Bart Van Assche, Hannes Reinecke, Christoph Hellwig, Linux Block Layer Mailinglist, Linux Kernel Mailinglist On Thu, Apr 13, 2017 at 01:53:28PM +0200, Johannes Thumshirn wrote: > On Thu, Apr 13, 2017 at 06:02:21PM +0800, Ming Lei wrote: > > On Thu, Apr 13, 2017 at 10:06:29AM +0200, Johannes Thumshirn wrote: > > > Doing a mkfs.btrfs on a (qemu emulated) PCIe NVMe causes a kernel panic > > > in nvme_setup_prps() because the dma_len will drop below zero but the > > > length not. > > > > Looks I can't reproduce the issue in QEMU(32G nvme, either partitioned > > or not, just use 'mkfs.btrfs /dev/nvme0n1p1'), could you share the exact > > mkfs command line and size of your emulated NVMe? > > the exact cmdline is mkfs.btrfs -f /dev/nvme0n1p1 (-f because there was a > existing btrfs on the image). The image is 17179869184 (a.k.a 16G) bytes. > > [...] > > > Could you try the following patch to see if it fixes your issue? > > It's back to the old, erratic behaviour, see log below. Johannes, could you test the following patch? Thanks Ming --- diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index b75d6fe5a1b9..cc68dfaef528 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -1672,12 +1672,27 @@ static inline bool bios_segs_mergeable(struct request_queue *q, return true; } -static inline bool bio_will_gap(struct request_queue *q, struct bio *prev, - struct bio *next) +static inline bool bio_will_gap(struct request_queue *q, + struct request *prev_rq, + struct bio *prev, + struct bio *next) { if (bio_has_data(prev) && queue_virt_boundary(q)) { struct bio_vec pb, nb; + /* + * don't merge if the 1st bio starts with non-zero + * offset, otherwise it is quite difficult to respect + * sg gap limit. We work hard to merge huge of small + * bios in case of mkfs. + */ + if (prev_rq) + bio_get_first_bvec(prev_rq->bio, &pb); + else + bio_get_first_bvec(prev, &pb); + if (pb.bv_offset) + return true; + bio_get_last_bvec(prev, &pb); bio_get_first_bvec(next, &nb); @@ -1690,12 +1705,12 @@ static inline bool bio_will_gap(struct request_queue *q, struct bio *prev, static inline bool req_gap_back_merge(struct request *req, struct bio *bio) { - return bio_will_gap(req->q, req->biotail, bio); + return bio_will_gap(req->q, req, req->biotail, bio); } static inline bool req_gap_front_merge(struct request *req, struct bio *bio) { - return bio_will_gap(req->q, bio, req->bio); + return bio_will_gap(req->q, NULL, bio, req->bio); } static inline void blk_dump_rq(const struct request *req, const char *msg) ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH] block: bios with an offset are always gappy 2017-04-13 14:45 ` Ming Lei @ 2017-04-13 14:50 ` Johannes Thumshirn 2017-04-13 20:35 ` Andreas Mohr 1 sibling, 0 replies; 13+ messages in thread From: Johannes Thumshirn @ 2017-04-13 14:50 UTC (permalink / raw) To: Ming Lei Cc: Jens Axboe, Omar Sandoval, Bart Van Assche, Hannes Reinecke, Christoph Hellwig, Linux Block Layer Mailinglist, Linux Kernel Mailinglist On Thu, Apr 13, 2017 at 10:45:10PM +0800, Ming Lei wrote: > On Thu, Apr 13, 2017 at 01:53:28PM +0200, Johannes Thumshirn wrote: > > On Thu, Apr 13, 2017 at 06:02:21PM +0800, Ming Lei wrote: > > > On Thu, Apr 13, 2017 at 10:06:29AM +0200, Johannes Thumshirn wrote: > > > > Doing a mkfs.btrfs on a (qemu emulated) PCIe NVMe causes a kernel panic > > > > in nvme_setup_prps() because the dma_len will drop below zero but the > > > > length not. > > > > > > Looks I can't reproduce the issue in QEMU(32G nvme, either partitioned > > > or not, just use 'mkfs.btrfs /dev/nvme0n1p1'), could you share the exact > > > mkfs command line and size of your emulated NVMe? > > > > the exact cmdline is mkfs.btrfs -f /dev/nvme0n1p1 (-f because there was a > > existing btrfs on the image). The image is 17179869184 (a.k.a 16G) bytes. > > > > [...] > > > > > Could you try the following patch to see if it fixes your issue? > > > > It's back to the old, erratic behaviour, see log below. > > Johannes, could you test the following patch? > > Thanks > Ming Works, awesome thanks! Tested-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> -- Johannes Thumshirn Storage jthumshirn@suse.de +49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850 ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] block: bios with an offset are always gappy 2017-04-13 14:45 ` Ming Lei 2017-04-13 14:50 ` Johannes Thumshirn @ 2017-04-13 20:35 ` Andreas Mohr 2017-04-14 1:15 ` Ming Lei 1 sibling, 1 reply; 13+ messages in thread From: Andreas Mohr @ 2017-04-13 20:35 UTC (permalink / raw) To: Ming Lei Cc: Johannes Thumshirn, Jens Axboe, Omar Sandoval, Bart Van Assche, Hannes Reinecke, Christoph Hellwig, Linux Block Layer Mailinglist, Linux Kernel Mailinglist On Thu, Apr 13, 2017 at 10:45:10PM +0800, Ming Lei wrote: > + /* > + * don't merge if the 1st bio starts with non-zero > + * offset, otherwise it is quite difficult to respect > + * sg gap limit. We work hard to merge huge of small > + * bios in case of mkfs. "huge of small bios in case" - -ENOPARSE. "huge numbers of"? "huge or small"? HTH, Andreas Mohr ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] block: bios with an offset are always gappy 2017-04-13 20:35 ` Andreas Mohr @ 2017-04-14 1:15 ` Ming Lei 0 siblings, 0 replies; 13+ messages in thread From: Ming Lei @ 2017-04-14 1:15 UTC (permalink / raw) To: Andreas Mohr Cc: Johannes Thumshirn, Jens Axboe, Omar Sandoval, Bart Van Assche, Hannes Reinecke, Christoph Hellwig, Linux Block Layer Mailinglist, Linux Kernel Mailinglist On Thu, Apr 13, 2017 at 10:35:17PM +0200, Andreas Mohr wrote: > On Thu, Apr 13, 2017 at 10:45:10PM +0800, Ming Lei wrote: > > + /* > > + * don't merge if the 1st bio starts with non-zero > > + * offset, otherwise it is quite difficult to respect > > + * sg gap limit. We work hard to merge huge of small > > + * bios in case of mkfs. > > "huge of small bios in case" - -ENOPARSE. > > "huge numbers of"? > "huge or small"? OK, it should be tons of or sort of description. As you can see in the trace, there are 2560 bios merged in one request. Thanks, Ming ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2017-04-14 1:15 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-04-13 8:06 [PATCH] block: bios with an offset are always gappy Johannes Thumshirn
2017-04-13 9:48 ` Christoph Hellwig
2017-04-13 9:56 ` Johannes Thumshirn
2017-04-13 10:01 ` Johannes Thumshirn
2017-04-13 10:02 ` Ming Lei
2017-04-13 10:10 ` Johannes Thumshirn
2017-04-13 11:53 ` Johannes Thumshirn
2017-04-13 12:11 ` Ming Lei
[not found] ` <20170413122010.GJ6734@linux-x5ow.site>
2017-04-13 13:44 ` Ming Lei
2017-04-13 14:45 ` Ming Lei
2017-04-13 14:50 ` Johannes Thumshirn
2017-04-13 20:35 ` Andreas Mohr
2017-04-14 1:15 ` Ming Lei
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox