* Re: [PATCH 1/9] bio-integrity: Do not allocate integrity context for bio w/o data
From: Hannes Reinecke @ 2017-04-05 6:32 UTC (permalink / raw)
To: Dmitry Monakhov, linux-kernel, linux-block, martin.petersen
In-Reply-To: <1491332201-26926-2-git-send-email-dmonakhov@openvz.org>
On 04/04/2017 08:56 PM, Dmitry Monakhov wrote:
> If bio has no data, such as ones from blkdev_issue_flush(),
> then we have nothing to protect.
>
> This patch prevent bugon like follows:
>
> kfree_debugcheck: out of range ptr ac1fa1d106742a5ah
> kernel BUG at mm/slab.c:2773!
> invalid opcode: 0000 [#1] SMP
> Modules linked in: bcache
> CPU: 0 PID: 4428 Comm: xfs_io Tainted: G W 4.11.0-rc4-ext4-00041-g2ef0043-dirty #43
> Hardware name: Virtuozzo KVM, BIOS seabios-1.7.5-11.vz7.4 04/01/2014
> task: ffff880137786440 task.stack: ffffc90000ba8000
> RIP: 0010:kfree_debugcheck+0x25/0x2a
> RSP: 0018:ffffc90000babde0 EFLAGS: 00010082
> RAX: 0000000000000034 RBX: ac1fa1d106742a5a RCX: 0000000000000007
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88013f3ccb40
> RBP: ffffc90000babde8 R08: 0000000000000000 R09: 0000000000000000
> R10: 00000000fcb76420 R11: 00000000725172ed R12: 0000000000000282
> R13: ffffffff8150e766 R14: ffff88013a145e00 R15: 0000000000000001
> FS: 00007fb09384bf40(0000) GS:ffff88013f200000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007fd0172f9e40 CR3: 0000000137fa9000 CR4: 00000000000006f0
> Call Trace:
> kfree+0xc8/0x1b3
> bio_integrity_free+0xc3/0x16b
> bio_free+0x25/0x66
> bio_put+0x14/0x26
> blkdev_issue_flush+0x7a/0x85
> blkdev_fsync+0x35/0x42
> vfs_fsync_range+0x8e/0x9f
> vfs_fsync+0x1c/0x1e
> do_fsync+0x31/0x4a
> SyS_fsync+0x10/0x14
> entry_SYSCALL_64_fastpath+0x1f/0xc2
>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
> ---
> block/bio-integrity.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/block/bio-integrity.c b/block/bio-integrity.c
> index 5384713..b5009a8 100644
> --- a/block/bio-integrity.c
> +++ b/block/bio-integrity.c
> @@ -175,6 +175,9 @@ bool bio_integrity_enabled(struct bio *bio)
> if (bio_op(bio) != REQ_OP_READ && bio_op(bio) != REQ_OP_WRITE)
> return false;
>
> + if (!bio_sectors(bio))
> + return false;
> +
> /* Already protected? */
> if (bio_integrity(bio))
> return false;
>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Cheers,
Hannes
--
Dr. Hannes Reinecke Teamlead Storage & Networking
hare@suse.de +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: F. Imend�rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N�rnberg)
^ permalink raw reply
* Re: [PATCH v2 4/5] scsi: Add scsi_restart_hctx()
From: Christoph Hellwig @ 2017-04-05 6:20 UTC (permalink / raw)
To: Bart Van Assche
Cc: Christoph Hellwig, Jens Axboe, linux-block@vger.kernel.org,
Martin K . Petersen, James Bottomley, Hannes Reinecke
In-Reply-To: <BY1PR0401MB15322BB999F894A46F40AE5E810B0@BY1PR0401MB1532.namprd04.prod.outlook.com>
On Tue, Apr 04, 2017 at 03:56:34PM +0000, Bart Van Assche wrote:
> > This looks like generic block layer code, why is it in SCSI?
>
> Hello Christoph,
>
> That's an excellent question. I assume that you are fine with moving
> this code to the block layer?
Yes. In fact I wonder if we need the blk_mq_ops method at all now
that you're using RCU locking and don't need the scsi host_lock.
^ permalink raw reply
* Re: [PATCH] loop: Add PF_LESS_THROTTLE to block/loop device thread.
From: Ming Lei @ 2017-04-05 5:13 UTC (permalink / raw)
To: NeilBrown; +Cc: Christoph Hellwig, Jens Axboe, linux-block, linux-mm, LKML
In-Reply-To: <8760ijiind.fsf@notabene.neil.brown.name>
On Wed, Apr 5, 2017 at 12:27 PM, NeilBrown <neilb@suse.com> wrote:
> On Tue, Apr 04 2017, Christoph Hellwig wrote:
>
>> Looks fine,
>>
>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>
>> But if you actually care about performance in any way I'd suggest
>> to use the loop device in direct I/O mode..
>
> The losetup on my test VM is too old to support that :-(
> I guess it might be time to upgraded.
>
> It seems that there is not "mount -o direct_loop" or similar, so you
> have to do the losetup and the mount separately. Any thoughts on
I guess the 'direct_loop' can be added into 'mount' directly? but not familiar
with mount utility.
> whether that should be changed ?
There was sysfs interface for controling direct IO in the initial
submission, but
looks it was reviewed out, :-)
Thanks,
Ming Lei
^ permalink raw reply
* Re: [PATCH v2] loop: Add PF_LESS_THROTTLE to block/loop device thread.
From: Ming Lei @ 2017-04-05 5:05 UTC (permalink / raw)
To: NeilBrown; +Cc: Jens Axboe, linux-block, linux-mm, LKML
In-Reply-To: <87wpazh3rl.fsf@notabene.neil.brown.name>
On Wed, Apr 5, 2017 at 12:33 PM, NeilBrown <neilb@suse.com> wrote:
>
> When a filesystem is mounted from a loop device, writes are
> throttled by balance_dirty_pages() twice: once when writing
> to the filesystem and once when the loop_handle_cmd() writes
> to the backing file. This double-throttling can trigger
> positive feedback loops that create significant delays. The
> throttling at the lower level is seen by the upper level as
> a slow device, so it throttles extra hard.
>
> The PF_LESS_THROTTLE flag was created to handle exactly this
> circumstance, though with an NFS filesystem mounted from a
> local NFS server. It reduces the throttling on the lower
> layer so that it can proceed largely unthrottled.
>
> To demonstrate this, create a filesystem on a loop device
> and write (e.g. with dd) several large files which combine
> to consume significantly more than the limit set by
> /proc/sys/vm/dirty_ratio or dirty_bytes. Measure the total
> time taken.
>
> When I do this directly on a device (no loop device) the
> total time for several runs (mkfs, mount, write 200 files,
> umount) is fairly stable: 28-35 seconds.
> When I do this over a loop device the times are much worse
> and less stable. 52-460 seconds. Half below 100seconds,
> half above.
> When I apply this patch, the times become stable again,
> though not as fast as the no-loop-back case: 53-72 seconds.
>
> There may be room for further improvement as the total overhead still
> seems too high, but this is a big improvement.
>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
> ---
>
> I moved where the flag is set, thanks to suggestion from
> Ming Lei.
> I've preserved the *-by: tags I was offered despite the code
> being different, as the concept is identical.
>
> Thanks,
> NeilBrown
>
>
> drivers/block/loop.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index 0ecb6461ed81..44b3506fd086 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -852,6 +852,7 @@ static int loop_prepare_queue(struct loop_device *lo)
> if (IS_ERR(lo->worker_task))
> return -ENOMEM;
> set_user_nice(lo->worker_task, MIN_NICE);
> + lo->worker_task->flags |= PF_LESS_THROTTLE;
> return 0;
> }
>
> --
> 2.12.2
>
^ permalink raw reply
* [PATCH v2] loop: Add PF_LESS_THROTTLE to block/loop device thread.
From: NeilBrown @ 2017-04-05 4:33 UTC (permalink / raw)
To: Jens Axboe; +Cc: linux-block, linux-mm, LKML, Ming Lei
In-Reply-To: <871staffus.fsf@notabene.neil.brown.name>
[-- Attachment #1: Type: text/plain, Size: 2242 bytes --]
When a filesystem is mounted from a loop device, writes are
throttled by balance_dirty_pages() twice: once when writing
to the filesystem and once when the loop_handle_cmd() writes
to the backing file. This double-throttling can trigger
positive feedback loops that create significant delays. The
throttling at the lower level is seen by the upper level as
a slow device, so it throttles extra hard.
The PF_LESS_THROTTLE flag was created to handle exactly this
circumstance, though with an NFS filesystem mounted from a
local NFS server. It reduces the throttling on the lower
layer so that it can proceed largely unthrottled.
To demonstrate this, create a filesystem on a loop device
and write (e.g. with dd) several large files which combine
to consume significantly more than the limit set by
/proc/sys/vm/dirty_ratio or dirty_bytes. Measure the total
time taken.
When I do this directly on a device (no loop device) the
total time for several runs (mkfs, mount, write 200 files,
umount) is fairly stable: 28-35 seconds.
When I do this over a loop device the times are much worse
and less stable. 52-460 seconds. Half below 100seconds,
half above.
When I apply this patch, the times become stable again,
though not as fast as the no-loop-back case: 53-72 seconds.
There may be room for further improvement as the total overhead still
seems too high, but this is a big improvement.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
---
I moved where the flag is set, thanks to suggestion from
Ming Lei.
I've preserved the *-by: tags I was offered despite the code
being different, as the concept is identical.
Thanks,
NeilBrown
drivers/block/loop.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 0ecb6461ed81..44b3506fd086 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -852,6 +852,7 @@ static int loop_prepare_queue(struct loop_device *lo)
if (IS_ERR(lo->worker_task))
return -ENOMEM;
set_user_nice(lo->worker_task, MIN_NICE);
+ lo->worker_task->flags |= PF_LESS_THROTTLE;
return 0;
}
--
2.12.2
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]
^ permalink raw reply related
* Re: [PATCH] loop: Add PF_LESS_THROTTLE to block/loop device thread.
From: NeilBrown @ 2017-04-05 4:31 UTC (permalink / raw)
To: Ming Lei; +Cc: Jens Axboe, linux-block, linux-mm, LKML
In-Reply-To: <CACVXFVO54OseKKpZXEju9a+GWYkTFRt9qHT22zzcTjOqGnanmw@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 2519 bytes --]
On Tue, Apr 04 2017, Ming Lei wrote:
> On Mon, Apr 3, 2017 at 9:18 AM, NeilBrown <neilb@suse.com> wrote:
>>
>> When a filesystem is mounted from a loop device, writes are
>> throttled by balance_dirty_pages() twice: once when writing
>> to the filesystem and once when the loop_handle_cmd() writes
>> to the backing file. This double-throttling can trigger
>> positive feedback loops that create significant delays. The
>> throttling at the lower level is seen by the upper level as
>> a slow device, so it throttles extra hard.
>>
>> The PF_LESS_THROTTLE flag was created to handle exactly this
>> circumstance, though with an NFS filesystem mounted from a
>> local NFS server. It reduces the throttling on the lower
>> layer so that it can proceed largely unthrottled.
>>
>> To demonstrate this, create a filesystem on a loop device
>> and write (e.g. with dd) several large files which combine
>> to consume significantly more than the limit set by
>> /proc/sys/vm/dirty_ratio or dirty_bytes. Measure the total
>> time taken.
>>
>> When I do this directly on a device (no loop device) the
>> total time for several runs (mkfs, mount, write 200 files,
>> umount) is fairly stable: 28-35 seconds.
>> When I do this over a loop device the times are much worse
>> and less stable. 52-460 seconds. Half below 100seconds,
>> half above.
>> When I apply this patch, the times become stable again,
>> though not as fast as the no-loop-back case: 53-72 seconds.
>>
>> There may be room for further improvement as the total overhead still
>> seems too high, but this is a big improvement.
>>
>> Signed-off-by: NeilBrown <neilb@suse.com>
>> ---
>> drivers/block/loop.c | 3 +++
>> 1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
>> index 0ecb6461ed81..a7e1dd215fc2 100644
>> --- a/drivers/block/loop.c
>> +++ b/drivers/block/loop.c
>> @@ -1694,8 +1694,11 @@ static void loop_queue_work(struct kthread_work *work)
>> {
>> struct loop_cmd *cmd =
>> container_of(work, struct loop_cmd, work);
>> + int oldflags = current->flags & PF_LESS_THROTTLE;
>>
>> + current->flags |= PF_LESS_THROTTLE;
>> loop_handle_cmd(cmd);
>> + current->flags = (current->flags & ~PF_LESS_THROTTLE) | oldflags;
>> }
>
> You can do it against 'lo->worker_task' instead of doing it in each
> loop_queue_work(),
> and this flag needn't to be restored because the kernel thread is loop
> specialized.
>
good point. I'll do that. Thanks,
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]
^ permalink raw reply
* Re: [PATCH] loop: Add PF_LESS_THROTTLE to block/loop device thread.
From: NeilBrown @ 2017-04-05 4:27 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Jens Axboe, linux-block, linux-mm, LKML
In-Reply-To: <20170404071033.GA25855@infradead.org>
[-- Attachment #1: Type: text/plain, Size: 509 bytes --]
On Tue, Apr 04 2017, Christoph Hellwig wrote:
> Looks fine,
>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
>
> But if you actually care about performance in any way I'd suggest
> to use the loop device in direct I/O mode..
The losetup on my test VM is too old to support that :-(
I guess it might be time to upgraded.
It seems that there is not "mount -o direct_loop" or similar, so you
have to do the losetup and the mount separately. Any thoughts on
whether that should be changed ?
Thanks,
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]
^ permalink raw reply
* Re: [RFC PATCH] blk: reset 'bi_next' when bio is done inside request
From: NeilBrown @ 2017-04-04 21:52 UTC (permalink / raw)
To: Michael Wang, linux-kernel@vger.kernel.org, linux-block,
linux-raid
Cc: Jens Axboe, Shaohua Li, Jinpu Wang
In-Reply-To: <2a6f8c30-616c-d6fe-1c3f-ab687c145cd7@profitbricks.com>
[-- Attachment #1: Type: text/plain, Size: 4628 bytes --]
On Tue, Apr 04 2017, Michael Wang wrote:
> On 04/04/2017 11:37 AM, NeilBrown wrote:
>> On Tue, Apr 04 2017, Michael Wang wrote:
> [snip]
>>>>
>>>> If sync_request_write() is using a bio that has already been used, it
>>>> should call bio_reset() and fill in the details again.
>>>> However I don't see how that would happen.
>>>> Can you give specific details on the situation that triggers the bug?
>>>
>>> We have storage side mapping lv through scst to server, on server side
>>> we assemble them into multipath device, and then assemble these dm into
>>> two raid1.
>>>
>>> The test is firstly do mkfs.ext4 on raid1 then start fio on it, on storage
>>> side we unmap all the lv (could during mkfs or fio), then on server side
>>> we hit the BUG (reproducible).
>>
>> So I assume the initial resync is still happening at this point?
>> And you unmap *all* the lv's so you expect IO to fail?
>> I can see that the code would behave strangely if you have a
>> bad-block-list configured (which is the default).
>> Do you have a bbl? If you create the array without the bbl, does it
>> still crash?
>
> The resync is at least happen concurrently in this case, we try
> to simulate the case that all the connections dropped, the IO do
> failed, also bunch of kernel log like:
>
> md: super_written gets error=-5
> blk_update_request: I/O error, dev dm-3, sector 64184
> md/raid1:md1: dm-2: unrecoverable I/O read error for block 46848
>
> we expect that to happen, but server should not crash on BUG.
>
> And we haven't done any thing special regarding bbl, the bad_blocks
> in sysfs are all empty.
>
>>
>>>
>>> The path of bio was confirmed by add tracing, it is reused in sync_request_write()
>>> with 'bi_next' once chained inside blk_attempt_plug_merge().
>>
>> I still don't see why it is re-used.
>> I assume you didn't explicitly ask for a check/repair (i.e. didn't write
>> to .../md/sync_action at all?). In that case MD_RECOVERY_REQUESTED is
>> not set.
>
> Just unmap lv on storage side, no operation on server side.
>
>> So sync_request() sends only one bio to generic_make_request():
>> r1_bio->bios[r1_bio->read_disk];
>>
>> then sync_request_write() *doesn't* send that bio again, but does send
>> all the others.
>>
>> So where does it reuse a bio?
>
> If that's the design then it would be strange... the log do showing the path
> of that bio go through sync_request(), will do more investigation.
>
>>
>>>
>>> We also tried to reset the bi_next inside sync_request_write() before
>>> generic_make_request() which also works.
>>>
>>> The testing was done with 4.4, but we found upstream also left bi_next
>>> chained after done in request, thus we post this RFC.
>>>
>>> Regarding raid1, we haven't found the place on path where the bio was
>>> reset... where does it supposed to be?
>>
>> I'm not sure what you mean.
>> We only reset bios when they are being reused.
>> One place is in process_checks() where bio_reset() is called before
>> filling in all the details.
>>
>>
>> Maybe, in sync_request_write(), before
>>
>> wbio->bi_rw = WRITE;
>>
>> add something like
>> if (wbio->bi_next)
>> printk("bi_next!= NULL i=%d read_disk=%d bi_end_io=%pf\n",
>> i, r1_bio->read_disk, wbio->bi_end_io);
>>
>> that might help narrow down what is happening.
>
> Just triggered again in 4.4, dmesg like:
>
> [ 399.240230] md: super_written gets error=-5
> [ 399.240286] md: super_written gets error=-5
> [ 399.240286] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
> [ 399.240300] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
> [ 399.240312] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
> [ 399.240323] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
> [ 399.240334] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
> [ 399.240341] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
> [ 399.240349] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
> [ 399.240352] bi_next!= NULL i=0 read_disk=0 bi_end_io=end_sync_write [raid1]
It is very peculiar for r1_bio->bios[r1_bio->read_disk].bi_end_io to be
end_sync_write.
I can only see this happening when sync_request_write() assigns that,
and this printk is before there.
That seems to suggest that sync_request_write() is being performed on
the same r1_bio twice, which is also very peculiar.
I might have to try to reproduce this myself and see what is happening.
Thanks,
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]
^ permalink raw reply
* Re: [PATCH] Block SQ/MQ Request Priority
From: Jens Axboe @ 2017-04-04 21:40 UTC (permalink / raw)
To: adam.manzanares, linux-block, linux-kernel; +Cc: hch, Bart.VanAssche
In-Reply-To: <1491319514-6671-1-git-send-email-adam.manzanares@wdc.com>
On 04/04/2017 09:25 AM, adam.manzanares@wdc.com wrote:
> From: Adam Manzanares <adam.manzanares@wdc.com>
>
> In 4.10 I introduced a patch that associates the ioc priority with
> each request in the block layer. This work was done in the single queue
> block layer code. This patch unifies ioc priority to request mapping across
> the single/multi queue block layers.
>
> I have tested this patch with the null block device driver with the following
> parameters.
>
> null_blk queue_mode=2 irqmode=0 use_per_node_hctx=1 nr_devices=1
>
> I have not seen a performance regression with this patch and I would appreciate
> any feedback or additional testing.
>
> I have also verified that io priorities are passed to the device when using
> the SQ and MQ path to a SATA HDD that supports io priorities.
Applied, with an updated subject line, it didn't really describe what
the patch does. Thanks.
--
Jens Axboe
^ permalink raw reply
* [PATCH 6/9] T10: Move opencoded contants to common header
From: Dmitry Monakhov @ 2017-04-04 18:56 UTC (permalink / raw)
To: linux-kernel, linux-block, martin.petersen; +Cc: Dmitry Monakhov
In-Reply-To: <1491332201-26926-1-git-send-email-dmonakhov@openvz.org>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
block/t10-pi.c | 9 +++------
drivers/scsi/lpfc/lpfc_scsi.c | 5 +++--
drivers/scsi/qla2xxx/qla_isr.c | 8 ++++----
drivers/target/target_core_sbc.c | 2 +-
include/linux/t10-pi.h | 2 ++
5 files changed, 13 insertions(+), 13 deletions(-)
diff --git a/block/t10-pi.c b/block/t10-pi.c
index 2c97912..485cecd 100644
--- a/block/t10-pi.c
+++ b/block/t10-pi.c
@@ -28,9 +28,6 @@
typedef __be16 (csum_fn) (void *, unsigned int);
-static const __be16 APP_ESCAPE = (__force __be16) 0xffff;
-static const __be32 REF_ESCAPE = (__force __be32) 0xffffffff;
-
static __be16 t10_pi_crc_fn(void *data, unsigned int len)
{
return cpu_to_be16(crc_t10dif(data, len));
@@ -82,7 +79,7 @@ static int t10_pi_verify(struct blk_integrity_iter *iter, csum_fn *fn,
switch (type) {
case 1:
case 2:
- if (pi->app_tag == APP_ESCAPE)
+ if (pi->app_tag == T10_APP_ESCAPE)
goto next;
if (be32_to_cpu(pi->ref_tag) !=
@@ -95,8 +92,8 @@ static int t10_pi_verify(struct blk_integrity_iter *iter, csum_fn *fn,
}
break;
case 3:
- if (pi->app_tag == APP_ESCAPE &&
- pi->ref_tag == REF_ESCAPE)
+ if (pi->app_tag == T10_APP_ESCAPE &&
+ pi->ref_tag == T10_REF_ESCAPE)
goto next;
break;
}
diff --git a/drivers/scsi/lpfc/lpfc_scsi.c b/drivers/scsi/lpfc/lpfc_scsi.c
index 54fd0c8..6f6b40e 100644
--- a/drivers/scsi/lpfc/lpfc_scsi.c
+++ b/drivers/scsi/lpfc/lpfc_scsi.c
@@ -26,6 +26,7 @@
#include <linux/export.h>
#include <linux/delay.h>
#include <asm/unaligned.h>
+#include <linux/t10-pi.h>
#include <linux/crc-t10dif.h>
#include <net/checksum.h>
@@ -2934,8 +2935,8 @@ lpfc_calc_bg_err(struct lpfc_hba *phba, struct lpfc_scsi_buf *lpfc_cmd)
* First check to see if a protection data
* check is valid
*/
- if ((src->ref_tag == 0xffffffff) ||
- (src->app_tag == 0xffff)) {
+ if ((src->ref_tag == T10_REF_ESCAPE) ||
+ (src->app_tag == T10_APP_ESCAPE)) {
start_ref_tag++;
goto skipit;
}
diff --git a/drivers/scsi/qla2xxx/qla_isr.c b/drivers/scsi/qla2xxx/qla_isr.c
index 3203367..ed4b302 100644
--- a/drivers/scsi/qla2xxx/qla_isr.c
+++ b/drivers/scsi/qla2xxx/qla_isr.c
@@ -1950,9 +1950,9 @@ qla2x00_handle_dif_error(srb_t *sp, struct sts_entry_24xx *sts24)
* For type 3: ref & app tag is all 'f's
* For type 0,1,2: app tag is all 'f's
*/
- if ((a_app_tag == 0xffff) &&
+ if ((a_app_tag == T10_APP_ESCAPE) &&
((scsi_get_prot_type(cmd) != SCSI_PROT_DIF_TYPE3) ||
- (a_ref_tag == 0xffffffff))) {
+ (a_ref_tag == T10_REF_ESCAPE))) {
uint32_t blocks_done, resid;
sector_t lba_s = scsi_get_lba(cmd);
@@ -1994,9 +1994,9 @@ qla2x00_handle_dif_error(srb_t *sp, struct sts_entry_24xx *sts24)
spt = page_address(sg_page(sg)) + sg->offset;
spt += j;
- spt->app_tag = 0xffff;
+ spt->app_tag = T10_APP_ESCAPE;
if (scsi_get_prot_type(cmd) == SCSI_PROT_DIF_TYPE3)
- spt->ref_tag = 0xffffffff;
+ spt->ref_tag = T10_REF_ESCAPE;
}
return 0;
diff --git a/drivers/target/target_core_sbc.c b/drivers/target/target_core_sbc.c
index c194063..927ef44 100644
--- a/drivers/target/target_core_sbc.c
+++ b/drivers/target/target_core_sbc.c
@@ -1446,7 +1446,7 @@ sbc_dif_verify(struct se_cmd *cmd, sector_t start, unsigned int sectors,
(unsigned long long)sector, sdt->guard_tag,
sdt->app_tag, be32_to_cpu(sdt->ref_tag));
- if (sdt->app_tag == cpu_to_be16(0xffff)) {
+ if (sdt->app_tag == T10_APP_ESCAPE) {
dsg_off += block_size;
goto next;
}
diff --git a/include/linux/t10-pi.h b/include/linux/t10-pi.h
index 9fba9dd..f892442 100644
--- a/include/linux/t10-pi.h
+++ b/include/linux/t10-pi.h
@@ -33,6 +33,8 @@ struct t10_pi_tuple {
__be32 ref_tag; /* Target LBA or indirect LBA */
};
+#define T10_APP_ESCAPE cpu_to_be16(0xffff)
+#define T10_REF_ESCAPE cpu_to_be32(0xffffffff)
extern struct blk_integrity_profile t10_pi_type1_crc;
extern struct blk_integrity_profile t10_pi_type1_ip;
--
2.9.3
^ permalink raw reply related
* [PATCH 5/9] bio-integrity: fold bio_integrity_enabled to bio_integrity_prep
From: Dmitry Monakhov @ 2017-04-04 18:56 UTC (permalink / raw)
To: linux-kernel, linux-block, martin.petersen; +Cc: Dmitry Monakhov
In-Reply-To: <1491332201-26926-1-git-send-email-dmonakhov@openvz.org>
Currently all integrity prep hooks are open-coded, and if prepare fails
we ignore it's code and fail bio with EIO. Let's return real error to
upper layer, so later caller may react accordingly.
In fact no one want to use bio_integrity_prep() w/o bio_integrity_enabled,
so it is reasonable to fold it in to one function.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
Documentation/block/data-integrity.txt | 3 --
block/bio-integrity.c | 88 ++++++++++++++++++----------------
block/blk-core.c | 5 +-
block/blk-mq.c | 8 +---
drivers/nvdimm/blk.c | 13 +----
drivers/nvdimm/btt.c | 13 +----
include/linux/bio.h | 6 ---
7 files changed, 55 insertions(+), 81 deletions(-)
diff --git a/Documentation/block/data-integrity.txt b/Documentation/block/data-integrity.txt
index f56ec97..37ac837 100644
--- a/Documentation/block/data-integrity.txt
+++ b/Documentation/block/data-integrity.txt
@@ -202,9 +202,6 @@ will require extra work due to the application tag.
added. It is up to the caller to ensure that the bio does not
change while I/O is in progress.
- bio_integrity_prep() should only be called if
- bio_integrity_enabled() returned 1.
-
5.3 PASSING EXISTING INTEGRITY METADATA
diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 6170dad..f2b9f09 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -160,44 +160,6 @@ int bio_integrity_add_page(struct bio *bio, struct page *page,
EXPORT_SYMBOL(bio_integrity_add_page);
/**
- * bio_integrity_enabled - Check whether integrity can be passed
- * @bio: bio to check
- *
- * Description: Determines whether bio_integrity_prep() can be called
- * on this bio or not. bio data direction and target device must be
- * set prior to calling. The functions honors the write_generate and
- * read_verify flags in sysfs.
- */
-bool bio_integrity_enabled(struct bio *bio)
-{
- struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
-
- if (bio_op(bio) != REQ_OP_READ && bio_op(bio) != REQ_OP_WRITE)
- return false;
-
- if (!bio_sectors(bio))
- return false;
-
- /* Already protected? */
- if (bio_integrity(bio))
- return false;
-
- if (bi == NULL)
- return false;
-
- if (bio_data_dir(bio) == READ && bi->profile->verify_fn != NULL &&
- (bi->flags & BLK_INTEGRITY_VERIFY))
- return true;
-
- if (bio_data_dir(bio) == WRITE && bi->profile->generate_fn != NULL &&
- (bi->flags & BLK_INTEGRITY_GENERATE))
- return true;
-
- return false;
-}
-EXPORT_SYMBOL(bio_integrity_enabled);
-
-/**
* bio_integrity_intervals - Return number of integrity intervals for a bio
* @bi: blk_integrity profile for device
* @sectors: Size of the bio in 512-byte sectors
@@ -259,7 +221,7 @@ static int bio_integrity_process(struct bio *bio,
}
/**
- * bio_integrity_prep - Prepare bio for integrity I/O
+ * bio_integrity_setup - Prepare bio for integrity I/O
* @bio: bio to prepare
*
* Description: Allocates a buffer for integrity metadata, maps the
@@ -269,7 +231,7 @@ static int bio_integrity_process(struct bio *bio,
* block device's integrity function. In the READ case, the buffer
* will be prepared for DMA and a suitable end_io handler set up.
*/
-int bio_integrity_prep(struct bio *bio)
+static int bio_integrity_setup(struct bio *bio)
{
struct bio_integrity_payload *bip;
struct blk_integrity *bi;
@@ -352,6 +314,52 @@ int bio_integrity_prep(struct bio *bio)
return 0;
}
+/**
+ * bio_integrity_prep - Prepare bio for integrity I/O
+ * @bio: bio to prepare
+ *
+ * Description: Checks if the bio already has an integrity payload attached.
+ * If it does, the payload has been generated by another kernel subsystem,
+ * and we just pass it through. Otherwise allocates integrity payload.
+ */
+
+int bio_integrity_prep(struct bio *bio)
+{
+ int err;
+ struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
+
+ if (bio_op(bio) != REQ_OP_READ && bio_op(bio) != REQ_OP_WRITE)
+ return 0;
+
+ if (!bio_sectors(bio))
+ return 0;
+
+ /* Already protected? */
+ if (bio_integrity(bio))
+ return 0;
+
+ if (bi == NULL)
+ return 0;
+
+ if (bio_data_dir(bio) == READ && bi->profile->verify_fn != NULL &&
+ (bi->flags & BLK_INTEGRITY_VERIFY))
+ goto need_prep;
+
+ if (bio_data_dir(bio) == WRITE && bi->profile->generate_fn != NULL &&
+ (bi->flags & BLK_INTEGRITY_GENERATE))
+ goto need_prep;
+
+ return 0;
+need_prep:
+ err = bio_integrity_setup(bio);
+ if (err) {
+ bio->bi_error = err;
+ bio_endio(bio);
+ }
+ return err;
+}
+
+
EXPORT_SYMBOL(bio_integrity_prep);
/**
diff --git a/block/blk-core.c b/block/blk-core.c
index d772c22..275c1e4 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1637,11 +1637,8 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
blk_queue_split(q, &bio, q->bio_split);
- if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
- bio->bi_error = -EIO;
- bio_endio(bio);
+ if (bio_integrity_prep(bio))
return BLK_QC_T_NONE;
- }
if (op_is_flush(bio->bi_opf)) {
spin_lock_irq(q->queue_lock);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 08a49c6..b19cd92 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1489,10 +1489,8 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
blk_queue_bounce(q, &bio);
- if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
- bio_io_error(bio);
+ if (bio_integrity_prep(bio))
return BLK_QC_T_NONE;
- }
blk_queue_split(q, &bio, q->bio_split);
@@ -1611,10 +1609,8 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
blk_queue_bounce(q, &bio);
- if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
- bio_io_error(bio);
+ if (bio_integrity_prep(bio))
return BLK_QC_T_NONE;
- }
blk_queue_split(q, &bio, q->bio_split);
diff --git a/drivers/nvdimm/blk.c b/drivers/nvdimm/blk.c
index 9faaa96..0b49336 100644
--- a/drivers/nvdimm/blk.c
+++ b/drivers/nvdimm/blk.c
@@ -179,16 +179,8 @@ static blk_qc_t nd_blk_make_request(struct request_queue *q, struct bio *bio)
int err = 0, rw;
bool do_acct;
- /*
- * bio_integrity_enabled also checks if the bio already has an
- * integrity payload attached. If it does, we *don't* do a
- * bio_integrity_prep here - the payload has been generated by
- * another kernel subsystem, and we just pass it through.
- */
- if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
- bio->bi_error = -EIO;
- goto out;
- }
+ if (bio_integrity_prep(bio))
+ return BLK_QC_T_NONE;
bip = bio_integrity(bio);
nsblk = q->queuedata;
@@ -212,7 +204,6 @@ static blk_qc_t nd_blk_make_request(struct request_queue *q, struct bio *bio)
if (do_acct)
nd_iostat_end(bio, start);
- out:
bio_endio(bio);
return BLK_QC_T_NONE;
}
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index 368795a..3d7a9fe 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -1158,16 +1158,8 @@ static blk_qc_t btt_make_request(struct request_queue *q, struct bio *bio)
int err = 0;
bool do_acct;
- /*
- * bio_integrity_enabled also checks if the bio already has an
- * integrity payload attached. If it does, we *don't* do a
- * bio_integrity_prep here - the payload has been generated by
- * another kernel subsystem, and we just pass it through.
- */
- if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
- bio->bi_error = -EIO;
- goto out;
- }
+ if (bio_integrity_prep(bio))
+ return BLK_QC_T_NONE;
do_acct = nd_iostat_start(bio, &start);
bio_for_each_segment(bvec, bio, iter) {
@@ -1194,7 +1186,6 @@ static blk_qc_t btt_make_request(struct request_queue *q, struct bio *bio)
if (do_acct)
nd_iostat_end(bio, start);
-out:
bio_endio(bio);
return BLK_QC_T_NONE;
}
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 7f7bf37..65445c0 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -727,7 +727,6 @@ struct biovec_slab {
extern struct bio_integrity_payload *bio_integrity_alloc(struct bio *, gfp_t, unsigned int);
extern void bio_integrity_free(struct bio *);
extern int bio_integrity_add_page(struct bio *, struct page *, unsigned int, unsigned int);
-extern bool bio_integrity_enabled(struct bio *bio);
extern int bio_integrity_prep(struct bio *);
extern void bio_integrity_endio(struct bio *);
extern void bio_integrity_advance(struct bio *, unsigned int);
@@ -744,11 +743,6 @@ static inline void *bio_integrity(struct bio *bio)
return NULL;
}
-static inline bool bio_integrity_enabled(struct bio *bio)
-{
- return false;
-}
-
static inline int bioset_integrity_create(struct bio_set *bs, int pool_size)
{
return 0;
--
2.9.3
^ permalink raw reply related
* [PATCH 9/9] bio-integrity: Restore original iterator on verify stage
From: Dmitry Monakhov @ 2017-04-04 18:56 UTC (permalink / raw)
To: linux-kernel, linux-block, martin.petersen; +Cc: Dmitry Monakhov
In-Reply-To: <1491332201-26926-1-git-send-email-dmonakhov@openvz.org>
Currently ->verify_fn not woks at all because at the moment it is called
bio->bi_iter.bi_size == 0, so we do not iterate integrity bvecs at all.
In order to perform verification we need to know original data vector,
with new bvec rewind API this is trivial.
testcase: https://github.com/dmonakhov/xfstests/commit/3c6509eaa83b9c17cd0bc95d73fcdd76e1c54a85
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
block/bio-integrity.c | 22 +++++++++++++++-------
1 file changed, 15 insertions(+), 7 deletions(-)
diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index f2b9f09..f08096d 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -184,9 +184,10 @@ static inline unsigned int bio_integrity_bytes(struct blk_integrity *bi,
/**
* bio_integrity_process - Process integrity metadata for a bio
* @bio: bio to generate/verify integrity metadata for
+ * @proc_iter: iterator to process
* @proc_fn: Pointer to the relevant processing function
*/
-static int bio_integrity_process(struct bio *bio,
+static int bio_integrity_process(struct bio *bio, struct bvec_iter *proc_iter,
integrity_processing_fn *proc_fn)
{
struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
@@ -200,10 +201,10 @@ static int bio_integrity_process(struct bio *bio,
iter.disk_name = bio->bi_bdev->bd_disk->disk_name;
iter.interval = 1 << bi->interval_exp;
- iter.seed = bip_get_seed(bip);
+ iter.seed = proc_iter->bi_sector;
iter.prot_buf = prot_buf;
- bio_for_each_segment(bv, bio, bviter) {
+ __bio_for_each_segment(bv, bio, bviter, *proc_iter) {
void *kaddr = kmap_atomic(bv.bv_page);
iter.data_buf = kaddr + bv.bv_offset;
@@ -310,7 +311,8 @@ static int bio_integrity_setup(struct bio *bio)
/* Auto-generate integrity metadata if this is a write */
if (bio_data_dir(bio) == WRITE)
- bio_integrity_process(bio, bi->profile->generate_fn);
+ bio_integrity_process(bio, &bio->bi_iter,
+ bi->profile->generate_fn);
return 0;
}
@@ -376,9 +378,15 @@ static void bio_integrity_verify_fn(struct work_struct *work)
container_of(work, struct bio_integrity_payload, bip_work);
struct bio *bio = bip->bip_bio;
struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
-
- bio->bi_error = bio_integrity_process(bio, bi->profile->verify_fn);
-
+ struct bvec_iter iter = bio->bi_iter;
+
+ /* At the moment verify is called bio's iterator was advanced
+ * during split and completion, we need to rewind iterator to
+ * it's original position */
+ bio_rewind_iter(bio, &iter, iter.bi_done);
+ if (!bio->bi_error)
+ bio->bi_error = bio_integrity_process(bio, &iter,
+ bi->profile->verify_fn);
/* Restore original bio completion handler */
bio->bi_end_io = bip->bip_end_io;
bio_endio(bio);
--
2.9.3
^ permalink raw reply related
* [PATCH 8/9] bio: add bvec_iter rewind API
From: Dmitry Monakhov @ 2017-04-04 18:56 UTC (permalink / raw)
To: linux-kernel, linux-block, martin.petersen; +Cc: Dmitry Monakhov
In-Reply-To: <1491332201-26926-1-git-send-email-dmonakhov@openvz.org>
Some ->bi_end_io handlers (for example: pi_verify or decrypt handlers)
need to know original data vector, but after bio traverse io-stack it may
be advanced, splited and relocated many times so it is hard to guess
original iterator. Let's add 'bi_done' conter which accounts number
of bytes iterator was advanced during it's evolution. Later end_io handler
may easily restore original iterator by rewinding iterator to
iter->bi_done.
Note: this change makes sizeof (struct bvec_iter) multiple to 8
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
include/linux/bio.h | 21 +++++++++++++++++++--
include/linux/bvec.h | 26 ++++++++++++++++++++++++++
2 files changed, 45 insertions(+), 2 deletions(-)
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 12b7fcd..491fb0c 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -166,9 +166,10 @@ static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
{
iter->bi_sector += bytes >> 9;
- if (bio_no_advance_iter(bio))
+ if (bio_no_advance_iter(bio)) {
iter->bi_size -= bytes;
- else {
+ iter->bi_done += bytes;
+ } else {
int err;
err = bvec_iter_advance(bio->bi_io_vec, iter, bytes);
if (unlikely(err))
@@ -176,6 +177,22 @@ static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
}
}
+static inline void bio_rewind_iter(struct bio *bio, struct bvec_iter *iter,
+ unsigned bytes)
+{
+ iter->bi_sector -= bytes >> 9;
+
+ if (bio_no_advance_iter(bio)) {
+ iter->bi_size += bytes;
+ iter->bi_done -= bytes;
+ } else {
+ int err;
+ err = bvec_iter_rewind(bio->bi_io_vec, iter, bytes);
+ if (unlikely(err))
+ bio->bi_error = err;
+ }
+}
+
#define __bio_for_each_segment(bvl, bio, iter, start) \
for (iter = (start); \
(iter).bi_size && \
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 984a7a8..35cb97b 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -40,6 +40,8 @@ struct bvec_iter {
unsigned int bi_idx; /* current index into bvl_vec */
+ unsigned int bi_done; /* number of bytes completed */
+
unsigned int bi_bvec_done; /* number of bytes completed in
current bvec */
};
@@ -84,6 +86,7 @@ static inline int bvec_iter_advance(const struct bio_vec *bv,
bytes -= len;
iter->bi_size -= len;
iter->bi_bvec_done += len;
+ iter->bi_done += len;
if (iter->bi_bvec_done == __bvec_iter_bvec(bv, *iter)->bv_len) {
iter->bi_bvec_done = 0;
@@ -93,6 +96,29 @@ static inline int bvec_iter_advance(const struct bio_vec *bv,
return 0;
}
+static inline int bvec_iter_rewind(const struct bio_vec *bv,
+ struct bvec_iter *iter,
+ unsigned bytes)
+{
+ while (bytes) {
+ unsigned len = min(bytes, iter->bi_bvec_done);
+ if (iter->bi_bvec_done == 0 ) {
+ if (WARN_ONCE(iter->bi_idx == 0,
+ "Attempted to rewind iter beyond "
+ "bvec's boundaries\n")) {
+ return -EINVAL;
+ }
+ iter->bi_idx--;
+ iter->bi_bvec_done = __bvec_iter_bvec(bv, *iter)->bv_len;
+ continue;
+ }
+ bytes -= len;
+ iter->bi_size += len;
+ iter->bi_bvec_done -= len;
+ }
+ return 0;
+}
+
#define for_each_bvec(bvl, bio_vec, iter, start) \
for (iter = (start); \
(iter).bi_size && \
--
2.9.3
^ permalink raw reply related
* [PATCH 7/9] Guard bvec iteration logic v3
From: Dmitry Monakhov @ 2017-04-04 18:56 UTC (permalink / raw)
To: linux-kernel, linux-block, martin.petersen; +Cc: Dmitry Monakhov
In-Reply-To: <1491332201-26926-1-git-send-email-dmonakhov@openvz.org>
Currently if some one try to advance bvec beyond it's size we simply
dump WARN_ONCE and continue to iterate beyond bvec array boundaries.
This simply means that we endup dereferencing/corrupting random memory
region.
Sane reaction would be to propagate error back to calling context
But bvec_iter_advance's calling context is not always good for error
handling. For safity reason let truncate iterator size to zero which
will break external iteration loop which prevent us from unpredictable
memory range corruption. And even it caller ignores an error, it will
corrupt it's own bvecs, not others.
This patch does:
- Return error back to caller with hope that it will react on this
- Truncate iterator size
Code was added long time ago here 4550dd6c, luckily no one hit it
in real life :)
changes since V1:
- Replace BUG_ON with error logic.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
drivers/nvdimm/blk.c | 4 +++-
drivers/nvdimm/btt.c | 4 +++-
include/linux/bio.h | 8 ++++++--
include/linux/bvec.h | 11 ++++++++---
4 files changed, 20 insertions(+), 7 deletions(-)
diff --git a/drivers/nvdimm/blk.c b/drivers/nvdimm/blk.c
index 0b49336..c82331b 100644
--- a/drivers/nvdimm/blk.c
+++ b/drivers/nvdimm/blk.c
@@ -106,7 +106,9 @@ static int nd_blk_rw_integrity(struct nd_namespace_blk *nsblk,
len -= cur_len;
dev_offset += cur_len;
- bvec_iter_advance(bip->bip_vec, &bip->bip_iter, cur_len);
+ err = bvec_iter_advance(bip->bip_vec, &bip->bip_iter, cur_len);
+ if (err)
+ return err;
}
return err;
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index 3d7a9fe..5a68681 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -942,7 +942,9 @@ static int btt_rw_integrity(struct btt *btt, struct bio_integrity_payload *bip,
len -= cur_len;
meta_nsoff += cur_len;
- bvec_iter_advance(bip->bip_vec, &bip->bip_iter, cur_len);
+ ret = bvec_iter_advance(bip->bip_vec, &bip->bip_iter, cur_len);
+ if (ret)
+ return ret;
}
return ret;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 65445c0..12b7fcd 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -168,8 +168,12 @@ static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
if (bio_no_advance_iter(bio))
iter->bi_size -= bytes;
- else
- bvec_iter_advance(bio->bi_io_vec, iter, bytes);
+ else {
+ int err;
+ err = bvec_iter_advance(bio->bi_io_vec, iter, bytes);
+ if (unlikely(err))
+ bio->bi_error = err;
+ }
}
#define __bio_for_each_segment(bvl, bio, iter, start) \
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 89b65b8..984a7a8 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -22,6 +22,7 @@
#include <linux/kernel.h>
#include <linux/bug.h>
+#include <linux/errno.h>
/*
* was unsigned short, but we might as well be ready for > 64kB I/O pages
@@ -66,12 +67,15 @@ struct bvec_iter {
.bv_offset = bvec_iter_offset((bvec), (iter)), \
})
-static inline void bvec_iter_advance(const struct bio_vec *bv,
+static inline int bvec_iter_advance(const struct bio_vec *bv,
struct bvec_iter *iter,
unsigned bytes)
{
- WARN_ONCE(bytes > iter->bi_size,
- "Attempted to advance past end of bvec iter\n");
+ if (WARN_ONCE(bytes > iter->bi_size,
+ "Attempted to advance past end of bvec iter\n")) {
+ iter->bi_size = 0;
+ return -EINVAL;
+ }
while (bytes) {
unsigned iter_len = bvec_iter_len(bv, *iter);
@@ -86,6 +90,7 @@ static inline void bvec_iter_advance(const struct bio_vec *bv,
iter->bi_idx++;
}
}
+ return 0;
}
#define for_each_bvec(bvl, bio_vec, iter, start) \
--
2.9.3
^ permalink raw reply related
* [PATCH 4/9] bio-integrity: fix interface for bio_integrity_trim v2
From: Dmitry Monakhov @ 2017-04-04 18:56 UTC (permalink / raw)
To: linux-kernel, linux-block, martin.petersen; +Cc: Dmitry Monakhov
In-Reply-To: <1491332201-26926-1-git-send-email-dmonakhov@openvz.org>
bio_integrity_trim inherent it's interface from bio_trim and accept
offset and size, but this API is error prone because data offset
must always be insync with bio's data offset. That is why we have
integrity update hook in bio_advance()
So only meaningful values are: offset == 0, sectors == bio_sectors(bio)
Let's just remove them completely.
changes from v1:
- remove 'sectors' arguments
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
block/bio-integrity.c | 11 ++---------
block/bio.c | 4 ++--
drivers/md/dm.c | 2 +-
include/linux/bio.h | 5 ++---
4 files changed, 7 insertions(+), 15 deletions(-)
diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 82a6ffb..6170dad 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -433,22 +433,15 @@ EXPORT_SYMBOL(bio_integrity_advance);
/**
* bio_integrity_trim - Trim integrity vector
* @bio: bio whose integrity vector to update
- * @offset: offset to first data sector
- * @sectors: number of data sectors
*
* Description: Used to trim the integrity vector in a cloned bio.
- * The ivec will be advanced corresponding to 'offset' data sectors
- * and the length will be truncated corresponding to 'len' data
- * sectors.
*/
-void bio_integrity_trim(struct bio *bio, unsigned int offset,
- unsigned int sectors)
+void bio_integrity_trim(struct bio *bio)
{
struct bio_integrity_payload *bip = bio_integrity(bio);
struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
- bio_integrity_advance(bio, offset << 9);
- bip->bip_iter.bi_size = bio_integrity_bytes(bi, sectors);
+ bip->bip_iter.bi_size = bio_integrity_bytes(bi, bio_sectors(bio));
}
EXPORT_SYMBOL(bio_integrity_trim);
diff --git a/block/bio.c b/block/bio.c
index fa84323..65da4c9 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1878,7 +1878,7 @@ struct bio *bio_split(struct bio *bio, int sectors,
split->bi_iter.bi_size = sectors << 9;
if (bio_integrity(split))
- bio_integrity_trim(split, 0, sectors);
+ bio_integrity_trim(split);
bio_advance(bio, split->bi_iter.bi_size);
@@ -1909,7 +1909,7 @@ void bio_trim(struct bio *bio, int offset, int size)
bio->bi_iter.bi_size = size;
if (bio_integrity(bio))
- bio_integrity_trim(bio, 0, size);
+ bio_integrity_trim(bio);
}
EXPORT_SYMBOL_GPL(bio_trim);
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index dfb7597..25c5a8b 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1102,7 +1102,7 @@ static int clone_bio(struct dm_target_io *tio, struct bio *bio,
clone->bi_iter.bi_size = to_bytes(len);
if (bio_integrity(bio))
- bio_integrity_trim(clone, 0, len);
+ bio_integrity_trim(clone);
return 0;
}
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 8e52119..7f7bf37 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -731,7 +731,7 @@ extern bool bio_integrity_enabled(struct bio *bio);
extern int bio_integrity_prep(struct bio *);
extern void bio_integrity_endio(struct bio *);
extern void bio_integrity_advance(struct bio *, unsigned int);
-extern void bio_integrity_trim(struct bio *, unsigned int, unsigned int);
+extern void bio_integrity_trim(struct bio *);
extern int bio_integrity_clone(struct bio *, struct bio *, gfp_t);
extern int bioset_integrity_create(struct bio_set *, int);
extern void bioset_integrity_free(struct bio_set *);
@@ -781,8 +781,7 @@ static inline void bio_integrity_advance(struct bio *bio,
return;
}
-static inline void bio_integrity_trim(struct bio *bio, unsigned int offset,
- unsigned int sectors)
+static inline void bio_integrity_trim(struct bio *bio)
{
return;
}
--
2.9.3
^ permalink raw reply related
* [PATCH 3/9] bio-integrity: bio_integrity_advance must update integrity seed
From: Dmitry Monakhov @ 2017-04-04 18:56 UTC (permalink / raw)
To: linux-kernel, linux-block, martin.petersen; +Cc: Dmitry Monakhov
In-Reply-To: <1491332201-26926-1-git-send-email-dmonakhov@openvz.org>
SCSI drivers do care about bip_seed so we must update it accordingly.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
block/bio-integrity.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index b5009a8..82a6ffb 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -425,6 +425,7 @@ void bio_integrity_advance(struct bio *bio, unsigned int bytes_done)
struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
unsigned bytes = bio_integrity_bytes(bi, bytes_done >> 9);
+ bip->bip_iter.bi_sector += bytes_done >> 9;
bvec_iter_advance(bip->bip_vec, &bip->bip_iter, bytes);
}
EXPORT_SYMBOL(bio_integrity_advance);
--
2.9.3
^ permalink raw reply related
* [PATCH 2/9] bio-integrity: bio_trim should truncate integrity vector accordingly
From: Dmitry Monakhov @ 2017-04-04 18:56 UTC (permalink / raw)
To: linux-kernel, linux-block, martin.petersen; +Cc: Dmitry Monakhov
In-Reply-To: <1491332201-26926-1-git-send-email-dmonakhov@openvz.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
block/bio.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/block/bio.c b/block/bio.c
index e75878f..fa84323 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1907,6 +1907,10 @@ void bio_trim(struct bio *bio, int offset, int size)
bio_advance(bio, offset << 9);
bio->bi_iter.bi_size = size;
+
+ if (bio_integrity(bio))
+ bio_integrity_trim(bio, 0, size);
+
}
EXPORT_SYMBOL_GPL(bio_trim);
--
2.9.3
^ permalink raw reply related
* [PATCH 1/9] bio-integrity: Do not allocate integrity context for bio w/o data
From: Dmitry Monakhov @ 2017-04-04 18:56 UTC (permalink / raw)
To: linux-kernel, linux-block, martin.petersen; +Cc: Dmitry Monakhov
In-Reply-To: <1491332201-26926-1-git-send-email-dmonakhov@openvz.org>
If bio has no data, such as ones from blkdev_issue_flush(),
then we have nothing to protect.
This patch prevent bugon like follows:
kfree_debugcheck: out of range ptr ac1fa1d106742a5ah
kernel BUG at mm/slab.c:2773!
invalid opcode: 0000 [#1] SMP
Modules linked in: bcache
CPU: 0 PID: 4428 Comm: xfs_io Tainted: G W 4.11.0-rc4-ext4-00041-g2ef0043-dirty #43
Hardware name: Virtuozzo KVM, BIOS seabios-1.7.5-11.vz7.4 04/01/2014
task: ffff880137786440 task.stack: ffffc90000ba8000
RIP: 0010:kfree_debugcheck+0x25/0x2a
RSP: 0018:ffffc90000babde0 EFLAGS: 00010082
RAX: 0000000000000034 RBX: ac1fa1d106742a5a RCX: 0000000000000007
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88013f3ccb40
RBP: ffffc90000babde8 R08: 0000000000000000 R09: 0000000000000000
R10: 00000000fcb76420 R11: 00000000725172ed R12: 0000000000000282
R13: ffffffff8150e766 R14: ffff88013a145e00 R15: 0000000000000001
FS: 00007fb09384bf40(0000) GS:ffff88013f200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fd0172f9e40 CR3: 0000000137fa9000 CR4: 00000000000006f0
Call Trace:
kfree+0xc8/0x1b3
bio_integrity_free+0xc3/0x16b
bio_free+0x25/0x66
bio_put+0x14/0x26
blkdev_issue_flush+0x7a/0x85
blkdev_fsync+0x35/0x42
vfs_fsync_range+0x8e/0x9f
vfs_fsync+0x1c/0x1e
do_fsync+0x31/0x4a
SyS_fsync+0x10/0x14
entry_SYSCALL_64_fastpath+0x1f/0xc2
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
block/bio-integrity.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 5384713..b5009a8 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -175,6 +175,9 @@ bool bio_integrity_enabled(struct bio *bio)
if (bio_op(bio) != REQ_OP_READ && bio_op(bio) != REQ_OP_WRITE)
return false;
+ if (!bio_sectors(bio))
+ return false;
+
/* Already protected? */
if (bio_integrity(bio))
return false;
--
2.9.3
^ permalink raw reply related
* [PATCH 0/9] block: T10/DIF Fixes and cleanups v3
From: Dmitry Monakhov @ 2017-04-04 18:56 UTC (permalink / raw)
To: linux-kernel, linux-block, martin.petersen; +Cc: Dmitry Monakhov
This patch set fix various problems spotted during T10/DIF integrity machinery testing.
TOC:
## Fix various bugs in T10/DIF/DIX infrastructure
0001-bio-integrity-Do-not-allocate-integrity-context-for
0002-bio-integrity-bio_trim-should-truncate-integrity-vec
0003-bio-integrity-bio_integrity_advance-must-update-inte
0004-bio-integrity-fix-interface-for-bio_integrity_trim
0005-bio-integrity-fold-bio_integrity_enabled-to-bio_integrity
## Cleanup T10/DIF/DIX infrastructure
0006-T10-Move-opencoded-contants-to-common-header
0007-Guard-bvec-iteration-logic-v3
0008-bio-add-bvec_iter-rewind-API
0009-bio-integrity-Restore-original-iterator-on-verify
Changes since V2
- Cleanups in response to commens from axboe@ and hch@
- Use rewind bvec api to restore original vector on verify
- fix one more bug with bip_seed update on bio split.
Changes since V1
- fix issues potted by kbuild bot
- Replace BUG_ON with error logic for 7'th patch
Testcase: xfstest blockdev/003
https://github.com/dmonakhov/xfstests/commit/3c6509eaa83b9c17cd0bc95d73fcdd76e1c54a85
^ permalink raw reply
* Re: [PATCH 6/8] nowait aio: ext4
From: Goldwyn Rodrigues @ 2017-04-04 18:41 UTC (permalink / raw)
To: Christoph Hellwig, Jan Kara
Cc: linux-fsdevel, jack, linux-block, linux-btrfs, linux-ext4,
linux-xfs, sagi, avi, axboe, linux-api, willy, tom.leiming,
Goldwyn Rodrigues
In-Reply-To: <20170404084122.GA10252@infradead.org>
On 04/04/2017 03:41 AM, Christoph Hellwig wrote:
> On Tue, Apr 04, 2017 at 09:58:53AM +0200, Jan Kara wrote:
>> FS_NOWAIT looks a bit too generic given these are filesystem feature flags.
>> Can we call it FS_NOWAIT_IO?
>
> It's way to generic as it's a feature of the particular file_operations
> instance. But once we switch to using RWF_* we can just the existing
> per-op feature checks for thos and the per-fs flag should just go away.
>
I am working on incorporating RWF_* flags. However, I am not sure how
RWF_* flags would get rid of FS_NOWAIT/FS_NOWAIT_IO. Since most of
"blocking" information is with the filesystem, it is a per-filesystem
flag to block out (EOPNOTSUPP) the filesystems which do not support it.
--
Goldwyn
^ permalink raw reply
* Re: [RFC PATCH 0/2] block,scsi: support host-wide tagset
From: Hannes Reinecke @ 2017-04-04 17:10 UTC (permalink / raw)
To: Ming Lei
Cc: Jens Axboe, Omar Sandoval, Martin K. Petersen, James Bottomley,
Christoph Hellwig, Bart van Assche, linux-block, Linux SCSI List
In-Reply-To: <CACVXFVPyWL8XyFKC9uT7r4q+QT1+_RvnvJnOskCO3aSnUEVY1w@mail.gmail.com>
On 04/04/2017 05:59 PM, Ming Lei wrote:
> On Tue, Apr 4, 2017 at 8:07 PM, Hannes Reinecke <hare@suse.de> wrote:
>> Hi all,
>>
>> as discussed recently most existing HBAs have a host-wide tagset which
>> does not map easily onto the per-queue tagset model of block mq.
>> This patchset implements a flag BLK_MQ_F_GLOBAL_TAGS for block-mq, which
>> enables the use of a shared tagset for all hardware queues.
>> The second patch adds a flag 'host_tagset' to the SCSI host template,
>> which allows drivers to enable the use of the global tagset.
>>
>> This patchset probably has some performance implications as
>> there is a quite high probability of cache-bouncing when allocating
>> tags. Also I'm not quite sure if the implemented tagset sharing
>> is the correct way to handle things.
>> So this can be considered an RFC.
>>
>> As usual, comments and reviews are welcome.
>
> Just be curious, why is multi hard queue used for this case? Are there
> some real cases in SCSI?
>
Yes.
This is required by basically every driver in drivers/scsi which would
in theory be able to support scsi-mq (ie lpfc, qla2xxx, mpt3sas, and
possibly fnic). Each of them support several submission/completion
queues, but every one of them has a host-wide tag map, too.
Cheers,
Hannes
--
Dr. Hannes Reinecke zSeries & Storage
hare@suse.de +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
^ permalink raw reply
* Re: [PATCH 0/5] blk-mq: scheduler and hw queue initialization fixes/enhancements
From: Josef Bacik @ 2017-04-04 17:09 UTC (permalink / raw)
To: Omar Sandoval; +Cc: Jens Axboe, linux-block, kernel-team
In-Reply-To: <cover.1491254827.git.osandov@fb.com>
> On Apr 3, 2017, at 5:42 PM, Omar Sandoval <osandov@osandov.com> wrote:
>
> From: Omar Sandoval <osandov@fb.com>
>
> Hi, Jens,
>
> This series has some fixes and enhancements for blk-mq:
>
> - Patch 1 is a cleanup in preparation for the rest of the series
> - Patch 2 is a fix necessary for patch 4 when scheduling is enabled,
> making sure we bring up new hardware queues with scheduler tags
> - Patch 3 makes error handling in elevator_switch() more robust, making
> us fall back to none like you recommended last time
> - Patch 4 is the remap fix from last week
> - Patch 5 is an extension of patch 2 for multiqueue schedulers that
> allocate per-hctx data. Nothing in-tree needs it, but Kyber will.
>
> Let me know if you'd prefer deferring this to 4.12 or want to apply 1-4
> for 4.11. These are based on block/for-next, so the latter might require
> a respin.
Tested with nbd, didn't crash, you can add
Tested-by: Josef Bacik <jbacik@fb.com>
Thanks,
Josef
^ permalink raw reply
* Re: [PATCH 0/10 v5] block: Fix block device shutdown related races
From: Thiago Jung Bauermann @ 2017-04-04 17:09 UTC (permalink / raw)
To: Jan Kara
Cc: Jens Axboe, linux-block, Christoph Hellwig, Dan Williams,
Tejun Heo, Tahsin Erdogan, Omar Sandoval, Lekshmi C. Pillai
In-Reply-To: <20170323003702.27571-1-jack@suse.cz>
Hello,
Am Donnerstag, 23. M=E4rz 2017, 01:36:52 BRT schrieb Jan Kara:
> this is a series with the remaining patches (on top of 4.11-rc2) to fix
> several different races and issues I've found when testing device shutdown
> and reuse. The first patch fixes possible (theoretical) problems when
> opening of a block device races with shutdown of a gendisk structure.
> Patches 2-8 fix oops that is triggered by __blkdev_put() calling
> inode_detach_wb() too early (the problem reported by Thiago). Patches 9 a=
nd
> 10 fix oops due to a bug in gendisk code where get_gendisk() can return
> already freed gendisk structure (again triggered by Omar's stress test).
>=20
> All patches got reviewed by Tejun and also tested by Thiago (thanks!). Je=
ns,
> can you please queue these fixes for the next merge window? Thanks!
Lekshmi tested these patches on top of v4.11-rc4 and they worked fine.
Sorry for the delay, it takes at least 72h of running time to be sure we're=
=20
not hitting the race condition, and I had trouble with v4.11-rc2.
=2D-=20
Thiago Jung Bauermann
IBM Linux Technology Center
^ permalink raw reply
* Re: [RFC PATCH 0/2] block,scsi: support host-wide tagset
From: Hannes Reinecke @ 2017-04-04 17:06 UTC (permalink / raw)
To: Omar Sandoval
Cc: Jens Axboe, Omar Sandoval, Martin K. Petersen, James Bottomley,
Christoph Hellwig, Bart van Assche, linux-block, linux-scsi
In-Reply-To: <20170404153234.GA3234@vader>
On 04/04/2017 05:32 PM, Omar Sandoval wrote:
> On Tue, Apr 04, 2017 at 02:07:43PM +0200, Hannes Reinecke wrote:
>> Hi all,
>>
>> as discussed recently most existing HBAs have a host-wide tagset which
>> does not map easily onto the per-queue tagset model of block mq.
>> This patchset implements a flag BLK_MQ_F_GLOBAL_TAGS for block-mq, which
>> enables the use of a shared tagset for all hardware queues.
>> The second patch adds a flag 'host_tagset' to the SCSI host template,
>> which allows drivers to enable the use of the global tagset.
>>
>> This patchset probably has some performance implications as
>> there is a quite high probability of cache-bouncing when allocating
>> tags. Also I'm not quite sure if the implemented tagset sharing
>> is the correct way to handle things.
>> So this can be considered an RFC.
>>
>> As usual, comments and reviews are welcome.
>
> Hi, Hannes,
>
> blk-mq already supports a shared tagset, and scsi-mq already uses that.
> When we initialize a request queue, we add it to a tagset with
> blk_mq_add_queue_set(), where we automatically mark the tagset as shared
> if there is more than one queue using it. What does this do that
> BLK_MQ_F_TAG_SHARED doesn't cover?
>
This is orthogonal to the BLK_MQ_F_TAG_SHARED flag.
That flag is covering the case for several block devices sharing the
same tagset.
My new flag is covering the case where several _hardware_ queues are
sharing the same tagset.
Cheers,
Hannes
--
Dr. Hannes Reinecke zSeries & Storage
hare@suse.de +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: J. Hawn, J. Guild, F. Imend�rffer, HRB 16746 (AG N�rnberg)
^ permalink raw reply
* [PATCH v2 5/5] blk-mq: introduce Kyber multiqueue I/O scheduler
From: Omar Sandoval @ 2017-04-04 16:49 UTC (permalink / raw)
To: Jens Axboe, linux-block; +Cc: kernel-team
In-Reply-To: <cover.1491324365.git.osandov@fb.com>
From: Omar Sandoval <osandov@fb.com>
The Kyber I/O scheduler is an I/O scheduler for fast devices designed to
scale to multiple queues. Users configure only two knobs, the target
read and synchronous write latencies, and the scheduler tunes itself to
achieve that latency goal.
The implementation is based on "tokens", built on top of the scalable
bitmap library. Tokens serve as a mechanism for limiting requests. There
are two tiers of tokens: queueing tokens and dispatch tokens.
A queueing token is required to allocate a request. In fact, these
tokens are actually the blk-mq internal scheduler tags, but the
scheduler manages the allocation directly in order to implement its
policy.
Dispatch tokens are device-wide and split up into two scheduling
domains: reads vs. writes. Each hardware queue dispatches batches
round-robin between the scheduling domains as long as tokens are
available for that domain.
These tokens can be used as the mechanism to enable various policies.
The policy Kyber uses is inspired by active queue management techniques
for network routing, similar to blk-wbt. The scheduler monitors
latencies and scales the number of dispatch tokens accordingly. Queueing
tokens are used to prevent starvation of synchronous requests by
asynchronous requests.
Various extensions are possible, including better heuristics and ionice
support. The new scheduler isn't set as the default yet.
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
Documentation/block/kyber-iosched.txt | 14 +
block/Kconfig.iosched | 9 +
block/Makefile | 1 +
block/kyber-iosched.c | 706 ++++++++++++++++++++++++++++++++++
4 files changed, 730 insertions(+)
create mode 100644 Documentation/block/kyber-iosched.txt
create mode 100644 block/kyber-iosched.c
diff --git a/Documentation/block/kyber-iosched.txt b/Documentation/block/kyber-iosched.txt
new file mode 100644
index 000000000000..e94feacd7edc
--- /dev/null
+++ b/Documentation/block/kyber-iosched.txt
@@ -0,0 +1,14 @@
+Kyber I/O scheduler tunables
+===========================
+
+The only two tunables for the Kyber scheduler are the target latencies for
+reads and synchronous writes. Kyber will throttle requests in order to meet
+these target latencies.
+
+read_lat_nsec
+-------------
+Target latency for reads (in nanoseconds).
+
+write_lat_nsec
+--------------
+Target latency for synchronous writes (in nanoseconds).
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 58fc8684788d..916e69c68fa4 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -69,6 +69,15 @@ config MQ_IOSCHED_DEADLINE
---help---
MQ version of the deadline IO scheduler.
+config MQ_IOSCHED_KYBER
+ tristate "Kyber I/O scheduler"
+ default y
+ ---help---
+ The Kyber I/O scheduler is a low-overhead scheduler suitable for
+ multiqueue and other fast devices. Given target latencies for reads and
+ synchronous writes, it will self-tune queue depths to achieve that
+ goal.
+
endmenu
endif
diff --git a/block/Makefile b/block/Makefile
index 081bb680789b..6146d2eaaeaa 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -20,6 +20,7 @@ obj-$(CONFIG_IOSCHED_NOOP) += noop-iosched.o
obj-$(CONFIG_IOSCHED_DEADLINE) += deadline-iosched.o
obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o
obj-$(CONFIG_MQ_IOSCHED_DEADLINE) += mq-deadline.o
+obj-$(CONFIG_MQ_IOSCHED_KYBER) += kyber-iosched.o
obj-$(CONFIG_BLOCK_COMPAT) += compat_ioctl.o
obj-$(CONFIG_BLK_CMDLINE_PARSER) += cmdline-parser.o
diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
new file mode 100644
index 000000000000..30253d0758aa
--- /dev/null
+++ b/block/kyber-iosched.c
@@ -0,0 +1,706 @@
+/*
+ * The Kyber I/O scheduler. Controls latency by throttling queue depths using
+ * scalable techniques.
+ *
+ * Copyright (C) 2017 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see <https://www.gnu.org/licenses/>.
+ */
+
+#include <linux/kernel.h>
+#include <linux/blkdev.h>
+#include <linux/blk-mq.h>
+#include <linux/elevator.h>
+#include <linux/module.h>
+#include <linux/sbitmap.h>
+
+#include "blk.h"
+#include "blk-mq.h"
+#include "blk-mq-sched.h"
+#include "blk-mq-tag.h"
+#include "blk-stat.h"
+
+/* Scheduling domains. */
+enum {
+ KYBER_READ,
+ KYBER_SYNC_WRITE,
+ KYBER_OTHER, /* Async writes, discard, etc. */
+ KYBER_NUM_DOMAINS,
+};
+
+enum {
+ KYBER_MIN_DEPTH = 256,
+
+ /*
+ * In order to prevent starvation of synchronous requests by a flood of
+ * asynchronous requests, we reserve 25% of requests for synchronous
+ * operations.
+ */
+ KYBER_ASYNC_PERCENT = 75,
+};
+
+/*
+ * Initial device-wide depths for each scheduling domain.
+ *
+ * Even for fast devices with lots of tags like NVMe, you can saturate
+ * the device with only a fraction of the maximum possible queue depth.
+ * So, we cap these to a reasonable value.
+ */
+static const unsigned int kyber_depth[] = {
+ [KYBER_READ] = 256,
+ [KYBER_SYNC_WRITE] = 128,
+ [KYBER_OTHER] = 64,
+};
+
+/*
+ * Scheduling domain batch sizes. We favor reads.
+ */
+static const unsigned int kyber_batch_size[] = {
+ [KYBER_READ] = 16,
+ [KYBER_SYNC_WRITE] = 8,
+ [KYBER_OTHER] = 8,
+};
+
+struct kyber_queue_data {
+ struct request_queue *q;
+
+ struct blk_stat_callback *cb;
+
+ /*
+ * The device is divided into multiple scheduling domains based on the
+ * request type. Each domain has a fixed number of in-flight requests of
+ * that type device-wide, limited by these tokens.
+ */
+ struct sbitmap_queue domain_tokens[KYBER_NUM_DOMAINS];
+
+ /*
+ * Async request percentage, converted to per-word depth for
+ * sbitmap_get_shallow().
+ */
+ unsigned int async_depth;
+
+ /* Target latencies in nanoseconds. */
+ u64 read_lat_nsec, write_lat_nsec;
+};
+
+struct kyber_hctx_data {
+ spinlock_t lock;
+ struct list_head rqs[KYBER_NUM_DOMAINS];
+ int cur_domain;
+ unsigned int batching;
+};
+
+static unsigned int rq_sched_domain(const struct request *rq)
+{
+ unsigned int op = rq->cmd_flags;
+
+ if ((op & REQ_OP_MASK) == REQ_OP_READ)
+ return KYBER_READ;
+ else if ((op & REQ_OP_MASK) == REQ_OP_WRITE && op_is_sync(op))
+ return KYBER_SYNC_WRITE;
+ else
+ return KYBER_OTHER;
+}
+
+enum {
+ NONE = 0,
+ GOOD = 1,
+ GREAT = 2,
+ BAD = -1,
+ AWFUL = -2,
+};
+
+#define IS_GOOD(status) ((status) > 0)
+#define IS_BAD(status) ((status) < 0)
+
+static int kyber_lat_status(struct blk_stat_callback *cb,
+ unsigned int sched_domain, u64 target)
+{
+ u64 latency;
+
+ if (!cb->stat[sched_domain].nr_samples)
+ return NONE;
+
+ latency = cb->stat[sched_domain].mean;
+ if (latency >= 2 * target)
+ return AWFUL;
+ else if (latency > target)
+ return BAD;
+ else if (latency <= target / 2)
+ return GREAT;
+ else /* (latency <= target) */
+ return GOOD;
+}
+
+/*
+ * Adjust the read or synchronous write depth given the status of reads and
+ * writes. The goal is that the latencies of the two domains are fair (i.e., if
+ * one is good, then the other is good).
+ */
+static void kyber_adjust_rw_depth(struct kyber_queue_data *kqd,
+ unsigned int sched_domain, int this_status,
+ int other_status)
+{
+ unsigned int orig_depth, depth;
+
+ /*
+ * If this domain had no samples, or reads and writes are both good or
+ * both bad, don't adjust the depth.
+ */
+ if (this_status == NONE ||
+ (IS_GOOD(this_status) && IS_GOOD(other_status)) ||
+ (IS_BAD(this_status) && IS_BAD(other_status)))
+ return;
+
+ orig_depth = depth = kqd->domain_tokens[sched_domain].sb.depth;
+
+ if (other_status == NONE) {
+ depth++;
+ } else {
+ switch (this_status) {
+ case GOOD:
+ if (other_status == AWFUL)
+ depth -= max(depth / 4, 1U);
+ else
+ depth -= max(depth / 8, 1U);
+ break;
+ case GREAT:
+ if (other_status == AWFUL)
+ depth /= 2;
+ else
+ depth -= max(depth / 4, 1U);
+ break;
+ case BAD:
+ depth++;
+ break;
+ case AWFUL:
+ if (other_status == GREAT)
+ depth += 2;
+ else
+ depth++;
+ break;
+ }
+ }
+
+ depth = clamp(depth, 1U, kyber_depth[sched_domain]);
+ if (depth != orig_depth)
+ sbitmap_queue_resize(&kqd->domain_tokens[sched_domain], depth);
+}
+
+/*
+ * Adjust the depth of other requests given the status of reads and synchronous
+ * writes. As long as either domain is doing fine, we don't throttle, but if
+ * both domains are doing badly, we throttle heavily.
+ */
+static void kyber_adjust_other_depth(struct kyber_queue_data *kqd,
+ int read_status, int write_status,
+ bool have_samples)
+{
+ unsigned int orig_depth, depth;
+ int status;
+
+ orig_depth = depth = kqd->domain_tokens[KYBER_OTHER].sb.depth;
+
+ if (read_status == NONE && write_status == NONE) {
+ depth += 2;
+ } else if (have_samples) {
+ if (read_status == NONE)
+ status = write_status;
+ else if (write_status == NONE)
+ status = read_status;
+ else
+ status = max(read_status, write_status);
+ switch (status) {
+ case GREAT:
+ depth += 2;
+ break;
+ case GOOD:
+ depth++;
+ break;
+ case BAD:
+ depth -= max(depth / 4, 1U);
+ break;
+ case AWFUL:
+ depth /= 2;
+ break;
+ }
+ }
+
+ depth = clamp(depth, 1U, kyber_depth[KYBER_OTHER]);
+ if (depth != orig_depth)
+ sbitmap_queue_resize(&kqd->domain_tokens[KYBER_OTHER], depth);
+}
+
+/*
+ * Apply heuristics for limiting queue depths based on gathered latency
+ * statistics.
+ */
+static void kyber_stat_timer_fn(struct blk_stat_callback *cb)
+{
+ struct kyber_queue_data *kqd = cb->data;
+ int read_status, write_status;
+
+ read_status = kyber_lat_status(cb, KYBER_READ, kqd->read_lat_nsec);
+ write_status = kyber_lat_status(cb, KYBER_SYNC_WRITE, kqd->write_lat_nsec);
+
+ kyber_adjust_rw_depth(kqd, KYBER_READ, read_status, write_status);
+ kyber_adjust_rw_depth(kqd, KYBER_SYNC_WRITE, write_status, read_status);
+ kyber_adjust_other_depth(kqd, read_status, write_status,
+ cb->stat[KYBER_OTHER].nr_samples != 0);
+
+ /*
+ * Continue monitoring latencies if we aren't hitting the targets or
+ * we're still throttling other requests.
+ */
+ if (!blk_stat_is_active(kqd->cb) &&
+ ((IS_BAD(read_status) || IS_BAD(write_status) ||
+ kqd->domain_tokens[KYBER_OTHER].sb.depth < kyber_depth[KYBER_OTHER])))
+ blk_stat_activate_msecs(kqd->cb, 100);
+}
+
+static unsigned int kyber_sched_tags_shift(struct kyber_queue_data *kqd)
+{
+ /*
+ * All of the hardware queues have the same depth, so we can just grab
+ * the shift of the first one.
+ */
+ return kqd->q->queue_hw_ctx[0]->sched_tags->bitmap_tags.sb.shift;
+}
+
+static struct kyber_queue_data *kyber_queue_data_alloc(struct request_queue *q)
+{
+ struct kyber_queue_data *kqd;
+ unsigned int max_tokens;
+ unsigned int shift;
+ int ret = -ENOMEM;
+ int i;
+
+ kqd = kmalloc_node(sizeof(*kqd), GFP_KERNEL, q->node);
+ if (!kqd)
+ goto err;
+ kqd->q = q;
+
+ kqd->cb = blk_stat_alloc_callback(kyber_stat_timer_fn, rq_sched_domain,
+ KYBER_NUM_DOMAINS, kqd);
+ if (!kqd->cb)
+ goto err_kqd;
+
+ /*
+ * The maximum number of tokens for any scheduling domain is at least
+ * the queue depth of a single hardware queue. If the hardware doesn't
+ * have many tags, still provide a reasonable number.
+ */
+ max_tokens = max_t(unsigned int, q->tag_set->queue_depth,
+ KYBER_MIN_DEPTH);
+ for (i = 0; i < KYBER_NUM_DOMAINS; i++) {
+ WARN_ON(!kyber_depth[i]);
+ WARN_ON(!kyber_batch_size[i]);
+ ret = sbitmap_queue_init_node(&kqd->domain_tokens[i],
+ max_tokens, -1, false, GFP_KERNEL,
+ q->node);
+ if (ret) {
+ while (--i >= 0)
+ sbitmap_queue_free(&kqd->domain_tokens[i]);
+ goto err_cb;
+ }
+ sbitmap_queue_resize(&kqd->domain_tokens[i], kyber_depth[i]);
+ }
+
+ shift = kyber_sched_tags_shift(kqd);
+ kqd->async_depth = (1U << shift) * KYBER_ASYNC_PERCENT / 100U;
+
+ kqd->read_lat_nsec = 2000000ULL;
+ kqd->write_lat_nsec = 10000000ULL;
+
+ return kqd;
+
+err_cb:
+ blk_stat_free_callback(kqd->cb);
+err_kqd:
+ kfree(kqd);
+err:
+ return ERR_PTR(ret);
+}
+
+static int kyber_init_sched(struct request_queue *q, struct elevator_type *e)
+{
+ struct kyber_queue_data *kqd;
+ struct elevator_queue *eq;
+
+ eq = elevator_alloc(q, e);
+ if (!eq)
+ return -ENOMEM;
+
+ kqd = kyber_queue_data_alloc(q);
+ if (IS_ERR(kqd)) {
+ kobject_put(&eq->kobj);
+ return PTR_ERR(kqd);
+ }
+
+ eq->elevator_data = kqd;
+ q->elevator = eq;
+
+ blk_stat_add_callback(q, kqd->cb);
+
+ return 0;
+}
+
+static void kyber_exit_sched(struct elevator_queue *e)
+{
+ struct kyber_queue_data *kqd = e->elevator_data;
+ struct request_queue *q = kqd->q;
+ int i;
+
+ blk_stat_remove_callback(q, kqd->cb);
+
+ for (i = 0; i < KYBER_NUM_DOMAINS; i++)
+ sbitmap_queue_free(&kqd->domain_tokens[i]);
+ blk_stat_free_callback(kqd->cb);
+ kfree(kqd);
+}
+
+static int kyber_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
+{
+ struct kyber_hctx_data *khd;
+ int i;
+
+ khd = kmalloc_node(sizeof(*khd), GFP_KERNEL, hctx->numa_node);
+ if (!khd)
+ return -ENOMEM;
+
+ spin_lock_init(&khd->lock);
+
+ for (i = 0; i < KYBER_NUM_DOMAINS; i++)
+ INIT_LIST_HEAD(&khd->rqs[i]);
+
+ khd->cur_domain = 0;
+ khd->batching = 0;
+
+ hctx->sched_data = khd;
+
+ return 0;
+}
+
+static void kyber_exit_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
+{
+ kfree(hctx->sched_data);
+}
+
+static int kyber_get_domain_token(struct kyber_queue_data *kqd,
+ unsigned int sched_domain)
+{
+ struct sbitmap_queue *domain_tokens;
+
+ domain_tokens = &kqd->domain_tokens[sched_domain];
+ return __sbitmap_queue_get(domain_tokens);
+}
+
+static int rq_get_domain_token(struct request *rq)
+{
+ return (long)rq->elv.priv[0];
+}
+
+static void rq_set_domain_token(struct request *rq, int token)
+{
+ rq->elv.priv[0] = (void *)(long)token;
+}
+
+static void rq_clear_domain_token(struct kyber_queue_data *kqd,
+ struct request *rq)
+{
+ unsigned int sched_domain;
+ int nr;
+
+ nr = rq_get_domain_token(rq);
+ if (nr != -1) {
+ sched_domain = rq_sched_domain(rq);
+ sbitmap_queue_clear(&kqd->domain_tokens[sched_domain], nr,
+ rq->mq_ctx->cpu);
+ }
+}
+
+static struct request *kyber_get_request(struct request_queue *q,
+ unsigned int op,
+ struct blk_mq_alloc_data *data)
+{
+ struct kyber_queue_data *kqd = q->elevator->elevator_data;
+ struct request *rq;
+
+ /*
+ * We use the scheduler tags as per-hardware queue queueing tokens.
+ * Async requests can be limited at this stage.
+ */
+ if (!op_is_sync(op))
+ data->shallow_depth = kqd->async_depth;
+
+ rq = __blk_mq_alloc_request(data, op);
+ if (rq)
+ rq_set_domain_token(rq, -1);
+ return rq;
+}
+
+static void kyber_put_request(struct request *rq)
+{
+ struct request_queue *q = rq->q;
+ struct kyber_queue_data *kqd = q->elevator->elevator_data;
+
+ rq_clear_domain_token(kqd, rq);
+ blk_mq_finish_request(rq);
+}
+
+static void kyber_completed_request(struct request *rq)
+{
+ struct request_queue *q = rq->q;
+ struct kyber_queue_data *kqd = q->elevator->elevator_data;
+ unsigned int sched_domain;
+ u64 now, latency, target;
+
+ /*
+ * Check if this request met our latency goal. If not, quickly gather
+ * some statistics and start throttling.
+ */
+ sched_domain = rq_sched_domain(rq);
+ switch (sched_domain) {
+ case KYBER_READ:
+ target = kqd->read_lat_nsec;
+ break;
+ case KYBER_SYNC_WRITE:
+ target = kqd->write_lat_nsec;
+ break;
+ default:
+ return;
+ }
+
+ /* If we are already monitoring latencies, don't check again. */
+ if (blk_stat_is_active(kqd->cb))
+ return;
+
+ now = __blk_stat_time(ktime_to_ns(ktime_get()));
+ if (now < blk_stat_time(&rq->issue_stat))
+ return;
+
+ latency = now - blk_stat_time(&rq->issue_stat);
+
+ if (latency > target)
+ blk_stat_activate_msecs(kqd->cb, 10);
+}
+
+static void kyber_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx,
+ struct kyber_hctx_data *khd)
+{
+ LIST_HEAD(rq_list);
+ struct request *rq, *next;
+
+ blk_mq_flush_busy_ctxs(hctx, &rq_list);
+ list_for_each_entry_safe(rq, next, &rq_list, queuelist) {
+ unsigned int sched_domain;
+
+ sched_domain = rq_sched_domain(rq);
+ list_move_tail(&rq->queuelist, &khd->rqs[sched_domain]);
+ }
+}
+
+static struct request *
+kyber_dispatch_cur_domain(struct blk_mq_hw_ctx *hctx,
+ struct kyber_queue_data *kqd,
+ struct kyber_hctx_data *khd,
+ bool *flushed, bool *no_tokens)
+{
+ struct list_head *rqs;
+ struct request *rq;
+ int nr;
+
+ rqs = &khd->rqs[khd->cur_domain];
+ rq = list_first_entry_or_null(rqs, struct request, queuelist);
+
+ /*
+ * If there wasn't already a pending request and we haven't flushed the
+ * software queues yet, flush the software queues and check again.
+ */
+ if (!rq && !*flushed) {
+ kyber_flush_busy_ctxs(hctx, khd);
+ *flushed = true;
+ rq = list_first_entry_or_null(rqs, struct request, queuelist);
+ }
+
+ if (rq) {
+ nr = kyber_get_domain_token(kqd, khd->cur_domain);
+ if (nr == -1) {
+ *no_tokens = true;
+ } else {
+ khd->batching++;
+ rq_set_domain_token(rq, nr);
+ list_del_init(&rq->queuelist);
+ return rq;
+ }
+ }
+
+ /* There were either no pending requests or no tokens. */
+ return NULL;
+}
+
+/*
+ * Returns a request on success, NULL if there were no requests to dispatch, and
+ * ERR_PTR(-EBUSY) if there were requests to dispatch but no domain tokens for
+ * them.
+ */
+static struct request *__kyber_dispatch_request(struct kyber_queue_data *kqd,
+ struct kyber_hctx_data *khd,
+ struct blk_mq_hw_ctx *hctx)
+{
+ bool flushed = false, no_tokens = false;
+ struct request *rq;
+ int i;
+
+ /*
+ * First, if we are still entitled to batch, try to dispatch a request
+ * from the batch.
+ */
+ if (khd->batching < kyber_batch_size[khd->cur_domain]) {
+ rq = kyber_dispatch_cur_domain(hctx, kqd, khd, &flushed,
+ &no_tokens);
+ if (rq)
+ return rq;
+ }
+
+ /*
+ * Either,
+ * 1. We were no longer entitled to a batch.
+ * 2. The domain we were batching didn't have any requests.
+ * 3. The domain we were batching was out of tokens.
+ *
+ * Start another batch. Note that this wraps back around to the original
+ * domain if no other domains have requests or tokens.
+ */
+ khd->batching = 0;
+ for (i = 0; i < KYBER_NUM_DOMAINS; i++) {
+ if (++khd->cur_domain >= KYBER_NUM_DOMAINS)
+ khd->cur_domain = 0;
+
+ rq = kyber_dispatch_cur_domain(hctx, kqd, khd, &flushed,
+ &no_tokens);
+ if (rq)
+ return rq;
+ }
+
+ return no_tokens ? ERR_PTR(-EBUSY) : NULL;
+}
+
+static struct request *kyber_dispatch_request(struct blk_mq_hw_ctx *hctx)
+{
+ struct kyber_queue_data *kqd = hctx->queue->elevator->elevator_data;
+ struct kyber_hctx_data *khd = hctx->sched_data;
+ struct request *rq;
+
+ spin_lock(&khd->lock);
+
+ rq = __kyber_dispatch_request(kqd, khd, hctx);
+ if (IS_ERR(rq)) {
+ /*
+ * We failed to get a domain token. Mark the queue as needing a
+ * restart and try again in case a token was freed before we set
+ * the restart bit.
+ */
+ blk_mq_sched_mark_restart_queue(hctx);
+ rq = __kyber_dispatch_request(kqd, khd, hctx);
+ if (IS_ERR(rq))
+ rq = NULL;
+ }
+
+ spin_unlock(&khd->lock);
+
+ return rq;
+}
+
+static bool kyber_has_work(struct blk_mq_hw_ctx *hctx)
+{
+ struct kyber_hctx_data *khd = hctx->sched_data;
+ int i;
+
+ for (i = 0; i < KYBER_NUM_DOMAINS; i++) {
+ if (!list_empty_careful(&khd->rqs[i]))
+ return true;
+ }
+ return false;
+}
+
+#define KYBER_LAT_SHOW_STORE(op) \
+static ssize_t kyber_##op##_lat_show(struct elevator_queue *e, \
+ char *page) \
+{ \
+ struct kyber_queue_data *kqd = e->elevator_data; \
+ \
+ return sprintf(page, "%llu\n", kqd->op##_lat_nsec); \
+} \
+ \
+static ssize_t kyber_##op##_lat_store(struct elevator_queue *e, \
+ const char *page, size_t count) \
+{ \
+ struct kyber_queue_data *kqd = e->elevator_data; \
+ unsigned long long nsec; \
+ int ret; \
+ \
+ ret = kstrtoull(page, 10, &nsec); \
+ if (ret) \
+ return ret; \
+ \
+ kqd->op##_lat_nsec = nsec; \
+ \
+ return count; \
+}
+KYBER_LAT_SHOW_STORE(read);
+KYBER_LAT_SHOW_STORE(write);
+#undef KYBER_LAT_SHOW_STORE
+
+#define KYBER_LAT_ATTR(op) __ATTR(op##_lat_nsec, 0644, kyber_##op##_lat_show, kyber_##op##_lat_store)
+static struct elv_fs_entry kyber_sched_attrs[] = {
+ KYBER_LAT_ATTR(read),
+ KYBER_LAT_ATTR(write),
+ __ATTR_NULL
+};
+#undef KYBER_LAT_ATTR
+
+static struct elevator_type kyber_sched = {
+ .ops.mq = {
+ .init_sched = kyber_init_sched,
+ .exit_sched = kyber_exit_sched,
+ .init_hctx = kyber_init_hctx,
+ .exit_hctx = kyber_exit_hctx,
+ .get_request = kyber_get_request,
+ .put_request = kyber_put_request,
+ .completed_request = kyber_completed_request,
+ .dispatch_request = kyber_dispatch_request,
+ .has_work = kyber_has_work,
+ },
+ .uses_mq = true,
+ .elevator_attrs = kyber_sched_attrs,
+ .elevator_name = "kyber",
+ .elevator_owner = THIS_MODULE,
+};
+
+static int __init kyber_init(void)
+{
+ return elv_register(&kyber_sched);
+}
+
+static void __exit kyber_exit(void)
+{
+ elv_unregister(&kyber_sched);
+}
+
+module_init(kyber_init);
+module_exit(kyber_exit);
+
+MODULE_AUTHOR("Omar Sandoval");
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("Kyber I/O scheduler");
--
2.12.2
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox