* [PATCH v2 1/4] block: remove extra calls to wbt_exit()
From: Omar Sandoval @ 2017-03-21 15:56 UTC (permalink / raw)
To: linux-block; +Cc: kernel-team, Omar Sandoval
In-Reply-To: <cover.1490110621.git.osandov@fb.com>
From: Omar Sandoval <osandov@fb.com>
We always call wbt_exit() from blk_release_queue(), so these are
unnecessary.
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
block/blk-core.c | 1 -
block/blk-mq.c | 2 --
2 files changed, 3 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index d772c221cc17..e8a9bc0d4bbb 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -889,7 +889,6 @@ int blk_init_allocated_queue(struct request_queue *q)
q->exit_rq_fn(q, q->fq->flush_rq);
out_free_flush_queue:
blk_free_flush_queue(q->fq);
- wbt_exit(q);
return -ENOMEM;
}
EXPORT_SYMBOL(blk_init_allocated_queue);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index a4546f060e80..534f49a90e3a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2431,8 +2431,6 @@ void blk_mq_free_queue(struct request_queue *q)
list_del_init(&q->all_q_node);
mutex_unlock(&all_q_mutex);
- wbt_exit(q);
-
blk_mq_del_queue_tag_set(q);
blk_mq_exit_hw_queues(q, set, set->nr_hw_queues);
--
2.12.0
^ permalink raw reply related
* [PATCH v2 0/4] block: callback-based statistics
From: Omar Sandoval @ 2017-03-21 15:56 UTC (permalink / raw)
To: linux-block; +Cc: kernel-team
From: Omar Sandoval <osandov@fb.com>
This patchset generalizes the blk-stats infrastructure to allow users to
register a callback to be called at a given time with the statistics of
requests completed during that window. Writeback throttling and hybrid
polling are converted to the new infrastructure. The new Kyber I/O
scheduler uses this, as well (but it needs to be rebased on this v2).
The details are in patch 4, which is the actual conversion. Patches 1-3
are preparation cleanups.
Changes since v1:
- Now the user can subdivide stats into arbitrary buckets. Both in-tree
users just do reads vs. writes, but we can extend poll based on
request size in the future
- blk_stat_arm_callback() became blk_stat_activate_msecs() and
blk_stat_activate_nsecs()
- The poll statistics are exposed in debugfs
Omar Sandoval (4):
block: remove extra calls to wbt_exit()
blk-stat: use READ and WRITE instead of BLK_STAT_{READ,WRITE}
blk-stat: move BLK_RQ_STAT_BATCH definition to blk-stat.c
blk-stat: convert to callback-based statistics reporting
block/blk-core.c | 7 +-
block/blk-mq-debugfs.c | 99 +++++++--------
block/blk-mq.c | 78 ++++++++----
block/blk-mq.h | 1 -
block/blk-stat.c | 315 ++++++++++++++++++++++------------------------
block/blk-stat.h | 182 ++++++++++++++++++++++++---
block/blk-sysfs.c | 31 +----
block/blk-wbt.c | 61 ++++-----
block/blk-wbt.h | 2 +-
include/linux/blk_types.h | 3 -
include/linux/blkdev.h | 10 +-
11 files changed, 454 insertions(+), 335 deletions(-)
--
2.12.0
^ permalink raw reply
* Re: [PATCH 1/4] blk-mq: remove BLK_MQ_F_DEFER_ISSUE
From: Johannes Thumshirn @ 2017-03-21 13:33 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: axboe, Bart.VanAssche, linux-block
In-Reply-To: <20170320203930.12533-2-hch@lst.de>
On Mon, Mar 20, 2017 at 04:39:27PM -0400, Christoph Hellwig wrote:
> This flag was never used since it was introduced.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
--
Johannes Thumshirn Storage
jthumshirn@suse.de +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: Felix Imend�rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N�rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
^ permalink raw reply
* Re: unify and streamline the blk-mq make_request implementations V2
From: Bart Van Assche @ 2017-03-21 12:40 UTC (permalink / raw)
To: hch@lst.de, axboe@kernel.dk; +Cc: linux-block@vger.kernel.org
In-Reply-To: <20170320203930.12533-1-hch@lst.de>
On Mon, 2017-03-20 at 16:39 -0400, Christoph Hellwig wrote:
> Changes since V1:
>=A0=A0- rebase on top of the recent blk_mq_try_issue_directly changes
>=A0=A0- incorporate comments from Bart
Hi Christoph,
It seems to me like none of the three comments I had posted on patch 4/4
have been addressed. Please have another look at these comments.
Thanks,
Bart.
=A0=
^ permalink raw reply
* Re: [PATCH rfc 10/10] target: Use non-selective polling
From: Sagi Grimberg @ 2017-03-21 11:35 UTC (permalink / raw)
To: Nicholas A. Bellinger; +Cc: linux-block, linux-rdma, target-devel, linux-nvme
In-Reply-To: <1489881481.27336.23.camel@haakon3.risingtidesystems.com>
> Hey Sagi,
Hey Nic
> Let's make 'batch' into a backend specific attribute so it can be
> changed on-the-fly per device, instead of a hard-coded value.
>
> Here's a quick patch to that end. Feel free to fold it into your
> series.
I will, thanks!
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply
* Re: [PATCH 3/4] blk-mq: improve blk_mq_try_issue_directly
From: Bart Van Assche @ 2017-03-21 1:35 UTC (permalink / raw)
To: Christoph Hellwig, axboe@kernel.dk
Cc: Bart Van Assche, linux-block@vger.kernel.org
In-Reply-To: <20170320203930.12533-4-hch@lst.de>
On 03/20/2017 04:39 PM, Christoph Hellwig wrote:=0A=
> Rename blk_mq_try_issue_directly to __blk_mq_try_issue_directly and add a=
=0A=
> new wrapper that takes care of RCU / SRCU locking to avoid having=0A=
> boileplate code in the caller which would get duplicated with new callers=
.=0A=
=0A=
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>=0A=
^ permalink raw reply
* Re: [PATCH 2/4] blk-mq: merge mq and sq make_request instances
From: Bart Van Assche @ 2017-03-21 1:33 UTC (permalink / raw)
To: Christoph Hellwig, axboe@kernel.dk
Cc: Bart Van Assche, linux-block@vger.kernel.org
In-Reply-To: <20170320203930.12533-3-hch@lst.de>
On 03/20/2017 04:39 PM, Christoph Hellwig wrote:=0A=
> @@ -1534,7 +1529,36 @@ static blk_qc_t blk_mq_make_request(struct request=
_queue *q, struct bio *bio)=0A=
> }=0A=
> =0A=
> plug =3D current->plug;=0A=
> - if (((plug && !blk_queue_nomerges(q)) || is_sync)) {=0A=
> + if (plug && q->nr_hw_queues =3D=3D 1) {=0A=
> + [ ... ]=0A=
> + } else if (((plug && !blk_queue_nomerges(q)) || is_sync)) {=0A=
> struct request *old_rq =3D NULL;=0A=
> =0A=
> blk_mq_bio_to_request(rq, bio);=0A=
=0A=
I think this patch will change the behavior for the plug =3D=3D NULL &&=0A=
q->nr_hw_queues =3D=3D 1 && is_sync case: with this patch applied the code=
=0A=
under "else if" will be executed for that case but that wasn't the case=0A=
before this patch.=0A=
=0A=
Bart.=0A=
^ permalink raw reply
* Re: [PATCH 1/4] blk-mq: remove BLK_MQ_F_DEFER_ISSUE
From: Bart Van Assche @ 2017-03-21 1:11 UTC (permalink / raw)
To: Christoph Hellwig, axboe@kernel.dk
Cc: Bart Van Assche, linux-block@vger.kernel.org
In-Reply-To: <20170320203930.12533-2-hch@lst.de>
On 03/20/2017 04:39 PM, Christoph Hellwig wrote:=0A=
> This flag was never used since it was introduced.=0A=
=0A=
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>=0A=
^ permalink raw reply
* Re: [PATCH 0/11 v4] block: Fix block device shutdown related races
From: Thiago Jung Bauermann @ 2017-03-21 0:57 UTC (permalink / raw)
To: Jan Kara, Lekshmi C. Pillai
Cc: Jens Axboe, linux-block, Christoph Hellwig, Dan Williams,
Tejun Heo, Tahsin Erdogan, Omar Sandoval
In-Reply-To: <20170313151410.5586-1-jack@suse.cz>
Hello,
Am Montag, 13. M=E4rz 2017, 16:13:59 BRT schrieb Jan Kara:
> People, please have a look at patches. They are mostly simple however the
> interactions are rather complex so I may have missed something. Also I'm
> happy for any additional testing these patches can get - I've stressed th=
em
> with Omar's script, tested memcg writeback, tested static (not udev manag=
ed)
> device inodes.
Lekshmi tested these patches on top of an older kernel version that is=20
afflicted by this bug (I backported the patches on which this series depend=
s as=20
well), and it fixed the bug there.
Sorry if that's not as helpful, I had unrelated issues with v4.11-rc2 and=20
wasn't able to test with it yet. :-/
=2D-=20
Thiago Jung Bauermann
IBM Linux Technology Center
^ permalink raw reply
* [RFC PATCH 1/1] nbd: replace kill_bdev() with __invalidate_device()
From: Ming Lin @ 2017-03-20 22:58 UTC (permalink / raw)
To: nbd-general, Josef Bacik, Ratna Manoj Bolla
Cc: linux-block, linux-kernel, jianshu.ljs, xiongwei.jiang, james.liu,
Markus Pargmann
In-Reply-To: <1490050729-3578-1-git-send-email-mlin@kernel.org>
From: Ratna Manoj Bolla <manoj.br@gmail.com>
When a filesystem is mounted on a nbd device and on a disconnect, because
of kill_bdev(), and resetting bdev size to zero, buffer_head mappings are
getting destroyed under mounted filesystem.
After a bdev size reset(i.e bdev->bd_inode->i_size = 0) on a disconnect,
followed by a sys_umount(),
generic_shutdown_super()->...
->__sync_blockdev()->...
-blkdev_writepages()->...
->do_invalidatepage()->...
-discard_buffer() is discarding superblock buffer_head assumed
to be in mapped state by ext4_commit_super().
[mlin: ported to 4.11-rc2]
Signed-off-by: Ratna Manoj Bolla <manoj.br@gmail.com
---
drivers/block/nbd.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index cb4ccfc..a6a3643 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -125,7 +125,8 @@ static const char *nbdcmd_to_ascii(int cmd)
static int nbd_size_clear(struct nbd_device *nbd, struct block_device *bdev)
{
- bd_set_size(bdev, 0);
+ if (bdev->bd_openers <= 1)
+ bd_set_size(bdev, 0);
set_capacity(nbd->disk, 0);
kobject_uevent(&nbd_to_dev(nbd)->kobj, KOBJ_CHANGE);
@@ -603,6 +604,8 @@ static void nbd_reset(struct nbd_device *nbd)
static void nbd_bdev_reset(struct block_device *bdev)
{
+ if (bdev->bd_openers > 1)
+ return;
set_device_ro(bdev, false);
bdev->bd_inode->i_size = 0;
if (max_part > 0) {
@@ -666,7 +669,8 @@ static int nbd_clear_sock(struct nbd_device *nbd, struct block_device *bdev)
{
sock_shutdown(nbd);
nbd_clear_que(nbd);
- kill_bdev(bdev);
+
+ __invalidate_device(bdev, true);
nbd_bdev_reset(bdev);
/*
* We want to give the run thread a chance to wait for everybody
--
1.8.3.1
^ permalink raw reply related
* [RFC PATCH 0/1] nbd: fix crash when unmaping nbd device with fs still mounted
From: Ming Lin @ 2017-03-20 22:58 UTC (permalink / raw)
To: nbd-general, Josef Bacik, Ratna Manoj Bolla
Cc: linux-block, linux-kernel, jianshu.ljs, xiongwei.jiang, james.liu,
Markus Pargmann
Hi all,
I run into a BUG_ON(!buffer_mapped(bh)) crash with below script.
$ rbd-nbd map mypool/myimg
$ mkfs.ext4 /dev/nbd0
$ mount /dev/nbd0 /mnt/
$ rbd-nbd unmap /dev/nbd0
$ umount /mnt
[ 1248.870131] kernel BUG at /home/mlin/linux/fs/buffer.c:3103!
[ 1248.871214] invalid opcode: 0000 [#1] SMP
[ 1248.879468] CPU: 0 PID: 2450 Comm: umount Tainted: G E 4.11.0-rc2+ #2
[ 1248.896579] Call Trace:
[ 1248.897056] __sync_dirty_buffer+0x6e/0xe0
[ 1248.897870] ext4_commit_super+0x1eb/0x290 [ext4]
[ 1248.898795] ext4_put_super+0x2fa/0x3c0 [ext4]
[ 1248.899662] generic_shutdown_super+0x6f/0x100
[ 1248.900525] kill_block_super+0x27/0x70
[ 1248.901257] deactivate_locked_super+0x43/0x70
[ 1248.902112] deactivate_super+0x46/0x60
[ 1248.902869] cleanup_mnt+0x3f/0x80
[ 1248.903526] __cleanup_mnt+0x12/0x20
[ 1248.904218] task_work_run+0x83/0xb0
[ 1248.904941] exit_to_usermode_loop+0x59/0x7b
[ 1248.905769] do_syscall_64+0x165/0x180
[ 1248.907603] entry_SYSCALL64_slow_path+0x25/0x25
Last year, Ratna posted a patch to fix it.
https://lkml.org/lkml/2016/4/20/257
Ratna's script to reproduce the bug.
$ qemu-img create -f qcow2 f.img 1G
$ mkfs.ext4 f.img
$ qemu-nbd -c /dev/nbd0 f.img
$ mount /dev/nbd0 dir
$ killall -KILL qemu-nbd
$ sleep 1
$ ls dir
$ umount dir
I ported Rantna's patch to 4.11-rc2 and confirmed that it fixes the crash.
Jan Kara had some comments about this bug:
http://www.kernelhub.org/?p=2&msg=361407
I hope to fix this bug in the upstream kernel first and then back port it to
our production system.
Please see "PATCH 1/1" for detail.
Thanks,
Ming
^ permalink raw reply
* [PATCH 7/7] sd: use ZERO_PAGE for WRITE_SAME payloads
From: Christoph Hellwig @ 2017-03-20 20:43 UTC (permalink / raw)
To: tj, martin.petersen, axboe; +Cc: linux-ide, linux-scsi, linux-block
In-Reply-To: <20170320204319.12628-1-hch@lst.de>
We're never touching the contents of the page, so save a memory
allocation for these cases.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
drivers/scsi/sd.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index c18fe9ff1f8f..af632e350ab4 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -756,7 +756,7 @@ static int sd_setup_write_same16_cmnd(struct scsi_cmnd *cmd)
u32 nr_sectors = blk_rq_sectors(rq) >> (ilog2(sdp->sector_size) - 9);
u32 data_len = sdp->sector_size;
- rq->special_vec.bv_page = alloc_page(GFP_ATOMIC | __GFP_ZERO);
+ rq->special_vec.bv_page = ZERO_PAGE(0);
if (!rq->special_vec.bv_page)
return BLKPREP_DEFER;
rq->special_vec.bv_offset = 0;
@@ -785,7 +785,7 @@ static int sd_setup_write_same10_cmnd(struct scsi_cmnd *cmd, bool unmap)
u32 nr_sectors = blk_rq_sectors(rq) >> (ilog2(sdp->sector_size) - 9);
u32 data_len = sdp->sector_size;
- rq->special_vec.bv_page = alloc_page(GFP_ATOMIC | __GFP_ZERO);
+ rq->special_vec.bv_page = ZERO_PAGE(0);
if (!rq->special_vec.bv_page)
return BLKPREP_DEFER;
rq->special_vec.bv_offset = 0;
@@ -1256,7 +1256,8 @@ static void sd_uninit_command(struct scsi_cmnd *SCpnt)
{
struct request *rq = SCpnt->request;
- if (rq->rq_flags & RQF_SPECIAL_PAYLOAD)
+ if ((rq->rq_flags & RQF_SPECIAL_PAYLOAD) &&
+ rq->special_vec.bv_page != ZERO_PAGE(0))
__free_page(rq->special_vec.bv_page);
if (SCpnt->cmnd != scsi_req(rq)->cmd) {
--
2.11.0
^ permalink raw reply related
* [PATCH 6/7] sd: support multi-range TRIM for ATA disks
From: Christoph Hellwig @ 2017-03-20 20:43 UTC (permalink / raw)
To: tj, martin.petersen, axboe; +Cc: linux-ide, linux-scsi, linux-block
In-Reply-To: <20170320204319.12628-1-hch@lst.de>
Almost the same scheme as the older multi-range support for NVMe.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
drivers/scsi/sd.c | 37 +++++++++++++++++++++++++------------
1 file changed, 25 insertions(+), 12 deletions(-)
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index cc8684818305..c18fe9ff1f8f 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -643,7 +643,7 @@ static void sd_config_discard(struct scsi_disk *sdkp, unsigned int mode)
{
struct request_queue *q = sdkp->disk->queue;
unsigned int logical_block_size = sdkp->device->sector_size;
- unsigned int max_blocks = 0;
+ unsigned int max_blocks = 0, max_ranges = 0, max_range_size = 0;
q->limits.discard_zeroes_data = 0;
@@ -698,13 +698,19 @@ static void sd_config_discard(struct scsi_disk *sdkp, unsigned int mode)
break;
case SD_LBP_ATA_TRIM:
- max_blocks = 65535 * (512 / sizeof(__le64));
+ max_ranges = 512 / sizeof(__le64);
+ max_range_size = USHRT_MAX;
+ max_blocks = max_ranges * max_range_size;
if (sdkp->device->ata_trim_zeroes_data)
q->limits.discard_zeroes_data = 1;
break;
}
blk_queue_max_discard_sectors(q, max_blocks * (logical_block_size >> 9));
+ if (max_ranges)
+ blk_queue_max_discard_segments(q, max_ranges);
+ if (max_range_size)
+ blk_queue_max_discard_segment_size(q, max_range_size);
queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
}
@@ -805,9 +811,9 @@ static int sd_setup_ata_trim_cmnd(struct scsi_cmnd *cmd)
{
struct scsi_device *sdp = cmd->device;
struct request *rq = cmd->request;
- u64 sector = blk_rq_pos(rq) >> (ilog2(sdp->sector_size) - 9);
- u32 nr_sectors = blk_rq_sectors(rq) >> (ilog2(sdp->sector_size) - 9);
- u32 data_len = sdp->sector_size, i;
+ u32 sector_shift = ilog2(sdp->sector_size);
+ u32 data_len = sdp->sector_size, i = 0;
+ struct bio *bio;
__le64 *buf;
rq->special_vec.bv_page = alloc_page(GFP_ATOMIC | __GFP_ZERO);
@@ -826,14 +832,21 @@ static int sd_setup_ata_trim_cmnd(struct scsi_cmnd *cmd)
cmd->cmnd[8] = data_len;
buf = page_address(rq->special_vec.bv_page);
- for (i = 0; i < (data_len >> 3); i++) {
- u64 n = min(nr_sectors, 0xffffu);
+ __rq_for_each_bio(bio, rq) {
+ u64 sector = bio->bi_iter.bi_sector >> (sector_shift - 9);
+ u32 nr_sectors = bio->bi_iter.bi_size >> sector_shift;
- buf[i] = cpu_to_le64(sector | (n << 48));
- if (nr_sectors <= 0xffff)
- break;
- sector += 0xffff;
- nr_sectors -= 0xffff;
+ do {
+ u64 n = min(nr_sectors, 0xffffu);
+
+ buf[i] = cpu_to_le64(sector | (n << 48));
+ if (nr_sectors <= 0xffff)
+ break;
+ sector += 0xffff;
+ nr_sectors -= 0xffff;
+ i++;
+
+ } while (!WARN_ON_ONCE(i >= data_len >> 3));
}
cmd->allowed = SD_MAX_RETRIES;
--
2.11.0
^ permalink raw reply related
* [PATCH 5/7] block: add a max_discard_segment_size queue limit
From: Christoph Hellwig @ 2017-03-20 20:43 UTC (permalink / raw)
To: tj, martin.petersen, axboe; +Cc: linux-ide, linux-scsi, linux-block
In-Reply-To: <20170320204319.12628-1-hch@lst.de>
ATA only allows 16 bits, so we need a limit.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
block/blk-core.c | 6 ++++++
block/blk-merge.c | 9 +++++++++
block/blk-settings.c | 14 ++++++++++++++
include/linux/blkdev.h | 8 ++++++++
4 files changed, 37 insertions(+)
diff --git a/block/blk-core.c b/block/blk-core.c
index d772c221cc17..3eb3bd89b47a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1486,9 +1486,15 @@ bool bio_attempt_discard_merge(struct request_queue *q, struct request *req,
struct bio *bio)
{
unsigned short segments = blk_rq_nr_discard_segments(req);
+ unsigned max_segment_sectors = queue_max_discard_segment_size(q) >> 9;
if (segments >= queue_max_discard_segments(q))
goto no_merge;
+ if (blk_rq_sectors(req) > max_segment_sectors)
+ goto no_merge;
+ if (bio_sectors(bio) > max_segment_sectors)
+ goto no_merge;
+
if (blk_rq_sectors(req) + bio_sectors(bio) >
blk_rq_get_max_sectors(req, blk_rq_pos(req)))
goto no_merge;
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 2afa262425d1..c62a6f0325e0 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -11,6 +11,15 @@
#include "blk.h"
+/*
+ * Split a discard bio if it doesn't fit into the overall discard request size
+ * of the device. Note that we don't split it here if it's over the maximum
+ * discard segment size to avoid creating way too many bios in that case.
+ * We will simply take care of never merging such a larger than segment size
+ * bio into a request that has other bios, and let the low-level driver take
+ * care of splitting the request into multiple ranges in the on the wire
+ * format.
+ */
static struct bio *blk_bio_discard_split(struct request_queue *q,
struct bio *bio,
struct bio_set *bs,
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 1e7174ffc9d4..9d515ae3a405 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -93,6 +93,7 @@ void blk_set_default_limits(struct queue_limits *lim)
lim->seg_boundary_mask = BLK_SEG_BOUNDARY_MASK;
lim->virt_boundary_mask = 0;
lim->max_segment_size = BLK_MAX_SEGMENT_SIZE;
+ lim->max_discard_segment_size = UINT_MAX;
lim->max_sectors = lim->max_hw_sectors = BLK_SAFE_MAX_SECTORS;
lim->max_dev_sectors = 0;
lim->chunk_sectors = 0;
@@ -132,6 +133,7 @@ void blk_set_stacking_limits(struct queue_limits *lim)
lim->max_discard_segments = 1;
lim->max_hw_sectors = UINT_MAX;
lim->max_segment_size = UINT_MAX;
+ lim->max_discard_segment_size = UINT_MAX;
lim->max_sectors = UINT_MAX;
lim->max_dev_sectors = UINT_MAX;
lim->max_write_same_sectors = UINT_MAX;
@@ -376,6 +378,18 @@ void blk_queue_max_segment_size(struct request_queue *q, unsigned int max_size)
EXPORT_SYMBOL(blk_queue_max_segment_size);
/**
+ * blk_queue_max_discard_segment_size - set max segment size for discards
+ * @q: the request queue for the device
+ * @max_size: max size of a discard segment in bytes
+ **/
+void blk_queue_max_discard_segment_size(struct request_queue *q,
+ unsigned int max_size)
+{
+ q->limits.max_discard_segment_size = max_size;
+}
+EXPORT_SYMBOL_GPL(blk_queue_max_discard_segment_size);
+
+/**
* blk_queue_logical_block_size - set logical block size for the queue
* @q: the request queue for the device
* @size: the logical block size, in bytes
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 5a7da607ca04..3b3bd646f580 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -333,6 +333,7 @@ struct queue_limits {
unsigned short max_segments;
unsigned short max_integrity_segments;
unsigned short max_discard_segments;
+ unsigned int max_discard_segment_size;
unsigned char misaligned;
unsigned char discard_misaligned;
@@ -1150,6 +1151,8 @@ extern void blk_queue_max_segments(struct request_queue *, unsigned short);
extern void blk_queue_max_discard_segments(struct request_queue *,
unsigned short);
extern void blk_queue_max_segment_size(struct request_queue *, unsigned int);
+extern void blk_queue_max_discard_segment_size(struct request_queue *,
+ unsigned int);
extern void blk_queue_max_discard_sectors(struct request_queue *q,
unsigned int max_discard_sectors);
extern void blk_queue_max_write_same_sectors(struct request_queue *q,
@@ -1415,6 +1418,11 @@ static inline unsigned int queue_max_segment_size(struct request_queue *q)
return q->limits.max_segment_size;
}
+static inline unsigned int queue_max_discard_segment_size(struct request_queue *q)
+{
+ return q->limits.max_discard_segment_size;
+}
+
static inline unsigned short queue_logical_block_size(struct request_queue *q)
{
int retval = 512;
--
2.11.0
^ permalink raw reply related
* [PATCH 4/7] libata: simplify the trim implementation
From: Christoph Hellwig @ 2017-03-20 20:43 UTC (permalink / raw)
To: tj, martin.petersen, axboe; +Cc: linux-ide, linux-scsi, linux-block
In-Reply-To: <20170320204319.12628-1-hch@lst.de>
Don't try to fake up basic SCSI logical block provisioning and WRITE SAME
support, but offer support for the Linux Vendor Specific TRIM command
instead. This simplifies the implementation a lot, and avoids rewriting
the data out buffer in the I/O path. Note that this new command is only
offered to the block layer and will fail for pass through commands.
While this is theoretically a regression in the functionality offered
through SG_IO the previous support was buggy and corrupted user memory
by rewriting the data out buffer in place.
Last but not least this removes the global ata_scsi_rbuf_lock from
the trim path.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
drivers/ata/libata-scsi.c | 179 ++++++++--------------------------------------
1 file changed, 28 insertions(+), 151 deletions(-)
diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index b93d7e33789a..965b9e7dbb7d 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -1322,6 +1322,16 @@ static int ata_scsi_dev_config(struct scsi_device *sdev,
blk_queue_flush_queueable(q, false);
+ if (ata_id_has_trim(dev->id) &&
+ !(dev->horkage & ATA_HORKAGE_NOTRIM)) {
+ sdev->ata_trim = 1;
+ if (ata_id_has_zero_after_trim(dev->id) &&
+ (dev->horkage & ATA_HORKAGE_ZERO_AFTER_TRIM)) {
+ ata_dev_info(dev, "Enabling discard_zeroes_data\n");
+ sdev->ata_trim_zeroes_data = 1;
+ }
+ }
+
dev->sdev = sdev;
return 0;
}
@@ -2383,21 +2393,6 @@ static unsigned int ata_scsiop_inq_b0(struct ata_scsi_args *args, u8 *rbuf)
*/
min_io_sectors = 1 << ata_id_log2_per_physical_sector(args->id);
put_unaligned_be16(min_io_sectors, &rbuf[6]);
-
- /*
- * Optimal unmap granularity.
- *
- * The ATA spec doesn't even know about a granularity or alignment
- * for the TRIM command. We can leave away most of the unmap related
- * VPD page entries, but we have specifify a granularity to signal
- * that we support some form of unmap - in thise case via WRITE SAME
- * with the unmap bit set.
- */
- if (ata_id_has_trim(args->id)) {
- put_unaligned_be64(65535 * ATA_MAX_TRIM_RNUM, &rbuf[36]);
- put_unaligned_be32(1, &rbuf[28]);
- }
-
return 0;
}
@@ -2746,16 +2741,6 @@ static unsigned int ata_scsiop_read_cap(struct ata_scsi_args *args, u8 *rbuf)
rbuf[14] = (lowest_aligned >> 8) & 0x3f;
rbuf[15] = lowest_aligned;
- if (ata_id_has_trim(args->id) &&
- !(dev->horkage & ATA_HORKAGE_NOTRIM)) {
- rbuf[14] |= 0x80; /* LBPME */
-
- if (ata_id_has_zero_after_trim(args->id) &&
- dev->horkage & ATA_HORKAGE_ZERO_AFTER_TRIM) {
- ata_dev_info(dev, "Enabling discard_zeroes_data\n");
- rbuf[14] |= 0x40; /* LBPRZ */
- }
- }
if (ata_id_zoned_cap(args->id) ||
args->dev->class == ATA_DEV_ZAC)
rbuf[12] = (1 << 4); /* RC_BASIS */
@@ -3339,141 +3324,45 @@ static unsigned int ata_scsi_pass_thru(struct ata_queued_cmd *qc)
}
/**
- * ata_format_dsm_trim_descr() - SATL Write Same to DSM Trim
- * @cmd: SCSI command being translated
- * @trmax: Maximum number of entries that will fit in sector_size bytes.
- * @sector: Starting sector
- * @count: Total Range of request in logical sectors
- *
- * Rewrite the WRITE SAME descriptor to be a DSM TRIM little-endian formatted
- * descriptor.
- *
- * Upto 64 entries of the format:
- * 63:48 Range Length
- * 47:0 LBA
- *
- * Range Length of 0 is ignored.
- * LBA's should be sorted order and not overlap.
- *
- * NOTE: this is the same format as ADD LBA(S) TO NV CACHE PINNED SET
- *
- * Return: Number of bytes copied into sglist.
- */
-static size_t ata_format_dsm_trim_descr(struct scsi_cmnd *cmd, u32 trmax,
- u64 sector, u32 count)
-{
- struct scsi_device *sdp = cmd->device;
- size_t len = sdp->sector_size;
- size_t r;
- __le64 *buf;
- u32 i = 0;
- unsigned long flags;
-
- WARN_ON(len > ATA_SCSI_RBUF_SIZE);
-
- if (len > ATA_SCSI_RBUF_SIZE)
- len = ATA_SCSI_RBUF_SIZE;
-
- spin_lock_irqsave(&ata_scsi_rbuf_lock, flags);
- buf = ((void *)ata_scsi_rbuf);
- memset(buf, 0, len);
- while (i < trmax) {
- u64 entry = sector |
- ((u64)(count > 0xffff ? 0xffff : count) << 48);
- buf[i++] = __cpu_to_le64(entry);
- if (count <= 0xffff)
- break;
- count -= 0xffff;
- sector += 0xffff;
- }
- r = sg_copy_from_buffer(scsi_sglist(cmd), scsi_sg_count(cmd), buf, len);
- spin_unlock_irqrestore(&ata_scsi_rbuf_lock, flags);
-
- return r;
-}
-
-/**
- * ata_scsi_write_same_xlat() - SATL Write Same to ATA SCT Write Same
+ * ata_scsi_trim_xlat() - Handle the vendor specific TRIM command.
* @qc: Command to be translated
*
- * Translate a SCSI WRITE SAME command to be either a DSM TRIM command or
- * an SCT Write Same command.
- * Based on WRITE SAME has the UNMAP flag
- * When set translate to DSM TRIM
- * When clear translate to SCT Write Same
+ * Setup a DSM TRIM command (or it's queued variant) after sd already
+ * prepared the payload for us.
*/
-static unsigned int ata_scsi_write_same_xlat(struct ata_queued_cmd *qc)
+static unsigned int ata_scsi_trim_xlat(struct ata_queued_cmd *qc)
{
struct ata_taskfile *tf = &qc->tf;
struct scsi_cmnd *scmd = qc->scsicmd;
- struct scsi_device *sdp = scmd->device;
- size_t len = sdp->sector_size;
struct ata_device *dev = qc->dev;
- const u8 *cdb = scmd->cmnd;
- u64 block;
- u32 n_block;
- const u32 trmax = len >> 3;
- u32 size;
- u16 fp;
- u8 bp = 0xff;
- u8 unmap = cdb[1] & 0x8;
-
- /* we may not issue DMA commands if no DMA mode is set */
- if (unlikely(!dev->dma_mode))
- goto invalid_opcode;
- if (unlikely(scmd->cmd_len < 16)) {
- fp = 15;
- goto invalid_fld;
+ if (unlikely(!dev->dma_mode)) {
+ ata_scsi_set_sense(dev, scmd, ILLEGAL_REQUEST, 0x20, 0x0);
+ return 1;
}
- scsi_16_lba_len(cdb, &block, &n_block);
- if (!unmap ||
- (dev->horkage & ATA_HORKAGE_NOTRIM) ||
- !ata_id_has_trim(dev->id)) {
- fp = 1;
- bp = 3;
- goto invalid_fld;
- }
- /* If the request is too large the cmd is invalid */
- if (n_block > 0xffff * trmax) {
- fp = 2;
- goto invalid_fld;
+ /* We only allow sending this command through the block layer */
+ if (unlikely(req_op(scmd->request) != REQ_OP_DISCARD)) {
+ ata_scsi_set_sense(dev, scmd, ILLEGAL_REQUEST, 0x20, 0x0);
+ return 1;
}
- /*
- * WRITE SAME always has a sector sized buffer as payload, this
- * should never be a multiple entry S/G list.
- */
- if (!scsi_sg_count(scmd))
- goto invalid_param_len;
-
- /*
- * size must match sector size in bytes
- * For DATA SET MANAGEMENT TRIM in ACS-2 nsect (aka count)
- * is defined as number of 512 byte blocks to be transferred.
- */
-
- size = ata_format_dsm_trim_descr(scmd, trmax, block, n_block);
- if (size != len)
- goto invalid_param_len;
-
if (ata_ncq_enabled(dev) && ata_fpdma_dsm_supported(dev)) {
/* Newer devices support queued TRIM commands */
tf->protocol = ATA_PROT_NCQ;
tf->command = ATA_CMD_FPDMA_SEND;
tf->hob_nsect = ATA_SUBCMD_FPDMA_SEND_DSM & 0x1f;
tf->nsect = qc->tag << 3;
- tf->hob_feature = (size / 512) >> 8;
- tf->feature = size / 512;
+ tf->hob_feature = (scmd->device->sector_size / 512) >> 8;
+ tf->feature = scmd->device->sector_size / 512;
tf->auxiliary = 1;
} else {
tf->protocol = ATA_PROT_DMA;
tf->hob_feature = 0;
tf->feature = ATA_DSM_TRIM;
- tf->hob_nsect = (size / 512) >> 8;
- tf->nsect = size / 512;
+ tf->hob_nsect = (scmd->device->sector_size / 512) >> 8;
+ tf->nsect = scmd->device->sector_size / 512;
tf->command = ATA_CMD_DSM;
}
@@ -3483,18 +3372,6 @@ static unsigned int ata_scsi_write_same_xlat(struct ata_queued_cmd *qc)
ata_qc_set_pc_nbytes(qc);
return 0;
-
-invalid_fld:
- ata_scsi_set_invalid_field(dev, scmd, fp, bp);
- return 1;
-invalid_param_len:
- /* "Parameter list length error" */
- ata_scsi_set_sense(dev, scmd, ILLEGAL_REQUEST, 0x1a, 0x0);
- return 1;
-invalid_opcode:
- /* "Invalid command operation code" */
- ata_scsi_set_sense(dev, scmd, ILLEGAL_REQUEST, 0x20, 0x0);
- return 1;
}
/**
@@ -4087,9 +3964,6 @@ static inline ata_xlat_func_t ata_get_xlat_func(struct ata_device *dev, u8 cmd)
case WRITE_16:
return ata_scsi_rw_xlat;
- case WRITE_SAME_16:
- return ata_scsi_write_same_xlat;
-
case SYNCHRONIZE_CACHE:
if (ata_try_flush_cache(dev))
return ata_scsi_flush_xlat;
@@ -4116,6 +3990,9 @@ static inline ata_xlat_func_t ata_get_xlat_func(struct ata_device *dev, u8 cmd)
case START_STOP:
return ata_scsi_start_stop_xlat;
+
+ case LINUX_VS_TRIM:
+ return ata_scsi_trim_xlat;
}
return NULL;
--
2.11.0
^ permalink raw reply related
* [PATCH 3/7] libata: remove SCT WRITE SAME support
From: Christoph Hellwig @ 2017-03-20 20:43 UTC (permalink / raw)
To: tj, martin.petersen, axboe; +Cc: linux-ide, linux-scsi, linux-block
In-Reply-To: <20170320204319.12628-1-hch@lst.de>
This was already disabled a while ago because it caused I/O errors,
and it's severly getting into the way of the discard / write zeroes
rework.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
drivers/ata/libata-scsi.c | 128 +++++++++++-----------------------------------
1 file changed, 29 insertions(+), 99 deletions(-)
diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index 1ac70744ae7b..b93d7e33789a 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -3393,46 +3393,6 @@ static size_t ata_format_dsm_trim_descr(struct scsi_cmnd *cmd, u32 trmax,
}
/**
- * ata_format_dsm_trim_descr() - SATL Write Same to ATA SCT Write Same
- * @cmd: SCSI command being translated
- * @lba: Starting sector
- * @num: Number of sectors to be zero'd.
- *
- * Rewrite the WRITE SAME payload to be an SCT Write Same formatted
- * descriptor.
- * NOTE: Writes a pattern (0's) in the foreground.
- *
- * Return: Number of bytes copied into sglist.
- */
-static size_t ata_format_sct_write_same(struct scsi_cmnd *cmd, u64 lba, u64 num)
-{
- struct scsi_device *sdp = cmd->device;
- size_t len = sdp->sector_size;
- size_t r;
- u16 *buf;
- unsigned long flags;
-
- spin_lock_irqsave(&ata_scsi_rbuf_lock, flags);
- buf = ((void *)ata_scsi_rbuf);
-
- put_unaligned_le16(0x0002, &buf[0]); /* SCT_ACT_WRITE_SAME */
- put_unaligned_le16(0x0101, &buf[1]); /* WRITE PTRN FG */
- put_unaligned_le64(lba, &buf[2]);
- put_unaligned_le64(num, &buf[6]);
- put_unaligned_le32(0u, &buf[10]); /* pattern */
-
- WARN_ON(len > ATA_SCSI_RBUF_SIZE);
-
- if (len > ATA_SCSI_RBUF_SIZE)
- len = ATA_SCSI_RBUF_SIZE;
-
- r = sg_copy_from_buffer(scsi_sglist(cmd), scsi_sg_count(cmd), buf, len);
- spin_unlock_irqrestore(&ata_scsi_rbuf_lock, flags);
-
- return r;
-}
-
-/**
* ata_scsi_write_same_xlat() - SATL Write Same to ATA SCT Write Same
* @qc: Command to be translated
*
@@ -3468,26 +3428,17 @@ static unsigned int ata_scsi_write_same_xlat(struct ata_queued_cmd *qc)
}
scsi_16_lba_len(cdb, &block, &n_block);
- if (unmap) {
- /* If trim is not enabled the cmd is invalid. */
- if ((dev->horkage & ATA_HORKAGE_NOTRIM) ||
- !ata_id_has_trim(dev->id)) {
- fp = 1;
- bp = 3;
- goto invalid_fld;
- }
- /* If the request is too large the cmd is invalid */
- if (n_block > 0xffff * trmax) {
- fp = 2;
- goto invalid_fld;
- }
- } else {
- /* If write same is not available the cmd is invalid */
- if (!ata_id_sct_write_same(dev->id)) {
- fp = 1;
- bp = 3;
- goto invalid_fld;
- }
+ if (!unmap ||
+ (dev->horkage & ATA_HORKAGE_NOTRIM) ||
+ !ata_id_has_trim(dev->id)) {
+ fp = 1;
+ bp = 3;
+ goto invalid_fld;
+ }
+ /* If the request is too large the cmd is invalid */
+ if (n_block > 0xffff * trmax) {
+ fp = 2;
+ goto invalid_fld;
}
/*
@@ -3502,49 +3453,28 @@ static unsigned int ata_scsi_write_same_xlat(struct ata_queued_cmd *qc)
* For DATA SET MANAGEMENT TRIM in ACS-2 nsect (aka count)
* is defined as number of 512 byte blocks to be transferred.
*/
- if (unmap) {
- size = ata_format_dsm_trim_descr(scmd, trmax, block, n_block);
- if (size != len)
- goto invalid_param_len;
- if (ata_ncq_enabled(dev) && ata_fpdma_dsm_supported(dev)) {
- /* Newer devices support queued TRIM commands */
- tf->protocol = ATA_PROT_NCQ;
- tf->command = ATA_CMD_FPDMA_SEND;
- tf->hob_nsect = ATA_SUBCMD_FPDMA_SEND_DSM & 0x1f;
- tf->nsect = qc->tag << 3;
- tf->hob_feature = (size / 512) >> 8;
- tf->feature = size / 512;
+ size = ata_format_dsm_trim_descr(scmd, trmax, block, n_block);
+ if (size != len)
+ goto invalid_param_len;
- tf->auxiliary = 1;
- } else {
- tf->protocol = ATA_PROT_DMA;
- tf->hob_feature = 0;
- tf->feature = ATA_DSM_TRIM;
- tf->hob_nsect = (size / 512) >> 8;
- tf->nsect = size / 512;
- tf->command = ATA_CMD_DSM;
- }
- } else {
- size = ata_format_sct_write_same(scmd, block, n_block);
- if (size != len)
- goto invalid_param_len;
+ if (ata_ncq_enabled(dev) && ata_fpdma_dsm_supported(dev)) {
+ /* Newer devices support queued TRIM commands */
+ tf->protocol = ATA_PROT_NCQ;
+ tf->command = ATA_CMD_FPDMA_SEND;
+ tf->hob_nsect = ATA_SUBCMD_FPDMA_SEND_DSM & 0x1f;
+ tf->nsect = qc->tag << 3;
+ tf->hob_feature = (size / 512) >> 8;
+ tf->feature = size / 512;
- tf->hob_feature = 0;
- tf->feature = 0;
- tf->hob_nsect = 0;
- tf->nsect = 1;
- tf->lbah = 0;
- tf->lbam = 0;
- tf->lbal = ATA_CMD_STANDBYNOW1;
- tf->hob_lbah = 0;
- tf->hob_lbam = 0;
- tf->hob_lbal = 0;
- tf->device = ATA_CMD_STANDBYNOW1;
+ tf->auxiliary = 1;
+ } else {
tf->protocol = ATA_PROT_DMA;
- tf->command = ATA_CMD_WRITE_LOG_DMA_EXT;
- if (unlikely(dev->flags & ATA_DFLAG_PIO))
- tf->command = ATA_CMD_WRITE_LOG_EXT;
+ tf->hob_feature = 0;
+ tf->feature = ATA_DSM_TRIM;
+ tf->hob_nsect = (size / 512) >> 8;
+ tf->nsect = size / 512;
+ tf->command = ATA_CMD_DSM;
}
tf->flags |= ATA_TFLAG_ISADDR | ATA_TFLAG_DEVICE | ATA_TFLAG_LBA48 |
--
2.11.0
^ permalink raw reply related
* [PATCH 2/7] sd: provide a new ata trim provisioning mode
From: Christoph Hellwig @ 2017-03-20 20:43 UTC (permalink / raw)
To: tj, martin.petersen, axboe; +Cc: linux-ide, linux-scsi, linux-block
In-Reply-To: <20170320204319.12628-1-hch@lst.de>
This uses a vendor specific command to pass the ATA TRIM payload to
libata without having to rewrite it in place. Support for it is
indicated by a new flag in struct scsi_device that libata will set
in it's slave_configure routine. A second flag indicates if TRIM
will reliably zero data.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
drivers/scsi/sd.c | 60 +++++++++++++++++++++++++++++++++++++++++++---
drivers/scsi/sd.h | 1 +
include/scsi/scsi_device.h | 2 ++
include/scsi/scsi_proto.h | 3 +++
4 files changed, 63 insertions(+), 3 deletions(-)
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index b853f91fb3da..cc8684818305 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -371,6 +371,7 @@ static const char *lbp_mode[] = {
[SD_LBP_WS16] = "writesame_16",
[SD_LBP_WS10] = "writesame_10",
[SD_LBP_ZERO] = "writesame_zero",
+ [SD_LBP_ATA_TRIM] = "ata_trim",
[SD_LBP_DISABLE] = "disabled",
};
@@ -411,7 +412,7 @@ provisioning_mode_store(struct device *dev, struct device_attribute *attr,
sd_config_discard(sdkp, SD_LBP_ZERO);
else if (!strncmp(buf, lbp_mode[SD_LBP_DISABLE], 20))
sd_config_discard(sdkp, SD_LBP_DISABLE);
- else
+ else /* we don't allow manual setting of SD_LBP_ATA_TRIM */
return -EINVAL;
return count;
@@ -653,7 +654,7 @@ static void sd_config_discard(struct scsi_disk *sdkp, unsigned int mode)
* lead to data corruption. If LBPRZ is not set, we honor the
* device preference.
*/
- if (sdkp->lbprz) {
+ if (sdkp->lbprz || sdkp->device->ata_trim) {
q->limits.discard_alignment = 0;
q->limits.discard_granularity = logical_block_size;
} else {
@@ -695,6 +696,12 @@ static void sd_config_discard(struct scsi_disk *sdkp, unsigned int mode)
(u32)SD_MAX_WS10_BLOCKS);
q->limits.discard_zeroes_data = 1;
break;
+
+ case SD_LBP_ATA_TRIM:
+ max_blocks = 65535 * (512 / sizeof(__le64));
+ if (sdkp->device->ata_trim_zeroes_data)
+ q->limits.discard_zeroes_data = 1;
+ break;
}
blk_queue_max_discard_sectors(q, max_blocks * (logical_block_size >> 9));
@@ -794,6 +801,49 @@ static int sd_setup_write_same10_cmnd(struct scsi_cmnd *cmd, bool unmap)
return scsi_init_io(cmd);
}
+static int sd_setup_ata_trim_cmnd(struct scsi_cmnd *cmd)
+{
+ struct scsi_device *sdp = cmd->device;
+ struct request *rq = cmd->request;
+ u64 sector = blk_rq_pos(rq) >> (ilog2(sdp->sector_size) - 9);
+ u32 nr_sectors = blk_rq_sectors(rq) >> (ilog2(sdp->sector_size) - 9);
+ u32 data_len = sdp->sector_size, i;
+ __le64 *buf;
+
+ rq->special_vec.bv_page = alloc_page(GFP_ATOMIC | __GFP_ZERO);
+ if (!rq->special_vec.bv_page)
+ return BLKPREP_DEFER;
+ rq->special_vec.bv_offset = 0;
+ rq->special_vec.bv_len = data_len;
+ rq->rq_flags |= RQF_SPECIAL_PAYLOAD;
+
+ /*
+ * Use the Linux Vendor Specific TRIM command to pass the TRIM payload
+ * to libata.
+ */
+ cmd->cmd_len = 10;
+ cmd->cmnd[0] = LINUX_VS_TRIM;
+ cmd->cmnd[8] = data_len;
+
+ buf = page_address(rq->special_vec.bv_page);
+ for (i = 0; i < (data_len >> 3); i++) {
+ u64 n = min(nr_sectors, 0xffffu);
+
+ buf[i] = cpu_to_le64(sector | (n << 48));
+ if (nr_sectors <= 0xffff)
+ break;
+ sector += 0xffff;
+ nr_sectors -= 0xffff;
+ }
+
+ cmd->allowed = SD_MAX_RETRIES;
+ cmd->transfersize = data_len;
+ rq->timeout = SD_TIMEOUT;
+ scsi_req(rq)->resid_len = data_len;
+
+ return scsi_init_io(cmd);
+}
+
static void sd_config_write_same(struct scsi_disk *sdkp)
{
struct request_queue *q = sdkp->disk->queue;
@@ -1168,6 +1218,8 @@ static int sd_init_command(struct scsi_cmnd *cmd)
return sd_setup_write_same10_cmnd(cmd, true);
case SD_LBP_ZERO:
return sd_setup_write_same10_cmnd(cmd, false);
+ case SD_LBP_ATA_TRIM:
+ return sd_setup_ata_trim_cmnd(cmd);
default:
return BLKPREP_INVALID;
}
@@ -2739,7 +2791,9 @@ static void sd_read_block_limits(struct scsi_disk *sdkp)
sdkp->max_xfer_blocks = get_unaligned_be32(&buffer[8]);
sdkp->opt_xfer_blocks = get_unaligned_be32(&buffer[12]);
- if (buffer[3] == 0x3c) {
+ if (sdkp->device->ata_trim) {
+ sd_config_discard(sdkp, SD_LBP_ATA_TRIM);
+ } else if (buffer[3] == 0x3c) {
unsigned int lba_count, desc_count;
sdkp->max_ws_blocks = (u32)get_unaligned_be64(&buffer[36]);
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 4dac35e96a75..711d48cea5d7 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -56,6 +56,7 @@ enum {
SD_LBP_WS16, /* Use WRITE SAME(16) with UNMAP bit */
SD_LBP_WS10, /* Use WRITE SAME(10) with UNMAP bit */
SD_LBP_ZERO, /* Use WRITE SAME(10) with zero payload */
+ SD_LBP_ATA_TRIM, /* generate a ATA TRIM payload for libata */
SD_LBP_DISABLE, /* Discard disabled due to failed cmd */
};
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index 080c7ce9bae8..7b1450b1b130 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -182,6 +182,8 @@ struct scsi_device {
unsigned broken_fua:1; /* Don't set FUA bit */
unsigned lun_in_cdb:1; /* Store LUN bits in CDB[1] */
unsigned synchronous_alua:1; /* Synchronous ALUA commands */
+ unsigned ata_trim:1; /* use ATA TRIM payload for discard */
+ unsigned ata_trim_zeroes_data:1;/* ATA TRIM zeroes discard blocks */
atomic_t disk_events_disable_depth; /* disable depth for disk events */
diff --git a/include/scsi/scsi_proto.h b/include/scsi/scsi_proto.h
index 6ba66e01f6df..033b5662a5f5 100644
--- a/include/scsi/scsi_proto.h
+++ b/include/scsi/scsi_proto.h
@@ -169,6 +169,9 @@
/* Vendor specific CDBs start here */
#define VENDOR_SPECIFIC_CDB 0xc0
+/* used to pass the TRIM payload to libata with rewriting it: */
+#define LINUX_VS_TRIM VENDOR_SPECIFIC_CDB
+
/*
* SCSI command lengths
*/
--
2.11.0
^ permalink raw reply related
* [PATCH 1/7] ѕd: split sd_setup_discard_cmnd
From: Christoph Hellwig @ 2017-03-20 20:43 UTC (permalink / raw)
To: tj, martin.petersen, axboe; +Cc: linux-ide, linux-scsi, linux-block
In-Reply-To: <20170320204319.12628-1-hch@lst.de>
Split sd_setup_discard_cmnd into one function per provisioning type. While
this creates some very slight duplication of boilerplate code it keeps the
code modular for additions of new provisioning types, and for reusing the
write same functions for the upcoming scsi implementation of the Write Zeroes
operation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
drivers/scsi/sd.c | 153 ++++++++++++++++++++++++++++++------------------------
1 file changed, 84 insertions(+), 69 deletions(-)
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index fcfeddc79331..b853f91fb3da 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -701,93 +701,97 @@ static void sd_config_discard(struct scsi_disk *sdkp, unsigned int mode)
queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
}
-/**
- * sd_setup_discard_cmnd - unmap blocks on thinly provisioned device
- * @sdp: scsi device to operate on
- * @rq: Request to prepare
- *
- * Will issue either UNMAP or WRITE SAME(16) depending on preference
- * indicated by target device.
- **/
-static int sd_setup_discard_cmnd(struct scsi_cmnd *cmd)
+static int sd_setup_unmap_cmnd(struct scsi_cmnd *cmd)
{
- struct request *rq = cmd->request;
struct scsi_device *sdp = cmd->device;
- struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
- sector_t sector = blk_rq_pos(rq);
- unsigned int nr_sectors = blk_rq_sectors(rq);
- unsigned int len;
- int ret;
+ struct request *rq = cmd->request;
+ u64 sector = blk_rq_pos(rq) >> (ilog2(sdp->sector_size) - 9);
+ u32 nr_sectors = blk_rq_sectors(rq) >> (ilog2(sdp->sector_size) - 9);
+ unsigned int data_len = 24;
char *buf;
- struct page *page;
- sector >>= ilog2(sdp->sector_size) - 9;
- nr_sectors >>= ilog2(sdp->sector_size) - 9;
-
- page = alloc_page(GFP_ATOMIC | __GFP_ZERO);
- if (!page)
+ rq->special_vec.bv_page = alloc_page(GFP_ATOMIC | __GFP_ZERO);
+ if (!rq->special_vec.bv_page)
return BLKPREP_DEFER;
+ rq->special_vec.bv_offset = 0;
+ rq->special_vec.bv_len = data_len;
+ rq->rq_flags |= RQF_SPECIAL_PAYLOAD;
- switch (sdkp->provisioning_mode) {
- case SD_LBP_UNMAP:
- buf = page_address(page);
-
- cmd->cmd_len = 10;
- cmd->cmnd[0] = UNMAP;
- cmd->cmnd[8] = 24;
-
- put_unaligned_be16(6 + 16, &buf[0]);
- put_unaligned_be16(16, &buf[2]);
- put_unaligned_be64(sector, &buf[8]);
- put_unaligned_be32(nr_sectors, &buf[16]);
+ cmd->cmd_len = 10;
+ cmd->cmnd[0] = UNMAP;
+ cmd->cmnd[8] = 24;
- len = 24;
- break;
+ buf = page_address(rq->special_vec.bv_page);
+ put_unaligned_be16(6 + 16, &buf[0]);
+ put_unaligned_be16(16, &buf[2]);
+ put_unaligned_be64(sector, &buf[8]);
+ put_unaligned_be32(nr_sectors, &buf[16]);
- case SD_LBP_WS16:
- cmd->cmd_len = 16;
- cmd->cmnd[0] = WRITE_SAME_16;
- cmd->cmnd[1] = 0x8; /* UNMAP */
- put_unaligned_be64(sector, &cmd->cmnd[2]);
- put_unaligned_be32(nr_sectors, &cmd->cmnd[10]);
+ cmd->allowed = SD_MAX_RETRIES;
+ cmd->transfersize = data_len;
+ rq->timeout = SD_TIMEOUT;
+ scsi_req(rq)->resid_len = data_len;
- len = sdkp->device->sector_size;
- break;
+ return scsi_init_io(cmd);
+}
- case SD_LBP_WS10:
- case SD_LBP_ZERO:
- cmd->cmd_len = 10;
- cmd->cmnd[0] = WRITE_SAME;
- if (sdkp->provisioning_mode == SD_LBP_WS10)
- cmd->cmnd[1] = 0x8; /* UNMAP */
- put_unaligned_be32(sector, &cmd->cmnd[2]);
- put_unaligned_be16(nr_sectors, &cmd->cmnd[7]);
+static int sd_setup_write_same16_cmnd(struct scsi_cmnd *cmd)
+{
+ struct scsi_device *sdp = cmd->device;
+ struct request *rq = cmd->request;
+ u64 sector = blk_rq_pos(rq) >> (ilog2(sdp->sector_size) - 9);
+ u32 nr_sectors = blk_rq_sectors(rq) >> (ilog2(sdp->sector_size) - 9);
+ u32 data_len = sdp->sector_size;
- len = sdkp->device->sector_size;
- break;
+ rq->special_vec.bv_page = alloc_page(GFP_ATOMIC | __GFP_ZERO);
+ if (!rq->special_vec.bv_page)
+ return BLKPREP_DEFER;
+ rq->special_vec.bv_offset = 0;
+ rq->special_vec.bv_len = data_len;
+ rq->rq_flags |= RQF_SPECIAL_PAYLOAD;
- default:
- ret = BLKPREP_INVALID;
- goto out;
- }
+ cmd->cmd_len = 16;
+ cmd->cmnd[0] = WRITE_SAME_16;
+ cmd->cmnd[1] = 0x8; /* UNMAP */
+ put_unaligned_be64(sector, &cmd->cmnd[2]);
+ put_unaligned_be32(nr_sectors, &cmd->cmnd[10]);
+ cmd->allowed = SD_MAX_RETRIES;
+ cmd->transfersize = data_len;
rq->timeout = SD_TIMEOUT;
+ scsi_req(rq)->resid_len = data_len;
- cmd->transfersize = len;
- cmd->allowed = SD_MAX_RETRIES;
+ return scsi_init_io(cmd);
+}
- rq->special_vec.bv_page = page;
- rq->special_vec.bv_offset = 0;
- rq->special_vec.bv_len = len;
+static int sd_setup_write_same10_cmnd(struct scsi_cmnd *cmd, bool unmap)
+{
+ struct scsi_device *sdp = cmd->device;
+ struct request *rq = cmd->request;
+ u64 sector = blk_rq_pos(rq) >> (ilog2(sdp->sector_size) - 9);
+ u32 nr_sectors = blk_rq_sectors(rq) >> (ilog2(sdp->sector_size) - 9);
+ u32 data_len = sdp->sector_size;
+ rq->special_vec.bv_page = alloc_page(GFP_ATOMIC | __GFP_ZERO);
+ if (!rq->special_vec.bv_page)
+ return BLKPREP_DEFER;
+ rq->special_vec.bv_offset = 0;
+ rq->special_vec.bv_len = data_len;
rq->rq_flags |= RQF_SPECIAL_PAYLOAD;
- scsi_req(rq)->resid_len = len;
- ret = scsi_init_io(cmd);
-out:
- if (ret != BLKPREP_OK)
- __free_page(page);
- return ret;
+ cmd->cmd_len = 10;
+ cmd->cmnd[0] = WRITE_SAME;
+ if (unmap)
+ cmd->cmnd[1] = 0x8; /* UNMAP */
+ put_unaligned_be32(sector, &cmd->cmnd[2]);
+ put_unaligned_be16(nr_sectors, &cmd->cmnd[7]);
+
+ cmd->allowed = SD_MAX_RETRIES;
+ cmd->transfersize = data_len;
+ rq->timeout = SD_TIMEOUT;
+ scsi_req(rq)->resid_len = data_len;
+
+ return scsi_init_io(cmd);
}
static void sd_config_write_same(struct scsi_disk *sdkp)
@@ -1155,7 +1159,18 @@ static int sd_init_command(struct scsi_cmnd *cmd)
switch (req_op(rq)) {
case REQ_OP_DISCARD:
- return sd_setup_discard_cmnd(cmd);
+ switch (scsi_disk(rq->rq_disk)->provisioning_mode) {
+ case SD_LBP_UNMAP:
+ return sd_setup_unmap_cmnd(cmd);
+ case SD_LBP_WS16:
+ return sd_setup_write_same16_cmnd(cmd);
+ case SD_LBP_WS10:
+ return sd_setup_write_same10_cmnd(cmd, true);
+ case SD_LBP_ZERO:
+ return sd_setup_write_same10_cmnd(cmd, false);
+ default:
+ return BLKPREP_INVALID;
+ }
case REQ_OP_WRITE_SAME:
return sd_setup_write_same_cmnd(cmd);
case REQ_OP_FLUSH:
--
2.11.0
^ permalink raw reply related
* support ranges TRIM for libata
From: Christoph Hellwig @ 2017-03-20 20:43 UTC (permalink / raw)
To: tj, martin.petersen, axboe; +Cc: linux-ide, linux-scsi, linux-block
This series implements rangeѕ discard for ATA SSDs. Compared to the
initial NVMe support there are two things that complicate the ATA
support:
- ATA only suports 16-bit long ranges
- the whole mess of generating a SCSI command first and then
translating it to an ATA one.
This series adds support for limited range size to the block layer,
and stops translating discard commands - instead we add a new
Vendor Specific SCSI command that contains the TRIM payload when
the device asks for it.
^ permalink raw reply
* [PATCH 4/4] blk-mq: streamline blk_mq_make_request
From: Christoph Hellwig @ 2017-03-20 20:39 UTC (permalink / raw)
To: axboe; +Cc: Bart.VanAssche, linux-block
In-Reply-To: <20170320203930.12533-1-hch@lst.de>
Turn the different ways of merging or issuing I/O into a series of if/else
statements instead of the current maze of gotos. Note that this means we
pin the CPU a little longer for some cases as the CTX put is moved to
common code at the end of the function.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
block/blk-mq.c | 67 +++++++++++++++++++++++-----------------------------------
1 file changed, 27 insertions(+), 40 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 48748cb799ed..18e449cc832f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1534,16 +1534,17 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
cookie = request_to_qc_t(data.hctx, rq);
+ plug = current->plug;
if (unlikely(is_flush_fua)) {
- if (q->elevator)
- goto elv_insert;
blk_mq_bio_to_request(rq, bio);
- blk_insert_flush(rq);
- goto run_queue;
- }
-
- plug = current->plug;
- if (plug && q->nr_hw_queues == 1) {
+ if (q->elevator) {
+ blk_mq_sched_insert_request(rq, false, true,
+ !is_sync || is_flush_fua, true);
+ } else {
+ blk_insert_flush(rq);
+ blk_mq_run_hw_queue(data.hctx, !is_sync || is_flush_fua);
+ }
+ } else if (plug && q->nr_hw_queues == 1) {
struct request *last = NULL;
blk_mq_bio_to_request(rq, bio);
@@ -1562,8 +1563,6 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
else
last = list_entry_rq(plug->mq_list.prev);
- blk_mq_put_ctx(data.ctx);
-
if (request_count >= BLK_MAX_REQUEST_COUNT || (last &&
blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE)) {
blk_flush_plug_list(plug, false);
@@ -1571,56 +1570,44 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
}
list_add_tail(&rq->queuelist, &plug->mq_list);
- goto done;
- } else if (((plug && !blk_queue_nomerges(q)) || is_sync)) {
- struct request *old_rq = NULL;
-
+ } else if (plug && !blk_queue_nomerges(q)) {
blk_mq_bio_to_request(rq, bio);
/*
* We do limited plugging. If the bio can be merged, do that.
* Otherwise the existing request in the plug list will be
* issued. So the plug list will have one request at most
+ *
+ * The plug list might get flushed before this. If that happens,
+ * the plug list is emptry and same_queue_rq is invalid.
*/
- if (plug) {
- /*
- * The plug list might get flushed before this. If that
- * happens, same_queue_rq is invalid and plug list is
- * empty
- */
- if (same_queue_rq && !list_empty(&plug->mq_list)) {
- old_rq = same_queue_rq;
- list_del_init(&old_rq->queuelist);
- }
- list_add_tail(&rq->queuelist, &plug->mq_list);
- } else /* is_sync */
- old_rq = rq;
- blk_mq_put_ctx(data.ctx);
- if (old_rq)
- blk_mq_try_issue_directly(data.hctx, old_rq, &cookie);
- goto done;
- }
+ if (!list_empty(&plug->mq_list))
+ list_del_init(&same_queue_rq->queuelist);
+ else
+ same_queue_rq = NULL;
- if (q->elevator) {
-elv_insert:
- blk_mq_put_ctx(data.ctx);
+ list_add_tail(&rq->queuelist, &plug->mq_list);
+ if (same_queue_rq)
+ blk_mq_try_issue_directly(data.hctx, same_queue_rq,
+ &cookie);
+ } else if (is_sync) {
+ blk_mq_bio_to_request(rq, bio);
+ blk_mq_try_issue_directly(data.hctx, rq, &cookie);
+ } else if (q->elevator) {
blk_mq_bio_to_request(rq, bio);
blk_mq_sched_insert_request(rq, false, true,
!is_sync || is_flush_fua, true);
- goto done;
- }
- if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
+ } else if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
/*
* For a SYNC request, send it to the hardware immediately. For
* an ASYNC request, just ensure that we run it later on. The
* latter allows for merging opportunities and more efficient
* dispatching.
*/
-run_queue:
blk_mq_run_hw_queue(data.hctx, !is_sync || is_flush_fua);
}
+
blk_mq_put_ctx(data.ctx);
-done:
return cookie;
}
--
2.11.0
^ permalink raw reply related
* [PATCH 3/4] blk-mq: improve blk_mq_try_issue_directly
From: Christoph Hellwig @ 2017-03-20 20:39 UTC (permalink / raw)
To: axboe; +Cc: Bart.VanAssche, linux-block
In-Reply-To: <20170320203930.12533-1-hch@lst.de>
Rename blk_mq_try_issue_directly to __blk_mq_try_issue_directly and add a
new wrapper that takes care of RCU / SRCU locking to avoid having
boileplate code in the caller which would get duplicated with new callers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
block/blk-mq.c | 32 ++++++++++++++++++--------------
1 file changed, 18 insertions(+), 14 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 53e49a3f6f0a..48748cb799ed 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1434,7 +1434,7 @@ static blk_qc_t request_to_qc_t(struct blk_mq_hw_ctx *hctx, struct request *rq)
return blk_tag_to_qc_t(rq->internal_tag, hctx->queue_num, true);
}
-static void blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie)
+static void __blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie)
{
struct request_queue *q = rq->q;
struct blk_mq_queue_data bd = {
@@ -1478,13 +1478,27 @@ static void blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie)
blk_mq_sched_insert_request(rq, false, true, true, false);
}
+static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
+ struct request *rq, blk_qc_t *cookie)
+{
+ if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
+ rcu_read_lock();
+ __blk_mq_try_issue_directly(rq, cookie);
+ rcu_read_unlock();
+ } else {
+ unsigned int srcu_idx = srcu_read_lock(&hctx->queue_rq_srcu);
+ __blk_mq_try_issue_directly(rq, cookie);
+ srcu_read_unlock(&hctx->queue_rq_srcu, srcu_idx);
+ }
+}
+
static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
{
const int is_sync = op_is_sync(bio->bi_opf);
const int is_flush_fua = op_is_flush(bio->bi_opf);
struct blk_mq_alloc_data data = { .flags = 0 };
struct request *rq;
- unsigned int request_count = 0, srcu_idx;
+ unsigned int request_count = 0;
struct blk_plug *plug;
struct request *same_queue_rq = NULL;
blk_qc_t cookie;
@@ -1582,18 +1596,8 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
} else /* is_sync */
old_rq = rq;
blk_mq_put_ctx(data.ctx);
- if (!old_rq)
- goto done;
-
- if (!(data.hctx->flags & BLK_MQ_F_BLOCKING)) {
- rcu_read_lock();
- blk_mq_try_issue_directly(old_rq, &cookie);
- rcu_read_unlock();
- } else {
- srcu_idx = srcu_read_lock(&data.hctx->queue_rq_srcu);
- blk_mq_try_issue_directly(old_rq, &cookie);
- srcu_read_unlock(&data.hctx->queue_rq_srcu, srcu_idx);
- }
+ if (old_rq)
+ blk_mq_try_issue_directly(data.hctx, old_rq, &cookie);
goto done;
}
--
2.11.0
^ permalink raw reply related
* [PATCH 2/4] blk-mq: merge mq and sq make_request instances
From: Christoph Hellwig @ 2017-03-20 20:39 UTC (permalink / raw)
To: axboe; +Cc: Bart.VanAssche, linux-block
In-Reply-To: <20170320203930.12533-1-hch@lst.de>
They are mostly the same code anyway - this just one small conditional
for the plug case that is different for both variants.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
block/blk-mq.c | 164 +++++++++++----------------------------------------------
1 file changed, 31 insertions(+), 133 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index acf0ddf4af52..53e49a3f6f0a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1478,11 +1478,6 @@ static void blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie)
blk_mq_sched_insert_request(rq, false, true, true, false);
}
-/*
- * Multiple hardware queue variant. This will not use per-process plugs,
- * but will attempt to bypass the hctx queueing if we can go straight to
- * hardware for SYNC IO.
- */
static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
{
const int is_sync = op_is_sync(bio->bi_opf);
@@ -1534,7 +1529,36 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
}
plug = current->plug;
- if (((plug && !blk_queue_nomerges(q)) || is_sync)) {
+ if (plug && q->nr_hw_queues == 1) {
+ struct request *last = NULL;
+
+ blk_mq_bio_to_request(rq, bio);
+
+ /*
+ * @request_count may become stale because of schedule
+ * out, so check the list again.
+ */
+ if (list_empty(&plug->mq_list))
+ request_count = 0;
+ else if (blk_queue_nomerges(q))
+ request_count = blk_plug_queued_count(q);
+
+ if (!request_count)
+ trace_block_plug(q);
+ else
+ last = list_entry_rq(plug->mq_list.prev);
+
+ blk_mq_put_ctx(data.ctx);
+
+ if (request_count >= BLK_MAX_REQUEST_COUNT || (last &&
+ blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE)) {
+ blk_flush_plug_list(plug, false);
+ trace_block_plug(q);
+ }
+
+ list_add_tail(&rq->queuelist, &plug->mq_list);
+ goto done;
+ } else if (((plug && !blk_queue_nomerges(q)) || is_sync)) {
struct request *old_rq = NULL;
blk_mq_bio_to_request(rq, bio);
@@ -1596,119 +1620,6 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
return cookie;
}
-/*
- * Single hardware queue variant. This will attempt to use any per-process
- * plug for merging and IO deferral.
- */
-static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
-{
- const int is_sync = op_is_sync(bio->bi_opf);
- const int is_flush_fua = op_is_flush(bio->bi_opf);
- struct blk_plug *plug;
- unsigned int request_count = 0;
- struct blk_mq_alloc_data data = { .flags = 0 };
- struct request *rq;
- blk_qc_t cookie;
- unsigned int wb_acct;
-
- blk_queue_bounce(q, &bio);
-
- if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
- bio_io_error(bio);
- return BLK_QC_T_NONE;
- }
-
- blk_queue_split(q, &bio, q->bio_split);
-
- if (!is_flush_fua && !blk_queue_nomerges(q)) {
- if (blk_attempt_plug_merge(q, bio, &request_count, NULL))
- return BLK_QC_T_NONE;
- } else
- request_count = blk_plug_queued_count(q);
-
- if (blk_mq_sched_bio_merge(q, bio))
- return BLK_QC_T_NONE;
-
- wb_acct = wbt_wait(q->rq_wb, bio, NULL);
-
- trace_block_getrq(q, bio, bio->bi_opf);
-
- rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, &data);
- if (unlikely(!rq)) {
- __wbt_done(q->rq_wb, wb_acct);
- return BLK_QC_T_NONE;
- }
-
- wbt_track(&rq->issue_stat, wb_acct);
-
- cookie = request_to_qc_t(data.hctx, rq);
-
- if (unlikely(is_flush_fua)) {
- if (q->elevator)
- goto elv_insert;
- blk_mq_bio_to_request(rq, bio);
- blk_insert_flush(rq);
- goto run_queue;
- }
-
- /*
- * A task plug currently exists. Since this is completely lockless,
- * utilize that to temporarily store requests until the task is
- * either done or scheduled away.
- */
- plug = current->plug;
- if (plug) {
- struct request *last = NULL;
-
- blk_mq_bio_to_request(rq, bio);
-
- /*
- * @request_count may become stale because of schedule
- * out, so check the list again.
- */
- if (list_empty(&plug->mq_list))
- request_count = 0;
- if (!request_count)
- trace_block_plug(q);
- else
- last = list_entry_rq(plug->mq_list.prev);
-
- blk_mq_put_ctx(data.ctx);
-
- if (request_count >= BLK_MAX_REQUEST_COUNT || (last &&
- blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE)) {
- blk_flush_plug_list(plug, false);
- trace_block_plug(q);
- }
-
- list_add_tail(&rq->queuelist, &plug->mq_list);
- return cookie;
- }
-
- if (q->elevator) {
-elv_insert:
- blk_mq_put_ctx(data.ctx);
- blk_mq_bio_to_request(rq, bio);
- blk_mq_sched_insert_request(rq, false, true,
- !is_sync || is_flush_fua, true);
- goto done;
- }
- if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
- /*
- * For a SYNC request, send it to the hardware immediately. For
- * an ASYNC request, just ensure that we run it later on. The
- * latter allows for merging opportunities and more efficient
- * dispatching.
- */
-run_queue:
- blk_mq_run_hw_queue(data.hctx, !is_sync || is_flush_fua);
- }
-
- blk_mq_put_ctx(data.ctx);
-done:
- return cookie;
-}
-
void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
unsigned int hctx_idx)
{
@@ -2366,10 +2277,7 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
INIT_LIST_HEAD(&q->requeue_list);
spin_lock_init(&q->requeue_lock);
- if (q->nr_hw_queues > 1)
- blk_queue_make_request(q, blk_mq_make_request);
- else
- blk_queue_make_request(q, blk_sq_make_request);
+ blk_queue_make_request(q, blk_mq_make_request);
/*
* Do this after blk_queue_make_request() overrides it...
@@ -2717,16 +2625,6 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
set->nr_hw_queues = nr_hw_queues;
list_for_each_entry(q, &set->tag_list, tag_set_list) {
blk_mq_realloc_hw_ctxs(set, q);
-
- /*
- * Manually set the make_request_fn as blk_queue_make_request
- * resets a lot of the queue settings.
- */
- if (q->nr_hw_queues > 1)
- q->make_request_fn = blk_mq_make_request;
- else
- q->make_request_fn = blk_sq_make_request;
-
blk_mq_queue_reinit(q, cpu_online_mask);
}
--
2.11.0
^ permalink raw reply related
* [PATCH 1/4] blk-mq: remove BLK_MQ_F_DEFER_ISSUE
From: Christoph Hellwig @ 2017-03-20 20:39 UTC (permalink / raw)
To: axboe; +Cc: Bart.VanAssche, linux-block
In-Reply-To: <20170320203930.12533-1-hch@lst.de>
This flag was never used since it was introduced.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
block/blk-mq.c | 8 +-------
include/linux/blk-mq.h | 1 -
2 files changed, 1 insertion(+), 8 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 159187a28d66..acf0ddf4af52 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1534,13 +1534,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
}
plug = current->plug;
- /*
- * If the driver supports defer issued based on 'last', then
- * queue it up like normal since we can potentially save some
- * CPU this way.
- */
- if (((plug && !blk_queue_nomerges(q)) || is_sync) &&
- !(data.hctx->flags & BLK_MQ_F_DEFER_ISSUE)) {
+ if (((plug && !blk_queue_nomerges(q)) || is_sync)) {
struct request *old_rq = NULL;
blk_mq_bio_to_request(rq, bio);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index b296a9006117..5b3e201c8d4f 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -152,7 +152,6 @@ enum {
BLK_MQ_F_SHOULD_MERGE = 1 << 0,
BLK_MQ_F_TAG_SHARED = 1 << 1,
BLK_MQ_F_SG_MERGE = 1 << 2,
- BLK_MQ_F_DEFER_ISSUE = 1 << 4,
BLK_MQ_F_BLOCKING = 1 << 5,
BLK_MQ_F_NO_SCHED = 1 << 6,
BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
--
2.11.0
^ permalink raw reply related
* unify and streamline the blk-mq make_request implementations V2
From: Christoph Hellwig @ 2017-03-20 20:39 UTC (permalink / raw)
To: axboe; +Cc: Bart.VanAssche, linux-block
A bunch of cleanups to get us a nice I/O submission path.
Changes since V1:
- rebase on top of the recent blk_mq_try_issue_directly changes
- incorporate comments from Bart
^ permalink raw reply
* Re: [PATCH RFC 00/14] Add the BFQ I/O Scheduler to blk-mq
From: Jens Axboe @ 2017-03-20 18:40 UTC (permalink / raw)
To: Bart Van Assche, paolo.valente@linaro.org,
linus.walleij@linaro.org
Cc: linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
fchecconi@gmail.com, broonie@kernel.org,
avanzini.arianna@gmail.com, tj@kernel.org, ulf.hansson@linaro.org
In-Reply-To: <1489859192.2339.12.camel@sandisk.com>
On 03/18/2017 01:46 PM, Bart Van Assche wrote:
> On Sat, 2017-03-18 at 18:09 +0100, Linus Walleij wrote:
>> On Sat, Mar 18, 2017 at 11:52 AM, Paolo Valente
>> <paolo.valente@linaro.org> wrote:
>>>> Il giorno 14 mar 2017, alle ore 16:32, Bart Van Assche <bart.vanassche@sandisk.com> ha scritto:
>>>> (...) what should
>>>> a developer do who only has access to a small subset of all the storage
>>>> devices that are supported by the Linux kernel and hence who can not run the
>>>> benchmark against every supported storage device?
>>
>> Don't we use the community for that? We are dependent on people
>> downloading and testing our code eventually, I mean sure it's good if
>> we make some reasonable effort to test changes we do, but we are
>> only humans, and we get corrected by the experience of other humans.
>
> Hello Linus,
>
> Do you mean relying on the community to test other storage devices
> before or after a patch is upstream? Relying on the community to file
> bug reports after a patch is upstream would be wrong. The Linux kernel
> should not be used for experiments. As you know patches that are sent
> upstream should not introduce regressions.
I think there are two main aspects to this:
1) Stability issues
2) Performance issues
For stability issues, obviously we expect BFQ to be bug free when
merged. In practical matters, this means that it doesn't have any known
pending issues, since we obviously cannot guarantee that the code is Bug
Free in general.
>From a performance perspective, using BFQ is absolutely known to
introduce regressions when used on certain types of storage. It works
well on single queue rotating devices. It'll tank your NVMe device
performance. I don't think think this is necessarily a problem. By
default, BFQ will not be enabled anywhere. It's a scheduler that is
available in the system, and users can opt in if they desire to use BFQ.
I'm expecting distros to do the right thing with udev rules here.
> My primary concern about BFQ is that it is a very complicated I/O
> scheduler and also that the concepts used internally in that I/O
> scheduler are far away from the concepts we are used to when reasoning
> about I/O devices. I'm concerned that this will make the BFQ I/O
> scheduler hard to maintain.
That is also my main concern, which is why I'm trying to go through the
code and suggest areas where it can be improved. It'd be great if it was
more modular, for instance, it's somewhat cumbersome to wade through
nine thousand lines of code. It's my hope that we can improve this
aspect of it.
Understanding the actual algorithms is a separate issue. But in that
regard I do think that BFQ is more forgiving than CFQ, since there are
actual papers detailing how it works. If implemented as cleanly as
possible, we can't really make it any easier to understand. It's not a
trivial topic.
--
Jens Axboe
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox