public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH V2 00/12] scsi-mq support for ZBC disks
@ 2017-09-07 16:16 Damien Le Moal
  2017-09-07 16:16 ` [PATCH V2 01/12] block: Fix declaration of blk-mq debugfs functions Damien Le Moal
                   ` (12 more replies)
  0 siblings, 13 replies; 38+ messages in thread
From: Damien Le Moal @ 2017-09-07 16:16 UTC (permalink / raw)
  To: linux-scsi, Martin K . Petersen, linux-block, Jens Axboe
  Cc: Christoph Hellwig, Bart Van Assche

This series implements support for ZBC disks used through the scsi-mq I/O path.

The current scsi level support of ZBC disks guarantees write request ordering
using a per-zone write lock which prevents issuing simultaneously multiple
write commands to a zone, doing so avoid reordering of sequential writes to
sequential zones. This method is however ineffective when scsi-mq is used with
zoned block devices. This is due to the different execution model of blk-mq
which passes a request to the scsi layer for dispatching after the request has
been removed from the I/O scheduler queue. That is, when the scsi layer tries
to lock the target zone of the request, the request may already be out of
order and zone write locking fails to prevent that.

Various approaches have been tried to solve this problem. All of them had the
serious disadvantage of cluttering blk-mq code with zoned block device specific
conditions and processing. As such extensive changes can only transform into
maintenance nightmares, a radically different solution is proposed here.

This series support implementation is in the form of a new "zoned" I/O
scheduler based on mq-deadline. Zoned is mostly identical to the mq-deadline
scheduler. It only differs from mq-deadline from the addition of a per zone
write locking mechanism similar to that implemented in sd_zbc.c. The zoned
scheduler zone write locking mechanism is used for the exact same purpose as
the one in the scsi disk driver: limit writes per zone to one to avoid
reordering. The locking context however changes and is moved to the
dispatch_request method of the scheduler, that is, target zones of write
requests can be locked before the requests are issued. In effect, this results
in the same behavior as the legacy scsi path. Sequential write ordering is
preserved.

The zoned scheduler is added as part of the scsi code under driver/scsi. This
allows access to information on the disk zones maintained within the device
scsi_disk structure as this information (e.g. zone types) cannot be retreived
from the context of an I/O scheduler initialization method (init_queue()
method).

The series patches are as follows:
- The first 2 patches fix exported symbols declaration to allow compiling an
  I/O scheduler outside of the block/ directory. No functional changes from
  these patches.
- Patches 3 to 7 reorganize and cleanup the scsi ZBC support to facilitate the
  intorduction of the scheduler.
- Path 8 is a bug fix
- Patch 9 is an optimization for the legacy scsi path which avoids zone
  locking if a write request targets a conventional zone. 
- Path 10 disables zone write locking for disks accessed through scsi-mq.
- Patch 11 introduces the new function scsi_disk_from_queue()
- Patch 12 introduces the zoned blk-mq I/O scheduler.

Comments are as always very much appreciated.

Changes from v1:
* Addressed Bart's comments for the blk-mq patches (declarations files)
* Split (former) patch 4 into multiple patches to facilitate review
* Fixed scsi disk lookup from io scheduler by introducing
  scsi_disk_from_queue()

Damien Le Moal (12):
  block: Fix declaration of blk-mq debugfs functions
  block: Fix declaration of blk-mq scheduler functions
  scsi: sd_zbc: Move ZBC declarations to scsi_proto.h
  scsi: sd_zbc: Move zbc disk declarations to sd_zbc.h
  scsi: sd_zbc: Fix comments and indentation
  scsi: sd_zbc: Rearrange code
  scsi: sd_zbc.c: Use well defined macros
  scsi: sd_zbc: Fix sd_zbc_read_zoned_characteristics()
  scsi: sd_zbc: Limit zone write locking to sequential zones
  scsi: sd_zbc: Disable zone write locking with scsi-mq
  scsi: sd: Introduce scsi_disk_from_queue()
  scsi: Introduce ZBC disk I/O scheduler

 Documentation/block/zoned-iosched.txt |  48 ++
 block/Kconfig.iosched                 |  12 +
 block/blk-mq-debugfs.h                |  14 +-
 block/blk-mq-sched.h                  |  11 +-
 drivers/scsi/Makefile                 |   1 +
 drivers/scsi/sd.c                     |  33 ++
 drivers/scsi/sd.h                     |  54 +-
 drivers/scsi/sd_zbc.c                 | 236 +++++----
 drivers/scsi/sd_zbc.h                 |  91 ++++
 drivers/scsi/zoned_iosched.c          | 939 ++++++++++++++++++++++++++++++++++
 include/linux/blk-mq-debugfs.h        |  23 +
 include/linux/blk-mq-sched.h          |  14 +
 include/scsi/scsi_proto.h             |  45 +-
 13 files changed, 1354 insertions(+), 167 deletions(-)
 create mode 100644 Documentation/block/zoned-iosched.txt
 create mode 100644 drivers/scsi/sd_zbc.h
 create mode 100644 drivers/scsi/zoned_iosched.c
 create mode 100644 include/linux/blk-mq-debugfs.h
 create mode 100644 include/linux/blk-mq-sched.h

-- 
2.13.5

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH V2 01/12] block: Fix declaration of blk-mq debugfs functions
  2017-09-07 16:16 [PATCH V2 00/12] scsi-mq support for ZBC disks Damien Le Moal
@ 2017-09-07 16:16 ` Damien Le Moal
  2017-09-08  8:05   ` Johannes Thumshirn
  2017-09-07 16:16 ` [PATCH V2 02/12] block: Fix declaration of blk-mq scheduler functions Damien Le Moal
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 38+ messages in thread
From: Damien Le Moal @ 2017-09-07 16:16 UTC (permalink / raw)
  To: linux-scsi, Martin K . Petersen, linux-block, Jens Axboe
  Cc: Christoph Hellwig, Bart Van Assche

__blk_mq_debugfs_rq_show() and blk_mq_debugfs_rq_show() are exported
symbols but ar eonly declared in the block internal file
block/blk-mq-debugfs.h. which is not cleanly accessible to files outside
of the block directory.
Move the declaration of these functions to the new file
include/linux/blk-mq-debugfs.h file to make the declarations cleanly
available to other modules.

While at it, also move the definition of the blk_mq_debugfs_attr
structure to allow scheduler modules outside of the block directory to
define debugfs attributes.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
---
 block/blk-mq-debugfs.h         | 14 +-------------
 include/linux/blk-mq-debugfs.h | 23 +++++++++++++++++++++++
 2 files changed, 24 insertions(+), 13 deletions(-)
 create mode 100644 include/linux/blk-mq-debugfs.h

diff --git a/block/blk-mq-debugfs.h b/block/blk-mq-debugfs.h
index a182e6f97565..4ffeae84b83a 100644
--- a/block/blk-mq-debugfs.h
+++ b/block/blk-mq-debugfs.h
@@ -3,19 +3,7 @@
 
 #ifdef CONFIG_BLK_DEBUG_FS
 
-#include <linux/seq_file.h>
-
-struct blk_mq_debugfs_attr {
-	const char *name;
-	umode_t mode;
-	int (*show)(void *, struct seq_file *);
-	ssize_t (*write)(void *, const char __user *, size_t, loff_t *);
-	/* Set either .show or .seq_ops. */
-	const struct seq_operations *seq_ops;
-};
-
-int __blk_mq_debugfs_rq_show(struct seq_file *m, struct request *rq);
-int blk_mq_debugfs_rq_show(struct seq_file *m, void *v);
+#include <linux/blk-mq-debugfs.h>
 
 int blk_mq_debugfs_register(struct request_queue *q);
 void blk_mq_debugfs_unregister(struct request_queue *q);
diff --git a/include/linux/blk-mq-debugfs.h b/include/linux/blk-mq-debugfs.h
new file mode 100644
index 000000000000..211537f61dce
--- /dev/null
+++ b/include/linux/blk-mq-debugfs.h
@@ -0,0 +1,23 @@
+#ifndef BLK_MQ_DEBUGFS_H
+#define BLK_MQ_DEBUGFS_H
+
+#ifdef CONFIG_BLK_DEBUG_FS
+
+#include <linux/blkdev.h>
+#include <linux/seq_file.h>
+
+struct blk_mq_debugfs_attr {
+	const char *name;
+	umode_t mode;
+	int (*show)(void *, struct seq_file *);
+	ssize_t (*write)(void *, const char __user *, size_t, loff_t *);
+	/* Set either .show or .seq_ops. */
+	const struct seq_operations *seq_ops;
+};
+
+int __blk_mq_debugfs_rq_show(struct seq_file *m, struct request *rq);
+int blk_mq_debugfs_rq_show(struct seq_file *m, void *v);
+
+#endif
+
+#endif
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH V2 02/12] block: Fix declaration of blk-mq scheduler functions
  2017-09-07 16:16 [PATCH V2 00/12] scsi-mq support for ZBC disks Damien Le Moal
  2017-09-07 16:16 ` [PATCH V2 01/12] block: Fix declaration of blk-mq debugfs functions Damien Le Moal
@ 2017-09-07 16:16 ` Damien Le Moal
  2017-09-08  8:05   ` Johannes Thumshirn
  2017-09-07 16:16 ` [PATCH V2 03/12] scsi: sd_zbc: Move ZBC declarations to scsi_proto.h Damien Le Moal
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 38+ messages in thread
From: Damien Le Moal @ 2017-09-07 16:16 UTC (permalink / raw)
  To: linux-scsi, Martin K . Petersen, linux-block, Jens Axboe
  Cc: Christoph Hellwig, Bart Van Assche

The functions blk_mq_sched_free_hctx_data(), blk_mq_sched_try_merge(),
blk_mq_sched_try_insert_merge() and blk_mq_sched_request_inserted() are
all exported symbols but are declared only internally in
block/blk-mq-sched.h. Move these declarations to the new file
include/linux/blk-mq-sched.h to make them available to block scheduler
modules implemented outside of the block directory.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
---
 block/blk-mq-sched.h         | 11 +++--------
 include/linux/blk-mq-sched.h | 14 ++++++++++++++
 2 files changed, 17 insertions(+), 8 deletions(-)
 create mode 100644 include/linux/blk-mq-sched.h

diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 9267d0b7c197..f48f3c8edcb3 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -1,19 +1,14 @@
-#ifndef BLK_MQ_SCHED_H
-#define BLK_MQ_SCHED_H
+#ifndef INT_BLK_MQ_SCHED_H
+#define INT_BLK_MQ_SCHED_H
 
 #include "blk-mq.h"
 #include "blk-mq-tag.h"
 
-void blk_mq_sched_free_hctx_data(struct request_queue *q,
-				 void (*exit)(struct blk_mq_hw_ctx *));
+#include <linux/blk-mq-sched.h>
 
 void blk_mq_sched_assign_ioc(struct request *rq, struct bio *bio);
 
-void blk_mq_sched_request_inserted(struct request *rq);
-bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
-				struct request **merged_request);
 bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio);
-bool blk_mq_sched_try_insert_merge(struct request_queue *q, struct request *rq);
 void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx);
 
 void blk_mq_sched_insert_request(struct request *rq, bool at_head,
diff --git a/include/linux/blk-mq-sched.h b/include/linux/blk-mq-sched.h
new file mode 100644
index 000000000000..8ae1992e02c6
--- /dev/null
+++ b/include/linux/blk-mq-sched.h
@@ -0,0 +1,14 @@
+#ifndef BLK_MQ_SCHED_H
+#define BLK_MQ_SCHED_H
+
+/*
+ * Scheduler helper functions.
+ */
+void blk_mq_sched_free_hctx_data(struct request_queue *q,
+				 void (*exit)(struct blk_mq_hw_ctx *));
+void blk_mq_sched_request_inserted(struct request *rq);
+bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
+			    struct request **merged_request);
+bool blk_mq_sched_try_insert_merge(struct request_queue *q, struct request *rq);
+
+#endif
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH V2 03/12] scsi: sd_zbc: Move ZBC declarations to scsi_proto.h
  2017-09-07 16:16 [PATCH V2 00/12] scsi-mq support for ZBC disks Damien Le Moal
  2017-09-07 16:16 ` [PATCH V2 01/12] block: Fix declaration of blk-mq debugfs functions Damien Le Moal
  2017-09-07 16:16 ` [PATCH V2 02/12] block: Fix declaration of blk-mq scheduler functions Damien Le Moal
@ 2017-09-07 16:16 ` Damien Le Moal
  2017-09-08  8:10   ` Johannes Thumshirn
  2017-09-07 16:16 ` [PATCH V2 04/12] scsi: sd_zbc: Move zbc disk declarations to sd_zbc.h Damien Le Moal
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 38+ messages in thread
From: Damien Le Moal @ 2017-09-07 16:16 UTC (permalink / raw)
  To: linux-scsi, Martin K . Petersen, linux-block, Jens Axboe
  Cc: Christoph Hellwig, Bart Van Assche

Move standard macro definitions for the zone types and zone conditions
to scsi_proto.h together with the definitions related to the
REPORT ZONES command.

While at it, define all values in the enums to be clear.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Bart Van Assche <Bart.VanAssche@wdc.com>
---
 drivers/scsi/sd_zbc.c     | 18 ------------------
 include/scsi/scsi_proto.h | 45 ++++++++++++++++++++++++++++++++++-----------
 2 files changed, 34 insertions(+), 29 deletions(-)

diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index 8aa54779aac1..d445a57f99bd 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -37,24 +37,6 @@
 #include "sd.h"
 #include "scsi_priv.h"
 
-enum zbc_zone_type {
-	ZBC_ZONE_TYPE_CONV = 0x1,
-	ZBC_ZONE_TYPE_SEQWRITE_REQ,
-	ZBC_ZONE_TYPE_SEQWRITE_PREF,
-	ZBC_ZONE_TYPE_RESERVED,
-};
-
-enum zbc_zone_cond {
-	ZBC_ZONE_COND_NO_WP,
-	ZBC_ZONE_COND_EMPTY,
-	ZBC_ZONE_COND_IMP_OPEN,
-	ZBC_ZONE_COND_EXP_OPEN,
-	ZBC_ZONE_COND_CLOSED,
-	ZBC_ZONE_COND_READONLY = 0xd,
-	ZBC_ZONE_COND_FULL,
-	ZBC_ZONE_COND_OFFLINE,
-};
-
 /**
  * Convert a zone descriptor to a zone struct.
  */
diff --git a/include/scsi/scsi_proto.h b/include/scsi/scsi_proto.h
index 8c285d9a06d8..39130a9c05bf 100644
--- a/include/scsi/scsi_proto.h
+++ b/include/scsi/scsi_proto.h
@@ -301,19 +301,42 @@ struct scsi_lun {
 
 /* Reporting options for REPORT ZONES */
 enum zbc_zone_reporting_options {
-	ZBC_ZONE_REPORTING_OPTION_ALL = 0,
-	ZBC_ZONE_REPORTING_OPTION_EMPTY,
-	ZBC_ZONE_REPORTING_OPTION_IMPLICIT_OPEN,
-	ZBC_ZONE_REPORTING_OPTION_EXPLICIT_OPEN,
-	ZBC_ZONE_REPORTING_OPTION_CLOSED,
-	ZBC_ZONE_REPORTING_OPTION_FULL,
-	ZBC_ZONE_REPORTING_OPTION_READONLY,
-	ZBC_ZONE_REPORTING_OPTION_OFFLINE,
-	ZBC_ZONE_REPORTING_OPTION_NEED_RESET_WP = 0x10,
-	ZBC_ZONE_REPORTING_OPTION_NON_SEQWRITE,
-	ZBC_ZONE_REPORTING_OPTION_NON_WP = 0x3f,
+	ZBC_ZONE_REPORTING_OPTION_ALL		= 0x00,
+	ZBC_ZONE_REPORTING_OPTION_EMPTY		= 0x01,
+	ZBC_ZONE_REPORTING_OPTION_IMPLICIT_OPEN	= 0x02,
+	ZBC_ZONE_REPORTING_OPTION_EXPLICIT_OPEN	= 0x03,
+	ZBC_ZONE_REPORTING_OPTION_CLOSED	= 0x04,
+	ZBC_ZONE_REPORTING_OPTION_FULL		= 0x05,
+	ZBC_ZONE_REPORTING_OPTION_READONLY	= 0x06,
+	ZBC_ZONE_REPORTING_OPTION_OFFLINE	= 0x07,
+	/* 0x08 to 0x0f are reserved */
+	ZBC_ZONE_REPORTING_OPTION_NEED_RESET_WP	= 0x10,
+	ZBC_ZONE_REPORTING_OPTION_NON_SEQWRITE	= 0x11,
+	/* 0x12 to 0x3e are reserved */
+	ZBC_ZONE_REPORTING_OPTION_NON_WP	= 0x3f,
 };
 
 #define ZBC_REPORT_ZONE_PARTIAL 0x80
 
+/* Zone types of REPORT ZONES zone descriptors */
+enum zbc_zone_type {
+	ZBC_ZONE_TYPE_CONV		= 0x1,
+	ZBC_ZONE_TYPE_SEQWRITE_REQ	= 0x2,
+	ZBC_ZONE_TYPE_SEQWRITE_PREF	= 0x3,
+	/* 0x4 to 0xf are reserved */
+};
+
+/* Zone conditions of REPORT ZONES zone descriptors */
+enum zbc_zone_cond {
+	ZBC_ZONE_COND_NO_WP		= 0x0,
+	ZBC_ZONE_COND_EMPTY		= 0x1,
+	ZBC_ZONE_COND_IMP_OPEN		= 0x2,
+	ZBC_ZONE_COND_EXP_OPEN		= 0x3,
+	ZBC_ZONE_COND_CLOSED		= 0x4,
+	/* 0x5 to 0xc are reserved */
+	ZBC_ZONE_COND_READONLY		= 0xd,
+	ZBC_ZONE_COND_FULL		= 0xe,
+	ZBC_ZONE_COND_OFFLINE		= 0xf,
+};
+
 #endif /* _SCSI_PROTO_H_ */
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH V2 04/12] scsi: sd_zbc: Move zbc disk declarations to sd_zbc.h
  2017-09-07 16:16 [PATCH V2 00/12] scsi-mq support for ZBC disks Damien Le Moal
                   ` (2 preceding siblings ...)
  2017-09-07 16:16 ` [PATCH V2 03/12] scsi: sd_zbc: Move ZBC declarations to scsi_proto.h Damien Le Moal
@ 2017-09-07 16:16 ` Damien Le Moal
  2017-09-08  8:11   ` Johannes Thumshirn
  2017-09-07 16:16 ` [PATCH V2 05/12] scsi: sd_zbc: Fix comments and indentation Damien Le Moal
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 38+ messages in thread
From: Damien Le Moal @ 2017-09-07 16:16 UTC (permalink / raw)
  To: linux-scsi, Martin K . Petersen, linux-block, Jens Axboe
  Cc: Christoph Hellwig, Bart Van Assche

Introduce sd_zbc.h to gather ZBC disk related declarations and avoid
cluttering sd.h.

No functional change is introduced by this patch.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
---
 drivers/scsi/sd.c     |  1 +
 drivers/scsi/sd.h     | 48 ---------------------------------------
 drivers/scsi/sd_zbc.c |  7 +-----
 drivers/scsi/sd_zbc.h | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 64 insertions(+), 54 deletions(-)
 create mode 100644 drivers/scsi/sd_zbc.h

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index c5d05b1c58a5..1cd113df2d3a 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -68,6 +68,7 @@
 #include <scsi/scsicam.h>
 
 #include "sd.h"
+#include "sd_zbc.h"
 #include "scsi_priv.h"
 #include "scsi_logging.h"
 
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 99c4dde9b6bf..c9a627e7eebd 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -277,52 +277,4 @@ static inline int sd_is_zoned(struct scsi_disk *sdkp)
 	return sdkp->zoned == 1 || sdkp->device->type == TYPE_ZBC;
 }
 
-#ifdef CONFIG_BLK_DEV_ZONED
-
-extern int sd_zbc_read_zones(struct scsi_disk *sdkp, unsigned char *buffer);
-extern void sd_zbc_remove(struct scsi_disk *sdkp);
-extern void sd_zbc_print_zones(struct scsi_disk *sdkp);
-extern int sd_zbc_write_lock_zone(struct scsi_cmnd *cmd);
-extern void sd_zbc_write_unlock_zone(struct scsi_cmnd *cmd);
-extern int sd_zbc_setup_report_cmnd(struct scsi_cmnd *cmd);
-extern int sd_zbc_setup_reset_cmnd(struct scsi_cmnd *cmd);
-extern void sd_zbc_complete(struct scsi_cmnd *cmd, unsigned int good_bytes,
-			    struct scsi_sense_hdr *sshdr);
-
-#else /* CONFIG_BLK_DEV_ZONED */
-
-static inline int sd_zbc_read_zones(struct scsi_disk *sdkp,
-				    unsigned char *buf)
-{
-	return 0;
-}
-
-static inline void sd_zbc_remove(struct scsi_disk *sdkp) {}
-
-static inline void sd_zbc_print_zones(struct scsi_disk *sdkp) {}
-
-static inline int sd_zbc_write_lock_zone(struct scsi_cmnd *cmd)
-{
-	/* Let the drive fail requests */
-	return BLKPREP_OK;
-}
-
-static inline void sd_zbc_write_unlock_zone(struct scsi_cmnd *cmd) {}
-
-static inline int sd_zbc_setup_report_cmnd(struct scsi_cmnd *cmd)
-{
-	return BLKPREP_INVALID;
-}
-
-static inline int sd_zbc_setup_reset_cmnd(struct scsi_cmnd *cmd)
-{
-	return BLKPREP_INVALID;
-}
-
-static inline void sd_zbc_complete(struct scsi_cmnd *cmd,
-				   unsigned int good_bytes,
-				   struct scsi_sense_hdr *sshdr) {}
-
-#endif /* CONFIG_BLK_DEV_ZONED */
-
 #endif /* _SCSI_DISK_H */
diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index d445a57f99bd..a62f8e256a26 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -28,14 +28,9 @@
 
 #include <scsi/scsi.h>
 #include <scsi/scsi_cmnd.h>
-#include <scsi/scsi_dbg.h>
-#include <scsi/scsi_device.h>
-#include <scsi/scsi_driver.h>
-#include <scsi/scsi_host.h>
-#include <scsi/scsi_eh.h>
 
 #include "sd.h"
-#include "scsi_priv.h"
+#include "sd_zbc.h"
 
 /**
  * Convert a zone descriptor to a zone struct.
diff --git a/drivers/scsi/sd_zbc.h b/drivers/scsi/sd_zbc.h
new file mode 100644
index 000000000000..5e7ecc9b2966
--- /dev/null
+++ b/drivers/scsi/sd_zbc.h
@@ -0,0 +1,62 @@
+#ifndef _SCSI_DISK_ZBC_H
+#define _SCSI_DISK_ZBC_H
+
+#ifdef CONFIG_BLK_DEV_ZONED
+
+int sd_zbc_read_zones(struct scsi_disk *sdkp, unsigned char *buffer);
+void sd_zbc_remove(struct scsi_disk *sdkp);
+void sd_zbc_print_zones(struct scsi_disk *sdkp);
+
+int sd_zbc_write_lock_zone(struct scsi_cmnd *cmd);
+void sd_zbc_write_unlock_zone(struct scsi_cmnd *cmd);
+
+int sd_zbc_setup_report_cmnd(struct scsi_cmnd *cmd);
+int sd_zbc_setup_reset_cmnd(struct scsi_cmnd *cmd);
+void sd_zbc_complete(struct scsi_cmnd *cmd, unsigned int good_bytes,
+		     struct scsi_sense_hdr *sshdr);
+
+#else /* CONFIG_BLK_DEV_ZONED */
+
+static inline int sd_zbc_read_zones(struct scsi_disk *sdkp,
+				    unsigned char *buf)
+{
+	return 0;
+}
+
+static inline void sd_zbc_remove(struct scsi_disk *sdkp)
+{
+}
+
+static inline void sd_zbc_print_zones(struct scsi_disk *sdkp)
+{
+}
+
+static inline int sd_zbc_write_lock_zone(struct scsi_cmnd *cmd)
+{
+	/* Let the drive fail requests */
+	return BLKPREP_OK;
+}
+
+static inline void sd_zbc_write_unlock_zone(struct scsi_cmnd *cmd)
+{
+}
+
+static inline int sd_zbc_setup_report_cmnd(struct scsi_cmnd *cmd)
+{
+	return BLKPREP_INVALID;
+}
+
+static inline int sd_zbc_setup_reset_cmnd(struct scsi_cmnd *cmd)
+{
+	return BLKPREP_INVALID;
+}
+
+static inline void sd_zbc_complete(struct scsi_cmnd *cmd,
+				   unsigned int good_bytes,
+				   struct scsi_sense_hdr *sshdr)
+{
+}
+
+#endif /* CONFIG_BLK_DEV_ZONED */
+
+#endif /* _SCSI_DISK_ZBC_H */
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH V2 05/12] scsi: sd_zbc: Fix comments and indentation
  2017-09-07 16:16 [PATCH V2 00/12] scsi-mq support for ZBC disks Damien Le Moal
                   ` (3 preceding siblings ...)
  2017-09-07 16:16 ` [PATCH V2 04/12] scsi: sd_zbc: Move zbc disk declarations to sd_zbc.h Damien Le Moal
@ 2017-09-07 16:16 ` Damien Le Moal
  2017-09-08  8:12   ` Johannes Thumshirn
  2017-09-07 16:16 ` [PATCH V2 06/12] scsi: sd_zbc: Rearrange code Damien Le Moal
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 38+ messages in thread
From: Damien Le Moal @ 2017-09-07 16:16 UTC (permalink / raw)
  To: linux-scsi, Martin K . Petersen, linux-block, Jens Axboe
  Cc: Christoph Hellwig, Bart Van Assche

Fix comments style (do not use documented comment style) and add some
comments to clarify some functions. Also fix some functions signature
indentation and remove a useless blank line.

No functional change is introduced by this patch.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
---
 drivers/scsi/sd_zbc.c | 50 ++++++++++++++++++++++++++++++++------------------
 1 file changed, 32 insertions(+), 18 deletions(-)

diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index a62f8e256a26..e0927f8fb829 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -32,11 +32,11 @@
 #include "sd.h"
 #include "sd_zbc.h"
 
-/**
- * Convert a zone descriptor to a zone struct.
+/*
+ * Convert a zone descriptor to a struct blk_zone,
+ * taking care of converting LBA sized values to 512B sectors.
  */
-static void sd_zbc_parse_report(struct scsi_disk *sdkp,
-				u8 *buf,
+static void sd_zbc_parse_report(struct scsi_disk *sdkp, u8 *buf,
 				struct blk_zone *zone)
 {
 	struct scsi_device *sdp = sdkp->device;
@@ -58,8 +58,9 @@ static void sd_zbc_parse_report(struct scsi_disk *sdkp,
 		zone->wp = zone->start + zone->len;
 }
 
-/**
+/*
  * Issue a REPORT ZONES scsi command.
+ * For internal use during device validation.
  */
 static int sd_zbc_report_zones(struct scsi_disk *sdkp, unsigned char *buf,
 			       unsigned int buflen, sector_t lba)
@@ -100,6 +101,9 @@ static int sd_zbc_report_zones(struct scsi_disk *sdkp, unsigned char *buf,
 	return 0;
 }
 
+/*
+ * Prepare a REPORT ZONES scsi command for a REQ_OP_ZONE_REPORT request.
+ */
 int sd_zbc_setup_report_cmnd(struct scsi_cmnd *cmd)
 {
 	struct request *rq = cmd->request;
@@ -142,6 +146,11 @@ int sd_zbc_setup_report_cmnd(struct scsi_cmnd *cmd)
 	return BLKPREP_OK;
 }
 
+/*
+ * Process a REPORT ZONES scsi command reply, converting all reported
+ * zone descriptors to struct blk_zone. The conversion is done in-place,
+ * directly in the request specified sg buffer.
+ */
 static void sd_zbc_report_zones_complete(struct scsi_cmnd *scmd,
 					 unsigned int good_bytes)
 {
@@ -208,6 +217,9 @@ static inline unsigned int sd_zbc_zone_no(struct scsi_disk *sdkp,
 	return sectors_to_logical(sdkp->device, sector) >> sdkp->zone_shift;
 }
 
+/*
+ * Prepare a RESET WRITE POINTER scsi command for a REQ_OP_ZONE_RESET request.
+ */
 int sd_zbc_setup_reset_cmnd(struct scsi_cmnd *cmd)
 {
 	struct request *rq = cmd->request;
@@ -240,6 +252,11 @@ int sd_zbc_setup_reset_cmnd(struct scsi_cmnd *cmd)
 	return BLKPREP_OK;
 }
 
+/*
+ * Write lock a sequential zone.
+ * Called from sd_init_cmd() for write requests
+ * (standard write, write same or write zeroes operations).
+ */
 int sd_zbc_write_lock_zone(struct scsi_cmnd *cmd)
 {
 	struct request *rq = cmd->request;
@@ -262,10 +279,7 @@ int sd_zbc_write_lock_zone(struct scsi_cmnd *cmd)
 	 * Do not issue more than one write at a time per
 	 * zone. This solves write ordering problems due to
 	 * the unlocking of the request queue in the dispatch
-	 * path in the non scsi-mq case. For scsi-mq, this
-	 * also avoids potential write reordering when multiple
-	 * threads running on different CPUs write to the same
-	 * zone (with a synchronized sequential pattern).
+	 * path in the non scsi-mq case.
 	 */
 	if (sdkp->zones_wlock &&
 	    test_and_set_bit(zno, sdkp->zones_wlock))
@@ -277,6 +291,10 @@ int sd_zbc_write_lock_zone(struct scsi_cmnd *cmd)
 	return BLKPREP_OK;
 }
 
+/*
+ * Write unlock a sequential zone.
+ * Called from sd_uninit_cmd().
+ */
 void sd_zbc_write_unlock_zone(struct scsi_cmnd *cmd)
 {
 	struct request *rq = cmd->request;
@@ -291,8 +309,7 @@ void sd_zbc_write_unlock_zone(struct scsi_cmnd *cmd)
 	}
 }
 
-void sd_zbc_complete(struct scsi_cmnd *cmd,
-		     unsigned int good_bytes,
+void sd_zbc_complete(struct scsi_cmnd *cmd, unsigned int good_bytes,
 		     struct scsi_sense_hdr *sshdr)
 {
 	int result = cmd->result;
@@ -336,7 +353,7 @@ void sd_zbc_complete(struct scsi_cmnd *cmd,
 	}
 }
 
-/**
+/*
  * Read zoned block device characteristics (VPD page B6).
  */
 static int sd_zbc_read_zoned_characteristics(struct scsi_disk *sdkp,
@@ -366,11 +383,10 @@ static int sd_zbc_read_zoned_characteristics(struct scsi_disk *sdkp,
 	return 0;
 }
 
-/**
+/*
  * Check reported capacity.
  */
-static int sd_zbc_check_capacity(struct scsi_disk *sdkp,
-				 unsigned char *buf)
+static int sd_zbc_check_capacity(struct scsi_disk *sdkp, unsigned char *buf)
 {
 	sector_t lba;
 	int ret;
@@ -526,8 +542,7 @@ static int sd_zbc_setup(struct scsi_disk *sdkp)
 	return 0;
 }
 
-int sd_zbc_read_zones(struct scsi_disk *sdkp,
-		      unsigned char *buf)
+int sd_zbc_read_zones(struct scsi_disk *sdkp, unsigned char *buf)
 {
 	int ret;
 
@@ -538,7 +553,6 @@ int sd_zbc_read_zones(struct scsi_disk *sdkp,
 		 */
 		return 0;
 
-
 	/* Get zoned block device characteristics */
 	ret = sd_zbc_read_zoned_characteristics(sdkp, buf);
 	if (ret)
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH V2 06/12] scsi: sd_zbc: Rearrange code
  2017-09-07 16:16 [PATCH V2 00/12] scsi-mq support for ZBC disks Damien Le Moal
                   ` (4 preceding siblings ...)
  2017-09-07 16:16 ` [PATCH V2 05/12] scsi: sd_zbc: Fix comments and indentation Damien Le Moal
@ 2017-09-07 16:16 ` Damien Le Moal
  2017-09-08  8:13   ` Johannes Thumshirn
  2017-09-07 16:16 ` [PATCH V2 07/12] scsi: sd_zbc.c: Use well defined macros Damien Le Moal
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 38+ messages in thread
From: Damien Le Moal @ 2017-09-07 16:16 UTC (permalink / raw)
  To: linux-scsi, Martin K . Petersen, linux-block, Jens Axboe
  Cc: Christoph Hellwig, Bart Van Assche

Move inline functions sd_zbc_zone_sectors() and sd_zbc_zone_no()
declarations to sd_zbc.h and rearrange sd_zbc_setup() to include
use_16_for_rw and use_10_for_rw assignment.
Also move calculation of sdkp->zone_shift together with the assignment
of the verified zone_blocks value in sd_zbc_check_zone_size().

No functional change is introduced by this patch.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
---
 drivers/scsi/sd_zbc.c | 21 +++++----------------
 drivers/scsi/sd_zbc.h | 17 +++++++++++++++++
 2 files changed, 22 insertions(+), 16 deletions(-)

diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index e0927f8fb829..a25a940243a9 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -206,17 +206,6 @@ static void sd_zbc_report_zones_complete(struct scsi_cmnd *scmd,
 	local_irq_restore(flags);
 }
 
-static inline sector_t sd_zbc_zone_sectors(struct scsi_disk *sdkp)
-{
-	return logical_to_sectors(sdkp->device, sdkp->zone_blocks);
-}
-
-static inline unsigned int sd_zbc_zone_no(struct scsi_disk *sdkp,
-					  sector_t sector)
-{
-	return sectors_to_logical(sdkp->device, sector) >> sdkp->zone_shift;
-}
-
 /*
  * Prepare a RESET WRITE POINTER scsi command for a REQ_OP_ZONE_RESET request.
  */
@@ -516,6 +505,7 @@ static int sd_zbc_check_zone_size(struct scsi_disk *sdkp)
 	}
 
 	sdkp->zone_blocks = zone_blocks;
+	sdkp->zone_shift = ilog2(zone_blocks);
 
 	return 0;
 }
@@ -523,10 +513,13 @@ static int sd_zbc_check_zone_size(struct scsi_disk *sdkp)
 static int sd_zbc_setup(struct scsi_disk *sdkp)
 {
 
+	/* READ16/WRITE16 is mandatory for ZBC disks */
+	sdkp->device->use_16_for_rw = 1;
+	sdkp->device->use_10_for_rw = 0;
+
 	/* chunk_sectors indicates the zone size */
 	blk_queue_chunk_sectors(sdkp->disk->queue,
 			logical_to_sectors(sdkp->device, sdkp->zone_blocks));
-	sdkp->zone_shift = ilog2(sdkp->zone_blocks);
 	sdkp->nr_zones = sdkp->capacity >> sdkp->zone_shift;
 	if (sdkp->capacity & (sdkp->zone_blocks - 1))
 		sdkp->nr_zones++;
@@ -589,10 +582,6 @@ int sd_zbc_read_zones(struct scsi_disk *sdkp, unsigned char *buf)
 	if (ret)
 		goto err;
 
-	/* READ16/WRITE16 is mandatory for ZBC disks */
-	sdkp->device->use_16_for_rw = 1;
-	sdkp->device->use_10_for_rw = 0;
-
 	return 0;
 
 err:
diff --git a/drivers/scsi/sd_zbc.h b/drivers/scsi/sd_zbc.h
index 5e7ecc9b2966..b34002f0acbe 100644
--- a/drivers/scsi/sd_zbc.h
+++ b/drivers/scsi/sd_zbc.h
@@ -15,6 +15,23 @@ int sd_zbc_setup_reset_cmnd(struct scsi_cmnd *cmd);
 void sd_zbc_complete(struct scsi_cmnd *cmd, unsigned int good_bytes,
 		     struct scsi_sense_hdr *sshdr);
 
+/*
+ * Zone size in number of 512B sectors.
+ */
+static inline sector_t sd_zbc_zone_sectors(struct scsi_disk *sdkp)
+{
+	return logical_to_sectors(sdkp->device, sdkp->zone_blocks);
+}
+
+/*
+ * Zone number of the specified sector.
+ */
+static inline unsigned int sd_zbc_zone_no(struct scsi_disk *sdkp,
+					  sector_t sector)
+{
+	return sectors_to_logical(sdkp->device, sector) >> sdkp->zone_shift;
+}
+
 #else /* CONFIG_BLK_DEV_ZONED */
 
 static inline int sd_zbc_read_zones(struct scsi_disk *sdkp,
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH V2 07/12] scsi: sd_zbc.c: Use well defined macros
  2017-09-07 16:16 [PATCH V2 00/12] scsi-mq support for ZBC disks Damien Le Moal
                   ` (5 preceding siblings ...)
  2017-09-07 16:16 ` [PATCH V2 06/12] scsi: sd_zbc: Rearrange code Damien Le Moal
@ 2017-09-07 16:16 ` Damien Le Moal
  2017-09-08  8:15   ` Johannes Thumshirn
  2017-09-07 16:16 ` [PATCH V2 08/12] scsi: sd_zbc: Fix sd_zbc_read_zoned_characteristics() Damien Le Moal
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 38+ messages in thread
From: Damien Le Moal @ 2017-09-07 16:16 UTC (permalink / raw)
  To: linux-scsi, Martin K . Petersen, linux-block, Jens Axboe
  Cc: Christoph Hellwig, Bart Van Assche

instead of open coding, use the min() macro to calculate a report zones
reply buffer length in sd_zbc_check_zone_size() and the round_up()
macro for calculating the number of zones in sd_zbc_setup().

No functional change is introduced by this patch.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
---
 drivers/scsi/sd_zbc.c | 12 ++++--------
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index a25a940243a9..8a45db8bd6e7 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -403,7 +403,7 @@ static int sd_zbc_check_capacity(struct scsi_disk *sdkp, unsigned char *buf)
 	return 0;
 }
 
-#define SD_ZBC_BUF_SIZE 131072
+#define SD_ZBC_BUF_SIZE 131072U
 
 static int sd_zbc_check_zone_size(struct scsi_disk *sdkp)
 {
@@ -447,10 +447,7 @@ static int sd_zbc_check_zone_size(struct scsi_disk *sdkp)
 		/* Parse REPORT ZONES header */
 		list_length = get_unaligned_be32(&buf[0]) + 64;
 		rec = buf + 64;
-		if (list_length < SD_ZBC_BUF_SIZE)
-			buf_len = list_length;
-		else
-			buf_len = SD_ZBC_BUF_SIZE;
+		buf_len = min(list_length, SD_ZBC_BUF_SIZE);
 
 		/* Parse zone descriptors */
 		while (rec < buf + buf_len) {
@@ -520,9 +517,8 @@ static int sd_zbc_setup(struct scsi_disk *sdkp)
 	/* chunk_sectors indicates the zone size */
 	blk_queue_chunk_sectors(sdkp->disk->queue,
 			logical_to_sectors(sdkp->device, sdkp->zone_blocks));
-	sdkp->nr_zones = sdkp->capacity >> sdkp->zone_shift;
-	if (sdkp->capacity & (sdkp->zone_blocks - 1))
-		sdkp->nr_zones++;
+	sdkp->nr_zones =
+		round_up(sdkp->capacity, sdkp->zone_blocks) >> sdkp->zone_shift;
 
 	if (!sdkp->zones_wlock) {
 		sdkp->zones_wlock = kcalloc(BITS_TO_LONGS(sdkp->nr_zones),
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH V2 08/12] scsi: sd_zbc: Fix sd_zbc_read_zoned_characteristics()
  2017-09-07 16:16 [PATCH V2 00/12] scsi-mq support for ZBC disks Damien Le Moal
                   ` (6 preceding siblings ...)
  2017-09-07 16:16 ` [PATCH V2 07/12] scsi: sd_zbc.c: Use well defined macros Damien Le Moal
@ 2017-09-07 16:16 ` Damien Le Moal
  2017-09-08  8:17   ` Johannes Thumshirn
  2017-09-07 16:16 ` [PATCH V2 09/12] scsi: sd_zbc: Limit zone write locking to sequential zones Damien Le Moal
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 38+ messages in thread
From: Damien Le Moal @ 2017-09-07 16:16 UTC (permalink / raw)
  To: linux-scsi, Martin K . Petersen, linux-block, Jens Axboe
  Cc: Christoph Hellwig, Bart Van Assche, stable

The three values starting at byte 8 of the Zoned Block Device
Characteristics VPD page B6h are 32 bits values, not 64bits. So use
get_unaligned_be32() to retrieve the values and not get_unaligned_be64()

Fixes: 89d947561077 ("sd: Implement support for ZBC devices")
Cc: <stable@vger.kernel.org>

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Bart Van Assche <Bart.VanAssche@wdc.com>
---
 drivers/scsi/sd_zbc.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index 8a45db8bd6e7..8b258254a2bb 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -358,15 +358,15 @@ static int sd_zbc_read_zoned_characteristics(struct scsi_disk *sdkp,
 	if (sdkp->device->type != TYPE_ZBC) {
 		/* Host-aware */
 		sdkp->urswrz = 1;
-		sdkp->zones_optimal_open = get_unaligned_be64(&buf[8]);
-		sdkp->zones_optimal_nonseq = get_unaligned_be64(&buf[12]);
+		sdkp->zones_optimal_open = get_unaligned_be32(&buf[8]);
+		sdkp->zones_optimal_nonseq = get_unaligned_be32(&buf[12]);
 		sdkp->zones_max_open = 0;
 	} else {
 		/* Host-managed */
 		sdkp->urswrz = buf[4] & 1;
 		sdkp->zones_optimal_open = 0;
 		sdkp->zones_optimal_nonseq = 0;
-		sdkp->zones_max_open = get_unaligned_be64(&buf[16]);
+		sdkp->zones_max_open = get_unaligned_be32(&buf[16]);
 	}
 
 	return 0;
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH V2 09/12] scsi: sd_zbc: Limit zone write locking to sequential zones
  2017-09-07 16:16 [PATCH V2 00/12] scsi-mq support for ZBC disks Damien Le Moal
                   ` (7 preceding siblings ...)
  2017-09-07 16:16 ` [PATCH V2 08/12] scsi: sd_zbc: Fix sd_zbc_read_zoned_characteristics() Damien Le Moal
@ 2017-09-07 16:16 ` Damien Le Moal
  2017-09-08  8:39   ` Johannes Thumshirn
  2017-09-07 16:16 ` [PATCH V2 10/12] scsi: sd_zbc: Disable zone write locking with scsi-mq Damien Le Moal
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 38+ messages in thread
From: Damien Le Moal @ 2017-09-07 16:16 UTC (permalink / raw)
  To: linux-scsi, Martin K . Petersen, linux-block, Jens Axboe
  Cc: Christoph Hellwig, Bart Van Assche

Zoned block devices have no write constraints for conventional zones
so write locking of conventional zones is not necessary and can hurt
performance. To avoid this, introduce the seq_zones bitmap to indicate
if a zone is a sequential one. Use this information to allow any write
to be issued to conventional zones, locking only sequential zones.

As the zone bitmap allocation for seq_zones is identical to the zones
write lock bitmap, introduce the helper sd_zbc_alloc_zone_bitmap().
Using this helper, wait for the disk capacity and number of zones to
stabilize on the second revalidation pass to allocate and initialize
the bitmaps.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Bart Van Assche <Bart.VanAssche@wdc.com>
---
 drivers/scsi/sd.h     |   1 +
 drivers/scsi/sd_zbc.c | 108 ++++++++++++++++++++++++++++++++++++++++++++------
 drivers/scsi/sd_zbc.h |  12 ++++++
 3 files changed, 108 insertions(+), 13 deletions(-)

diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index c9a627e7eebd..3ea82c8cecd5 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -77,6 +77,7 @@ struct scsi_disk {
 	unsigned int	zone_blocks;
 	unsigned int	zone_shift;
 	unsigned long	*zones_wlock;
+	unsigned long	*seq_zones;
 	unsigned int	zones_optimal_open;
 	unsigned int	zones_optimal_nonseq;
 	unsigned int	zones_max_open;
diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index 8b258254a2bb..bcece82a1d8d 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -252,7 +252,7 @@ int sd_zbc_write_lock_zone(struct scsi_cmnd *cmd)
 	struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
 	sector_t sector = blk_rq_pos(rq);
 	sector_t zone_sectors = sd_zbc_zone_sectors(sdkp);
-	unsigned int zno = sd_zbc_zone_no(sdkp, sector);
+	unsigned int zno;
 
 	/*
 	 * Note: Checks of the alignment of the write command on
@@ -265,13 +265,15 @@ int sd_zbc_write_lock_zone(struct scsi_cmnd *cmd)
 		return BLKPREP_KILL;
 
 	/*
-	 * Do not issue more than one write at a time per
-	 * zone. This solves write ordering problems due to
-	 * the unlocking of the request queue in the dispatch
-	 * path in the non scsi-mq case.
+	 * There is no write constraint on conventional zones, but do not issue
+	 * more than one write at a time per sequential zone. This avoids write
+	 * ordering problems due to the unlocking of the request queue in the
+	 * dispatch path of the non scsi-mq (legacy) case.
 	 */
-	if (sdkp->zones_wlock &&
-	    test_and_set_bit(zno, sdkp->zones_wlock))
+	zno = sd_zbc_zone_no(sdkp, sector);
+	if (!test_bit(zno, sdkp->seq_zones))
+		return BLKPREP_OK;
+	if (test_and_set_bit(zno, sdkp->zones_wlock))
 		return BLKPREP_DEFER;
 
 	WARN_ON_ONCE(cmd->flags & SCMD_ZONE_WRITE_LOCK);
@@ -289,8 +291,9 @@ void sd_zbc_write_unlock_zone(struct scsi_cmnd *cmd)
 	struct request *rq = cmd->request;
 	struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
 
-	if (sdkp->zones_wlock && cmd->flags & SCMD_ZONE_WRITE_LOCK) {
+	if (cmd->flags & SCMD_ZONE_WRITE_LOCK) {
 		unsigned int zno = sd_zbc_zone_no(sdkp, blk_rq_pos(rq));
+
 		WARN_ON_ONCE(!test_bit(zno, sdkp->zones_wlock));
 		cmd->flags &= ~SCMD_ZONE_WRITE_LOCK;
 		clear_bit_unlock(zno, sdkp->zones_wlock);
@@ -507,8 +510,67 @@ static int sd_zbc_check_zone_size(struct scsi_disk *sdkp)
 	return 0;
 }
 
+/*
+ * Initialize the sequential zone bitmap to allow identifying sequential zones.
+ */
+static int sd_zbc_setup_seq_zones(struct scsi_disk *sdkp)
+{
+	unsigned long *seq_zones;
+	sector_t block = 0;
+	unsigned char *buf;
+	unsigned char *rec;
+	unsigned int buf_len;
+	unsigned int list_length;
+	unsigned int n = 0;
+	int ret = -ENOMEM;
+
+	/* Allocate zone type bitmap */
+	seq_zones = sd_zbc_alloc_zone_bitmap(sdkp);
+	if (!seq_zones)
+		return -ENOMEM;
+
+	buf = kmalloc(SD_ZBC_BUF_SIZE, GFP_KERNEL);
+	if (!buf)
+		goto out;
+
+	while (block < sdkp->capacity) {
+
+		ret = sd_zbc_report_zones(sdkp, buf, SD_ZBC_BUF_SIZE, block);
+		if (ret)
+			goto out;
+
+		/* Parse reported zone descriptors */
+		list_length = get_unaligned_be32(&buf[0]) + 64;
+		rec = buf + 64;
+		buf_len = min(list_length, SD_ZBC_BUF_SIZE);
+		while (rec < buf + buf_len) {
+			if ((rec[0] & 0x0f) != ZBC_ZONE_TYPE_CONV)
+				set_bit(n, seq_zones);
+			block += get_unaligned_be64(&rec[8]);
+			rec += 64;
+			n++;
+		}
+
+	}
+
+	if (n != sdkp->nr_zones)
+		ret = -EIO;
+
+out:
+	kfree(buf);
+	if (ret) {
+		kfree(seq_zones);
+		return ret;
+	}
+
+	sdkp->seq_zones = seq_zones;
+
+	return 0;
+}
+
 static int sd_zbc_setup(struct scsi_disk *sdkp)
 {
+	int ret;
 
 	/* READ16/WRITE16 is mandatory for ZBC disks */
 	sdkp->device->use_16_for_rw = 1;
@@ -520,17 +582,37 @@ static int sd_zbc_setup(struct scsi_disk *sdkp)
 	sdkp->nr_zones =
 		round_up(sdkp->capacity, sdkp->zone_blocks) >> sdkp->zone_shift;
 
+	/*
+	 * Wait for the disk capacity to stabilize before
+	 * initializing zone related information.
+	 */
+	if (sdkp->first_scan)
+		return 0;
+
 	if (!sdkp->zones_wlock) {
-		sdkp->zones_wlock = kcalloc(BITS_TO_LONGS(sdkp->nr_zones),
-					    sizeof(unsigned long),
-					    GFP_KERNEL);
+		sdkp->zones_wlock = sd_zbc_alloc_zone_bitmap(sdkp);
 		if (!sdkp->zones_wlock)
 			return -ENOMEM;
 	}
 
+	if (!sdkp->seq_zones) {
+		ret = sd_zbc_setup_seq_zones(sdkp);
+		if (ret)
+			return ret;
+	}
+
 	return 0;
 }
 
+static void sd_zbc_cleanup(struct scsi_disk *sdkp)
+{
+	kfree(sdkp->seq_zones);
+	sdkp->seq_zones = NULL;
+
+	kfree(sdkp->zones_wlock);
+	sdkp->zones_wlock = NULL;
+}
+
 int sd_zbc_read_zones(struct scsi_disk *sdkp, unsigned char *buf)
 {
 	int ret;
@@ -582,14 +664,14 @@ int sd_zbc_read_zones(struct scsi_disk *sdkp, unsigned char *buf)
 
 err:
 	sdkp->capacity = 0;
+	sd_zbc_cleanup(sdkp);
 
 	return ret;
 }
 
 void sd_zbc_remove(struct scsi_disk *sdkp)
 {
-	kfree(sdkp->zones_wlock);
-	sdkp->zones_wlock = NULL;
+	sd_zbc_cleanup(sdkp);
 }
 
 void sd_zbc_print_zones(struct scsi_disk *sdkp)
diff --git a/drivers/scsi/sd_zbc.h b/drivers/scsi/sd_zbc.h
index b34002f0acbe..2b63ee352fa2 100644
--- a/drivers/scsi/sd_zbc.h
+++ b/drivers/scsi/sd_zbc.h
@@ -32,6 +32,18 @@ static inline unsigned int sd_zbc_zone_no(struct scsi_disk *sdkp,
 	return sectors_to_logical(sdkp->device, sector) >> sdkp->zone_shift;
 }
 
+/*
+ * Allocate a zone bitmap (one bit per zone).
+ */
+static inline unsigned long *sd_zbc_alloc_zone_bitmap(struct scsi_disk *sdkp)
+{
+	struct request_queue *q = sdkp->disk->queue;
+
+	return kzalloc_node(BITS_TO_LONGS(sdkp->nr_zones)
+			    * sizeof(unsigned long),
+			    GFP_KERNEL, q->node);
+}
+
 #else /* CONFIG_BLK_DEV_ZONED */
 
 static inline int sd_zbc_read_zones(struct scsi_disk *sdkp,
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH V2 10/12] scsi: sd_zbc: Disable zone write locking with scsi-mq
  2017-09-07 16:16 [PATCH V2 00/12] scsi-mq support for ZBC disks Damien Le Moal
                   ` (8 preceding siblings ...)
  2017-09-07 16:16 ` [PATCH V2 09/12] scsi: sd_zbc: Limit zone write locking to sequential zones Damien Le Moal
@ 2017-09-07 16:16 ` Damien Le Moal
  2017-09-08 12:43   ` Ming Lei
  2017-09-07 16:16 ` [PATCH V2 11/12] scsi: sd: Introduce scsi_disk_from_queue() Damien Le Moal
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 38+ messages in thread
From: Damien Le Moal @ 2017-09-07 16:16 UTC (permalink / raw)
  To: linux-scsi, Martin K . Petersen, linux-block, Jens Axboe
  Cc: Christoph Hellwig, Bart Van Assche

In the case of a ZBC disk used with scsi-mq, zone write locking does
not prevent write reordering in sequential zones. Unlike the legacy
case, zone locking can only be done after the command request is
removed from the scheduler dispatch queue. That is, at the time of
zone locking, the write command may already be out of order.

Disable zone write locking in sd_zbc_write_lock_zone() if the disk is
used with scsi-mq. Write order guarantees can be provided by an
adapted I/O scheduler.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Bart Van Assche <Bart.VanAssche@wdc.com>
---
 drivers/scsi/sd_zbc.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index bcece82a1d8d..6b1a7f9c1e90 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -264,6 +264,10 @@ int sd_zbc_write_lock_zone(struct scsi_cmnd *cmd)
 	    (sector & (zone_sectors - 1)) + blk_rq_sectors(rq) > zone_sectors)
 		return BLKPREP_KILL;
 
+	/* No write locking with scsi-mq */
+	if (rq->q->mq_ops)
+		return BLKPREP_OK;
+
 	/*
 	 * There is no write constraint on conventional zones, but do not issue
 	 * more than one write at a time per sequential zone. This avoids write
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH V2 11/12] scsi: sd: Introduce scsi_disk_from_queue()
  2017-09-07 16:16 [PATCH V2 00/12] scsi-mq support for ZBC disks Damien Le Moal
                   ` (9 preceding siblings ...)
  2017-09-07 16:16 ` [PATCH V2 10/12] scsi: sd_zbc: Disable zone write locking with scsi-mq Damien Le Moal
@ 2017-09-07 16:16 ` Damien Le Moal
  2017-09-10  5:16   ` Ming Lei
  2017-09-07 16:16 ` [PATCH V2 12/12] scsi: Introduce ZBC disk I/O scheduler Damien Le Moal
  2017-09-08  8:20 ` [PATCH V2 00/12] scsi-mq support for ZBC disks Christoph Hellwig
  12 siblings, 1 reply; 38+ messages in thread
From: Damien Le Moal @ 2017-09-07 16:16 UTC (permalink / raw)
  To: linux-scsi, Martin K . Petersen, linux-block, Jens Axboe
  Cc: Christoph Hellwig, Bart Van Assche

Using scsi_device_from_queue(), return the scsi_disk structure
associated with a request queue if the device is a disk.
Export this function to make it available to modules.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
---
 drivers/scsi/sd.c | 32 ++++++++++++++++++++++++++++++++
 drivers/scsi/sd.h |  2 ++
 2 files changed, 34 insertions(+)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 1cd113df2d3a..ff9fff4de3e6 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -631,6 +631,38 @@ static void scsi_disk_put(struct scsi_disk *sdkp)
 	mutex_unlock(&sd_ref_mutex);
 }
 
+static int scsi_disk_lookup(struct device *dev, void *data)
+{
+	struct scsi_disk **sdkp = data;
+
+	if (!*sdkp && dev->class == &sd_disk_class)
+		*sdkp = to_scsi_disk(dev);
+
+	return 0;
+}
+
+/**
+ * scsi_disk_from_queue - return scsi disk associated with a request_queue
+ * @q: The request queue to return the scsi disk from
+ *
+ * Return the struct scsi_disk associated with a request queue or NULL if the
+ * request_queue does not reference a SCSI device or a if the device is
+ * not a disk.
+ */
+struct scsi_disk *scsi_disk_from_queue(struct request_queue *q)
+{
+	struct scsi_device *sdev = scsi_device_from_queue(q);
+	struct scsi_disk *sdkp = NULL;
+
+	if (!sdev)
+		return NULL;
+
+	device_for_each_child(&sdev->sdev_gendev, &sdkp, scsi_disk_lookup);
+
+	return sdkp;
+}
+EXPORT_SYMBOL_GPL(scsi_disk_from_queue);
+
 #ifdef CONFIG_BLK_SED_OPAL
 static int sd_sec_submit(void *data, u16 spsp, u8 secp, void *buffer,
 		size_t len, bool send)
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 3ea82c8cecd5..92113a9e2b20 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -126,6 +126,8 @@ static inline struct scsi_disk *scsi_disk(struct gendisk *disk)
 	return container_of(disk->private_data, struct scsi_disk, driver);
 }
 
+struct scsi_disk *scsi_disk_from_queue(struct request_queue *q);
+
 #define sd_printk(prefix, sdsk, fmt, a...)				\
         (sdsk)->disk ?							\
 	      sdev_prefix_printk(prefix, (sdsk)->device,		\
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH V2 12/12] scsi: Introduce ZBC disk I/O scheduler
  2017-09-07 16:16 [PATCH V2 00/12] scsi-mq support for ZBC disks Damien Le Moal
                   ` (10 preceding siblings ...)
  2017-09-07 16:16 ` [PATCH V2 11/12] scsi: sd: Introduce scsi_disk_from_queue() Damien Le Moal
@ 2017-09-07 16:16 ` Damien Le Moal
  2017-09-08  8:20 ` [PATCH V2 00/12] scsi-mq support for ZBC disks Christoph Hellwig
  12 siblings, 0 replies; 38+ messages in thread
From: Damien Le Moal @ 2017-09-07 16:16 UTC (permalink / raw)
  To: linux-scsi, Martin K . Petersen, linux-block, Jens Axboe
  Cc: Christoph Hellwig, Bart Van Assche

The zoned I/O scheduler is mostly identical to mq-deadline and retains
the same configuration attributes. The main difference is that the
zoned scheduler will ensure that at any time at most only one write
request (command) per sequential zone is in flight (has been issued to
the disk) in order to protect against sequential write reordering
potentially resulting from the concurrent execution of request dispatch
by multiple contexts.

This is achieved similarly to the legacy SCSI I/O path by write locking
zones when a write requests is issued. Subsequent writes to the same
zone have to wait for the completion of the previously issued write
before being in turn dispatched to the disk. This ensures that
sequential writes are processed in the correct order without needing
any modification to the execution model of blk-mq. In addition, this
also protects against write reordering at the HBA level (e.g. AHCI).

This zoned scheduler code is added under the drivers/scsi directory so
that information managed using the scsi_disk structure can be directly
referenced. Doing so, cluttering the block layer with device type
specific code is avoided.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
---
 Documentation/block/zoned-iosched.txt |  48 ++
 block/Kconfig.iosched                 |  12 +
 drivers/scsi/Makefile                 |   1 +
 drivers/scsi/sd.h                     |   3 +-
 drivers/scsi/sd_zbc.c                 |  16 +-
 drivers/scsi/sd_zbc.h                 |   8 +-
 drivers/scsi/zoned_iosched.c          | 939 ++++++++++++++++++++++++++++++++++
 7 files changed, 1015 insertions(+), 12 deletions(-)
 create mode 100644 Documentation/block/zoned-iosched.txt
 create mode 100644 drivers/scsi/zoned_iosched.c

diff --git a/Documentation/block/zoned-iosched.txt b/Documentation/block/zoned-iosched.txt
new file mode 100644
index 000000000000..b269125bdc61
--- /dev/null
+++ b/Documentation/block/zoned-iosched.txt
@@ -0,0 +1,48 @@
+Zoned I/O scheduler
+===================
+
+The Zoned I/O scheduler solves zoned block devices write ordering problems
+inherent to the absence of a global request queue lock in the blk-mq
+infrastructure. Multiple contexts may try to dispatch simultaneously write
+requests to the same sequential zone of a zoned block device, doing so
+potentially breaking the sequential write order imposed by the device.
+
+The Zoned I/O scheduler is based on the mq-deadline scheduler. It shares the
+same tunables and behaves in a comparable manner. The main difference introduced
+with the zoned scheduler is handling of write batches. Whereas mq-deadline will
+keep dispatching write requests to the device as long as the batching size
+allows and reads are not starved, the zoned scheduler introduces additional
+constraints:
+1) At most only one write request can be issued to a sequential zone of the
+device. This ensures that no reordering of sequential writes to a sequential
+zone can happen once the write request leaves the scheduler internal queue (rb
+tree).
+2) The number of sequential zones that can be simultaneously written is limited
+to the device advertized maximum number of open zones. This additional condition
+avoids performance degradation due to excessive open zone resource use at the
+device level.
+
+These conditions do not apply to write requests targeting conventional zones.
+For these, the zoned scheduler behaves exactly like the mq-deadline scheduler.
+
+The zoned I/O scheduler cannot be used with regular block devices. It can only
+be used with host-managed or host-aware zoned block devices.
+Using the zoned I/O scheduler is mandatory with host-managed disks unless the
+disk user tightly controls itself write sequencing to sequential zones. The
+zoned scheduler will treat host-aware disks exactly the same way as host-managed
+devices. That is, eventhough host aware disks can be randomly written, the zoned
+scheduler will still impose the limit to one write per zone so that sequential
+writes sequences are preserved.
+
+For host-managed disks, automating the used of the zoned scheduler can be done
+simply with a udev rule. An example of such rule is shown below.
+
+# Set zoned scheduler for host-managed zoned block devices
+ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/zoned}=="host-managed", \
+	ATTR{queue/scheduler}="zoned"
+
+Zoned I/O scheduler tunables
+============================
+
+Tunables of the Zoned I/O scheduler are identical to those of the deadline
+I/O scheduler. See Documentation/block/deadline-iosched.txt for details.
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index fd2cefa47d35..b87c67dbf1f6 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -70,6 +70,18 @@ config MQ_IOSCHED_DEADLINE
 	---help---
 	  MQ version of the deadline IO scheduler.
 
+config MQ_IOSCHED_ZONED
+	tristate "Zoned I/O scheduler"
+	depends on BLK_DEV_ZONED
+	depends on SCSI_MOD
+	depends on BLK_DEV_SD
+	default y
+	---help---
+	  MQ deadline IO scheduler with support for zoned block devices.
+
+	  This should be set as the default I/O scheduler for host-managed
+	  zoned block devices. It is optional for host-aware block devices.
+
 config MQ_IOSCHED_KYBER
 	tristate "Kyber I/O scheduler"
 	default y
diff --git a/drivers/scsi/Makefile b/drivers/scsi/Makefile
index 93dbe58c47c8..740870396a9a 100644
--- a/drivers/scsi/Makefile
+++ b/drivers/scsi/Makefile
@@ -176,6 +176,7 @@ hv_storvsc-y			:= storvsc_drv.o
 sd_mod-objs	:= sd.o
 sd_mod-$(CONFIG_BLK_DEV_INTEGRITY) += sd_dif.o
 sd_mod-$(CONFIG_BLK_DEV_ZONED) += sd_zbc.o
+obj-$(CONFIG_MQ_IOSCHED_ZONED) += zoned_iosched.o
 
 sr_mod-objs	:= sr.o sr_ioctl.o sr_vendor.o
 ncr53c8xx-flags-$(CONFIG_SCSI_ZALON) \
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 92113a9e2b20..84f35562aa8c 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -75,7 +75,8 @@ struct scsi_disk {
 #ifdef CONFIG_BLK_DEV_ZONED
 	unsigned int	nr_zones;
 	unsigned int	zone_blocks;
-	unsigned int	zone_shift;
+	unsigned int	zone_sectors;
+	unsigned int	zone_sectors_shift;
 	unsigned long	*zones_wlock;
 	unsigned long	*seq_zones;
 	unsigned int	zones_optimal_open;
diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index 6b1a7f9c1e90..ec34ede9c290 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -223,7 +223,7 @@ int sd_zbc_setup_reset_cmnd(struct scsi_cmnd *cmd)
 	if (sdkp->device->changed)
 		return BLKPREP_KILL;
 
-	if (sector & (sd_zbc_zone_sectors(sdkp) - 1))
+	if (sector & (sdkp->zone_sectors - 1))
 		/* Unaligned request */
 		return BLKPREP_KILL;
 
@@ -251,7 +251,7 @@ int sd_zbc_write_lock_zone(struct scsi_cmnd *cmd)
 	struct request *rq = cmd->request;
 	struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
 	sector_t sector = blk_rq_pos(rq);
-	sector_t zone_sectors = sd_zbc_zone_sectors(sdkp);
+	sector_t zone_sectors = sdkp->zone_sectors;
 	unsigned int zno;
 
 	/*
@@ -274,7 +274,7 @@ int sd_zbc_write_lock_zone(struct scsi_cmnd *cmd)
 	 * ordering problems due to the unlocking of the request queue in the
 	 * dispatch path of the non scsi-mq (legacy) case.
 	 */
-	zno = sd_zbc_zone_no(sdkp, sector);
+	zno = sd_zbc_request_zone_no(sdkp, rq);
 	if (!test_bit(zno, sdkp->seq_zones))
 		return BLKPREP_OK;
 	if (test_and_set_bit(zno, sdkp->zones_wlock))
@@ -296,7 +296,7 @@ void sd_zbc_write_unlock_zone(struct scsi_cmnd *cmd)
 	struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
 
 	if (cmd->flags & SCMD_ZONE_WRITE_LOCK) {
-		unsigned int zno = sd_zbc_zone_no(sdkp, blk_rq_pos(rq));
+		unsigned int zno = sd_zbc_request_zone_no(sdkp, rq);
 
 		WARN_ON_ONCE(!test_bit(zno, sdkp->zones_wlock));
 		cmd->flags &= ~SCMD_ZONE_WRITE_LOCK;
@@ -509,7 +509,8 @@ static int sd_zbc_check_zone_size(struct scsi_disk *sdkp)
 	}
 
 	sdkp->zone_blocks = zone_blocks;
-	sdkp->zone_shift = ilog2(zone_blocks);
+	sdkp->zone_sectors = logical_to_sectors(sdkp->device, zone_blocks);
+	sdkp->zone_sectors_shift = ilog2(sdkp->zone_sectors);
 
 	return 0;
 }
@@ -574,6 +575,7 @@ static int sd_zbc_setup_seq_zones(struct scsi_disk *sdkp)
 
 static int sd_zbc_setup(struct scsi_disk *sdkp)
 {
+	sector_t zone_blocks = sdkp->zone_blocks;
 	int ret;
 
 	/* READ16/WRITE16 is mandatory for ZBC disks */
@@ -582,9 +584,9 @@ static int sd_zbc_setup(struct scsi_disk *sdkp)
 
 	/* chunk_sectors indicates the zone size */
 	blk_queue_chunk_sectors(sdkp->disk->queue,
-			logical_to_sectors(sdkp->device, sdkp->zone_blocks));
+				logical_to_sectors(sdkp->device, zone_blocks));
 	sdkp->nr_zones =
-		round_up(sdkp->capacity, sdkp->zone_blocks) >> sdkp->zone_shift;
+		round_up(sdkp->capacity, zone_blocks) >> ilog2(zone_blocks);
 
 	/*
 	 * Wait for the disk capacity to stabilize before
diff --git a/drivers/scsi/sd_zbc.h b/drivers/scsi/sd_zbc.h
index 2b63ee352fa2..7971f05de333 100644
--- a/drivers/scsi/sd_zbc.h
+++ b/drivers/scsi/sd_zbc.h
@@ -24,12 +24,12 @@ static inline sector_t sd_zbc_zone_sectors(struct scsi_disk *sdkp)
 }
 
 /*
- * Zone number of the specified sector.
+ * Zone number of the specified request.
  */
-static inline unsigned int sd_zbc_zone_no(struct scsi_disk *sdkp,
-					  sector_t sector)
+static inline unsigned int sd_zbc_request_zone_no(struct scsi_disk *sdkp,
+						  struct request *rq)
 {
-	return sectors_to_logical(sdkp->device, sector) >> sdkp->zone_shift;
+	return blk_rq_pos(rq) >> sdkp->zone_sectors_shift;
 }
 
 /*
diff --git a/drivers/scsi/zoned_iosched.c b/drivers/scsi/zoned_iosched.c
new file mode 100644
index 000000000000..e1a57a7a5271
--- /dev/null
+++ b/drivers/scsi/zoned_iosched.c
@@ -0,0 +1,939 @@
+/*
+ *  Zoned MQ Deadline i/o scheduler - adaptation of the MQ deadline scheduler,
+ *  for zoned block devices used with the blk-mq scheduling framework
+ *
+ *  Copyright (C) 2016 Jens Axboe <axboe@kernel.dk>
+ *  Copyright (C) 2017 Damien Le Moal <damien.lemoal@wdc.com>
+ */
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/blkdev.h>
+#include <linux/blk-mq.h>
+#include <linux/blk-mq-sched.h>
+#include <linux/blk-mq-debugfs.h>
+#include <linux/elevator.h>
+#include <linux/bio.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/init.h>
+#include <linux/compiler.h>
+#include <linux/rbtree.h>
+#include <linux/sbitmap.h>
+#include <linux/seq_file.h>
+
+#include <scsi/scsi.h>
+#include <scsi/scsi_cmnd.h>
+
+#include "sd.h"
+#include "sd_zbc.h"
+
+/*
+ * See Documentation/block/deadline-iosched.txt
+ */
+
+/* max time before a read is submitted. */
+static const int read_expire = HZ / 2;
+
+/* ditto for writes, these limits are SOFT! */
+static const int write_expire = 5 * HZ;
+
+/* max times reads can starve a write */
+static const int writes_starved = 2;
+
+/*
+ * Number of sequential requests treated as one by the above parameters.
+ * For throughput.
+ */
+static const int fifo_batch = 16;
+
+/*
+ * Run time data.
+ */
+struct zoned_data {
+	/*
+	 * requests (zoned_rq s) are present on both sort_list and fifo_list
+	 */
+	struct rb_root sort_list[2];
+	struct list_head fifo_list[2];
+
+	/*
+	 * next in sort order. read, write or both are NULL
+	 */
+	struct request *next_rq[2];
+	unsigned int batching;		/* number of sequential requests made */
+	unsigned int starved;		/* times reads have starved writes */
+
+	/*
+	 * settings that change how the i/o scheduler behaves
+	 */
+	int fifo_expire[2];
+	int fifo_batch;
+	int writes_starved;
+	int front_merges;
+
+	spinlock_t lock;
+	struct list_head dispatch;
+
+	struct scsi_disk *sdkp;
+
+	spinlock_t zones_lock;
+	unsigned long *zones_wlock;
+	unsigned long *seq_zones;
+};
+
+static inline struct rb_root *
+zoned_rb_root(struct zoned_data *zd, struct request *rq)
+{
+	return &zd->sort_list[rq_data_dir(rq)];
+}
+
+/*
+ * get the request after `rq' in sector-sorted order
+ */
+static inline struct request *
+zoned_latter_request(struct request *rq)
+{
+	struct rb_node *node = rb_next(&rq->rb_node);
+
+	if (node)
+		return rb_entry_rq(node);
+
+	return NULL;
+}
+
+static void
+zoned_add_rq_rb(struct zoned_data *zd, struct request *rq)
+{
+	struct rb_root *root = zoned_rb_root(zd, rq);
+
+	elv_rb_add(root, rq);
+}
+
+static inline void
+zoned_del_rq_rb(struct zoned_data *zd, struct request *rq)
+{
+	const int data_dir = rq_data_dir(rq);
+
+	if (zd->next_rq[data_dir] == rq)
+		zd->next_rq[data_dir] = zoned_latter_request(rq);
+
+	elv_rb_del(zoned_rb_root(zd, rq), rq);
+}
+
+/*
+ * remove rq from rbtree and fifo.
+ */
+static void zoned_remove_request(struct request_queue *q, struct request *rq)
+{
+	struct zoned_data *zd = q->elevator->elevator_data;
+
+	list_del_init(&rq->queuelist);
+
+	/*
+	 * We might not be on the rbtree, if we are doing an insert merge
+	 */
+	if (!RB_EMPTY_NODE(&rq->rb_node))
+		zoned_del_rq_rb(zd, rq);
+
+	elv_rqhash_del(q, rq);
+	if (q->last_merge == rq)
+		q->last_merge = NULL;
+}
+
+static void zd_request_merged(struct request_queue *q, struct request *req,
+			      enum elv_merge type)
+{
+	struct zoned_data *zd = q->elevator->elevator_data;
+
+	/*
+	 * if the merge was a front merge, we need to reposition request
+	 */
+	if (type == ELEVATOR_FRONT_MERGE) {
+		elv_rb_del(zoned_rb_root(zd, req), req);
+		zoned_add_rq_rb(zd, req);
+	}
+}
+
+static void zd_merged_requests(struct request_queue *q, struct request *req,
+			       struct request *next)
+{
+	/*
+	 * if next expires before rq, assign its expire time to rq
+	 * and move into next position (next will be deleted) in fifo
+	 */
+	if (!list_empty(&req->queuelist) && !list_empty(&next->queuelist)) {
+		if (time_before((unsigned long)next->fifo_time,
+				(unsigned long)req->fifo_time)) {
+			list_move(&req->queuelist, &next->queuelist);
+			req->fifo_time = next->fifo_time;
+		}
+	}
+
+	/*
+	 * kill knowledge of next, this one is a goner
+	 */
+	zoned_remove_request(q, next);
+}
+
+/*
+ * Return true if a request is a write requests that needs zone
+ * write locking.
+ */
+static inline bool zoned_request_needs_wlock(struct zoned_data *zd,
+					     struct request *rq)
+{
+	unsigned int zno = sd_zbc_request_zone_no(zd->sdkp, rq);
+
+	if (blk_rq_is_passthrough(rq))
+		return false;
+
+	if (!test_bit(zno, zd->seq_zones))
+		return false;
+
+	switch (req_op(rq)) {
+	case REQ_OP_WRITE_ZEROES:
+	case REQ_OP_WRITE_SAME:
+	case REQ_OP_WRITE:
+		return true;
+	default:
+		return false;
+	}
+}
+
+/*
+ * Abuse the elv.priv[0] pointer to indicate if a request
+ * has locked its target zone.
+ */
+#define RQ_LOCKED_ZONE		((void *)1UL)
+static inline void zoned_set_request_lock(struct request *rq)
+{
+	rq->elv.priv[0] = RQ_LOCKED_ZONE;
+}
+
+#define RQ_ZONE_NO_LOCK		((void *)0UL)
+static inline void zoned_clear_request_lock(struct request *rq)
+{
+	rq->elv.priv[0] = RQ_ZONE_NO_LOCK;
+}
+
+static inline bool zoned_request_has_lock(struct request *rq)
+{
+	return rq->elv.priv[0] == RQ_LOCKED_ZONE;
+}
+
+/*
+ * Write lock the target zone of a write request.
+ */
+static void zoned_wlock_request_zone(struct zoned_data *zd, struct request *rq)
+{
+	unsigned int zno = sd_zbc_request_zone_no(zd->sdkp, rq);
+
+	WARN_ON_ONCE(zoned_request_has_lock(rq));
+	WARN_ON_ONCE(test_and_set_bit(zno, zd->zones_wlock));
+	zoned_set_request_lock(rq);
+}
+
+/*
+ * Write unlock the target zone of a write request.
+ */
+static void zoned_wunlock_request_zone(struct zoned_data *zd,
+				       struct request *rq)
+{
+	unsigned int zno = sd_zbc_request_zone_no(zd->sdkp, rq);
+	unsigned long flags;
+
+	/*
+	 * Dispatch may be running on a different CPU.
+	 * So do not unlock the zone until it is done or
+	 * a write request in the middle of a sequence may end up
+	 * being dispatched.
+	 */
+	spin_lock_irqsave(&zd->zones_lock, flags);
+
+	WARN_ON_ONCE(!test_and_clear_bit(zno, zd->zones_wlock));
+	zoned_clear_request_lock(rq);
+
+	spin_unlock_irqrestore(&zd->zones_lock, flags);
+}
+
+/*
+ * Test the write lock state of the target zone of a write request.
+ */
+static inline bool zoned_request_zone_is_wlocked(struct zoned_data *zd,
+						 struct request *rq)
+{
+	unsigned int zno = sd_zbc_request_zone_no(zd->sdkp, rq);
+
+	return test_bit(zno, zd->zones_wlock);
+}
+
+/*
+ * move an entry to dispatch queue
+ */
+static void zoned_move_request(struct zoned_data *zd, struct request *rq)
+{
+	const int data_dir = rq_data_dir(rq);
+
+	zd->next_rq[READ] = NULL;
+	zd->next_rq[WRITE] = NULL;
+	zd->next_rq[data_dir] = zoned_latter_request(rq);
+
+	/*
+	 * take it off the sort and fifo list
+	 */
+	zoned_remove_request(rq->q, rq);
+}
+
+/*
+ * zoned_check_fifo returns 0 if there are no expired requests on the fifo,
+ * 1 otherwise. Requires !list_empty(&zd->fifo_list[data_dir])
+ */
+static inline int zoned_check_fifo(struct zoned_data *zd, int ddir)
+{
+	struct request *rq = rq_entry_fifo(zd->fifo_list[ddir].next);
+
+	/*
+	 * rq is expired!
+	 */
+	if (time_after_eq(jiffies, (unsigned long)rq->fifo_time))
+		return 1;
+
+	return 0;
+}
+
+/*
+ * Test if a request can be dispatched.
+ */
+static inline bool zoned_can_dispatch_request(struct zoned_data *zd,
+					      struct request *rq)
+{
+	return !zoned_request_needs_wlock(zd, rq) ||
+		!zoned_request_zone_is_wlocked(zd, rq);
+}
+
+/*
+ * For the specified data direction, find the next request that can be
+ * dispatched. Search in increasing sector position.
+ */
+static struct request *
+zoned_next_request(struct zoned_data *zd, int data_dir)
+{
+	struct request *rq = zd->next_rq[data_dir];
+	unsigned long flags;
+
+	if (data_dir == READ)
+		return rq;
+
+	spin_lock_irqsave(&zd->zones_lock, flags);
+	while (rq) {
+		if (zoned_can_dispatch_request(zd, rq))
+			break;
+		rq = zoned_latter_request(rq);
+	}
+	spin_unlock_irqrestore(&zd->zones_lock, flags);
+
+	return rq;
+}
+
+/*
+ * For the specified data direction, find the next request that can be
+ * dispatched. Search in arrival order from the oldest request.
+ */
+static struct request *
+zoned_fifo_request(struct zoned_data *zd, int data_dir)
+{
+	struct request *rq;
+	unsigned long flags;
+
+	if (list_empty(&zd->fifo_list[data_dir]))
+		return NULL;
+
+	if (data_dir == READ)
+		return rq_entry_fifo(zd->fifo_list[READ].next);
+
+	spin_lock_irqsave(&zd->zones_lock, flags);
+
+	list_for_each_entry(rq, &zd->fifo_list[WRITE], queuelist) {
+		if (zoned_can_dispatch_request(zd, rq))
+			goto out;
+	}
+	rq = NULL;
+
+out:
+	spin_unlock_irqrestore(&zd->zones_lock, flags);
+
+	return rq;
+}
+
+/*
+ * __zd_dispatch_request selects the best request according to
+ * read/write batch expiration, fifo_batch, target zone lock state, etc
+ */
+static struct request *__zd_dispatch_request(struct blk_mq_hw_ctx *hctx)
+{
+	struct zoned_data *zd = hctx->queue->elevator->elevator_data;
+	struct request *rq = NULL, *next_rq;
+	bool reads, writes;
+	int data_dir;
+
+	if (!list_empty(&zd->dispatch)) {
+		rq = list_first_entry(&zd->dispatch, struct request, queuelist);
+		list_del_init(&rq->queuelist);
+		goto done;
+	}
+
+	reads = !list_empty(&zd->fifo_list[READ]);
+	writes = !list_empty(&zd->fifo_list[WRITE]);
+
+	/*
+	 * batches are currently reads XOR writes
+	 */
+	rq = zoned_next_request(zd, WRITE);
+	if (!rq)
+		rq = zoned_next_request(zd, READ);
+	if (rq && zd->batching < zd->fifo_batch)
+		/* we have a next request are still entitled to batch */
+		goto dispatch_request;
+
+	/*
+	 * at this point we are not running a batch. select the appropriate
+	 * data direction (read / write)
+	 */
+
+	if (reads) {
+		if (writes && (zd->starved++ >= zd->writes_starved))
+			goto dispatch_writes;
+
+		data_dir = READ;
+
+		goto dispatch_find_request;
+	}
+
+	/*
+	 * there are either no reads or writes have been starved
+	 */
+
+	if (writes) {
+dispatch_writes:
+		zd->starved = 0;
+
+		/* Really select writes if at least one can be dispatched */
+		if (zoned_fifo_request(zd, WRITE))
+			data_dir = WRITE;
+		else
+			data_dir = READ;
+
+		goto dispatch_find_request;
+	}
+
+	return NULL;
+
+dispatch_find_request:
+	/*
+	 * we are not running a batch, find best request for selected data_dir
+	 */
+	next_rq = zoned_next_request(zd, data_dir);
+	if (zoned_check_fifo(zd, data_dir) || !next_rq) {
+		/*
+		 * A deadline has expired, the last request was in the other
+		 * direction, or we have run out of higher-sectored requests.
+		 * Start again from the request with the earliest expiry time.
+		 */
+		rq = zoned_fifo_request(zd, data_dir);
+	} else {
+		/*
+		 * The last req was the same dir and we have a next request in
+		 * sort order. No expired requests so continue on from here.
+		 */
+		rq = next_rq;
+	}
+
+	if (!rq)
+		return NULL;
+
+	zd->batching = 0;
+
+dispatch_request:
+	/*
+	 * rq is the selected appropriate request.
+	 */
+	zd->batching++;
+	zoned_move_request(zd, rq);
+
+done:
+	/*
+	 * If the request needs its target zone locked, do it.
+	 */
+	if (zoned_request_needs_wlock(zd, rq))
+		zoned_wlock_request_zone(zd, rq);
+	rq->rq_flags |= RQF_STARTED;
+	return rq;
+}
+
+static struct request *zd_dispatch_request(struct blk_mq_hw_ctx *hctx)
+{
+	struct zoned_data *zd = hctx->queue->elevator->elevator_data;
+	struct request *rq;
+
+	spin_lock(&zd->lock);
+	rq = __zd_dispatch_request(hctx);
+	spin_unlock(&zd->lock);
+
+	return rq;
+}
+
+static int zd_request_merge(struct request_queue *q, struct request **rq,
+			    struct bio *bio)
+{
+	struct zoned_data *zd = q->elevator->elevator_data;
+	sector_t sector = bio_end_sector(bio);
+	struct request *__rq;
+
+	if (!zd->front_merges)
+		return ELEVATOR_NO_MERGE;
+
+	__rq = elv_rb_find(&zd->sort_list[bio_data_dir(bio)], sector);
+	if (__rq) {
+		if (WARN_ON(sector != blk_rq_pos(__rq)))
+			return ELEVATOR_NO_MERGE;
+
+		if (elv_bio_merge_ok(__rq, bio)) {
+			*rq = __rq;
+			return ELEVATOR_FRONT_MERGE;
+		}
+	}
+
+	return ELEVATOR_NO_MERGE;
+}
+
+static bool zd_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio)
+{
+	struct request_queue *q = hctx->queue;
+	struct zoned_data *zd = q->elevator->elevator_data;
+	struct request *free = NULL;
+	bool ret;
+
+	spin_lock(&zd->lock);
+	ret = blk_mq_sched_try_merge(q, bio, &free);
+	spin_unlock(&zd->lock);
+
+	if (free)
+		blk_mq_free_request(free);
+
+	return ret;
+}
+
+/*
+ * add rq to rbtree and fifo
+ */
+static void __zd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
+				bool at_head)
+{
+	struct request_queue *q = hctx->queue;
+	struct zoned_data *zd = q->elevator->elevator_data;
+	const int data_dir = rq_data_dir(rq);
+
+	if (blk_mq_sched_try_insert_merge(q, rq))
+		return;
+
+	blk_mq_sched_request_inserted(rq);
+
+	if (at_head || blk_rq_is_passthrough(rq)) {
+		if (at_head)
+			list_add(&rq->queuelist, &zd->dispatch);
+		else
+			list_add_tail(&rq->queuelist, &zd->dispatch);
+	} else {
+		zoned_add_rq_rb(zd, rq);
+
+		if (rq_mergeable(rq)) {
+			elv_rqhash_add(q, rq);
+			if (!q->last_merge)
+				q->last_merge = rq;
+		}
+
+		/*
+		 * set expire time and add to fifo list
+		 */
+		rq->fifo_time = jiffies + zd->fifo_expire[data_dir];
+		list_add_tail(&rq->queuelist, &zd->fifo_list[data_dir]);
+	}
+}
+
+static void zd_insert_requests(struct blk_mq_hw_ctx *hctx,
+			       struct list_head *list, bool at_head)
+{
+	struct request_queue *q = hctx->queue;
+	struct zoned_data *zd = q->elevator->elevator_data;
+
+	spin_lock(&zd->lock);
+	while (!list_empty(list)) {
+		struct request *rq;
+
+		rq = list_first_entry(list, struct request, queuelist);
+		list_del_init(&rq->queuelist);
+
+		/*
+		 * This may be a requeue of a request that has locked its
+		 * target zone. If this is the case, release the zone lock.
+		 */
+		if (zoned_request_has_lock(rq))
+			zoned_wunlock_request_zone(zd, rq);
+
+		__zd_insert_request(hctx, rq, at_head);
+	}
+	spin_unlock(&zd->lock);
+}
+
+/*
+ * Write unlock the target zone of a completed write request.
+ */
+static void zd_completed_request(struct request *rq)
+{
+
+	if (zoned_request_has_lock(rq)) {
+		struct zoned_data *zd = rq->q->elevator->elevator_data;
+
+		zoned_wunlock_request_zone(zd, rq);
+	}
+}
+
+static bool zd_has_work(struct blk_mq_hw_ctx *hctx)
+{
+	struct zoned_data *zd = hctx->queue->elevator->elevator_data;
+
+	return !list_empty_careful(&zd->dispatch) ||
+		!list_empty_careful(&zd->fifo_list[0]) ||
+		!list_empty_careful(&zd->fifo_list[1]);
+}
+
+static struct scsi_disk *zoned_lookup_disk(struct request_queue *q)
+{
+	struct scsi_disk *sdkp;
+
+	if (!blk_queue_is_zoned(q)) {
+		pr_err("zoned: Not a zoned block device\n");
+		return NULL;
+	}
+
+	sdkp = scsi_disk_from_queue(q);
+	if (!sdkp) {
+		pr_err("zoned: Not a SCSI disk\n");
+		return NULL;
+	}
+
+	/* Paranoia check */
+	if (WARN_ON(sdkp->disk->queue != q))
+		return NULL;
+
+	return sdkp;
+}
+
+/*
+ * Initialize elevator private data (zoned_data).
+ */
+static int zd_init_queue(struct request_queue *q, struct elevator_type *e)
+{
+	struct scsi_disk *sdkp;
+	struct zoned_data *zd;
+	struct elevator_queue *eq;
+	int ret;
+
+	sdkp = zoned_lookup_disk(q);
+	if (!sdkp)
+		return -ENODEV;
+
+	eq = elevator_alloc(q, e);
+	if (!eq)
+		return -ENOMEM;
+
+	zd = kzalloc_node(sizeof(*zd), GFP_KERNEL, q->node);
+	if (!zd) {
+		kobject_put(&eq->kobj);
+		return -ENOMEM;
+	}
+
+	INIT_LIST_HEAD(&zd->fifo_list[READ]);
+	INIT_LIST_HEAD(&zd->fifo_list[WRITE]);
+	zd->sort_list[READ] = RB_ROOT;
+	zd->sort_list[WRITE] = RB_ROOT;
+	zd->fifo_expire[READ] = read_expire;
+	zd->fifo_expire[WRITE] = write_expire;
+	zd->writes_starved = writes_starved;
+	zd->front_merges = 1;
+	zd->fifo_batch = fifo_batch;
+	spin_lock_init(&zd->lock);
+	INIT_LIST_HEAD(&zd->dispatch);
+
+	zd->sdkp = sdkp;
+	spin_lock_init(&zd->zones_lock);
+
+	zd->zones_wlock = sdkp->zones_wlock;
+	zd->seq_zones = sdkp->seq_zones;
+	if (!zd->zones_wlock || !zd->seq_zones) {
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	eq->elevator_data = zd;
+	q->elevator = eq;
+
+	return 0;
+
+err:
+	kfree(zd);
+	kobject_put(&eq->kobj);
+
+	return ret;
+}
+
+static void zd_exit_queue(struct elevator_queue *e)
+{
+	struct zoned_data *zd = e->elevator_data;
+
+	WARN_ON(!list_empty(&zd->fifo_list[READ]));
+	WARN_ON(!list_empty(&zd->fifo_list[WRITE]));
+
+	kfree(zd);
+}
+
+/*
+ * sysfs parts below
+ */
+static ssize_t
+zoned_var_show(int var, char *page)
+{
+	return sprintf(page, "%d\n", var);
+}
+
+static ssize_t
+zoned_var_store(int *var, const char *page, size_t count)
+{
+	char *p = (char *) page;
+	int ret;
+
+	ret = kstrtoint(p, 10, var);
+	if (ret)
+		return ret;
+
+	return count;
+}
+
+#define SHOW_FUNCTION(__FUNC, __VAR, __CONV)				\
+static ssize_t __FUNC(struct elevator_queue *e, char *page)		\
+{									\
+	struct zoned_data *zd = e->elevator_data;			\
+	int __data = __VAR;						\
+	if (__CONV)							\
+		__data = jiffies_to_msecs(__data);			\
+	return zoned_var_show(__data, (page));			\
+}
+SHOW_FUNCTION(zoned_read_expire_show, zd->fifo_expire[READ], 1);
+SHOW_FUNCTION(zoned_write_expire_show, zd->fifo_expire[WRITE], 1);
+SHOW_FUNCTION(zoned_writes_starved_show, zd->writes_starved, 0);
+SHOW_FUNCTION(zoned_front_merges_show, zd->front_merges, 0);
+SHOW_FUNCTION(zoned_fifo_batch_show, zd->fifo_batch, 0);
+#undef SHOW_FUNCTION
+
+#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
+static ssize_t __FUNC(struct elevator_queue *e,				\
+		      const char *page, size_t count)			\
+{									\
+	struct zoned_data *zd = e->elevator_data;			\
+	int __data;							\
+	int ret = zoned_var_store(&__data, (page), count);		\
+	if (__data < (MIN))						\
+		__data = (MIN);						\
+	else if (__data > (MAX))					\
+		__data = (MAX);						\
+	if (__CONV)							\
+		*(__PTR) = msecs_to_jiffies(__data);			\
+	else								\
+		*(__PTR) = __data;					\
+	return ret;							\
+}
+STORE_FUNCTION(zoned_read_expire_store, &zd->fifo_expire[READ],
+	       0, INT_MAX, 1);
+STORE_FUNCTION(zoned_write_expire_store, &zd->fifo_expire[WRITE],
+	       0, INT_MAX, 1);
+STORE_FUNCTION(zoned_writes_starved_store, &zd->writes_starved,
+	       INT_MIN, INT_MAX, 0);
+STORE_FUNCTION(zoned_front_merges_store, &zd->front_merges,
+	       0, 1, 0);
+STORE_FUNCTION(zoned_fifo_batch_store, &zd->fifo_batch,
+	       0, INT_MAX, 0);
+#undef STORE_FUNCTION
+
+#define DD_ATTR(name) \
+	__ATTR(name, S_IRUGO|S_IWUSR, zoned_##name##_show, \
+				      zoned_##name##_store)
+
+static struct elv_fs_entry zoned_attrs[] = {
+	DD_ATTR(read_expire),
+	DD_ATTR(write_expire),
+	DD_ATTR(writes_starved),
+	DD_ATTR(front_merges),
+	DD_ATTR(fifo_batch),
+	__ATTR_NULL
+};
+
+#ifdef CONFIG_BLK_DEBUG_FS
+#define ZONED_DEBUGFS_DDIR_ATTRS(ddir, name)				\
+static void *zoned_##name##_fifo_start(struct seq_file *m,		\
+					  loff_t *pos)			\
+	__acquires(&zd->lock)						\
+{									\
+	struct request_queue *q = m->private;				\
+	struct zoned_data *zd = q->elevator->elevator_data;		\
+									\
+	spin_lock(&zd->lock);						\
+	return seq_list_start(&zd->fifo_list[ddir], *pos);		\
+}									\
+									\
+static void *zoned_##name##_fifo_next(struct seq_file *m, void *v,	\
+					 loff_t *pos)			\
+{									\
+	struct request_queue *q = m->private;				\
+	struct zoned_data *zd = q->elevator->elevator_data;		\
+									\
+	return seq_list_next(v, &zd->fifo_list[ddir], pos);		\
+}									\
+									\
+static void zoned_##name##_fifo_stop(struct seq_file *m, void *v)	\
+	__releases(&zd->lock)						\
+{									\
+	struct request_queue *q = m->private;				\
+	struct zoned_data *zd = q->elevator->elevator_data;		\
+									\
+	spin_unlock(&zd->lock);						\
+}									\
+									\
+static const struct seq_operations zoned_##name##_fifo_seq_ops = {	\
+	.start	= zoned_##name##_fifo_start,				\
+	.next	= zoned_##name##_fifo_next,				\
+	.stop	= zoned_##name##_fifo_stop,				\
+	.show	= blk_mq_debugfs_rq_show,				\
+};									\
+									\
+static int zoned_##name##_next_rq_show(void *data,			\
+					  struct seq_file *m)		\
+{									\
+	struct request_queue *q = data;					\
+	struct zoned_data *zd = q->elevator->elevator_data;		\
+	struct request *rq = zd->next_rq[ddir];				\
+									\
+	if (rq)								\
+		__blk_mq_debugfs_rq_show(m, rq);			\
+	return 0;							\
+}
+ZONED_DEBUGFS_DDIR_ATTRS(READ, read)
+ZONED_DEBUGFS_DDIR_ATTRS(WRITE, write)
+#undef ZONED_DEBUGFS_DDIR_ATTRS
+
+static int zoned_batching_show(void *data, struct seq_file *m)
+{
+	struct request_queue *q = data;
+	struct zoned_data *zd = q->elevator->elevator_data;
+
+	seq_printf(m, "%u\n", zd->batching);
+	return 0;
+}
+
+static int zoned_starved_show(void *data, struct seq_file *m)
+{
+	struct request_queue *q = data;
+	struct zoned_data *zd = q->elevator->elevator_data;
+
+	seq_printf(m, "%u\n", zd->starved);
+	return 0;
+}
+
+static void *zoned_dispatch_start(struct seq_file *m, loff_t *pos)
+	__acquires(&zd->lock)
+{
+	struct request_queue *q = m->private;
+	struct zoned_data *zd = q->elevator->elevator_data;
+
+	spin_lock(&zd->lock);
+	return seq_list_start(&zd->dispatch, *pos);
+}
+
+static void *zoned_dispatch_next(struct seq_file *m, void *v, loff_t *pos)
+{
+	struct request_queue *q = m->private;
+	struct zoned_data *zd = q->elevator->elevator_data;
+
+	return seq_list_next(v, &zd->dispatch, pos);
+}
+
+static void zoned_dispatch_stop(struct seq_file *m, void *v)
+	__releases(&zd->lock)
+{
+	struct request_queue *q = m->private;
+	struct zoned_data *zd = q->elevator->elevator_data;
+
+	spin_unlock(&zd->lock);
+}
+
+static const struct seq_operations zoned_dispatch_seq_ops = {
+	.start	= zoned_dispatch_start,
+	.next	= zoned_dispatch_next,
+	.stop	= zoned_dispatch_stop,
+	.show	= blk_mq_debugfs_rq_show,
+};
+
+#define ZONED_QUEUE_DDIR_ATTRS(name)					     \
+	{#name "_fifo_list", 0400, .seq_ops = &zoned_##name##_fifo_seq_ops}, \
+	{#name "_next_rq", 0400, zoned_##name##_next_rq_show}
+static const struct blk_mq_debugfs_attr zoned_queue_debugfs_attrs[] = {
+	ZONED_QUEUE_DDIR_ATTRS(read),
+	ZONED_QUEUE_DDIR_ATTRS(write),
+	{"batching", 0400, zoned_batching_show},
+	{"starved", 0400, zoned_starved_show},
+	{"dispatch", 0400, .seq_ops = &zoned_dispatch_seq_ops},
+	{},
+};
+#undef ZONED_QUEUE_DDIR_ATTRS
+#endif
+
+static struct elevator_type zoned_elv = {
+	.ops.mq = {
+		.insert_requests	= zd_insert_requests,
+		.dispatch_request	= zd_dispatch_request,
+		.completed_request	= zd_completed_request,
+		.next_request		= elv_rb_latter_request,
+		.former_request		= elv_rb_former_request,
+		.bio_merge		= zd_bio_merge,
+		.request_merge		= zd_request_merge,
+		.requests_merged	= zd_merged_requests,
+		.request_merged		= zd_request_merged,
+		.has_work		= zd_has_work,
+		.init_sched		= zd_init_queue,
+		.exit_sched		= zd_exit_queue,
+	},
+
+	.uses_mq	= true,
+#ifdef CONFIG_BLK_DEBUG_FS
+	.queue_debugfs_attrs = zoned_queue_debugfs_attrs,
+#endif
+	.elevator_attrs = zoned_attrs,
+	.elevator_name = "zoned",
+	.elevator_owner = THIS_MODULE,
+};
+
+static int __init zoned_init(void)
+{
+	return elv_register(&zoned_elv);
+}
+
+static void __exit zoned_exit(void)
+{
+	elv_unregister(&zoned_elv);
+}
+
+module_init(zoned_init);
+module_exit(zoned_exit);
+
+MODULE_AUTHOR("Damien Le Moal");
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("Zoned MQ deadline IO scheduler");
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH V2 01/12] block: Fix declaration of blk-mq debugfs functions
  2017-09-07 16:16 ` [PATCH V2 01/12] block: Fix declaration of blk-mq debugfs functions Damien Le Moal
@ 2017-09-08  8:05   ` Johannes Thumshirn
  0 siblings, 0 replies; 38+ messages in thread
From: Johannes Thumshirn @ 2017-09-08  8:05 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-scsi, Martin K . Petersen, linux-block, Jens Axboe,
	Christoph Hellwig, Bart Van Assche

On Fri, Sep 08, 2017 at 01:16:29AM +0900, Damien Le Moal wrote:
> __blk_mq_debugfs_rq_show() and blk_mq_debugfs_rq_show() are exported
> symbols but ar eonly declared in the block internal file
             are only

Otherwise this looks good to me,
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>

-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: Felix Imend�rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N�rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V2 02/12] block: Fix declaration of blk-mq scheduler functions
  2017-09-07 16:16 ` [PATCH V2 02/12] block: Fix declaration of blk-mq scheduler functions Damien Le Moal
@ 2017-09-08  8:05   ` Johannes Thumshirn
  0 siblings, 0 replies; 38+ messages in thread
From: Johannes Thumshirn @ 2017-09-08  8:05 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-scsi, Martin K . Petersen, linux-block, Jens Axboe,
	Christoph Hellwig, Bart Van Assche

Looks good,
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: Felix Imend�rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N�rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V2 03/12] scsi: sd_zbc: Move ZBC declarations to scsi_proto.h
  2017-09-07 16:16 ` [PATCH V2 03/12] scsi: sd_zbc: Move ZBC declarations to scsi_proto.h Damien Le Moal
@ 2017-09-08  8:10   ` Johannes Thumshirn
  0 siblings, 0 replies; 38+ messages in thread
From: Johannes Thumshirn @ 2017-09-08  8:10 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-scsi, Martin K . Petersen, linux-block, Jens Axboe,
	Christoph Hellwig, Bart Van Assche

Looks good,
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: Felix Imend�rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N�rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V2 04/12] scsi: sd_zbc: Move zbc disk declarations to sd_zbc.h
  2017-09-07 16:16 ` [PATCH V2 04/12] scsi: sd_zbc: Move zbc disk declarations to sd_zbc.h Damien Le Moal
@ 2017-09-08  8:11   ` Johannes Thumshirn
  0 siblings, 0 replies; 38+ messages in thread
From: Johannes Thumshirn @ 2017-09-08  8:11 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-scsi, Martin K . Petersen, linux-block, Jens Axboe,
	Christoph Hellwig, Bart Van Assche

Looks good,
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: Felix Imend�rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N�rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V2 05/12] scsi: sd_zbc: Fix comments and indentation
  2017-09-07 16:16 ` [PATCH V2 05/12] scsi: sd_zbc: Fix comments and indentation Damien Le Moal
@ 2017-09-08  8:12   ` Johannes Thumshirn
  0 siblings, 0 replies; 38+ messages in thread
From: Johannes Thumshirn @ 2017-09-08  8:12 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-scsi, Martin K . Petersen, linux-block, Jens Axboe,
	Christoph Hellwig, Bart Van Assche

Looks good,
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: Felix Imend�rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N�rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V2 06/12] scsi: sd_zbc: Rearrange code
  2017-09-07 16:16 ` [PATCH V2 06/12] scsi: sd_zbc: Rearrange code Damien Le Moal
@ 2017-09-08  8:13   ` Johannes Thumshirn
  0 siblings, 0 replies; 38+ messages in thread
From: Johannes Thumshirn @ 2017-09-08  8:13 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-scsi, Martin K . Petersen, linux-block, Jens Axboe,
	Christoph Hellwig, Bart Van Assche

Looks good,
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: Felix Imend�rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N�rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V2 07/12] scsi: sd_zbc.c: Use well defined macros
  2017-09-07 16:16 ` [PATCH V2 07/12] scsi: sd_zbc.c: Use well defined macros Damien Le Moal
@ 2017-09-08  8:15   ` Johannes Thumshirn
  0 siblings, 0 replies; 38+ messages in thread
From: Johannes Thumshirn @ 2017-09-08  8:15 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-scsi, Martin K . Petersen, linux-block, Jens Axboe,
	Christoph Hellwig, Bart Van Assche

Looks good,
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: Felix Imend�rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N�rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V2 08/12] scsi: sd_zbc: Fix sd_zbc_read_zoned_characteristics()
  2017-09-07 16:16 ` [PATCH V2 08/12] scsi: sd_zbc: Fix sd_zbc_read_zoned_characteristics() Damien Le Moal
@ 2017-09-08  8:17   ` Johannes Thumshirn
  0 siblings, 0 replies; 38+ messages in thread
From: Johannes Thumshirn @ 2017-09-08  8:17 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-scsi, Martin K . Petersen, linux-block, Jens Axboe,
	Christoph Hellwig, Bart Van Assche, stable

Looks good,
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: Felix Imend�rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N�rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V2 00/12] scsi-mq support for ZBC disks
  2017-09-07 16:16 [PATCH V2 00/12] scsi-mq support for ZBC disks Damien Le Moal
                   ` (11 preceding siblings ...)
  2017-09-07 16:16 ` [PATCH V2 12/12] scsi: Introduce ZBC disk I/O scheduler Damien Le Moal
@ 2017-09-08  8:20 ` Christoph Hellwig
  2017-09-08 16:12   ` Damien Le Moal
  12 siblings, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2017-09-08  8:20 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-scsi, Martin K . Petersen, linux-block, Jens Axboe,
	Christoph Hellwig, Bart Van Assche

I'm really worried about how this is tied into the sd driver.

>From a quick look it seems like the only thing you really need in
the scheduler that is done through struct scsi_disk is the
offset + len -> zone number lookup - I'd much rather expose this
from sd upwards and keep the scheduler in the block code.

Except for that this looks like a fine approach to me.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V2 09/12] scsi: sd_zbc: Limit zone write locking to sequential zones
  2017-09-07 16:16 ` [PATCH V2 09/12] scsi: sd_zbc: Limit zone write locking to sequential zones Damien Le Moal
@ 2017-09-08  8:39   ` Johannes Thumshirn
  2017-09-08  9:48     ` Christoph Hellwig
  0 siblings, 1 reply; 38+ messages in thread
From: Johannes Thumshirn @ 2017-09-08  8:39 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-scsi, Martin K . Petersen, linux-block, Jens Axboe,
	Christoph Hellwig, Bart Van Assche

On Fri, Sep 08, 2017 at 01:16:37AM +0900, Damien Le Moal wrote:
> +/*
> + * Allocate a zone bitmap (one bit per zone).
> + */
> +static inline unsigned long *sd_zbc_alloc_zone_bitmap(struct scsi_disk *sdkp)
> +{
> +	struct request_queue *q = sdkp->disk->queue;
> +
> +	return kzalloc_node(BITS_TO_LONGS(sdkp->nr_zones)
> +			    * sizeof(unsigned long),
> +			    GFP_KERNEL, q->node);
> +}

It's a shame we have all this overflow checking kcalloc() and
kmalloc_array() functions but none of them is taking NUMA nodes into
account.

But that has nothing to do with your patch, just a general rant,
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>

-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: Felix Imend�rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N�rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V2 09/12] scsi: sd_zbc: Limit zone write locking to sequential zones
  2017-09-08  8:39   ` Johannes Thumshirn
@ 2017-09-08  9:48     ` Christoph Hellwig
  2017-09-08  9:50       ` Johannes Thumshirn
  0 siblings, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2017-09-08  9:48 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Damien Le Moal, linux-scsi, Martin K . Petersen, linux-block,
	Jens Axboe, Christoph Hellwig, Bart Van Assche

On Fri, Sep 08, 2017 at 10:39:25AM +0200, Johannes Thumshirn wrote:
> It's a shame we have all this overflow checking kcalloc() and
> kmalloc_array() functions but none of them is taking NUMA nodes into
> account.

Something that could be trivially fixed..

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V2 09/12] scsi: sd_zbc: Limit zone write locking to sequential zones
  2017-09-08  9:48     ` Christoph Hellwig
@ 2017-09-08  9:50       ` Johannes Thumshirn
  0 siblings, 0 replies; 38+ messages in thread
From: Johannes Thumshirn @ 2017-09-08  9:50 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Damien Le Moal, linux-scsi, Martin K . Petersen, linux-block,
	Jens Axboe, Bart Van Assche

On Fri, Sep 08, 2017 at 11:48:09AM +0200, Christoph Hellwig wrote:
> On Fri, Sep 08, 2017 at 10:39:25AM +0200, Johannes Thumshirn wrote:
> > It's a shame we have all this overflow checking kcalloc() and
> > kmalloc_array() functions but none of them is taking NUMA nodes into
> > account.
> 
> Something that could be trivially fixed..

I know I'm on it ;-)

-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: Felix Imend�rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N�rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V2 10/12] scsi: sd_zbc: Disable zone write locking with scsi-mq
  2017-09-07 16:16 ` [PATCH V2 10/12] scsi: sd_zbc: Disable zone write locking with scsi-mq Damien Le Moal
@ 2017-09-08 12:43   ` Ming Lei
  2017-09-08 16:53     ` Damien Le Moal
  0 siblings, 1 reply; 38+ messages in thread
From: Ming Lei @ 2017-09-08 12:43 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-scsi, Martin K . Petersen, linux-block, Jens Axboe,
	Christoph Hellwig, Bart Van Assche

Hi Damien,

On Fri, Sep 08, 2017 at 01:16:38AM +0900, Damien Le Moal wrote:
> In the case of a ZBC disk used with scsi-mq, zone write locking does
> not prevent write reordering in sequential zones. Unlike the legacy
> case, zone locking can only be done after the command request is
> removed from the scheduler dispatch queue. That is, at the time of
> zone locking, the write command may already be out of order.

Per my understanding, for legacy case, it can be quite tricky to let
the existed I/O scheduler guarantee the write order for ZBC disk.
I guess requeue still might cause write reorder even in legacy path,
since requeue can happen in both scsi_request_fn() and scsi_io_completion()
with q->queue_lock released, meantime new rq belonging to the same
zone can come and be inserted to queue.

> 
> Disable zone write locking in sd_zbc_write_lock_zone() if the disk is
> used with scsi-mq. Write order guarantees can be provided by an
> adapted I/O scheduler.

Sounds a good idea to enhance the order in a new scheduler, will
look at the following patch.

-- 
Ming

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V2 00/12] scsi-mq support for ZBC disks
  2017-09-08  8:20 ` [PATCH V2 00/12] scsi-mq support for ZBC disks Christoph Hellwig
@ 2017-09-08 16:12   ` Damien Le Moal
  2017-09-11 12:24     ` Christoph Hellwig
  0 siblings, 1 reply; 38+ messages in thread
From: Damien Le Moal @ 2017-09-08 16:12 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-scsi, Martin K . Petersen, linux-block, Bart Van Assche,
	Jens Axboe

Christoph,

On 9/8/17 01:20, Christoph Hellwig wrote:
> I'm really worried about how this is tied into the sd driver.
> 
> From a quick look it seems like the only thing you really need in
> the scheduler that is done through struct scsi_disk is the
> offset + len -> zone number lookup - I'd much rather expose this
> from sd upwards and keep the scheduler in the block code.

The only things that the scheduler needs are:
1) The zone size and the number of zones of the device (for the bitmaps
allocation and offset->zone number conversion).
2) Zone type for the optimization that avoids locking conventional zones.

(2) is optional. We can do without, but still really nice to have from a
performance perspective as conventional zones tend to be used for
storing metadata. So a lot of small random writes is more likely and
high queue depth writing would improve performance significantly.

For (1), the zone size is known through q->limits.chunk_sectors. But the
disk capacity is not known using request_queue only, so the number of
zones cannot be calculated... I thought of exporting it through queue
limits too, but then stacking of device mappers using ZBC drives becomes
a pain as the number of zones needs to be recalculated.

The best solution would really be to be able to get a disk capacity
given a request queue. Any idea on how to do this ?
For optimization (2), there is also the problem that given a request
queue, the block_device struct is not known, so it is not possible to
issue report zones (granted, the scheduler could build the request
directly to do that, but that is really an ugly solution).

I did not find any nice solution for either problem, so ended up tying
the scheduler to sd since the scsi_disk struct already had all the info
needed.

> Except for that this looks like a fine approach to me.

Thanks.

-- 
Damien Le Moal,
Western Digital Research

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V2 10/12] scsi: sd_zbc: Disable zone write locking with scsi-mq
  2017-09-08 12:43   ` Ming Lei
@ 2017-09-08 16:53     ` Damien Le Moal
  2017-09-10  5:10       ` Ming Lei
  0 siblings, 1 reply; 38+ messages in thread
From: Damien Le Moal @ 2017-09-08 16:53 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-scsi, Martin K . Petersen, linux-block, Jens Axboe,
	Christoph Hellwig, Bart Van Assche

Ming,

On 9/8/17 05:43, Ming Lei wrote:
> Hi Damien,
> 
> On Fri, Sep 08, 2017 at 01:16:38AM +0900, Damien Le Moal wrote:
>> In the case of a ZBC disk used with scsi-mq, zone write locking does
>> not prevent write reordering in sequential zones. Unlike the legacy
>> case, zone locking can only be done after the command request is
>> removed from the scheduler dispatch queue. That is, at the time of
>> zone locking, the write command may already be out of order.
> 
> Per my understanding, for legacy case, it can be quite tricky to let
> the existed I/O scheduler guarantee the write order for ZBC disk.
> I guess requeue still might cause write reorder even in legacy path,
> since requeue can happen in both scsi_request_fn() and scsi_io_completion()
> with q->queue_lock released, meantime new rq belonging to the same
> zone can come and be inserted to queue.

Yes, the write ordering will always depend on the scheduler doing the
right thing. But both cfq, deadline and even noop do the right thing
there, even considering the aging case. The next write for a zone will
always be the oldest in the queue for that zone, if it is not, it means
that the application did not write sequentially. Extensive testing in
the legacy case never showed a problem due to the scheduler itself.

scsi_requeue_command() does the unprep (zone unlock) and requeue while
holding the queue lock. So this is atomic with new write command
insertion. Requeued commands are added to the dispatch queue head, and
since a zone will only have a single write in-flight, there is no
reordering possible. The next write command for a zone to go again is
the last requeued one or the next in lba order. It works.

Note that for write commands that failed due to an unaligned write
error, there is no retry done, so no requeue. The requeue case for
writes would only happen for other conditions (a dead drive being the
most likely in this case).

>> Disable zone write locking in sd_zbc_write_lock_zone() if the disk is
>> used with scsi-mq. Write order guarantees can be provided by an
>> adapted I/O scheduler.
> 
> Sounds a good idea to enhance the order in a new scheduler, will
> look at the following patch.

For blk-mq, I only tried mq-deadline. The zoned scheduler I posted is
based on it. There is no fundamental change to the ordering on
insertion. Only different choices on dispatch (using the zone lock).

For rotating rust and blk-mq, I think that getting calls to dispatch
serialized would naturally enhance ordering and also merging to some
extent. Ordering really gets killed when multiple context try to push
down requests, which each context ending up each with only a few
requests in their local dispatch lists. Some initial patch I wrote for
zbc that attacked the problem from within blk-mq did that serialization.
That is not mandatory anymore with the zoned scheduler, but I think
would still be benefitial to both ZBC disks and standard disks too.

Best regards.


-- 
Damien Le Moal,
Western Digital Research

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V2 10/12] scsi: sd_zbc: Disable zone write locking with scsi-mq
  2017-09-08 16:53     ` Damien Le Moal
@ 2017-09-10  5:10       ` Ming Lei
  2017-09-12  8:24         ` Damien Le Moal
  0 siblings, 1 reply; 38+ messages in thread
From: Ming Lei @ 2017-09-10  5:10 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-scsi, Martin K . Petersen, linux-block, Jens Axboe,
	Christoph Hellwig, Bart Van Assche

On Fri, Sep 08, 2017 at 09:53:53AM -0700, Damien Le Moal wrote:
> Ming,
> 
> On 9/8/17 05:43, Ming Lei wrote:
> > Hi Damien,
> > 
> > On Fri, Sep 08, 2017 at 01:16:38AM +0900, Damien Le Moal wrote:
> >> In the case of a ZBC disk used with scsi-mq, zone write locking does
> >> not prevent write reordering in sequential zones. Unlike the legacy
> >> case, zone locking can only be done after the command request is
> >> removed from the scheduler dispatch queue. That is, at the time of
> >> zone locking, the write command may already be out of order.
> > 
> > Per my understanding, for legacy case, it can be quite tricky to let
> > the existed I/O scheduler guarantee the write order for ZBC disk.
> > I guess requeue still might cause write reorder even in legacy path,
> > since requeue can happen in both scsi_request_fn() and scsi_io_completion()
> > with q->queue_lock released, meantime new rq belonging to the same
> > zone can come and be inserted to queue.
> 
> Yes, the write ordering will always depend on the scheduler doing the
> right thing. But both cfq, deadline and even noop do the right thing
> there, even considering the aging case. The next write for a zone will
> always be the oldest in the queue for that zone, if it is not, it means
> that the application did not write sequentially. Extensive testing in
> the legacy case never showed a problem due to the scheduler itself.

OK, I suggest to document this guarantee of no write reorder for ZBC
somewhere, so that people will keep it in mind when trying to change
the current code.

> 
> scsi_requeue_command() does the unprep (zone unlock) and requeue while
> holding the queue lock. So this is atomic with new write command
> insertion. Requeued commands are added to the dispatch queue head, and
> since a zone will only have a single write in-flight, there is no
> reordering possible. The next write command for a zone to go again is
> the last requeued one or the next in lba order. It works.

One special case is write with FLUSH/FUA, which may be added to
front of q->queue_head directly. Suppose one write with FUA is
just comes between requeue and run queue, write reorder may be
triggered.

If this assumption is true, there might be such issue on blk-mq
too.

-- 
Ming

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V2 11/12] scsi: sd: Introduce scsi_disk_from_queue()
  2017-09-07 16:16 ` [PATCH V2 11/12] scsi: sd: Introduce scsi_disk_from_queue() Damien Le Moal
@ 2017-09-10  5:16   ` Ming Lei
  2017-09-12  8:05     ` Damien Le Moal
  0 siblings, 1 reply; 38+ messages in thread
From: Ming Lei @ 2017-09-10  5:16 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-scsi, Martin K . Petersen, linux-block, Jens Axboe,
	Christoph Hellwig, Bart Van Assche

On Fri, Sep 08, 2017 at 01:16:39AM +0900, Damien Le Moal wrote:
> Using scsi_device_from_queue(), return the scsi_disk structure
> associated with a request queue if the device is a disk.
> Export this function to make it available to modules.
> 

This approach may be a little over-kill, actually gendisk is the
parent of request queue in kobject tree(see blk_register_queue()),
that might be an easy way to retrieve disk attached to the queue.

-- 
Ming

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V2 00/12] scsi-mq support for ZBC disks
  2017-09-08 16:12   ` Damien Le Moal
@ 2017-09-11 12:24     ` Christoph Hellwig
  2017-09-12  8:38       ` Damien Le Moal
  0 siblings, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2017-09-11 12:24 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Christoph Hellwig, linux-scsi, Martin K . Petersen, linux-block,
	Bart Van Assche, Jens Axboe

On Fri, Sep 08, 2017 at 09:12:12AM -0700, Damien Le Moal wrote:
> 1) The zone size and the number of zones of the device (for the bitmaps
> allocation and offset->zone number conversion).
> 2) Zone type for the optimization that avoids locking conventional zones.
> 
> (2) is optional. We can do without, but still really nice to have from a
> performance perspective as conventional zones tend to be used for
> storing metadata. So a lot of small random writes is more likely and
> high queue depth writing would improve performance significantly.
> 
> For (1), the zone size is known through q->limits.chunk_sectors. But the
> disk capacity is not known using request_queue only, so the number of
> zones cannot be calculated... I thought of exporting it through queue
> limits too, but then stacking of device mappers using ZBC drives becomes
> a pain as the number of zones needs to be recalculated.

For 4.14-rc+ you should be able to easily get at the gendisk as the
whole submission path is now gendisk based, although it might need
some minor argument reshuffle to pass it instead of the request_queue
in a few places.  Note that for everything passing the request
you can get the gendisk from the request as it contains a pointer.

The only annoying issue is that some of our passthrough character
device callers don't have a gendisk at all, which causes
problems all over and should probably be fixed at some point.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V2 11/12] scsi: sd: Introduce scsi_disk_from_queue()
  2017-09-10  5:16   ` Ming Lei
@ 2017-09-12  8:05     ` Damien Le Moal
  0 siblings, 0 replies; 38+ messages in thread
From: Damien Le Moal @ 2017-09-12  8:05 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-scsi, Martin K . Petersen, linux-block, Jens Axboe,
	Christoph Hellwig, Bart Van Assche

Ming,

On 9/10/17 14:16, Ming Lei wrote:
> On Fri, Sep 08, 2017 at 01:16:39AM +0900, Damien Le Moal wrote:
>> Using scsi_device_from_queue(), return the scsi_disk structure
>> associated with a request queue if the device is a disk.
>> Export this function to make it available to modules.
>>
> 
> This approach may be a little over-kill, actually gendisk is the
> parent of request queue in kobject tree(see blk_register_queue()),
> that might be an easy way to retrieve disk attached to the queue.

Yes. It is a little extreme.
I am currently preparing a V3 that will have a softer touch.

Best regards.

-- 
Damien Le Moal,
Western Digital

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V2 10/12] scsi: sd_zbc: Disable zone write locking with scsi-mq
  2017-09-10  5:10       ` Ming Lei
@ 2017-09-12  8:24         ` Damien Le Moal
  2017-09-12  9:26           ` Ming Lei
  0 siblings, 1 reply; 38+ messages in thread
From: Damien Le Moal @ 2017-09-12  8:24 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-scsi, Martin K . Petersen, linux-block, Jens Axboe,
	Christoph Hellwig, Bart Van Assche

Ming,

On 9/10/17 14:10, Ming Lei wrote:
> On Fri, Sep 08, 2017 at 09:53:53AM -0700, Damien Le Moal wrote:
>> Ming,
>>
>> On 9/8/17 05:43, Ming Lei wrote:
>>> Hi Damien,
>>>
>>> On Fri, Sep 08, 2017 at 01:16:38AM +0900, Damien Le Moal wrote:
>>>> In the case of a ZBC disk used with scsi-mq, zone write locking does
>>>> not prevent write reordering in sequential zones. Unlike the legacy
>>>> case, zone locking can only be done after the command request is
>>>> removed from the scheduler dispatch queue. That is, at the time of
>>>> zone locking, the write command may already be out of order.
>>>
>>> Per my understanding, for legacy case, it can be quite tricky to let
>>> the existed I/O scheduler guarantee the write order for ZBC disk.
>>> I guess requeue still might cause write reorder even in legacy path,
>>> since requeue can happen in both scsi_request_fn() and scsi_io_completion()
>>> with q->queue_lock released, meantime new rq belonging to the same
>>> zone can come and be inserted to queue.
>>
>> Yes, the write ordering will always depend on the scheduler doing the
>> right thing. But both cfq, deadline and even noop do the right thing
>> there, even considering the aging case. The next write for a zone will
>> always be the oldest in the queue for that zone, if it is not, it means
>> that the application did not write sequentially. Extensive testing in
>> the legacy case never showed a problem due to the scheduler itself.
> 
> OK, I suggest to document this guarantee of no write reorder for ZBC
> somewhere, so that people will keep it in mind when trying to change
> the current code.

Have you looked at the comments in sd_zbc.c ? That is explained there.
Granted, this is a little deep in the stack, but this is after all
dependent on the implementation of scsi_request_fn(). I can add comments
there too if you prefer.

>> scsi_requeue_command() does the unprep (zone unlock) and requeue while
>> holding the queue lock. So this is atomic with new write command
>> insertion. Requeued commands are added to the dispatch queue head, and
>> since a zone will only have a single write in-flight, there is no
>> reordering possible. The next write command for a zone to go again is
>> the last requeued one or the next in lba order. It works.
> 
> One special case is write with FLUSH/FUA, which may be added to
> front of q->queue_head directly. Suppose one write with FUA is
> just comes between requeue and run queue, write reorder may be
> triggered.

Zoned disks are recent and all of them support FUA. This means that a
write with FUA will be processed like any other write request (if I read
the code in blk-flush.c correctly). Well, at least for the mq case,
which does a blk_mq_sched_insert_request(), while the direct call to
list_add_tail() for the legacy case is suspicious. Why not
blk_insert_request() ? This may be a problem, but all the testing I have
done on the legacy path never hit this one. The reason is I think that
the two in-kernel users of zoned devices (f2fs and dm-zoned) do not use
REQ_FUA, nor issue flush requests with data.

> If this assumption is true, there might be such issue on blk-mq
> too.

For the same above reason, I think blk-mq is OK too. But granted, this
seems all a little brittle. I will look into this case more carefully to
solidify the overall support.

Thank you for your comments.

-- 
Damien Le Moal,
Western Digital

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V2 00/12] scsi-mq support for ZBC disks
  2017-09-11 12:24     ` Christoph Hellwig
@ 2017-09-12  8:38       ` Damien Le Moal
  2017-09-13 20:17         ` Christoph Hellwig
  0 siblings, 1 reply; 38+ messages in thread
From: Damien Le Moal @ 2017-09-12  8:38 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-scsi, Martin K . Petersen, linux-block, Bart Van Assche,
	Jens Axboe, Ming Lei

Christoph,

On 9/11/17 21:24, Christoph Hellwig wrote:
> On Fri, Sep 08, 2017 at 09:12:12AM -0700, Damien Le Moal wrote:
>> 1) The zone size and the number of zones of the device (for the bitmaps
>> allocation and offset->zone number conversion).
>> 2) Zone type for the optimization that avoids locking conventional zones.
>>
>> (2) is optional. We can do without, but still really nice to have from a
>> performance perspective as conventional zones tend to be used for
>> storing metadata. So a lot of small random writes is more likely and
>> high queue depth writing would improve performance significantly.
>>
>> For (1), the zone size is known through q->limits.chunk_sectors. But the
>> disk capacity is not known using request_queue only, so the number of
>> zones cannot be calculated... I thought of exporting it through queue
>> limits too, but then stacking of device mappers using ZBC drives becomes
>> a pain as the number of zones needs to be recalculated.
> 
> For 4.14-rc+ you should be able to easily get at the gendisk as the
> whole submission path is now gendisk based, although it might need
> some minor argument reshuffle to pass it instead of the request_queue
> in a few places.  Note that for everything passing the request
> you can get the gendisk from the request as it contains a pointer.

Thank you for the pointers. Ming also mentioned this. I will look into
it for V3. But in any case, this would only allow allocating the zone
write lock bitmap as the device number of zones can now be calculated
using the device capacity (obtained from gendisk) and the device zone
size (obtained from queue limits).

I was thinking of something else very simple, similar to the integrity
code: using a "struct blk_zoned" added to the request queue and
initialized by the low level device driver on disk revalidate. This
structure only needs the number of zones and a zone type bitmap to
indicate sequential zones.

struct blk_zoned {
	unsigned int nr_zones;
	unsigned long *seq_zones;
};

struct request_queue {
	...
#ifdef CONFIG_BLK_DEV_ZONED
	struct blk_zoned zoned;
#endif
	...
};

This would allow implementing (2) trivially and also makes the
information available to any user of the disk (FS and device mappers).

Would this be acceptable ?

> The only annoying issue is that some of our passthrough character
> device callers don't have a gendisk at all, which causes
> problems all over and should probably be fixed at some point.

pass-through requests are completely ignored for zoned block devices and
processed as they come. Any kind of reordering control is the
responsibility of users.

Best regards

-- 
Damien Le Moal,
Western Digital

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V2 10/12] scsi: sd_zbc: Disable zone write locking with scsi-mq
  2017-09-12  8:24         ` Damien Le Moal
@ 2017-09-12  9:26           ` Ming Lei
  2017-09-13  0:13             ` Damien Le Moal
  0 siblings, 1 reply; 38+ messages in thread
From: Ming Lei @ 2017-09-12  9:26 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-scsi, Martin K . Petersen, linux-block, Jens Axboe,
	Christoph Hellwig, Bart Van Assche

On Tue, Sep 12, 2017 at 05:24:02PM +0900, Damien Le Moal wrote:
> Ming,
> 
> On 9/10/17 14:10, Ming Lei wrote:
> > On Fri, Sep 08, 2017 at 09:53:53AM -0700, Damien Le Moal wrote:
> >> Ming,
> >>
> >> On 9/8/17 05:43, Ming Lei wrote:
> >>> Hi Damien,
> >>>
> >>> On Fri, Sep 08, 2017 at 01:16:38AM +0900, Damien Le Moal wrote:
> >>>> In the case of a ZBC disk used with scsi-mq, zone write locking does
> >>>> not prevent write reordering in sequential zones. Unlike the legacy
> >>>> case, zone locking can only be done after the command request is
> >>>> removed from the scheduler dispatch queue. That is, at the time of
> >>>> zone locking, the write command may already be out of order.
> >>>
> >>> Per my understanding, for legacy case, it can be quite tricky to let
> >>> the existed I/O scheduler guarantee the write order for ZBC disk.
> >>> I guess requeue still might cause write reorder even in legacy path,
> >>> since requeue can happen in both scsi_request_fn() and scsi_io_completion()
> >>> with q->queue_lock released, meantime new rq belonging to the same
> >>> zone can come and be inserted to queue.
> >>
> >> Yes, the write ordering will always depend on the scheduler doing the
> >> right thing. But both cfq, deadline and even noop do the right thing
> >> there, even considering the aging case. The next write for a zone will
> >> always be the oldest in the queue for that zone, if it is not, it means
> >> that the application did not write sequentially. Extensive testing in
> >> the legacy case never showed a problem due to the scheduler itself.
> > 
> > OK, I suggest to document this guarantee of no write reorder for ZBC
> > somewhere, so that people will keep it in mind when trying to change
> > the current code.
> 
> Have you looked at the comments in sd_zbc.c ? That is explained there.
> Granted, this is a little deep in the stack, but this is after all
> dependent on the implementation of scsi_request_fn(). I can add comments
> there too if you prefer.

Yeah, I looked at that, but seems it is too coarse.

> 
> >> scsi_requeue_command() does the unprep (zone unlock) and requeue while
> >> holding the queue lock. So this is atomic with new write command
> >> insertion. Requeued commands are added to the dispatch queue head, and
> >> since a zone will only have a single write in-flight, there is no
> >> reordering possible. The next write command for a zone to go again is
> >> the last requeued one or the next in lba order. It works.
> > 
> > One special case is write with FLUSH/FUA, which may be added to
> > front of q->queue_head directly. Suppose one write with FUA is
> > just comes between requeue and run queue, write reorder may be
> > triggered.
> 
> Zoned disks are recent and all of them support FUA. This means that a
> write with FUA will be processed like any other write request (if I read
> the code in blk-flush.c correctly). Well, at least for the mq case,
> which does a blk_mq_sched_insert_request(), while the direct call to

blk_mq_sched_bypass_insert() can be called for flush requests too,
since requests in flush sequence share one driver tag(rq->tag != -1),
then the rq can be added to front of hctx->dispatch directly.


-- 
Ming

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V2 10/12] scsi: sd_zbc: Disable zone write locking with scsi-mq
  2017-09-12  9:26           ` Ming Lei
@ 2017-09-13  0:13             ` Damien Le Moal
  0 siblings, 0 replies; 38+ messages in thread
From: Damien Le Moal @ 2017-09-13  0:13 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-scsi, Martin K . Petersen, linux-block, Jens Axboe,
	Christoph Hellwig, Bart Van Assche

Ming,

On 9/12/17 18:26, Ming Lei wrote:
>>> OK, I suggest to document this guarantee of no write reorder for ZBC
>>> somewhere, so that people will keep it in mind when trying to change
>>> the current code.
>>
>> Have you looked at the comments in sd_zbc.c ? That is explained there.
>> Granted, this is a little deep in the stack, but this is after all
>> dependent on the implementation of scsi_request_fn(). I can add comments
>> there too if you prefer.
> 
> Yeah, I looked at that, but seems it is too coarse.

OK. Will improve that in V3.

>> Zoned disks are recent and all of them support FUA. This means that a
>> write with FUA will be processed like any other write request (if I read
>> the code in blk-flush.c correctly). Well, at least for the mq case,
>> which does a blk_mq_sched_insert_request(), while the direct call to
> 
> blk_mq_sched_bypass_insert() can be called for flush requests too,
> since requests in flush sequence share one driver tag(rq->tag != -1),
> then the rq can be added to front of hctx->dispatch directly.

For pure flush commands, any order is OK, since these are not actual
writes. the ZBC/ZAC standards do not modify the behavior of
flush/synchronize cache commands.

But I will look into more detail at the PREFLUSH/POSTFLUSH mechanic and
relation to write request ordering to make sure no corner cases are left
ignored.

Thanks for the comments.

Best regards.

-- 
Damien Le Moal,
Western Digital

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V2 00/12] scsi-mq support for ZBC disks
  2017-09-12  8:38       ` Damien Le Moal
@ 2017-09-13 20:17         ` Christoph Hellwig
  2017-09-14  0:04           ` Damien Le Moal
  0 siblings, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2017-09-13 20:17 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Christoph Hellwig, linux-scsi, Martin K . Petersen, linux-block,
	Bart Van Assche, Jens Axboe, Ming Lei

On Tue, Sep 12, 2017 at 05:38:05PM +0900, Damien Le Moal wrote:
> struct blk_zoned {
> 	unsigned int nr_zones;
> 	unsigned long *seq_zones;
> };
> 
> struct request_queue {
> 	...
> #ifdef CONFIG_BLK_DEV_ZONED
> 	struct blk_zoned zoned;
> #endif
> 	...

Do we even need a new structure for it?  I'd just hang it off
the gendisk directly.

Except for that the idea looks fine to me.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH V2 00/12] scsi-mq support for ZBC disks
  2017-09-13 20:17         ` Christoph Hellwig
@ 2017-09-14  0:04           ` Damien Le Moal
  0 siblings, 0 replies; 38+ messages in thread
From: Damien Le Moal @ 2017-09-14  0:04 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-scsi, Martin K . Petersen, linux-block, Bart Van Assche,
	Jens Axboe, Ming Lei

Christoph,

On 9/14/17 05:17, Christoph Hellwig wrote:
> On Tue, Sep 12, 2017 at 05:38:05PM +0900, Damien Le Moal wrote:
>> struct blk_zoned {
>> 	unsigned int nr_zones;
>> 	unsigned long *seq_zones;
>> };
>>
>> struct request_queue {
>> 	...
>> #ifdef CONFIG_BLK_DEV_ZONED
>> 	struct blk_zoned zoned;
>> #endif
>> 	...
> 
> Do we even need a new structure for it?  I'd just hang it off
> the gendisk directly.

Using a structure just seemed cleaner to me. But it is indeed not really
necessary. As for attaching this info to the gendisk, that would indeed
be equivalent. Having the struct on hand in the request queue just makes
it easier to write small helpers in blkdev.h to access the info. Hence
this choice. Should I change it ?

I have a V3 under test using the blk_zoned struct in the request queue
under test. I will post shortly.

> Except for that the idea looks fine to me.

Thanks !

-- 
Damien Le Moal,
Western Digital

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2017-09-14  0:04 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-09-07 16:16 [PATCH V2 00/12] scsi-mq support for ZBC disks Damien Le Moal
2017-09-07 16:16 ` [PATCH V2 01/12] block: Fix declaration of blk-mq debugfs functions Damien Le Moal
2017-09-08  8:05   ` Johannes Thumshirn
2017-09-07 16:16 ` [PATCH V2 02/12] block: Fix declaration of blk-mq scheduler functions Damien Le Moal
2017-09-08  8:05   ` Johannes Thumshirn
2017-09-07 16:16 ` [PATCH V2 03/12] scsi: sd_zbc: Move ZBC declarations to scsi_proto.h Damien Le Moal
2017-09-08  8:10   ` Johannes Thumshirn
2017-09-07 16:16 ` [PATCH V2 04/12] scsi: sd_zbc: Move zbc disk declarations to sd_zbc.h Damien Le Moal
2017-09-08  8:11   ` Johannes Thumshirn
2017-09-07 16:16 ` [PATCH V2 05/12] scsi: sd_zbc: Fix comments and indentation Damien Le Moal
2017-09-08  8:12   ` Johannes Thumshirn
2017-09-07 16:16 ` [PATCH V2 06/12] scsi: sd_zbc: Rearrange code Damien Le Moal
2017-09-08  8:13   ` Johannes Thumshirn
2017-09-07 16:16 ` [PATCH V2 07/12] scsi: sd_zbc.c: Use well defined macros Damien Le Moal
2017-09-08  8:15   ` Johannes Thumshirn
2017-09-07 16:16 ` [PATCH V2 08/12] scsi: sd_zbc: Fix sd_zbc_read_zoned_characteristics() Damien Le Moal
2017-09-08  8:17   ` Johannes Thumshirn
2017-09-07 16:16 ` [PATCH V2 09/12] scsi: sd_zbc: Limit zone write locking to sequential zones Damien Le Moal
2017-09-08  8:39   ` Johannes Thumshirn
2017-09-08  9:48     ` Christoph Hellwig
2017-09-08  9:50       ` Johannes Thumshirn
2017-09-07 16:16 ` [PATCH V2 10/12] scsi: sd_zbc: Disable zone write locking with scsi-mq Damien Le Moal
2017-09-08 12:43   ` Ming Lei
2017-09-08 16:53     ` Damien Le Moal
2017-09-10  5:10       ` Ming Lei
2017-09-12  8:24         ` Damien Le Moal
2017-09-12  9:26           ` Ming Lei
2017-09-13  0:13             ` Damien Le Moal
2017-09-07 16:16 ` [PATCH V2 11/12] scsi: sd: Introduce scsi_disk_from_queue() Damien Le Moal
2017-09-10  5:16   ` Ming Lei
2017-09-12  8:05     ` Damien Le Moal
2017-09-07 16:16 ` [PATCH V2 12/12] scsi: Introduce ZBC disk I/O scheduler Damien Le Moal
2017-09-08  8:20 ` [PATCH V2 00/12] scsi-mq support for ZBC disks Christoph Hellwig
2017-09-08 16:12   ` Damien Le Moal
2017-09-11 12:24     ` Christoph Hellwig
2017-09-12  8:38       ` Damien Le Moal
2017-09-13 20:17         ` Christoph Hellwig
2017-09-14  0:04           ` Damien Le Moal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox