* [PATCH 1/8] zbd: Fix partition block device handling
2019-02-20 7:17 [PATCH 0/8] ZBD support fixes Damien Le Moal
@ 2019-02-20 7:17 ` Damien Le Moal
2019-02-20 16:00 ` Bart Van Assche
2019-02-20 7:17 ` [PATCH 2/8] sg: Avoid READ CAPACITY failures Damien Le Moal
` (6 subsequent siblings)
7 siblings, 1 reply; 13+ messages in thread
From: Damien Le Moal @ 2019-02-20 7:17 UTC (permalink / raw)
To: fio, Jens Axboe
Cc: Bart Van Assche, Matias Bjorling, Shinichiro Kawasaki,
Dmitry Fomichev
From: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
For fio to correctly handle the zonemode=zbd mode with partitions of
zoned block devices, the partition block device file must be identified
as a zoned disk. However, partition block device files do not have
a zoned sysfs file. This patch allows a correct identification of the
device file zone model by accessing the sysfs "zoned" file of the
holder disk for partition devices.
Change get_zbd_model() function to resolve the symbolic link to the
sysfs path to obtain the canonical sysfs path. The canonical sysfs
path of a partition device includes both of the holder device name and
the partition device name. If the given device is a partition device,
cut the partition device name in the canonical sysfs path to access
the "zoned" file in the holder device sysfs path.
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
---
zbd.c | 38 +++++++++++++++++++++++++++++++++++++-
1 file changed, 37 insertions(+), 1 deletion(-)
diff --git a/zbd.c b/zbd.c
index 8acda1f6..9b603b97 100644
--- a/zbd.c
+++ b/zbd.c
@@ -228,12 +228,45 @@ static enum blk_zoned_model get_zbd_model(const char *file_name)
char *zoned_attr_path = NULL;
char *model_str = NULL;
struct stat statbuf;
+ char *sys_devno_path = NULL;
+ char *part_attr_path = NULL;
+ char *part_str = NULL;
+ char sys_path[PATH_MAX];
+ ssize_t sz;
+ char *delim = NULL;
if (stat(file_name, &statbuf) < 0)
goto out;
- if (asprintf(&zoned_attr_path, "/sys/dev/block/%d:%d/queue/zoned",
+
+ if (asprintf(&sys_devno_path, "/sys/dev/block/%d:%d",
major(statbuf.st_rdev), minor(statbuf.st_rdev)) < 0)
goto out;
+
+ sz = readlink(sys_devno_path, sys_path, sizeof(sys_path) - 1);
+ if (sz < 0)
+ goto out;
+ sys_path[sz] = '\0';
+
+ /*
+ * If the device is a partition device, cut the device name in the
+ * canonical sysfs path to obtain the sysfs path of the holder device.
+ * e.g.: /sys/devices/.../sda/sda1 -> /sys/devices/.../sda
+ */
+ if (asprintf(&part_attr_path, "/sys/dev/block/%s/partition",
+ sys_path) < 0)
+ goto out;
+ part_str = read_file(part_attr_path);
+ if (part_str && *part_str == '1') {
+ delim = strrchr(sys_path, '/');
+ if (!delim)
+ goto out;
+ *delim = '\0';
+ }
+
+ if (asprintf(&zoned_attr_path,
+ "/sys/dev/block/%s/queue/zoned", sys_path) < 0)
+ goto out;
+
model_str = read_file(zoned_attr_path);
if (!model_str)
goto out;
@@ -246,6 +279,9 @@ static enum blk_zoned_model get_zbd_model(const char *file_name)
out:
free(model_str);
free(zoned_attr_path);
+ free(part_str);
+ free(part_attr_path);
+ free(sys_devno_path);
return model;
}
--
2.20.1
^ permalink raw reply related [flat|nested] 13+ messages in thread* Re: [PATCH 1/8] zbd: Fix partition block device handling
2019-02-20 7:17 ` [PATCH 1/8] zbd: Fix partition block device handling Damien Le Moal
@ 2019-02-20 16:00 ` Bart Van Assche
2019-02-21 1:47 ` Damien Le Moal
0 siblings, 1 reply; 13+ messages in thread
From: Bart Van Assche @ 2019-02-20 16:00 UTC (permalink / raw)
To: Damien Le Moal, fio, Jens Axboe
Cc: Matias Bjorling, Shinichiro Kawasaki, Dmitry Fomichev
On Wed, 2019-02-20 at 16:17 +0900, Damien Le Moal wrote:
> From: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
>
> For fio to correctly handle the zonemode=zbd mode with partitions of
> zoned block devices, the partition block device file must be identified
> as a zoned disk. However, partition block device files do not have
> a zoned sysfs file. This patch allows a correct identification of the
> device file zone model by accessing the sysfs "zoned" file of the
> holder disk for partition devices.
>
> Change get_zbd_model() function to resolve the symbolic link to the
> sysfs path to obtain the canonical sysfs path. The canonical sysfs
> path of a partition device includes both of the holder device name and
> the partition device name. If the given device is a partition device,
> cut the partition device name in the canonical sysfs path to access
> the "zoned" file in the holder device sysfs path.
Please explain what makes you think that it makes sense to create
partitions on a zoned block device. An application that wants to write
to a partition of a zoned block device needs access to the whole disk
anyway to get permission to submit the BLKREPORTZONE and BLKRESETZONE
ioctls. So why would anyone create partitions on a zoned block device?
Thanks,
Bart.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/8] zbd: Fix partition block device handling
2019-02-20 16:00 ` Bart Van Assche
@ 2019-02-21 1:47 ` Damien Le Moal
0 siblings, 0 replies; 13+ messages in thread
From: Damien Le Moal @ 2019-02-21 1:47 UTC (permalink / raw)
To: fio@vger.kernel.org, bvanassche@acm.org, axboe@kernel.dk
Cc: Shinichiro Kawasaki, Dmitry Fomichev, osandov@osandov.com,
Matias Bjorling
Bart,
+Omar who has a similar question on blktests side.
On Wed, 2019-02-20 at 08:00 -0800, Bart Van Assche wrote:
> On Wed, 2019-02-20 at 16:17 +0900, Damien Le Moal wrote:
> > From: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> >
> > For fio to correctly handle the zonemode=zbd mode with partitions of
> > zoned block devices, the partition block device file must be
> > identified
> > as a zoned disk. However, partition block device files do not have
> > a zoned sysfs file. This patch allows a correct identification of
> > the
> > device file zone model by accessing the sysfs "zoned" file of the
> > holder disk for partition devices.
> >
> > Change get_zbd_model() function to resolve the symbolic link to the
> > sysfs path to obtain the canonical sysfs path. The canonical sysfs
> > path of a partition device includes both of the holder device name
> > and
> > the partition device name. If the given device is a partition
> > device,
> > cut the partition device name in the canonical sysfs path to access
> > the "zoned" file in the holder device sysfs path.
>
> Please explain what makes you think that it makes sense to create
> partitions on a zoned block device. An application that wants to write
> to a partition of a zoned block device needs access to the whole disk
> anyway to get permission to submit the BLKREPORTZONE and BLKRESETZONE
> ioctls. So why would anyone create partitions on a zoned block device?
The BLKREPORTZONE and BLKRESETZONE ioctls do work as expected when
issued against a partition device. That is, the sector values involved
with these ioctls get remapped against the partition sector range with
an addition or substraction of the partition sector offset. For this to
work correctly, the kernel partition code only allows partitions on
zoned block devices if the partitions are zone aligned (please see the
use of the function part_zone_aligned() in the kernel file
block/partition-generic.c). So an application working with a partition
of a zoned block device does not need to issue these ioctls against the
entire disk.
Now for the use case. Indeed, for host-managed disks, this is not a very
compeling use case, nor is it commonly in use in the field as far as I
know. Chunking a host-managed disk into smaller drives with dm-linear is
likely a better option. There is one use case I have seen in the field
though where a partition was created over the conventional zones space
of a disk to create an essentially normal (i.e. randomly writable) disk
to put an ext4 file system on top for a metadata DB to manage data
stored on the remaining sequential zones of the disk. Such choice allows
to simplify the overall system design by enabling the use of proven
components (ext4, DB) rather than writing or porting everything to fit a
pure zoned block device model.
The main use case is for host-aware disks as these can be written
randomly anywhere and so are in essence "normal" disks, more so
considering the fact that these drives SCSI device type is "0x0",
equivalent to regular disk devices. As such, partitioning host-aware
disks in the same manner as regular disks generally are is a valid use
case.
Together with the blktests fixes we posted yesterday, the intent of this
fix is to allow for running tests against partitions to check the kernel
partition sector remapping for zoned block disks. Specific tests in
blktests are not needed, we only need the ability to specify a partition
device file in blktests config (TEST_DEVS variable). This is to improve
test coverage. We also run these tests with the same intent (sector
remapping verification) for dm-linear devices too.
Thank you for your review.
Best regards.
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 2/8] sg: Avoid READ CAPACITY failures
2019-02-20 7:17 [PATCH 0/8] ZBD support fixes Damien Le Moal
2019-02-20 7:17 ` [PATCH 1/8] zbd: Fix partition block device handling Damien Le Moal
@ 2019-02-20 7:17 ` Damien Le Moal
2019-02-20 16:03 ` Bart Van Assche
2019-02-20 7:17 ` [PATCH 3/8] t/zbd: Fix handling of partition devices Damien Le Moal
` (5 subsequent siblings)
7 siblings, 1 reply; 13+ messages in thread
From: Damien Le Moal @ 2019-02-20 7:17 UTC (permalink / raw)
To: fio, Jens Axboe
Cc: Bart Van Assche, Matias Bjorling, Shinichiro Kawasaki,
Dmitry Fomichev
From: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Some SCSI devices (very large disks or SMR zoned disks in particular)
do not support the READ CAPACITY(10) command and only reply
successfully to the READ CAPACITY(16) command. This patch forces the
execution READ CAPACITY(16) if READ CAPACITY(10) fails with
CHECK CONDITION.
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
engines/sg.c | 45 ++++++++++++++++++++++++++++++---------------
1 file changed, 30 insertions(+), 15 deletions(-)
diff --git a/engines/sg.c b/engines/sg.c
index 3cc068f3..9105c24c 100644
--- a/engines/sg.c
+++ b/engines/sg.c
@@ -723,6 +723,8 @@ static int fio_sgio_read_capacity(struct thread_data *td, unsigned int *bs,
* io_u structures, which are not initialized until later.
*/
struct sg_io_hdr hdr;
+ unsigned long long hlba;
+ unsigned int blksz = 0;
unsigned char cmd[16];
unsigned char sb[64];
unsigned char buf[32]; // read capacity return
@@ -759,16 +761,21 @@ static int fio_sgio_read_capacity(struct thread_data *td, unsigned int *bs,
return ret;
}
- *bs = ((unsigned long) buf[4] << 24) | ((unsigned long) buf[5] << 16) |
- ((unsigned long) buf[6] << 8) | (unsigned long) buf[7];
- *max_lba = ((unsigned long) buf[0] << 24) | ((unsigned long) buf[1] << 16) |
- ((unsigned long) buf[2] << 8) | (unsigned long) buf[3];
+ if (hdr.info & SG_INFO_CHECK) {
+ /* RCAP(10) might be unsupported by device. Force RCAP(16) */
+ hlba = MAX_10B_LBA;
+ } else {
+ blksz = ((unsigned long) buf[4] << 24) | ((unsigned long) buf[5] << 16) |
+ ((unsigned long) buf[6] << 8) | (unsigned long) buf[7];
+ hlba = ((unsigned long) buf[0] << 24) | ((unsigned long) buf[1] << 16) |
+ ((unsigned long) buf[2] << 8) | (unsigned long) buf[3];
+ }
/*
* If max lba masked by MAX_10B_LBA equals MAX_10B_LBA,
* then need to retry with 16 byte Read Capacity command.
*/
- if (*max_lba == MAX_10B_LBA) {
+ if (hlba == MAX_10B_LBA) {
hdr.cmd_len = 16;
hdr.cmdp[0] = 0x9e; // service action
hdr.cmdp[1] = 0x10; // Read Capacity(16)
@@ -791,19 +798,27 @@ static int fio_sgio_read_capacity(struct thread_data *td, unsigned int *bs,
if (hdr.info & SG_INFO_CHECK)
td_verror(td, EIO, "fio_sgio_read_capacity");
- *bs = (buf[8] << 24) | (buf[9] << 16) | (buf[10] << 8) | buf[11];
- *max_lba = ((unsigned long long)buf[0] << 56) |
- ((unsigned long long)buf[1] << 48) |
- ((unsigned long long)buf[2] << 40) |
- ((unsigned long long)buf[3] << 32) |
- ((unsigned long long)buf[4] << 24) |
- ((unsigned long long)buf[5] << 16) |
- ((unsigned long long)buf[6] << 8) |
- (unsigned long long)buf[7];
+ blksz = (buf[8] << 24) | (buf[9] << 16) | (buf[10] << 8) | buf[11];
+ hlba = ((unsigned long long)buf[0] << 56) |
+ ((unsigned long long)buf[1] << 48) |
+ ((unsigned long long)buf[2] << 40) |
+ ((unsigned long long)buf[3] << 32) |
+ ((unsigned long long)buf[4] << 24) |
+ ((unsigned long long)buf[5] << 16) |
+ ((unsigned long long)buf[6] << 8) |
+ (unsigned long long)buf[7];
+ }
+
+ if (blksz) {
+ *bs = blksz;
+ *max_lba = hlba;
+ ret = 0;
+ } else {
+ ret = EIO;
}
close(fd);
- return 0;
+ return ret;
}
static void fio_sgio_cleanup(struct thread_data *td)
--
2.20.1
^ permalink raw reply related [flat|nested] 13+ messages in thread* Re: [PATCH 2/8] sg: Avoid READ CAPACITY failures
2019-02-20 7:17 ` [PATCH 2/8] sg: Avoid READ CAPACITY failures Damien Le Moal
@ 2019-02-20 16:03 ` Bart Van Assche
0 siblings, 0 replies; 13+ messages in thread
From: Bart Van Assche @ 2019-02-20 16:03 UTC (permalink / raw)
To: Damien Le Moal, fio, Jens Axboe
Cc: Matias Bjorling, Shinichiro Kawasaki, Dmitry Fomichev
On Wed, 2019-02-20 at 16:17 +0900, Damien Le Moal wrote:
> From: Dmitry Fomichev <dmitry.fomichev@wdc.com>
>
> Some SCSI devices (very large disks or SMR zoned disks in particular)
> do not support the READ CAPACITY(10) command and only reply
> successfully to the READ CAPACITY(16) command. This patch forces the
> execution READ CAPACITY(16) if READ CAPACITY(10) fails with
> CHECK CONDITION.
>
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> ---
> engines/sg.c | 45 ++++++++++++++++++++++++++++++---------------
> 1 file changed, 30 insertions(+), 15 deletions(-)
>
> diff --git a/engines/sg.c b/engines/sg.c
> index 3cc068f3..9105c24c 100644
> --- a/engines/sg.c
> +++ b/engines/sg.c
> @@ -723,6 +723,8 @@ static int fio_sgio_read_capacity(struct thread_data *td, unsigned int *bs,
> * io_u structures, which are not initialized until later.
> */
> struct sg_io_hdr hdr;
> + unsigned long long hlba;
> + unsigned int blksz = 0;
> unsigned char cmd[16];
> unsigned char sb[64];
> unsigned char buf[32]; // read capacity return
> @@ -759,16 +761,21 @@ static int fio_sgio_read_capacity(struct thread_data *td, unsigned int *bs,
> return ret;
> }
>
> - *bs = ((unsigned long) buf[4] << 24) | ((unsigned long) buf[5] << 16) |
> - ((unsigned long) buf[6] << 8) | (unsigned long) buf[7];
> - *max_lba = ((unsigned long) buf[0] << 24) | ((unsigned long) buf[1] << 16) |
> - ((unsigned long) buf[2] << 8) | (unsigned long) buf[3];
> + if (hdr.info & SG_INFO_CHECK) {
> + /* RCAP(10) might be unsupported by device. Force RCAP(16) */
> + hlba = MAX_10B_LBA;
> + } else {
> + blksz = ((unsigned long) buf[4] << 24) | ((unsigned long) buf[5] << 16) |
> + ((unsigned long) buf[6] << 8) | (unsigned long) buf[7];
> + hlba = ((unsigned long) buf[0] << 24) | ((unsigned long) buf[1] << 16) |
> + ((unsigned long) buf[2] << 8) | (unsigned long) buf[3];
> + }
Please introduce get_unaligned_be32() and get_unaligned_be64() functions in fio
such that this code becomes easier to read and to maintain.
Thanks,
Bart.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 3/8] t/zbd: Fix handling of partition devices
2019-02-20 7:17 [PATCH 0/8] ZBD support fixes Damien Le Moal
2019-02-20 7:17 ` [PATCH 1/8] zbd: Fix partition block device handling Damien Le Moal
2019-02-20 7:17 ` [PATCH 2/8] sg: Avoid READ CAPACITY failures Damien Le Moal
@ 2019-02-20 7:17 ` Damien Le Moal
2019-02-20 7:17 ` [PATCH 4/8] t/zbd: Fix test 2 and 3 result handling Damien Le Moal
` (4 subsequent siblings)
7 siblings, 0 replies; 13+ messages in thread
From: Damien Le Moal @ 2019-02-20 7:17 UTC (permalink / raw)
To: fio, Jens Axboe
Cc: Bart Van Assche, Matias Bjorling, Shinichiro Kawasaki,
Dmitry Fomichev
From: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
To allow t/zbd/tests-zbd-support test script to run correctly on
partitions of zoned block devices, fix access to the device properties
through sysfs by referencing the sysfs directory of the holder block
device. Doing so, the "zoned", "logical_block_size" and "mq" attributes
can be correctly accessed.
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
---
t/zbd/test-zbd-support | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/t/zbd/test-zbd-support b/t/zbd/test-zbd-support
index 2d727910..d316d880 100755
--- a/t/zbd/test-zbd-support
+++ b/t/zbd/test-zbd-support
@@ -761,7 +761,15 @@ source "$(dirname "$0")/functions" || exit $?
dev=$1
realdev=$(readlink -f "$dev")
basename=$(basename "$realdev")
-disk_size=$(($(<"/sys/block/$basename/size")*512))
+major=$((0x$(stat -L -c '%t' "$realdev")))
+minor=$((0x$(stat -L -c '%T' "$realdev")))
+disk_size=$(($(<"/sys/dev/block/$major:$minor/size")*512))
+# When the target is a partition device, get basename of its holder device to
+# access sysfs path of the holder device
+if [[ -r "/sys/dev/block/$major:$minor/partition" ]]; then
+ realsysfs=$(readlink "/sys/dev/block/$major:$minor")
+ basename=$(basename "${realsysfs%/*}")
+fi
logical_block_size=$(<"/sys/block/$basename/queue/logical_block_size")
case "$(<"/sys/class/block/$basename/queue/zoned")" in
host-managed|host-aware)
--
2.20.1
^ permalink raw reply related [flat|nested] 13+ messages in thread* [PATCH 4/8] t/zbd: Fix test 2 and 3 result handling
2019-02-20 7:17 [PATCH 0/8] ZBD support fixes Damien Le Moal
` (2 preceding siblings ...)
2019-02-20 7:17 ` [PATCH 3/8] t/zbd: Fix handling of partition devices Damien Le Moal
@ 2019-02-20 7:17 ` Damien Le Moal
2019-02-20 16:06 ` Bart Van Assche
2019-02-20 7:17 ` [PATCH 5/8] t/zbd: Default to using blkzone tool Damien Le Moal
` (3 subsequent siblings)
7 siblings, 1 reply; 13+ messages in thread
From: Damien Le Moal @ 2019-02-20 7:17 UTC (permalink / raw)
To: fio, Jens Axboe
Cc: Bart Van Assche, Matias Bjorling, Shinichiro Kawasaki,
Dmitry Fomichev
Removal of the message "No I/O performed" when fio does not execute any
I/O broke zbd tests 2 and 3 as this message is looked after to test for
success. Fix this by looking for a "Run status" line starting with
"WRITE:" for test 2 and "READ:" for test 3. The run status lines are not
printed when no I/O is performed. Testing for the absence of these
strings thus allows to easily test if I/Os where executed or not.
Fixes: ff3aa922570c ("Kill "No I/O performed by ..." message")
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
---
t/zbd/test-zbd-support | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/t/zbd/test-zbd-support b/t/zbd/test-zbd-support
index d316d880..c6391169 100755
--- a/t/zbd/test-zbd-support
+++ b/t/zbd/test-zbd-support
@@ -141,9 +141,9 @@ test2() {
if [ -z "$is_zbd" ]; then
opts+=("--zonesize=${zone_size}")
fi
- run_fio "${opts[@]}" 2>&1 |
- tee -a "${logfile}.${test_number}" |
- grep -q 'No I/O performed'
+ run_fio "${opts[@]}" >> "${logfile}.${test_number}" 2>&1 || return $?
+ grep -q 'WRITE:' "${logfile}.${test_number}"
+ [ $? != 0 ]
}
# Run fio against an empty zone. This causes fio to report "No I/O performed".
@@ -160,12 +160,12 @@ test3() {
opts+=("--zonesize=${zone_size}")
fi
run_fio "${opts[@]}" >> "${logfile}.${test_number}" 2>&1 || return $?
- grep -q "No I/O performed" "${logfile}.${test_number}"
+ grep -q 'READ:' "${logfile}.${test_number}"
rc=$?
if [ -n "$is_zbd" ]; then
- [ $rc = 0 ]
- else
[ $rc != 0 ]
+ else
+ [ $rc = 0 ]
fi
}
--
2.20.1
^ permalink raw reply related [flat|nested] 13+ messages in thread* Re: [PATCH 4/8] t/zbd: Fix test 2 and 3 result handling
2019-02-20 7:17 ` [PATCH 4/8] t/zbd: Fix test 2 and 3 result handling Damien Le Moal
@ 2019-02-20 16:06 ` Bart Van Assche
0 siblings, 0 replies; 13+ messages in thread
From: Bart Van Assche @ 2019-02-20 16:06 UTC (permalink / raw)
To: Damien Le Moal, fio, Jens Axboe
Cc: Matias Bjorling, Shinichiro Kawasaki, Dmitry Fomichev
On Wed, 2019-02-20 at 16:17 +0900, Damien Le Moal wrote:
> Removal of the message "No I/O performed" when fio does not execute any
> I/O broke zbd tests 2 and 3 as this message is looked after to test for
> success. Fix this by looking for a "Run status" line starting with
> "WRITE:" for test 2 and "READ:" for test 3. The run status lines are not
> printed when no I/O is performed. Testing for the absence of these
> strings thus allows to easily test if I/Os where executed or not.
>
> Fixes: ff3aa922570c ("Kill "No I/O performed by ..." message")
> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
> ---
> t/zbd/test-zbd-support | 12 ++++++------
> 1 file changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/t/zbd/test-zbd-support b/t/zbd/test-zbd-support
> index d316d880..c6391169 100755
> --- a/t/zbd/test-zbd-support
> +++ b/t/zbd/test-zbd-support
> @@ -141,9 +141,9 @@ test2() {
> if [ -z "$is_zbd" ]; then
> opts+=("--zonesize=${zone_size}")
> fi
> - run_fio "${opts[@]}" 2>&1 |
> - tee -a "${logfile}.${test_number}" |
> - grep -q 'No I/O performed'
> + run_fio "${opts[@]}" >> "${logfile}.${test_number}" 2>&1 || return $?
> + grep -q 'WRITE:' "${logfile}.${test_number}"
> + [ $? != 0 ]
> }
Can the last two lines be changed into the following single line?
! grep -q 'WRITE:' "${logfile}.${test_number}"
Thanks,
Bart.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 5/8] t/zbd: Default to using blkzone tool
2019-02-20 7:17 [PATCH 0/8] ZBD support fixes Damien Le Moal
` (3 preceding siblings ...)
2019-02-20 7:17 ` [PATCH 4/8] t/zbd: Fix test 2 and 3 result handling Damien Le Moal
@ 2019-02-20 7:17 ` Damien Le Moal
2019-02-20 7:17 ` [PATCH 6/8] zbd: Fix zone locking for async I/O engines Damien Le Moal
` (2 subsequent siblings)
7 siblings, 0 replies; 13+ messages in thread
From: Damien Le Moal @ 2019-02-20 7:17 UTC (permalink / raw)
To: fio, Jens Axboe
Cc: Bart Van Assche, Matias Bjorling, Shinichiro Kawasaki,
Dmitry Fomichev
From: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
The test-zbd-support script fails to execute for partition devices with
the error message "Open /dev/sdX1 failed (No such file or directory)"
when libzbc tools are used by the script to open the specified
partition device. This is due to libzbc also opening a partition holder
block device file, which when closed causes a partition table
revalidation and the partition device files to be deleted and
recreated by udev through the RRPART ioctl.
To avoid the failure, default to using blkzone for zone report and
reset if supported by the system (util-linux v2.30 and higher) as this
tool does not open the older device and avoids the same problem.
To obtain the device maximum number of open zones, which is not
advertized by blkzone, use sg_inq for SCSI devices and use the default
maximum of 128 for other device types (i.e. null_blk devices in zone
mode).
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
---
t/zbd/functions | 25 +++++++++++++++++++------
1 file changed, 19 insertions(+), 6 deletions(-)
diff --git a/t/zbd/functions b/t/zbd/functions
index 173f0ca6..d49555a8 100644
--- a/t/zbd/functions
+++ b/t/zbd/functions
@@ -1,8 +1,7 @@
#!/bin/bash
-# To do: switch to blkzone once blkzone reset works correctly.
-blkzone=
-#blkzone=$(type -p blkzone 2>/dev/null)
+blkzone=$(type -p blkzone 2>/dev/null)
+sg_inq=$(type -p sg_inq 2>/dev/null)
zbc_report_zones=$(type -p zbc_report_zones 2>/dev/null)
zbc_reset_zone=$(type -p zbc_reset_zone 2>/dev/null)
if [ -z "${blkzone}" ] &&
@@ -34,9 +33,23 @@ first_sequential_zone() {
max_open_zones() {
local dev=$1
- if [ -n "${blkzone}" ]; then
- # To do: query the maximum number of open zones using sg_raw
- return 1
+ if [ -n "${sg_inq}" ]; then
+ if ! ${sg_inq} -e --page=0xB6 --len=20 --hex "$dev" 2> /dev/null; then
+ # Non scsi device such as null_blk can not return max open zones.
+ # Use default value.
+ echo 128
+ else
+ ${sg_inq} -e --page=0xB6 --len=20 --hex "$dev" | tail -1 |
+ {
+ read -r offset b0 b1 b2 b3 trailer || return $?
+ # Convert from hex to decimal
+ max_nr_open_zones=$((0x${b0}))
+ max_nr_open_zones=$((max_nr_open_zones * 256 + 0x${b1}))
+ max_nr_open_zones=$((max_nr_open_zones * 256 + 0x${b2}))
+ max_nr_open_zones=$((max_nr_open_zones * 256 + 0x${b3}))
+ echo ${max_nr_open_zones}
+ }
+ fi
else
${zbc_report_zones} "$dev" |
sed -n 's/^[[:blank:]]*Maximum number of open sequential write required zones:[[:blank:]]*//p'
--
2.20.1
^ permalink raw reply related [flat|nested] 13+ messages in thread* [PATCH 6/8] zbd: Fix zone locking for async I/O engines
2019-02-20 7:17 [PATCH 0/8] ZBD support fixes Damien Le Moal
` (4 preceding siblings ...)
2019-02-20 7:17 ` [PATCH 5/8] t/zbd: Default to using blkzone tool Damien Le Moal
@ 2019-02-20 7:17 ` Damien Le Moal
2019-02-20 7:17 ` [PATCH 7/8] zbd: Avoid async I/O multi-job workload deadlock Damien Le Moal
2019-02-20 7:17 ` [PATCH 8/8] t/zbd: Add multi-job libaio test Damien Le Moal
7 siblings, 0 replies; 13+ messages in thread
From: Damien Le Moal @ 2019-02-20 7:17 UTC (permalink / raw)
To: fio, Jens Axboe
Cc: Bart Van Assche, Matias Bjorling, Shinichiro Kawasaki,
Dmitry Fomichev
For a zoned block device with zonemode=zbd, the lock on the target zone
of an I/O is held from the time the I/O is prepared with
zbd_adjust_block() execution in fill_io_u() until the I/O is queued in
td_io_queue(). For a sync I/O engines, this means that the target zone
of an I/O operations is locked throughout the liftime of an I/O unit,
resulting in the serialization of write request preparation and
execution, as well as serialization of write operations and reset zone
operations for a zone, avoiding error inducing reordering.
However, in the case of an async I/O engine, the engine ->commit()
method falls outside of the zone lock serialization for all I/O units
that will be issued by the method execution. This results in potential
reordering of write requests during issuing, as well as simultaneous
queueing of write requests and zone reset operations resulting in
unaligned write errors.
For example, using a 1GB null_blk zoned device, the command:
fio --name=nullb0 --filename=/dev/nullb0 --direct=1 --zonemode=zbd
--bs=4k --rw=randwrite --ioengine=libaio --group_reporting=1
--iodepth=32 --numjobs=4
always fails due to unaligned write errors.
Fix this by refining the control over zone locking and unlocking.
Locking of an I/O target zone is unchanged and done in
zbd_adjust_block(), but the I/O callback function zbd_post_submit()
which updates a zone write pointer and unlocks the zone is split into
two different callbacks zbd_queue_io() and zbd_put_io().
zbd_queue_io() updates the zone write pointer for write operations and
unlocks the target zone only if the I/O operation was not queued or if
the I/O operation completed during the execution of the engine
->queue() method (e.g. a sync I/O engine is being used). The execution
of this I/O callback is done right after executing the I/O engine
->queue() method. The zbd_put_io() callback is used to unlock an I/O
target zone after completion of an I/O from within the put_io_u()
function.
To simplify the code the helper functions zbd_queue_io_u() and
zbd_put_io_u() which respectively call an io_u zbd_queue_io() and
zbd_put_io() callbacks are introduced. These helper functions are
conditionally defined only if CONFIG_LINUX_BLKZONED is set.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
---
io_u.c | 10 ++-----
io_u.h | 17 +++++++++---
ioengines.c | 6 ++---
zbd.c | 75 +++++++++++++++++++++++++++++++++++++++++------------
zbd.h | 22 ++++++++++++++++
5 files changed, 98 insertions(+), 32 deletions(-)
diff --git a/io_u.c b/io_u.c
index bee99c37..910b7deb 100644
--- a/io_u.c
+++ b/io_u.c
@@ -775,10 +775,7 @@ void put_io_u(struct thread_data *td, struct io_u *io_u)
{
const bool needs_lock = td_async_processing(td);
- if (io_u->post_submit) {
- io_u->post_submit(io_u, io_u->error == 0);
- io_u->post_submit = NULL;
- }
+ zbd_put_io_u(io_u);
if (td->parent)
td = td->parent;
@@ -1340,10 +1337,7 @@ static long set_io_u_file(struct thread_data *td, struct io_u *io_u)
if (!fill_io_u(td, io_u))
break;
- if (io_u->post_submit) {
- io_u->post_submit(io_u, false);
- io_u->post_submit = NULL;
- }
+ zbd_put_io_u(io_u);
put_file_log(td, f);
td_io_close_file(td, f);
diff --git a/io_u.h b/io_u.h
index 97270c94..e75993bd 100644
--- a/io_u.h
+++ b/io_u.h
@@ -92,11 +92,22 @@ struct io_u {
struct workqueue_work work;
};
+#ifdef CONFIG_LINUX_BLKZONED
/*
- * Post-submit callback. Used by the ZBD code. @success == true means
- * that the I/O operation has been queued or completed successfully.
+ * ZBD mode zbd_queue_io callback: called after engine->queue operation
+ * to advance a zone write pointer and eventually unlock the I/O zone.
+ * @q indicates the I/O queue status (busy, queued or completed).
+ * @success == true means that the I/O operation has been queued or
+ * completed successfully.
*/
- void (*post_submit)(const struct io_u *, bool success);
+ void (*zbd_queue_io)(struct io_u *, int q, bool success);
+
+ /*
+ * ZBD mode zbd_put_io callback: called in after completion of an I/O
+ * or commit of an async I/O to unlock the I/O target zone.
+ */
+ void (*zbd_put_io)(const struct io_u *);
+#endif
/*
* Callback for io completion
diff --git a/ioengines.c b/ioengines.c
index 45e769e6..7e5a50cc 100644
--- a/ioengines.c
+++ b/ioengines.c
@@ -329,10 +329,7 @@ enum fio_q_status td_io_queue(struct thread_data *td, struct io_u *io_u)
}
ret = td->io_ops->queue(td, io_u);
- if (ret != FIO_Q_BUSY && io_u->post_submit) {
- io_u->post_submit(io_u, io_u->error == 0);
- io_u->post_submit = NULL;
- }
+ zbd_queue_io_u(io_u, ret);
unlock_file(td, io_u->file);
@@ -374,6 +371,7 @@ enum fio_q_status td_io_queue(struct thread_data *td, struct io_u *io_u)
if (!td->io_ops->commit) {
io_u_mark_submit(td, 1);
io_u_mark_complete(td, 1);
+ zbd_put_io_u(io_u);
}
if (ret == FIO_Q_COMPLETED) {
diff --git a/zbd.c b/zbd.c
index 9b603b97..310b1732 100644
--- a/zbd.c
+++ b/zbd.c
@@ -1111,37 +1111,44 @@ zbd_find_zone(struct thread_data *td, struct io_u *io_u,
return NULL;
}
-
/**
- * zbd_post_submit - update the write pointer and unlock the zone lock
+ * zbd_queue_io - update the write pointer of a sequential zone
* @io_u: I/O unit
- * @success: Whether or not the I/O unit has been executed successfully
+ * @success: Whether or not the I/O unit has been queued successfully
+ * @q: queueing status (busy, completed or queued).
*
- * For write and trim operations, update the write pointer of all affected
- * zones.
+ * For write and trim operations, update the write pointer of the I/O unit
+ * target zone.
*/
-static void zbd_post_submit(const struct io_u *io_u, bool success)
+static void zbd_queue_io(struct io_u *io_u, int q, bool success)
{
- struct zoned_block_device_info *zbd_info;
+ const struct fio_file *f = io_u->file;
+ struct zoned_block_device_info *zbd_info = f->zbd_info;
struct fio_zone_info *z;
uint32_t zone_idx;
- uint64_t end, zone_end;
+ uint64_t zone_end;
- zbd_info = io_u->file->zbd_info;
if (!zbd_info)
return;
- zone_idx = zbd_zone_idx(io_u->file, io_u->offset);
- end = io_u->offset + io_u->buflen;
- z = &zbd_info->zone_info[zone_idx];
+ zone_idx = zbd_zone_idx(f, io_u->offset);
assert(zone_idx < zbd_info->nr_zones);
+ z = &zbd_info->zone_info[zone_idx];
+
if (z->type != BLK_ZONE_TYPE_SEQWRITE_REQ)
return;
+
if (!success)
goto unlock;
+
+ dprint(FD_ZBD,
+ "%s: queued I/O (%lld, %llu) for zone %u\n",
+ f->file_name, io_u->offset, io_u->buflen, zone_idx);
+
switch (io_u->ddir) {
case DDIR_WRITE:
- zone_end = min(end, (z + 1)->start);
+ zone_end = min((uint64_t)(io_u->offset + io_u->buflen),
+ (z + 1)->start);
pthread_mutex_lock(&zbd_info->mutex);
/*
* z->wp > zone_end means that one or more I/O errors
@@ -1158,10 +1165,42 @@ static void zbd_post_submit(const struct io_u *io_u, bool success)
default:
break;
}
+
unlock:
- pthread_mutex_unlock(&z->mutex);
+ if (!success || q != FIO_Q_QUEUED) {
+ /* BUSY or COMPLETED: unlock the zone */
+ pthread_mutex_unlock(&z->mutex);
+ io_u->zbd_put_io = NULL;
+ }
+}
+
+/**
+ * zbd_put_io - Unlock an I/O unit target zone lock
+ * @io_u: I/O unit
+ */
+static void zbd_put_io(const struct io_u *io_u)
+{
+ const struct fio_file *f = io_u->file;
+ struct zoned_block_device_info *zbd_info = f->zbd_info;
+ struct fio_zone_info *z;
+ uint32_t zone_idx;
+
+ if (!zbd_info)
+ return;
- zbd_check_swd(io_u->file);
+ zone_idx = zbd_zone_idx(f, io_u->offset);
+ assert(zone_idx < zbd_info->nr_zones);
+ z = &zbd_info->zone_info[zone_idx];
+
+ if (z->type != BLK_ZONE_TYPE_SEQWRITE_REQ)
+ return;
+
+ dprint(FD_ZBD,
+ "%s: terminate I/O (%lld, %llu) for zone %u\n",
+ f->file_name, io_u->offset, io_u->buflen, zone_idx);
+
+ assert(pthread_mutex_unlock(&z->mutex) == 0);
+ zbd_check_swd(f);
}
bool zbd_unaligned_write(int error_code)
@@ -1354,8 +1393,10 @@ enum io_u_action zbd_adjust_block(struct thread_data *td, struct io_u *io_u)
accept:
assert(zb);
assert(zb->cond != BLK_ZONE_COND_OFFLINE);
- assert(!io_u->post_submit);
- io_u->post_submit = zbd_post_submit;
+ assert(!io_u->zbd_queue_io);
+ assert(!io_u->zbd_put_io);
+ io_u->zbd_queue_io = zbd_queue_io;
+ io_u->zbd_put_io = zbd_put_io;
return io_u_accept;
eof:
diff --git a/zbd.h b/zbd.h
index 33e6d8bd..521283b2 100644
--- a/zbd.h
+++ b/zbd.h
@@ -96,6 +96,24 @@ void zbd_file_reset(struct thread_data *td, struct fio_file *f);
bool zbd_unaligned_write(int error_code);
enum io_u_action zbd_adjust_block(struct thread_data *td, struct io_u *io_u);
char *zbd_write_status(const struct thread_stat *ts);
+
+static inline void zbd_queue_io_u(struct io_u *io_u, enum fio_q_status status)
+{
+ if (io_u->zbd_queue_io) {
+ io_u->zbd_queue_io(io_u, status, io_u->error == 0);
+ io_u->zbd_queue_io = NULL;
+ }
+}
+
+static inline void zbd_put_io_u(struct io_u *io_u)
+{
+ if (io_u->zbd_put_io) {
+ io_u->zbd_put_io(io_u);
+ io_u->zbd_queue_io = NULL;
+ io_u->zbd_put_io = NULL;
+ }
+}
+
#else
static inline void zbd_free_zone_info(struct fio_file *f)
{
@@ -125,6 +143,10 @@ static inline char *zbd_write_status(const struct thread_stat *ts)
{
return NULL;
}
+
+static inline void zbd_queue_io_u(struct io_u *io_u,
+ enum fio_q_status status) {}
+static inline void zbd_put_io_u(struct io_u *io_u) {}
#endif
#endif /* FIO_ZBD_H */
--
2.20.1
^ permalink raw reply related [flat|nested] 13+ messages in thread* [PATCH 7/8] zbd: Avoid async I/O multi-job workload deadlock
2019-02-20 7:17 [PATCH 0/8] ZBD support fixes Damien Le Moal
` (5 preceding siblings ...)
2019-02-20 7:17 ` [PATCH 6/8] zbd: Fix zone locking for async I/O engines Damien Le Moal
@ 2019-02-20 7:17 ` Damien Le Moal
2019-02-20 7:17 ` [PATCH 8/8] t/zbd: Add multi-job libaio test Damien Le Moal
7 siblings, 0 replies; 13+ messages in thread
From: Damien Le Moal @ 2019-02-20 7:17 UTC (permalink / raw)
To: fio, Jens Axboe
Cc: Bart Van Assche, Matias Bjorling, Shinichiro Kawasaki,
Dmitry Fomichev
With zonemode=zbd, for a multi-job workload using asynchronous I/O
engines with a deep I/O queue depth setting, a job that is building a
batch of asynchronous I/Os to submit may end up waiting for an I/O
target zone lock held by another job that is also preparing a batch.
For small devices with few zones and/or a large number of jobs, such
prepare phase zone lock contention can be frequent enough to end up in a
situation where all jobs are waiting for zone locks held by other jobs
and no I/O being executed (so no zone being unlocked).
Avoid this situation by using pthread_mutex_trylock() instead of
pthread_mutex_lock() and by calling io_u_quiesce() to execute queued
I/O units if locking fails. pthread_mutex_lock() is then called to
lock the desired target zone. The execution of io_u_quiesce() forces
I/O execution progress and so zones to be unlocked, avoiding job
deadlock.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
---
zbd.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/zbd.c b/zbd.c
index 310b1732..2da742b7 100644
--- a/zbd.c
+++ b/zbd.c
@@ -1255,7 +1255,21 @@ enum io_u_action zbd_adjust_block(struct thread_data *td, struct io_u *io_u)
zbd_check_swd(f);
- pthread_mutex_lock(&zb->mutex);
+ /*
+ * Lock the io_u target zone. The zone will be unlocked if io_u offset
+ * is changed or when io_u completes and zbd_put_io() executed.
+ * To avoid multiple jobs doing asynchronous I/Os from deadlocking each
+ * other waiting for zone locks when building an io_u batch, first
+ * only trylock the zone. If the zone is already locked by another job,
+ * process the currently queued I/Os so that I/O progress is made and
+ * zones unlocked.
+ */
+ if (pthread_mutex_trylock(&zb->mutex) != 0) {
+ if (!td_ioengine_flagged(td, FIO_SYNCIO))
+ io_u_quiesce(td);
+ pthread_mutex_lock(&zb->mutex);
+ }
+
switch (io_u->ddir) {
case DDIR_READ:
if (td->runstate == TD_VERIFYING) {
--
2.20.1
^ permalink raw reply related [flat|nested] 13+ messages in thread* [PATCH 8/8] t/zbd: Add multi-job libaio test
2019-02-20 7:17 [PATCH 0/8] ZBD support fixes Damien Le Moal
` (6 preceding siblings ...)
2019-02-20 7:17 ` [PATCH 7/8] zbd: Avoid async I/O multi-job workload deadlock Damien Le Moal
@ 2019-02-20 7:17 ` Damien Le Moal
7 siblings, 0 replies; 13+ messages in thread
From: Damien Le Moal @ 2019-02-20 7:17 UTC (permalink / raw)
To: fio, Jens Axboe
Cc: Bart Van Assche, Matias Bjorling, Shinichiro Kawasaki,
Dmitry Fomichev
Introduce test case 46 to verify that write ordering is correct and that
no job deadlock occurs in the case of a multi job run with an
asynchronous I/O engine (libaio).
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
---
t/zbd/test-zbd-support | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)
diff --git a/t/zbd/test-zbd-support b/t/zbd/test-zbd-support
index c6391169..2a470397 100755
--- a/t/zbd/test-zbd-support
+++ b/t/zbd/test-zbd-support
@@ -731,6 +731,17 @@ test45() {
grep -q "fio: first I/O failed. If .* is a zoned block device, consider --zonemode=zbd"
}
+# Random write to sequential zones, libaio, 8 jobs, queue depth 64 per job
+test46() {
+ local size
+
+ size=$((4 * zone_size))
+ run_fio_on_seq --ioengine=libaio --iodepth=64 --rw=randwrite --bs=4K \
+ --group_reporting=1 --numjobs=8 \
+ >> "${logfile}.${test_number}" 2>&1 || return $?
+ check_written $((size * 8)) || return $?
+}
+
tests=()
dynamic_analyzer=()
reset_all_zones=
@@ -802,7 +813,7 @@ case "$(<"/sys/class/block/$basename/queue/zoned")" in
esac
if [ "${#tests[@]}" = 0 ]; then
- for ((i=1;i<=45;i++)); do
+ for ((i=1;i<=46;i++)); do
tests+=("$i")
done
fi
--
2.20.1
^ permalink raw reply related [flat|nested] 13+ messages in thread