* ZBC/FLEX FIO addition ideas @ 2018-03-06 18:59 Phillip Chen 2018-03-15 16:15 ` Kris Davis 0 siblings, 1 reply; 12+ messages in thread From: Phillip Chen @ 2018-03-06 18:59 UTC (permalink / raw) To: fio Hello again, We are interested in making changes to the Seagate branch of FIO that ideally would be widely picked up that would allow FIO to run well on Host Managed/Zoned Block Command (ZBC) drives. Host Managed drives have some simple restrictions which make running workloads on them very challenging. The main hurdles to overcome are all writes must be at its zone write pointer, all reads must be below its zone write pointer, and reads or writes can be over more than one zone. The largest difficulty is to get random performance data. To do this we will need to add the capability to write randomly between zones (but sequentially within zones) and only read what has been written. There seem to be two approaches that could work for this and I'd appreciate any advice as to which approach to pursue. Approach 1: We could let FIO pick I/O offsets as it usually does, but change those offsets into legal ones before they reach the drive. Approach 2: We create a random Host Managed specific workload which will be aware of where it is legal to read and write and will only pick offsets that are valid. Additionally, we would later make some small changes to the existing sequential workloads that would prevent I/O over zone boundaries and do some preparation (writing over zones we are planning to read and resetting the write pointers of zones we want to write over). I am leaning towards this approach because it makes more sense to me. Either way, we will need to implement some way of keeping track of where it is legal to read and write as well as send some new commands like get zone information and reset write pointer(s). Again, it seems like there are two viable approaches. Approach 1: We can create a new I/O engine which handles all the I/O as well as any additional ZBC specific commands like get zone information and resetting write pointers. However, we would like to avoid this option because our customers to use the I/O engine they are comfortable with. Approach 2: We create a layer above the I/O engine which watches all of the I/O and keeps track of the write pointers on the drive as well as issuing any additional ZBC commands as necessary. Thank you, Phillip Chen ^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: ZBC/FLEX FIO addition ideas 2018-03-06 18:59 ZBC/FLEX FIO addition ideas Phillip Chen @ 2018-03-15 16:15 ` Kris Davis 2018-03-15 16:30 ` Sitsofe Wheeler 0 siblings, 1 reply; 12+ messages in thread From: Kris Davis @ 2018-03-15 16:15 UTC (permalink / raw) To: Phillip Chen, fio@vger.kernel.org; +Cc: Jason Jorgensen Despite the desire to have fio work with Host Managed devices independent of the engine, a Host Managed device operation is different than a traditional block device, and the kernel also has to manage it differently. Thus, I would still recommend creating a new IO engine for use with Host Managed devices, we would not want the additional overhead associated with SMR to impact the standard aio engine. We have mostly used the fio SG engine along with external operations in testing of SMR. Here is the way we envision a new fio engine might work with Host Managed devices: The SMR device supports report zone operations. Fio engine performs report zones for all zones on the drive prior to testing, and loads the following items to a list or dictionary for each zone: 1. Zone start LBA 2. Type of zone a. Random allowed (conventional HDD format) b. Sequential only (SMR format) c. Sequential preferred (for device managed products) 3. Zone Condition: a. Empty b. Full c. Opened 4. Length of zone (usually is the same for all zones, but would need this information for completeness) 5. Current Write pointer Option for random reads: since the read LBA and length must be less than the current zone write pointer, and cannot cross zone boundaries: Randomly choose from open or full zones. Then choose a random LBA prior to the write pointer within the zone (while accounted for operation size and zone boundary): 1. If zone type is random allowed with a write pointer of 0xffffffffffff then as long as the write pointer does not cross the zone boundary the read is valid 2. Sequential only zones a. Full zones could be read from without restrictions as long as read does not cross zone boundaries b. Opened zones would need to verify the following i. Read + length of command is less then write pointer for zone ii. Does not cross zone boundary Option for random writes: On a write command, Allow the engine to create a write command to a random zone, then choose a somewhat random LBA based on zone type and current write pointer Check zone type: 1. If random allowed, as long as length of command does not cross zone boundary, preform write to random LBA 2. If sequential, check LBA against write pointer and change LBA to write pointer location within same zone a. If length of write crosses zone boundary reset write pointer to zone and perform write at new write pointer b. Update dictionary with new write pointer We may want options for how/when the write pointer is reset after a zone is full: By default, we expect the write pointer to be reset only when a new write is created to a full zone. We're unsure if anyone might wish for other handling. Also need an options for implicit (drives handles) versus explicit (host software handles) open/close of zones Kris Davis -----Original Message----- From: fio-owner@vger.kernel.org [mailto:fio-owner@vger.kernel.org] On Behalf Of Phillip Chen Sent: Tuesday, March 6, 2018 12:59 PM To: fio@vger.kernel.org Subject: ZBC/FLEX FIO addition ideas Hello again, We are interested in making changes to the Seagate branch of FIO that ideally would be widely picked up that would allow FIO to run well on Host Managed/Zoned Block Command (ZBC) drives. Host Managed drives have some simple restrictions which make running workloads on them very challenging. The main hurdles to overcome are all writes must be at its zone write pointer, all reads must be below its zone write pointer, and reads or writes can be over more than one zone. The largest difficulty is to get random performance data. To do this we will need to add the capability to write randomly between zones (but sequentially within zones) and only read what has been written. There seem to be two approaches that could work for this and I'd appreciate any advice as to which approach to pursue. Approach 1: We could let FIO pick I/O offsets as it usually does, but change those offsets into legal ones before they reach the drive. Approach 2: We create a random Host Managed specific workload which will be aware of where it is legal to read and write and will only pick offsets that are valid. Additionally, we would later make some small changes to the existing sequential workloads that would prevent I/O over zone boundaries and do some preparation (writing over zones we are planning to read and resetting the write pointers of zones we want to write over). I am leaning towards this approach because it makes more sense to me. Either way, we will need to implement some way of keeping track of where it is legal to read and write as well as send some new commands like get zone information and reset write pointer(s). Again, it seems like there are two viable approaches. Approach 1: We can create a new I/O engine which handles all the I/O as well as any additional ZBC specific commands like get zone information and resetting write pointers. However, we would like to avoid this option because our customers to use the I/O engine they are comfortable with. Approach 2: We create a layer above the I/O engine which watches all of the I/O and keeps track of the write pointers on the drive as well as issuing any additional ZBC commands as necessary. Thank you, Phillip Chen -- To unsubscribe from this list: send the line "unsubscribe fio" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: ZBC/FLEX FIO addition ideas 2018-03-15 16:15 ` Kris Davis @ 2018-03-15 16:30 ` Sitsofe Wheeler 2018-03-15 16:43 ` Bart Van Assche 0 siblings, 1 reply; 12+ messages in thread From: Sitsofe Wheeler @ 2018-03-15 16:30 UTC (permalink / raw) To: Kris Davis; +Cc: Phillip Chen, fio@vger.kernel.org, Jason Jorgensen On 15 March 2018 at 16:15, Kris Davis <Kris.Davis@wdc.com> wrote: > Despite the desire to have fio work with Host Managed devices independent of the engine, a Host Managed device operation is different than a traditional block device, and the kernel also has to manage it differently. Thus, I would still recommend creating a new IO engine for use with Host Managed devices, we would not want the additional overhead associated with SMR to impact the standard aio engine. We have mostly used the fio SG engine along with external operations in testing of SMR. > > Here is the way we envision a new fio engine might work with Host Managed devices: It's definitely going to need something special. I think last time round (https://www.spinics.net/lists/fio/msg06646.html ) I suggested a profile but perhaps that won't be enough. I doubt an ioengine would be enough because you're going to have interact with the next offset code etc. unless you're going to fake I/Os done to "wrong" regions... -- Sitsofe | http://sucs.org/~sits/ ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: ZBC/FLEX FIO addition ideas 2018-03-15 16:30 ` Sitsofe Wheeler @ 2018-03-15 16:43 ` Bart Van Assche 2018-03-15 18:06 ` Phillip Chen 2018-03-17 7:55 ` Sitsofe Wheeler 0 siblings, 2 replies; 12+ messages in thread From: Bart Van Assche @ 2018-03-15 16:43 UTC (permalink / raw) To: sitsofe@gmail.com, Kris Davis Cc: fio@vger.kernel.org, Jason Jorgensen, phillip.a.chen@seagate.com On Thu, 2018-03-15 at 16:30 +0000, Sitsofe Wheeler wrote: > On 15 March 2018 at 16:15, Kris Davis <Kris.Davis@wdc.com> wrote: > > Despite the desire to have fio work with Host Managed devices independent of the engine, a Host Managed device operation is different than a traditional block device, and the kernel also has to > > manage it differently. Thus, I would still recommend creating a new IO engine for use with Host Managed devices, we would not want the additional overhead associated with SMR to impact the > > standard aio engine. We have mostly used the fio SG engine along with external operations in testing of SMR. > > > > Here is the way we envision a new fio engine might work with Host Managed devices: > > It's definitely going to need something special. I think last time > round (https://www.spinics.net/lists/fio/msg06646.html ) I suggested a > profile but perhaps that won't be enough. I doubt an ioengine would be > enough because you're going to have interact with the next offset code > etc. unless you're going to fake I/Os done to "wrong" regions... Hello Sitsofe, Adding support for ZBC drives as a profile has a significant disadvantage, namely that the different I/O patterns (sequential read, sequential write, random read, random write, ...) all have to be reimplemented. That's why I'm considering to add ZBC support by modifying what get_next_block() produces. Can you have a look at the (barely tested) patch below? Thanks, Bart. diff --git a/Makefile b/Makefile index 8f4871c63528..44dd7f3439f6 100644 --- a/Makefile +++ b/Makefile @@ -145,6 +145,9 @@ endif ifdef CONFIG_LIBPMEM SOURCE += engines/libpmem.c endif +ifdef HAVE_LINUX_BLKZONED_H + SOURCE += zbc.c +endif ifeq ($(CONFIG_TARGET_OS), Linux) SOURCE += diskutil.c fifo.c blktrace.c cgroup.c trim.c engines/sg.c \ diff --git a/configure b/configure index 4442a1cc0d75..688e86201320 100755 --- a/configure +++ b/configure @@ -2104,6 +2104,27 @@ if compile_prog "" "" "valgrind_dev"; then fi print_config "Valgrind headers" "$valgrind_dev" +########################################## +# <linux/blkzoned.h> probe +if test "$linux_blkzoned_h" != "yes" ; then + linux_blkzoned_h="no" +fi +cat > $TMPC << EOF +#include <linux/blkzoned.h> +int main(int argc, char **argv) +{ + return 0; +} +EOF +if compile_prog "" "" "linux_blkzoned_h"; then + linux_blkzoned_h="yes" +fi +print_config "<linux/blkzoned.h>" "$linux_blkzoned_h" +if test "$linux_blkzoned_h" = "yes" ; then + output_sym "HAVE_LINUX_BLKZONED_H" +fi + +########################################## # check march=armv8-a+crc+crypto if test "$march_armv8_a_crc_crypto" != "yes" ; then march_armv8_a_crc_crypto="no" diff --git a/debug.h b/debug.h index b8718ddc225f..c69c8079beda 100644 --- a/debug.h +++ b/debug.h @@ -24,6 +24,7 @@ enum { FD_COMPRESS, FD_STEADYSTATE, FD_HELPERTHREAD, + FD_ZBC, FD_DEBUG_MAX, }; diff --git a/file.h b/file.h index 8fd34b136c23..42b304629824 100644 --- a/file.h +++ b/file.h @@ -10,6 +10,9 @@ #include "lib/lfsr.h" #include "lib/gauss.h" +/* Forward declarations */ +struct zoned_block_device_info; + /* * The type of object we are working on */ @@ -97,6 +100,11 @@ struct fio_file { uint64_t file_offset; uint64_t io_size; + /* + * Zoned device information + */ + struct zoned_block_device_info *zbd_info; + /* * Track last end and last start of IO for a given data direction */ diff --git a/filesetup.c b/filesetup.c index 7cbce1327f8f..d981c61f5b7f 100644 --- a/filesetup.c +++ b/filesetup.c @@ -16,6 +16,7 @@ #include "hash.h" #include "lib/axmap.h" #include "rwlock.h" +#include "zbc.h" #ifdef CONFIG_LINUX_FALLOCATE #include <linux/falloc.h> @@ -773,6 +774,9 @@ static int get_file_sizes(struct thread_data *td) */ if (f->real_file_size == -1ULL && td->o.size) f->real_file_size = td->o.size / td->o.nr_files; + + if (f->filetype == FIO_TYPE_BLOCK) + zbc_init_zone_info(f); } return err; @@ -1165,7 +1169,9 @@ done: td->done = 1; td_restore_runstate(td, old_state); - return 0; + + return zbc_verify_options(); + err_offset: log_err("%s: you need to specify valid offset=\n", o->name); err_out: diff --git a/init.c b/init.c index e47e5384119b..3ea5ea57e3a1 100644 --- a/init.c +++ b/init.c @@ -2266,6 +2266,10 @@ const struct debug_level debug_levels[] = { .help = "Helper thread logging", .shift = FD_HELPERTHREAD, }, + { .name = "zbc", + .help = "Zoned Block Device logging", + .shift = FD_ZBC, + }, { .name = NULL, }, }; diff --git a/io_u.c b/io_u.c index 01b36938d1b5..ba4dbb9d8ecc 100644 --- a/io_u.c +++ b/io_u.c @@ -14,6 +14,7 @@ #include "err.h" #include "lib/pow2.h" #include "minmax.h" +#include "zbc.h" struct io_completion_data { int nr; /* input */ @@ -558,6 +559,9 @@ static int get_next_offset(struct thread_data *td, struct io_u *io_u, if (get_next_block(td, io_u, ddir, rw_seq_hit, is_random)) return 1; + if (zbc_adjust_block(td, io_u)) + return 1; + if (io_u->offset >= f->io_size) { dprint(FD_IO, "get_next_offset: offset %llu >= io_size %llu\n", (unsigned long long) io_u->offset, diff --git a/ioengines.c b/ioengines.c index 965581aa4157..a04a977cca9f 100644 --- a/ioengines.c +++ b/ioengines.c @@ -19,6 +19,7 @@ #include "fio.h" #include "diskutil.h" +#include "zbc.h" static FLIST_HEAD(engine_list); @@ -320,6 +321,8 @@ int td_io_queue(struct thread_data *td, struct io_u *io_u) } ret = td->io_ops->queue(td, io_u); + if (ret < FIO_Q_BUSY) + zbc_update_wp(td, io_u); unlock_file(td, io_u->file); diff --git a/zbc.c b/zbc.c new file mode 100644 index 000000000000..57e1981ee35d --- /dev/null +++ b/zbc.c @@ -0,0 +1,604 @@ +/* + * Copyright (C) 2018 Western Digital Corporation or its affiliates. + * + * This file is released under the GPL. + */ + +#include <errno.h> +#include <string.h> +#include <stdlib.h> +#include <dirent.h> +#include <fcntl.h> +#include <sys/ioctl.h> +#include <unistd.h> +#include <linux/blkzoned.h> +#include "file.h" +#include "fio.h" +#include "log.h" +#include "zbc.h" + +/* Return the name of the first entry in a directory */ +static char *get_first_dirent(const char *dir_path) +{ + char *res = NULL; + struct dirent *e; + DIR *d; + + d = opendir(dir_path); + if (!d) + return NULL; + while ((e = readdir(d))) { + /* Skip "." and ".." */ + if (e->d_name[0] == '.') + continue; + res = strdup(e->d_name); + break; + } + closedir(d); + + return res; +} + +/* + * Convert a block device name into a SCSI device path, e.g. /dev/sdc into + * /sys/class/scsi_device/0:0:0:0. + */ +static int bdev_to_scsi(char **scsi_id, const char *bdev) +{ + char *bdev_path = NULL, *bdev2 = NULL; + struct dirent *e; + bool matches; + DIR *d; + int res = 0; + + if (strncmp(bdev, "/dev/", 5) != 0) + return -EINVAL; + d = opendir("/sys/class/scsi_device"); + if (!d) + return -ENOMEM; + while ((e = readdir(d))) { + /* Skip "." and ".." */ + if (e->d_name[0] == '.') + continue; + free(bdev_path); + bdev_path = NULL; + res = -ENOMEM; + if (asprintf(&bdev_path, + "/sys/class/scsi_device/%s/device/block", + e->d_name) < 0) + break; + bdev2 = get_first_dirent(bdev_path); + matches = bdev2 && strcmp(bdev2, bdev + 5) == 0; + free(bdev2); + if (matches) { + *scsi_id = strdup(e->d_name); + res = 0; + break; + } + res = -ENOENT; + } + closedir(d); + + free(bdev_path); + + return res; +} + +/* + * Get the SCSI device type from VPD page 0x80. That device type is called the + * "peripheral device type" in the SCSI SPC-5 standard. Returns -ENXIO if and + * only if the device is not a SCSI device. + */ +static int get_scsi_device_type(const char *bdev) +{ + char *scsi_id = NULL, *vpd_pg80_path = NULL; + uint8_t vpd_pg80[8]; + int vpd_fd; + int ret; + + ret = bdev_to_scsi(&scsi_id, bdev); + if (ret < 0 && ret != -ENOENT) + return ret; + if (scsi_id == NULL) + return -ENXIO; + dprint(FD_ZBC, "Block device %s has SCSI ID %s\n", bdev, scsi_id); + ret = -ENOMEM; + if (asprintf(&vpd_pg80_path, + "/sys/class/scsi_device/%s/device/vpd_pg80", scsi_id) < 0) + goto out; + + vpd_fd = open(vpd_pg80_path, O_RDONLY); + if (vpd_fd < 0) + goto out; + ret = read(vpd_fd, vpd_pg80, sizeof(vpd_pg80)); + close(vpd_fd); + if (ret < sizeof(vpd_pg80)) + goto out; + + ret = vpd_pg80[0] & 0x1f; + +out: + free(vpd_pg80_path); + free(scsi_id); + return ret; +} + +/** + * zbc_reset_zone - reset the write pointer of one or more zones + * @f: FIO file associated with the disk for which to reset write pointers + * @sector: First sector for which to reset the write pointer in units of 512 + * bytes. + * @nr_sectors: Number of sectors to reset the write pointer of. + */ +static int zbc_reset_zone(const struct fio_file *f, uint64_t sector, + uint64_t nr_sectors) +{ + struct blk_zone_range zr = { + .sector = sector, + .nr_sectors = nr_sectors + }; + int ret; + + ret = ioctl(f->fd, BLKRESETZONE, &zr); + if (ret < 0) + log_err("%s: resetting wp for %lu sectors at sector %lu failed (%d).\n", + f->file_name, nr_sectors, sector, errno); + return ret; +} + +/* + * Read zone information into @buf starting from sector @start_sector. + * @fd is a file descriptor that refers to a block device and @bufsz is the + * size of @buf. + */ +static int read_zone_info(int fd, uint64_t start_sector, + void *buf, unsigned int bufsz) +{ + struct blk_zone_report *hdr = buf; + + if (bufsz < sizeof(*hdr)) + return -EINVAL; + + memset(hdr, 0, sizeof(*hdr)); + + hdr->nr_zones = (bufsz - sizeof(*hdr)) / sizeof(struct blk_zone); + hdr->sector = start_sector; + return ioctl(fd, BLKREPORTZONE, hdr); +} + +/* + * Initialize f->zbd_info. + */ +int zbc_init_zone_info(struct fio_file *f) +{ + const unsigned int bufsz = sizeof(struct blk_zone_report) + + 32768 * sizeof(struct blk_zone); + unsigned int nr_zones; + struct blk_zone_report *hdr; + const struct blk_zone *z; + struct fio_zone_info *p; + uint64_t zone_size, start_sector; + struct zoned_block_device_info *zbd_info = NULL; + void *buf; + int fd, i, j, ret = -ENOMEM; + + ret = get_scsi_device_type(f->file_name); + if (ret == -ENXIO) { + dprint(FD_ZBC, "%s: not a SCSI device\n", f->file_name); + ret = 0; + goto out; + } + if (ret < 0) + log_info("fio: unable to determine device type for %s.\n", + f->file_name); + + dprint(FD_ZBC, "Block device %s has SCSI device type %#x\n", + f->file_name, ret); + + if (ret != 0x14 /* ZBC */) { + ret = 0; + goto out; + } + + dprint(FD_ZBC, "Reading zone information for device %s\n", + f->file_name); + + buf = malloc(bufsz); + if (!buf) + goto out; + + fd = open(f->file_name, O_RDONLY | O_LARGEFILE); + if (fd < 0) { + ret = -errno; + goto free; + } + + ret = read_zone_info(fd, 0, buf, bufsz); + if (ret < 0) { + log_info("fio: BLKREPORTZONE(%lu) failed for %s (%d).\n", + 0UL, f->file_name, errno); + goto close; + } + hdr = buf; + if (hdr->nr_zones < 1) { + log_info("fio: %s has invalid zone information.\n", + f->file_name); + goto close; + } + z = (void *)(hdr + 1); + zone_size = z->len; + nr_zones = (f->real_file_size >> 9) / zone_size; + + dprint(FD_ZBC, "Device %s has %d zones of size %lu\n", f->file_name, + nr_zones, zone_size); + + zbd_info = calloc(1, sizeof(*zbd_info) + + (nr_zones + 1) * sizeof(zbd_info->zone_info[0])); + ret = -ENOMEM; + if (!zbd_info) + goto close; + p = &zbd_info->zone_info[0]; + for (start_sector = 0, j = 0; j < nr_zones;) { + z = (void *)(hdr + 1); + for (i = 0; i < hdr->nr_zones; i++, j++, z++, p++) { + p->start = z->start; + p->wp = z->wp; + p->type = z->type; + if (j > 0 && p->start != p[-1].start + zone_size) { + log_info("%s: invalid zone data\n", + f->file_name); + ret = -EINVAL; + goto close; + } + } + z--; + start_sector = z->start + z->len; + if (j >= nr_zones) + break; + ret = read_zone_info(fd, start_sector, buf, bufsz); + if (ret < 0) { + log_info("fio: BLKREPORTZONE(%lu) failed for %s (%d).\n", + start_sector, f->file_name, errno); + goto close; + } + } + /* a sentinel */ + zbd_info->zone_info[nr_zones].start = start_sector; + + f->zbd_info = zbd_info; + f->zbd_info->zone_size = zone_size; + f->zbd_info->nr_zones = nr_zones; + zbd_info = NULL; + ret = 0; + +close: + free(zbd_info); + close(fd); +free: + free(buf); +out: + return ret; +} + +/** + * zbc_zone_idx - convert an offset into a zone number + * @td: thread data. + * @f: file pointer. + * @offset: offset in bytes. If this offset equals the disk size then the + * index of the sentinel is returned. + */ +static uint32_t zbc_zone_idx(const struct thread_data *td, + const struct fio_file *f, uint64_t offset) +{ + uint32_t zone_idx = (offset >> 9) / f->zbd_info->zone_size; + + assert(offset <= f->real_file_size); + assert(zone_idx <= f->zbd_info->nr_zones); + return zone_idx; +} + +static bool zone_full(const struct fio_file *f, const struct fio_zone_info *z) +{ + return z->wp >= z->start + f->zbd_info->zone_size; +} + +static bool is_valid_offset(const struct fio_file *f, uint64_t offset) +{ + return (uint64_t)(offset - f->file_offset) < f->io_size; +} + +/* Verify whether direct I/O is used for all ZBC drives. */ +static bool zbc_using_direct_io(void) +{ + struct thread_data *td; + struct fio_file *f; + int i, j; + + for_each_td(td, i) { + if (td->o.odirect) + continue; + for_each_file(td, f, j) { + if (f->zbd_info) + return false; + } + } + + return true; +} + +static bool other_job_writes_to(struct thread_data *const td, + struct fio_file *const f, + const int i, const int j) +{ + struct thread_data *td2; + struct fio_file *f2; + int k, m; + + for_each_td(td2, k) { + if ((td->o.td_ddir & (TD_DDIR_WRITE | TD_DDIR_TRIM)) == 0) + continue; + for_each_file(td2, f2, m) { + if (k == i && m == j) + continue; + if (f2->zbd_info && + strcmp(f->file_name, f2->file_name) == 0) + return true; + } + } + + return false; +} + +/* + * Check whether multiple ZBC write or trim jobs have been specified for the + * same drive. + * + * To do: refine this code such that locking is only required if multiple + * ZBC write or trim jobs have been specified for the same drive. + */ +static bool zbc_multiple_writers(void) +{ + struct thread_data *td; + struct fio_file *f; + int i, j; + + for_each_td(td, i) { + if ((td->o.td_ddir & (TD_DDIR_WRITE | TD_DDIR_TRIM)) == 0) + continue; + for_each_file(td, f, j) + if (f->zbd_info && + other_job_writes_to(td, f, i, j)) + return true; + } + + return false; +} + +static bool zbc_verify_sizes(void) +{ + const struct fio_zone_info *z; + struct thread_data *td; + struct fio_file *f; + uint64_t new_offset, new_start; + uint32_t zone_idx; + int i, j; + + for_each_td(td, i) { + for_each_file(td, f, j) { + if (!f->zbd_info) + continue; + zone_idx = zbc_zone_idx(td, f, f->file_offset); + z = &f->zbd_info->zone_info[zone_idx]; + if ((z->start << 9) != f->file_offset) { + new_offset = (z->start + + f->zbd_info->zone_size) << 9; + log_info("%s: rounded up offset from %lu to %lu\n", + f->file_name, f->file_offset, + new_offset); + f->io_size -= (new_offset - f->file_offset); + f->file_offset = new_offset; + } + zone_idx = zbc_zone_idx(td, f, f->file_offset + + f->io_size); + z = &f->zbd_info->zone_info[zone_idx]; + new_start = z->start << 9; + if (f->file_offset + f->io_size != new_start) { + if (new_start == f->file_offset) { + log_info("%s: io_size must be at least one zone\n", + f->file_name); + return false; + } + log_info("%s: rounded down io_size from %lu to %lu\n", + f->file_name, f->io_size, new_start); + f->io_size = new_start; + } + } + } + + return true; +} + +int zbc_verify_options(void) +{ + if (!zbc_using_direct_io()) { + log_err("Using direct I/O is mandatory for ZBC drives\n\n"); + return 1; + } + + if (zbc_multiple_writers()) { + log_err("Concurrent writing to ZBC disks is not supported\n\n"); + return 1; + } + + if (!zbc_verify_sizes()) + return 1; + + return 0; +} + +/** + * zbc_adjust_block - adjust the offset and length as necessary for ZBC drives + * @td: FIO thread data. + * @io_u: FIO I/O unit. + * + * Returns 0 if the I/O unit should be used and 1 if not. + */ +int zbc_adjust_block(const struct thread_data *td, struct io_u *io_u) +{ + const struct fio_file *f = io_u->file; + uint32_t zone_idx_b, zone_idx_e; + struct fio_zone_info *zb, *ze; + uint64_t offset, orig_o = io_u->offset; + uint32_t orig_len = io_u->buflen; + uint64_t delta; + + if (!f->zbd_info) + return 0; + + zone_idx_b = zbc_zone_idx(td, f, f->file_offset + io_u->offset); + zone_idx_e = zbc_zone_idx(td, f, f->file_offset + io_u->offset + + (io_u->buflen ? io_u->buflen - 1 : 0)); + zb = &f->zbd_info->zone_info[zone_idx_b]; + ze = &f->zbd_info->zone_info[zone_idx_e]; + + switch (io_u->ddir) { + case DDIR_READ: + /* + * From the ZBC spec: a read operation past the write pointer + * of a zone shall return logical block data set to the last + * initialization pattern that was set at manufacture time, + * by the FORMAT UNIT command or by the most recent SANITIZE + * command with the service action set to OVERWRITE. Hence, + * for random I/O, do not read past the write pointer. + */ + if (!td_random(td)) + return 0; + if (io_u->buflen > ((zb->wp - zb->start) << 9)) + return 1; + if (io_u->offset + io_u->buflen > (zb->wp << 9)) { + io_u->offset = (zb->wp << 9) - io_u->buflen; + dprint(FD_IO, + "changed write offset from %ld into %lld\n", + orig_o, io_u->offset); + } + return 0; + case DDIR_WRITE: + if (zb->type != BLK_ZONE_TYPE_SEQWRITE_REQ) + return 0; + /* Make writes occur at the write pointer */ + if (zone_full(f, zb)) { + if (zbc_reset_zone(f, zb->start, + f->zbd_info->zone_size) < 0) + return 1; + zb->wp = zb->start; + offset = zb->start << 9; + dprint(FD_IO, + "reset zone write pointer at offset %ld\n", + offset); + } else { + offset = zb->wp << 9; + } + if (!is_valid_offset(f, offset)) + return 1; + io_u->offset = offset - f->file_offset; + if (orig_o != io_u->offset) + dprint(FD_IO, + "changed write offset from %ld into %lld\n", + orig_o, io_u->offset); + /* Shrink write requests that cross zone boundaries. */ + if (zone_idx_b != zone_idx_e) { + io_u->buflen = ((zb->start + f->zbd_info->zone_size) + << 9) - (io_u->offset + f->file_offset); + dprint(FD_IO, "Changed length from %u into %lu\n", + orig_len, io_u->buflen); + } + return 0; + case DDIR_TRIM: + /* Align trims to zone boundaries. */ + if (zone_idx_b == zone_idx_e) { + if (zb->type == BLK_ZONE_TYPE_SEQWRITE_REQ && + io_u->buflen < f->zbd_info->zone_size) + io_u->buflen = 0; + } else { + if (zb->type == BLK_ZONE_TYPE_SEQWRITE_REQ) { + delta = f->zbd_info->zone_size - + (f->file_offset + io_u->offset - + (zb->start << 9)); + io_u->offset += delta; + io_u->buflen -= delta; + } + if (ze->type == BLK_ZONE_TYPE_SEQWRITE_REQ) + io_u->buflen -= (f->file_offset + + io_u->offset + io_u->buflen) - + (ze->start << 9); + } + dprint(FD_IO, "Changed trim range from %lu + %u into %llu + %lu (adjustments: offset + %llu; len - %lu)\n", + orig_o, orig_len, io_u->offset, io_u->buflen, + io_u->offset - orig_o, orig_len - io_u->buflen); + return 0; + case DDIR_SYNC: + case DDIR_DATASYNC: + case DDIR_SYNC_FILE_RANGE: + case DDIR_WAIT: + case DDIR_LAST: + case DDIR_INVAL: + return 0; + } + + assert(false); + return 1; +} + +/** + * zbc_update_wp - update the write pointer + * @td: thread data + * @io_u: I/O unit + * + * For write and trim operations, update the write pointer of all affected + * zones. + */ +void zbc_update_wp(struct thread_data *td, const struct io_u *io_u) +{ + struct zoned_block_device_info *zbd_info; + struct fio_zone_info *z; + uint32_t zone_idx; + uint64_t end; + + if (!io_u->file->zbd_info) + return; + + switch (io_u->ddir) { + case DDIR_READ: + case DDIR_SYNC: + case DDIR_DATASYNC: + case DDIR_SYNC_FILE_RANGE: + case DDIR_WAIT: + case DDIR_LAST: + case DDIR_INVAL: + return; + case DDIR_WRITE: + case DDIR_TRIM: + break; + } + + zbd_info = io_u->file->zbd_info; + zone_idx = zbc_zone_idx(td, io_u->file, io_u->offset); + end = (io_u->offset + io_u->buflen) >> 9; + for (z = &zbd_info->zone_info[zone_idx]; z->start < end; + z++, zone_idx++) { + assert(zone_idx < zbd_info->nr_zones); + if (z->type != BLK_ZONE_TYPE_SEQWRITE_REQ) + continue; + switch (io_u->ddir) { + case DDIR_WRITE: + z->wp = end; + break; + case DDIR_TRIM: + z->wp = z->start; + break; + default: + assert(false); + break; + } + } +} diff --git a/zbc.h b/zbc.h new file mode 100644 index 000000000000..de2a39551ca8 --- /dev/null +++ b/zbc.h @@ -0,0 +1,69 @@ +/* + * Copyright (C) 2018 Western Digital Corporation or its affiliates. + * + * This file is released under the GPL. + */ + +#ifndef FIO_ZBC_H +#define FIO_ZBC_H + +#include <inttypes.h> + +struct fio_file; + +/** + * struct fio_zone_info - information about a single ZBC zone + * @start: zone start in 512 byte units + * @wp: zone write pointer location in 512 byte units + * @type: zone type as defined by enum blk_zone_type + */ +struct fio_zone_info { + uint64_t start; + uint64_t wp; + uint8_t type; +}; + +/** + * zoned_block_device_info - zoned block device characteristics + * @zone_size: size of a single zone in units of 512 bytes + * @nr_zones: number of zones + * @zone_info: description of the individual zones + * + * Only devices for which all zones have the same size are supported. + * Note: if the capacity is not a multiple of the zone size then the last zone + * will be smaller than 'zone_size'. + */ +struct zoned_block_device_info { + uint64_t zone_size; + uint64_t nr_zones; + struct fio_zone_info zone_info[0]; +}; + +#ifdef HAVE_LINUX_BLKZONED_H +int zbc_init_zone_info(struct fio_file *f); +int zbc_verify_options(void); +int zbc_adjust_block(const struct thread_data *td, struct io_u *io_u); +void zbc_update_wp(struct thread_data *td, const struct io_u *io_u); +#else +static inline int zbc_init_zone_info(struct fio_file *f) +{ + return 0; +} + +static inline int zbc_verify_options(void) +{ + return 0; +} + +static inline int zbc_adjust_block(struct thread_data *td, struct io_u *io_u) +{ + return 0; +} + +static inline void zbc_update_wp(struct thread_data *td, + const struct io_u *io_u) +{ +} +#endif + +#endif /* FIO_ZBC_H */ ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: ZBC/FLEX FIO addition ideas 2018-03-15 16:43 ` Bart Van Assche @ 2018-03-15 18:06 ` Phillip Chen 2018-03-15 18:38 ` Bart Van Assche 2018-03-17 7:55 ` Sitsofe Wheeler 1 sibling, 1 reply; 12+ messages in thread From: Phillip Chen @ 2018-03-15 18:06 UTC (permalink / raw) To: Bart Van Assche Cc: sitsofe@gmail.com, Kris Davis, fio@vger.kernel.org, Jason Jorgensen Would creating new profiles for all the I/O patterns be particularly difficult? I'm sure you're much more familiar with the FIO codebase than I am, but it seems to me that all you'd need to do for randoms is move the logic from the zbc_adjust_block cases upstream into the various methods called by get_off_from_method(), or possibly modify the existing methods to work differently when working on a ZBC drive. For sequentials it seems like you'd just have to move the logic into get_next_seq_offset(). It also seems to me that it might be better to have get_next_block() pick a valid area to begin with. The main benefit to doing this that I can see would be to allow much more control over the number of open zones, which I think will be of particular interest in testing ZBC drive performance. Additionally, it might be worthwhile to have an option allows the workload to pick a new zone instead of resetting the write pointer of a zone when writing to a full zone. This would also be made easier with a more upstream approach, because you wouldn't need to retry and get a new offset, you could just avoid full zones entirely. Or you could keep track of which zones are open and add/replace open zones as necessary. Phillip On Thu, Mar 15, 2018 at 10:43 AM, Bart Van Assche <Bart.VanAssche@wdc.com> wrote: > On Thu, 2018-03-15 at 16:30 +0000, Sitsofe Wheeler wrote: > >> On 15 March 2018 at 16:15, Kris Davis <Kris.Davis@wdc.com> wrote: > >> > Despite the desire to have fio work with Host Managed devices independent of the engine, a Host Managed device operation is different than a traditional block device, and the kernel also has to > >> > manage it differently. Thus, I would still recommend creating a new IO engine for use with Host Managed devices, we would not want the additional overhead associated with SMR to impact the > >> > standard aio engine. We have mostly used the fio SG engine along with external operations in testing of SMR. > >> > > >> > Here is the way we envision a new fio engine might work with Host Managed devices: > >> > >> It's definitely going to need something special. I think last time > >> round (https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.net_lists_fio_msg06646.html&d=DwIGaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=eNMOVQH16Aa4ThAFVwj-O7goG7k06cW3W6DO_yXnzSg&m=icSXCK6t72RPOJKtqT2YyV-6X8g2HQnSmGSFsvlJgOo&s=ti4Mozyfv9Huy4ZNAs9GDPnvDhaiz40m16tWRS4cOVM&e= ) I suggested a > >> profile but perhaps that won't be enough. I doubt an ioengine would be > >> enough because you're going to have interact with the next offset code > >> etc. unless you're going to fake I/Os done to "wrong" regions... > > > > Hello Sitsofe, > > > > Adding support for ZBC drives as a profile has a significant disadvantage, > > namely that the different I/O patterns (sequential read, sequential write, > > random read, random write, ...) all have to be reimplemented. That's why I'm > > considering to add ZBC support by modifying what get_next_block() produces. > > Can you have a look at the (barely tested) patch below? > > > > Thanks, > > > > Bart. > > > > > > diff --git a/Makefile b/Makefile > > index 8f4871c63528..44dd7f3439f6 100644 > > --- a/Makefile > > +++ b/Makefile > > @@ -145,6 +145,9 @@ endif > > ifdef CONFIG_LIBPMEM > > SOURCE += engines/libpmem.c > > endif > > +ifdef HAVE_LINUX_BLKZONED_H > > + SOURCE += zbc.c > > +endif > > > > ifeq ($(CONFIG_TARGET_OS), Linux) > > SOURCE += diskutil.c fifo.c blktrace.c cgroup.c trim.c engines/sg.c \ > > diff --git a/configure b/configure > > index 4442a1cc0d75..688e86201320 100755 > > --- a/configure > > +++ b/configure > > @@ -2104,6 +2104,27 @@ if compile_prog "" "" "valgrind_dev"; then > > fi > > print_config "Valgrind headers" "$valgrind_dev" > > > > +########################################## > > +# <linux/blkzoned.h> probe > > +if test "$linux_blkzoned_h" != "yes" ; then > > + linux_blkzoned_h="no" > > +fi > > +cat > $TMPC << EOF > > +#include <linux/blkzoned.h> > > +int main(int argc, char **argv) > > +{ > > + return 0; > > +} > > +EOF > > +if compile_prog "" "" "linux_blkzoned_h"; then > > + linux_blkzoned_h="yes" > > +fi > > +print_config "<linux/blkzoned.h>" "$linux_blkzoned_h" > > +if test "$linux_blkzoned_h" = "yes" ; then > > + output_sym "HAVE_LINUX_BLKZONED_H" > > +fi > > + > > +########################################## > > # check march=armv8-a+crc+crypto > > if test "$march_armv8_a_crc_crypto" != "yes" ; then > > march_armv8_a_crc_crypto="no" > > diff --git a/debug.h b/debug.h > > index b8718ddc225f..c69c8079beda 100644 > > --- a/debug.h > > +++ b/debug.h > > @@ -24,6 +24,7 @@ enum { > > FD_COMPRESS, > > FD_STEADYSTATE, > > FD_HELPERTHREAD, > > + FD_ZBC, > > FD_DEBUG_MAX, > > }; > > > > diff --git a/file.h b/file.h > > index 8fd34b136c23..42b304629824 100644 > > --- a/file.h > > +++ b/file.h > > @@ -10,6 +10,9 @@ > > #include "lib/lfsr.h" > > #include "lib/gauss.h" > > > > +/* Forward declarations */ > > +struct zoned_block_device_info; > > + > > /* > > * The type of object we are working on > > */ > > @@ -97,6 +100,11 @@ struct fio_file { > > uint64_t file_offset; > > uint64_t io_size; > > > > + /* > > + * Zoned device information > > + */ > > + struct zoned_block_device_info *zbd_info; > > + > > /* > > * Track last end and last start of IO for a given data direction > > */ > > diff --git a/filesetup.c b/filesetup.c > > index 7cbce1327f8f..d981c61f5b7f 100644 > > --- a/filesetup.c > > +++ b/filesetup.c > > @@ -16,6 +16,7 @@ > > #include "hash.h" > > #include "lib/axmap.h" > > #include "rwlock.h" > > +#include "zbc.h" > > > > #ifdef CONFIG_LINUX_FALLOCATE > > #include <linux/falloc.h> > > @@ -773,6 +774,9 @@ static int get_file_sizes(struct thread_data *td) > > */ > > if (f->real_file_size == -1ULL && td->o.size) > > f->real_file_size = td->o.size / td->o.nr_files; > > + > > + if (f->filetype == FIO_TYPE_BLOCK) > > + zbc_init_zone_info(f); > > } > > > > return err; > > @@ -1165,7 +1169,9 @@ done: > > td->done = 1; > > > > td_restore_runstate(td, old_state); > > - return 0; > > + > > + return zbc_verify_options(); > > + > > err_offset: > > log_err("%s: you need to specify valid offset=\n", o->name); > > err_out: > > diff --git a/init.c b/init.c > > index e47e5384119b..3ea5ea57e3a1 100644 > > --- a/init.c > > +++ b/init.c > > @@ -2266,6 +2266,10 @@ const struct debug_level debug_levels[] = { > > .help = "Helper thread logging", > > .shift = FD_HELPERTHREAD, > > }, > > + { .name = "zbc", > > + .help = "Zoned Block Device logging", > > + .shift = FD_ZBC, > > + }, > > { .name = NULL, }, > > }; > > > > diff --git a/io_u.c b/io_u.c > > index 01b36938d1b5..ba4dbb9d8ecc 100644 > > --- a/io_u.c > > +++ b/io_u.c > > @@ -14,6 +14,7 @@ > > #include "err.h" > > #include "lib/pow2.h" > > #include "minmax.h" > > +#include "zbc.h" > > > > struct io_completion_data { > > int nr; /* input */ > > @@ -558,6 +559,9 @@ static int get_next_offset(struct thread_data *td, struct io_u *io_u, > > if (get_next_block(td, io_u, ddir, rw_seq_hit, is_random)) > > return 1; > > > > + if (zbc_adjust_block(td, io_u)) > > + return 1; > > + > > if (io_u->offset >= f->io_size) { > > dprint(FD_IO, "get_next_offset: offset %llu >= io_size %llu\n", > > (unsigned long long) io_u->offset, > > diff --git a/ioengines.c b/ioengines.c > > index 965581aa4157..a04a977cca9f 100644 > > --- a/ioengines.c > > +++ b/ioengines.c > > @@ -19,6 +19,7 @@ > > > > #include "fio.h" > > #include "diskutil.h" > > +#include "zbc.h" > > > > static FLIST_HEAD(engine_list); > > > > @@ -320,6 +321,8 @@ int td_io_queue(struct thread_data *td, struct io_u *io_u) > > } > > > > ret = td->io_ops->queue(td, io_u); > > + if (ret < FIO_Q_BUSY) > > + zbc_update_wp(td, io_u); > > > > unlock_file(td, io_u->file); > > > > diff --git a/zbc.c b/zbc.c > > new file mode 100644 > > index 000000000000..57e1981ee35d > > --- /dev/null > > +++ b/zbc.c > > @@ -0,0 +1,604 @@ > > +/* > > + * Copyright (C) 2018 Western Digital Corporation or its affiliates. > > + * > > + * This file is released under the GPL. > > + */ > > + > > +#include <errno.h> > > +#include <string.h> > > +#include <stdlib.h> > > +#include <dirent.h> > > +#include <fcntl.h> > > +#include <sys/ioctl.h> > > +#include <unistd.h> > > +#include <linux/blkzoned.h> > > +#include "file.h" > > +#include "fio.h" > > +#include "log.h" > > +#include "zbc.h" > > + > > +/* Return the name of the first entry in a directory */ > > +static char *get_first_dirent(const char *dir_path) > > +{ > > + char *res = NULL; > > + struct dirent *e; > > + DIR *d; > > + > > + d = opendir(dir_path); > > + if (!d) > > + return NULL; > > + while ((e = readdir(d))) { > > + /* Skip "." and ".." */ > > + if (e->d_name[0] == '.') > > + continue; > > + res = strdup(e->d_name); > > + break; > > + } > > + closedir(d); > > + > > + return res; > > +} > > + > > +/* > > + * Convert a block device name into a SCSI device path, e.g. /dev/sdc into > > + * /sys/class/scsi_device/0:0:0:0. > > + */ > > +static int bdev_to_scsi(char **scsi_id, const char *bdev) > > +{ > > + char *bdev_path = NULL, *bdev2 = NULL; > > + struct dirent *e; > > + bool matches; > > + DIR *d; > > + int res = 0; > > + > > + if (strncmp(bdev, "/dev/", 5) != 0) > > + return -EINVAL; > > + d = opendir("/sys/class/scsi_device"); > > + if (!d) > > + return -ENOMEM; > > + while ((e = readdir(d))) { > > + /* Skip "." and ".." */ > > + if (e->d_name[0] == '.') > > + continue; > > + free(bdev_path); > > + bdev_path = NULL; > > + res = -ENOMEM; > > + if (asprintf(&bdev_path, > > + "/sys/class/scsi_device/%s/device/block", > > + e->d_name) < 0) > > + break; > > + bdev2 = get_first_dirent(bdev_path); > > + matches = bdev2 && strcmp(bdev2, bdev + 5) == 0; > > + free(bdev2); > > + if (matches) { > > + *scsi_id = strdup(e->d_name); > > + res = 0; > > + break; > > + } > > + res = -ENOENT; > > + } > > + closedir(d); > > + > > + free(bdev_path); > > + > > + return res; > > +} > > + > > +/* > > + * Get the SCSI device type from VPD page 0x80. That device type is called the > > + * "peripheral device type" in the SCSI SPC-5 standard. Returns -ENXIO if and > > + * only if the device is not a SCSI device. > > + */ > > +static int get_scsi_device_type(const char *bdev) > > +{ > > + char *scsi_id = NULL, *vpd_pg80_path = NULL; > > + uint8_t vpd_pg80[8]; > > + int vpd_fd; > > + int ret; > > + > > + ret = bdev_to_scsi(&scsi_id, bdev); > > + if (ret < 0 && ret != -ENOENT) > > + return ret; > > + if (scsi_id == NULL) > > + return -ENXIO; > > + dprint(FD_ZBC, "Block device %s has SCSI ID %s\n", bdev, scsi_id); > > + ret = -ENOMEM; > > + if (asprintf(&vpd_pg80_path, > > + "/sys/class/scsi_device/%s/device/vpd_pg80", scsi_id) < 0) > > + goto out; > > + > > + vpd_fd = open(vpd_pg80_path, O_RDONLY); > > + if (vpd_fd < 0) > > + goto out; > > + ret = read(vpd_fd, vpd_pg80, sizeof(vpd_pg80)); > > + close(vpd_fd); > > + if (ret < sizeof(vpd_pg80)) > > + goto out; > > + > > + ret = vpd_pg80[0] & 0x1f; > > + > > +out: > > + free(vpd_pg80_path); > > + free(scsi_id); > > + return ret; > > +} > > + > > +/** > > + * zbc_reset_zone - reset the write pointer of one or more zones > > + * @f: FIO file associated with the disk for which to reset write pointers > > + * @sector: First sector for which to reset the write pointer in units of 512 > > + * bytes. > > + * @nr_sectors: Number of sectors to reset the write pointer of. > > + */ > > +static int zbc_reset_zone(const struct fio_file *f, uint64_t sector, > > + uint64_t nr_sectors) > > +{ > > + struct blk_zone_range zr = { > > + .sector = sector, > > + .nr_sectors = nr_sectors > > + }; > > + int ret; > > + > > + ret = ioctl(f->fd, BLKRESETZONE, &zr); > > + if (ret < 0) > > + log_err("%s: resetting wp for %lu sectors at sector %lu failed (%d).\n", > > + f->file_name, nr_sectors, sector, errno); > > + return ret; > > +} > > + > > +/* > > + * Read zone information into @buf starting from sector @start_sector. > > + * @fd is a file descriptor that refers to a block device and @bufsz is the > > + * size of @buf. > > + */ > > +static int read_zone_info(int fd, uint64_t start_sector, > > + void *buf, unsigned int bufsz) > > +{ > > + struct blk_zone_report *hdr = buf; > > + > > + if (bufsz < sizeof(*hdr)) > > + return -EINVAL; > > + > > + memset(hdr, 0, sizeof(*hdr)); > > + > > + hdr->nr_zones = (bufsz - sizeof(*hdr)) / sizeof(struct blk_zone); > > + hdr->sector = start_sector; > > + return ioctl(fd, BLKREPORTZONE, hdr); > > +} > > + > > +/* > > + * Initialize f->zbd_info. > > + */ > > +int zbc_init_zone_info(struct fio_file *f) > > +{ > > + const unsigned int bufsz = sizeof(struct blk_zone_report) + > > + 32768 * sizeof(struct blk_zone); > > + unsigned int nr_zones; > > + struct blk_zone_report *hdr; > > + const struct blk_zone *z; > > + struct fio_zone_info *p; > > + uint64_t zone_size, start_sector; > > + struct zoned_block_device_info *zbd_info = NULL; > > + void *buf; > > + int fd, i, j, ret = -ENOMEM; > > + > > + ret = get_scsi_device_type(f->file_name); > > + if (ret == -ENXIO) { > > + dprint(FD_ZBC, "%s: not a SCSI device\n", f->file_name); > > + ret = 0; > > + goto out; > > + } > > + if (ret < 0) > > + log_info("fio: unable to determine device type for %s.\n", > > + f->file_name); > > + > > + dprint(FD_ZBC, "Block device %s has SCSI device type %#x\n", > > + f->file_name, ret); > > + > > + if (ret != 0x14 /* ZBC */) { > > + ret = 0; > > + goto out; > > + } > > + > > + dprint(FD_ZBC, "Reading zone information for device %s\n", > > + f->file_name); > > + > > + buf = malloc(bufsz); > > + if (!buf) > > + goto out; > > + > > + fd = open(f->file_name, O_RDONLY | O_LARGEFILE); > > + if (fd < 0) { > > + ret = -errno; > > + goto free; > > + } > > + > > + ret = read_zone_info(fd, 0, buf, bufsz); > > + if (ret < 0) { > > + log_info("fio: BLKREPORTZONE(%lu) failed for %s (%d).\n", > > + 0UL, f->file_name, errno); > > + goto close; > > + } > > + hdr = buf; > > + if (hdr->nr_zones < 1) { > > + log_info("fio: %s has invalid zone information.\n", > > + f->file_name); > > + goto close; > > + } > > + z = (void *)(hdr + 1); > > + zone_size = z->len; > > + nr_zones = (f->real_file_size >> 9) / zone_size; > > + > > + dprint(FD_ZBC, "Device %s has %d zones of size %lu\n", f->file_name, > > + nr_zones, zone_size); > > + > > + zbd_info = calloc(1, sizeof(*zbd_info) + > > + (nr_zones + 1) * sizeof(zbd_info->zone_info[0])); > > + ret = -ENOMEM; > > + if (!zbd_info) > > + goto close; > > + p = &zbd_info->zone_info[0]; > > + for (start_sector = 0, j = 0; j < nr_zones;) { > > + z = (void *)(hdr + 1); > > + for (i = 0; i < hdr->nr_zones; i++, j++, z++, p++) { > > + p->start = z->start; > > + p->wp = z->wp; > > + p->type = z->type; > > + if (j > 0 && p->start != p[-1].start + zone_size) { > > + log_info("%s: invalid zone data\n", > > + f->file_name); > > + ret = -EINVAL; > > + goto close; > > + } > > + } > > + z--; > > + start_sector = z->start + z->len; > > + if (j >= nr_zones) > > + break; > > + ret = read_zone_info(fd, start_sector, buf, bufsz); > > + if (ret < 0) { > > + log_info("fio: BLKREPORTZONE(%lu) failed for %s (%d).\n", > > + start_sector, f->file_name, errno); > > + goto close; > > + } > > + } > > + /* a sentinel */ > > + zbd_info->zone_info[nr_zones].start = start_sector; > > + > > + f->zbd_info = zbd_info; > > + f->zbd_info->zone_size = zone_size; > > + f->zbd_info->nr_zones = nr_zones; > > + zbd_info = NULL; > > + ret = 0; > > + > > +close: > > + free(zbd_info); > > + close(fd); > > +free: > > + free(buf); > > +out: > > + return ret; > > +} > > + > > +/** > > + * zbc_zone_idx - convert an offset into a zone number > > + * @td: thread data. > > + * @f: file pointer. > > + * @offset: offset in bytes. If this offset equals the disk size then the > > + * index of the sentinel is returned. > > + */ > > +static uint32_t zbc_zone_idx(const struct thread_data *td, > > + const struct fio_file *f, uint64_t offset) > > +{ > > + uint32_t zone_idx = (offset >> 9) / f->zbd_info->zone_size; > > + > > + assert(offset <= f->real_file_size); > > + assert(zone_idx <= f->zbd_info->nr_zones); > > + return zone_idx; > > +} > > + > > +static bool zone_full(const struct fio_file *f, const struct fio_zone_info *z) > > +{ > > + return z->wp >= z->start + f->zbd_info->zone_size; > > +} > > + > > +static bool is_valid_offset(const struct fio_file *f, uint64_t offset) > > +{ > > + return (uint64_t)(offset - f->file_offset) < f->io_size; > > +} > > + > > +/* Verify whether direct I/O is used for all ZBC drives. */ > > +static bool zbc_using_direct_io(void) > > +{ > > + struct thread_data *td; > > + struct fio_file *f; > > + int i, j; > > + > > + for_each_td(td, i) { > > + if (td->o.odirect) > > + continue; > > + for_each_file(td, f, j) { > > + if (f->zbd_info) > > + return false; > > + } > > + } > > + > > + return true; > > +} > > + > > +static bool other_job_writes_to(struct thread_data *const td, > > + struct fio_file *const f, > > + const int i, const int j) > > +{ > > + struct thread_data *td2; > > + struct fio_file *f2; > > + int k, m; > > + > > + for_each_td(td2, k) { > > + if ((td->o.td_ddir & (TD_DDIR_WRITE | TD_DDIR_TRIM)) == 0) > > + continue; > > + for_each_file(td2, f2, m) { > > + if (k == i && m == j) > > + continue; > > + if (f2->zbd_info && > > + strcmp(f->file_name, f2->file_name) == 0) > > + return true; > > + } > > + } > > + > > + return false; > > +} > > + > > +/* > > + * Check whether multiple ZBC write or trim jobs have been specified for the > > + * same drive. > > + * > > + * To do: refine this code such that locking is only required if multiple > > + * ZBC write or trim jobs have been specified for the same drive. > > + */ > > +static bool zbc_multiple_writers(void) > > +{ > > + struct thread_data *td; > > + struct fio_file *f; > > + int i, j; > > + > > + for_each_td(td, i) { > > + if ((td->o.td_ddir & (TD_DDIR_WRITE | TD_DDIR_TRIM)) == 0) > > + continue; > > + for_each_file(td, f, j) > > + if (f->zbd_info && > > + other_job_writes_to(td, f, i, j)) > > + return true; > > + } > > + > > + return false; > > +} > > + > > +static bool zbc_verify_sizes(void) > > +{ > > + const struct fio_zone_info *z; > > + struct thread_data *td; > > + struct fio_file *f; > > + uint64_t new_offset, new_start; > > + uint32_t zone_idx; > > + int i, j; > > + > > + for_each_td(td, i) { > > + for_each_file(td, f, j) { > > + if (!f->zbd_info) > > + continue; > > + zone_idx = zbc_zone_idx(td, f, f->file_offset); > > + z = &f->zbd_info->zone_info[zone_idx]; > > + if ((z->start << 9) != f->file_offset) { > > + new_offset = (z->start + > > + f->zbd_info->zone_size) << 9; > > + log_info("%s: rounded up offset from %lu to %lu\n", > > + f->file_name, f->file_offset, > > + new_offset); > > + f->io_size -= (new_offset - f->file_offset); > > + f->file_offset = new_offset; > > + } > > + zone_idx = zbc_zone_idx(td, f, f->file_offset + > > + f->io_size); > > + z = &f->zbd_info->zone_info[zone_idx]; > > + new_start = z->start << 9; > > + if (f->file_offset + f->io_size != new_start) { > > + if (new_start == f->file_offset) { > > + log_info("%s: io_size must be at least one zone\n", > > + f->file_name); > > + return false; > > + } > > + log_info("%s: rounded down io_size from %lu to %lu\n", > > + f->file_name, f->io_size, new_start); > > + f->io_size = new_start; > > + } > > + } > > + } > > + > > + return true; > > +} > > + > > +int zbc_verify_options(void) > > +{ > > + if (!zbc_using_direct_io()) { > > + log_err("Using direct I/O is mandatory for ZBC drives\n\n"); > > + return 1; > > + } > > + > > + if (zbc_multiple_writers()) { > > + log_err("Concurrent writing to ZBC disks is not supported\n\n"); > > + return 1; > > + } > > + > > + if (!zbc_verify_sizes()) > > + return 1; > > + > > + return 0; > > +} > > + > > +/** > > + * zbc_adjust_block - adjust the offset and length as necessary for ZBC drives > > + * @td: FIO thread data. > > + * @io_u: FIO I/O unit. > > + * > > + * Returns 0 if the I/O unit should be used and 1 if not. > > + */ > > +int zbc_adjust_block(const struct thread_data *td, struct io_u *io_u) > > +{ > > + const struct fio_file *f = io_u->file; > > + uint32_t zone_idx_b, zone_idx_e; > > + struct fio_zone_info *zb, *ze; > > + uint64_t offset, orig_o = io_u->offset; > > + uint32_t orig_len = io_u->buflen; > > + uint64_t delta; > > + > > + if (!f->zbd_info) > > + return 0; > > + > > + zone_idx_b = zbc_zone_idx(td, f, f->file_offset + io_u->offset); > > + zone_idx_e = zbc_zone_idx(td, f, f->file_offset + io_u->offset + > > + (io_u->buflen ? io_u->buflen - 1 : 0)); > > + zb = &f->zbd_info->zone_info[zone_idx_b]; > > + ze = &f->zbd_info->zone_info[zone_idx_e]; > > + > > + switch (io_u->ddir) { > > + case DDIR_READ: > > + /* > > + * From the ZBC spec: a read operation past the write pointer > > + * of a zone shall return logical block data set to the last > > + * initialization pattern that was set at manufacture time, > > + * by the FORMAT UNIT command or by the most recent SANITIZE > > + * command with the service action set to OVERWRITE. Hence, > > + * for random I/O, do not read past the write pointer. > > + */ > > + if (!td_random(td)) > > + return 0; > > + if (io_u->buflen > ((zb->wp - zb->start) << 9)) > > + return 1; > > + if (io_u->offset + io_u->buflen > (zb->wp << 9)) { > > + io_u->offset = (zb->wp << 9) - io_u->buflen; > > + dprint(FD_IO, > > + "changed write offset from %ld into %lld\n", > > + orig_o, io_u->offset); > > + } > > + return 0; > > + case DDIR_WRITE: > > + if (zb->type != BLK_ZONE_TYPE_SEQWRITE_REQ) > > + return 0; > > + /* Make writes occur at the write pointer */ > > + if (zone_full(f, zb)) { > > + if (zbc_reset_zone(f, zb->start, > > + f->zbd_info->zone_size) < 0) > > + return 1; > > + zb->wp = zb->start; > > + offset = zb->start << 9; > > + dprint(FD_IO, > > + "reset zone write pointer at offset %ld\n", > > + offset); > > + } else { > > + offset = zb->wp << 9; > > + } > > + if (!is_valid_offset(f, offset)) > > + return 1; > > + io_u->offset = offset - f->file_offset; > > + if (orig_o != io_u->offset) > > + dprint(FD_IO, > > + "changed write offset from %ld into %lld\n", > > + orig_o, io_u->offset); > > + /* Shrink write requests that cross zone boundaries. */ > > + if (zone_idx_b != zone_idx_e) { > > + io_u->buflen = ((zb->start + f->zbd_info->zone_size) > > + << 9) - (io_u->offset + f->file_offset); > > + dprint(FD_IO, "Changed length from %u into %lu\n", > > + orig_len, io_u->buflen); > > + } > > + return 0; > > + case DDIR_TRIM: > > + /* Align trims to zone boundaries. */ > > + if (zone_idx_b == zone_idx_e) { > > + if (zb->type == BLK_ZONE_TYPE_SEQWRITE_REQ && > > + io_u->buflen < f->zbd_info->zone_size) > > + io_u->buflen = 0; > > + } else { > > + if (zb->type == BLK_ZONE_TYPE_SEQWRITE_REQ) { > > + delta = f->zbd_info->zone_size - > > + (f->file_offset + io_u->offset - > > + (zb->start << 9)); > > + io_u->offset += delta; > > + io_u->buflen -= delta; > > + } > > + if (ze->type == BLK_ZONE_TYPE_SEQWRITE_REQ) > > + io_u->buflen -= (f->file_offset + > > + io_u->offset + io_u->buflen) - > > + (ze->start << 9); > > + } > > + dprint(FD_IO, "Changed trim range from %lu + %u into %llu + %lu (adjustments: offset + %llu; len - %lu)\n", > > + orig_o, orig_len, io_u->offset, io_u->buflen, > > + io_u->offset - orig_o, orig_len - io_u->buflen); > > + return 0; > > + case DDIR_SYNC: > > + case DDIR_DATASYNC: > > + case DDIR_SYNC_FILE_RANGE: > > + case DDIR_WAIT: > > + case DDIR_LAST: > > + case DDIR_INVAL: > > + return 0; > > + } > > + > > + assert(false); > > + return 1; > > +} > > + > > +/** > > + * zbc_update_wp - update the write pointer > > + * @td: thread data > > + * @io_u: I/O unit > > + * > > + * For write and trim operations, update the write pointer of all affected > > + * zones. > > + */ > > +void zbc_update_wp(struct thread_data *td, const struct io_u *io_u) > > +{ > > + struct zoned_block_device_info *zbd_info; > > + struct fio_zone_info *z; > > + uint32_t zone_idx; > > + uint64_t end; > > + > > + if (!io_u->file->zbd_info) > > + return; > > + > > + switch (io_u->ddir) { > > + case DDIR_READ: > > + case DDIR_SYNC: > > + case DDIR_DATASYNC: > > + case DDIR_SYNC_FILE_RANGE: > > + case DDIR_WAIT: > > + case DDIR_LAST: > > + case DDIR_INVAL: > > + return; > > + case DDIR_WRITE: > > + case DDIR_TRIM: > > + break; > > + } > > + > > + zbd_info = io_u->file->zbd_info; > > + zone_idx = zbc_zone_idx(td, io_u->file, io_u->offset); > > + end = (io_u->offset + io_u->buflen) >> 9; > > + for (z = &zbd_info->zone_info[zone_idx]; z->start < end; > > + z++, zone_idx++) { > > + assert(zone_idx < zbd_info->nr_zones); > > + if (z->type != BLK_ZONE_TYPE_SEQWRITE_REQ) > > + continue; > > + switch (io_u->ddir) { > > + case DDIR_WRITE: > > + z->wp = end; > > + break; > > + case DDIR_TRIM: > > + z->wp = z->start; > > + break; > > + default: > > + assert(false); > > + break; > > + } > > + } > > +} > > diff --git a/zbc.h b/zbc.h > > new file mode 100644 > > index 000000000000..de2a39551ca8 > > --- /dev/null > > +++ b/zbc.h > > @@ -0,0 +1,69 @@ > > +/* > > + * Copyright (C) 2018 Western Digital Corporation or its affiliates. > > + * > > + * This file is released under the GPL. > > + */ > > + > > +#ifndef FIO_ZBC_H > > +#define FIO_ZBC_H > > + > > +#include <inttypes.h> > > + > > +struct fio_file; > > + > > +/** > > + * struct fio_zone_info - information about a single ZBC zone > > + * @start: zone start in 512 byte units > > + * @wp: zone write pointer location in 512 byte units > > + * @type: zone type as defined by enum blk_zone_type > > + */ > > +struct fio_zone_info { > > + uint64_t start; > > + uint64_t wp; > > + uint8_t type; > > +}; > > + > > +/** > > + * zoned_block_device_info - zoned block device characteristics > > + * @zone_size: size of a single zone in units of 512 bytes > > + * @nr_zones: number of zones > > + * @zone_info: description of the individual zones > > + * > > + * Only devices for which all zones have the same size are supported. > > + * Note: if the capacity is not a multiple of the zone size then the last zone > > + * will be smaller than 'zone_size'. > > + */ > > +struct zoned_block_device_info { > > + uint64_t zone_size; > > + uint64_t nr_zones; > > + struct fio_zone_info zone_info[0]; > > +}; > > + > > +#ifdef HAVE_LINUX_BLKZONED_H > > +int zbc_init_zone_info(struct fio_file *f); > > +int zbc_verify_options(void); > > +int zbc_adjust_block(const struct thread_data *td, struct io_u *io_u); > > +void zbc_update_wp(struct thread_data *td, const struct io_u *io_u); > > +#else > > +static inline int zbc_init_zone_info(struct fio_file *f) > > +{ > > + return 0; > > +} > > + > > +static inline int zbc_verify_options(void) > > +{ > > + return 0; > > +} > > + > > +static inline int zbc_adjust_block(struct thread_data *td, struct io_u *io_u) > > +{ > > + return 0; > > +} > > + > > +static inline void zbc_update_wp(struct thread_data *td, > > + const struct io_u *io_u) > > +{ > > +} > > +#endif > > + > > +#endif /* FIO_ZBC_H */ > > > > > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: ZBC/FLEX FIO addition ideas 2018-03-15 18:06 ` Phillip Chen @ 2018-03-15 18:38 ` Bart Van Assche 2018-03-15 19:40 ` Phillip Chen 0 siblings, 1 reply; 12+ messages in thread From: Bart Van Assche @ 2018-03-15 18:38 UTC (permalink / raw) To: phillip.a.chen@seagate.com Cc: sitsofe@gmail.com, Kris Davis, Jason Jorgensen, fio@vger.kernel.org On Thu, 2018-03-15 at 12:06 -0600, Phillip Chen wrote: > Would creating new profiles for all the I/O patterns be particularly > difficult? I'm sure you're much more familiar with the FIO codebase > than I am, but it seems to me that all you'd need to do for randoms is > move the logic from the zbc_adjust_block cases upstream into the > various methods called by get_off_from_method(), or possibly modify > the existing methods to work differently when working on a ZBC drive. > For sequentials it seems like you'd just have to move the logic into > get_next_seq_offset(). Hello Phillip, Adding support for ZBC drives through creation of a new profile has the following disadvantages: - It makes it impossible to use another profile (act or tiobench) for the generation of a workload. - It will lead to code duplication. fio already has code for supporting a large number of I/O patterns (sequential, random, ...). If we can avoid code duplication I think we should avoid it. > It also seems to me that it might be better to have get_next_block() > pick a valid area to begin with. What does "valid" mean in this context? Have you noticed that zbc_adjust_block() modifies the I/O request offset and length such that neither write errors nor reads past the end of the write pointer are triggered? > The main benefit to doing this that I > can see would be to allow much more control over the number of open > zones, which I think will be of particular interest in testing ZBC > drive performance. Additionally, it might be worthwhile to have an > option allows the workload to pick a new zone instead of resetting the > write pointer of a zone when writing to a full zone. This would also > be made easier with a more upstream approach, because you wouldn't > need to retry and get a new offset, you could just avoid full zones > entirely. Or you could keep track of which zones are open and > add/replace open zones as necessary. With the approach I proposed it is already possible to control the number of open zones, namely by setting the I/O offset (--offset=) to the start of a zone and by setting the I/O size (--io_size=) to (number of zones to test) * (zone size). But I agree with you that the kind of workload that you described would best be implemented as an I/O profile. How about starting with the approach I proposed and adding profile(s) for more advanced ZBC I/O patterns later? Thanks, Bart. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: ZBC/FLEX FIO addition ideas 2018-03-15 18:38 ` Bart Van Assche @ 2018-03-15 19:40 ` Phillip Chen 2018-03-15 21:01 ` Bart Van Assche 0 siblings, 1 reply; 12+ messages in thread From: Phillip Chen @ 2018-03-15 19:40 UTC (permalink / raw) To: Bart Van Assche Cc: sitsofe@gmail.com, Kris Davis, Jason Jorgensen, fio@vger.kernel.org I haven't used the profile argument for FIO before, so when I read profile I thought it was being used as a synonym I/O profile or workload. So I think you're right that we shouldn't be creating a new profile for ZBC but rather focusing on the existing FIO generated workloads. When I was talking about valid I meant that get_next_block() would generate IO that would not cause errors or read filler data past the write pointer rather than having zbc_adjust_block() modify the IO afterwards. You make a good point that the two approaches can easily coexist. If you get your changes added I would like to try and build the additional workloads on top of your existing changes so that I can leverage your write pointer tracking code. Thank you, Phillip On Thu, Mar 15, 2018 at 12:38 PM, Bart Van Assche <Bart.VanAssche@wdc.com> wrote: > On Thu, 2018-03-15 at 12:06 -0600, Phillip Chen wrote: >> Would creating new profiles for all the I/O patterns be particularly >> difficult? I'm sure you're much more familiar with the FIO codebase >> than I am, but it seems to me that all you'd need to do for randoms is >> move the logic from the zbc_adjust_block cases upstream into the >> various methods called by get_off_from_method(), or possibly modify >> the existing methods to work differently when working on a ZBC drive. >> For sequentials it seems like you'd just have to move the logic into >> get_next_seq_offset(). > > Hello Phillip, > > Adding support for ZBC drives through creation of a new profile has the > following disadvantages: > - It makes it impossible to use another profile (act or tiobench) for the > generation of a workload. > - It will lead to code duplication. fio already has code for supporting a > large number of I/O patterns (sequential, random, ...). If we can avoid > code duplication I think we should avoid it. > >> It also seems to me that it might be better to have get_next_block() >> pick a valid area to begin with. > > What does "valid" mean in this context? Have you noticed that > zbc_adjust_block() modifies the I/O request offset and length such that > neither write errors nor reads past the end of the write pointer are > triggered? > >> The main benefit to doing this that I >> can see would be to allow much more control over the number of open >> zones, which I think will be of particular interest in testing ZBC >> drive performance. Additionally, it might be worthwhile to have an >> option allows the workload to pick a new zone instead of resetting the >> write pointer of a zone when writing to a full zone. This would also >> be made easier with a more upstream approach, because you wouldn't >> need to retry and get a new offset, you could just avoid full zones >> entirely. Or you could keep track of which zones are open and >> add/replace open zones as necessary. > > With the approach I proposed it is already possible to control the number of > open zones, namely by setting the I/O offset (--offset=) to the start of a > zone and by setting the I/O size (--io_size=) to (number of zones to test) * > (zone size). But I agree with you that the kind of workload that you > described would best be implemented as an I/O profile. How about starting > with the approach I proposed and adding profile(s) for more advanced ZBC I/O > patterns later? > > Thanks, > > Bart. > > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: ZBC/FLEX FIO addition ideas 2018-03-15 19:40 ` Phillip Chen @ 2018-03-15 21:01 ` Bart Van Assche 0 siblings, 0 replies; 12+ messages in thread From: Bart Van Assche @ 2018-03-15 21:01 UTC (permalink / raw) To: phillip.a.chen@seagate.com Cc: sitsofe@gmail.com, Kris Davis, Jason Jorgensen, fio@vger.kernel.org, Damien Le Moal On Thu, 2018-03-15 at 13:40 -0600, Phillip Chen wrote: > I haven't used the profile argument for FIO before, so when I read > profile I thought it was being used as a synonym I/O profile or > workload. So I think you're right that we shouldn't be creating a new > profile for ZBC but rather focusing on the existing FIO generated > workloads. > When I was talking about valid I meant that get_next_block() would > generate IO that would not cause errors or read filler data past the > write pointer rather than having zbc_adjust_block() modify the IO > afterwards. > You make a good point that the two approaches can easily coexist. If > you get your changes added I would like to try and build the > additional workloads on top of your existing changes so that I can > leverage your write pointer tracking code. Hello Phillip, That sounds like a plan to me :-) Can you have a look at the following code: https://github.com/bvanassche/fio/tree/zbc. The quick fio tests I ran with this code work fine with the ZBC and ZAC disks I tried. I will send that code to Jens once the following two pull requests have been accepted: * "Improvements for analyzing fio with Valgrind" (https://github.com/axboe/fio/pull/560). * "Add an asprintf() implementation" (https://github.com/axboe/fio/pull/561). Thanks, Bart. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: ZBC/FLEX FIO addition ideas 2018-03-15 16:43 ` Bart Van Assche 2018-03-15 18:06 ` Phillip Chen @ 2018-03-17 7:55 ` Sitsofe Wheeler 2018-03-20 2:21 ` Bart Van Assche 1 sibling, 1 reply; 12+ messages in thread From: Sitsofe Wheeler @ 2018-03-17 7:55 UTC (permalink / raw) To: Bart Van Assche Cc: Kris Davis, fio@vger.kernel.org, Jason Jorgensen, phillip.a.chen@seagate.com On 15 March 2018 at 16:43, Bart Van Assche <Bart.VanAssche@wdc.com> wrote: > > Adding support for ZBC drives as a profile has a significant disadvantage, > namely that the different I/O patterns (sequential read, sequential write, > random read, random write, ...) all have to be reimplemented. That's why I'm > considering to add ZBC support by modifying what get_next_block() produces. > Can you have a look at the (barely tested) patch below? I don't have access to any ZBC devices but I looked over the patch. This seems to be a blend of both both of Philip's suggestions and I quite like it because it centralized most of the logic within the engine itself bar the offset fixing. Something that could help is if ioengines were able to specify a callbacks to say whether an offset + size is valid or whether the ioengine wants to "fix" them. If it says the offset is invalid we just quietly pretend we did the I/O, reduce the amount of I/O we have to do on this loop and generate the next offset + size. If it chooses to just fix it then we just use the different values it fixed the I/O to (assuming it fits alignment, minimum block sizes etc). How do people see fix mode interacting with things like bssplit? Do we just ban bssplit when in this mode for now? Most engines could leave these unset but it would be perfect for this one and it would stop the typical path having to depend on this particular ioengine being configured in. If fix things up mode is set then I suppose fio could then go on to ban things like lsfr, randommap and random_distribution. I guess one snag with re-rolling in "valid" mode is that you could end up re-rolling forever. I'm no expert but I'm hoping doing another if (engine defines block_checking) on every I/O won't have too much of a hit. Perhaps we just start with "fix" mode see how that goes and move on from there? Something else that could get hairy are verify jobs that are seperate from their write job because the offsets generated during a subsequent verify will not necessarily be invalid at the same points or changed the same way. I suppose inline verify should work assuming write regions haven't been reset due to becoming full. -- Sitsofe | http://sucs.org/~sits/ ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: ZBC/FLEX FIO addition ideas 2018-03-17 7:55 ` Sitsofe Wheeler @ 2018-03-20 2:21 ` Bart Van Assche 2018-03-23 17:30 ` Phillip Chen 0 siblings, 1 reply; 12+ messages in thread From: Bart Van Assche @ 2018-03-20 2:21 UTC (permalink / raw) To: sitsofe@gmail.com Cc: Kris Davis, fio@vger.kernel.org, Jason Jorgensen, phillip.a.chen@seagate.com On Sat, 2018-03-17 at 07:55 +0000, Sitsofe Wheeler wrote: > I don't have access to any ZBC devices but I looked over the patch. > This seems to be a blend of both both of Philip's suggestions and I > quite like it because it centralized most of the logic within the > engine itself bar the offset fixing. Thanks for having taken a look :-) > Something that could help is > if ioengines were able to specify a callbacks to say whether an offset > + size is valid or whether the ioengine wants to "fix" them. If it > says the offset is invalid we just quietly pretend we did the I/O, > reduce the amount of I/O we have to do on this loop and generate the > next offset + size. If it chooses to just fix it then we just use the > different values it fixed the I/O to (assuming it fits alignment, > minimum block sizes etc). How do people see fix mode interacting with > things like bssplit? Do we just ban bssplit when in this mode for now? > > Most engines could leave these unset but it would be perfect for this > one and it would stop the typical path having to depend on this > particular ioengine being configured in. If fix things up mode is set > then I suppose fio could then go on to ban things like lsfr, randommap > and random_distribution. I guess one snag with re-rolling in "valid" > mode is that you could end up re-rolling forever. I'm no expert but > I'm hoping doing another if (engine defines block_checking) on every > I/O won't have too much of a hit. > > Perhaps we just start with "fix" mode see how that goes and move on > from there? Something else that could get hairy are verify jobs that > are seperate from their write job because the offsets generated during > a subsequent verify will not necessarily be invalid at the same points > or changed the same way. I suppose inline verify should work assuming > write regions haven't been reset due to becoming full. It's not clear to me why you would like to add such logic in the I/O engines? Shouldn't such logic rather be added in a new I/O profile such that users who want to run ZBC tests have the freedom of chosing an I/O engine? I'm not sure that a "fix" mode would work best. Phillip Chen has mentioned that we need a mode in which the user can control the number of open zones. If a random blocks would be generated first and the open zone limit would be applied afer that offset has been generated then most of the generated random blocks would have to be discarded since the number of open zones is typically much smaller than the number of zones supported by a disk. I'm currently looking at how to let I/O profiles influence the behavior of get_next_block() in such a way that the number of open zones can be limited by an I/O profile without having to reimplement fio core logic in an I/O profile. Regarding implementing write verify: the code that I posted does not yet support verifying written data. However, it's not that hard to modify that code such that verifying written data becomes possible. My proposal is to use the following approach: - Before writing starts, reset the zones that fio will write to to avoid that a zone has to be reset in the middle of the write phase, something that would result in data loss. - In a single zone, write at the write pointer. This will result in sequential writes per zone from the start to the end of the zone. - When verifying data, read the zone from start to end such that read and write offsets match. Thanks, Bart. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: ZBC/FLEX FIO addition ideas 2018-03-20 2:21 ` Bart Van Assche @ 2018-03-23 17:30 ` Phillip Chen 2018-03-23 17:35 ` Bart Van Assche 0 siblings, 1 reply; 12+ messages in thread From: Phillip Chen @ 2018-03-23 17:30 UTC (permalink / raw) To: Bart Van Assche Cc: sitsofe@gmail.com, Kris Davis, fio@vger.kernel.org, Jason Jorgensen Hello Bart, I've been trying out the ZBC changes with a few drives and workloads, and the only limitation I've found so far is that random write workloads with a queue depth > 1 seem to write beyond the write pointer. This isn't particularly surprising, but I wanted to double check if this is an expected limitation, and if so, are there plans to support queued random writes in the future? Other than that, the changes seem to be working just as advertised. I think your zone management system will make writing a zone random workload fairly straightforward. Thank you for the update! Sitsofe, would you be interested in being sent a ZBC drive? If you would like to give me an address, we can get one sent to you to help with validation. Let me know what you'd like, Phillip On Mon, Mar 19, 2018 at 8:21 PM, Bart Van Assche <Bart.VanAssche@wdc.com> wrote: > On Sat, 2018-03-17 at 07:55 +0000, Sitsofe Wheeler wrote: >> I don't have access to any ZBC devices but I looked over the patch. >> This seems to be a blend of both both of Philip's suggestions and I >> quite like it because it centralized most of the logic within the >> engine itself bar the offset fixing. > > Thanks for having taken a look :-) > >> Something that could help is >> if ioengines were able to specify a callbacks to say whether an offset >> + size is valid or whether the ioengine wants to "fix" them. If it >> says the offset is invalid we just quietly pretend we did the I/O, >> reduce the amount of I/O we have to do on this loop and generate the >> next offset + size. If it chooses to just fix it then we just use the >> different values it fixed the I/O to (assuming it fits alignment, >> minimum block sizes etc). How do people see fix mode interacting with >> things like bssplit? Do we just ban bssplit when in this mode for now? >> >> Most engines could leave these unset but it would be perfect for this >> one and it would stop the typical path having to depend on this >> particular ioengine being configured in. If fix things up mode is set >> then I suppose fio could then go on to ban things like lsfr, randommap >> and random_distribution. I guess one snag with re-rolling in "valid" >> mode is that you could end up re-rolling forever. I'm no expert but >> I'm hoping doing another if (engine defines block_checking) on every >> I/O won't have too much of a hit. >> >> Perhaps we just start with "fix" mode see how that goes and move on >> from there? Something else that could get hairy are verify jobs that >> are seperate from their write job because the offsets generated during >> a subsequent verify will not necessarily be invalid at the same points >> or changed the same way. I suppose inline verify should work assuming >> write regions haven't been reset due to becoming full. > > It's not clear to me why you would like to add such logic in the I/O > engines? Shouldn't such logic rather be added in a new I/O profile such that > users who want to run ZBC tests have the freedom of chosing an I/O engine? > > I'm not sure that a "fix" mode would work best. Phillip Chen has > mentioned that we need a mode in which the user can control the number of > open zones. If a random blocks would be generated first and the open zone > limit would be applied afer that offset has been generated then most of > the generated random blocks would have to be discarded since the number > of open zones is typically much smaller than the number of zones > supported by a disk. I'm currently looking at how to let I/O profiles > influence the behavior of get_next_block() in such a way that the number > of open zones can be limited by an I/O profile without having to reimplement > fio core logic in an I/O profile. > > Regarding implementing write verify: the code that I posted does not yet > support verifying written data. However, it's not that hard to modify that > code such that verifying written data becomes possible. My proposal is to > use the following approach: > - Before writing starts, reset the zones that fio will write to to avoid > that a zone has to be reset in the middle of the write phase, something > that would result in data loss. > - In a single zone, write at the write pointer. This will result in > sequential writes per zone from the start to the end of the zone. > - When verifying data, read the zone from start to end such that read and > write offsets match. > > Thanks, > > Bart. > > > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: ZBC/FLEX FIO addition ideas 2018-03-23 17:30 ` Phillip Chen @ 2018-03-23 17:35 ` Bart Van Assche 0 siblings, 0 replies; 12+ messages in thread From: Bart Van Assche @ 2018-03-23 17:35 UTC (permalink / raw) To: phillip.a.chen@seagate.com Cc: sitsofe@gmail.com, Kris Davis, Jason Jorgensen, fio@vger.kernel.org On Fri, 2018-03-23 at 11:30 -0600, Phillip Chen wrote: > I've been trying out the ZBC changes with a few drives and workloads, > and the only limitation I've found so far is that random write > workloads with a queue depth > 1 seem to write beyond the write > pointer. This isn't particularly surprising, but I wanted to double > check if this is an expected limitation, and if so, are there plans to > support queued random writes in the future? Other than that, the > changes seem to be working just as advertised. I think your zone > management system will make writing a zone random workload fairly > straightforward. Thank you for the update! Hello Phillip, That is unintended behavior. A fix is ready and passes my own tests. I'm also working on implementing additional functionality, namely to make it possible to specify the maximum number of open zones when using fio to test an SMR drive. I will let you know once all new functionality passes my tests. Bart. ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2018-03-23 17:35 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2018-03-06 18:59 ZBC/FLEX FIO addition ideas Phillip Chen 2018-03-15 16:15 ` Kris Davis 2018-03-15 16:30 ` Sitsofe Wheeler 2018-03-15 16:43 ` Bart Van Assche 2018-03-15 18:06 ` Phillip Chen 2018-03-15 18:38 ` Bart Van Assche 2018-03-15 19:40 ` Phillip Chen 2018-03-15 21:01 ` Bart Van Assche 2018-03-17 7:55 ` Sitsofe Wheeler 2018-03-20 2:21 ` Bart Van Assche 2018-03-23 17:30 ` Phillip Chen 2018-03-23 17:35 ` Bart Van Assche
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox