* [RFC PATCH 0/1] dm: add dm-dust, a bad sector emulator
@ 2019-01-07 19:31 Bryan Gurney
2019-01-07 19:31 ` [RFC PATCH 1/1] dm: add dust target Bryan Gurney
0 siblings, 1 reply; 11+ messages in thread
From: Bryan Gurney @ 2019-01-07 19:31 UTC (permalink / raw)
To: snitzer, agk, dm-devel
Cc: jdorminy, jshimkus, tjaskiew, jpittman, Bryan Gurney
This patch adds the device-mapper target dm-dust, which allows the
arbitrary designation of "bad" sectors, at an arbitrary time.
This target behaves similarly to a linear target, with the exception
of the minimum and optimal IO sizes being configurable to 512 or
4096 bytes. At a given time, the user can send a message to the
target to start failing read requests on specific blocks. When the
failure behavior is enabled, reads of blocks configured "bad" will
fail with EIO.
Writes of blocks configured "bad" will result in the following:
1. Remove the block from the "bad block list".
sector).
2. Successfully complete the write.
After this point, the block will successfully contain the written
data, and will service reads and writes normally. This emulates the
behavior of a "remapped sector" on a hard disk drive.
dm-dust provides logging of which blocks have been added or removed
to the "bad block list", as well as logging when a block has been
removed from the bad block list. These messages can be used
alongside the messages from the driver using a dm-dust device to
analyze the driver's behavior when a read fails at a given time.
(This logging can be reduced via a "quiet" mode, if desired.)
Bryan Gurney (1):
dm: add dust target
Documentation/device-mapper/dm-dust.txt | 256 ++++++++++++
drivers/md/Kconfig | 9 +
drivers/md/Makefile | 1 +
drivers/md/dm-dust.c | 499 ++++++++++++++++++++++++
4 files changed, 765 insertions(+)
create mode 100644 Documentation/device-mapper/dm-dust.txt
create mode 100644 drivers/md/dm-dust.c
--
2.17.2
^ permalink raw reply [flat|nested] 11+ messages in thread
* [RFC PATCH 1/1] dm: add dust target
2019-01-07 19:31 [RFC PATCH 0/1] dm: add dm-dust, a bad sector emulator Bryan Gurney
@ 2019-01-07 19:31 ` Bryan Gurney
2019-01-08 0:10 ` Benjamin Marzinski
0 siblings, 1 reply; 11+ messages in thread
From: Bryan Gurney @ 2019-01-07 19:31 UTC (permalink / raw)
To: snitzer, agk, dm-devel
Cc: jdorminy, jshimkus, tjaskiew, jpittman, Bryan Gurney
Add the dm-dust target, which simulates the behavior of bad sectors
at arbitrary locations, and the ability to enable the emulation of
the read failures at an arbitrary time.
This target behaves similarly to a linear target, with the exception
of the minimum and optimal IO sizes being configurable to 512 or
4096 bytes.
At a given time, the user can send a message to the target to start
failing read requests on specific blocks. When the failure behavior
is enabled, reads of blocks configured "bad" will fail with EIO.
Writes of blocks configured "bad" will result in the following:
1. Remove the block from the "bad block list".
2. Successfully complete the write.
After this point, the block will successfully contain the written
data, and will service reads and writes normally. This emulates the
behavior of a "remapped sector" on a hard disk drive.
Signed-off-by: Joe Shimkus <jshimkus@redhat.com>
Signed-off-by: John Dorminy <jdorminy@redhat.com>
Signed-off-by: John Pittman <jpittman@redhat.com>
Signed-off-by: Thomas Jaskiewicz <tjaskiew@redhat.com>
Signed-off-by: Bryan Gurney <bgurney@redhat.com>
---
Documentation/device-mapper/dm-dust.txt | 256 ++++++++++++
drivers/md/Kconfig | 9 +
drivers/md/Makefile | 1 +
drivers/md/dm-dust.c | 499 ++++++++++++++++++++++++
4 files changed, 765 insertions(+)
create mode 100644 Documentation/device-mapper/dm-dust.txt
create mode 100644 drivers/md/dm-dust.c
diff --git a/Documentation/device-mapper/dm-dust.txt b/Documentation/device-mapper/dm-dust.txt
new file mode 100644
index 000000000000..7d7740d5b1c9
--- /dev/null
+++ b/Documentation/device-mapper/dm-dust.txt
@@ -0,0 +1,256 @@
+dm-dust
+=======
+
+This target emulates the behavior of bad sectors at arbitrary
+locations, and the ability to enable the emulation of the failures
+at an arbitrary time.
+
+This target behaves similarly to a linear target, except that the
+minimum and optimal io sizes are DUST_BLOCK_SIZE bytes (512 or
+4096 bytes). At a given time, the user can send a message to the
+target to start failing read requests on specific blocks (to emulate
+the behavior of a hard disk drive with bad sectors).
+
+When the failure behavior is enabled (i.e.: when the output of
+"dmsetup status" displays "fail_read_on_bad_block"), reads of blocks
+in the "bad block list" will fail with EIO ("Input/output error").
+
+Writes of blocks in the "bad block list will result in the following:
+
+1. Remove the block from the "bad block list".
+2. Successfully complete the write.
+
+This emulates the "remapped sector" behavior of a drive with bad
+sectors.
+
+Normally, a drive that is encountering bad sectors will most likely
+encounter more bad sectors, at an unknown time or location.
+With dm-dust, the user can use the "addbadblock" and "removebadblock"
+messages to add arbitrary bad blocks at new locations, and the
+"enable" and "disable" messages to modulate the state of whether the
+configured "bad blocks" will be treated as bad, or bypassed.
+This allows the pre-writing of test data and metadata prior to
+simulating a "failure" event where bad sectors start to appear.
+
+Table parameters:
+-----------------
+<device_path> <blksz>
+
+Mandatory parameters:
+ <device_path>: path to the block device.
+ <blksz>: block size (valid choices: 512, 4096).
+
+Usage instructions:
+-------------------
+
+First, find the size (in 512-byte sectors) of the device to be used:
+
+$ sudo blockdev --getsz /dev/vdb1
+33552384
+
+After building the module, load the module:
+
+$ sudo modprobe pbitdust
+
+Create the device:
+(For a device with a block size of 512 bytes)
+$ sudo dmsetup create dust1 --table '0 33552384 dust /dev/vdb1 512'
+
+(For a device with a block size of 4096 bytes)
+$ sudo dmsetup create dust1 --table '0 33552384 dust /dev/vdb1 4096'
+
+Check the status of the read behavior ("bypass" indicates that all I/O
+will be passed through to the underlying device):
+$ sudo dmsetup status dust1
+0 33552384 dust 252:17 bypass
+
+$ sudo dd if=/dev/mapper/dust1 of=/dev/null bs=512 count=128 iflag=direct
+128+0 records in
+128+0 records out
+
+$ sudo dd if=/dev/zero of=/dev/mapper/dust1 bs=512 count=128 oflag=direct
+128+0 records in
+128+0 records out
+
+Adding and removing bad blocks:
+-------------------------------
+
+At any time (i.e.: whether the device has the "bad block" emulation
+enabled or disabled), bad blocks may be added or removed from the
+device via the "addbadblock" and "removebadblock" messages:
+
+$ sudo dmsetup message dust1 0 addbadblock 60
+kernel: device-mapper: dust: badblock added at block 60
+
+$ sudo dmsetup message dust1 0 addbadblock 67
+kernel: device-mapper: dust: badblock added at block 67
+
+$ sudo dmsetup message dust1 0 addbadblock 72
+kernel: device-mapper: dust: badblock added at block 72
+
+These bad blocks will be stored in the "bad block list".
+While the device is in "bypass" mode, reads and writes will succeed:
+
+$ sudo dmsetup status dust1
+0 33552384 dust 252:17 bypass
+
+Enabling block read failures:
+-----------------------------
+
+To enable the "fail read on bad block" behavior, send the "enable" message:
+
+$ sudo dmsetup message dust1 0 enable
+kernel: device-mapper: dust: enabling read failures on bad sectors
+
+$ sudo dmsetup status dust1
+0 33552384 dust 252:17 fail_read_on_bad_block
+
+With the device in "fail read on bad block" mode, attempting to read a
+block will encounter an "Input/output error":
+
+$ sudo dd if=/dev/mapper/dust1 of=/dev/null bs=512 count=1 skip=67 iflag=direct
+dd: error reading '/dev/mapper/dust1': Input/output error
+0+0 records in
+0+0 records out
+0 bytes copied, 0.00040651 s, 0.0 kB/s
+
+...and writing to the bad blocks will remove the blocks from the list,
+therefore emulating the "remap" behavior of hard disk drives:
+
+$ sudo dd if=/dev/zero of=/dev/mapper/dust1 bs=512 count=128 oflag=direct
+128+0 records in
+128+0 records out
+
+kernel: device-mapper: dust: block 60 removed from badblocklist by write
+kernel: device-mapper: dust: block 67 removed from badblocklist by write
+kernel: device-mapper: dust: block 72 removed from badblocklist by write
+kernel: device-mapper: dust: block 87 removed from badblocklist by write
+
+Bad block add/remove error handling:
+------------------------------------
+
+Attempting to add a bad block that already exists in the list will
+result in an "Invalid argument" error, as well as a helpful message:
+
+$ sudo dmsetup message dust1 0 addbadblock 88
+device-mapper: message ioctl on dust1 failed: Invalid argument
+kernel: device-mapper: dust: block 88 already in badblocklist
+
+Attempting to remove a bad block that doesn't exist in the list will
+result in an "Invalid argument" error, as well as a helpful message:
+
+$ sudo dmsetup message dust1 0 removebadblock 87
+device-mapper: message ioctl on dust1 failed: Invalid argument
+kernel: device-mapper: dust: block 87 not found in badblocklist
+
+Counting the number of bad blocks in the bad block list:
+--------------------------------------------------------
+
+To count the number of bad blocks configured in the device, run the
+following message command:
+
+$ sudo dmsetup message dust1 0 countbadblocks
+
+A message will print with the number of bad blocks currently
+configured on the device:
+
+kernel: device-mapper: dust: countbadblocks: 895 badblock(s) found
+
+Clearing the bad block list:
+----------------------------
+
+To clear the bad block list (without needing to individually run
+a "removebadblock" message command for every block), run the
+following message command:
+
+$ sudo dmsetup message dust1 0 clearbadblocks
+
+After clearing the bad block list, the following message will appear:
+
+kernel: device-mapper: dust: clearbadblocks: badblocks cleared
+
+If there were no bad blocks to clear, the following message will
+appear:
+
+kernel: device-mapper: dust: clearbadblocks: no badblocks found
+
+Message commands list:
+----------------------
+
+Below is a list of the messages that can be sent to a dust device:
+
+Operations on blocks (requires a <blknum> argument):
+
+addbadblock <blknum>
+queryblock <blknum>
+removebadblock <blknum>
+
+...where <blknum> is a block number within range of the device
+ (corresponding to the block size of the device, 512 or 4096 bytes.)
+
+Single argument message commands:
+
+countbadblocks
+clearbadblocks
+disable
+enable
+quiet
+
+Device removal:
+---------------
+
+When finished, remove the device via the "dmsetup remove" command:
+
+$ sudo dmsetup remove dust1
+
+Quiet mode:
+-----------
+
+On test runs with many bad blocks, it may be desirable to avoid
+excessive logging (from bad blocks added, removed, or "remapped").
+This can be done by enabling "quiet mode" via the following message:
+
+$ sudo dmsetup message dust1 0 quiet
+
+This will suppress log messages from add / remove / removed by write
+operations. Log messages from "countbadblocks" or "queryblock"
+message commands will still print in quiet mode.
+
+The status of quiet mode can be seen by running "dmsetup status":
+
+$ sudo dmsetup status dust1
+0 33552384 dust 252:17 fail_read_on_bad_block quiet
+
+To disable quiet mode, send the "quiet" message again:
+
+$ sudo dmsetup message dust1 0 quiet
+
+$ sudo dmsetup status dust1
+0 33552384 dust 252:17 fail_read_on_bad_block verbose
+
+(The presence of "verbose" indicates normal logging.)
+
+"Why not...?"
+-------------
+
+scsi_debug has a "medium error" mode that can fail reads on one
+specified sector (sector 0x1234, hardcoded in the source code), but
+it uses RAM for the persistent storage, which drastically decreases
+the potential device size.
+
+dm-flakey fails all I/O from all block locations at a specified time
+frequency, and not a given point in time.
+
+When a bad sector occurs on a hard disk drive, reads to that sector
+are failed by the device, usually resulting in an error code of EIO
+("I/O error") or ENODATA ("No data available"). However, a write to
+the sector may succeed, and result in the sector becoming readable
+after the device controller no longer experiences errors reading the
+sector (or after a reallocation of the sector). However, there may
+be bad sectors that occur on the device in the future, in a different,
+unpredictable location.
+
+This target seeks to provide a device that can exhibit the behavior
+of a bad sector at a known sector location, at a known time, based
+on a large storage device (at least tens of gigabytes, not occupying
+system memory).
diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index 3db222509e44..5370473a0999 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -436,6 +436,15 @@ config DM_DELAY
If unsure, say N.
+config DM_DUST
+ tristate "Bad sector simulation target"
+ depends on BLK_DEV_DM
+ ---help---
+ A target that simulates bad sector behavior, and
+ limited optimal_io_size behavior. Useful for testing.
+
+ If unsure, say N.
+
config DM_UEVENT
bool "DM uevents"
depends on BLK_DEV_DM
diff --git a/drivers/md/Makefile b/drivers/md/Makefile
index 822f4e8753bc..75bf0391127a 100644
--- a/drivers/md/Makefile
+++ b/drivers/md/Makefile
@@ -48,6 +48,7 @@ obj-$(CONFIG_DM_BUFIO) += dm-bufio.o
obj-$(CONFIG_DM_BIO_PRISON) += dm-bio-prison.o
obj-$(CONFIG_DM_CRYPT) += dm-crypt.o
obj-$(CONFIG_DM_DELAY) += dm-delay.o
+obj-$(CONFIG_DM_DUST) += dm-dust.o
obj-$(CONFIG_DM_FLAKEY) += dm-flakey.o
obj-$(CONFIG_DM_MULTIPATH) += dm-multipath.o dm-round-robin.o
obj-$(CONFIG_DM_MULTIPATH_QL) += dm-queue-length.o
diff --git a/drivers/md/dm-dust.c b/drivers/md/dm-dust.c
new file mode 100644
index 000000000000..94b43d79b186
--- /dev/null
+++ b/drivers/md/dm-dust.c
@@ -0,0 +1,499 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2018 Red Hat, Inc.
+ *
+ * This is a test "dust" device, which fails reads on specified
+ * sectors, emulating the behavior of a hard disk drive sending
+ * a "Read Medium Error" sense.
+ *
+ */
+
+#include <linux/device-mapper.h>
+#include <linux/module.h>
+#include <linux/version.h>
+#include <linux/rbtree.h>
+
+#define DM_MSG_PREFIX "dust"
+
+struct badblock {
+ struct rb_node node;
+ sector_t bb;
+};
+
+struct dust_device {
+ struct dm_dev *dev;
+ bool fail_read_on_bb;
+ struct rb_root badblocklist;
+ unsigned long long badblock_count;
+ spinlock_t dust_lock;
+ unsigned int blksz;
+ unsigned int sect_per_block;
+ bool quiet_mode;
+};
+
+struct badblock *dust_rb_search(struct rb_root *root, sector_t blk)
+{
+ struct rb_node *node = root->rb_node;
+
+ while (node) {
+ struct badblock *bblk = rb_entry(node, struct badblock, node);
+
+ if (bblk->bb > blk)
+ node = node->rb_left;
+ else if (bblk->bb < blk)
+ node = node->rb_right;
+ else
+ return bblk;
+ }
+ return NULL;
+}
+
+bool dust_rb_insert(struct rb_root *root, struct badblock *new)
+{
+ struct badblock *bblk;
+ struct rb_node **link = &root->rb_node, *parent = NULL;
+ sector_t value = new->bb;
+
+ while (*link) {
+ parent = *link;
+ bblk = rb_entry(parent, struct badblock, node);
+
+ if (bblk->bb > value)
+ link = &(*link)->rb_left;
+ else if (bblk->bb < value)
+ link = &(*link)->rb_right;
+ else
+ return false;
+ }
+
+ rb_link_node(&new->node, parent, link);
+ rb_insert_color(&new->node, root);
+ return true;
+}
+
+static int dust_remove_block(struct dust_device *dd, unsigned long long block)
+{
+ struct badblock *bblock;
+ unsigned long flags;
+
+ spin_lock_irqsave(&dd->dust_lock, flags);
+ bblock = dust_rb_search(&dd->badblocklist, block * dd->sect_per_block);
+
+ if (bblock == NULL) {
+ if (!dd->quiet_mode)
+ DMERR("removebadblock: block %llu not found in badblocklist",
+ block);
+ spin_unlock_irqrestore(&dd->dust_lock, flags);
+ return -EINVAL;
+ }
+
+ rb_erase(&bblock->node, &dd->badblocklist);
+ dd->badblock_count--;
+ if (!dd->quiet_mode)
+ DMINFO("removebadblock: badblock removed at block %llu", block);
+ kfree(bblock);
+ spin_unlock_irqrestore(&dd->dust_lock, flags);
+ return 0;
+}
+
+static int dust_add_block(struct dust_device *dd, unsigned long long block)
+{
+ struct badblock *bblock;
+ unsigned long flags;
+
+ spin_lock_irqsave(&dd->dust_lock, flags);
+ bblock = dust_rb_search(&dd->badblocklist, block * dd->sect_per_block);
+
+ if (bblock != NULL) {
+ if (!dd->quiet_mode)
+ DMERR("addbadblock: block %llu already in badblocklist",
+ block);
+ spin_unlock_irqrestore(&dd->dust_lock, flags);
+ return -EINVAL;
+ }
+
+ bblock = kmalloc(sizeof(*bblock), GFP_KERNEL);
+ if (bblock == NULL) {
+ if (!dd->quiet_mode)
+ DMERR("addbadblock: badblock allocation failed");
+ spin_unlock_irqrestore(&dd->dust_lock, flags);
+ return -ENOMEM;
+ }
+
+ bblock->bb = block * dd->sect_per_block;
+ dust_rb_insert(&dd->badblocklist, bblock);
+ dd->badblock_count++;
+ if (!dd->quiet_mode)
+ DMINFO("addbadblock: badblock added at block %llu", block);
+ spin_unlock_irqrestore(&dd->dust_lock, flags);
+ return 0;
+}
+
+static int dust_query_block(struct dust_device *dd, unsigned long long block)
+{
+ struct badblock *bblock;
+ unsigned long flags;
+
+ spin_lock_irqsave(&dd->dust_lock, flags);
+ bblock = dust_rb_search(&dd->badblocklist, block * dd->sect_per_block);
+
+ if (bblock != NULL)
+ DMINFO("queryblock: block %llu found in badblocklist", block);
+ else
+ DMINFO("queryblock: block %llu not found in badblocklist",
+ block);
+
+ spin_unlock_irqrestore(&dd->dust_lock, flags);
+ return 0;
+}
+
+static int dust_map_read(struct dust_device *dd, sector_t thisblock,
+ bool fail_read_on_bb)
+{
+ unsigned long flags;
+ struct badblock *bblk;
+
+ spin_lock_irqsave(&dd->dust_lock, flags);
+ bblk = dust_rb_search(&dd->badblocklist, thisblock);
+
+ if (fail_read_on_bb && bblk) {
+ spin_unlock_irqrestore(&dd->dust_lock, flags);
+ return DM_MAPIO_KILL;
+ }
+
+ spin_unlock_irqrestore(&dd->dust_lock, flags);
+ return DM_MAPIO_REMAPPED;
+}
+
+static int dust_map_write(struct dust_device *dd, sector_t thisblock,
+ bool fail_read_on_bb)
+{
+ unsigned long flags;
+ struct badblock *bblk;
+
+ spin_lock_irqsave(&dd->dust_lock, flags);
+ bblk = dust_rb_search(&dd->badblocklist, thisblock);
+
+ if (fail_read_on_bb && bblk) {
+ rb_erase(&bblk->node, &dd->badblocklist);
+ dd->badblock_count--;
+ kfree(bblk);
+ if (!dd->quiet_mode)
+ DMINFO("block %llu removed from badblocklist by write",
+ (unsigned long long)thisblock / dd->sect_per_block);
+ }
+ spin_unlock_irqrestore(&dd->dust_lock, flags);
+ return DM_MAPIO_REMAPPED;
+}
+
+static inline void dust_set_biosector(struct bio *bio, sector_t sector)
+{
+ bio->bi_iter.bi_sector = sector;
+}
+
+static inline sector_t dust_get_biosector(struct bio *bio)
+{
+ return bio->bi_iter.bi_sector;
+}
+
+static int __dust_clear_badblocks(struct rb_root *tree,
+ unsigned long long count)
+{
+ struct rb_node *node = NULL, *nnode = NULL;
+
+ nnode = rb_first(tree);
+ if (nnode == NULL) {
+ BUG_ON(count != 0);
+ return 0;
+ }
+
+ while (nnode) {
+ node = nnode;
+ nnode = rb_next(node);
+ rb_erase(node, tree);
+ count--;
+ kfree(node);
+ }
+ BUG_ON(count != 0);
+ BUG_ON(tree->rb_node != NULL);
+ return 1;
+}
+
+static int dust_clear_badblocks(struct dust_device *dd)
+{
+ unsigned long flags;
+ struct rb_root badblocklist;
+ unsigned long long badblock_count;
+
+ spin_lock_irqsave(&dd->dust_lock, flags);
+ badblocklist = dd->badblocklist;
+ badblock_count = dd->badblock_count;
+ dd->badblocklist = RB_ROOT;
+ dd->badblock_count = 0;
+ spin_unlock_irqrestore(&dd->dust_lock, flags);
+
+ if (__dust_clear_badblocks(&badblocklist, badblock_count) == 0)
+ DMINFO("clearbadblocks: no badblocks found");
+ else
+ DMINFO("clearbadblocks: badblocks cleared");
+ return 0;
+}
+
+/*
+ * Target parameters:
+ *
+ * <device_path> <blksz>
+ *
+ * device_path: path to the block device
+ * blksz: block size (valid choices: 512, 4096)
+ */
+static int dust_ctr(struct dm_target *ti, unsigned int argc, char **argv)
+{
+ struct dust_device *dd;
+ const char *device_path = argv[0];
+ unsigned int blksz;
+ unsigned int sect_per_block;
+
+ if (argc != 2) {
+ ti->error = "requires exactly 2 arguments";
+ return -EINVAL;
+ }
+
+ if (kstrtouint(argv[1], 10, &blksz) || !blksz) {
+ ti->error =
+ "Invalid block size parameter -- please enter 512 or 4096";
+ return -EINVAL;
+ }
+ if (blksz != 512 && blksz != 4096) {
+ ti->error = "Invalid block size -- please enter 512 or 4096";
+ return -EINVAL;
+ }
+
+ sect_per_block = (blksz >> SECTOR_SHIFT);
+
+ dd = kzalloc(sizeof(struct dust_device), GFP_KERNEL);
+
+ if (dd == NULL) {
+ ti->error = "Cannot allocate context";
+ return -ENOMEM;
+ }
+
+ if (dm_get_device(ti, device_path, dm_table_get_mode(ti->table),
+ &dd->dev)) {
+ ti->error = "Device lookup failed";
+ kfree(dd);
+ return -EINVAL;
+ }
+
+ dd->sect_per_block = sect_per_block;
+ dd->blksz = blksz;
+
+ // Whether to fail a read on a "bad" block.
+ // Defaults to false; enabled later by message.
+ dd->fail_read_on_bb = false;
+
+ // Bad block list rbtree.
+ // Just initialize for now.
+ dd->badblocklist = RB_ROOT;
+ dd->badblock_count = 0;
+ spin_lock_init(&dd->dust_lock);
+
+ dd->quiet_mode = false;
+
+ BUG_ON(dm_set_target_max_io_len(ti, dd->sect_per_block) != 0);
+
+ ti->num_discard_bios = 1;
+ ti->num_flush_bios = 1;
+ ti->private = dd;
+ return 0;
+}
+
+static void dust_dtr(struct dm_target *ti)
+{
+ struct dust_device *dd = ti->private;
+
+ __dust_clear_badblocks(&dd->badblocklist, dd->badblock_count);
+ dm_put_device(ti, dd->dev);
+ kfree(dd);
+}
+
+static void dust_io_hints(struct dm_target *ti, struct queue_limits *limits)
+{
+ struct dust_device *dd = ti->private;
+
+ limits->logical_block_size = dd->blksz;
+ limits->physical_block_size = dd->blksz;
+
+ // The minimum io size for random io
+ blk_limits_io_min(limits, dd->blksz);
+ // The optimal io size for streamed/sequential io
+ blk_limits_io_opt(limits, dd->blksz);
+}
+
+static int dust_map(struct dm_target *ti, struct bio *bio)
+{
+ struct dust_device *dd;
+ sector_t thisblock;
+ int ret;
+
+ dd = ti->private;
+ bio_set_dev(bio, dd->dev->bdev);
+ dust_set_biosector(bio, dm_target_offset(ti, dust_get_biosector(bio)));
+ thisblock = dust_get_biosector(bio);
+
+ if (bio_data_dir(bio) == READ)
+ ret = dust_map_read(dd, thisblock, dd->fail_read_on_bb);
+ if (bio_data_dir(bio) == WRITE)
+ ret = dust_map_write(dd, thisblock, dd->fail_read_on_bb);
+
+ return ret;
+}
+
+static int dust_message(struct dm_target *ti, unsigned int argc, char **argv,
+ char *result_buf, unsigned int maxlen)
+{
+ struct dust_device *dd = ti->private;
+ sector_t size = i_size_read(dd->dev->bdev->bd_inode) >> SECTOR_SHIFT;
+ bool invalid_msg = false;
+ int result = -EINVAL;
+ unsigned long long tmp, block;
+ unsigned long flags;
+ char dummy;
+
+ if (argc == 1) {
+ if (!strcasecmp(argv[0], "addbadblock") ||
+ !strcasecmp(argv[0], "removebadblock") ||
+ !strcasecmp(argv[0], "queryblock")) {
+ DMERR("%s requires an additional argument", argv[0]);
+ } else if (!strcasecmp(argv[0], "disable")) {
+ DMINFO("disabling read failures on bad sectors");
+ dd->fail_read_on_bb = false;
+ result = 0;
+ } else if (!strcasecmp(argv[0], "enable")) {
+ DMINFO("enabling read failures on bad sectors");
+ dd->fail_read_on_bb = true;
+ result = 0;
+ } else if (!strcasecmp(argv[0], "countbadblocks")) {
+ spin_lock_irqsave(&dd->dust_lock, flags);
+ DMINFO("countbadblocks: %llu badblock(s) found",
+ dd->badblock_count);
+ spin_unlock_irqrestore(&dd->dust_lock, flags);
+ result = 0;
+ } else if (!strcasecmp(argv[0], "clearbadblocks")) {
+ result = dust_clear_badblocks(dd);
+ } else if (!strcasecmp(argv[0], "quiet")) {
+ if (!dd->quiet_mode)
+ dd->quiet_mode = true;
+ else
+ dd->quiet_mode = false;
+ result = 0;
+ } else {
+ invalid_msg = true;
+ }
+ } else if (argc == 2) {
+ if (sscanf(argv[1], "%llu%c", &tmp, &dummy) != 1)
+ return result;
+
+ block = tmp;
+
+ if (block > size / dd->sect_per_block || block < 0) {
+ DMERR("selected block value out of range");
+ return result;
+ }
+
+ if (!strcasecmp(argv[0], "addbadblock"))
+ result = dust_add_block(dd, block);
+ else if (!strcasecmp(argv[0], "removebadblock"))
+ result = dust_remove_block(dd, block);
+ else if (!strcasecmp(argv[0], "queryblock"))
+ result = dust_query_block(dd, block);
+ else
+ invalid_msg = true;
+
+ } else {
+ DMERR("invalid number of arguments '%d'", argc);
+ }
+
+ if (invalid_msg)
+ DMERR("unrecognized dmsetup message '%s' received", argv[0]);
+
+ return result;
+}
+
+static void dust_status(struct dm_target *ti, status_type_t type,
+ unsigned int status_flags, char *result, unsigned int maxlen)
+{
+ struct dust_device *dd = ti->private;
+ unsigned int sz = 0; // used by the DMEMIT macro
+
+ switch (type) {
+ case STATUSTYPE_INFO:
+ DMEMIT("%s %s %s", dd->dev->name,
+ dd->fail_read_on_bb ? "fail_read_on_bad_block" :
+ "bypass",
+ dd->quiet_mode ? "quiet" : "verbose");
+ break;
+
+ case STATUSTYPE_TABLE:
+ DMEMIT("%s %u", dd->dev->name, dd->blksz);
+ break;
+ }
+}
+
+static int dust_prepare_ioctl(struct dm_target *ti, struct block_device **bdev)
+{
+ struct dust_device *dd = ti->private;
+ struct dm_dev *dev = dd->dev;
+
+ *bdev = dev->bdev;
+
+ // Only pass ioctls through if the device sizes match exactly.
+ if (ti->len != i_size_read(dev->bdev->bd_inode) >> SECTOR_SHIFT)
+ return 1;
+
+ return 0;
+}
+
+static int dust_iterate_devices(struct dm_target *ti, iterate_devices_callout_fn fn,
+ void *data)
+{
+ struct dust_device *dd = ti->private;
+
+ return fn(ti, dd->dev, 0, ti->len, data);
+}
+
+static struct target_type dust_target = {
+ .name = "dust",
+ .version = { 1, 0, 0 },
+ .module = THIS_MODULE,
+ .ctr = dust_ctr,
+ .dtr = dust_dtr,
+ .iterate_devices = dust_iterate_devices,
+ .io_hints = dust_io_hints,
+ .map = dust_map,
+ .message = dust_message,
+ .status = dust_status,
+ .prepare_ioctl = dust_prepare_ioctl,
+};
+
+int __init dm_dust_init(void)
+{
+ int result = dm_register_target(&dust_target);
+
+ if (result < 0)
+ DMERR("dm_register_target failed %d", result);
+
+ return result;
+}
+
+void __exit dm_dust_exit(void)
+{
+ dm_unregister_target(&dust_target);
+}
+
+module_init(dm_dust_init);
+module_exit(dm_dust_exit);
+
+MODULE_DESCRIPTION(DM_NAME " dust testing device");
+MODULE_AUTHOR("Red Hat, Inc.");
+MODULE_LICENSE("GPL");
--
2.17.2
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 1/1] dm: add dust target
2019-01-07 19:31 ` [RFC PATCH 1/1] dm: add dust target Bryan Gurney
@ 2019-01-08 0:10 ` Benjamin Marzinski
2019-01-08 15:30 ` Bryan Gurney
0 siblings, 1 reply; 11+ messages in thread
From: Benjamin Marzinski @ 2019-01-08 0:10 UTC (permalink / raw)
To: Bryan Gurney
Cc: snitzer, jdorminy, jshimkus, dm-devel, tjaskiew, jpittman, agk
On Mon, Jan 07, 2019 at 02:31:23PM -0500, Bryan Gurney wrote:
> Add the dm-dust target, which simulates the behavior of bad sectors
> at arbitrary locations, and the ability to enable the emulation of
> the read failures at an arbitrary time.
>
> This target behaves similarly to a linear target, with the exception
> of the minimum and optimal IO sizes being configurable to 512 or
> 4096 bytes.
>
> At a given time, the user can send a message to the target to start
> failing read requests on specific blocks. When the failure behavior
> is enabled, reads of blocks configured "bad" will fail with EIO.
>
> Writes of blocks configured "bad" will result in the following:
>
> 1. Remove the block from the "bad block list".
> 2. Successfully complete the write.
>
> After this point, the block will successfully contain the written
> data, and will service reads and writes normally. This emulates the
> behavior of a "remapped sector" on a hard disk drive.
>
> Signed-off-by: Joe Shimkus <jshimkus@redhat.com>
> Signed-off-by: John Dorminy <jdorminy@redhat.com>
> Signed-off-by: John Pittman <jpittman@redhat.com>
> Signed-off-by: Thomas Jaskiewicz <tjaskiew@redhat.com>
> Signed-off-by: Bryan Gurney <bgurney@redhat.com>
> ---
> Documentation/device-mapper/dm-dust.txt | 256 ++++++++++++
> drivers/md/Kconfig | 9 +
> drivers/md/Makefile | 1 +
> drivers/md/dm-dust.c | 499 ++++++++++++++++++++++++
> 4 files changed, 765 insertions(+)
> create mode 100644 Documentation/device-mapper/dm-dust.txt
> create mode 100644 drivers/md/dm-dust.c
>
> diff --git a/Documentation/device-mapper/dm-dust.txt b/Documentation/device-mapper/dm-dust.txt
> new file mode 100644
> index 000000000000..7d7740d5b1c9
> --- /dev/null
> +++ b/Documentation/device-mapper/dm-dust.txt
> @@ -0,0 +1,256 @@
> +dm-dust
> +=======
> +
> +This target emulates the behavior of bad sectors at arbitrary
> +locations, and the ability to enable the emulation of the failures
> +at an arbitrary time.
> +
> +This target behaves similarly to a linear target, except that the
> +minimum and optimal io sizes are DUST_BLOCK_SIZE bytes (512 or
> +4096 bytes). At a given time, the user can send a message to the
> +target to start failing read requests on specific blocks (to emulate
> +the behavior of a hard disk drive with bad sectors).
> +
> +When the failure behavior is enabled (i.e.: when the output of
> +"dmsetup status" displays "fail_read_on_bad_block"), reads of blocks
> +in the "bad block list" will fail with EIO ("Input/output error").
> +
> +Writes of blocks in the "bad block list will result in the following:
> +
> +1. Remove the block from the "bad block list".
> +2. Successfully complete the write.
> +
> +This emulates the "remapped sector" behavior of a drive with bad
> +sectors.
> +
> +Normally, a drive that is encountering bad sectors will most likely
> +encounter more bad sectors, at an unknown time or location.
> +With dm-dust, the user can use the "addbadblock" and "removebadblock"
> +messages to add arbitrary bad blocks at new locations, and the
> +"enable" and "disable" messages to modulate the state of whether the
> +configured "bad blocks" will be treated as bad, or bypassed.
> +This allows the pre-writing of test data and metadata prior to
> +simulating a "failure" event where bad sectors start to appear.
> +
> +Table parameters:
> +-----------------
> +<device_path> <blksz>
> +
> +Mandatory parameters:
> + <device_path>: path to the block device.
> + <blksz>: block size (valid choices: 512, 4096).
> +
> +Usage instructions:
> +-------------------
> +
> +First, find the size (in 512-byte sectors) of the device to be used:
> +
> +$ sudo blockdev --getsz /dev/vdb1
> +33552384
> +
> +After building the module, load the module:
> +
> +$ sudo modprobe pbitdust
> +
> +Create the device:
> +(For a device with a block size of 512 bytes)
> +$ sudo dmsetup create dust1 --table '0 33552384 dust /dev/vdb1 512'
> +
> +(For a device with a block size of 4096 bytes)
> +$ sudo dmsetup create dust1 --table '0 33552384 dust /dev/vdb1 4096'
> +
> +Check the status of the read behavior ("bypass" indicates that all I/O
> +will be passed through to the underlying device):
> +$ sudo dmsetup status dust1
> +0 33552384 dust 252:17 bypass
> +
> +$ sudo dd if=/dev/mapper/dust1 of=/dev/null bs=512 count=128 iflag=direct
> +128+0 records in
> +128+0 records out
> +
> +$ sudo dd if=/dev/zero of=/dev/mapper/dust1 bs=512 count=128 oflag=direct
> +128+0 records in
> +128+0 records out
> +
> +Adding and removing bad blocks:
> +-------------------------------
> +
> +At any time (i.e.: whether the device has the "bad block" emulation
> +enabled or disabled), bad blocks may be added or removed from the
> +device via the "addbadblock" and "removebadblock" messages:
> +
> +$ sudo dmsetup message dust1 0 addbadblock 60
> +kernel: device-mapper: dust: badblock added at block 60
> +
> +$ sudo dmsetup message dust1 0 addbadblock 67
> +kernel: device-mapper: dust: badblock added at block 67
> +
> +$ sudo dmsetup message dust1 0 addbadblock 72
> +kernel: device-mapper: dust: badblock added at block 72
> +
> +These bad blocks will be stored in the "bad block list".
> +While the device is in "bypass" mode, reads and writes will succeed:
> +
> +$ sudo dmsetup status dust1
> +0 33552384 dust 252:17 bypass
> +
> +Enabling block read failures:
> +-----------------------------
> +
> +To enable the "fail read on bad block" behavior, send the "enable" message:
> +
> +$ sudo dmsetup message dust1 0 enable
> +kernel: device-mapper: dust: enabling read failures on bad sectors
> +
> +$ sudo dmsetup status dust1
> +0 33552384 dust 252:17 fail_read_on_bad_block
> +
> +With the device in "fail read on bad block" mode, attempting to read a
> +block will encounter an "Input/output error":
> +
> +$ sudo dd if=/dev/mapper/dust1 of=/dev/null bs=512 count=1 skip=67 iflag=direct
> +dd: error reading '/dev/mapper/dust1': Input/output error
> +0+0 records in
> +0+0 records out
> +0 bytes copied, 0.00040651 s, 0.0 kB/s
> +
> +...and writing to the bad blocks will remove the blocks from the list,
> +therefore emulating the "remap" behavior of hard disk drives:
> +
> +$ sudo dd if=/dev/zero of=/dev/mapper/dust1 bs=512 count=128 oflag=direct
> +128+0 records in
> +128+0 records out
> +
> +kernel: device-mapper: dust: block 60 removed from badblocklist by write
> +kernel: device-mapper: dust: block 67 removed from badblocklist by write
> +kernel: device-mapper: dust: block 72 removed from badblocklist by write
> +kernel: device-mapper: dust: block 87 removed from badblocklist by write
> +
> +Bad block add/remove error handling:
> +------------------------------------
> +
> +Attempting to add a bad block that already exists in the list will
> +result in an "Invalid argument" error, as well as a helpful message:
> +
> +$ sudo dmsetup message dust1 0 addbadblock 88
> +device-mapper: message ioctl on dust1 failed: Invalid argument
> +kernel: device-mapper: dust: block 88 already in badblocklist
> +
> +Attempting to remove a bad block that doesn't exist in the list will
> +result in an "Invalid argument" error, as well as a helpful message:
> +
> +$ sudo dmsetup message dust1 0 removebadblock 87
> +device-mapper: message ioctl on dust1 failed: Invalid argument
> +kernel: device-mapper: dust: block 87 not found in badblocklist
> +
> +Counting the number of bad blocks in the bad block list:
> +--------------------------------------------------------
> +
> +To count the number of bad blocks configured in the device, run the
> +following message command:
> +
> +$ sudo dmsetup message dust1 0 countbadblocks
> +
> +A message will print with the number of bad blocks currently
> +configured on the device:
> +
> +kernel: device-mapper: dust: countbadblocks: 895 badblock(s) found
> +
> +Clearing the bad block list:
> +----------------------------
> +
> +To clear the bad block list (without needing to individually run
> +a "removebadblock" message command for every block), run the
> +following message command:
> +
> +$ sudo dmsetup message dust1 0 clearbadblocks
> +
> +After clearing the bad block list, the following message will appear:
> +
> +kernel: device-mapper: dust: clearbadblocks: badblocks cleared
> +
> +If there were no bad blocks to clear, the following message will
> +appear:
> +
> +kernel: device-mapper: dust: clearbadblocks: no badblocks found
> +
> +Message commands list:
> +----------------------
> +
> +Below is a list of the messages that can be sent to a dust device:
> +
> +Operations on blocks (requires a <blknum> argument):
> +
> +addbadblock <blknum>
> +queryblock <blknum>
> +removebadblock <blknum>
> +
> +...where <blknum> is a block number within range of the device
> + (corresponding to the block size of the device, 512 or 4096 bytes.)
> +
> +Single argument message commands:
> +
> +countbadblocks
> +clearbadblocks
> +disable
> +enable
> +quiet
> +
> +Device removal:
> +---------------
> +
> +When finished, remove the device via the "dmsetup remove" command:
> +
> +$ sudo dmsetup remove dust1
> +
> +Quiet mode:
> +-----------
> +
> +On test runs with many bad blocks, it may be desirable to avoid
> +excessive logging (from bad blocks added, removed, or "remapped").
> +This can be done by enabling "quiet mode" via the following message:
> +
> +$ sudo dmsetup message dust1 0 quiet
> +
> +This will suppress log messages from add / remove / removed by write
> +operations. Log messages from "countbadblocks" or "queryblock"
> +message commands will still print in quiet mode.
> +
> +The status of quiet mode can be seen by running "dmsetup status":
> +
> +$ sudo dmsetup status dust1
> +0 33552384 dust 252:17 fail_read_on_bad_block quiet
> +
> +To disable quiet mode, send the "quiet" message again:
> +
> +$ sudo dmsetup message dust1 0 quiet
> +
> +$ sudo dmsetup status dust1
> +0 33552384 dust 252:17 fail_read_on_bad_block verbose
> +
> +(The presence of "verbose" indicates normal logging.)
> +
> +"Why not...?"
> +-------------
> +
> +scsi_debug has a "medium error" mode that can fail reads on one
> +specified sector (sector 0x1234, hardcoded in the source code), but
> +it uses RAM for the persistent storage, which drastically decreases
> +the potential device size.
> +
> +dm-flakey fails all I/O from all block locations at a specified time
> +frequency, and not a given point in time.
> +
> +When a bad sector occurs on a hard disk drive, reads to that sector
> +are failed by the device, usually resulting in an error code of EIO
> +("I/O error") or ENODATA ("No data available"). However, a write to
> +the sector may succeed, and result in the sector becoming readable
> +after the device controller no longer experiences errors reading the
> +sector (or after a reallocation of the sector). However, there may
> +be bad sectors that occur on the device in the future, in a different,
> +unpredictable location.
> +
> +This target seeks to provide a device that can exhibit the behavior
> +of a bad sector at a known sector location, at a known time, based
> +on a large storage device (at least tens of gigabytes, not occupying
> +system memory).
> diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
> index 3db222509e44..5370473a0999 100644
> --- a/drivers/md/Kconfig
> +++ b/drivers/md/Kconfig
> @@ -436,6 +436,15 @@ config DM_DELAY
>
> If unsure, say N.
>
> +config DM_DUST
> + tristate "Bad sector simulation target"
> + depends on BLK_DEV_DM
> + ---help---
> + A target that simulates bad sector behavior, and
> + limited optimal_io_size behavior. Useful for testing.
> +
> + If unsure, say N.
> +
> config DM_UEVENT
> bool "DM uevents"
> depends on BLK_DEV_DM
> diff --git a/drivers/md/Makefile b/drivers/md/Makefile
> index 822f4e8753bc..75bf0391127a 100644
> --- a/drivers/md/Makefile
> +++ b/drivers/md/Makefile
> @@ -48,6 +48,7 @@ obj-$(CONFIG_DM_BUFIO) += dm-bufio.o
> obj-$(CONFIG_DM_BIO_PRISON) += dm-bio-prison.o
> obj-$(CONFIG_DM_CRYPT) += dm-crypt.o
> obj-$(CONFIG_DM_DELAY) += dm-delay.o
> +obj-$(CONFIG_DM_DUST) += dm-dust.o
> obj-$(CONFIG_DM_FLAKEY) += dm-flakey.o
> obj-$(CONFIG_DM_MULTIPATH) += dm-multipath.o dm-round-robin.o
> obj-$(CONFIG_DM_MULTIPATH_QL) += dm-queue-length.o
> diff --git a/drivers/md/dm-dust.c b/drivers/md/dm-dust.c
> new file mode 100644
> index 000000000000..94b43d79b186
> --- /dev/null
> +++ b/drivers/md/dm-dust.c
> @@ -0,0 +1,499 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (c) 2018 Red Hat, Inc.
> + *
> + * This is a test "dust" device, which fails reads on specified
> + * sectors, emulating the behavior of a hard disk drive sending
> + * a "Read Medium Error" sense.
> + *
> + */
> +
> +#include <linux/device-mapper.h>
> +#include <linux/module.h>
> +#include <linux/version.h>
> +#include <linux/rbtree.h>
> +
> +#define DM_MSG_PREFIX "dust"
> +
> +struct badblock {
> + struct rb_node node;
> + sector_t bb;
> +};
> +
> +struct dust_device {
> + struct dm_dev *dev;
> + bool fail_read_on_bb;
> + struct rb_root badblocklist;
> + unsigned long long badblock_count;
> + spinlock_t dust_lock;
> + unsigned int blksz;
> + unsigned int sect_per_block;
> + bool quiet_mode;
> +};
> +
> +struct badblock *dust_rb_search(struct rb_root *root, sector_t blk)
> +{
> + struct rb_node *node = root->rb_node;
> +
> + while (node) {
> + struct badblock *bblk = rb_entry(node, struct badblock, node);
> +
> + if (bblk->bb > blk)
> + node = node->rb_left;
> + else if (bblk->bb < blk)
> + node = node->rb_right;
> + else
> + return bblk;
> + }
> + return NULL;
> +}
> +
> +bool dust_rb_insert(struct rb_root *root, struct badblock *new)
> +{
> + struct badblock *bblk;
> + struct rb_node **link = &root->rb_node, *parent = NULL;
> + sector_t value = new->bb;
> +
> + while (*link) {
> + parent = *link;
> + bblk = rb_entry(parent, struct badblock, node);
> +
> + if (bblk->bb > value)
> + link = &(*link)->rb_left;
> + else if (bblk->bb < value)
> + link = &(*link)->rb_right;
> + else
> + return false;
> + }
> +
> + rb_link_node(&new->node, parent, link);
> + rb_insert_color(&new->node, root);
> + return true;
> +}
> +
> +static int dust_remove_block(struct dust_device *dd, unsigned long long block)
> +{
> + struct badblock *bblock;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&dd->dust_lock, flags);
> + bblock = dust_rb_search(&dd->badblocklist, block * dd->sect_per_block);
> +
> + if (bblock == NULL) {
> + if (!dd->quiet_mode)
> + DMERR("removebadblock: block %llu not found in badblocklist",
> + block);
> + spin_unlock_irqrestore(&dd->dust_lock, flags);
> + return -EINVAL;
> + }
> +
> + rb_erase(&bblock->node, &dd->badblocklist);
> + dd->badblock_count--;
> + if (!dd->quiet_mode)
> + DMINFO("removebadblock: badblock removed at block %llu", block);
> + kfree(bblock);
> + spin_unlock_irqrestore(&dd->dust_lock, flags);
> + return 0;
> +}
> +
> +static int dust_add_block(struct dust_device *dd, unsigned long long block)
> +{
> + struct badblock *bblock;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&dd->dust_lock, flags);
> + bblock = dust_rb_search(&dd->badblocklist, block * dd->sect_per_block);
> +
> + if (bblock != NULL) {
> + if (!dd->quiet_mode)
> + DMERR("addbadblock: block %llu already in badblocklist",
> + block);
> + spin_unlock_irqrestore(&dd->dust_lock, flags);
> + return -EINVAL;
> + }
> +
> + bblock = kmalloc(sizeof(*bblock), GFP_KERNEL);
You can't do a GFP_KERNEL allocation while holding a spinlock. I think
it would be better to assume that the user was correctly adding a new
block, do the allocation, and then just call dust_rb_insert() under the
spinlock. dust_rb_insert() will return false if the block is already
inserted
> + if (bblock == NULL) {
> + if (!dd->quiet_mode)
> + DMERR("addbadblock: badblock allocation failed");
> + spin_unlock_irqrestore(&dd->dust_lock, flags);
> + return -ENOMEM;
> + }
> +
> + bblock->bb = block * dd->sect_per_block;
> + dust_rb_insert(&dd->badblocklist, bblock);
> + dd->badblock_count++;
> + if (!dd->quiet_mode)
> + DMINFO("addbadblock: badblock added at block %llu", block);
> + spin_unlock_irqrestore(&dd->dust_lock, flags);
> + return 0;
> +}
> +
> +static int dust_query_block(struct dust_device *dd, unsigned long long block)
> +{
> + struct badblock *bblock;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&dd->dust_lock, flags);
> + bblock = dust_rb_search(&dd->badblocklist, block * dd->sect_per_block);
> +
> + if (bblock != NULL)
> + DMINFO("queryblock: block %llu found in badblocklist", block);
> + else
> + DMINFO("queryblock: block %llu not found in badblocklist",
> + block);
> +
> + spin_unlock_irqrestore(&dd->dust_lock, flags);
> + return 0;
> +}
> +
> +static int dust_map_read(struct dust_device *dd, sector_t thisblock,
> + bool fail_read_on_bb)
> +{
> + unsigned long flags;
> + struct badblock *bblk;
> +
> + spin_lock_irqsave(&dd->dust_lock, flags);
> + bblk = dust_rb_search(&dd->badblocklist, thisblock);
I don't see the benefit of doing dust_rb_search() if fail_read_on_bb
isn't set. It just slows the read without using the result.
> +
> + if (fail_read_on_bb && bblk) {
> + spin_unlock_irqrestore(&dd->dust_lock, flags);
> + return DM_MAPIO_KILL;
> + }
> +
> + spin_unlock_irqrestore(&dd->dust_lock, flags);
> + return DM_MAPIO_REMAPPED;
> +}
> +
> +static int dust_map_write(struct dust_device *dd, sector_t thisblock,
> + bool fail_read_on_bb)
> +{
> + unsigned long flags;
> + struct badblock *bblk;
> +
> + spin_lock_irqsave(&dd->dust_lock, flags);
> + bblk = dust_rb_search(&dd->badblocklist, thisblock);
> +
> + if (fail_read_on_bb && bblk) {
The same goes for here as well.
> + rb_erase(&bblk->node, &dd->badblocklist);
> + dd->badblock_count--;
> + kfree(bblk);
> + if (!dd->quiet_mode)
> + DMINFO("block %llu removed from badblocklist by write",
> + (unsigned long long)thisblock / dd->sect_per_block);
> + }
> + spin_unlock_irqrestore(&dd->dust_lock, flags);
> + return DM_MAPIO_REMAPPED;
> +}
> +
> +static inline void dust_set_biosector(struct bio *bio, sector_t sector)
> +{
> + bio->bi_iter.bi_sector = sector;
> +}
> +
> +static inline sector_t dust_get_biosector(struct bio *bio)
> +{
> + return bio->bi_iter.bi_sector;
> +}
> +
> +static int __dust_clear_badblocks(struct rb_root *tree,
> + unsigned long long count)
> +{
> + struct rb_node *node = NULL, *nnode = NULL;
> +
> + nnode = rb_first(tree);
> + if (nnode == NULL) {
> + BUG_ON(count != 0);
> + return 0;
> + }
> +
> + while (nnode) {
> + node = nnode;
> + nnode = rb_next(node);
> + rb_erase(node, tree);
> + count--;
> + kfree(node);
> + }
> + BUG_ON(count != 0);
> + BUG_ON(tree->rb_node != NULL);
> + return 1;
> +}
> +
> +static int dust_clear_badblocks(struct dust_device *dd)
> +{
> + unsigned long flags;
> + struct rb_root badblocklist;
> + unsigned long long badblock_count;
> +
> + spin_lock_irqsave(&dd->dust_lock, flags);
> + badblocklist = dd->badblocklist;
> + badblock_count = dd->badblock_count;
> + dd->badblocklist = RB_ROOT;
> + dd->badblock_count = 0;
> + spin_unlock_irqrestore(&dd->dust_lock, flags);
> +
> + if (__dust_clear_badblocks(&badblocklist, badblock_count) == 0)
> + DMINFO("clearbadblocks: no badblocks found");
> + else
> + DMINFO("clearbadblocks: badblocks cleared");
> + return 0;
> +}
> +
> +/*
> + * Target parameters:
> + *
> + * <device_path> <blksz>
> + *
> + * device_path: path to the block device
> + * blksz: block size (valid choices: 512, 4096)
> + */
> +static int dust_ctr(struct dm_target *ti, unsigned int argc, char **argv)
> +{
> + struct dust_device *dd;
> + const char *device_path = argv[0];
> + unsigned int blksz;
> + unsigned int sect_per_block;
> +
> + if (argc != 2) {
> + ti->error = "requires exactly 2 arguments";
> + return -EINVAL;
> + }
> +
> + if (kstrtouint(argv[1], 10, &blksz) || !blksz) {
> + ti->error =
> + "Invalid block size parameter -- please enter 512 or 4096";
> + return -EINVAL;
> + }
Why check for !blksz above, when the code only allows 512 and 4096? Why
not just check for those values in the first "if" statement?
-Ben
> + if (blksz != 512 && blksz != 4096) {
> + ti->error = "Invalid block size -- please enter 512 or 4096";
> + return -EINVAL;
> + }
> +
> + sect_per_block = (blksz >> SECTOR_SHIFT);
> +
> + dd = kzalloc(sizeof(struct dust_device), GFP_KERNEL);
> +
> + if (dd == NULL) {
> + ti->error = "Cannot allocate context";
> + return -ENOMEM;
> + }
> +
> + if (dm_get_device(ti, device_path, dm_table_get_mode(ti->table),
> + &dd->dev)) {
> + ti->error = "Device lookup failed";
> + kfree(dd);
> + return -EINVAL;
> + }
> +
> + dd->sect_per_block = sect_per_block;
> + dd->blksz = blksz;
> +
> + // Whether to fail a read on a "bad" block.
> + // Defaults to false; enabled later by message.
> + dd->fail_read_on_bb = false;
> +
> + // Bad block list rbtree.
> + // Just initialize for now.
> + dd->badblocklist = RB_ROOT;
> + dd->badblock_count = 0;
> + spin_lock_init(&dd->dust_lock);
> +
> + dd->quiet_mode = false;
> +
> + BUG_ON(dm_set_target_max_io_len(ti, dd->sect_per_block) != 0);
> +
> + ti->num_discard_bios = 1;
> + ti->num_flush_bios = 1;
> + ti->private = dd;
> + return 0;
> +}
> +
> +static void dust_dtr(struct dm_target *ti)
> +{
> + struct dust_device *dd = ti->private;
> +
> + __dust_clear_badblocks(&dd->badblocklist, dd->badblock_count);
> + dm_put_device(ti, dd->dev);
> + kfree(dd);
> +}
> +
> +static void dust_io_hints(struct dm_target *ti, struct queue_limits *limits)
> +{
> + struct dust_device *dd = ti->private;
> +
> + limits->logical_block_size = dd->blksz;
> + limits->physical_block_size = dd->blksz;
> +
> + // The minimum io size for random io
> + blk_limits_io_min(limits, dd->blksz);
> + // The optimal io size for streamed/sequential io
> + blk_limits_io_opt(limits, dd->blksz);
> +}
> +
> +static int dust_map(struct dm_target *ti, struct bio *bio)
> +{
> + struct dust_device *dd;
> + sector_t thisblock;
> + int ret;
> +
> + dd = ti->private;
> + bio_set_dev(bio, dd->dev->bdev);
> + dust_set_biosector(bio, dm_target_offset(ti, dust_get_biosector(bio)));
> + thisblock = dust_get_biosector(bio);
> +
> + if (bio_data_dir(bio) == READ)
> + ret = dust_map_read(dd, thisblock, dd->fail_read_on_bb);
> + if (bio_data_dir(bio) == WRITE)
> + ret = dust_map_write(dd, thisblock, dd->fail_read_on_bb);
> +
> + return ret;
> +}
> +
> +static int dust_message(struct dm_target *ti, unsigned int argc, char **argv,
> + char *result_buf, unsigned int maxlen)
> +{
> + struct dust_device *dd = ti->private;
> + sector_t size = i_size_read(dd->dev->bdev->bd_inode) >> SECTOR_SHIFT;
> + bool invalid_msg = false;
> + int result = -EINVAL;
> + unsigned long long tmp, block;
> + unsigned long flags;
> + char dummy;
> +
> + if (argc == 1) {
> + if (!strcasecmp(argv[0], "addbadblock") ||
> + !strcasecmp(argv[0], "removebadblock") ||
> + !strcasecmp(argv[0], "queryblock")) {
> + DMERR("%s requires an additional argument", argv[0]);
> + } else if (!strcasecmp(argv[0], "disable")) {
> + DMINFO("disabling read failures on bad sectors");
> + dd->fail_read_on_bb = false;
> + result = 0;
> + } else if (!strcasecmp(argv[0], "enable")) {
> + DMINFO("enabling read failures on bad sectors");
> + dd->fail_read_on_bb = true;
> + result = 0;
> + } else if (!strcasecmp(argv[0], "countbadblocks")) {
> + spin_lock_irqsave(&dd->dust_lock, flags);
> + DMINFO("countbadblocks: %llu badblock(s) found",
> + dd->badblock_count);
> + spin_unlock_irqrestore(&dd->dust_lock, flags);
> + result = 0;
> + } else if (!strcasecmp(argv[0], "clearbadblocks")) {
> + result = dust_clear_badblocks(dd);
> + } else if (!strcasecmp(argv[0], "quiet")) {
> + if (!dd->quiet_mode)
> + dd->quiet_mode = true;
> + else
> + dd->quiet_mode = false;
> + result = 0;
> + } else {
> + invalid_msg = true;
> + }
> + } else if (argc == 2) {
> + if (sscanf(argv[1], "%llu%c", &tmp, &dummy) != 1)
> + return result;
> +
> + block = tmp;
> +
> + if (block > size / dd->sect_per_block || block < 0) {
> + DMERR("selected block value out of range");
> + return result;
> + }
> +
> + if (!strcasecmp(argv[0], "addbadblock"))
> + result = dust_add_block(dd, block);
> + else if (!strcasecmp(argv[0], "removebadblock"))
> + result = dust_remove_block(dd, block);
> + else if (!strcasecmp(argv[0], "queryblock"))
> + result = dust_query_block(dd, block);
> + else
> + invalid_msg = true;
> +
> + } else {
> + DMERR("invalid number of arguments '%d'", argc);
> + }
> +
> + if (invalid_msg)
> + DMERR("unrecognized dmsetup message '%s' received", argv[0]);
> +
> + return result;
> +}
> +
> +static void dust_status(struct dm_target *ti, status_type_t type,
> + unsigned int status_flags, char *result, unsigned int maxlen)
> +{
> + struct dust_device *dd = ti->private;
> + unsigned int sz = 0; // used by the DMEMIT macro
> +
> + switch (type) {
> + case STATUSTYPE_INFO:
> + DMEMIT("%s %s %s", dd->dev->name,
> + dd->fail_read_on_bb ? "fail_read_on_bad_block" :
> + "bypass",
> + dd->quiet_mode ? "quiet" : "verbose");
> + break;
> +
> + case STATUSTYPE_TABLE:
> + DMEMIT("%s %u", dd->dev->name, dd->blksz);
> + break;
> + }
> +}
> +
> +static int dust_prepare_ioctl(struct dm_target *ti, struct block_device **bdev)
> +{
> + struct dust_device *dd = ti->private;
> + struct dm_dev *dev = dd->dev;
> +
> + *bdev = dev->bdev;
> +
> + // Only pass ioctls through if the device sizes match exactly.
> + if (ti->len != i_size_read(dev->bdev->bd_inode) >> SECTOR_SHIFT)
> + return 1;
> +
> + return 0;
> +}
> +
> +static int dust_iterate_devices(struct dm_target *ti, iterate_devices_callout_fn fn,
> + void *data)
> +{
> + struct dust_device *dd = ti->private;
> +
> + return fn(ti, dd->dev, 0, ti->len, data);
> +}
> +
> +static struct target_type dust_target = {
> + .name = "dust",
> + .version = { 1, 0, 0 },
> + .module = THIS_MODULE,
> + .ctr = dust_ctr,
> + .dtr = dust_dtr,
> + .iterate_devices = dust_iterate_devices,
> + .io_hints = dust_io_hints,
> + .map = dust_map,
> + .message = dust_message,
> + .status = dust_status,
> + .prepare_ioctl = dust_prepare_ioctl,
> +};
> +
> +int __init dm_dust_init(void)
> +{
> + int result = dm_register_target(&dust_target);
> +
> + if (result < 0)
> + DMERR("dm_register_target failed %d", result);
> +
> + return result;
> +}
> +
> +void __exit dm_dust_exit(void)
> +{
> + dm_unregister_target(&dust_target);
> +}
> +
> +module_init(dm_dust_init);
> +module_exit(dm_dust_exit);
> +
> +MODULE_DESCRIPTION(DM_NAME " dust testing device");
> +MODULE_AUTHOR("Red Hat, Inc.");
> +MODULE_LICENSE("GPL");
> --
> 2.17.2
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 1/1] dm: add dust target
2019-01-08 0:10 ` Benjamin Marzinski
@ 2019-01-08 15:30 ` Bryan Gurney
2019-01-08 16:23 ` Mike Snitzer
2019-01-08 17:36 ` [RFC PATCH 1/1] dm: add dust target Benjamin Marzinski
0 siblings, 2 replies; 11+ messages in thread
From: Bryan Gurney @ 2019-01-08 15:30 UTC (permalink / raw)
To: Benjamin Marzinski
Cc: Mike Snitzer, John Dorminy, Joe Shimkus, dm-devel, tjaskiew,
John Pittman, Alasdair G Kergon
On Mon, Jan 7, 2019 at 7:10 PM Benjamin Marzinski <bmarzins@redhat.com> wrote:
>
> On Mon, Jan 07, 2019 at 02:31:23PM -0500, Bryan Gurney wrote:
> > Add the dm-dust target, which simulates the behavior of bad sectors
> > at arbitrary locations, and the ability to enable the emulation of
> > the read failures at an arbitrary time.
> >
> > This target behaves similarly to a linear target, with the exception
> > of the minimum and optimal IO sizes being configurable to 512 or
> > 4096 bytes.
> >
> > At a given time, the user can send a message to the target to start
> > failing read requests on specific blocks. When the failure behavior
> > is enabled, reads of blocks configured "bad" will fail with EIO.
> >
> > Writes of blocks configured "bad" will result in the following:
> >
> > 1. Remove the block from the "bad block list".
> > 2. Successfully complete the write.
> >
> > After this point, the block will successfully contain the written
> > data, and will service reads and writes normally. This emulates the
> > behavior of a "remapped sector" on a hard disk drive.
> >
> > Signed-off-by: Joe Shimkus <jshimkus@redhat.com>
> > Signed-off-by: John Dorminy <jdorminy@redhat.com>
> > Signed-off-by: John Pittman <jpittman@redhat.com>
> > Signed-off-by: Thomas Jaskiewicz <tjaskiew@redhat.com>
> > Signed-off-by: Bryan Gurney <bgurney@redhat.com>
> > ---
> > Documentation/device-mapper/dm-dust.txt | 256 ++++++++++++
> > drivers/md/Kconfig | 9 +
> > drivers/md/Makefile | 1 +
> > drivers/md/dm-dust.c | 499 ++++++++++++++++++++++++
> > 4 files changed, 765 insertions(+)
> > create mode 100644 Documentation/device-mapper/dm-dust.txt
> > create mode 100644 drivers/md/dm-dust.c
> >
> > diff --git a/Documentation/device-mapper/dm-dust.txt b/Documentation/device-mapper/dm-dust.txt
> > new file mode 100644
> > index 000000000000..7d7740d5b1c9
> > --- /dev/null
> > +++ b/Documentation/device-mapper/dm-dust.txt
> > @@ -0,0 +1,256 @@
> > +dm-dust
> > +=======
> > +
> > +This target emulates the behavior of bad sectors at arbitrary
> > +locations, and the ability to enable the emulation of the failures
> > +at an arbitrary time.
> > +
> > +This target behaves similarly to a linear target, except that the
> > +minimum and optimal io sizes are DUST_BLOCK_SIZE bytes (512 or
> > +4096 bytes). At a given time, the user can send a message to the
> > +target to start failing read requests on specific blocks (to emulate
> > +the behavior of a hard disk drive with bad sectors).
> > +
> > +When the failure behavior is enabled (i.e.: when the output of
> > +"dmsetup status" displays "fail_read_on_bad_block"), reads of blocks
> > +in the "bad block list" will fail with EIO ("Input/output error").
> > +
> > +Writes of blocks in the "bad block list will result in the following:
> > +
> > +1. Remove the block from the "bad block list".
> > +2. Successfully complete the write.
> > +
> > +This emulates the "remapped sector" behavior of a drive with bad
> > +sectors.
> > +
> > +Normally, a drive that is encountering bad sectors will most likely
> > +encounter more bad sectors, at an unknown time or location.
> > +With dm-dust, the user can use the "addbadblock" and "removebadblock"
> > +messages to add arbitrary bad blocks at new locations, and the
> > +"enable" and "disable" messages to modulate the state of whether the
> > +configured "bad blocks" will be treated as bad, or bypassed.
> > +This allows the pre-writing of test data and metadata prior to
> > +simulating a "failure" event where bad sectors start to appear.
> > +
> > +Table parameters:
> > +-----------------
> > +<device_path> <blksz>
> > +
> > +Mandatory parameters:
> > + <device_path>: path to the block device.
> > + <blksz>: block size (valid choices: 512, 4096).
> > +
> > +Usage instructions:
> > +-------------------
> > +
> > +First, find the size (in 512-byte sectors) of the device to be used:
> > +
> > +$ sudo blockdev --getsz /dev/vdb1
> > +33552384
> > +
> > +After building the module, load the module:
> > +
> > +$ sudo modprobe pbitdust
> > +
> > +Create the device:
> > +(For a device with a block size of 512 bytes)
> > +$ sudo dmsetup create dust1 --table '0 33552384 dust /dev/vdb1 512'
> > +
> > +(For a device with a block size of 4096 bytes)
> > +$ sudo dmsetup create dust1 --table '0 33552384 dust /dev/vdb1 4096'
> > +
> > +Check the status of the read behavior ("bypass" indicates that all I/O
> > +will be passed through to the underlying device):
> > +$ sudo dmsetup status dust1
> > +0 33552384 dust 252:17 bypass
> > +
> > +$ sudo dd if=/dev/mapper/dust1 of=/dev/null bs=512 count=128 iflag=direct
> > +128+0 records in
> > +128+0 records out
> > +
> > +$ sudo dd if=/dev/zero of=/dev/mapper/dust1 bs=512 count=128 oflag=direct
> > +128+0 records in
> > +128+0 records out
> > +
> > +Adding and removing bad blocks:
> > +-------------------------------
> > +
> > +At any time (i.e.: whether the device has the "bad block" emulation
> > +enabled or disabled), bad blocks may be added or removed from the
> > +device via the "addbadblock" and "removebadblock" messages:
> > +
> > +$ sudo dmsetup message dust1 0 addbadblock 60
> > +kernel: device-mapper: dust: badblock added at block 60
> > +
> > +$ sudo dmsetup message dust1 0 addbadblock 67
> > +kernel: device-mapper: dust: badblock added at block 67
> > +
> > +$ sudo dmsetup message dust1 0 addbadblock 72
> > +kernel: device-mapper: dust: badblock added at block 72
> > +
> > +These bad blocks will be stored in the "bad block list".
> > +While the device is in "bypass" mode, reads and writes will succeed:
> > +
> > +$ sudo dmsetup status dust1
> > +0 33552384 dust 252:17 bypass
> > +
> > +Enabling block read failures:
> > +-----------------------------
> > +
> > +To enable the "fail read on bad block" behavior, send the "enable" message:
> > +
> > +$ sudo dmsetup message dust1 0 enable
> > +kernel: device-mapper: dust: enabling read failures on bad sectors
> > +
> > +$ sudo dmsetup status dust1
> > +0 33552384 dust 252:17 fail_read_on_bad_block
> > +
> > +With the device in "fail read on bad block" mode, attempting to read a
> > +block will encounter an "Input/output error":
> > +
> > +$ sudo dd if=/dev/mapper/dust1 of=/dev/null bs=512 count=1 skip=67 iflag=direct
> > +dd: error reading '/dev/mapper/dust1': Input/output error
> > +0+0 records in
> > +0+0 records out
> > +0 bytes copied, 0.00040651 s, 0.0 kB/s
> > +
> > +...and writing to the bad blocks will remove the blocks from the list,
> > +therefore emulating the "remap" behavior of hard disk drives:
> > +
> > +$ sudo dd if=/dev/zero of=/dev/mapper/dust1 bs=512 count=128 oflag=direct
> > +128+0 records in
> > +128+0 records out
> > +
> > +kernel: device-mapper: dust: block 60 removed from badblocklist by write
> > +kernel: device-mapper: dust: block 67 removed from badblocklist by write
> > +kernel: device-mapper: dust: block 72 removed from badblocklist by write
> > +kernel: device-mapper: dust: block 87 removed from badblocklist by write
> > +
> > +Bad block add/remove error handling:
> > +------------------------------------
> > +
> > +Attempting to add a bad block that already exists in the list will
> > +result in an "Invalid argument" error, as well as a helpful message:
> > +
> > +$ sudo dmsetup message dust1 0 addbadblock 88
> > +device-mapper: message ioctl on dust1 failed: Invalid argument
> > +kernel: device-mapper: dust: block 88 already in badblocklist
> > +
> > +Attempting to remove a bad block that doesn't exist in the list will
> > +result in an "Invalid argument" error, as well as a helpful message:
> > +
> > +$ sudo dmsetup message dust1 0 removebadblock 87
> > +device-mapper: message ioctl on dust1 failed: Invalid argument
> > +kernel: device-mapper: dust: block 87 not found in badblocklist
> > +
> > +Counting the number of bad blocks in the bad block list:
> > +--------------------------------------------------------
> > +
> > +To count the number of bad blocks configured in the device, run the
> > +following message command:
> > +
> > +$ sudo dmsetup message dust1 0 countbadblocks
> > +
> > +A message will print with the number of bad blocks currently
> > +configured on the device:
> > +
> > +kernel: device-mapper: dust: countbadblocks: 895 badblock(s) found
> > +
> > +Clearing the bad block list:
> > +----------------------------
> > +
> > +To clear the bad block list (without needing to individually run
> > +a "removebadblock" message command for every block), run the
> > +following message command:
> > +
> > +$ sudo dmsetup message dust1 0 clearbadblocks
> > +
> > +After clearing the bad block list, the following message will appear:
> > +
> > +kernel: device-mapper: dust: clearbadblocks: badblocks cleared
> > +
> > +If there were no bad blocks to clear, the following message will
> > +appear:
> > +
> > +kernel: device-mapper: dust: clearbadblocks: no badblocks found
> > +
> > +Message commands list:
> > +----------------------
> > +
> > +Below is a list of the messages that can be sent to a dust device:
> > +
> > +Operations on blocks (requires a <blknum> argument):
> > +
> > +addbadblock <blknum>
> > +queryblock <blknum>
> > +removebadblock <blknum>
> > +
> > +...where <blknum> is a block number within range of the device
> > + (corresponding to the block size of the device, 512 or 4096 bytes.)
> > +
> > +Single argument message commands:
> > +
> > +countbadblocks
> > +clearbadblocks
> > +disable
> > +enable
> > +quiet
> > +
> > +Device removal:
> > +---------------
> > +
> > +When finished, remove the device via the "dmsetup remove" command:
> > +
> > +$ sudo dmsetup remove dust1
> > +
> > +Quiet mode:
> > +-----------
> > +
> > +On test runs with many bad blocks, it may be desirable to avoid
> > +excessive logging (from bad blocks added, removed, or "remapped").
> > +This can be done by enabling "quiet mode" via the following message:
> > +
> > +$ sudo dmsetup message dust1 0 quiet
> > +
> > +This will suppress log messages from add / remove / removed by write
> > +operations. Log messages from "countbadblocks" or "queryblock"
> > +message commands will still print in quiet mode.
> > +
> > +The status of quiet mode can be seen by running "dmsetup status":
> > +
> > +$ sudo dmsetup status dust1
> > +0 33552384 dust 252:17 fail_read_on_bad_block quiet
> > +
> > +To disable quiet mode, send the "quiet" message again:
> > +
> > +$ sudo dmsetup message dust1 0 quiet
> > +
> > +$ sudo dmsetup status dust1
> > +0 33552384 dust 252:17 fail_read_on_bad_block verbose
> > +
> > +(The presence of "verbose" indicates normal logging.)
> > +
> > +"Why not...?"
> > +-------------
> > +
> > +scsi_debug has a "medium error" mode that can fail reads on one
> > +specified sector (sector 0x1234, hardcoded in the source code), but
> > +it uses RAM for the persistent storage, which drastically decreases
> > +the potential device size.
> > +
> > +dm-flakey fails all I/O from all block locations at a specified time
> > +frequency, and not a given point in time.
> > +
> > +When a bad sector occurs on a hard disk drive, reads to that sector
> > +are failed by the device, usually resulting in an error code of EIO
> > +("I/O error") or ENODATA ("No data available"). However, a write to
> > +the sector may succeed, and result in the sector becoming readable
> > +after the device controller no longer experiences errors reading the
> > +sector (or after a reallocation of the sector). However, there may
> > +be bad sectors that occur on the device in the future, in a different,
> > +unpredictable location.
> > +
> > +This target seeks to provide a device that can exhibit the behavior
> > +of a bad sector at a known sector location, at a known time, based
> > +on a large storage device (at least tens of gigabytes, not occupying
> > +system memory).
> > diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
> > index 3db222509e44..5370473a0999 100644
> > --- a/drivers/md/Kconfig
> > +++ b/drivers/md/Kconfig
> > @@ -436,6 +436,15 @@ config DM_DELAY
> >
> > If unsure, say N.
> >
> > +config DM_DUST
> > + tristate "Bad sector simulation target"
> > + depends on BLK_DEV_DM
> > + ---help---
> > + A target that simulates bad sector behavior, and
> > + limited optimal_io_size behavior. Useful for testing.
> > +
> > + If unsure, say N.
> > +
> > config DM_UEVENT
> > bool "DM uevents"
> > depends on BLK_DEV_DM
> > diff --git a/drivers/md/Makefile b/drivers/md/Makefile
> > index 822f4e8753bc..75bf0391127a 100644
> > --- a/drivers/md/Makefile
> > +++ b/drivers/md/Makefile
> > @@ -48,6 +48,7 @@ obj-$(CONFIG_DM_BUFIO) += dm-bufio.o
> > obj-$(CONFIG_DM_BIO_PRISON) += dm-bio-prison.o
> > obj-$(CONFIG_DM_CRYPT) += dm-crypt.o
> > obj-$(CONFIG_DM_DELAY) += dm-delay.o
> > +obj-$(CONFIG_DM_DUST) += dm-dust.o
> > obj-$(CONFIG_DM_FLAKEY) += dm-flakey.o
> > obj-$(CONFIG_DM_MULTIPATH) += dm-multipath.o dm-round-robin.o
> > obj-$(CONFIG_DM_MULTIPATH_QL) += dm-queue-length.o
> > diff --git a/drivers/md/dm-dust.c b/drivers/md/dm-dust.c
> > new file mode 100644
> > index 000000000000..94b43d79b186
> > --- /dev/null
> > +++ b/drivers/md/dm-dust.c
> > @@ -0,0 +1,499 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Copyright (c) 2018 Red Hat, Inc.
> > + *
> > + * This is a test "dust" device, which fails reads on specified
> > + * sectors, emulating the behavior of a hard disk drive sending
> > + * a "Read Medium Error" sense.
> > + *
> > + */
> > +
> > +#include <linux/device-mapper.h>
> > +#include <linux/module.h>
> > +#include <linux/version.h>
> > +#include <linux/rbtree.h>
> > +
> > +#define DM_MSG_PREFIX "dust"
> > +
> > +struct badblock {
> > + struct rb_node node;
> > + sector_t bb;
> > +};
> > +
> > +struct dust_device {
> > + struct dm_dev *dev;
> > + bool fail_read_on_bb;
> > + struct rb_root badblocklist;
> > + unsigned long long badblock_count;
> > + spinlock_t dust_lock;
> > + unsigned int blksz;
> > + unsigned int sect_per_block;
> > + bool quiet_mode;
> > +};
> > +
> > +struct badblock *dust_rb_search(struct rb_root *root, sector_t blk)
> > +{
> > + struct rb_node *node = root->rb_node;
> > +
> > + while (node) {
> > + struct badblock *bblk = rb_entry(node, struct badblock, node);
> > +
> > + if (bblk->bb > blk)
> > + node = node->rb_left;
> > + else if (bblk->bb < blk)
> > + node = node->rb_right;
> > + else
> > + return bblk;
> > + }
> > + return NULL;
> > +}
> > +
> > +bool dust_rb_insert(struct rb_root *root, struct badblock *new)
> > +{
> > + struct badblock *bblk;
> > + struct rb_node **link = &root->rb_node, *parent = NULL;
> > + sector_t value = new->bb;
> > +
> > + while (*link) {
> > + parent = *link;
> > + bblk = rb_entry(parent, struct badblock, node);
> > +
> > + if (bblk->bb > value)
> > + link = &(*link)->rb_left;
> > + else if (bblk->bb < value)
> > + link = &(*link)->rb_right;
> > + else
> > + return false;
> > + }
> > +
> > + rb_link_node(&new->node, parent, link);
> > + rb_insert_color(&new->node, root);
> > + return true;
> > +}
> > +
> > +static int dust_remove_block(struct dust_device *dd, unsigned long long block)
> > +{
> > + struct badblock *bblock;
> > + unsigned long flags;
> > +
> > + spin_lock_irqsave(&dd->dust_lock, flags);
> > + bblock = dust_rb_search(&dd->badblocklist, block * dd->sect_per_block);
> > +
> > + if (bblock == NULL) {
> > + if (!dd->quiet_mode)
> > + DMERR("removebadblock: block %llu not found in badblocklist",
> > + block);
> > + spin_unlock_irqrestore(&dd->dust_lock, flags);
> > + return -EINVAL;
> > + }
> > +
> > + rb_erase(&bblock->node, &dd->badblocklist);
> > + dd->badblock_count--;
> > + if (!dd->quiet_mode)
> > + DMINFO("removebadblock: badblock removed at block %llu", block);
> > + kfree(bblock);
> > + spin_unlock_irqrestore(&dd->dust_lock, flags);
> > + return 0;
> > +}
> > +
> > +static int dust_add_block(struct dust_device *dd, unsigned long long block)
> > +{
> > + struct badblock *bblock;
> > + unsigned long flags;
> > +
> > + spin_lock_irqsave(&dd->dust_lock, flags);
> > + bblock = dust_rb_search(&dd->badblocklist, block * dd->sect_per_block);
> > +
> > + if (bblock != NULL) {
> > + if (!dd->quiet_mode)
> > + DMERR("addbadblock: block %llu already in badblocklist",
> > + block);
> > + spin_unlock_irqrestore(&dd->dust_lock, flags);
> > + return -EINVAL;
> > + }
> > +
> > + bblock = kmalloc(sizeof(*bblock), GFP_KERNEL);
>
> You can't do a GFP_KERNEL allocation while holding a spinlock. I think
> it would be better to assume that the user was correctly adding a new
> block, do the allocation, and then just call dust_rb_insert() under the
> spinlock. dust_rb_insert() will return false if the block is already
> inserted
Ah, okay. I'll be spinning a v2 of this patch. (I need to make some
updates to the dm-dust.txt documentation file as well, so those
updates will be in there.)
After reading through your description of the timeline, I created this
revision of dust_add_block() (which compiles, and seems to work
properly in my test module):
static int dust_add_block(struct dust_device *dd, unsigned long long block)
{
struct badblock *bblock;
unsigned long flags;
bblock = kmalloc(sizeof(*bblock), GFP_KERNEL);
if (bblock == NULL) {
if (!dd->quiet_mode)
DMERR("addbadblock: badblock allocation failed");
return -ENOMEM;
}
spin_lock_irqsave(&dd->dust_lock, flags);
bblock->bb = block * dd->sect_per_block;
if (!dust_rb_insert(&dd->badblocklist, bblock)) {
if (!dd->quiet_mode)
DMERR("addbadblock: block %llu already in badblocklist",
block);
spin_unlock_irqrestore(&dd->dust_lock, flags);
kfree(bblock);
return -EINVAL;
}
dd->badblock_count++;
if (!dd->quiet_mode)
DMINFO("addbadblock: badblock added at block %llu", block);
spin_unlock_irqrestore(&dd->dust_lock, flags);
return 0;
}
Test results:
# dmsetup message dust1 0 addbadblock 60
kernel: device-mapper: dust: addbadblock: badblock added at block 60
# dmsetup message dust1 0 addbadblock 60
kernel: device-mapper: dust: addbadblock: block 60 already in badblocklist
# dmsetup message dust1 0 addbadblock 60
kernel: device-mapper: dust: addbadblock: block 60 already in badblocklist
The kmalloc() is before the spinlock, and if the block is already in
the rbtree, it unlocks, and frees the unused "bblock". Does this look
good?
> > + if (bblock == NULL) {
> > + if (!dd->quiet_mode)
> > + DMERR("addbadblock: badblock allocation failed");
> > + spin_unlock_irqrestore(&dd->dust_lock, flags);
> > + return -ENOMEM;
> > + }
> > +
> > + bblock->bb = block * dd->sect_per_block;
> > + dust_rb_insert(&dd->badblocklist, bblock);
> > + dd->badblock_count++;
> > + if (!dd->quiet_mode)
> > + DMINFO("addbadblock: badblock added at block %llu", block);
> > + spin_unlock_irqrestore(&dd->dust_lock, flags);
> > + return 0;
> > +}
> > +
> > +static int dust_query_block(struct dust_device *dd, unsigned long long block)
> > +{
> > + struct badblock *bblock;
> > + unsigned long flags;
> > +
> > + spin_lock_irqsave(&dd->dust_lock, flags);
> > + bblock = dust_rb_search(&dd->badblocklist, block * dd->sect_per_block);
> > +
> > + if (bblock != NULL)
> > + DMINFO("queryblock: block %llu found in badblocklist", block);
> > + else
> > + DMINFO("queryblock: block %llu not found in badblocklist",
> > + block);
> > +
> > + spin_unlock_irqrestore(&dd->dust_lock, flags);
> > + return 0;
> > +}
> > +
> > +static int dust_map_read(struct dust_device *dd, sector_t thisblock,
> > + bool fail_read_on_bb)
> > +{
> > + unsigned long flags;
> > + struct badblock *bblk;
> > +
> > + spin_lock_irqsave(&dd->dust_lock, flags);
> > + bblk = dust_rb_search(&dd->badblocklist, thisblock);
>
> I don't see the benefit of doing dust_rb_search() if fail_read_on_bb
> isn't set. It just slows the read without using the result.
This is something that I have to admit was unintentional, but I like
it: it keeps the timing consistent between the "enabled" and
"disabled" modes (a.k.a.: the "fail_read_on_bad_block" and "bypass"
modes).
The primary goal of this target is to be a close simulation of a drive
with bad sectors. Therefore, it would be good to have a consistent
"seek time" between the two modes. There could be test cases where a
problem only appears on a "faster" (or "slower") device, but that's
not what this test device is intended to test.
Considering the above, can we leave this as is? (I can leave a
comment explaining the desire for timing consistency between the
"enabled" and "disabled" states.)
> > +
> > + if (fail_read_on_bb && bblk) {
> > + spin_unlock_irqrestore(&dd->dust_lock, flags);
> > + return DM_MAPIO_KILL;
> > + }
> > +
> > + spin_unlock_irqrestore(&dd->dust_lock, flags);
> > + return DM_MAPIO_REMAPPED;
> > +}
> > +
> > +static int dust_map_write(struct dust_device *dd, sector_t thisblock,
> > + bool fail_read_on_bb)
> > +{
> > + unsigned long flags;
> > + struct badblock *bblk;
> > +
> > + spin_lock_irqsave(&dd->dust_lock, flags);
> > + bblk = dust_rb_search(&dd->badblocklist, thisblock);
> > +
> > + if (fail_read_on_bb && bblk) {
>
> The same goes for here as well.
Please see my above comment for dust_map_read().
> > + rb_erase(&bblk->node, &dd->badblocklist);
> > + dd->badblock_count--;
> > + kfree(bblk);
> > + if (!dd->quiet_mode)
> > + DMINFO("block %llu removed from badblocklist by write",
> > + (unsigned long long)thisblock / dd->sect_per_block);
> > + }
> > + spin_unlock_irqrestore(&dd->dust_lock, flags);
> > + return DM_MAPIO_REMAPPED;
> > +}
> > +
> > +static inline void dust_set_biosector(struct bio *bio, sector_t sector)
> > +{
> > + bio->bi_iter.bi_sector = sector;
> > +}
> > +
> > +static inline sector_t dust_get_biosector(struct bio *bio)
> > +{
> > + return bio->bi_iter.bi_sector;
> > +}
> > +
> > +static int __dust_clear_badblocks(struct rb_root *tree,
> > + unsigned long long count)
> > +{
> > + struct rb_node *node = NULL, *nnode = NULL;
> > +
> > + nnode = rb_first(tree);
> > + if (nnode == NULL) {
> > + BUG_ON(count != 0);
> > + return 0;
> > + }
> > +
> > + while (nnode) {
> > + node = nnode;
> > + nnode = rb_next(node);
> > + rb_erase(node, tree);
> > + count--;
> > + kfree(node);
> > + }
> > + BUG_ON(count != 0);
> > + BUG_ON(tree->rb_node != NULL);
> > + return 1;
> > +}
> > +
> > +static int dust_clear_badblocks(struct dust_device *dd)
> > +{
> > + unsigned long flags;
> > + struct rb_root badblocklist;
> > + unsigned long long badblock_count;
> > +
> > + spin_lock_irqsave(&dd->dust_lock, flags);
> > + badblocklist = dd->badblocklist;
> > + badblock_count = dd->badblock_count;
> > + dd->badblocklist = RB_ROOT;
> > + dd->badblock_count = 0;
> > + spin_unlock_irqrestore(&dd->dust_lock, flags);
> > +
> > + if (__dust_clear_badblocks(&badblocklist, badblock_count) == 0)
> > + DMINFO("clearbadblocks: no badblocks found");
> > + else
> > + DMINFO("clearbadblocks: badblocks cleared");
> > + return 0;
> > +}
> > +
> > +/*
> > + * Target parameters:
> > + *
> > + * <device_path> <blksz>
> > + *
> > + * device_path: path to the block device
> > + * blksz: block size (valid choices: 512, 4096)
> > + */
> > +static int dust_ctr(struct dm_target *ti, unsigned int argc, char **argv)
> > +{
> > + struct dust_device *dd;
> > + const char *device_path = argv[0];
> > + unsigned int blksz;
> > + unsigned int sect_per_block;
> > +
> > + if (argc != 2) {
> > + ti->error = "requires exactly 2 arguments";
> > + return -EINVAL;
> > + }
> > +
> > + if (kstrtouint(argv[1], 10, &blksz) || !blksz) {
> > + ti->error =
> > + "Invalid block size parameter -- please enter 512 or 4096";
> > + return -EINVAL;
> > + }
>
> Why check for !blksz above, when the code only allows 512 and 4096? Why
> not just check for those values in the first "if" statement?
>
This is actually a two-stage check: the first stage checks for
non-numeric input:
# dmsetup create dust1 --table '0 10483712 dust /dev/vdb1 foo'
stdout:
"device-mapper: reload ioctl on dust1 failed: Invalid argument"
printk:
"kernel: device-mapper: table: 253:2: dust: Invalid block size
parameter -- please enter 512 or 4096"
...and the second stage checks for the usable block sizes:
# dmsetup create dust1 --table '0 10483712 dust /dev/vdb1 520'
stdout:
"device-mapper: reload ioctl on dust1 failed: Invalid argument"
printk:
"kernel: device-mapper: table: 253:2: dust: Invalid block size --
please enter 512 or 4096"
These different messages help to clarify whether the parameter isn't a
usable integer, or isn't an integer at all. (Perhaps the message are
a bit too similar).
Given the above, should there still be a 2-stage check, or should this
be folded into one that simply checks whether blksz is 512 or 4096?
Thanks,
Bryan
>
> > + if (blksz != 512 && blksz != 4096) {
> > + ti->error = "Invalid block size -- please enter 512 or 4096";
> > + return -EINVAL;
> > + }
> > +
> > + sect_per_block = (blksz >> SECTOR_SHIFT);
> > +
> > + dd = kzalloc(sizeof(struct dust_device), GFP_KERNEL);
> > +
> > + if (dd == NULL) {
> > + ti->error = "Cannot allocate context";
> > + return -ENOMEM;
> > + }
> > +
> > + if (dm_get_device(ti, device_path, dm_table_get_mode(ti->table),
> > + &dd->dev)) {
> > + ti->error = "Device lookup failed";
> > + kfree(dd);
> > + return -EINVAL;
> > + }
> > +
> > + dd->sect_per_block = sect_per_block;
> > + dd->blksz = blksz;
> > +
> > + // Whether to fail a read on a "bad" block.
> > + // Defaults to false; enabled later by message.
> > + dd->fail_read_on_bb = false;
> > +
> > + // Bad block list rbtree.
> > + // Just initialize for now.
> > + dd->badblocklist = RB_ROOT;
> > + dd->badblock_count = 0;
> > + spin_lock_init(&dd->dust_lock);
> > +
> > + dd->quiet_mode = false;
> > +
> > + BUG_ON(dm_set_target_max_io_len(ti, dd->sect_per_block) != 0);
> > +
> > + ti->num_discard_bios = 1;
> > + ti->num_flush_bios = 1;
> > + ti->private = dd;
> > + return 0;
> > +}
> > +
> > +static void dust_dtr(struct dm_target *ti)
> > +{
> > + struct dust_device *dd = ti->private;
> > +
> > + __dust_clear_badblocks(&dd->badblocklist, dd->badblock_count);
> > + dm_put_device(ti, dd->dev);
> > + kfree(dd);
> > +}
> > +
> > +static void dust_io_hints(struct dm_target *ti, struct queue_limits *limits)
> > +{
> > + struct dust_device *dd = ti->private;
> > +
> > + limits->logical_block_size = dd->blksz;
> > + limits->physical_block_size = dd->blksz;
> > +
> > + // The minimum io size for random io
> > + blk_limits_io_min(limits, dd->blksz);
> > + // The optimal io size for streamed/sequential io
> > + blk_limits_io_opt(limits, dd->blksz);
> > +}
> > +
> > +static int dust_map(struct dm_target *ti, struct bio *bio)
> > +{
> > + struct dust_device *dd;
> > + sector_t thisblock;
> > + int ret;
> > +
> > + dd = ti->private;
> > + bio_set_dev(bio, dd->dev->bdev);
> > + dust_set_biosector(bio, dm_target_offset(ti, dust_get_biosector(bio)));
> > + thisblock = dust_get_biosector(bio);
> > +
> > + if (bio_data_dir(bio) == READ)
> > + ret = dust_map_read(dd, thisblock, dd->fail_read_on_bb);
> > + if (bio_data_dir(bio) == WRITE)
> > + ret = dust_map_write(dd, thisblock, dd->fail_read_on_bb);
> > +
> > + return ret;
> > +}
> > +
> > +static int dust_message(struct dm_target *ti, unsigned int argc, char **argv,
> > + char *result_buf, unsigned int maxlen)
> > +{
> > + struct dust_device *dd = ti->private;
> > + sector_t size = i_size_read(dd->dev->bdev->bd_inode) >> SECTOR_SHIFT;
> > + bool invalid_msg = false;
> > + int result = -EINVAL;
> > + unsigned long long tmp, block;
> > + unsigned long flags;
> > + char dummy;
> > +
> > + if (argc == 1) {
> > + if (!strcasecmp(argv[0], "addbadblock") ||
> > + !strcasecmp(argv[0], "removebadblock") ||
> > + !strcasecmp(argv[0], "queryblock")) {
> > + DMERR("%s requires an additional argument", argv[0]);
> > + } else if (!strcasecmp(argv[0], "disable")) {
> > + DMINFO("disabling read failures on bad sectors");
> > + dd->fail_read_on_bb = false;
> > + result = 0;
> > + } else if (!strcasecmp(argv[0], "enable")) {
> > + DMINFO("enabling read failures on bad sectors");
> > + dd->fail_read_on_bb = true;
> > + result = 0;
> > + } else if (!strcasecmp(argv[0], "countbadblocks")) {
> > + spin_lock_irqsave(&dd->dust_lock, flags);
> > + DMINFO("countbadblocks: %llu badblock(s) found",
> > + dd->badblock_count);
> > + spin_unlock_irqrestore(&dd->dust_lock, flags);
> > + result = 0;
> > + } else if (!strcasecmp(argv[0], "clearbadblocks")) {
> > + result = dust_clear_badblocks(dd);
> > + } else if (!strcasecmp(argv[0], "quiet")) {
> > + if (!dd->quiet_mode)
> > + dd->quiet_mode = true;
> > + else
> > + dd->quiet_mode = false;
> > + result = 0;
> > + } else {
> > + invalid_msg = true;
> > + }
> > + } else if (argc == 2) {
> > + if (sscanf(argv[1], "%llu%c", &tmp, &dummy) != 1)
> > + return result;
> > +
> > + block = tmp;
> > +
> > + if (block > size / dd->sect_per_block || block < 0) {
> > + DMERR("selected block value out of range");
> > + return result;
> > + }
> > +
> > + if (!strcasecmp(argv[0], "addbadblock"))
> > + result = dust_add_block(dd, block);
> > + else if (!strcasecmp(argv[0], "removebadblock"))
> > + result = dust_remove_block(dd, block);
> > + else if (!strcasecmp(argv[0], "queryblock"))
> > + result = dust_query_block(dd, block);
> > + else
> > + invalid_msg = true;
> > +
> > + } else {
> > + DMERR("invalid number of arguments '%d'", argc);
> > + }
> > +
> > + if (invalid_msg)
> > + DMERR("unrecognized dmsetup message '%s' received", argv[0]);
> > +
> > + return result;
> > +}
> > +
> > +static void dust_status(struct dm_target *ti, status_type_t type,
> > + unsigned int status_flags, char *result, unsigned int maxlen)
> > +{
> > + struct dust_device *dd = ti->private;
> > + unsigned int sz = 0; // used by the DMEMIT macro
> > +
> > + switch (type) {
> > + case STATUSTYPE_INFO:
> > + DMEMIT("%s %s %s", dd->dev->name,
> > + dd->fail_read_on_bb ? "fail_read_on_bad_block" :
> > + "bypass",
> > + dd->quiet_mode ? "quiet" : "verbose");
> > + break;
> > +
> > + case STATUSTYPE_TABLE:
> > + DMEMIT("%s %u", dd->dev->name, dd->blksz);
> > + break;
> > + }
> > +}
> > +
> > +static int dust_prepare_ioctl(struct dm_target *ti, struct block_device **bdev)
> > +{
> > + struct dust_device *dd = ti->private;
> > + struct dm_dev *dev = dd->dev;
> > +
> > + *bdev = dev->bdev;
> > +
> > + // Only pass ioctls through if the device sizes match exactly.
> > + if (ti->len != i_size_read(dev->bdev->bd_inode) >> SECTOR_SHIFT)
> > + return 1;
> > +
> > + return 0;
> > +}
> > +
> > +static int dust_iterate_devices(struct dm_target *ti, iterate_devices_callout_fn fn,
> > + void *data)
> > +{
> > + struct dust_device *dd = ti->private;
> > +
> > + return fn(ti, dd->dev, 0, ti->len, data);
> > +}
> > +
> > +static struct target_type dust_target = {
> > + .name = "dust",
> > + .version = { 1, 0, 0 },
> > + .module = THIS_MODULE,
> > + .ctr = dust_ctr,
> > + .dtr = dust_dtr,
> > + .iterate_devices = dust_iterate_devices,
> > + .io_hints = dust_io_hints,
> > + .map = dust_map,
> > + .message = dust_message,
> > + .status = dust_status,
> > + .prepare_ioctl = dust_prepare_ioctl,
> > +};
> > +
> > +int __init dm_dust_init(void)
> > +{
> > + int result = dm_register_target(&dust_target);
> > +
> > + if (result < 0)
> > + DMERR("dm_register_target failed %d", result);
> > +
> > + return result;
> > +}
> > +
> > +void __exit dm_dust_exit(void)
> > +{
> > + dm_unregister_target(&dust_target);
> > +}
> > +
> > +module_init(dm_dust_init);
> > +module_exit(dm_dust_exit);
> > +
> > +MODULE_DESCRIPTION(DM_NAME " dust testing device");
> > +MODULE_AUTHOR("Red Hat, Inc.");
> > +MODULE_LICENSE("GPL");
> > --
> > 2.17.2
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 1/1] dm: add dust target
2019-01-08 15:30 ` Bryan Gurney
@ 2019-01-08 16:23 ` Mike Snitzer
2019-01-08 19:52 ` Bryan Gurney
2019-01-08 17:36 ` [RFC PATCH 1/1] dm: add dust target Benjamin Marzinski
1 sibling, 1 reply; 11+ messages in thread
From: Mike Snitzer @ 2019-01-08 16:23 UTC (permalink / raw)
To: Bryan Gurney
Cc: John Dorminy, Joe Shimkus, dm-devel, tjaskiew, John Pittman,
Alasdair G Kergon
On Tue, Jan 08 2019 at 10:30am -0500,
Bryan Gurney <bgurney@redhat.com> wrote:
> On Mon, Jan 7, 2019 at 7:10 PM Benjamin Marzinski <bmarzins@redhat.com> wrote:
> >
> > On Mon, Jan 07, 2019 at 02:31:23PM -0500, Bryan Gurney wrote:
> > > +
> > > +static int dust_map_read(struct dust_device *dd, sector_t thisblock,
> > > + bool fail_read_on_bb)
> > > +{
> > > + unsigned long flags;
> > > + struct badblock *bblk;
> > > +
> > > + spin_lock_irqsave(&dd->dust_lock, flags);
> > > + bblk = dust_rb_search(&dd->badblocklist, thisblock);
> >
> > I don't see the benefit of doing dust_rb_search() if fail_read_on_bb
> > isn't set. It just slows the read without using the result.
>
> This is something that I have to admit was unintentional, but I like
> it: it keeps the timing consistent between the "enabled" and
> "disabled" modes (a.k.a.: the "fail_read_on_bad_block" and "bypass"
> modes).
>
> The primary goal of this target is to be a close simulation of a drive
> with bad sectors. Therefore, it would be good to have a consistent
> "seek time" between the two modes. There could be test cases where a
> problem only appears on a "faster" (or "slower") device, but that's
> not what this test device is intended to test.
>
> Considering the above, can we leave this as is? (I can leave a
> comment explaining the desire for timing consistency between the
> "enabled" and "disabled" states.)
But this extra work isn't "seek time", it is CPU time. So no, there is
no benefit to doing this.
> > > + if (fail_read_on_bb && bblk) {
> > > + spin_unlock_irqrestore(&dd->dust_lock, flags);
> > > + return DM_MAPIO_KILL;
> > > + }
> > > +
> > > + spin_unlock_irqrestore(&dd->dust_lock, flags);
> > > + return DM_MAPIO_REMAPPED;
> > > +}
> > > +
> > > +static int dust_map_write(struct dust_device *dd, sector_t thisblock,
> > > + bool fail_read_on_bb)
> > > +{
> > > + unsigned long flags;
> > > + struct badblock *bblk;
> > > +
> > > + spin_lock_irqsave(&dd->dust_lock, flags);
> > > + bblk = dust_rb_search(&dd->badblocklist, thisblock);
> > > +
> > > + if (fail_read_on_bb && bblk) {
> >
> > The same goes for here as well.
>
> Please see my above comment for dust_map_read().
Nope, see above ;)
> > > +/*
> > > + * Target parameters:
> > > + *
> > > + * <device_path> <blksz>
> > > + *
> > > + * device_path: path to the block device
> > > + * blksz: block size (valid choices: 512, 4096)
> > > + */
> > > +static int dust_ctr(struct dm_target *ti, unsigned int argc, char **argv)
> > > +{
> > > + struct dust_device *dd;
> > > + const char *device_path = argv[0];
> > > + unsigned int blksz;
> > > + unsigned int sect_per_block;
> > > +
> > > + if (argc != 2) {
> > > + ti->error = "requires exactly 2 arguments";
> > > + return -EINVAL;
> > > + }
> > > +
> > > + if (kstrtouint(argv[1], 10, &blksz) || !blksz) {
> > > + ti->error =
> > > + "Invalid block size parameter -- please enter 512 or 4096";
> > > + return -EINVAL;
> > > + }
> >
> > Why check for !blksz above, when the code only allows 512 and 4096? Why
> > not just check for those values in the first "if" statement?
> >
>
> This is actually a two-stage check: the first stage checks for
> non-numeric input:
>
> # dmsetup create dust1 --table '0 10483712 dust /dev/vdb1 foo'
> stdout:
> "device-mapper: reload ioctl on dust1 failed: Invalid argument"
> printk:
> "kernel: device-mapper: table: 253:2: dust: Invalid block size
> parameter -- please enter 512 or 4096"
>
> ...and the second stage checks for the usable block sizes:
>
> # dmsetup create dust1 --table '0 10483712 dust /dev/vdb1 520'
> stdout:
> "device-mapper: reload ioctl on dust1 failed: Invalid argument"
> printk:
> "kernel: device-mapper: table: 253:2: dust: Invalid block size --
> please enter 512 or 4096"
>
> These different messages help to clarify whether the parameter isn't a
> usable integer, or isn't an integer at all. (Perhaps the message are
> a bit too similar).
>
> Given the above, should there still be a 2-stage check, or should this
> be folded into one that simply checks whether blksz is 512 or 4096?
I'm fine with it as is, but I have a different concern: why does this
target need to override the queue_limits at all? What is the benefit to
rigidly imposing the provided 512 or 4096 bytes for all of
{logical_block,physical_block,minimum_io,optimal_io}_size?
If the underlying device is a 4K native device it simply cannot handle
512b IO. At a minimum you need to verify that the underlying device can
even handle 512b IO.
In general, this ability to override the block core's normal limits
stacking (based on underlying device capabilities) needs proper
justification.
Mike
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 1/1] dm: add dust target
2019-01-08 15:30 ` Bryan Gurney
2019-01-08 16:23 ` Mike Snitzer
@ 2019-01-08 17:36 ` Benjamin Marzinski
1 sibling, 0 replies; 11+ messages in thread
From: Benjamin Marzinski @ 2019-01-08 17:36 UTC (permalink / raw)
To: Bryan Gurney
Cc: Mike Snitzer, John Dorminy, Joe Shimkus, dm-devel, tjaskiew,
John Pittman, Alasdair G Kergon
On Tue, Jan 08, 2019 at 10:30:32AM -0500, Bryan Gurney wrote:
> On Mon, Jan 7, 2019 at 7:10 PM Benjamin Marzinski <bmarzins@redhat.com> wrote:
> >
> > On Mon, Jan 07, 2019 at 02:31:23PM -0500, Bryan Gurney wrote:
> > > +
> > > +static int dust_add_block(struct dust_device *dd, unsigned long long block)
> > > +{
> > > + struct badblock *bblock;
> > > + unsigned long flags;
> > > +
> > > + spin_lock_irqsave(&dd->dust_lock, flags);
> > > + bblock = dust_rb_search(&dd->badblocklist, block * dd->sect_per_block);
> > > +
> > > + if (bblock != NULL) {
> > > + if (!dd->quiet_mode)
> > > + DMERR("addbadblock: block %llu already in badblocklist",
> > > + block);
> > > + spin_unlock_irqrestore(&dd->dust_lock, flags);
> > > + return -EINVAL;
> > > + }
> > > +
> > > + bblock = kmalloc(sizeof(*bblock), GFP_KERNEL);
> >
> > You can't do a GFP_KERNEL allocation while holding a spinlock. I think
> > it would be better to assume that the user was correctly adding a new
> > block, do the allocation, and then just call dust_rb_insert() under the
> > spinlock. dust_rb_insert() will return false if the block is already
> > inserted
>
> Ah, okay. I'll be spinning a v2 of this patch. (I need to make some
> updates to the dm-dust.txt documentation file as well, so those
> updates will be in there.)
>
> After reading through your description of the timeline, I created this
> revision of dust_add_block() (which compiles, and seems to work
> properly in my test module):
>
> static int dust_add_block(struct dust_device *dd, unsigned long long block)
> {
> struct badblock *bblock;
> unsigned long flags;
>
> bblock = kmalloc(sizeof(*bblock), GFP_KERNEL);
> if (bblock == NULL) {
> if (!dd->quiet_mode)
> DMERR("addbadblock: badblock allocation failed");
> return -ENOMEM;
> }
>
> spin_lock_irqsave(&dd->dust_lock, flags);
> bblock->bb = block * dd->sect_per_block;
> if (!dust_rb_insert(&dd->badblocklist, bblock)) {
> if (!dd->quiet_mode)
> DMERR("addbadblock: block %llu already in badblocklist",
> block);
> spin_unlock_irqrestore(&dd->dust_lock, flags);
> kfree(bblock);
> return -EINVAL;
> }
>
> dd->badblock_count++;
> if (!dd->quiet_mode)
> DMINFO("addbadblock: badblock added at block %llu", block);
> spin_unlock_irqrestore(&dd->dust_lock, flags);
> return 0;
> }
>
> Test results:
>
> # dmsetup message dust1 0 addbadblock 60
> kernel: device-mapper: dust: addbadblock: badblock added at block 60
> # dmsetup message dust1 0 addbadblock 60
> kernel: device-mapper: dust: addbadblock: block 60 already in badblocklist
> # dmsetup message dust1 0 addbadblock 60
> kernel: device-mapper: dust: addbadblock: block 60 already in badblocklist
>
> The kmalloc() is before the spinlock, and if the block is already in
> the rbtree, it unlocks, and frees the unused "bblock". Does this look
> good?
>
Yep. That looks fine (other than your ">" and "&" characters getting
lost in translation)
-Ben
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 1/1] dm: add dust target
2019-01-08 16:23 ` Mike Snitzer
@ 2019-01-08 19:52 ` Bryan Gurney
2019-01-08 20:29 ` Mike Snitzer
2019-01-09 1:03 ` Damien Le Moal
0 siblings, 2 replies; 11+ messages in thread
From: Bryan Gurney @ 2019-01-08 19:52 UTC (permalink / raw)
To: Mike Snitzer
Cc: John Dorminy, Joe Shimkus, dm-devel, tjaskiew, John Pittman,
Alasdair G Kergon
On Tue, Jan 8, 2019 at 11:23 AM Mike Snitzer <snitzer@redhat.com> wrote:
>
> On Tue, Jan 08 2019 at 10:30am -0500,
> Bryan Gurney <bgurney@redhat.com> wrote:
>
> > On Mon, Jan 7, 2019 at 7:10 PM Benjamin Marzinski <bmarzins@redhat.com> wrote:
> > >
> > > On Mon, Jan 07, 2019 at 02:31:23PM -0500, Bryan Gurney wrote:
> > > > +
> > > > +static int dust_map_read(struct dust_device *dd, sector_t thisblock,
> > > > + bool fail_read_on_bb)
> > > > +{
> > > > + unsigned long flags;
> > > > + struct badblock *bblk;
> > > > +
> > > > + spin_lock_irqsave(&dd->dust_lock, flags);
> > > > + bblk = dust_rb_search(&dd->badblocklist, thisblock);
> > >
> > > I don't see the benefit of doing dust_rb_search() if fail_read_on_bb
> > > isn't set. It just slows the read without using the result.
> >
> > This is something that I have to admit was unintentional, but I like
> > it: it keeps the timing consistent between the "enabled" and
> > "disabled" modes (a.k.a.: the "fail_read_on_bad_block" and "bypass"
> > modes).
> >
> > The primary goal of this target is to be a close simulation of a drive
> > with bad sectors. Therefore, it would be good to have a consistent
> > "seek time" between the two modes. There could be test cases where a
> > problem only appears on a "faster" (or "slower") device, but that's
> > not what this test device is intended to test.
> >
> > Considering the above, can we leave this as is? (I can leave a
> > comment explaining the desire for timing consistency between the
> > "enabled" and "disabled" states.)
>
> But this extra work isn't "seek time", it is CPU time. So no, there is
> no benefit to doing this.
>
OK; I will fix it to avoid searching the rbtree if the device is in
the "disabled" mode.
> > > > + if (fail_read_on_bb && bblk) {
> > > > + spin_unlock_irqrestore(&dd->dust_lock, flags);
> > > > + return DM_MAPIO_KILL;
> > > > + }
> > > > +
> > > > + spin_unlock_irqrestore(&dd->dust_lock, flags);
> > > > + return DM_MAPIO_REMAPPED;
> > > > +}
> > > > +
> > > > +static int dust_map_write(struct dust_device *dd, sector_t thisblock,
> > > > + bool fail_read_on_bb)
> > > > +{
> > > > + unsigned long flags;
> > > > + struct badblock *bblk;
> > > > +
> > > > + spin_lock_irqsave(&dd->dust_lock, flags);
> > > > + bblk = dust_rb_search(&dd->badblocklist, thisblock);
> > > > +
> > > > + if (fail_read_on_bb && bblk) {
> > >
> > > The same goes for here as well.
> >
> > Please see my above comment for dust_map_read().
>
> Nope, see above ;)
>
> > > > +/*
> > > > + * Target parameters:
> > > > + *
> > > > + * <device_path> <blksz>
> > > > + *
> > > > + * device_path: path to the block device
> > > > + * blksz: block size (valid choices: 512, 4096)
> > > > + */
> > > > +static int dust_ctr(struct dm_target *ti, unsigned int argc, char **argv)
> > > > +{
> > > > + struct dust_device *dd;
> > > > + const char *device_path = argv[0];
> > > > + unsigned int blksz;
> > > > + unsigned int sect_per_block;
> > > > +
> > > > + if (argc != 2) {
> > > > + ti->error = "requires exactly 2 arguments";
> > > > + return -EINVAL;
> > > > + }
> > > > +
> > > > + if (kstrtouint(argv[1], 10, &blksz) || !blksz) {
> > > > + ti->error =
> > > > + "Invalid block size parameter -- please enter 512 or 4096";
> > > > + return -EINVAL;
> > > > + }
> > >
> > > Why check for !blksz above, when the code only allows 512 and 4096? Why
> > > not just check for those values in the first "if" statement?
> > >
> >
> > This is actually a two-stage check: the first stage checks for
> > non-numeric input:
> >
> > # dmsetup create dust1 --table '0 10483712 dust /dev/vdb1 foo'
> > stdout:
> > "device-mapper: reload ioctl on dust1 failed: Invalid argument"
> > printk:
> > "kernel: device-mapper: table: 253:2: dust: Invalid block size
> > parameter -- please enter 512 or 4096"
> >
> > ...and the second stage checks for the usable block sizes:
> >
> > # dmsetup create dust1 --table '0 10483712 dust /dev/vdb1 520'
> > stdout:
> > "device-mapper: reload ioctl on dust1 failed: Invalid argument"
> > printk:
> > "kernel: device-mapper: table: 253:2: dust: Invalid block size --
> > please enter 512 or 4096"
> >
> > These different messages help to clarify whether the parameter isn't a
> > usable integer, or isn't an integer at all. (Perhaps the message are
> > a bit too similar).
> >
> > Given the above, should there still be a 2-stage check, or should this
> > be folded into one that simply checks whether blksz is 512 or 4096?
>
> I'm fine with it as is, but I have a different concern: why does this
> target need to override the queue_limits at all? What is the benefit to
> rigidly imposing the provided 512 or 4096 bytes for all of
> {logical_block,physical_block,minimum_io,optimal_io}_size?
>
This is a bit of a complex question, which may cover multiple areas.
I unfortunately don't have as much expertise in the "backend" of the
original test target code at the heart of dm-dust; I'm still learning
it as I go.
On the "frontend": The "bad block list" has a "grain size" of the
block size of the device (therefore, "block 1" of a 4096-byte block
device starts at byte offset 4096, while "block 1" of a 512-byte block
device starts at byte offset 512.)
There is definitely value in dm-dust emulating a 4K native device, as
it will help to reproduce bugs. In the absence of a real device that
is 4K native, it can be created on top of a 512-byte logical block
size device to create a "512-byte write detector", of sorts.
(An early version of dm-dust was actually a reproducer for the commit
"isofs: reject hardware sector size > 2048 bytes".)
> If the underlying device is a 4K native device it simply cannot handle
> 512b IO. At a minimum you need to verify that the underlying device can
> even handle 512b IO.
>
This specific case is a bug, and I want to fix it. (During earlier
testing, I was wondering why it was possible to set up a 512-byte
dm-dust device on top of a 4K native device.)
I need to add a check in the "dmsetup create" process, to verify if
the underlying device's logical_block_size is small enough to handle
the minimum I/O size.
Thanks,
Bryan
> In general, this ability to override the block core's normal limits
> stacking (based on underlying device capabilities) needs proper
> justification.
>
> Mike
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 1/1] dm: add dust target
2019-01-08 19:52 ` Bryan Gurney
@ 2019-01-08 20:29 ` Mike Snitzer
2019-01-09 1:03 ` Damien Le Moal
1 sibling, 0 replies; 11+ messages in thread
From: Mike Snitzer @ 2019-01-08 20:29 UTC (permalink / raw)
To: Bryan Gurney
Cc: John Dorminy, Joe Shimkus, dm-devel, tjaskiew, John Pittman,
Alasdair G Kergon
On Tue, Jan 08 2019 at 2:52pm -0500,
Bryan Gurney <bgurney@redhat.com> wrote:
> On Tue, Jan 8, 2019 at 11:23 AM Mike Snitzer <snitzer@redhat.com> wrote:
> >
> > I'm fine with it as is, but I have a different concern: why does this
> > target need to override the queue_limits at all? What is the benefit to
> > rigidly imposing the provided 512 or 4096 bytes for all of
> > {logical_block,physical_block,minimum_io,optimal_io}_size?
> >
>
> This is a bit of a complex question, which may cover multiple areas.
> I unfortunately don't have as much expertise in the "backend" of the
> original test target code at the heart of dm-dust; I'm still learning
> it as I go.
>
> On the "frontend": The "bad block list" has a "grain size" of the
> block size of the device (therefore, "block 1" of a 4096-byte block
> device starts at byte offset 4096, while "block 1" of a 512-byte block
> device starts at byte offset 512.)
>
> There is definitely value in dm-dust emulating a 4K native device, as
> it will help to reproduce bugs. In the absence of a real device that
> is 4K native, it can be created on top of a 512-byte logical block
> size device to create a "512-byte write detector", of sorts.
>
> (An early version of dm-dust was actually a reproducer for the commit
> "isofs: reject hardware sector size > 2048 bytes".)
>
> > If the underlying device is a 4K native device it simply cannot handle
> > 512b IO. At a minimum you need to verify that the underlying device can
> > even handle 512b IO.
> >
>
> This specific case is a bug, and I want to fix it. (During earlier
> testing, I was wondering why it was possible to set up a 512-byte
> dm-dust device on top of a 4K native device.)
>
> I need to add a check in the "dmsetup create" process, to verify if
> the underlying device's logical_block_size is small enough to handle
> the minimum I/O size.
Would prefer to see this target accept any "block size" and _not_ use
that size for the queue_limits. Quite a few DM targets provide such
arbitrary addressing (e.g. dm-stripe, dm-thin, dm-cache).
You'd then just verify that the user provided "block size" isn't smaller
than logical_block_size.
Any novelty of allowing the user to specify 512b vs 4K for the block
size could easily be provided by something basic like scsi_debug as the
underlying device.
Mike
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 1/1] dm: add dust target
2019-01-08 19:52 ` Bryan Gurney
2019-01-08 20:29 ` Mike Snitzer
@ 2019-01-09 1:03 ` Damien Le Moal
2019-01-09 14:29 ` SMR 512e drive firmware advertising misleading limits? [was: Re: [RFC PATCH 1/1] dm: add dust target] Mike Snitzer
1 sibling, 1 reply; 11+ messages in thread
From: Damien Le Moal @ 2019-01-09 1:03 UTC (permalink / raw)
To: Bryan Gurney, Mike Snitzer
Cc: John Dorminy, Joe Shimkus, dm-devel@redhat.com,
tjaskiew@redhat.com, John Pittman, Alasdair G Kergon
On 2019/01/09 4:53, Bryan Gurney wrote:
> On Tue, Jan 8, 2019 at 11:23 AM Mike Snitzer <snitzer@redhat.com> wrote:
>>
>> On Tue, Jan 08 2019 at 10:30am -0500,
>> Bryan Gurney <bgurney@redhat.com> wrote:
>>
>>> On Mon, Jan 7, 2019 at 7:10 PM Benjamin Marzinski <bmarzins@redhat.com> wrote:
>>>>
>>>> On Mon, Jan 07, 2019 at 02:31:23PM -0500, Bryan Gurney wrote:
>>>>> +
>>>>> +static int dust_map_read(struct dust_device *dd, sector_t thisblock,
>>>>> + bool fail_read_on_bb)
>>>>> +{
>>>>> + unsigned long flags;
>>>>> + struct badblock *bblk;
>>>>> +
>>>>> + spin_lock_irqsave(&dd->dust_lock, flags);
>>>>> + bblk = dust_rb_search(&dd->badblocklist, thisblock);
>>>>
>>>> I don't see the benefit of doing dust_rb_search() if fail_read_on_bb
>>>> isn't set. It just slows the read without using the result.
>>>
>>> This is something that I have to admit was unintentional, but I like
>>> it: it keeps the timing consistent between the "enabled" and
>>> "disabled" modes (a.k.a.: the "fail_read_on_bad_block" and "bypass"
>>> modes).
>>>
>>> The primary goal of this target is to be a close simulation of a drive
>>> with bad sectors. Therefore, it would be good to have a consistent
>>> "seek time" between the two modes. There could be test cases where a
>>> problem only appears on a "faster" (or "slower") device, but that's
>>> not what this test device is intended to test.
>>>
>>> Considering the above, can we leave this as is? (I can leave a
>>> comment explaining the desire for timing consistency between the
>>> "enabled" and "disabled" states.)
>>
>> But this extra work isn't "seek time", it is CPU time. So no, there is
>> no benefit to doing this.
>>
>
> OK; I will fix it to avoid searching the rbtree if the device is in
> the "disabled" mode.
I agree with Mike here. And considering "seek" time when dealing with bad
sectors emulation does not really matter anyway. Seek happens only once, for the
head to go above the track containing the sector. After that, the command
execution time when a read hits a bad sector is largely dominated by rotational
latency as the disk retries reading the bad sector again and again repeatedly.
That means every read takes one revolution of the platter (8.3 ms on a regular
7200 rpm disk). The total execution time can literally reach seconds for a
single command as the disk goes through different stages of error recovering
procedures (which can succeed, or fail in the end). So even keeping the rbtree
search, you would be very far off a precise emulation.
>
>>>>> + if (fail_read_on_bb && bblk) {
>>>>> + spin_unlock_irqrestore(&dd->dust_lock, flags);
>>>>> + return DM_MAPIO_KILL;
>>>>> + }
>>>>> +
>>>>> + spin_unlock_irqrestore(&dd->dust_lock, flags);
>>>>> + return DM_MAPIO_REMAPPED;
>>>>> +}
>>>>> +
>>>>> +static int dust_map_write(struct dust_device *dd, sector_t thisblock,
>>>>> + bool fail_read_on_bb)
>>>>> +{
>>>>> + unsigned long flags;
>>>>> + struct badblock *bblk;
>>>>> +
>>>>> + spin_lock_irqsave(&dd->dust_lock, flags);
>>>>> + bblk = dust_rb_search(&dd->badblocklist, thisblock);
>>>>> +
>>>>> + if (fail_read_on_bb && bblk) {
>>>>
>>>> The same goes for here as well.
>>>
>>> Please see my above comment for dust_map_read().
>>
>> Nope, see above ;)
>>
>>>>> +/*
>>>>> + * Target parameters:
>>>>> + *
>>>>> + * <device_path> <blksz>
>>>>> + *
>>>>> + * device_path: path to the block device
>>>>> + * blksz: block size (valid choices: 512, 4096)
>>>>> + */
>>>>> +static int dust_ctr(struct dm_target *ti, unsigned int argc, char **argv)
>>>>> +{
>>>>> + struct dust_device *dd;
>>>>> + const char *device_path = argv[0];
>>>>> + unsigned int blksz;
>>>>> + unsigned int sect_per_block;
>>>>> +
>>>>> + if (argc != 2) {
>>>>> + ti->error = "requires exactly 2 arguments";
>>>>> + return -EINVAL;
>>>>> + }
>>>>> +
>>>>> + if (kstrtouint(argv[1], 10, &blksz) || !blksz) {
>>>>> + ti->error =
>>>>> + "Invalid block size parameter -- please enter 512 or 4096";
>>>>> + return -EINVAL;
>>>>> + }
>>>>
>>>> Why check for !blksz above, when the code only allows 512 and 4096? Why
>>>> not just check for those values in the first "if" statement?
>>>>
>>>
>>> This is actually a two-stage check: the first stage checks for
>>> non-numeric input:
>>>
>>> # dmsetup create dust1 --table '0 10483712 dust /dev/vdb1 foo'
>>> stdout:
>>> "device-mapper: reload ioctl on dust1 failed: Invalid argument"
>>> printk:
>>> "kernel: device-mapper: table: 253:2: dust: Invalid block size
>>> parameter -- please enter 512 or 4096"
>>>
>>> ...and the second stage checks for the usable block sizes:
>>>
>>> # dmsetup create dust1 --table '0 10483712 dust /dev/vdb1 520'
>>> stdout:
>>> "device-mapper: reload ioctl on dust1 failed: Invalid argument"
>>> printk:
>>> "kernel: device-mapper: table: 253:2: dust: Invalid block size --
>>> please enter 512 or 4096"
>>>
>>> These different messages help to clarify whether the parameter isn't a
>>> usable integer, or isn't an integer at all. (Perhaps the message are
>>> a bit too similar).
>>>
>>> Given the above, should there still be a 2-stage check, or should this
>>> be folded into one that simply checks whether blksz is 512 or 4096?
>>
>> I'm fine with it as is, but I have a different concern: why does this
>> target need to override the queue_limits at all? What is the benefit to
>> rigidly imposing the provided 512 or 4096 bytes for all of
>> {logical_block,physical_block,minimum_io,optimal_io}_size?
>>
>
> This is a bit of a complex question, which may cover multiple areas.
> I unfortunately don't have as much expertise in the "backend" of the
> original test target code at the heart of dm-dust; I'm still learning
> it as I go.
>
> On the "frontend": The "bad block list" has a "grain size" of the
> block size of the device (therefore, "block 1" of a 4096-byte block
> device starts at byte offset 4096, while "block 1" of a 512-byte block
> device starts at byte offset 512.)
>
> There is definitely value in dm-dust emulating a 4K native device, as
> it will help to reproduce bugs. In the absence of a real device that
> is 4K native, it can be created on top of a 512-byte logical block
> size device to create a "512-byte write detector", of sorts.
I do not understand this statement. If you emulate a 4K phys sector disk on top
of a 512 native disk, emulating the errors based on bad sectors set with a
granularity of 512B do not make sense since the real drive will NEVER report
partially good or bad sectors. The entire 4K is good or it is not. So the
emulation side should only manage 4K entries, regardless of the underlying disk
actual LBA size.
Also please note that these days, most enterprise class disks can have there
sector size changed between 4K and 512B with a low level format operation.
You may want to try that on the disks you have.
>
> (An early version of dm-dust was actually a reproducer for the commit
> "isofs: reject hardware sector size > 2048 bytes".)
>
>> If the underlying device is a 4K native device it simply cannot handle
>> 512b IO. At a minimum you need to verify that the underlying device can
>> even handle 512b IO.
>>
>
> This specific case is a bug, and I want to fix it. (During earlier
> testing, I was wondering why it was possible to set up a 512-byte
> dm-dust device on top of a 4K native device.)
>
> I need to add a check in the "dmsetup create" process, to verify if
> the underlying device's logical_block_size is small enough to handle
> the minimum I/O size.
You may want to check against the physical_block_size, and not the logical. SMR
disks that are 512 e (512B logical and 4K physical) can handle reads in 512B
units but writes have to be 4K. These are exceptions though and kind of breaking
all LBA definitions known to men... This check on the physical_block_size can be
limited to setups where the underlying disk is SMR.
>
>
> Thanks,
>
> Bryan
>
>> In general, this ability to override the block core's normal limits
>> stacking (based on underlying device capabilities) needs proper
>> justification.
>>
>> Mike
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
>
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 11+ messages in thread
* SMR 512e drive firmware advertising misleading limits? [was: Re: [RFC PATCH 1/1] dm: add dust target]
2019-01-09 1:03 ` Damien Le Moal
@ 2019-01-09 14:29 ` Mike Snitzer
2019-01-10 4:58 ` Damien Le Moal
0 siblings, 1 reply; 11+ messages in thread
From: Mike Snitzer @ 2019-01-09 14:29 UTC (permalink / raw)
To: Damien Le Moal
Cc: John Dorminy, Joe Shimkus, dm-devel@redhat.com,
tjaskiew@redhat.com, John Pittman, Alasdair G Kergon,
Bryan Gurney
On Tue, Jan 08 2019 at 8:03pm -0500,
Damien Le Moal <Damien.LeMoal@wdc.com> wrote:
> You may want to check against the physical_block_size, and not the logical. SMR
> disks that are 512 e (512B logical and 4K physical) can handle reads in 512B
> units but writes have to be 4K. These are exceptions though and kind of breaking
> all LBA definitions known to men... This check on the physical_block_size can be
> limited to setups where the underlying disk is SMR.
Damien, this sounds like a serious issue with SMR firmware. If the
512e SMR drive cannot handle 512b writes then it has no business
advertising 512B logical. It should be advertising 4K logical.
What am I missing?
Mike
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: SMR 512e drive firmware advertising misleading limits? [was: Re: [RFC PATCH 1/1] dm: add dust target]
2019-01-09 14:29 ` SMR 512e drive firmware advertising misleading limits? [was: Re: [RFC PATCH 1/1] dm: add dust target] Mike Snitzer
@ 2019-01-10 4:58 ` Damien Le Moal
0 siblings, 0 replies; 11+ messages in thread
From: Damien Le Moal @ 2019-01-10 4:58 UTC (permalink / raw)
To: Mike Snitzer
Cc: John Dorminy, Shimkus, dm-devel@redhat.com, Joe,
tjaskiew@redhat.com, John Pittman, Alasdair G Kergon,
Bryan Gurney
Mike,
On 2019/01/09 23:29, Mike Snitzer wrote:
> On Tue, Jan 08 2019 at 8:03pm -0500,
> Damien Le Moal <Damien.LeMoal@wdc.com> wrote:
>
>> You may want to check against the physical_block_size, and not the logical. SMR
>> disks that are 512 e (512B logical and 4K physical) can handle reads in 512B
>> units but writes have to be 4K. These are exceptions though and kind of breaking
>> all LBA definitions known to men... This check on the physical_block_size can be
>> limited to setups where the underlying disk is SMR.
>
> Damien, this sounds like a serious issue with SMR firmware. If the
> 512e SMR drive cannot handle 512b writes then it has no business
> advertising 512B logical. It should be advertising 4K logical.
>
> What am I missing?
You are absolutely correct, and I am with you on this one ! 512e disks are an
insanity for SMR and should not exist. But unfortunately, they do. The problem
is not the disk firmware, but the specs. ZBC and ZAC clearly state that writes
into sequential zones have to be aligned on physical sector boundaries. This
constraint applies only to sequential zones. Conventional zones can and should
handle any write request, even those not aligned on 4K boundaries on 512e disks.
Form ZBC r05:
"4.4.3.4.1 Writing in sequential write required zones
[...]
If the device server processes a write command with:
a) the starting LBA in a sequential write required zone set to a value
that is in the write required zone but
that is not equal to the write pointer; or
b) an ending LBA that is not equal to the last logical block within a
physical block (see SBC-4),
then the device server shall not write any data and shall terminate the command
with CHECK CONDITION status, with the sense key set to ILLEGAL REQUEST, and the
additional sense code set to UNALIGNED WRITE COMMAND."
This kind of make sense considering that track shingling allows sectors to be
written only once, but it seems that nobody in the standard committee clearly
realized the implications of this on 512e disk models. And so we end up with
this situation were everything is fine for 512n and 4Kn, but 512e is definitely
broken.
Disk FW implementation could handle partial 4K sector writes with some caching
since only the physical sectors containing the LBA pointed by zone write
pointers need caching. That is at most as many sectors as the number of
sequential zones. But the specs are like this now, and the disk FW
implementation has no choice but to follow the text.
Note that we discussed in the past adding checks in the block stack or scsi
stack for writes that are not 4K aligned on 512e SMR disks, but in the end
decided against it. That does not buy us anything and so we let the drive fail
these unaligned writes instead of failing the commands before dispatching.
Luckily, all the in kernel software we have supporting SMR use 4K writes (f2fs,
btrfs prototype and dm-zoned). So this problem never really triggers.
Best regards.
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2019-01-10 4:58 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-01-07 19:31 [RFC PATCH 0/1] dm: add dm-dust, a bad sector emulator Bryan Gurney
2019-01-07 19:31 ` [RFC PATCH 1/1] dm: add dust target Bryan Gurney
2019-01-08 0:10 ` Benjamin Marzinski
2019-01-08 15:30 ` Bryan Gurney
2019-01-08 16:23 ` Mike Snitzer
2019-01-08 19:52 ` Bryan Gurney
2019-01-08 20:29 ` Mike Snitzer
2019-01-09 1:03 ` Damien Le Moal
2019-01-09 14:29 ` SMR 512e drive firmware advertising misleading limits? [was: Re: [RFC PATCH 1/1] dm: add dust target] Mike Snitzer
2019-01-10 4:58 ` Damien Le Moal
2019-01-08 17:36 ` [RFC PATCH 1/1] dm: add dust target Benjamin Marzinski
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.