* data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device @ 2018-07-23 16:33 Mike Snitzer 2018-07-24 6:00 ` Hannes Reinecke 0 siblings, 1 reply; 11+ messages in thread From: Mike Snitzer @ 2018-07-23 16:33 UTC (permalink / raw) To: linux-nvme, linux-block, dm-devel Hi, I've opened the following public BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1607527 Feel free to add comments to that BZ if you have a redhat bugzilla account. But otherwise, happy to get as much feedback and discussion going purely on the relevant lists. I've taken ~1.5 weeks to categorize and isolate this issue. But I've reached a point where I'm getting diminishing returns and could _really_ use the collective eyeballs and expertise of the community. This is by far one of the most nasty cases of corruption I've seen in a while. Not sure where the ultimate cause of corruption lies (that the money question) but it _feels_ rooted in NVMe and is unique to this particular workload I've stumbled onto via customer escalation and then trying to replicate an rbd device using a more approachable one (request-based DM multipath in this case). >From the BZ's comment#0: The following occurs with latest v4.18-rc3 and v4.18-rc6 and also occurs with v4.15. When corruption occurs from this test it also destroys the DOS partition table (created during step 0 below).. yeah, corruption is _that_ bad. Almost like the corruption is temporal (recently accessed regions of the NVMe device)? Anyway: I stumbled onto rampant corruption when using request-based DM multipath ontop of an NVMe device (not exclusive to a particular drive either, happens to NVMe devices from multiple vendors). But the corruption only occurs if the request-based multipath IO is issued to an NVMe device in parallel to other IO issued to the _same_ underlying NVMe by the DM cache target. See topology detailed below (at the very end of this comment).. basically all 3 devices that are used to create a DM cache device need to be backed by the same NVMe device (via partitions or linear volumes). Again, using request-based DM multipath for dm-cache's "slow" device is _required_ to reproduce. Not 100% clear why really... other than request-based DM multipath builds large IOs (due to merging). --- Additional comment from Mike Snitzer on 2018-07-20 10:14:09 EDT --- To reproduce this issue using device-mapper-test-suite: 0) Partition an NVMe device. First primary partition with at least a 5GB, seconf primary partition with at least 48GB. NOTE: larger partitions (e.g. 1: 50GB 2: >= 220GB) can be used to reproduce XFS corruption much quicker. 1) create a request-based multipath device ontop of an NVMe device, e.g.: #!/bin/sh modprobe dm-service-time DEVICE=/dev/nvme1n1p2 SIZE=`blockdev --getsz $DEVICE` echo "0 $SIZE multipath 2 queue_mode mq 0 1 1 service-time 0 1 2 $DEVICE 1000 1" | dmsetup create nvme_mpath # Just a note for how to fail/reinstate path: # dmsetup message nvme_mpath 0 "fail_path $DEVICE" # dmsetup message nvme_mpath 0 "reinstate_path $DEVICE" 2) checkout device-mapper-test-suite from my github repo: git clone git://github.com/snitm/device-mapper-test-suite.git cd device-mapper-test-suite git checkout -b devel origin/devel 3) follow device-mapper-test-suite's README.md to get it all setup 4) Configure /root/.dmtest/config with something like: profile :nvme_shared do metadata_dev '/dev/nvme1n1p1' #data_dev '/dev/nvme1n1p2' data_dev '/dev/mapper/nvme_mpath' end default_profile :nvme_shared ------ NOTE: configured 'metadata_dev' gets carved up by device-mapper-test-suite to provide both the dm-cache's metadata device and the "fast" data device. The configured 'data_dev' is used for dm-cache's "slow" data device. 5) run the test: # tail -f /var/log/messages & # time dmtest run --suite cache -n /split_large_file/ 6) If multipath device failed the lone NVMe path you'll need to reinstate the path before the next iteration of your test, e.g. (from #1 above): dmsetup message nvme_mpath 0 "reinstate_path $DEVICE" --- Additional comment from Mike Snitzer on 2018-07-20 12:02:45 EDT --- (In reply to Mike Snitzer from comment #6) > SO seems pretty clear something is still wrong with request-based DM > multipath ontop of NVMe... sadly we don't have any negative check in > blk-core, NVMe or elsewhere to offer any clue :( Building on this comment: "Anyway, fact that I'm getting this corruption on multiple different NVMe drives: I am definitely concerned that this BZ is due to a bug somewhere in NVMe core (or block core code that is specific to NVMe)." I'm left thinking that request-based DM multipath is somehow causing NVMe's SG lists or other infrastructure to be "wrong" and it is resulting in corruption. I get corruption to the dm-cache's metadata device (which while theoretically unrelated as its a separate device from the "slow" dm-cache data device) if the dm-cache slow data device is backed by request-based dm-multipath ontop of NVMe (which is a partition from the _same_ NVMe device that is used by the dm-cache metadata device). Basically I'm back to thinking NVMe is corrupting the data due to the IO pattern or nature of the cloned requests dm-multipath is issuing. And it is causing corruption to other NVMe partitions on the same parent NVMe device. Certainly that is a concerning hypothesis but I'm not seeing much else that would explain this weird corruption. If I don't use the same NVMe device (with multiple partitions) for _all_ 3 sub-devices that dm-cache needs I don't see the corruption. It is almost like the mix of IO issued by DM cache's metadata (on nvme1n1p1 using dm-linear) and "fast" device (also on nvme1n1p1 via dm-linear volume) in conjunction with IO issued by request-based DM multipath to NVMe for "slow" device (on nvme1n1p2) is triggering NVMe to respond negatively. But this same observation can be made on completely different hardware using 2 totally different NVMe devices: testbed1: Intel Corporation Optane SSD 900P Series (2700) testbed2: Samsung Electronics Co Ltd NVMe SSD Controller 171X (rev 03) Which is why it feels like some bug in Linux (be it dm-rq.c, blk-core.c, blk-merge.c or the common NVMe driver) topology before starting the device-mapper-test-suite test: # lsblk /dev/nvme1n1 NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT nvme1n1 259:1 0 745.2G 0 disk ├─nvme1n1p2 259:5 0 695.2G 0 part │ └─nvme_mpath 253:2 0 695.2G 0 dm └─nvme1n1p1 259:4 0 50G 0 part topology during the device-mapper-test-suite test: # lsblk /dev/nvme1n1 NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT nvme1n1 259:1 0 745.2G 0 disk ├─nvme1n1p2 259:5 0 695.2G 0 part │ └─nvme_mpath 253:2 0 695.2G 0 dm │ └─test-dev-458572 253:5 0 48G 0 dm │ └─test-dev-613083 253:6 0 48G 0 dm /root/snitm/git/device-mapper-test-suite/kernel_builds └─nvme1n1p1 259:4 0 50G 0 part ├─test-dev-126378 253:4 0 4G 0 dm │ └─test-dev-613083 253:6 0 48G 0 dm /root/snitm/git/device-mapper-test-suite/kernel_builds └─test-dev-652491 253:3 0 40M 0 dm └─test-dev-613083 253:6 0 48G 0 dm /root/snitm/git/device-mapper-test-suite/kernel_builds pruning that tree a bit (removing the dm-cache device 253:6) for clarity: # lsblk /dev/nvme1n1 NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT nvme1n1 259:1 0 745.2G 0 disk ├─nvme1n1p2 259:5 0 695.2G 0 part │ └─nvme_mpath 253:2 0 695.2G 0 dm │ └─test-dev-458572 253:5 0 48G 0 dm └─nvme1n1p1 259:4 0 50G 0 part ├─test-dev-126378 253:4 0 4G 0 dm └─test-dev-652491 253:3 0 40M 0 dm 40M device is dm-cache "metadata" device 4G device is dm-cache "fast" data device 48G device is dm-cache "slow" data device ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device 2018-07-23 16:33 data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device Mike Snitzer @ 2018-07-24 6:00 ` Hannes Reinecke 2018-07-24 13:07 ` Mike Snitzer 2018-07-24 14:25 ` Bart Van Assche 0 siblings, 2 replies; 11+ messages in thread From: Hannes Reinecke @ 2018-07-24 6:00 UTC (permalink / raw) To: Mike Snitzer, linux-nvme, linux-block, dm-devel On 07/23/2018 06:33 PM, Mike Snitzer wrote: > Hi, > > I've opened the following public BZ: > https://bugzilla.redhat.com/show_bug.cgi?id=1607527 > > Feel free to add comments to that BZ if you have a redhat bugzilla > account. > > But otherwise, happy to get as much feedback and discussion going purely > on the relevant lists. I've taken ~1.5 weeks to categorize and isolate > this issue. But I've reached a point where I'm getting diminishing > returns and could _really_ use the collective eyeballs and expertise of > the community. This is by far one of the most nasty cases of corruption > I've seen in a while. Not sure where the ultimate cause of corruption > lies (that the money question) but it _feels_ rooted in NVMe and is > unique to this particular workload I've stumbled onto via customer > escalation and then trying to replicate an rbd device using a more > approachable one (request-based DM multipath in this case). > I might be stating the obvious, but so far we only have considered request-based multipath as being active for the _entire_ device. To my knowledge we've never tested that when running on a partition. So, have you tested that request-based multipathing works on a partition _at all_? I'm not sure if partition mapping is done correctly here; we never remap the start of the request (nor bio, come to speak of it), so it looks as if we would be doing the wrong things here. Have you checked that partition remapping is done correctly? Cheers, Hannes ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device 2018-07-24 6:00 ` Hannes Reinecke @ 2018-07-24 13:07 ` Mike Snitzer 2018-07-24 13:22 ` Laurence Oberman ` (2 more replies) 2018-07-24 14:25 ` Bart Van Assche 1 sibling, 3 replies; 11+ messages in thread From: Mike Snitzer @ 2018-07-24 13:07 UTC (permalink / raw) To: Hannes Reinecke; +Cc: linux-nvme, linux-block, dm-devel On Tue, Jul 24 2018 at 2:00am -0400, Hannes Reinecke <hare@suse.de> wrote: > On 07/23/2018 06:33 PM, Mike Snitzer wrote: > >Hi, > > > >I've opened the following public BZ: > >https://bugzilla.redhat.com/show_bug.cgi?id=1607527 > > > >Feel free to add comments to that BZ if you have a redhat bugzilla > >account. > > > >But otherwise, happy to get as much feedback and discussion going purely > >on the relevant lists. I've taken ~1.5 weeks to categorize and isolate > >this issue. But I've reached a point where I'm getting diminishing > >returns and could _really_ use the collective eyeballs and expertise of > >the community. This is by far one of the most nasty cases of corruption > >I've seen in a while. Not sure where the ultimate cause of corruption > >lies (that the money question) but it _feels_ rooted in NVMe and is > >unique to this particular workload I've stumbled onto via customer > >escalation and then trying to replicate an rbd device using a more > >approachable one (request-based DM multipath in this case). > > > I might be stating the obvious, but so far we only have considered > request-based multipath as being active for the _entire_ device. > To my knowledge we've never tested that when running on a partition. True. We only ever support mapping the partitions ontop of request-based multipath (via dm-linear volumes created by kpartx). > So, have you tested that request-based multipathing works on a > partition _at all_? I'm not sure if partition mapping is done > correctly here; we never remap the start of the request (nor bio, > come to speak of it), so it looks as if we would be doing the wrong > things here. > > Have you checked that partition remapping is done correctly? It clearly doesn't work. Not quite following why but... After running the test the partition table at the start of the whole NVMe device is overwritten by XFS. So likely the IO destined to the dm-cache's "slow" (dm-mpath device on NVMe partition) was issued to the whole NVMe device: # pvcreate /dev/nvme1n1 WARNING: xfs signature detected on /dev/nvme1n1 at offset 0. Wipe it? [y/n] # vgcreate test /dev/nvme1n1 # lvcreate -n slow -L 512G test WARNING: xfs signature detected on /dev/test/slow at offset 0. Wipe it? [y/n]: y Wiping xfs signature on /dev/test/slow. Logical volume "slow" created. Isn't this a failing of block core's partitioning? Why should a target that is given the entire partition of a device need to be concerned with remapping IO? Shouldn't block core handle that mapping? Anyway, yesterday I went so far as to hack together request-based support for DM linear (because request-based DM cannot stack on bio-based DM) . With this, request-based linear devices instead of conventional partitioning, I no longer see the XFS corruption when running the test: drivers/md/dm-linear.c | 45 ++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 42 insertions(+), 3 deletions(-) diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c index d10964d41fd7..d4a65dd20c6e 100644 --- a/drivers/md/dm-linear.c +++ b/drivers/md/dm-linear.c @@ -12,6 +12,7 @@ #include <linux/dax.h> #include <linux/slab.h> #include <linux/device-mapper.h> +#include <linux/blk-mq.h> #define DM_MSG_PREFIX "linear" @@ -24,7 +25,7 @@ struct linear_c { }; /* - * Construct a linear mapping: <dev_path> <offset> + * Construct a linear mapping: <dev_path> <offset> [<# optional params> <optional params>] */ static int linear_ctr(struct dm_target *ti, unsigned int argc, char **argv) { @@ -57,6 +58,11 @@ static int linear_ctr(struct dm_target *ti, unsigned int argc, char **argv) goto bad; } + // FIXME: need to parse optional args + // FIXME: model alloc_multipath_stage2()? + // Call: dm_table_set_type() + dm_table_set_type(ti->table, DM_TYPE_MQ_REQUEST_BASED); + ti->num_flush_bios = 1; ti->num_discard_bios = 1; ti->num_secure_erase_bios = 1; @@ -113,6 +119,37 @@ static int linear_end_io(struct dm_target *ti, struct bio *bio, return DM_ENDIO_DONE; } +static int linear_clone_and_map(struct dm_target *ti, struct request *rq, + union map_info *map_context, + struct request **__clone) +{ + struct linear_c *lc = ti->private; + struct block_device *bdev = lc->dev->bdev; + struct request_queue *q = bdev_get_queue(bdev); + + struct request *clone = blk_get_request(q, rq->cmd_flags | REQ_NOMERGE, + BLK_MQ_REQ_NOWAIT); + if (IS_ERR(clone)) { + if (blk_queue_dying(q) || !q->mq_ops) + return DM_MAPIO_DELAY_REQUEUE; + + return DM_MAPIO_REQUEUE; + } + + clone->__sector = linear_map_sector(ti, rq->__sector); + clone->bio = clone->biotail = NULL; + clone->rq_disk = bdev->bd_disk; + clone->cmd_flags |= REQ_FAILFAST_TRANSPORT; + *__clone = clone; + + return DM_MAPIO_REMAPPED; +} + +static void linear_release_clone(struct request *clone) +{ + blk_put_request(clone); +} + static void linear_status(struct dm_target *ti, status_type_t type, unsigned status_flags, char *result, unsigned maxlen) { @@ -207,13 +244,15 @@ static size_t linear_dax_copy_to_iter(struct dm_target *ti, pgoff_t pgoff, static struct target_type linear_target = { .name = "linear", - .version = {1, 4, 0}, - .features = DM_TARGET_PASSES_INTEGRITY | DM_TARGET_ZONED_HM, + .version = {1, 5, 0}, + .features = DM_TARGET_IMMUTABLE | DM_TARGET_PASSES_INTEGRITY | DM_TARGET_ZONED_HM, .module = THIS_MODULE, .ctr = linear_ctr, .dtr = linear_dtr, .map = linear_map, .end_io = linear_end_io, + .clone_and_map_rq = linear_clone_and_map, + .release_clone_rq = linear_release_clone, .status = linear_status, .prepare_ioctl = linear_prepare_ioctl, .iterate_devices = linear_iterate_devices, ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device 2018-07-24 13:07 ` Mike Snitzer @ 2018-07-24 13:22 ` Laurence Oberman 2018-07-24 13:51 ` Hannes Reinecke 2018-07-24 17:42 ` Christoph Hellwig 2 siblings, 0 replies; 11+ messages in thread From: Laurence Oberman @ 2018-07-24 13:22 UTC (permalink / raw) To: Mike Snitzer, Hannes Reinecke; +Cc: linux-nvme, linux-block, dm-devel On Tue, 2018-07-24 at 09:07 -0400, Mike Snitzer wrote: > On Tue, Jul 24 2018 at 2:00am -0400, > Hannes Reinecke <hare@suse.de> wrote: > > > On 07/23/2018 06:33 PM, Mike Snitzer wrote: > > > Hi, > > > > > > I've opened the following public BZ: > > > https://bugzilla.redhat.com/show_bug.cgi?id=1607527 > > > > > > Feel free to add comments to that BZ if you have a redhat > > > bugzilla > > > account. > > > > > > But otherwise, happy to get as much feedback and discussion going > > > purely > > > on the relevant lists. I've taken ~1.5 weeks to categorize and > > > isolate > > > this issue. But I've reached a point where I'm getting > > > diminishing > > > returns and could _really_ use the collective eyeballs and > > > expertise of > > > the community. This is by far one of the most nasty cases of > > > corruption > > > I've seen in a while. Not sure where the ultimate cause of > > > corruption > > > lies (that the money question) but it _feels_ rooted in NVMe and > > > is > > > unique to this particular workload I've stumbled onto via > > > customer > > > escalation and then trying to replicate an rbd device using a > > > more > > > approachable one (request-based DM multipath in this case). > > > > > > > I might be stating the obvious, but so far we only have considered > > request-based multipath as being active for the _entire_ device. > > To my knowledge we've never tested that when running on a > > partition. > > True. We only ever support mapping the partitions ontop of > request-based multipath (via dm-linear volumes created by kpartx). > > > So, have you tested that request-based multipathing works on a > > partition _at all_? I'm not sure if partition mapping is done > > correctly here; we never remap the start of the request (nor bio, > > come to speak of it), so it looks as if we would be doing the wrong > > things here. > > > > Have you checked that partition remapping is done correctly? > > It clearly doesn't work. Not quite following why but... > > After running the test the partition table at the start of the whole > NVMe device is overwritten by XFS. So likely the IO destined to the > dm-cache's "slow" (dm-mpath device on NVMe partition) was issued to > the > whole NVMe device: > > # pvcreate /dev/nvme1n1 > WARNING: xfs signature detected on /dev/nvme1n1 at offset 0. Wipe it? > [y/n] > > # vgcreate test /dev/nvme1n1 > # lvcreate -n slow -L 512G test > WARNING: xfs signature detected on /dev/test/slow at offset 0. Wipe > it? > [y/n]: y > Wiping xfs signature on /dev/test/slow. > Logical volume "slow" created. > > Isn't this a failing of block core's partitioning? Why should a > target > that is given the entire partition of a device need to be concerned > with > remapping IO? Shouldn't block core handle that mapping? > > Anyway, yesterday I went so far as to hack together request-based > support for DM linear (because request-based DM cannot stack on > bio-based DM) . With this, request-based linear devices instead of > conventional partitioning, I no longer see the XFS corruption when > running the test: > > drivers/md/dm-linear.c | 45 > ++++++++++++++++++++++++++++++++++++++++++--- > 1 file changed, 42 insertions(+), 3 deletions(-) > > diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c > index d10964d41fd7..d4a65dd20c6e 100644 > --- a/drivers/md/dm-linear.c > +++ b/drivers/md/dm-linear.c > @@ -12,6 +12,7 @@ > #include <linux/dax.h> > #include <linux/slab.h> > #include <linux/device-mapper.h> > +#include <linux/blk-mq.h> > > #define DM_MSG_PREFIX "linear" > > @@ -24,7 +25,7 @@ struct linear_c { > }; > > /* > - * Construct a linear mapping: <dev_path> <offset> > + * Construct a linear mapping: <dev_path> <offset> [<# optional > params> <optional params>] > */ > static int linear_ctr(struct dm_target *ti, unsigned int argc, char > **argv) > { > @@ -57,6 +58,11 @@ static int linear_ctr(struct dm_target *ti, > unsigned int argc, char **argv) > goto bad; > } > > + // FIXME: need to parse optional args > + // FIXME: model alloc_multipath_stage2()? > + // Call: dm_table_set_type() > + dm_table_set_type(ti->table, DM_TYPE_MQ_REQUEST_BASED); > + > ti->num_flush_bios = 1; > ti->num_discard_bios = 1; > ti->num_secure_erase_bios = 1; > @@ -113,6 +119,37 @@ static int linear_end_io(struct dm_target *ti, > struct bio *bio, > return DM_ENDIO_DONE; > } > > +static int linear_clone_and_map(struct dm_target *ti, struct request > *rq, > + union map_info *map_context, > + struct request **__clone) > +{ > + struct linear_c *lc = ti->private; > + struct block_device *bdev = lc->dev->bdev; > + struct request_queue *q = bdev_get_queue(bdev); > + > + struct request *clone = blk_get_request(q, rq->cmd_flags | > REQ_NOMERGE, > + BLK_MQ_REQ_NOWAIT); > + if (IS_ERR(clone)) { > + if (blk_queue_dying(q) || !q->mq_ops) > + return DM_MAPIO_DELAY_REQUEUE; > + > + return DM_MAPIO_REQUEUE; > + } > + > + clone->__sector = linear_map_sector(ti, rq->__sector); > + clone->bio = clone->biotail = NULL; > + clone->rq_disk = bdev->bd_disk; > + clone->cmd_flags |= REQ_FAILFAST_TRANSPORT; > + *__clone = clone; > + > + return DM_MAPIO_REMAPPED; > +} > + > +static void linear_release_clone(struct request *clone) > +{ > + blk_put_request(clone); > +} > + > static void linear_status(struct dm_target *ti, status_type_t type, > unsigned status_flags, char *result, > unsigned maxlen) > { > @@ -207,13 +244,15 @@ static size_t linear_dax_copy_to_iter(struct > dm_target *ti, pgoff_t pgoff, > > static struct target_type linear_target = { > .name = "linear", > - .version = {1, 4, 0}, > - .features = DM_TARGET_PASSES_INTEGRITY | DM_TARGET_ZONED_HM, > + .version = {1, 5, 0}, > + .features = DM_TARGET_IMMUTABLE | DM_TARGET_PASSES_INTEGRITY > | DM_TARGET_ZONED_HM, > .module = THIS_MODULE, > .ctr = linear_ctr, > .dtr = linear_dtr, > .map = linear_map, > .end_io = linear_end_io, > + .clone_and_map_rq = linear_clone_and_map, > + .release_clone_rq = linear_release_clone, > .status = linear_status, > .prepare_ioctl = linear_prepare_ioctl, > .iterate_devices = linear_iterate_devices, > > > With Oracle setups and multipath, we have plenty of customers using non NVME LUNS (i.e. F/C) with 1 single partition on top of a request based multipath with no issues. Same for file systems on top of multipath devices with a single partition Its very uncommon for sharing a disk with multiple partitions, and multipath. It has to be the multiple partitions, but we should test on non NVME with multiple partitions in the lab setup I guess to make sure ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device 2018-07-24 13:07 ` Mike Snitzer 2018-07-24 13:22 ` Laurence Oberman @ 2018-07-24 13:51 ` Hannes Reinecke 2018-07-24 13:57 ` Laurence Oberman 2018-07-24 17:42 ` Christoph Hellwig 2 siblings, 1 reply; 11+ messages in thread From: Hannes Reinecke @ 2018-07-24 13:51 UTC (permalink / raw) To: Mike Snitzer; +Cc: linux-nvme, linux-block, dm-devel On 07/24/2018 03:07 PM, Mike Snitzer wrote: > On Tue, Jul 24 2018 at 2:00am -0400, > Hannes Reinecke <hare@suse.de> wrote: > >> On 07/23/2018 06:33 PM, Mike Snitzer wrote: >>> Hi, >>> >>> I've opened the following public BZ: >>> https://bugzilla.redhat.com/show_bug.cgi?id=1607527 >>> >>> Feel free to add comments to that BZ if you have a redhat bugzilla >>> account. >>> >>> But otherwise, happy to get as much feedback and discussion going purely >>> on the relevant lists. I've taken ~1.5 weeks to categorize and isolate >>> this issue. But I've reached a point where I'm getting diminishing >>> returns and could _really_ use the collective eyeballs and expertise of >>> the community. This is by far one of the most nasty cases of corruption >>> I've seen in a while. Not sure where the ultimate cause of corruption >>> lies (that the money question) but it _feels_ rooted in NVMe and is >>> unique to this particular workload I've stumbled onto via customer >>> escalation and then trying to replicate an rbd device using a more >>> approachable one (request-based DM multipath in this case). >>> >> I might be stating the obvious, but so far we only have considered >> request-based multipath as being active for the _entire_ device. >> To my knowledge we've never tested that when running on a partition. > > True. We only ever support mapping the partitions ontop of > request-based multipath (via dm-linear volumes created by kpartx). > >> So, have you tested that request-based multipathing works on a >> partition _at all_? I'm not sure if partition mapping is done >> correctly here; we never remap the start of the request (nor bio, >> come to speak of it), so it looks as if we would be doing the wrong >> things here. >> >> Have you checked that partition remapping is done correctly? > > It clearly doesn't work. Not quite following why but... > > After running the test the partition table at the start of the whole > NVMe device is overwritten by XFS. So likely the IO destined to the > dm-cache's "slow" (dm-mpath device on NVMe partition) was issued to the > whole NVMe device: > > # pvcreate /dev/nvme1n1 > WARNING: xfs signature detected on /dev/nvme1n1 at offset 0. Wipe it? [y/n] > > # vgcreate test /dev/nvme1n1 > # lvcreate -n slow -L 512G test > WARNING: xfs signature detected on /dev/test/slow at offset 0. Wipe it? > [y/n]: y > Wiping xfs signature on /dev/test/slow. > Logical volume "slow" created. > > Isn't this a failing of block core's partitioning? Why should a target > that is given the entire partition of a device need to be concerned with > remapping IO? Shouldn't block core handle that mapping? > Only if the device is marked a 'partitionable', which device-mapper devices are not. But I thought you knew that ... > Anyway, yesterday I went so far as to hack together request-based > support for DM linear (because request-based DM cannot stack on > bio-based DM) . With this, request-based linear devices instead of > conventional partitioning, I no longer see the XFS corruption when > running the test: > _Actually_, I would've done it the other way around; after all, where't the point in running dm-multipath on a partition? Anything running on the other partitions would suffer from the issues dm-multipath is designed to handle (temporary path loss etc), so I'm not quite sure what you are trying to achieve with your testcase. Can you enlighten me? Cheers, Hannes ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device 2018-07-24 13:51 ` Hannes Reinecke @ 2018-07-24 13:57 ` Laurence Oberman 2018-07-24 15:18 ` Mike Snitzer 0 siblings, 1 reply; 11+ messages in thread From: Laurence Oberman @ 2018-07-24 13:57 UTC (permalink / raw) To: Hannes Reinecke, Mike Snitzer; +Cc: linux-nvme, linux-block, dm-devel On Tue, 2018-07-24 at 15:51 +0200, Hannes Reinecke wrote: > On 07/24/2018 03:07 PM, Mike Snitzer wrote: > > On Tue, Jul 24 2018 at 2:00am -0400, > > Hannes Reinecke <hare@suse.de> wrote: > > > > > On 07/23/2018 06:33 PM, Mike Snitzer wrote: > > > > Hi, > > > > > > > > I've opened the following public BZ: > > > > https://bugzilla.redhat.com/show_bug.cgi?id=1607527 > > > > > > > > Feel free to add comments to that BZ if you have a redhat > > > > bugzilla > > > > account. > > > > > > > > But otherwise, happy to get as much feedback and discussion > > > > going purely > > > > on the relevant lists. I've taken ~1.5 weeks to categorize and > > > > isolate > > > > this issue. But I've reached a point where I'm getting > > > > diminishing > > > > returns and could _really_ use the collective eyeballs and > > > > expertise of > > > > the community. This is by far one of the most nasty cases of > > > > corruption > > > > I've seen in a while. Not sure where the ultimate cause of > > > > corruption > > > > lies (that the money question) but it _feels_ rooted in NVMe > > > > and is > > > > unique to this particular workload I've stumbled onto via > > > > customer > > > > escalation and then trying to replicate an rbd device using a > > > > more > > > > approachable one (request-based DM multipath in this case). > > > > > > > > > > I might be stating the obvious, but so far we only have > > > considered > > > request-based multipath as being active for the _entire_ device. > > > To my knowledge we've never tested that when running on a > > > partition. > > > > True. We only ever support mapping the partitions ontop of > > request-based multipath (via dm-linear volumes created by kpartx). > > > > > So, have you tested that request-based multipathing works on a > > > partition _at all_? I'm not sure if partition mapping is done > > > correctly here; we never remap the start of the request (nor bio, > > > come to speak of it), so it looks as if we would be doing the > > > wrong > > > things here. > > > > > > Have you checked that partition remapping is done correctly? > > > > It clearly doesn't work. Not quite following why but... > > > > After running the test the partition table at the start of the > > whole > > NVMe device is overwritten by XFS. So likely the IO destined to > > the > > dm-cache's "slow" (dm-mpath device on NVMe partition) was issued to > > the > > whole NVMe device: > > > > # pvcreate /dev/nvme1n1 > > WARNING: xfs signature detected on /dev/nvme1n1 at offset 0. Wipe > > it? [y/n] > > > > # vgcreate test /dev/nvme1n1 > > # lvcreate -n slow -L 512G test > > WARNING: xfs signature detected on /dev/test/slow at offset 0. Wipe > > it? > > [y/n]: y > > Wiping xfs signature on /dev/test/slow. > > Logical volume "slow" created. > > > > Isn't this a failing of block core's partitioning? Why should a > > target > > that is given the entire partition of a device need to be concerned > > with > > remapping IO? Shouldn't block core handle that mapping? > > > > Only if the device is marked a 'partitionable', which device-mapper > devices are not. > But I thought you knew that ... > > > Anyway, yesterday I went so far as to hack together request-based > > support for DM linear (because request-based DM cannot stack on > > bio-based DM) . With this, request-based linear devices instead of > > conventional partitioning, I no longer see the XFS corruption when > > running the test: > > > > _Actually_, I would've done it the other way around; after all, > where't > the point in running dm-multipath on a partition? > Anything running on the other partitions would suffer from the > issues > dm-multipath is designed to handle (temporary path loss etc), so I'm > not > quite sure what you are trying to achieve with your testcase. > Can you enlighten me? > > Cheers, > > Hannes This came about because a customer is using nvme for a dm-cache device and created multiple partitions so as to use the same nvme to cache multiple different "slower" devices. The corruption was noticed in XFS and I engaged Mike to assist in figuring out what was going on. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device 2018-07-24 13:57 ` Laurence Oberman @ 2018-07-24 15:18 ` Mike Snitzer 2018-07-24 15:31 ` Laurence Oberman 0 siblings, 1 reply; 11+ messages in thread From: Mike Snitzer @ 2018-07-24 15:18 UTC (permalink / raw) To: Hannes Reinecke, Laurence Oberman; +Cc: linux-nvme, linux-block, dm-devel On Tue, Jul 24 2018 at 9:57am -0400, Laurence Oberman <loberman@redhat.com> wrote: > On Tue, 2018-07-24 at 15:51 +0200, Hannes Reinecke wrote: > > > > _Actually_, I would've done it the other way around; after all, > > where't the point in running dm-multipath on a partition? > > Anything running on the other partitions would suffer from the > > issues dm-multipath is designed to handle (temporary path loss etc), so I'm > > not quite sure what you are trying to achieve with your testcase. > > Can you enlighten me? > > > > Cheers, > > > > Hannes I wasn't looking to deply this (multipath on partition) in production or suggest it to others. It was a means to experiment. More below. > This came about because a customer is using nvme for a dm-cache device > and created multiple partitions so as to use the same nvme to cache > multiple different "slower" devices. The corruption was noticed in XFS > and I engaged Mike to assist in figuring out what was going on. Yes, so topology for the customer's setup is: 1) MD raid1 on 2 NVMe partitions (from separate NVMe devices). 2) Then DM cache's "fast" and "metadata" devices layered on dm-linear mapping ontop of the MD raid1. 3) Then Ceph's rbd for DM-cache's slow device. I was just looking to simplify the stack to try to assess why XFS corruption was being seen without all the insanity. One issue was corruption due to incorrect shutdown order (network was getting shutdown out from underneath rbd, and in turn DM-cache couldn't complete its IO migrations during cache_postsuspend()). So I elected to try using DM multipath with queue_if_no_path to try to replicate rbd losing network _without_ needing a full Ceph/rbd setup. The rest is history... a rat-hole of corruption that likely is very different than the customer's setup. Mike ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device 2018-07-24 15:18 ` Mike Snitzer @ 2018-07-24 15:31 ` Laurence Oberman 0 siblings, 0 replies; 11+ messages in thread From: Laurence Oberman @ 2018-07-24 15:31 UTC (permalink / raw) To: Mike Snitzer, Hannes Reinecke Cc: linux-nvme, linux-block, dm-devel, Brett Hull On Tue, 2018-07-24 at 11:18 -0400, Mike Snitzer wrote: > On Tue, Jul 24 2018 at 9:57am -0400, > Laurence Oberman <loberman@redhat.com> wrote: > > > On Tue, 2018-07-24 at 15:51 +0200, Hannes Reinecke wrote: > > > > > > _Actually_, I would've done it the other way around; after all, > > > where't the point in running dm-multipath on a partition? > > > Anything running on the other partitions would suffer from the > > > issues dm-multipath is designed to handle (temporary path loss > > > etc), so I'm > > > not quite sure what you are trying to achieve with your testcase. > > > Can you enlighten me? > > > > > > Cheers, > > > > > > Hannes > > I wasn't looking to deply this (multipath on partition) in production > or > suggest it to others. It was a means to experiment. More below. > > > This came about because a customer is using nvme for a dm-cache > > device > > and created multiple partitions so as to use the same nvme to cache > > multiple different "slower" devices. The corruption was noticed in > > XFS > > and I engaged Mike to assist in figuring out what was going on. > > Yes, so topology for the customer's setup is: > > 1) MD raid1 on 2 NVMe partitions (from separate NVMe devices). > 2) Then DM cache's "fast" and "metadata" devices layered on dm-linear > mapping ontop of the MD raid1. > 3) Then Ceph's rbd for DM-cache's slow device. > > I was just looking to simplify the stack to try to assess why XFS > corruption was being seen without all the insanity. > > One issue was corruption due to incorrect shutdown order (network was > getting shutdown out from underneath rbd, and in turn DM-cache > couldn't > complete its IO migrations during cache_postsuspend()). > > So I elected to try using DM multipath with queue_if_no_path to try > to > replicate rbd losing network _without_ needing a full Ceph/rbd setup. > > The rest is history... a rat-hole of corruption that likely is very > different than the customer's setup. > > Mike Not to muddy the waters here, and as Mike said the issue he tripped over may not be the direct issue we originally started with. In the lab reproducer with rbd as a slow devices we do not have an MD raided nvme for the dm-cache, but we still see the corruption only on the rbd based test. We used the nvme partitioned but no DM raid to try an F/C device- mapper-multipath LUNS cached via dm-cache. The last test we ran where we did not see corruption was a partition where the second partition was used to cache F/C luns nvme0n1 259:0 0 372.6G 0 disk ├─nvme0n1p1 259:1 0 150G 0 part └─nvme0n1p2 259:2 0 150G 0 part ├─cache_FC-nvme_blk_cache_cdata 253:42 0 20G 0 lvm │ └─cache_FC-fc_disk 253:45 0 48G 0 lvm /cache_FC └─cache_FC-nvme_blk_cache_cmeta 253:43 0 40M 0 lvm └─cache_FC-fc_disk 253:45 0 48G 0 lvm /cache_FC cache_FC-fc_disk (253:45) ├─cache_FC-fc_disk_corig (253:44) │ └─3600140508da66c2c9ee4cc6aface1bab (253:36) Multipath │ ├─ (68:224) │ ├─ (69:240) │ ├─ (8:192) │ └─ (8:64) ├─cache_FC-nvme_blk_cache_cdata (253:42) │ └─ (259:2) └─cache_FC-nvme_blk_cache_cmeta (253:43) └─ (259:2) ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device 2018-07-24 13:07 ` Mike Snitzer 2018-07-24 13:22 ` Laurence Oberman 2018-07-24 13:51 ` Hannes Reinecke @ 2018-07-24 17:42 ` Christoph Hellwig 2 siblings, 0 replies; 11+ messages in thread From: Christoph Hellwig @ 2018-07-24 17:42 UTC (permalink / raw) To: Mike Snitzer; +Cc: Hannes Reinecke, linux-nvme, linux-block, dm-devel On Tue, Jul 24, 2018 at 09:07:03AM -0400, Mike Snitzer wrote: > True. We only ever support mapping the partitions ontop of > request-based multipath (via dm-linear volumes created by kpartx). > > > So, have you tested that request-based multipathing works on a > > partition _at all_? I'm not sure if partition mapping is done > > correctly here; we never remap the start of the request (nor bio, > > come to speak of it), so it looks as if we would be doing the wrong > > things here. > > > > Have you checked that partition remapping is done correctly? > > It clearly doesn't work. Not quite following why but... blk_insert_cloned_request seems to be missing a call to blk_partition_remap. Given that no one but dm-multipath uses this request clone insert helper, and people generally run multipath on the whole device this is a code path that is almost never exercised. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device 2018-07-24 6:00 ` Hannes Reinecke 2018-07-24 13:07 ` Mike Snitzer @ 2018-07-24 14:25 ` Bart Van Assche 2018-07-24 15:07 ` Mike Snitzer 1 sibling, 1 reply; 11+ messages in thread From: Bart Van Assche @ 2018-07-24 14:25 UTC (permalink / raw) To: dm-devel@redhat.com, linux-block@vger.kernel.org, hare@suse.de, linux-nvme@lists.infradead.org, snitzer@redhat.com On Tue, 2018-07-24 at 08:00 +0200, Hannes Reinecke wrote: > So, have you tested that request-based multipathing works on a partit= ion=20 > _at all_? I'm not sure if partition mapping is done correctly= here; we=20 > never remap the start of the request (nor bio, come to speak of it), = so=20 > it looks as if we would be doing the wrong things here. >=20 > Have you checked that partition remapping is done correctly? I think generic_make_request() takes care of partition remapping by= calling blk_partition_remap(). generic_make_request() is called by = submit_bio(). Is that sufficient to cover all dm drivers? Bart. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device 2018-07-24 14:25 ` Bart Van Assche @ 2018-07-24 15:07 ` Mike Snitzer 0 siblings, 0 replies; 11+ messages in thread From: Mike Snitzer @ 2018-07-24 15:07 UTC (permalink / raw) To: Bart Van Assche Cc: dm-devel@redhat.com, linux-block@vger.kernel.org, hare@suse.de, linux-nvme@lists.infradead.org On Tue, Jul 24 2018 at 10:25am -0400, Bart Van Assche <Bart.VanAssche@wdc.com> wrote: > On Tue, 2018-07-24 at 08:00 +0200, Hannes Reinecke wrote: > > So, have you tested that request-based multipathing works on a partition > > _at all_? I'm not sure if partition mapping is done correctly here; we > > never remap the start of the request (nor bio, come to speak of it), so > > it looks as if we would be doing the wrong things here. > > > > Have you checked that partition remapping is done correctly? > > I think generic_make_request() takes care of partition remapping by calling > blk_partition_remap(). generic_make_request() is called by submit_bio(). Is > that sufficient to cover all dm drivers? Seems not for request-based DM (see my previous reply in this thread). But bio-based DM-multipath seems to work just fine. ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2018-07-24 18:50 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2018-07-23 16:33 data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device Mike Snitzer 2018-07-24 6:00 ` Hannes Reinecke 2018-07-24 13:07 ` Mike Snitzer 2018-07-24 13:22 ` Laurence Oberman 2018-07-24 13:51 ` Hannes Reinecke 2018-07-24 13:57 ` Laurence Oberman 2018-07-24 15:18 ` Mike Snitzer 2018-07-24 15:31 ` Laurence Oberman 2018-07-24 17:42 ` Christoph Hellwig 2018-07-24 14:25 ` Bart Van Assche 2018-07-24 15:07 ` Mike Snitzer
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).