data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device
@ 2018-07-23 16:33 Mike Snitzer
  2018-07-24  6:00 ` Hannes Reinecke
  0 siblings, 1 reply; 11+ messages in thread
From: Mike Snitzer @ 2018-07-23 16:33 UTC (permalink / raw)
  To: linux-nvme, linux-block, dm-devel

Hi,

I've opened the following public BZ:
https://bugzilla.redhat.com/show_bug.cgi?id=1607527

Feel free to add comments to that BZ if you have a redhat bugzilla
account.

But otherwise, happy to get as much feedback and discussion going purely
on the relevant lists.  I've taken ~1.5 weeks to categorize and isolate
this issue.  But I've reached a point where I'm getting diminishing
returns and could _really_ use the collective eyeballs and expertise of
the community.  This is by far one of the most nasty cases of corruption
I've seen in a while.  Not sure where the ultimate cause of corruption
lies (that the money question) but it _feels_ rooted in NVMe and is
unique to this particular workload I've stumbled onto via customer
escalation and then trying to replicate an rbd device using a more
approachable one (request-based DM multipath in this case).

>From the BZ's comment#0:

The following occurs with latest v4.18-rc3 and v4.18-rc6 and also occurs
with v4.15.  When corruption occurs from this test it also destroys the
DOS partition table (created during step 0 below).. yeah, corruption is
_that_ bad.  Almost like the corruption is temporal (recently accessed
regions of the NVMe device)?

Anyway: I stumbled onto rampant corruption when using request-based DM
multipath ontop of an NVMe device (not exclusive to a particular drive
either, happens to NVMe devices from multiple vendors).  But the
corruption only occurs if the request-based multipath IO is issued to an
NVMe device in parallel to other IO issued to the _same_ underlying NVMe
by the DM cache target.  See topology detailed below (at the very end of
this comment).. basically all 3 devices that are used to create a DM
cache device need to be backed by the same NVMe device (via partitions
or linear volumes).

Again, using request-based DM multipath for dm-cache's "slow" device is
_required_ to reproduce.  Not 100% clear why really... other than
request-based DM multipath builds large IOs (due to merging).

--- Additional comment from Mike Snitzer on 2018-07-20 10:14:09 EDT ---

To reproduce this issue using device-mapper-test-suite:

0) Partition an NVMe device.  First primary partition with at least a
5GB, seconf primary partition with at least 48GB.
NOTE: larger partitions (e.g. 1: 50GB 2: >= 220GB) can be used to
reproduce XFS corruption much quicker.

1) create a request-based multipath device ontop of an NVMe device,
e.g.:

#!/bin/sh

modprobe dm-service-time

DEVICE=/dev/nvme1n1p2
SIZE=`blockdev --getsz $DEVICE`

echo "0 $SIZE multipath 2 queue_mode mq 0 1 1 service-time 0 1 2 $DEVICE
1000 1" | dmsetup create nvme_mpath

# Just a note for how to fail/reinstate path:
# dmsetup message nvme_mpath 0 "fail_path $DEVICE"
# dmsetup message nvme_mpath 0 "reinstate_path $DEVICE"

2) checkout device-mapper-test-suite from my github repo:

git clone git://github.com/snitm/device-mapper-test-suite.git
cd device-mapper-test-suite
git checkout -b devel origin/devel

3) follow device-mapper-test-suite's README.md to get it all setup

4) Configure /root/.dmtest/config with something like:

profile :nvme_shared do
   metadata_dev '/dev/nvme1n1p1'
   #data_dev '/dev/nvme1n1p2'
   data_dev '/dev/mapper/nvme_mpath'
end

default_profile :nvme_shared

------
NOTE: configured 'metadata_dev' gets carved up by
device-mapper-test-suite to provide both the dm-cache's metadata device
and the "fast" data device.  The configured 'data_dev' is used for
dm-cache's "slow" data device.

5) run the test:
# tail -f /var/log/messages &
# time dmtest run --suite cache -n /split_large_file/

6) If multipath device failed the lone NVMe path you'll need to
reinstate the path before the next iteration of your test, e.g. (from #1
above):
 dmsetup message nvme_mpath 0 "reinstate_path $DEVICE"

--- Additional comment from Mike Snitzer on 2018-07-20 12:02:45 EDT ---

(In reply to Mike Snitzer from comment #6)

> SO seems pretty clear something is still wrong with request-based DM
> multipath ontop of NVMe... sadly we don't have any negative check in
> blk-core, NVMe or elsewhere to offer any clue :(

Building on this comment:

"Anyway, fact that I'm getting this corruption on multiple different
NVMe drives: I am definitely concerned that this BZ is due to a bug
somewhere in NVMe core (or block core code that is specific to NVMe)."

I'm left thinking that request-based DM multipath is somehow causing
NVMe's SG lists or other infrastructure to be "wrong" and it is
resulting in corruption.  I get corruption to the dm-cache's metadata
device (which while theoretically unrelated as its a separate device
from the "slow" dm-cache data device) if the dm-cache slow data device
is backed by request-based dm-multipath ontop of NVMe (which is a
partition from the _same_ NVMe device that is used by the dm-cache
metadata device).

Basically I'm back to thinking NVMe is corrupting the data due to the IO
pattern or nature of the cloned requests dm-multipath is issuing.  And
it is causing corruption to other NVMe partitions on the same parent
NVMe device.  Certainly that is a concerning hypothesis but I'm not
seeing much else that would explain this weird corruption.

If I don't use the same NVMe device (with multiple partitions) for _all_
3 sub-devices that dm-cache needs I don't see the corruption.  It is
almost like the mix of IO issued by DM cache's metadata (on nvme1n1p1
using dm-linear) and "fast" device (also on nvme1n1p1 via dm-linear
volume) in conjunction with IO issued by request-based DM multipath to
NVMe for "slow" device (on nvme1n1p2) is triggering NVMe to respond
negatively.  But this same observation can be made on completely
different hardware using 2 totally different NVMe devices:
testbed1: Intel Corporation Optane SSD 900P Series (2700)
testbed2: Samsung Electronics Co Ltd NVMe SSD Controller 171X (rev 03)

Which is why it feels like some bug in Linux (be it dm-rq.c, blk-core.c,
blk-merge.c or the common NVMe driver)

topology before starting the device-mapper-test-suite test:

# lsblk /dev/nvme1n1
NAME           MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
nvme1n1        259:1    0 745.2G  0 disk
├─nvme1n1p2    259:5    0 695.2G  0 part
│ └─nvme_mpath 253:2    0 695.2G  0 dm
└─nvme1n1p1    259:4    0    50G  0 part

topology during the device-mapper-test-suite test:

# lsblk /dev/nvme1n1
NAME                    MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
nvme1n1                 259:1    0 745.2G  0 disk
├─nvme1n1p2             259:5    0 695.2G  0 part
│ └─nvme_mpath          253:2    0 695.2G  0 dm
│   └─test-dev-458572   253:5    0    48G  0 dm
│     └─test-dev-613083 253:6    0    48G  0 dm
/root/snitm/git/device-mapper-test-suite/kernel_builds
└─nvme1n1p1             259:4    0    50G  0 part
  ├─test-dev-126378     253:4    0     4G  0 dm
  │ └─test-dev-613083   253:6    0    48G  0 dm
  /root/snitm/git/device-mapper-test-suite/kernel_builds
  └─test-dev-652491     253:3    0    40M  0 dm
    └─test-dev-613083   253:6    0    48G  0 dm
    /root/snitm/git/device-mapper-test-suite/kernel_builds

pruning that tree a bit (removing the dm-cache device 253:6) for
clarity:

# lsblk /dev/nvme1n1
NAME                    MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
nvme1n1                 259:1    0 745.2G  0 disk
├─nvme1n1p2             259:5    0 695.2G  0 part
│ └─nvme_mpath          253:2    0 695.2G  0 dm
│   └─test-dev-458572   253:5    0    48G  0 dm
└─nvme1n1p1             259:4    0    50G  0 part
  ├─test-dev-126378     253:4    0     4G  0 dm
  └─test-dev-652491     253:3    0    40M  0 dm

40M device is dm-cache "metadata" device
4G device is dm-cache "fast" data device
48G device is dm-cache "slow" data device

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device
  2018-07-23 16:33 data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device Mike Snitzer
@ 2018-07-24  6:00 ` Hannes Reinecke
  2018-07-24 13:07   ` Mike Snitzer
  2018-07-24 14:25   ` Bart Van Assche
  0 siblings, 2 replies; 11+ messages in thread
From: Hannes Reinecke @ 2018-07-24  6:00 UTC (permalink / raw)
  To: Mike Snitzer, linux-nvme, linux-block, dm-devel

On 07/23/2018 06:33 PM, Mike Snitzer wrote:
> Hi,
> 
> I've opened the following public BZ:
> https://bugzilla.redhat.com/show_bug.cgi?id=1607527
> 
> Feel free to add comments to that BZ if you have a redhat bugzilla
> account.
> 
> But otherwise, happy to get as much feedback and discussion going purely
> on the relevant lists.  I've taken ~1.5 weeks to categorize and isolate
> this issue.  But I've reached a point where I'm getting diminishing
> returns and could _really_ use the collective eyeballs and expertise of
> the community.  This is by far one of the most nasty cases of corruption
> I've seen in a while.  Not sure where the ultimate cause of corruption
> lies (that the money question) but it _feels_ rooted in NVMe and is
> unique to this particular workload I've stumbled onto via customer
> escalation and then trying to replicate an rbd device using a more
> approachable one (request-based DM multipath in this case).
> 
I might be stating the obvious, but so far we only have considered 
request-based multipath as being active for the _entire_ device.
To my knowledge we've never tested that when running on a partition.

So, have you tested that request-based multipathing works on a partition 
_at all_? I'm not sure if partition mapping is done correctly here; we 
never remap the start of the request (nor bio, come to speak of it), so 
it looks as if we would be doing the wrong things here.

Have you checked that partition remapping is done correctly?

Cheers,

Hannes

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device
  2018-07-24  6:00 ` Hannes Reinecke
@ 2018-07-24 13:07   ` Mike Snitzer
  2018-07-24 13:22     ` Laurence Oberman
                       ` (2 more replies)
  2018-07-24 14:25   ` Bart Van Assche
  1 sibling, 3 replies; 11+ messages in thread
From: Mike Snitzer @ 2018-07-24 13:07 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: linux-nvme, linux-block, dm-devel

On Tue, Jul 24 2018 at  2:00am -0400,
Hannes Reinecke <hare@suse.de> wrote:

> On 07/23/2018 06:33 PM, Mike Snitzer wrote:
> >Hi,
> >
> >I've opened the following public BZ:
> >https://bugzilla.redhat.com/show_bug.cgi?id=1607527
> >
> >Feel free to add comments to that BZ if you have a redhat bugzilla
> >account.
> >
> >But otherwise, happy to get as much feedback and discussion going purely
> >on the relevant lists.  I've taken ~1.5 weeks to categorize and isolate
> >this issue.  But I've reached a point where I'm getting diminishing
> >returns and could _really_ use the collective eyeballs and expertise of
> >the community.  This is by far one of the most nasty cases of corruption
> >I've seen in a while.  Not sure where the ultimate cause of corruption
> >lies (that the money question) but it _feels_ rooted in NVMe and is
> >unique to this particular workload I've stumbled onto via customer
> >escalation and then trying to replicate an rbd device using a more
> >approachable one (request-based DM multipath in this case).
> >
> I might be stating the obvious, but so far we only have considered
> request-based multipath as being active for the _entire_ device.
> To my knowledge we've never tested that when running on a partition.

True.  We only ever support mapping the partitions ontop of
request-based multipath (via dm-linear volumes created by kpartx).

> So, have you tested that request-based multipathing works on a
> partition _at all_? I'm not sure if partition mapping is done
> correctly here; we never remap the start of the request (nor bio,
> come to speak of it), so it looks as if we would be doing the wrong
> things here.
> 
> Have you checked that partition remapping is done correctly?

It clearly doesn't work.  Not quite following why but...

After running the test the partition table at the start of the whole
NVMe device is overwritten by XFS.  So likely the IO destined to the
dm-cache's "slow" (dm-mpath device on NVMe partition) was issued to the
whole NVMe device:

# pvcreate /dev/nvme1n1
WARNING: xfs signature detected on /dev/nvme1n1 at offset 0. Wipe it? [y/n]

# vgcreate test /dev/nvme1n1
# lvcreate -n slow -L 512G test
WARNING: xfs signature detected on /dev/test/slow at offset 0. Wipe it?
[y/n]: y
  Wiping xfs signature on /dev/test/slow.
  Logical volume "slow" created.

Isn't this a failing of block core's partitioning?  Why should a target
that is given the entire partition of a device need to be concerned with
remapping IO?  Shouldn't block core handle that mapping?

Anyway, yesterday I went so far as to hack together request-based
support for DM linear (because request-based DM cannot stack on
bio-based DM) .  With this, request-based linear devices instead of
conventional partitioning, I no longer see the XFS corruption when
running the test:

 drivers/md/dm-linear.c | 45 ++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 42 insertions(+), 3 deletions(-)

diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index d10964d41fd7..d4a65dd20c6e 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -12,6 +12,7 @@
 #include <linux/dax.h>
 #include <linux/slab.h>
 #include <linux/device-mapper.h>
+#include <linux/blk-mq.h>
 
 #define DM_MSG_PREFIX "linear"
 
@@ -24,7 +25,7 @@ struct linear_c {
 };
 
 /*
- * Construct a linear mapping: <dev_path> <offset>
+ * Construct a linear mapping: <dev_path> <offset> [<# optional params> <optional params>]
  */
 static int linear_ctr(struct dm_target *ti, unsigned int argc, char **argv)
 {
@@ -57,6 +58,11 @@ static int linear_ctr(struct dm_target *ti, unsigned int argc, char **argv)
 		goto bad;
 	}
 
+	// FIXME: need to parse optional args
+	// FIXME: model  alloc_multipath_stage2()?
+	// Call: dm_table_set_type()
+	dm_table_set_type(ti->table, DM_TYPE_MQ_REQUEST_BASED);
+
 	ti->num_flush_bios = 1;
 	ti->num_discard_bios = 1;
 	ti->num_secure_erase_bios = 1;
@@ -113,6 +119,37 @@ static int linear_end_io(struct dm_target *ti, struct bio *bio,
 	return DM_ENDIO_DONE;
 }
 
+static int linear_clone_and_map(struct dm_target *ti, struct request *rq,
+				union map_info *map_context,
+				struct request **__clone)
+{
+	struct linear_c *lc = ti->private;
+	struct block_device *bdev = lc->dev->bdev;
+	struct request_queue *q = bdev_get_queue(bdev);
+
+	struct request *clone = blk_get_request(q, rq->cmd_flags | REQ_NOMERGE,
+						BLK_MQ_REQ_NOWAIT);
+	if (IS_ERR(clone)) {
+		if (blk_queue_dying(q) || !q->mq_ops)
+			return DM_MAPIO_DELAY_REQUEUE;
+
+		return DM_MAPIO_REQUEUE;
+	}
+
+	clone->__sector = linear_map_sector(ti, rq->__sector);
+	clone->bio = clone->biotail = NULL;
+	clone->rq_disk = bdev->bd_disk;
+	clone->cmd_flags |= REQ_FAILFAST_TRANSPORT;
+	*__clone = clone;
+
+	return DM_MAPIO_REMAPPED;
+}
+
+static void linear_release_clone(struct request *clone)
+{
+	blk_put_request(clone);
+}
+
 static void linear_status(struct dm_target *ti, status_type_t type,
 			  unsigned status_flags, char *result, unsigned maxlen)
 {
@@ -207,13 +244,15 @@ static size_t linear_dax_copy_to_iter(struct dm_target *ti, pgoff_t pgoff,
 
 static struct target_type linear_target = {
 	.name   = "linear",
-	.version = {1, 4, 0},
-	.features = DM_TARGET_PASSES_INTEGRITY | DM_TARGET_ZONED_HM,
+	.version = {1, 5, 0},
+	.features = DM_TARGET_IMMUTABLE | DM_TARGET_PASSES_INTEGRITY | DM_TARGET_ZONED_HM,
 	.module = THIS_MODULE,
 	.ctr    = linear_ctr,
 	.dtr    = linear_dtr,
 	.map    = linear_map,
 	.end_io = linear_end_io,
+	.clone_and_map_rq = linear_clone_and_map,
+	.release_clone_rq = linear_release_clone,
 	.status = linear_status,
 	.prepare_ioctl = linear_prepare_ioctl,
 	.iterate_devices = linear_iterate_devices,

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device
  2018-07-24 13:07   ` Mike Snitzer
@ 2018-07-24 13:22     ` Laurence Oberman
  2018-07-24 13:51     ` Hannes Reinecke
  2018-07-24 17:42     ` Christoph Hellwig
  2 siblings, 0 replies; 11+ messages in thread
From: Laurence Oberman @ 2018-07-24 13:22 UTC (permalink / raw)
  To: Mike Snitzer, Hannes Reinecke; +Cc: linux-nvme, linux-block, dm-devel

On Tue, 2018-07-24 at 09:07 -0400, Mike Snitzer wrote:
> On Tue, Jul 24 2018 at  2:00am -0400,
> Hannes Reinecke <hare@suse.de> wrote:
> 
> > On 07/23/2018 06:33 PM, Mike Snitzer wrote:
> > > Hi,
> > > 
> > > I've opened the following public BZ:
> > > https://bugzilla.redhat.com/show_bug.cgi?id=1607527
> > > 
> > > Feel free to add comments to that BZ if you have a redhat
> > > bugzilla
> > > account.
> > > 
> > > But otherwise, happy to get as much feedback and discussion going
> > > purely
> > > on the relevant lists.  I've taken ~1.5 weeks to categorize and
> > > isolate
> > > this issue.  But I've reached a point where I'm getting
> > > diminishing
> > > returns and could _really_ use the collective eyeballs and
> > > expertise of
> > > the community.  This is by far one of the most nasty cases of
> > > corruption
> > > I've seen in a while.  Not sure where the ultimate cause of
> > > corruption
> > > lies (that the money question) but it _feels_ rooted in NVMe and
> > > is
> > > unique to this particular workload I've stumbled onto via
> > > customer
> > > escalation and then trying to replicate an rbd device using a
> > > more
> > > approachable one (request-based DM multipath in this case).
> > > 
> > 
> > I might be stating the obvious, but so far we only have considered
> > request-based multipath as being active for the _entire_ device.
> > To my knowledge we've never tested that when running on a
> > partition.
> 
> True.  We only ever support mapping the partitions ontop of
> request-based multipath (via dm-linear volumes created by kpartx).
> 
> > So, have you tested that request-based multipathing works on a
> > partition _at all_? I'm not sure if partition mapping is done
> > correctly here; we never remap the start of the request (nor bio,
> > come to speak of it), so it looks as if we would be doing the wrong
> > things here.
> > 
> > Have you checked that partition remapping is done correctly?
> 
> It clearly doesn't work.  Not quite following why but...
> 
> After running the test the partition table at the start of the whole
> NVMe device is overwritten by XFS.  So likely the IO destined to the
> dm-cache's "slow" (dm-mpath device on NVMe partition) was issued to
> the
> whole NVMe device:
> 
> # pvcreate /dev/nvme1n1
> WARNING: xfs signature detected on /dev/nvme1n1 at offset 0. Wipe it?
> [y/n]
> 
> # vgcreate test /dev/nvme1n1
> # lvcreate -n slow -L 512G test
> WARNING: xfs signature detected on /dev/test/slow at offset 0. Wipe
> it?
> [y/n]: y
>   Wiping xfs signature on /dev/test/slow.
>   Logical volume "slow" created.
> 
> Isn't this a failing of block core's partitioning?  Why should a
> target
> that is given the entire partition of a device need to be concerned
> with
> remapping IO?  Shouldn't block core handle that mapping?
> 
> Anyway, yesterday I went so far as to hack together request-based
> support for DM linear (because request-based DM cannot stack on
> bio-based DM) .  With this, request-based linear devices instead of
> conventional partitioning, I no longer see the XFS corruption when
> running the test:
> 
>  drivers/md/dm-linear.c | 45
> ++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 42 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
> index d10964d41fd7..d4a65dd20c6e 100644
> --- a/drivers/md/dm-linear.c
> +++ b/drivers/md/dm-linear.c
> @@ -12,6 +12,7 @@
>  #include <linux/dax.h>
>  #include <linux/slab.h>
>  #include <linux/device-mapper.h>
> +#include <linux/blk-mq.h>
>  
>  #define DM_MSG_PREFIX "linear"
>  
> @@ -24,7 +25,7 @@ struct linear_c {
>  };
>  
>  /*
> - * Construct a linear mapping: <dev_path> <offset>
> + * Construct a linear mapping: <dev_path> <offset> [<# optional
> params> <optional params>]
>   */
>  static int linear_ctr(struct dm_target *ti, unsigned int argc, char
> **argv)
>  {
> @@ -57,6 +58,11 @@ static int linear_ctr(struct dm_target *ti,
> unsigned int argc, char **argv)
>  		goto bad;
>  	}
>  
> +	// FIXME: need to parse optional args
> +	// FIXME: model  alloc_multipath_stage2()?
> +	// Call: dm_table_set_type()
> +	dm_table_set_type(ti->table, DM_TYPE_MQ_REQUEST_BASED);
> +
>  	ti->num_flush_bios = 1;
>  	ti->num_discard_bios = 1;
>  	ti->num_secure_erase_bios = 1;
> @@ -113,6 +119,37 @@ static int linear_end_io(struct dm_target *ti,
> struct bio *bio,
>  	return DM_ENDIO_DONE;
>  }
>  
> +static int linear_clone_and_map(struct dm_target *ti, struct request
> *rq,
> +				union map_info *map_context,
> +				struct request **__clone)
> +{
> +	struct linear_c *lc = ti->private;
> +	struct block_device *bdev = lc->dev->bdev;
> +	struct request_queue *q = bdev_get_queue(bdev);
> +
> +	struct request *clone = blk_get_request(q, rq->cmd_flags |
> REQ_NOMERGE,
> +						BLK_MQ_REQ_NOWAIT);
> +	if (IS_ERR(clone)) {
> +		if (blk_queue_dying(q) || !q->mq_ops)
> +			return DM_MAPIO_DELAY_REQUEUE;
> +
> +		return DM_MAPIO_REQUEUE;
> +	}
> +
> +	clone->__sector = linear_map_sector(ti, rq->__sector);
> +	clone->bio = clone->biotail = NULL;
> +	clone->rq_disk = bdev->bd_disk;
> +	clone->cmd_flags |= REQ_FAILFAST_TRANSPORT;
> +	*__clone = clone;
> +
> +	return DM_MAPIO_REMAPPED;
> +}
> +
> +static void linear_release_clone(struct request *clone)
> +{
> +	blk_put_request(clone);
> +}
> +
>  static void linear_status(struct dm_target *ti, status_type_t type,
>  			  unsigned status_flags, char *result,
> unsigned maxlen)
>  {
> @@ -207,13 +244,15 @@ static size_t linear_dax_copy_to_iter(struct
> dm_target *ti, pgoff_t pgoff,
>  
>  static struct target_type linear_target = {
>  	.name   = "linear",
> -	.version = {1, 4, 0},
> -	.features = DM_TARGET_PASSES_INTEGRITY | DM_TARGET_ZONED_HM,
> +	.version = {1, 5, 0},
> +	.features = DM_TARGET_IMMUTABLE | DM_TARGET_PASSES_INTEGRITY
> | DM_TARGET_ZONED_HM,
>  	.module = THIS_MODULE,
>  	.ctr    = linear_ctr,
>  	.dtr    = linear_dtr,
>  	.map    = linear_map,
>  	.end_io = linear_end_io,
> +	.clone_and_map_rq = linear_clone_and_map,
> +	.release_clone_rq = linear_release_clone,
>  	.status = linear_status,
>  	.prepare_ioctl = linear_prepare_ioctl,
>  	.iterate_devices = linear_iterate_devices,
> 
> 
> 

With Oracle setups and multipath, we have plenty of customers using non
NVME LUNS (i.e. F/C) with 1 single partition on top of a request based
multipath with no issues.
Same for file systems on top of multipath devices with a single
partition

Its very uncommon for sharing a disk with multiple partitions, and
multipath.

It has to be the multiple partitions, but we should test on non NVME
with multiple partitions in the lab setup I guess to make sure

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device
  2018-07-24 13:07   ` Mike Snitzer
  2018-07-24 13:22     ` Laurence Oberman
@ 2018-07-24 13:51     ` Hannes Reinecke
  2018-07-24 13:57       ` Laurence Oberman
  2018-07-24 17:42     ` Christoph Hellwig
  2 siblings, 1 reply; 11+ messages in thread
From: Hannes Reinecke @ 2018-07-24 13:51 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: linux-nvme, linux-block, dm-devel

On 07/24/2018 03:07 PM, Mike Snitzer wrote:
> On Tue, Jul 24 2018 at  2:00am -0400,
> Hannes Reinecke <hare@suse.de> wrote:
> 
>> On 07/23/2018 06:33 PM, Mike Snitzer wrote:
>>> Hi,
>>>
>>> I've opened the following public BZ:
>>> https://bugzilla.redhat.com/show_bug.cgi?id=1607527
>>>
>>> Feel free to add comments to that BZ if you have a redhat bugzilla
>>> account.
>>>
>>> But otherwise, happy to get as much feedback and discussion going purely
>>> on the relevant lists.  I've taken ~1.5 weeks to categorize and isolate
>>> this issue.  But I've reached a point where I'm getting diminishing
>>> returns and could _really_ use the collective eyeballs and expertise of
>>> the community.  This is by far one of the most nasty cases of corruption
>>> I've seen in a while.  Not sure where the ultimate cause of corruption
>>> lies (that the money question) but it _feels_ rooted in NVMe and is
>>> unique to this particular workload I've stumbled onto via customer
>>> escalation and then trying to replicate an rbd device using a more
>>> approachable one (request-based DM multipath in this case).
>>>
>> I might be stating the obvious, but so far we only have considered
>> request-based multipath as being active for the _entire_ device.
>> To my knowledge we've never tested that when running on a partition.
> 
> True.  We only ever support mapping the partitions ontop of
> request-based multipath (via dm-linear volumes created by kpartx).
> 
>> So, have you tested that request-based multipathing works on a
>> partition _at all_? I'm not sure if partition mapping is done
>> correctly here; we never remap the start of the request (nor bio,
>> come to speak of it), so it looks as if we would be doing the wrong
>> things here.
>>
>> Have you checked that partition remapping is done correctly?
> 
> It clearly doesn't work.  Not quite following why but...
> 
> After running the test the partition table at the start of the whole
> NVMe device is overwritten by XFS.  So likely the IO destined to the
> dm-cache's "slow" (dm-mpath device on NVMe partition) was issued to the
> whole NVMe device:
> 
> # pvcreate /dev/nvme1n1
> WARNING: xfs signature detected on /dev/nvme1n1 at offset 0. Wipe it? [y/n]
> 
> # vgcreate test /dev/nvme1n1
> # lvcreate -n slow -L 512G test
> WARNING: xfs signature detected on /dev/test/slow at offset 0. Wipe it?
> [y/n]: y
>    Wiping xfs signature on /dev/test/slow.
>    Logical volume "slow" created.
> 
> Isn't this a failing of block core's partitioning?  Why should a target
> that is given the entire partition of a device need to be concerned with
> remapping IO?  Shouldn't block core handle that mapping?
> 
Only if the device is marked a 'partitionable', which device-mapper 
devices are not.
But I thought you knew that ...

> Anyway, yesterday I went so far as to hack together request-based
> support for DM linear (because request-based DM cannot stack on
> bio-based DM) .  With this, request-based linear devices instead of
> conventional partitioning, I no longer see the XFS corruption when
> running the test:
> 
_Actually_, I would've done it the other way around; after all, where't 
the point in running dm-multipath on a partition?
Anything running on the other partitions would suffer from the issues 
dm-multipath is designed to handle (temporary path loss etc), so I'm not 
quite sure what you are trying to achieve with your testcase.
Can you enlighten me?

Cheers,

Hannes

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device
  2018-07-24 13:51     ` Hannes Reinecke
@ 2018-07-24 13:57       ` Laurence Oberman
  2018-07-24 15:18         ` Mike Snitzer
  0 siblings, 1 reply; 11+ messages in thread
From: Laurence Oberman @ 2018-07-24 13:57 UTC (permalink / raw)
  To: Hannes Reinecke, Mike Snitzer; +Cc: linux-nvme, linux-block, dm-devel

On Tue, 2018-07-24 at 15:51 +0200, Hannes Reinecke wrote:
> On 07/24/2018 03:07 PM, Mike Snitzer wrote:
> > On Tue, Jul 24 2018 at  2:00am -0400,
> > Hannes Reinecke <hare@suse.de> wrote:
> > 
> > > On 07/23/2018 06:33 PM, Mike Snitzer wrote:
> > > > Hi,
> > > > 
> > > > I've opened the following public BZ:
> > > > https://bugzilla.redhat.com/show_bug.cgi?id=1607527
> > > > 
> > > > Feel free to add comments to that BZ if you have a redhat
> > > > bugzilla
> > > > account.
> > > > 
> > > > But otherwise, happy to get as much feedback and discussion
> > > > going purely
> > > > on the relevant lists.  I've taken ~1.5 weeks to categorize and
> > > > isolate
> > > > this issue.  But I've reached a point where I'm getting
> > > > diminishing
> > > > returns and could _really_ use the collective eyeballs and
> > > > expertise of
> > > > the community.  This is by far one of the most nasty cases of
> > > > corruption
> > > > I've seen in a while.  Not sure where the ultimate cause of
> > > > corruption
> > > > lies (that the money question) but it _feels_ rooted in NVMe
> > > > and is
> > > > unique to this particular workload I've stumbled onto via
> > > > customer
> > > > escalation and then trying to replicate an rbd device using a
> > > > more
> > > > approachable one (request-based DM multipath in this case).
> > > > 
> > > 
> > > I might be stating the obvious, but so far we only have
> > > considered
> > > request-based multipath as being active for the _entire_ device.
> > > To my knowledge we've never tested that when running on a
> > > partition.
> > 
> > True.  We only ever support mapping the partitions ontop of
> > request-based multipath (via dm-linear volumes created by kpartx).
> > 
> > > So, have you tested that request-based multipathing works on a
> > > partition _at all_? I'm not sure if partition mapping is done
> > > correctly here; we never remap the start of the request (nor bio,
> > > come to speak of it), so it looks as if we would be doing the
> > > wrong
> > > things here.
> > > 
> > > Have you checked that partition remapping is done correctly?
> > 
> > It clearly doesn't work.  Not quite following why but...
> > 
> > After running the test the partition table at the start of the
> > whole
> > NVMe device is overwritten by XFS.  So likely the IO destined to
> > the
> > dm-cache's "slow" (dm-mpath device on NVMe partition) was issued to
> > the
> > whole NVMe device:
> > 
> > # pvcreate /dev/nvme1n1
> > WARNING: xfs signature detected on /dev/nvme1n1 at offset 0. Wipe
> > it? [y/n]
> > 
> > # vgcreate test /dev/nvme1n1
> > # lvcreate -n slow -L 512G test
> > WARNING: xfs signature detected on /dev/test/slow at offset 0. Wipe
> > it?
> > [y/n]: y
> >    Wiping xfs signature on /dev/test/slow.
> >    Logical volume "slow" created.
> > 
> > Isn't this a failing of block core's partitioning?  Why should a
> > target
> > that is given the entire partition of a device need to be concerned
> > with
> > remapping IO?  Shouldn't block core handle that mapping?
> > 
> 
> Only if the device is marked a 'partitionable', which device-mapper 
> devices are not.
> But I thought you knew that ...
> 
> > Anyway, yesterday I went so far as to hack together request-based
> > support for DM linear (because request-based DM cannot stack on
> > bio-based DM) .  With this, request-based linear devices instead of
> > conventional partitioning, I no longer see the XFS corruption when
> > running the test:
> > 
> 
> _Actually_, I would've done it the other way around; after all,
> where't 
> the point in running dm-multipath on a partition?
> Anything running on the other partitions would suffer from the
> issues 
> dm-multipath is designed to handle (temporary path loss etc), so I'm
> not 
> quite sure what you are trying to achieve with your testcase.
> Can you enlighten me?
> 
> Cheers,
> 
> Hannes

This came about because a customer is using nvme for a dm-cache device
and created multiple partitions so as to use the same nvme to cache
multiple different "slower" devices. The corruption was noticed in XFS
and I engaged Mike to assist in figuring out what was going on.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device
  2018-07-24  6:00 ` Hannes Reinecke
  2018-07-24 13:07   ` Mike Snitzer
@ 2018-07-24 14:25   ` Bart Van Assche
  2018-07-24 15:07     ` Mike Snitzer
  1 sibling, 1 reply; 11+ messages in thread
From: Bart Van Assche @ 2018-07-24 14:25 UTC (permalink / raw)
  To: dm-devel@redhat.com, linux-block@vger.kernel.org, hare@suse.de,
	linux-nvme@lists.infradead.org, snitzer@redhat.com

On Tue, 2018-07-24 at 08:00 +0200, Hannes Reinecke wrote:
> So, have you tested that request-based multipathing works on a partit=
ion=20
> _at all_? I'm not sure if partition mapping is done correctly=
 here; we=20
> never remap the start of the request (nor bio, come to speak of it), =
so=20
> it looks as if we would be doing the wrong things here.
>=20
> Have you checked that partition remapping is done correctly?

I think generic_make_request() takes care of partition remapping by=
 calling
blk_partition_remap(). generic_make_request() is called by =
submit_bio(). Is
that sufficient to cover all dm drivers?

Bart.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device
  2018-07-24 14:25   ` Bart Van Assche
@ 2018-07-24 15:07     ` Mike Snitzer
  0 siblings, 0 replies; 11+ messages in thread
From: Mike Snitzer @ 2018-07-24 15:07 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: dm-devel@redhat.com, linux-block@vger.kernel.org, hare@suse.de,
	linux-nvme@lists.infradead.org

On Tue, Jul 24 2018 at 10:25am -0400,
Bart Van Assche <Bart.VanAssche@wdc.com> wrote:

> On Tue, 2018-07-24 at 08:00 +0200, Hannes Reinecke wrote:
> > So, have you tested that request-based multipathing works on a partition 
> > _at all_? I'm not sure if partition mapping is done correctly here; we 
> > never remap the start of the request (nor bio, come to speak of it), so 
> > it looks as if we would be doing the wrong things here.
> > 
> > Have you checked that partition remapping is done correctly?
> 
> I think generic_make_request() takes care of partition remapping by calling
> blk_partition_remap(). generic_make_request() is called by submit_bio(). Is
> that sufficient to cover all dm drivers?

Seems not for request-based DM (see my previous reply in this thread).

But bio-based DM-multipath seems to work just fine.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device
  2018-07-24 13:57       ` Laurence Oberman
@ 2018-07-24 15:18         ` Mike Snitzer
  2018-07-24 15:31           ` Laurence Oberman
  0 siblings, 1 reply; 11+ messages in thread
From: Mike Snitzer @ 2018-07-24 15:18 UTC (permalink / raw)
  To: Hannes Reinecke, Laurence Oberman; +Cc: linux-nvme, linux-block, dm-devel

On Tue, Jul 24 2018 at  9:57am -0400,
Laurence Oberman <loberman@redhat.com> wrote:

> On Tue, 2018-07-24 at 15:51 +0200, Hannes Reinecke wrote:
> > 
> > _Actually_, I would've done it the other way around; after all,
> > where't the point in running dm-multipath on a partition?
> > Anything running on the other partitions would suffer from the
> > issues dm-multipath is designed to handle (temporary path loss etc), so I'm
> > not quite sure what you are trying to achieve with your testcase.
> > Can you enlighten me?
> > 
> > Cheers,
> > 
> > Hannes

I wasn't looking to deply this (multipath on partition) in production or
suggest it to others.  It was a means to experiment.  More below.

> This came about because a customer is using nvme for a dm-cache device
> and created multiple partitions so as to use the same nvme to cache
> multiple different "slower" devices. The corruption was noticed in XFS
> and I engaged Mike to assist in figuring out what was going on.

Yes, so topology for the customer's setup is:

1) MD raid1 on 2 NVMe partitions (from separate NVMe devices).
2) Then DM cache's "fast" and "metadata" devices layered on dm-linear
   mapping ontop of the MD raid1.
3) Then Ceph's rbd for DM-cache's slow device.

I was just looking to simplify the stack to try to assess why XFS
corruption was being seen without all the insanity.

One issue was corruption due to incorrect shutdown order (network was
getting shutdown out from underneath rbd, and in turn DM-cache couldn't
complete its IO migrations during cache_postsuspend()).

So I elected to try using DM multipath with queue_if_no_path to try to
replicate rbd losing network _without_ needing a full Ceph/rbd setup.

The rest is history... a rat-hole of corruption that likely is very
different than the customer's setup.

Mike

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device
  2018-07-24 15:18         ` Mike Snitzer
@ 2018-07-24 15:31           ` Laurence Oberman
  0 siblings, 0 replies; 11+ messages in thread
From: Laurence Oberman @ 2018-07-24 15:31 UTC (permalink / raw)
  To: Mike Snitzer, Hannes Reinecke
  Cc: linux-nvme, linux-block, dm-devel, Brett Hull

On Tue, 2018-07-24 at 11:18 -0400, Mike Snitzer wrote:
> On Tue, Jul 24 2018 at  9:57am -0400,
> Laurence Oberman <loberman@redhat.com> wrote:
> 
> > On Tue, 2018-07-24 at 15:51 +0200, Hannes Reinecke wrote:
> > > 
> > > _Actually_, I would've done it the other way around; after all,
> > > where't the point in running dm-multipath on a partition?
> > > Anything running on the other partitions would suffer from the
> > > issues dm-multipath is designed to handle (temporary path loss
> > > etc), so I'm
> > > not quite sure what you are trying to achieve with your testcase.
> > > Can you enlighten me?
> > > 
> > > Cheers,
> > > 
> > > Hannes
> 
> I wasn't looking to deply this (multipath on partition) in production
> or
> suggest it to others.  It was a means to experiment.  More below.
> 
> > This came about because a customer is using nvme for a dm-cache
> > device
> > and created multiple partitions so as to use the same nvme to cache
> > multiple different "slower" devices. The corruption was noticed in
> > XFS
> > and I engaged Mike to assist in figuring out what was going on.
> 
> Yes, so topology for the customer's setup is:
> 
> 1) MD raid1 on 2 NVMe partitions (from separate NVMe devices).
> 2) Then DM cache's "fast" and "metadata" devices layered on dm-linear
>    mapping ontop of the MD raid1.
> 3) Then Ceph's rbd for DM-cache's slow device.
> 
> I was just looking to simplify the stack to try to assess why XFS
> corruption was being seen without all the insanity.
> 
> One issue was corruption due to incorrect shutdown order (network was
> getting shutdown out from underneath rbd, and in turn DM-cache
> couldn't
> complete its IO migrations during cache_postsuspend()).
> 
> So I elected to try using DM multipath with queue_if_no_path to try
> to
> replicate rbd losing network _without_ needing a full Ceph/rbd setup.
> 
> The rest is history... a rat-hole of corruption that likely is very
> different than the customer's setup.
> 
> Mike
Not to muddy the waters here, and as Mike said the issue he tripped
over may not be the direct issue we originally started with.

In the lab reproducer with rbd as a slow devices we do not have an MD
raided nvme for the dm-cache, but we still see the corruption only on
the rbd based test.

We used the nvme partitioned but no DM raid to try an F/C device-
mapper-multipath LUNS cached via dm-cache.

The last test we ran where we did not see corruption was a partition
where the second partition was used to cache F/C luns

nvme0n1                             259:0    0 372.6G  0 disk  
├─nvme0n1p1                         259:1    0   150G  0 part  
└─nvme0n1p2                         259:2    0   150G  0 part  
  ├─cache_FC-nvme_blk_cache_cdata   253:42   0    20G  0 lvm   
  │ └─cache_FC-fc_disk              253:45   0    48G  0
lvm   /cache_FC
  └─cache_FC-nvme_blk_cache_cmeta   253:43   0    40M  0 lvm   
    └─cache_FC-fc_disk              253:45   0    48G  0
lvm   /cache_FC

cache_FC-fc_disk (253:45)
 ├─cache_FC-fc_disk_corig (253:44)
 │  └─3600140508da66c2c9ee4cc6aface1bab (253:36) Multipath
 │     ├─ (68:224)
 │     ├─ (69:240)
 │     ├─ (8:192)
 │     └─ (8:64)
 ├─cache_FC-nvme_blk_cache_cdata (253:42)
 │  └─ (259:2)
 └─cache_FC-nvme_blk_cache_cmeta (253:43)
    └─ (259:2)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device
  2018-07-24 13:07   ` Mike Snitzer
  2018-07-24 13:22     ` Laurence Oberman
  2018-07-24 13:51     ` Hannes Reinecke
@ 2018-07-24 17:42     ` Christoph Hellwig
  2 siblings, 0 replies; 11+ messages in thread
From: Christoph Hellwig @ 2018-07-24 17:42 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Hannes Reinecke, linux-nvme, linux-block, dm-devel

On Tue, Jul 24, 2018 at 09:07:03AM -0400, Mike Snitzer wrote:
> True.  We only ever support mapping the partitions ontop of
> request-based multipath (via dm-linear volumes created by kpartx).
> 
> > So, have you tested that request-based multipathing works on a
> > partition _at all_? I'm not sure if partition mapping is done
> > correctly here; we never remap the start of the request (nor bio,
> > come to speak of it), so it looks as if we would be doing the wrong
> > things here.
> > 
> > Have you checked that partition remapping is done correctly?
> 
> It clearly doesn't work.  Not quite following why but...

blk_insert_cloned_request seems to be missing a call to
blk_partition_remap.  Given that no one but dm-multipath uses this
request clone insert helper, and people generally run multipath on
the whole device this is a code path that is almost never exercised.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2018-07-24 18:50 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-07-23 16:33 data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device Mike Snitzer
2018-07-24  6:00 ` Hannes Reinecke
2018-07-24 13:07   ` Mike Snitzer
2018-07-24 13:22     ` Laurence Oberman
2018-07-24 13:51     ` Hannes Reinecke
2018-07-24 13:57       ` Laurence Oberman
2018-07-24 15:18         ` Mike Snitzer
2018-07-24 15:31           ` Laurence Oberman
2018-07-24 17:42     ` Christoph Hellwig
2018-07-24 14:25   ` Bart Van Assche
2018-07-24 15:07     ` Mike Snitzer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).