* data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device
@ 2018-07-24 13:07 ` Mike Snitzer
0 siblings, 0 replies; 32+ messages in thread
From: Mike Snitzer @ 2018-07-24 13:07 UTC (permalink / raw)
On Tue, Jul 24 2018 at 2:00am -0400,
Hannes Reinecke <hare@suse.de> wrote:
> On 07/23/2018 06:33 PM, Mike Snitzer wrote:
> >Hi,
> >
> >I've opened the following public BZ:
> >https://bugzilla.redhat.com/show_bug.cgi?id=1607527
> >
> >Feel free to add comments to that BZ if you have a redhat bugzilla
> >account.
> >
> >But otherwise, happy to get as much feedback and discussion going purely
> >on the relevant lists. I've taken ~1.5 weeks to categorize and isolate
> >this issue. But I've reached a point where I'm getting diminishing
> >returns and could _really_ use the collective eyeballs and expertise of
> >the community. This is by far one of the most nasty cases of corruption
> >I've seen in a while. Not sure where the ultimate cause of corruption
> >lies (that the money question) but it _feels_ rooted in NVMe and is
> >unique to this particular workload I've stumbled onto via customer
> >escalation and then trying to replicate an rbd device using a more
> >approachable one (request-based DM multipath in this case).
> >
> I might be stating the obvious, but so far we only have considered
> request-based multipath as being active for the _entire_ device.
> To my knowledge we've never tested that when running on a partition.
True. We only ever support mapping the partitions ontop of
request-based multipath (via dm-linear volumes created by kpartx).
> So, have you tested that request-based multipathing works on a
> partition _at all_? I'm not sure if partition mapping is done
> correctly here; we never remap the start of the request (nor bio,
> come to speak of it), so it looks as if we would be doing the wrong
> things here.
>
> Have you checked that partition remapping is done correctly?
It clearly doesn't work. Not quite following why but...
After running the test the partition table at the start of the whole
NVMe device is overwritten by XFS. So likely the IO destined to the
dm-cache's "slow" (dm-mpath device on NVMe partition) was issued to the
whole NVMe device:
# pvcreate /dev/nvme1n1
WARNING: xfs signature detected on /dev/nvme1n1 at offset 0. Wipe it? [y/n]
# vgcreate test /dev/nvme1n1
# lvcreate -n slow -L 512G test
WARNING: xfs signature detected on /dev/test/slow at offset 0. Wipe it?
[y/n]: y
Wiping xfs signature on /dev/test/slow.
Logical volume "slow" created.
Isn't this a failing of block core's partitioning? Why should a target
that is given the entire partition of a device need to be concerned with
remapping IO? Shouldn't block core handle that mapping?
Anyway, yesterday I went so far as to hack together request-based
support for DM linear (because request-based DM cannot stack on
bio-based DM) . With this, request-based linear devices instead of
conventional partitioning, I no longer see the XFS corruption when
running the test:
drivers/md/dm-linear.c | 45 ++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 42 insertions(+), 3 deletions(-)
diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index d10964d41fd7..d4a65dd20c6e 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -12,6 +12,7 @@
#include <linux/dax.h>
#include <linux/slab.h>
#include <linux/device-mapper.h>
+#include <linux/blk-mq.h>
#define DM_MSG_PREFIX "linear"
@@ -24,7 +25,7 @@ struct linear_c {
};
/*
- * Construct a linear mapping: <dev_path> <offset>
+ * Construct a linear mapping: <dev_path> <offset> [<# optional params> <optional params>]
*/
static int linear_ctr(struct dm_target *ti, unsigned int argc, char **argv)
{
@@ -57,6 +58,11 @@ static int linear_ctr(struct dm_target *ti, unsigned int argc, char **argv)
goto bad;
}
+ // FIXME: need to parse optional args
+ // FIXME: model alloc_multipath_stage2()?
+ // Call: dm_table_set_type()
+ dm_table_set_type(ti->table, DM_TYPE_MQ_REQUEST_BASED);
+
ti->num_flush_bios = 1;
ti->num_discard_bios = 1;
ti->num_secure_erase_bios = 1;
@@ -113,6 +119,37 @@ static int linear_end_io(struct dm_target *ti, struct bio *bio,
return DM_ENDIO_DONE;
}
+static int linear_clone_and_map(struct dm_target *ti, struct request *rq,
+ union map_info *map_context,
+ struct request **__clone)
+{
+ struct linear_c *lc = ti->private;
+ struct block_device *bdev = lc->dev->bdev;
+ struct request_queue *q = bdev_get_queue(bdev);
+
+ struct request *clone = blk_get_request(q, rq->cmd_flags | REQ_NOMERGE,
+ BLK_MQ_REQ_NOWAIT);
+ if (IS_ERR(clone)) {
+ if (blk_queue_dying(q) || !q->mq_ops)
+ return DM_MAPIO_DELAY_REQUEUE;
+
+ return DM_MAPIO_REQUEUE;
+ }
+
+ clone->__sector = linear_map_sector(ti, rq->__sector);
+ clone->bio = clone->biotail = NULL;
+ clone->rq_disk = bdev->bd_disk;
+ clone->cmd_flags |= REQ_FAILFAST_TRANSPORT;
+ *__clone = clone;
+
+ return DM_MAPIO_REMAPPED;
+}
+
+static void linear_release_clone(struct request *clone)
+{
+ blk_put_request(clone);
+}
+
static void linear_status(struct dm_target *ti, status_type_t type,
unsigned status_flags, char *result, unsigned maxlen)
{
@@ -207,13 +244,15 @@ static size_t linear_dax_copy_to_iter(struct dm_target *ti, pgoff_t pgoff,
static struct target_type linear_target = {
.name = "linear",
- .version = {1, 4, 0},
- .features = DM_TARGET_PASSES_INTEGRITY | DM_TARGET_ZONED_HM,
+ .version = {1, 5, 0},
+ .features = DM_TARGET_IMMUTABLE | DM_TARGET_PASSES_INTEGRITY | DM_TARGET_ZONED_HM,
.module = THIS_MODULE,
.ctr = linear_ctr,
.dtr = linear_dtr,
.map = linear_map,
.end_io = linear_end_io,
+ .clone_and_map_rq = linear_clone_and_map,
+ .release_clone_rq = linear_release_clone,
.status = linear_status,
.prepare_ioctl = linear_prepare_ioctl,
.iterate_devices = linear_iterate_devices,
^ permalink raw reply related [flat|nested] 32+ messages in thread* Re: data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device
@ 2018-07-24 13:07 ` Mike Snitzer
0 siblings, 0 replies; 32+ messages in thread
From: Mike Snitzer @ 2018-07-24 13:07 UTC (permalink / raw)
To: Hannes Reinecke; +Cc: linux-nvme, linux-block, dm-devel
On Tue, Jul 24 2018 at 2:00am -0400,
Hannes Reinecke <hare@suse.de> wrote:
> On 07/23/2018 06:33 PM, Mike Snitzer wrote:
> >Hi,
> >
> >I've opened the following public BZ:
> >https://bugzilla.redhat.com/show_bug.cgi?id=1607527
> >
> >Feel free to add comments to that BZ if you have a redhat bugzilla
> >account.
> >
> >But otherwise, happy to get as much feedback and discussion going purely
> >on the relevant lists. I've taken ~1.5 weeks to categorize and isolate
> >this issue. But I've reached a point where I'm getting diminishing
> >returns and could _really_ use the collective eyeballs and expertise of
> >the community. This is by far one of the most nasty cases of corruption
> >I've seen in a while. Not sure where the ultimate cause of corruption
> >lies (that the money question) but it _feels_ rooted in NVMe and is
> >unique to this particular workload I've stumbled onto via customer
> >escalation and then trying to replicate an rbd device using a more
> >approachable one (request-based DM multipath in this case).
> >
> I might be stating the obvious, but so far we only have considered
> request-based multipath as being active for the _entire_ device.
> To my knowledge we've never tested that when running on a partition.
True. We only ever support mapping the partitions ontop of
request-based multipath (via dm-linear volumes created by kpartx).
> So, have you tested that request-based multipathing works on a
> partition _at all_? I'm not sure if partition mapping is done
> correctly here; we never remap the start of the request (nor bio,
> come to speak of it), so it looks as if we would be doing the wrong
> things here.
>
> Have you checked that partition remapping is done correctly?
It clearly doesn't work. Not quite following why but...
After running the test the partition table at the start of the whole
NVMe device is overwritten by XFS. So likely the IO destined to the
dm-cache's "slow" (dm-mpath device on NVMe partition) was issued to the
whole NVMe device:
# pvcreate /dev/nvme1n1
WARNING: xfs signature detected on /dev/nvme1n1 at offset 0. Wipe it? [y/n]
# vgcreate test /dev/nvme1n1
# lvcreate -n slow -L 512G test
WARNING: xfs signature detected on /dev/test/slow at offset 0. Wipe it?
[y/n]: y
Wiping xfs signature on /dev/test/slow.
Logical volume "slow" created.
Isn't this a failing of block core's partitioning? Why should a target
that is given the entire partition of a device need to be concerned with
remapping IO? Shouldn't block core handle that mapping?
Anyway, yesterday I went so far as to hack together request-based
support for DM linear (because request-based DM cannot stack on
bio-based DM) . With this, request-based linear devices instead of
conventional partitioning, I no longer see the XFS corruption when
running the test:
drivers/md/dm-linear.c | 45 ++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 42 insertions(+), 3 deletions(-)
diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index d10964d41fd7..d4a65dd20c6e 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -12,6 +12,7 @@
#include <linux/dax.h>
#include <linux/slab.h>
#include <linux/device-mapper.h>
+#include <linux/blk-mq.h>
#define DM_MSG_PREFIX "linear"
@@ -24,7 +25,7 @@ struct linear_c {
};
/*
- * Construct a linear mapping: <dev_path> <offset>
+ * Construct a linear mapping: <dev_path> <offset> [<# optional params> <optional params>]
*/
static int linear_ctr(struct dm_target *ti, unsigned int argc, char **argv)
{
@@ -57,6 +58,11 @@ static int linear_ctr(struct dm_target *ti, unsigned int argc, char **argv)
goto bad;
}
+ // FIXME: need to parse optional args
+ // FIXME: model alloc_multipath_stage2()?
+ // Call: dm_table_set_type()
+ dm_table_set_type(ti->table, DM_TYPE_MQ_REQUEST_BASED);
+
ti->num_flush_bios = 1;
ti->num_discard_bios = 1;
ti->num_secure_erase_bios = 1;
@@ -113,6 +119,37 @@ static int linear_end_io(struct dm_target *ti, struct bio *bio,
return DM_ENDIO_DONE;
}
+static int linear_clone_and_map(struct dm_target *ti, struct request *rq,
+ union map_info *map_context,
+ struct request **__clone)
+{
+ struct linear_c *lc = ti->private;
+ struct block_device *bdev = lc->dev->bdev;
+ struct request_queue *q = bdev_get_queue(bdev);
+
+ struct request *clone = blk_get_request(q, rq->cmd_flags | REQ_NOMERGE,
+ BLK_MQ_REQ_NOWAIT);
+ if (IS_ERR(clone)) {
+ if (blk_queue_dying(q) || !q->mq_ops)
+ return DM_MAPIO_DELAY_REQUEUE;
+
+ return DM_MAPIO_REQUEUE;
+ }
+
+ clone->__sector = linear_map_sector(ti, rq->__sector);
+ clone->bio = clone->biotail = NULL;
+ clone->rq_disk = bdev->bd_disk;
+ clone->cmd_flags |= REQ_FAILFAST_TRANSPORT;
+ *__clone = clone;
+
+ return DM_MAPIO_REMAPPED;
+}
+
+static void linear_release_clone(struct request *clone)
+{
+ blk_put_request(clone);
+}
+
static void linear_status(struct dm_target *ti, status_type_t type,
unsigned status_flags, char *result, unsigned maxlen)
{
@@ -207,13 +244,15 @@ static size_t linear_dax_copy_to_iter(struct dm_target *ti, pgoff_t pgoff,
static struct target_type linear_target = {
.name = "linear",
- .version = {1, 4, 0},
- .features = DM_TARGET_PASSES_INTEGRITY | DM_TARGET_ZONED_HM,
+ .version = {1, 5, 0},
+ .features = DM_TARGET_IMMUTABLE | DM_TARGET_PASSES_INTEGRITY | DM_TARGET_ZONED_HM,
.module = THIS_MODULE,
.ctr = linear_ctr,
.dtr = linear_dtr,
.map = linear_map,
.end_io = linear_end_io,
+ .clone_and_map_rq = linear_clone_and_map,
+ .release_clone_rq = linear_release_clone,
.status = linear_status,
.prepare_ioctl = linear_prepare_ioctl,
.iterate_devices = linear_iterate_devices,
^ permalink raw reply related [flat|nested] 32+ messages in thread* Re: data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device
2018-07-24 13:07 ` Mike Snitzer
(?)
@ 2018-07-24 13:22 ` Laurence Oberman
-1 siblings, 0 replies; 32+ messages in thread
From: Laurence Oberman @ 2018-07-24 13:22 UTC (permalink / raw)
To: Mike Snitzer, Hannes Reinecke; +Cc: linux-block, dm-devel, linux-nvme
On Tue, 2018-07-24 at 09:07 -0400, Mike Snitzer wrote:
> On Tue, Jul 24 2018 at 2:00am -0400,
> Hannes Reinecke <hare@suse.de> wrote:
>
> > On 07/23/2018 06:33 PM, Mike Snitzer wrote:
> > > Hi,
> > >
> > > I've opened the following public BZ:
> > > https://bugzilla.redhat.com/show_bug.cgi?id=1607527
> > >
> > > Feel free to add comments to that BZ if you have a redhat
> > > bugzilla
> > > account.
> > >
> > > But otherwise, happy to get as much feedback and discussion going
> > > purely
> > > on the relevant lists. I've taken ~1.5 weeks to categorize and
> > > isolate
> > > this issue. But I've reached a point where I'm getting
> > > diminishing
> > > returns and could _really_ use the collective eyeballs and
> > > expertise of
> > > the community. This is by far one of the most nasty cases of
> > > corruption
> > > I've seen in a while. Not sure where the ultimate cause of
> > > corruption
> > > lies (that the money question) but it _feels_ rooted in NVMe and
> > > is
> > > unique to this particular workload I've stumbled onto via
> > > customer
> > > escalation and then trying to replicate an rbd device using a
> > > more
> > > approachable one (request-based DM multipath in this case).
> > >
> >
> > I might be stating the obvious, but so far we only have considered
> > request-based multipath as being active for the _entire_ device.
> > To my knowledge we've never tested that when running on a
> > partition.
>
> True. We only ever support mapping the partitions ontop of
> request-based multipath (via dm-linear volumes created by kpartx).
>
> > So, have you tested that request-based multipathing works on a
> > partition _at all_? I'm not sure if partition mapping is done
> > correctly here; we never remap the start of the request (nor bio,
> > come to speak of it), so it looks as if we would be doing the wrong
> > things here.
> >
> > Have you checked that partition remapping is done correctly?
>
> It clearly doesn't work. Not quite following why but...
>
> After running the test the partition table at the start of the whole
> NVMe device is overwritten by XFS. So likely the IO destined to the
> dm-cache's "slow" (dm-mpath device on NVMe partition) was issued to
> the
> whole NVMe device:
>
> # pvcreate /dev/nvme1n1
> WARNING: xfs signature detected on /dev/nvme1n1 at offset 0. Wipe it?
> [y/n]
>
> # vgcreate test /dev/nvme1n1
> # lvcreate -n slow -L 512G test
> WARNING: xfs signature detected on /dev/test/slow at offset 0. Wipe
> it?
> [y/n]: y
> Wiping xfs signature on /dev/test/slow.
> Logical volume "slow" created.
>
> Isn't this a failing of block core's partitioning? Why should a
> target
> that is given the entire partition of a device need to be concerned
> with
> remapping IO? Shouldn't block core handle that mapping?
>
> Anyway, yesterday I went so far as to hack together request-based
> support for DM linear (because request-based DM cannot stack on
> bio-based DM) . With this, request-based linear devices instead of
> conventional partitioning, I no longer see the XFS corruption when
> running the test:
>
> drivers/md/dm-linear.c | 45
> ++++++++++++++++++++++++++++++++++++++++++---
> 1 file changed, 42 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
> index d10964d41fd7..d4a65dd20c6e 100644
> --- a/drivers/md/dm-linear.c
> +++ b/drivers/md/dm-linear.c
> @@ -12,6 +12,7 @@
> #include <linux/dax.h>
> #include <linux/slab.h>
> #include <linux/device-mapper.h>
> +#include <linux/blk-mq.h>
>
> #define DM_MSG_PREFIX "linear"
>
> @@ -24,7 +25,7 @@ struct linear_c {
> };
>
> /*
> - * Construct a linear mapping: <dev_path> <offset>
> + * Construct a linear mapping: <dev_path> <offset> [<# optional
> params> <optional params>]
> */
> static int linear_ctr(struct dm_target *ti, unsigned int argc, char
> **argv)
> {
> @@ -57,6 +58,11 @@ static int linear_ctr(struct dm_target *ti,
> unsigned int argc, char **argv)
> goto bad;
> }
>
> + // FIXME: need to parse optional args
> + // FIXME: model alloc_multipath_stage2()?
> + // Call: dm_table_set_type()
> + dm_table_set_type(ti->table, DM_TYPE_MQ_REQUEST_BASED);
> +
> ti->num_flush_bios = 1;
> ti->num_discard_bios = 1;
> ti->num_secure_erase_bios = 1;
> @@ -113,6 +119,37 @@ static int linear_end_io(struct dm_target *ti,
> struct bio *bio,
> return DM_ENDIO_DONE;
> }
>
> +static int linear_clone_and_map(struct dm_target *ti, struct request
> *rq,
> + union map_info *map_context,
> + struct request **__clone)
> +{
> + struct linear_c *lc = ti->private;
> + struct block_device *bdev = lc->dev->bdev;
> + struct request_queue *q = bdev_get_queue(bdev);
> +
> + struct request *clone = blk_get_request(q, rq->cmd_flags |
> REQ_NOMERGE,
> + BLK_MQ_REQ_NOWAIT);
> + if (IS_ERR(clone)) {
> + if (blk_queue_dying(q) || !q->mq_ops)
> + return DM_MAPIO_DELAY_REQUEUE;
> +
> + return DM_MAPIO_REQUEUE;
> + }
> +
> + clone->__sector = linear_map_sector(ti, rq->__sector);
> + clone->bio = clone->biotail = NULL;
> + clone->rq_disk = bdev->bd_disk;
> + clone->cmd_flags |= REQ_FAILFAST_TRANSPORT;
> + *__clone = clone;
> +
> + return DM_MAPIO_REMAPPED;
> +}
> +
> +static void linear_release_clone(struct request *clone)
> +{
> + blk_put_request(clone);
> +}
> +
> static void linear_status(struct dm_target *ti, status_type_t type,
> unsigned status_flags, char *result,
> unsigned maxlen)
> {
> @@ -207,13 +244,15 @@ static size_t linear_dax_copy_to_iter(struct
> dm_target *ti, pgoff_t pgoff,
>
> static struct target_type linear_target = {
> .name = "linear",
> - .version = {1, 4, 0},
> - .features = DM_TARGET_PASSES_INTEGRITY | DM_TARGET_ZONED_HM,
> + .version = {1, 5, 0},
> + .features = DM_TARGET_IMMUTABLE | DM_TARGET_PASSES_INTEGRITY
> | DM_TARGET_ZONED_HM,
> .module = THIS_MODULE,
> .ctr = linear_ctr,
> .dtr = linear_dtr,
> .map = linear_map,
> .end_io = linear_end_io,
> + .clone_and_map_rq = linear_clone_and_map,
> + .release_clone_rq = linear_release_clone,
> .status = linear_status,
> .prepare_ioctl = linear_prepare_ioctl,
> .iterate_devices = linear_iterate_devices,
>
>
>
With Oracle setups and multipath, we have plenty of customers using non
NVME LUNS (i.e. F/C) with 1 single partition on top of a request based
multipath with no issues.
Same for file systems on top of multipath devices with a single
partition
Its very uncommon for sharing a disk with multiple partitions, and
multipath.
It has to be the multiple partitions, but we should test on non NVME
with multiple partitions in the lab setup I guess to make sure
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
^ permalink raw reply [flat|nested] 32+ messages in thread* data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device
@ 2018-07-24 13:22 ` Laurence Oberman
0 siblings, 0 replies; 32+ messages in thread
From: Laurence Oberman @ 2018-07-24 13:22 UTC (permalink / raw)
On Tue, 2018-07-24@09:07 -0400, Mike Snitzer wrote:
> On Tue, Jul 24 2018 at??2:00am -0400,
> Hannes Reinecke <hare@suse.de> wrote:
>
> > On 07/23/2018 06:33 PM, Mike Snitzer wrote:
> > > Hi,
> > >
> > > I've opened the following public BZ:
> > > https://bugzilla.redhat.com/show_bug.cgi?id=1607527
> > >
> > > Feel free to add comments to that BZ if you have a redhat
> > > bugzilla
> > > account.
> > >
> > > But otherwise, happy to get as much feedback and discussion going
> > > purely
> > > on the relevant lists.??I've taken ~1.5 weeks to categorize and
> > > isolate
> > > this issue.??But I've reached a point where I'm getting
> > > diminishing
> > > returns and could _really_ use the collective eyeballs and
> > > expertise of
> > > the community.??This is by far one of the most nasty cases of
> > > corruption
> > > I've seen in a while.??Not sure where the ultimate cause of
> > > corruption
> > > lies (that the money question) but it _feels_ rooted in NVMe and
> > > is
> > > unique to this particular workload I've stumbled onto via
> > > customer
> > > escalation and then trying to replicate an rbd device using a
> > > more
> > > approachable one (request-based DM multipath in this case).
> > >
> >
> > I might be stating the obvious, but so far we only have considered
> > request-based multipath as being active for the _entire_ device.
> > To my knowledge we've never tested that when running on a
> > partition.
>
> True.??We only ever support mapping the partitions ontop of
> request-based multipath (via dm-linear volumes created by kpartx).
>
> > So, have you tested that request-based multipathing works on a
> > partition _at all_? I'm not sure if partition mapping is done
> > correctly here; we never remap the start of the request (nor bio,
> > come to speak of it), so it looks as if we would be doing the wrong
> > things here.
> >
> > Have you checked that partition remapping is done correctly?
>
> It clearly doesn't work.??Not quite following why but...
>
> After running the test the partition table at the start of the whole
> NVMe device is overwritten by XFS.??So likely the IO destined to the
> dm-cache's "slow" (dm-mpath device on NVMe partition) was issued to
> the
> whole NVMe device:
>
> # pvcreate /dev/nvme1n1
> WARNING: xfs signature detected on /dev/nvme1n1 at offset 0. Wipe it?
> [y/n]
>
> # vgcreate test /dev/nvme1n1
> # lvcreate -n slow -L 512G test
> WARNING: xfs signature detected on /dev/test/slow at offset 0. Wipe
> it?
> [y/n]: y
> ? Wiping xfs signature on /dev/test/slow.
> ? Logical volume "slow" created.
>
> Isn't this a failing of block core's partitioning???Why should a
> target
> that is given the entire partition of a device need to be concerned
> with
> remapping IO???Shouldn't block core handle that mapping?
>
> Anyway, yesterday I went so far as to hack together request-based
> support for DM linear (because request-based DM cannot stack on
> bio-based DM) .??With this, request-based linear devices instead of
> conventional partitioning, I no longer see the XFS corruption when
> running the test:
>
> ?drivers/md/dm-linear.c | 45
> ++++++++++++++++++++++++++++++++++++++++++---
> ?1 file changed, 42 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
> index d10964d41fd7..d4a65dd20c6e 100644
> --- a/drivers/md/dm-linear.c
> +++ b/drivers/md/dm-linear.c
> @@ -12,6 +12,7 @@
> ?#include <linux/dax.h>
> ?#include <linux/slab.h>
> ?#include <linux/device-mapper.h>
> +#include <linux/blk-mq.h>
> ?
> ?#define DM_MSG_PREFIX "linear"
> ?
> @@ -24,7 +25,7 @@ struct linear_c {
> ?};
> ?
> ?/*
> - * Construct a linear mapping: <dev_path> <offset>
> + * Construct a linear mapping: <dev_path> <offset> [<# optional
> params> <optional params>]
> ? */
> ?static int linear_ctr(struct dm_target *ti, unsigned int argc, char
> **argv)
> ?{
> @@ -57,6 +58,11 @@ static int linear_ctr(struct dm_target *ti,
> unsigned int argc, char **argv)
> ? goto bad;
> ? }
> ?
> + // FIXME: need to parse optional args
> + // FIXME: model??alloc_multipath_stage2()?
> + // Call: dm_table_set_type()
> + dm_table_set_type(ti->table, DM_TYPE_MQ_REQUEST_BASED);
> +
> ? ti->num_flush_bios = 1;
> ? ti->num_discard_bios = 1;
> ? ti->num_secure_erase_bios = 1;
> @@ -113,6 +119,37 @@ static int linear_end_io(struct dm_target *ti,
> struct bio *bio,
> ? return DM_ENDIO_DONE;
> ?}
> ?
> +static int linear_clone_and_map(struct dm_target *ti, struct request
> *rq,
> + union map_info *map_context,
> + struct request **__clone)
> +{
> + struct linear_c *lc = ti->private;
> + struct block_device *bdev = lc->dev->bdev;
> + struct request_queue *q = bdev_get_queue(bdev);
> +
> + struct request *clone = blk_get_request(q, rq->cmd_flags |
> REQ_NOMERGE,
> + BLK_MQ_REQ_NOWAIT);
> + if (IS_ERR(clone)) {
> + if (blk_queue_dying(q) || !q->mq_ops)
> + return DM_MAPIO_DELAY_REQUEUE;
> +
> + return DM_MAPIO_REQUEUE;
> + }
> +
> + clone->__sector = linear_map_sector(ti, rq->__sector);
> + clone->bio = clone->biotail = NULL;
> + clone->rq_disk = bdev->bd_disk;
> + clone->cmd_flags |= REQ_FAILFAST_TRANSPORT;
> + *__clone = clone;
> +
> + return DM_MAPIO_REMAPPED;
> +}
> +
> +static void linear_release_clone(struct request *clone)
> +{
> + blk_put_request(clone);
> +}
> +
> ?static void linear_status(struct dm_target *ti, status_type_t type,
> ? ??unsigned status_flags, char *result,
> unsigned maxlen)
> ?{
> @@ -207,13 +244,15 @@ static size_t linear_dax_copy_to_iter(struct
> dm_target *ti, pgoff_t pgoff,
> ?
> ?static struct target_type linear_target = {
> ? .name???= "linear",
> - .version = {1, 4, 0},
> - .features = DM_TARGET_PASSES_INTEGRITY | DM_TARGET_ZONED_HM,
> + .version = {1, 5, 0},
> + .features = DM_TARGET_IMMUTABLE | DM_TARGET_PASSES_INTEGRITY
> | DM_TARGET_ZONED_HM,
> ? .module = THIS_MODULE,
> ? .ctr????= linear_ctr,
> ? .dtr????= linear_dtr,
> ? .map????= linear_map,
> ? .end_io = linear_end_io,
> + .clone_and_map_rq = linear_clone_and_map,
> + .release_clone_rq = linear_release_clone,
> ? .status = linear_status,
> ? .prepare_ioctl = linear_prepare_ioctl,
> ? .iterate_devices = linear_iterate_devices,
>
>
>
With Oracle setups and multipath, we have plenty of customers using non
NVME LUNS (i.e. F/C) with 1 single partition on top of a request based
multipath with no issues.
Same for file systems on top of multipath devices with a single
partition
Its very uncommon for sharing a disk with multiple partitions, and
multipath.
It has to be the multiple partitions, but we should test on non NVME
with multiple partitions in the lab setup I guess to make sure
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device
@ 2018-07-24 13:22 ` Laurence Oberman
0 siblings, 0 replies; 32+ messages in thread
From: Laurence Oberman @ 2018-07-24 13:22 UTC (permalink / raw)
To: Mike Snitzer, Hannes Reinecke; +Cc: linux-nvme, linux-block, dm-devel
On Tue, 2018-07-24 at 09:07 -0400, Mike Snitzer wrote:
> On Tue, Jul 24 2018 at 2:00am -0400,
> Hannes Reinecke <hare@suse.de> wrote:
>
> > On 07/23/2018 06:33 PM, Mike Snitzer wrote:
> > > Hi,
> > >
> > > I've opened the following public BZ:
> > > https://bugzilla.redhat.com/show_bug.cgi?id=1607527
> > >
> > > Feel free to add comments to that BZ if you have a redhat
> > > bugzilla
> > > account.
> > >
> > > But otherwise, happy to get as much feedback and discussion going
> > > purely
> > > on the relevant lists. I've taken ~1.5 weeks to categorize and
> > > isolate
> > > this issue. But I've reached a point where I'm getting
> > > diminishing
> > > returns and could _really_ use the collective eyeballs and
> > > expertise of
> > > the community. This is by far one of the most nasty cases of
> > > corruption
> > > I've seen in a while. Not sure where the ultimate cause of
> > > corruption
> > > lies (that the money question) but it _feels_ rooted in NVMe and
> > > is
> > > unique to this particular workload I've stumbled onto via
> > > customer
> > > escalation and then trying to replicate an rbd device using a
> > > more
> > > approachable one (request-based DM multipath in this case).
> > >
> >
> > I might be stating the obvious, but so far we only have considered
> > request-based multipath as being active for the _entire_ device.
> > To my knowledge we've never tested that when running on a
> > partition.
>
> True. We only ever support mapping the partitions ontop of
> request-based multipath (via dm-linear volumes created by kpartx).
>
> > So, have you tested that request-based multipathing works on a
> > partition _at all_? I'm not sure if partition mapping is done
> > correctly here; we never remap the start of the request (nor bio,
> > come to speak of it), so it looks as if we would be doing the wrong
> > things here.
> >
> > Have you checked that partition remapping is done correctly?
>
> It clearly doesn't work. Not quite following why but...
>
> After running the test the partition table at the start of the whole
> NVMe device is overwritten by XFS. So likely the IO destined to the
> dm-cache's "slow" (dm-mpath device on NVMe partition) was issued to
> the
> whole NVMe device:
>
> # pvcreate /dev/nvme1n1
> WARNING: xfs signature detected on /dev/nvme1n1 at offset 0. Wipe it?
> [y/n]
>
> # vgcreate test /dev/nvme1n1
> # lvcreate -n slow -L 512G test
> WARNING: xfs signature detected on /dev/test/slow at offset 0. Wipe
> it?
> [y/n]: y
> Wiping xfs signature on /dev/test/slow.
> Logical volume "slow" created.
>
> Isn't this a failing of block core's partitioning? Why should a
> target
> that is given the entire partition of a device need to be concerned
> with
> remapping IO? Shouldn't block core handle that mapping?
>
> Anyway, yesterday I went so far as to hack together request-based
> support for DM linear (because request-based DM cannot stack on
> bio-based DM) . With this, request-based linear devices instead of
> conventional partitioning, I no longer see the XFS corruption when
> running the test:
>
> drivers/md/dm-linear.c | 45
> ++++++++++++++++++++++++++++++++++++++++++---
> 1 file changed, 42 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
> index d10964d41fd7..d4a65dd20c6e 100644
> --- a/drivers/md/dm-linear.c
> +++ b/drivers/md/dm-linear.c
> @@ -12,6 +12,7 @@
> #include <linux/dax.h>
> #include <linux/slab.h>
> #include <linux/device-mapper.h>
> +#include <linux/blk-mq.h>
>
> #define DM_MSG_PREFIX "linear"
>
> @@ -24,7 +25,7 @@ struct linear_c {
> };
>
> /*
> - * Construct a linear mapping: <dev_path> <offset>
> + * Construct a linear mapping: <dev_path> <offset> [<# optional
> params> <optional params>]
> */
> static int linear_ctr(struct dm_target *ti, unsigned int argc, char
> **argv)
> {
> @@ -57,6 +58,11 @@ static int linear_ctr(struct dm_target *ti,
> unsigned int argc, char **argv)
> goto bad;
> }
>
> + // FIXME: need to parse optional args
> + // FIXME: model alloc_multipath_stage2()?
> + // Call: dm_table_set_type()
> + dm_table_set_type(ti->table, DM_TYPE_MQ_REQUEST_BASED);
> +
> ti->num_flush_bios = 1;
> ti->num_discard_bios = 1;
> ti->num_secure_erase_bios = 1;
> @@ -113,6 +119,37 @@ static int linear_end_io(struct dm_target *ti,
> struct bio *bio,
> return DM_ENDIO_DONE;
> }
>
> +static int linear_clone_and_map(struct dm_target *ti, struct request
> *rq,
> + union map_info *map_context,
> + struct request **__clone)
> +{
> + struct linear_c *lc = ti->private;
> + struct block_device *bdev = lc->dev->bdev;
> + struct request_queue *q = bdev_get_queue(bdev);
> +
> + struct request *clone = blk_get_request(q, rq->cmd_flags |
> REQ_NOMERGE,
> + BLK_MQ_REQ_NOWAIT);
> + if (IS_ERR(clone)) {
> + if (blk_queue_dying(q) || !q->mq_ops)
> + return DM_MAPIO_DELAY_REQUEUE;
> +
> + return DM_MAPIO_REQUEUE;
> + }
> +
> + clone->__sector = linear_map_sector(ti, rq->__sector);
> + clone->bio = clone->biotail = NULL;
> + clone->rq_disk = bdev->bd_disk;
> + clone->cmd_flags |= REQ_FAILFAST_TRANSPORT;
> + *__clone = clone;
> +
> + return DM_MAPIO_REMAPPED;
> +}
> +
> +static void linear_release_clone(struct request *clone)
> +{
> + blk_put_request(clone);
> +}
> +
> static void linear_status(struct dm_target *ti, status_type_t type,
> unsigned status_flags, char *result,
> unsigned maxlen)
> {
> @@ -207,13 +244,15 @@ static size_t linear_dax_copy_to_iter(struct
> dm_target *ti, pgoff_t pgoff,
>
> static struct target_type linear_target = {
> .name = "linear",
> - .version = {1, 4, 0},
> - .features = DM_TARGET_PASSES_INTEGRITY | DM_TARGET_ZONED_HM,
> + .version = {1, 5, 0},
> + .features = DM_TARGET_IMMUTABLE | DM_TARGET_PASSES_INTEGRITY
> | DM_TARGET_ZONED_HM,
> .module = THIS_MODULE,
> .ctr = linear_ctr,
> .dtr = linear_dtr,
> .map = linear_map,
> .end_io = linear_end_io,
> + .clone_and_map_rq = linear_clone_and_map,
> + .release_clone_rq = linear_release_clone,
> .status = linear_status,
> .prepare_ioctl = linear_prepare_ioctl,
> .iterate_devices = linear_iterate_devices,
>
>
>
With Oracle setups and multipath, we have plenty of customers using non
NVME LUNS (i.e. F/C) with 1 single partition on top of a request based
multipath with no issues.
Same for file systems on top of multipath devices with a single
partition
Its very uncommon for sharing a disk with multiple partitions, and
multipath.
It has to be the multiple partitions, but we should test on non NVME
with multiple partitions in the lab setup I guess to make sure
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device
2018-07-24 13:07 ` Mike Snitzer
(?)
@ 2018-07-24 13:51 ` Hannes Reinecke
-1 siblings, 0 replies; 32+ messages in thread
From: Hannes Reinecke @ 2018-07-24 13:51 UTC (permalink / raw)
To: Mike Snitzer; +Cc: linux-block, dm-devel, linux-nvme
On 07/24/2018 03:07 PM, Mike Snitzer wrote:
> On Tue, Jul 24 2018 at 2:00am -0400,
> Hannes Reinecke <hare@suse.de> wrote:
>
>> On 07/23/2018 06:33 PM, Mike Snitzer wrote:
>>> Hi,
>>>
>>> I've opened the following public BZ:
>>> https://bugzilla.redhat.com/show_bug.cgi?id=1607527
>>>
>>> Feel free to add comments to that BZ if you have a redhat bugzilla
>>> account.
>>>
>>> But otherwise, happy to get as much feedback and discussion going purely
>>> on the relevant lists. I've taken ~1.5 weeks to categorize and isolate
>>> this issue. But I've reached a point where I'm getting diminishing
>>> returns and could _really_ use the collective eyeballs and expertise of
>>> the community. This is by far one of the most nasty cases of corruption
>>> I've seen in a while. Not sure where the ultimate cause of corruption
>>> lies (that the money question) but it _feels_ rooted in NVMe and is
>>> unique to this particular workload I've stumbled onto via customer
>>> escalation and then trying to replicate an rbd device using a more
>>> approachable one (request-based DM multipath in this case).
>>>
>> I might be stating the obvious, but so far we only have considered
>> request-based multipath as being active for the _entire_ device.
>> To my knowledge we've never tested that when running on a partition.
>
> True. We only ever support mapping the partitions ontop of
> request-based multipath (via dm-linear volumes created by kpartx).
>
>> So, have you tested that request-based multipathing works on a
>> partition _at all_? I'm not sure if partition mapping is done
>> correctly here; we never remap the start of the request (nor bio,
>> come to speak of it), so it looks as if we would be doing the wrong
>> things here.
>>
>> Have you checked that partition remapping is done correctly?
>
> It clearly doesn't work. Not quite following why but...
>
> After running the test the partition table at the start of the whole
> NVMe device is overwritten by XFS. So likely the IO destined to the
> dm-cache's "slow" (dm-mpath device on NVMe partition) was issued to the
> whole NVMe device:
>
> # pvcreate /dev/nvme1n1
> WARNING: xfs signature detected on /dev/nvme1n1 at offset 0. Wipe it? [y/n]
>
> # vgcreate test /dev/nvme1n1
> # lvcreate -n slow -L 512G test
> WARNING: xfs signature detected on /dev/test/slow at offset 0. Wipe it?
> [y/n]: y
> Wiping xfs signature on /dev/test/slow.
> Logical volume "slow" created.
>
> Isn't this a failing of block core's partitioning? Why should a target
> that is given the entire partition of a device need to be concerned with
> remapping IO? Shouldn't block core handle that mapping?
>
Only if the device is marked a 'partitionable', which device-mapper
devices are not.
But I thought you knew that ...
> Anyway, yesterday I went so far as to hack together request-based
> support for DM linear (because request-based DM cannot stack on
> bio-based DM) . With this, request-based linear devices instead of
> conventional partitioning, I no longer see the XFS corruption when
> running the test:
>
_Actually_, I would've done it the other way around; after all, where't
the point in running dm-multipath on a partition?
Anything running on the other partitions would suffer from the issues
dm-multipath is designed to handle (temporary path loss etc), so I'm not
quite sure what you are trying to achieve with your testcase.
Can you enlighten me?
Cheers,
Hannes
^ permalink raw reply [flat|nested] 32+ messages in thread
* data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device
@ 2018-07-24 13:51 ` Hannes Reinecke
0 siblings, 0 replies; 32+ messages in thread
From: Hannes Reinecke @ 2018-07-24 13:51 UTC (permalink / raw)
On 07/24/2018 03:07 PM, Mike Snitzer wrote:
> On Tue, Jul 24 2018 at 2:00am -0400,
> Hannes Reinecke <hare@suse.de> wrote:
>
>> On 07/23/2018 06:33 PM, Mike Snitzer wrote:
>>> Hi,
>>>
>>> I've opened the following public BZ:
>>> https://bugzilla.redhat.com/show_bug.cgi?id=1607527
>>>
>>> Feel free to add comments to that BZ if you have a redhat bugzilla
>>> account.
>>>
>>> But otherwise, happy to get as much feedback and discussion going purely
>>> on the relevant lists. I've taken ~1.5 weeks to categorize and isolate
>>> this issue. But I've reached a point where I'm getting diminishing
>>> returns and could _really_ use the collective eyeballs and expertise of
>>> the community. This is by far one of the most nasty cases of corruption
>>> I've seen in a while. Not sure where the ultimate cause of corruption
>>> lies (that the money question) but it _feels_ rooted in NVMe and is
>>> unique to this particular workload I've stumbled onto via customer
>>> escalation and then trying to replicate an rbd device using a more
>>> approachable one (request-based DM multipath in this case).
>>>
>> I might be stating the obvious, but so far we only have considered
>> request-based multipath as being active for the _entire_ device.
>> To my knowledge we've never tested that when running on a partition.
>
> True. We only ever support mapping the partitions ontop of
> request-based multipath (via dm-linear volumes created by kpartx).
>
>> So, have you tested that request-based multipathing works on a
>> partition _at all_? I'm not sure if partition mapping is done
>> correctly here; we never remap the start of the request (nor bio,
>> come to speak of it), so it looks as if we would be doing the wrong
>> things here.
>>
>> Have you checked that partition remapping is done correctly?
>
> It clearly doesn't work. Not quite following why but...
>
> After running the test the partition table at the start of the whole
> NVMe device is overwritten by XFS. So likely the IO destined to the
> dm-cache's "slow" (dm-mpath device on NVMe partition) was issued to the
> whole NVMe device:
>
> # pvcreate /dev/nvme1n1
> WARNING: xfs signature detected on /dev/nvme1n1 at offset 0. Wipe it? [y/n]
>
> # vgcreate test /dev/nvme1n1
> # lvcreate -n slow -L 512G test
> WARNING: xfs signature detected on /dev/test/slow at offset 0. Wipe it?
> [y/n]: y
> Wiping xfs signature on /dev/test/slow.
> Logical volume "slow" created.
>
> Isn't this a failing of block core's partitioning? Why should a target
> that is given the entire partition of a device need to be concerned with
> remapping IO? Shouldn't block core handle that mapping?
>
Only if the device is marked a 'partitionable', which device-mapper
devices are not.
But I thought you knew that ...
> Anyway, yesterday I went so far as to hack together request-based
> support for DM linear (because request-based DM cannot stack on
> bio-based DM) . With this, request-based linear devices instead of
> conventional partitioning, I no longer see the XFS corruption when
> running the test:
>
_Actually_, I would've done it the other way around; after all, where't
the point in running dm-multipath on a partition?
Anything running on the other partitions would suffer from the issues
dm-multipath is designed to handle (temporary path loss etc), so I'm not
quite sure what you are trying to achieve with your testcase.
Can you enlighten me?
Cheers,
Hannes
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device
@ 2018-07-24 13:51 ` Hannes Reinecke
0 siblings, 0 replies; 32+ messages in thread
From: Hannes Reinecke @ 2018-07-24 13:51 UTC (permalink / raw)
To: Mike Snitzer; +Cc: linux-nvme, linux-block, dm-devel
On 07/24/2018 03:07 PM, Mike Snitzer wrote:
> On Tue, Jul 24 2018 at 2:00am -0400,
> Hannes Reinecke <hare@suse.de> wrote:
>
>> On 07/23/2018 06:33 PM, Mike Snitzer wrote:
>>> Hi,
>>>
>>> I've opened the following public BZ:
>>> https://bugzilla.redhat.com/show_bug.cgi?id=1607527
>>>
>>> Feel free to add comments to that BZ if you have a redhat bugzilla
>>> account.
>>>
>>> But otherwise, happy to get as much feedback and discussion going purely
>>> on the relevant lists. I've taken ~1.5 weeks to categorize and isolate
>>> this issue. But I've reached a point where I'm getting diminishing
>>> returns and could _really_ use the collective eyeballs and expertise of
>>> the community. This is by far one of the most nasty cases of corruption
>>> I've seen in a while. Not sure where the ultimate cause of corruption
>>> lies (that the money question) but it _feels_ rooted in NVMe and is
>>> unique to this particular workload I've stumbled onto via customer
>>> escalation and then trying to replicate an rbd device using a more
>>> approachable one (request-based DM multipath in this case).
>>>
>> I might be stating the obvious, but so far we only have considered
>> request-based multipath as being active for the _entire_ device.
>> To my knowledge we've never tested that when running on a partition.
>
> True. We only ever support mapping the partitions ontop of
> request-based multipath (via dm-linear volumes created by kpartx).
>
>> So, have you tested that request-based multipathing works on a
>> partition _at all_? I'm not sure if partition mapping is done
>> correctly here; we never remap the start of the request (nor bio,
>> come to speak of it), so it looks as if we would be doing the wrong
>> things here.
>>
>> Have you checked that partition remapping is done correctly?
>
> It clearly doesn't work. Not quite following why but...
>
> After running the test the partition table at the start of the whole
> NVMe device is overwritten by XFS. So likely the IO destined to the
> dm-cache's "slow" (dm-mpath device on NVMe partition) was issued to the
> whole NVMe device:
>
> # pvcreate /dev/nvme1n1
> WARNING: xfs signature detected on /dev/nvme1n1 at offset 0. Wipe it? [y/n]
>
> # vgcreate test /dev/nvme1n1
> # lvcreate -n slow -L 512G test
> WARNING: xfs signature detected on /dev/test/slow at offset 0. Wipe it?
> [y/n]: y
> Wiping xfs signature on /dev/test/slow.
> Logical volume "slow" created.
>
> Isn't this a failing of block core's partitioning? Why should a target
> that is given the entire partition of a device need to be concerned with
> remapping IO? Shouldn't block core handle that mapping?
>
Only if the device is marked a 'partitionable', which device-mapper
devices are not.
But I thought you knew that ...
> Anyway, yesterday I went so far as to hack together request-based
> support for DM linear (because request-based DM cannot stack on
> bio-based DM) . With this, request-based linear devices instead of
> conventional partitioning, I no longer see the XFS corruption when
> running the test:
>
_Actually_, I would've done it the other way around; after all, where't
the point in running dm-multipath on a partition?
Anything running on the other partitions would suffer from the issues
dm-multipath is designed to handle (temporary path loss etc), so I'm not
quite sure what you are trying to achieve with your testcase.
Can you enlighten me?
Cheers,
Hannes
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device
2018-07-24 13:51 ` Hannes Reinecke
(?)
@ 2018-07-24 13:57 ` Laurence Oberman
-1 siblings, 0 replies; 32+ messages in thread
From: Laurence Oberman @ 2018-07-24 13:57 UTC (permalink / raw)
To: Hannes Reinecke, Mike Snitzer; +Cc: linux-block, dm-devel, linux-nvme
On Tue, 2018-07-24 at 15:51 +0200, Hannes Reinecke wrote:
> On 07/24/2018 03:07 PM, Mike Snitzer wrote:
> > On Tue, Jul 24 2018 at 2:00am -0400,
> > Hannes Reinecke <hare@suse.de> wrote:
> >
> > > On 07/23/2018 06:33 PM, Mike Snitzer wrote:
> > > > Hi,
> > > >
> > > > I've opened the following public BZ:
> > > > https://bugzilla.redhat.com/show_bug.cgi?id=1607527
> > > >
> > > > Feel free to add comments to that BZ if you have a redhat
> > > > bugzilla
> > > > account.
> > > >
> > > > But otherwise, happy to get as much feedback and discussion
> > > > going purely
> > > > on the relevant lists. I've taken ~1.5 weeks to categorize and
> > > > isolate
> > > > this issue. But I've reached a point where I'm getting
> > > > diminishing
> > > > returns and could _really_ use the collective eyeballs and
> > > > expertise of
> > > > the community. This is by far one of the most nasty cases of
> > > > corruption
> > > > I've seen in a while. Not sure where the ultimate cause of
> > > > corruption
> > > > lies (that the money question) but it _feels_ rooted in NVMe
> > > > and is
> > > > unique to this particular workload I've stumbled onto via
> > > > customer
> > > > escalation and then trying to replicate an rbd device using a
> > > > more
> > > > approachable one (request-based DM multipath in this case).
> > > >
> > >
> > > I might be stating the obvious, but so far we only have
> > > considered
> > > request-based multipath as being active for the _entire_ device.
> > > To my knowledge we've never tested that when running on a
> > > partition.
> >
> > True. We only ever support mapping the partitions ontop of
> > request-based multipath (via dm-linear volumes created by kpartx).
> >
> > > So, have you tested that request-based multipathing works on a
> > > partition _at all_? I'm not sure if partition mapping is done
> > > correctly here; we never remap the start of the request (nor bio,
> > > come to speak of it), so it looks as if we would be doing the
> > > wrong
> > > things here.
> > >
> > > Have you checked that partition remapping is done correctly?
> >
> > It clearly doesn't work. Not quite following why but...
> >
> > After running the test the partition table at the start of the
> > whole
> > NVMe device is overwritten by XFS. So likely the IO destined to
> > the
> > dm-cache's "slow" (dm-mpath device on NVMe partition) was issued to
> > the
> > whole NVMe device:
> >
> > # pvcreate /dev/nvme1n1
> > WARNING: xfs signature detected on /dev/nvme1n1 at offset 0. Wipe
> > it? [y/n]
> >
> > # vgcreate test /dev/nvme1n1
> > # lvcreate -n slow -L 512G test
> > WARNING: xfs signature detected on /dev/test/slow at offset 0. Wipe
> > it?
> > [y/n]: y
> > Wiping xfs signature on /dev/test/slow.
> > Logical volume "slow" created.
> >
> > Isn't this a failing of block core's partitioning? Why should a
> > target
> > that is given the entire partition of a device need to be concerned
> > with
> > remapping IO? Shouldn't block core handle that mapping?
> >
>
> Only if the device is marked a 'partitionable', which device-mapper
> devices are not.
> But I thought you knew that ...
>
> > Anyway, yesterday I went so far as to hack together request-based
> > support for DM linear (because request-based DM cannot stack on
> > bio-based DM) . With this, request-based linear devices instead of
> > conventional partitioning, I no longer see the XFS corruption when
> > running the test:
> >
>
> _Actually_, I would've done it the other way around; after all,
> where't
> the point in running dm-multipath on a partition?
> Anything running on the other partitions would suffer from the
> issues
> dm-multipath is designed to handle (temporary path loss etc), so I'm
> not
> quite sure what you are trying to achieve with your testcase.
> Can you enlighten me?
>
> Cheers,
>
> Hannes
This came about because a customer is using nvme for a dm-cache device
and created multiple partitions so as to use the same nvme to cache
multiple different "slower" devices. The corruption was noticed in XFS
and I engaged Mike to assist in figuring out what was going on.
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
^ permalink raw reply [flat|nested] 32+ messages in thread
* data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device
@ 2018-07-24 13:57 ` Laurence Oberman
0 siblings, 0 replies; 32+ messages in thread
From: Laurence Oberman @ 2018-07-24 13:57 UTC (permalink / raw)
On Tue, 2018-07-24@15:51 +0200, Hannes Reinecke wrote:
> On 07/24/2018 03:07 PM, Mike Snitzer wrote:
> > On Tue, Jul 24 2018 at??2:00am -0400,
> > Hannes Reinecke <hare@suse.de> wrote:
> >
> > > On 07/23/2018 06:33 PM, Mike Snitzer wrote:
> > > > Hi,
> > > >
> > > > I've opened the following public BZ:
> > > > https://bugzilla.redhat.com/show_bug.cgi?id=1607527
> > > >
> > > > Feel free to add comments to that BZ if you have a redhat
> > > > bugzilla
> > > > account.
> > > >
> > > > But otherwise, happy to get as much feedback and discussion
> > > > going purely
> > > > on the relevant lists.??I've taken ~1.5 weeks to categorize and
> > > > isolate
> > > > this issue.??But I've reached a point where I'm getting
> > > > diminishing
> > > > returns and could _really_ use the collective eyeballs and
> > > > expertise of
> > > > the community.??This is by far one of the most nasty cases of
> > > > corruption
> > > > I've seen in a while.??Not sure where the ultimate cause of
> > > > corruption
> > > > lies (that the money question) but it _feels_ rooted in NVMe
> > > > and is
> > > > unique to this particular workload I've stumbled onto via
> > > > customer
> > > > escalation and then trying to replicate an rbd device using a
> > > > more
> > > > approachable one (request-based DM multipath in this case).
> > > >
> > >
> > > I might be stating the obvious, but so far we only have
> > > considered
> > > request-based multipath as being active for the _entire_ device.
> > > To my knowledge we've never tested that when running on a
> > > partition.
> >
> > True.??We only ever support mapping the partitions ontop of
> > request-based multipath (via dm-linear volumes created by kpartx).
> >
> > > So, have you tested that request-based multipathing works on a
> > > partition _at all_? I'm not sure if partition mapping is done
> > > correctly here; we never remap the start of the request (nor bio,
> > > come to speak of it), so it looks as if we would be doing the
> > > wrong
> > > things here.
> > >
> > > Have you checked that partition remapping is done correctly?
> >
> > It clearly doesn't work.??Not quite following why but...
> >
> > After running the test the partition table at the start of the
> > whole
> > NVMe device is overwritten by XFS.??So likely the IO destined to
> > the
> > dm-cache's "slow" (dm-mpath device on NVMe partition) was issued to
> > the
> > whole NVMe device:
> >
> > # pvcreate /dev/nvme1n1
> > WARNING: xfs signature detected on /dev/nvme1n1 at offset 0. Wipe
> > it? [y/n]
> >
> > # vgcreate test /dev/nvme1n1
> > # lvcreate -n slow -L 512G test
> > WARNING: xfs signature detected on /dev/test/slow at offset 0. Wipe
> > it?
> > [y/n]: y
> > ???Wiping xfs signature on /dev/test/slow.
> > ???Logical volume "slow" created.
> >
> > Isn't this a failing of block core's partitioning???Why should a
> > target
> > that is given the entire partition of a device need to be concerned
> > with
> > remapping IO???Shouldn't block core handle that mapping?
> >
>
> Only if the device is marked a 'partitionable', which device-mapper?
> devices are not.
> But I thought you knew that ...
>
> > Anyway, yesterday I went so far as to hack together request-based
> > support for DM linear (because request-based DM cannot stack on
> > bio-based DM) .??With this, request-based linear devices instead of
> > conventional partitioning, I no longer see the XFS corruption when
> > running the test:
> >
>
> _Actually_, I would've done it the other way around; after all,
> where't?
> the point in running dm-multipath on a partition?
> Anything running on the other partitions would suffer from the
> issues?
> dm-multipath is designed to handle (temporary path loss etc), so I'm
> not?
> quite sure what you are trying to achieve with your testcase.
> Can you enlighten me?
>
> Cheers,
>
> Hannes
This came about because a customer is using nvme for a dm-cache device
and created multiple partitions so as to use the same nvme to cache
multiple different "slower" devices. The corruption was noticed in XFS
and I engaged Mike to assist in figuring out what was going on.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device
@ 2018-07-24 13:57 ` Laurence Oberman
0 siblings, 0 replies; 32+ messages in thread
From: Laurence Oberman @ 2018-07-24 13:57 UTC (permalink / raw)
To: Hannes Reinecke, Mike Snitzer; +Cc: linux-nvme, linux-block, dm-devel
On Tue, 2018-07-24 at 15:51 +0200, Hannes Reinecke wrote:
> On 07/24/2018 03:07 PM, Mike Snitzer wrote:
> > On Tue, Jul 24 2018 at 2:00am -0400,
> > Hannes Reinecke <hare@suse.de> wrote:
> >
> > > On 07/23/2018 06:33 PM, Mike Snitzer wrote:
> > > > Hi,
> > > >
> > > > I've opened the following public BZ:
> > > > https://bugzilla.redhat.com/show_bug.cgi?id=1607527
> > > >
> > > > Feel free to add comments to that BZ if you have a redhat
> > > > bugzilla
> > > > account.
> > > >
> > > > But otherwise, happy to get as much feedback and discussion
> > > > going purely
> > > > on the relevant lists. I've taken ~1.5 weeks to categorize and
> > > > isolate
> > > > this issue. But I've reached a point where I'm getting
> > > > diminishing
> > > > returns and could _really_ use the collective eyeballs and
> > > > expertise of
> > > > the community. This is by far one of the most nasty cases of
> > > > corruption
> > > > I've seen in a while. Not sure where the ultimate cause of
> > > > corruption
> > > > lies (that the money question) but it _feels_ rooted in NVMe
> > > > and is
> > > > unique to this particular workload I've stumbled onto via
> > > > customer
> > > > escalation and then trying to replicate an rbd device using a
> > > > more
> > > > approachable one (request-based DM multipath in this case).
> > > >
> > >
> > > I might be stating the obvious, but so far we only have
> > > considered
> > > request-based multipath as being active for the _entire_ device.
> > > To my knowledge we've never tested that when running on a
> > > partition.
> >
> > True. We only ever support mapping the partitions ontop of
> > request-based multipath (via dm-linear volumes created by kpartx).
> >
> > > So, have you tested that request-based multipathing works on a
> > > partition _at all_? I'm not sure if partition mapping is done
> > > correctly here; we never remap the start of the request (nor bio,
> > > come to speak of it), so it looks as if we would be doing the
> > > wrong
> > > things here.
> > >
> > > Have you checked that partition remapping is done correctly?
> >
> > It clearly doesn't work. Not quite following why but...
> >
> > After running the test the partition table at the start of the
> > whole
> > NVMe device is overwritten by XFS. So likely the IO destined to
> > the
> > dm-cache's "slow" (dm-mpath device on NVMe partition) was issued to
> > the
> > whole NVMe device:
> >
> > # pvcreate /dev/nvme1n1
> > WARNING: xfs signature detected on /dev/nvme1n1 at offset 0. Wipe
> > it? [y/n]
> >
> > # vgcreate test /dev/nvme1n1
> > # lvcreate -n slow -L 512G test
> > WARNING: xfs signature detected on /dev/test/slow at offset 0. Wipe
> > it?
> > [y/n]: y
> > Wiping xfs signature on /dev/test/slow.
> > Logical volume "slow" created.
> >
> > Isn't this a failing of block core's partitioning? Why should a
> > target
> > that is given the entire partition of a device need to be concerned
> > with
> > remapping IO? Shouldn't block core handle that mapping?
> >
>
> Only if the device is marked a 'partitionable', which device-mapper
> devices are not.
> But I thought you knew that ...
>
> > Anyway, yesterday I went so far as to hack together request-based
> > support for DM linear (because request-based DM cannot stack on
> > bio-based DM) . With this, request-based linear devices instead of
> > conventional partitioning, I no longer see the XFS corruption when
> > running the test:
> >
>
> _Actually_, I would've done it the other way around; after all,
> where't
> the point in running dm-multipath on a partition?
> Anything running on the other partitions would suffer from the
> issues
> dm-multipath is designed to handle (temporary path loss etc), so I'm
> not
> quite sure what you are trying to achieve with your testcase.
> Can you enlighten me?
>
> Cheers,
>
> Hannes
This came about because a customer is using nvme for a dm-cache device
and created multiple partitions so as to use the same nvme to cache
multiple different "slower" devices. The corruption was noticed in XFS
and I engaged Mike to assist in figuring out what was going on.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device
2018-07-24 13:57 ` Laurence Oberman
(?)
@ 2018-07-24 15:18 ` Mike Snitzer
-1 siblings, 0 replies; 32+ messages in thread
From: Mike Snitzer @ 2018-07-24 15:18 UTC (permalink / raw)
To: Hannes Reinecke, Laurence Oberman; +Cc: linux-block, dm-devel, linux-nvme
On Tue, Jul 24 2018 at 9:57am -0400,
Laurence Oberman <loberman@redhat.com> wrote:
> On Tue, 2018-07-24 at 15:51 +0200, Hannes Reinecke wrote:
> >
> > _Actually_, I would've done it the other way around; after all,
> > where't the point in running dm-multipath on a partition?
> > Anything running on the other partitions would suffer from the
> > issues dm-multipath is designed to handle (temporary path loss etc), so I'm
> > not quite sure what you are trying to achieve with your testcase.
> > Can you enlighten me?
> >
> > Cheers,
> >
> > Hannes
I wasn't looking to deply this (multipath on partition) in production or
suggest it to others. It was a means to experiment. More below.
> This came about because a customer is using nvme for a dm-cache device
> and created multiple partitions so as to use the same nvme to cache
> multiple different "slower" devices. The corruption was noticed in XFS
> and I engaged Mike to assist in figuring out what was going on.
Yes, so topology for the customer's setup is:
1) MD raid1 on 2 NVMe partitions (from separate NVMe devices).
2) Then DM cache's "fast" and "metadata" devices layered on dm-linear
mapping ontop of the MD raid1.
3) Then Ceph's rbd for DM-cache's slow device.
I was just looking to simplify the stack to try to assess why XFS
corruption was being seen without all the insanity.
One issue was corruption due to incorrect shutdown order (network was
getting shutdown out from underneath rbd, and in turn DM-cache couldn't
complete its IO migrations during cache_postsuspend()).
So I elected to try using DM multipath with queue_if_no_path to try to
replicate rbd losing network _without_ needing a full Ceph/rbd setup.
The rest is history... a rat-hole of corruption that likely is very
different than the customer's setup.
Mike
^ permalink raw reply [flat|nested] 32+ messages in thread
* data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device
@ 2018-07-24 15:18 ` Mike Snitzer
0 siblings, 0 replies; 32+ messages in thread
From: Mike Snitzer @ 2018-07-24 15:18 UTC (permalink / raw)
On Tue, Jul 24 2018 at 9:57am -0400,
Laurence Oberman <loberman@redhat.com> wrote:
> On Tue, 2018-07-24@15:51 +0200, Hannes Reinecke wrote:
> >
> > _Actually_, I would've done it the other way around; after all,
> > where't the point in running dm-multipath on a partition?
> > Anything running on the other partitions would suffer from the
> > issues dm-multipath is designed to handle (temporary path loss etc), so I'm
> > not quite sure what you are trying to achieve with your testcase.
> > Can you enlighten me?
> >
> > Cheers,
> >
> > Hannes
I wasn't looking to deply this (multipath on partition) in production or
suggest it to others. It was a means to experiment. More below.
> This came about because a customer is using nvme for a dm-cache device
> and created multiple partitions so as to use the same nvme to cache
> multiple different "slower" devices. The corruption was noticed in XFS
> and I engaged Mike to assist in figuring out what was going on.
Yes, so topology for the customer's setup is:
1) MD raid1 on 2 NVMe partitions (from separate NVMe devices).
2) Then DM cache's "fast" and "metadata" devices layered on dm-linear
mapping ontop of the MD raid1.
3) Then Ceph's rbd for DM-cache's slow device.
I was just looking to simplify the stack to try to assess why XFS
corruption was being seen without all the insanity.
One issue was corruption due to incorrect shutdown order (network was
getting shutdown out from underneath rbd, and in turn DM-cache couldn't
complete its IO migrations during cache_postsuspend()).
So I elected to try using DM multipath with queue_if_no_path to try to
replicate rbd losing network _without_ needing a full Ceph/rbd setup.
The rest is history... a rat-hole of corruption that likely is very
different than the customer's setup.
Mike
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device
@ 2018-07-24 15:18 ` Mike Snitzer
0 siblings, 0 replies; 32+ messages in thread
From: Mike Snitzer @ 2018-07-24 15:18 UTC (permalink / raw)
To: Hannes Reinecke, Laurence Oberman; +Cc: linux-nvme, linux-block, dm-devel
On Tue, Jul 24 2018 at 9:57am -0400,
Laurence Oberman <loberman@redhat.com> wrote:
> On Tue, 2018-07-24 at 15:51 +0200, Hannes Reinecke wrote:
> >
> > _Actually_, I would've done it the other way around; after all,
> > where't the point in running dm-multipath on a partition?
> > Anything running on the other partitions would suffer from the
> > issues dm-multipath is designed to handle (temporary path loss etc), so I'm
> > not quite sure what you are trying to achieve with your testcase.
> > Can you enlighten me?
> >
> > Cheers,
> >
> > Hannes
I wasn't looking to deply this (multipath on partition) in production or
suggest it to others. It was a means to experiment. More below.
> This came about because a customer is using nvme for a dm-cache device
> and created multiple partitions so as to use the same nvme to cache
> multiple different "slower" devices. The corruption was noticed in XFS
> and I engaged Mike to assist in figuring out what was going on.
Yes, so topology for the customer's setup is:
1) MD raid1 on 2 NVMe partitions (from separate NVMe devices).
2) Then DM cache's "fast" and "metadata" devices layered on dm-linear
mapping ontop of the MD raid1.
3) Then Ceph's rbd for DM-cache's slow device.
I was just looking to simplify the stack to try to assess why XFS
corruption was being seen without all the insanity.
One issue was corruption due to incorrect shutdown order (network was
getting shutdown out from underneath rbd, and in turn DM-cache couldn't
complete its IO migrations during cache_postsuspend()).
So I elected to try using DM multipath with queue_if_no_path to try to
replicate rbd losing network _without_ needing a full Ceph/rbd setup.
The rest is history... a rat-hole of corruption that likely is very
different than the customer's setup.
Mike
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device
2018-07-24 15:18 ` Mike Snitzer
(?)
@ 2018-07-24 15:31 ` Laurence Oberman
-1 siblings, 0 replies; 32+ messages in thread
From: Laurence Oberman @ 2018-07-24 15:31 UTC (permalink / raw)
To: Mike Snitzer, Hannes Reinecke
Cc: linux-block, Brett Hull, dm-devel, linux-nvme
On Tue, 2018-07-24 at 11:18 -0400, Mike Snitzer wrote:
> On Tue, Jul 24 2018 at 9:57am -0400,
> Laurence Oberman <loberman@redhat.com> wrote:
>
> > On Tue, 2018-07-24 at 15:51 +0200, Hannes Reinecke wrote:
> > >
> > > _Actually_, I would've done it the other way around; after all,
> > > where't the point in running dm-multipath on a partition?
> > > Anything running on the other partitions would suffer from the
> > > issues dm-multipath is designed to handle (temporary path loss
> > > etc), so I'm
> > > not quite sure what you are trying to achieve with your testcase.
> > > Can you enlighten me?
> > >
> > > Cheers,
> > >
> > > Hannes
>
> I wasn't looking to deply this (multipath on partition) in production
> or
> suggest it to others. It was a means to experiment. More below.
>
> > This came about because a customer is using nvme for a dm-cache
> > device
> > and created multiple partitions so as to use the same nvme to cache
> > multiple different "slower" devices. The corruption was noticed in
> > XFS
> > and I engaged Mike to assist in figuring out what was going on.
>
> Yes, so topology for the customer's setup is:
>
> 1) MD raid1 on 2 NVMe partitions (from separate NVMe devices).
> 2) Then DM cache's "fast" and "metadata" devices layered on dm-linear
> mapping ontop of the MD raid1.
> 3) Then Ceph's rbd for DM-cache's slow device.
>
> I was just looking to simplify the stack to try to assess why XFS
> corruption was being seen without all the insanity.
>
> One issue was corruption due to incorrect shutdown order (network was
> getting shutdown out from underneath rbd, and in turn DM-cache
> couldn't
> complete its IO migrations during cache_postsuspend()).
>
> So I elected to try using DM multipath with queue_if_no_path to try
> to
> replicate rbd losing network _without_ needing a full Ceph/rbd setup.
>
> The rest is history... a rat-hole of corruption that likely is very
> different than the customer's setup.
>
> Mike
Not to muddy the waters here, and as Mike said the issue he tripped
over may not be the direct issue we originally started with.
In the lab reproducer with rbd as a slow devices we do not have an MD
raided nvme for the dm-cache, but we still see the corruption only on
the rbd based test.
We used the nvme partitioned but no DM raid to try an F/C device-
mapper-multipath LUNS cached via dm-cache.
The last test we ran where we did not see corruption was a partition
where the second partition was used to cache F/C luns
nvme0n1 259:0 0 372.6G 0 disk
├─nvme0n1p1 259:1 0 150G 0 part
└─nvme0n1p2 259:2 0 150G 0 part
├─cache_FC-nvme_blk_cache_cdata 253:42 0 20G 0 lvm
│ └─cache_FC-fc_disk 253:45 0 48G 0
lvm /cache_FC
└─cache_FC-nvme_blk_cache_cmeta 253:43 0 40M 0 lvm
└─cache_FC-fc_disk 253:45 0 48G 0
lvm /cache_FC
cache_FC-fc_disk (253:45)
├─cache_FC-fc_disk_corig (253:44)
│ └─3600140508da66c2c9ee4cc6aface1bab (253:36) Multipath
│ ├─ (68:224)
│ ├─ (69:240)
│ ├─ (8:192)
│ └─ (8:64)
├─cache_FC-nvme_blk_cache_cdata (253:42)
│ └─ (259:2)
└─cache_FC-nvme_blk_cache_cmeta (253:43)
└─ (259:2)
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
^ permalink raw reply [flat|nested] 32+ messages in thread
* data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device
@ 2018-07-24 15:31 ` Laurence Oberman
0 siblings, 0 replies; 32+ messages in thread
From: Laurence Oberman @ 2018-07-24 15:31 UTC (permalink / raw)
On Tue, 2018-07-24@11:18 -0400, Mike Snitzer wrote:
> On Tue, Jul 24 2018 at??9:57am -0400,
> Laurence Oberman <loberman@redhat.com> wrote:
>
> > On Tue, 2018-07-24@15:51 +0200, Hannes Reinecke wrote:
> > >
> > > _Actually_, I would've done it the other way around; after all,
> > > where't the point in running dm-multipath on a partition?
> > > Anything running on the other partitions would suffer from the
> > > issues dm-multipath is designed to handle (temporary path loss
> > > etc), so I'm
> > > not quite sure what you are trying to achieve with your testcase.
> > > Can you enlighten me?
> > >
> > > Cheers,
> > >
> > > Hannes
>
> I wasn't looking to deply this (multipath on partition) in production
> or
> suggest it to others.??It was a means to experiment.??More below.
>
> > This came about because a customer is using nvme for a dm-cache
> > device
> > and created multiple partitions so as to use the same nvme to cache
> > multiple different "slower" devices. The corruption was noticed in
> > XFS
> > and I engaged Mike to assist in figuring out what was going on.
>
> Yes, so topology for the customer's setup is:
>
> 1) MD raid1 on 2 NVMe partitions (from separate NVMe devices).
> 2) Then DM cache's "fast" and "metadata" devices layered on dm-linear
> ???mapping ontop of the MD raid1.
> 3) Then Ceph's rbd for DM-cache's slow device.
>
> I was just looking to simplify the stack to try to assess why XFS
> corruption was being seen without all the insanity.
>
> One issue was corruption due to incorrect shutdown order (network was
> getting shutdown out from underneath rbd, and in turn DM-cache
> couldn't
> complete its IO migrations during cache_postsuspend()).
>
> So I elected to try using DM multipath with queue_if_no_path to try
> to
> replicate rbd losing network _without_ needing a full Ceph/rbd setup.
>
> The rest is history... a rat-hole of corruption that likely is very
> different than the customer's setup.
>
> Mike
Not to muddy the waters here, and as Mike said the issue he tripped
over may not be the direct issue we originally started with.
In the lab reproducer with rbd as a slow devices we do not have an MD
raided nvme for the dm-cache, but we still see the corruption only on
the rbd based test.
We used the nvme partitioned but no DM raid to try an F/C device-
mapper-multipath LUNS cached via dm-cache.
The last test we ran where we did not see corruption was a partition
where the second partition was used to cache F/C luns
nvme0n1?????????????????????????????259:0????0 372.6G??0 disk??
??nvme0n1p1?????????????????????????259:1????0???150G??0 part??
??nvme0n1p2?????????????????????????259:2????0???150G??0 part??
? ??cache_FC-nvme_blk_cache_cdata???253:42???0????20G??0 lvm???
? ? ??cache_FC-fc_disk??????????????253:45???0????48G??0
lvm???/cache_FC
? ??cache_FC-nvme_blk_cache_cmeta???253:43???0????40M??0 lvm???
??????cache_FC-fc_disk??????????????253:45???0????48G??0
lvm???/cache_FC
cache_FC-fc_disk (253:45)
???cache_FC-fc_disk_corig (253:44)
??????3600140508da66c2c9ee4cc6aface1bab (253:36) Multipath
????????? (68:224)
????????? (69:240)
????????? (8:192)
????????? (8:64)
???cache_FC-nvme_blk_cache_cdata (253:42)
?????? (259:2)
???cache_FC-nvme_blk_cache_cmeta (253:43)
?????? (259:2)
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device
@ 2018-07-24 15:31 ` Laurence Oberman
0 siblings, 0 replies; 32+ messages in thread
From: Laurence Oberman @ 2018-07-24 15:31 UTC (permalink / raw)
To: Mike Snitzer, Hannes Reinecke
Cc: linux-nvme, linux-block, dm-devel, Brett Hull
On Tue, 2018-07-24 at 11:18 -0400, Mike Snitzer wrote:
> On Tue, Jul 24 2018 at 9:57am -0400,
> Laurence Oberman <loberman@redhat.com> wrote:
>
> > On Tue, 2018-07-24 at 15:51 +0200, Hannes Reinecke wrote:
> > >
> > > _Actually_, I would've done it the other way around; after all,
> > > where't the point in running dm-multipath on a partition?
> > > Anything running on the other partitions would suffer from the
> > > issues dm-multipath is designed to handle (temporary path loss
> > > etc), so I'm
> > > not quite sure what you are trying to achieve with your testcase.
> > > Can you enlighten me?
> > >
> > > Cheers,
> > >
> > > Hannes
>
> I wasn't looking to deply this (multipath on partition) in production
> or
> suggest it to others. It was a means to experiment. More below.
>
> > This came about because a customer is using nvme for a dm-cache
> > device
> > and created multiple partitions so as to use the same nvme to cache
> > multiple different "slower" devices. The corruption was noticed in
> > XFS
> > and I engaged Mike to assist in figuring out what was going on.
>
> Yes, so topology for the customer's setup is:
>
> 1) MD raid1 on 2 NVMe partitions (from separate NVMe devices).
> 2) Then DM cache's "fast" and "metadata" devices layered on dm-linear
> mapping ontop of the MD raid1.
> 3) Then Ceph's rbd for DM-cache's slow device.
>
> I was just looking to simplify the stack to try to assess why XFS
> corruption was being seen without all the insanity.
>
> One issue was corruption due to incorrect shutdown order (network was
> getting shutdown out from underneath rbd, and in turn DM-cache
> couldn't
> complete its IO migrations during cache_postsuspend()).
>
> So I elected to try using DM multipath with queue_if_no_path to try
> to
> replicate rbd losing network _without_ needing a full Ceph/rbd setup.
>
> The rest is history... a rat-hole of corruption that likely is very
> different than the customer's setup.
>
> Mike
Not to muddy the waters here, and as Mike said the issue he tripped
over may not be the direct issue we originally started with.
In the lab reproducer with rbd as a slow devices we do not have an MD
raided nvme for the dm-cache, but we still see the corruption only on
the rbd based test.
We used the nvme partitioned but no DM raid to try an F/C device-
mapper-multipath LUNS cached via dm-cache.
The last test we ran where we did not see corruption was a partition
where the second partition was used to cache F/C luns
nvme0n1 259:0 0 372.6G 0 disk
├─nvme0n1p1 259:1 0 150G 0 part
└─nvme0n1p2 259:2 0 150G 0 part
├─cache_FC-nvme_blk_cache_cdata 253:42 0 20G 0 lvm
│ └─cache_FC-fc_disk 253:45 0 48G 0
lvm /cache_FC
└─cache_FC-nvme_blk_cache_cmeta 253:43 0 40M 0 lvm
└─cache_FC-fc_disk 253:45 0 48G 0
lvm /cache_FC
cache_FC-fc_disk (253:45)
├─cache_FC-fc_disk_corig (253:44)
│ └─3600140508da66c2c9ee4cc6aface1bab (253:36) Multipath
│ ├─ (68:224)
│ ├─ (69:240)
│ ├─ (8:192)
│ └─ (8:64)
├─cache_FC-nvme_blk_cache_cdata (253:42)
│ └─ (259:2)
└─cache_FC-nvme_blk_cache_cmeta (253:43)
└─ (259:2)
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device
2018-07-24 13:07 ` Mike Snitzer
(?)
@ 2018-07-24 17:42 ` Christoph Hellwig
-1 siblings, 0 replies; 32+ messages in thread
From: Christoph Hellwig @ 2018-07-24 17:42 UTC (permalink / raw)
To: Mike Snitzer; +Cc: linux-block, dm-devel, linux-nvme
On Tue, Jul 24, 2018 at 09:07:03AM -0400, Mike Snitzer wrote:
> True. We only ever support mapping the partitions ontop of
> request-based multipath (via dm-linear volumes created by kpartx).
>
> > So, have you tested that request-based multipathing works on a
> > partition _at all_? I'm not sure if partition mapping is done
> > correctly here; we never remap the start of the request (nor bio,
> > come to speak of it), so it looks as if we would be doing the wrong
> > things here.
> >
> > Have you checked that partition remapping is done correctly?
>
> It clearly doesn't work. Not quite following why but...
blk_insert_cloned_request seems to be missing a call to
blk_partition_remap. Given that no one but dm-multipath uses this
request clone insert helper, and people generally run multipath on
the whole device this is a code path that is almost never exercised.
^ permalink raw reply [flat|nested] 32+ messages in thread
* data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device
@ 2018-07-24 17:42 ` Christoph Hellwig
0 siblings, 0 replies; 32+ messages in thread
From: Christoph Hellwig @ 2018-07-24 17:42 UTC (permalink / raw)
On Tue, Jul 24, 2018@09:07:03AM -0400, Mike Snitzer wrote:
> True. We only ever support mapping the partitions ontop of
> request-based multipath (via dm-linear volumes created by kpartx).
>
> > So, have you tested that request-based multipathing works on a
> > partition _at all_? I'm not sure if partition mapping is done
> > correctly here; we never remap the start of the request (nor bio,
> > come to speak of it), so it looks as if we would be doing the wrong
> > things here.
> >
> > Have you checked that partition remapping is done correctly?
>
> It clearly doesn't work. Not quite following why but...
blk_insert_cloned_request seems to be missing a call to
blk_partition_remap. Given that no one but dm-multipath uses this
request clone insert helper, and people generally run multipath on
the whole device this is a code path that is almost never exercised.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device
@ 2018-07-24 17:42 ` Christoph Hellwig
0 siblings, 0 replies; 32+ messages in thread
From: Christoph Hellwig @ 2018-07-24 17:42 UTC (permalink / raw)
To: Mike Snitzer; +Cc: Hannes Reinecke, linux-nvme, linux-block, dm-devel
On Tue, Jul 24, 2018 at 09:07:03AM -0400, Mike Snitzer wrote:
> True. We only ever support mapping the partitions ontop of
> request-based multipath (via dm-linear volumes created by kpartx).
>
> > So, have you tested that request-based multipathing works on a
> > partition _at all_? I'm not sure if partition mapping is done
> > correctly here; we never remap the start of the request (nor bio,
> > come to speak of it), so it looks as if we would be doing the wrong
> > things here.
> >
> > Have you checked that partition remapping is done correctly?
>
> It clearly doesn't work. Not quite following why but...
blk_insert_cloned_request seems to be missing a call to
blk_partition_remap. Given that no one but dm-multipath uses this
request clone insert helper, and people generally run multipath on
the whole device this is a code path that is almost never exercised.
^ permalink raw reply [flat|nested] 32+ messages in thread