From: Laurence Oberman <loberman@redhat.com>
To: Hannes Reinecke <hare@suse.de>, Mike Snitzer <snitzer@redhat.com>
Cc: linux-block@vger.kernel.org, dm-devel@redhat.com,
linux-nvme@lists.infradead.org
Subject: Re: data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device
Date: Tue, 24 Jul 2018 09:57:03 -0400 [thread overview]
Message-ID: <1532440623.9819.4.camel@redhat.com> (raw)
In-Reply-To: <27cadfb3-2442-3931-6d58-58aa6adb2e2b@suse.de>
On Tue, 2018-07-24 at 15:51 +0200, Hannes Reinecke wrote:
> On 07/24/2018 03:07 PM, Mike Snitzer wrote:
> > On Tue, Jul 24 2018 at 2:00am -0400,
> > Hannes Reinecke <hare@suse.de> wrote:
> >
> > > On 07/23/2018 06:33 PM, Mike Snitzer wrote:
> > > > Hi,
> > > >
> > > > I've opened the following public BZ:
> > > > https://bugzilla.redhat.com/show_bug.cgi?id=1607527
> > > >
> > > > Feel free to add comments to that BZ if you have a redhat
> > > > bugzilla
> > > > account.
> > > >
> > > > But otherwise, happy to get as much feedback and discussion
> > > > going purely
> > > > on the relevant lists. I've taken ~1.5 weeks to categorize and
> > > > isolate
> > > > this issue. But I've reached a point where I'm getting
> > > > diminishing
> > > > returns and could _really_ use the collective eyeballs and
> > > > expertise of
> > > > the community. This is by far one of the most nasty cases of
> > > > corruption
> > > > I've seen in a while. Not sure where the ultimate cause of
> > > > corruption
> > > > lies (that the money question) but it _feels_ rooted in NVMe
> > > > and is
> > > > unique to this particular workload I've stumbled onto via
> > > > customer
> > > > escalation and then trying to replicate an rbd device using a
> > > > more
> > > > approachable one (request-based DM multipath in this case).
> > > >
> > >
> > > I might be stating the obvious, but so far we only have
> > > considered
> > > request-based multipath as being active for the _entire_ device.
> > > To my knowledge we've never tested that when running on a
> > > partition.
> >
> > True. We only ever support mapping the partitions ontop of
> > request-based multipath (via dm-linear volumes created by kpartx).
> >
> > > So, have you tested that request-based multipathing works on a
> > > partition _at all_? I'm not sure if partition mapping is done
> > > correctly here; we never remap the start of the request (nor bio,
> > > come to speak of it), so it looks as if we would be doing the
> > > wrong
> > > things here.
> > >
> > > Have you checked that partition remapping is done correctly?
> >
> > It clearly doesn't work. Not quite following why but...
> >
> > After running the test the partition table at the start of the
> > whole
> > NVMe device is overwritten by XFS. So likely the IO destined to
> > the
> > dm-cache's "slow" (dm-mpath device on NVMe partition) was issued to
> > the
> > whole NVMe device:
> >
> > # pvcreate /dev/nvme1n1
> > WARNING: xfs signature detected on /dev/nvme1n1 at offset 0. Wipe
> > it? [y/n]
> >
> > # vgcreate test /dev/nvme1n1
> > # lvcreate -n slow -L 512G test
> > WARNING: xfs signature detected on /dev/test/slow at offset 0. Wipe
> > it?
> > [y/n]: y
> > Wiping xfs signature on /dev/test/slow.
> > Logical volume "slow" created.
> >
> > Isn't this a failing of block core's partitioning? Why should a
> > target
> > that is given the entire partition of a device need to be concerned
> > with
> > remapping IO? Shouldn't block core handle that mapping?
> >
>
> Only if the device is marked a 'partitionable', which device-mapper
> devices are not.
> But I thought you knew that ...
>
> > Anyway, yesterday I went so far as to hack together request-based
> > support for DM linear (because request-based DM cannot stack on
> > bio-based DM) . With this, request-based linear devices instead of
> > conventional partitioning, I no longer see the XFS corruption when
> > running the test:
> >
>
> _Actually_, I would've done it the other way around; after all,
> where't
> the point in running dm-multipath on a partition?
> Anything running on the other partitions would suffer from the
> issues
> dm-multipath is designed to handle (temporary path loss etc), so I'm
> not
> quite sure what you are trying to achieve with your testcase.
> Can you enlighten me?
>
> Cheers,
>
> Hannes
This came about because a customer is using nvme for a dm-cache device
and created multiple partitions so as to use the same nvme to cache
multiple different "slower" devices. The corruption was noticed in XFS
and I engaged Mike to assist in figuring out what was going on.
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
WARNING: multiple messages have this Message-ID (diff)
From: Laurence Oberman <loberman@redhat.com>
To: Hannes Reinecke <hare@suse.de>, Mike Snitzer <snitzer@redhat.com>
Cc: linux-nvme@lists.infradead.org, linux-block@vger.kernel.org,
dm-devel@redhat.com
Subject: Re: data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device
Date: Tue, 24 Jul 2018 09:57:03 -0400 [thread overview]
Message-ID: <1532440623.9819.4.camel@redhat.com> (raw)
In-Reply-To: <27cadfb3-2442-3931-6d58-58aa6adb2e2b@suse.de>
On Tue, 2018-07-24 at 15:51 +0200, Hannes Reinecke wrote:
> On 07/24/2018 03:07 PM, Mike Snitzer wrote:
> > On Tue, Jul 24 2018 at 2:00am -0400,
> > Hannes Reinecke <hare@suse.de> wrote:
> >
> > > On 07/23/2018 06:33 PM, Mike Snitzer wrote:
> > > > Hi,
> > > >
> > > > I've opened the following public BZ:
> > > > https://bugzilla.redhat.com/show_bug.cgi?id=1607527
> > > >
> > > > Feel free to add comments to that BZ if you have a redhat
> > > > bugzilla
> > > > account.
> > > >
> > > > But otherwise, happy to get as much feedback and discussion
> > > > going purely
> > > > on the relevant lists. I've taken ~1.5 weeks to categorize and
> > > > isolate
> > > > this issue. But I've reached a point where I'm getting
> > > > diminishing
> > > > returns and could _really_ use the collective eyeballs and
> > > > expertise of
> > > > the community. This is by far one of the most nasty cases of
> > > > corruption
> > > > I've seen in a while. Not sure where the ultimate cause of
> > > > corruption
> > > > lies (that the money question) but it _feels_ rooted in NVMe
> > > > and is
> > > > unique to this particular workload I've stumbled onto via
> > > > customer
> > > > escalation and then trying to replicate an rbd device using a
> > > > more
> > > > approachable one (request-based DM multipath in this case).
> > > >
> > >
> > > I might be stating the obvious, but so far we only have
> > > considered
> > > request-based multipath as being active for the _entire_ device.
> > > To my knowledge we've never tested that when running on a
> > > partition.
> >
> > True. We only ever support mapping the partitions ontop of
> > request-based multipath (via dm-linear volumes created by kpartx).
> >
> > > So, have you tested that request-based multipathing works on a
> > > partition _at all_? I'm not sure if partition mapping is done
> > > correctly here; we never remap the start of the request (nor bio,
> > > come to speak of it), so it looks as if we would be doing the
> > > wrong
> > > things here.
> > >
> > > Have you checked that partition remapping is done correctly?
> >
> > It clearly doesn't work. Not quite following why but...
> >
> > After running the test the partition table at the start of the
> > whole
> > NVMe device is overwritten by XFS. So likely the IO destined to
> > the
> > dm-cache's "slow" (dm-mpath device on NVMe partition) was issued to
> > the
> > whole NVMe device:
> >
> > # pvcreate /dev/nvme1n1
> > WARNING: xfs signature detected on /dev/nvme1n1 at offset 0. Wipe
> > it? [y/n]
> >
> > # vgcreate test /dev/nvme1n1
> > # lvcreate -n slow -L 512G test
> > WARNING: xfs signature detected on /dev/test/slow at offset 0. Wipe
> > it?
> > [y/n]: y
> > Wiping xfs signature on /dev/test/slow.
> > Logical volume "slow" created.
> >
> > Isn't this a failing of block core's partitioning? Why should a
> > target
> > that is given the entire partition of a device need to be concerned
> > with
> > remapping IO? Shouldn't block core handle that mapping?
> >
>
> Only if the device is marked a 'partitionable', which device-mapper
> devices are not.
> But I thought you knew that ...
>
> > Anyway, yesterday I went so far as to hack together request-based
> > support for DM linear (because request-based DM cannot stack on
> > bio-based DM) . With this, request-based linear devices instead of
> > conventional partitioning, I no longer see the XFS corruption when
> > running the test:
> >
>
> _Actually_, I would've done it the other way around; after all,
> where't
> the point in running dm-multipath on a partition?
> Anything running on the other partitions would suffer from the
> issues
> dm-multipath is designed to handle (temporary path loss etc), so I'm
> not
> quite sure what you are trying to achieve with your testcase.
> Can you enlighten me?
>
> Cheers,
>
> Hannes
This came about because a customer is using nvme for a dm-cache device
and created multiple partitions so as to use the same nvme to cache
multiple different "slower" devices. The corruption was noticed in XFS
and I engaged Mike to assist in figuring out what was going on.
WARNING: multiple messages have this Message-ID (diff)
From: loberman@redhat.com (Laurence Oberman)
Subject: data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device
Date: Tue, 24 Jul 2018 09:57:03 -0400 [thread overview]
Message-ID: <1532440623.9819.4.camel@redhat.com> (raw)
In-Reply-To: <27cadfb3-2442-3931-6d58-58aa6adb2e2b@suse.de>
On Tue, 2018-07-24@15:51 +0200, Hannes Reinecke wrote:
> On 07/24/2018 03:07 PM, Mike Snitzer wrote:
> > On Tue, Jul 24 2018 at??2:00am -0400,
> > Hannes Reinecke <hare@suse.de> wrote:
> >
> > > On 07/23/2018 06:33 PM, Mike Snitzer wrote:
> > > > Hi,
> > > >
> > > > I've opened the following public BZ:
> > > > https://bugzilla.redhat.com/show_bug.cgi?id=1607527
> > > >
> > > > Feel free to add comments to that BZ if you have a redhat
> > > > bugzilla
> > > > account.
> > > >
> > > > But otherwise, happy to get as much feedback and discussion
> > > > going purely
> > > > on the relevant lists.??I've taken ~1.5 weeks to categorize and
> > > > isolate
> > > > this issue.??But I've reached a point where I'm getting
> > > > diminishing
> > > > returns and could _really_ use the collective eyeballs and
> > > > expertise of
> > > > the community.??This is by far one of the most nasty cases of
> > > > corruption
> > > > I've seen in a while.??Not sure where the ultimate cause of
> > > > corruption
> > > > lies (that the money question) but it _feels_ rooted in NVMe
> > > > and is
> > > > unique to this particular workload I've stumbled onto via
> > > > customer
> > > > escalation and then trying to replicate an rbd device using a
> > > > more
> > > > approachable one (request-based DM multipath in this case).
> > > >
> > >
> > > I might be stating the obvious, but so far we only have
> > > considered
> > > request-based multipath as being active for the _entire_ device.
> > > To my knowledge we've never tested that when running on a
> > > partition.
> >
> > True.??We only ever support mapping the partitions ontop of
> > request-based multipath (via dm-linear volumes created by kpartx).
> >
> > > So, have you tested that request-based multipathing works on a
> > > partition _at all_? I'm not sure if partition mapping is done
> > > correctly here; we never remap the start of the request (nor bio,
> > > come to speak of it), so it looks as if we would be doing the
> > > wrong
> > > things here.
> > >
> > > Have you checked that partition remapping is done correctly?
> >
> > It clearly doesn't work.??Not quite following why but...
> >
> > After running the test the partition table at the start of the
> > whole
> > NVMe device is overwritten by XFS.??So likely the IO destined to
> > the
> > dm-cache's "slow" (dm-mpath device on NVMe partition) was issued to
> > the
> > whole NVMe device:
> >
> > # pvcreate /dev/nvme1n1
> > WARNING: xfs signature detected on /dev/nvme1n1 at offset 0. Wipe
> > it? [y/n]
> >
> > # vgcreate test /dev/nvme1n1
> > # lvcreate -n slow -L 512G test
> > WARNING: xfs signature detected on /dev/test/slow at offset 0. Wipe
> > it?
> > [y/n]: y
> > ???Wiping xfs signature on /dev/test/slow.
> > ???Logical volume "slow" created.
> >
> > Isn't this a failing of block core's partitioning???Why should a
> > target
> > that is given the entire partition of a device need to be concerned
> > with
> > remapping IO???Shouldn't block core handle that mapping?
> >
>
> Only if the device is marked a 'partitionable', which device-mapper?
> devices are not.
> But I thought you knew that ...
>
> > Anyway, yesterday I went so far as to hack together request-based
> > support for DM linear (because request-based DM cannot stack on
> > bio-based DM) .??With this, request-based linear devices instead of
> > conventional partitioning, I no longer see the XFS corruption when
> > running the test:
> >
>
> _Actually_, I would've done it the other way around; after all,
> where't?
> the point in running dm-multipath on a partition?
> Anything running on the other partitions would suffer from the
> issues?
> dm-multipath is designed to handle (temporary path loss etc), so I'm
> not?
> quite sure what you are trying to achieve with your testcase.
> Can you enlighten me?
>
> Cheers,
>
> Hannes
This came about because a customer is using nvme for a dm-cache device
and created multiple partitions so as to use the same nvme to cache
multiple different "slower" devices. The corruption was noticed in XFS
and I engaged Mike to assist in figuring out what was going on.
next prev parent reply other threads:[~2018-07-24 13:57 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-07-23 16:33 data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device Mike Snitzer
2018-07-23 16:33 ` Mike Snitzer
2018-07-23 16:33 ` Mike Snitzer
2018-07-24 6:00 ` Hannes Reinecke
2018-07-24 6:00 ` Hannes Reinecke
2018-07-24 13:07 ` Mike Snitzer
2018-07-24 13:07 ` Mike Snitzer
2018-07-24 13:07 ` Mike Snitzer
2018-07-24 13:22 ` Laurence Oberman
2018-07-24 13:22 ` Laurence Oberman
2018-07-24 13:22 ` Laurence Oberman
2018-07-24 13:51 ` Hannes Reinecke
2018-07-24 13:51 ` Hannes Reinecke
2018-07-24 13:51 ` Hannes Reinecke
2018-07-24 13:57 ` Laurence Oberman [this message]
2018-07-24 13:57 ` Laurence Oberman
2018-07-24 13:57 ` Laurence Oberman
2018-07-24 15:18 ` Mike Snitzer
2018-07-24 15:18 ` Mike Snitzer
2018-07-24 15:18 ` Mike Snitzer
2018-07-24 15:31 ` Laurence Oberman
2018-07-24 15:31 ` Laurence Oberman
2018-07-24 15:31 ` Laurence Oberman
2018-07-24 17:42 ` Christoph Hellwig
2018-07-24 17:42 ` Christoph Hellwig
2018-07-24 17:42 ` Christoph Hellwig
2018-07-24 14:25 ` Bart Van Assche
2018-07-24 14:25 ` Bart Van Assche
2018-07-24 14:25 ` Bart Van Assche
2018-07-24 15:07 ` Mike Snitzer
2018-07-24 15:07 ` Mike Snitzer
2018-07-24 15:07 ` Mike Snitzer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1532440623.9819.4.camel@redhat.com \
--to=loberman@redhat.com \
--cc=dm-devel@redhat.com \
--cc=hare@suse.de \
--cc=linux-block@vger.kernel.org \
--cc=linux-nvme@lists.infradead.org \
--cc=snitzer@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.