From mboxrd@z Thu Jan  1 00:00:00 1970
From: Laurence Oberman <loberman@redhat.com>
Subject: Re: data corruption with 'splt' workload to XFS on DM
 cache with its 3 underlying devices being on same NVMe device
Date: Tue, 24 Jul 2018 11:31:48 -0400
Message-ID: <1532446308.9819.7.camel@redhat.com>
References: <20180723163357.GA29658@redhat.com>
	<e761830f-e3f7-3e88-1697-b4b150e84e5f@suse.de>
	<20180724130703.GA30804@redhat.com>
	<27cadfb3-2442-3931-6d58-58aa6adb2e2b@suse.de>
	<1532440623.9819.4.camel@redhat.com> <20180724151845.GB3235@redhat.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
Return-path: <dm-devel-bounces@redhat.com>
In-Reply-To: <20180724151845.GB3235@redhat.com>
List-Unsubscribe: <https://www.redhat.com/mailman/options/dm-devel>,
	<mailto:dm-devel-request@redhat.com?subject=unsubscribe>
List-Archive: <https://www.redhat.com/archives/dm-devel>
List-Post: <mailto:dm-devel@redhat.com>
List-Help: <mailto:dm-devel-request@redhat.com?subject=help>
List-Subscribe: <https://www.redhat.com/mailman/listinfo/dm-devel>,
	<mailto:dm-devel-request@redhat.com?subject=subscribe>
Sender: dm-devel-bounces@redhat.com
Errors-To: dm-devel-bounces@redhat.com
To: Mike Snitzer <snitzer@redhat.com>, Hannes Reinecke <hare@suse.de>
Cc: linux-block@vger.kernel.org, Brett Hull <bhull@redhat.com>, dm-devel@redhat.com, linux-nvme@lists.infradead.org
List-Id: dm-devel.ids

T24gVHVlLCAyMDE4LTA3LTI0IGF0IDExOjE4IC0wNDAwLCBNaWtlIFNuaXR6ZXIgd3JvdGU6Cj4g
T24gVHVlLCBKdWwgMjQgMjAxOCBhdMKgwqA5OjU3YW0gLTA0MDAsCj4gTGF1cmVuY2UgT2Jlcm1h
biA8bG9iZXJtYW5AcmVkaGF0LmNvbT4gd3JvdGU6Cj4gCj4gPiBPbiBUdWUsIDIwMTgtMDctMjQg
YXQgMTU6NTEgKzAyMDAsIEhhbm5lcyBSZWluZWNrZSB3cm90ZToKPiA+ID4gCj4gPiA+IF9BY3R1
YWxseV8sIEkgd291bGQndmUgZG9uZSBpdCB0aGUgb3RoZXIgd2F5IGFyb3VuZDsgYWZ0ZXIgYWxs
LAo+ID4gPiB3aGVyZSd0IHRoZSBwb2ludCBpbiBydW5uaW5nIGRtLW11bHRpcGF0aCBvbiBhIHBh
cnRpdGlvbj8KPiA+ID4gQW55dGhpbmcgcnVubmluZyBvbiB0aGUgb3RoZXIgcGFydGl0aW9ucyB3
b3VsZCBzdWZmZXIgZnJvbSB0aGUKPiA+ID4gaXNzdWVzIGRtLW11bHRpcGF0aCBpcyBkZXNpZ25l
ZCB0byBoYW5kbGUgKHRlbXBvcmFyeSBwYXRoIGxvc3MKPiA+ID4gZXRjKSwgc28gSSdtCj4gPiA+
IG5vdCBxdWl0ZSBzdXJlIHdoYXQgeW91IGFyZSB0cnlpbmcgdG8gYWNoaWV2ZSB3aXRoIHlvdXIg
dGVzdGNhc2UuCj4gPiA+IENhbiB5b3UgZW5saWdodGVuIG1lPwo+ID4gPiAKPiA+ID4gQ2hlZXJz
LAo+ID4gPiAKPiA+ID4gSGFubmVzCj4gCj4gSSB3YXNuJ3QgbG9va2luZyB0byBkZXBseSB0aGlz
IChtdWx0aXBhdGggb24gcGFydGl0aW9uKSBpbiBwcm9kdWN0aW9uCj4gb3IKPiBzdWdnZXN0IGl0
IHRvIG90aGVycy7CoMKgSXQgd2FzIGEgbWVhbnMgdG8gZXhwZXJpbWVudC7CoMKgTW9yZSBiZWxv
dy4KPiAKPiA+IFRoaXMgY2FtZSBhYm91dCBiZWNhdXNlIGEgY3VzdG9tZXIgaXMgdXNpbmcgbnZt
ZSBmb3IgYSBkbS1jYWNoZQo+ID4gZGV2aWNlCj4gPiBhbmQgY3JlYXRlZCBtdWx0aXBsZSBwYXJ0
aXRpb25zIHNvIGFzIHRvIHVzZSB0aGUgc2FtZSBudm1lIHRvIGNhY2hlCj4gPiBtdWx0aXBsZSBk
aWZmZXJlbnQgInNsb3dlciIgZGV2aWNlcy4gVGhlIGNvcnJ1cHRpb24gd2FzIG5vdGljZWQgaW4K
PiA+IFhGUwo+ID4gYW5kIEkgZW5nYWdlZCBNaWtlIHRvIGFzc2lzdCBpbiBmaWd1cmluZyBvdXQg
d2hhdCB3YXMgZ29pbmcgb24uCj4gCj4gWWVzLCBzbyB0b3BvbG9neSBmb3IgdGhlIGN1c3RvbWVy
J3Mgc2V0dXAgaXM6Cj4gCj4gMSkgTUQgcmFpZDEgb24gMiBOVk1lIHBhcnRpdGlvbnMgKGZyb20g
c2VwYXJhdGUgTlZNZSBkZXZpY2VzKS4KPiAyKSBUaGVuIERNIGNhY2hlJ3MgImZhc3QiIGFuZCAi
bWV0YWRhdGEiIGRldmljZXMgbGF5ZXJlZCBvbiBkbS1saW5lYXIKPiDCoMKgwqBtYXBwaW5nIG9u
dG9wIG9mIHRoZSBNRCByYWlkMS4KPiAzKSBUaGVuIENlcGgncyByYmQgZm9yIERNLWNhY2hlJ3Mg
c2xvdyBkZXZpY2UuCj4gCj4gSSB3YXMganVzdCBsb29raW5nIHRvIHNpbXBsaWZ5IHRoZSBzdGFj
ayB0byB0cnkgdG8gYXNzZXNzIHdoeSBYRlMKPiBjb3JydXB0aW9uIHdhcyBiZWluZyBzZWVuIHdp
dGhvdXQgYWxsIHRoZSBpbnNhbml0eS4KPiAKPiBPbmUgaXNzdWUgd2FzIGNvcnJ1cHRpb24gZHVl
IHRvIGluY29ycmVjdCBzaHV0ZG93biBvcmRlciAobmV0d29yayB3YXMKPiBnZXR0aW5nIHNodXRk
b3duIG91dCBmcm9tIHVuZGVybmVhdGggcmJkLCBhbmQgaW4gdHVybiBETS1jYWNoZQo+IGNvdWxk
bid0Cj4gY29tcGxldGUgaXRzIElPIG1pZ3JhdGlvbnMgZHVyaW5nIGNhY2hlX3Bvc3RzdXNwZW5k
KCkpLgo+IAo+IFNvIEkgZWxlY3RlZCB0byB0cnkgdXNpbmcgRE0gbXVsdGlwYXRoIHdpdGggcXVl
dWVfaWZfbm9fcGF0aCB0byB0cnkKPiB0bwo+IHJlcGxpY2F0ZSByYmQgbG9zaW5nIG5ldHdvcmsg
X3dpdGhvdXRfIG5lZWRpbmcgYSBmdWxsIENlcGgvcmJkIHNldHVwLgo+IAo+IFRoZSByZXN0IGlz
IGhpc3RvcnkuLi4gYSByYXQtaG9sZSBvZiBjb3JydXB0aW9uIHRoYXQgbGlrZWx5IGlzIHZlcnkK
PiBkaWZmZXJlbnQgdGhhbiB0aGUgY3VzdG9tZXIncyBzZXR1cC4KPiAKPiBNaWtlCk5vdCB0byBt
dWRkeSB0aGUgd2F0ZXJzIGhlcmUsIGFuZCBhcyBNaWtlIHNhaWQgdGhlIGlzc3VlIGhlIHRyaXBw
ZWQKb3ZlciBtYXkgbm90IGJlIHRoZSBkaXJlY3QgaXNzdWUgd2Ugb3JpZ2luYWxseSBzdGFydGVk
IHdpdGguCgpJbiB0aGUgbGFiIHJlcHJvZHVjZXIgd2l0aCByYmQgYXMgYSBzbG93IGRldmljZXMg
d2UgZG8gbm90IGhhdmUgYW4gTUQKcmFpZGVkIG52bWUgZm9yIHRoZSBkbS1jYWNoZSwgYnV0IHdl
IHN0aWxsIHNlZSB0aGUgY29ycnVwdGlvbiBvbmx5IG9uCnRoZSByYmQgYmFzZWQgdGVzdC4KCldl
IHVzZWQgdGhlIG52bWUgcGFydGl0aW9uZWQgYnV0IG5vIERNIHJhaWQgdG8gdHJ5IGFuIEYvQyBk
ZXZpY2UtCm1hcHBlci1tdWx0aXBhdGggTFVOUyBjYWNoZWQgdmlhIGRtLWNhY2hlLgoKVGhlIGxh
c3QgdGVzdCB3ZSByYW4gd2hlcmUgd2UgZGlkIG5vdCBzZWUgY29ycnVwdGlvbiB3YXMgYSBwYXJ0
aXRpb24Kd2hlcmUgdGhlIHNlY29uZCBwYXJ0aXRpb24gd2FzIHVzZWQgdG8gY2FjaGUgRi9DIGx1
bnMKCm52bWUwbjHCoMKgwqDCoMKgwqDCoMKgwqDCoMKgwqDCoMKgwqDCoMKgwqDCoMKgwqDCoMKg
wqDCoMKgwqDCoMKgMjU5OjDCoMKgwqDCoDAgMzcyLjZHwqDCoDAgZGlza8KgwqAK4pSc4pSAbnZt
ZTBuMXAxwqDCoMKgwqDCoMKgwqDCoMKgwqDCoMKgwqDCoMKgwqDCoMKgwqDCoMKgwqDCoMKgwqAy
NTk6McKgwqDCoMKgMMKgwqDCoDE1MEfCoMKgMCBwYXJ0wqDCoArilJTilIBudm1lMG4xcDLCoMKg
wqDCoMKgwqDCoMKgwqDCoMKgwqDCoMKgwqDCoMKgwqDCoMKgwqDCoMKgwqDCoDI1OToywqDCoMKg
wqAwwqDCoMKgMTUwR8KgwqAwIHBhcnTCoMKgCsKgIOKUnOKUgGNhY2hlX0ZDLW52bWVfYmxrX2Nh
Y2hlX2NkYXRhwqDCoMKgMjUzOjQywqDCoMKgMMKgwqDCoMKgMjBHwqDCoDAgbHZtwqDCoMKgCsKg
IOKUgiDilJTilIBjYWNoZV9GQy1mY19kaXNrwqDCoMKgwqDCoMKgwqDCoMKgwqDCoMKgwqDCoDI1
Mzo0NcKgwqDCoDDCoMKgwqDCoDQ4R8KgwqAwCmx2bcKgwqDCoC9jYWNoZV9GQwrCoCDilJTilIBj
YWNoZV9GQy1udm1lX2Jsa19jYWNoZV9jbWV0YcKgwqDCoDI1Mzo0M8KgwqDCoDDCoMKgwqDCoDQw
TcKgwqAwIGx2bcKgwqDCoArCoMKgwqDCoOKUlOKUgGNhY2hlX0ZDLWZjX2Rpc2vCoMKgwqDCoMKg
wqDCoMKgwqDCoMKgwqDCoMKgMjUzOjQ1wqDCoMKgMMKgwqDCoMKgNDhHwqDCoDAKbHZtwqDCoMKg
L2NhY2hlX0ZDCgpjYWNoZV9GQy1mY19kaXNrICgyNTM6NDUpCsKg4pSc4pSAY2FjaGVfRkMtZmNf
ZGlza19jb3JpZyAoMjUzOjQ0KQrCoOKUgsKgwqDilJTilIAzNjAwMTQwNTA4ZGE2NmMyYzllZTRj
YzZhZmFjZTFiYWIgKDI1MzozNikgTXVsdGlwYXRoCsKg4pSCwqDCoMKgwqDCoOKUnOKUgCAoNjg6
MjI0KQrCoOKUgsKgwqDCoMKgwqDilJzilIAgKDY5OjI0MCkKwqDilILCoMKgwqDCoMKg4pSc4pSA
ICg4OjE5MikKwqDilILCoMKgwqDCoMKg4pSU4pSAICg4OjY0KQrCoOKUnOKUgGNhY2hlX0ZDLW52
bWVfYmxrX2NhY2hlX2NkYXRhICgyNTM6NDIpCsKg4pSCwqDCoOKUlOKUgCAoMjU5OjIpCsKg4pSU
4pSAY2FjaGVfRkMtbnZtZV9ibGtfY2FjaGVfY21ldGEgKDI1Mzo0MykKwqDCoMKgwqDilJTilIAg
KDI1OToyKQoKLS0KZG0tZGV2ZWwgbWFpbGluZyBsaXN0CmRtLWRldmVsQHJlZGhhdC5jb20KaHR0
cHM6Ly93d3cucmVkaGF0LmNvbS9tYWlsbWFuL2xpc3RpbmZvL2RtLWRldmVs

From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-block-owner@vger.kernel.org>
Received: from mail-qk0-f181.google.com ([209.85.220.181]:36408 "EHLO
        mail-qk0-f181.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S2388248AbeGXQiw (ORCPT
        <rfc822;linux-block@vger.kernel.org>);
        Tue, 24 Jul 2018 12:38:52 -0400
Received: by mail-qk0-f181.google.com with SMTP id a132-v6so2873476qkg.3
        for <linux-block@vger.kernel.org>; Tue, 24 Jul 2018 08:31:51 -0700 (PDT)
Message-ID: <1532446308.9819.7.camel@redhat.com>
Subject: Re: data corruption with 'splt' workload to XFS on DM cache with
 its 3 underlying devices being on same NVMe device
From: Laurence Oberman <loberman@redhat.com>
To: Mike Snitzer <snitzer@redhat.com>, Hannes Reinecke <hare@suse.de>
Cc: linux-nvme@lists.infradead.org, linux-block@vger.kernel.org,
        dm-devel@redhat.com, Brett Hull <bhull@redhat.com>
Date: Tue, 24 Jul 2018 11:31:48 -0400
In-Reply-To: <20180724151845.GB3235@redhat.com>
References: <20180723163357.GA29658@redhat.com>
         <e761830f-e3f7-3e88-1697-b4b150e84e5f@suse.de>
         <20180724130703.GA30804@redhat.com>
         <27cadfb3-2442-3931-6d58-58aa6adb2e2b@suse.de>
         <1532440623.9819.4.camel@redhat.com> <20180724151845.GB3235@redhat.com>
Content-Type: text/plain; charset="UTF-8"
Mime-Version: 1.0
Sender: linux-block-owner@vger.kernel.org
List-Id: linux-block@vger.kernel.org

On Tue, 2018-07-24 at 11:18 -0400, Mike Snitzer wrote:
> On Tue, Jul 24 2018 at  9:57am -0400,
> Laurence Oberman <loberman@redhat.com> wrote:
> 
> > On Tue, 2018-07-24 at 15:51 +0200, Hannes Reinecke wrote:
> > > 
> > > _Actually_, I would've done it the other way around; after all,
> > > where't the point in running dm-multipath on a partition?
> > > Anything running on the other partitions would suffer from the
> > > issues dm-multipath is designed to handle (temporary path loss
> > > etc), so I'm
> > > not quite sure what you are trying to achieve with your testcase.
> > > Can you enlighten me?
> > > 
> > > Cheers,
> > > 
> > > Hannes
> 
> I wasn't looking to deply this (multipath on partition) in production
> or
> suggest it to others.  It was a means to experiment.  More below.
> 
> > This came about because a customer is using nvme for a dm-cache
> > device
> > and created multiple partitions so as to use the same nvme to cache
> > multiple different "slower" devices. The corruption was noticed in
> > XFS
> > and I engaged Mike to assist in figuring out what was going on.
> 
> Yes, so topology for the customer's setup is:
> 
> 1) MD raid1 on 2 NVMe partitions (from separate NVMe devices).
> 2) Then DM cache's "fast" and "metadata" devices layered on dm-linear
>    mapping ontop of the MD raid1.
> 3) Then Ceph's rbd for DM-cache's slow device.
> 
> I was just looking to simplify the stack to try to assess why XFS
> corruption was being seen without all the insanity.
> 
> One issue was corruption due to incorrect shutdown order (network was
> getting shutdown out from underneath rbd, and in turn DM-cache
> couldn't
> complete its IO migrations during cache_postsuspend()).
> 
> So I elected to try using DM multipath with queue_if_no_path to try
> to
> replicate rbd losing network _without_ needing a full Ceph/rbd setup.
> 
> The rest is history... a rat-hole of corruption that likely is very
> different than the customer's setup.
> 
> Mike
Not to muddy the waters here, and as Mike said the issue he tripped
over may not be the direct issue we originally started with.

In the lab reproducer with rbd as a slow devices we do not have an MD
raided nvme for the dm-cache, but we still see the corruption only on
the rbd based test.

We used the nvme partitioned but no DM raid to try an F/C device-
mapper-multipath LUNS cached via dm-cache.

The last test we ran where we did not see corruption was a partition
where the second partition was used to cache F/C luns

nvme0n1                             259:0    0 372.6G  0 disk  
├─nvme0n1p1                         259:1    0   150G  0 part  
└─nvme0n1p2                         259:2    0   150G  0 part  
  ├─cache_FC-nvme_blk_cache_cdata   253:42   0    20G  0 lvm   
  │ └─cache_FC-fc_disk              253:45   0    48G  0
lvm   /cache_FC
  └─cache_FC-nvme_blk_cache_cmeta   253:43   0    40M  0 lvm   
    └─cache_FC-fc_disk              253:45   0    48G  0
lvm   /cache_FC

cache_FC-fc_disk (253:45)
 ├─cache_FC-fc_disk_corig (253:44)
 │  └─3600140508da66c2c9ee4cc6aface1bab (253:36) Multipath
 │     ├─ (68:224)
 │     ├─ (69:240)
 │     ├─ (8:192)
 │     └─ (8:64)
 ├─cache_FC-nvme_blk_cache_cdata (253:42)
 │  └─ (259:2)
 └─cache_FC-nvme_blk_cache_cmeta (253:43)
    └─ (259:2)

From mboxrd@z Thu Jan  1 00:00:00 1970
From: loberman@redhat.com (Laurence Oberman)
Date: Tue, 24 Jul 2018 11:31:48 -0400
Subject: data corruption with 'splt' workload to XFS on DM cache with
 its 3 underlying devices being on same NVMe device
In-Reply-To: <20180724151845.GB3235@redhat.com>
References: <20180723163357.GA29658@redhat.com>
 <e761830f-e3f7-3e88-1697-b4b150e84e5f@suse.de>
 <20180724130703.GA30804@redhat.com>
 <27cadfb3-2442-3931-6d58-58aa6adb2e2b@suse.de>
 <1532440623.9819.4.camel@redhat.com> <20180724151845.GB3235@redhat.com>
Message-ID: <1532446308.9819.7.camel@redhat.com>

On Tue, 2018-07-24@11:18 -0400, Mike Snitzer wrote:
> On Tue, Jul 24 2018 at??9:57am -0400,
> Laurence Oberman <loberman@redhat.com> wrote:
> 
> > On Tue, 2018-07-24@15:51 +0200, Hannes Reinecke wrote:
> > > 
> > > _Actually_, I would've done it the other way around; after all,
> > > where't the point in running dm-multipath on a partition?
> > > Anything running on the other partitions would suffer from the
> > > issues dm-multipath is designed to handle (temporary path loss
> > > etc), so I'm
> > > not quite sure what you are trying to achieve with your testcase.
> > > Can you enlighten me?
> > > 
> > > Cheers,
> > > 
> > > Hannes
> 
> I wasn't looking to deply this (multipath on partition) in production
> or
> suggest it to others.??It was a means to experiment.??More below.
> 
> > This came about because a customer is using nvme for a dm-cache
> > device
> > and created multiple partitions so as to use the same nvme to cache
> > multiple different "slower" devices. The corruption was noticed in
> > XFS
> > and I engaged Mike to assist in figuring out what was going on.
> 
> Yes, so topology for the customer's setup is:
> 
> 1) MD raid1 on 2 NVMe partitions (from separate NVMe devices).
> 2) Then DM cache's "fast" and "metadata" devices layered on dm-linear
> ???mapping ontop of the MD raid1.
> 3) Then Ceph's rbd for DM-cache's slow device.
> 
> I was just looking to simplify the stack to try to assess why XFS
> corruption was being seen without all the insanity.
> 
> One issue was corruption due to incorrect shutdown order (network was
> getting shutdown out from underneath rbd, and in turn DM-cache
> couldn't
> complete its IO migrations during cache_postsuspend()).
> 
> So I elected to try using DM multipath with queue_if_no_path to try
> to
> replicate rbd losing network _without_ needing a full Ceph/rbd setup.
> 
> The rest is history... a rat-hole of corruption that likely is very
> different than the customer's setup.
> 
> Mike
Not to muddy the waters here, and as Mike said the issue he tripped
over may not be the direct issue we originally started with.

In the lab reproducer with rbd as a slow devices we do not have an MD
raided nvme for the dm-cache, but we still see the corruption only on
the rbd based test.

We used the nvme partitioned but no DM raid to try an F/C device-
mapper-multipath LUNS cached via dm-cache.

The last test we ran where we did not see corruption was a partition
where the second partition was used to cache F/C luns

nvme0n1?????????????????????????????259:0????0 372.6G??0 disk??
??nvme0n1p1?????????????????????????259:1????0???150G??0 part??
??nvme0n1p2?????????????????????????259:2????0???150G??0 part??
? ??cache_FC-nvme_blk_cache_cdata???253:42???0????20G??0 lvm???
? ? ??cache_FC-fc_disk??????????????253:45???0????48G??0
lvm???/cache_FC
? ??cache_FC-nvme_blk_cache_cmeta???253:43???0????40M??0 lvm???
??????cache_FC-fc_disk??????????????253:45???0????48G??0
lvm???/cache_FC

cache_FC-fc_disk (253:45)
???cache_FC-fc_disk_corig (253:44)
??????3600140508da66c2c9ee4cc6aface1bab (253:36) Multipath
????????? (68:224)
????????? (69:240)
????????? (8:192)
????????? (8:64)
???cache_FC-nvme_blk_cache_cdata (253:42)
?????? (259:2)
???cache_FC-nvme_blk_cache_cmeta (253:43)
?????? (259:2)