From mboxrd@z Thu Jan 1 00:00:00 1970 From: Laurence Oberman Subject: Re: data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device Date: Tue, 24 Jul 2018 11:31:48 -0400 Message-ID: <1532446308.9819.7.camel@redhat.com> References: <20180723163357.GA29658@redhat.com> <20180724130703.GA30804@redhat.com> <27cadfb3-2442-3931-6d58-58aa6adb2e2b@suse.de> <1532440623.9819.4.camel@redhat.com> <20180724151845.GB3235@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Return-path: In-Reply-To: <20180724151845.GB3235@redhat.com> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: dm-devel-bounces@redhat.com Errors-To: dm-devel-bounces@redhat.com To: Mike Snitzer , Hannes Reinecke Cc: linux-block@vger.kernel.org, Brett Hull , dm-devel@redhat.com, linux-nvme@lists.infradead.org List-Id: dm-devel.ids T24gVHVlLCAyMDE4LTA3LTI0IGF0IDExOjE4IC0wNDAwLCBNaWtlIFNuaXR6ZXIgd3JvdGU6Cj4g T24gVHVlLCBKdWwgMjQgMjAxOCBhdMKgwqA5OjU3YW0gLTA0MDAsCj4gTGF1cmVuY2UgT2Jlcm1h biA8bG9iZXJtYW5AcmVkaGF0LmNvbT4gd3JvdGU6Cj4gCj4gPiBPbiBUdWUsIDIwMTgtMDctMjQg YXQgMTU6NTEgKzAyMDAsIEhhbm5lcyBSZWluZWNrZSB3cm90ZToKPiA+ID4gCj4gPiA+IF9BY3R1 YWxseV8sIEkgd291bGQndmUgZG9uZSBpdCB0aGUgb3RoZXIgd2F5IGFyb3VuZDsgYWZ0ZXIgYWxs LAo+ID4gPiB3aGVyZSd0IHRoZSBwb2ludCBpbiBydW5uaW5nIGRtLW11bHRpcGF0aCBvbiBhIHBh cnRpdGlvbj8KPiA+ID4gQW55dGhpbmcgcnVubmluZyBvbiB0aGUgb3RoZXIgcGFydGl0aW9ucyB3 b3VsZCBzdWZmZXIgZnJvbSB0aGUKPiA+ID4gaXNzdWVzIGRtLW11bHRpcGF0aCBpcyBkZXNpZ25l ZCB0byBoYW5kbGUgKHRlbXBvcmFyeSBwYXRoIGxvc3MKPiA+ID4gZXRjKSwgc28gSSdtCj4gPiA+ IG5vdCBxdWl0ZSBzdXJlIHdoYXQgeW91IGFyZSB0cnlpbmcgdG8gYWNoaWV2ZSB3aXRoIHlvdXIg dGVzdGNhc2UuCj4gPiA+IENhbiB5b3UgZW5saWdodGVuIG1lPwo+ID4gPiAKPiA+ID4gQ2hlZXJz LAo+ID4gPiAKPiA+ID4gSGFubmVzCj4gCj4gSSB3YXNuJ3QgbG9va2luZyB0byBkZXBseSB0aGlz IChtdWx0aXBhdGggb24gcGFydGl0aW9uKSBpbiBwcm9kdWN0aW9uCj4gb3IKPiBzdWdnZXN0IGl0 IHRvIG90aGVycy7CoMKgSXQgd2FzIGEgbWVhbnMgdG8gZXhwZXJpbWVudC7CoMKgTW9yZSBiZWxv dy4KPiAKPiA+IFRoaXMgY2FtZSBhYm91dCBiZWNhdXNlIGEgY3VzdG9tZXIgaXMgdXNpbmcgbnZt ZSBmb3IgYSBkbS1jYWNoZQo+ID4gZGV2aWNlCj4gPiBhbmQgY3JlYXRlZCBtdWx0aXBsZSBwYXJ0 aXRpb25zIHNvIGFzIHRvIHVzZSB0aGUgc2FtZSBudm1lIHRvIGNhY2hlCj4gPiBtdWx0aXBsZSBk aWZmZXJlbnQgInNsb3dlciIgZGV2aWNlcy4gVGhlIGNvcnJ1cHRpb24gd2FzIG5vdGljZWQgaW4K PiA+IFhGUwo+ID4gYW5kIEkgZW5nYWdlZCBNaWtlIHRvIGFzc2lzdCBpbiBmaWd1cmluZyBvdXQg d2hhdCB3YXMgZ29pbmcgb24uCj4gCj4gWWVzLCBzbyB0b3BvbG9neSBmb3IgdGhlIGN1c3RvbWVy J3Mgc2V0dXAgaXM6Cj4gCj4gMSkgTUQgcmFpZDEgb24gMiBOVk1lIHBhcnRpdGlvbnMgKGZyb20g c2VwYXJhdGUgTlZNZSBkZXZpY2VzKS4KPiAyKSBUaGVuIERNIGNhY2hlJ3MgImZhc3QiIGFuZCAi bWV0YWRhdGEiIGRldmljZXMgbGF5ZXJlZCBvbiBkbS1saW5lYXIKPiDCoMKgwqBtYXBwaW5nIG9u dG9wIG9mIHRoZSBNRCByYWlkMS4KPiAzKSBUaGVuIENlcGgncyByYmQgZm9yIERNLWNhY2hlJ3Mg c2xvdyBkZXZpY2UuCj4gCj4gSSB3YXMganVzdCBsb29raW5nIHRvIHNpbXBsaWZ5IHRoZSBzdGFj ayB0byB0cnkgdG8gYXNzZXNzIHdoeSBYRlMKPiBjb3JydXB0aW9uIHdhcyBiZWluZyBzZWVuIHdp dGhvdXQgYWxsIHRoZSBpbnNhbml0eS4KPiAKPiBPbmUgaXNzdWUgd2FzIGNvcnJ1cHRpb24gZHVl IHRvIGluY29ycmVjdCBzaHV0ZG93biBvcmRlciAobmV0d29yayB3YXMKPiBnZXR0aW5nIHNodXRk b3duIG91dCBmcm9tIHVuZGVybmVhdGggcmJkLCBhbmQgaW4gdHVybiBETS1jYWNoZQo+IGNvdWxk bid0Cj4gY29tcGxldGUgaXRzIElPIG1pZ3JhdGlvbnMgZHVyaW5nIGNhY2hlX3Bvc3RzdXNwZW5k KCkpLgo+IAo+IFNvIEkgZWxlY3RlZCB0byB0cnkgdXNpbmcgRE0gbXVsdGlwYXRoIHdpdGggcXVl dWVfaWZfbm9fcGF0aCB0byB0cnkKPiB0bwo+IHJlcGxpY2F0ZSByYmQgbG9zaW5nIG5ldHdvcmsg X3dpdGhvdXRfIG5lZWRpbmcgYSBmdWxsIENlcGgvcmJkIHNldHVwLgo+IAo+IFRoZSByZXN0IGlz IGhpc3RvcnkuLi4gYSByYXQtaG9sZSBvZiBjb3JydXB0aW9uIHRoYXQgbGlrZWx5IGlzIHZlcnkK PiBkaWZmZXJlbnQgdGhhbiB0aGUgY3VzdG9tZXIncyBzZXR1cC4KPiAKPiBNaWtlCk5vdCB0byBt dWRkeSB0aGUgd2F0ZXJzIGhlcmUsIGFuZCBhcyBNaWtlIHNhaWQgdGhlIGlzc3VlIGhlIHRyaXBw ZWQKb3ZlciBtYXkgbm90IGJlIHRoZSBkaXJlY3QgaXNzdWUgd2Ugb3JpZ2luYWxseSBzdGFydGVk IHdpdGguCgpJbiB0aGUgbGFiIHJlcHJvZHVjZXIgd2l0aCByYmQgYXMgYSBzbG93IGRldmljZXMg d2UgZG8gbm90IGhhdmUgYW4gTUQKcmFpZGVkIG52bWUgZm9yIHRoZSBkbS1jYWNoZSwgYnV0IHdl IHN0aWxsIHNlZSB0aGUgY29ycnVwdGlvbiBvbmx5IG9uCnRoZSByYmQgYmFzZWQgdGVzdC4KCldl IHVzZWQgdGhlIG52bWUgcGFydGl0aW9uZWQgYnV0IG5vIERNIHJhaWQgdG8gdHJ5IGFuIEYvQyBk ZXZpY2UtCm1hcHBlci1tdWx0aXBhdGggTFVOUyBjYWNoZWQgdmlhIGRtLWNhY2hlLgoKVGhlIGxh c3QgdGVzdCB3ZSByYW4gd2hlcmUgd2UgZGlkIG5vdCBzZWUgY29ycnVwdGlvbiB3YXMgYSBwYXJ0 aXRpb24Kd2hlcmUgdGhlIHNlY29uZCBwYXJ0aXRpb24gd2FzIHVzZWQgdG8gY2FjaGUgRi9DIGx1 bnMKCm52bWUwbjHCoMKgwqDCoMKgwqDCoMKgwqDCoMKgwqDCoMKgwqDCoMKgwqDCoMKgwqDCoMKg wqDCoMKgwqDCoMKgMjU5OjDCoMKgwqDCoDAgMzcyLjZHwqDCoDAgZGlza8KgwqAK4pSc4pSAbnZt ZTBuMXAxwqDCoMKgwqDCoMKgwqDCoMKgwqDCoMKgwqDCoMKgwqDCoMKgwqDCoMKgwqDCoMKgwqAy NTk6McKgwqDCoMKgMMKgwqDCoDE1MEfCoMKgMCBwYXJ0wqDCoArilJTilIBudm1lMG4xcDLCoMKg wqDCoMKgwqDCoMKgwqDCoMKgwqDCoMKgwqDCoMKgwqDCoMKgwqDCoMKgwqDCoDI1OToywqDCoMKg wqAwwqDCoMKgMTUwR8KgwqAwIHBhcnTCoMKgCsKgIOKUnOKUgGNhY2hlX0ZDLW52bWVfYmxrX2Nh Y2hlX2NkYXRhwqDCoMKgMjUzOjQywqDCoMKgMMKgwqDCoMKgMjBHwqDCoDAgbHZtwqDCoMKgCsKg IOKUgiDilJTilIBjYWNoZV9GQy1mY19kaXNrwqDCoMKgwqDCoMKgwqDCoMKgwqDCoMKgwqDCoDI1 Mzo0NcKgwqDCoDDCoMKgwqDCoDQ4R8KgwqAwCmx2bcKgwqDCoC9jYWNoZV9GQwrCoCDilJTilIBj YWNoZV9GQy1udm1lX2Jsa19jYWNoZV9jbWV0YcKgwqDCoDI1Mzo0M8KgwqDCoDDCoMKgwqDCoDQw TcKgwqAwIGx2bcKgwqDCoArCoMKgwqDCoOKUlOKUgGNhY2hlX0ZDLWZjX2Rpc2vCoMKgwqDCoMKg wqDCoMKgwqDCoMKgwqDCoMKgMjUzOjQ1wqDCoMKgMMKgwqDCoMKgNDhHwqDCoDAKbHZtwqDCoMKg L2NhY2hlX0ZDCgpjYWNoZV9GQy1mY19kaXNrICgyNTM6NDUpCsKg4pSc4pSAY2FjaGVfRkMtZmNf ZGlza19jb3JpZyAoMjUzOjQ0KQrCoOKUgsKgwqDilJTilIAzNjAwMTQwNTA4ZGE2NmMyYzllZTRj YzZhZmFjZTFiYWIgKDI1MzozNikgTXVsdGlwYXRoCsKg4pSCwqDCoMKgwqDCoOKUnOKUgCAoNjg6 MjI0KQrCoOKUgsKgwqDCoMKgwqDilJzilIAgKDY5OjI0MCkKwqDilILCoMKgwqDCoMKg4pSc4pSA ICg4OjE5MikKwqDilILCoMKgwqDCoMKg4pSU4pSAICg4OjY0KQrCoOKUnOKUgGNhY2hlX0ZDLW52 bWVfYmxrX2NhY2hlX2NkYXRhICgyNTM6NDIpCsKg4pSCwqDCoOKUlOKUgCAoMjU5OjIpCsKg4pSU 4pSAY2FjaGVfRkMtbnZtZV9ibGtfY2FjaGVfY21ldGEgKDI1Mzo0MykKwqDCoMKgwqDilJTilIAg KDI1OToyKQoKLS0KZG0tZGV2ZWwgbWFpbGluZyBsaXN0CmRtLWRldmVsQHJlZGhhdC5jb20KaHR0 cHM6Ly93d3cucmVkaGF0LmNvbS9tYWlsbWFuL2xpc3RpbmZvL2RtLWRldmVs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f181.google.com ([209.85.220.181]:36408 "EHLO mail-qk0-f181.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2388248AbeGXQiw (ORCPT ); Tue, 24 Jul 2018 12:38:52 -0400 Received: by mail-qk0-f181.google.com with SMTP id a132-v6so2873476qkg.3 for ; Tue, 24 Jul 2018 08:31:51 -0700 (PDT) Message-ID: <1532446308.9819.7.camel@redhat.com> Subject: Re: data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device From: Laurence Oberman To: Mike Snitzer , Hannes Reinecke Cc: linux-nvme@lists.infradead.org, linux-block@vger.kernel.org, dm-devel@redhat.com, Brett Hull Date: Tue, 24 Jul 2018 11:31:48 -0400 In-Reply-To: <20180724151845.GB3235@redhat.com> References: <20180723163357.GA29658@redhat.com> <20180724130703.GA30804@redhat.com> <27cadfb3-2442-3931-6d58-58aa6adb2e2b@suse.de> <1532440623.9819.4.camel@redhat.com> <20180724151845.GB3235@redhat.com> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Sender: linux-block-owner@vger.kernel.org List-Id: linux-block@vger.kernel.org On Tue, 2018-07-24 at 11:18 -0400, Mike Snitzer wrote: > On Tue, Jul 24 2018 at  9:57am -0400, > Laurence Oberman wrote: > > > On Tue, 2018-07-24 at 15:51 +0200, Hannes Reinecke wrote: > > > > > > _Actually_, I would've done it the other way around; after all, > > > where't the point in running dm-multipath on a partition? > > > Anything running on the other partitions would suffer from the > > > issues dm-multipath is designed to handle (temporary path loss > > > etc), so I'm > > > not quite sure what you are trying to achieve with your testcase. > > > Can you enlighten me? > > > > > > Cheers, > > > > > > Hannes > > I wasn't looking to deply this (multipath on partition) in production > or > suggest it to others.  It was a means to experiment.  More below. > > > This came about because a customer is using nvme for a dm-cache > > device > > and created multiple partitions so as to use the same nvme to cache > > multiple different "slower" devices. The corruption was noticed in > > XFS > > and I engaged Mike to assist in figuring out what was going on. > > Yes, so topology for the customer's setup is: > > 1) MD raid1 on 2 NVMe partitions (from separate NVMe devices). > 2) Then DM cache's "fast" and "metadata" devices layered on dm-linear >    mapping ontop of the MD raid1. > 3) Then Ceph's rbd for DM-cache's slow device. > > I was just looking to simplify the stack to try to assess why XFS > corruption was being seen without all the insanity. > > One issue was corruption due to incorrect shutdown order (network was > getting shutdown out from underneath rbd, and in turn DM-cache > couldn't > complete its IO migrations during cache_postsuspend()). > > So I elected to try using DM multipath with queue_if_no_path to try > to > replicate rbd losing network _without_ needing a full Ceph/rbd setup. > > The rest is history... a rat-hole of corruption that likely is very > different than the customer's setup. > > Mike Not to muddy the waters here, and as Mike said the issue he tripped over may not be the direct issue we originally started with. In the lab reproducer with rbd as a slow devices we do not have an MD raided nvme for the dm-cache, but we still see the corruption only on the rbd based test. We used the nvme partitioned but no DM raid to try an F/C device- mapper-multipath LUNS cached via dm-cache. The last test we ran where we did not see corruption was a partition where the second partition was used to cache F/C luns nvme0n1                             259:0    0 372.6G  0 disk   ├─nvme0n1p1                         259:1    0   150G  0 part   └─nvme0n1p2                         259:2    0   150G  0 part     ├─cache_FC-nvme_blk_cache_cdata   253:42   0    20G  0 lvm      │ └─cache_FC-fc_disk              253:45   0    48G  0 lvm   /cache_FC   └─cache_FC-nvme_blk_cache_cmeta   253:43   0    40M  0 lvm        └─cache_FC-fc_disk              253:45   0    48G  0 lvm   /cache_FC cache_FC-fc_disk (253:45)  ├─cache_FC-fc_disk_corig (253:44)  │  └─3600140508da66c2c9ee4cc6aface1bab (253:36) Multipath  │     ├─ (68:224)  │     ├─ (69:240)  │     ├─ (8:192)  │     └─ (8:64)  ├─cache_FC-nvme_blk_cache_cdata (253:42)  │  └─ (259:2)  └─cache_FC-nvme_blk_cache_cmeta (253:43)     └─ (259:2) From mboxrd@z Thu Jan 1 00:00:00 1970 From: loberman@redhat.com (Laurence Oberman) Date: Tue, 24 Jul 2018 11:31:48 -0400 Subject: data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device In-Reply-To: <20180724151845.GB3235@redhat.com> References: <20180723163357.GA29658@redhat.com> <20180724130703.GA30804@redhat.com> <27cadfb3-2442-3931-6d58-58aa6adb2e2b@suse.de> <1532440623.9819.4.camel@redhat.com> <20180724151845.GB3235@redhat.com> Message-ID: <1532446308.9819.7.camel@redhat.com> On Tue, 2018-07-24@11:18 -0400, Mike Snitzer wrote: > On Tue, Jul 24 2018 at??9:57am -0400, > Laurence Oberman wrote: > > > On Tue, 2018-07-24@15:51 +0200, Hannes Reinecke wrote: > > > > > > _Actually_, I would've done it the other way around; after all, > > > where't the point in running dm-multipath on a partition? > > > Anything running on the other partitions would suffer from the > > > issues dm-multipath is designed to handle (temporary path loss > > > etc), so I'm > > > not quite sure what you are trying to achieve with your testcase. > > > Can you enlighten me? > > > > > > Cheers, > > > > > > Hannes > > I wasn't looking to deply this (multipath on partition) in production > or > suggest it to others.??It was a means to experiment.??More below. > > > This came about because a customer is using nvme for a dm-cache > > device > > and created multiple partitions so as to use the same nvme to cache > > multiple different "slower" devices. The corruption was noticed in > > XFS > > and I engaged Mike to assist in figuring out what was going on. > > Yes, so topology for the customer's setup is: > > 1) MD raid1 on 2 NVMe partitions (from separate NVMe devices). > 2) Then DM cache's "fast" and "metadata" devices layered on dm-linear > ???mapping ontop of the MD raid1. > 3) Then Ceph's rbd for DM-cache's slow device. > > I was just looking to simplify the stack to try to assess why XFS > corruption was being seen without all the insanity. > > One issue was corruption due to incorrect shutdown order (network was > getting shutdown out from underneath rbd, and in turn DM-cache > couldn't > complete its IO migrations during cache_postsuspend()). > > So I elected to try using DM multipath with queue_if_no_path to try > to > replicate rbd losing network _without_ needing a full Ceph/rbd setup. > > The rest is history... a rat-hole of corruption that likely is very > different than the customer's setup. > > Mike Not to muddy the waters here, and as Mike said the issue he tripped over may not be the direct issue we originally started with. In the lab reproducer with rbd as a slow devices we do not have an MD raided nvme for the dm-cache, but we still see the corruption only on the rbd based test. We used the nvme partitioned but no DM raid to try an F/C device- mapper-multipath LUNS cached via dm-cache. The last test we ran where we did not see corruption was a partition where the second partition was used to cache F/C luns nvme0n1?????????????????????????????259:0????0 372.6G??0 disk?? ??nvme0n1p1?????????????????????????259:1????0???150G??0 part?? ??nvme0n1p2?????????????????????????259:2????0???150G??0 part?? ? ??cache_FC-nvme_blk_cache_cdata???253:42???0????20G??0 lvm??? ? ? ??cache_FC-fc_disk??????????????253:45???0????48G??0 lvm???/cache_FC ? ??cache_FC-nvme_blk_cache_cmeta???253:43???0????40M??0 lvm??? ??????cache_FC-fc_disk??????????????253:45???0????48G??0 lvm???/cache_FC cache_FC-fc_disk (253:45) ???cache_FC-fc_disk_corig (253:44) ??????3600140508da66c2c9ee4cc6aface1bab (253:36) Multipath ????????? (68:224) ????????? (69:240) ????????? (8:192) ????????? (8:64) ???cache_FC-nvme_blk_cache_cdata (253:42) ?????? (259:2) ???cache_FC-nvme_blk_cache_cmeta (253:43) ?????? (259:2)