From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mike Snitzer Subject: data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device Date: Mon, 23 Jul 2018 12:33:57 -0400 Message-ID: <20180723163357.GA29658@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Return-path: Content-Disposition: inline List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: dm-devel-bounces@redhat.com Errors-To: dm-devel-bounces@redhat.com To: linux-nvme@lists.infradead.org, linux-block@vger.kernel.org, dm-devel@redhat.com List-Id: dm-devel.ids SGksCgpJJ3ZlIG9wZW5lZCB0aGUgZm9sbG93aW5nIHB1YmxpYyBCWjoKaHR0cHM6Ly9idWd6aWxs YS5yZWRoYXQuY29tL3Nob3dfYnVnLmNnaT9pZD0xNjA3NTI3CgpGZWVsIGZyZWUgdG8gYWRkIGNv bW1lbnRzIHRvIHRoYXQgQlogaWYgeW91IGhhdmUgYSByZWRoYXQgYnVnemlsbGEKYWNjb3VudC4K CkJ1dCBvdGhlcndpc2UsIGhhcHB5IHRvIGdldCBhcyBtdWNoIGZlZWRiYWNrIGFuZCBkaXNjdXNz aW9uIGdvaW5nIHB1cmVseQpvbiB0aGUgcmVsZXZhbnQgbGlzdHMuICBJJ3ZlIHRha2VuIH4xLjUg d2Vla3MgdG8gY2F0ZWdvcml6ZSBhbmQgaXNvbGF0ZQp0aGlzIGlzc3VlLiAgQnV0IEkndmUgcmVh Y2hlZCBhIHBvaW50IHdoZXJlIEknbSBnZXR0aW5nIGRpbWluaXNoaW5nCnJldHVybnMgYW5kIGNv dWxkIF9yZWFsbHlfIHVzZSB0aGUgY29sbGVjdGl2ZSBleWViYWxscyBhbmQgZXhwZXJ0aXNlIG9m CnRoZSBjb21tdW5pdHkuICBUaGlzIGlzIGJ5IGZhciBvbmUgb2YgdGhlIG1vc3QgbmFzdHkgY2Fz ZXMgb2YgY29ycnVwdGlvbgpJJ3ZlIHNlZW4gaW4gYSB3aGlsZS4gIE5vdCBzdXJlIHdoZXJlIHRo ZSB1bHRpbWF0ZSBjYXVzZSBvZiBjb3JydXB0aW9uCmxpZXMgKHRoYXQgdGhlIG1vbmV5IHF1ZXN0 aW9uKSBidXQgaXQgX2ZlZWxzXyByb290ZWQgaW4gTlZNZSBhbmQgaXMKdW5pcXVlIHRvIHRoaXMg cGFydGljdWxhciB3b3JrbG9hZCBJJ3ZlIHN0dW1ibGVkIG9udG8gdmlhIGN1c3RvbWVyCmVzY2Fs YXRpb24gYW5kIHRoZW4gdHJ5aW5nIHRvIHJlcGxpY2F0ZSBhbiByYmQgZGV2aWNlIHVzaW5nIGEg bW9yZQphcHByb2FjaGFibGUgb25lIChyZXF1ZXN0LWJhc2VkIERNIG11bHRpcGF0aCBpbiB0aGlz IGNhc2UpLgoKPkZyb20gdGhlIEJaJ3MgY29tbWVudCMwOgoKVGhlIGZvbGxvd2luZyBvY2N1cnMg d2l0aCBsYXRlc3QgdjQuMTgtcmMzIGFuZCB2NC4xOC1yYzYgYW5kIGFsc28gb2NjdXJzCndpdGgg djQuMTUuICBXaGVuIGNvcnJ1cHRpb24gb2NjdXJzIGZyb20gdGhpcyB0ZXN0IGl0IGFsc28gZGVz dHJveXMgdGhlCkRPUyBwYXJ0aXRpb24gdGFibGUgKGNyZWF0ZWQgZHVyaW5nIHN0ZXAgMCBiZWxv dykuLiB5ZWFoLCBjb3JydXB0aW9uIGlzCl90aGF0XyBiYWQuICBBbG1vc3QgbGlrZSB0aGUgY29y cnVwdGlvbiBpcyB0ZW1wb3JhbCAocmVjZW50bHkgYWNjZXNzZWQKcmVnaW9ucyBvZiB0aGUgTlZN ZSBkZXZpY2UpPwoKQW55d2F5OiBJIHN0dW1ibGVkIG9udG8gcmFtcGFudCBjb3JydXB0aW9uIHdo ZW4gdXNpbmcgcmVxdWVzdC1iYXNlZCBETQptdWx0aXBhdGggb250b3Agb2YgYW4gTlZNZSBkZXZp Y2UgKG5vdCBleGNsdXNpdmUgdG8gYSBwYXJ0aWN1bGFyIGRyaXZlCmVpdGhlciwgaGFwcGVucyB0 byBOVk1lIGRldmljZXMgZnJvbSBtdWx0aXBsZSB2ZW5kb3JzKS4gIEJ1dCB0aGUKY29ycnVwdGlv biBvbmx5IG9jY3VycyBpZiB0aGUgcmVxdWVzdC1iYXNlZCBtdWx0aXBhdGggSU8gaXMgaXNzdWVk IHRvIGFuCk5WTWUgZGV2aWNlIGluIHBhcmFsbGVsIHRvIG90aGVyIElPIGlzc3VlZCB0byB0aGUg X3NhbWVfIHVuZGVybHlpbmcgTlZNZQpieSB0aGUgRE0gY2FjaGUgdGFyZ2V0LiAgU2VlIHRvcG9s b2d5IGRldGFpbGVkIGJlbG93IChhdCB0aGUgdmVyeSBlbmQgb2YKdGhpcyBjb21tZW50KS4uIGJh c2ljYWxseSBhbGwgMyBkZXZpY2VzIHRoYXQgYXJlIHVzZWQgdG8gY3JlYXRlIGEgRE0KY2FjaGUg ZGV2aWNlIG5lZWQgdG8gYmUgYmFja2VkIGJ5IHRoZSBzYW1lIE5WTWUgZGV2aWNlICh2aWEgcGFy dGl0aW9ucwpvciBsaW5lYXIgdm9sdW1lcykuCgpBZ2FpbiwgdXNpbmcgcmVxdWVzdC1iYXNlZCBE TSBtdWx0aXBhdGggZm9yIGRtLWNhY2hlJ3MgInNsb3ciIGRldmljZSBpcwpfcmVxdWlyZWRfIHRv IHJlcHJvZHVjZS4gIE5vdCAxMDAlIGNsZWFyIHdoeSByZWFsbHkuLi4gb3RoZXIgdGhhbgpyZXF1 ZXN0LWJhc2VkIERNIG11bHRpcGF0aCBidWlsZHMgbGFyZ2UgSU9zIChkdWUgdG8gbWVyZ2luZyku CgotLS0gQWRkaXRpb25hbCBjb21tZW50IGZyb20gTWlrZSBTbml0emVyIG9uIDIwMTgtMDctMjAg MTA6MTQ6MDkgRURUIC0tLQoKVG8gcmVwcm9kdWNlIHRoaXMgaXNzdWUgdXNpbmcgZGV2aWNlLW1h cHBlci10ZXN0LXN1aXRlOgoKMCkgUGFydGl0aW9uIGFuIE5WTWUgZGV2aWNlLiAgRmlyc3QgcHJp bWFyeSBwYXJ0aXRpb24gd2l0aCBhdCBsZWFzdCBhCjVHQiwgc2Vjb25mIHByaW1hcnkgcGFydGl0 aW9uIHdpdGggYXQgbGVhc3QgNDhHQi4KTk9URTogbGFyZ2VyIHBhcnRpdGlvbnMgKGUuZy4gMTog NTBHQiAyOiA+PSAyMjBHQikgY2FuIGJlIHVzZWQgdG8KcmVwcm9kdWNlIFhGUyBjb3JydXB0aW9u IG11Y2ggcXVpY2tlci4KCjEpIGNyZWF0ZSBhIHJlcXVlc3QtYmFzZWQgbXVsdGlwYXRoIGRldmlj ZSBvbnRvcCBvZiBhbiBOVk1lIGRldmljZSwKZS5nLjoKCiMhL2Jpbi9zaAoKbW9kcHJvYmUgZG0t c2VydmljZS10aW1lCgpERVZJQ0U9L2Rldi9udm1lMW4xcDIKU0laRT1gYmxvY2tkZXYgLS1nZXRz eiAkREVWSUNFYAoKZWNobyAiMCAkU0laRSBtdWx0aXBhdGggMiBxdWV1ZV9tb2RlIG1xIDAgMSAx IHNlcnZpY2UtdGltZSAwIDEgMiAkREVWSUNFCjEwMDAgMSIgfCBkbXNldHVwIGNyZWF0ZSBudm1l X21wYXRoCgojIEp1c3QgYSBub3RlIGZvciBob3cgdG8gZmFpbC9yZWluc3RhdGUgcGF0aDoKIyBk bXNldHVwIG1lc3NhZ2UgbnZtZV9tcGF0aCAwICJmYWlsX3BhdGggJERFVklDRSIKIyBkbXNldHVw IG1lc3NhZ2UgbnZtZV9tcGF0aCAwICJyZWluc3RhdGVfcGF0aCAkREVWSUNFIgoKMikgY2hlY2tv dXQgZGV2aWNlLW1hcHBlci10ZXN0LXN1aXRlIGZyb20gbXkgZ2l0aHViIHJlcG86CgpnaXQgY2xv bmUgZ2l0Oi8vZ2l0aHViLmNvbS9zbml0bS9kZXZpY2UtbWFwcGVyLXRlc3Qtc3VpdGUuZ2l0CmNk IGRldmljZS1tYXBwZXItdGVzdC1zdWl0ZQpnaXQgY2hlY2tvdXQgLWIgZGV2ZWwgb3JpZ2luL2Rl dmVsCgozKSBmb2xsb3cgZGV2aWNlLW1hcHBlci10ZXN0LXN1aXRlJ3MgUkVBRE1FLm1kIHRvIGdl dCBpdCBhbGwgc2V0dXAKCjQpIENvbmZpZ3VyZSAvcm9vdC8uZG10ZXN0L2NvbmZpZyB3aXRoIHNv bWV0aGluZyBsaWtlOgoKcHJvZmlsZSA6bnZtZV9zaGFyZWQgZG8KICAgbWV0YWRhdGFfZGV2ICcv ZGV2L252bWUxbjFwMScKICAgI2RhdGFfZGV2ICcvZGV2L252bWUxbjFwMicKICAgZGF0YV9kZXYg Jy9kZXYvbWFwcGVyL252bWVfbXBhdGgnCmVuZAoKZGVmYXVsdF9wcm9maWxlIDpudm1lX3NoYXJl ZAoKLS0tLS0tCk5PVEU6IGNvbmZpZ3VyZWQgJ21ldGFkYXRhX2RldicgZ2V0cyBjYXJ2ZWQgdXAg YnkKZGV2aWNlLW1hcHBlci10ZXN0LXN1aXRlIHRvIHByb3ZpZGUgYm90aCB0aGUgZG0tY2FjaGUn cyBtZXRhZGF0YSBkZXZpY2UKYW5kIHRoZSAiZmFzdCIgZGF0YSBkZXZpY2UuICBUaGUgY29uZmln dXJlZCAnZGF0YV9kZXYnIGlzIHVzZWQgZm9yCmRtLWNhY2hlJ3MgInNsb3ciIGRhdGEgZGV2aWNl LgoKNSkgcnVuIHRoZSB0ZXN0OgojIHRhaWwgLWYgL3Zhci9sb2cvbWVzc2FnZXMgJgojIHRpbWUg ZG10ZXN0IHJ1biAtLXN1aXRlIGNhY2hlIC1uIC9zcGxpdF9sYXJnZV9maWxlLwoKNikgSWYgbXVs dGlwYXRoIGRldmljZSBmYWlsZWQgdGhlIGxvbmUgTlZNZSBwYXRoIHlvdSdsbCBuZWVkIHRvCnJl aW5zdGF0ZSB0aGUgcGF0aCBiZWZvcmUgdGhlIG5leHQgaXRlcmF0aW9uIG9mIHlvdXIgdGVzdCwg ZS5nLiAoZnJvbSAjMQphYm92ZSk6CiBkbXNldHVwIG1lc3NhZ2UgbnZtZV9tcGF0aCAwICJyZWlu c3RhdGVfcGF0aCAkREVWSUNFIgoKLS0tIEFkZGl0aW9uYWwgY29tbWVudCBmcm9tIE1pa2UgU25p dHplciBvbiAyMDE4LTA3LTIwIDEyOjAyOjQ1IEVEVCAtLS0KCihJbiByZXBseSB0byBNaWtlIFNu aXR6ZXIgZnJvbSBjb21tZW50ICM2KQoKPiBTTyBzZWVtcyBwcmV0dHkgY2xlYXIgc29tZXRoaW5n IGlzIHN0aWxsIHdyb25nIHdpdGggcmVxdWVzdC1iYXNlZCBETQo+IG11bHRpcGF0aCBvbnRvcCBv ZiBOVk1lLi4uIHNhZGx5IHdlIGRvbid0IGhhdmUgYW55IG5lZ2F0aXZlIGNoZWNrIGluCj4gYmxr LWNvcmUsIE5WTWUgb3IgZWxzZXdoZXJlIHRvIG9mZmVyIGFueSBjbHVlIDooCgpCdWlsZGluZyBv biB0aGlzIGNvbW1lbnQ6CgoiQW55d2F5LCBmYWN0IHRoYXQgSSdtIGdldHRpbmcgdGhpcyBjb3Jy dXB0aW9uIG9uIG11bHRpcGxlIGRpZmZlcmVudApOVk1lIGRyaXZlczogSSBhbSBkZWZpbml0ZWx5 IGNvbmNlcm5lZCB0aGF0IHRoaXMgQlogaXMgZHVlIHRvIGEgYnVnCnNvbWV3aGVyZSBpbiBOVk1l IGNvcmUgKG9yIGJsb2NrIGNvcmUgY29kZSB0aGF0IGlzIHNwZWNpZmljIHRvIE5WTWUpLiIKCkkn bSBsZWZ0IHRoaW5raW5nIHRoYXQgcmVxdWVzdC1iYXNlZCBETSBtdWx0aXBhdGggaXMgc29tZWhv dyBjYXVzaW5nCk5WTWUncyBTRyBsaXN0cyBvciBvdGhlciBpbmZyYXN0cnVjdHVyZSB0byBiZSAi d3JvbmciIGFuZCBpdCBpcwpyZXN1bHRpbmcgaW4gY29ycnVwdGlvbi4gIEkgZ2V0IGNvcnJ1cHRp b24gdG8gdGhlIGRtLWNhY2hlJ3MgbWV0YWRhdGEKZGV2aWNlICh3aGljaCB3aGlsZSB0aGVvcmV0 aWNhbGx5IHVucmVsYXRlZCBhcyBpdHMgYSBzZXBhcmF0ZSBkZXZpY2UKZnJvbSB0aGUgInNsb3ci IGRtLWNhY2hlIGRhdGEgZGV2aWNlKSBpZiB0aGUgZG0tY2FjaGUgc2xvdyBkYXRhIGRldmljZQpp cyBiYWNrZWQgYnkgcmVxdWVzdC1iYXNlZCBkbS1tdWx0aXBhdGggb250b3Agb2YgTlZNZSAod2hp Y2ggaXMgYQpwYXJ0aXRpb24gZnJvbSB0aGUgX3NhbWVfIE5WTWUgZGV2aWNlIHRoYXQgaXMgdXNl ZCBieSB0aGUgZG0tY2FjaGUKbWV0YWRhdGEgZGV2aWNlKS4KCkJhc2ljYWxseSBJJ20gYmFjayB0 byB0aGlua2luZyBOVk1lIGlzIGNvcnJ1cHRpbmcgdGhlIGRhdGEgZHVlIHRvIHRoZSBJTwpwYXR0 ZXJuIG9yIG5hdHVyZSBvZiB0aGUgY2xvbmVkIHJlcXVlc3RzIGRtLW11bHRpcGF0aCBpcyBpc3N1 aW5nLiAgQW5kCml0IGlzIGNhdXNpbmcgY29ycnVwdGlvbiB0byBvdGhlciBOVk1lIHBhcnRpdGlv bnMgb24gdGhlIHNhbWUgcGFyZW50Ck5WTWUgZGV2aWNlLiAgQ2VydGFpbmx5IHRoYXQgaXMgYSBj b25jZXJuaW5nIGh5cG90aGVzaXMgYnV0IEknbSBub3QKc2VlaW5nIG11Y2ggZWxzZSB0aGF0IHdv dWxkIGV4cGxhaW4gdGhpcyB3ZWlyZCBjb3JydXB0aW9uLgoKSWYgSSBkb24ndCB1c2UgdGhlIHNh bWUgTlZNZSBkZXZpY2UgKHdpdGggbXVsdGlwbGUgcGFydGl0aW9ucykgZm9yIF9hbGxfCjMgc3Vi LWRldmljZXMgdGhhdCBkbS1jYWNoZSBuZWVkcyBJIGRvbid0IHNlZSB0aGUgY29ycnVwdGlvbi4g IEl0IGlzCmFsbW9zdCBsaWtlIHRoZSBtaXggb2YgSU8gaXNzdWVkIGJ5IERNIGNhY2hlJ3MgbWV0 YWRhdGEgKG9uIG52bWUxbjFwMQp1c2luZyBkbS1saW5lYXIpIGFuZCAiZmFzdCIgZGV2aWNlIChh bHNvIG9uIG52bWUxbjFwMSB2aWEgZG0tbGluZWFyCnZvbHVtZSkgaW4gY29uanVuY3Rpb24gd2l0 aCBJTyBpc3N1ZWQgYnkgcmVxdWVzdC1iYXNlZCBETSBtdWx0aXBhdGggdG8KTlZNZSBmb3IgInNs b3ciIGRldmljZSAob24gbnZtZTFuMXAyKSBpcyB0cmlnZ2VyaW5nIE5WTWUgdG8gcmVzcG9uZApu ZWdhdGl2ZWx5LiAgQnV0IHRoaXMgc2FtZSBvYnNlcnZhdGlvbiBjYW4gYmUgbWFkZSBvbiBjb21w bGV0ZWx5CmRpZmZlcmVudCBoYXJkd2FyZSB1c2luZyAyIHRvdGFsbHkgZGlmZmVyZW50IE5WTWUg ZGV2aWNlczoKdGVzdGJlZDE6IEludGVsIENvcnBvcmF0aW9uIE9wdGFuZSBTU0QgOTAwUCBTZXJp ZXMgKDI3MDApCnRlc3RiZWQyOiBTYW1zdW5nIEVsZWN0cm9uaWNzIENvIEx0ZCBOVk1lIFNTRCBD b250cm9sbGVyIDE3MVggKHJldiAwMykKCldoaWNoIGlzIHdoeSBpdCBmZWVscyBsaWtlIHNvbWUg YnVnIGluIExpbnV4IChiZSBpdCBkbS1ycS5jLCBibGstY29yZS5jLApibGstbWVyZ2UuYyBvciB0 aGUgY29tbW9uIE5WTWUgZHJpdmVyKQoKdG9wb2xvZ3kgYmVmb3JlIHN0YXJ0aW5nIHRoZSBkZXZp Y2UtbWFwcGVyLXRlc3Qtc3VpdGUgdGVzdDoKCiMgbHNibGsgL2Rldi9udm1lMW4xCk5BTUUgICAg ICAgICAgIE1BSjpNSU4gUk0gICBTSVpFIFJPIFRZUEUgTU9VTlRQT0lOVApudm1lMW4xICAgICAg ICAyNTk6MSAgICAwIDc0NS4yRyAgMCBkaXNrCuKUnOKUgG52bWUxbjFwMiAgICAyNTk6NSAgICAw IDY5NS4yRyAgMCBwYXJ0CuKUgiDilJTilIBudm1lX21wYXRoIDI1MzoyICAgIDAgNjk1LjJHICAw IGRtCuKUlOKUgG52bWUxbjFwMSAgICAyNTk6NCAgICAwICAgIDUwRyAgMCBwYXJ0Cgp0b3BvbG9n eSBkdXJpbmcgdGhlIGRldmljZS1tYXBwZXItdGVzdC1zdWl0ZSB0ZXN0OgoKIyBsc2JsayAvZGV2 L252bWUxbjEKTkFNRSAgICAgICAgICAgICAgICAgICAgTUFKOk1JTiBSTSAgIFNJWkUgUk8gVFlQ RSBNT1VOVFBPSU5UCm52bWUxbjEgICAgICAgICAgICAgICAgIDI1OToxICAgIDAgNzQ1LjJHICAw IGRpc2sK4pSc4pSAbnZtZTFuMXAyICAgICAgICAgICAgIDI1OTo1ICAgIDAgNjk1LjJHICAwIHBh cnQK4pSCIOKUlOKUgG52bWVfbXBhdGggICAgICAgICAgMjUzOjIgICAgMCA2OTUuMkcgIDAgZG0K 4pSCICAg4pSU4pSAdGVzdC1kZXYtNDU4NTcyICAgMjUzOjUgICAgMCAgICA0OEcgIDAgZG0K4pSC ICAgICDilJTilIB0ZXN0LWRldi02MTMwODMgMjUzOjYgICAgMCAgICA0OEcgIDAgZG0KL3Jvb3Qv c25pdG0vZ2l0L2RldmljZS1tYXBwZXItdGVzdC1zdWl0ZS9rZXJuZWxfYnVpbGRzCuKUlOKUgG52 bWUxbjFwMSAgICAgICAgICAgICAyNTk6NCAgICAwICAgIDUwRyAgMCBwYXJ0CiAg4pSc4pSAdGVz dC1kZXYtMTI2Mzc4ICAgICAyNTM6NCAgICAwICAgICA0RyAgMCBkbQogIOKUgiDilJTilIB0ZXN0 LWRldi02MTMwODMgICAyNTM6NiAgICAwICAgIDQ4RyAgMCBkbQogIC9yb290L3NuaXRtL2dpdC9k ZXZpY2UtbWFwcGVyLXRlc3Qtc3VpdGUva2VybmVsX2J1aWxkcwogIOKUlOKUgHRlc3QtZGV2LTY1 MjQ5MSAgICAgMjUzOjMgICAgMCAgICA0ME0gIDAgZG0KICAgIOKUlOKUgHRlc3QtZGV2LTYxMzA4 MyAgIDI1Mzo2ICAgIDAgICAgNDhHICAwIGRtCiAgICAvcm9vdC9zbml0bS9naXQvZGV2aWNlLW1h cHBlci10ZXN0LXN1aXRlL2tlcm5lbF9idWlsZHMKCnBydW5pbmcgdGhhdCB0cmVlIGEgYml0IChy ZW1vdmluZyB0aGUgZG0tY2FjaGUgZGV2aWNlIDI1Mzo2KSBmb3IKY2xhcml0eToKCiMgbHNibGsg L2Rldi9udm1lMW4xCk5BTUUgICAgICAgICAgICAgICAgICAgIE1BSjpNSU4gUk0gICBTSVpFIFJP IFRZUEUgTU9VTlRQT0lOVApudm1lMW4xICAgICAgICAgICAgICAgICAyNTk6MSAgICAwIDc0NS4y RyAgMCBkaXNrCuKUnOKUgG52bWUxbjFwMiAgICAgICAgICAgICAyNTk6NSAgICAwIDY5NS4yRyAg MCBwYXJ0CuKUgiDilJTilIBudm1lX21wYXRoICAgICAgICAgIDI1MzoyICAgIDAgNjk1LjJHICAw IGRtCuKUgiAgIOKUlOKUgHRlc3QtZGV2LTQ1ODU3MiAgIDI1Mzo1ICAgIDAgICAgNDhHICAwIGRt CuKUlOKUgG52bWUxbjFwMSAgICAgICAgICAgICAyNTk6NCAgICAwICAgIDUwRyAgMCBwYXJ0CiAg 4pSc4pSAdGVzdC1kZXYtMTI2Mzc4ICAgICAyNTM6NCAgICAwICAgICA0RyAgMCBkbQogIOKUlOKU gHRlc3QtZGV2LTY1MjQ5MSAgICAgMjUzOjMgICAgMCAgICA0ME0gIDAgZG0KCjQwTSBkZXZpY2Ug aXMgZG0tY2FjaGUgIm1ldGFkYXRhIiBkZXZpY2UKNEcgZGV2aWNlIGlzIGRtLWNhY2hlICJmYXN0 IiBkYXRhIGRldmljZQo0OEcgZGV2aWNlIGlzIGRtLWNhY2hlICJzbG93IiBkYXRhIGRldmljZQoK LS0KZG0tZGV2ZWwgbWFpbGluZyBsaXN0CmRtLWRldmVsQHJlZGhhdC5jb20KaHR0cHM6Ly93d3cu cmVkaGF0LmNvbS9tYWlsbWFuL2xpc3RpbmZvL2RtLWRldmVs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx3-rdu2.redhat.com ([66.187.233.73]:54202 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S2388147AbeGWRgA (ORCPT ); Mon, 23 Jul 2018 13:36:00 -0400 Date: Mon, 23 Jul 2018 12:33:57 -0400 From: Mike Snitzer To: linux-nvme@lists.infradead.org, linux-block@vger.kernel.org, dm-devel@redhat.com Subject: data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device Message-ID: <20180723163357.GA29658@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Sender: linux-block-owner@vger.kernel.org List-Id: linux-block@vger.kernel.org Hi, I've opened the following public BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1607527 Feel free to add comments to that BZ if you have a redhat bugzilla account. But otherwise, happy to get as much feedback and discussion going purely on the relevant lists. I've taken ~1.5 weeks to categorize and isolate this issue. But I've reached a point where I'm getting diminishing returns and could _really_ use the collective eyeballs and expertise of the community. This is by far one of the most nasty cases of corruption I've seen in a while. Not sure where the ultimate cause of corruption lies (that the money question) but it _feels_ rooted in NVMe and is unique to this particular workload I've stumbled onto via customer escalation and then trying to replicate an rbd device using a more approachable one (request-based DM multipath in this case). >>From the BZ's comment#0: The following occurs with latest v4.18-rc3 and v4.18-rc6 and also occurs with v4.15. When corruption occurs from this test it also destroys the DOS partition table (created during step 0 below).. yeah, corruption is _that_ bad. Almost like the corruption is temporal (recently accessed regions of the NVMe device)? Anyway: I stumbled onto rampant corruption when using request-based DM multipath ontop of an NVMe device (not exclusive to a particular drive either, happens to NVMe devices from multiple vendors). But the corruption only occurs if the request-based multipath IO is issued to an NVMe device in parallel to other IO issued to the _same_ underlying NVMe by the DM cache target. See topology detailed below (at the very end of this comment).. basically all 3 devices that are used to create a DM cache device need to be backed by the same NVMe device (via partitions or linear volumes). Again, using request-based DM multipath for dm-cache's "slow" device is _required_ to reproduce. Not 100% clear why really... other than request-based DM multipath builds large IOs (due to merging). --- Additional comment from Mike Snitzer on 2018-07-20 10:14:09 EDT --- To reproduce this issue using device-mapper-test-suite: 0) Partition an NVMe device. First primary partition with at least a 5GB, seconf primary partition with at least 48GB. NOTE: larger partitions (e.g. 1: 50GB 2: >= 220GB) can be used to reproduce XFS corruption much quicker. 1) create a request-based multipath device ontop of an NVMe device, e.g.: #!/bin/sh modprobe dm-service-time DEVICE=/dev/nvme1n1p2 SIZE=`blockdev --getsz $DEVICE` echo "0 $SIZE multipath 2 queue_mode mq 0 1 1 service-time 0 1 2 $DEVICE 1000 1" | dmsetup create nvme_mpath # Just a note for how to fail/reinstate path: # dmsetup message nvme_mpath 0 "fail_path $DEVICE" # dmsetup message nvme_mpath 0 "reinstate_path $DEVICE" 2) checkout device-mapper-test-suite from my github repo: git clone git://github.com/snitm/device-mapper-test-suite.git cd device-mapper-test-suite git checkout -b devel origin/devel 3) follow device-mapper-test-suite's README.md to get it all setup 4) Configure /root/.dmtest/config with something like: profile :nvme_shared do metadata_dev '/dev/nvme1n1p1' #data_dev '/dev/nvme1n1p2' data_dev '/dev/mapper/nvme_mpath' end default_profile :nvme_shared ------ NOTE: configured 'metadata_dev' gets carved up by device-mapper-test-suite to provide both the dm-cache's metadata device and the "fast" data device. The configured 'data_dev' is used for dm-cache's "slow" data device. 5) run the test: # tail -f /var/log/messages & # time dmtest run --suite cache -n /split_large_file/ 6) If multipath device failed the lone NVMe path you'll need to reinstate the path before the next iteration of your test, e.g. (from #1 above): dmsetup message nvme_mpath 0 "reinstate_path $DEVICE" --- Additional comment from Mike Snitzer on 2018-07-20 12:02:45 EDT --- (In reply to Mike Snitzer from comment #6) > SO seems pretty clear something is still wrong with request-based DM > multipath ontop of NVMe... sadly we don't have any negative check in > blk-core, NVMe or elsewhere to offer any clue :( Building on this comment: "Anyway, fact that I'm getting this corruption on multiple different NVMe drives: I am definitely concerned that this BZ is due to a bug somewhere in NVMe core (or block core code that is specific to NVMe)." I'm left thinking that request-based DM multipath is somehow causing NVMe's SG lists or other infrastructure to be "wrong" and it is resulting in corruption. I get corruption to the dm-cache's metadata device (which while theoretically unrelated as its a separate device from the "slow" dm-cache data device) if the dm-cache slow data device is backed by request-based dm-multipath ontop of NVMe (which is a partition from the _same_ NVMe device that is used by the dm-cache metadata device). Basically I'm back to thinking NVMe is corrupting the data due to the IO pattern or nature of the cloned requests dm-multipath is issuing. And it is causing corruption to other NVMe partitions on the same parent NVMe device. Certainly that is a concerning hypothesis but I'm not seeing much else that would explain this weird corruption. If I don't use the same NVMe device (with multiple partitions) for _all_ 3 sub-devices that dm-cache needs I don't see the corruption. It is almost like the mix of IO issued by DM cache's metadata (on nvme1n1p1 using dm-linear) and "fast" device (also on nvme1n1p1 via dm-linear volume) in conjunction with IO issued by request-based DM multipath to NVMe for "slow" device (on nvme1n1p2) is triggering NVMe to respond negatively. But this same observation can be made on completely different hardware using 2 totally different NVMe devices: testbed1: Intel Corporation Optane SSD 900P Series (2700) testbed2: Samsung Electronics Co Ltd NVMe SSD Controller 171X (rev 03) Which is why it feels like some bug in Linux (be it dm-rq.c, blk-core.c, blk-merge.c or the common NVMe driver) topology before starting the device-mapper-test-suite test: # lsblk /dev/nvme1n1 NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT nvme1n1 259:1 0 745.2G 0 disk ├─nvme1n1p2 259:5 0 695.2G 0 part │ └─nvme_mpath 253:2 0 695.2G 0 dm └─nvme1n1p1 259:4 0 50G 0 part topology during the device-mapper-test-suite test: # lsblk /dev/nvme1n1 NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT nvme1n1 259:1 0 745.2G 0 disk ├─nvme1n1p2 259:5 0 695.2G 0 part │ └─nvme_mpath 253:2 0 695.2G 0 dm │ └─test-dev-458572 253:5 0 48G 0 dm │ └─test-dev-613083 253:6 0 48G 0 dm /root/snitm/git/device-mapper-test-suite/kernel_builds └─nvme1n1p1 259:4 0 50G 0 part ├─test-dev-126378 253:4 0 4G 0 dm │ └─test-dev-613083 253:6 0 48G 0 dm /root/snitm/git/device-mapper-test-suite/kernel_builds └─test-dev-652491 253:3 0 40M 0 dm └─test-dev-613083 253:6 0 48G 0 dm /root/snitm/git/device-mapper-test-suite/kernel_builds pruning that tree a bit (removing the dm-cache device 253:6) for clarity: # lsblk /dev/nvme1n1 NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT nvme1n1 259:1 0 745.2G 0 disk ├─nvme1n1p2 259:5 0 695.2G 0 part │ └─nvme_mpath 253:2 0 695.2G 0 dm │ └─test-dev-458572 253:5 0 48G 0 dm └─nvme1n1p1 259:4 0 50G 0 part ├─test-dev-126378 253:4 0 4G 0 dm └─test-dev-652491 253:3 0 40M 0 dm 40M device is dm-cache "metadata" device 4G device is dm-cache "fast" data device 48G device is dm-cache "slow" data device From mboxrd@z Thu Jan 1 00:00:00 1970 From: snitzer@redhat.com (Mike Snitzer) Date: Mon, 23 Jul 2018 12:33:57 -0400 Subject: data corruption with 'splt' workload to XFS on DM cache with its 3 underlying devices being on same NVMe device Message-ID: <20180723163357.GA29658@redhat.com> Hi, I've opened the following public BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1607527 Feel free to add comments to that BZ if you have a redhat bugzilla account. But otherwise, happy to get as much feedback and discussion going purely on the relevant lists. I've taken ~1.5 weeks to categorize and isolate this issue. But I've reached a point where I'm getting diminishing returns and could _really_ use the collective eyeballs and expertise of the community. This is by far one of the most nasty cases of corruption I've seen in a while. Not sure where the ultimate cause of corruption lies (that the money question) but it _feels_ rooted in NVMe and is unique to this particular workload I've stumbled onto via customer escalation and then trying to replicate an rbd device using a more approachable one (request-based DM multipath in this case). >>From the BZ's comment#0: The following occurs with latest v4.18-rc3 and v4.18-rc6 and also occurs with v4.15. When corruption occurs from this test it also destroys the DOS partition table (created during step 0 below).. yeah, corruption is _that_ bad. Almost like the corruption is temporal (recently accessed regions of the NVMe device)? Anyway: I stumbled onto rampant corruption when using request-based DM multipath ontop of an NVMe device (not exclusive to a particular drive either, happens to NVMe devices from multiple vendors). But the corruption only occurs if the request-based multipath IO is issued to an NVMe device in parallel to other IO issued to the _same_ underlying NVMe by the DM cache target. See topology detailed below (at the very end of this comment).. basically all 3 devices that are used to create a DM cache device need to be backed by the same NVMe device (via partitions or linear volumes). Again, using request-based DM multipath for dm-cache's "slow" device is _required_ to reproduce. Not 100% clear why really... other than request-based DM multipath builds large IOs (due to merging). --- Additional comment from Mike Snitzer on 2018-07-20 10:14:09 EDT --- To reproduce this issue using device-mapper-test-suite: 0) Partition an NVMe device. First primary partition with at least a 5GB, seconf primary partition with at least 48GB. NOTE: larger partitions (e.g. 1: 50GB 2: >= 220GB) can be used to reproduce XFS corruption much quicker. 1) create a request-based multipath device ontop of an NVMe device, e.g.: #!/bin/sh modprobe dm-service-time DEVICE=/dev/nvme1n1p2 SIZE=`blockdev --getsz $DEVICE` echo "0 $SIZE multipath 2 queue_mode mq 0 1 1 service-time 0 1 2 $DEVICE 1000 1" | dmsetup create nvme_mpath # Just a note for how to fail/reinstate path: # dmsetup message nvme_mpath 0 "fail_path $DEVICE" # dmsetup message nvme_mpath 0 "reinstate_path $DEVICE" 2) checkout device-mapper-test-suite from my github repo: git clone git://github.com/snitm/device-mapper-test-suite.git cd device-mapper-test-suite git checkout -b devel origin/devel 3) follow device-mapper-test-suite's README.md to get it all setup 4) Configure /root/.dmtest/config with something like: profile :nvme_shared do metadata_dev '/dev/nvme1n1p1' #data_dev '/dev/nvme1n1p2' data_dev '/dev/mapper/nvme_mpath' end default_profile :nvme_shared ------ NOTE: configured 'metadata_dev' gets carved up by device-mapper-test-suite to provide both the dm-cache's metadata device and the "fast" data device. The configured 'data_dev' is used for dm-cache's "slow" data device. 5) run the test: # tail -f /var/log/messages & # time dmtest run --suite cache -n /split_large_file/ 6) If multipath device failed the lone NVMe path you'll need to reinstate the path before the next iteration of your test, e.g. (from #1 above): dmsetup message nvme_mpath 0 "reinstate_path $DEVICE" --- Additional comment from Mike Snitzer on 2018-07-20 12:02:45 EDT --- (In reply to Mike Snitzer from comment #6) > SO seems pretty clear something is still wrong with request-based DM > multipath ontop of NVMe... sadly we don't have any negative check in > blk-core, NVMe or elsewhere to offer any clue :( Building on this comment: "Anyway, fact that I'm getting this corruption on multiple different NVMe drives: I am definitely concerned that this BZ is due to a bug somewhere in NVMe core (or block core code that is specific to NVMe)." I'm left thinking that request-based DM multipath is somehow causing NVMe's SG lists or other infrastructure to be "wrong" and it is resulting in corruption. I get corruption to the dm-cache's metadata device (which while theoretically unrelated as its a separate device from the "slow" dm-cache data device) if the dm-cache slow data device is backed by request-based dm-multipath ontop of NVMe (which is a partition from the _same_ NVMe device that is used by the dm-cache metadata device). Basically I'm back to thinking NVMe is corrupting the data due to the IO pattern or nature of the cloned requests dm-multipath is issuing. And it is causing corruption to other NVMe partitions on the same parent NVMe device. Certainly that is a concerning hypothesis but I'm not seeing much else that would explain this weird corruption. If I don't use the same NVMe device (with multiple partitions) for _all_ 3 sub-devices that dm-cache needs I don't see the corruption. It is almost like the mix of IO issued by DM cache's metadata (on nvme1n1p1 using dm-linear) and "fast" device (also on nvme1n1p1 via dm-linear volume) in conjunction with IO issued by request-based DM multipath to NVMe for "slow" device (on nvme1n1p2) is triggering NVMe to respond negatively. But this same observation can be made on completely different hardware using 2 totally different NVMe devices: testbed1: Intel Corporation Optane SSD 900P Series (2700) testbed2: Samsung Electronics Co Ltd NVMe SSD Controller 171X (rev 03) Which is why it feels like some bug in Linux (be it dm-rq.c, blk-core.c, blk-merge.c or the common NVMe driver) topology before starting the device-mapper-test-suite test: # lsblk /dev/nvme1n1 NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT nvme1n1 259:1 0 745.2G 0 disk ??nvme1n1p2 259:5 0 695.2G 0 part ? ??nvme_mpath 253:2 0 695.2G 0 dm ??nvme1n1p1 259:4 0 50G 0 part topology during the device-mapper-test-suite test: # lsblk /dev/nvme1n1 NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT nvme1n1 259:1 0 745.2G 0 disk ??nvme1n1p2 259:5 0 695.2G 0 part ? ??nvme_mpath 253:2 0 695.2G 0 dm ? ??test-dev-458572 253:5 0 48G 0 dm ? ??test-dev-613083 253:6 0 48G 0 dm /root/snitm/git/device-mapper-test-suite/kernel_builds ??nvme1n1p1 259:4 0 50G 0 part ??test-dev-126378 253:4 0 4G 0 dm ? ??test-dev-613083 253:6 0 48G 0 dm /root/snitm/git/device-mapper-test-suite/kernel_builds ??test-dev-652491 253:3 0 40M 0 dm ??test-dev-613083 253:6 0 48G 0 dm /root/snitm/git/device-mapper-test-suite/kernel_builds pruning that tree a bit (removing the dm-cache device 253:6) for clarity: # lsblk /dev/nvme1n1 NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT nvme1n1 259:1 0 745.2G 0 disk ??nvme1n1p2 259:5 0 695.2G 0 part ? ??nvme_mpath 253:2 0 695.2G 0 dm ? ??test-dev-458572 253:5 0 48G 0 dm ??nvme1n1p1 259:4 0 50G 0 part ??test-dev-126378 253:4 0 4G 0 dm ??test-dev-652491 253:3 0 40M 0 dm 40M device is dm-cache "metadata" device 4G device is dm-cache "fast" data device 48G device is dm-cache "slow" data device