dm_cache BUG() after power loss

* dm_cache BUG() after power loss
@ 2018-10-09 16:00 Steinar H. Gunderson
  2018-10-17 14:43 ` Steinar H. Gunderson
  2018-10-19 13:03 ` Mike Snitzer
  0 siblings, 2 replies; 4+ messages in thread
From: Steinar H. Gunderson @ 2018-10-09 16:00 UTC (permalink / raw)
  To: dm-devel

Hi,

We had a power loss event, and when a server with dm-cache came up again,
it paniced (see below for the panic text). I couldn't find any other way to
remedy this than to blow away the metadata volume, which I assumed was safe
as the cache is in writethrough mode (after several catastrophic events with
dm-cache earlier, I don't trust writeback anymore). Unfortunately, this was
seemingly not enough, as the underlying devices came back with various levels
of corruption and eventually had to be restored from backup. (It's running
without dm-cache now.)

Here's the panic:

[   13.388089] device-mapper: cache: You have created a cache device with a lot of individual cache blocks (1114672)
[   13.388089] All these mappings can consume a lot of kernel memory, and take some time to read/write.
[   13.388089] Please consider increasing the cache block size to reduce the overall cache block count.
[   13.452782] device-mapper: array: array_block_check failed: blocknr 1082331758718 != wanted 2156
[   13.462194] device-mapper: block manager: array validator check failed for block 2156
[   13.470643] device-mapper: array: get_ablock failed
[   13.475869] device-mapper: cache metadata: dm_array_cursor_next for mapping failed
[   13.484075] ------------[ cut here ]------------
[   13.489036] kernel BUG at drivers/md/dm-bufio.c:1180!
[   13.494443] invalid opcode: 0000 [#1] SMP PTI
[   13.499144] CPU: 34 PID: 5918 Comm: dmsetup Not tainted 4.18.11 #1
[   13.505671] Hardware name: Supermicro X10DRi/X10DRI-T, BIOS 1.1 04/14/2015
[   13.512885] RIP: 0010:dm_bufio_release+0x18/0x74 [dm_bufio]
[   13.518797] Code: 43 18 48 b8 00 02 00 00 00 00 ad de 48 89 43 20 5b c3 55 53 48 8b 6f 60 48 89 fb 48 89 ef e8 04 7e 20 e1 8b 43 3c 85 c0 75 02 <0f> 0b ff c8 85 c0 89 43 3c 75 47 31 c9 ba 01 00 00 00 be 03 00 00
[   13.538588] RSP: 0018:ffffc90000a03ba0 EFLAGS: 00010246
[   13.544157] RAX: 0000000000000000 RBX: ffff881025a6cea0 RCX: 00000000ffffffff
[   13.551632] RDX: ffff8810303a3980 RSI: ffff881025a6cea0 RDI: ffff881032b37800
[   13.559110] RBP: ffff881032b37800 R08: 0000000000000000 R09: ffff8800000b8c80
[   13.566591] R10: ffffc90000a03b00 R11: ffffffff82194947 R12: 0000000000000000
[   13.574071] R13: ffff881027631340 R14: 00000000000011e4 R15: ffffffffa0127443
[   13.581550] FS:  00007f3cd0894400(0000) GS:ffff88103f480000(0000) knlGS:0000000000000000
[   13.590256] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   13.596337] CR2: 0000563e475af7a8 CR3: 000000102f802006 CR4: 00000000001606e0
[   13.603812] Call Trace:
[   13.606614]  dm_array_cursor_end+0x1c/0x27 [dm_persistent_data]
[   13.612880]  dm_cache_load_mappings+0x2be/0x2fe [dm_cache]
[   13.618717]  ? retrieve_status+0x176/0x176 [dm_mod]
[   13.623935]  cache_preresume+0xc6/0x195 [dm_cache]
[   13.629067]  dm_table_resume_targets+0x38/0xaa [dm_mod]
[   13.634641]  dm_resume+0x7e/0xa7 [dm_mod]
[   13.639000]  dev_suspend+0x15b/0x1bc [dm_mod]
[   13.643706]  ctl_ioctl+0x2f8/0x394 [dm_mod]
[   13.648238]  dm_ctl_ioctl+0x5/0x8 [dm_mod]
[   13.652681]  vfs_ioctl+0x19/0x26
[   13.656247]  do_vfs_ioctl+0x4d0/0x547
[   13.660255]  ? handle_mm_fault+0x151/0x1b9
[   13.664695]  ksys_ioctl+0x4b/0x6b
[   13.668355]  __x64_sys_ioctl+0x11/0x14
[   13.672446]  do_syscall_64+0x4a/0xd3
[   13.676365]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   13.681758] RIP: 0033:0x7f3ccff7edd7
[   13.685670] Code: 00 00 00 48 8b 05 c1 80 2b 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 91 80 2b 00 f7 d8 64 89 01 48
[   13.705454] RSP: 002b:00007ffcdf6a6f18 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[   13.713640] RAX: ffffffffffffffda RBX: 000055f036494280 RCX: 00007f3ccff7edd7
[   13.721117] RDX: 000055f036494280 RSI: 00000000c138fd06 RDI: 0000000000000003
[   13.728595] RBP: 000000000000000f R08: 00007f3cd048a648 R09: 00007ffcdf6a6d80
[   13.736073] R10: 00007f3cd0489b53 R11: 0000000000000246 R12: 000055f0364942b0
[   13.743553] R13: 00007f3cd0489b53 R14: 000055f036493030 R15: 0000000000000001
[   13.751034] Modules linked in: raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid1 raid10 raid6_pq raid0 md_mod sd_mod usbhid dm_cache_smq dm_cache dm_bio_prison dm_persistent_data dm_bufio dm_mod libcrc32c crc32c_generic ixgbe i2c_i801 mdio ehci_pci crc32c_intel mpt3sas ahci ptp raid_class i2c_core ehci_hcd libahci pps_core unix
[   13.783877] ---[ end trace 1140618cbf25a884 ]---
[   13.792516] RIP: 0010:dm_bufio_release+0x18/0x74 [dm_bufio]
[   13.798431] Code: 43 18 48 b8 00 02 00 00 00 00 ad de 48 89 43 20 5b c3 55 53 48 8b 6f 60 48 89 fb 48 89 ef e8 04 7e 20 e1 8b 43 3c 85 c0 75 02 <0f> 0b ff c8 85 c0 89 43 3c 75 47 31 c9 ba 01 00 00 00 be 03 00 00
[   13.818221] RSP: 0018:ffffc90000a03ba0 EFLAGS: 00010246
[   13.823787] RAX: 0000000000000000 RBX: ffff881025a6cea0 RCX: 00000000ffffffff
[   13.831260] RDX: ffff8810303a3980 RSI: ffff881025a6cea0 RDI: ffff881032b37800
[   13.838742] RBP: ffff881032b37800 R08: 0000000000000000 R09: ffff8800000b8c80
[   13.846220] R10: ffffc90000a03b00 R11: ffffffff82194947 R12: 0000000000000000
[   13.853701] R13: ffff881027631340 R14: 00000000000011e4 R15: ffffffffa0127443
[   13.861182] FS:  00007f3cd0894400(0000) GS:ffff88103f480000(0000) knlGS:0000000000000000
[   13.869889] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   13.875988] CR2: 0000563e475af7a8 CR3: 000000102f802006 CR4: 00000000001606e0

/* Steinar */
-- 
Homepage: https://www.sesse.net/

^ permalink raw reply	[flat|nested] 4+ messages in thread