* bcache failure hangs something in kernel @ 2017-10-12 12:49 Alexandr Kuznetsov 2017-10-12 18:12 ` Michael Lyle 0 siblings, 1 reply; 13+ messages in thread From: Alexandr Kuznetsov @ 2017-10-12 12:49 UTC (permalink / raw) To: linux-bcache Hellow. Can any one help me? Two days ago i encountered bcache failure and since then i can't boot my system Ubuntu 16.04 amd64. Now when cache and backend devices meets each other during register process, something hangs inside the kernel and such messages appear in dmesg: [ 839.113067] INFO: task bcache-register:2303 blocked for more than 120 seconds. [ 839.113077] Not tainted 4.4.0-97-generic #120-Ubuntu [ 839.113079] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 839.113082] bcache-register D ffff8801256f3a88 0 2303 1 0x00000004 [ 839.113089] ffff8801256f3a88 ffff88008edc0dd0 ffff88013560b800 ffff880135bd5400 [ 839.113093] ffff8801256f4000 ffff88007a9f8000 0000000000000000 0000000000000000 [ 839.113096] 0000000000000000 ffff8801256f3aa0 ffffffff8183f6b5 ffff88007a9f8000 [ 839.113099] Call Trace: [ 839.113112] [<ffffffff8183f6b5>] schedule+0x35/0x80 [ 839.113133] [<ffffffffc039c2b8>] bch_bucket_alloc+0x1d8/0x350 [bcache] [ 839.113139] [<ffffffff810c4410>] ? wake_atomic_t_function+0x60/0x60 [ 839.113148] [<ffffffffc039c5c1>] __bch_bucket_alloc_set+0xf1/0x150 [bcache] [ 839.113157] [<ffffffffc039c66e>] bch_bucket_alloc_set+0x4e/0x70 [bcache] [ 839.113168] [<ffffffffc03b0529>] __uuid_write+0x59/0x130 [bcache] [ 839.113179] [<ffffffffc03b0ed6>] bch_uuid_write+0x16/0x40 [bcache] [ 839.113189] [<ffffffffc03b1ad5>] bch_cached_dev_attach+0xf5/0x490 [bcache] [ 839.113199] [<ffffffffc03af5ad>] ? __write_super+0x13d/0x170 [bcache] [ 839.113210] [<ffffffffc03b0eb0>] ? bcache_write_super+0x190/0x1a0 [bcache] [ 839.113225] [<ffffffffc03b2958>] run_cache_set+0x5e8/0x8f0 [bcache] [ 839.113236] [<ffffffffc03b3f62>] register_bcache+0xdc2/0x1140 [bcache] [ 839.113242] [<ffffffff813fcd2f>] kobj_attr_store+0xf/0x20 [ 839.113247] [<ffffffff81290f27>] sysfs_kf_write+0x37/0x40 [ 839.113250] [<ffffffff8129030d>] kernfs_fop_write+0x11d/0x170 [ 839.113255] [<ffffffff8120f888>] __vfs_write+0x18/0x40 [ 839.113258] [<ffffffff81210219>] vfs_write+0xa9/0x1a0 [ 839.113261] [<ffffffff81210ed5>] SyS_write+0x55/0xc0 [ 839.113264] [<ffffffff818437f2>] entry_SYSCALL_64_fastpath+0x16/0x71 No /dev/bcache* devices appear and whole system switches into strange state, for example it can not reboot gracefuly - it freezes. My data storage configuration is: /dev/md2 as caching device, it is mdadm raid1 on two 64GiB partitions on two 128Gb SSD's. /dev/md0 as primary storage (mdadm raid5), splitted to 55 100Gib partitions and remainder as 56 partition, that gives /dev/md0p<1-56> devices. /dev/md0p* used as backing devices and produces /dev/bcache<0-55> cached devices. /dev/bcache* used as pv's for lvm. Two days ago i experimented with remote lvm volumes creation/deletion using ssh commands, and something hanged. System could not reboot gracefuly, and later was reset hardly. After that it refuses to boot. bcache-super-show on cache device and all backing devices says that everything is fine. 54 backing devices show: dev.data.cache_mode 1 [writeback] dev.data.cache_state 1 [clean] cset.uuid d93ae507-b4bb-48ef-8d64-fa9329a08a39 One backing device (md0p3) show: dev.data.cache_mode 1 [writeback] dev.data.cache_state 1 [dirty] cset.uuid d93ae507-b4bb-48ef-8d64-fa9329a08a39 And one strange device (md0p2) show: dev.data.cache_mode 1 [writeback] dev.data.cache_state 0 [detached] cset.uuid 9a6aeb43-5f33-45ca-a1b0-a1277e3e5c44 Is it possible that device can be detached in writeback mode with strange cset.uuid? After that i copied images of cache device and 2 backing devices (with dd) as examples for experiments to recovery. But i can't do anything - when caching and backing devices meet each other during register, no matter in which order, something bad happens inside the kernel, /dev/bcache* devices do not appear and commands like 'cat /sys/block/md0p1/bcache/running' hangs infinitely. Is it possible to recover data in this situation? ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: bcache failure hangs something in kernel 2017-10-12 12:49 bcache failure hangs something in kernel Alexandr Kuznetsov @ 2017-10-12 18:12 ` Michael Lyle 2017-10-13 7:59 ` Alexandr Kuznetsov 0 siblings, 1 reply; 13+ messages in thread From: Michael Lyle @ 2017-10-12 18:12 UTC (permalink / raw) To: Alexandr Kuznetsov, linux-bcache Hi-- sorry you are having trouble.. On 10/12/2017 05:49 AM, Alexandr Kuznetsov wrote: > Hellow. > > Can any one help me? Two days ago i encountered bcache failure and since > then i can't boot my system Ubuntu 16.04 amd64. > Now when cache and backend devices meets each other during register > process, something hangs inside the kernel and such messages appear in > dmesg: [snip] > 54 backing devices show: > dev.data.cache_mode 1 [writeback] > dev.data.cache_state 1 [clean] > cset.uuid d93ae507-b4bb-48ef-8d64-fa9329a08a39 > One backing device (md0p3) show: > dev.data.cache_mode 1 [writeback] > dev.data.cache_state 1 [dirty] > cset.uuid d93ae507-b4bb-48ef-8d64-fa9329a08a39 > And one strange device (md0p2) show: > dev.data.cache_mode 1 [writeback] > dev.data.cache_state 0 [detached] > cset.uuid 9a6aeb43-5f33-45ca-a1b0-a1277e3e5c44 It looks like probably the superblock of md0p2 and other data structures were corrupted during the lvm commands, and in turn this is triggering bugs with bcache (bcache should detect the situation and abort everything, but instead is left with the bucket_lock held and freezes). One thing you could do possibly do is blacklist bcache in your /etc/modules, and then attach all the devices one by one, (not including md0p2), to get at the data on all the other volumes. Also, 54 of the backing devices are clean-- they have no dirty data in the cache-- so they can be mounted directly if you want. Mike ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: bcache failure hangs something in kernel 2017-10-12 18:12 ` Michael Lyle @ 2017-10-13 7:59 ` Alexandr Kuznetsov 2017-10-13 8:11 ` Michael Lyle 0 siblings, 1 reply; 13+ messages in thread From: Alexandr Kuznetsov @ 2017-10-13 7:59 UTC (permalink / raw) To: Michael Lyle, linux-bcache Hi > It looks like probably the superblock of md0p2 and other data structures > were corrupted during the lvm commands, and in turn this is triggering > bugs with bcache (bcache should detect the situation and abort > everything, but instead is left with the bucket_lock held and freezes). This immediately rises questions about reliability and safety of lvm and bcache. I thought that lvm is old, mature and safe technology, but here it is stuck, then manualy interrupted and result is catastrophic data corruption. lvm sits on top of that sandwich of block devices, on layer of /dev/bcache* devices. Another question here is how crazy lvm could damage data outside of /dev/bcache* devices? This means that some necessary io buffer range checks are missing inside bcache. > One thing you could do possibly do is blacklist bcache in your > /etc/modules, and then attach all the devices one by one, (not including > md0p2), to get at the data on all the other volumes. > > Also, 54 of the backing devices are clean-- they have no dirty data in > the cache-- so they can be mounted directly if you want. Unfortunately this md0p* block devices are not separate from each other - there is one 2Tb volume on top of them inside lvm. Loss of one 100Gib part and dirty data in another 100Gib part can kill entire file system with very high probability. Yesterday I have read that bcache failures are nasty, because file system roots data often resides on cache and is dirty on backing device. Is there any tool like fsck exist, that can check and may be try to recover data from caching and backing devices? Or developers can get this corrupted images to experiment for bugfixing? ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: bcache failure hangs something in kernel 2017-10-13 7:59 ` Alexandr Kuznetsov @ 2017-10-13 8:11 ` Michael Lyle 2017-10-13 9:10 ` Alexandr Kuznetsov 2017-11-14 13:27 ` Nix 0 siblings, 2 replies; 13+ messages in thread From: Michael Lyle @ 2017-10-13 8:11 UTC (permalink / raw) To: Alexandr Kuznetsov, linux-bcache On 10/13/2017 12:59 AM, Alexandr Kuznetsov wrote: > Hi > >> It looks like probably the superblock of md0p2 and other data structures >> were corrupted during the lvm commands, and in turn this is triggering >> bugs with bcache (bcache should detect the situation and abort >> everything, but instead is left with the bucket_lock held and freezes). > This immediately rises questions about reliability and safety of lvm and > bcache. Neither is safe if you overwrite the superblock with an errant command. If you pvcreate'd on the backing device directly, or did something similarly, that would be expected to go badly. > I thought that lvm is old, mature and safe technology, but here it is > stuck, then manualy interrupted and result is catastrophic data corruption. > lvm sits on top of that sandwich of block devices, on layer of > /dev/bcache* devices. Another question here is how crazy lvm could > damage data outside of /dev/bcache* devices? This means that some > necessary io buffer range checks are missing inside bcache. I don't know what commands you ran. I've never seen/heard of a bcache superblock corrupted, and I believe the mappings/shrink are appropriate. > Unfortunately this md0p* block devices are not separate from each other > - there is one 2Tb volume on top of them inside lvm. Loss of one 100Gib > part and dirty data in another 100Gib part can kill entire file system > with very high probability. Yesterday I have read that bcache failures > are nasty, because file system roots data often resides on cache and is > dirty on backing device.> Is there any tool like fsck exist, that can check and may be try to > recover data from caching and backing devices? Or developers can get > this corrupted images to experiment for bugfixing? Sorry, no. Other filesystems / block devices will not behave well if you overwrite their superblock, either. This is not behavior bcache is expected to recover gracefully from (though it shouldn't hang). re: the dirty data in the 100GB part, having a filesystem with a superblock marked dirty is fine if the cache device is available. Mike > -- > To unsubscribe from this list: send the line "unsubscribe linux-bcache" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: bcache failure hangs something in kernel 2017-10-13 8:11 ` Michael Lyle @ 2017-10-13 9:10 ` Alexandr Kuznetsov 2017-10-13 9:13 ` Michael Lyle 2017-11-14 13:27 ` Nix 1 sibling, 1 reply; 13+ messages in thread From: Alexandr Kuznetsov @ 2017-10-13 9:10 UTC (permalink / raw) To: Michael Lyle, linux-bcache > On 10/13/2017 12:59 AM, Alexandr Kuznetsov wrote: >> Hi >> >>> It looks like probably the superblock of md0p2 and other data structures >>> were corrupted during the lvm commands, and in turn this is triggering >>> bugs with bcache (bcache should detect the situation and abort >>> everything, but instead is left with the bucket_lock held and freezes). >> This immediately rises questions about reliability and safety of lvm and >> bcache. > Neither is safe if you overwrite the superblock with an errant command. > If you pvcreate'd on the backing device directly, or did something > similarly, that would be expected to go badly. > >> I thought that lvm is old, mature and safe technology, but here it is >> stuck, then manualy interrupted and result is catastrophic data corruption. >> lvm sits on top of that sandwich of block devices, on layer of >> /dev/bcache* devices. Another question here is how crazy lvm could >> damage data outside of /dev/bcache* devices? This means that some >> necessary io buffer range checks are missing inside bcache. > I don't know what commands you ran. I've never seen/heard of a bcache > superblock corrupted, and I believe the mappings/shrink are appropriate. I was not manipulating directly with backing devices or lvm pv's. I was not doing something illegal from lvm or bcache points of view, otherwise i would not write here, because then i would know that file system was killed by myself. There was only lvcreate and lvremove commands that creates and removes logical volumes inside lvm, nothing more, there wasn't any direct access outside of /dev/bcache* devices. Thats why i wrote "This means that some necessary io buffer range checks are missing inside bcache". So how bcache allowed to damage data outside of bcache* devices if any access to them went through bcache, not directly? I'm sure thats a bug. Why bcache freezes when he meets corrupted data instead of reporting errors? I'm sure thats a bug. > Sorry, no. Other filesystems / block devices will not behave well if > you overwrite their superblock, either. This is not behavior bcache is > expected to recover gracefully from (though it shouldn't hang). > > re: the dirty data in the 100GB part, having a filesystem with a > superblock marked dirty is fine if the cache device is available. > > Mike The cache device is available and it looks fine at first, but it behaves same way even if cache meets any backing device that marked as "clean". So i am not sure that everything is fine with caching device. I need at least to check it, and I ask for appropriate tool for that. Is it exist? Registering any device (caching or backing) alone behaves nicely... at first sight. But if bcache tryes to connect cache and backing device marked "clean" to each other during register process, it hangs. That's all nasty because i was not doing anything wrong, but i lost my data due to bugs in both, lvm and bcache :( ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: bcache failure hangs something in kernel 2017-10-13 9:10 ` Alexandr Kuznetsov @ 2017-10-13 9:13 ` Michael Lyle 2017-10-13 10:11 ` Alexandr Kuznetsov 0 siblings, 1 reply; 13+ messages in thread From: Michael Lyle @ 2017-10-13 9:13 UTC (permalink / raw) To: Alexandr Kuznetsov, linux-bcache On 10/13/2017 02:10 AM, Alexandr Kuznetsov wrote: [snip] > I was not manipulating directly with backing devices or lvm pv's. I was > not doing something illegal from lvm or bcache points of view, otherwise > i would not write here, because then i would know that file system was > killed by myself. > There was only lvcreate and lvremove commands that creates and removes > logical volumes inside lvm, nothing more, there wasn't any direct access > outside of /dev/bcache* devices. Thats why i wrote "This means that some > necessary io buffer range checks are missing inside bcache". So how > bcache allowed to damage data outside of bcache* devices if any access > to them went through bcache, not directly? I'm sure thats a bug. Why > bcache freezes when he meets corrupted data instead of reporting errors? > I'm sure thats a bug. I'm sorry you've lost data. I've run bcache and lvm a lot at scale and haven't seen anything like this, nor are there any other reports as far as I'm aware. I am new as bcache maintainer but have a pretty high degree of confidence it doesn't overwrite its own superblocks randomly. If you come up with a repro I'll be glad to look at it. Mike ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: bcache failure hangs something in kernel 2017-10-13 9:13 ` Michael Lyle @ 2017-10-13 10:11 ` Alexandr Kuznetsov 0 siblings, 0 replies; 13+ messages in thread From: Alexandr Kuznetsov @ 2017-10-13 10:11 UTC (permalink / raw) To: Michael Lyle, linux-bcache > > I'm sorry you've lost data. I've run bcache and lvm a lot at scale and > haven't seen anything like this, nor are there any other reports as far > as I'm aware. I am new as bcache maintainer but have a pretty high > degree of confidence it doesn't overwrite its own superblocks randomly. > > If you come up with a repro I'll be glad to look at it. > > Mike So, it's the first report of such a failure... Ok, when i reinstall system and i'll have free time, i will try to repoduce this situation, if it is not caused by random galactic rays. Last question: bcache now used worldwide, and yet there is no consistency check&repair tool exist? I am not found answer fo this question. If it exist, it will help in failure situations with minor damages. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: bcache failure hangs something in kernel 2017-10-13 8:11 ` Michael Lyle 2017-10-13 9:10 ` Alexandr Kuznetsov @ 2017-11-14 13:27 ` Nix 2017-11-14 17:20 ` Michael Lyle 1 sibling, 1 reply; 13+ messages in thread From: Nix @ 2017-11-14 13:27 UTC (permalink / raw) To: Michael Lyle; +Cc: Alexandr Kuznetsov, linux-bcache On 13 Oct 2017, Michael Lyle said: > On 10/13/2017 12:59 AM, Alexandr Kuznetsov wrote: >> I thought that lvm is old, mature and safe technology, but here it is >> stuck, then manualy interrupted and result is catastrophic data corruption. >> lvm sits on top of that sandwich of block devices, on layer of >> /dev/bcache* devices. Another question here is how crazy lvm could >> damage data outside of /dev/bcache* devices? This means that some >> necessary io buffer range checks are missing inside bcache. > > I don't know what commands you ran. I've never seen/heard of a bcache > superblock corrupted, and I believe the mappings/shrink are appropriate. I have also had corruption on a writethrough bcache (atop RAID-6, with LVM PVs within it) causing (rootfs) mount failure: bucket corruption, IIRC. Every time I rebooted I got warnings that bcache couldn't clean up in time, and I suspect this caused corruption in the end (fairly fast, actually, less than a month after starting using bcache: it had only just finished populating). The thing is in none mode at the moment, waiting for me to revamp my shutdown process to rotate the initramfs into place at shutdown so I can unmount the rootfs and stop the bcache, in the hope that that might give it a chance to shut down neatly. (Even so, finding that dirty shutdown can corrupt the bcache is unpleasant. I guess nobody does powerfail tests? How do most people shut down their bcache-on-rootfs systems?) -- NULL && (void) ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: bcache failure hangs something in kernel 2017-11-14 13:27 ` Nix @ 2017-11-14 17:20 ` Michael Lyle 2017-11-14 18:25 ` Nix 2017-11-15 8:44 ` Alexandr Kuznetsov 0 siblings, 2 replies; 13+ messages in thread From: Michael Lyle @ 2017-11-14 17:20 UTC (permalink / raw) To: Nix; +Cc: Alexandr Kuznetsov, linux-bcache On Tue, Nov 14, 2017 at 5:27 AM, Nix <nix@esperi.org.uk> wrote: > Every time I rebooted I got warnings that bcache couldn't clean up in time, and I suspect this caused corruption in the end (fairly fast, actually, less than a month after starting using bcache: it had only just finished populating). What did you see? A message like this is normal: [ 2.224767] bcache: bch_journal_replay() journal replay done, 432 keys in 243 entries, seq 40691386 but anything else is strange... If you were consistently seeing other messages that means something unusual was happening (an already badly-corrupted volume or bad hardware). > The thing is in none mode at the moment, waiting for me to revamp my > shutdown process to rotate the initramfs into place at shutdown so I can > unmount the rootfs and stop the bcache, in the hope that that might give > it a chance to shut down neatly. (Even so, finding that dirty shutdown > can corrupt the bcache is unpleasant. I guess nobody does powerfail > tests? How do most people shut down their bcache-on-rootfs systems?) I've been doing a fair amount of powerfail tests with OK results. Unfortunately, my production bcache machines had some unplanned power failure testing in the past week as well. Also I sometimes write crashy kernel code and end up testing unclean shutdown that way. I have not experienced bcache corruption yet, though that doesn't say there's not an issue. This doesn't sound at all like what Alexandr experienced, either. There's not a bucket corruption message in the kernel, -- maybe you saw a bad btree message at bucket xxxxx, block xxx, dev xxxx? What SSD are you using? A known issue is that there are families of SSDs that do not do the right thing on shutdown-- e.g. some devices based around LSI/SandForce that do emergency-writeback-from-RAM that have underprovisioned / missing capacitors. Mike ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: bcache failure hangs something in kernel 2017-11-14 17:20 ` Michael Lyle @ 2017-11-14 18:25 ` Nix 2017-11-14 19:03 ` Michael Lyle 2017-11-15 8:44 ` Alexandr Kuznetsov 1 sibling, 1 reply; 13+ messages in thread From: Nix @ 2017-11-14 18:25 UTC (permalink / raw) To: Michael Lyle; +Cc: Alexandr Kuznetsov, linux-bcache On 14 Nov 2017, Michael Lyle outgrape: > On Tue, Nov 14, 2017 at 5:27 AM, Nix <nix@esperi.org.uk> wrote: >> Every time I rebooted I got warnings that bcache couldn't clean up > in time, and I suspect this caused corruption in the end (fairly fast, > actually, less than a month after starting using bcache: it had only > just finished populating). > > What did you see? A message like this is normal: > > [ 2.224767] bcache: bch_journal_replay() journal replay done, 432 > keys in 243 entries, seq 40691386 > > but anything else is strange... If you were consistently seeing other > messages that means something unusual was happening (an already > badly-corrupted volume or bad hardware). This registration code: # Register all bcaches. if [ -f /sys/fs/bcache/register_quiet ]; then for name in /dev/sd*[0-9]* /dev/md/*; do echo $name > /sys/fs/bcache/register_quiet 2>&1 done # New devices registered: create them, after a short delay # to let the registration happen. sleep 1 /sbin/mdev -s fi ... did this (including the messages showing that the md array it's caching is happy): [ 11.281907] md: md125 stopped. [ 11.294948] md/raid:md125: device sda3 operational as raid disk 0 [ 11.305620] md/raid:md125: device sdf3 operational as raid disk 4 [ 11.315899] md/raid:md125: device sdd3 operational as raid disk 3 [ 11.325770] md/raid:md125: device sdc3 operational as raid disk 2 [ 11.335245] md/raid:md125: device sdb3 operational as raid disk 1 [ 11.344688] md/raid:md125: raid level 6 active with 5 out of 5 devices, algorithm 2 [ 11.353810] md125: detected capacity change from 0 to 15761089757184 [ 11.468956] bcache: prio_read() bad csum reading priorities [ 11.478010] bcache: prio_read() bad magic reading priorities [ 11.497911] bcache: error on 314dcdd2-9869-4110-99cc-9cd3a861afa6: [ 11.497914] bad checksum at bucket 28262, block 0, 36185 keys [ 11.507021] , disabling caching [ 11.529823] bcache: register_cache() registered cache device sde2 [ 11.539054] bcache: cache_set_free() Cache set 314dcdd2-9869-4110-99cc-9cd3a861afa6 unregistered [ 11.558596] bcache: register_bdev() registered backing device md125 The hardware is fine (zero other problems ever encountered: the RAM is all ECC and has had zero errors ever, and the disks and SSD are otherwise faultless -- so far! -- and are in fairly heavy use for other things, like the XFS and RAID journals, without incident). The cached (XFS) volume was in use as my rootfs (not only / but also /usr/src and /home: 4TiB, ~10% full) until the previous reboot. There had been a lot of writes to the cache device becuase I'd only enabled the cache a week before, rebooting several times after doing that as part of system bringup. Reboots with the cache enabled always featured a message from bcache an instant before reboot saying it had timed out: from the code, the timeout is based on a (short!) delay without any concern for whether, say, the SSD is in the middle of writing a bunch of data, and the delay is way too short for the SSD in question (an ATA-connected DC3510) to write more than a GiB or so, a small fraction of the 350GiB I have devoted to bcache. I note that the SMART data's bus reset count on the SSD suggests that rebooting resets the bus as part of POST (the count of bus resets is identical to the count of OS reboots plus firmware upgrades from the IPMI event log), which likely halts any ongoing writes. I suspect this alone could explain the problem, but it's all speculation. However, SMART also says 0x04 0x010 4 0 --- Resets Between Cmd Acceptance and Completion which I believe suggests that this cannot be the problem. Indeed, 199 CRC_Error_Count -OSRCK 100 100 000 - 0 171 Program_Fail_Count -O--CK 100 100 000 - 0 172 Erase_Fail_Count -O--CK 100 100 000 - 0 suggests zero problems. However I do also see 174 Unsafe_Shutdown_Count -O--CK 100 100 000 - 17 (whatever *that* means. What defines a safe shutdown to Intel SSDs? Search me.) > I have not experienced bcache corruption yet, though that doesn't say > there's not an issue. This doesn't sound at all like what Alexandr > experienced, either. Indeed not. > There's not a bucket corruption message in the kernel, -- maybe you > saw a bad btree message at bucket xxxxx, block xxx, dev xxxx? dmesg above. Sorry, vague memory caused trouble there. I posted a fuller description on 7th June, which got no response: <https://www.spinics.net/lists/linux-bcache/msg04668.html> > What SSD are you using? A known issue is that there are families of > SSDs that do not do the right thing on shutdown-- e.g. some devices > based around LSI/SandForce that do emergency-writeback-from-RAM that > have underprovisioned / missing capacitors. This is an Intel DC3510. I believe Intel SSDs are the single model that actually *work* reliably when powerfail happens (btw Corsair are rumoured to be even worse, sometimes bricking the whole device on powerfail! "Corsair" seems to be an appropriate manufacturer name in this case). SMART says: 175 Power_Loss_Cap_Test PO--CK 100 100 010 - 5350 (33 1413) which I believe means "all is fine". isdct says that things are fine too: DeviceStatus : Healthy EnduranceAnalyzer : 1102.31 years LatencyTrackingEnabled : False WriteCacheEnabled : True WriteCacheReorderingStateEnabled : True WriteCacheState : 1 WriteCacheSupported : True WriteErrorRecoveryTimer : 0 (Hm is it write caching that's doing it? Not given that the thing has capacitors, surely.) (In any case this machine has never experienced a power loss. :) ) ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: bcache failure hangs something in kernel 2017-11-14 18:25 ` Nix @ 2017-11-14 19:03 ` Michael Lyle 2017-11-17 20:13 ` Nix 0 siblings, 1 reply; 13+ messages in thread From: Michael Lyle @ 2017-11-14 19:03 UTC (permalink / raw) To: Nix; +Cc: Alexandr Kuznetsov, linux-bcache On Tue, Nov 14, 2017 at 10:25 AM, Nix <nix@esperi.org.uk> wrote: > [ 11.497914] bad checksum at bucket 28262, block 0, 36185 keys That's no good-- shouldn't have checksum errors. It means either the metadata we wrote got corrupted by the disk, or a metadata write didn't happen in the order we requested. > Reboots with the cache enabled always featured a message from bcache an > instant before reboot saying it had timed out: from the code, the > timeout is based on a (short!) delay without any concern for whether, > say, the SSD is in the middle of writing a bunch of data, and the delay > is way too short for the SSD in question (an ATA-connected DC3510) to > write more than a GiB or so, a small fraction of the 350GiB I have > devoted to bcache. I've seen things hit this couple second timeout before. It basically means that garbage collection is busy analyzing stuff on the disk and doesn't get around to checking the "should I exit now?" flag in time. Not ideal but relatively harmless. (It's not trying to write back the dirty data at this phase or anything). > I note that the SMART data's bus reset count on the SSD suggests that > rebooting resets the bus as part of POST (the count of bus resets is > identical to the count of OS reboots plus firmware upgrades from the > IPMI event log), which likely halts any ongoing writes. Even if it did, as long as acknowledged IO is written it's OK. That is, it's OK for anything we're trying to write to be lost, as long as the drive hasn't told us it's done and then later that write gets "undone". I think there has to be something somewhat unique to your environment-- at an environment I used to administrate (before working on bcache myself), there were about 100 bcache roots in writeback mode-- and we both unceremoniously lost power with active workload a couple of times and did several clean shutdowns for upgrades without losing a volume to corruption (though we did lose many disks that didn't feel like working at all again after power failure). And now I have a bad arc-fault circuit breaker in my home that has dumped power on my two ext4 root-on bcache-on md machines three times in the past couple weeks without issue. Each of my production machines has 15 unsafe shutdowns in smartctl -- a number that I can't quite explain because I think the real number should be 7-8 or so... and my bcache development test rig has 145 (!). Mike ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: bcache failure hangs something in kernel 2017-11-14 19:03 ` Michael Lyle @ 2017-11-17 20:13 ` Nix 0 siblings, 0 replies; 13+ messages in thread From: Nix @ 2017-11-17 20:13 UTC (permalink / raw) To: Michael Lyle; +Cc: Alexandr Kuznetsov, linux-bcache On 14 Nov 2017, Michael Lyle stated: > On Tue, Nov 14, 2017 at 10:25 AM, Nix <nix@esperi.org.uk> wrote: >> [ 11.497914] bad checksum at bucket 28262, block 0, 36185 keys > > That's no good-- shouldn't have checksum errors. It means either the > metadata we wrote got corrupted by the disk, or a metadata write > didn't happen in the order we requested. Ugh!!! That would cause definite problems for any fs... >> is way too short for the SSD in question (an ATA-connected DC3510) to >> write more than a GiB or so, a small fraction of the 350GiB I have >> devoted to bcache. > > I've seen things hit this couple second timeout before. It basically > means that garbage collection is busy analyzing stuff on the disk and > doesn't get around to checking the "should I exit now?" flag in time. Note that at its peak the cache had 120GiB of stuff in it. The cache is 350GiB. I find it hard to understand why GC would be running at all, let alone taking ages to do anything. > Even if it did, as long as acknowledged IO is written it's OK. That > is, it's OK for anything we're trying to write to be lost, as long as > the drive hasn't told us it's done and then later that write gets > "undone". > > I think there has to be something somewhat unique to your > environment-- at an environment I used to administrate (before working Oh I'm sure it is. One of the uniquenesses is that my shutdown procedure is gross: kill as many processes as possible, toposort and lazily unmount everything, wait a bit, sync, wait a bit more, reboot... nothing saner seems to work reliably in the presence of the maze of bind mounts and unshared fs hierarchies on my system. Hence my plan to revisit this and redesign it so it can reliably unmount everything, pivot to an initramfs, unmount the root, and stop the bcache before I try to enable the caches again. (There is nothing unusual about the hardware, and the storage stack is just WD disks -> partitions -> md6 -> bcache -> LVM PV (and then xfs and LUKSed xfs in that). The LVM PV is part of a VG that extends over unbcached md6 too. The SSD is just partitioned with one partition devoted to a cache device. No unusual controllers or anything, just ordinary Intel S2600CWTR built-in mobo ATA stuff.) > have a bad arc-fault circuit breaker in my home that has dumped power > on my two ext4 root-on bcache-on md machines three times in the past > couple weeks without issue. Each of my production machines has 15 > unsafe shutdowns in smartctl -- a number that I can't quite explain > because I think the real number should be 7-8 or so... and my bcache > development test rig has 145 (!). Hm. Maybe I should re-enable it and see what happens? If it goes wrong, if there anything I can do with the wreckage to help track this down? (In particular the wreckage left on the cache device after I've flipped it back into none mode?) -- NULL && (void) ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: bcache failure hangs something in kernel 2017-11-14 17:20 ` Michael Lyle 2017-11-14 18:25 ` Nix @ 2017-11-15 8:44 ` Alexandr Kuznetsov 1 sibling, 0 replies; 13+ messages in thread From: Alexandr Kuznetsov @ 2017-11-15 8:44 UTC (permalink / raw) To: Michael Lyle, Nix; +Cc: linux-bcache > I have not experienced bcache corruption yet, though that doesn't say > there's not an issue. This doesn't sound at all like what Alexandr > experienced, either. By the way, i discovered cause of lvm hangups - it's not a bugs in lvm. I use lvcreate command in fully automated mode in shell script with suppression of any output (> /dev/null 2>&1), but when lvcreate detects filesystem superblock in a new volume, it ask user wether to wipe it out or not. User interaction is impossible in that scenario, so lvm hangs infinitely, leaving global lvm lock as acquired. This lock causes any lvm operation to hangup (sigterm or even sigkill do not help), and even entire system can not reboot gracefully. When i met this situation, only reset button, was the last thing i was able to do, and i think exactly this hard reset caused corruption on bcache caching device. ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2017-11-17 20:13 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-10-12 12:49 bcache failure hangs something in kernel Alexandr Kuznetsov 2017-10-12 18:12 ` Michael Lyle 2017-10-13 7:59 ` Alexandr Kuznetsov 2017-10-13 8:11 ` Michael Lyle 2017-10-13 9:10 ` Alexandr Kuznetsov 2017-10-13 9:13 ` Michael Lyle 2017-10-13 10:11 ` Alexandr Kuznetsov 2017-11-14 13:27 ` Nix 2017-11-14 17:20 ` Michael Lyle 2017-11-14 18:25 ` Nix 2017-11-14 19:03 ` Michael Lyle 2017-11-17 20:13 ` Nix 2017-11-15 8:44 ` Alexandr Kuznetsov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox