bcache failure hangs something in kernel

public inbox for linux-bcache@vger.kernel.org
 help / color / mirror / Atom feed

* bcache failure hangs something in kernel
@ 2017-10-12 12:49 Alexandr Kuznetsov
  2017-10-12 18:12 ` Michael Lyle
  0 siblings, 1 reply; 13+ messages in thread
From: Alexandr Kuznetsov @ 2017-10-12 12:49 UTC (permalink / raw)
  To: linux-bcache

Hellow.

Can any one help me? Two days ago i encountered bcache failure and since 
then i can't boot my system Ubuntu 16.04 amd64.
Now when cache and backend devices meets each other during register 
process, something hangs inside the kernel and such messages appear in 
dmesg:
[  839.113067] INFO: task bcache-register:2303 blocked for more than 120 
seconds.
[  839.113077]       Not tainted 4.4.0-97-generic #120-Ubuntu
[  839.113079] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[  839.113082] bcache-register D ffff8801256f3a88     0  2303 1 0x00000004
[  839.113089]  ffff8801256f3a88 ffff88008edc0dd0 ffff88013560b800 
ffff880135bd5400
[  839.113093]  ffff8801256f4000 ffff88007a9f8000 0000000000000000 
0000000000000000
[  839.113096]  0000000000000000 ffff8801256f3aa0 ffffffff8183f6b5 
ffff88007a9f8000
[  839.113099] Call Trace:
[  839.113112]  [<ffffffff8183f6b5>] schedule+0x35/0x80
[  839.113133]  [<ffffffffc039c2b8>] bch_bucket_alloc+0x1d8/0x350 [bcache]
[  839.113139]  [<ffffffff810c4410>] ? wake_atomic_t_function+0x60/0x60
[  839.113148]  [<ffffffffc039c5c1>] __bch_bucket_alloc_set+0xf1/0x150 
[bcache]
[  839.113157]  [<ffffffffc039c66e>] bch_bucket_alloc_set+0x4e/0x70 [bcache]
[  839.113168]  [<ffffffffc03b0529>] __uuid_write+0x59/0x130 [bcache]
[  839.113179]  [<ffffffffc03b0ed6>] bch_uuid_write+0x16/0x40 [bcache]
[  839.113189]  [<ffffffffc03b1ad5>] bch_cached_dev_attach+0xf5/0x490 
[bcache]
[  839.113199]  [<ffffffffc03af5ad>] ? __write_super+0x13d/0x170 [bcache]
[  839.113210]  [<ffffffffc03b0eb0>] ? bcache_write_super+0x190/0x1a0 
[bcache]
[  839.113225]  [<ffffffffc03b2958>] run_cache_set+0x5e8/0x8f0 [bcache]
[  839.113236]  [<ffffffffc03b3f62>] register_bcache+0xdc2/0x1140 [bcache]
[  839.113242]  [<ffffffff813fcd2f>] kobj_attr_store+0xf/0x20
[  839.113247]  [<ffffffff81290f27>] sysfs_kf_write+0x37/0x40
[  839.113250]  [<ffffffff8129030d>] kernfs_fop_write+0x11d/0x170
[  839.113255]  [<ffffffff8120f888>] __vfs_write+0x18/0x40
[  839.113258]  [<ffffffff81210219>] vfs_write+0xa9/0x1a0
[  839.113261]  [<ffffffff81210ed5>] SyS_write+0x55/0xc0
[  839.113264]  [<ffffffff818437f2>] entry_SYSCALL_64_fastpath+0x16/0x71

No /dev/bcache* devices appear and whole system switches into strange 
state, for example it can not reboot gracefuly - it freezes.
My data storage configuration is:
     /dev/md2 as caching device, it is mdadm raid1 on two 64GiB 
partitions on two 128Gb SSD's.
     /dev/md0 as primary storage (mdadm raid5), splitted to 55 100Gib 
partitions and remainder as 56 partition, that gives /dev/md0p<1-56> 
devices.
     /dev/md0p* used as backing devices and produces /dev/bcache<0-55> 
cached devices.
     /dev/bcache* used as pv's for lvm.

Two days ago i experimented with remote lvm volumes creation/deletion 
using ssh commands, and something hanged. System could not reboot 
gracefuly, and later was reset hardly. After that it refuses to boot.
bcache-super-show on cache device and all backing devices says that 
everything is fine.
54 backing devices show:
     dev.data.cache_mode    1 [writeback]
     dev.data.cache_state    1 [clean]
     cset.uuid        d93ae507-b4bb-48ef-8d64-fa9329a08a39
One backing device (md0p3) show:
     dev.data.cache_mode    1 [writeback]
     dev.data.cache_state    1 [dirty]
     cset.uuid        d93ae507-b4bb-48ef-8d64-fa9329a08a39
And one strange device (md0p2) show:
     dev.data.cache_mode    1 [writeback]
     dev.data.cache_state    0 [detached]
     cset.uuid        9a6aeb43-5f33-45ca-a1b0-a1277e3e5c44

Is it possible that device can be detached in writeback mode with 
strange cset.uuid?
After that i copied images of cache device and 2 backing devices (with 
dd) as examples for experiments to recovery. But i can't do anything - 
when caching and backing devices meet each other during register, no 
matter in which order, something bad happens inside the kernel, 
/dev/bcache* devices do not appear and commands like 'cat 
/sys/block/md0p1/bcache/running' hangs infinitely.
Is it possible to recover data in this situation?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: bcache failure hangs something in kernel
  2017-10-12 12:49 bcache failure hangs something in kernel Alexandr Kuznetsov
@ 2017-10-12 18:12 ` Michael Lyle
  2017-10-13  7:59   ` Alexandr Kuznetsov
  0 siblings, 1 reply; 13+ messages in thread
From: Michael Lyle @ 2017-10-12 18:12 UTC (permalink / raw)
  To: Alexandr Kuznetsov, linux-bcache

Hi-- sorry you are having trouble..

On 10/12/2017 05:49 AM, Alexandr Kuznetsov wrote:
> Hellow.
> 
> Can any one help me? Two days ago i encountered bcache failure and since
> then i can't boot my system Ubuntu 16.04 amd64.
> Now when cache and backend devices meets each other during register
> process, something hangs inside the kernel and such messages appear in
> dmesg:

[snip]

> 54 backing devices show:
>     dev.data.cache_mode    1 [writeback]
>     dev.data.cache_state    1 [clean]
>     cset.uuid        d93ae507-b4bb-48ef-8d64-fa9329a08a39
> One backing device (md0p3) show:
>     dev.data.cache_mode    1 [writeback]
>     dev.data.cache_state    1 [dirty]
>     cset.uuid        d93ae507-b4bb-48ef-8d64-fa9329a08a39
> And one strange device (md0p2) show:
>     dev.data.cache_mode    1 [writeback]
>     dev.data.cache_state    0 [detached]
>     cset.uuid        9a6aeb43-5f33-45ca-a1b0-a1277e3e5c44

It looks like probably the superblock of md0p2 and other data structures
were corrupted during the lvm commands, and in turn this is triggering
bugs with bcache (bcache should detect the situation and abort
everything, but instead is left with the bucket_lock held and freezes).

One thing you could do possibly do is blacklist bcache in your
/etc/modules, and then attach all the devices one by one, (not including
md0p2), to get at the data on all the other volumes.

Also, 54 of the backing devices are clean-- they have no dirty data in
the cache-- so they can be mounted directly if you want.

Mike

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: bcache failure hangs something in kernel
  2017-10-12 18:12 ` Michael Lyle
@ 2017-10-13  7:59   ` Alexandr Kuznetsov
  2017-10-13  8:11     ` Michael Lyle
  0 siblings, 1 reply; 13+ messages in thread
From: Alexandr Kuznetsov @ 2017-10-13  7:59 UTC (permalink / raw)
  To: Michael Lyle, linux-bcache

Hi

> It looks like probably the superblock of md0p2 and other data structures
> were corrupted during the lvm commands, and in turn this is triggering
> bugs with bcache (bcache should detect the situation and abort
> everything, but instead is left with the bucket_lock held and freezes).
This immediately rises questions about reliability and safety of lvm and 
bcache.
I thought that lvm is old, mature and safe technology, but here it is 
stuck, then manualy interrupted and result is catastrophic data corruption.
lvm sits on top of that sandwich of block devices, on layer of 
/dev/bcache* devices. Another question here is how crazy lvm could 
damage data outside of /dev/bcache* devices? This means that some 
necessary io buffer range checks are missing inside bcache.

> One thing you could do possibly do is blacklist bcache in your
> /etc/modules, and then attach all the devices one by one, (not including
> md0p2), to get at the data on all the other volumes.
>
> Also, 54 of the backing devices are clean-- they have no dirty data in
> the cache-- so they can be mounted directly if you want.
Unfortunately this md0p* block devices are not separate from each other 
- there is one 2Tb volume on top of them inside lvm. Loss of one 100Gib 
part and dirty data in another 100Gib part can kill entire file system 
with very high probability. Yesterday I have read that bcache failures 
are nasty, because file system roots data often resides on cache and is 
dirty on backing device.
Is there any tool like fsck exist, that can check and may be try to 
recover data from caching and backing devices? Or developers can get 
this corrupted images to experiment for bugfixing?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: bcache failure hangs something in kernel
  2017-10-13  7:59   ` Alexandr Kuznetsov
@ 2017-10-13  8:11     ` Michael Lyle
  2017-10-13  9:10       ` Alexandr Kuznetsov
  2017-11-14 13:27       ` Nix
  0 siblings, 2 replies; 13+ messages in thread
From: Michael Lyle @ 2017-10-13  8:11 UTC (permalink / raw)
  To: Alexandr Kuznetsov, linux-bcache

On 10/13/2017 12:59 AM, Alexandr Kuznetsov wrote:
> Hi
> 
>> It looks like probably the superblock of md0p2 and other data structures
>> were corrupted during the lvm commands, and in turn this is triggering
>> bugs with bcache (bcache should detect the situation and abort
>> everything, but instead is left with the bucket_lock held and freezes).
> This immediately rises questions about reliability and safety of lvm and
> bcache.

Neither is safe if you overwrite the superblock with an errant command.
If you pvcreate'd on the backing device directly, or did something
similarly, that would be expected to go badly.

> I thought that lvm is old, mature and safe technology, but here it is
> stuck, then manualy interrupted and result is catastrophic data corruption.
> lvm sits on top of that sandwich of block devices, on layer of
> /dev/bcache* devices. Another question here is how crazy lvm could
> damage data outside of /dev/bcache* devices? This means that some
> necessary io buffer range checks are missing inside bcache.

I don't know what commands you ran.  I've never seen/heard of a bcache
superblock corrupted, and I believe the mappings/shrink are appropriate.

> Unfortunately this md0p* block devices are not separate from each other
> - there is one 2Tb volume on top of them inside lvm. Loss of one 100Gib
> part and dirty data in another 100Gib part can kill entire file system
> with very high probability. Yesterday I have read that bcache failures
> are nasty, because file system roots data often resides on cache and is
> dirty on backing device.> Is there any tool like fsck exist, that can check and may be try to
> recover data from caching and backing devices? Or developers can get
> this corrupted images to experiment for bugfixing?
Sorry, no.  Other filesystems / block devices will not behave well if
you overwrite their superblock, either.  This is not behavior bcache is
expected to recover gracefully from (though it shouldn't hang).

re: the dirty data in the 100GB part, having a filesystem with a
superblock marked dirty is fine if the cache device is available.

Mike

> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: bcache failure hangs something in kernel
  2017-10-13  8:11     ` Michael Lyle
@ 2017-10-13  9:10       ` Alexandr Kuznetsov
  2017-10-13  9:13         ` Michael Lyle
  2017-11-14 13:27       ` Nix
  1 sibling, 1 reply; 13+ messages in thread
From: Alexandr Kuznetsov @ 2017-10-13  9:10 UTC (permalink / raw)
  To: Michael Lyle, linux-bcache


> On 10/13/2017 12:59 AM, Alexandr Kuznetsov wrote:
>> Hi
>>
>>> It looks like probably the superblock of md0p2 and other data structures
>>> were corrupted during the lvm commands, and in turn this is triggering
>>> bugs with bcache (bcache should detect the situation and abort
>>> everything, but instead is left with the bucket_lock held and freezes).
>> This immediately rises questions about reliability and safety of lvm and
>> bcache.
> Neither is safe if you overwrite the superblock with an errant command.
> If you pvcreate'd on the backing device directly, or did something
> similarly, that would be expected to go badly.
>
>> I thought that lvm is old, mature and safe technology, but here it is
>> stuck, then manualy interrupted and result is catastrophic data corruption.
>> lvm sits on top of that sandwich of block devices, on layer of
>> /dev/bcache* devices. Another question here is how crazy lvm could
>> damage data outside of /dev/bcache* devices? This means that some
>> necessary io buffer range checks are missing inside bcache.
> I don't know what commands you ran.  I've never seen/heard of a bcache
> superblock corrupted, and I believe the mappings/shrink are appropriate.
I was not manipulating directly with backing devices or lvm pv's. I was 
not doing something illegal from lvm or bcache points of view, otherwise 
i would not write here, because then i would know that file system was 
killed by myself.
There was only lvcreate and lvremove commands that creates and removes 
logical volumes inside lvm, nothing more, there wasn't any direct access 
outside of /dev/bcache* devices. Thats why i wrote "This means that some 
necessary io buffer range checks are missing inside bcache". So how 
bcache allowed to damage data outside of bcache* devices if any access 
to them went through bcache, not directly? I'm sure thats a bug. Why 
bcache freezes when he meets corrupted data instead of reporting errors? 
I'm sure thats a bug.

> Sorry, no.  Other filesystems / block devices will not behave well if
> you overwrite their superblock, either.  This is not behavior bcache is
> expected to recover gracefully from (though it shouldn't hang).
>
> re: the dirty data in the 100GB part, having a filesystem with a
> superblock marked dirty is fine if the cache device is available.
>
> Mike
The cache device is available and it looks fine at first, but it behaves 
same way even if cache meets any backing device that marked as "clean". 
So i am not sure that everything is fine with caching device. I need at 
least to check it, and I ask for appropriate tool for that. Is it exist?
Registering any device (caching or backing) alone behaves nicely... at 
first sight. But if bcache tryes to connect cache and backing device 
marked "clean" to each other during register process, it hangs. That's 
all nasty because i was not doing anything wrong, but i lost my data due 
to bugs in both, lvm and bcache :(

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: bcache failure hangs something in kernel
  2017-10-13  9:10       ` Alexandr Kuznetsov
@ 2017-10-13  9:13         ` Michael Lyle
  2017-10-13 10:11           ` Alexandr Kuznetsov
  0 siblings, 1 reply; 13+ messages in thread
From: Michael Lyle @ 2017-10-13  9:13 UTC (permalink / raw)
  To: Alexandr Kuznetsov, linux-bcache

On 10/13/2017 02:10 AM, Alexandr Kuznetsov wrote:
[snip]
> I was not manipulating directly with backing devices or lvm pv's. I was
> not doing something illegal from lvm or bcache points of view, otherwise
> i would not write here, because then i would know that file system was
> killed by myself.
> There was only lvcreate and lvremove commands that creates and removes
> logical volumes inside lvm, nothing more, there wasn't any direct access
> outside of /dev/bcache* devices. Thats why i wrote "This means that some
> necessary io buffer range checks are missing inside bcache". So how
> bcache allowed to damage data outside of bcache* devices if any access
> to them went through bcache, not directly? I'm sure thats a bug. Why
> bcache freezes when he meets corrupted data instead of reporting errors?
> I'm sure thats a bug.

I'm sorry you've lost data.  I've run bcache and lvm a lot at scale and
haven't seen anything like this, nor are there any other reports as far
as I'm aware.  I am new as bcache maintainer but have a pretty high
degree of confidence it doesn't overwrite its own superblocks randomly.

If you come up with a repro I'll be glad to look at it.

Mike

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: bcache failure hangs something in kernel
  2017-10-13  9:13         ` Michael Lyle
@ 2017-10-13 10:11           ` Alexandr Kuznetsov
  0 siblings, 0 replies; 13+ messages in thread
From: Alexandr Kuznetsov @ 2017-10-13 10:11 UTC (permalink / raw)
  To: Michael Lyle, linux-bcache

>
> I'm sorry you've lost data.  I've run bcache and lvm a lot at scale and
> haven't seen anything like this, nor are there any other reports as far
> as I'm aware.  I am new as bcache maintainer but have a pretty high
> degree of confidence it doesn't overwrite its own superblocks randomly.
>
> If you come up with a repro I'll be glad to look at it.
>
> Mike
So, it's the first report of such a failure... Ok, when i reinstall 
system and i'll have free time, i will try to repoduce this situation, 
if it is not caused by random galactic rays.
Last question: bcache now used worldwide, and yet there is no 
consistency check&repair tool exist? I am not found answer fo this 
question. If it exist, it will help in failure situations with minor 
damages.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: bcache failure hangs something in kernel
  2017-10-13  8:11     ` Michael Lyle
  2017-10-13  9:10       ` Alexandr Kuznetsov
@ 2017-11-14 13:27       ` Nix
  2017-11-14 17:20         ` Michael Lyle
  1 sibling, 1 reply; 13+ messages in thread
From: Nix @ 2017-11-14 13:27 UTC (permalink / raw)
  To: Michael Lyle; +Cc: Alexandr Kuznetsov, linux-bcache

On 13 Oct 2017, Michael Lyle said:

> On 10/13/2017 12:59 AM, Alexandr Kuznetsov wrote:
>> I thought that lvm is old, mature and safe technology, but here it is
>> stuck, then manualy interrupted and result is catastrophic data corruption.
>> lvm sits on top of that sandwich of block devices, on layer of
>> /dev/bcache* devices. Another question here is how crazy lvm could
>> damage data outside of /dev/bcache* devices? This means that some
>> necessary io buffer range checks are missing inside bcache.
>
> I don't know what commands you ran.  I've never seen/heard of a bcache
> superblock corrupted, and I believe the mappings/shrink are appropriate.

I have also had corruption on a writethrough bcache (atop RAID-6, with
LVM PVs within it) causing (rootfs) mount failure: bucket corruption,
IIRC. Every time I rebooted I got warnings that bcache couldn't clean up
in time, and I suspect this caused corruption in the end (fairly fast,
actually, less than a month after starting using bcache: it had only
just finished populating).

The thing is in none mode at the moment, waiting for me to revamp my
shutdown process to rotate the initramfs into place at shutdown so I can
unmount the rootfs and stop the bcache, in the hope that that might give
it a chance to shut down neatly. (Even so, finding that dirty shutdown
can corrupt the bcache is unpleasant. I guess nobody does powerfail
tests? How do most people shut down their bcache-on-rootfs systems?)

-- 
NULL && (void)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: bcache failure hangs something in kernel
  2017-11-14 13:27       ` Nix
@ 2017-11-14 17:20         ` Michael Lyle
  2017-11-14 18:25           ` Nix
  2017-11-15  8:44           ` Alexandr Kuznetsov
  0 siblings, 2 replies; 13+ messages in thread
From: Michael Lyle @ 2017-11-14 17:20 UTC (permalink / raw)
  To: Nix; +Cc: Alexandr Kuznetsov, linux-bcache

On Tue, Nov 14, 2017 at 5:27 AM, Nix <nix@esperi.org.uk> wrote:
> Every time I rebooted I got warnings that bcache couldn't clean up
in time, and I suspect this caused corruption in the end (fairly fast,
actually, less than a month after starting using bcache: it had only
just finished populating).

What did you see?  A message like this is normal:

[    2.224767] bcache: bch_journal_replay() journal replay done, 432
keys in 243 entries, seq 40691386

but anything else is strange...  If you were consistently seeing other
messages that means something unusual was happening (an already
badly-corrupted volume or bad hardware).

> The thing is in none mode at the moment, waiting for me to revamp my
> shutdown process to rotate the initramfs into place at shutdown so I can
> unmount the rootfs and stop the bcache, in the hope that that might give
> it a chance to shut down neatly. (Even so, finding that dirty shutdown
> can corrupt the bcache is unpleasant. I guess nobody does powerfail
> tests? How do most people shut down their bcache-on-rootfs systems?)

I've been doing a fair amount of powerfail tests with OK results.
Unfortunately, my production bcache machines had some unplanned power
failure testing in the past week as well.  Also I sometimes write
crashy kernel code and end up testing unclean shutdown that way.

I have not experienced bcache corruption yet, though that doesn't say
there's not an issue.  This doesn't sound at all like what Alexandr
experienced, either.

There's not a bucket corruption message in the kernel, -- maybe you
saw a bad btree message at bucket xxxxx, block xxx, dev xxxx?

What SSD are you using?  A known issue is that there are families of
SSDs that do not do the right thing on shutdown-- e.g. some devices
based around LSI/SandForce that do emergency-writeback-from-RAM that
have underprovisioned / missing capacitors.

Mike

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: bcache failure hangs something in kernel
  2017-11-14 17:20         ` Michael Lyle
@ 2017-11-14 18:25           ` Nix
  2017-11-14 19:03             ` Michael Lyle
  2017-11-15  8:44           ` Alexandr Kuznetsov
  1 sibling, 1 reply; 13+ messages in thread
From: Nix @ 2017-11-14 18:25 UTC (permalink / raw)
  To: Michael Lyle; +Cc: Alexandr Kuznetsov, linux-bcache

On 14 Nov 2017, Michael Lyle outgrape:

> On Tue, Nov 14, 2017 at 5:27 AM, Nix <nix@esperi.org.uk> wrote:
>> Every time I rebooted I got warnings that bcache couldn't clean up
> in time, and I suspect this caused corruption in the end (fairly fast,
> actually, less than a month after starting using bcache: it had only
> just finished populating).
>
> What did you see?  A message like this is normal:
>
> [    2.224767] bcache: bch_journal_replay() journal replay done, 432
> keys in 243 entries, seq 40691386
>
> but anything else is strange...  If you were consistently seeing other
> messages that means something unusual was happening (an already
> badly-corrupted volume or bad hardware).

This registration code:

# Register all bcaches.
if [ -f /sys/fs/bcache/register_quiet ]; then
    for name in /dev/sd*[0-9]* /dev/md/*; do
        echo $name > /sys/fs/bcache/register_quiet 2>&1
    done
    # New devices registered: create them, after a short delay
    # to let the registration happen.
    sleep 1
    /sbin/mdev -s
fi

... did this (including the messages showing that the md array it's
caching is happy):

[   11.281907] md: md125 stopped.
[   11.294948] md/raid:md125: device sda3 operational as raid disk 0
[   11.305620] md/raid:md125: device sdf3 operational as raid disk 4
[   11.315899] md/raid:md125: device sdd3 operational as raid disk 3
[   11.325770] md/raid:md125: device sdc3 operational as raid disk 2
[   11.335245] md/raid:md125: device sdb3 operational as raid disk 1
[   11.344688] md/raid:md125: raid level 6 active with 5 out of 5 devices, algorithm 2
[   11.353810] md125: detected capacity change from 0 to 15761089757184

[   11.468956] bcache: prio_read() bad csum reading priorities
[   11.478010] bcache: prio_read() bad magic reading priorities
[   11.497911] bcache: error on 314dcdd2-9869-4110-99cc-9cd3a861afa6: 
[   11.497914] bad checksum at bucket 28262, block 0, 36185 keys
[   11.507021] , disabling caching
[   11.529823] bcache: register_cache() registered cache device sde2
[   11.539054] bcache: cache_set_free() Cache set 314dcdd2-9869-4110-99cc-9cd3a861afa6 unregistered
[   11.558596] bcache: register_bdev() registered backing device md125

The hardware is fine (zero other problems ever encountered: the RAM is
all ECC and has had zero errors ever, and the disks and SSD are
otherwise faultless -- so far! -- and are in fairly heavy use for other
things, like the XFS and RAID journals, without incident). The cached
(XFS) volume was in use as my rootfs (not only / but also /usr/src and
/home: 4TiB, ~10% full) until the previous reboot. There had been a lot
of writes to the cache device becuase I'd only enabled the cache a week
before, rebooting several times after doing that as part of system
bringup.

Reboots with the cache enabled always featured a message from bcache an
instant before reboot saying it had timed out: from the code, the
timeout is based on a (short!) delay without any concern for whether,
say, the SSD is in the middle of writing a bunch of data, and the delay
is way too short for the SSD in question (an ATA-connected DC3510) to
write more than a GiB or so, a small fraction of the 350GiB I have
devoted to bcache.

I note that the SMART data's bus reset count on the SSD suggests that
rebooting resets the bus as part of POST (the count of bus resets is
identical to the count of OS reboots plus firmware upgrades from the
IPMI event log), which likely halts any ongoing writes. I suspect this
alone could explain the problem, but it's all speculation.

However, SMART also says

0x04  0x010  4               0  ---  Resets Between Cmd Acceptance and Completion

which I believe suggests that this cannot be the problem.

Indeed,

199 CRC_Error_Count         -OSRCK   100   100   000    -    0
171 Program_Fail_Count      -O--CK   100   100   000    -    0
172 Erase_Fail_Count        -O--CK   100   100   000    -    0

suggests zero problems. However I do also see

174 Unsafe_Shutdown_Count   -O--CK   100   100   000    -    17

(whatever *that* means. What defines a safe shutdown to Intel SSDs?
Search me.)

> I have not experienced bcache corruption yet, though that doesn't say
> there's not an issue.  This doesn't sound at all like what Alexandr
> experienced, either.

Indeed not.

> There's not a bucket corruption message in the kernel, -- maybe you
> saw a bad btree message at bucket xxxxx, block xxx, dev xxxx?

dmesg above. Sorry, vague memory caused trouble there. I posted a
fuller description on 7th June, which got no response:

<https://www.spinics.net/lists/linux-bcache/msg04668.html>

> What SSD are you using?  A known issue is that there are families of
> SSDs that do not do the right thing on shutdown-- e.g. some devices
> based around LSI/SandForce that do emergency-writeback-from-RAM that
> have underprovisioned / missing capacitors.

This is an Intel DC3510. I believe Intel SSDs are the single model that
actually *work* reliably when powerfail happens (btw Corsair are
rumoured to be even worse, sometimes bricking the whole device on
powerfail! "Corsair" seems to be an appropriate manufacturer name in
this case). SMART says:

175 Power_Loss_Cap_Test     PO--CK   100   100   010    -    5350 (33 1413)

which I believe means "all is fine". isdct says that things are fine too:

DeviceStatus : Healthy
EnduranceAnalyzer : 1102.31 years
LatencyTrackingEnabled : False
WriteCacheEnabled : True
WriteCacheReorderingStateEnabled : True
WriteCacheState : 1
WriteCacheSupported : True
WriteErrorRecoveryTimer : 0

(Hm is it write caching that's doing it? Not given that the thing has
capacitors, surely.)

(In any case this machine has never experienced a power loss. :) )

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: bcache failure hangs something in kernel
  2017-11-14 18:25           ` Nix
@ 2017-11-14 19:03             ` Michael Lyle
  2017-11-17 20:13               ` Nix
  0 siblings, 1 reply; 13+ messages in thread
From: Michael Lyle @ 2017-11-14 19:03 UTC (permalink / raw)
  To: Nix; +Cc: Alexandr Kuznetsov, linux-bcache

On Tue, Nov 14, 2017 at 10:25 AM, Nix <nix@esperi.org.uk> wrote:
> [   11.497914] bad checksum at bucket 28262, block 0, 36185 keys

That's no good-- shouldn't have checksum errors.  It means either the
metadata we wrote got corrupted by the disk, or a metadata write
didn't happen in the order we requested.

> Reboots with the cache enabled always featured a message from bcache an
> instant before reboot saying it had timed out: from the code, the
> timeout is based on a (short!) delay without any concern for whether,
> say, the SSD is in the middle of writing a bunch of data, and the delay
> is way too short for the SSD in question (an ATA-connected DC3510) to
> write more than a GiB or so, a small fraction of the 350GiB I have
> devoted to bcache.

I've seen things hit this couple second timeout before.  It basically
means that garbage collection is busy analyzing stuff on the disk and
doesn't get around to checking the "should I exit now?" flag in time.
Not ideal but relatively harmless.  (It's not trying to write back the
dirty data at this phase or anything).

> I note that the SMART data's bus reset count on the SSD suggests that
> rebooting resets the bus as part of POST (the count of bus resets is
> identical to the count of OS reboots plus firmware upgrades from the
> IPMI event log), which likely halts any ongoing writes.

Even if it did, as long as acknowledged IO is written it's OK.  That
is, it's OK for anything we're trying to write to be lost, as long as
the drive hasn't told us it's done and then later that write gets
"undone".

I think there has to be something somewhat unique to your
environment-- at an environment I used to administrate (before working
on bcache myself), there were about 100 bcache roots in writeback
mode-- and we both unceremoniously lost power with active workload a
couple of times and did several clean shutdowns for upgrades without
losing a volume to corruption (though we did lose many disks that
didn't feel like working at all again after power failure).  And now I
have a bad arc-fault circuit breaker in my home that has dumped power
on my two ext4 root-on bcache-on md machines three times in the past
couple weeks without issue.  Each of my production machines has 15
unsafe shutdowns in smartctl -- a number that I can't quite explain
because I think the real number should be 7-8 or so... and my bcache
development test rig has 145 (!).

Mike

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: bcache failure hangs something in kernel
  2017-11-14 17:20         ` Michael Lyle
  2017-11-14 18:25           ` Nix
@ 2017-11-15  8:44           ` Alexandr Kuznetsov
  1 sibling, 0 replies; 13+ messages in thread
From: Alexandr Kuznetsov @ 2017-11-15  8:44 UTC (permalink / raw)
  To: Michael Lyle, Nix; +Cc: linux-bcache

> I have not experienced bcache corruption yet, though that doesn't say
> there's not an issue.  This doesn't sound at all like what Alexandr
> experienced, either.
By the way, i discovered cause of lvm hangups - it's not a bugs in lvm. 
I use lvcreate command in fully automated mode in shell script with 
suppression of any output (> /dev/null 2>&1), but when lvcreate detects 
filesystem superblock in a new volume, it ask user wether to wipe it out 
or not. User interaction is impossible in that scenario, so lvm hangs 
infinitely, leaving global lvm lock as acquired. This lock causes any 
lvm operation to hangup (sigterm or even sigkill do not help), and even 
entire system can not reboot gracefully. When i met this situation, only 
reset button, was the last thing i was able to do, and i think exactly 
this hard reset caused corruption on bcache caching device.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: bcache failure hangs something in kernel
  2017-11-14 19:03             ` Michael Lyle
@ 2017-11-17 20:13               ` Nix
  0 siblings, 0 replies; 13+ messages in thread
From: Nix @ 2017-11-17 20:13 UTC (permalink / raw)
  To: Michael Lyle; +Cc: Alexandr Kuznetsov, linux-bcache

On 14 Nov 2017, Michael Lyle stated:

> On Tue, Nov 14, 2017 at 10:25 AM, Nix <nix@esperi.org.uk> wrote:
>> [   11.497914] bad checksum at bucket 28262, block 0, 36185 keys
>
> That's no good-- shouldn't have checksum errors.  It means either the
> metadata we wrote got corrupted by the disk, or a metadata write
> didn't happen in the order we requested.

Ugh!!! That would cause definite problems for any fs...

>> is way too short for the SSD in question (an ATA-connected DC3510) to
>> write more than a GiB or so, a small fraction of the 350GiB I have
>> devoted to bcache.
>
> I've seen things hit this couple second timeout before.  It basically
> means that garbage collection is busy analyzing stuff on the disk and
> doesn't get around to checking the "should I exit now?" flag in time.

Note that at its peak the cache had 120GiB of stuff in it. The cache is
350GiB. I find it hard to understand why GC would be running at all,
let alone taking ages to do anything.

> Even if it did, as long as acknowledged IO is written it's OK.  That
> is, it's OK for anything we're trying to write to be lost, as long as
> the drive hasn't told us it's done and then later that write gets
> "undone".
>
> I think there has to be something somewhat unique to your
> environment-- at an environment I used to administrate (before working

Oh I'm sure it is. One of the uniquenesses is that my shutdown procedure
is gross: kill as many processes as possible, toposort and lazily
unmount everything, wait a bit, sync, wait a bit more, reboot... nothing
saner seems to work reliably in the presence of the maze of bind mounts
and unshared fs hierarchies on my system.

Hence my plan to revisit this and redesign it so it can reliably unmount
everything, pivot to an initramfs, unmount the root, and stop the bcache
before I try to enable the caches again.

(There is nothing unusual about the hardware, and the storage stack is
just WD disks -> partitions -> md6 -> bcache -> LVM PV (and then xfs and
LUKSed xfs in that). The LVM PV is part of a VG that extends over
unbcached md6 too. The SSD is just partitioned with one partition
devoted to a cache device. No unusual controllers or anything, just
ordinary Intel S2600CWTR built-in mobo ATA stuff.)

> have a bad arc-fault circuit breaker in my home that has dumped power
> on my two ext4 root-on bcache-on md machines three times in the past
> couple weeks without issue.  Each of my production machines has 15
> unsafe shutdowns in smartctl -- a number that I can't quite explain
> because I think the real number should be 7-8 or so... and my bcache
> development test rig has 145 (!).

Hm. Maybe I should re-enable it and see what happens? If it goes wrong,
if there anything I can do with the wreckage to help track this down?
(In particular the wreckage left on the cache device after I've flipped
it back into none mode?)

-- 
NULL && (void)

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2017-11-17 20:13 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-10-12 12:49 bcache failure hangs something in kernel Alexandr Kuznetsov
2017-10-12 18:12 ` Michael Lyle
2017-10-13  7:59   ` Alexandr Kuznetsov
2017-10-13  8:11     ` Michael Lyle
2017-10-13  9:10       ` Alexandr Kuznetsov
2017-10-13  9:13         ` Michael Lyle
2017-10-13 10:11           ` Alexandr Kuznetsov
2017-11-14 13:27       ` Nix
2017-11-14 17:20         ` Michael Lyle
2017-11-14 18:25           ` Nix
2017-11-14 19:03             ` Michael Lyle
2017-11-17 20:13               ` Nix
2017-11-15  8:44           ` Alexandr Kuznetsov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox