linux-bcache.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [bcachefs]  BUG: soft lockup - CPU#0 stuck for 22s! [bch_copygc_read:5328]
@ 2016-09-03 21:29 Marcin
  2016-09-03 21:57 ` Marcin
  2016-09-04  0:17 ` Kent Overstreet
  0 siblings, 2 replies; 7+ messages in thread
From: Marcin @ 2016-09-03 21:29 UTC (permalink / raw)
  To: linux-bcache

Hi!
Kernel at commit c820493652e830dc050e1418301e1bdec5691a1e

I createt to devices, fast has size
# blockdev --getsz /dev/sde1
20971520
and slower device:
# blockdev --getsz /dev/sdd1
2930209551

I was copying files from one disk to bcache, after some time I got:
  BUG: soft lockup - CPU#0 stuck for 22s! [bch_copygc_read:5328]

# cat 
/sys/fs/bcache/5ad3d751-cb40-4bb7-9546-3e5b48ccd3cf/cache0/dirty_bytes  
/sys/fs/bcache/5ad3d751-cb40-4bb7-9546-3e5b48ccd3cf/cache1/dirty_bytes
9269264384
2540163072

dmesg says:
[ 9088.110001] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! 
[bch_copygc_read:5328]
[ 9088.110001] Modules linked in: bcache lz4_compress zlib_deflate 
libcrc32c lz4_decompress sha1_generic cpufreq_ondemand msr af_packet 
fuse pcrypt nf_nat_ftp nf_conntrack_ftp nf_nat nf_conntrack ipv6 
coretemp hwmon kvm_intel kvm irqbypass i2c_i801 i2c_core sky2 hpsa 
acpi_cpufreq scsi_transport_sas processor button raid1 md_mod 
pata_jmicron ehci_pci ehci_hcd dm_mod
[ 9088.110001] CPU: 0 PID: 5328 Comm: bch_copygc_read Not tainted 4.7.0+ 
#2
[ 9088.110001] Hardware name: Gigabyte Technology Co., Ltd. 
965P-DS3/965P-DS3, BIOS F14 06/25/2009
[ 9088.110001] task: ffff88001b561a80 ti: ffff880002810000 task.ti: 
ffff880002810000
[ 9088.110001] RIP: 0010:[<ffffffffc028795b>]  [<ffffffffc028795b>] 
bch_moving_gc+0xcb/0xd90 [bcache]
[ 9088.110001] RSP: 0018:ffff880002813d20  EFLAGS: 00000287
[ 9088.110001] RAX: ffffc900001f6e78 RBX: ffff8800102c5000 RCX: 
000000000003c000
[ 9088.110001] RDX: ffffc9000027f000 RSI: ffff8800102c5000 RDI: 
ffff8800102c5608
[ 9088.110001] RBP: ffff880002813e40 R08: ffff880002813d90 R09: 
0000000002000000
[ 9088.110001] R10: 0000000000000bc6 R11: 0000000000064000 R12: 
ffff8800014d0000
[ 9088.110001] R13: ffff8800102c5000 R14: 0000000000027800 R15: 
0000000001b6c6e0
[ 9088.110001] FS:  0000000000000000(0000) GS:ffff88003fc00000(0000) 
knlGS:0000000000000000
[ 9088.110001] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9088.110001] CR2: 000056048ce06f62 CR3: 000000003d0ce000 CR4: 
00000000000006f0
[ 9088.110001] Stack:
[ 9088.110001]  0000000000015f00 0000000000015f00 ffffffff8107bd3f 
ffff8800102c5000
[ 9088.110001]  ffff880002813de0 0000000000000000 ffff880002813e68 
ffff880002813e40
[ 9088.110001]  0000000000000002 0000000000064000 ffff8800102c5608 
000000000000000f
[ 9088.110001] Call Trace:
[ 9088.110001]  [<ffffffff8107bd3f>] ? sched_clock_cpu+0x7f/0xa0
[ 9088.110001]  [<ffffffffc028878c>] bch_moving_gc_thread+0x16c/0x180 
[bcache]
[ 9088.110001]  [<ffffffffc0288620>] ? bch_moving_gc+0xd90/0xd90 
[bcache]
[ 9088.110001]  [<ffffffff8106e874>] kthread+0xc4/0xe0
[ 9088.110001]  [<ffffffff8146b0bf>] ret_from_fork+0x1f/0x40
[ 9088.110001]  [<ffffffff8106e7b0>] ? kthread_worker_fn+0x150/0x150
[ 9088.110001] Code: 00 48 8b 8b a8 02 00 00 48 8b 93 a8 05 00 00 48 8d 
04 40 48 8d 0c 49 48 8d 04 82 48 8d 14 8a 48 39 d0 0f 83 7e 08 00 00 48 
89 de <44> 8b 48 04 c6 40 09 00 45 85 c9 74 2e f7 40 04 01 00 01 00 75


Compression wasn't turned on.

Marcin

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [bcachefs]  BUG: soft lockup - CPU#0 stuck for 22s! [bch_copygc_read:5328]
  2016-09-03 21:29 [bcachefs] BUG: soft lockup - CPU#0 stuck for 22s! [bch_copygc_read:5328] Marcin
@ 2016-09-03 21:57 ` Marcin
  2016-09-04  0:17 ` Kent Overstreet
  1 sibling, 0 replies; 7+ messages in thread
From: Marcin @ 2016-09-03 21:57 UTC (permalink / raw)
  To: linux-bcache

W dniu 2016-09-03 23:29, Marcin napisał(a):
> Hi!
> Kernel at commit c820493652e830dc050e1418301e1bdec5691a1e

It looks that I was messing with bcache-encryption and this bug appears 
on bcache-encryption branch.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [bcachefs]  BUG: soft lockup - CPU#0 stuck for 22s! [bch_copygc_read:5328]
  2016-09-03 21:29 [bcachefs] BUG: soft lockup - CPU#0 stuck for 22s! [bch_copygc_read:5328] Marcin
  2016-09-03 21:57 ` Marcin
@ 2016-09-04  0:17 ` Kent Overstreet
  2016-09-04 18:21   ` Marcin
  1 sibling, 1 reply; 7+ messages in thread
From: Kent Overstreet @ 2016-09-04  0:17 UTC (permalink / raw)
  To: Marcin; +Cc: linux-bcache

On Sat, Sep 03, 2016 at 11:29:49PM +0200, Marcin wrote:
> Hi!
> Kernel at commit c820493652e830dc050e1418301e1bdec5691a1e
> 
> I createt to devices, fast has size
> # blockdev --getsz /dev/sde1
> 20971520
> and slower device:
> # blockdev --getsz /dev/sdd1
> 2930209551
> 
> I was copying files from one disk to bcache, after some time I got:
>  BUG: soft lockup - CPU#0 stuck for 22s! [bch_copygc_read:5328]

Thanks for the report - can you run addr2line with your vmlinux file, and the
RIP?

addr2line -i -e vmlinux ffffffffc028795b

> 
> # cat /sys/fs/bcache/5ad3d751-cb40-4bb7-9546-3e5b48ccd3cf/cache0/dirty_bytes
> /sys/fs/bcache/5ad3d751-cb40-4bb7-9546-3e5b48ccd3cf/cache1/dirty_bytes
> 9269264384
> 2540163072
> 
> dmesg says:
> [ 9088.110001] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s!
> [bch_copygc_read:5328]
> [ 9088.110001] Modules linked in: bcache lz4_compress zlib_deflate libcrc32c
> lz4_decompress sha1_generic cpufreq_ondemand msr af_packet fuse pcrypt
> nf_nat_ftp nf_conntrack_ftp nf_nat nf_conntrack ipv6 coretemp hwmon
> kvm_intel kvm irqbypass i2c_i801 i2c_core sky2 hpsa acpi_cpufreq
> scsi_transport_sas processor button raid1 md_mod pata_jmicron ehci_pci
> ehci_hcd dm_mod
> [ 9088.110001] CPU: 0 PID: 5328 Comm: bch_copygc_read Not tainted 4.7.0+ #2
> [ 9088.110001] Hardware name: Gigabyte Technology Co., Ltd.
> 965P-DS3/965P-DS3, BIOS F14 06/25/2009
> [ 9088.110001] task: ffff88001b561a80 ti: ffff880002810000 task.ti:
> ffff880002810000
> [ 9088.110001] RIP: 0010:[<ffffffffc028795b>]  [<ffffffffc028795b>]
> bch_moving_gc+0xcb/0xd90 [bcache]
> [ 9088.110001] RSP: 0018:ffff880002813d20  EFLAGS: 00000287
> [ 9088.110001] RAX: ffffc900001f6e78 RBX: ffff8800102c5000 RCX:
> 000000000003c000
> [ 9088.110001] RDX: ffffc9000027f000 RSI: ffff8800102c5000 RDI:
> ffff8800102c5608
> [ 9088.110001] RBP: ffff880002813e40 R08: ffff880002813d90 R09:
> 0000000002000000
> [ 9088.110001] R10: 0000000000000bc6 R11: 0000000000064000 R12:
> ffff8800014d0000
> [ 9088.110001] R13: ffff8800102c5000 R14: 0000000000027800 R15:
> 0000000001b6c6e0
> [ 9088.110001] FS:  0000000000000000(0000) GS:ffff88003fc00000(0000)
> knlGS:0000000000000000
> [ 9088.110001] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 9088.110001] CR2: 000056048ce06f62 CR3: 000000003d0ce000 CR4:
> 00000000000006f0
> [ 9088.110001] Stack:
> [ 9088.110001]  0000000000015f00 0000000000015f00 ffffffff8107bd3f
> ffff8800102c5000
> [ 9088.110001]  ffff880002813de0 0000000000000000 ffff880002813e68
> ffff880002813e40
> [ 9088.110001]  0000000000000002 0000000000064000 ffff8800102c5608
> 000000000000000f
> [ 9088.110001] Call Trace:
> [ 9088.110001]  [<ffffffff8107bd3f>] ? sched_clock_cpu+0x7f/0xa0
> [ 9088.110001]  [<ffffffffc028878c>] bch_moving_gc_thread+0x16c/0x180
> [bcache]
> [ 9088.110001]  [<ffffffffc0288620>] ? bch_moving_gc+0xd90/0xd90 [bcache]
> [ 9088.110001]  [<ffffffff8106e874>] kthread+0xc4/0xe0
> [ 9088.110001]  [<ffffffff8146b0bf>] ret_from_fork+0x1f/0x40
> [ 9088.110001]  [<ffffffff8106e7b0>] ? kthread_worker_fn+0x150/0x150
> [ 9088.110001] Code: 00 48 8b 8b a8 02 00 00 48 8b 93 a8 05 00 00 48 8d 04
> 40 48 8d 0c 49 48 8d 04 82 48 8d 14 8a 48 39 d0 0f 83 7e 08 00 00 48 89 de
> <44> 8b 48 04 c6 40 09 00 45 85 c9 74 2e f7 40 04 01 00 01 00 75
> 
> 
> Compression wasn't turned on.
> 
> Marcin
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [bcachefs]  BUG: soft lockup - CPU#0 stuck for 22s! [bch_copygc_read:5328]
  2016-09-04  0:17 ` Kent Overstreet
@ 2016-09-04 18:21   ` Marcin
  2016-09-06  2:24     ` Kent Overstreet
  0 siblings, 1 reply; 7+ messages in thread
From: Marcin @ 2016-09-04 18:21 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcache

W dniu 2016-09-04 02:17, Kent Overstreet napisał(a):

Hi!

> On Sat, Sep 03, 2016 at 11:29:49PM +0200, Marcin wrote:
>> Hi!
>> Kernel at commit c820493652e830dc050e1418301e1bdec5691a1e
>> 
>> I createt to devices, fast has size
>> # blockdev --getsz /dev/sde1
>> 20971520
>> and slower device:
>> # blockdev --getsz /dev/sdd1
>> 2930209551
>> 
>> I was copying files from one disk to bcache, after some time I got:
>>  BUG: soft lockup - CPU#0 stuck for 22s! [bch_copygc_read:5328]
> 
> Thanks for the report - can you run addr2line with your vmlinux file, 
> and the
> RIP?
> 
> addr2line -i -e vmlinux ffffffffc028795b

It returned:
??:0

Probably due to I'm using bcache as module.
<long story>
As I mentioned before I wasn't sure which branch I used to test. So I 
started to reproduce problem again and watch which branch is used. I 
encountered strange behaviour, random BUGs in kernel, even random 
reboots. It looked like a hardware problem but I never got such problems 
using this box as backup & samba server. Memtest86+ didn't found any 
problem, cpuburn didn't trigger anything suspicious. Gcc compiles 
without any strange failures. Nothing suspicious happened when I didn't 
use bcachefs. So i spent a couple of hours trying to find out what is 
going on. I tried bcachefs without tiering. When I formated /dev/sde1 
then everything was ok. But when I tried to use /dev/sdd1 something 
happened:
# bcache format /dev/sdd1
/dev/sdd1 contains a bcache filesystem
Proceed anyway? (y,n) y
UUID:                   78a90acf-0247-4edf-a56f-f4f82a51e13e
Set UUID:               a0a6e9fa-d7bb-4cec-8c42-b7e4bbccb2d8
version:                6
nbuckets:               3815462
block_size:             1
bucket_size:            768
nr_in_set:              1
nr_this_dev:            0
first_bucket:           3

# LC_ALL=en_US.utf-8 mount -t bcache /dev/sdd1 /mnt/test/
mount: mount(2) failed: No such file or directory

Kernel throws:
zserwerek kernel: [ 5832.483306] bcache: bch_open_as_blockdevs() 
register_cache_set err Bad bucket size

Hmmm. So I resized /dev/sdd1 from 1.4TB to 100GB. Formated again, and 
mounted without problem. Hmm. I created tiered bcachefs with smaller 
/dev/sdd1 and it works stable.

<conclucion>
Please look at line with "bucket size":
bucket_size:            768
If bucket size is higher than (probably) 512 then I can't mount simple 
(without tiering) bcachefs filesystem. If I use such big device in 
tiered bcachefs I'm expieriencing random problems with stability of box.
I think that bug in mail's subject is only random symptom of problem 
when device is formated with bucket size >512.
What is going inside kernel in this case, is overwittem memory of other 
processes?


Marcin

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [bcachefs]  BUG: soft lockup - CPU#0 stuck for 22s! [bch_copygc_read:5328]
  2016-09-04 18:21   ` Marcin
@ 2016-09-06  2:24     ` Kent Overstreet
  2016-09-06 11:08       ` Marcin Mirosław
  0 siblings, 1 reply; 7+ messages in thread
From: Kent Overstreet @ 2016-09-06  2:24 UTC (permalink / raw)
  To: Marcin; +Cc: linux-bcache

On Sun, Sep 04, 2016 at 08:21:17PM +0200, Marcin wrote:
> W dniu 2016-09-04 02:17, Kent Overstreet napisał(a):
> 
> Hi!
> 
> > On Sat, Sep 03, 2016 at 11:29:49PM +0200, Marcin wrote:
> > > Hi!
> > > Kernel at commit c820493652e830dc050e1418301e1bdec5691a1e
> > > 
> > > I createt to devices, fast has size
> > > # blockdev --getsz /dev/sde1
> > > 20971520
> > > and slower device:
> > > # blockdev --getsz /dev/sdd1
> > > 2930209551
> > > 
> > > I was copying files from one disk to bcache, after some time I got:
> > >  BUG: soft lockup - CPU#0 stuck for 22s! [bch_copygc_read:5328]
> > 
> > Thanks for the report - can you run addr2line with your vmlinux file,
> > and the
> > RIP?
> > 
> > addr2line -i -e vmlinux ffffffffc028795b
> 
> It returned:
> ??:0
> 
> Probably due to I'm using bcache as module.
> <long story>
> As I mentioned before I wasn't sure which branch I used to test.

In case I didn't mention before - bcache-dev. This bug in the bcache-encryption
branch is a bit disconcerting though since my tests never hit it, but don't
worry about it - I'll chase it down.

> Please look at line with "bucket size":
> bucket_size:            768
> If bucket size is higher than (probably) 512 then I can't mount simple
> (without tiering) bcachefs filesystem. If I use such big device in tiered
> bcachefs I'm expieriencing random problems with stability of box.
> I think that bug in mail's subject is only random symptom of problem when
> device is formated with bucket size >512.
> What is going inside kernel in this case, is overwittem memory of other
> processes?

Whoops - that one is a bug in bcache-tools, non power of two bucket sizes aren't
supported (might be someday, but aren't currently). I just pushed a fix for that
to bcache-tools.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [bcachefs] BUG: soft lockup - CPU#0 stuck for 22s! [bch_copygc_read:5328]
  2016-09-06  2:24     ` Kent Overstreet
@ 2016-09-06 11:08       ` Marcin Mirosław
  2016-09-06 13:08         ` Kent Overstreet
  0 siblings, 1 reply; 7+ messages in thread
From: Marcin Mirosław @ 2016-09-06 11:08 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcache

W dniu 06.09.2016 o 04:24, Kent Overstreet pisze:
Hi!

> On Sun, Sep 04, 2016 at 08:21:17PM +0200, Marcin wrote:
>> W dniu 2016-09-04 02:17, Kent Overstreet napisał(a):
>>
>> Hi!
>>
>>> On Sat, Sep 03, 2016 at 11:29:49PM +0200, Marcin wrote:
>>>> Hi!
>>>> Kernel at commit c820493652e830dc050e1418301e1bdec5691a1e
>>>>
>>>> I createt to devices, fast has size
>>>> # blockdev --getsz /dev/sde1
>>>> 20971520
>>>> and slower device:
>>>> # blockdev --getsz /dev/sdd1
>>>> 2930209551
>>>>
>>>> I was copying files from one disk to bcache, after some time I got:
>>>>  BUG: soft lockup - CPU#0 stuck for 22s! [bch_copygc_read:5328]
>>>
>>> Thanks for the report - can you run addr2line with your vmlinux file,
>>> and the
>>> RIP?
>>>
>>> addr2line -i -e vmlinux ffffffffc028795b
>>
>> It returned:
>> ??:0
>>
>> Probably due to I'm using bcache as module.
>> <long story>
>> As I mentioned before I wasn't sure which branch I used to test.
> 
> In case I didn't mention before - bcache-dev. This bug in the bcache-encryption
> branch is a bit disconcerting though since my tests never hit it, but don't
> worry about it - I'll chase it down.

I think that bug "BUG: soft lockup" is due to problem with bucket size.
I saw many random, different bugs when second tiered device had bucket
size equal to 768.

>> Please look at line with "bucket size":
>> bucket_size:            768
>> If bucket size is higher than (probably) 512 then I can't mount simple
>> (without tiering) bcachefs filesystem. If I use such big device in tiered
>> bcachefs I'm expieriencing random problems with stability of box.
>> I think that bug in mail's subject is only random symptom of problem when
>> device is formated with bucket size >512.
>> What is going inside kernel in this case, is overwittem memory of other
>> processes?
> 
> Whoops - that one is a bug in bcache-tools, non power of two bucket sizes aren't
> supported (might be someday, but aren't currently). I just pushed a fix for that
> to bcache-tools.
One mor thing, when I tested tiering with one device formated with
unsupported bucket sizethis command worked:
# mount /dev/sde1:/dev/sdd1 /mnt/test
but this one didn't:
# mount /dev/sdd1:/dev/sde1 /mnt/test

so:
<low priority wish>
it could be good to check if on disk format of every device is correct
and supported.

Thank you,
Marcin

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [bcachefs] BUG: soft lockup - CPU#0 stuck for 22s! [bch_copygc_read:5328]
  2016-09-06 11:08       ` Marcin Mirosław
@ 2016-09-06 13:08         ` Kent Overstreet
  0 siblings, 0 replies; 7+ messages in thread
From: Kent Overstreet @ 2016-09-06 13:08 UTC (permalink / raw)
  To: Marcin Mirosław; +Cc: linux-bcache

On Tue, Sep 06, 2016 at 01:08:57PM +0200, Marcin Mirosław wrote:
> <low priority wish>
> it could be good to check if on disk format of every device is correct
> and supported.

Whoops - that should be fixed now.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-09-06 13:09 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-09-03 21:29 [bcachefs] BUG: soft lockup - CPU#0 stuck for 22s! [bch_copygc_read:5328] Marcin
2016-09-03 21:57 ` Marcin
2016-09-04  0:17 ` Kent Overstreet
2016-09-04 18:21   ` Marcin
2016-09-06  2:24     ` Kent Overstreet
2016-09-06 11:08       ` Marcin Mirosław
2016-09-06 13:08         ` Kent Overstreet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).