Deadlock on 3.18.5

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Deadlock on 3.18.5
@ 2015-02-04 23:05 Juergen Fitschen
  2015-02-05  4:03 ` Duncan
  2015-02-05 11:04 ` Juergen Fitschen
  0 siblings, 2 replies; 6+ messages in thread
From: Juergen Fitschen @ 2015-02-04 23:05 UTC (permalink / raw)
  To: linux-btrfs

Hey,

I just got a deadlock on Linux 3.18.5. It occurred while processing a some data: All in all reading from the btrfs volume, decompressing it (the bottleneck in this process), doing some calculations and finally writing the results back to a new file on the volume.

I am using a freshly installed Ubuntu 14.04 in a virtual machine. The block device used for btrfs is a LVM volume offered by the KVM hypervisor.
Unfortunately I was using an old version of my user space tools (Version 3.12-1), so it might be related to that?
The volume was mounted with the following mount options:
> default,compress-force=lzo,subvol=data

Related kernel stack trace:
> Feb  4 22:00:41 qdock03 kernel: [22560.284120] INFO: task btrfs-transacti:508 blocked for more than 120 seconds.
> Feb  4 22:00:41 qdock03 kernel: [22560.284200]       Tainted: G        W      3.18.5-031805-generic #201501292218
> Feb  4 22:00:41 qdock03 kernel: [22560.284267] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Feb  4 22:00:41 qdock03 kernel: [22560.284339] btrfs-transacti D 0000000000000001     0   508      2 0x00000000
> Feb  4 22:00:41 qdock03 kernel: [22560.284345]  ffff881ece617b28 0000000000000046 ffffffffc0136e90 00000000008bee9a
> Feb  4 22:00:41 qdock03 kernel: [22560.284349]  ffff881ece617fd8 0000000000013640 ffff881ece095b00 0000000000013640
> Feb  4 22:00:41 qdock03 kernel: [22560.284352]  ffff881ed0638000 ffff881ecfdf5000 ffff881ece617b98 ffff881f79c33f40
> Feb  4 22:00:41 qdock03 kernel: [22560.284356] Call Trace:
> Feb  4 22:00:41 qdock03 kernel: [22560.284397]  [<ffffffffc0136e90>] ? btree_io_failed_hook+0x60/0x60 [btrfs]
> Feb  4 22:00:41 qdock03 kernel: [22560.284406]  [<ffffffff817b0750>] ? bit_wait_timeout+0x80/0x80
> Feb  4 22:00:41 qdock03 kernel: [22560.284411]  [<ffffffff817afd29>] schedule+0x29/0x70
> Feb  4 22:00:41 qdock03 kernel: [22560.284415]  [<ffffffff817afdff>] io_schedule+0x8f/0xd0
> Feb  4 22:00:41 qdock03 kernel: [22560.284419]  [<ffffffff817b077c>] bit_wait_io+0x2c/0x50
> Feb  4 22:00:41 qdock03 kernel: [22560.284423]  [<ffffffff817b04e5>] __wait_on_bit+0x65/0x90
> Feb  4 22:00:41 qdock03 kernel: [22560.284428]  [<ffffffff8117888b>] ? find_get_pages_tag+0xcb/0x170
> Feb  4 22:00:41 qdock03 kernel: [22560.284449]  [<ffffffffc015ce8e>] ? clear_state_bit+0xae/0x170 [btrfs]
> Feb  4 22:00:41 qdock03 kernel: [22560.284455]  [<ffffffff81177e37>] wait_on_page_bit+0xc7/0xd0
> Feb  4 22:00:41 qdock03 kernel: [22560.284460]  [<ffffffff810b4cf0>] ? wake_atomic_t_function+0x40/0x40
> Feb  4 22:00:41 qdock03 kernel: [22560.284465]  [<ffffffff81178004>] filemap_fdatawait_range+0xf4/0x180
> Feb  4 22:00:41 qdock03 kernel: [22560.284483]  [<ffffffffc013f41f>] btrfs_wait_marked_extents+0x8f/0x140 [btrfs]
> Feb  4 22:00:41 qdock03 kernel: [22560.284498]  [<ffffffffc013f532>] btrfs_write_and_wait_transaction+0x62/0xb0 [btrfs]
> Feb  4 22:00:41 qdock03 kernel: [22560.284514]  [<ffffffffc013fe04>] btrfs_commit_transaction+0x734/0xae0 [btrfs]
> Feb  4 22:00:41 qdock03 kernel: [22560.284528]  [<ffffffffc013d4d5>] transaction_kthread+0x1d5/0x250 [btrfs]
> Feb  4 22:00:41 qdock03 kernel: [22560.284543]  [<ffffffffc013d300>] ? open_ctree+0x1f50/0x1f50 [btrfs]
> Feb  4 22:00:41 qdock03 kernel: [22560.284548]  [<ffffffff81093a79>] kthread+0xc9/0xe0
> Feb  4 22:00:41 qdock03 kernel: [22560.284552]  [<ffffffff810939b0>] ? flush_kthread_worker+0x90/0x90
> Feb  4 22:00:41 qdock03 kernel: [22560.284556]  [<ffffffff817b3ebc>] ret_from_fork+0x7c/0xb0
> Feb  4 22:00:41 qdock03 kernel: [22560.284559]  [<ffffffff810939b0>] ? flush_kthread_worker+0x90/0x90



Regards,
  Juergen

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Deadlock on 3.18.5
  2015-02-04 23:05 Deadlock on 3.18.5 Juergen Fitschen
@ 2015-02-05  4:03 ` Duncan
  2015-02-05 11:04 ` Juergen Fitschen
  1 sibling, 0 replies; 6+ messages in thread
From: Duncan @ 2015-02-05  4:03 UTC (permalink / raw)
  To: linux-btrfs

Juergen Fitschen posted on Thu, 05 Feb 2015 00:05:33 +0100 as excerpted:

> I just got a deadlock on Linux 3.18.5. [...]
> I am using a freshly installed Ubuntu 14.04 in a virtual machine. The
> block device used for btrfs is a LVM volume offered by the KVM
> hypervisor.
> Unfortunately I was using an old version of my user space tools (Version
> 3.12-1), so it might be related to that?

I know little to nothing about the complexities of btrfs in a VM so won't 
attempt an answer there.  But about userspace specifically I can say 
this...

For normal operation, btrfs userspace doesn't matter so much.  It's the 
kernel that's vital as userspace mostly simply tells the kernel what to 
do at a relatively high level (mount/umount/snapshot/delete-snapshot/etc) 
and the kernel code handles pretty much everything at the more critical 
lower levels.  OTOH, if there are problems and you are trying to fix them 
using userspace tools such as btrfs check, restore, etc, /that/ is when 
the /userspace/ version needs to be current, as it's then userspace doing 
the low-level stuff as well.

Since it was during normal operations that this deadlock occurred, it's 
almost certainly a kernel problem (unless the problem is ultimately 
traced to a bug in the original mkfs.btrfs that created the filesystem, 
or the like), and that was current, so it's it's a current problem. =:^(

Hopefully a dev or other regular with more VM experience will be along to 
help you further, but I could deal with this much, so I did. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Deadlock on 3.18.5
  2015-02-04 23:05 Deadlock on 3.18.5 Juergen Fitschen
  2015-02-05  4:03 ` Duncan
@ 2015-02-05 11:04 ` Juergen Fitschen
  2015-02-05 12:47   ` Austin S Hemmelgarn
  1 sibling, 1 reply; 6+ messages in thread
From: Juergen Fitschen @ 2015-02-05 11:04 UTC (permalink / raw)
  To: linux-btrfs

Hey,

It’s me again.
First of all: Thanks for the reply, Duncan :)

After detecting the deadlock und posting the stack trace yesterday evening, I left the machine alone and didn’t rebooted it. The monitoring told me that the whole server (including the hypervisor) became unreachable a few minutes after I fetched the stack trace from syslog.

But now - as I just realised - the server is back alive, the process finished successfully and the volume is accessible. So the deadlock was not a real deadlock or resolved itself magically without me doing anything. The syslog reports several stack traces from kworker for about 2 hours (the time the server was not reachable). Within the first hour syslog was completely silent. Within the second hour the kernel complains about that rcu_sched dected stalls on the CPU in an interval of 3 minutes.

What do you think?

Regards,
  Juergen

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Deadlock on 3.18.5
  2015-02-05 11:04 ` Juergen Fitschen
@ 2015-02-05 12:47   ` Austin S Hemmelgarn
  2015-02-05 15:24     ` Juergen Fitschen
  0 siblings, 1 reply; 6+ messages in thread
From: Austin S Hemmelgarn @ 2015-02-05 12:47 UTC (permalink / raw)
  To: Juergen Fitschen, linux-btrfs

On 2015-02-05 06:04, Juergen Fitschen wrote:
> Hey,
>
> It’s me again.
> First of all: Thanks for the reply, Duncan :)
>
> After detecting the deadlock und posting the stack trace yesterday evening, I left the machine alone and didn’t rebooted it. The monitoring told me that the whole server (including the hypervisor) became unreachable a few minutes after I fetched the stack trace from syslog.
>
> But now - as I just realised - the server is back alive, the process finished successfully and the volume is accessible. So the deadlock was not a real deadlock or resolved itself magically without me doing anything. The syslog reports several stack traces from kworker for about 2 hours (the time the server was not reachable). Within the first hour syslog was completely silent. Within the second hour the kernel complains about that rcu_sched dected stalls on the CPU in an interval of 3 minutes.
>
> What do you think?

I've actually seen similar behavior without the virtualization when 
doing large filesystem intensive operations with compression enabled.
I don't know if this is significant, but it seems to be worse with lzo 
compression than zlib, and also seems to be worse when compression is 
enabled at the filesystem level instead of through 'chattr +c'.

I'm not certain, but I think it might have something to do with the 
somewhat brain-dead default parameters in the default I/O scheduler (the 
so-called 'completely fair queue', which as I've said before was 
obviously named by a mathematician and not based on it's actual 
behavior), although it seems to be much worse when using the Deadline 
and no-op I/O schedulers.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Deadlock on 3.18.5
  2015-02-05 12:47   ` Austin S Hemmelgarn
@ 2015-02-05 15:24     ` Juergen Fitschen
  2015-02-05 16:48       ` Austin S Hemmelgarn
  0 siblings, 1 reply; 6+ messages in thread
From: Juergen Fitschen @ 2015-02-05 15:24 UTC (permalink / raw)
  To: Austin S Hemmelgarn, linux-btrfs


> On 05 Feb 2015, at 13:47, Austin S Hemmelgarn <ahferroin7@gmail.com> wrote:
> 
> I've actually seen similar behavior without the virtualization when doing large filesystem intensive operations with compression enabled.
> I don't know if this is significant, but it seems to be worse with lzo compression than zlib, and also seems to be worse when compression is enabled at the filesystem level instead of through 'chattr +c’.
Zlib isn’t that performant compared to lzo. So zlib creates a bottleneck at the CPU and thereby limits the IO the volume is exposed to. So our problem might be related to intensive operations on the volume.

> I'm not certain, but I think it might have something to do with the somewhat brain-dead default parameters in the default I/O scheduler (the so-called 'completely fair queue', which as I've said before was obviously named by a mathematician and not based on it's actual behavior), although it seems to be much worse when using the Deadline and no-op I/O schedulers.
Good idea. I had a look to my configuration of the “stack” for the block devices and their queuing and caching. My setup looks like this (with default settings - I made no adjustments):

* 2 HDDs
* Hewlett-Packard Company Smart Array Gen8 Controllers (rev 01)
  [With 1GB write cache. Other black magic seems to be included. Combines both HDDs to a RAID1]
* Block device driver
* IO Scheduler: deadline
* LVM
* QEMU
  [With writeback cache. Should I change it to “none"? The storage controller has write cache included.]
* virtio-blk
* btrfs

As you can see, only one IO scheduler is involved. The VM by default seems not to use any IO schedulers. I checked this by executing “cat /sys/block/vd*/queue/scheduler” on the VM and it reported “none”.


Regards,
  Juergen


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Deadlock on 3.18.5
  2015-02-05 15:24     ` Juergen Fitschen
@ 2015-02-05 16:48       ` Austin S Hemmelgarn
  0 siblings, 0 replies; 6+ messages in thread
From: Austin S Hemmelgarn @ 2015-02-05 16:48 UTC (permalink / raw)
  To: Juergen Fitschen, linux-btrfs

On 2015-02-05 10:24, Juergen Fitschen wrote:
>
>> On 05 Feb 2015, at 13:47, Austin S Hemmelgarn <ahferroin7@gmail.com> wrote:
>>
>> I've actually seen similar behavior without the virtualization when doing large filesystem intensive operations with compression enabled.
>> I don't know if this is significant, but it seems to be worse with lzo compression than zlib, and also seems to be worse when compression is enabled at the filesystem level instead of through 'chattr +c’.
> Zlib isn’t that performant compared to lzo. So zlib creates a bottleneck at the CPU and thereby limits the IO the volume is exposed to. So our problem might be related to intensive operations on the volume.
>
>> I'm not certain, but I think it might have something to do with the somewhat brain-dead default parameters in the default I/O scheduler (the so-called 'completely fair queue', which as I've said before was obviously named by a mathematician and not based on it's actual behavior), although it seems to be much worse when using the Deadline and no-op I/O schedulers.
> Good idea. I had a look to my configuration of the “stack” for the block devices and their queuing and caching. My setup looks like this (with default settings - I made no adjustments):
>
> * 2 HDDs
> * Hewlett-Packard Company Smart Array Gen8 Controllers (rev 01)
>    [With 1GB write cache. Other black magic seems to be included. Combines both HDDs to a RAID1]
> * Block device driver
> * IO Scheduler: deadline
> * LVM
> * QEMU
>    [With writeback cache. Should I change it to “none"? The storage controller has write cache included.]
> * virtio-blk
> * btrfs
>
> As you can see, only one IO scheduler is involved. The VM by default seems not to use any IO schedulers. I checked this by executing “cat /sys/block/vd*/queue/scheduler” on the VM and it reported “none”.
Yeah, thankfully Linux is smart enough to turn off the I/O scheduler for 
block devices that it can see are virtualized.

At the very least, I would suggest changing QEMU to not use caching. 
I've found that host-side caching for virtualized block devices tends to 
just make things slower unless the block device is imported over the 
netowkr (ie, iSCSI/ATAoE/NBD).  This is especially significant when you 
have a storage controller with such a big write-cache (I would make sure 
that the write-cache on the storage controller is non-volatile first 
though, if it isn't you should probably use writethrough mode for QEMU's 
caching).

Additionally, you might want to try using CFQ for the I/O scheduler on 
the host side, albeit with some non-default parameters (the deadline 
scheduler tends to get very laggy with really heavy random-access 
workloads).  I've found that it does do well when you actually take the 
time to fine tune things.  The particular parameters I would suggest 
some experimentation with for CFQ are:
  * Under /sys/block/<device>/queue: nomerges, rq_affinity, max_sectors_kb
  * Under /sys/block/<device>/queue/iosched: group_idle, quantum, 
slice_idle, back_seek_max, back_seek_penalty
There is good information on what each of these does in the kernel 
sources under Documentation/block/queue-sysfs.txt and 
Documentation/block/cfq-iosched.txt
Once you find a set of parameters that work well, I'd suggest writing 
some simple udev rules to automatically set them on boot/device enumeration.

FWIW, I've found that the following parameters provide near optimal 
performance for the SSD in my laptop:
queue/nomerges=1
queue/rq_affinity=2
queue/max_sectors_kb=16387 (16MB, which is 4x the erase-block size on 
the SSD)
queue/iosched/group_idle=8
queue/iosched/quantum=128 (this also happens to be equal to the device's 
NCQ queue depth)
queue/iosched/slice_idle=0
Using these settings, the time from the boot-loader handing off 
execution to the kernel to having a login prompt is about 45 seconds. 
With the default CFQ parameters, it takes almost 150 seconds, so fine 
tuning here can provide a very noticeable performance improvement.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2015-02-05 16:48 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-02-04 23:05 Deadlock on 3.18.5 Juergen Fitschen
2015-02-05  4:03 ` Duncan
2015-02-05 11:04 ` Juergen Fitschen
2015-02-05 12:47   ` Austin S Hemmelgarn
2015-02-05 15:24     ` Juergen Fitschen
2015-02-05 16:48       ` Austin S Hemmelgarn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).