Machine lockup due to btrfs-transaction on AWS EC2 Ubuntu 14.04

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Machine lockup due to btrfs-transaction on AWS EC2 Ubuntu 14.04
@ 2014-07-29  8:04 Peter Waller
  2014-07-29  9:20 ` Peter Waller
  0 siblings, 1 reply; 7+ messages in thread
From: Peter Waller @ 2014-07-29  8:04 UTC (permalink / raw)
  To: linux-btrfs

Hi All,

I've reported a bug with Ubuntu here:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1349711

The machine in question has one BTRFS volume which is 87% full and
lives on an Logical Volume Manager (LVM) block device on top of one
Amazon Elastic Block Store (EBS) device.

We have other machines in a similar configuration which have not
displayed this behaviour.

The one thing which makes this machine different is that it has
directories which contain many thousands of files. We don't make heavy
use of subvolumes or snapshots.

More details follow:

# cat /proc/version_signature
Ubuntu 3.13.0-32.57-generic 3.13.11.4

The machine had a soft-lockup with messages like this appearing on the console:

[246736.752053] INFO: rcu_sched self-detected stall on CPU { 0}
(t=2220246 jiffies g=35399662 c=35399661 q=0)
[246736.756059] INFO: rcu_sched detected stalls on CPUs/tasks: { 0}
(detected by 1, t=2220247 jiffies, g=35399662, c=35399661, q=0)
[246764.192014] BUG: soft lockup - CPU#0 stuck for 23s! [kworker/u30:2:1828]
[246764.212058] BUG: soft lockup - CPU#1 stuck for 23s! [btrfs-transacti:492]

After the first lockup and reboot, the following messages were in
dmesg, which I ignored because after some research I saw that they
were changed to warnings and considered non-harmful. A btrfs-scrub
performed after this failed without error:

[ 77.609490] BTRFS error (device dm-0): block group 10766778368 has
wrong amount of free space
[ 77.613678] BTRFS error (device dm-0): failed to load free space
cache for block group 10766778368
[ 77.643801] BTRFS error (device dm-0): block group 19356712960 has
wrong amount of free space
[ 77.648952] BTRFS error (device dm-0): failed to load free space
cache for block group 19356712960
[ 77.926325] BTRFS error (device dm-0): block group 20430454784 has
wrong amount of free space
[ 77.931078] BTRFS error (device dm-0): failed to load free space
cache for block group 20430454784
[ 78.111437] BTRFS error (device dm-0): block group 21504196608 has
wrong amount of free space
[ 78.116165] BTRFS error (device dm-0): failed to load free space
cache for block group 21504196608

After the second time I've observed the lockup and rebooted, these
messages have appeared:

[ 45.390221] BTRFS error (device dm-0): free space inode generation
(0) did not match free space cache generation (70012)
[ 45.413472] BTRFS error (device dm-0): free space inode generation
(0) did not match free space cache generation (70012)
[ 467.423961] BTRFS error (device dm-0): block group 518646661120 has
wrong amount of free space
[ 467.429251] BTRFS error (device dm-0): failed to load free space
cache for block group 518646661120

I would like to know if these second messages are harmful and if
remedial action is needed in response to the latter messages.
Searching for messages similar to my lockup I found this report which
suggested the problem may be fixed in 3.14.

Any advice appreciated,

Thanks,

- Peter

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Machine lockup due to btrfs-transaction on AWS EC2 Ubuntu 14.04
  2014-07-29  8:04 Machine lockup due to btrfs-transaction on AWS EC2 Ubuntu 14.04 Peter Waller
@ 2014-07-29  9:20 ` Peter Waller
  2014-07-30 10:02   ` Peter Waller
  0 siblings, 1 reply; 7+ messages in thread
From: Peter Waller @ 2014-07-29  9:20 UTC (permalink / raw)
  To: linux-btrfs

Someone on IRC suggested that I clear the free cache:

> sudo mount -o remount,clear_cache /path/to/dev /path/to/mount
> sudo mount -o remount,space_cache /path/to/dev /path/to/mount


The former command printed `btrfs: disk space caching is enabled` and
the latter repeated it, making me think that maybe the latter was
unnecessary.

On 29 July 2014 09:04, Peter Waller <peter@scraperwiki.com> wrote:
> Hi All,
>
> I've reported a bug with Ubuntu here:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1349711
>
> The machine in question has one BTRFS volume which is 87% full and
> lives on an Logical Volume Manager (LVM) block device on top of one
> Amazon Elastic Block Store (EBS) device.
>
> We have other machines in a similar configuration which have not
> displayed this behaviour.
>
> The one thing which makes this machine different is that it has
> directories which contain many thousands of files. We don't make heavy
> use of subvolumes or snapshots.
>
> More details follow:
>
> # cat /proc/version_signature
> Ubuntu 3.13.0-32.57-generic 3.13.11.4
>
> The machine had a soft-lockup with messages like this appearing on the console:
>
> [246736.752053] INFO: rcu_sched self-detected stall on CPU { 0}
> (t=2220246 jiffies g=35399662 c=35399661 q=0)
> [246736.756059] INFO: rcu_sched detected stalls on CPUs/tasks: { 0}
> (detected by 1, t=2220247 jiffies, g=35399662, c=35399661, q=0)
> [246764.192014] BUG: soft lockup - CPU#0 stuck for 23s! [kworker/u30:2:1828]
> [246764.212058] BUG: soft lockup - CPU#1 stuck for 23s! [btrfs-transacti:492]
>
>
> After the first lockup and reboot, the following messages were in
> dmesg, which I ignored because after some research I saw that they
> were changed to warnings and considered non-harmful. A btrfs-scrub
> performed after this failed without error:
>
>
> [ 77.609490] BTRFS error (device dm-0): block group 10766778368 has
> wrong amount of free space
> [ 77.613678] BTRFS error (device dm-0): failed to load free space
> cache for block group 10766778368
> [ 77.643801] BTRFS error (device dm-0): block group 19356712960 has
> wrong amount of free space
> [ 77.648952] BTRFS error (device dm-0): failed to load free space
> cache for block group 19356712960
> [ 77.926325] BTRFS error (device dm-0): block group 20430454784 has
> wrong amount of free space
> [ 77.931078] BTRFS error (device dm-0): failed to load free space
> cache for block group 20430454784
> [ 78.111437] BTRFS error (device dm-0): block group 21504196608 has
> wrong amount of free space
> [ 78.116165] BTRFS error (device dm-0): failed to load free space
> cache for block group 21504196608
>
>
> After the second time I've observed the lockup and rebooted, these
> messages have appeared:
>
>
> [ 45.390221] BTRFS error (device dm-0): free space inode generation
> (0) did not match free space cache generation (70012)
> [ 45.413472] BTRFS error (device dm-0): free space inode generation
> (0) did not match free space cache generation (70012)
> [ 467.423961] BTRFS error (device dm-0): block group 518646661120 has
> wrong amount of free space
> [ 467.429251] BTRFS error (device dm-0): failed to load free space
> cache for block group 518646661120
>
>
> I would like to know if these second messages are harmful and if
> remedial action is needed in response to the latter messages.
> Searching for messages similar to my lockup I found this report which
> suggested the problem may be fixed in 3.14.
>
> Any advice appreciated,
>
> Thanks,
>
> - Peter

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Machine lockup due to btrfs-transaction on AWS EC2 Ubuntu 14.04
  2014-07-29  9:20 ` Peter Waller
@ 2014-07-30 10:02   ` Peter Waller
  2014-07-31 15:07     ` Peter Waller
  0 siblings, 1 reply; 7+ messages in thread
From: Peter Waller @ 2014-07-30 10:02 UTC (permalink / raw)
  To: linux-btrfs

The crashes became more frequent. The time scale before lockup went
~12 days, ~7 days, ~2 days, ~6 hours, ~1 hour.

Then we upgraded to 3.15.7-031507-generic on the advice of
#ubuntu-kernel and #btrfs on IRC, and it has since been stable for
19.5 hours.

I dug around more in our logs and realised that a previous machine
lock was probably due to the same problem, so I have an additional
stack trace, reproduced below.

Thanks to those who chipped in on IRC.

- Peter

> [1093202.136107] INFO: task kworker/u30:1:31455 blocked for more than 120 seconds.
> [1093202.141596]       Tainted: GF            3.13.0-30-generic #54-Ubuntu
> [1093202.146201] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [1093202.152408] kworker/u30:1   D ffff8801efc34440     0 31455      2 0x00000000
> [1093202.152416] Workqueue: writeback bdi_writeback_workfn (flush-btrfs-1)
> [1093202.152419]  ffff880040d3b768 0000000000000002 ffff8800880f0000 ffff880040d3bfd8
> [1093202.152422]  0000000000014440 0000000000014440 ffff8800880f0000 ffff8801efc34cd8
> [1093202.152426]  ffff8801effcefe8 ffff880040d3b7f0 0000000000000002 ffffffff8114e010
> [1093202.152429] Call Trace:
> [1093202.152435]  [<ffffffff8114e010>] ? wait_on_page_read+0x60/0x60
> [1093202.152439]  [<ffffffff8171ea6d>] io_schedule+0x9d/0x140
> [1093202.152442]  [<ffffffff8114e01e>] sleep_on_page+0xe/0x20
> [1093202.152445]  [<ffffffff8171eff8>] __wait_on_bit_lock+0x48/0xb0
> [1093202.152449]  [<ffffffff8109df64>] ? vtime_common_task_switch+0x24/0x40
> [1093202.152452]  [<ffffffff8114e12a>] __lock_page+0x6a/0x70
> [1093202.152456]  [<ffffffff810aaee0>] ? autoremove_wake_function+0x40/0x40
> [1093202.152474]  [<ffffffffa00e8a0d>] lock_delalloc_pages+0x13d/0x1d0 [btrfs]
> [1093202.152487]  [<ffffffffa00eaf2b>] find_lock_delalloc_range.constprop.43+0x14b/0x1f0 [btrfs]
> [1093202.152499]  [<ffffffffa00ebfa1>] __extent_writepage+0x131/0x750 [btrfs]
> [1093202.152509]  [<ffffffffa00eb040>] ? end_extent_writepage+0x70/0x70 [btrfs]
> [1093202.152521]  [<ffffffffa00ec832>] extent_write_cache_pages.isra.30.constprop.50+0x272/0x3d0 [btrfs]
> [1093202.152532]  [<ffffffffa00edafd>] extent_writepages+0x4d/0x70 [btrfs]
> [1093202.152544]  [<ffffffffa00d3b20>] ? btrfs_real_readdir+0x5b0/0x5b0 [btrfs]
> [1093202.152555]  [<ffffffffa00d1cb8>] btrfs_writepages+0x28/0x30 [btrfs]
> [1093202.152559]  [<ffffffff8115a46e>] do_writepages+0x1e/0x40
> [1093202.152562]  [<ffffffff811e5c50>] __writeback_single_inode+0x40/0x210
> [1093202.152565]  [<ffffffff811e6a07>] writeback_sb_inodes+0x247/0x3e0
> [1093202.152568]  [<ffffffff811e6c3f>] __writeback_inodes_wb+0x9f/0xd0
> [1093202.152571]  [<ffffffff811e6eb3>] wb_writeback+0x243/0x2c0
> [1093202.152574]  [<ffffffff811e86d8>] bdi_writeback_workfn+0x108/0x430
> [1093202.152577]  [<ffffffff810974a8>] ? finish_task_switch+0x128/0x170
> [1093202.152581]  [<ffffffff810838a2>] process_one_work+0x182/0x450
> [1093202.152585]  [<ffffffff81084641>] worker_thread+0x121/0x410
> [1093202.152588]  [<ffffffff81084520>] ? rescuer_thread+0x3e0/0x3e0
> [1093202.152591]  [<ffffffff8108b322>] kthread+0xd2/0xf0
> [1093202.152594]  [<ffffffff8108b250>] ? kthread_create_on_node+0x1d0/0x1d0
> [1093202.152598]  [<ffffffff8172ac3c>] ret_from_fork+0x7c/0xb0
> [1093202.152601]  [<ffffffff8108b250>] ? kthread_create_on_node+0x1d0/0x1d0

On 29 July 2014 10:20, Peter Waller <peter@scraperwiki.com> wrote:
> Someone on IRC suggested that I clear the free cache:
>
>> sudo mount -o remount,clear_cache /path/to/dev /path/to/mount
>> sudo mount -o remount,space_cache /path/to/dev /path/to/mount
>
>
> The former command printed `btrfs: disk space caching is enabled` and
> the latter repeated it, making me think that maybe the latter was
> unnecessary.
>
> On 29 July 2014 09:04, Peter Waller <peter@scraperwiki.com> wrote:
>> Hi All,
>>
>> I've reported a bug with Ubuntu here:
>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1349711
>>
>> The machine in question has one BTRFS volume which is 87% full and
>> lives on an Logical Volume Manager (LVM) block device on top of one
>> Amazon Elastic Block Store (EBS) device.
>>
>> We have other machines in a similar configuration which have not
>> displayed this behaviour.
>>
>> The one thing which makes this machine different is that it has
>> directories which contain many thousands of files. We don't make heavy
>> use of subvolumes or snapshots.
>>
>> More details follow:
>>
>> # cat /proc/version_signature
>> Ubuntu 3.13.0-32.57-generic 3.13.11.4
>>
>> The machine had a soft-lockup with messages like this appearing on the console:
>>
>> [246736.752053] INFO: rcu_sched self-detected stall on CPU { 0}
>> (t=2220246 jiffies g=35399662 c=35399661 q=0)
>> [246736.756059] INFO: rcu_sched detected stalls on CPUs/tasks: { 0}
>> (detected by 1, t=2220247 jiffies, g=35399662, c=35399661, q=0)
>> [246764.192014] BUG: soft lockup - CPU#0 stuck for 23s! [kworker/u30:2:1828]
>> [246764.212058] BUG: soft lockup - CPU#1 stuck for 23s! [btrfs-transacti:492]
>>
>>
>> After the first lockup and reboot, the following messages were in
>> dmesg, which I ignored because after some research I saw that they
>> were changed to warnings and considered non-harmful. A btrfs-scrub
>> performed after this failed without error:
>>
>>
>> [ 77.609490] BTRFS error (device dm-0): block group 10766778368 has
>> wrong amount of free space
>> [ 77.613678] BTRFS error (device dm-0): failed to load free space
>> cache for block group 10766778368
>> [ 77.643801] BTRFS error (device dm-0): block group 19356712960 has
>> wrong amount of free space
>> [ 77.648952] BTRFS error (device dm-0): failed to load free space
>> cache for block group 19356712960
>> [ 77.926325] BTRFS error (device dm-0): block group 20430454784 has
>> wrong amount of free space
>> [ 77.931078] BTRFS error (device dm-0): failed to load free space
>> cache for block group 20430454784
>> [ 78.111437] BTRFS error (device dm-0): block group 21504196608 has
>> wrong amount of free space
>> [ 78.116165] BTRFS error (device dm-0): failed to load free space
>> cache for block group 21504196608
>>
>>
>> After the second time I've observed the lockup and rebooted, these
>> messages have appeared:
>>
>>
>> [ 45.390221] BTRFS error (device dm-0): free space inode generation
>> (0) did not match free space cache generation (70012)
>> [ 45.413472] BTRFS error (device dm-0): free space inode generation
>> (0) did not match free space cache generation (70012)
>> [ 467.423961] BTRFS error (device dm-0): block group 518646661120 has
>> wrong amount of free space
>> [ 467.429251] BTRFS error (device dm-0): failed to load free space
>> cache for block group 518646661120
>>
>>
>> I would like to know if these second messages are harmful and if
>> remedial action is needed in response to the latter messages.
>> Searching for messages similar to my lockup I found this report which
>> suggested the problem may be fixed in 3.14.
>>
>> Any advice appreciated,
>>
>> Thanks,
>>
>> - Peter

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Machine lockup due to btrfs-transaction on AWS EC2 Ubuntu 14.04
  2014-07-30 10:02   ` Peter Waller
@ 2014-07-31 15:07     ` Peter Waller
  2014-07-31 15:10       ` Peter Waller
  0 siblings, 1 reply; 7+ messages in thread
From: Peter Waller @ 2014-07-31 15:07 UTC (permalink / raw)
  To: linux-btrfs

I've now reproduced this on 3.15.7-031507-generic and
3.16.0-031600rc7-generic, and have a test case where I can reliably
cause the crash after about 30 seconds of disk activity.

The test case just involves taking a directory tree of ~400GB of files
and copying every file to a new one with .new on the end in the same
directory of the original.

The soft-lockup makes a kernel thread fully occupy a CPU. On the
advice of kdave from IRC I took a stack trace every 10s whilst the
machine remained responsive, which was for 5-10 minutes before the
machine became unresponsive to the network.

The lockup seems to be reliably stuck in btrfs_find_space_cluster.

kernel log with 10s traces is linked to below (3.16.0-031600rc7-generic)

https://gist.github.com/pwaller/d40df3badb5cc8a574ef

On 30 July 2014 11:02, Peter Waller <peter@scraperwiki.com> wrote:
> The crashes became more frequent. The time scale before lockup went
> ~12 days, ~7 days, ~2 days, ~6 hours, ~1 hour.
>
> Then we upgraded to 3.15.7-031507-generic on the advice of
> #ubuntu-kernel and #btrfs on IRC, and it has since been stable for
> 19.5 hours.
>
> I dug around more in our logs and realised that a previous machine
> lock was probably due to the same problem, so I have an additional
> stack trace, reproduced below.
>
> Thanks to those who chipped in on IRC.
>
> - Peter
>
>> [1093202.136107] INFO: task kworker/u30:1:31455 blocked for more than 120 seconds.
>> [1093202.141596]       Tainted: GF            3.13.0-30-generic #54-Ubuntu
>> [1093202.146201] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> [1093202.152408] kworker/u30:1   D ffff8801efc34440     0 31455      2 0x00000000
>> [1093202.152416] Workqueue: writeback bdi_writeback_workfn (flush-btrfs-1)
>> [1093202.152419]  ffff880040d3b768 0000000000000002 ffff8800880f0000 ffff880040d3bfd8
>> [1093202.152422]  0000000000014440 0000000000014440 ffff8800880f0000 ffff8801efc34cd8
>> [1093202.152426]  ffff8801effcefe8 ffff880040d3b7f0 0000000000000002 ffffffff8114e010
>> [1093202.152429] Call Trace:
>> [1093202.152435]  [<ffffffff8114e010>] ? wait_on_page_read+0x60/0x60
>> [1093202.152439]  [<ffffffff8171ea6d>] io_schedule+0x9d/0x140
>> [1093202.152442]  [<ffffffff8114e01e>] sleep_on_page+0xe/0x20
>> [1093202.152445]  [<ffffffff8171eff8>] __wait_on_bit_lock+0x48/0xb0
>> [1093202.152449]  [<ffffffff8109df64>] ? vtime_common_task_switch+0x24/0x40
>> [1093202.152452]  [<ffffffff8114e12a>] __lock_page+0x6a/0x70
>> [1093202.152456]  [<ffffffff810aaee0>] ? autoremove_wake_function+0x40/0x40
>> [1093202.152474]  [<ffffffffa00e8a0d>] lock_delalloc_pages+0x13d/0x1d0 [btrfs]
>> [1093202.152487]  [<ffffffffa00eaf2b>] find_lock_delalloc_range.constprop.43+0x14b/0x1f0 [btrfs]
>> [1093202.152499]  [<ffffffffa00ebfa1>] __extent_writepage+0x131/0x750 [btrfs]
>> [1093202.152509]  [<ffffffffa00eb040>] ? end_extent_writepage+0x70/0x70 [btrfs]
>> [1093202.152521]  [<ffffffffa00ec832>] extent_write_cache_pages.isra.30.constprop.50+0x272/0x3d0 [btrfs]
>> [1093202.152532]  [<ffffffffa00edafd>] extent_writepages+0x4d/0x70 [btrfs]
>> [1093202.152544]  [<ffffffffa00d3b20>] ? btrfs_real_readdir+0x5b0/0x5b0 [btrfs]
>> [1093202.152555]  [<ffffffffa00d1cb8>] btrfs_writepages+0x28/0x30 [btrfs]
>> [1093202.152559]  [<ffffffff8115a46e>] do_writepages+0x1e/0x40
>> [1093202.152562]  [<ffffffff811e5c50>] __writeback_single_inode+0x40/0x210
>> [1093202.152565]  [<ffffffff811e6a07>] writeback_sb_inodes+0x247/0x3e0
>> [1093202.152568]  [<ffffffff811e6c3f>] __writeback_inodes_wb+0x9f/0xd0
>> [1093202.152571]  [<ffffffff811e6eb3>] wb_writeback+0x243/0x2c0
>> [1093202.152574]  [<ffffffff811e86d8>] bdi_writeback_workfn+0x108/0x430
>> [1093202.152577]  [<ffffffff810974a8>] ? finish_task_switch+0x128/0x170
>> [1093202.152581]  [<ffffffff810838a2>] process_one_work+0x182/0x450
>> [1093202.152585]  [<ffffffff81084641>] worker_thread+0x121/0x410
>> [1093202.152588]  [<ffffffff81084520>] ? rescuer_thread+0x3e0/0x3e0
>> [1093202.152591]  [<ffffffff8108b322>] kthread+0xd2/0xf0
>> [1093202.152594]  [<ffffffff8108b250>] ? kthread_create_on_node+0x1d0/0x1d0
>> [1093202.152598]  [<ffffffff8172ac3c>] ret_from_fork+0x7c/0xb0
>> [1093202.152601]  [<ffffffff8108b250>] ? kthread_create_on_node+0x1d0/0x1d0
>
> On 29 July 2014 10:20, Peter Waller <peter@scraperwiki.com> wrote:
>> Someone on IRC suggested that I clear the free cache:
>>
>>> sudo mount -o remount,clear_cache /path/to/dev /path/to/mount
>>> sudo mount -o remount,space_cache /path/to/dev /path/to/mount
>>
>>
>> The former command printed `btrfs: disk space caching is enabled` and
>> the latter repeated it, making me think that maybe the latter was
>> unnecessary.
>>
>> On 29 July 2014 09:04, Peter Waller <peter@scraperwiki.com> wrote:
>>> Hi All,
>>>
>>> I've reported a bug with Ubuntu here:
>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1349711
>>>
>>> The machine in question has one BTRFS volume which is 87% full and
>>> lives on an Logical Volume Manager (LVM) block device on top of one
>>> Amazon Elastic Block Store (EBS) device.
>>>
>>> We have other machines in a similar configuration which have not
>>> displayed this behaviour.
>>>
>>> The one thing which makes this machine different is that it has
>>> directories which contain many thousands of files. We don't make heavy
>>> use of subvolumes or snapshots.
>>>
>>> More details follow:
>>>
>>> # cat /proc/version_signature
>>> Ubuntu 3.13.0-32.57-generic 3.13.11.4
>>>
>>> The machine had a soft-lockup with messages like this appearing on the console:
>>>
>>> [246736.752053] INFO: rcu_sched self-detected stall on CPU { 0}
>>> (t=2220246 jiffies g=35399662 c=35399661 q=0)
>>> [246736.756059] INFO: rcu_sched detected stalls on CPUs/tasks: { 0}
>>> (detected by 1, t=2220247 jiffies, g=35399662, c=35399661, q=0)
>>> [246764.192014] BUG: soft lockup - CPU#0 stuck for 23s! [kworker/u30:2:1828]
>>> [246764.212058] BUG: soft lockup - CPU#1 stuck for 23s! [btrfs-transacti:492]
>>>
>>>
>>> After the first lockup and reboot, the following messages were in
>>> dmesg, which I ignored because after some research I saw that they
>>> were changed to warnings and considered non-harmful. A btrfs-scrub
>>> performed after this failed without error:
>>>
>>>
>>> [ 77.609490] BTRFS error (device dm-0): block group 10766778368 has
>>> wrong amount of free space
>>> [ 77.613678] BTRFS error (device dm-0): failed to load free space
>>> cache for block group 10766778368
>>> [ 77.643801] BTRFS error (device dm-0): block group 19356712960 has
>>> wrong amount of free space
>>> [ 77.648952] BTRFS error (device dm-0): failed to load free space
>>> cache for block group 19356712960
>>> [ 77.926325] BTRFS error (device dm-0): block group 20430454784 has
>>> wrong amount of free space
>>> [ 77.931078] BTRFS error (device dm-0): failed to load free space
>>> cache for block group 20430454784
>>> [ 78.111437] BTRFS error (device dm-0): block group 21504196608 has
>>> wrong amount of free space
>>> [ 78.116165] BTRFS error (device dm-0): failed to load free space
>>> cache for block group 21504196608
>>>
>>>
>>> After the second time I've observed the lockup and rebooted, these
>>> messages have appeared:
>>>
>>>
>>> [ 45.390221] BTRFS error (device dm-0): free space inode generation
>>> (0) did not match free space cache generation (70012)
>>> [ 45.413472] BTRFS error (device dm-0): free space inode generation
>>> (0) did not match free space cache generation (70012)
>>> [ 467.423961] BTRFS error (device dm-0): block group 518646661120 has
>>> wrong amount of free space
>>> [ 467.429251] BTRFS error (device dm-0): failed to load free space
>>> cache for block group 518646661120
>>>
>>>
>>> I would like to know if these second messages are harmful and if
>>> remedial action is needed in response to the latter messages.
>>> Searching for messages similar to my lockup I found this report which
>>> suggested the problem may be fixed in 3.14.
>>>
>>> Any advice appreciated,
>>>
>>> Thanks,
>>>
>>> - Peter

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Machine lockup due to btrfs-transaction on AWS EC2 Ubuntu 14.04
  2014-07-31 15:07     ` Peter Waller
@ 2014-07-31 15:10       ` Peter Waller
  2014-08-01 11:59         ` Peter Waller
  0 siblings, 1 reply; 7+ messages in thread
From: Peter Waller @ 2014-07-31 15:10 UTC (permalink / raw)
  To: linux-btrfs

I should add that I have reproduced this even after doing `mount -o
clear_cache /dev/... /mnt/...`, unmount, remount with `-o
space_cache`. After the machine lockup and rebooting there are the
warnings of the form:

> [ 117.288248] BTRFS warning (device dm-0): block group 694165700608 has wrong amount of free space
> [ 117.288253] BTRFS warning (device dm-0): failed to load free space cache for block group 694165700608, rebuild it now

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Machine lockup due to btrfs-transaction on AWS EC2 Ubuntu 14.04
  2014-07-31 15:10       ` Peter Waller
@ 2014-08-01 11:59         ` Peter Waller
  2014-08-05 10:23           ` Peter Waller
  0 siblings, 1 reply; 7+ messages in thread
From: Peter Waller @ 2014-08-01 11:59 UTC (permalink / raw)
  To: linux-btrfs

I've reproduced these issues on a single-core machine which doesn't
appear to become completely unresponsive after 12 hours of copying (as
the other machines are deadlocking after 5-10 minutes, perhaps?), but
it does use 100% SYS CPU with no IO traffic for the vast majority of
the time. (In fact, constantly?)

The filesystem was originally created on an Ubuntu Saucy 13.10 machine
(and therefore older kernel + older btrfs-tools) on a PV guest with a
non-SSD block device.

The newer current machine is a HVM guest and the data is now on an SSD.

Below is a link to the dstat output and dmesg which shows this (note
the sys column on the left and the IOPs column on the right). This was
taken whilst copying files of varying sizes from many directories to a
sibling in the same directory.

`dstat -tcdnymlr 5`, which shows high sys CPU but no IO traffic:
https://gist.github.com/pwaller/cb8d088ebceb2707d24b

INFO: task sync:5906 blocked for more than 120 seconds.
https://gist.github.com/pwaller/574a369ea4b65fe125b9#file-dmesg-log-L541

INFO: task btrfs-transacti:2531 blocked for more than 120 seconds.
https://gist.github.com/pwaller/574a369ea4b65fe125b9#file-dmesg-log-L1015

INFO: task btrfs-flush_del:16764 blocked for more than 120 seconds.
https://gist.github.com/pwaller/574a369ea4b65fe125b9#file-dmesg-log-L1041

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Machine lockup due to btrfs-transaction on AWS EC2 Ubuntu 14.04
  2014-08-01 11:59         ` Peter Waller
@ 2014-08-05 10:23           ` Peter Waller
  0 siblings, 0 replies; 7+ messages in thread
From: Peter Waller @ 2014-08-05 10:23 UTC (permalink / raw)
  To: linux-btrfs

My current interpretation of this problem is that it is some
pathological condition caused by not rebalancing and being nearly out
of space for allocating more metadata and hence it is rarely being
seen by anyone else (because most users are regularly doing
rebalances).

See this thread for details about rebalancing and out of space:
"ENOSPC with mkdir and rename" on 2014-08-02:

http://thread.gmane.org/gmane.comp.file-systems.btrfs/37415

I haven't had the lockups now since July and I'm now trialling a
nightly rebalance:

$ btrfs filesystem balance start -dusage=50 -musage=10 "$mount"

I'll report back if I encounter further problems.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2014-08-05 10:23 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-07-29  8:04 Machine lockup due to btrfs-transaction on AWS EC2 Ubuntu 14.04 Peter Waller
2014-07-29  9:20 ` Peter Waller
2014-07-30 10:02   ` Peter Waller
2014-07-31 15:07     ` Peter Waller
2014-07-31 15:10       ` Peter Waller
2014-08-01 11:59         ` Peter Waller
2014-08-05 10:23           ` Peter Waller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).