[PATCH] Btrfs: fix quick exhaustion of the system array in the superblock

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] Btrfs: fix quick exhaustion of the system array in the superblock
@ 2015-07-20 13:56 fdmanana
  2015-07-21 23:53 ` Omar Sandoval
  0 siblings, 1 reply; 5+ messages in thread
From: fdmanana @ 2015-07-20 13:56 UTC (permalink / raw)
  To: linux-btrfs; +Cc: osandov, Filipe Manana

From: Filipe Manana <fdmanana@suse.com>

Omar reported that after commit 4fbcdf669454 ("Btrfs: fix -ENOSPC when
finishing block group creation"), introduced in 4.2-rc1, the following
test was failing due to exhaustion of the system array in the superblock:

  #!/bin/bash

  truncate -s 100T big.img
  mkfs.btrfs big.img
  mount -o loop big.img /mnt/loop

  num=5
  sz=10T
  for ((i = 0; i < $num; i++)); do
      echo fallocate $i $sz
      fallocate -l $sz /mnt/loop/testfile$i
  done
  btrfs filesystem sync /mnt/loop

  for ((i = 0; i < $num; i++)); do
        echo rm $i
        rm /mnt/loop/testfile$i
        btrfs filesystem sync /mnt/loop
  done
  umount /mnt/loop

This made btrfs_add_system_chunk() fail with -EFBIG due to excessive
allocation of system block groups. This happened because the test creates
a large number of data block groups per transaction and when committing
the transaction we start the writeout of the block group caches for all
the new new (dirty) block groups, which results in pre-allocating space
for each block group's free space cache using the same transaction handle.
That in turn often leads to creation of more block groups, and all get
attached to the new_bgs list of the same transaction handle to the point
of getting a list with over 1500 elements, and creation of new block groups
leads to the need of reserving space in the chunk block reserve and often
creating a new system block group too.

So that made us quickly exhaust the chunk block reserve/system space info,
because as of the commit mentioned before, we do reserve space for each
new block group in the chunk block reserve, unlike before where we would
not and would at most allocate one new system block group and therefore
would only ensure that there was enough space in the system space info to
allocate 1 new block group even if we ended up allocating thousands of
new block groups using the same transaction handle. That worked most of
the time because the computed required space at check_system_chunk() is
very pessimistic (assumes a chunk tree height of BTRFS_MAX_LEVEL/8 and
that all nodes/leafs in a path will be COWed and split) and since the
updates to the chunk tree all happen at btrfs_create_pending_block_groups
it is unlikely that a path needs to be COWed more than once (unless
writepages() for the btree inode is called by mm in between) and that
compensated for the need of creating any new nodes/leads in the chunk
tree.

So fix this by ensuring we don't accumulate a too large list of new block
groups in a transaction's handles new_bgs list, inserting/updating the
chunk tree for all accumulated new block groups and releasing the unused
space from the chunk block reserve whenever the list becomes sufficiently
large. This is a generic solution even though the problem currently can
only happen when starting the writeout of the free space caches for all
dirty block groups (btrfs_start_dirty_block_groups()).

Reported-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent-tree.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 171312d..07204bf 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4227,6 +4227,24 @@ out:
 	space_info->chunk_alloc = 0;
 	spin_unlock(&space_info->lock);
 	mutex_unlock(&fs_info->chunk_mutex);
+	/*
+	 * When we allocate a new chunk we reserve space in the chunk block
+	 * reserve to make sure we can COW nodes/leafs in the chunk tree or
+	 * add new nodes/leafs to it if we end up needing to do it when
+	 * inserting the chunk item and updating device items as part of the
+	 * second phase of chunk allocation, performed by
+	 * btrfs_finish_chunk_alloc(). So make sure we don't accumulate a
+	 * large number of new block groups to create in our transaction
+	 * handle's new_bgs list to avoid exhausting the chunk block reserve
+	 * in extreme cases - like having a single transaction create many new
+	 * block groups when starting to write out the free space caches of all
+	 * the block groups that were made dirty during the lifetime of the
+	 * transaction.
+	 */
+	if (trans->chunk_bytes_reserved >= (2 * 1024 * 1024ull)) {
+		btrfs_create_pending_block_groups(trans, trans->root);
+		btrfs_trans_release_chunk_metadata(trans);
+	}
 	return ret;
 }

-- 
2.1.3

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] Btrfs: fix quick exhaustion of the system array in the superblock
  2015-07-20 13:56 [PATCH] Btrfs: fix quick exhaustion of the system array in the superblock fdmanana
@ 2015-07-21 23:53 ` Omar Sandoval
  2015-12-13 10:29   ` Alex Lyakas
  0 siblings, 1 reply; 5+ messages in thread
From: Omar Sandoval @ 2015-07-21 23:53 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs, Filipe Manana

On Mon, Jul 20, 2015 at 02:56:20PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Omar reported that after commit 4fbcdf669454 ("Btrfs: fix -ENOSPC when
> finishing block group creation"), introduced in 4.2-rc1, the following
> test was failing due to exhaustion of the system array in the superblock:
> 
>   #!/bin/bash
> 
>   truncate -s 100T big.img
>   mkfs.btrfs big.img
>   mount -o loop big.img /mnt/loop
> 
>   num=5
>   sz=10T
>   for ((i = 0; i < $num; i++)); do
>       echo fallocate $i $sz
>       fallocate -l $sz /mnt/loop/testfile$i
>   done
>   btrfs filesystem sync /mnt/loop
> 
>   for ((i = 0; i < $num; i++)); do
>         echo rm $i
>         rm /mnt/loop/testfile$i
>         btrfs filesystem sync /mnt/loop
>   done
>   umount /mnt/loop
> 
> This made btrfs_add_system_chunk() fail with -EFBIG due to excessive
> allocation of system block groups. This happened because the test creates
> a large number of data block groups per transaction and when committing
> the transaction we start the writeout of the block group caches for all
> the new new (dirty) block groups, which results in pre-allocating space
> for each block group's free space cache using the same transaction handle.
> That in turn often leads to creation of more block groups, and all get
> attached to the new_bgs list of the same transaction handle to the point
> of getting a list with over 1500 elements, and creation of new block groups
> leads to the need of reserving space in the chunk block reserve and often
> creating a new system block group too.
> 
> So that made us quickly exhaust the chunk block reserve/system space info,
> because as of the commit mentioned before, we do reserve space for each
> new block group in the chunk block reserve, unlike before where we would
> not and would at most allocate one new system block group and therefore
> would only ensure that there was enough space in the system space info to
> allocate 1 new block group even if we ended up allocating thousands of
> new block groups using the same transaction handle. That worked most of
> the time because the computed required space at check_system_chunk() is
> very pessimistic (assumes a chunk tree height of BTRFS_MAX_LEVEL/8 and
> that all nodes/leafs in a path will be COWed and split) and since the
> updates to the chunk tree all happen at btrfs_create_pending_block_groups
> it is unlikely that a path needs to be COWed more than once (unless
> writepages() for the btree inode is called by mm in between) and that
> compensated for the need of creating any new nodes/leads in the chunk
> tree.
> 
> So fix this by ensuring we don't accumulate a too large list of new block
> groups in a transaction's handles new_bgs list, inserting/updating the
> chunk tree for all accumulated new block groups and releasing the unused
> space from the chunk block reserve whenever the list becomes sufficiently
> large. This is a generic solution even though the problem currently can
> only happen when starting the writeout of the free space caches for all
> dirty block groups (btrfs_start_dirty_block_groups()).
> 
> Reported-by: Omar Sandoval <osandov@fb.com>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Thanks a lot for taking a look.

Tested-by: Omar Sandoval <osandov@fb.com>

> ---
>  fs/btrfs/extent-tree.c | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 171312d..07204bf 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4227,6 +4227,24 @@ out:
>  	space_info->chunk_alloc = 0;
>  	spin_unlock(&space_info->lock);
>  	mutex_unlock(&fs_info->chunk_mutex);
> +	/*
> +	 * When we allocate a new chunk we reserve space in the chunk block
> +	 * reserve to make sure we can COW nodes/leafs in the chunk tree or
> +	 * add new nodes/leafs to it if we end up needing to do it when
> +	 * inserting the chunk item and updating device items as part of the
> +	 * second phase of chunk allocation, performed by
> +	 * btrfs_finish_chunk_alloc(). So make sure we don't accumulate a
> +	 * large number of new block groups to create in our transaction
> +	 * handle's new_bgs list to avoid exhausting the chunk block reserve
> +	 * in extreme cases - like having a single transaction create many new
> +	 * block groups when starting to write out the free space caches of all
> +	 * the block groups that were made dirty during the lifetime of the
> +	 * transaction.
> +	 */
> +	if (trans->chunk_bytes_reserved >= (2 * 1024 * 1024ull)) {
> +		btrfs_create_pending_block_groups(trans, trans->root);
> +		btrfs_trans_release_chunk_metadata(trans);
> +	}
>  	return ret;
>  }
>  
> -- 
> 2.1.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Omar

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] Btrfs: fix quick exhaustion of the system array in the superblock
  2015-07-21 23:53 ` Omar Sandoval
@ 2015-12-13 10:29   ` Alex Lyakas
  2015-12-13 15:45     ` Filipe Manana
  0 siblings, 1 reply; 5+ messages in thread
From: Alex Lyakas @ 2015-12-13 10:29 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: fdmanana, linux-btrfs, Filipe Manana

Hi Filipe Manana,

Can't the call to btrfs_create_pending_block_groups() cause a
deadlock, like in
http://www.spinics.net/lists/linux-btrfs/msg48744.html? Because this
call updates the device tree, and we may be calling do_chunk_alloc()
from find_free_extent() when holding a lock on the device tree root
(because we want to COW a block of the device tree).

My understanding from Josef's chunk allocator rework
(http://www.spinics.net/lists/linux-btrfs/msg25722.html) was that now
when allocating a new chunk we do not immediately update the
device/chunk tree. We keep the new chunk in "pending_chunks" and in
"new_bgs" on a transaction handle, and we actually update the
chunk/device tree only when we are done with a particular transaction
handle. This way we avoid that sort of deadlocks.

But this patch breaks this rule, as it may make us update the
device/chunk tree in the context of chunk allocation, which is the
scenario that the rework was meant to avoid.

Can you please point me at what I am missing?

Thanks,
Alex.


On Wed, Jul 22, 2015 at 1:53 AM, Omar Sandoval <osandov@fb.com> wrote:
> On Mon, Jul 20, 2015 at 02:56:20PM +0100, fdmanana@kernel.org wrote:
>> From: Filipe Manana <fdmanana@suse.com>
>>
>> Omar reported that after commit 4fbcdf669454 ("Btrfs: fix -ENOSPC when
>> finishing block group creation"), introduced in 4.2-rc1, the following
>> test was failing due to exhaustion of the system array in the superblock:
>>
>>   #!/bin/bash
>>
>>   truncate -s 100T big.img
>>   mkfs.btrfs big.img
>>   mount -o loop big.img /mnt/loop
>>
>>   num=5
>>   sz=10T
>>   for ((i = 0; i < $num; i++)); do
>>       echo fallocate $i $sz
>>       fallocate -l $sz /mnt/loop/testfile$i
>>   done
>>   btrfs filesystem sync /mnt/loop
>>
>>   for ((i = 0; i < $num; i++)); do
>>         echo rm $i
>>         rm /mnt/loop/testfile$i
>>         btrfs filesystem sync /mnt/loop
>>   done
>>   umount /mnt/loop
>>
>> This made btrfs_add_system_chunk() fail with -EFBIG due to excessive
>> allocation of system block groups. This happened because the test creates
>> a large number of data block groups per transaction and when committing
>> the transaction we start the writeout of the block group caches for all
>> the new new (dirty) block groups, which results in pre-allocating space
>> for each block group's free space cache using the same transaction handle.
>> That in turn often leads to creation of more block groups, and all get
>> attached to the new_bgs list of the same transaction handle to the point
>> of getting a list with over 1500 elements, and creation of new block groups
>> leads to the need of reserving space in the chunk block reserve and often
>> creating a new system block group too.
>>
>> So that made us quickly exhaust the chunk block reserve/system space info,
>> because as of the commit mentioned before, we do reserve space for each
>> new block group in the chunk block reserve, unlike before where we would
>> not and would at most allocate one new system block group and therefore
>> would only ensure that there was enough space in the system space info to
>> allocate 1 new block group even if we ended up allocating thousands of
>> new block groups using the same transaction handle. That worked most of
>> the time because the computed required space at check_system_chunk() is
>> very pessimistic (assumes a chunk tree height of BTRFS_MAX_LEVEL/8 and
>> that all nodes/leafs in a path will be COWed and split) and since the
>> updates to the chunk tree all happen at btrfs_create_pending_block_groups
>> it is unlikely that a path needs to be COWed more than once (unless
>> writepages() for the btree inode is called by mm in between) and that
>> compensated for the need of creating any new nodes/leads in the chunk
>> tree.
>>
>> So fix this by ensuring we don't accumulate a too large list of new block
>> groups in a transaction's handles new_bgs list, inserting/updating the
>> chunk tree for all accumulated new block groups and releasing the unused
>> space from the chunk block reserve whenever the list becomes sufficiently
>> large. This is a generic solution even though the problem currently can
>> only happen when starting the writeout of the free space caches for all
>> dirty block groups (btrfs_start_dirty_block_groups()).
>>
>> Reported-by: Omar Sandoval <osandov@fb.com>
>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>
> Thanks a lot for taking a look.
>
> Tested-by: Omar Sandoval <osandov@fb.com>
>
>> ---
>>  fs/btrfs/extent-tree.c | 18 ++++++++++++++++++
>>  1 file changed, 18 insertions(+)
>>
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index 171312d..07204bf 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -4227,6 +4227,24 @@ out:
>>       space_info->chunk_alloc = 0;
>>       spin_unlock(&space_info->lock);
>>       mutex_unlock(&fs_info->chunk_mutex);
>> +     /*
>> +      * When we allocate a new chunk we reserve space in the chunk block
>> +      * reserve to make sure we can COW nodes/leafs in the chunk tree or
>> +      * add new nodes/leafs to it if we end up needing to do it when
>> +      * inserting the chunk item and updating device items as part of the
>> +      * second phase of chunk allocation, performed by
>> +      * btrfs_finish_chunk_alloc(). So make sure we don't accumulate a
>> +      * large number of new block groups to create in our transaction
>> +      * handle's new_bgs list to avoid exhausting the chunk block reserve
>> +      * in extreme cases - like having a single transaction create many new
>> +      * block groups when starting to write out the free space caches of all
>> +      * the block groups that were made dirty during the lifetime of the
>> +      * transaction.
>> +      */
>> +     if (trans->chunk_bytes_reserved >= (2 * 1024 * 1024ull)) {
>> +             btrfs_create_pending_block_groups(trans, trans->root);
>> +             btrfs_trans_release_chunk_metadata(trans);
>> +     }
>>       return ret;
>>  }
>>
>> --
>> 2.1.3
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> Omar
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] Btrfs: fix quick exhaustion of the system array in the superblock
  2015-12-13 10:29   ` Alex Lyakas
@ 2015-12-13 15:45     ` Filipe Manana
  2015-12-13 16:15       ` Alex Lyakas
  0 siblings, 1 reply; 5+ messages in thread
From: Filipe Manana @ 2015-12-13 15:45 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: Omar Sandoval, linux-btrfs, Filipe Manana

On Sun, Dec 13, 2015 at 10:29 AM, Alex Lyakas <alex@zadarastorage.com> wrote:
> Hi Filipe Manana,
>
> Can't the call to btrfs_create_pending_block_groups() cause a
> deadlock, like in
> http://www.spinics.net/lists/linux-btrfs/msg48744.html? Because this
> call updates the device tree, and we may be calling do_chunk_alloc()
> from find_free_extent() when holding a lock on the device tree root
> (because we want to COW a block of the device tree).
>
> My understanding from Josef's chunk allocator rework
> (http://www.spinics.net/lists/linux-btrfs/msg25722.html) was that now
> when allocating a new chunk we do not immediately update the
> device/chunk tree. We keep the new chunk in "pending_chunks" and in
> "new_bgs" on a transaction handle, and we actually update the
> chunk/device tree only when we are done with a particular transaction
> handle. This way we avoid that sort of deadlocks.
>
> But this patch breaks this rule, as it may make us update the
> device/chunk tree in the context of chunk allocation, which is the
> scenario that the rework was meant to avoid.
>
> Can you please point me at what I am missing?

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=d9a0540a79f87456907f2ce031f058cf745c5bff

>
> Thanks,
> Alex.
>
>
> On Wed, Jul 22, 2015 at 1:53 AM, Omar Sandoval <osandov@fb.com> wrote:
>> On Mon, Jul 20, 2015 at 02:56:20PM +0100, fdmanana@kernel.org wrote:
>>> From: Filipe Manana <fdmanana@suse.com>
>>>
>>> Omar reported that after commit 4fbcdf669454 ("Btrfs: fix -ENOSPC when
>>> finishing block group creation"), introduced in 4.2-rc1, the following
>>> test was failing due to exhaustion of the system array in the superblock:
>>>
>>>   #!/bin/bash
>>>
>>>   truncate -s 100T big.img
>>>   mkfs.btrfs big.img
>>>   mount -o loop big.img /mnt/loop
>>>
>>>   num=5
>>>   sz=10T
>>>   for ((i = 0; i < $num; i++)); do
>>>       echo fallocate $i $sz
>>>       fallocate -l $sz /mnt/loop/testfile$i
>>>   done
>>>   btrfs filesystem sync /mnt/loop
>>>
>>>   for ((i = 0; i < $num; i++)); do
>>>         echo rm $i
>>>         rm /mnt/loop/testfile$i
>>>         btrfs filesystem sync /mnt/loop
>>>   done
>>>   umount /mnt/loop
>>>
>>> This made btrfs_add_system_chunk() fail with -EFBIG due to excessive
>>> allocation of system block groups. This happened because the test creates
>>> a large number of data block groups per transaction and when committing
>>> the transaction we start the writeout of the block group caches for all
>>> the new new (dirty) block groups, which results in pre-allocating space
>>> for each block group's free space cache using the same transaction handle.
>>> That in turn often leads to creation of more block groups, and all get
>>> attached to the new_bgs list of the same transaction handle to the point
>>> of getting a list with over 1500 elements, and creation of new block groups
>>> leads to the need of reserving space in the chunk block reserve and often
>>> creating a new system block group too.
>>>
>>> So that made us quickly exhaust the chunk block reserve/system space info,
>>> because as of the commit mentioned before, we do reserve space for each
>>> new block group in the chunk block reserve, unlike before where we would
>>> not and would at most allocate one new system block group and therefore
>>> would only ensure that there was enough space in the system space info to
>>> allocate 1 new block group even if we ended up allocating thousands of
>>> new block groups using the same transaction handle. That worked most of
>>> the time because the computed required space at check_system_chunk() is
>>> very pessimistic (assumes a chunk tree height of BTRFS_MAX_LEVEL/8 and
>>> that all nodes/leafs in a path will be COWed and split) and since the
>>> updates to the chunk tree all happen at btrfs_create_pending_block_groups
>>> it is unlikely that a path needs to be COWed more than once (unless
>>> writepages() for the btree inode is called by mm in between) and that
>>> compensated for the need of creating any new nodes/leads in the chunk
>>> tree.
>>>
>>> So fix this by ensuring we don't accumulate a too large list of new block
>>> groups in a transaction's handles new_bgs list, inserting/updating the
>>> chunk tree for all accumulated new block groups and releasing the unused
>>> space from the chunk block reserve whenever the list becomes sufficiently
>>> large. This is a generic solution even though the problem currently can
>>> only happen when starting the writeout of the free space caches for all
>>> dirty block groups (btrfs_start_dirty_block_groups()).
>>>
>>> Reported-by: Omar Sandoval <osandov@fb.com>
>>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>>
>> Thanks a lot for taking a look.
>>
>> Tested-by: Omar Sandoval <osandov@fb.com>
>>
>>> ---
>>>  fs/btrfs/extent-tree.c | 18 ++++++++++++++++++
>>>  1 file changed, 18 insertions(+)
>>>
>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>> index 171312d..07204bf 100644
>>> --- a/fs/btrfs/extent-tree.c
>>> +++ b/fs/btrfs/extent-tree.c
>>> @@ -4227,6 +4227,24 @@ out:
>>>       space_info->chunk_alloc = 0;
>>>       spin_unlock(&space_info->lock);
>>>       mutex_unlock(&fs_info->chunk_mutex);
>>> +     /*
>>> +      * When we allocate a new chunk we reserve space in the chunk block
>>> +      * reserve to make sure we can COW nodes/leafs in the chunk tree or
>>> +      * add new nodes/leafs to it if we end up needing to do it when
>>> +      * inserting the chunk item and updating device items as part of the
>>> +      * second phase of chunk allocation, performed by
>>> +      * btrfs_finish_chunk_alloc(). So make sure we don't accumulate a
>>> +      * large number of new block groups to create in our transaction
>>> +      * handle's new_bgs list to avoid exhausting the chunk block reserve
>>> +      * in extreme cases - like having a single transaction create many new
>>> +      * block groups when starting to write out the free space caches of all
>>> +      * the block groups that were made dirty during the lifetime of the
>>> +      * transaction.
>>> +      */
>>> +     if (trans->chunk_bytes_reserved >= (2 * 1024 * 1024ull)) {
>>> +             btrfs_create_pending_block_groups(trans, trans->root);
>>> +             btrfs_trans_release_chunk_metadata(trans);
>>> +     }
>>>       return ret;
>>>  }
>>>
>>> --
>>> 2.1.3
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>> --
>> Omar
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] Btrfs: fix quick exhaustion of the system array in the superblock
  2015-12-13 15:45     ` Filipe Manana
@ 2015-12-13 16:15       ` Alex Lyakas
  0 siblings, 0 replies; 5+ messages in thread
From: Alex Lyakas @ 2015-12-13 16:15 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs, Filipe Manana

Thank you, Filipe. Now it is more clear.
Fortunately, in my kernel 3.18 I do not have do_chunk_alloc() calling
btrfs_create_pending_block_groups(), so I cannot hit this deadlock.
But can hit the issue that this call is meant to fix.

Thanks,
Alex.


On Sun, Dec 13, 2015 at 5:45 PM, Filipe Manana <fdmanana@kernel.org> wrote:
> On Sun, Dec 13, 2015 at 10:29 AM, Alex Lyakas <alex@zadarastorage.com> wrote:
>> Hi Filipe Manana,
>>
>> Can't the call to btrfs_create_pending_block_groups() cause a
>> deadlock, like in
>> http://www.spinics.net/lists/linux-btrfs/msg48744.html? Because this
>> call updates the device tree, and we may be calling do_chunk_alloc()
>> from find_free_extent() when holding a lock on the device tree root
>> (because we want to COW a block of the device tree).
>>
>> My understanding from Josef's chunk allocator rework
>> (http://www.spinics.net/lists/linux-btrfs/msg25722.html) was that now
>> when allocating a new chunk we do not immediately update the
>> device/chunk tree. We keep the new chunk in "pending_chunks" and in
>> "new_bgs" on a transaction handle, and we actually update the
>> chunk/device tree only when we are done with a particular transaction
>> handle. This way we avoid that sort of deadlocks.
>>
>> But this patch breaks this rule, as it may make us update the
>> device/chunk tree in the context of chunk allocation, which is the
>> scenario that the rework was meant to avoid.
>>
>> Can you please point me at what I am missing?
>
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=d9a0540a79f87456907f2ce031f058cf745c5bff
>
>>
>> Thanks,
>> Alex.
>>
>>
>> On Wed, Jul 22, 2015 at 1:53 AM, Omar Sandoval <osandov@fb.com> wrote:
>>> On Mon, Jul 20, 2015 at 02:56:20PM +0100, fdmanana@kernel.org wrote:
>>>> From: Filipe Manana <fdmanana@suse.com>
>>>>
>>>> Omar reported that after commit 4fbcdf669454 ("Btrfs: fix -ENOSPC when
>>>> finishing block group creation"), introduced in 4.2-rc1, the following
>>>> test was failing due to exhaustion of the system array in the superblock:
>>>>
>>>>   #!/bin/bash
>>>>
>>>>   truncate -s 100T big.img
>>>>   mkfs.btrfs big.img
>>>>   mount -o loop big.img /mnt/loop
>>>>
>>>>   num=5
>>>>   sz=10T
>>>>   for ((i = 0; i < $num; i++)); do
>>>>       echo fallocate $i $sz
>>>>       fallocate -l $sz /mnt/loop/testfile$i
>>>>   done
>>>>   btrfs filesystem sync /mnt/loop
>>>>
>>>>   for ((i = 0; i < $num; i++)); do
>>>>         echo rm $i
>>>>         rm /mnt/loop/testfile$i
>>>>         btrfs filesystem sync /mnt/loop
>>>>   done
>>>>   umount /mnt/loop
>>>>
>>>> This made btrfs_add_system_chunk() fail with -EFBIG due to excessive
>>>> allocation of system block groups. This happened because the test creates
>>>> a large number of data block groups per transaction and when committing
>>>> the transaction we start the writeout of the block group caches for all
>>>> the new new (dirty) block groups, which results in pre-allocating space
>>>> for each block group's free space cache using the same transaction handle.
>>>> That in turn often leads to creation of more block groups, and all get
>>>> attached to the new_bgs list of the same transaction handle to the point
>>>> of getting a list with over 1500 elements, and creation of new block groups
>>>> leads to the need of reserving space in the chunk block reserve and often
>>>> creating a new system block group too.
>>>>
>>>> So that made us quickly exhaust the chunk block reserve/system space info,
>>>> because as of the commit mentioned before, we do reserve space for each
>>>> new block group in the chunk block reserve, unlike before where we would
>>>> not and would at most allocate one new system block group and therefore
>>>> would only ensure that there was enough space in the system space info to
>>>> allocate 1 new block group even if we ended up allocating thousands of
>>>> new block groups using the same transaction handle. That worked most of
>>>> the time because the computed required space at check_system_chunk() is
>>>> very pessimistic (assumes a chunk tree height of BTRFS_MAX_LEVEL/8 and
>>>> that all nodes/leafs in a path will be COWed and split) and since the
>>>> updates to the chunk tree all happen at btrfs_create_pending_block_groups
>>>> it is unlikely that a path needs to be COWed more than once (unless
>>>> writepages() for the btree inode is called by mm in between) and that
>>>> compensated for the need of creating any new nodes/leads in the chunk
>>>> tree.
>>>>
>>>> So fix this by ensuring we don't accumulate a too large list of new block
>>>> groups in a transaction's handles new_bgs list, inserting/updating the
>>>> chunk tree for all accumulated new block groups and releasing the unused
>>>> space from the chunk block reserve whenever the list becomes sufficiently
>>>> large. This is a generic solution even though the problem currently can
>>>> only happen when starting the writeout of the free space caches for all
>>>> dirty block groups (btrfs_start_dirty_block_groups()).
>>>>
>>>> Reported-by: Omar Sandoval <osandov@fb.com>
>>>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>>>
>>> Thanks a lot for taking a look.
>>>
>>> Tested-by: Omar Sandoval <osandov@fb.com>
>>>
>>>> ---
>>>>  fs/btrfs/extent-tree.c | 18 ++++++++++++++++++
>>>>  1 file changed, 18 insertions(+)
>>>>
>>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>>> index 171312d..07204bf 100644
>>>> --- a/fs/btrfs/extent-tree.c
>>>> +++ b/fs/btrfs/extent-tree.c
>>>> @@ -4227,6 +4227,24 @@ out:
>>>>       space_info->chunk_alloc = 0;
>>>>       spin_unlock(&space_info->lock);
>>>>       mutex_unlock(&fs_info->chunk_mutex);
>>>> +     /*
>>>> +      * When we allocate a new chunk we reserve space in the chunk block
>>>> +      * reserve to make sure we can COW nodes/leafs in the chunk tree or
>>>> +      * add new nodes/leafs to it if we end up needing to do it when
>>>> +      * inserting the chunk item and updating device items as part of the
>>>> +      * second phase of chunk allocation, performed by
>>>> +      * btrfs_finish_chunk_alloc(). So make sure we don't accumulate a
>>>> +      * large number of new block groups to create in our transaction
>>>> +      * handle's new_bgs list to avoid exhausting the chunk block reserve
>>>> +      * in extreme cases - like having a single transaction create many new
>>>> +      * block groups when starting to write out the free space caches of all
>>>> +      * the block groups that were made dirty during the lifetime of the
>>>> +      * transaction.
>>>> +      */
>>>> +     if (trans->chunk_bytes_reserved >= (2 * 1024 * 1024ull)) {
>>>> +             btrfs_create_pending_block_groups(trans, trans->root);
>>>> +             btrfs_trans_release_chunk_metadata(trans);
>>>> +     }
>>>>       return ret;
>>>>  }
>>>>
>>>> --
>>>> 2.1.3
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>> --
>>> Omar
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2015-12-13 16:15 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-07-20 13:56 [PATCH] Btrfs: fix quick exhaustion of the system array in the superblock fdmanana
2015-07-21 23:53 ` Omar Sandoval
2015-12-13 10:29   ` Alex Lyakas
2015-12-13 15:45     ` Filipe Manana
2015-12-13 16:15       ` Alex Lyakas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).