Re: "kernel BUG" and segmentation fault with "device delete"

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

From: Vladimir Panteleev <thecybershadow@gmail.com>
To: Qu Wenruo <quwenruo.btrfs@gmx.com>, linux-btrfs@vger.kernel.org
Subject: Re: "kernel BUG" and segmentation fault with "device delete"
Date: Thu, 8 Aug 2019 20:40:53 +0000	[thread overview]
Message-ID: <9f0f001c-54a6-9db4-3f1d-668d90fde023@gmail.com> (raw)
In-Reply-To: <811c2c41-e795-9562-0e8b-033b404bf43d@gmail.com>

Did more digging today. Here is where the -ENOSPC is coming from:

btrfs_run_delayed_refs ->          // WARN here
__btrfs_run_delayed_refs ->
btrfs_run_delayed_refs_for_head ->
run_one_delayed_ref ->
run_delayed_data_ref ->
__btrfs_inc_extent_ref ->
insert_extent_backref ->
insert_extent_data_ref ->
btrfs_insert_empty_item ->
btrfs_insert_empty_items ->
btrfs_search_slot ->
split_leaf ->
alloc_tree_block_no_bg_flush ->
btrfs_alloc_tree_block ->
use_block_rsv ->
block_rsv_use_bytes / reserve_metadata_bytes

In use_block_rsv, first block_rsv_use_bytes (with the 
BTRFS_BLOCK_RSV_DELREFS one) fails, then reserve_metadata_bytes fails, 
then block_rsv_use_bytes with global_rsv fails again.

My understanding of this in plain English is as follows: btrfs attempted 
to finalize a transaction and add the queued backreferences. When doing 
so, it ran out of space in a B-tree, and attempted to allocate a new 
tree block; however, in doing so, it hit the limit it reserved for 
itself for how much space it was going to use during that operation, so 
it gave up on the whole thing, which led everything to go downhill from 
there. Is this anywhere close to being accurate?

BTW, the DELREFS rsv is 0 / 7GB reserved/free. So, it looks like it 
didn't expect to allocate the new tree node at all? Perhaps it should be 
using some other rsv for those?

Am I on the right track, or should I be discussing this elsewhere / with 
someone else?

On 20/07/2019 10.59, Vladimir Panteleev wrote:
> Hi,
> 
> I've done a few experiments and here are my findings.
> 
> First I probably should describe the filesystem: it is a snapshot 
> archive, containing a lot of snapshots for 4 subvolumes, totaling 2487 
> subvolumes/snapshots. There are also a few files (inside the snapshots) 
> that are probably very fragmented. This is probably what causes the bug.
> 
> Observations:
> 
> - If I delete all snapshots, the bug disappears (device delete succeeds).
> - If I delete all but any single subvolume's snapshots, the bug disappears.
> - If I delete one of two subvolumes' snapshots, the bug disappears, but 
> stays if I delete one of the other two subvolumes' snapshots.
> 
> It looks like two subvolumes' snapshots' data participates in causing 
> the bug.
> 
> In theory, I guess it would be possible to reduce the filesystem to the 
> minimal one causing the bug by iteratively deleting snapshots / files 
> and checking if the bug manifests, but it would be extremely 
> time-consuming, probably requiring weeks.
> 
> Anything else I can do to help diagnose / fix it? Or should I just order 
> more HDDs and clone the RAID10 the right way?
> 
> On 06/07/2019 05.51, Qu Wenruo wrote:
>>
>>
>> On 2019/7/6 下午1:13, Vladimir Panteleev wrote:
>> [...]
>>>> I'm not sure if it's the degraded mount cause the problem, as the
>>>> enospc_debug output looks like reserved/pinned/over-reserved space has
>>>> taken up all space, while no new chunk get allocated.
>>>
>>> The problem happens after replace-ing the missing device (which succeeds
>>> in full) and then attempting to remove it, i.e. without a degraded 
>>> mount.
>>>
>>>> Would you please try to balance metadata to see if the ENOSPC still
>>>> happens?
>>>
>>> The problem also manifests when attempting to rebalance the metadata.
>>
>> Have you tried to balance just one or two metadata block groups?
>> E.g using -mdevid or -mvrange?
>>
>> And did the problem always happen at the same block group?
>>
>> Thanks,
>> Qu
>>>
>>> Thanks!
>>>
>>
> 

-- 
Best regards,
  Vladimir

     prev parent reply	other threads:[~2019-08-08 20:41 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-07-05  4:39 "kernel BUG" and segmentation fault with "device delete" Vladimir Panteleev
2019-07-05  7:01 ` Vladimir Panteleev
2019-07-05  9:42 ` Andrei Borzenkov
2019-07-05 10:20   ` Vladimir Panteleev
2019-07-05 21:48     ` Chris Murphy
2019-07-05 22:04       ` Chris Murphy
2019-07-05 21:43 ` Chris Murphy
2019-07-06  0:05   ` Vladimir Panteleev
2019-07-06  2:38     ` Chris Murphy
2019-07-06  3:37       ` Vladimir Panteleev
2019-07-06 17:36         ` Chris Murphy
2019-07-06  5:01 ` Qu Wenruo
2019-07-06  5:13   ` Vladimir Panteleev
2019-07-06  5:51     ` Qu Wenruo
2019-07-06 15:09       ` Vladimir Panteleev
2019-07-20 10:59       ` Vladimir Panteleev
2019-08-08 20:40         ` Vladimir Panteleev [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9f0f001c-54a6-9db4-3f1d-668d90fde023@gmail.com \
    --to=thecybershadow@gmail.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=quwenruo.btrfs@gmx.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox