From: "Sebastian Döring" <moralapostel@gmail.com>
To: Qu Wenruo <quwenruo.btrfs@gmx.com>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: Balance loops: what we know so far
Date: Wed, 29 Apr 2020 14:23:41 +0200 [thread overview]
Message-ID: <CADkZQamNXN6NjBqxun=uupRX1mOKEYDyVN4-ow8Q1qWf=vQouA@mail.gmail.com> (raw)
In-Reply-To: <b1e24eb6-a8ee-dd87-b8b7-0c0e95b2a060@gmx.com>
I think I'm also running into this kind of balancing loop. I'm seeing
thousands of these repeating messages in syslog:
Apr 29 11:56:49 rockpro64 kernel: [12044.645247] BTRFS info (device
dm-3): found 36652 extents, stage: move data extents
Apr 29 11:56:49 rockpro64 kernel: [12045.134649] BTRFS info (device
dm-3): found 36652 extents, stage: move data extents
Apr 29 11:56:50 rockpro64 kernel: [12045.648773] BTRFS info (device
dm-3): found 36652 extents, stage: move data extents
Apr 29 11:56:50 rockpro64 kernel: [12046.172767] BTRFS info (device
dm-3): found 36652 extents, stage: move data extents
Apr 29 11:56:51 rockpro64 kernel: [12046.666079] BTRFS info (device
dm-3): found 36652 extents, stage: move data extents
Apr 29 11:56:51 rockpro64 kernel: [12047.199652] BTRFS info (device
dm-3): found 36652 extents, stage: move data extents
Can I help you to debug this in any way? I'm not too familiar with the
internal btrfs tools.
Best regards,
Sebastian
Am Mi., 29. Apr. 2020 um 07:36 Uhr schrieb Qu Wenruo <quwenruo.btrfs@gmx.com>:
>
>
>
> On 2020/4/28 下午10:51, Zygo Blaxell wrote:
> > On Tue, Apr 28, 2020 at 05:54:21PM +0800, Qu Wenruo wrote:
> >>
> >>
> >> On 2020/4/28 下午12:55, Zygo Blaxell wrote:
> >>> On Mon, Apr 27, 2020 at 03:07:29PM +0800, Qu Wenruo wrote:
> >>>>
> >>>>
> >>>> On 2020/4/12 上午5:14, Zygo Blaxell wrote:
> >>>>> Since 5.1, btrfs has been prone to getting stuck in semi-infinite loops
> >>>>> in balance and device shrink/remove:
> >>>>>
> >>>>> [Sat Apr 11 16:59:32 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers
> >>>>> [Sat Apr 11 16:59:33 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers
> >>>>> [Sat Apr 11 16:59:34 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers
> >>>>> [Sat Apr 11 16:59:34 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers
> >>>>> [Sat Apr 11 16:59:35 2020] BTRFS info (device dm-0): found 29 extents, stage: update data pointers
> >>>>>
> >>>>> This is a block group while it's looping, as seen by python-btrfs:
> >>>>>
> >>>>> # share/python-btrfs/examples/show_block_group_contents.py 1934913175552 /media/testfs/
> >>>>> block group vaddr 1934913175552 length 1073741824 flags DATA used 939167744 used_pct 87
> >> [...]
> >>>>>
> >>>>> All of the extent data backrefs are removed by the balance, but the
> >>>>> loop keeps trying to get rid of the shared data backrefs. It has
> >>>>> no effect on them, but keeps trying anyway.
> >>>>
> >>>> I guess this shows a pretty good clue.
> >>>>
> >>>> I was always thinking about the reloc tree, but in your case, it's data
> >>>> reloc tree owning them.
> >>>
> >>> In that case, yes. Metadata balances loop too, in the "move data extents"
> >>> stage while data balances loop in the "update data pointers" stage.
> >>
> >> Would you please take an image dump of the fs when runaway balance happened?
> >>
> >> Both metadata and data block group loop would greatly help.
> >
> > There's two problems with this:
> >
> > 1) my smallest test filesystems have 29GB of metadata,
> >
> > 2) the problem is not reproducible with an image.
>
> OK, then when such loop happened, what we need is:
> - Which block group is looping
> Obviously
>
> - The extent tree dump of that block group
> You have done a pretty good job on that.
>
> - Follow the backref to reach the root
> In balance case, the offending backref should always using parent
> bytenr, so you need to step by step to reach the root, which should
> have backref points back to itself.
>
> - Root tree dump of the looping fs.
> To find out which reloc tree it belongs to.
> And show the ref number of that reloc tree.
>
> >
> > I've tried using VM snapshots to put a filesystem into a reproducible
> > looping state. A block group that loops on one boot doesn't repeatably
> > loop on another boot from the same initial state; however, once a
> > booted system starts looping, it continues to loop even if the balance
> > is cancelled, and I restart balance on the same block group or other
> > random block groups.
> >
> > I have production filesystems with tens of thousands of block groups
> > and almost all of them loop (as I said before, I cannot complete any
> > RAID reshapes with 5.1+ kernels). They can't _all_ be bad.
> >
> > Cancelling a balance (usually) doesn't recover. Rebooting does.
> > The change that triggered this changes the order of operations in
> > the kernel. That smells like a runtime thing to me.
> >
> >>>> In that case, data reloc tree is only cleaned up at the end of
> >>>> btrfs_relocate_block_group().
> >>>>
> >>>> Thus it is never cleaned up until we exit the balance loop.
> >>>>
> >>>> I'm not sure why this is happening only after I extended the lifespan of
> >>>> reloc tree (not data reloc tree).
> >>>
> >>> I have been poking around with printk to trace what it's doing in the
> >>> looping and non-looping cases. It seems to be very similar up to
> >>> calling merge_reloc_root, merge_reloc_roots, unset_reloc_control,
> >>> btrfs_block_rsv_release, btrfs_commit_transaction, clean_dirty_subvols,
> >>> btrfs_free_block_rsv. In the looping cases, everything up to those
> >>> functions seems the same on every loop except the first one.
> >>>
> >>> In the non-looping cases, those functions do something different than
> >>> the looping cases: the extents disappear in the next loop, and the
> >>> balance finishes.
> >>>
> >>> I haven't figured out _what_ is different yet. I need more cycles to
> >>> look at it.
> >>>
> >>> Your extend-the-lifespan-of-reloc-tree patch moves one of the
> >>> functions--clean_dirty_subvols (or btrfs_drop_snapshot)--to a different
> >>> place in the call sequence. It was in merge_reloc_roots before the
> >>> transaction commit, now it's in relocate_block_group after transaction
> >>> commit. My guess is that the problem lies somewhere in how the behavior
> >>> of these functions has been changed by calling them in a different
> >>> sequence.
> >>>
> >>>> But anyway, would you like to give a try of the following patch?
> >>>> https://patchwork.kernel.org/patch/11511241/
> >>>
> >>> I'm not sure how this patch could work. We are hitting the found_extents
> >>> counter every time through the loop. It's returning thousands of extents
> >>> each time.
> >>>
> >>>> It should make us exit the the balance so long as we have no extra
> >>>> extent to relocate.
> >>>
> >>> The problem is not that we have no extents to relocate. The problem is
> >>> that we don't successfully get rid of the extents we do find, so we keep
> >>> finding them over and over again.
> >>
> >> That's very strange.
> >>
> >> As you can see, for relocate_block_group(), it will cleanup reloc trees.
> >>
> >> This means either we have reloc trees in use and not cleaned up, or some
> >> tracing mechanism is not work properly.
> >
> > Can you point out where in the kernel that happens? If we throw some
> > printks at it we might see something.
>
> The merge happens at merge_reloc_roots().
> The cleanup happens at clean_dirty_subvols().
>
> In theory, all reloc trees will either marked as orphan (due to new
> changes in original subvolume) or merged in merge_reloc_roots(), and
> queued for cleanup.
> Then in clean_dirty_subvolumes() reloc trees all get cleaned up.
>
> If there are any reloc tree remaining after clean_dirty_subvolumes()
> then it could be the cause of such loop.
>
> Thanks,
> Qu
> >
> >> Anyway, if image dump with the dead looping block group specified, it
> >> would provide good hint to this long problem.
> >>
> >> Thanks,
> >> Qu
> >>
> >>>
> >>> In testing, the patch has no effect:
> >>>
> >>> [Mon Apr 27 23:36:15 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> >>> [Mon Apr 27 23:36:21 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> >>> [Mon Apr 27 23:36:27 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> >>> [Mon Apr 27 23:36:32 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> >>> [Mon Apr 27 23:36:38 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> >>> [Mon Apr 27 23:36:44 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> >>> [Mon Apr 27 23:36:50 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> >>> [Mon Apr 27 23:36:56 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> >>> [Mon Apr 27 23:37:01 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> >>> [Mon Apr 27 23:37:07 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> >>> [Mon Apr 27 23:37:13 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> >>> [Mon Apr 27 23:37:19 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> >>> [Mon Apr 27 23:37:24 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> >>> [Mon Apr 27 23:37:30 2020] BTRFS info (device dm-0): found 4800 extents, stage: update data pointers
> >>>
> >>> The above is the tail end of 3320 loops on a single block group.
> >>>
> >>> I switched to a metadata block group and it's on the 9th loop:
> >>>
> >>> # btrfs balance start -mconvert=raid1 /media/testfs/
> >>> [Tue Apr 28 00:09:47 2020] BTRFS info (device dm-0): found 34977 extents, stage: move data extents
> >>> [Tue Apr 28 00:12:24 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
> >>> [Tue Apr 28 00:18:46 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
> >>> [Tue Apr 28 00:23:24 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
> >>> [Tue Apr 28 00:25:54 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
> >>> [Tue Apr 28 00:28:17 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
> >>> [Tue Apr 28 00:30:35 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
> >>> [Tue Apr 28 00:32:45 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
> >>> [Tue Apr 28 00:37:01 2020] BTRFS info (device dm-0): found 26475 extents, stage: move data extents
> >>>
> >>>
> >>>> Thanks,
> >>>> Qu
> >>>>
> >>>>>
> >>>>> This is "semi-infinite" because it is possible for the balance to
> >>>>> terminate if something removes those 29 extents (e.g. by looking up the
> >>>>> extent vaddrs with 'btrfs ins log' then feeding the references to 'btrfs
> >>>>> fi defrag' will reduce the number of inline shared data backref objects.
> >>>>> When it's reduced all the way to zero, balance starts up again, usually
> >>>>> promptly getting stuck on the very next block group. If the _only_
> >>>>> thing running on the filesystem is balance, it will not stop looping.
> >>>>>
> >>>>> Bisection points to commit d2311e698578 "btrfs: relocation: Delay reloc
> >>>>> tree deletion after merge_reloc_roots" as the first commit where the
> >>>>> balance loops can be reproduced.
> >>>>>
> >>>>> I tested with commit 59b2c371052c "btrfs: check commit root generation
> >>>>> in should_ignore_root" as well as the rest of misc-next, but the balance
> >>>>> loops are still easier to reproduce than to avoid.
> >>>>>
> >>>>> Once it starts happening on a filesystem, it seems to happen very
> >>>>> frequently. It is not possible to reshape a RAID array of more than a
> >>>>> few hundred GB on kernels after 5.0. I can get maybe 50-100 block groups
> >>>>> completed in a resize or balance after a fresh boot, then balance gets
> >>>>> stuck in loops after that. With the fast balance cancel patches it's
> >>>>> possibly to recover from the loop, but futile, since the next balance
> >>>>> will almost always also loop, even if it is passed a different block
> >>>>> group. I've had to downgrade to 5.0 or 4.19 to complete any RAID
> >>>>> reshaping work.
> >>>>>
> >>>>
> >>>
> >>>
> >>>
> >>
> >
> >
> >
>
next prev parent reply other threads:[~2020-04-29 12:23 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-04-11 21:14 Balance loops: what we know so far Zygo Blaxell
2020-04-27 7:07 ` Qu Wenruo
2020-04-28 4:55 ` Zygo Blaxell
2020-04-28 9:54 ` Qu Wenruo
2020-04-28 14:51 ` Zygo Blaxell
2020-04-29 5:34 ` Qu Wenruo
2020-04-29 12:23 ` Sebastian Döring [this message]
2020-05-04 18:54 ` Andrea Gelmini
2020-05-04 23:48 ` Qu Wenruo
2020-05-05 9:10 ` Andrea Gelmini
2020-05-06 5:58 ` Qu Wenruo
2020-05-06 18:24 ` Andrea Gelmini
2020-05-07 9:59 ` Andrea Gelmini
2020-05-08 6:33 ` Qu Wenruo
2020-05-11 8:31 ` Qu Wenruo
2020-05-12 13:43 ` Zygo Blaxell
2020-05-12 14:11 ` Zygo Blaxell
2020-05-13 2:28 ` Qu Wenruo
2020-05-13 5:02 ` Zygo Blaxell
2020-05-13 6:36 ` Qu Wenruo
2020-05-13 5:24 ` Zygo Blaxell
2020-05-13 11:23 ` Qu Wenruo
2020-05-13 12:21 ` Zygo Blaxell
2020-05-14 8:08 ` Qu Wenruo
2020-05-14 8:55 ` Qu Wenruo
2020-05-14 17:44 ` Zygo Blaxell
2020-05-14 23:43 ` Qu Wenruo
2020-05-15 6:57 ` Qu Wenruo
2020-05-15 15:17 ` Zygo Blaxell
2020-05-18 5:25 ` Qu Wenruo
2020-05-20 7:27 ` Qu Wenruo
2020-05-21 3:26 ` Zygo Blaxell
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CADkZQamNXN6NjBqxun=uupRX1mOKEYDyVN4-ow8Q1qWf=vQouA@mail.gmail.com' \
--to=moralapostel@gmail.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=quwenruo.btrfs@gmx.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).