balancing every night broke balancing so now I can't balance anymore?

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

* balancing every night broke balancing so now I can't balance anymore?
@ 2017-05-13 20:54 Marc MERLIN
  2017-05-14  7:34 ` Duncan
  2017-05-14 19:13 ` Hans van Kranenburg
  0 siblings, 2 replies; 13+ messages in thread
From: Marc MERLIN @ 2017-05-13 20:54 UTC (permalink / raw)
  To: linux-btrfs

Kernel 4.11, btrfs-progs v4.7.3

I run scrub and balance every night, been doing this for 1.5 years on this
filesystem.
But it has just started failing:
saruman:~# btrfs balance start -musage=0  /mnt/btrfs_pool1
Done, had to relocate 0 out of 235 chunks
saruman:~# btrfs balance start -dusage=0  /mnt/btrfs_pool1
Done, had to relocate 0 out of 235 chunks

saruman:~# btrfs balance start -musage=1  /mnt/btrfs_pool1
ERROR: error during balancing '/mnt/btrfs_pool1': No space left on device
aruman:~# btrfs balance start -dusage=10  /mnt/btrfs_pool1
Done, had to relocate 0 out of 235 chunks
saruman:~# btrfs balance start -dusage=20  /mnt/btrfs_pool1
ERROR: error during balancing '/mnt/btrfs_pool1': No space left on device
There may be more info in syslog - try dmesg | tail

BTRFS info (device dm-2): 1 enospc errors during balance
BTRFS info (device dm-2): relocating block group 598566305792 flags data
BTRFS info (device dm-2): 1 enospc errors during balance
BTRFS info (device dm-2): 1 enospc errors during balance
BTRFS info (device dm-2): relocating block group 598566305792 flags data
BTRFS info (device dm-2): 1 enospc errors during balance

saruman:~# btrfs fi show /mnt/btrfs_pool1/
Label: 'btrfs_pool1'  uuid: bc115001-a8d1-445c-9ec9-6050620efd0a
	Total devices 1 FS bytes used 169.73GiB
	devid    1 size 228.67GiB used 228.67GiB path /dev/mapper/pool1

saruman:~# btrfs fi usage /mnt/btrfs_pool1/
Overall:
    Device size:		 228.67GiB
    Device allocated:		 228.67GiB
    Device unallocated:		   1.00MiB
    Device missing:		     0.00B
    Used:			 171.25GiB
    Free (estimated):		  55.32GiB	(min: 55.32GiB)
    Data ratio:			      1.00
    Metadata ratio:		      1.00
    Global reserve:		 512.00MiB	(used: 0.00B)

Data,single: Size:221.60GiB, Used:166.28GiB
   /dev/mapper/pool1	 221.60GiB

Metadata,single: Size:7.03GiB, Used:4.96GiB
   /dev/mapper/pool1	   7.03GiB

System,single: Size:32.00MiB, Used:48.00KiB
   /dev/mapper/pool1	  32.00MiB

Unallocated:
   /dev/mapper/pool1	   1.00MiB


How did I get into such a misbalanced state when I balance every night?

My filesystem is not full, I can write just fine, but I sure cannot
rebalance now.

Besides adding another device to add space, is there a way around this
and more generally not getting into that state anymore considering that
I already rebalance every night?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: balancing every night broke balancing so now I can't balance anymore?
  2017-05-13 20:54 balancing every night broke balancing so now I can't balance anymore? Marc MERLIN
@ 2017-05-14  7:34 ` Duncan
  2017-05-14 19:13 ` Hans van Kranenburg
  1 sibling, 0 replies; 13+ messages in thread
From: Duncan @ 2017-05-14  7:34 UTC (permalink / raw)
  To: linux-btrfs

Marc MERLIN posted on Sat, 13 May 2017 13:54:31 -0700 as excerpted:

> Kernel 4.11, btrfs-progs v4.7.3
> 
> I run scrub and balance every night, been doing this for 1.5 years on
> this filesystem.
> But it has just started failing:

> saruman:~# btrfs balance start -musage=0  /mnt/btrfs_pool1
> Done, had to relocate 0 out of 235 chunks

> saruman:~# btrfs balance start -dusage=0 
> /mnt/btrfs_pool1 Done, had to relocate 0 out of 235 chunks

Those aren't failing (as you likely know, but to explain for others 
following along), there's nothing to do as there's no entirely empty 
chunks.

But...

> saruman:~# btrfs balance start -musage=1  /mnt/btrfs_pool1
> ERROR: error during balancing '/mnt/btrfs_pool1':
> No space left on device

> aruman:~# btrfs balance start -dusage=10  /mnt/btrfs_pool1
> Done, had to relocate 0 out of 235 chunks

> saruman:~# btrfs balance start -dusage=20  /mnt/btrfs_pool1
> ERROR: error during balancing '/mnt/btrfs_pool1':
> No space left on device

... Errors there.  ENOSPC

[from dmesg]
> BTRFS info (device dm-2): 1 enospc errors during balance
> BTRFS info (device dm-2): relocating block group 598566305792 flags data
> BTRFS info (device dm-2): 1 enospc errors during balance
> BTRFS info (device dm-2): 1 enospc errors during balance
> BTRFS info (device dm-2): relocating block group 598566305792 flags data
> BTRFS info (device dm-2): 1 enospc errors during balance

> saruman:~# btrfs fi show /mnt/btrfs_pool1/
> Label: 'btrfs_pool1'  uuid: bc115001-a8d1-445c-9ec9-6050620efd0a
> 	Total devices 1 FS bytes used 169.73GiB
>       devid    1 size 228.67GiB used 228.67GiB path /dev/mapper/pool1

> saruman:~# btrfs fi usage /mnt/btrfs_pool1/
> Overall:
>     Device size:		 228.67GiB
>     Device allocated:		 228.67GiB
>     Device unallocated:	   1.00MiB
>     Device missing:		     0.00B
>     Used:			 171.25GiB
>     Free (estimated):		  55.32GiB	(min: 55.32GiB)
>     Data ratio:		      1.00
>     Metadata ratio:		      1.00
>     Global reserve:     	 512.00MiB	(used: 0.00B)
> 
> Data,single: Size:221.60GiB, Used:166.28GiB
>    /dev/mapper/pool1	 221.60GiB
> 
> Metadata,single: Size:7.03GiB, Used:4.96GiB
>    /dev/mapper/pool1	   7.03GiB
> 
> System,single: Size:32.00MiB, Used:48.00KiB
>    /dev/mapper/pool1	  32.00MiB
> 
> Unallocated:
>    /dev/mapper/pool1	   1.00MiB

So we see it's fully chunk-allocated, no unallocated space, but gigs and 
gigs of empty space withing the chunk allocations, data chunks in 
particular.

> How did I get into such a misbalanced state when I balance every night?
> 
> My filesystem is not full, I can write just fine, but I sure cannot
> rebalance now.

Well, you can write just fine... for now.

After accounting for the global reserve coming out of metadata's reported 
free, there's about 1.5 GiB space in the metadata, and about 55 GiB of 
space in the data, so you should actually be able to write for some time 
before running out of either.

You just can't rebalance to chunk-defrag and reclaim chunks to 
unallocated, so they can be used for the other chunk type if necessary.
You're correct to be worried about this, but it's not immediately urgent.

> Besides adding another device to add space, is there a way around this
> and more generally not getting into that state anymore considering that
> I already rebalance every night?

What you /haven't/ yet said is what your nightly rebalance command, 
presumably scheduled, with -dusage and -musage, actually is.  How did you 
determine the usage amount to feed to the command, and was it dynamic, 
presumably determined by some script and changing based on the amount of 
unutilized space trapped within the data chunks, or static, the same 
usage command given every nite?

The other thing we don't have, and you might not have any idea either if 
it was simply scheduled and you hadn't been specifically checking, is a 
trendline of whether the post-balance unallocated space has been reducing 
over time, while the post-balance unutilized space within the data chunks 
was growing, or whether it happened all of a sudden.

If you've been following current discussion threads here, you may already 
know one possible specific trigger, as discussed, and more generically, 
there could be other specific triggers in the same general category.

In that thread the specific culprit appeared to be btrfs behavior with 
the (autodetected based on device rotational value as reported by sysfs) 
ssd mount option, in particular as it interacted with systemd's journal 
files, but it would apply to anything else with a similar write pattern.

The overall btrfs usage pattern was problematic as much like you 
apparently were getting but didn't catch before full allocation while he 
did, btrfs was continuing to allocate new chunks, even tho there was 
plenty of space left within existing chunks, none of which were entirely 
empty (so they didn't get auto-reclaimed to unallocated), but few of 
which were anything like entirely filled, either.

If you go look at that thread (which I'd specify only I'd have to go look 
for it too, and the OP on that thread is list-active so will likely reply 
on this thread as well), there's some very nice chunk-usage 
visualizations linked of what btrfs was doing.

Well he's the coder I'm not, so he could actually dive into btrfs code, 
and combined with experiments, eventually traced it down to the behavior 
of the (auto-enabled based on rotational, tho it didn't really apply in 
his case) ssd mount option.

It turns out that at a low level, what the ssd mount option actually does 
is force data-block allocations to be 2 MiB at a time.  The idea is to 
match the very often 2-4 MiB ssd erase-block size, so writes ideally 
correspond to erase blocks and if that range in the file is rewritten or 
the file deleted, it'll be a 1:1 erase and (possible) rewrite, at least 
for the data.

While that works well for normal files generally written in reasonably 
large (full-file or MiB at a time) chunks and often not rewritten, it 
turns out systemd's journal files are near worst-case, at least if 
subject to regular snapshotting.

The journal pattern is (IIRC from the thread) to fallocate the file, 
typically several MB, write an index at the front, and then write journal 
entries as they come in from the /back/ /forward/, naturally rewriting 
the index each time as well.

Of course as btrfs users and systemd devs discovered early on, this write 
pattern is worst-case for COW filesystems such as btrfs.  The early 
result was that systemd quickly set the journal directory +C/NOCOW by 
default, ideally making it rewrite-in-place like they were used to on 
other filesystems and (they thought) eliminating the problem.

Except... as we list-regulars at least know by now, nocow doesn't mean 
nocow when a file is both repeatedly rewritten and snapshotted, because 
snapshots lock in the current content so the first write thereafter MUST 
be COW, a phenomenon often referred to as cow1.  While the effect isn't 
so bad with an occasional snapshot and/or rewrite, once the snapshots and 
rewrites are coming fast and regularly enough, the effect is very close 
to standard COW, despite the nominal NOCOW.

The effect of regular snapshots on systemd's journal files, generally 
rewritten a single journal record at a time, except that the records are 
written from the end of the file forward, with the index at the beginning 
also rewritten...

Combined with the effect of the (auto-enabled) ssd mount option forcing 
each of those writes to (what would be) a separate 2-MiB erase-block...

Is **SERIOUSLY** fragmented journal files!!

Now consider what COW in context of regular snapshotting does to those 
seriously fragmented due to regular rewrite files.  The original full-
block allocations can't be released until **NO** references to them 
continue to exist in old snapshots.  So the whole original 8 MiB or 16 
MiB or whatever then near empty journal-file allocation continues to 
remain, until all parts of it have been rewritten.  But because the 
journal records are filled in from the back forward, the last 4-KB block 
won't be rewritten until the file is nearly full and about to be rotated 
out of active use.  By then, it'll have all those single-record entries 
added a record or two at a time, so will be effectively fully fragmented!

And all those snapshots will be locking all those fragments of the file 
in its various snapshotted states in place, including the original intact 
near-empty first write of the whole fallocated file, until all those 
snapshots are deleted!

And with the ssd mount option, all those locked-in-place single journal 
record fragments are going to be 2 MiB each!

Of course many database and VM image file formats have similar rewrite-
triggered problems on COW, exacerbated by snapshotting triggering cow1, 
even if the file is nominally NOCOW.

The observed behavior was that new chunks would be allocated and filled, 
2 MiB at a time.  Eventually snapshot deletion would start clearing 
things out, but with continued write activity of the journal and other 
stuff as well, the chunk would remain partially full, but eventually with 
no continuous spaces left large enough to write 2 MiB at a time into, so 
another chunk would be allocated.

This repeated time after time, with each newly allocated chunk eventually 
eaten into Swiss cheese, as chunk allocations continued to grow, even tho 
actual data usage remained near steady-state.

With code study and eventually the confirmation of experimentation, he 
eventually traced the problem, on the btrfs side at least, down to the 
ssd mount option.  Turning that off allowed the allocator to fill in all 
those previously empty single-4K-block holes in the Swiss cheese, and the 
problem disappeared.  (His rebalance scripts were sophisticated enough to 
use btrfs debugging to pick the worst fragmented chunks and rebalance 
them specifically, just a couple chunks on each call, so as soon as the 2 
MiB write problem disappeared, his scripts gradually filled in the 
existing Swiss cheese, eliminating chunks as they did so.)

Note that this was actually on an enterprise-level storage aggregation 
device, a whole bunch of spinning rust underneath, but not exposing 
physical blocks or rotation information to the kernel and thus to btrfs.  
So all btrfs saw was non-rotational, and it (wrongly in this case) auto-
enabled the ssd mount option based on that.  It wasn't a deliberately 
added mount option, tho it wasn't, at that time, deliberately disabled, 
either.

He fixed that by specifically adding nossd to his mount options.

The other btrfs angled piece of the solution was to put the journal files 
in their own dedicated subvolume, so they didn't get snapshotted with the 
parent.  He decided he didn't need journal snapshots anyway, at least not 
at the cost of the trouble they were causing.

I'm guessing that you have something similar going on.  It may be the ssd 
mount option and journald files.  It might be rewrite-pattern VM images 
instead of or in addition to the journald files.  It might be database 
files.  It might be something else similar.  But they're probably being 
snapshotted, and that's killing the nocow if you have it enabled.

Meanwhile, here, I /am/ on ssds and have the option enabled, but I'm not 
seeing anything similar, despite my running systemd as well.  That's for 
several reasons:

1) No snapshotting at all, here.  I run smallish partitions and btrfs' of 
under 100 GiB each, and simply copy the entire filesystem tree 
(directories and files) to a freshly mkfs-ed backup copy (with multiple 
such backup copies), as my backup method.

2) No systemd journal files on btrfs, here.  journald.conf has 
Storage=volatile set, so it only keeps the tmpfs files (which I've 
reconfigured size-wise to retain a full session).  Meanwhile, I run a 
conventional syslog-ng with systemd passing on journald entries to it 
too, and syslog manages the actually stored logs... in conventional 
greppable text-based append-only files that are much *MUCH* easier for 
btrfs to reasonably handle. =:^)

3) I run the autodefrag mount option.  This rewrites fragmented ranges as 
necessary, so I expect even if I was doing journald files on btrfs, and 
snapshotting them, along with the ssd I do have, I'd not have the same 
sort of issue.

4) The compress=lzo mount option I also run may affect this sort of thing 
too, due to its 128-KiB compression-block size, but I'm not sure what the 
exact effect would be, and with journald writing the index up front in 
the first block that btrfs quick-tests for compressibility, and the fact 
that I don't use compress-force, it may be that btrfs wouldn't compress 
the journal files anyway, thus no effect.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: balancing every night broke balancing so now I can't balance anymore?
  2017-05-13 20:54 balancing every night broke balancing so now I can't balance anymore? Marc MERLIN
  2017-05-14  7:34 ` Duncan
@ 2017-05-14 19:13 ` Hans van Kranenburg
  2017-05-14 20:15   ` Marc MERLIN
  1 sibling, 1 reply; 13+ messages in thread
From: Hans van Kranenburg @ 2017-05-14 19:13 UTC (permalink / raw)
  To: Marc MERLIN, linux-btrfs

On 05/13/2017 10:54 PM, Marc MERLIN wrote:
> Kernel 4.11, btrfs-progs v4.7.3
> 
> I run scrub and balance every night, been doing this for 1.5 years on this
> filesystem.

What are the exact commands you run every day?

> But it has just started failing:
> [...]

> saruman:~# btrfs fi usage /mnt/btrfs_pool1/
> Overall:
>     Device size:		 228.67GiB
>     Device allocated:		 228.67GiB
>     Device unallocated:		   1.00MiB
>     Device missing:		     0.00B
>     Used:			 171.25GiB
>     Free (estimated):		  55.32GiB	(min: 55.32GiB)
>     Data ratio:			      1.00
>     Metadata ratio:		      1.00
>     Global reserve:		 512.00MiB	(used: 0.00B)
> 
> Data,single: Size:221.60GiB, Used:166.28GiB
>    /dev/mapper/pool1	 221.60GiB
> 
> Metadata,single: Size:7.03GiB, Used:4.96GiB
>    /dev/mapper/pool1	   7.03GiB
> 
> System,single: Size:32.00MiB, Used:48.00KiB
>    /dev/mapper/pool1	  32.00MiB
> 
> Unallocated:
>    /dev/mapper/pool1	   1.00MiB
> 
> How did I get into such a misbalanced state when I balance every night?

I don't know, since I don't know what you do exactly. :)

> My filesystem is not full, I can write just fine, but I sure cannot
> rebalance now.

Yes, because you have quite some allocated but unused space. If btrfs
cannot just allocate more chunks, it starts trying a bit harder to reuse
all the empty spots in the already existing chunks.

> Besides adding another device to add space, is there a way around this
> and more generally not getting into that state anymore considering that
> I already rebalance every night?

Add monitoring and alerting on the amount of unallocated space.

FWIW, this is what I use for that purpose:

https://packages.debian.org/sid/munin-plugins-btrfs
https://packages.debian.org/sid/monitoring-plugins-btrfs

And, of course the btrfs-heatmap program keeps being a fun tool to
create visual timelapses of your filesystem, so you can learn how your
usage pattern is resulting in allocation of space by btrfs, and so that
you can visually see what the effect of your btrfs balance attempts is:

https://github.com/knorrie/btrfs-heatmap/
https://packages.debian.org/sid/btrfs-heatmap
https://apps.fedoraproject.org/packages/btrfs-heatmap
https://aur.archlinux.org/packages/python-btrfs-heatmap/

-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: balancing every night broke balancing so now I can't balance anymore?
  2017-05-14 19:13 ` Hans van Kranenburg
@ 2017-05-14 20:15   ` Marc MERLIN
  2017-05-14 20:57     ` Lionel Bouton
                       ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Marc MERLIN @ 2017-05-14 20:15 UTC (permalink / raw)
  To: Hans van Kranenburg; +Cc: linux-btrfs

On Sun, May 14, 2017 at 09:13:35PM +0200, Hans van Kranenburg wrote:
> On 05/13/2017 10:54 PM, Marc MERLIN wrote:
> > Kernel 4.11, btrfs-progs v4.7.3
> > 
> > I run scrub and balance every night, been doing this for 1.5 years on this
> > filesystem.
> 
> What are the exact commands you run every day?
 
http://marc.merlins.org/perso/btrfs/post_2014-03-19_Btrfs-Tips_-Btrfs-Scrub-and-Btrfs-Filesystem-Repair.html
(at the bottom)
every night:
1) scrub
2) balance -musage=0
3) balance -musage=20
4) balance -dusage=0
5) balance -dusage=20

> > How did I get into such a misbalanced state when I balance every night?
> 
> I don't know, since I don't know what you do exactly. :)
 
Now you do :)

> > My filesystem is not full, I can write just fine, but I sure cannot
> > rebalance now.
> 
> Yes, because you have quite some allocated but unused space. If btrfs
> cannot just allocate more chunks, it starts trying a bit harder to reuse
> all the empty spots in the already existing chunks.

Ok. shouldn't balance fix problems just like this?
I have 60GB-ish free, or in this case that's also >25%, that's a lot

Speaking of unallocated, I have more now:
    Device unallocated:		 993.00MiB

This kind of just magically fixed itself during snapshot rotation and
deletion I think.
Sure enough, balance works again, but this feels pretty fragile.
Looking again:
    Device size:		 228.67GiB
    Device allocated:		 227.70GiB
    Device unallocated:		 993.00MiB
    Free (estimated):		  58.53GiB	(min: 58.53GiB)

You're saying that I need unallocated space for new chunks to be
created, which is required by balance.
Should btrfs not take care of keeping some space for me?
Shoudln't a nigthly balance, which I'm already doing, help even more
with this?

> > Besides adding another device to add space, is there a way around this
> > and more generally not getting into that state anymore considering that
> > I already rebalance every night?
> 
> Add monitoring and alerting on the amount of unallocated space.
> 
> FWIW, this is what I use for that purpose:
> 
> https://packages.debian.org/sid/munin-plugins-btrfs
> https://packages.debian.org/sid/monitoring-plugins-btrfs
> 
> And, of course the btrfs-heatmap program keeps being a fun tool to
> create visual timelapses of your filesystem, so you can learn how your
> usage pattern is resulting in allocation of space by btrfs, and so that
> you can visually see what the effect of your btrfs balance attempts is:

That's interesting, but ultimately, users shoudln't have to micromanage
their filesystem to that level, even btrfs.

a) What is wrong in my nightly script that I should fix/improve?
b) How do I recover from my current state?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
ome page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: balancing every night broke balancing so now I can't balance anymore?
  2017-05-14 20:15   ` Marc MERLIN
@ 2017-05-14 20:57     ` Lionel Bouton
  2017-05-14 21:30       ` Kai Krakow
  2017-05-14 21:21     ` Hugo Mills
  2017-05-14 21:22     ` Kai Krakow
  2 siblings, 1 reply; 13+ messages in thread
From: Lionel Bouton @ 2017-05-14 20:57 UTC (permalink / raw)
  To: Marc MERLIN, Hans van Kranenburg; +Cc: linux-btrfs

Le 14/05/2017 à 22:15, Marc MERLIN a écrit :
> On Sun, May 14, 2017 at 09:13:35PM +0200, Hans van Kranenburg wrote:
>> On 05/13/2017 10:54 PM, Marc MERLIN wrote:
>>> Kernel 4.11, btrfs-progs v4.7.3
>>>
>>> I run scrub and balance every night, been doing this for 1.5 years on this
>>> filesystem.
>> What are the exact commands you run every day?
>  
> http://marc.merlins.org/perso/btrfs/post_2014-03-19_Btrfs-Tips_-Btrfs-Scrub-and-Btrfs-Filesystem-Repair.html
> (at the bottom)
> every night:
> 1) scrub
> 2) balance -musage=0
> 3) balance -musage=20
> 4) balance -dusage=0
> 5) balance -dusage=20

usage=20 is pretty low: this means you don't try to reallocate and
regroup together block groups that are filled more than 20%.
Constantly using this settings has left lots of allocated block groups
that are mostly empty on your filesystem (a little more than 20% used).

The rebalance subject is a bit complex. With an empty filesystem you
almost don't need it as group creation is sparse and it's OK to have
mostly empty groups. When your filesystem begins to fill up you have to
raise the usage target to be able to reclaim space (as the fs fills up
most of your groups do too) so that new block creation can happen.

I've coded one Ruby script which tries to balance between the cost of
reallocating group and the need for it. The basic idea is that it tries
to keep the proportion of free space "wasted" by being allocated
although it isn't used below a threshold. It will bring this proportion
down enough through balance that minor reallocation won't trigger a new
balance right away. It should handle pathological conditions as well as
possible and it won't spend more than 2 hours working on a single
filesystem by default. We deploy this as a daily cron script through
Puppet on all our systems and it works very well (I didn't have to use
balance manually to manage free space since we did that).
Note that by default it sleeps a random amount of time to avoid IO
spikes on VMs running on the same host. You can either edit it or pass
it "0" which will be used for the max amount of time to sleep bypassing
this precaution.

Here is the latest version : https://pastebin.com/Rrw1GLtx
Given its current size, I should probably push it on github...

I've seen other maintenance scripts mentioned on this list so you might
something simpler or more targeted to your needs by browsing through the
list's history.

Best regards,

Lionel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: balancing every night broke balancing so now I can't balance anymore?
  2017-05-14 20:57     ` Lionel Bouton
@ 2017-05-14 21:30       ` Kai Krakow
  2017-05-14 23:08         ` Lionel Bouton
  0 siblings, 1 reply; 13+ messages in thread
From: Kai Krakow @ 2017-05-14 21:30 UTC (permalink / raw)
  To: linux-btrfs

Am Sun, 14 May 2017 22:57:26 +0200
schrieb Lionel Bouton <lionel-subscription@bouton.name>:

> I've coded one Ruby script which tries to balance between the cost of
> reallocating group and the need for it. The basic idea is that it
> tries to keep the proportion of free space "wasted" by being allocated
> although it isn't used below a threshold. It will bring this
> proportion down enough through balance that minor reallocation won't
> trigger a new balance right away. It should handle pathological
> conditions as well as possible and it won't spend more than 2 hours
> working on a single filesystem by default. We deploy this as a daily
> cron script through Puppet on all our systems and it works very well
> (I didn't have to use balance manually to manage free space since we
> did that). Note that by default it sleeps a random amount of time to
> avoid IO spikes on VMs running on the same host. You can either edit
> it or pass it "0" which will be used for the max amount of time to
> sleep bypassing this precaution.
> 
> Here is the latest version : https://pastebin.com/Rrw1GLtx
> Given its current size, I should probably push it on github...

Yes, please... ;-)

> I've seen other maintenance scripts mentioned on this list so you
> might something simpler or more targeted to your needs by browsing
> through the list's history.


-- 
Regards,
Kai

Replies to list-only preferred.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: balancing every night broke balancing so now I can't balance anymore?
  2017-05-14 21:30       ` Kai Krakow
@ 2017-05-14 23:08         ` Lionel Bouton
  0 siblings, 0 replies; 13+ messages in thread
From: Lionel Bouton @ 2017-05-14 23:08 UTC (permalink / raw)
  To: Kai Krakow, linux-btrfs

Le 14/05/2017 à 23:30, Kai Krakow a écrit :
> Am Sun, 14 May 2017 22:57:26 +0200
> schrieb Lionel Bouton <lionel-subscription@bouton.name>:
>
>> I've coded one Ruby script which tries to balance between the cost of
>> reallocating group and the need for it.[...]
>> Given its current size, I should probably push it on github...
> Yes, please... ;-)

Most of our BTRFS filesystems are used by Ceph OSD, so here it is :

https://github.com/jtek/ceph-utils/blob/master/btrfs-auto-rebalance.rb

Best regards,

Lionel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: balancing every night broke balancing so now I can't balance anymore?
  2017-05-14 20:15   ` Marc MERLIN
  2017-05-14 20:57     ` Lionel Bouton
@ 2017-05-14 21:21     ` Hugo Mills
  2017-05-14 23:16       ` Marc MERLIN
  2017-05-14 21:22     ` Kai Krakow
  2 siblings, 1 reply; 13+ messages in thread
From: Hugo Mills @ 2017-05-14 21:21 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: Hans van Kranenburg, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 3728 bytes --]

On Sun, May 14, 2017 at 01:15:09PM -0700, Marc MERLIN wrote:
> On Sun, May 14, 2017 at 09:13:35PM +0200, Hans van Kranenburg wrote:
> > On 05/13/2017 10:54 PM, Marc MERLIN wrote:
> > > Kernel 4.11, btrfs-progs v4.7.3
> > > 
> > > I run scrub and balance every night, been doing this for 1.5 years on this
> > > filesystem.
> > 
> > What are the exact commands you run every day?
>  
> http://marc.merlins.org/perso/btrfs/post_2014-03-19_Btrfs-Tips_-Btrfs-Scrub-and-Btrfs-Filesystem-Repair.html
> (at the bottom)
> every night:
> 1) scrub
> 2) balance -musage=0
> 3) balance -musage=20

   In most cases, this is going to make ENOSPC problems worse, not
better. The reason for doign this kind of balance is to recover unused
space and allow it to be reallocated. The typical behaviour is that
data gets overallocated, and it's metadata which runs out. So, the
last thing you want to be doing is reducing the metadata allocation,
because that's the scarce resource.

   Also, I'd usually recommend using limit=n, where n is approximately
the amount of data overallcation (allocated space less used
space). It's much more controllable than usage.

   Hugo.

> 4) balance -dusage=0
> 5) balance -dusage=20
> 
> > > How did I get into such a misbalanced state when I balance every night?
> > 
> > I don't know, since I don't know what you do exactly. :)
>  
> Now you do :)
> 
> > > My filesystem is not full, I can write just fine, but I sure cannot
> > > rebalance now.
> > 
> > Yes, because you have quite some allocated but unused space. If btrfs
> > cannot just allocate more chunks, it starts trying a bit harder to reuse
> > all the empty spots in the already existing chunks.
> 
> Ok. shouldn't balance fix problems just like this?
> I have 60GB-ish free, or in this case that's also >25%, that's a lot
> 
> Speaking of unallocated, I have more now:
>     Device unallocated:		 993.00MiB
> 
> This kind of just magically fixed itself during snapshot rotation and
> deletion I think.
> Sure enough, balance works again, but this feels pretty fragile.
> Looking again:
>     Device size:		 228.67GiB
>     Device allocated:		 227.70GiB
>     Device unallocated:		 993.00MiB
>     Free (estimated):		  58.53GiB	(min: 58.53GiB)
> 
> You're saying that I need unallocated space for new chunks to be
> created, which is required by balance.
> Should btrfs not take care of keeping some space for me?
> Shoudln't a nigthly balance, which I'm already doing, help even more
> with this?
> 
> > > Besides adding another device to add space, is there a way around this
> > > and more generally not getting into that state anymore considering that
> > > I already rebalance every night?
> > 
> > Add monitoring and alerting on the amount of unallocated space.
> > 
> > FWIW, this is what I use for that purpose:
> > 
> > https://packages.debian.org/sid/munin-plugins-btrfs
> > https://packages.debian.org/sid/monitoring-plugins-btrfs
> > 
> > And, of course the btrfs-heatmap program keeps being a fun tool to
> > create visual timelapses of your filesystem, so you can learn how your
> > usage pattern is resulting in allocation of space by btrfs, and so that
> > you can visually see what the effect of your btrfs balance attempts is:
> 
> That's interesting, but ultimately, users shoudln't have to micromanage
> their filesystem to that level, even btrfs.
> 
> a) What is wrong in my nightly script that I should fix/improve?
> b) How do I recover from my current state?
> 
> Thanks,
> Marc

-- 
Hugo Mills             | You stay in the theatre because you're afraid of
hugo@... carfax.org.uk | having no money? There's irony...
http://carfax.org.uk/  |
PGP: E2AB1DE4          |                                     Slings and Arrows

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: balancing every night broke balancing so now I can't balance anymore?
  2017-05-14 21:21     ` Hugo Mills
@ 2017-05-14 23:16       ` Marc MERLIN
  2017-05-15  8:14         ` Hugo Mills
  0 siblings, 1 reply; 13+ messages in thread
From: Marc MERLIN @ 2017-05-14 23:16 UTC (permalink / raw)
  To: Hugo Mills, Hans van Kranenburg, linux-btrfs

On Sun, May 14, 2017 at 09:21:11PM +0000, Hugo Mills wrote:
> > 2) balance -musage=0
> > 3) balance -musage=20
> 
>    In most cases, this is going to make ENOSPC problems worse, not
> better. The reason for doign this kind of balance is to recover unused
> space and allow it to be reallocated. The typical behaviour is that
> data gets overallocated, and it's metadata which runs out. So, the
> last thing you want to be doing is reducing the metadata allocation,
> because that's the scarce resource.
> 
>    Also, I'd usually recommend using limit=n, where n is approximately
> the amount of data overallcation (allocated space less used
> space). It's much more controllable than usage.


Thanks for that.
So, would you just remove the balance -musage=20 altogether?

As for limit= I'm not sure if it would be helpful since I run this
nightly. Anything that doesn't get done tonight due to limit, would be
done tomorrow?

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: balancing every night broke balancing so now I can't balance anymore?
  2017-05-14 23:16       ` Marc MERLIN
@ 2017-05-15  8:14         ` Hugo Mills
  2017-05-15 11:30           ` Lionel Bouton
  2017-05-15 12:34           ` Austin S. Hemmelgarn
  0 siblings, 2 replies; 13+ messages in thread
From: Hugo Mills @ 2017-05-15  8:14 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: Hans van Kranenburg, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2071 bytes --]

On Sun, May 14, 2017 at 04:16:52PM -0700, Marc MERLIN wrote:
> On Sun, May 14, 2017 at 09:21:11PM +0000, Hugo Mills wrote:
> > > 2) balance -musage=0
> > > 3) balance -musage=20
> > 
> >    In most cases, this is going to make ENOSPC problems worse, not
> > better. The reason for doign this kind of balance is to recover unused
> > space and allow it to be reallocated. The typical behaviour is that
> > data gets overallocated, and it's metadata which runs out. So, the
> > last thing you want to be doing is reducing the metadata allocation,
> > because that's the scarce resource.
> > 
> >    Also, I'd usually recommend using limit=n, where n is approximately
> > the amount of data overallcation (allocated space less used
> > space). It's much more controllable than usage.
> 
> 
> Thanks for that.
> So, would you just remove the balance -musage=20 altogether?

   Yes.

> As for limit= I'm not sure if it would be helpful since I run this
> nightly. Anything that doesn't get done tonight due to limit, would be
> done tomorrow?

   I'm suggesting limit= on its own. It's a fixed amount of work
compared to usage=, which may not do anything at all. For example,
it's perfectly possible to have a filesystem which is, say, 30% full,
and yet is still fully-allocated filesystem with more than 20% of
every chunk used. In that case your usage= wouldn't balance anything,
and you'd still be left in the situation of risking ENOSPC from
running out of metadata.

   All you need to do is ensure that you have enough unallocated space
for the metadata to expand into if it needs to. That's the ultimate
goal of all this.

   If you have SSDs, it may also be beneficial to use nossd as a mount
option, because that seems to have some pathology in overallocating
chunks in normal usage. Hans investigated this in detail a month or
two ago.

   Hugo.

-- 
Hugo Mills             | "You know, the British have always been nice to mad
hugo@... carfax.org.uk | people."
http://carfax.org.uk/  |
PGP: E2AB1DE4          |                         Laura Jesson, Brief Encounter

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: balancing every night broke balancing so now I can't balance anymore?
  2017-05-15  8:14         ` Hugo Mills
@ 2017-05-15 11:30           ` Lionel Bouton
  2017-05-15 12:34           ` Austin S. Hemmelgarn
  1 sibling, 0 replies; 13+ messages in thread
From: Lionel Bouton @ 2017-05-15 11:30 UTC (permalink / raw)
  To: Hugo Mills, Marc MERLIN, Hans van Kranenburg, linux-btrfs

Le 15/05/2017 à 10:14, Hugo Mills a écrit :
> [...]
>> As for limit= I'm not sure if it would be helpful since I run this
>> nightly. Anything that doesn't get done tonight due to limit, would be
>> done tomorrow?
>    I'm suggesting limit= on its own. It's a fixed amount of work
> compared to usage=, which may not do anything at all. For example,
> it's perfectly possible to have a filesystem which is, say, 30% full,
> and yet is still fully-allocated filesystem with more than 20% of
> every chunk used. In that case your usage= wouldn't balance anything,
> and you'd still be left in the situation of risking ENOSPC from
> running out of metadata.

Hugo, as I don't have any feedback on my approach to address this
problem could you have a look at my script or simply the principle : is
there any drawback vs using limit in calling balance multiple times
raising usage (and using the same value for data and metadata) until you
get enough free space ?

For reference :

https://github.com/jtek/ceph-utils/blob/master/btrfs-auto-rebalance.rb


Lionel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: balancing every night broke balancing so now I can't balance anymore?
  2017-05-15  8:14         ` Hugo Mills
  2017-05-15 11:30           ` Lionel Bouton
@ 2017-05-15 12:34           ` Austin S. Hemmelgarn
  1 sibling, 0 replies; 13+ messages in thread
From: Austin S. Hemmelgarn @ 2017-05-15 12:34 UTC (permalink / raw)
  To: Hugo Mills, Marc MERLIN, Hans van Kranenburg, linux-btrfs

On 2017-05-15 04:14, Hugo Mills wrote:
> On Sun, May 14, 2017 at 04:16:52PM -0700, Marc MERLIN wrote:
>> On Sun, May 14, 2017 at 09:21:11PM +0000, Hugo Mills wrote:
>>>> 2) balance -musage=0
>>>> 3) balance -musage=20
>>>
>>>    In most cases, this is going to make ENOSPC problems worse, not
>>> better. The reason for doign this kind of balance is to recover unused
>>> space and allow it to be reallocated. The typical behaviour is that
>>> data gets overallocated, and it's metadata which runs out. So, the
>>> last thing you want to be doing is reducing the metadata allocation,
>>> because that's the scarce resource.
>>>
>>>    Also, I'd usually recommend using limit=n, where n is approximately
>>> the amount of data overallcation (allocated space less used
>>> space). It's much more controllable than usage.
>>
>>
>> Thanks for that.
>> So, would you just remove the balance -musage=20 altogether?
>
>    Yes.
The advantages to doing that depend also on how much excess free space 
you have and what your usual usage is.  If you're balancing a filesystem 
for a mail server that has lots of free space, you may indeed want to 
re-balance metadata chunks regularly because you're likely to be 
rewriting significant amounts of metadata regularly.
>
>> As for limit= I'm not sure if it would be helpful since I run this
>> nightly. Anything that doesn't get done tonight due to limit, would be
>> done tomorrow?
>
>    I'm suggesting limit= on its own. It's a fixed amount of work
> compared to usage=, which may not do anything at all. For example,
> it's perfectly possible to have a filesystem which is, say, 30% full,
> and yet is still fully-allocated filesystem with more than 20% of
> every chunk used. In that case your usage= wouldn't balance anything,
> and you'd still be left in the situation of risking ENOSPC from
> running out of metadata.
FWIW, I normally use '-dusage=80 -mlimit=16' for my nightly balances. 
The usage filter at 80% means you won't waste time re-balancing full or 
mostly full chunks, and the limit filter of 16 takes on average about 5 
minutes on the consumer SSD's I have.
>
>    All you need to do is ensure that you have enough unallocated space
> for the metadata to expand into if it needs to. That's the ultimate
> goal of all this.
>
>    If you have SSDs, it may also be beneficial to use nossd as a mount
> option, because that seems to have some pathology in overallocating
> chunks in normal usage. Hans investigated this in detail a month or
> two ago.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: balancing every night broke balancing so now I can't balance anymore?
  2017-05-14 20:15   ` Marc MERLIN
  2017-05-14 20:57     ` Lionel Bouton
  2017-05-14 21:21     ` Hugo Mills
@ 2017-05-14 21:22     ` Kai Krakow
  2 siblings, 0 replies; 13+ messages in thread
From: Kai Krakow @ 2017-05-14 21:22 UTC (permalink / raw)
  To: linux-btrfs

Am Sun, 14 May 2017 13:15:09 -0700
schrieb Marc MERLIN <marc@merlins.org>:

> On Sun, May 14, 2017 at 09:13:35PM +0200, Hans van Kranenburg wrote:
> > On 05/13/2017 10:54 PM, Marc MERLIN wrote:  
> > > Kernel 4.11, btrfs-progs v4.7.3
> > > 
> > > I run scrub and balance every night, been doing this for 1.5
> > > years on this filesystem.  
> > 
> > What are the exact commands you run every day?  
>  
> http://marc.merlins.org/perso/btrfs/post_2014-03-19_Btrfs-Tips_-Btrfs-Scrub-and-Btrfs-Filesystem-Repair.html
> (at the bottom)
> every night:
> 1) scrub
> 2) balance -musage=0
> 3) balance -musage=20
> 4) balance -dusage=0
> 5) balance -dusage=20
> 
> > > How did I get into such a misbalanced state when I balance every
> > > night?  
> > 
> > I don't know, since I don't know what you do exactly. :)  
>  
> Now you do :)
> 
> > > My filesystem is not full, I can write just fine, but I sure
> > > cannot rebalance now.  
> > 
> > Yes, because you have quite some allocated but unused space. If
> > btrfs cannot just allocate more chunks, it starts trying a bit
> > harder to reuse all the empty spots in the already existing
> > chunks.  
> 
> Ok. shouldn't balance fix problems just like this?
> I have 60GB-ish free, or in this case that's also >25%, that's a lot
> 
> Speaking of unallocated, I have more now:
>     Device unallocated:		 993.00MiB
> 
> This kind of just magically fixed itself during snapshot rotation and
> deletion I think.
> Sure enough, balance works again, but this feels pretty fragile.
> Looking again:
>     Device size:		 228.67GiB
>     Device allocated:		 227.70GiB
>     Device unallocated:		 993.00MiB
>     Free (estimated):		  58.53GiB	(min: 58.53GiB)
> 
> You're saying that I need unallocated space for new chunks to be
> created, which is required by balance.
> Should btrfs not take care of keeping some space for me?
> Shoudln't a nigthly balance, which I'm already doing, help even more
> with this?
> 
> > > Besides adding another device to add space, is there a way around
> > > this and more generally not getting into that state anymore
> > > considering that I already rebalance every night?  
> > 
> > Add monitoring and alerting on the amount of unallocated space.
> > 
> > FWIW, this is what I use for that purpose:
> > 
> > https://packages.debian.org/sid/munin-plugins-btrfs
> > https://packages.debian.org/sid/monitoring-plugins-btrfs
> > 
> > And, of course the btrfs-heatmap program keeps being a fun tool to
> > create visual timelapses of your filesystem, so you can learn how
> > your usage pattern is resulting in allocation of space by btrfs,
> > and so that you can visually see what the effect of your btrfs
> > balance attempts is:  
> 
> That's interesting, but ultimately, users shoudln't have to
> micromanage their filesystem to that level, even btrfs.
> 
> a) What is wrong in my nightly script that I should fix/improve?

You may want to try
https://www.spinics.net/lists/linux-btrfs/msg52076.html

> b) How do I recover from my current state?

That script may work it's way through.

-- 
Regards,
Kai

Replies to list-only preferred.


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2017-05-15 12:34 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-05-13 20:54 balancing every night broke balancing so now I can't balance anymore? Marc MERLIN
2017-05-14  7:34 ` Duncan
2017-05-14 19:13 ` Hans van Kranenburg
2017-05-14 20:15   ` Marc MERLIN
2017-05-14 20:57     ` Lionel Bouton
2017-05-14 21:30       ` Kai Krakow
2017-05-14 23:08         ` Lionel Bouton
2017-05-14 21:21     ` Hugo Mills
2017-05-14 23:16       ` Marc MERLIN
2017-05-15  8:14         ` Hugo Mills
2017-05-15 11:30           ` Lionel Bouton
2017-05-15 12:34           ` Austin S. Hemmelgarn
2017-05-14 21:22     ` Kai Krakow

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox