Did btrfs filesystem defrag just make things worse?

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Did btrfs filesystem defrag just make things worse?
@ 2015-07-10 20:57 Donald Pearson
  2015-07-11  2:52 ` Chris Murphy
  2015-07-11  4:30 ` Duncan
  0 siblings, 2 replies; 6+ messages in thread
From: Donald Pearson @ 2015-07-10 20:57 UTC (permalink / raw)
  To: Btrfs BTRFS

If I'm reading this right, my most fragmented file
(Training-flat.vmdk) is now almost 3x more fragmented?

[root@san01 tank]# filefrag
/mnt2/tank/virtual_machines/virtual_machines/Training/*
/mnt2/tank/virtual_machines/virtual_machines/Training/Training-flat.vmdk:
1444 extents found
/mnt2/tank/virtual_machines/virtual_machines/Training/Training.nvram:
1 extent found
/mnt2/tank/virtual_machines/virtual_machines/Training/Training.vmdk: 1
extent found
/mnt2/tank/virtual_machines/virtual_machines/Training/Training.vmsd: 0
extents found
/mnt2/tank/virtual_machines/virtual_machines/Training/Training.vmx: 1
extent found
/mnt2/tank/virtual_machines/virtual_machines/Training/Training.vmxf: 1
extent found
/mnt2/tank/virtual_machines/virtual_machines/Training/vmware-1.log: 4
extents found
/mnt2/tank/virtual_machines/virtual_machines/Training/vmware.log: 3
extents found
[root@san01 tank]# lsattr /mnt2/tank/virtual_machines/virtual_machines/Training/
--------c------C
/mnt2/tank/virtual_machines/virtual_machines/Training/Training.vmxf
--------c------C
/mnt2/tank/virtual_machines/virtual_machines/Training/Training.vmsd
--------c------C
/mnt2/tank/virtual_machines/virtual_machines/Training/Training-flat.vmdk
--------c------C
/mnt2/tank/virtual_machines/virtual_machines/Training/Training.vmdk
--------c------C
/mnt2/tank/virtual_machines/virtual_machines/Training/Training.nvram
--------c------C
/mnt2/tank/virtual_machines/virtual_machines/Training/vmware-1.log
--------c------C
/mnt2/tank/virtual_machines/virtual_machines/Training/vmware.log
--------c------C
/mnt2/tank/virtual_machines/virtual_machines/Training/Training.vmx
[root@san01 tank]# cd ~
[root@san01 ~]# cd git
[root@san01 git]# cd btrfs-progs/
[root@san01 btrfs-progs]# ./btrfs fi defragment
btrfs filesystem defragment: too few arguments
usage: btrfs filesystem defragment [options] <file>|<dir> [<file>|<dir>...]

    Defragment a file or a directory

    -v             be verbose
    -r             defragment files recursively
    -c[zlib,lzo]   compress the file while defragmenting
    -f             flush data to disk immediately after defragmenting
    -s start       defragment only from byte onward
    -l len         defragment only up to len bytes
    -t size        minimal size of file to be considered for defragmenting

[root@san01 btrfs-progs]# ./btrfs fi defragment -vr /mnt2/tank/virtual_machines
/mnt2/tank/virtual_machines/virtual_machines/Training/Training.vmxf
/mnt2/tank/virtual_machines/virtual_machines/Training/Training.vmsd
/mnt2/tank/virtual_machines/virtual_machines/Training/Training-flat.vmdk
/mnt2/tank/virtual_machines/virtual_machines/Training/Training.vmdk
/mnt2/tank/virtual_machines/virtual_machines/Training/Training.nvram
/mnt2/tank/virtual_machines/virtual_machines/Training/vmware-1.log
/mnt2/tank/virtual_machines/virtual_machines/Training/vmware.log
/mnt2/tank/virtual_machines/virtual_machines/Training/Training.vmx
/mnt2/tank/virtual_machines/no_vms_here!
/mnt2/tank/virtual_machines/archived/DonW7/DonW7-disk1.vmdk
/mnt2/tank/virtual_machines/archived/DonW7/DonW7.mf
/mnt2/tank/virtual_machines/archived/DonW7/DonW7.ovf
btrfs-progs v4.1
[root@san01 btrfs-progs]# filefrag
/mnt2/tank/virtual_machines/virtual_machines/Training/*
/mnt2/tank/virtual_machines/virtual_machines/Training/Training-flat.vmdk:
4090 extents found
/mnt2/tank/virtual_machines/virtual_machines/Training/Training.nvram:
1 extent found
/mnt2/tank/virtual_machines/virtual_machines/Training/Training.vmdk: 1
extent found
/mnt2/tank/virtual_machines/virtual_machines/Training/Training.vmsd: 0
extents found
/mnt2/tank/virtual_machines/virtual_machines/Training/Training.vmx: 1
extent found
/mnt2/tank/virtual_machines/virtual_machines/Training/Training.vmxf: 1
extent found
/mnt2/tank/virtual_machines/virtual_machines/Training/vmware-1.log: 4
extents found
/mnt2/tank/virtual_machines/virtual_machines/Training/vmware.log: 2
extents found

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Did btrfs filesystem defrag just make things worse?
  2015-07-10 20:57 Did btrfs filesystem defrag just make things worse? Donald Pearson
@ 2015-07-11  2:52 ` Chris Murphy
  2015-07-11  4:30 ` Duncan
  1 sibling, 0 replies; 6+ messages in thread
From: Chris Murphy @ 2015-07-11  2:52 UTC (permalink / raw)
  To: Donald Pearson; +Cc: Btrfs BTRFS

Is compression enabled? That's the most likely reason, filefrag gives
unreliable results when compression is enabled.

https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg26127.html


Chris Murphy

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Did btrfs filesystem defrag just make things worse?
  2015-07-10 20:57 Did btrfs filesystem defrag just make things worse? Donald Pearson
  2015-07-11  2:52 ` Chris Murphy
@ 2015-07-11  4:30 ` Duncan
  2015-07-11 12:18   ` Donald Pearson
  1 sibling, 1 reply; 6+ messages in thread
From: Duncan @ 2015-07-11  4:30 UTC (permalink / raw)
  To: linux-btrfs

Donald Pearson posted on Fri, 10 Jul 2015 15:57:46 -0500 as excerpted:

> If I'm reading this right, my most fragmented file
> (Training-flat.vmdk) is now almost 3x more fragmented?

[snip to context for brevity]

> # filefrag /mnt2/tank/virtual_machines/virtual_machines/Training/*
> /mnt2/tank/virtual_machines/virtual_machines/Training/Training-flat.vmdk:
> 1444 extents found

> # lsattr /mnt2/tank/virtual_machines/virtual_machines/Training/
> --------c------C
> /mnt2/tank/virtual_machines/virtual_machines/Training/Training-flat.vmdk

> # ./btrfs fi defragment -vr /mnt2/tank/virtual_machines
> /mnt2/tank/virtual_machines/virtual_machines/Training/Training-flat.vmdk
> btrfs-progs v4.1

> # filefrag /mnt2/tank/virtual_machines/virtual_machines/Training/*
> /mnt2/tank/virtual_machines/virtual_machines/Training/Training-flat.vmdk:
> 4090 extents found

FWIW RE the lsattr results, the chattr manpage says c=compressed, C=nocow
(disabling copy-on-write).

The combination of the compressed and nocow attributes _seriously_
complicates the picture because their interaction is unknown,
given the information provided.

1) As the chattr manpage warns, on btrfs the C/nocow attribute should
be set on new/empty files.  If set on a file with existing data blocks,
when nocow stability takes effect is undefined.

The sequence of events in your post doesn't detail when the file was
marked nocow, so we basically haven't a clue whether that nocow
attribute has actually taken effect and the file stabilized, or not.

2) On btrfs, nocow has the effect of disabling both compression and
checksumming.  This is necessary because btrfs takes advantage of
cow so it has to calculate checksum only on newly written blocks,
not the whole previous extent, and because newly written data may
compress more or less effectively than the old, changing the
compressed size, no problem with cow since the new data is written
elsewhere, a serious problem with nocow, since now you have new data
that's likely of a different size trying to fit into the same space
as the old data.

So nocow would ordinarily override and disable compression, but
due to issue #1, we don't know whether nocow is actually on or not,
so we don't know if compression is actually disabled or not.

3) Filefrag doesn't know about btrfs compression, which works in
128 KiB blocks, and counts each 128 KiB block as a separarate extent.

Due to #2, we don't know if the file is actually compressed or not,
and you didn't list the size of the file either, but if it /is/
compressed, we'd expect filefrag to report one fragment per 128 KiB,
1/8 MiB, so divide those 4090 reported extents by 8... 511+ MB,
basically half a GiB.

If that's about the size of the file, then we can gather from
the evidence that:

a) The file isn't actually nocow as that would have disabled compression.

b) The file had the compression and nocow attributes set after creation,
so nocow didn't apply, and the defrag actually triggered the compression
of a previously uncompressed file, thereby causing filefrag to report
individual 128 KiB compression blocks as extents.

If the file is *NOT* about half a GiB in size, then something else
happened.

Among other things, your existing free space may be so fragmented that
defrag couldn't find sufficiently large blocks of space to reduce
fragmentation, so it actually got _more_ fragmented.

It's also worth noting that by default, defrag skips what it considers
"large" extents (IIRC 256 KiB... or was it 256 MiB?).  Use the -t option
set to something like 2G, to tell it to defrag everything.  (Given btrfs'
nominal data chunk size of 1 GiB, above that isn't going to do a lot of
good in any case, and don't use values above 3G as there's an integer
overflow bug in btrfs-progs earlier than the just released 4.1.1, which
effectively makes it very small, instead.)

Whatever the case, having both the c/compression and C/nocow attributes
on those files isn't a good idea, since one or the other will be
a lie.  And if you really want nocow, as indicated above, the file
must be created with nocow.  The easiest way to do that is to set nocow
on the directory the files will be created in.  For existing files, after
the dir they will be placed in is set nocow, copy them into place in such
a way that the new copy is actually created.  (The easiest two ways to
ensure that are (a) copy/move across a filesystem boundary, or (b) cat
the file into place using redirection: cat src > dest.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Did btrfs filesystem defrag just make things worse?
  2015-07-11  4:30 ` Duncan
@ 2015-07-11 12:18   ` Donald Pearson
  2015-07-11 15:24     ` Duncan
  0 siblings, 1 reply; 6+ messages in thread
From: Donald Pearson @ 2015-07-11 12:18 UTC (permalink / raw)
  To: Duncan; +Cc: Btrfs BTRFS

Thanks for the replies!

I'll clear up the unknowns.

The nocow attribute was set on the folder prior to any data being written.
So based on your understanding the files are truly nocow and must therefore
be not compressed (an inherited attribute from the subvolume).

I think it's safe to say the file wasn't skipped if the number of extents
changed?

So if it isn't really compressed, and it wasn't skipped, is there any
reason to still think filefrag is confused about the results?

Total used space is about 1T and free space is approximately 9t.
Training-flat.vmdk is 60g

Thanks again,
Donald

On Fri, Jul 10, 2015 at 11:30 PM, Duncan <1i5t5.duncan@cox.net> wrote:
> Donald Pearson posted on Fri, 10 Jul 2015 15:57:46 -0500 as excerpted:
>
>> If I'm reading this right, my most fragmented file
>> (Training-flat.vmdk) is now almost 3x more fragmented?
>
> [snip to context for brevity]
>
>> # filefrag /mnt2/tank/virtual_machines/virtual_machines/Training/*
>> /mnt2/tank/virtual_machines/virtual_machines/Training/Training-flat.vmdk:
>> 1444 extents found
>
>> # lsattr /mnt2/tank/virtual_machines/virtual_machines/Training/
>> --------c------C
>> /mnt2/tank/virtual_machines/virtual_machines/Training/Training-flat.vmdk
>
>> # ./btrfs fi defragment -vr /mnt2/tank/virtual_machines
>> /mnt2/tank/virtual_machines/virtual_machines/Training/Training-flat.vmdk
>> btrfs-progs v4.1
>
>> # filefrag /mnt2/tank/virtual_machines/virtual_machines/Training/*
>> /mnt2/tank/virtual_machines/virtual_machines/Training/Training-flat.vmdk:
>> 4090 extents found
>
> FWIW RE the lsattr results, the chattr manpage says c=compressed, C=nocow
> (disabling copy-on-write).
>
> The combination of the compressed and nocow attributes _seriously_
> complicates the picture because their interaction is unknown,
> given the information provided.
>
> 1) As the chattr manpage warns, on btrfs the C/nocow attribute should
> be set on new/empty files.  If set on a file with existing data blocks,
> when nocow stability takes effect is undefined.
>
> The sequence of events in your post doesn't detail when the file was
> marked nocow, so we basically haven't a clue whether that nocow
> attribute has actually taken effect and the file stabilized, or not.
>
> 2) On btrfs, nocow has the effect of disabling both compression and
> checksumming.  This is necessary because btrfs takes advantage of
> cow so it has to calculate checksum only on newly written blocks,
> not the whole previous extent, and because newly written data may
> compress more or less effectively than the old, changing the
> compressed size, no problem with cow since the new data is written
> elsewhere, a serious problem with nocow, since now you have new data
> that's likely of a different size trying to fit into the same space
> as the old data.
>
> So nocow would ordinarily override and disable compression, but
> due to issue #1, we don't know whether nocow is actually on or not,
> so we don't know if compression is actually disabled or not.
>
> 3) Filefrag doesn't know about btrfs compression, which works in
> 128 KiB blocks, and counts each 128 KiB block as a separarate extent.
>
> Due to #2, we don't know if the file is actually compressed or not,
> and you didn't list the size of the file either, but if it /is/
> compressed, we'd expect filefrag to report one fragment per 128 KiB,
> 1/8 MiB, so divide those 4090 reported extents by 8... 511+ MB,
> basically half a GiB.
>
> If that's about the size of the file, then we can gather from
> the evidence that:
>
> a) The file isn't actually nocow as that would have disabled compression.
>
> b) The file had the compression and nocow attributes set after creation,
> so nocow didn't apply, and the defrag actually triggered the compression
> of a previously uncompressed file, thereby causing filefrag to report
> individual 128 KiB compression blocks as extents.
>
> If the file is *NOT* about half a GiB in size, then something else
> happened.
>
> Among other things, your existing free space may be so fragmented that
> defrag couldn't find sufficiently large blocks of space to reduce
> fragmentation, so it actually got _more_ fragmented.
>
> It's also worth noting that by default, defrag skips what it considers
> "large" extents (IIRC 256 KiB... or was it 256 MiB?).  Use the -t option
> set to something like 2G, to tell it to defrag everything.  (Given btrfs'
> nominal data chunk size of 1 GiB, above that isn't going to do a lot of
> good in any case, and don't use values above 3G as there's an integer
> overflow bug in btrfs-progs earlier than the just released 4.1.1, which
> effectively makes it very small, instead.)
>
> Whatever the case, having both the c/compression and C/nocow attributes
> on those files isn't a good idea, since one or the other will be
> a lie.  And if you really want nocow, as indicated above, the file
> must be created with nocow.  The easiest way to do that is to set nocow
> on the directory the files will be created in.  For existing files, after
> the dir they will be placed in is set nocow, copy them into place in such
> a way that the new copy is actually created.  (The easiest two ways to
> ensure that are (a) copy/move across a filesystem boundary, or (b) cat
> the file into place using redirection: cat src > dest.)
>
> --
> Duncan - List replies preferred.   No HTML msgs.
> "Every nonfree program has a lord, a master --
> and if you use the program, he is your master."  Richard Stallman
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Did btrfs filesystem defrag just make things worse?
  2015-07-11 12:18   ` Donald Pearson
@ 2015-07-11 15:24     ` Duncan
  2015-07-13 11:41       ` Austin S Hemmelgarn
  0 siblings, 1 reply; 6+ messages in thread
From: Duncan @ 2015-07-11 15:24 UTC (permalink / raw)
  To: linux-btrfs

Donald Pearson posted on Sat, 11 Jul 2015 07:18:00 -0500 as excerpted:

> The nocow attribute was set on the folder prior to any data being
> written.  So based on your understanding the files are truly nocow and
> must therefore be not compressed (an inherited attribute from the
> subvolume).
> 
> I think it's safe to say the file wasn't skipped if the number of
> extents changed?

True, but individual extents within the file may have been.  Obviously 
the whole file wasn't skipped, tho.  And if the file is 60 gig, even the 
higher number of extents filefrag is reporting isn't what it should be if 
it's compressed and filefrag is mistaking compression blocks for 
extents.  So it's unlikely to be compressed.

> So if it isn't really compressed, and it wasn't skipped, is there any
> reason to still think filefrag is confused about the results?
> 
> Total used space is about 1T and free space is approximately 9t.
> Training-flat.vmdk is 60g

Replying either in-line/in-context, or under the quote, makes further
replies in context far easier...

If the file isn't compressed, then filefrag is likely correct.  However, 
if you're not either running with autodefrag set at mount, or regularly 
defragging the entire filesystem, it's quite possible that defrag can't 
find unfragmented space -- see the (possible) explanation below.

I'm not a coder, only a list regular and btrfs user, and I'm not sure on 
this, but there have been several reports of this nature on the list 
recently, and I have a theory.  Maybe the devs can step in and either 
confirm or shoot it down.

The theory is this.  

Background: Btrfs allocates space in two stages, first allocating big 
chunks (nominally 1 GiB for data, 256 MiB for metadata, of course it's 
data chunks we're talking here) from unallocated space, then placing 
files in those chunks until they are full and more chunks need allocated.

These 1 GiB data chunks, BTW, mean that the best-case for a 60 GiB file 
is likely to be 60 or 61 extents with each one taking a full chunk, 
possibly with the first and/or the last not being a full chunk.

Of course during normal use, files get deleted as well, thereby clearing 
space in existing chunks.  But this space will be fragmented, with a mix 
of unallocated extents and still remaining files.  The allocator will I 
/believe/ (this is where people who can actually read the code come in) 
try to use up space in existing chunks before allocating additional 
space, possibly subject to some reasonable extent minimum size, below 
which btrfs will simply allocate another chunk.

Obviously, if you always mount with autodefrag, both the file and space 
fragmentation should be kept reasonably low, as files will be rewritten 
if autodefrag detects too much fragmentation as they are written.  
(Actually, autodefrag doesn't rewrite directly, it schedules the file for 
rewrite by a separate cleanup task that follows behind.)  But with normal 
filesystem activity, file deletion, partial copy-on-write file-rewrite, 
space fragmentation, and therefore file fragmentation, is still going to 
happen to some extent over time, it'll simply take longer.

Extremely regular overall filesystem scheduled defrag should similarly 
help with the problem, tho I don't think it likely to be quite as good as 
autodefrag, as the time between fragmentation and rewrite will be longer, 
thus allowing more time for space fragmentation as well.

Then we have defrag itself.  In theory, it can prioritize one of two 
things:

1) Prioritize reduced fragmentation, at the expense of higher data chunk 
allocation.  In the extreme, this would mean always choosing to allocate 
a new chunk and use it if the file (or remainder of the file not yet 
defragged) was larger than the largest free extent in existing data 
chunks.

The problem with this is that over time, the number of partially used 
data chunks goes up as new ones are allocated to defrag into, but sub-1 
GiB files that are already defragged are left where they are.  Of course 
a balance can help here, by combining multiple partial chunks into fewer 
full chunks, but unless a balance is run...

2) Prioritize chunk utilization, at the expense of leaving some 
fragmentation, despite massive amounts of unallocated space.

This is what I've begun to suspect defrag does.  With a bunch of free but 
fragmented space in existing chunks, defrag could actually increase 
fragmentation, as the space in existing chunks is so fragmented a rewrite 
is forced to use more, smaller extents, because that's all there is free, 
until another chunk is allocated.

As I mentioned above for normal file allocation, it's quite possible that 
there's some minimum extent size (greater than the bare minimum 4 KiB  
block size) where the allocator will give up and allocate a new data 
chunk, but if so, perhaps this size needs bumped upward, as it seems a 
bit low, today.

Meanwhile, there's a number of exacerbating factors to consider as well.

* Snapshots and other shared references lock extents in place.

Defrag doesn't touch anything but the subvolume it's actually pointed at 
for the defrag.  Other subvolumes and shared-reference files will 
continue to keep the extents they reference locked in place.  And COW 
will rewrite blocks of a file, but the old reference extent remains 
locked, until all references to it are cleared -- the entire file (or at 
least all blocks that were in that extent) must be rewritten, and no 
snapshots or other references to it remain, before it can be freed.

For a few kernel cycles btrfs had snapshot-aware-defrag, but that 
implementation didn't scale well at all, so it was disabled until it 
could be rewritten, and that rewrite hasn't occurred yet.  So snapshot-
aware-defrag remains disabled, and defrag only works on the subvolume 
it's actually pointed at.

As a result, if defrag rewrites a snapshotted file, it actually doubles 
the space that file takes, as it makes a new copy, breaking the reference 
link between it and the copy in the snapshot.

Of course, with the space not freed up, this will, over time, tend to 
fragment space that is freed even more heavily.

* Chunk reclamation.  

This is the relatively new development that I think is triggering the 
surge in defrag not defragging reports we're seeing now.

Until quite recently, btrfs could allocate new chunks, but it couldn't, 
on its own, deallocate empty chunks.  What tended to happen over time was 
that people would find all the filesystem space taken up by empty or 
mostly empty data chunks, and btrfs would start spitting ENOSPC errors 
when it needed to allocate new metadata chunks but couldn't, as all the 
space was in empty data chunks.  A balance could fix it, often relatively 
quickly with a -dusage=0 or -dusage-10 filter or the like, but it was a 
manual process, btrfs wouldn't do it on its own.

Recently the devs (mostly) fixed that, and btrfs will automatically 
reclaim entirely empty chunks on its own now.  It still doesn't reclaim 
partially empty chunks automatically; a manual rebalance must still be 
used to combine multiple partially empty chunks into fewer full chunks; 
but it does well enough to make the previous problem pretty rare -- we 
don't see the hundreds of GiB of empty data chunks allocated any more, 
like we used to.

Which fixed the one problem, but if my theory is correct, it exacerbated 
the defrag issue, which I think was there before but seldom triggered so 
it generally wasn't noticed.

What I believe is happening now compared to before, based on the rash of 
reports we're seeing, is that before, space fragmentation in allocated 
data chunks seldom became an issue, because people tended to accumulate 
all these extra empty data chunks, leaving defrag all that unfragmented 
empty space to rewrite the new extents into as it did the defrag.

But now, all those empty data chunks are reclaimed, leaving defrag only 
the heavily space-fragmented partially used chunks.  So now we're getting 
all these reports of defrag actually making the problem worse, not better!

Again, I'm not a dev and thus can't simply look at the code to see.  But 
to my sysadmin's troubleshooting eye the theory fits the observed 
behavior.

If this theory is correct, now that btrf does chunk reclaim, defrag needs 
to be rewritten to more heavily prioritize actual defrag, at the expense 
of allocating additional data chunks when necessary.

Tho it's likely going to always-allocate extreme isn't ideal either, and 
that either adding or adjusting upward a minimum extent size, which if it 
can't satisfy in existing chunks, will allocate a new chunk to write the 
defragged extent into (provided there's unallocated space from which to 
allocate it, of course).

Devs?  Impossible hogwash based on the code, or actually plausible?

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Did btrfs filesystem defrag just make things worse?
  2015-07-11 15:24     ` Duncan
@ 2015-07-13 11:41       ` Austin S Hemmelgarn
  0 siblings, 0 replies; 6+ messages in thread
From: Austin S Hemmelgarn @ 2015-07-13 11:41 UTC (permalink / raw)
  To: Duncan, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 6444 bytes --]

On 2015-07-11 11:24, Duncan wrote:
> I'm not a coder, only a list regular and btrfs user, and I'm not sure on
> this, but there have been several reports of this nature on the list
> recently, and I have a theory.  Maybe the devs can step in and either
> confirm or shoot it down.
While I am a coder, I'm not a BTRFS developer, so what I say below may 
still be incorrect.
>
[...trimmed for brevity...]
> Of course during normal use, files get deleted as well, thereby clearing
> space in existing chunks.  But this space will be fragmented, with a mix
> of unallocated extents and still remaining files.  The allocator will I
> /believe/ (this is where people who can actually read the code come in)
> try to use up space in existing chunks before allocating additional
> space, possibly subject to some reasonable extent minimum size, below
> which btrfs will simply allocate another chunk.
AFAICT, this is in fact the case.
>
> 1) Prioritize reduced fragmentation, at the expense of higher data chunk
> allocation.  In the extreme, this would mean always choosing to allocate
> a new chunk and use it if the file (or remainder of the file not yet
> defragged) was larger than the largest free extent in existing data
> chunks.
>
> The problem with this is that over time, the number of partially used
> data chunks goes up as new ones are allocated to defrag into, but sub-1
> GiB files that are already defragged are left where they are.  Of course
> a balance can help here, by combining multiple partial chunks into fewer
> full chunks, but unless a balance is run...
>
> 2) Prioritize chunk utilization, at the expense of leaving some
> fragmentation, despite massive amounts of unallocated space.
>
> This is what I've begun to suspect defrag does.  With a bunch of free but
> fragmented space in existing chunks, defrag could actually increase
> fragmentation, as the space in existing chunks is so fragmented a rewrite
> is forced to use more, smaller extents, because that's all there is free,
> until another chunk is allocated.
>
> As I mentioned above for normal file allocation, it's quite possible that
> there's some minimum extent size (greater than the bare minimum 4 KiB
> block size) where the allocator will give up and allocate a new data
> chunk, but if so, perhaps this size needs bumped upward, as it seems a
> bit low, today.
If I'm reading the code correctly, defrag does indeed try to avoid 
allocating a new chunk if at all possible.
>
>
> Meanwhile, there's a number of exacerbating factors to consider as well.
>
> * Snapshots and other shared references lock extents in place.
>
> Defrag doesn't touch anything but the subvolume it's actually pointed at
> for the defrag.  Other subvolumes and shared-reference files will
> continue to keep the extents they reference locked in place.  And COW
> will rewrite blocks of a file, but the old reference extent remains
> locked, until all references to it are cleared -- the entire file (or at
> least all blocks that were in that extent) must be rewritten, and no
> snapshots or other references to it remain, before it can be freed.
>
> For a few kernel cycles btrfs had snapshot-aware-defrag, but that
> implementation didn't scale well at all, so it was disabled until it
> could be rewritten, and that rewrite hasn't occurred yet.  So snapshot-
> aware-defrag remains disabled, and defrag only works on the subvolume
> it's actually pointed at.
>
> As a result, if defrag rewrites a snapshotted file, it actually doubles
> the space that file takes, as it makes a new copy, breaking the reference
> link between it and the copy in the snapshot.
>
> Of course, with the space not freed up, this will, over time, tend to
> fragment space that is freed even more heavily.
To mitigate this, one can run offline data deduplication (duperemove is 
the tool I'd suggest for this), although there are caveats to doing that 
as well.
>
> * Chunk reclamation.
>
> This is the relatively new development that I think is triggering the
> surge in defrag not defragging reports we're seeing now.
>
> Until quite recently, btrfs could allocate new chunks, but it couldn't,
> on its own, deallocate empty chunks.  What tended to happen over time was
> that people would find all the filesystem space taken up by empty or
> mostly empty data chunks, and btrfs would start spitting ENOSPC errors
> when it needed to allocate new metadata chunks but couldn't, as all the
> space was in empty data chunks.  A balance could fix it, often relatively
> quickly with a -dusage=0 or -dusage-10 filter or the like, but it was a
> manual process, btrfs wouldn't do it on its own.
>
> Recently the devs (mostly) fixed that, and btrfs will automatically
> reclaim entirely empty chunks on its own now.  It still doesn't reclaim
> partially empty chunks automatically; a manual rebalance must still be
> used to combine multiple partially empty chunks into fewer full chunks;
> but it does well enough to make the previous problem pretty rare -- we
> don't see the hundreds of GiB of empty data chunks allocated any more,
> like we used to.
>
> Which fixed the one problem, but if my theory is correct, it exacerbated
> the defrag issue, which I think was there before but seldom triggered so
> it generally wasn't noticed.
>
> What I believe is happening now compared to before, based on the rash of
> reports we're seeing, is that before, space fragmentation in allocated
> data chunks seldom became an issue, because people tended to accumulate
> all these extra empty data chunks, leaving defrag all that unfragmented
> empty space to rewrite the new extents into as it did the defrag.
>
> But now, all those empty data chunks are reclaimed, leaving defrag only
> the heavily space-fragmented partially used chunks.  So now we're getting
> all these reports of defrag actually making the problem worse, not better!
I believe that this is in fact the root cause.  Personally, I would love 
to be able to turn this off without having to patch the kernel.  Since 
it went in, not only does it (apparently) cause issues with defrag, but 
DISCARD/TRIM support is broken, and most of my (heavily rewritten) 
filesystems are running noticeably slower as well.  I'm going to start a 
discussion regarding this in another thread however, as it doesn't just 
affect defrag.



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2967 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2015-07-13 11:41 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-07-10 20:57 Did btrfs filesystem defrag just make things worse? Donald Pearson
2015-07-11  2:52 ` Chris Murphy
2015-07-11  4:30 ` Duncan
2015-07-11 12:18   ` Donald Pearson
2015-07-11 15:24     ` Duncan
2015-07-13 11:41       ` Austin S Hemmelgarn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).