All of lore.kernel.org
 help / color / mirror / Atom feed
From: Gian-Carlo Pascutto <gcp@sjeng.org>
To: linux-btrfs@vger.kernel.org
Cc: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Subject: Re: Big disk space usage difference, even after defrag, on identical data
Date: Mon, 13 Apr 2015 13:32:17 +0200	[thread overview]
Message-ID: <552BA941.1000409@sjeng.org> (raw)
In-Reply-To: <20150413040436.GB4711@hungrycats.org>

On 13-04-15 06:04, Zygo Blaxell wrote:

>> I would think that compression differences or things like
>> fragmentation or bookending for modified files shouldn't affect
>> this, because the first filesystem has been
>> defragmented/recompressed and didn't shrink.
>> 
>> So what can explain this? Where did the 66G go?
> 
> There are a few places:  the kernel may have decided your files are
> not compressible and disabled compression on them (some older kernels
> did this with great enthusiasm);

As stated in the previous mail, this is 3.19.1. Moreover, the data is
either uniformly compressible or not at all. Lastly, note that the
*exact same* mount options are being used on *the exact same kernel*
with *the exact same data*. Getting a different compressible decision
given the same inputs would point to bugs.

> your files might have preallocated space from the fallocate system
> call (which disables compression and allocates contiguous space, so
> defrag will not touch it).

So defrag -clzo or -czlib won't actually re-compress mostly-continuous
files? That's evil. I have no idea whether PostgreSQL allocates files
that way, though.

> 'filefrag -v' can tell you if this is happening to your files.

Not sure how to interpret that. Without "-v", I see most of the (DB)
data has 2-5 extents per Gigabyte. A few have 8192 extents per Gigabyte.

Comparing to the copy that takes 66G less, there every (compressible)
file has about 8192 extents per Gigabyte, and the others 5 or 6.

So you may be right that some DB files are "wedged" in a format that
btrfs can't compress. I forced the files to be rewritten (VACUUM FULL)
and that "fixed" the problem.

> In practice database files take about double the amount of space
> they appear to because of extent shingling.

This is what I called "bookending" in the original mail, I didn't know
the correct name, but I understand doing updates can result in N^2/2 or
thereabouts disk space usage, however:

> Defragmenting the files helps free space temporarily; however, space
> usage will quickly grow again until it returns to the steady state
> around 2x the file size.

As stated in the original mail, the filesystem was *freshly
defragmented* so that can't have been the cause.

> Until this is fixed, the most space-efficient approach seems to be to
> force compression (so the maximum extent is 128K instead of 1GB)

Would that fix the problem with fallocated() files?

-- 
GCP


  parent reply	other threads:[~2015-04-13 11:32 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-04-11 19:59 Big disk space usage difference, even after defrag, on identical data Gian-Carlo Pascutto
2015-04-13  4:04 ` Zygo Blaxell
2015-04-13  8:07   ` Duncan
2015-04-13 11:32   ` Gian-Carlo Pascutto [this message]
2015-04-13  5:06 ` Duncan
2015-04-13 14:06   ` Gian-Carlo Pascutto
2015-04-13 21:45     ` Zygo Blaxell
2015-04-14  3:18     ` Duncan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=552BA941.1000409@sjeng.org \
    --to=gcp@sjeng.org \
    --cc=ce3g8jdj@umail.furryterror.org \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.