Re: price to pay for nocow file bit?

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
To: Lennart Poettering <lennart@poettering.net>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: price to pay for nocow file bit?
Date: Thu, 8 Jan 2015 10:56:11 -0500	[thread overview]
Message-ID: <20150108155610.GA12859@hungrycats.org> (raw)
In-Reply-To: <20150107174315.GA21865@gardel-login>

[-- Attachment #1: Type: text/plain, Size: 4531 bytes --]

On Wed, Jan 07, 2015 at 06:43:15PM +0100, Lennart Poettering wrote:
> Heya!
> 
> Currently, systemd-journald's disk access patterns (appending to the
> end of files, then updating a few pointers in the front) result in
> awfully fragmented journal files on btrfs, which has a pretty
> negative effect on performance when accessing them.
> 
> Now, to improve things a bit, I yesterday made a change to journald,
> to issue the btrfs defrag ioctl when a journal file is rotated,
> i.e. when we know that no further writes will be ever done on the
> file. 
> 
> However, I wonder now if I should go one step further even, and use
> the equivalent of "chattr -C" (i.e. nocow) on all journal files. I am
> wondering what price I would precisely have to pay for
> that. Judging by this earlier thread:
> 
>         http://www.spinics.net/lists/linux-btrfs/msg33134.html
> 
> it's mostly about data integrity, which is something I can live with,
> given the conservative write patterns of journald, and the fact that
> we do our own checksumming and careful data validation. I mean, if
> btrfs in this mode provides no worse data integrity semantics than
> ext4 I am fully fine with losing this feature for these files.

This sounds to me like a job for fallocate with FALLOC_FL_KEEP_SIZE.
This would work on ext4, xfs, and others, and provide the same benefit
(or even better) without filesystem-specific code.  journald would
preallocate a contiguous chunk past the end of the file for appends, and
on btrfs the first write to each block will not be COWed or compressed
(I'm hand-waving away some details here related to small writes, file
tails, and inline storage, but the end result is the same).  If there's a
configured target size for journals then allocate that amount; otherwise,
double the allocated size each time the visible file size reaches a power
of two so that the number of fragments is logarithmic over file size.

This should get you what you want without all the dangerous messing around
with data integrity controls and defragmentation.  Defragmentation has a
number of negative side-effects of its own:  it searches for free space
aggressively and holds locks that can block writes for a long time (I've
learned the hard way that this can be over 20 minutes for a 1GB file, long
enough to trigger hardware watchdog resets).  There are some other good
reasons to never defragment, but they don't arise in journald's use cases.

I, for one, use btrfs scrub to detect data corruption that occurs during
early stages of disk failure.  I'd object strongly to applications
randomly turning off data integrity features without being explicitly
configured to do so, especially those that do most of the writing.
It would create areas of the disk that are blind spots when testing for
storage corruption errors, and in journald's case those blind spots would
be among the most significant sources of data about storage corruption.

I don't really care if applications can survive corrupted data--as the
owner of the storage, I need to be aware that storage-level corruption is
happening.  I don't want to have to test different areas of the filesystem
with a dozen different application-specific tools.  That particular
insanity is one of the reasons why I now use btrfs and not ext4.

> Hence I am mostly interested in what else is lost if this flag is
> turned on by default for all journal files journald creates: 
> 
> Does this have any effect on functionality? As I understood snapshots
> still work fine for files marked like that, and so do
> reflinks. Any drawback functionality-wise? Apparently file compression
> support is lost if the bit is set? (which I can live with too, journal
> files are internally compressed anyway)
> 
> What about performance? Do any operations get substantially slower by
> setting this bit? For example, what happens if I take a snapshot of
> files with this bit set and then modify the file, does this result in
> a full (and hence slow) copy of the file on that occasion? 
> 
> I am trying to understand the pros and cons of turning this bit on,
> before I can make this change. So far I see one big pro, but I wonder
> if there's any major con I should think about?
> 
> Thanks,
> 
> Lennart
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

next prev parent reply	other threads:[~2015-01-08 15:56 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-01-07 17:43 price to pay for nocow file bit? Lennart Poettering
2015-01-07 20:10 ` Josef Bacik
2015-01-07 21:05   ` Goffredo Baroncelli
2015-01-07 22:06     ` Josef Bacik
2015-01-08  6:30   ` Duncan
2015-01-10 12:00     ` Martin Steigerwald
2015-01-10 12:23       ` Martin Steigerwald
2015-01-08  8:24   ` Chris Murphy
2015-01-08  8:35     ` Koen Kooi
2015-01-08 13:30   ` Lennart Poettering
2015-01-08 18:24     ` Konstantinos Skarlatos
2015-01-08 18:48       ` Goffredo Baroncelli
2015-01-09 15:52     ` David Sterba
2015-01-10 10:30       ` Martin Steigerwald
2015-01-11 20:39     ` Chris Murphy
2015-01-08 15:56 ` Zygo Blaxell [this message]
2015-01-08 16:53   ` Lennart Poettering
2015-01-08 18:36     ` Zygo Blaxell
2015-01-09 15:41       ` David Sterba
2015-01-09 16:14         ` Zygo Blaxell
2015-01-08 20:42     ` Roger Binns
2015-01-15 19:06     ` Chris Mason

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150108155610.GA12859@hungrycats.org \
    --to=ce3g8jdj@umail.furryterror.org \
    --cc=lennart@poettering.net \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).