linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Goffredo Baroncelli <kreijack@libero.it>
To: Duncan <1i5t5.duncan@cox.net>, linux-btrfs@vger.kernel.org
Cc: systemd-devel@lists.freedesktop.org
Subject: Re: Slow startup of systemd-journal on BTRFS
Date: Sat, 14 Jun 2014 09:52:39 +0200	[thread overview]
Message-ID: <539BFF47.8060006@libero.it> (raw)
In-Reply-To: <pan$625ac$a8aa7477$d0179ebe$c66ba817@cox.net>

On 06/14/2014 04:53 AM, Duncan wrote:
> Goffredo Baroncelli posted on Sat, 14 Jun 2014 00:19:31 +0200 as
> excerpted:
> 
>> On 06/13/2014 01:24 AM, Dave Chinner wrote:
>>> On Thu, Jun 12, 2014 at 12:37:13PM +0000, Duncan wrote:
>>>>
>>>> FWIW, either 4 byte or 8 MiB fallocate calls would be bad, I think
>>>> actually pretty much equally bad without NOCOW set on the file.
>>>
>>> So maybe it's been fixed in systemd since the last time I looked.
>>> Yup:
>>>
>>> http://cgit.freedesktop.org/systemd/systemd/commit/src/journal/journal-
> file.c?id=eda4b58b50509dc8ad0428a46e20f6c5cf516d58
>>>
>>> The reason it was changed? To "save a syscall per append", not to
>>> prevent fragmentation of the file, which was the problem everyone was
>>> complaining about...
>>
>> thanks for pointing that. However I am performing my tests on a fedora
>> 20 with systemd-208, which seems have this change
>>>
>>>> Why?  Because btrfs data blocks are 4 KiB.  With COW, the effect for
>>>> either 4 byte or 8 MiB file allocations is going to end up being the
>>>> same, forcing (repeated until full) rewrite of each 4 KiB block into
>>>> its own extent.
>>
>> I am reaching the conclusion that fallocate is not the problem. The
>> fallocate increase the filesize of about 8MB, which is enough for some
>> logging. So it is not called very often.
> 
> But... 
> 
> If a file isn't (properly[1]) set NOCOW (and the btrfs isn't mounted with 
> nodatacow), then an fallocate of 8 MiB will increase the file size by 8 
> MiB and write that out.  So far so good as at that point the 8 MiB should 
> be a single extent.  But then, data gets written into 4 KiB blocks of 
> that 8 MiB one at a time, and because btrfs is COW, the new data in the 
> block must be written to a new location.
> 
> Which effectively means that by the time the 8 MiB is filled, each 4 KiB 
> block has been rewritten to a new location and is now an extent unto 
> itself.  So now that 8 MiB is composed of 2048 new extents, each one a 
> single 4 KiB block in size.

Several people pointed fallocate as the problem. But I don't understand the reason.
1) 8MB is a quite huge value, so fallocate is called (at worst) 1 time during the boot. Often never because the log are less than 8MB.
2) it is true that btrfs "rewrite" almost 2 times each 4kb page with fallocate. But the first time is a "big" write of 8MB; instead the second write would happen in any case. What I mean is that without the fallocate in any case journald would make small write.

To be honest, I fatigue to see the gain of having a fallocate on a COW filesystem... may be that I don't understand very well the fallocate() call.

> 
> =:^(
> 
> Tho as I already stated, for file sizes upto 128 MiB or so anyway[2], the 
> btrfs autodefrag mount option should at least catch that and rewrite 
> (again), this time sequentially.
> 
>> I have to investigate more what happens when the log are copied from
>> /run to /var/log/journal: this is when journald seems to slow all.
> 
> That's an interesting point.
> 
> At least in theory, during normal operation journald will write to 
> /var/log/journal, but there's a point during boot at which it flushes the 
> information accumulated during boot from the volatile /run location to 
> the non-volatile /var/log location.  /That/ write, at least, should be 
> sequential, since there will be > 4 KiB of journal accumulated that needs 
> to be transferred at once.  However, if it's being handled by the forced 
> pre-write fallocate described above, then that's not going to be the 
> case, as it'll then be a rewrite of already fallocated file blocks and 
> thus will get COWed exactly as I described above.
> 
> =:^(
> 
> 
>> I am prepared a PC which reboot continuously; I am collecting the time
>> required to finish the boot vs the fragmentation of the system.journal
>> file vs the number of boot. The results are dramatic: after 20 reboot,
>> the boot time increase of 20-30 seconds. Doing a defrag of
>> system.journal reduces the boot time to the original one, but after
>> another 20 reboot, the boot time still requires 20-30 seconds more....
>>
>> It is a slow PC, but I saw the same behavior also on a more modern pc
>> (i5 with 8GB).
>>
>> For both PC the HD is a mechanical one...
> 
> The problem's duplicable.  That's the first step toward a fix. =:^)
I Hope so
> 
>>> And that's now a btrfs problem.... :/
>>
>> Are you sure ?
> 
> As they say, "Whoosh!"
> 
[...] 
> Another alternative is that distros will start setting /var/log/journal 
> NOCOW in their setup scripts by default when it's btrfs, thus avoiding 
> the problem.  (Altho if they do automated snapshotting they'll also have 
> to set it as its own subvolume, to avoid the first-write-after-snapshot-
> is-COW problem.)  Well, that, and/or set autodefrag in the default mount 
> options.

Pay attention, that this remove also the checksum, which are very useful in a RAID configuration.

> 
> Meanwhile, there's some focus on making btrfs behave better with such 
> rewrite-pattern files, but while I think the problem can be made /some/ 
> better, hopefully enough that the defaults bother far fewer people in far 
> fewer cases, I expect it'll always be a bit of a sore spot because that's 
> just how the technology works, and as such, setting NOCOW for such files 
> and/or using autodefrag will continue to be recommended for an optimized 
> setup.
> 
> ---
> [1] "Properly" set NOCOW:  Btrfs doesn't guarantee the effectiveness of 
> setting NOCOW (chattr +C) unless the attribute is set while the file is 
> still zero size, effectively, at file creation.  The easiest way to do 
> that is to set NOCOW on the subdir that will contain the file, such that 
> when the file is created it inherits the NOCOW attribute automatically.
> 
> [2] File sizes upto 128 MiB ... and possibly upto 1 GiB.  Under 128 MiB 
> should be fine, over 1 GiB is known to cause issues, between the two is a 
> gray area that depends on the speed of the hardware and the incoming 
> write-stream.
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

  reply	other threads:[~2014-06-14  7:48 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-06-12 11:13 R: Re: Slow startup of systemd-journal on BTRFS Goffredo Baroncelli <kreijack@libero.it>
2014-06-12 12:37 ` Duncan
2014-06-12 23:24   ` Dave Chinner
2014-06-13 22:19     ` Goffredo Baroncelli
2014-06-14  2:53       ` Duncan
2014-06-14  7:52         ` Goffredo Baroncelli [this message]
2014-06-15  5:43           ` Duncan
2014-06-15 22:39             ` [systemd-devel] " Lennart Poettering
2014-06-15 22:13           ` Lennart Poettering
2014-06-16  0:17             ` Russell Coker
2014-06-16  1:06               ` John Williams
2014-06-16  2:19                 ` Russell Coker
2014-06-16 10:14               ` Lennart Poettering
2014-06-16 10:35                 ` Russell Coker
2014-06-16 11:16                   ` Austin S Hemmelgarn
2014-06-16 11:56                 ` Andrey Borzenkov
2014-06-16 16:05                 ` Josef Bacik
2014-06-16 19:52                   ` Martin
2014-06-16 20:20                     ` Josef Bacik
2014-06-17  0:15                     ` Austin S Hemmelgarn
2014-06-17  1:13                     ` cwillu
2014-06-17 12:24                       ` Martin
2014-06-17 17:56                       ` Chris Murphy
2014-06-17 18:46                       ` Filipe Brandenburger
2014-06-17 19:42                         ` Goffredo Baroncelli
2014-06-17 21:12                   ` Lennart Poettering
2014-06-16 16:32             ` Goffredo Baroncelli
2014-06-16 18:47               ` Goffredo Baroncelli
2014-06-19  1:13             ` Dave Chinner
2014-06-14 10:59         ` Kai Krakow
2014-06-15  5:02           ` Duncan
2014-06-15 11:18             ` Kai Krakow
2014-06-15 21:45           ` Martin Steigerwald
2014-06-15 21:51             ` Hugo Mills
2014-06-15 22:43           ` [systemd-devel] " Lennart Poettering
2014-06-15 21:31         ` Martin Steigerwald
2014-06-15 21:37           ` Hugo Mills
2014-06-17  8:22           ` Duncan
  -- strict thread matches above, loose matches on Subject: below --
2014-06-11 21:28 Goffredo Baroncelli
2014-06-12  0:40 ` Chris Murphy
2014-06-12  1:18 ` Russell Coker
2014-06-12  4:39   ` Duncan
2014-06-12  1:21 ` Dave Chinner
2014-06-12  1:37   ` Dave Chinner
2014-06-12  2:32     ` Chris Murphy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=539BFF47.8060006@libero.it \
    --to=kreijack@libero.it \
    --cc=1i5t5.duncan@cox.net \
    --cc=kreijack@inwind.it \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=systemd-devel@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).