From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-io0-f179.google.com ([209.85.223.179]:35734 "EHLO mail-io0-f179.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751101AbdDJToE (ORCPT ); Mon, 10 Apr 2017 15:44:04 -0400 Received: by mail-io0-f179.google.com with SMTP id r16so53645896ioi.2 for ; Mon, 10 Apr 2017 12:44:04 -0700 (PDT) Received: from [191.9.206.254] (rrcs-70-62-41-24.central.biz.rr.com. [70.62.41.24]) by smtp.gmail.com with ESMTPSA id a128sm3978086itg.22.2017.04.10.12.44.01 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 10 Apr 2017 12:44:01 -0700 (PDT) Subject: Re: btrfs filesystem keeps allocating new chunks for no apparent reason To: linux-btrfs@vger.kernel.org References: <572D0C8B.8010404@mendix.com> <89a684c7-364e-f409-5348-bc0077fd438c@cn.fujitsu.com> <5b642448-951e-5b5e-1343-0299a950089c@mendix.com> <51778c0f-2720-1c2d-aba2-e22e5f4d3a3a@mendix.com> <4532f6ee-2a6e-412a-7230-edb76735d55f@mendix.com> <07a7f59e-64e0-4d09-5d32-01bc933fe38d@gmail.com> <20170410144533.664fc304@jupiter.sol.kaishome.de> <5488ea5a-b41c-5987-e664-ec17cf2d5e01@gmail.com> <20170410184444.08ced097@jupiter.sol.local> <20170410185437.235b3b86@jupiter.sol.kaishome.de> <7ea65b63-d399-c049-d466-681c1df2d025@gmail.com> <20170410201842.216893be@jupiter.sol.kaishome.de> From: "Austin S. Hemmelgarn" Message-ID: Date: Mon, 10 Apr 2017 15:43:57 -0400 MIME-Version: 1.0 In-Reply-To: <20170410201842.216893be@jupiter.sol.kaishome.de> Content-Type: text/plain; charset=windows-1252; format=flowed Sender: linux-btrfs-owner@vger.kernel.org List-ID: On 2017-04-10 14:18, Kai Krakow wrote: > Am Mon, 10 Apr 2017 13:13:39 -0400 > schrieb "Austin S. Hemmelgarn" : > >> On 2017-04-10 12:54, Kai Krakow wrote: >>> Am Mon, 10 Apr 2017 18:44:44 +0200 >>> schrieb Kai Krakow : >>> >>>> Am Mon, 10 Apr 2017 08:51:38 -0400 >>>> schrieb "Austin S. Hemmelgarn" : >>>> >> [...] >> [...] >>>> [...] >> [...] >> [...] >>>> >>>> Did you put it in /etc/fstab only for the rootfs? If yes, it >>>> probably has no effect. You would need to give it as rootflags on >>>> the kernel cmdline. >>> >>> I did a "fgrep lazytime /usr/src/linux -ir" and it reveals only ext4 >>> and f2fs know the flag. Kernel 4.10. >>> >>> So probably you're seeing a placebo effect. If you put lazytime for >>> rootfs just only into fstab, it won't have an effect because on >>> initial mount this file cannot be opened (for obvious reasons), and >>> on remount, btrfs seems to happily accept lazytime but it has no >>> effect. It won't show up in /proc/mounts. Try using it in rootflags >>> kernel cmdline and you should see that the kernel won't accept the >>> flag lazytime. >> The command-line also rejects a number of perfectly legitimate >> arguments that BTRFS does understand too though, so that's not much >> of a test. > > Which are those? I didn't encounter any... I'm not sure there are any anymore, but I know that a handful (mostly really uncommon ones) used to (and BTRFS is not alone in this respect, some of the more esoteric ext4 options aren't accepted on the kernel command-line either). I know at a minimum at some point in the past alloc-start, check_int, and inode_cache did not work from the kernel command-line. > >> I've just finished some quick testing though, and it looks >> like you're right, BTRFS does not support this, which means I now >> need to figure out what the hell was causing the IOPS counters in >> collectd to change in rough correlation with remounting (especially >> since it appears to happen mostly independent of the options being >> changed). > > I think that noatime (which I remember you also used?), lazytime, and > relatime are mutually exclusive: they all handle the inode updates. > Maybe that is the effect you see? They're not exactly exclusive. The lazytime option will prevent changes to the mtime or atime fields in a file from forcing inode write-out for up to 24 hours (if the inode would be written out for some other reason (such as a file-size change or the inode being evicted from the cache), then the timestamps will be too), but it does not change the value of the timestamps. So if you have lazytime enabled and use touch to update the mtime on anotherwise idle file, the mtime will still be correct as far as userspace is concerned, as long as you don't crash before the update hits the disk (but userspace will only see the discrepancy _after_ the crash). By comparison, relatime causes the atime not to updated at all if it's changed in the last 24 hours, and noatime completely prevents atime updates. In both cases, the atime isn't correct at all in userspace as far as POSIX is concerned. So, you have the following combinations: * strictatime, nolazytime: Both atime and mtime updates happen, and are flushed to disk (almost) immediately. * relatime, nolazytime (the upstream default): atime updates happen only if the atime hasn't changed in 24 hours, mtime updates happen as normal, and both types of update are flushed to disk (almost) immediately. * noatime, nolazytime (the default on some specific kernels (this is easy to patch, so a lot of people who already carry custom patches and don't use mutt patch it)): atime updates never happen, mtime updates happen as normal and are flushed to disk (almost) immediately. * strictatime, lazytime: Both atime and mtime updates happen, but they actual update may not hit the disk for up to 24 hours (this will let mutt work correctly as long as your system shuts down cleanly, but still improve performance noticeably on at least ext4). * relatime, lazytime: atime updates happen only if the atime hasn't changed in 24 hours, mtime updates happen as normal, and both may not hit the disk for up to 24 hours. * noatime, lazytime (what I'm trying to run): atime updates never happen, mtime updates happen as normal, but may not hit the disk for up to 24 hours. In essence, lazytime only impacts inode writeback (deferring it under special circumstances), while {no,rel,strict}atime impacts the actual value of the time-stamps. > >> This is somewhat disappointing though, as supporting this would >> probably help with the write-amplification issues inherent in COW >> filesystems. -- > > Well, relatime is mostly the same thus not perfectly resembling the > POSIX standard. I think the only software that relies on atime is > mutt... This very much depends on what you're doing. If you have a WORM workload, then yeah, it's pretty much the same. If however you have something like a database workload where a specific set of files get internally rewritten regularly, then it actually has a measurable impact. As a very specific example, I run collectd on my systems using RRD files as data storage. An RRD file is essentially a really fancy circular buffer, so it remains fixed size but gets a _lot_ of internal rewrites (by the way, if anyone wants to test fragmentation behavior on BTRFS, RRD files are a great way to do it). Because of how I have things set up, each file gets a batch of data points every 1-2 minutes. This in turn means that the mtime is updating every 1-2 minutes for each of the 1000+ RRD files. In this case, writing out the timestamps results in an overhead of roughly 256 bytes per file, which is about 0.1% based on the average file size of roughly 169k. If I use noatime on this filesystem, then it has near zero impact because the average number of times per hour that these files are read is near zero. Turning on lazytime however, results in mtime updates getting deferred until the hourly forced fssync for this filesystem hits (this is something I'm doing, not the OS), that reduces the overhead by a factor of roughly 45 (the average number of writes per-file per-hour) to about 0.00003%, which is a pretty serious difference.