From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-io0-f175.google.com ([209.85.223.175]:34416 "EHLO mail-io0-f175.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751295AbdDQRNs (ORCPT ); Mon, 17 Apr 2017 13:13:48 -0400 Received: by mail-io0-f175.google.com with SMTP id a103so160810450ioj.1 for ; Mon, 17 Apr 2017 10:13:48 -0700 (PDT) Subject: Re: Btrfs/SSD To: Chris Murphy References: Cc: Imran Geriskovan , Btrfs BTRFS From: "Austin S. Hemmelgarn" Message-ID: <8f046fa5-a458-9db8-b616-907afd34383b@gmail.com> Date: Mon, 17 Apr 2017 13:13:39 -0400 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Sender: linux-btrfs-owner@vger.kernel.org List-ID: On 2017-04-17 12:58, Chris Murphy wrote: > On Mon, Apr 17, 2017 at 5:53 AM, Austin S. Hemmelgarn > wrote: > >> Regarding BTRFS specifically: >> * Given my recently newfound understanding of what the 'ssd' mount option >> actually does, I'm inclined to recommend that people who are using high-end >> SSD's _NOT_ use it as it will heavily increase fragmentation and will likely >> have near zero impact on actual device lifetime (but may _hurt_ >> performance). It will still probably help with mid and low-end SSD's. > > What is a high end SSD these days? Built-in NVMe? One with a good FTL in the firmware. At minimum, the good Samsung EVO drives, the high quality Intel ones, and the Crucial MX series, but probably some others. My choice of words here probably wasn't the best though. > > > >> * Files with NOCOW and filesystems with 'nodatacow' set will both hurt >> performance for BTRFS on SSD's, and appear to reduce the lifetime of the >> SSD. > > Can you elaborate. It's an interesting problem, on a small scale the > systemd folks have journald set +C on /var/log/journal so that any new > journals are nocow. There is an initial fallocate, but the write > behavior is writing in the same place at the head and tail. But at the > tail, the writes get pushed torward the middle. So the file is growing > into its fallocated space from the tail. The header changes in the > same location, it's an overwrite. For a normal filesystem or BTRFS with nodatacow or NOCOW, the block gets rewritten in-place. This means that cheap FTL's will rewrite that erase block in-place (which won't hurt performance but will impact device lifetime), and good ones will rewrite into a free block somewhere else but may not free that original block for quite some time (which is bad for performance but slightly better for device lifetime). When BTRFS does a COW operation on a block however, it will guarantee that that block moves. Because of this, the old location will either: 1. Be discarded by the FS itself if the 'discard' mount option is set. 2. Be caught by a scheduled call to 'fstrim'. 3. Lay dormant for at least a while. The first case is ideal for most FTL's, because it lets them know immediately that that data isn't needed and the space can be reused. The second is close to ideal, but defers telling the FTL that the block is unused, which can be better on some SSD's (some have firmware that handles wear-leveling better in batches). The third is not ideal, but is still better than what happens with NOCOW or nodatacow set. Overall, this boils down to the fact that most FTL's get slower if they can't wear-level the device properly, and in-place rewrites make it harder for them to do proper wear-leveling. > > So long as this file is not reflinked or snapshot, filefrag shows a > pile of mostly 4096 byte blocks, thousands. But as they're pretty much > all continuous, the file fragmentation (extent count) is usually never > higher than 12. It meanders between 1 and 12 extents for its life. > > Except on the system using ssd_spread mount option. That one has a > journal file that is +C, is not being snapshot, but has over 3000 > extents per filefrag and btrfs-progs/debugfs. Really weird. Given how the 'ssd' mount option behaves and the frequency that most systemd instances write to their journals, that's actually reasonably expected. We look for big chunks of free space to write into and then align to 2M regardless of the actual size of the write, which in turn means that files like the systemd journal which see lots of small (relatively speaking) writes will have way more extents than they should until you defragment them. > > Now, systemd aside, there are databases that behave this same way > where there's a small section contantly being overwritten, and one or > more sections that grow the data base file from within and at the end. > If this is made cow, the file will absolutely fragment a ton. And > especially if the changes are mostly 4KiB block sizes that then are > fsync'd. > > It's almost like we need these things to not fsync at all, and just > rely on the filesystem commit time... Essentially yes, but that causes all kinds of other problems.