From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-yw0-f175.google.com ([209.85.161.175]:34109 "EHLO mail-yw0-f175.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753999AbdDQSeT (ORCPT ); Mon, 17 Apr 2017 14:34:19 -0400 Received: by mail-yw0-f175.google.com with SMTP id k13so59598487ywk.1 for ; Mon, 17 Apr 2017 11:34:19 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <8f046fa5-a458-9db8-b616-907afd34383b@gmail.com> References: <8f046fa5-a458-9db8-b616-907afd34383b@gmail.com> From: Chris Murphy Date: Mon, 17 Apr 2017 12:34:17 -0600 Message-ID: Subject: Re: Btrfs/SSD To: "Austin S. Hemmelgarn" Cc: Chris Murphy , Imran Geriskovan , Btrfs BTRFS Content-Type: text/plain; charset=UTF-8 Sender: linux-btrfs-owner@vger.kernel.org List-ID: On Mon, Apr 17, 2017 at 11:13 AM, Austin S. Hemmelgarn wrote: >> What is a high end SSD these days? Built-in NVMe? > > One with a good FTL in the firmware. At minimum, the good Samsung EVO > drives, the high quality Intel ones, and the Crucial MX series, but probably > some others. My choice of words here probably wasn't the best though. It's a confusing market that sorta defies figuring out what we've got. I have a Samsung EVO SATA SSD in one laptop, but then I have a Samsung EVO+ SD Card in an Intel NUC. They use that same EVO branding on an $11 SD Card. And then there's the Samsung Electronics Co Ltd NVMe SSD Controller SM951/PM951 in another laptop. >> So long as this file is not reflinked or snapshot, filefrag shows a >> pile of mostly 4096 byte blocks, thousands. But as they're pretty much >> all continuous, the file fragmentation (extent count) is usually never >> higher than 12. It meanders between 1 and 12 extents for its life. >> >> Except on the system using ssd_spread mount option. That one has a >> journal file that is +C, is not being snapshot, but has over 3000 >> extents per filefrag and btrfs-progs/debugfs. Really weird. > > Given how the 'ssd' mount option behaves and the frequency that most systemd > instances write to their journals, that's actually reasonably expected. We > look for big chunks of free space to write into and then align to 2M > regardless of the actual size of the write, which in turn means that files > like the systemd journal which see lots of small (relatively speaking) > writes will have way more extents than they should until you defragment > them. Nope. The first paragraph applies to NVMe machine with ssd mount option. Few fragments. The second paragraph applies to SD Card machine with ssd_spread mount option. Many fragments. These are different versions of systemd-journald so I can't completely rule out a difference in write behavior. >> Now, systemd aside, there are databases that behave this same way >> where there's a small section contantly being overwritten, and one or >> more sections that grow the data base file from within and at the end. >> If this is made cow, the file will absolutely fragment a ton. And >> especially if the changes are mostly 4KiB block sizes that then are >> fsync'd. >> >> It's almost like we need these things to not fsync at all, and just >> rely on the filesystem commit time... > > Essentially yes, but that causes all kinds of other problems. Drat. -- Chris Murphy