From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f68.google.com ([74.125.82.68]:34117 "EHLO mail-wm0-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S934788AbeE2OCn (ORCPT ); Tue, 29 May 2018 10:02:43 -0400 Received: by mail-wm0-f68.google.com with SMTP id q4-v6so29112193wmq.1 for ; Tue, 29 May 2018 07:02:42 -0700 (PDT) Subject: Re: csum failed root raveled during balance To: "Austin S. Hemmelgarn" , linux-btrfs@vger.kernel.org References: <5B047817.3040106@gmail.com> <5B052068.5000608@gmail.com> <5B053DF2.4030301@gmail.com> <34e2ea9c-dc43-af9e-6de1-566e411f7660@gmail.com> <9aef47e7-7415-2465-6380-daafcba82e45@gmail.com> From: ein Message-ID: <52ac7589-ff7f-79b3-3ce7-4c7e128d67a0@gmail.com> Date: Tue, 29 May 2018 16:02:39 +0200 MIME-Version: 1.0 In-Reply-To: <9aef47e7-7415-2465-6380-daafcba82e45@gmail.com> Content-Type: text/plain; charset=utf-8 Sender: linux-btrfs-owner@vger.kernel.org List-ID: On 05/29/2018 02:12 PM, Austin S. Hemmelgarn wrote: > On 2018-05-28 13:10, ein wrote: >> On 05/23/2018 01:03 PM, Austin S. Hemmelgarn wrote: >>> On 2018-05-23 06:09, ein wrote: >>>> On 05/23/2018 11:09 AM, Duncan wrote: >>>>> ein posted on Wed, 23 May 2018 10:03:52 +0200 as excerpted: >>>>> >>>>>>> IMHO the best course of action would be to disable checksumming for >>>>>>> you >>>>>>> vm files. >>>>>> >>>>>> Do you mean '-o nodatasum' mount flag? Is it possible to disable >>>>>> checksumming for singe file by setting some magical chattr? Google >>>>>> thinks it's not possible to disable csums for a single file. >>>>> >>>>> You can use nocow (-C), but of course that has other restrictions (like >>>>> setting it on the files when they're zero-length, easiest done for >>>>> existing data by setting it on the containing dir and copying files (no >>>>> reflink) in) as well as the nocow effects, and nocow becomes cow1 >>>>> after a >>>>> snapshot (which locks the existing copy in place so changes written to a >>>>> block /must/ be written elsewhere, thus the cow1, aka cow the first time >>>>> written after the snapshot but retain the nocow for repeated writes >>>>> between snapshots). >>>>> >>>>> But if you're disabling checksumming anyway, nocow's likely the way >>>>> to go. >>>> >>>> Disabling checksumming only may be a way to go - we live without it >>>> every day. But nocow @ VM files defeats whole purpose of using BTRFS for >>>> me, even with huge performance penalty - backup reasons - I mean few >>>> snapshots (20-30), send & receive. >>>> >>> Setting NOCOW on a file doesn't prevent it from being snapshotted, it >>> just prevents COW operations from happening under most normal >>> circumstances.  In essence, it prevents COW from happening except for >>> writing right after the snapshot.  More specifically, the first write to >>> a given block in a file set for NOCOW after taking a snapshot will >>> trigger a _single_ COW operation for _only_ that block (unless you have >>> autodefrag enabled too), after which that block will revert to not doing >>> COW operations on write.  This way, you still get consistent and working >>> snapshots, but you also don't take the performance hits from COW except >>> right after taking a snapshot. >> >> Yeah, just after I've post it, I've found some Duncan post from 2015, >> explaining it, thank you anyway. >> >> Is there anything we can do better in random/write VM workload to speed >> the BTRFS up and why? >> >> My settings: >> >> >>        >>        >>        >>        [...] >> >> >> /dev/mapper/raid10-images on /var/lib/libvirt type btrfs >> (rw,noatime,nodiratime,compress=lzo:3,ssd,space_cache,autodefrag,subvolid=5,subvol=/) >> >> md1 : active raid10 sdc1[2] sdb2[1] sdd1[3] sda2[0] >>        468596736 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU] >>        bitmap: 0/4 pages [0KB], 65536KB chunk >> >> CPU: E3-1246 with: VT-x, VT-d, HT, EPT, TSX-NI, AES-NI on debian's >> kernel 4.15.0-0.bpo.2-amd64 >> >> As far as I understand compress and autodefrag are impacting negatively >> for performance (latency), especially autodefrag. I think also that >> nodatacow shall also speed things up and it's a must when using qemu and >> BTRFS. Is it better to use virtio or virt-scsi with TRIM support? >> > FWIW, I've been doing just fine without nodatacow, but I also use raw images contained in sparse > files, and keep autodefrag off for the dedicated filesystem I put the images on. So do I, RAW images created by qemu-img, but I am not sure if preallocation works as expected. The size of disks in filesystem looks fine though. May I ask in what workloads? From my testing while having VM on BTRFS storage: - file/web servers works perfect on BTRFS. - Windows (2012/2016) file servers with AD, are perfect too, besides time required for Windows Update, but this service is... let's say not fine enough. - database (firebird) impact is huuuge, guest filesystem is Ext4, the database performs slower in this conditions (4 SSDs in RAID10) than when it was on raid1 with 2 10krpm SASes. I am still thinking how to benchmark it properly. A lot of iowait in host's kernel. > Compression shouldn't have much in the way of negative impact unless you're also using transparent > compression (or disk for file encryption) inside the VM (in fact, it may speed things up > significantly depending on what filesystem is being used by the guest OS, the ext4 inode table in > particular seems to compress very well).  If you are using `nodatacow` though, you can just turn > compression off, as it's not going to be used anyway.  If you want to keep using compression, then > I'd suggest using `compress-force` instead of `compress`, which makes BTRFS a bit more aggressive > about trying to compress things, but makes the performance much more deterministic.  You may also > want to look int using `zstd` instead of `lzo` for the compression, it gets better ratios most of > the time, and usually performs better than `lzo` does. Yeah, I do know exact values from the post we both know for sure: https://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git/commit/?h=next&id=5c1aab1dd5445ed8bdcdbb575abc1b0d7ee5b2e7 > Autodefrag should probably be off.  If you have nodatacow set (or just have all the files marked > with the NOCOW attribute), then there's not really any point to having autodefrag on.  If like me > you aren't turning off COW for data, it's still a good idea to have it off and just do batch > defragmentation at a regularly scheduled time. Well, at least I need to try nodatacow to check the impact. > For the VM settings, everything looks fine to me (though if you have somewhat slow storage and > aren't giving the VM's lots of memory to work with, doing write-through caching might be helpful).  > I would probably be using virtio-scsi for the TRIM support, as with raw images you will get holes in > the file where the TRIM command was issued, which can actually improve performance (and does improve > storage utilization (though doing batch trims instead of using the `discard` mount option is better > for performance if you have Linux guests). I don't consider this... : 4: Model=Samsung SSD 850 PRO 256GB, FwRev=EXM02B6Q, SerialNo=S251NX0H822367V 21: Model=Samsung SSD 850 PRO 256GB, FwRev=EXM02B6Q, SerialNo=S251NX0H822370A 38: Model=INTEL SSDSC2BP240G4, FwRev=L2010420, SerialNo=BTJR6141002D240AGN 55: Model=INTEL SSDSC2BP240G4, FwRev=L2010420, SerialNo=BTJR6063000F240AGN as slow in Raid 10 mode, because: root@node0:~# time dd if=/dev/md1 of=/dev/null bs=4096 count=10M 10485760+0 records in 10485760+0 records out 42949672960 bytes (43 GB, 40 GiB) copied, 31.6336 s, 1.4 GB/s real 0m31.636s user 0m1.949s sys 0m12.222s root@node0:~# iostat -x 5 /dev/md1 /dev/sda /dev/sdb /dev/sdc /dev/sdd avg-cpu: %user %nice %system %iowait %steal %idle 0.63 0.00 4.85 6.61 0.00 87.91 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdc 306.80 0.00 672.00 0.00 333827.20 0.00 993.53 1.05 1.57 1.57 0.00 1.21 81.20 sdb 329.80 0.00 663.40 0.00 332640.00 0.00 1002.83 0.94 1.41 1.41 0.00 1.05 69.44 sdd 298.80 0.00 664.80 0.00 329110.40 0.00 990.10 1.00 1.50 1.50 0.00 1.22 80.96 sda 291.60 0.00 657.40 0.00 330297.60 0.00 1004.86 0.92 1.40 1.40 0.00 1.05 69.20 md1 0.00 0.00 3884.80 0.00 2693254.40 0.00 1386.56 0.00 0.00 0.00 0.00 0.00 0.00 It gives me much more than 100k IOPs to the BTRFS filesystem if I remember correctly while fio benchmark random workload (75% reads, 25% writes), 2 threads. > You're using an MD RAID10 array.  This is generally the fastest option in terms of performance, but > it also means you can't take advantage of BTRFS' self repairing ability very well, and you may be > wasting space and some performance (because you probably have the 'dup' profile set for metadata).  > If it's an option, I'd suggest converting this to a BTRFS raid1 volume on top of two MD RAID0 > volumes, which should either get the same performance, or slightly better performance, will avoid > wasting space storing metadata, and will also let you take advantage of the self-repair > functionality in BTRFS. That's a good point. > You should probably switch the `ssd` mount option to `nossd` (and then run a full recursive defrag > on the volume, as this option affects the allocation policy, so the changes only take effect for new > allocations).  The SSD allocator can actually pretty significantly hurt performance in many cases, > and has at best very limited benefits for device lifetimes (you'll maybe get another few months out > of a device that will last for ten years without issue).  Make a point to test this though, because > you're on a RAID array, this may actually be improving performance slightly. Good point either. I am going to test ssd parameter impact to, I think that recreating fs and copy the data may be good idea. We should not care about the wearout: 20 users working on the database every day (work week), for about a year and: Model Family: Intel 730 and DC S35x0/3610/3700 Series SSDs Device Model: INTEL SSDSC2BP240G4 233 Media_Wearout_Indicator 0x0032 098 098 000 Old_age Always - 0 and 177 Wear_Leveling_Count 0x0013 095 095 000 Pre-fail Always - 282 Model Family: Samsung based SSDs Device Model: Samsung SSD 850 PRO 256GB Which means 50 more years @ Samsung Pro 850 and 100 years @ Intel 730, which is interesting. (btw. start date is exactly the same). Thank you for sharing Austin. -- PGP Public Key (RSA/4096b): ID: 0xF2C6EA10 SHA-1: 51DA 40EE 832A 0572 5AD8 B3C0 7AFF 69E1 F2C6 EA10