From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-it0-f53.google.com ([209.85.214.53]:38074 "EHLO mail-it0-f53.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752827AbcHOMcY (ORCPT ); Mon, 15 Aug 2016 08:32:24 -0400 Received: by mail-it0-f53.google.com with SMTP id n128so5889798ith.1 for ; Mon, 15 Aug 2016 05:32:23 -0700 (PDT) Subject: Re: Huge load on btrfs subvolume delete To: Daniel Caillibaud , linux-btrfs@vger.kernel.org References: <20160815123928.47bd2c03@asus17.lairdutemps.org> From: "Austin S. Hemmelgarn" Message-ID: <95c98a1b-4484-c294-fcd6-001e73cd25d6@gmail.com> Date: Mon, 15 Aug 2016 08:32:00 -0400 MIME-Version: 1.0 In-Reply-To: <20160815123928.47bd2c03@asus17.lairdutemps.org> Content-Type: text/plain; charset=windows-1252; format=flowed Sender: linux-btrfs-owner@vger.kernel.org List-ID: On 2016-08-15 06:39, Daniel Caillibaud wrote: > Hi, > > I'm newbie with btrfs, and I have pb with high load after each btrfs subvolume delete > > I use snapshots on lxc hosts under debian jessie with > - kernel 4.6.0-0.bpo.1-amd64 > - btrfs-progs 4.6.1-1~bpo8 > > For backup, I have each day, for each subvolume > > btrfs subvolume snapshot -r $subvol $snap > # then later > ionice -c3 btrfs subvolume delete $snap > > but ionice doesn't seems to have any effect here and after a few minutes the load grows up > quite high (30~40), and I don't know how to make this deletion nicer with I/O Before I start explaining possible solutions, it helps to explain what's actually happening here. When you create a snapshot, BTRFS just scans down the tree for the subvolume in question and creates new references to everything in that subvolume in a separate tree. This is usually insanely fast because all that needs to be done is updating metadata. When you delete a snapshot however, it has to remove any remaining references within the snapshot to the parent subvolume, and also has to process any changed data that is now different from the parent subvolume for deletion just like it would for deleting a file. As a result of this, the work to create a snapshot only depends on the complexity of the directory structure within the subvolume, while the work to delete it depends on both that and how much the snapshot has changed from the parent subvolume. The spike in load your seeing is the filesystem handling all that internal accounting in the background, and I'd be willing to bet that it varies based on how fast things are changing in the parent subvolume. Setting idle I/O scheduling priority on the command to delete the snapshot does nothing because all that command does is tell the kernel to delete the snapshot, the actual deletion is handled in the filesystem driver. While it won't help with the spike in load, you probably want to add `--commit-after` to that subvolume deletion command. That will cause the spike to happen almost immediately, and the command won't return until the filesystem is finished with the accounting and thus the load should be back to normal when it returns. > > Is there a better way to do so ? While there isn't any way I know of to do so, there are ways you can reduce the impact by reducing how much your backing up: 1. You almost certainly don't need to back up the logs, and if you do, they should probably be backed up independently from the rest of the system image. In most cases, logs just add extra size to a backup, and have little value when you restore a backup, so it makes little sense in most cases to include them in a backup. The simplest way to exclude them in your case is to make /var/log in the LXC containers be a separate subvolume. This will exclude it from the snapshot for the backup, which will both speed up the backup, and reduce the amount of changes from the parent that occur while creating the backup. 2. Assuming you're using a distribution compliant with the filesystem hierarchy standard, there are a couple of directories you can safely exclude from all backups simply because portable programs are designed to handle losing data from these directories gracefully. Such directories include /tmp, /var/tmp, and /var/cache, and they can be excluded the same way as /var/log. 3. Similar arguments apply to $HOME/.cache, which is essentially a per-user /var/cache. This is less likely to have an impact if you don't have individual users doing things on these systems. 4. Look for other similar areas you may be able to safely exclude. For example, I use Gentoo, and I build all my packages with external debugging symbols which get stored in /usr/lib/debug. I only have this set up for convenience, so there's no point in me backing it up because I can just rebuild the package to regenerate the debugging symbols if I need them after restoring from a backup. Similarly, I also exclude any VCS repositories that I have copies of elsewhere, simply because I can just clone that copy if I need it. > > Is it a bad idea to set ionice -c3 on the btrfs-transacti process which seems the one doing a > lot of I/O ? Yes, it's always a bad idea to mess with any scheduling properties other than CPU affinity for kernel threads (and even messing with CPU affinity is usually a bad idea too). The btrfs-transaction kthread (the name gets cut off by the length limits built into the kernel) is a particularly bad one to mess with, because it handles committing updates to the filesystem. Setting an idle scheduling priority on it would probably put you at severe risk of data loss or cause your system to lock up. > > Actually my io priority on btrfs process are > > ps x|awk '/[b]trfs/ {printf("%20s ", $NF); system("ionice -p" $1)}' > [btrfs-worker] none: prio 4 > [btrfs-worker-hi] none: prio 4 > [btrfs-delalloc] none: prio 4 > [btrfs-flush_del] none: prio 4 > [btrfs-cache] none: prio 4 > [btrfs-submit] none: prio 4 > [btrfs-fixup] none: prio 4 > [btrfs-endio] none: prio 4 > [btrfs-endio-met] none: prio 4 > [btrfs-endio-met] none: prio 4 > [btrfs-endio-rai] none: prio 4 > [btrfs-endio-rep] none: prio 4 > [btrfs-rmw] none: prio 4 > [btrfs-endio-wri] none: prio 4 > [btrfs-freespace] none: prio 4 > [btrfs-delayed-m] none: prio 4 > [btrfs-readahead] none: prio 4 > [btrfs-qgroup-re] none: prio 4 > [btrfs-extent-re] none: prio 4 > [btrfs-cleaner] none: prio 0 > [btrfs-transacti] none: prio 0 Altogether, this is exactly what they should be in a normal kernel. Also, neat trick with awk to get that info, I'll have to remember that.