Re: Issues with "no space left on device" maybe related to 3.13 and/or kvm disk image fragmentation

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

From: Duncan <1i5t5.duncan@cox.net>
To: linux-btrfs@vger.kernel.org
Subject: Re: Issues with "no space left on device" maybe related to 3.13 and/or kvm disk image fragmentation
Date: Mon, 13 Jan 2014 07:21:26 +0000 (UTC)	[thread overview]
Message-ID: <pan$7af36$542cb610$935a8f99$e4118763@cox.net> (raw)
In-Reply-To: 52D31FB5.1040906@kuther.net

Thomas Kuther posted on Mon, 13 Jan 2014 00:05:25 +0100 as excerpted:
> 

[ Rearranged to standard quote/reply order so replies are in context.  
Top-posting is irritating to try to reply to.]

> Am 12.01.2014 21:24, schrieb Thomas Kuther:
>> 
>> I'm experiencing an interesting issue with the BTRFS filesystem on my
>> SSD drive. It first occured some time after the upgrade to kernel
>> 3.13-rc (-rc3 was my first 3.13-rc) but I'm not sure if it is related.
>> 
>> The obvious symptoms are that services on my system started crashing
>> with "no space left on device" errors.
>> 
>> └» mount |grep "/mnt/ssd"
>> /dev/sda2 on /mnt/ssd type btrfs
>> (rw,noatime,compress=lzo,ssd,noacl,space_cache)
>> 
>> └» btrfs fi df /mnt/ssd
>> Data, single: total=113.11GiB, used=90.02GiB
>> System, DUP: total=64.00MiB, used=24.00KiB
>> System, single: total=4.00MiB, used=0.00
>> Metadata, DUP: total=3.00GiB, used=2.46GiB

This shows only half the story, tho.  You also need the output of btrfs fi 
show /mnt/ssd.  Btrfs fi show displays how much of the total available 
space is chunk-allocated; btrfs fi df displays how much of the chunk-
allocation for each type is actually used.  Only with both of them is the 
picture complete enough to actually see what's going on.

>> I use snapper on two subvolumes of that BTRFS volume (/ and /home) -
>> each keeping 7 daily snapshots and up to 10 hourlys.
>> 
>> When I saw those errors I started to delete most of the older
>> snapshots,
>> and the issue went away instantly, but this couldn't be a solution nor
>> a workaround.
>> 
>> I do though have a "usual suspect" on that BTRFS volume. A KVM disk
>> image of a Win8 VM (I _need_ Adobe Lightroom)
>> 
>> » lsattr /mnt/ssd/kvm-images/
>> ---------------C /mnt/ssd/kvm-images/Windows_8_Pro.qcow2
>> 
>> So the image has CoW disabled. Now comes the interesting part:
>> I'm trying to copy off the image to my raid5 array (BTRFS ontop of a
>> mdraid 5 - absolutely no issues with that one), but the cp process
>> seems like it's stalled.
>> 
>> After one hour the size of the destination copy is still 0 bytes. iotop
>> almost constantly show values like
>> 
>>  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
>>  4636 be/4 tom        14.40 K/s    0.00 B/s  0.00 %  0.71 % cp
>> /mnt/ssd/kvm-images/Windows_8_Pro.qcow2 .
>> 
>> It tries to read the file with some 14K/s and writes absolutely
>> nothing.
>> 
>> Any idea what's going wrong here, or suggestions how to get that qcow
>> file copied off? I do have a backup, but honestly that one is quite
>> aged - so simply rm'ing it would be the very last thing I'd like to
>> try.

OK.  There's a familiar known-troublesome pattern here that your 
situation fits... with one difference that I had previously /thought/ 
would ameliorate the problem, but either you didn't catch the problem 
soon enough, or the root issue is more complex than I at first understood 
(quite possible, since while I'm a regular on the list and thus see the 
common issues posted, I'm just a btrfs user/admin, not a dev, btrfs or 
otherwise).

The base problem is that btrfs is normally a copy-on-write filesystem, 
and frequently internally-rewritten (as opposed to sequential-write 
append-only or write once, read many) files are in general a COW-
filesystem's worst-case, the larger the file and more frequently 
partially rewritten, the worse it is, since every small internal write 
will COW the area being written elsewhere, quickly fragmenting large 
routinely internal-written files such as VM images into hundreds of 
thousands of extents!  =:^(

In general, btrfs has two methods to help deal with that.  For smaller 
files the autodefrag mount option can help.  For larger files autodefrag 
can be a performance issue in itself due to write magnification (each 
small internal write triggering a rewrite of the entire multi-gig file), 
but there's the NOCOW extended-attribute, which is what /has/ been 
recommended for these things as it's supposed to tell the filesystem to 
do in-place rewrites instead of COW.  That doesn't seem to have worked 
for you, which is the interesting bit, but it's possible that's an 
artifact of how it was handled.  Additionally, there's the snapshot 
aspect throwing further complexity into the works, as described below.

OK, so the file has NOCOW (the +C xattribute) set, which is good.  *BUT*, 
when/how did you set it?  On btrfs that can make all the difference!

The caveat with NOCOW on btrfs is that in ordered to be properly 
effective, NOCOW must be set on the file when it's first created, before 
there's actually any data in it.  If the attribute is not set until 
later, when the file is not zero-size, behavior isn't what one might 
expect or desire -- simply stated, it doesn't work.

The simplest way to ensure that a file gets the NOCOW attribute set while 
it's still empty is to set the attribute on the parent directory before 
the file is created in the first place.  Any newly created files will 
then automatically inherit the directory's attribute, and thus will be 
set NOCOW from the beginning.

A second method is to do it manually by first creating the zero-length 
file using touch, then setting the NOCOW attribute using chattr +C, and 
only /then/ copying the content into it.  However, this is a rather 
difficult for files created by other processes, so the directory 
inheritance method is generally recommended as the simplest method.

So now the question is, the file has NOCOW set as recommended, but was it 
set before the file had content in it as required, or was NOCOW only set 
later, on the existing file with its existing content, thus in practice 
nullifying the effect of setting it at all?

Meanwhile, the other significant factor here is the snapshotting.  In VM-
image-cases *WITHOUT* the NOCOW xattr properly set, heavy snapshotting of 
a filesystem with VM images is a known extreme worst-case of the worst-
cases, with *EXTREMELY* bad behavior characteristics that don't scale 
well at all, such that attempting to work with that file will tie up the 
filesystem in huge knots such that very little forward progress can be 
made, period.  We're talking days or even weeks to do what /should/ have 
taken a few minutes, due to the *SEVERE* scaling issues.  They're working 
on the problem, but it's a tough one to solve and its scale only recently 
became apparent.

Actually, the current theory is that the recent changes to make defrag 
snapshot-aware may have triggered the severe scaling issues we're seeing 
now.  Before that, the situation was bad, but apparently not horribly 
terribly broken to the point of not working at all, as it is now.

But as I said, the previous recommendation has been to NOCOW the file to 
prevent the problem from ever appearing in the first place.

Which you have apparently done and the problem is still there, except 
that we don't know yet whether you set NOCOW effectively, probably using 
the inheritance method, or not.  If you set it effectively, then the 
problem is worse, MUCH worse, than thought, since the recommended 
workaround, doesn't workaround.  But if you set it too late to be 
effective, then the problem is simply another instance of the already 
known issue.

As for how to manage the existing file, you seem to have figured that out 
already, below...

>> PS: please reply-to-all, I'm not subscribed. Thanks.

OK.  I'm doing so here, but please remind in every reply.

FWIW, I read and respond to the list as a newsgroup using gmane.org's 
list2news service and normally reply to the "newsgroup", which gets 
forwarded to the list.  So I'm not actually using a mail client but a 
news client, and replying to both author and newsgroup/list isn't 
particularly easy, nor do I do it often, so reminding with every reply 
does help me remember.

> I did some more digging, and I think I have two maybe unrelated issues
> here.
> 
> The "no space left on device" could be caused by the amount of metadata
> used. I defragmented the KVM image and other parts, ran a "balance start
> -dusage=5", and now it looks like
> 
> └» btrfs fi df /
> Data, single: total=113.11GiB, used=88.83GiB
> System, DUP: total=64.00MiB, used=24.00KiB
> System, single: total=4.00MiB, used=0.00
> Metadata, DUP: total=3.00GiB, used=2.40GiB

Just as a hint, you can get rid of that extra system chunk (the empty 
single one) by doing a balance -sf (system, force, force necessary when 
balancing system chunks only, not as part of metadata).   Since that's 
only a few KiB of actual system data, it should go fast, and you won't 
have that second system chunk display any more. =:^)

> The issue with copying/moving off the KVM image still remains. Using
> "cp" or "mv" hangs. Interestingly, what did work was using "qemu-img
> convert -O raw ..." so now I have a fresh backup at least. The VM works
> just fine with the original image file. I really wonder what goes wrong
> with cp and mv.

They're apparently getting caught up in that 100k-extents snapshot 
scaling morass...

But *THANKS* for the qemu-img convert idea.  I haven't setup any VMs here 
so didn't know about that at all.  At least now I can pass on something 
that should actually let people get a backup to work with. =:^)

Meanwhile...

> And I stumbled over a third issue with my raid5 array:
> └» df -h|grep /mnt/btrfs
> /dev/md0        5,5T    3,4T  2,1T   63% /mnt/btrfs
> └» sudo btrfs fi df /mnt/btrfs/
> Data, single: total=3.33TiB, used=3.33TiB
> System, DUP: total=8.00MiB, used=388.00KiB
> System, single: total=4.00MiB, used=0.00
> Metadata, DUP: total=56.12GiB, used=5.14GiB
> Metadata, single: total=8.00MiB, used=0.00

Again, you can use balance to get rid of those unused single chunks.  
They're currently an artifact from the creation of the filesystem due to 
how mkfs.btrfs works at present, so I've started doing a balance 
immediately after first mount to deal with them, before there's anything 
on the filesystem so the balance goes real fast. =:^)  3+ TiB of data is 
a little late for that, but you can balance metadata (and system) only, 
at least.

> The array has been grown quite a while ago using "btrfs filesystem
> resize max", but "btrfs fi df" still shows the old data size. How could
> that happen?

As hinted at above, btrfs fi df <mntpnt> is only half the story, 
displaying how much of currently allocated chunks are used and for what 
(data/metadata/system/shared/etc).  What it does *NOT* display is how 
much of the total filesystem size is actually allocated in the first 
place.  That's where btrfs fi show <mntpnt> comes in.  (Just btrfs fi 
show, without the <mntpnt> parameter, works fine if you've only a single 
btrfs or maybe a couple, but once you get a half dozen or so, adding the 
<mntpnt> just as you do for df, is useful to just display the one.)

Consider: On a single device btrfs, data is single mode by default, with 
data chunks normally 1 GiB each, metadata is dup mode by default, with 
metadata chunks normally 1/4 GiB (256 MiB), but due to dup mode, two of 
them are allocated at a time, so half a GiB.

Given that, how do you represent unallocated space that could be 
allocated as either data (single, takes the space of the size of the 
data, or a bit less when compression is on) or metadata (dup, takes twice 
as much space as the size of the actual metadata as there's two copies of 
it), depending on what is needed?

Of course btrfs can be used on multiple devices in various raid modes as 
well, complicating the picture further, particularly in the future when 
each subvolume can have its own single/dup/raid policy applied so they're 
not the same.

The way btrfs deals with this question is that btrfs fi show displays 
allocated vs. total space (with the space that doesn't show up as 
allocated obviously being... unallocated! =:^), while btrfs fi df, 
displays the usage detail on only /allocated/ space.

Meanwhile, plain df (not btrfs df, just df) currently doesn't work 
particularly well for btrfs, because the rules it uses to display used 
vs. available space that work on most filesystems, don't really apply to 
btrfs in the same way, and it doesn't know to apply different rules to 
btrfs or what they might be if it did.  (There's an effort to teach df to 
know about btrfs and similar filesystems, but it's early stage ATM, as 
there's some very real questions to settle on exactly what a sensible 
kernel API might look like for that, first, with the assumption being 
that if the interface is designed correctly, other filesystems will be 
able to make use of it in the future as well.)

> This is becomming a "collection of maybe unrelated BTRFS funny tales"
> thread... still I'd be happy on suggestions regarding any of the issues.

Some of this stuff, including discussion of the issues surrounding space 
used and left, is covered on the btrfs wiki, here (bookmark it! =:^) :

https://btrfs.wiki.kernel.org

In particular, see FAQ items 4.4-4.10 (documentation, faq...) covering 
space questions, but it's worth reading pretty much all the User level 
(as opposed to developer) documentation.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

     prev parent reply	other threads:[~2014-01-13  7:21 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-01-12 20:24 Issues with "no space left on device" maybe related to 3.13 and/or kvm disk image fragmentation Thomas Kuther
2014-01-12 23:05 ` Thomas Kuther
2014-01-13  7:21   ` Duncan [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='pan$7af36$542cb610$935a8f99$e4118763@cox.net' \
    --to=1i5t5.duncan@cox.net \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox