Re: Issues with "no space left on device" maybe related to 3.13

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* Re: Issues with "no space left on device" maybe related to 3.13
       [not found] <20140113002532.3975c806@ws>
@ 2014-01-13 10:29 ` Thomas Kuther
  2014-01-14  5:52   ` Duncan
  0 siblings, 1 reply; 3+ messages in thread
From: Thomas Kuther @ 2014-01-13 10:29 UTC (permalink / raw)
  To: linux-btrfs

Am 13.01.2014 08:25, schrieb Duncan:
> [This mail was also posted to gmane.comp.file-systems.btrfs.]
> 
> Thomas Kuther posted on Mon, 13 Jan 2014 00:05:25 +0100 as excerpted:
>>
> 
> [ Rearranged to standard quote/reply order so replies are in context.  
> Top-posting is irritating to try to reply to.]
Oops, sorry. Has been too late for the second mail yesterday.

> 
>> Am 12.01.2014 21:24, schrieb Thomas Kuther:
>>>
>>> I'm experiencing an interesting issue with the BTRFS filesystem on my
>>> SSD drive. It first occured some time after the upgrade to kernel
>>> 3.13-rc (-rc3 was my first 3.13-rc) but I'm not sure if it is
>>> related.
>>>
>>> The obvious symptoms are that services on my system started crashing
>>> with "no space left on device" errors.
>>>
>>> └» mount |grep "/mnt/ssd"
>>> /dev/sda2 on /mnt/ssd type btrfs
>>> (rw,noatime,compress=lzo,ssd,noacl,space_cache)
>>>
>>> └» btrfs fi df /mnt/ssd
>>> Data, single: total=113.11GiB, used=90.02GiB
>>> System, DUP: total=64.00MiB, used=24.00KiB
>>> System, single: total=4.00MiB, used=0.00
>>> Metadata, DUP: total=3.00GiB, used=2.46GiB
> 
> This shows only half the story, tho.  You also need the output of btrfs
> fi show /mnt/ssd.  Btrfs fi show displays how much of the total
> available space is chunk-allocated; btrfs fi df displays how much of
> the chunk- allocation for each type is actually used.  Only with both
> of them is the picture complete enough to actually see what's going on.

└» sudo btrfs fi show /mnt/ssd
Label: none  uuid: 52bc94ba-b21a-400f-a80d-e75c4cd8a936
        Total devices 1 FS bytes used 93.22GiB
        devid    1 size 119.24GiB used 119.24GiB path /dev/sda2

Btrfs v3.12
└» sudo btrfs fi df /mnt/ssd
Data, single: total=113.11GiB, used=90.79GiB
System, DUP: total=64.00MiB, used=24.00KiB
System, single: total=4.00MiB, used=0.00
Metadata, DUP: total=3.00GiB, used=2.43GiB

So, this looks like it's really full.

>>> I use snapper on two subvolumes of that BTRFS volume (/ and /home) -
>>> each keeping 7 daily snapshots and up to 10 hourlys.
>>>
>>> When I saw those errors I started to delete most of the older
>>> snapshots,
>>> and the issue went away instantly, but this couldn't be a solution
>>> nor a workaround.
>>>
>>> I do though have a "usual suspect" on that BTRFS volume. A KVM disk
>>> image of a Win8 VM (I _need_ Adobe Lightroom)
>>>
>>> » lsattr /mnt/ssd/kvm-images/
>>> ---------------C /mnt/ssd/kvm-images/Windows_8_Pro.qcow2
>>>
>>> So the image has CoW disabled. Now comes the interesting part:
>>> I'm trying to copy off the image to my raid5 array (BTRFS ontop of a
>>> mdraid 5 - absolutely no issues with that one), but the cp process
>>> seems like it's stalled.
>>>
>>> After one hour the size of the destination copy is still 0 bytes.
>>> iotop almost constantly show values like
>>>
>>>  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
>>>  4636 be/4 tom        14.40 K/s    0.00 B/s  0.00 %  0.71 % cp
>>> /mnt/ssd/kvm-images/Windows_8_Pro.qcow2 .
>>>
>>> It tries to read the file with some 14K/s and writes absolutely
>>> nothing.
>>>
>>> Any idea what's going wrong here, or suggestions how to get that qcow
>>> file copied off? I do have a backup, but honestly that one is quite
>>> aged - so simply rm'ing it would be the very last thing I'd like to
>>> try.
> 
> OK.  There's a familiar known-troublesome pattern here that your 
> situation fits... with one difference that I had previously /thought/ 
> would ameliorate the problem, but either you didn't catch the problem 
> soon enough, or the root issue is more complex than I at first
> understood (quite possible, since while I'm a regular on the list and
> thus see the common issues posted, I'm just a btrfs user/admin, not a
> dev, btrfs or otherwise).
> 
> The base problem is that btrfs is normally a copy-on-write filesystem, 
> and frequently internally-rewritten (as opposed to sequential-write 
> append-only or write once, read many) files are in general a COW-
> filesystem's worst-case, the larger the file and more frequently 
> partially rewritten, the worse it is, since every small internal write 
> will COW the area being written elsewhere, quickly fragmenting large 
> routinely internal-written files such as VM images into hundreds of 
> thousands of extents!  =:^(
> 
> In general, btrfs has two methods to help deal with that.  For smaller 
> files the autodefrag mount option can help.  For larger files
> autodefrag can be a performance issue in itself due to write
> magnification (each small internal write triggering a rewrite of the
> entire multi-gig file), but there's the NOCOW extended-attribute, which
> is what /has/ been recommended for these things as it's supposed to
> tell the filesystem to do in-place rewrites instead of COW.  That
> doesn't seem to have worked for you, which is the interesting bit, but
> it's possible that's an artifact of how it was handled.  Additionally,
> there's the snapshot aspect throwing further complexity into the works,
> as described below.
> 
> OK, so the file has NOCOW (the +C xattribute) set, which is good.
> *BUT*, when/how did you set it?  On btrfs that can make all the
> difference!
> 
> The caveat with NOCOW on btrfs is that in ordered to be properly 
> effective, NOCOW must be set on the file when it's first created,
> before there's actually any data in it.  If the attribute is not set
> until later, when the file is not zero-size, behavior isn't what one
> might expect or desire -- simply stated, it doesn't work.
> 
> The simplest way to ensure that a file gets the NOCOW attribute set
> while it's still empty is to set the attribute on the parent directory
> before the file is created in the first place.  Any newly created files
> will then automatically inherit the directory's attribute, and thus
> will be set NOCOW from the beginning.

I created the subvolume /mnt/ssd/kvm-images and set +C on it. Then I
moved the VM image in there. So the attribute for the file was inherited
by the parent directory at creation time, yes.

> 
> A second method is to do it manually by first creating the zero-length 
> file using touch, then setting the NOCOW attribute using chattr +C, and 
> only /then/ copying the content into it.  However, this is a rather 
> difficult for files created by other processes, so the directory 
> inheritance method is generally recommended as the simplest method.
> 
> So now the question is, the file has NOCOW set as recommended, but was
> it set before the file had content in it as required, or was NOCOW only
> set later, on the existing file with its existing content, thus in
> practice nullifying the effect of setting it at all?
> 
> Meanwhile, the other significant factor here is the snapshotting.  In
> VM- image-cases *WITHOUT* the NOCOW xattr properly set, heavy
> snapshotting of a filesystem with VM images is a known extreme
> worst-case of the worst- cases, with *EXTREMELY* bad behavior
> characteristics that don't scale well at all, such that attempting to
> work with that file will tie up the filesystem in huge knots such that
> very little forward progress can be made, period.  We're talking days
> or even weeks to do what /should/ have taken a few minutes, due to the
> *SEVERE* scaling issues.  They're working on the problem, but it's a
> tough one to solve and its scale only recently became apparent.

I do not have any snapshots of that specific kvm-images subvolume for
those reasons. There are some snapshots of other subvolumes (/ and
/home) but only a hand full dating back a few days.

> 
> Actually, the current theory is that the recent changes to make defrag 
> snapshot-aware may have triggered the severe scaling issues we're
> seeing now.  Before that, the situation was bad, but apparently not
> horribly terribly broken to the point of not working at all, as it is
> now.
> 
> But as I said, the previous recommendation has been to NOCOW the file
> to prevent the problem from ever appearing in the first place.
> 
> Which you have apparently done and the problem is still there, except 
> that we don't know yet whether you set NOCOW effectively, probably
> using the inheritance method, or not.  If you set it effectively, then
> the problem is worse, MUCH worse, than thought, since the recommended 
> workaround, doesn't workaround.  But if you set it too late to be 
> effective, then the problem is simply another instance of the already 
> known issue.

So it seems I hit the worst case.

> 
> As for how to manage the existing file, you seem to have figured that
> out already, below...
> 
>>> PS: please reply-to-all, I'm not subscribed. Thanks.
> 
> OK.  I'm doing so here, but please remind in every reply.
> 
> FWIW, I read and respond to the list as a newsgroup using gmane.org's 
> list2news service and normally reply to the "newsgroup", which gets 
> forwarded to the list.  So I'm not actually using a mail client but a 
> news client, and replying to both author and newsgroup/list isn't 
> particularly easy, nor do I do it often, so reminding with every reply 
> does help me remember.

Hmm, using nntp is a good idea, actually.

> 
>> I did some more digging, and I think I have two maybe unrelated issues
>> here.
>>
>> The "no space left on device" could be caused by the amount of
>> metadata used. I defragmented the KVM image and other parts, ran a
>> "balance start -dusage=5", and now it looks like
>>
>> └» btrfs fi df /
>> Data, single: total=113.11GiB, used=88.83GiB
>> System, DUP: total=64.00MiB, used=24.00KiB
>> System, single: total=4.00MiB, used=0.00
>> Metadata, DUP: total=3.00GiB, used=2.40GiB
> 
> Just as a hint, you can get rid of that extra system chunk (the empty 
> single one) by doing a balance -sf (system, force, force necessary when 
> balancing system chunks only, not as part of metadata).   Since that's 
> only a few KiB of actual system data, it should go fast, and you won't 
> have that second system chunk display any more. =:^)

OK, will do. Thanks!

> 
>> The issue with copying/moving off the KVM image still remains. Using
>> "cp" or "mv" hangs. Interestingly, what did work was using "qemu-img
>> convert -O raw ..." so now I have a fresh backup at least. The VM
>> works just fine with the original image file. I really wonder what
>> goes wrong with cp and mv.
> 
> They're apparently getting caught up in that 100k-extents snapshot 
> scaling morass...

Even when subvolume in question has no snapshots and never had?

> 
> But *THANKS* for the qemu-img convert idea.  I haven't setup any VMs
> here so didn't know about that at all.  At least now I can pass on
> something that should actually let people get a backup to work with.
> =:^)
> 
> 
> Meanwhile...
> 
>> And I stumbled over a third issue with my raid5 array:
>> └» df -h|grep /mnt/btrfs
>> /dev/md0        5,5T    3,4T  2,1T   63% /mnt/btrfs
>> └» sudo btrfs fi df /mnt/btrfs/
>> Data, single: total=3.33TiB, used=3.33TiB
>> System, DUP: total=8.00MiB, used=388.00KiB
>> System, single: total=4.00MiB, used=0.00
>> Metadata, DUP: total=56.12GiB, used=5.14GiB
>> Metadata, single: total=8.00MiB, used=0.00
> 
> Again, you can use balance to get rid of those unused single chunks.  
> They're currently an artifact from the creation of the filesystem due
> to how mkfs.btrfs works at present, so I've started doing a balance 
> immediately after first mount to deal with them, before there's
> anything on the filesystem so the balance goes real fast. =:^)  3+ TiB
> of data is a little late for that, but you can balance metadata (and
> system) only, at least.
>  
>> The array has been grown quite a while ago using "btrfs filesystem
>> resize max", but "btrfs fi df" still shows the old data size. How
>> could that happen?
> 
> As hinted at above, btrfs fi df <mntpnt> is only half the story, 
> displaying how much of currently allocated chunks are used and for what 
> (data/metadata/system/shared/etc).  What it does *NOT* display is how 
> much of the total filesystem size is actually allocated in the first 
> place.  That's where btrfs fi show <mntpnt> comes in.  (Just btrfs fi 
> show, without the <mntpnt> parameter, works fine if you've only a
> single btrfs or maybe a couple, but once you get a half dozen or so,
> adding the <mntpnt> just as you do for df, is useful to just display
> the one.)
> 
> Consider: On a single device btrfs, data is single mode by default,
> with data chunks normally 1 GiB each, metadata is dup mode by default,
> with metadata chunks normally 1/4 GiB (256 MiB), but due to dup mode,
> two of them are allocated at a time, so half a GiB.
> 
> Given that, how do you represent unallocated space that could be 
> allocated as either data (single, takes the space of the size of the 
> data, or a bit less when compression is on) or metadata (dup, takes
> twice as much space as the size of the actual metadata as there's two
> copies of it), depending on what is needed?
> 
> Of course btrfs can be used on multiple devices in various raid modes
> as well, complicating the picture further, particularly in the future
> when each subvolume can have its own single/dup/raid policy applied so
> they're not the same.
> 
> The way btrfs deals with this question is that btrfs fi show displays 
> allocated vs. total space (with the space that doesn't show up as 
> allocated obviously being... unallocated! =:^), while btrfs fi df, 
> displays the usage detail on only /allocated/ space.

OK, now I got it.

└» sudo btrfs fi show /mnt/btrfs
Label: none  uuid: 939f2547-176a-4942-b8d6-8883fed68973
        Total devices 1 FS bytes used 3.34TiB
        devid    1 size 5.46TiB used 3.44TiB path /dev/md0

No issues on that array, just PEBKAC.

> 
> Meanwhile, plain df (not btrfs df, just df) currently doesn't work 
> particularly well for btrfs, because the rules it uses to display used 
> vs. available space that work on most filesystems, don't really apply
> to btrfs in the same way, and it doesn't know to apply different rules
> to btrfs or what they might be if it did.  (There's an effort to teach
> df to know about btrfs and similar filesystems, but it's early stage
> ATM, as there's some very real questions to settle on exactly what a
> sensible kernel API might look like for that, first, with the
> assumption being that if the interface is designed correctly, other
> filesystems will be able to make use of it in the future as well.)
>> This is becomming a "collection of maybe unrelated BTRFS funny tales"
>> thread... still I'd be happy on suggestions regarding any of the
>> issues.
> 
> Some of this stuff, including discussion of the issues surrounding
> space used and left, is covered on the btrfs wiki, here (bookmark it!
> =:^) :
> 
> https://btrfs.wiki.kernel.org
> 
> In particular, see FAQ items 4.4-4.10 (documentation, faq...) covering 
> space questions, but it's worth reading pretty much all the User level 
> (as opposed to developer) documentation.
> 

Will do, last time I went through the wiki has been at least 2 or 3
years ago, I guess. And obviously I wasn't really aware of the
difference between btrfs fi show and df.

Thanks for your detailed input and the little slap on the backhead
regarding df vs. show :-)

Regards,
Tom

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Issues with "no space left on device" maybe related to 3.13
  2014-01-13 10:29 ` Issues with "no space left on device" maybe related to 3.13 Thomas Kuther
@ 2014-01-14  5:52   ` Duncan
  2014-01-14  8:23     ` Issues with Thomas Kuther
  0 siblings, 1 reply; 3+ messages in thread
From: Duncan @ 2014-01-14  5:52 UTC (permalink / raw)
  To: linux-btrfs

Thomas Kuther posted on Mon, 13 Jan 2014 11:29:38 +0100 as excerpted:

>> This shows only half the story, tho.  You also need the output of btrfs
>> fi show /mnt/ssd.  Btrfs fi show displays how much of the total
>> available space is chunk-allocated; btrfs fi df displays how much of
>> the chunk- allocation for each type is actually used.  Only with both
>> of them is the picture complete enough to actually see what's going on.
> 
> └» sudo btrfs fi show /mnt/ssd Label: none  uuid:
> 52bc94ba-b21a-400f-a80d-e75c4cd8a936
>         Total devices 1 FS bytes used 93.22GiB devid
>         1 size 119.24GiB used 119.24GiB path /dev/sda2
> 
> Btrfs v3.12
> └» sudo btrfs fi df /mnt/ssd
> Data, single: total=113.11GiB, used=90.79GiB
> System, DUP: total=64.00MiB, used=24.00KiB
> System, single: total=4.00MiB, used=0.00
> Metadata, DUP: total=3.00GiB, used=2.43GiB
> 
> So, this looks like it's really full.

Well, you have 100% space allocated, but not all that allocated space is 
actually used.  113+ gigs allocated for data, but only just under 91 gigs 
used, so ~22.5 gigs are allocated for data but not used.  Metadata's 
closer, particularly considering it's dup-mode so allocations happen 2-at-
a-time.  Metadata chunks are 256 MiB by default, *2 due to dup, so 512 MiB 
allocated at once.  That means you're within a single allocation-unit of 
full on metadata.

And since all space is allocated, when those existing metadata chunks 
fill up, as they presumably originally did to trigger this thread, 
there's nothing left to allocate so out-of-space!

Normally you'd do a data balance to consolidate data in the data chunks 
and return the now freed chunks to the unallocated space pool, but you're 
going to have problems doing that ATM, for two reasons.  The likely 
easiest to work around is that since all space is allocated and balance 
works by allocating new chunks and copying data/metadata from the old 
chunks over, rewriting, defragging and consolidating as it goes, but 
there's no space left to allocate that new one...

The usual solution to that is to temporarily btrfs device add another 
device with a few gigs available, do the rebalance with it providing the 
necessary new-chunk space, then btrfs device delete, to move the chunks 
on the temporary-add back to the main device so you can safely remove the 
temporary-add.  Ordinarily, even a loopback on tmpfs could be used to 
provide a few gigs, and that should be enough, but of course you can't 
reboot while the chunks are on that tmpfs-based loopback or you'll lose 
that data, and the below will likely trigger a live-lock and you'll 
pretty much HAVE to reboot, so having those chunks on tmpfs probably 
isn't such a good idea after all.  But a few gig thumbdrive should work, 
and should keep the data safe over a reboot, so that's probably what I'd 
recommend ATM.

The more worrisome problem is that nasty multi-extent morass of a VM 
image.  When the rebalance hits that, it'll live-lock just as an 
attempted defrag or the like does.  =:^(

But with a bit of luck and perhaps playing with the balance filters a 
bit, you may be able to get at least a few chunks rebalanced first, 
hopefully freeing up a gig or two to unallocated, thus getting you out of 
the worst of the bind and making that space available to metadata if it 
needs it.  And as long as you're not using a RAM-backed device as your 
temp-storage, that balance should be reasonably safe if you have to 
reboot due to live-lock in the middle of it.

For future reference, I'd suggest trying to keep at least enough 
unallocated space around to allocate one more each data (1 GiB) and 
metadata (256 MiB *2 = 512 MiB) chunks free, thus allowing a balance to 
allocate it to hopefully free more space when needed.  Which in practice 
means doubling that to two each (3 GiB total), and as soon as the second 
one gets allocated, do a balance to hopefully free more room before your 
reserved chunk space gets allocated too.

As for the subvolume/snapshots thing (discussion snipped), I don't 
actually use subvolumes here, preferring fully independent partitions so 
my eggs aren't all in one still-under-development-filesystem basket.  And 
I and don't use snapshots that much.  So I really haven't followed the 
subvolume stuff, and don't know how that interacts with fragmented VM-
image bug we're dealing with here at all.

So I honestly don't know whether it's still that VM-image file implicated 
here, or whether we need to look for something else as the subvolumes 
should keep that interference from happening.

Actually, I'm not sure the devs know yet on this one, since it's 
obviously a situation that's much worse than they anticipated, too, which 
means that there's /some/ aspect of it that they don't understand what's 
going on with the interaction.

Were it my system, I'd probably do one of two things.  Either I'd try to 
get a dev actively working with me to trace/reproduce/solve the problem 
and thus eliminate it once and for all, or I'd take advantage of your 
qemu-img-convert idea to get a backup of the problem file, take (and 
test!!) a backup of everything else on the filesystem if I didn't have 
one already, and simply nuke the entire filesystem with a mkfs.btrfs, 
starting over fresh.  Currently that seems to be the only efficient way 
out of the live-lock triggering file situation once you find yourself in 
it, unfortunately, since defrag and balance, as well as simply trying to 
copy the file elsewhere (using anything but your qemu-image trick) simply 
trigger that live-lock once again. =:^(

Then if at all possible, put your VM image(s) on a dedicated filesystem, 
probably something other than btrfs since btrfs just seems broken for 
that usage ATM, and keep btrfs for the stuff it seems to actually work 
with ATM.

That's what I'd do.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Issues with
  2014-01-14  5:52   ` Duncan
@ 2014-01-14  8:23     ` Thomas Kuther
  0 siblings, 0 replies; 3+ messages in thread
From: Thomas Kuther @ 2014-01-14  8:23 UTC (permalink / raw)
  To: linux-btrfs

Duncan <1i5t5.duncan <at> cox.net> writes:

[...]
> Normally you'd do a data balance to consolidate data in the data chunks 
> and return the now freed chunks to the unallocated space pool, but you're 
> going to have problems doing that ATM, for two reasons.  The likely 
> easiest to work around is that since all space is allocated and balance 
> works by allocating new chunks and copying data/metadata from the old 
> chunks over, rewriting, defragging and consolidating as it goes, but 
> there's no space left to allocate that new one...
> 
> The usual solution to that is to temporarily btrfs device add another 
> device with a few gigs available, do the rebalance with it providing the 
> necessary new-chunk space, then btrfs device delete, to move the chunks 
> on the temporary-add back to the main device so you can safely remove the 
> temporary-add.  Ordinarily, even a loopback on tmpfs could be used to 
> provide a few gigs, and that should be enough, but of course you can't 
> reboot while the chunks are on that tmpfs-based loopback or you'll lose 
> that data, and the below will likely trigger a live-lock and you'll 
> pretty much HAVE to reboot, so having those chunks on tmpfs probably 
> isn't such a good idea after all.  But a few gig thumbdrive should work, 
> and should keep the data safe over a reboot, so that's probably what I'd 
> recommend ATM.


Thanks again for your input, Duncan.

What I did now was:
a) took another read on the matter first
b) verified the qemu-img'ed backup of the VM is working properly
c) deleted the original vm image and refreshed my system backups

At that point data was still fully allocated.

d) dropped that single system chunk as you suggested previously. 
Interestingly this gave me:

Label: none  uuid: 52bc94ba-b21a-400f-a80d-e75c4cd8a936
        Total devices 1 FS bytes used 43.92GiB
        devid    1 size 119.24GiB used 119.03GiB path /dev/sda2

..some free data chunks.

e) Ran an iteration with balance -dusage=5, -dusage=10, 15, 20, 25.
After 25 it looks like:

Label: none  uuid: 52bc94ba-b21a-400f-a80d-e75c4cd8a936
        Total devices 1 FS bytes used 43.92GiB
        devid    1 size 119.24GiB used 95.02GiB path /dev/sda2


So, your suggestion regarding d) saved me from having to recreate the FS or 
having to add a drive.

Now I need to make sure I get a notice when data gets filled up too much.

Regards,
Tom


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2014-01-14  8:25 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20140113002532.3975c806@ws>
2014-01-13 10:29 ` Issues with "no space left on device" maybe related to 3.13 Thomas Kuther
2014-01-14  5:52   ` Duncan
2014-01-14  8:23     ` Issues with Thomas Kuther

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox