btrfs filesystem failing with 'No space left on device' after 4 hours

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* btrfs filesystem failing with 'No space left on device' after 4 hours
@ 2019-03-06 14:19 Michael Firth
  2019-03-06 17:59 ` Patrik Lundquist
  2019-03-06 20:36 ` Chris Murphy
  0 siblings, 2 replies; 3+ messages in thread
From: Michael Firth @ 2019-03-06 14:19 UTC (permalink / raw)
  To: linux-btrfs@vger.kernel.org

Hi,

I have a BTRFS filesystem that seems to have become very ill. After 4 hours of being mounted, it will fail with every write attempt saying "No space left on device".

Unmounting and remounting the filesystem clears the issue for another 4 hours

From every check I have done, no messages are logged at the point of the failure to "dmesg" or any system log.

I'm over 99% sure there is not a space issue on the filesystem - it has over 100GB free, and I've run a full "balance" which has not changed the behaviour. A "scrub" on the filesystem hasn't reported any issues.

The output of the three (why on earth are there three?) disk space commands on the filesystem are:

--------------------------------------------------------------------------------------------------------------------------------------
$ sudo btrfs filesystem usage /home
Overall:
    Device size:                                    450.00GiB
    Device allocated:                         319.06GiB
    Device unallocated:                    130.94GiB
    Device missing:                                  0.00B
    Used:                                                305.95GiB
    Free (estimated):                        131.77GiB           (min: 66.30GiB)
    Data ratio:                                            1.00
    Metadata ratio:                                  2.00
    Global reserve:                             512.00MiB          (used: 0.00B)

Data,single: Size:299.00GiB, Used:298.16GiB
   /dev/mapper/VG-HomeVol              299.00GiB

Metadata,DUP: Size:10.00GiB, Used:3.89GiB
   /dev/mapper/VG-HomeVol                20.00GiB

System,DUP: Size:32.00MiB, Used:80.00KiB
   /dev/mapper/VG-HomeVol                64.00MiB

Unallocated:
   /dev/mapper/VG-HomeVol              130.94GiB

$ sudo btrfs filesystem df /home
Data, single: total=299.00GiB, used=298.16GiB
System, DUP: total=32.00MiB, used=80.00KiB
Metadata, DUP: total=10.00GiB, used=3.89GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

$ sudo btrfs filesystem show /home
Label: none  uuid: 550e6e7c-d669-4128-9b0d-b61ef4f3f1c1
                Total devices 1 FS bytes used 302.07GiB
                devid    1 size 450.00GiB used 319.06GiB path /dev/mapper/VG-HomeVol
--------------------------------------------------------------------------------------------------------------------------------------

From my understanding of the output in this, there don't seem to be any areas that are even close to full. And if it was a genuine full condition, even due to running out of metadata or something, then I wouldn't expect unmounting and remounting to clear the issue.

Is there any known issue that may cause this behaviour?

Is there any way to get more debugging from what is going on?

My initial thought was that it might be related to snapshots, as I was generating regular snapshots (for a 'previous versions' feature), and many of the failures were just after a snapshot was created. However, I have now disabled the snapshot creation and I am still seeing regular failures.

The system is running stock Debian 9 (Stretch). It was running their latest 4.9 kernel (Rev 4.9.144-3.1) when the problem first occurred. After two instances of the problem, I rolled back to their previous kernel (Rev 4.9.130-2), which the system had been running error free for several months, but the failures have continued.

I'm happy to get any other information that would be needed to debug this, if someone can point me to how to do it.

Currently my faith in BTRFS is approaching zero (it was knocked after a data loss in October, but had grown again). It has a lot of nice features, but (despite comments on the Wiki) really does not seem stable, at least not in Debian.

Thanks

Michael

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: btrfs filesystem failing with 'No space left on device' after 4 hours
  2019-03-06 14:19 btrfs filesystem failing with 'No space left on device' after 4 hours Michael Firth
@ 2019-03-06 17:59 ` Patrik Lundquist
  2019-03-06 20:36 ` Chris Murphy
  1 sibling, 0 replies; 3+ messages in thread
From: Patrik Lundquist @ 2019-03-06 17:59 UTC (permalink / raw)
  To: Michael Firth; +Cc: linux-btrfs@vger.kernel.org

On Wed, 6 Mar 2019 at 16:53, Michael Firth <MFirth@nevion.com> wrote:
>
> Is there any way to get more debugging from what is going on?

Try mounting with enospc_debug.

> The system is running stock Debian 9 (Stretch). It was running their latest 4.9 kernel (Rev 4.9.144-3.1) when the problem first occurred. After two instances of the problem, I rolled back to their previous kernel (Rev 4.9.130-2), which the system had been running error free for several months, but the failures have continued.
>

4.9 is pretty old for Btrfs. I'd use the backported kernel which
currently is 4.19.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: btrfs filesystem failing with 'No space left on device' after 4 hours
  2019-03-06 14:19 btrfs filesystem failing with 'No space left on device' after 4 hours Michael Firth
  2019-03-06 17:59 ` Patrik Lundquist
@ 2019-03-06 20:36 ` Chris Murphy
  1 sibling, 0 replies; 3+ messages in thread
From: Chris Murphy @ 2019-03-06 20:36 UTC (permalink / raw)
  To: Michael Firth; +Cc: linux-btrfs@vger.kernel.org

On Wed, Mar 6, 2019 at 7:29 AM Michael Firth <MFirth@nevion.com> wrote:
>
> Hi,
>
> I have a BTRFS filesystem that seems to have become very ill. After 4 hours of being mounted, it will fail with every write attempt saying "No space left on device".

What program/process is trying to write to the volume? Even "touch
~/hello" fails with this message? What happens if you strace the
command? If you strace and output to a file, make sure you direct the
file to a file system OTHER than root and the file system you've had
this problem with (redirect it to a USB stick or to /tmp), but if the
problem happens even with touch, or writing some zeros with dd, you
should be able to strace just to std out and copy paste the results
into a file.

>
> Unmounting and remounting the filesystem clears the issue for another 4 hours
>
> From every check I have done, no messages are logged at the point of the failure to "dmesg" or any system log.

The lack of a message doesn't sound like the usual enospc. If the file
system runs out of space, even if it's wrong and it's a bug, Btrfs
will warn or info in dmesg.

>
> The output of the three (why on earth are there three?) disk space commands on the filesystem are:

The three come from different eras, and the legacy 'btrfs filesystem
df' and 'btrfs filesystem show' commands were kept around for script
support I assume. I personally find it ridiculous, but also I know
developers are busy with other important issues. I think there should
be one command for humans and when meaningful improvements are made,
the old way is flat out removed. And there should be a switch to
output machine readable raw spew for scripts and such. But whatever,
not up to me!

>
> From my understanding of the output in this, there don't seem to be any areas that are even close to full. And if it was a genuine full condition, even due to running out of metadata or something, then I wouldn't expect unmounting and remounting to clear the issue.

Yep, it's suspicious that it is kernel related. But there's a lot that
happens at umount (you can strace umount and see some of it!) that's
not just implicating Btrfs as a possible cause. It could be something
else. The lack of Btrfs errors strongly suggests it's not directly
related to Btrfs. The program is getting some idea that there's no
space left so that needs to be tracked down why it thinks this. Btrfs
doesn't think that because when it does, it reports it to dmesg.

I don't know anything about Debian and its default kernel console
message logging level, but sometimes I see for some distros that
'dmesg -n 7' needs to be issued before reproducing a problem. Maybe in
your case a hint is just not being retained by dmesg? If you're
running systemd an alternative is to get kernel messages from
'journalctl -k' for the current boot; or also 'journalctl -k
--no-pager' or output with monotonic time 'journalclt -k -o
short-monotonic > journal.txt' and so on.

> Is there any known issue that may cause this behaviour?

This list is upstream development. You'll find on ext4 and XFS list a
similar notion that distro kernels are supported by distros, not
upstream. It's a function of almost pure luck if you get the attention
of a developer who knows something about a 2 year old kernel. And 4.9
is more than 2 years old from a Btrfs development perspective, closer
to three years. Current development is happening on kernel 5.2; where
bug fixes are happening for 5.1. For practical purposes it's ordinary
to be asked to use a mainline or stable (5.0 or 4.20) kernel to see if
the problem still happens. If it does, then you've likely discovered
an unfixed bug. If it doesn't happen, you've discovered a fixed bug.
For various reasons it can be difficult to backport all bug fixes so
maybe it's in a 4.19 Debian built kernel, you'd have to test it. But
the way to limit the testing as much as possible is go straight to
5.0. If it happens there you've almost certainly found a bug that's
not yet fixed.

But even before changing kernels in your case I suggest stracing the
simplest program that reproduces the error, like even touch or cp. We
need to have some idea why the program thinks there's no more space
left while the kernel isn't reporting it.

>
> Is there any way to get more debugging from what is going on?

dmesg -n 7
and reproduce with strace + some simple command simpler reproduction the better

>
> My initial thought was that it might be related to snapshots, as I was generating regular snapshots (for a 'previous versions' feature), and many of the failures were just after a snapshot was created. However, I have now disabled the snapshot creation and I am still seeing regular failures.

Could be one of the edge cases that was fixed in 4.12 but off hand I'd
guess those went back to 4.9. But there have been other edge case
fixes for enospc since then. Note that every merge cycle for the
kernel, Btrfs sees ~1000-2000 commits. It's a lot of changes to keep
track of in someone's memory when it's literally tens of thousands of
changes since kernel 4.9.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2019-03-06 20:36 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-03-06 14:19 btrfs filesystem failing with 'No space left on device' after 4 hours Michael Firth
2019-03-06 17:59 ` Patrik Lundquist
2019-03-06 20:36 ` Chris Murphy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox