Re: data rolled back 5 hours after crash, long fsync running times, watchdog evasion on 5.4.11

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
To: Martin Steigerwald <martin@lichtvoll.de>
Cc: linux-btrfs@vger.kernel.org,
	Timothy Pearson <tpearson@raptorengineering.com>
Subject: Re: data rolled back 5 hours after crash, long fsync running times, watchdog evasion on 5.4.11
Date: Sun, 9 Feb 2020 23:10:41 -0500	[thread overview]
Message-ID: <20200210041041.GH13306@hungrycats.org> (raw)
In-Reply-To: <2202848.tjv8jjdcNr@merkaba>

[-- Attachment #1: Type: text/plain, Size: 2916 bytes --]

On Sun, Feb 09, 2020 at 10:00:34AM +0100, Martin Steigerwald wrote:
> Zygo Blaxell - 09.02.20, 01:43:07 CET:
> > Up to that point, a few processes have been blocked for up to 5 hours,
> > but this is not unusual on a big filesystem given #1.  Usually
> > processes that read the filesystem (e.g. calling lstat) are not
> > blocked, unless they try to access a directory being modified by a
> > process that is blocked. lstat() being blocked is unusual.
> 
> This is really funny, cause what you consider not being unusual, I'd 
> consider a bug or at least a huge limitation.
> 
> But in a sense I never really got that processed can be stuck in 
> uninterruptible sleep on Linux or Unix *at all*. Such a situation 
> without giving a user at least the ability to end it by saying "I don't 
> care about the data that process is to write, let me remove it already" 
> for me is a major limitation to what appears to be kind of specific to 
> the UNIX architecture or at least the way the Linux virtual memory 
> manager is working.

> That written I may be completely ignorant of something very important 
> here and some may tell me it can't be any other way for this and that 
> reason. Currently I still think it can.

The process in uninterruptible sleep is waiting for the filesystem code to
finish whatever it's doing so the in-memory and on-disk structures end in
a consistent state.  If whatever it's doing is "waiting for a lock held by
some other thread doing an expensive thing", it can block for a long time.

We can't simply abort the kernel thread here, which is why it's
uninterruptible wait (*).  Generic interruption would need to unwind the
kernel stack all the way back to userspace, reverting all changes made
to the filesystem's internal data structures as we go, without tripping
over the need for some other lock in the process, and without introducing
horrible new regressions.

In theory we can interrupt any kernel thread at any time--that happens
naturally whenever there's a BUG() or power failure, for instance--but
the effect on all the other threads that might be running is pretty
painful.

If you add a level of indirection--e.g. run the btrfs code in a VM and
access it via a network or virtio client--then we can interrupt the
client, but the server ends up having to finish whatever operation the
client requested anyway, so the client just gets to immediately hang
waiting for the server on its next call.

> And even if uninterruptible sleep can still happen cause it is really 
> necessary, five hours is at least about five hours minus probably a minute 
> or so too long.

Yes it would be nice if btrfs could avoid overcommitting itself so badly,
but that's a somewhat older and larger-scoped bug.

> Ciao,
> -- 
> Martin
> 
> 

(*) well we could, if all the filesystem code was written that way.
Patches welcome!

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

next prev parent reply	other threads:[~2020-02-10  4:10 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-09  0:43 data rolled back 5 hours after crash, long fsync running times, watchdog evasion on 5.4.11 Zygo Blaxell
2020-02-09  9:00 ` Martin Steigerwald
2020-02-10  4:10   ` Zygo Blaxell [this message]
2020-02-09 17:08 ` Martin Raiber
2020-02-09 23:11   ` Timothy Pearson
2020-02-10  4:27     ` Zygo Blaxell
2020-02-10  5:18       ` Timothy Pearson
2020-02-10  5:20   ` Zygo Blaxell
2020-02-10  1:49 ` Chris Murphy
2020-02-10  5:18   ` Zygo Blaxell
2020-02-10  7:52     ` Chris Murphy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200210041041.GH13306@hungrycats.org \
    --to=ce3g8jdj@umail.furryterror.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=martin@lichtvoll.de \
    --cc=tpearson@raptorengineering.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox