All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Wysochanski <dwysocha@redhat.com>
To: Frank Sorenson <sorenson@redhat.com>
Cc: linux-fsdevel@vger.kernel.org, lvaz@redhat.com
Subject: Re: [PATCH 0/5] Add trace events for filesystem freeze/thaw events
Date: Thu, 03 Mar 2016 06:18:37 -0500	[thread overview]
Message-ID: <1457003917.4523.23.camel@redhat.com> (raw)
In-Reply-To: <1456945319-16283-1-git-send-email-sorenson@redhat.com>

On Wed, 2016-03-02 at 13:01 -0600, Frank Sorenson wrote:
> Currently, the only visibility into filesystem freeze or
> thaw activity is when an error is returned from the actual
> call to freeze or thaw.  If the action itself hangs, there
> is no indication that the freeze or thaw was in-progress,
> short of collecting a vmcore.
> 
> There is also no record of what process froze a filesystem
> or when it happened, so if the process does not thaw it
> later, debugging the issue is difficult.
> 

Thanks for getting the ball rolling Frank!

For some background, Frank can correct me if I'm wrong, but I think the
main "use case" which spurred these patches is as follows.  Frank,
myself and others supporting enterprise customers on Linux very
frequently see vmcores come in with customers reporting "hung systems".
We analyze the vmcore and see that the root filesystem, and often every
other local filesystem is frozen.  In the vmcore we see many processes
blocked waiting on a filesystem to become unfrozen.  In short, as a
result of the filesystems being frozen, eventually the system grinds to
a halt and users wonder what happened.

So the discussion then turns to "Who froze the filesystem, and why
didn't they unfreeze it"?  One approach we have is to give the customer
a systemtap script which basically does what these patches do and shows
which process issued a 'freeze' call and then 'unfreeze'.  As "simple"
as systemtap is, many customers run into problems with installing the
requirements and/or they don't want to do it.  It also requires they
reproduce the issue and give us another vmcore so we can see the output.
The discussion almost always ends with "Contact those responsible for
backup / snapshotting of your hung system, as this is most likely an
issue with snapshots / backup", but unfortunately we can't give many
more details than that since we don't have any log of what happened.

I thought about this some more and I realized the fact that if the
customer gives us a vmcore and it contains frozen filesystem, all
parties involved have "already lost".  This indicates usually a couple
things:
1. They didn't know what happened
2. There's a bug in freeze / thaw, or backup/snapshot agent, which needs
fixed

On problem #1, I wonder now if a 'freeze' operation should be treated
more like a "shutdown" where all users are notified this is about to
happen.  One downside of this is I can imagine this defeats the intent
of such "VM snapshots" or "transparent backups" where they probably
don't want people to notice a backup / snapshot is being taken.  One the
diagnostic side though, we definitely want to know what happened when
something goes wrong, so there's probably a conflict there.  The other
downside is such notification may not be possible since many of these
problems occur with 3rd party add-ons.  Then again, the leading
virtualization vendor has recently open sourced their agent so it may be
possible.

On problem #2, this patchset or something similar will help only to hone
in on who is to blame faster, but it won't fix anything of course.

In closing, I want to point out that based on the above use case, a
simple one line printk in just the freeze path would help.  That is, if
we just knew who the last process was which issued a freeze on a given
filesystem was, and that was always present in any system log / vmcore,
we'd be able to help customers in a more definitive way.

All this said, I don't think Frank, myself or anyone else working on
such problems for customers wants to add slightly better diagnostics,
but create a nightmare of maintenance of tracepoints or unknown side
effects that outweigh any benefit.


      parent reply	other threads:[~2016-03-03 11:18 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-03-02 19:01 [PATCH 0/5] Add trace events for filesystem freeze/thaw events Frank Sorenson
2016-03-02 19:01 ` [PATCH 1/5] fs: simplify freeze_super()/thaw_super() exit handling Frank Sorenson
2016-03-02 20:06   ` Al Viro
2016-03-02 19:01 ` [PATCH 2/5] fs/block_dev.c: simplify freeze_bdev() and thaw_bdev() " Frank Sorenson
2016-03-02 19:01 ` [PATCH 3/5] fs: add trace events for freeze_super() and thaw_super() Frank Sorenson
2016-03-02 20:12   ` Al Viro
2016-03-02 19:01 ` [PATCH 4/5] fs/block_dev.c: add trace events for freeze_bdev() and thaw_bdev() Frank Sorenson
2016-03-02 19:01 ` [PATCH 5/5] fs: enable filesystem freeze/thaw events Frank Sorenson
2016-03-02 20:15   ` Al Viro
2016-03-02 21:47 ` [PATCH 0/5] Add trace events for " Al Viro
2016-03-02 22:47   ` Dave Chinner
2016-03-02 23:22     ` Al Viro
2016-03-02 23:52       ` Dave Chinner
2016-03-03 11:18 ` Dave Wysochanski [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1457003917.4523.23.camel@redhat.com \
    --to=dwysocha@redhat.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=lvaz@redhat.com \
    --cc=sorenson@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.