cluster-devel.redhat.com archive mirror
 help / color / mirror / Atom feed
From: Steven Whitehouse <swhiteho@redhat.com>
To: cluster-devel.redhat.com
Subject: [Cluster-devel] [GFS2 PATCH 08/10] gfs2: New log flush watchdog
Date: Wed, 14 Jul 2021 09:53:26 +0100	[thread overview]
Message-ID: <572c25da73c9d4d94945c7c8670c691a0f75be92.camel@redhat.com> (raw)
In-Reply-To: <be7487f1-ebfd-e7d5-545c-97bb315afd18@redhat.com>

Hi,

On Tue, 2021-07-13 at 15:03 -0500, Bob Peterson wrote:
> On 7/13/21 1:41 PM, Steven Whitehouse wrote:
> > Hi,
> > 
> > On Tue, 2021-07-13 at 13:09 -0500, Bob Peterson wrote:
> > > This patch adds a new watchdog whose sole purpose is to complain
> > > when
> > > gfs2_log_flush operations are taking too long.
> > > 
> > This one is a bit confusing. It says that it is to check if the log
> > flush is taking too long, but it appears to set a timeout based on
> > the
> > amount of dirty data that will be written back, so it isn't really
> > the
> > log flush, but the writeback and log flush that is being timed I
> > think?
> > 
> > It also looks like the timeout is entirely dependent upon the
> > number of
> > dirty pages too, and not on the log flush size. I wonder about the
> > performance impact of traversing the list of dirty pages too. If
> > that
> > can be avoided it should make the implementation rather more
> > efficient,
> > 
> > Steve.
> 
> Well, perhaps my patch description was misleading. The watchdog is
> meant
> to time how long function gfs2_log_flush() holds the
> sd_log_flush_lock rwsem
> in write mode. 

I think it needs looking at a bit more carefully. That lock is really
an implementation detail, and one that we expect will change in the not
too distant future as the log code improves.

As you say the description is confusing, and the log messages even more
so, since they give a page count that refers to the ordered writes and
not to the log writes at all. 

Also, we have tools already that should be able to diagnose this issue
(slow I/O) such as blktrace, although I know that is more tricky to
catch after the fact. So I think we need to look at this again to see
if there is a better solution.


> That includes writing the ordered writes as well as the
> metadata. The metadata portion is almost always outweighed 100:1 (or
> more)
> by the ordered writes. The length of time it will take to do the
> ordered 
> writes
> should be based on the number of dirty pages. I don't think running
> the
> ordered writes list will impact performance too badly, and that's one
> reason
> I chose to do it before we actually take the sd_log_flush_lock. It
> does, 
> however,
> hold the sd_ordered_lock lock during its count. Still, it's small 
> compared to
> counting the actual pages or something, and modern cpus can run lists
> very
> quickly.
> 
What limits do we have on the ordered write list length? I seem to
remember we had addressed that issue at some point in the past.
Generally though iterating over what might be quite a long list is not
a good plan from a performance perspective,

Steve.

> My initial version didn't count at all; it just used an arbitrary
> number of
> seconds any log flush _ought_ to take. However, Barry pointed out
> that older
> hardware can be slow when driven to extremes and I didn't want false
> positives.
> 
> I also thought about keeping an interactive count whenever pages are
> dirtied, or when inodes are added to the ordered writes list, but
> that 
> seemed
> like overkill. But it is a reasonable alternative.
> 
> The timeout value is also somewhat arbitrary, but I'm open to
> suggestions.
> 
> In my case, faulty hardware caused log flushes to take a very long
> time, 
> which
> caused many transactions and glocks to be blocked a long time and
> eventually
> hit the 120-second kernel watchdog, which gives the impression glocks
> are
> being held a very long time (which they are) for some unknown reason.
> 
> This can manifest on many (often non-faulty) nodes, since glocks can
> be 
> tied up
> indefinitely waiting for a process who has it locked EX but now must
> wait until it can acquire the transaction lock, which is blocked on
> the 
> log flush:
> My goal was to make hardware problems (like faulty HBAs and fibre
> switches)
> NOT seem like cascading gfs2 file system problems or slowdowns.
> 
> These messages will hopefully prompt operations people to investigate
> the
> cause of the slowdown.
> 
> I tested this patch with faulty hardware, and it yielded messages
> like:
> 
> [ 2127.027527] gfs2: fsid=bobsrhel8:test.0: log flush pid 256206 took
> > 
> 20 seconds to write 98 pages.
> [ 2348.979535] gfs2: fsid=bobsrhel8:test.0: log flush pid 256681 took
> > 
> 1 seconds to write 1 pages.
> [ 3643.571505] gfs2: fsid=bobsrhel8:test.0: log flush pid 262385 took
> > 
> 4 seconds to write 16 pages.
> 
> Regards,
> 
> Bob Peterson
> 
> 



  reply	other threads:[~2021-07-14  8:53 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-13 18:09 [Cluster-devel] [GFS2 PATCH 00/10] gfs2: misc. patch collection Bob Peterson
2021-07-13 18:09 ` [Cluster-devel] [GFS2 PATCH 01/10] gfs2: Fix glock recursion in freeze_go_xmote_bh Bob Peterson
2021-07-13 18:09 ` [Cluster-devel] [GFS2 PATCH 02/10] gfs2: Eliminate go_xmote_bh in favor of go_lock Bob Peterson
2021-07-13 18:09 ` [Cluster-devel] [GFS2 PATCH 03/10] gfs2: be more verbose replaying invalid rgrp blocks Bob Peterson
2021-07-13 18:09 ` [Cluster-devel] [GFS2 PATCH 04/10] gfs2: trivial clean up of gfs2_ail_error Bob Peterson
2021-07-13 18:09 ` [Cluster-devel] [GFS2 PATCH 05/10] gfs2: tiny cleanup in gfs2_log_reserve Bob Peterson
2021-07-13 18:09 ` [Cluster-devel] [GFS2 PATCH 06/10] gfs2: init system threads before freeze lock Bob Peterson
2021-07-13 18:09 ` [Cluster-devel] [GFS2 PATCH 07/10] gfs2: Don't release and reacquire local statfs bh Bob Peterson
2021-07-13 18:09 ` [Cluster-devel] [GFS2 PATCH 08/10] gfs2: New log flush watchdog Bob Peterson
2021-07-13 18:41   ` Steven Whitehouse
2021-07-13 20:03     ` Bob Peterson
2021-07-14  8:53       ` Steven Whitehouse [this message]
2021-07-13 18:09 ` [Cluster-devel] [GFS2 PATCH 09/10] gfs2: fix deadlock in gfs2_ail1_empty withdraw Bob Peterson
2021-07-13 18:09 ` [Cluster-devel] [GFS2 PATCH 10/10] gfs2: replace sd_aspace with sd_inode Bob Peterson
2021-07-13 18:26   ` Steven Whitehouse
2021-07-13 19:34     ` Bob Peterson
2021-07-28  6:50       ` Andreas Gruenbacher
2021-07-28  8:57         ` Steven Whitehouse
2021-07-28 13:16           ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=572c25da73c9d4d94945c7c8670c691a0f75be92.camel@redhat.com \
    --to=swhiteho@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).