cluster-devel.redhat.com archive mirror
 help / color / mirror / Atom feed
From: Bob Peterson <rpeterso@redhat.com>
To: cluster-devel.redhat.com
Subject: [Cluster-devel] [GFS2 PATCH 08/10] gfs2: New log flush watchdog
Date: Tue, 13 Jul 2021 15:03:02 -0500	[thread overview]
Message-ID: <be7487f1-ebfd-e7d5-545c-97bb315afd18@redhat.com> (raw)
In-Reply-To: <ac3ca605d90d29e8424abc56ac5f410c0cc5e0b0.camel@redhat.com>


On 7/13/21 1:41 PM, Steven Whitehouse wrote:
> Hi,
>
> On Tue, 2021-07-13 at 13:09 -0500, Bob Peterson wrote:
>> This patch adds a new watchdog whose sole purpose is to complain when
>> gfs2_log_flush operations are taking too long.
>>
> This one is a bit confusing. It says that it is to check if the log
> flush is taking too long, but it appears to set a timeout based on the
> amount of dirty data that will be written back, so it isn't really the
> log flush, but the writeback and log flush that is being timed I think?
>
> It also looks like the timeout is entirely dependent upon the number of
> dirty pages too, and not on the log flush size. I wonder about the
> performance impact of traversing the list of dirty pages too. If that
> can be avoided it should make the implementation rather more efficient,
>
> Steve.

Well, perhaps my patch description was misleading. The watchdog is meant
to time how long function gfs2_log_flush() holds the sd_log_flush_lock rwsem
in write mode. That includes writing the ordered writes as well as the
metadata. The metadata portion is almost always outweighed 100:1 (or more)
by the ordered writes. The length of time it will take to do the ordered 
writes
should be based on the number of dirty pages. I don't think running the
ordered writes list will impact performance too badly, and that's one reason
I chose to do it before we actually take the sd_log_flush_lock. It does, 
however,
hold the sd_ordered_lock lock during its count. Still, it's small 
compared to
counting the actual pages or something, and modern cpus can run lists very
quickly.

My initial version didn't count at all; it just used an arbitrary number of
seconds any log flush _ought_ to take. However, Barry pointed out that older
hardware can be slow when driven to extremes and I didn't want false
positives.

I also thought about keeping an interactive count whenever pages are
dirtied, or when inodes are added to the ordered writes list, but that 
seemed
like overkill. But it is a reasonable alternative.

The timeout value is also somewhat arbitrary, but I'm open to suggestions.

In my case, faulty hardware caused log flushes to take a very long time, 
which
caused many transactions and glocks to be blocked a long time and eventually
hit the 120-second kernel watchdog, which gives the impression glocks are
being held a very long time (which they are) for some unknown reason.

This can manifest on many (often non-faulty) nodes, since glocks can be 
tied up
indefinitely waiting for a process who has it locked EX but now must
wait until it can acquire the transaction lock, which is blocked on the 
log flush:
My goal was to make hardware problems (like faulty HBAs and fibre switches)
NOT seem like cascading gfs2 file system problems or slowdowns.

These messages will hopefully prompt operations people to investigate the
cause of the slowdown.

I tested this patch with faulty hardware, and it yielded messages like:

[ 2127.027527] gfs2: fsid=bobsrhel8:test.0: log flush pid 256206 took > 
20 seconds to write 98 pages.
[ 2348.979535] gfs2: fsid=bobsrhel8:test.0: log flush pid 256681 took > 
1 seconds to write 1 pages.
[ 3643.571505] gfs2: fsid=bobsrhel8:test.0: log flush pid 262385 took > 
4 seconds to write 16 pages.

Regards,

Bob Peterson




  reply	other threads:[~2021-07-13 20:03 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-13 18:09 [Cluster-devel] [GFS2 PATCH 00/10] gfs2: misc. patch collection Bob Peterson
2021-07-13 18:09 ` [Cluster-devel] [GFS2 PATCH 01/10] gfs2: Fix glock recursion in freeze_go_xmote_bh Bob Peterson
2021-07-13 18:09 ` [Cluster-devel] [GFS2 PATCH 02/10] gfs2: Eliminate go_xmote_bh in favor of go_lock Bob Peterson
2021-07-13 18:09 ` [Cluster-devel] [GFS2 PATCH 03/10] gfs2: be more verbose replaying invalid rgrp blocks Bob Peterson
2021-07-13 18:09 ` [Cluster-devel] [GFS2 PATCH 04/10] gfs2: trivial clean up of gfs2_ail_error Bob Peterson
2021-07-13 18:09 ` [Cluster-devel] [GFS2 PATCH 05/10] gfs2: tiny cleanup in gfs2_log_reserve Bob Peterson
2021-07-13 18:09 ` [Cluster-devel] [GFS2 PATCH 06/10] gfs2: init system threads before freeze lock Bob Peterson
2021-07-13 18:09 ` [Cluster-devel] [GFS2 PATCH 07/10] gfs2: Don't release and reacquire local statfs bh Bob Peterson
2021-07-13 18:09 ` [Cluster-devel] [GFS2 PATCH 08/10] gfs2: New log flush watchdog Bob Peterson
2021-07-13 18:41   ` Steven Whitehouse
2021-07-13 20:03     ` Bob Peterson [this message]
2021-07-14  8:53       ` Steven Whitehouse
2021-07-13 18:09 ` [Cluster-devel] [GFS2 PATCH 09/10] gfs2: fix deadlock in gfs2_ail1_empty withdraw Bob Peterson
2021-07-13 18:09 ` [Cluster-devel] [GFS2 PATCH 10/10] gfs2: replace sd_aspace with sd_inode Bob Peterson
2021-07-13 18:26   ` Steven Whitehouse
2021-07-13 19:34     ` Bob Peterson
2021-07-28  6:50       ` Andreas Gruenbacher
2021-07-28  8:57         ` Steven Whitehouse
2021-07-28 13:16           ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=be7487f1-ebfd-e7d5-545c-97bb315afd18@redhat.com \
    --to=rpeterso@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).