From: Dave Chinner <david@fromorbit.com>
To: Troels Hansen <th@casalogic.dk>
Cc: linux-xfs@vger.kernel.org
Subject: Re: Strange XFS problem
Date: Thu, 13 Sep 2018 14:19:44 +1000 [thread overview]
Message-ID: <20180913041944.GA27618@dastard> (raw)
In-Reply-To: <1063031109.4602643.1536739675362.JavaMail.zimbra@casalogic.dk>
On Wed, Sep 12, 2018 at 10:07:55AM +0200, Troels Hansen wrote:
> Hi, we are facing an issue where we can't figure out if its XFS software related, or actually related to hardware, and can't quite figure out why we are facing the issues, though is doesn't seem hardware related.
>
> The issue is with a 102Tb array on a Dell branded LSISAS 3508 (Perc H840).
> Running Ubuntu with a 4.15.0-32 (Ubuntu branded), but we have also been running a number of 4.4.0-x with the same issues.
Smells of an IO overload problem from that.
> The XFS filsusyem is on a very busy NFS server, and when the issue
> occurs we see strange issues with NFS, while the system seems
> healthy on the local server, but at the same time some programs
> are having problems accessing the fs.
>
> It occure roughly every 14 days, where we have to restart the fs to come back fully working.
What happens on your network every 14 days or so? Is there a rogue
client side backup or admin task running somewhere?
> Sometimes refusing to unmount cleanly during shutdown, forcing us to fsck the fs on startup.
Unclean shutdown doesn't require fsck to be run.
> It looks like its hanging in xlog_grant_head_wait, but I don't know enough to determine what can make it hang there.
>
> Hoping someone in here could have a look and point me in the right direction.
>
> Below is a trace from the last crash we had:
Not a crash - it's a hung task warning.
> Sep 9 23:23:51 ged kernel: [1436769.178935] INFO: task mysqld:2847 blocked for more than 120 seconds.
> Sep 9 23:23:51 ged kernel: [1436769.178999] Not tainted 4.15.0-32-generic #35~16.04.1-Ubuntu
> Sep 9 23:23:51 ged kernel: [1436769.179047] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Sep 9 23:23:51 ged kernel: [1436769.179105] mysqld D 0 2847 1 0x00000000
> Sep 9 23:23:51 ged kernel: [1436769.179111] Call Trace:
> Sep 9 23:23:51 ged kernel: [1436769.179123] __schedule+0x3d6/0x8b0
> Sep 9 23:23:51 ged kernel: [1436769.179127] schedule+0x36/0x80
> Sep 9 23:23:51 ged kernel: [1436769.179216] xlog_grant_head_wait+0xb8/0x1e0 [xfs]
> Sep 9 23:23:51 ged kernel: [1436769.179277] xlog_grant_head_check+0x94/0x100 [xfs]
> Sep 9 23:23:51 ged kernel: [1436769.179330] xfs_log_reserve+0xcb/0x1e0 [xfs]
> Sep 9 23:23:51 ged kernel: [1436769.179381] xfs_trans_reserve+0x169/0x1d0 [xfs]
> Sep 9 23:23:51 ged kernel: [1436769.179428] xfs_trans_alloc+0xbe/0x130 [xfs]
> Sep 9 23:23:51 ged kernel: [1436769.179478] xfs_vn_update_time+0x5d/0x160 [xfs]
> Sep 9 23:23:51 ged kernel: [1436769.179486] file_update_time+0xbe/0x110
> Sep 9 23:23:51 ged kernel: [1436769.179493] ? tcp_recvmsg+0x317/0xab0
> Sep 9 23:23:51 ged kernel: [1436769.179542] xfs_file_aio_write_checks+0x13a/0x180 [xfs]
> Sep 9 23:23:51 ged kernel: [1436769.179588] xfs_file_buffered_aio_write+0x89/0x2a0 [xfs]
> Sep 9 23:23:51 ged kernel: [1436769.179632] xfs_file_write_iter+0x103/0x150 [xfs]
> Sep 9 23:23:51 ged kernel: [1436769.179637] new_sync_write+0xe5/0x140
> Sep 9 23:23:51 ged kernel: [1436769.179641] __vfs_write+0x29/0x40
> Sep 9 23:23:51 ged kernel: [1436769.179645] vfs_write+0xb8/0x1b0
> Sep 9 23:23:51 ged kernel: [1436769.179649] SyS_pwrite64+0x95/0xb0
> Sep 9 23:23:51 ged kernel: [1436769.179655] do_syscall_64+0x73/0x130
> Sep 9 23:23:51 ged kernel: [1436769.179661] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
.....
Does this repeat every 120s?
These hung task warnings can happen if your workload has overloaded
your raid array and everything doing IO hangs while it catches up.
e.g. you have 6GB of random 4k writes in the controller NV cache and
it takes minutes for it to flush (because random 4k writes are slow)
and make room for new incoming IO....
If the warnings don't repeat, then it means it was a temporary
overload. If the warnings repeat, but change processes and stack
traces then it's a sustained overload condition. If exactly the same
warnings repeat and/or has stalled and doesn't restart, then we've
got some kind of hang occurring and we'll need to look into it
further.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2018-09-13 9:27 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-09-12 8:07 Strange XFS problem Troels Hansen
2018-09-12 10:59 ` Carlos Maiolino
2018-09-12 11:39 ` Troels Hansen
[not found] ` <notmuch-sha1-960c954e5404b5b2f083d150633af0b7848ec14c>
2018-09-12 16:39 ` Carlos Maiolino
2018-09-13 4:19 ` Dave Chinner [this message]
2018-09-13 5:21 ` Troels Hansen
2018-09-13 6:18 ` Dave Chinner
-- strict thread matches above, loose matches on Subject: below --
2018-06-03 1:51 Thorsten Hufnagel
2018-06-03 2:32 ` Thorsten Hufnagel
2018-06-05 8:29 ` Carlos Maiolino
2018-06-05 13:46 ` Stefan Ring
2018-06-05 14:19 ` Eric Sandeen
2018-06-05 14:46 ` Stefan Ring
2018-06-05 17:34 ` Chris Murphy
2018-06-12 18:53 ` Stefan Ring
2018-06-12 18:56 ` Eric Sandeen
2018-06-13 22:02 ` Dave Chinner
2018-06-14 2:11 ` Eric Sandeen
2018-06-14 15:47 ` Stefan Ring
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180913041944.GA27618@dastard \
--to=david@fromorbit.com \
--cc=linux-xfs@vger.kernel.org \
--cc=th@casalogic.dk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).