Re: XFS filesystem hang - Stefan Priebe

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

From: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
To: d.aberger@profihost.ag, Dave Chinner <david@fromorbit.com>
Cc: Brian Foster <bfoster@redhat.com>,
	linux-xfs@vger.kernel.org, n.fahldieck@profihost.ag
Subject: Re: XFS filesystem hang
Date: Sun, 10 Feb 2019 19:52:42 +0100	[thread overview]
Message-ID: <dd73d3db-32e4-e357-a5db-0e1b41c8b4a0@profihost.ag> (raw)
In-Reply-To: <45ec229d-3554-e273-3704-daee5e1bfe54@profihost.ag>

Dear dave,

we're still seeing the issue below. I verified that we see this also on:
1.) systems without any raid or controller (pure qemu/kvm with ceph rbd
backend)

2.) systems where no btrs is running nor any backup jobs

To me this looks like a xfs lockup. I found the same trace here:
"https://access.redhat.com/solutions/2964341"

but do not have any access to this subscriber content.

Greets,
Stefan

Am 21.01.19 um 15:59 schrieb Daniel Aberger - Profihost AG:
> Am 19.01.19 um 01:19 schrieb Dave Chinner:
>> On Fri, Jan 18, 2019 at 03:48:46PM +0100, Daniel Aberger - Profihost AG wrote:
>>> Am 17.01.19 um 23:05 schrieb Dave Chinner:
>>>> On Thu, Jan 17, 2019 at 02:50:23PM +0100, Daniel Aberger - Profihost AG wrote:
>>>>> * Kernel Version: Linux 4.12.0+139-ph #1 SMP Tue Jan 1 21:46:16 UTC 2019
>>>>> x86_64 GNU/Linux
>>>>
>>>> Is that an unmodified distro kernel or one you've patched and built
>>>> yourself?
>>>
>>> Unmodified regarding XFS and any subsystems related to XFS, as I was
>>> being told.
>>
>> That doesn't answer my question - has the kernel been patched (and
>> what with) or is it a completely unmodified upstream kernel?
>>
> 
> The kernel we were running was OpenSUSE SLE15 based on commit
> 6c5c7489089608d89b7ce310bca44812e2b0a4a5.
> 
> https://github.com/openSUSE/kernel
> 
> 
>>>>> * /proc/meminfo, /proc/mounts, /proc/partitions and xfs_info can be
>>>>> found here: https://pastebin.com/cZiTrUDL
>>>>
>>>> Just  notes as I browse it.
>>>> - lots of free memory.
>>>> - xfs-info: 1.3TB, 32 ags, ~700MB log w/sunit =64fsbs
>>>>   sunit=64 fsbs, swidth=192fsbs (RAID?)
>>>> - mount options: noatime, sunit=512,sunit=1536, usrquota
>>>> - /dev/sda3 mounted on /
>>>> - /dev/sda3 also mounted on /home/tmp (bind mount of something?)
>>>>
>>>>> * full dmesg output of problem mentioned in the first mail:
>>>>> https://pastebin.com/pLaz18L1
>>>>
>>>> No smoking gun.
>>>>
>>>>> * a couple of more dmesg outputs from the same system with similar
>>>>> behaviour:
>>>>>  * https://pastebin.com/hWDbwcCr
>>>>>  * https://pastebin.com/HAqs4yQc
>>>>
>>>> Ok, so mysqld seems to be the problem child here.
>>>>
>>>
>>> Our MySQL workload on this server is very small except for this time of
>>> the day because our local backup to /backup happens during this time.
>>> The highest IO happens during the night when our local backup is being
>>> written. The timestamps of these two outputs suggest that the "mysql
>>> dump" phase might just have been started. Unfortunately we only keep the
>>> log of the last job, so I can't confirm that.
>>
>> Ok, so you've just started loading up the btrfs volume that is also
>> attached to the same raid controller, which does have raid caches
>> enabled....
>>
>> I wonder if that has anything to do with it?
>>
> 
> Do you suggest to change any caching options?
> 
>> Best would be to capture iostat output for both luns (as per the
>> FAQ) when the problem workload starts.
>>
> 
> What I can give you so far is two I/O activity screenshots of Grafana of
> two of the dmesg outputs above.
> 
> https://imgur.com/a/3lL776U
> 
> 
>>>> Which leads me to ask: what is your RAID cache setup - write-thru,
>>>> write-back, etc?
>>>>
>>>
>>> Our RAID6 cache configuration:
>>>
>>>    Read-cache setting                       : Disabled
>>>    Read-cache status                        : Off
>>>    Write-cache setting                      : Disabled
>>>    Write-cache status                       : Off
>>
>> Ok, so read caching is turned off, which means it likely won't even
>> be caching stripes between modifications. May not be very efficient,
>> but hard to say if it's the problem or not.
>>
>>> Full Configuration: https://pastebin.com/PdGatDY4
>>
>> Yeah, caching is enabled on the backup btrfs lun, so there may be
>> interaction issues. Is the backup device idle (or stalling) at the
>> same time that the XFS messages are being issued?
> 
> In 2 out of 3 cases it happened while the backup job was running, which
> starts at 0:10 am and finishes roughly between 2:30 and 3:30 am on this
> particular machine. So it wasn't idle.
> 
> The MySQL dumping phase takes about 20 to 25 minutes and happens at the
> end of the backup job.
>

     prev parent reply	other threads:[~2019-02-10 18:58 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-17 11:14 XFS filesystem hang Daniel Aberger - Profihost AG
2019-01-17 12:34 ` Brian Foster
2019-01-17 13:50   ` Daniel Aberger - Profihost AG
2019-01-17 22:05     ` Dave Chinner
2019-01-18 14:48       ` Daniel Aberger - Profihost AG
2019-01-19  0:19         ` Dave Chinner
2019-01-21 14:59           ` Daniel Aberger - Profihost AG
2019-02-10 18:52             ` Stefan Priebe - Profihost AG [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=dd73d3db-32e4-e357-a5db-0e1b41c8b4a0@profihost.ag \
    --to=s.priebe@profihost.ag \
    --cc=bfoster@redhat.com \
    --cc=d.aberger@profihost.ag \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    --cc=n.fahldieck@profihost.ag \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox