Re: VMs getting into stuck states since kernel ~5.13

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Dave Chinner <david@fromorbit.com>
To: Chris Murphy <lists@colorremedies.com>
Cc: xfs list <linux-xfs@vger.kernel.org>
Subject: Re: VMs getting into stuck states since kernel ~5.13
Date: Sat, 11 Dec 2021 08:20:33 +1100	[thread overview]
Message-ID: <20211210212033.GP449541@dread.disaster.area> (raw)
In-Reply-To: <CAJCQCtR5NjF61B4g4KkjBgdmV8rK8tWLNxtVvNbm4gzm9kdrhg@mail.gmail.com>

On Fri, Dec 10, 2021 at 03:06:37PM -0500, Chris Murphy wrote:
> On Wed, Dec 8, 2021 at 4:33 PM Dave Chinner <david@fromorbit.com> wrote:
> > Looking at the traces, I'd say IO is really slow, but not stuck.
> > `iostat -dxm 5` output for a few minutes will tell you if IO is
> > actually making progress or not.
> 
> Pretty sure like everything else we run once the hang happens, iostat
> will just hang too. But I'll ask.

If that's the case, then I want to see the stack trace for the hung
iostat binary.

> > Can you please provide the hardware configuration for these machines
> > and iostat output before we go any further here?
> 
> Dell PERC H730P
> megaraid sas controller,

Does it have NVRAM? if so, how much, how is it configured, etc.

> to 10x 600G sas drives,

Configured as JBOD, or something else? What is the per-drive cache
configuration?

The link I put in the last email did ask for this sort of specific
information....

> BFQ ioscheduler, and
> the stack is:
> 
> partition->mdadm raid6, 512KiB chunk->dm-crypt->LVM->XFS

Ok, that's pretty suboptimal for VM hosts that typically see lots of
small random IO. 512kB chunk size RAID6 is about the worst
configuration you can have for a small random write workload.
Putting dmcrypt on top of that will make things even slower.

But it also means we need to be looking for bugs in both dmcrypt and
MD raid, and given that there are bad bios being built somewhere in
the stack, it's a good be the problems are somewhere within these
two layers.

> meta-data=/dev/mapper/vg_guests-LogVol00 isize=512    agcount=180, agsize=1638272 blks

Where does the agcount=180 come from? Has this filesystem been grown
at some point in time?

>          =                       sectsz=512   attr=2, projid32bit=1
>          =                       crc=1        finobt=1, sparse=1, rmapbt=0
>          =                       reflink=1    bigtime=0 inobtcount=0
> data     =                       bsize=4096   blocks=294649856, imaxpct=25
>          =                       sunit=128    swidth=512 blks
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> log      =internal log           bsize=4096   blocks=12800, version=2

Yeah, 50MB is tiny log for a filesystem of this size.

>          =                       sectsz=512   sunit=8 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0

Ok, here's what I'm guessing was th original fs config:

# mkfs.xfs -N -d size=100G,su=512k,sw=4 foo
meta-data=foo                    isize=512    agcount=16, agsize=1638272 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1    bigtime=0 inobtcount=0
data     =                       bsize=4096   blocks=26212352, imaxpct=25
         =                       sunit=128    swidth=512 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=12800, version=2
         =                       sectsz=512   sunit=8 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
#

and then it was grown from 100GB to 1.5TB at some point later on in
it's life. An actual 1.5TB fs should have a log that is somewhere
around 800MB in size.

So some of the slowness could be because the log is too small,
causing much more frequent semi-random 4kB metadata writeback than
should be occurring. That should be somewhat temporary slowness (in
the order of minutes) but it's also worst case behaviour for the
RAID6 configuration of the storage.

Also, what mount options are in use?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

next prev parent reply	other threads:[~2021-12-10 21:20 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-12-08 18:54 VMs getting into stuck states since kernel ~5.13 Chris Murphy
2021-12-08 21:33 ` Dave Chinner
2021-12-10 20:06   ` Chris Murphy
2021-12-10 21:20     ` Dave Chinner [this message]
2021-12-09 14:56 ` Arkadiusz Miśkiewicz
2021-12-10 19:56   ` Chris Murphy
2021-12-12 14:21   ` Chris Murphy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20211210212033.GP449541@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    --cc=lists@colorremedies.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.