VMs getting into stuck states since kernel ~5.13

All of lore.kernel.org
 help / color / mirror / Atom feed

* VMs getting into stuck states since kernel ~5.13
@ 2021-12-08 18:54 Chris Murphy
  2021-12-08 21:33 ` Dave Chinner
  2021-12-09 14:56 ` Arkadiusz Miśkiewicz
  0 siblings, 2 replies; 7+ messages in thread
From: Chris Murphy @ 2021-12-08 18:54 UTC (permalink / raw)
  To: xfs list

Hi,

I'm trying to help progress a kernel regression hitting Fedora
infrastructure in which dozens of VMs run concurrently to execute QA
testing. The problem doesn't happen immediately, but all the VM's get
stuck and then any new process also gets stuck, so extracting
information from the system has been difficult and there's not a lot
to go on, but this is what I've got so far.

Systems (Fedora openQA worker hosts) on kernel 5.12.12+ wind up in a
state where forking does not work correctly, breaking most things
https://bugzilla.redhat.com/show_bug.cgi?id=2009585

In that bug some items of interest ...

This megaraid_sas trace. The hang hasn't happened at this point
though, so it may not be related at all or it might be an instigator.
https://bugzilla.redhat.com/show_bug.cgi?id=2009585#c31

Once there is a hang, we have these traces from reducing the time for
the kernel to report blocked tasks. Much of the messages I'm told from
kvm/qemu folks are pretty ordinary/expected locks. But the XFS
portions might give a clue what's going on?

5.15-rc7
https://bugzilla-attachments.redhat.com/attachment.cgi?id=1840941
5.15+
https://bugzilla-attachments.redhat.com/attachment.cgi?id=1840939

So I can imagine the VM's are stuck because XFS is stuck. And XFS is
stuck because something in the block layer or megaraid driver is
stuck, but I don't know that for certain.

Thanks,

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: VMs getting into stuck states since kernel ~5.13
  2021-12-08 18:54 VMs getting into stuck states since kernel ~5.13 Chris Murphy
@ 2021-12-08 21:33 ` Dave Chinner
  2021-12-10 20:06   ` Chris Murphy
  2021-12-09 14:56 ` Arkadiusz Miśkiewicz
  1 sibling, 1 reply; 7+ messages in thread
From: Dave Chinner @ 2021-12-08 21:33 UTC (permalink / raw)
  To: Chris Murphy; +Cc: xfs list

On Wed, Dec 08, 2021 at 01:54:02PM -0500, Chris Murphy wrote:
> Hi,
> 
> I'm trying to help progress a kernel regression hitting Fedora
> infrastructure in which dozens of VMs run concurrently to execute QA
> testing. The problem doesn't happen immediately, but all the VM's get
> stuck and then any new process also gets stuck, so extracting
> information from the system has been difficult and there's not a lot
> to go on, but this is what I've got so far.
> 
> Systems (Fedora openQA worker hosts) on kernel 5.12.12+ wind up in a
> state where forking does not work correctly, breaking most things
> https://bugzilla.redhat.com/show_bug.cgi?id=2009585
> 
> In that bug some items of interest ...
> 
> This megaraid_sas trace. The hang hasn't happened at this point
> though, so it may not be related at all or it might be an instigator.
> https://bugzilla.redhat.com/show_bug.cgi?id=2009585#c31

That's indicative of a bio handling bug somewhere in the storage
stack, likely the MD RAID layer...

> Once there is a hang, we have these traces from reducing the time for
> the kernel to report blocked tasks. Much of the messages I'm told from
> kvm/qemu folks are pretty ordinary/expected locks. But the XFS
> portions might give a clue what's going on?
> 
> 5.15-rc7
> https://bugzilla-attachments.redhat.com/attachment.cgi?id=1840941

So you have processes waiting on both journal IO completion,
(xlog_wait_on_iclog()) and data IO completion
(wait_on_page_writeback()).

> 5.15+
> https://bugzilla-attachments.redhat.com/attachment.cgi?id=1840939

And same here, except it is folio_wait_writeback() in this one.

They are all waiting for the storage to complete IOs.

> So I can imagine the VM's are stuck because XFS is stuck. And XFS is
> stuck because something in the block layer or megaraid driver is
> stuck, but I don't know that for certain.

Looking at the traces, I'd say IO is really slow, but not stuck.
`iostat -dxm 5` output for a few minutes will tell you if IO is
actually making progress or not.

Can you please provide the hardware configuration for these machines
and iostat output before we go any further here?

https://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: VMs getting into stuck states since kernel ~5.13
  2021-12-08 21:33 ` Dave Chinner
@ 2021-12-10 20:06   ` Chris Murphy
  2021-12-10 21:20     ` Dave Chinner
  0 siblings, 1 reply; 7+ messages in thread
From: Chris Murphy @ 2021-12-10 20:06 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Chris Murphy, xfs list

On Wed, Dec 8, 2021 at 4:33 PM Dave Chinner <david@fromorbit.com> wrote:
> Looking at the traces, I'd say IO is really slow, but not stuck.
> `iostat -dxm 5` output for a few minutes will tell you if IO is
> actually making progress or not.

Pretty sure like everything else we run once the hang happens, iostat
will just hang too. But I'll ask.

>
> Can you please provide the hardware configuration for these machines
> and iostat output before we go any further here?

Dell PERC H730P
megaraid sas controller, to 10x 600G sas drives, BFQ ioscheduler, and
the stack is:

partition->mdadm raid6, 512KiB chunk->dm-crypt->LVM->XFS

meta-data=/dev/mapper/vg_guests-LogVol00 isize=512    agcount=180,
agsize=1638272 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1    bigtime=0 inobtcount=0
data     =                       bsize=4096   blocks=294649856, imaxpct=25
         =                       sunit=128    swidth=512 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=12800, version=2
         =                       sectsz=512   sunit=8 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: VMs getting into stuck states since kernel ~5.13
  2021-12-10 20:06   ` Chris Murphy
@ 2021-12-10 21:20     ` Dave Chinner
  0 siblings, 0 replies; 7+ messages in thread
From: Dave Chinner @ 2021-12-10 21:20 UTC (permalink / raw)
  To: Chris Murphy; +Cc: xfs list

On Fri, Dec 10, 2021 at 03:06:37PM -0500, Chris Murphy wrote:
> On Wed, Dec 8, 2021 at 4:33 PM Dave Chinner <david@fromorbit.com> wrote:
> > Looking at the traces, I'd say IO is really slow, but not stuck.
> > `iostat -dxm 5` output for a few minutes will tell you if IO is
> > actually making progress or not.
> 
> Pretty sure like everything else we run once the hang happens, iostat
> will just hang too. But I'll ask.

If that's the case, then I want to see the stack trace for the hung
iostat binary.

> > Can you please provide the hardware configuration for these machines
> > and iostat output before we go any further here?
> 
> Dell PERC H730P
> megaraid sas controller,

Does it have NVRAM? if so, how much, how is it configured, etc.

> to 10x 600G sas drives,

Configured as JBOD, or something else? What is the per-drive cache
configuration?

The link I put in the last email did ask for this sort of specific
information....

> BFQ ioscheduler, and
> the stack is:
> 
> partition->mdadm raid6, 512KiB chunk->dm-crypt->LVM->XFS

Ok, that's pretty suboptimal for VM hosts that typically see lots of
small random IO. 512kB chunk size RAID6 is about the worst
configuration you can have for a small random write workload.
Putting dmcrypt on top of that will make things even slower.

But it also means we need to be looking for bugs in both dmcrypt and
MD raid, and given that there are bad bios being built somewhere in
the stack, it's a good be the problems are somewhere within these
two layers.

> meta-data=/dev/mapper/vg_guests-LogVol00 isize=512    agcount=180, agsize=1638272 blks

Where does the agcount=180 come from? Has this filesystem been grown
at some point in time?

>          =                       sectsz=512   attr=2, projid32bit=1
>          =                       crc=1        finobt=1, sparse=1, rmapbt=0
>          =                       reflink=1    bigtime=0 inobtcount=0
> data     =                       bsize=4096   blocks=294649856, imaxpct=25
>          =                       sunit=128    swidth=512 blks
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> log      =internal log           bsize=4096   blocks=12800, version=2

Yeah, 50MB is tiny log for a filesystem of this size.

>          =                       sectsz=512   sunit=8 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0

Ok, here's what I'm guessing was th original fs config:

# mkfs.xfs -N -d size=100G,su=512k,sw=4 foo
meta-data=foo                    isize=512    agcount=16, agsize=1638272 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1    bigtime=0 inobtcount=0
data     =                       bsize=4096   blocks=26212352, imaxpct=25
         =                       sunit=128    swidth=512 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=12800, version=2
         =                       sectsz=512   sunit=8 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
#

and then it was grown from 100GB to 1.5TB at some point later on in
it's life. An actual 1.5TB fs should have a log that is somewhere
around 800MB in size.

So some of the slowness could be because the log is too small,
causing much more frequent semi-random 4kB metadata writeback than
should be occurring. That should be somewhat temporary slowness (in
the order of minutes) but it's also worst case behaviour for the
RAID6 configuration of the storage.

Also, what mount options are in use?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: VMs getting into stuck states since kernel ~5.13
  2021-12-08 18:54 VMs getting into stuck states since kernel ~5.13 Chris Murphy
  2021-12-08 21:33 ` Dave Chinner
@ 2021-12-09 14:56 ` Arkadiusz Miśkiewicz
  2021-12-10 19:56   ` Chris Murphy
  2021-12-12 14:21   ` Chris Murphy
  1 sibling, 2 replies; 7+ messages in thread
From: Arkadiusz Miśkiewicz @ 2021-12-09 14:56 UTC (permalink / raw)
  To: Chris Murphy, xfs list

W dniu 08.12.2021 o 19:54, Chris Murphy pisze:
> Hi,
> 
> I'm trying to help progress a kernel regression hitting Fedora
> infrastructure in which dozens of VMs run concurrently to execute QA
> testing. The problem doesn't happen immediately, but all the VM's get
> stuck and then any new process also gets stuck, so extracting
> information from the system has been difficult and there's not a lot
> to go on, but this is what I've got so far.

Does qemu there have this fix?

https://github.com/qemu/qemu/commit/cc071629539dc1f303175a7e2d4ab854c0a8b20f

block: introduce max_hw_iov for use in scsi-generic
Linux limits the size of iovecs to 1024 (UIO_MAXIOV in the kernel
sources, IOV_MAX in POSIX).  Because of this, on some host adapters
requests with many iovecs are rejected with -EINVAL by the
io_submit() or readv()/writev() system calls.

In fact, the same limit applies to SG_IO as well.  To fix both the
EINVAL and the possible performance issues from using fewer iovecs
than allowed by Linux (some HBAs have max_segments as low as 128),
introduce a separate entry in BlockLimits to hold the max_segments
value from sysfs.  This new limit is used only for SG_IO and clamped
to bs->bl.max_iov anyway, just like max_hw_transfer is clamped to
bs->bl.max_transfer.

-- 
Arkadiusz Miśkiewicz, arekm / ( maven.pl | pld-linux.org )

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: VMs getting into stuck states since kernel ~5.13
  2021-12-09 14:56 ` Arkadiusz Miśkiewicz
@ 2021-12-10 19:56   ` Chris Murphy
  2021-12-12 14:21   ` Chris Murphy
  1 sibling, 0 replies; 7+ messages in thread
From: Chris Murphy @ 2021-12-10 19:56 UTC (permalink / raw)
  To: Arkadiusz Miśkiewicz; +Cc: Chris Murphy, xfs list

On Thu, Dec 9, 2021 at 9:56 AM Arkadiusz Miśkiewicz
<a.miskiewicz@gmail.com> wrote:
>
> W dniu 08.12.2021 o 19:54, Chris Murphy pisze:
> > Hi,
> >
> > I'm trying to help progress a kernel regression hitting Fedora
> > infrastructure in which dozens of VMs run concurrently to execute QA
> > testing. The problem doesn't happen immediately, but all the VM's get
> > stuck and then any new process also gets stuck, so extracting
> > information from the system has been difficult and there's not a lot
> > to go on, but this is what I've got so far.
>
> Does qemu there have this fix?
>
> https://github.com/qemu/qemu/commit/cc071629539dc1f303175a7e2d4ab854c0a8b20f

I don't think so. The problem appeared in Fedora 34 which had
qemu-5.2.0 and now Fedora 35 which has qemu-6.1.0. I'll see about
backporting that patch if it hasn't already been in 6.1.0-13.fc35


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: VMs getting into stuck states since kernel ~5.13
  2021-12-09 14:56 ` Arkadiusz Miśkiewicz
  2021-12-10 19:56   ` Chris Murphy
@ 2021-12-12 14:21   ` Chris Murphy
  1 sibling, 0 replies; 7+ messages in thread
From: Chris Murphy @ 2021-12-12 14:21 UTC (permalink / raw)
  To: Arkadiusz Miśkiewicz; +Cc: Chris Murphy, xfs list

On Thu, Dec 9, 2021 at 9:56 AM Arkadiusz Miśkiewicz
<a.miskiewicz@gmail.com> wrote:
>
> W dniu 08.12.2021 o 19:54, Chris Murphy pisze:
> > Hi,
> >
> > I'm trying to help progress a kernel regression hitting Fedora
> > infrastructure in which dozens of VMs run concurrently to execute QA
> > testing. The problem doesn't happen immediately, but all the VM's get
> > stuck and then any new process also gets stuck, so extracting
> > information from the system has been difficult and there's not a lot
> > to go on, but this is what I've got so far.
>
> Does qemu there have this fix?
>
> https://github.com/qemu/qemu/commit/cc071629539dc1f303175a7e2d4ab854c0a8b20f
>
> block: introduce max_hw_iov for use in scsi-generic

OK this worker is running a version of qemu with this fix backported
now, and the problem still happens.

I'll try to get answers to Dave's other questions (I don't have direct
access to this system, so there's a suboptimal relay happening in the
conversation).


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-12-12 14:21 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-12-08 18:54 VMs getting into stuck states since kernel ~5.13 Chris Murphy
2021-12-08 21:33 ` Dave Chinner
2021-12-10 20:06   ` Chris Murphy
2021-12-10 21:20     ` Dave Chinner
2021-12-09 14:56 ` Arkadiusz Miśkiewicz
2021-12-10 19:56   ` Chris Murphy
2021-12-12 14:21   ` Chris Murphy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.