[Bug 217572] Initial blocked tasks causing deterioration over hours until (nearly) complete system lockup and data loss with PostgreSQL 13

linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: bugzilla-daemon@kernel.org
To: linux-xfs@vger.kernel.org
Subject: [Bug 217572] Initial blocked tasks causing deterioration over hours until (nearly) complete system lockup and data loss with PostgreSQL 13
Date: Thu, 02 Nov 2023 20:59:09 +0000	[thread overview]
Message-ID: <bug-217572-201763-7aKPmPiF6l@https.bugzilla.kernel.org/> (raw)
In-Reply-To: <bug-217572-201763@https.bugzilla.kernel.org/>

https://bugzilla.kernel.org/show_bug.cgi?id=217572

--- Comment #21 from Dave Chinner (david@fromorbit.com) ---
On Thu, Nov 02, 2023 at 03:27:58PM +0000, bugzilla-daemon@kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=217572
> 
> --- Comment #18 from Christian Theune (ct@flyingcircus.io) ---
> We've updated a while ago and our fleet is not seeing improved results.
> They've
> actually seemed to have gotten worse according to the number of alerts we've
> seen. 

This is still an unreproducable, unfixed bug in upstream kernels.
There is no known reproducer, so actually triggering it and hence
performing RCA is extremely difficult at this point in time. We don't
really even know what workload triggers it.

> We've had a multitude of crashes in the last weeks with the following
> statistics:
> 
> 6.1.31 - 2 affected machines
> 6.1.35 - 1 affected machine
> 6.1.37 - 1 affected machine
> 6.1.51 - 5 affected machines
> 6.1.55 - 2 affected machines
> 6.1.57 - 2 affected machines

Do these machines have ECC memory?

> Here's the more detailed behaviour of one of the machines with 6.1.57.
> 
> $ uptime
>  16:10:23  up 13 days 19:00,  1 user,  load average: 3.21, 1.24, 0.57

Yeah, that's the problem - such a rare, one off issue that we don't
really even know where to begin looking. :(

Given you seem to have a workload that occasionally triggers it,
could you try to craft a reproducer workload that does stuff similar
to your production workload and see if you can find out something
that makes this easier to trigger?

> $ uname -a
> Linux ts00 6.1.57 #1-NixOS SMP PREEMPT_DYNAMIC Tue Oct 10 20:00:46 UTC 2023
> x86_64 GNU/Linux
> 
> And here' the stall:
....
> [654042.645101]  <TASK>
> [654042.645353]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> [654042.645956]  ? xas_descend+0x22/0x90
> [654042.646366]  xas_load+0x30/0x40
> [654042.646738]  filemap_get_read_batch+0x16e/0x250
> [654042.647253]  filemap_get_pages+0xa9/0x630
> [654042.647714]  filemap_read+0xd2/0x340
> [654042.648124]  ? __mod_memcg_lruvec_state+0x6e/0xd0
> [654042.648670]  xfs_file_buffered_read+0x4f/0xd0 [xfs]

This implies you are using memcg to constrain memory footprint of
the applications? Are these workloads running in memcgs that
experience random memcg OOM conditions? Or maybe the failure
correlates with global OOM conditions triggering memcg reclaim?

Cheers,

Dave.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

next prev parent reply	other threads:[~2023-11-02 20:59 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-19  8:29 [Bug 217572] New: Initial blocked tasks causing deterioration over hours until (nearly) complete system lockup and data loss with PostgreSQL 13 bugzilla-daemon
2023-06-20 15:10 ` Christian Theune
2023-06-20 15:11   ` Christian Theune
2023-06-20 15:10 ` [Bug 217572] " bugzilla-daemon
2023-06-20 15:13 ` bugzilla-daemon
2023-06-20 15:21 ` bugzilla-daemon
2023-06-20 17:26 ` bugzilla-daemon
2023-07-03 14:10 ` bugzilla-daemon
2023-07-03 19:56 ` bugzilla-daemon
2023-07-03 22:30   ` Dave Chinner
2023-07-03 22:30 ` bugzilla-daemon
2023-07-04  4:22 ` bugzilla-daemon
2023-07-05 22:07 ` bugzilla-daemon
2023-09-28 12:39 ` bugzilla-daemon
2023-09-28 22:44   ` Dave Chinner
2023-09-28 13:06 ` bugzilla-daemon
2023-09-28 22:44 ` bugzilla-daemon
2023-09-29  4:54 ` bugzilla-daemon
2023-09-29  5:01 ` bugzilla-daemon
2023-10-05 14:31 ` bugzilla-daemon
2023-10-08 17:35 ` bugzilla-daemon
2023-10-08 22:13 ` bugzilla-daemon
2023-11-02 15:27 ` bugzilla-daemon
2023-11-02 20:58   ` Dave Chinner
2023-11-02 15:28 ` bugzilla-daemon
2023-11-02 15:29 ` bugzilla-daemon
2023-11-02 16:23 ` bugzilla-daemon
2023-11-02 20:59 ` bugzilla-daemon [this message]
2023-11-03 12:52 ` bugzilla-daemon
2023-11-07 10:11 ` bugzilla-daemon
2023-11-07 10:25 ` bugzilla-daemon
2023-11-07 14:12 ` bugzilla-daemon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bug-217572-201763-7aKPmPiF6l@https.bugzilla.kernel.org/ \
    --to=bugzilla-daemon@kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).