[Bug 217572] Initial blocked tasks causing deterioration over hours until (nearly) complete system lockup and data loss with PostgreSQL 13

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

From: bugzilla-daemon@kernel.org
To: linux-xfs@vger.kernel.org
Subject: [Bug 217572] Initial blocked tasks causing deterioration over hours until (nearly) complete system lockup and data loss with PostgreSQL 13
Date: Fri, 03 Nov 2023 12:52:34 +0000	[thread overview]
Message-ID: <bug-217572-201763-daAbZgHO91@https.bugzilla.kernel.org/> (raw)
In-Reply-To: <bug-217572-201763@https.bugzilla.kernel.org/>

https://bugzilla.kernel.org/show_bug.cgi?id=217572

--- Comment #22 from Christian Theune (ct@flyingcircus.io) ---
(In reply to Dave Chinner from comment #21)
> 
> This is still an unreproducable, unfixed bug in upstream kernels.
> There is no known reproducer, so actually triggering it and hence
> performing RCA is extremely difficult at this point in time. We don't
> really even know what workload triggers it.

It seems IO-pressure related and we've seen it multiple times with various
PostgreSQL activities.

I've planned time for next week to analyze this further and trying to help
establishing a reproducer.

> > We've had a multitude of crashes in the last weeks with the following
> > statistics:
> > 
> > 6.1.31 - 2 affected machines
> > 6.1.35 - 1 affected machine
> > 6.1.37 - 1 affected machine
> > 6.1.51 - 5 affected machines
> > 6.1.55 - 2 affected machines
> > 6.1.57 - 2 affected machines
> 
> Do these machines have ECC memory?

The physical hosts do. The affected systems are all Qemu/KVM virtual machines,
though.

> > Here's the more detailed behaviour of one of the machines with 6.1.57.
> > 
> > $ uptime
> >  16:10:23  up 13 days 19:00,  1 user,  load average: 3.21, 1.24, 0.57
> 
> Yeah, that's the problem - such a rare, one off issue that we don't
> really even know where to begin looking. :(
> 
> Given you seem to have a workload that occasionally triggers it,
> could you try to craft a reproducer workload that does stuff similar
> to your production workload and see if you can find out something
> that makes this easier to trigger?

Yup. I'm prioritizing this for the next weeks.

> This implies you are using memcg to constrain memory footprint of
> the applications? Are these workloads running in memcgs that
> experience random memcg OOM conditions? Or maybe the failure
> correlates with global OOM conditions triggering memcg reclaim?

I'll have to read up on what memcg is and whether we're doing anything with it
on purpose. At the moment I think this is just whatever we're getting from our
baseline environment with kernel or distro defaults. 

How do I notice a memcg OOM? I've always tried to correlate all kernel log
messages and haven't seen any other tracebacks than the ones I posted.

Global (so I guess a "regular") OOM wasn't involved in any case so far.

I can try digging deeper into system VM statistics. We're running
telegraf/prometheus and have a relatively exhaustive number of system variables
we're monitoring on all systems. Anything specific I could look for?

> 
> Cheers,
> 
> Dave.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

next prev parent reply	other threads:[~2023-11-03 12:52 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-19  8:29 [Bug 217572] New: Initial blocked tasks causing deterioration over hours until (nearly) complete system lockup and data loss with PostgreSQL 13 bugzilla-daemon
2023-06-20 15:10 ` Christian Theune
2023-06-20 15:11   ` Christian Theune
2023-06-20 15:10 ` [Bug 217572] " bugzilla-daemon
2023-06-20 15:13 ` bugzilla-daemon
2023-06-20 15:21 ` bugzilla-daemon
2023-06-20 17:26 ` bugzilla-daemon
2023-07-03 14:10 ` bugzilla-daemon
2023-07-03 19:56 ` bugzilla-daemon
2023-07-03 22:30   ` Dave Chinner
2023-07-03 22:30 ` bugzilla-daemon
2023-07-04  4:22 ` bugzilla-daemon
2023-07-05 22:07 ` bugzilla-daemon
2023-09-28 12:39 ` bugzilla-daemon
2023-09-28 22:44   ` Dave Chinner
2023-09-28 13:06 ` bugzilla-daemon
2023-09-28 22:44 ` bugzilla-daemon
2023-09-29  4:54 ` bugzilla-daemon
2023-09-29  5:01 ` bugzilla-daemon
2023-10-05 14:31 ` bugzilla-daemon
2023-10-08 17:35 ` bugzilla-daemon
2023-10-08 22:13 ` bugzilla-daemon
2023-11-02 15:27 ` bugzilla-daemon
2023-11-02 20:58   ` Dave Chinner
2023-11-02 15:28 ` bugzilla-daemon
2023-11-02 15:29 ` bugzilla-daemon
2023-11-02 16:23 ` bugzilla-daemon
2023-11-02 20:59 ` bugzilla-daemon
2023-11-03 12:52 ` bugzilla-daemon [this message]
2023-11-07 10:11 ` bugzilla-daemon
2023-11-07 10:25 ` bugzilla-daemon
2023-11-07 14:12 ` bugzilla-daemon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bug-217572-201763-daAbZgHO91@https.bugzilla.kernel.org/ \
    --to=bugzilla-daemon@kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox