From: bugzilla-daemon@kernel.org
To: linux-xfs@vger.kernel.org
Subject: [Bug 217572] Initial blocked tasks causing deterioration over hours until (nearly) complete system lockup and data loss with PostgreSQL 13
Date: Fri, 03 Nov 2023 12:52:34 +0000 [thread overview]
Message-ID: <bug-217572-201763-daAbZgHO91@https.bugzilla.kernel.org/> (raw)
In-Reply-To: <bug-217572-201763@https.bugzilla.kernel.org/>
https://bugzilla.kernel.org/show_bug.cgi?id=217572
--- Comment #22 from Christian Theune (ct@flyingcircus.io) ---
(In reply to Dave Chinner from comment #21)
>
> This is still an unreproducable, unfixed bug in upstream kernels.
> There is no known reproducer, so actually triggering it and hence
> performing RCA is extremely difficult at this point in time. We don't
> really even know what workload triggers it.
It seems IO-pressure related and we've seen it multiple times with various
PostgreSQL activities.
I've planned time for next week to analyze this further and trying to help
establishing a reproducer.
> > We've had a multitude of crashes in the last weeks with the following
> > statistics:
> >
> > 6.1.31 - 2 affected machines
> > 6.1.35 - 1 affected machine
> > 6.1.37 - 1 affected machine
> > 6.1.51 - 5 affected machines
> > 6.1.55 - 2 affected machines
> > 6.1.57 - 2 affected machines
>
> Do these machines have ECC memory?
The physical hosts do. The affected systems are all Qemu/KVM virtual machines,
though.
> > Here's the more detailed behaviour of one of the machines with 6.1.57.
> >
> > $ uptime
> > 16:10:23 up 13 days 19:00, 1 user, load average: 3.21, 1.24, 0.57
>
> Yeah, that's the problem - such a rare, one off issue that we don't
> really even know where to begin looking. :(
>
> Given you seem to have a workload that occasionally triggers it,
> could you try to craft a reproducer workload that does stuff similar
> to your production workload and see if you can find out something
> that makes this easier to trigger?
Yup. I'm prioritizing this for the next weeks.
> This implies you are using memcg to constrain memory footprint of
> the applications? Are these workloads running in memcgs that
> experience random memcg OOM conditions? Or maybe the failure
> correlates with global OOM conditions triggering memcg reclaim?
I'll have to read up on what memcg is and whether we're doing anything with it
on purpose. At the moment I think this is just whatever we're getting from our
baseline environment with kernel or distro defaults.
How do I notice a memcg OOM? I've always tried to correlate all kernel log
messages and haven't seen any other tracebacks than the ones I posted.
Global (so I guess a "regular") OOM wasn't involved in any case so far.
I can try digging deeper into system VM statistics. We're running
telegraf/prometheus and have a relatively exhaustive number of system variables
we're monitoring on all systems. Anything specific I could look for?
>
> Cheers,
>
> Dave.
--
You may reply to this email to add a comment.
You are receiving this mail because:
You are watching the assignee of the bug.
next prev parent reply other threads:[~2023-11-03 12:52 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-06-19 8:29 [Bug 217572] New: Initial blocked tasks causing deterioration over hours until (nearly) complete system lockup and data loss with PostgreSQL 13 bugzilla-daemon
2023-06-20 15:10 ` Christian Theune
2023-06-20 15:11 ` Christian Theune
2023-06-20 15:10 ` [Bug 217572] " bugzilla-daemon
2023-06-20 15:13 ` bugzilla-daemon
2023-06-20 15:21 ` bugzilla-daemon
2023-06-20 17:26 ` bugzilla-daemon
2023-07-03 14:10 ` bugzilla-daemon
2023-07-03 19:56 ` bugzilla-daemon
2023-07-03 22:30 ` Dave Chinner
2023-07-03 22:30 ` bugzilla-daemon
2023-07-04 4:22 ` bugzilla-daemon
2023-07-05 22:07 ` bugzilla-daemon
2023-09-28 12:39 ` bugzilla-daemon
2023-09-28 22:44 ` Dave Chinner
2023-09-28 13:06 ` bugzilla-daemon
2023-09-28 22:44 ` bugzilla-daemon
2023-09-29 4:54 ` bugzilla-daemon
2023-09-29 5:01 ` bugzilla-daemon
2023-10-05 14:31 ` bugzilla-daemon
2023-10-08 17:35 ` bugzilla-daemon
2023-10-08 22:13 ` bugzilla-daemon
2023-11-02 15:27 ` bugzilla-daemon
2023-11-02 20:58 ` Dave Chinner
2023-11-02 15:28 ` bugzilla-daemon
2023-11-02 15:29 ` bugzilla-daemon
2023-11-02 16:23 ` bugzilla-daemon
2023-11-02 20:59 ` bugzilla-daemon
2023-11-03 12:52 ` bugzilla-daemon [this message]
2023-11-07 10:11 ` bugzilla-daemon
2023-11-07 10:25 ` bugzilla-daemon
2023-11-07 14:12 ` bugzilla-daemon
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=bug-217572-201763-daAbZgHO91@https.bugzilla.kernel.org/ \
--to=bugzilla-daemon@kernel.org \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox