From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5D4E1C4332F for ; Fri, 3 Nov 2023 12:52:38 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229940AbjKCMwj (ORCPT ); Fri, 3 Nov 2023 08:52:39 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59732 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229488AbjKCMwi (ORCPT ); Fri, 3 Nov 2023 08:52:38 -0400 Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 07C60FB for ; Fri, 3 Nov 2023 05:52:36 -0700 (PDT) Received: by smtp.kernel.org (Postfix) with ESMTPS id A6189C433C8 for ; Fri, 3 Nov 2023 12:52:35 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1699015955; bh=/sOO+DN6SCQOGwxFSNpw/wPDf0+m9CO2WuCIcoHnhwM=; h=From:To:Subject:Date:In-Reply-To:References:From; b=jkGJG8VR4I9HNR6V7IaeuNuXNqI6x4Xybiz+f3947oXO7U9yyOHQpLJjeccFh3hjU 83fj0LTgz8sDb8v2IKeh7ZhTbkZ5KwjwI+mYFmQbXYXjh/adfDsmKjtbG/QU5jSyBV qoEZU9p9kf1IP2tALjSNVnQY+V6Q0vByOYeui6IyG79Vp3xCJVHcT6Y6vuSGK//e26 LTUfaEXCiv/f5Yygby4ZhJIVOoOwOgrOoO8srJoLNQcxnuNC5SM7SCOJICG+bX+yrd LuQpC7kf0QYXUEFxs9ljezbZiST+0+wPjy3iwOSaO+zB4kdfQXNcAIVM7IXHu68Pxb Km16VvQQhxPPg== Received: by aws-us-west-2-korg-bugzilla-1.web.codeaurora.org (Postfix, from userid 48) id 96515C4332E; Fri, 3 Nov 2023 12:52:35 +0000 (UTC) From: bugzilla-daemon@kernel.org To: linux-xfs@vger.kernel.org Subject: [Bug 217572] Initial blocked tasks causing deterioration over hours until (nearly) complete system lockup and data loss with PostgreSQL 13 Date: Fri, 03 Nov 2023 12:52:34 +0000 X-Bugzilla-Reason: None X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: AssignedTo filesystem_xfs@kernel-bugs.kernel.org X-Bugzilla-Product: Memory Management X-Bugzilla-Component: Other X-Bugzilla-Version: 2.5 X-Bugzilla-Keywords: X-Bugzilla-Severity: high X-Bugzilla-Who: ct@flyingcircus.io X-Bugzilla-Status: NEW X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: filesystem_xfs@kernel-bugs.kernel.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugzilla.kernel.org/ Auto-Submitted: auto-generated MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org https://bugzilla.kernel.org/show_bug.cgi?id=3D217572 --- Comment #22 from Christian Theune (ct@flyingcircus.io) --- (In reply to Dave Chinner from comment #21) >=20 > This is still an unreproducable, unfixed bug in upstream kernels. > There is no known reproducer, so actually triggering it and hence > performing RCA is extremely difficult at this point in time. We don't > really even know what workload triggers it. It seems IO-pressure related and we've seen it multiple times with various PostgreSQL activities. I've planned time for next week to analyze this further and trying to help establishing a reproducer. > > We've had a multitude of crashes in the last weeks with the following > > statistics: > >=20 > > 6.1.31 - 2 affected machines > > 6.1.35 - 1 affected machine > > 6.1.37 - 1 affected machine > > 6.1.51 - 5 affected machines > > 6.1.55 - 2 affected machines > > 6.1.57 - 2 affected machines >=20 > Do these machines have ECC memory? The physical hosts do. The affected systems are all Qemu/KVM virtual machin= es, though. > > Here's the more detailed behaviour of one of the machines with 6.1.57. > >=20 > > $ uptime > > 16:10:23 up 13 days 19:00, 1 user, load average: 3.21, 1.24, 0.57 >=20 > Yeah, that's the problem - such a rare, one off issue that we don't > really even know where to begin looking. :( >=20 > Given you seem to have a workload that occasionally triggers it, > could you try to craft a reproducer workload that does stuff similar > to your production workload and see if you can find out something > that makes this easier to trigger? Yup. I'm prioritizing this for the next weeks. > This implies you are using memcg to constrain memory footprint of > the applications? Are these workloads running in memcgs that > experience random memcg OOM conditions? Or maybe the failure > correlates with global OOM conditions triggering memcg reclaim? I'll have to read up on what memcg is and whether we're doing anything with= it on purpose. At the moment I think this is just whatever we're getting from = our baseline environment with kernel or distro defaults.=20 How do I notice a memcg OOM? I've always tried to correlate all kernel log messages and haven't seen any other tracebacks than the ones I posted. Global (so I guess a "regular") OOM wasn't involved in any case so far. I can try digging deeper into system VM statistics. We're running telegraf/prometheus and have a relatively exhaustive number of system varia= bles we're monitoring on all systems. Anything specific I could look for? >=20 > Cheers, >=20 > Dave. --=20 You may reply to this email to add a comment. You are receiving this mail because: You are watching the assignee of the bug.=