From mboxrd@z Thu Jan 1 00:00:00 1970 From: Benjamin Segall Subject: Re: [Question] The system may be stuck if there is a cpu cgroup cpu.cfs_quato_us is very low Date: Fri, 01 Jul 2022 13:08:21 -0700 Message-ID: References: <5987be34-b527-4ff5-a17d-5f6f0dc94d6d@huawei.com> Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=from:to:cc:subject:references:date:in-reply-to:message-id :user-agent:mime-version:content-transfer-encoding; bh=PSFngJKHrho0FhjB6XnRsy4TL/BbcPjIjM6hLwlPgJM=; b=KJN2es13TaWWapgFPPSZUr0HDQneLNPJI+FpW3g3pKbHwyoLr5ukOqQ6PfvGZyYsU6 YVE0xVKLE5FZWNKxeqTTW+m7OM/XFUX1SlvFAOjo0BJ8/aHXDgD48rsY8qDd2KLt88Db ff7QI/rSlge0DvRdZBvMUsEeZ8SfK0/0omobRfhDo7MsVfXGQ4x5kDHFAE5hag/K4G/G PrQjSWop/84u7Wap/121fKfSYZCcbZ5NdvrlGhCvdqfjRSJmJzIu2frxoQU9qoIoWOYX H8eoFH4xEkWhBA7aOf5/hVkTmA+sZ86c5LnrhhRjqUta5cLGNt2r9DsFTDCsAKiyk02a u0/w== In-Reply-To: (Zhang Qiao's message of "Fri, 1 Jul 2022 15:34:41 +0800") List-ID: Content-Type: text/plain; charset="utf-8" To: Zhang Qiao Cc: Tejun Heo , mingo-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org, peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org, Juri Lelli , Vincent Guittot , lizefan.x-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org, hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, lkml , vschneid-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org, dietmar.eggemann-5wv7dgnIgG8@public.gmane.org, bristot-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org, Steven Rostedt , mgorman-l3A5Bk7waGM@public.gmane.org Zhang Qiao writes: > Hi, tejun > > Thanks for your reply. > > =E5=9C=A8 2022/6/27 16:32, Tejun Heo =E5=86=99=E9=81=93: >> Hello, >>=20 >> On Mon, Jun 27, 2022 at 02:50:25PM +0800, Zhang Qiao wrote: >>> Becuase the task cgroup's cpu.cfs_quota_us is very small and >>> test_fork's load is very heavy, the test_fork may be throttled long >>> time, therefore, the cgroup_threadgroup_rw_sem read lock is held for >>> a long time, other processes will get stuck waiting for the lock: >>=20 >> Yeah, this is a known problem and can happen with other locks too. The >> solution prolly is only throttling while in or when about to return to >> userspace. There is one really important and wide-spread assumption in >> the kernel: >>=20 >> If things get blocked on some shared resource, whatever is holding >> the resource ends up using more of the system to exit the critical >> section faster and thus unblocks others ASAP. IOW, things running in >> kernel are work-conserving. >>=20 >> The cpu bw controller gives the userspace a rather easy way to break >> this assumption and thus is rather fundamentally broken. This is >> basically the same problem we had with the old cgroup freezer >> implementation which trapped threads in random locations in the >> kernel. >>=20 > > so, if we want to completely slove this problem, is the best way to > change the cfs bw controller throttle mechanism? for example, throttle > tasks in a safe location. Yes, fixing (kernel) priority inversion due to CFS_BANDWIDTH requires a serious reworking of how it works, because it would need to dequeue tasks individually rather than doing the entire cfs_rq at a time (and would require some effort to avoid pinging every throttling task to get it into the kernel). From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D3A3EC433EF for ; Fri, 1 Jul 2022 20:08:41 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231430AbiGAUIk (ORCPT ); Fri, 1 Jul 2022 16:08:40 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43520 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229968AbiGAUIh (ORCPT ); Fri, 1 Jul 2022 16:08:37 -0400 Received: from mail-pj1-x102d.google.com (mail-pj1-x102d.google.com [IPv6:2607:f8b0:4864:20::102d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A4B8A4F666 for ; Fri, 1 Jul 2022 13:08:35 -0700 (PDT) Received: by mail-pj1-x102d.google.com with SMTP id g16-20020a17090a7d1000b001ea9f820449so7445937pjl.5 for ; Fri, 01 Jul 2022 13:08:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=from:to:cc:subject:references:date:in-reply-to:message-id :user-agent:mime-version:content-transfer-encoding; bh=PSFngJKHrho0FhjB6XnRsy4TL/BbcPjIjM6hLwlPgJM=; b=KJN2es13TaWWapgFPPSZUr0HDQneLNPJI+FpW3g3pKbHwyoLr5ukOqQ6PfvGZyYsU6 YVE0xVKLE5FZWNKxeqTTW+m7OM/XFUX1SlvFAOjo0BJ8/aHXDgD48rsY8qDd2KLt88Db ff7QI/rSlge0DvRdZBvMUsEeZ8SfK0/0omobRfhDo7MsVfXGQ4x5kDHFAE5hag/K4G/G PrQjSWop/84u7Wap/121fKfSYZCcbZ5NdvrlGhCvdqfjRSJmJzIu2frxoQU9qoIoWOYX H8eoFH4xEkWhBA7aOf5/hVkTmA+sZ86c5LnrhhRjqUta5cLGNt2r9DsFTDCsAKiyk02a u0/w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:references:date:in-reply-to :message-id:user-agent:mime-version:content-transfer-encoding; bh=PSFngJKHrho0FhjB6XnRsy4TL/BbcPjIjM6hLwlPgJM=; b=KA6duqQp8YW6i9SrDO9oa44CNGSLHlD+utVXsMP+r+GTMelCjLoy+OneFfNHtUovWp hxzPTRAt0gtS3tusPb1ky8us+KL+wcD0jij+Sba2M9k4eKHXpdendhkJhhHw8xFfJ4rU XUhwCy9YGu67prbI9x12h8qWoPkPEh0GDAwZ3QJdceT7DThMNAo8ken+x2OsLf0Quu6k A995EV69naRdB5wlSUIlBF9ho/oWClntTKz6QQrBODWYenJodiArop8IhZ+Na27GkZYX GlyM6JCAJT1TE8ReCBOHsjpn4QQObvXy9u38gF7BQ0/bJTJwADO/g3YXx2dWJJcQk+xr 6xOg== X-Gm-Message-State: AJIora9/raKshUVlF92Q0VIbkvSi1YQ32PNAe2ITlVKuaTB4FVWVCr1g 2d/XT1JBrkP7D7NX2fdiyxo/pQ== X-Google-Smtp-Source: AGRyM1vAh19dm+1k/OOm9aJo8kezXqibtQ121c9P50XCzJwgPwvviajf4WtwsUzOazG5i+60HoPygA== X-Received: by 2002:a17:90a:d3d7:b0:1ef:ebe:d613 with SMTP id d23-20020a17090ad3d700b001ef0ebed613mr18287246pjw.240.1656706115012; Fri, 01 Jul 2022 13:08:35 -0700 (PDT) Received: from bsegall-glaptop.localhost (c-67-188-112-16.hsd1.ca.comcast.net. [67.188.112.16]) by smtp.gmail.com with ESMTPSA id m17-20020a170902db1100b0016a275623c1sm8463737plx.219.2022.07.01.13.08.32 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 01 Jul 2022 13:08:33 -0700 (PDT) From: Benjamin Segall To: Zhang Qiao Cc: Tejun Heo , , , Juri Lelli , Vincent Guittot , , , , lkml , , , , Steven Rostedt , Subject: Re: [Question] The system may be stuck if there is a cpu cgroup cpu.cfs_quato_us is very low References: <5987be34-b527-4ff5-a17d-5f6f0dc94d6d@huawei.com> Date: Fri, 01 Jul 2022 13:08:21 -0700 In-Reply-To: (Zhang Qiao's message of "Fri, 1 Jul 2022 15:34:41 +0800") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Zhang Qiao writes: > Hi, tejun > > Thanks for your reply. > > =E5=9C=A8 2022/6/27 16:32, Tejun Heo =E5=86=99=E9=81=93: >> Hello, >>=20 >> On Mon, Jun 27, 2022 at 02:50:25PM +0800, Zhang Qiao wrote: >>> Becuase the task cgroup's cpu.cfs_quota_us is very small and >>> test_fork's load is very heavy, the test_fork may be throttled long >>> time, therefore, the cgroup_threadgroup_rw_sem read lock is held for >>> a long time, other processes will get stuck waiting for the lock: >>=20 >> Yeah, this is a known problem and can happen with other locks too. The >> solution prolly is only throttling while in or when about to return to >> userspace. There is one really important and wide-spread assumption in >> the kernel: >>=20 >> If things get blocked on some shared resource, whatever is holding >> the resource ends up using more of the system to exit the critical >> section faster and thus unblocks others ASAP. IOW, things running in >> kernel are work-conserving. >>=20 >> The cpu bw controller gives the userspace a rather easy way to break >> this assumption and thus is rather fundamentally broken. This is >> basically the same problem we had with the old cgroup freezer >> implementation which trapped threads in random locations in the >> kernel. >>=20 > > so, if we want to completely slove this problem, is the best way to > change the cfs bw controller throttle mechanism? for example, throttle > tasks in a safe location. Yes, fixing (kernel) priority inversion due to CFS_BANDWIDTH requires a serious reworking of how it works, because it would need to dequeue tasks individually rather than doing the entire cfs_rq at a time (and would require some effort to avoid pinging every throttling task to get it into the kernel).