* [Question] The system may be stuck if there is a cpu cgroup cpu.cfs_quato_us is very low
@ 2022-06-27 6:50 Zhang Qiao
[not found] ` <5987be34-b527-4ff5-a17d-5f6f0dc94d6d-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
0 siblings, 1 reply; 6+ messages in thread
From: Zhang Qiao @ 2022-06-27 6:50 UTC (permalink / raw)
To: Tejun Heo, mingo-H+wXaHxf7aLQT0dZR+AlfA,
peterz-wEGCiKHe2LqWVfeAwA7xHQ, Juri Lelli, Vincent Guittot
Cc: lizefan.x-EC8Uxl6Npydl57MIdRCFDg, hannes-druUgvl0LCNAfugRpC6u6w,
cgroups-u79uwXL29TY76Z2rM5mHXA, lkml,
vschneid-H+wXaHxf7aLQT0dZR+AlfA, dietmar.eggemann-5wv7dgnIgG8,
bristot-H+wXaHxf7aLQT0dZR+AlfA, bsegall-hpIqsD4AKlfQT0dZR+AlfA,
Steven Rostedt, mgorman-l3A5Bk7waGM
Hi all,
I'm working on debuging a problem.
The testcase does follew operations:
1) create a test task cgroup, set cpu.cfs_quota_us=2000,cpu.cfs_period_us=100000.
2) run 20 test_fork[1] test process in the test task cgroup.
3) create 100 new containers:
for i in {1..100}; do docker run -itd --health-cmd="ls" --health-interval=1s ubuntu:latest bash; done
These operations are expected to succeed and 100 containers create success. however, when creating containers,
the system will get stuck and create container failed.
After debug this, I found the test_fork process frequently sleep in freezer_fork()->mutex_lock()->might_sleep()
with taking the cgroup_threadgroup_rw_sem lock, as follow:
copy_process():
cgroup_can_fork() ---> lock cgroup_threadgroup_rw_sem
sched_cgroup_fork();
->task_fork_fair(){
->update_curr(){
->__account_cfs_rq_runtime() {
resched_curr(); ---> the quota is used up, and set flag TIF_NEED_RESCHED to current
}
cgroup_post_fork();
->feezer_fork()
->mutex_lock() {
->might_sleep() ---> schedule() and the current task will be throttled long time.
->cgroup_css_set_put_fork() ---> unlock cgroup_threadgroup_rw_sem
Becuase the task cgroup's cpu.cfs_quota_us is very small and test_fork's load is very heavy, the test_fork
may be throttled long time, therefore, the cgroup_threadgroup_rw_sem read lock is held for a long time, other
processes will get stuck waiting for the lock:
1) a task fork child, will wait at copy_process()->cgroup_can_fork();
2) a task exiting will wait at exit_signals();
3) a task write cgroup.procs file will wait at cgroup_file_write()->__cgroup1_procs_write();
...
even the whole system will get stuck.
Anyone know how to slove this? Except for changing the cpu.cfs_quota_us.
[1] test_fork.c
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/wait.h>
int main(int argc, char **argv)
{
pid_t pid;
int count = 20;
while(1) {
for (int i = 0; i < count; i++) {
if ((pid = fork()) <0) {
printf("fork error");
return 1;
} else if (pid ==0) {
exit(0);
}
}
for (int i = 0; i < count; i++) {
wait(NULL);
}
sleep(1);
}
return 0;
}
Thanks a lot.
-Qiao
-
^ permalink raw reply [flat|nested] 6+ messages in thread[parent not found: <5987be34-b527-4ff5-a17d-5f6f0dc94d6d-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>]
* Re: [Question] The system may be stuck if there is a cpu cgroup cpu.cfs_quato_us is very low [not found] ` <5987be34-b527-4ff5-a17d-5f6f0dc94d6d-hv44wF8Li93QT0dZR+AlfA@public.gmane.org> @ 2022-06-27 8:32 ` Tejun Heo [not found] ` <YrlrBmF3oOfS3+fq-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> 0 siblings, 1 reply; 6+ messages in thread From: Tejun Heo @ 2022-06-27 8:32 UTC (permalink / raw) To: Zhang Qiao Cc: mingo-H+wXaHxf7aLQT0dZR+AlfA, peterz-wEGCiKHe2LqWVfeAwA7xHQ, Juri Lelli, Vincent Guittot, lizefan.x-EC8Uxl6Npydl57MIdRCFDg, hannes-druUgvl0LCNAfugRpC6u6w, cgroups-u79uwXL29TY76Z2rM5mHXA, lkml, vschneid-H+wXaHxf7aLQT0dZR+AlfA, dietmar.eggemann-5wv7dgnIgG8, bristot-H+wXaHxf7aLQT0dZR+AlfA, bsegall-hpIqsD4AKlfQT0dZR+AlfA, Steven Rostedt, mgorman-l3A5Bk7waGM Hello, On Mon, Jun 27, 2022 at 02:50:25PM +0800, Zhang Qiao wrote: > Becuase the task cgroup's cpu.cfs_quota_us is very small and > test_fork's load is very heavy, the test_fork may be throttled long > time, therefore, the cgroup_threadgroup_rw_sem read lock is held for > a long time, other processes will get stuck waiting for the lock: Yeah, this is a known problem and can happen with other locks too. The solution prolly is only throttling while in or when about to return to userspace. There is one really important and wide-spread assumption in the kernel: If things get blocked on some shared resource, whatever is holding the resource ends up using more of the system to exit the critical section faster and thus unblocks others ASAP. IOW, things running in kernel are work-conserving. The cpu bw controller gives the userspace a rather easy way to break this assumption and thus is rather fundamentally broken. This is basically the same problem we had with the old cgroup freezer implementation which trapped threads in random locations in the kernel. So, right now, it's rather broken and can easily be used as an dos attack vector. Thanks. -- tejun ^ permalink raw reply [flat|nested] 6+ messages in thread
[parent not found: <YrlrBmF3oOfS3+fq-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>]
* Re: [Question] The system may be stuck if there is a cpu cgroup cpu.cfs_quato_us is very low [not found] ` <YrlrBmF3oOfS3+fq-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> @ 2022-07-01 7:34 ` Zhang Qiao [not found] ` <f0f55f89-14db-de29-c182-32539f8d4e4d-hv44wF8Li93QT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 6+ messages in thread From: Zhang Qiao @ 2022-07-01 7:34 UTC (permalink / raw) To: Tejun Heo Cc: mingo-H+wXaHxf7aLQT0dZR+AlfA, peterz-wEGCiKHe2LqWVfeAwA7xHQ, Juri Lelli, Vincent Guittot, lizefan.x-EC8Uxl6Npydl57MIdRCFDg, hannes-druUgvl0LCNAfugRpC6u6w, cgroups-u79uwXL29TY76Z2rM5mHXA, lkml, vschneid-H+wXaHxf7aLQT0dZR+AlfA, dietmar.eggemann-5wv7dgnIgG8, bristot-H+wXaHxf7aLQT0dZR+AlfA, bsegall-hpIqsD4AKlfQT0dZR+AlfA, Steven Rostedt, mgorman-l3A5Bk7waGM Hi, tejun Thanks for your reply. 在 2022/6/27 16:32, Tejun Heo 写道: > Hello, > > On Mon, Jun 27, 2022 at 02:50:25PM +0800, Zhang Qiao wrote: >> Becuase the task cgroup's cpu.cfs_quota_us is very small and >> test_fork's load is very heavy, the test_fork may be throttled long >> time, therefore, the cgroup_threadgroup_rw_sem read lock is held for >> a long time, other processes will get stuck waiting for the lock: > > Yeah, this is a known problem and can happen with other locks too. The > solution prolly is only throttling while in or when about to return to > userspace. There is one really important and wide-spread assumption in > the kernel: > > If things get blocked on some shared resource, whatever is holding > the resource ends up using more of the system to exit the critical > section faster and thus unblocks others ASAP. IOW, things running in > kernel are work-conserving. > > The cpu bw controller gives the userspace a rather easy way to break > this assumption and thus is rather fundamentally broken. This is > basically the same problem we had with the old cgroup freezer > implementation which trapped threads in random locations in the > kernel. > so, if we want to completely slove this problem, is the best way to change the cfs bw controller throttle mechanism? for example, throttle tasks in a safe location. Thanks. Qiao > So, right now, it's rather broken and can easily be used as an dos > attack vector. > > Thanks. > ^ permalink raw reply [flat|nested] 6+ messages in thread
[parent not found: <f0f55f89-14db-de29-c182-32539f8d4e4d-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>]
* Re: [Question] The system may be stuck if there is a cpu cgroup cpu.cfs_quato_us is very low [not found] ` <f0f55f89-14db-de29-c182-32539f8d4e4d-hv44wF8Li93QT0dZR+AlfA@public.gmane.org> @ 2022-07-01 20:08 ` Benjamin Segall [not found] ` <xm26czeoioju.fsf-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 6+ messages in thread From: Benjamin Segall @ 2022-07-01 20:08 UTC (permalink / raw) To: Zhang Qiao Cc: Tejun Heo, mingo-H+wXaHxf7aLQT0dZR+AlfA, peterz-wEGCiKHe2LqWVfeAwA7xHQ, Juri Lelli, Vincent Guittot, lizefan.x-EC8Uxl6Npydl57MIdRCFDg, hannes-druUgvl0LCNAfugRpC6u6w, cgroups-u79uwXL29TY76Z2rM5mHXA, lkml, vschneid-H+wXaHxf7aLQT0dZR+AlfA, dietmar.eggemann-5wv7dgnIgG8, bristot-H+wXaHxf7aLQT0dZR+AlfA, Steven Rostedt, mgorman-l3A5Bk7waGM Zhang Qiao <zhangqiao22-hv44wF8Li93QT0dZR+AlfA@public.gmane.org> writes: > Hi, tejun > > Thanks for your reply. > > 在 2022/6/27 16:32, Tejun Heo 写道: >> Hello, >> >> On Mon, Jun 27, 2022 at 02:50:25PM +0800, Zhang Qiao wrote: >>> Becuase the task cgroup's cpu.cfs_quota_us is very small and >>> test_fork's load is very heavy, the test_fork may be throttled long >>> time, therefore, the cgroup_threadgroup_rw_sem read lock is held for >>> a long time, other processes will get stuck waiting for the lock: >> >> Yeah, this is a known problem and can happen with other locks too. The >> solution prolly is only throttling while in or when about to return to >> userspace. There is one really important and wide-spread assumption in >> the kernel: >> >> If things get blocked on some shared resource, whatever is holding >> the resource ends up using more of the system to exit the critical >> section faster and thus unblocks others ASAP. IOW, things running in >> kernel are work-conserving. >> >> The cpu bw controller gives the userspace a rather easy way to break >> this assumption and thus is rather fundamentally broken. This is >> basically the same problem we had with the old cgroup freezer >> implementation which trapped threads in random locations in the >> kernel. >> > > so, if we want to completely slove this problem, is the best way to > change the cfs bw controller throttle mechanism? for example, throttle > tasks in a safe location. Yes, fixing (kernel) priority inversion due to CFS_BANDWIDTH requires a serious reworking of how it works, because it would need to dequeue tasks individually rather than doing the entire cfs_rq at a time (and would require some effort to avoid pinging every throttling task to get it into the kernel). ^ permalink raw reply [flat|nested] 6+ messages in thread
[parent not found: <xm26czeoioju.fsf-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>]
* Re: [Question] The system may be stuck if there is a cpu cgroup cpu.cfs_quato_us is very low [not found] ` <xm26czeoioju.fsf-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> @ 2022-07-01 20:15 ` Tejun Heo [not found] ` <Yr9V755mL6jr20c2-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> 0 siblings, 1 reply; 6+ messages in thread From: Tejun Heo @ 2022-07-01 20:15 UTC (permalink / raw) To: Benjamin Segall Cc: Zhang Qiao, mingo-H+wXaHxf7aLQT0dZR+AlfA, peterz-wEGCiKHe2LqWVfeAwA7xHQ, Juri Lelli, Vincent Guittot, lizefan.x-EC8Uxl6Npydl57MIdRCFDg, hannes-druUgvl0LCNAfugRpC6u6w, cgroups-u79uwXL29TY76Z2rM5mHXA, lkml, vschneid-H+wXaHxf7aLQT0dZR+AlfA, dietmar.eggemann-5wv7dgnIgG8, bristot-H+wXaHxf7aLQT0dZR+AlfA, Steven Rostedt, mgorman-l3A5Bk7waGM On Fri, Jul 01, 2022 at 01:08:21PM -0700, Benjamin Segall wrote: > Yes, fixing (kernel) priority inversion due to CFS_BANDWIDTH requires a > serious reworking of how it works, because it would need to dequeue > tasks individually rather than doing the entire cfs_rq at a time (and > would require some effort to avoid pinging every throttling task to get > it into the kernel). Right, I don't have a good idea on evolving the current implementation into something correct. As you pointed out, we need to account along the sched_group tree but conditionally enforce on each thread. Thanks. -- tejun ^ permalink raw reply [flat|nested] 6+ messages in thread
[parent not found: <Yr9V755mL6jr20c2-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>]
* Re: [Question] The system may be stuck if there is a cpu cgroup cpu.cfs_quato_us is very low [not found] ` <Yr9V755mL6jr20c2-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> @ 2022-07-07 6:59 ` Zhang Qiao 0 siblings, 0 replies; 6+ messages in thread From: Zhang Qiao @ 2022-07-07 6:59 UTC (permalink / raw) To: Tejun Heo, Benjamin Segall Cc: mingo-H+wXaHxf7aLQT0dZR+AlfA, peterz-wEGCiKHe2LqWVfeAwA7xHQ, Juri Lelli, Vincent Guittot, lizefan.x-EC8Uxl6Npydl57MIdRCFDg, hannes-druUgvl0LCNAfugRpC6u6w, cgroups-u79uwXL29TY76Z2rM5mHXA, lkml, vschneid-H+wXaHxf7aLQT0dZR+AlfA, dietmar.eggemann-5wv7dgnIgG8, bristot-H+wXaHxf7aLQT0dZR+AlfA, Steven Rostedt, mgorman-l3A5Bk7waGM 在 2022/7/2 4:15, Tejun Heo 写道: > On Fri, Jul 01, 2022 at 01:08:21PM -0700, Benjamin Segall wrote: >> Yes, fixing (kernel) priority inversion due to CFS_BANDWIDTH requires a >> serious reworking of how it works, because it would need to dequeue >> tasks individually rather than doing the entire cfs_rq at a time (and >> would require some effort to avoid pinging every throttling task to get >> it into the kernel). > > Right, I don't have a good idea on evolving the current implementation > into something correct. As you pointed out, we need to account along > the sched_group tree but conditionally enforce on each thread. > > Thanks. > Understood. Thanks for your detailed explanation. Thanks. ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2022-07-07 6:59 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-06-27 6:50 [Question] The system may be stuck if there is a cpu cgroup cpu.cfs_quato_us is very low Zhang Qiao
[not found] ` <5987be34-b527-4ff5-a17d-5f6f0dc94d6d-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2022-06-27 8:32 ` Tejun Heo
[not found] ` <YrlrBmF3oOfS3+fq-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
2022-07-01 7:34 ` Zhang Qiao
[not found] ` <f0f55f89-14db-de29-c182-32539f8d4e4d-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2022-07-01 20:08 ` Benjamin Segall
[not found] ` <xm26czeoioju.fsf-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2022-07-01 20:15 ` Tejun Heo
[not found] ` <Yr9V755mL6jr20c2-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
2022-07-07 6:59 ` Zhang Qiao
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox