[Question] The system may be stuck if there is a cpu cgroup cpu.cfs_quato

Linux cgroups development
 help / color / mirror / Atom feed

* [Question] The system may be stuck if there is a cpu cgroup cpu.cfs_quato_us is very low
@ 2022-06-27  6:50 Zhang Qiao
       [not found] ` <5987be34-b527-4ff5-a17d-5f6f0dc94d6d-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: Zhang Qiao @ 2022-06-27  6:50 UTC (permalink / raw)
  To: Tejun Heo, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ, Juri Lelli, Vincent Guittot
  Cc: lizefan.x-EC8Uxl6Npydl57MIdRCFDg, hannes-druUgvl0LCNAfugRpC6u6w,
	cgroups-u79uwXL29TY76Z2rM5mHXA, lkml,
	vschneid-H+wXaHxf7aLQT0dZR+AlfA, dietmar.eggemann-5wv7dgnIgG8,
	bristot-H+wXaHxf7aLQT0dZR+AlfA, bsegall-hpIqsD4AKlfQT0dZR+AlfA,
	Steven Rostedt, mgorman-l3A5Bk7waGM

Hi all,

I'm working on debuging a problem.
The testcase does follew operations:
1) create a test task cgroup, set cpu.cfs_quota_us=2000,cpu.cfs_period_us=100000.
2) run 20 test_fork[1] test process in the test task cgroup.
3) create 100 new containers:
   for i in {1..100}; do docker run -itd  --health-cmd="ls" --health-interval=1s ubuntu:latest  bash; done

These operations are expected to succeed and 100 containers create success. however, when creating containers,
the system will get stuck and create container failed.

After debug this, I found the test_fork process frequently sleep in freezer_fork()->mutex_lock()->might_sleep()
with taking the cgroup_threadgroup_rw_sem lock, as follow:

copy_process():
	cgroup_can_fork()			---> lock cgroup_threadgroup_rw_sem
	sched_cgroup_fork();
	  ->task_fork_fair(){
	      ->update_curr(){
		  ->__account_cfs_rq_runtime() {
			resched_curr();		---> the quota is used up, and set flag TIF_NEED_RESCHED to current
		   }
	cgroup_post_fork();   		
	  ->feezer_fork()
	      ->mutex_lock() {	
		  ->might_sleep()  		---> schedule() and the current task will be throttled long time.

	  ->cgroup_css_set_put_fork()    	---> unlock cgroup_threadgroup_rw_sem


Becuase the task cgroup's cpu.cfs_quota_us is very small and test_fork's load is very heavy, the test_fork
may be throttled long time, therefore, the cgroup_threadgroup_rw_sem read lock is held for a long time, other
processes will get stuck waiting for the lock:

1) a task fork child, will wait at copy_process()->cgroup_can_fork();

2) a task exiting will wait at exit_signals();

3) a task write cgroup.procs file will wait at cgroup_file_write()->__cgroup1_procs_write();
...

even the whole system will get stuck.

Anyone know how to slove this? Except for changing the cpu.cfs_quota_us.


[1] test_fork.c

#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/wait.h>

int main(int argc, char **argv)
{
    pid_t pid;
    int count = 20;

    while(1) {
        for (int i = 0; i < count; i++) {
            if ((pid = fork()) <0) {
                printf("fork error");
                return 1;
            } else if (pid ==0) {
                exit(0);
            }
        }

        for (int i = 0; i < count; i++) {
            wait(NULL);
        }
	sleep(1);
    }
    return 0;
}

Thanks a lot.
-Qiao
-

^ permalink raw reply	[flat|nested] 6+ messages in thread

[parent not found: <5987be34-b527-4ff5-a17d-5f6f0dc94d6d-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>]

* Re: [Question] The system may be stuck if there is a cpu cgroup cpu.cfs_quato_us is very low
       [not found] ` <5987be34-b527-4ff5-a17d-5f6f0dc94d6d-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
@ 2022-06-27  8:32   ` Tejun Heo
       [not found]     ` <YrlrBmF3oOfS3+fq-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: Tejun Heo @ 2022-06-27  8:32 UTC (permalink / raw)
  To: Zhang Qiao
  Cc: mingo-H+wXaHxf7aLQT0dZR+AlfA, peterz-wEGCiKHe2LqWVfeAwA7xHQ,
	Juri Lelli, Vincent Guittot, lizefan.x-EC8Uxl6Npydl57MIdRCFDg,
	hannes-druUgvl0LCNAfugRpC6u6w, cgroups-u79uwXL29TY76Z2rM5mHXA,
	lkml, vschneid-H+wXaHxf7aLQT0dZR+AlfA,
	dietmar.eggemann-5wv7dgnIgG8, bristot-H+wXaHxf7aLQT0dZR+AlfA,
	bsegall-hpIqsD4AKlfQT0dZR+AlfA, Steven Rostedt,
	mgorman-l3A5Bk7waGM

Hello,

On Mon, Jun 27, 2022 at 02:50:25PM +0800, Zhang Qiao wrote:
> Becuase the task cgroup's cpu.cfs_quota_us is very small and
> test_fork's load is very heavy, the test_fork may be throttled long
> time, therefore, the cgroup_threadgroup_rw_sem read lock is held for
> a long time, other processes will get stuck waiting for the lock:

Yeah, this is a known problem and can happen with other locks too. The
solution prolly is only throttling while in or when about to return to
userspace. There is one really important and wide-spread assumption in
the kernel:

  If things get blocked on some shared resource, whatever is holding
  the resource ends up using more of the system to exit the critical
  section faster and thus unblocks others ASAP. IOW, things running in
  kernel are work-conserving.

The cpu bw controller gives the userspace a rather easy way to break
this assumption and thus is rather fundamentally broken. This is
basically the same problem we had with the old cgroup freezer
implementation which trapped threads in random locations in the
kernel.

So, right now, it's rather broken and can easily be used as an dos
attack vector.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 6+ messages in thread

[parent not found: <YrlrBmF3oOfS3+fq-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>]

* Re: [Question] The system may be stuck if there is a cpu cgroup cpu.cfs_quato_us is very low
       [not found]     ` <YrlrBmF3oOfS3+fq-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
@ 2022-07-01  7:34       ` Zhang Qiao
       [not found]         ` <f0f55f89-14db-de29-c182-32539f8d4e4d-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: Zhang Qiao @ 2022-07-01  7:34 UTC (permalink / raw)
  To: Tejun Heo
  Cc: mingo-H+wXaHxf7aLQT0dZR+AlfA, peterz-wEGCiKHe2LqWVfeAwA7xHQ,
	Juri Lelli, Vincent Guittot, lizefan.x-EC8Uxl6Npydl57MIdRCFDg,
	hannes-druUgvl0LCNAfugRpC6u6w, cgroups-u79uwXL29TY76Z2rM5mHXA,
	lkml, vschneid-H+wXaHxf7aLQT0dZR+AlfA,
	dietmar.eggemann-5wv7dgnIgG8, bristot-H+wXaHxf7aLQT0dZR+AlfA,
	bsegall-hpIqsD4AKlfQT0dZR+AlfA, Steven Rostedt,
	mgorman-l3A5Bk7waGM


Hi, tejun

Thanks for your reply.

在 2022/6/27 16:32, Tejun Heo 写道:
> Hello,
> 
> On Mon, Jun 27, 2022 at 02:50:25PM +0800, Zhang Qiao wrote:
>> Becuase the task cgroup's cpu.cfs_quota_us is very small and
>> test_fork's load is very heavy, the test_fork may be throttled long
>> time, therefore, the cgroup_threadgroup_rw_sem read lock is held for
>> a long time, other processes will get stuck waiting for the lock:
> 
> Yeah, this is a known problem and can happen with other locks too. The
> solution prolly is only throttling while in or when about to return to
> userspace. There is one really important and wide-spread assumption in
> the kernel:
> 
>   If things get blocked on some shared resource, whatever is holding
>   the resource ends up using more of the system to exit the critical
>   section faster and thus unblocks others ASAP. IOW, things running in
>   kernel are work-conserving.
> 
> The cpu bw controller gives the userspace a rather easy way to break
> this assumption and thus is rather fundamentally broken. This is
> basically the same problem we had with the old cgroup freezer
> implementation which trapped threads in random locations in the
> kernel.
> 

so, if we want to completely slove this problem, is the best way to
change the cfs bw controller throttle mechanism? for example, throttle
tasks in a safe location.

Thanks.
    Qiao

> So, right now, it's rather broken and can easily be used as an dos
> attack vector.
> 
> Thanks.
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

[parent not found: <f0f55f89-14db-de29-c182-32539f8d4e4d-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>]

* Re: [Question] The system may be stuck if there is a cpu cgroup cpu.cfs_quato_us is very low
       [not found]         ` <f0f55f89-14db-de29-c182-32539f8d4e4d-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
@ 2022-07-01 20:08           ` Benjamin Segall
       [not found]             ` <xm26czeoioju.fsf-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: Benjamin Segall @ 2022-07-01 20:08 UTC (permalink / raw)
  To: Zhang Qiao
  Cc: Tejun Heo, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ, Juri Lelli, Vincent Guittot,
	lizefan.x-EC8Uxl6Npydl57MIdRCFDg, hannes-druUgvl0LCNAfugRpC6u6w,
	cgroups-u79uwXL29TY76Z2rM5mHXA, lkml,
	vschneid-H+wXaHxf7aLQT0dZR+AlfA, dietmar.eggemann-5wv7dgnIgG8,
	bristot-H+wXaHxf7aLQT0dZR+AlfA, Steven Rostedt,
	mgorman-l3A5Bk7waGM

Zhang Qiao <zhangqiao22-hv44wF8Li93QT0dZR+AlfA@public.gmane.org> writes:

> Hi, tejun
>
> Thanks for your reply.
>
> 在 2022/6/27 16:32, Tejun Heo 写道:
>> Hello,
>> 
>> On Mon, Jun 27, 2022 at 02:50:25PM +0800, Zhang Qiao wrote:
>>> Becuase the task cgroup's cpu.cfs_quota_us is very small and
>>> test_fork's load is very heavy, the test_fork may be throttled long
>>> time, therefore, the cgroup_threadgroup_rw_sem read lock is held for
>>> a long time, other processes will get stuck waiting for the lock:
>> 
>> Yeah, this is a known problem and can happen with other locks too. The
>> solution prolly is only throttling while in or when about to return to
>> userspace. There is one really important and wide-spread assumption in
>> the kernel:
>> 
>>   If things get blocked on some shared resource, whatever is holding
>>   the resource ends up using more of the system to exit the critical
>>   section faster and thus unblocks others ASAP. IOW, things running in
>>   kernel are work-conserving.
>> 
>> The cpu bw controller gives the userspace a rather easy way to break
>> this assumption and thus is rather fundamentally broken. This is
>> basically the same problem we had with the old cgroup freezer
>> implementation which trapped threads in random locations in the
>> kernel.
>> 
>
> so, if we want to completely slove this problem, is the best way to
> change the cfs bw controller throttle mechanism? for example, throttle
> tasks in a safe location.

Yes, fixing (kernel) priority inversion due to CFS_BANDWIDTH requires a
serious reworking of how it works, because it would need to dequeue
tasks individually rather than doing the entire cfs_rq at a time (and
would require some effort to avoid pinging every throttling task to get
it into the kernel).

^ permalink raw reply	[flat|nested] 6+ messages in thread

[parent not found: <xm26czeoioju.fsf-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>]

* Re: [Question] The system may be stuck if there is a cpu cgroup cpu.cfs_quato_us is very low
       [not found]             ` <xm26czeoioju.fsf-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2022-07-01 20:15               ` Tejun Heo
       [not found]                 ` <Yr9V755mL6jr20c2-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: Tejun Heo @ 2022-07-01 20:15 UTC (permalink / raw)
  To: Benjamin Segall
  Cc: Zhang Qiao, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ, Juri Lelli, Vincent Guittot,
	lizefan.x-EC8Uxl6Npydl57MIdRCFDg, hannes-druUgvl0LCNAfugRpC6u6w,
	cgroups-u79uwXL29TY76Z2rM5mHXA, lkml,
	vschneid-H+wXaHxf7aLQT0dZR+AlfA, dietmar.eggemann-5wv7dgnIgG8,
	bristot-H+wXaHxf7aLQT0dZR+AlfA, Steven Rostedt,
	mgorman-l3A5Bk7waGM

On Fri, Jul 01, 2022 at 01:08:21PM -0700, Benjamin Segall wrote:
> Yes, fixing (kernel) priority inversion due to CFS_BANDWIDTH requires a
> serious reworking of how it works, because it would need to dequeue
> tasks individually rather than doing the entire cfs_rq at a time (and
> would require some effort to avoid pinging every throttling task to get
> it into the kernel).

Right, I don't have a good idea on evolving the current implementation
into something correct. As you pointed out, we need to account along
the sched_group tree but conditionally enforce on each thread.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 6+ messages in thread

[parent not found: <Yr9V755mL6jr20c2-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>]

* Re: [Question] The system may be stuck if there is a cpu cgroup cpu.cfs_quato_us is very low
       [not found]                 ` <Yr9V755mL6jr20c2-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
@ 2022-07-07  6:59                   ` Zhang Qiao
  0 siblings, 0 replies; 6+ messages in thread
From: Zhang Qiao @ 2022-07-07  6:59 UTC (permalink / raw)
  To: Tejun Heo, Benjamin Segall
  Cc: mingo-H+wXaHxf7aLQT0dZR+AlfA, peterz-wEGCiKHe2LqWVfeAwA7xHQ,
	Juri Lelli, Vincent Guittot, lizefan.x-EC8Uxl6Npydl57MIdRCFDg,
	hannes-druUgvl0LCNAfugRpC6u6w, cgroups-u79uwXL29TY76Z2rM5mHXA,
	lkml, vschneid-H+wXaHxf7aLQT0dZR+AlfA,
	dietmar.eggemann-5wv7dgnIgG8, bristot-H+wXaHxf7aLQT0dZR+AlfA,
	Steven Rostedt, mgorman-l3A5Bk7waGM



在 2022/7/2 4:15, Tejun Heo 写道:
> On Fri, Jul 01, 2022 at 01:08:21PM -0700, Benjamin Segall wrote:
>> Yes, fixing (kernel) priority inversion due to CFS_BANDWIDTH requires a
>> serious reworking of how it works, because it would need to dequeue
>> tasks individually rather than doing the entire cfs_rq at a time (and
>> would require some effort to avoid pinging every throttling task to get
>> it into the kernel).
> 
> Right, I don't have a good idea on evolving the current implementation
> into something correct. As you pointed out, we need to account along
> the sched_group tree but conditionally enforce on each thread.
> 
> Thanks.
> 

Understood. Thanks for your detailed explanation.

Thanks.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-07-07  6:59 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-06-27  6:50 [Question] The system may be stuck if there is a cpu cgroup cpu.cfs_quato_us is very low Zhang Qiao
     [not found] ` <5987be34-b527-4ff5-a17d-5f6f0dc94d6d-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2022-06-27  8:32   ` Tejun Heo
     [not found]     ` <YrlrBmF3oOfS3+fq-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
2022-07-01  7:34       ` Zhang Qiao
     [not found]         ` <f0f55f89-14db-de29-c182-32539f8d4e4d-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2022-07-01 20:08           ` Benjamin Segall
     [not found]             ` <xm26czeoioju.fsf-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2022-07-01 20:15               ` Tejun Heo
     [not found]                 ` <Yr9V755mL6jr20c2-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
2022-07-07  6:59                   ` Zhang Qiao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox