From mboxrd@z Thu Jan 1 00:00:00 1970 From: Song Liu Subject: Re: [PATCH 0/7] introduce cpu.headroom knob to cpu controller Date: Tue, 30 Apr 2019 16:54:18 +0000 Message-ID: <19AF6556-A2A2-435B-9358-CD22CF7BFD9F@fb.com> References: <20190408214539.2705660-1-songliubraving@fb.com> <20190410115907.GE19434@e105550-lin.cambridge.arm.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : references : in-reply-to : content-type : content-id : content-transfer-encoding : mime-version; s=facebook; bh=ga/EWKOAn9UDB4/9v+JepbeJg373fEqSyzJw4liRsvE=; b=J/7lNJ2MO3QohbtmP58ifNjlYxGhac1sL5Qv6kjNPdsSRqsxPLuC/zWpflFo7RfaDV/C +avwcZOraofpifkwIPmvxuvtYmZqkwhfJCwj2PnKOmStwjuy/xDC6+NSd/lptRSvAFOh p79jbqnfMgOa52UIVxc95dhmotdni6sOAQU= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.onmicrosoft.com; s=selector1-fb-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=ga/EWKOAn9UDB4/9v+JepbeJg373fEqSyzJw4liRsvE=; b=KOVM2txMvQR230Okmd0uOBavvnacL3V91C5snYPoXu3pqeRweHszbCnetT6wo2tj0SZPgLgkeSSccg+FQbh2HDnxC9JKP3rE+WrT/Mp4KmEYUXWrlChLH9JVyzAnl5Eh7QcDZsgGniMq2Q0u/eBEOWbjn9/zwfbcAjBveNKqPqs= In-Reply-To: Content-Language: en-US Content-ID: Sender: linux-kernel-owner@vger.kernel.org List-ID: To: Vincent Guittot Cc: Morten Rasmussen , linux-kernel , "cgroups@vger.kernel.org" , "mingo@redhat.com" , "peterz@infradead.org" , "tglx@linutronix.de" , Kernel Team , viresh kumar > On Apr 30, 2019, at 12:20 PM, Vincent Guittot wrote: >=20 > Hi Song, >=20 > On Tue, 30 Apr 2019 at 08:11, Song Liu wrote: >>=20 >>=20 >>=20 >>> On Apr 29, 2019, at 8:24 AM, Vincent Guittot wrote: >>>=20 >>> Hi Song, >>>=20 >>> On Sun, 28 Apr 2019 at 21:47, Song Liu wrote: >>>>=20 >>>> Hi Morten and Vincent, >>>>=20 >>>>> On Apr 22, 2019, at 6:22 PM, Song Liu wrote: >>>>>=20 >>>>> Hi Vincent, >>>>>=20 >>>>>> On Apr 17, 2019, at 5:56 AM, Vincent Guittot wrote: >>>>>>=20 >>>>>> On Wed, 10 Apr 2019 at 21:43, Song Liu wrote= : >>>>>>>=20 >>>>>>> Hi Morten, >>>>>>>=20 >>>>>>>> On Apr 10, 2019, at 4:59 AM, Morten Rasmussen wrote: >>>>>>>>=20 >>>>>>=20 >>>>>>>>=20 >>>>>>>> The bit that isn't clear to me, is _why_ adding idle cycles helps = your >>>>>>>> workload. I'm not convinced that adding headroom gives any latency >>>>>>>> improvements beyond watering down the impact of your side jobs. AF= AIK, >>>>>>>=20 >>>>>>> We think the latency improvements actually come from watering down = the >>>>>>> impact of side jobs. It is not just statistically improving average >>>>>>> latency numbers, but also reduces resource contention caused by the= side >>>>>>> workload. I don't know whether it is from reducing contention of AL= Us, >>>>>>> memory bandwidth, CPU caches, or something else, but we saw reduced >>>>>>> latencies when headroom is used. >>>>>>>=20 >>>>>>>> the throttling mechanism effectively removes the throttled tasks f= rom >>>>>>>> the schedule according to a specific duty cycle. When the side job= is >>>>>>>> not throttled the main workload is experiencing the same latency i= ssues >>>>>>>> as before, but by dynamically tuning the side job throttling you c= an >>>>>>>> achieve a better average latency. Am I missing something? >>>>>>>>=20 >>>>>>>> Have you looked at your distribution of main job latency and tried= to >>>>>>>> compare with when throttling is active/not active? >>>>>>>=20 >>>>>>> cfs_bandwidth adjusts allowed runtime for each task_group each peri= od >>>>>>> (configurable, 100ms by default). cpu.headroom logic applies gentle >>>>>>> throttling, so that the side workload gets some runtime in every pe= riod. >>>>>>> Therefore, if we look at time window equal to or bigger than 100ms,= we >>>>>>> don't really see "throttling active time" vs. "throttling inactive = time". >>>>>>>=20 >>>>>>>>=20 >>>>>>>> I'm wondering if the headroom solution is really the right solutio= n for >>>>>>>> your use-case or if what you are really after is something which i= s >>>>>>>> lower priority than just setting the weight to 1. Something that >>>>>>>=20 >>>>>>> The experiments show that, cpu.weight does proper work for priority= : the >>>>>>> main workload gets priority to use the CPU; while the side workload= only >>>>>>> fill the idle CPU. However, this is not sufficient, as the side wor= kload >>>>>>> creates big enough contention to impact the main workload. >>>>>>>=20 >>>>>>>> (nearly) always gets pre-empted by your main job (SCHED_BATCH and >>>>>>>> SCHED_IDLE might not be enough). If your main job consist >>>>>>>> of lots of relatively short wake-ups things like the min_granulari= ty >>>>>>>> could have significant latency impact. >>>>>>>=20 >>>>>>> cpu.headroom gives benefits in addition to optimizations in pre-emp= t >>>>>>> side. By maintaining some idle time, fewer pre-empt actions are >>>>>>> necessary, thus the main workload will get better latency. >>>>>>=20 >>>>>> I agree with Morten's proposal, SCHED_IDLE should help your latency >>>>>> problem because side job will be directly preempted unlike normal cf= s >>>>>> task even lowest priority. >>>>>> In addition to min_granularity, sched_period also has an impact on t= he >>>>>> time that a task has to wait before preempting the running task. Als= o, >>>>>> some sched_feature like GENTLE_FAIR_SLEEPERS can also impact the >>>>>> latency of a task. >>>>>>=20 >>>>>> It would be nice to know if the latency problem comes from contentio= n >>>>>> on cache resources or if it's mainly because you main load waits >>>>>> before running on a CPU >>>>>>=20 >>>>>> Regards, >>>>>> Vincent >>>>>=20 >>>>> Thanks for these suggestions. Here are some more tests to show the im= pact >>>>> of scheduler knobs and cpu.headroom. >>>>>=20 >>>>> side-load | cpu.headroom | side/cpu.weight | min_gran | cpu-idle | ma= in/latency >>>>> ---------------------------------------------------------------------= ----------- >>>>> none | 0 | n/a | 1 ms | 45.20% | 1.= 00 >>>>> ffmpeg | 0 | 1 | 10 ms | 3.38% | 1= .46 >>>>> ffmpeg | 0 | SCHED_IDLE | 1 ms | 5.69% | 1= .42 >>>>> ffmpeg | 20% | SCHED_IDLE | 1 ms | 19.00% | 1= .13 >>>>> ffmpeg | 30% | SCHED_IDLE | 1 ms | 27.60% | 1= .08 >>>>>=20 >>>>> In all these cases, the main workload is loaded with same level of >>>>> traffic (request per second). Main workload latency numbers are norma= lized >>>>> based on the baseline (first row). >>>>>=20 >>>>> For the baseline, the main workload runs without any side workload, t= he >>>>> system has about 45.20% idle CPU. >>>>>=20 >>>>> The next two rows compare the impact of scheduling knobs cpu.weight a= nd >>>>> sched_min_granularity. With cpu.weight of 1 and min_granularity of 10= ms, >>>>> we see a latency of 1.46; with SCHED_IDLE and min_granularity of 1ms,= we >>>>> see a latency of 1.42. So SCHED_IDLE and min_granularity help protect= ing >>>>> the main workload. However, it is not sufficient, as the latency over= head >>>>> is high (>40%). >>>>>=20 >>>>> The last two rows show the benefit of cpu.headroom. With 20% headroom= , >>>>> the latency is 1.13; while with 30% headroom, the latency is 1.08. >>>>>=20 >>>>> We can also see a clear correlation between latency and global idle C= PU: >>>>> more idle CPU yields better lower latency. >>>>>=20 >>>>> Over all, these results show that cpu.headroom provides effective >>>>> mechanism to control the latency impact of side workloads. Other knob= s >>>>> could also help the latency, but they are not as effective and flexib= le >>>>> as cpu.headroom. >>>>>=20 >>>>> Does this analysis address your concern? >>>=20 >>> So, you results show that sched_idle class doesn't provide the >>> intended behavior because it still delay the scheduling of sched_other >>> tasks. In fact, the wakeup path of the scheduler doesn't make any >>> difference between a cpu running a sched_other and a cpu running a >>> sched_idle when looking for the idlest cpu and it can create some >>> contentions between sched_other tasks whereas a cpu runs sched_idle >>> task. >>=20 >> I don't think scheduling delay is the only (or dominating) factor of >> extra latency. Here are some data to show it. >>=20 >> I measured IPC (instructions per cycle) of the main workload under >> different scenarios: >>=20 >> side-load | cpu.headroom | side/cpu.weight | IPC >> ---------------------------------------------------- >> none | 0% | N/A | 0.66 >> ffmpeg | 0% | SCHED_IDLE | 0.53 >> ffmpeg | 20% | SCHED_IDLE | 0.58 >> ffmpeg | 30% | SCHED_IDLE | 0.62 >>=20 >> These data show that the side workload has a negative impact on the >> main workload's IPC. And cpu.headroom could help reduce this impact. >>=20 >> Therefore, while optimizations in the wakeup path should help the >> latency; cpu.headroom would add _significant_ benefit on top of that. >=20 > It seems normal that side workload has a negative impact on IPC > because of resources sharing but your previous results showed a 42% > regression of latency with sched_idle which is can't be only linked to > resources access contention Agreed. I think both scheduling latency and resource contention=20 contribute noticeable latency overhead to the main workload. The=20 scheduler optimization by Viresh would help reduce the scheduling latency, but it won't help the resource contention. Hopefully, with=20 optimizations in the scheduler, we can meet the latency target with=20 smaller cpu.headroom. However, I don't think scheduler optimizations=20 will eliminate the need of cpu.headroom, as the resource contention always exists, and the impact could be significant.=20 Do you have further concerns with this patchset? Thanks, Song=20 >>=20 >> Does this assessment make sense? >>=20 >>=20 >>> Viresh (cced to this email) is working on improving such behavior at >>> wake up and has sent an patch related to the subject: >>> https://lkml.org/lkml/2019/4/25/251 >>> I'm curious if this would improve the results. >>=20 >> I could try it with our workload next week (I am at LSF/MM this >> week). Also, please keep in mind that this test sometimes takes >> multiple days to setup and run. >=20 > Yes. I understand. That would be good to have a simpler setup to > reproduce the behavior of your setup in order to do preliminary tests > and analyse the behavior >=20 >>=20 >> Thanks, >> Song >>=20 >>>=20 >>> Regards, >>> Vincent >>>=20 >>>>>=20 >>>>> Thanks, >>>>> Song >>>>>=20 >>>>=20 >>>> Could you please share your comments and suggestions on this work? Did >>>> the results address your questions/concerns? >>>>=20 >>>> Thanks again, >>>> Song >>>>=20 >>>>>>=20 >>>>>>>=20 >>>>>>> Thanks, >>>>>>> Song >>>>>>>=20 >>>>>>>>=20 >>>>>>>> Morten