* [PATCH v1 0/4] cpuidle: teo: Fix and cleanups
@ 2025-01-10 12:51 Rafael J. Wysocki
2025-01-10 12:53 ` [PATCH v1 1/4] cpuidle: teo: Add polling flag check to early return path Rafael J. Wysocki
` (3 more replies)
0 siblings, 4 replies; 12+ messages in thread
From: Rafael J. Wysocki @ 2025-01-10 12:51 UTC (permalink / raw)
To: Linux PM; +Cc: LKML, Daniel Lezcano, Christian Loehle, Artem Bityutskiy
Hi Everyone,
This addresses a relatively recently added inconsistency in behavior of the teo
governor on systems where idle state 0 is a "polling" state (patch [1/4]) and
makes some other changes that may be regarded as cleanups.
Please review.
Thanks!
^ permalink raw reply [flat|nested] 12+ messages in thread* [PATCH v1 1/4] cpuidle: teo: Add polling flag check to early return path 2025-01-10 12:51 [PATCH v1 0/4] cpuidle: teo: Fix and cleanups Rafael J. Wysocki @ 2025-01-10 12:53 ` Rafael J. Wysocki 2025-01-10 13:16 ` Christian Loehle 2025-01-10 12:54 ` [PATCH v1 2/4] cpuidle: teo: Rearrange idle state lookup code Rafael J. Wysocki ` (2 subsequent siblings) 3 siblings, 1 reply; 12+ messages in thread From: Rafael J. Wysocki @ 2025-01-10 12:53 UTC (permalink / raw) To: Linux PM; +Cc: LKML, Daniel Lezcano, Christian Loehle, Artem Bityutskiy From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> After commit 6da8f9ba5a87 ("cpuidle: teo: Skip tick_nohz_get_sleep_length() call in some cases") the teo governor behaves a bit differently on systems where idle state 0 is a "polling" state (that is, it is not really an idle state, but a loop continuously executed by the CPU). Namely, on such systems it skips the tick_nohz_get_sleep_length() call if the target residency of the current candidate idle state is small enough. However, if state 0 itself was to be returned, it would be returned right away without calling tick_nohz_get_sleep_length() even on systems where it was not a "polling" state until commit 4b20b07ce72f ("cpuidle: teo: Don't count non-existent intercepts") that attempted to fix this problem. Unfortunately, commit 4b20b07ce72f has made the governor always call tick_nohz_get_sleep_length() when about to return state 0 early, even if that state is a "polling" one, which is inconsistent and defeats the purpose of commit 6da8f9ba5a87 in that case. Address this by adding a CPUIDLE_FLAG_POLLING check to the path where state 0 is returned early to prevent tick_nohz_get_sleep_length() from being called if it is a "polling" state. Fixes: 4b20b07ce72f ("cpuidle: teo: Don't count non-existent intercepts") Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> --- drivers/cpuidle/governors/teo.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) --- a/drivers/cpuidle/governors/teo.c +++ b/drivers/cpuidle/governors/teo.c @@ -422,7 +422,8 @@ first_suitable_idx = i; } } - if (!idx && prev_intercept_idx) { + if (!idx && prev_intercept_idx && + !(drv->states[0].flags & CPUIDLE_FLAG_POLLING)) { /* * We have to query the sleep length here otherwise we don't * know after wakeup if our guess was correct. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v1 1/4] cpuidle: teo: Add polling flag check to early return path 2025-01-10 12:53 ` [PATCH v1 1/4] cpuidle: teo: Add polling flag check to early return path Rafael J. Wysocki @ 2025-01-10 13:16 ` Christian Loehle 2025-01-10 13:34 ` Rafael J. Wysocki 2025-01-23 16:54 ` Artem Bityutskiy 0 siblings, 2 replies; 12+ messages in thread From: Christian Loehle @ 2025-01-10 13:16 UTC (permalink / raw) To: Rafael J. Wysocki, Linux PM, Aboorva Devarajan Cc: LKML, Daniel Lezcano, Artem Bityutskiy On 1/10/25 12:53, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > After commit 6da8f9ba5a87 ("cpuidle: teo: Skip tick_nohz_get_sleep_length() > call in some cases") the teo governor behaves a bit differently on > systems where idle state 0 is a "polling" state (that is, it is not > really an idle state, but a loop continuously executed by the CPU). > Namely, on such systems it skips the tick_nohz_get_sleep_length() call > if the target residency of the current candidate idle state is small > enough. > > However, if state 0 itself was to be returned, it would be returned > right away without calling tick_nohz_get_sleep_length() even on systems > where it was not a "polling" state until commit 4b20b07ce72f ("cpuidle: > teo: Don't count non-existent intercepts") that attempted to fix this > problem. > > Unfortunately, commit 4b20b07ce72f has made the governor always call > tick_nohz_get_sleep_length() when about to return state 0 early, even > if that state is a "polling" one, which is inconsistent and defeats > the purpose of commit 6da8f9ba5a87 in that case. > > Address this by adding a CPUIDLE_FLAG_POLLING check to the path where > state 0 is returned early to prevent tick_nohz_get_sleep_length() from > being called if it is a "polling" state. > > Fixes: 4b20b07ce72f ("cpuidle: teo: Don't count non-existent intercepts") > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > --- > drivers/cpuidle/governors/teo.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > --- a/drivers/cpuidle/governors/teo.c > +++ b/drivers/cpuidle/governors/teo.c > @@ -422,7 +422,8 @@ > first_suitable_idx = i; > } > } > - if (!idx && prev_intercept_idx) { > + if (!idx && prev_intercept_idx && > + !(drv->states[0].flags & CPUIDLE_FLAG_POLLING)) { > /* > * We have to query the sleep length here otherwise we don't > * know after wakeup if our guess was correct. > > > But then you do run into the issue of intercepts not being detected if state0 is the right choice, don't you? This would then enable intercept-detection only for <50% of the time, another option is to not allow intercepts selecting a polling state, but there were recent complaints about this exact behavior from Aboorva (+TO). They don't have a low-latency non-polling state. https://lore.kernel.org/lkml/20240809073120.250974-1-aboorvad@linux.ibm.com/ Regards, Christian ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v1 1/4] cpuidle: teo: Add polling flag check to early return path 2025-01-10 13:16 ` Christian Loehle @ 2025-01-10 13:34 ` Rafael J. Wysocki 2025-01-10 14:52 ` Christian Loehle 2025-01-23 16:54 ` Artem Bityutskiy 1 sibling, 1 reply; 12+ messages in thread From: Rafael J. Wysocki @ 2025-01-10 13:34 UTC (permalink / raw) To: Christian Loehle Cc: Rafael J. Wysocki, Linux PM, Aboorva Devarajan, LKML, Daniel Lezcano, Artem Bityutskiy On Fri, Jan 10, 2025 at 2:16 PM Christian Loehle <christian.loehle@arm.com> wrote: > > On 1/10/25 12:53, Rafael J. Wysocki wrote: > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > > > After commit 6da8f9ba5a87 ("cpuidle: teo: Skip tick_nohz_get_sleep_length() > > call in some cases") the teo governor behaves a bit differently on > > systems where idle state 0 is a "polling" state (that is, it is not > > really an idle state, but a loop continuously executed by the CPU). > > Namely, on such systems it skips the tick_nohz_get_sleep_length() call > > if the target residency of the current candidate idle state is small > > enough. > > > > However, if state 0 itself was to be returned, it would be returned > > right away without calling tick_nohz_get_sleep_length() even on systems > > where it was not a "polling" state until commit 4b20b07ce72f ("cpuidle: > > teo: Don't count non-existent intercepts") that attempted to fix this > > problem. > > > > Unfortunately, commit 4b20b07ce72f has made the governor always call > > tick_nohz_get_sleep_length() when about to return state 0 early, even > > if that state is a "polling" one, which is inconsistent and defeats > > the purpose of commit 6da8f9ba5a87 in that case. > > > > Address this by adding a CPUIDLE_FLAG_POLLING check to the path where > > state 0 is returned early to prevent tick_nohz_get_sleep_length() from > > being called if it is a "polling" state. > > > > Fixes: 4b20b07ce72f ("cpuidle: teo: Don't count non-existent intercepts") > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > --- > > drivers/cpuidle/governors/teo.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > --- a/drivers/cpuidle/governors/teo.c > > +++ b/drivers/cpuidle/governors/teo.c > > @@ -422,7 +422,8 @@ > > first_suitable_idx = i; > > } > > } > > - if (!idx && prev_intercept_idx) { > > + if (!idx && prev_intercept_idx && > > + !(drv->states[0].flags & CPUIDLE_FLAG_POLLING)) { > > /* > > * We have to query the sleep length here otherwise we don't > > * know after wakeup if our guess was correct. > > > > > > > > But then you do run into the issue of intercepts not being detected if > state0 is the right choice, don't you? That's true, but then on systems with a "polling" state 0 you still have this problem if the state returned early is not state 0. Say C1 on x86. The point here is that the behavior needs to be consistent, one way or another. The exact point of commit 6da8f9ba5a87 was to avoid calling tick_nohz_get_sleep_length() in some cases when the state to be returned is shallow enough and obviously that includes a "polling" state 0, possibly at the cost of being somewhat inaccurate in prediction. Then you're seeing this intercept accumulation for state 0 when there are only 2 states in the table (or all of the other states are much higher target residency than state 0). Commit 4b20b07ce72f effectively caused tick_nohz_get_sleep_length() to be called every time on systems without a "polling" state 0, which was fair enough, but it also affected the other systems, which wasn't. > This would then enable intercept-detection only for <50% of the time, > another option is to not allow intercepts selecting a polling state, but > there were recent complaints about this exact behavior from Aboorva (+TO). > They don't have a low-latency non-polling state. > > https://lore.kernel.org/lkml/20240809073120.250974-1-aboorvad@linux.ibm.com/ If they don't have a "polling" state 0, they won't be affected by this patch and after commit 4b20b07ce72f, they'll always call tick_nohz_get_sleep_length(), so the current governor behavior is generally unsuitable for them. I have an idea how to change it to be more accurate in prediction, but we'll see how it goes. Stay tuned. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v1 1/4] cpuidle: teo: Add polling flag check to early return path 2025-01-10 13:34 ` Rafael J. Wysocki @ 2025-01-10 14:52 ` Christian Loehle 2025-01-10 15:17 ` Rafael J. Wysocki 2025-01-10 17:23 ` Rafael J. Wysocki 0 siblings, 2 replies; 12+ messages in thread From: Christian Loehle @ 2025-01-10 14:52 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Rafael J. Wysocki, Linux PM, Aboorva Devarajan, LKML, Daniel Lezcano, Artem Bityutskiy On 1/10/25 13:34, Rafael J. Wysocki wrote: > On Fri, Jan 10, 2025 at 2:16 PM Christian Loehle > <christian.loehle@arm.com> wrote: >> >> On 1/10/25 12:53, Rafael J. Wysocki wrote: >>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> >>> >>> After commit 6da8f9ba5a87 ("cpuidle: teo: Skip tick_nohz_get_sleep_length() >>> call in some cases") the teo governor behaves a bit differently on >>> systems where idle state 0 is a "polling" state (that is, it is not >>> really an idle state, but a loop continuously executed by the CPU). >>> Namely, on such systems it skips the tick_nohz_get_sleep_length() call >>> if the target residency of the current candidate idle state is small >>> enough. >>> >>> However, if state 0 itself was to be returned, it would be returned >>> right away without calling tick_nohz_get_sleep_length() even on systems >>> where it was not a "polling" state until commit 4b20b07ce72f ("cpuidle: >>> teo: Don't count non-existent intercepts") that attempted to fix this >>> problem. >>> >>> Unfortunately, commit 4b20b07ce72f has made the governor always call >>> tick_nohz_get_sleep_length() when about to return state 0 early, even >>> if that state is a "polling" one, which is inconsistent and defeats >>> the purpose of commit 6da8f9ba5a87 in that case. >>> >>> Address this by adding a CPUIDLE_FLAG_POLLING check to the path where >>> state 0 is returned early to prevent tick_nohz_get_sleep_length() from >>> being called if it is a "polling" state. >>> >>> Fixes: 4b20b07ce72f ("cpuidle: teo: Don't count non-existent intercepts") >>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> >>> --- >>> drivers/cpuidle/governors/teo.c | 3 ++- >>> 1 file changed, 2 insertions(+), 1 deletion(-) >>> >>> --- a/drivers/cpuidle/governors/teo.c >>> +++ b/drivers/cpuidle/governors/teo.c >>> @@ -422,7 +422,8 @@ >>> first_suitable_idx = i; >>> } >>> } >>> - if (!idx && prev_intercept_idx) { >>> + if (!idx && prev_intercept_idx && >>> + !(drv->states[0].flags & CPUIDLE_FLAG_POLLING)) { >>> /* >>> * We have to query the sleep length here otherwise we don't >>> * know after wakeup if our guess was correct. >>> >>> >>> >> >> But then you do run into the issue of intercepts not being detected if >> state0 is the right choice, don't you? > > That's true, but then on systems with a "polling" state 0 you still > have this problem if the state returned early is not state 0. Say C1 > on x86.> > The point here is that the behavior needs to be consistent, one way or another. Yes, gotcha. Why not be consistent 'in the other way' then? > > The exact point of commit 6da8f9ba5a87 was to avoid calling > tick_nohz_get_sleep_length() in some cases when the state to be > returned is shallow enough and obviously that includes a "polling" > state 0, possibly at the cost of being somewhat inaccurate in > prediction. Somewhat inaccurate meaning not making any prediction? cpu_data->sleep_length_ns = KTIME_MAX; How much is the harm for calling tick_nohz_get_sleep_length() when polling anyway? I know tick_nohz_get_sleep_length() is the majority of the usual cpuidle entry path, but for many scenarios where state0 is appropriate that should be pretty fast, no? > > Then you're seeing this intercept accumulation for state 0 when there > are only 2 states in the table (or all of the other states are much > higher target residency than state 0). > > Commit 4b20b07ce72f effectively caused tick_nohz_get_sleep_length() to > be called every time on systems without a "polling" state 0, which was > fair enough, but it also affected the other systems, which wasn't. > >> This would then enable intercept-detection only for <50% of the time, >> another option is to not allow intercepts selecting a polling state, but >> there were recent complaints about this exact behavior from Aboorva (+TO). >> They don't have a low-latency non-polling state. >> >> https://lore.kernel.org/lkml/20240809073120.250974-1-aboorvad@linux.ibm.com/ > > If they don't have a "polling" state 0, they won't be affected by this > patch and after commit 4b20b07ce72f, they'll always call > tick_nohz_get_sleep_length(), so the current governor behavior is > generally unsuitable for them. They do though. commit 5ddcc03a07ae ("powerpc/cpuidle: Set CPUIDLE_FLAG_POLLING for snooze state") So they have a polling 'snooze' and a relatively high latency (hundreds usecs) non-polling state and no deeper state. So if they don't query sleep length on snooze on a (1us)-interrupt-wakeup heavy workload they will get 50% state0 and 50% state1 (because intercepts recovered due to not querying sleep length). > > I have an idea how to change it to be more accurate in prediction, but > we'll see how it goes. Stay tuned. I will. Regards, Christian ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v1 1/4] cpuidle: teo: Add polling flag check to early return path 2025-01-10 14:52 ` Christian Loehle @ 2025-01-10 15:17 ` Rafael J. Wysocki 2025-01-10 17:23 ` Rafael J. Wysocki 1 sibling, 0 replies; 12+ messages in thread From: Rafael J. Wysocki @ 2025-01-10 15:17 UTC (permalink / raw) To: Christian Loehle Cc: Rafael J. Wysocki, Rafael J. Wysocki, Linux PM, Aboorva Devarajan, LKML, Daniel Lezcano, Artem Bityutskiy On Fri, Jan 10, 2025 at 3:52 PM Christian Loehle <christian.loehle@arm.com> wrote: > > On 1/10/25 13:34, Rafael J. Wysocki wrote: > > On Fri, Jan 10, 2025 at 2:16 PM Christian Loehle > > <christian.loehle@arm.com> wrote: > >> > >> On 1/10/25 12:53, Rafael J. Wysocki wrote: > >>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > >>> > >>> After commit 6da8f9ba5a87 ("cpuidle: teo: Skip tick_nohz_get_sleep_length() > >>> call in some cases") the teo governor behaves a bit differently on > >>> systems where idle state 0 is a "polling" state (that is, it is not > >>> really an idle state, but a loop continuously executed by the CPU). > >>> Namely, on such systems it skips the tick_nohz_get_sleep_length() call > >>> if the target residency of the current candidate idle state is small > >>> enough. > >>> > >>> However, if state 0 itself was to be returned, it would be returned > >>> right away without calling tick_nohz_get_sleep_length() even on systems > >>> where it was not a "polling" state until commit 4b20b07ce72f ("cpuidle: > >>> teo: Don't count non-existent intercepts") that attempted to fix this > >>> problem. > >>> > >>> Unfortunately, commit 4b20b07ce72f has made the governor always call > >>> tick_nohz_get_sleep_length() when about to return state 0 early, even > >>> if that state is a "polling" one, which is inconsistent and defeats > >>> the purpose of commit 6da8f9ba5a87 in that case. > >>> > >>> Address this by adding a CPUIDLE_FLAG_POLLING check to the path where > >>> state 0 is returned early to prevent tick_nohz_get_sleep_length() from > >>> being called if it is a "polling" state. > >>> > >>> Fixes: 4b20b07ce72f ("cpuidle: teo: Don't count non-existent intercepts") > >>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > >>> --- > >>> drivers/cpuidle/governors/teo.c | 3 ++- > >>> 1 file changed, 2 insertions(+), 1 deletion(-) > >>> > >>> --- a/drivers/cpuidle/governors/teo.c > >>> +++ b/drivers/cpuidle/governors/teo.c > >>> @@ -422,7 +422,8 @@ > >>> first_suitable_idx = i; > >>> } > >>> } > >>> - if (!idx && prev_intercept_idx) { > >>> + if (!idx && prev_intercept_idx && > >>> + !(drv->states[0].flags & CPUIDLE_FLAG_POLLING)) { > >>> /* > >>> * We have to query the sleep length here otherwise we don't > >>> * know after wakeup if our guess was correct. > >>> > >>> > >>> > >> > >> But then you do run into the issue of intercepts not being detected if > >> state0 is the right choice, don't you? > > > > That's true, but then on systems with a "polling" state 0 you still > > have this problem if the state returned early is not state 0. Say C1 > > on x86.> > > The point here is that the behavior needs to be consistent, one way or another. > > Yes, gotcha. Why not be consistent 'in the other way' then? > > > > > The exact point of commit 6da8f9ba5a87 was to avoid calling > > tick_nohz_get_sleep_length() in some cases when the state to be > > returned is shallow enough and obviously that includes a "polling" > > state 0, possibly at the cost of being somewhat inaccurate in > > prediction. > > Somewhat inaccurate meaning not making any prediction? > cpu_data->sleep_length_ns = KTIME_MAX; Yes, and the wakeup is going to be counted as an intercept which of course may be inaccurate, but how inaccurate it really is depends on the workload. If the workload doesn't contain short timers, it actually is as good as it gets. > How much is the harm for calling tick_nohz_get_sleep_length() when > polling anyway? On x86, a lot, especially on systems with many cores. This was the whole reason for doing commit 6da8f9ba5a87 in the first place. > I know tick_nohz_get_sleep_length() is the majority of the usual > cpuidle entry path, but for many scenarios where state0 is appropriate > that should be pretty fast, no? Not necessarily. For single-digit exit latency idle states the tick_nohz_get_sleep_length() evaluation time may exceed the state exit latency and is comparable to the target residency, so calling it pretty much doesn't make sense. > > > > Then you're seeing this intercept accumulation for state 0 when there > > are only 2 states in the table (or all of the other states are much > > higher target residency than state 0). > > > > Commit 4b20b07ce72f effectively caused tick_nohz_get_sleep_length() to > > be called every time on systems without a "polling" state 0, which was > > fair enough, but it also affected the other systems, which wasn't. > > > >> This would then enable intercept-detection only for <50% of the time, > >> another option is to not allow intercepts selecting a polling state, but > >> there were recent complaints about this exact behavior from Aboorva (+TO). > >> They don't have a low-latency non-polling state. > >> > >> https://lore.kernel.org/lkml/20240809073120.250974-1-aboorvad@linux.ibm.com/ > > > > If they don't have a "polling" state 0, they won't be affected by this > > patch and after commit 4b20b07ce72f, they'll always call > > tick_nohz_get_sleep_length(), so the current governor behavior is > > generally unsuitable for them. > > They do though. > commit 5ddcc03a07ae ("powerpc/cpuidle: Set CPUIDLE_FLAG_POLLING for snooze state") > So they have a polling 'snooze' and a relatively high latency (hundreds usecs) > non-polling state and no deeper state. > > So if they don't query sleep length on snooze on a (1us)-interrupt-wakeup heavy > workload they will get 50% state0 and 50% state1 (because intercepts recovered > due to not querying sleep length). Obviously. OK, I guess I'll just do the other cleanups without this change to start with. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v1 1/4] cpuidle: teo: Add polling flag check to early return path 2025-01-10 14:52 ` Christian Loehle 2025-01-10 15:17 ` Rafael J. Wysocki @ 2025-01-10 17:23 ` Rafael J. Wysocki 1 sibling, 0 replies; 12+ messages in thread From: Rafael J. Wysocki @ 2025-01-10 17:23 UTC (permalink / raw) To: Rafael J. Wysocki, Christian Loehle Cc: Linux PM, Aboorva Devarajan, LKML, Daniel Lezcano, Artem Bityutskiy On Friday, January 10, 2025 3:52:09 PM CET Christian Loehle wrote: > On 1/10/25 13:34, Rafael J. Wysocki wrote: > > On Fri, Jan 10, 2025 at 2:16 PM Christian Loehle > > <christian.loehle@arm.com> wrote: > >> > >> On 1/10/25 12:53, Rafael J. Wysocki wrote: > >>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > >>> > >>> After commit 6da8f9ba5a87 ("cpuidle: teo: Skip tick_nohz_get_sleep_length() > >>> call in some cases") the teo governor behaves a bit differently on > >>> systems where idle state 0 is a "polling" state (that is, it is not > >>> really an idle state, but a loop continuously executed by the CPU). > >>> Namely, on such systems it skips the tick_nohz_get_sleep_length() call > >>> if the target residency of the current candidate idle state is small > >>> enough. > >>> > >>> However, if state 0 itself was to be returned, it would be returned > >>> right away without calling tick_nohz_get_sleep_length() even on systems > >>> where it was not a "polling" state until commit 4b20b07ce72f ("cpuidle: > >>> teo: Don't count non-existent intercepts") that attempted to fix this > >>> problem. > >>> > >>> Unfortunately, commit 4b20b07ce72f has made the governor always call > >>> tick_nohz_get_sleep_length() when about to return state 0 early, even > >>> if that state is a "polling" one, which is inconsistent and defeats > >>> the purpose of commit 6da8f9ba5a87 in that case. > >>> > >>> Address this by adding a CPUIDLE_FLAG_POLLING check to the path where > >>> state 0 is returned early to prevent tick_nohz_get_sleep_length() from > >>> being called if it is a "polling" state. > >>> > >>> Fixes: 4b20b07ce72f ("cpuidle: teo: Don't count non-existent intercepts") > >>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > >>> --- > >>> drivers/cpuidle/governors/teo.c | 3 ++- > >>> 1 file changed, 2 insertions(+), 1 deletion(-) > >>> > >>> --- a/drivers/cpuidle/governors/teo.c > >>> +++ b/drivers/cpuidle/governors/teo.c > >>> @@ -422,7 +422,8 @@ > >>> first_suitable_idx = i; > >>> } > >>> } > >>> - if (!idx && prev_intercept_idx) { > >>> + if (!idx && prev_intercept_idx && > >>> + !(drv->states[0].flags & CPUIDLE_FLAG_POLLING)) { > >>> /* > >>> * We have to query the sleep length here otherwise we don't > >>> * know after wakeup if our guess was correct. > >>> > >>> > >>> > >> > >> But then you do run into the issue of intercepts not being detected if > >> state0 is the right choice, don't you? > > > > That's true, but then on systems with a "polling" state 0 you still > > have this problem if the state returned early is not state 0. Say C1 > > on x86.> > > The point here is that the behavior needs to be consistent, one way or another. > > Yes, gotcha. Why not be consistent 'in the other way' then? > > > > > The exact point of commit 6da8f9ba5a87 was to avoid calling > > tick_nohz_get_sleep_length() in some cases when the state to be > > returned is shallow enough and obviously that includes a "polling" > > state 0, possibly at the cost of being somewhat inaccurate in > > prediction. > > Somewhat inaccurate meaning not making any prediction? > cpu_data->sleep_length_ns = KTIME_MAX; > > How much is the harm for calling tick_nohz_get_sleep_length() when > polling anyway? > I know tick_nohz_get_sleep_length() is the majority of the usual > cpuidle entry path, but for many scenarios where state0 is appropriate > that should be pretty fast, no? > > > > > Then you're seeing this intercept accumulation for state 0 when there > > are only 2 states in the table (or all of the other states are much > > higher target residency than state 0). > > > > Commit 4b20b07ce72f effectively caused tick_nohz_get_sleep_length() to > > be called every time on systems without a "polling" state 0, which was > > fair enough, but it also affected the other systems, which wasn't. > > > >> This would then enable intercept-detection only for <50% of the time, > >> another option is to not allow intercepts selecting a polling state, but > >> there were recent complaints about this exact behavior from Aboorva (+TO). > >> They don't have a low-latency non-polling state. > >> > >> https://lore.kernel.org/lkml/20240809073120.250974-1-aboorvad@linux.ibm.com/ > > > > If they don't have a "polling" state 0, they won't be affected by this > > patch and after commit 4b20b07ce72f, they'll always call > > tick_nohz_get_sleep_length(), so the current governor behavior is > > generally unsuitable for them. > > They do though. > commit 5ddcc03a07ae ("powerpc/cpuidle: Set CPUIDLE_FLAG_POLLING for snooze state") > So they have a polling 'snooze' and a relatively high latency (hundreds usecs) > non-polling state and no deeper state. > So if they don't query sleep length on snooze on a (1us)-interrupt-wakeup heavy > workload they will get 50% state0 and 50% state1 (because intercepts recovered > due to not querying sleep length). > > > > > I have an idea how to change it to be more accurate in prediction, but > > we'll see how it goes. Stay tuned. > > I will. So with the appended change applied, there is no difference between "hits" and "intercepts" any more, they all are just "events" now, so the sleep length value should not matter unless it is needed to refine the state selection. Of course, the whole governor can be simplified significantly on top of it. --- drivers/cpuidle/governors/teo.c | 98 +++++++++++++++++++--------------------- 1 file changed, 48 insertions(+), 50 deletions(-) --- a/drivers/cpuidle/governors/teo.c +++ b/drivers/cpuidle/governors/teo.c @@ -34,12 +34,17 @@ * * The computations carried out by this governor are based on using bins whose * boundaries are aligned with the target residency parameter values of the CPU - * idle states provided by the %CPUIdle driver in the ascending order. That is, - * the first bin spans from 0 up to, but not including, the target residency of - * the second idle state (idle state 1), the second bin spans from the target - * residency of idle state 1 up to, but not including, the target residency of - * idle state 2, the third bin spans from the target residency of idle state 2 - * up to, but not including, the target residency of idle state 3 and so on. + * idle states provided by the %CPUIdle driver in ascending order, and a given + * bin is used for counting events for which the measured idle duration is + * between its boundaries. Specifically, the first bin spans from 0 up to, but + * not including, the target residency of the second idle state (idle state 1), + * so it counts events for which the measured idle duration is greater than or + * equal to 0 and less than the target residency of state 1. The second bin + * spans from the target residency of idle state 1 up to, but not including, the + * target residency of idle state 2, the third bin spans from the target + * residency of idle state 2 up to, but not including, the target residency of + * idle state 3 and so on. + * * The last bin spans from the target residency of the deepest idle state * supplied by the driver to the scheduler tick period length or to infinity if * the tick period length is less than the targer residency of that state. In @@ -47,47 +52,31 @@ * duration between the tick period length and the target residency of the * deepest idle state. * - * Two metrics called "hits" and "intercepts" are associated with each bin. - * They are updated every time before selecting an idle state for the given CPU - * in accordance with what happened last time. - * - * The "hits" metric reflects the relative frequency of situations in which the - * sleep length and the idle duration measured after CPU wakeup fall into the - * same bin (that is, the CPU appears to wake up "on time" relative to the sleep - * length). In turn, the "intercepts" metric reflects the relative frequency of - * non-timer wakeup events for which the measured idle duration falls into a bin - * that corresponds to an idle state shallower than the one whose bin is fallen - * into by the sleep length (these events are also referred to as "intercepts" - * below). - * * In order to select an idle state for a CPU, the governor takes the following * steps (modulo the possible latency constraint that must be taken into account * too): * * 1. Find the deepest enabled CPU idle state (the candidate idle state) and - * compute 2 sums as follows: - * - * - The sum of the "hits" metric for all of the idle states shallower than - * the candidate one (it represents the cases in which the CPU was likely - * woken up by a timer). - * - * - The sum of the "intercepts" metric for all of the idle states shallower - * than the candidate one (it represents the cases in which the CPU was - * likely woken up by a non-timer wakeup source). - * - * 2. If the second sum computed in step 1 is greater than a half of the sum of - * both mertics for the candidate state bin and all subsequent bins(if any), - * a shallower idle state is likely to be more suitable, so look for it. + * compute a sum of events for all of the idle states shallower than the + * candidate one (it represents the cases in which the candidate idle state + * would have been to deep for the measured idle duration). + * + * 2. If the sum computed in step 1 is greater than 1/2 of the sum of all + * events, the candidate state would have been too deep for the measured idle + * duration in over 1/2 of all cases, so look for a shallower idle state to + * replace it. * * - Traverse the enabled idle states shallower than the candidate one in the * descending order. * - * - For each of them compute the sum of the "intercepts" metrics over all - * of the idle states between it and the candidate one (including the - * former and excluding the latter). - * - * - If this sum is greater than a half of the second sum computed in step 1, - * use the given idle state as the new candidate one. + * - In each step, compute the sum of events for the given state and all of + * the subsequent bins, including the events with the measured idle + * duration between the tick period length and the target residency of the + * deepest idle state (if the former is less than the latter). + * + * - If this sum is greater than 1/2 of the sum of all events. the given + * state would have been shallow enough for the measured idle duration in + * over 1/2 of all cases, so it can be used as the new candidate one. * * 3. If the current candidate state is state 0 or its target residency is short * enough, return it and prevent the scheduler tick from being stopped. @@ -288,6 +277,7 @@ unsigned int tick_intercept_sum = 0; unsigned int idx_intercept_sum = 0; unsigned int intercept_sum = 0; + unsigned int idx_event_sum = 0; unsigned int idx_hit_sum = 0; unsigned int hit_sum = 0; int constraint_idx = 0; @@ -361,15 +351,18 @@ tick_intercept_sum = intercept_sum + cpu_data->state_bins[drv->state_count-1].intercepts; - /* - * If the sum of the intercepts metric for all of the idle states - * shallower than the current candidate one (idx) is greater than the - * sum of the intercepts and hits metrics for the candidate state and - * all of the deeper states, a shallower idle state is likely to be a - * better choice. + * Compute the sum of both event metrics for all idle states shallower + * than the candidate one (idx). + */ + idx_event_sum = idx_intercept_sum + idx_hit_sum; + /* + * If this sum is greater than 1/2 of total events, the current + * candndate state would have been to deep for the measured idle + * duration in over 50% of recent cases, so look for a shallower one. */ - if (2 * idx_intercept_sum > cpu_data->total - idx_hit_sum) { + if (2 * idx_event_sum > cpu_data->total) { + unsigned int event_sum = cpu_data->total - idx_event_sum; int first_suitable_idx = idx; /* @@ -380,14 +373,19 @@ * Take the possible duration limitation present if the tick * has been stopped already into account. */ - intercept_sum = 0; - for (i = idx - 1; i >= 0; i--) { struct teo_bin *bin = &cpu_data->state_bins[i]; - intercept_sum += bin->intercepts; - - if (2 * intercept_sum > idx_intercept_sum) { + /* + * Add both metrics for this state to the total sum of + * metrics for all of the subsequent bins. + */ + event_sum += bin->intercepts + bin->hits; + /* + * If this sum is greater than 1/2 of total events, + * replace the previous candidate state with this one. + */ + if (2 * event_sum > cpu_data->total) { /* * Use the current state unless it is too * shallow or disabled, in which case take the ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v1 1/4] cpuidle: teo: Add polling flag check to early return path 2025-01-10 13:16 ` Christian Loehle 2025-01-10 13:34 ` Rafael J. Wysocki @ 2025-01-23 16:54 ` Artem Bityutskiy 2025-01-23 18:12 ` Rafael J. Wysocki 1 sibling, 1 reply; 12+ messages in thread From: Artem Bityutskiy @ 2025-01-23 16:54 UTC (permalink / raw) To: Christian Loehle, Rafael J. Wysocki, Linux PM, Aboorva Devarajan Cc: LKML, Daniel Lezcano On Fri, 2025-01-10 at 13:16 +0000, Christian Loehle wrote: > This would then enable intercept-detection only for <50% of the time, > another option is to not allow intercepts selecting a polling state, but > there were recent complaints about this exact behavior from Aboorva (+TO). > They don't have a low-latency non-polling state. What folks think about the following idea. Idle governor algorithm essentially predicts the sleep time (step 1), based on that, the idle state gets selected (step 2). What if we add a sleep time factor and expose it via sysfs. The predicted sleep time would be multiplied by the factor (between steps 1 and 2). Default factor value is 1 (or 100%). If users want teo be more hesitant selecting deeper idle states, they can set it to 0.5 (or 50%) or some other value < 1. I users want teo to be more enthusiastic about selecting deeper idle states, they set the factor to a value > 1. We could expose it via sysfs and allow changing in some reasonable range. The idea is to let users adjust the level of idle governor (teo in this case) enthusiasm regarding deep C-state. Thanks, Artem. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v1 1/4] cpuidle: teo: Add polling flag check to early return path 2025-01-23 16:54 ` Artem Bityutskiy @ 2025-01-23 18:12 ` Rafael J. Wysocki 0 siblings, 0 replies; 12+ messages in thread From: Rafael J. Wysocki @ 2025-01-23 18:12 UTC (permalink / raw) To: dedekind1 Cc: Christian Loehle, Rafael J. Wysocki, Linux PM, Aboorva Devarajan, LKML, Daniel Lezcano On Thu, Jan 23, 2025 at 5:54 PM Artem Bityutskiy <dedekind1@gmail.com> wrote: > > On Fri, 2025-01-10 at 13:16 +0000, Christian Loehle wrote: > > This would then enable intercept-detection only for <50% of the time, > > another option is to not allow intercepts selecting a polling state, but > > there were recent complaints about this exact behavior from Aboorva (+TO). > > They don't have a low-latency non-polling state. > > What folks think about the following idea. > > Idle governor algorithm essentially predicts the sleep time (step 1), based on > that, the idle state gets selected (step 2). No, it's kind of the reverse nowadays. Step 1 is to preselect an idle state using the idle duration information from the recent past and step 2 is to refine the selection using the sleep time information (unless step 1 by itself is sufficient). In particular, teo doesn't attempt to predict the idle duration. It basically picks up the deepest idle state that wouldn't have been too deep in the majority of cases taken into account. > What if we add a sleep time factor and expose it via sysfs. The predicted sleep > time would be multiplied by the factor (between steps 1 and 2). > > Default factor value is 1 (or 100%). If users want teo be more hesitant > selecting deeper idle states, they can set it to 0.5 (or 50%) or some other > value < 1. I users want teo to be more enthusiastic about selecting deeper idle > states, they set the factor to a value > 1. > > We could expose it via sysfs and allow changing in some reasonable range. > > The idea is to let users adjust the level of idle governor (teo in this case) > enthusiasm regarding deep C-state. For teo, that value from user space would effectively work as an upper boundary for the target residency of idle states to select and there is not too much difference between this and the existing PM (CPU latency) QoS (limiting target residency is equivalent to limiting exit latency). If you want a "latency bias" knob, for teo it might be something related to what is regarded as a "sufficient majority". Right now it is hardcoded to "over 50%", but it may be a different fraction in principle ((e.g. 5/8, 2/3 etc.) and it may be tunable within some limits. ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v1 2/4] cpuidle: teo: Rearrange idle state lookup code 2025-01-10 12:51 [PATCH v1 0/4] cpuidle: teo: Fix and cleanups Rafael J. Wysocki 2025-01-10 12:53 ` [PATCH v1 1/4] cpuidle: teo: Add polling flag check to early return path Rafael J. Wysocki @ 2025-01-10 12:54 ` Rafael J. Wysocki 2025-01-10 12:55 ` [PATCH v1 3/4] cpuidle: teo: Reorder candidate state index checks Rafael J. Wysocki 2025-01-10 12:56 ` [PATCH v1 4/4] cpuidle: teo: Combine candidate state index checks against 0 Rafael J. Wysocki 3 siblings, 0 replies; 12+ messages in thread From: Rafael J. Wysocki @ 2025-01-10 12:54 UTC (permalink / raw) To: Linux PM; +Cc: LKML, Daniel Lezcano, Christian Loehle, Artem Bityutskiy From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Rearrange code in the idle state lookup loop in teo_select() to make it somewhat easier to follow and update comments around it. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> --- drivers/cpuidle/governors/teo.c | 34 ++++++++++++++++++++-------------- 1 file changed, 20 insertions(+), 14 deletions(-) --- a/drivers/cpuidle/governors/teo.c +++ b/drivers/cpuidle/governors/teo.c @@ -367,7 +367,7 @@ * If the sum of the intercepts metric for all of the idle states * shallower than the current candidate one (idx) is greater than the * sum of the intercepts and hits metrics for the candidate state and - * all of the deeper states a shallower idle state is likely to be a + * all of the deeper states, a shallower idle state is likely to be a * better choice. */ prev_intercept_idx = idx; @@ -396,30 +396,36 @@ * first enabled state that is deep enough. */ if (teo_state_ok(i, drv) && - !dev->states_usage[i].disable) + !dev->states_usage[i].disable) { idx = i; - else - idx = first_suitable_idx; - + break; + } + idx = first_suitable_idx; break; } if (dev->states_usage[i].disable) continue; - if (!teo_state_ok(i, drv)) { + if (teo_state_ok(i, drv)) { /* - * The current state is too shallow, but if an - * alternative candidate state has been found, - * it may still turn out to be a better choice. + * The current state is deep enough, but still + * there may be a better one. */ - if (first_suitable_idx != idx) - continue; - - break; + first_suitable_idx = i; + continue; } - first_suitable_idx = i; + /* + * The current state is too shallow, so if no suitable + * states other than the initial candidate have been + * found, give up (the remaining states to check are + * shallower still), but otherwise the first suitable + * state other than the initial candidate may turn out + * to be preferable. + */ + if (first_suitable_idx == idx) + break; } } if (!idx && prev_intercept_idx && ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v1 3/4] cpuidle: teo: Reorder candidate state index checks 2025-01-10 12:51 [PATCH v1 0/4] cpuidle: teo: Fix and cleanups Rafael J. Wysocki 2025-01-10 12:53 ` [PATCH v1 1/4] cpuidle: teo: Add polling flag check to early return path Rafael J. Wysocki 2025-01-10 12:54 ` [PATCH v1 2/4] cpuidle: teo: Rearrange idle state lookup code Rafael J. Wysocki @ 2025-01-10 12:55 ` Rafael J. Wysocki 2025-01-10 12:56 ` [PATCH v1 4/4] cpuidle: teo: Combine candidate state index checks against 0 Rafael J. Wysocki 3 siblings, 0 replies; 12+ messages in thread From: Rafael J. Wysocki @ 2025-01-10 12:55 UTC (permalink / raw) To: Linux PM; +Cc: LKML, Daniel Lezcano, Christian Loehle, Artem Bityutskiy From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Since constraint_idx may be 0, the candidate state index may change to 0 after assinging constraint_idx to it, so first check if it is greater than constraint_idx (and update it if so) and then check it against 0. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> --- drivers/cpuidle/governors/teo.c | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-) --- a/drivers/cpuidle/governors/teo.c +++ b/drivers/cpuidle/governors/teo.c @@ -428,6 +428,14 @@ break; } } + + /* + * If there is a latency constraint, it may be necessary to select an + * idle state shallower than the current candidate one. + */ + if (idx > constraint_idx) + idx = constraint_idx; + if (!idx && prev_intercept_idx && !(drv->states[0].flags & CPUIDLE_FLAG_POLLING)) { /* @@ -440,13 +448,6 @@ } /* - * If there is a latency constraint, it may be necessary to select an - * idle state shallower than the current candidate one. - */ - if (idx > constraint_idx) - idx = constraint_idx; - - /* * Skip the timers check if state 0 is the current candidate one, * because an immediate non-timer wakeup is expected in that case. */ ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v1 4/4] cpuidle: teo: Combine candidate state index checks against 0 2025-01-10 12:51 [PATCH v1 0/4] cpuidle: teo: Fix and cleanups Rafael J. Wysocki ` (2 preceding siblings ...) 2025-01-10 12:55 ` [PATCH v1 3/4] cpuidle: teo: Reorder candidate state index checks Rafael J. Wysocki @ 2025-01-10 12:56 ` Rafael J. Wysocki 3 siblings, 0 replies; 12+ messages in thread From: Rafael J. Wysocki @ 2025-01-10 12:56 UTC (permalink / raw) To: Linux PM; +Cc: LKML, Daniel Lezcano, Christian Loehle, Artem Bityutskiy From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> There are two candidate state index checks against 0 in teo_select() that need not be separate, so combine them and update comments around them. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> --- drivers/cpuidle/governors/teo.c | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) --- a/drivers/cpuidle/governors/teo.c +++ b/drivers/cpuidle/governors/teo.c @@ -436,25 +436,25 @@ if (idx > constraint_idx) idx = constraint_idx; - if (!idx && prev_intercept_idx && - !(drv->states[0].flags & CPUIDLE_FLAG_POLLING)) { + + if (!idx) { /* - * We have to query the sleep length here otherwise we don't - * know after wakeup if our guess was correct. + * Skip the timers check if state 0 is the current candidate + * one because an immediate non-timer wakeup is expected then. */ - duration_ns = tick_nohz_get_sleep_length(&delta_tick); - cpu_data->sleep_length_ns = duration_ns; + if (prev_intercept_idx && + !(drv->states[0].flags & CPUIDLE_FLAG_POLLING)) { + /* + * Query the sleep length to be able to count the wakeup + * as a hit if it is caused by a timer. + */ + duration_ns = tick_nohz_get_sleep_length(&delta_tick); + cpu_data->sleep_length_ns = duration_ns; + } goto out_tick; } /* - * Skip the timers check if state 0 is the current candidate one, - * because an immediate non-timer wakeup is expected in that case. - */ - if (!idx) - goto out_tick; - - /* * If state 0 is a polling one, check if the target residency of * the current candidate state is low enough and skip the timers * check in that case too. ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2025-01-23 18:12 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-01-10 12:51 [PATCH v1 0/4] cpuidle: teo: Fix and cleanups Rafael J. Wysocki 2025-01-10 12:53 ` [PATCH v1 1/4] cpuidle: teo: Add polling flag check to early return path Rafael J. Wysocki 2025-01-10 13:16 ` Christian Loehle 2025-01-10 13:34 ` Rafael J. Wysocki 2025-01-10 14:52 ` Christian Loehle 2025-01-10 15:17 ` Rafael J. Wysocki 2025-01-10 17:23 ` Rafael J. Wysocki 2025-01-23 16:54 ` Artem Bityutskiy 2025-01-23 18:12 ` Rafael J. Wysocki 2025-01-10 12:54 ` [PATCH v1 2/4] cpuidle: teo: Rearrange idle state lookup code Rafael J. Wysocki 2025-01-10 12:55 ` [PATCH v1 3/4] cpuidle: teo: Reorder candidate state index checks Rafael J. Wysocki 2025-01-10 12:56 ` [PATCH v1 4/4] cpuidle: teo: Combine candidate state index checks against 0 Rafael J. Wysocki
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox