* [PATCH 0/3] Sapphire Rapids C0.x idle states support
@ 2023-03-06 12:34 Artem Bityutskiy
2023-03-06 12:34 ` [PATCH 1/3] x86/mwait: Add support for idle via umwait Artem Bityutskiy
` (2 more replies)
0 siblings, 3 replies; 13+ messages in thread
From: Artem Bityutskiy @ 2023-03-06 12:34 UTC (permalink / raw)
To: x86, Linux PM Mailing List; +Cc: Artem Bityutskiy
From: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
This patch-set enables C0.2 idle state support on Sapphire Rapids Xeon (later -
SPR). The new idle state is added between POLL and C1 states.
This patch-set is based on linux-pm/linux-next.
Motivation
----------
According to my measurements, SPR C0.2 idle state is 5-20% more
energy-efficient than Linux POLL state, and its latency is close to POLL
latency.
When a CPU rests in C0.2, saved power increases power budget of a busy CPU.
This way, C0.2 may improve turbo boost of a busy CPU and improve performance.
Latency and power impact
------------------------
I compared POLL to C0.2 using 'wult' tool (https://github.com/intel/wult).
Wult measures C-state exit latency and several other metrics, of which socket
level AC power and RAPL CPU package power are most interesting in this case.
I collected 1000000 datapoints with wult, and measured 4 configurations:
1. POLL + LFM
2. POLL + HFM
3. C0.2 + LFM
4. C0.2 + HFM
* In "POLL" experiments, all C-states except for POLL were disabled.
* In "C0.2" experiments, all C-states except for C0.2 were disabled.
* In "LFM" experiments, frequency of all CPUs was locked to 800MHz (low frequency
mode on my SPR system).
* In "HFM" experiments, frequency of all CPUs was locked to 2GHz (high
frequency mode on my SPR system).
Here are the measurement results. The numbers are percent change from POLL to
C0.2. The formula was:
% change of a value = [(POLL value - C0.2 value) / POLL value] * 100%
----|-----------|-----------|----------|-----------
| Median IR | 99th % IR | AC Power | RAPL power
----|-----------|-----------|----------|-----------
HFM | 22% | 11% | -13% | -19%
LFM | 23% | 6% | -5% | -18%
----------------------------|----------|-----------
* IR stands for interrupt latency. The table provides the median and 99th
percentile. Wult measures IR as TSC value in interrupt handler minus TSC value
of the moment when the interrupt fired.
* AC Power is the socket level AC power, measured with Yokogawa WT310 power
meter.
* RAPL power is the CPU package power, measured using the 'turbostat' tool.
Conclusion: C0.2 has higher latency comparing to POLL, but it comes with
substantial energy savings.
Cyclictest measurements
-----------------------
I used 'cyclictest' to measure average timer latency for timers armed 100us and
1.5ms ahead.
The executed commands:
# Arming timers 100us ahead.
cyclictest --default-system --nsecs -a 1 --loops 10000000 --distance 100
# Arming timers 1.5ms ahead.
cyclictest --default-system --nsecs -a 1 --loops 10000000 --distance 15000
The numbers below are percent change for the average timer latency.
----|-----------|-------
| 100us | 1.5ms
----|-----------|-------
HFM | 0.1% | -0.1%
LFM | -0.2% | -0.2%
Conclusion: C0.2 has very small latency impact, as per 'cyclictest'.
On wult vs cyclictest latency data.
- Latency measured with 'wult' is in range of [0.2,1] microseconds, depending
on configuration (CPU frequency, etc). The number includes HW latency and
some amount of SW overhead.
- Latency measured with 'cyclictest' is in tens of microseconds, and it is
dominated by the software overhead component: time run the HW interrupt, exit
idle, switch the context, reach the user-space handler, and so on.
This explains the significant difference in latency percent change between
'wult' and 'cyclictest'.
Hackbench measurements
----------------------
I ran the 'hackbench' benchmark using the following commands:
# 4 groups, 200 threads
hackbench -s 128 -l 100000000 -g4 -f 25 -P
# 8 groups, 400 threads.
hackbench -s 128 -l 100000000 -g8 -f 25 -P
My system has 224 CPUs, so the first command did not use all CPUs, the second
command used all of them. However, in both cases CPU power reached TDP.
I did not lock CPU frequency, so all frequencies were allowed on all CPUs. I
tested these 2 configurations for both 8 and 4 group runs (so total 4
experiments):
- POLL: only POLL state was enabled, all other C-states were disabled.
- POLL+C0.2: only POLL and C0.2 were enabled, all other C-states were disabled.
I ran hackbench 3 times and compared hackbench "score" (which is basically the
execution time) between POL' and POLL+C0.2 runs.
The below table contains hackbench score % change for hackbench 4 groups. C0.2
residency in these runs was about 20%.
------|--------------
Run # | Time decrease
------|--------------
1 | -6.5%
2 | -12.3%
3 | -7.1%
The below table contains hackbench score % change for hackbench 8 groups. C0.2
residency in these runs was about 10%.
------|---------------
Run # | Time decrease
------|---------------
1 | -1.6%
2 | -0.6%
3 | -0.9%
Conclusion: even with small C0.2 residency (~10%), hackbench shows some
performance improvement. With larger C0.2 residency the improvement is more
pronounced.
Q&A
---
1. Can C0.2 be disabled?
C0.2 can be disabled via sysfs and with the following kernel boot option:
intel_idle.states_off=2
2. Why C0.2, not C0.1?
I measured both C0.1 and C0.2, but did not notice a clear C0.1 advantage in
terms of latency, but did notice that C0.2 saves more power.
But if users want to try using C0.1 instead of C0.2, they can do this:
echo 0 > /sys/devices/system/cpu/umwait_control/enable_c02
This will make sure that C0.2 requests from 'intel_idle' are automatically
converted to C0.1 requests.
3. How did you verify that system enters C0.2?
I used 'perf' to read the corresponding PMU counters:
perf stat -e CPU_CLK_UNHALTED.C01,CPU_CLK_UNHALTED.C02,cycles -a sleep 1
Artem Bityutskiy (3):
x86/mwait: Add support for idle via umwait
x86/umwait: Increase tpause and umwait quanta
intel_idle: add C0.2 state for Sapphire Rapids Xeon
arch/x86/include/asm/mwait.h | 63 ++++++++++++++++++++++++++++++++++++
arch/x86/kernel/cpu/umwait.c | 4 +--
drivers/idle/intel_idle.c | 59 ++++++++++++++++++++++++++++++++-
3 files changed, 123 insertions(+), 3 deletions(-)
base-commit: 0a3f9a6b0265b64c02226fcabb5e9a958307913b
--
2.38.1
^ permalink raw reply [flat|nested] 13+ messages in thread* [PATCH 1/3] x86/mwait: Add support for idle via umwait
2023-03-06 12:34 [PATCH 0/3] Sapphire Rapids C0.x idle states support Artem Bityutskiy
@ 2023-03-06 12:34 ` Artem Bityutskiy
2023-03-06 14:55 ` Peter Zijlstra
2023-03-06 12:34 ` [PATCH 2/3] x86/umwait: Increase tpause and umwait quanta Artem Bityutskiy
2023-03-06 12:34 ` [PATCH 3/3] intel_idle: add C0.2 state for Sapphire Rapids Xeon Artem Bityutskiy
2 siblings, 1 reply; 13+ messages in thread
From: Artem Bityutskiy @ 2023-03-06 12:34 UTC (permalink / raw)
To: x86, Linux PM Mailing List; +Cc: Artem Bityutskiy
From: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
On Intel platforms, C-states are requested using the 'monitor/mwait'
instructions pair, as implemented in 'mwait_idle_with_hints()'. This
mechanism allows for entering C1 and deeper C-states.
Sapphire Rapids Xeon supports new idle states - C0.1 and C0.2 (later C0.x).
These idle states have lower latency comparing to C1, and can be requested
with either 'tpause' and 'umwait' instructions.
Linux already uses the 'tpause' instruction in delay functions like
'udelay()'. This patch adds 'umwait' and 'umonitor' instructions support.
'umwait' and 'tpause' instructions are very similar - both send the CPU to
C0.x and have the same break out rules. But unlike 'tpause', 'umwait' works
together with 'umonitor' and exits the C0.x when the monitored memory
address is modified (similar idea as with 'monitor/mwait').
This patch implements the 'umwait_idle()' function, which works very
similarly to existing 'mwait_idle_with_hints()', but requests C0.x. The
intention is to use it from the 'intel_idle' driver.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
---
arch/x86/include/asm/mwait.h | 63 ++++++++++++++++++++++++++++++++++++
1 file changed, 63 insertions(+)
diff --git a/arch/x86/include/asm/mwait.h b/arch/x86/include/asm/mwait.h
index 778df05f8539..a8612de3212a 100644
--- a/arch/x86/include/asm/mwait.h
+++ b/arch/x86/include/asm/mwait.h
@@ -141,4 +141,67 @@ static inline void __tpause(u32 ecx, u32 edx, u32 eax)
#endif
}
+#ifdef CONFIG_X86_64
+/*
+ * Monitor a memory address at 'rcx' using the 'umonitor' instruction.
+ */
+static inline void __umonitor(const void *rcx)
+{
+ /* "umonitor %rcx" */
+#ifdef CONFIG_AS_TPAUSE
+ asm volatile("umonitor %%rcx\n"
+ :
+ : "c"(rcx));
+#else
+ asm volatile(".byte 0xf3, 0x0f, 0xae, 0xf1\t\n"
+ :
+ : "c"(rcx));
+#endif
+}
+
+/*
+ * Same as '__tpause()', but uses the 'umwait' instruction. It is very
+ * similar to 'tpause', but also breaks out if the data at the address
+ * monitored with 'umonitor' is modified.
+ */
+static inline void __umwait(u32 ecx, u32 edx, u32 eax)
+{
+ /* "umwait %ecx, %edx, %eax;" */
+#ifdef CONFIG_AS_TPAUSE
+ asm volatile("umwait %%ecx\n"
+ :
+ : "c"(ecx), "d"(edx), "a"(eax));
+#else
+ asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf1\t\n"
+ :
+ : "c"(ecx), "d"(edx), "a"(eax));
+#endif
+}
+
+/*
+ * Enter C0.1 or C0.2 state and stay there until an event happens (an interrupt
+ * or the 'need_resched()'), or the deadline is reached. The deadline is the
+ * absolute TSC value to exit the idle state at. However, if deadline exceeds
+ * the global limit in the IA32_UMWAIT_CONTROL register, the global limit
+ * prevails, and the idle state is exited earlier than the deadline.
+ */
+static inline void umwait_idle(u64 deadline, u32 state)
+{
+ if (!current_set_polling_and_test()) {
+ u32 eax, edx;
+
+ eax = lower_32_bits(deadline);
+ edx = upper_32_bits(deadline);
+
+ __umonitor(¤t_thread_info()->flags);
+ if (!need_resched())
+ __umwait(state, edx, eax);
+ }
+ current_clr_polling();
+}
+#else
+#define umwait_idle(deadline, state) \
+ WARN_ONCE(1, "umwait CPU instruction is not supported")
+#endif /* CONFIG_X86_64 */
+
#endif /* _ASM_X86_MWAIT_H */
--
2.38.1
^ permalink raw reply related [flat|nested] 13+ messages in thread* Re: [PATCH 1/3] x86/mwait: Add support for idle via umwait
2023-03-06 12:34 ` [PATCH 1/3] x86/mwait: Add support for idle via umwait Artem Bityutskiy
@ 2023-03-06 14:55 ` Peter Zijlstra
2023-03-07 11:55 ` Rafael J. Wysocki
0 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2023-03-06 14:55 UTC (permalink / raw)
To: Artem Bityutskiy; +Cc: x86, Linux PM Mailing List
On Mon, Mar 06, 2023 at 02:34:16PM +0200, Artem Bityutskiy wrote:
> From: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
>
> On Intel platforms, C-states are requested using the 'monitor/mwait'
> instructions pair, as implemented in 'mwait_idle_with_hints()'. This
> mechanism allows for entering C1 and deeper C-states.
>
> Sapphire Rapids Xeon supports new idle states - C0.1 and C0.2 (later C0.x).
> These idle states have lower latency comparing to C1, and can be requested
> with either 'tpause' and 'umwait' instructions.
>
> Linux already uses the 'tpause' instruction in delay functions like
> 'udelay()'. This patch adds 'umwait' and 'umonitor' instructions support.
>
> 'umwait' and 'tpause' instructions are very similar - both send the CPU to
> C0.x and have the same break out rules. But unlike 'tpause', 'umwait' works
> together with 'umonitor' and exits the C0.x when the monitored memory
> address is modified (similar idea as with 'monitor/mwait').
>
> This patch implements the 'umwait_idle()' function, which works very
> similarly to existing 'mwait_idle_with_hints()', but requests C0.x. The
> intention is to use it from the 'intel_idle' driver.
Still wondering wth regular mwait can't access these new idle states.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/3] x86/mwait: Add support for idle via umwait
2023-03-06 14:55 ` Peter Zijlstra
@ 2023-03-07 11:55 ` Rafael J. Wysocki
2023-03-08 12:35 ` Peter Zijlstra
0 siblings, 1 reply; 13+ messages in thread
From: Rafael J. Wysocki @ 2023-03-07 11:55 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: Artem Bityutskiy, x86, Linux PM Mailing List
On Mon, Mar 6, 2023 at 3:56 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Mar 06, 2023 at 02:34:16PM +0200, Artem Bityutskiy wrote:
> > From: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
> >
> > On Intel platforms, C-states are requested using the 'monitor/mwait'
> > instructions pair, as implemented in 'mwait_idle_with_hints()'. This
> > mechanism allows for entering C1 and deeper C-states.
> >
> > Sapphire Rapids Xeon supports new idle states - C0.1 and C0.2 (later C0.x).
> > These idle states have lower latency comparing to C1, and can be requested
> > with either 'tpause' and 'umwait' instructions.
> >
> > Linux already uses the 'tpause' instruction in delay functions like
> > 'udelay()'. This patch adds 'umwait' and 'umonitor' instructions support.
> >
> > 'umwait' and 'tpause' instructions are very similar - both send the CPU to
> > C0.x and have the same break out rules. But unlike 'tpause', 'umwait' works
> > together with 'umonitor' and exits the C0.x when the monitored memory
> > address is modified (similar idea as with 'monitor/mwait').
> >
> > This patch implements the 'umwait_idle()' function, which works very
> > similarly to existing 'mwait_idle_with_hints()', but requests C0.x. The
> > intention is to use it from the 'intel_idle' driver.
>
> Still wondering wth regular mwait can't access these new idle states.
But is this a question for Artem to answer?
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/3] x86/mwait: Add support for idle via umwait
2023-03-07 11:55 ` Rafael J. Wysocki
@ 2023-03-08 12:35 ` Peter Zijlstra
0 siblings, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2023-03-08 12:35 UTC (permalink / raw)
To: Rafael J. Wysocki; +Cc: Artem Bityutskiy, x86, Linux PM Mailing List
On Tue, Mar 07, 2023 at 12:55:45PM +0100, Rafael J. Wysocki wrote:
> On Mon, Mar 6, 2023 at 3:56 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Mon, Mar 06, 2023 at 02:34:16PM +0200, Artem Bityutskiy wrote:
> > > From: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
> > >
> > > On Intel platforms, C-states are requested using the 'monitor/mwait'
> > > instructions pair, as implemented in 'mwait_idle_with_hints()'. This
> > > mechanism allows for entering C1 and deeper C-states.
> > >
> > > Sapphire Rapids Xeon supports new idle states - C0.1 and C0.2 (later C0.x).
> > > These idle states have lower latency comparing to C1, and can be requested
> > > with either 'tpause' and 'umwait' instructions.
> > >
> > > Linux already uses the 'tpause' instruction in delay functions like
> > > 'udelay()'. This patch adds 'umwait' and 'umonitor' instructions support.
> > >
> > > 'umwait' and 'tpause' instructions are very similar - both send the CPU to
> > > C0.x and have the same break out rules. But unlike 'tpause', 'umwait' works
> > > together with 'umonitor' and exits the C0.x when the monitored memory
> > > address is modified (similar idea as with 'monitor/mwait').
> > >
> > > This patch implements the 'umwait_idle()' function, which works very
> > > similarly to existing 'mwait_idle_with_hints()', but requests C0.x. The
> > > intention is to use it from the 'intel_idle' driver.
> >
> > Still wondering wth regular mwait can't access these new idle states.
>
> But is this a question for Artem to answer?
Maybe, maybe not, but I did want to call out this 'design' in public. It
is really weird IMO.
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 2/3] x86/umwait: Increase tpause and umwait quanta
2023-03-06 12:34 [PATCH 0/3] Sapphire Rapids C0.x idle states support Artem Bityutskiy
2023-03-06 12:34 ` [PATCH 1/3] x86/mwait: Add support for idle via umwait Artem Bityutskiy
@ 2023-03-06 12:34 ` Artem Bityutskiy
2023-03-06 12:34 ` [PATCH 3/3] intel_idle: add C0.2 state for Sapphire Rapids Xeon Artem Bityutskiy
2 siblings, 0 replies; 13+ messages in thread
From: Artem Bityutskiy @ 2023-03-06 12:34 UTC (permalink / raw)
To: x86, Linux PM Mailing List; +Cc: Artem Bityutskiy
From: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
== Background ==
The 'umwait' and 'tpause' instructions both put the CPU in a low power
state while they wait. They amount of time they wait is influenced by
an explicit deadline value passed in a register and an implicit value
written to an shared MSR (MSR_IA32_UMWAIT_CONTROL).
Existing 'tpause' users (udelay()) can tolerate a wide range of
MSR_IA32_UMWAIT_CONTROL MSR values. The explicit deadline trumps the
MSR for short delays. Longer delays will see extra wakeups, but no
functional issues.
== Problem ==
Extra wakeups mean extra power. That translates into worse idle power
when 'umwait' gets used for idle.
== Solution ==
Increase MSR_IA32_UMWAIT_CONTROL by factor of 100 to decrease idle power
when using 'umwait'. Make 'tpause' rely on its explicit deadline more
often and reduce the number of wakeups and save power during long delays.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
---
arch/x86/kernel/cpu/umwait.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/cpu/umwait.c b/arch/x86/kernel/cpu/umwait.c
index ec8064c0ae03..17c23173da0f 100644
--- a/arch/x86/kernel/cpu/umwait.c
+++ b/arch/x86/kernel/cpu/umwait.c
@@ -14,9 +14,9 @@
/*
* Cache IA32_UMWAIT_CONTROL MSR. This is a systemwide control. By default,
- * umwait max time is 100000 in TSC-quanta and C0.2 is enabled
+ * umwait max time is 10,000,000 in TSC-quanta and C0.2 is enabled.
*/
-static u32 umwait_control_cached = UMWAIT_CTRL_VAL(100000, UMWAIT_C02_ENABLE);
+static u32 umwait_control_cached = UMWAIT_CTRL_VAL(10000000, UMWAIT_C02_ENABLE);
/*
* Cache the original IA32_UMWAIT_CONTROL MSR value which is configured by
--
2.38.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 3/3] intel_idle: add C0.2 state for Sapphire Rapids Xeon
2023-03-06 12:34 [PATCH 0/3] Sapphire Rapids C0.x idle states support Artem Bityutskiy
2023-03-06 12:34 ` [PATCH 1/3] x86/mwait: Add support for idle via umwait Artem Bityutskiy
2023-03-06 12:34 ` [PATCH 2/3] x86/umwait: Increase tpause and umwait quanta Artem Bityutskiy
@ 2023-03-06 12:34 ` Artem Bityutskiy
2023-03-06 15:32 ` Peter Zijlstra
2 siblings, 1 reply; 13+ messages in thread
From: Artem Bityutskiy @ 2023-03-06 12:34 UTC (permalink / raw)
To: x86, Linux PM Mailing List; +Cc: Artem Bityutskiy
From: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Add Sapphire Rapids Xeon C0.2 state support. This state has a lower exit
latency comparing to C1, and saves energy comparing to POLL (in range of
5-20%).
This patch also improves performance (e.g., as measured by 'hackbench'),
because idle CPU power savings in C0.2 increase busy CPU power budget and
therefore, improve turbo boost of the busy CPU.
Suggested-by: Len Brown <len.brown@intel.com>
Suggested-by: Arjan Van De Ven <arjan.van.de.ven@intel.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
---
drivers/idle/intel_idle.c | 59 ++++++++++++++++++++++++++++++++++++++-
1 file changed, 58 insertions(+), 1 deletion(-)
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 938c17f25d94..f7705a64d0e6 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -51,11 +51,13 @@
#include <linux/notifier.h>
#include <linux/cpu.h>
#include <linux/moduleparam.h>
+#include <linux/units.h>
#include <asm/cpu_device_id.h>
#include <asm/intel-family.h>
#include <asm/nospec-branch.h>
#include <asm/mwait.h>
#include <asm/msr.h>
+#include <asm/tsc.h>
#include <asm/fpu/api.h>
#define INTEL_IDLE_VERSION "0.5.1"
@@ -73,6 +75,8 @@ static struct cpuidle_device __percpu *intel_idle_cpuidle_devices;
static unsigned long auto_demotion_disable_flags;
+static u64 umwait_limit;
+
static enum {
C1E_PROMOTION_PRESERVE,
C1E_PROMOTION_ENABLE,
@@ -225,6 +229,27 @@ static __cpuidle int intel_idle_s2idle(struct cpuidle_device *dev,
return 0;
}
+/**
+ * intel_idle_umwait_irq - Request C0.x using the 'umwait' instruction.
+ * @dev: cpuidle device of the target CPU.
+ * @drv: cpuidle driver (assumed to point to intel_idle_driver).
+ * @index: Target idle state index.
+ *
+ * Request C0.1 or C0.2 using 'umwait' instruction with interrupts enabled.
+ */
+static __cpuidle int intel_idle_umwait_irq(struct cpuidle_device *dev,
+ struct cpuidle_driver *drv,
+ int index)
+{
+ u32 state = flg2MWAIT(drv->states[index].flags);
+
+ raw_local_irq_enable();
+ umwait_idle(rdtsc() + umwait_limit, state);
+ local_irq_disable();
+
+ return index;
+}
+
/*
* States are indexed by the cstate number,
* which is also the index into the MWAIT hint array.
@@ -968,6 +993,13 @@ static struct cpuidle_state adl_n_cstates[] __initdata = {
};
static struct cpuidle_state spr_cstates[] __initdata = {
+ {
+ .name = "C0.2",
+ .desc = "UMWAIT C0.2",
+ .flags = MWAIT2flg(TPAUSE_C02_STATE) | CPUIDLE_FLAG_IRQ_ENABLE,
+ .exit_latency_ns = 100,
+ .target_residency_ns = 100,
+ .enter = &intel_idle_umwait_irq, },
{
.name = "C1",
.desc = "MWAIT 0x00",
@@ -1894,7 +1926,8 @@ static void __init intel_idle_init_cstates_icpu(struct cpuidle_driver *drv)
/* Structure copy. */
drv->states[drv->state_count] = cpuidle_state_table[cstate];
- if ((cpuidle_state_table[cstate].flags & CPUIDLE_FLAG_IRQ_ENABLE) || force_irq_on) {
+ if (cpuidle_state_table[cstate].enter == intel_idle &&
+ ((cpuidle_state_table[cstate].flags & CPUIDLE_FLAG_IRQ_ENABLE) || force_irq_on)) {
printk("intel_idle: forced intel_idle_irq for state %d\n", cstate);
drv->states[drv->state_count].enter = intel_idle_irq;
}
@@ -1926,6 +1959,29 @@ static void __init intel_idle_init_cstates_icpu(struct cpuidle_driver *drv)
}
}
+/**
+ * umwait_limit_init - initialize time limit value for 'umwait'.
+ * @drv: cpuidle driver structure to initialize.
+ *
+ * C0.1 and C0.2 (later C0.x) idle states are requested via the 'umwait'
+ * instruction. The 'umwait' instruction requires the "deadline" - the TSC
+ * counter value to break out of C0.x (unless it broke out because of an
+ * interrupt or some other event).
+ *
+ * The deadline is specified as an absolute TSC value, and it is calculated as
+ * current TSC value + 'umwait_limit'. This function initializes the
+ * 'umwait_limit' variable to count of cycles per tick. The motivation is:
+ * * the tick is not disabled for shallow states like C0.x so, so idle will
+ * not last longer than a tick anyway
+ * * limit idle time to give cpuidle a chance to re-evaluate its C-state
+ * selection decision and possibly select a deeper C-state.
+ */
+static void __init umwait_limit_init(void)
+{
+ umwait_limit = (u64)TICK_NSEC * tsc_khz;
+ do_div(umwait_limit, MICRO);
+}
+
/**
* intel_idle_cpuidle_driver_init - Create the list of available idle states.
* @drv: cpuidle driver structure to initialize.
@@ -1933,6 +1989,7 @@ static void __init intel_idle_init_cstates_icpu(struct cpuidle_driver *drv)
static void __init intel_idle_cpuidle_driver_init(struct cpuidle_driver *drv)
{
cpuidle_poll_state_init(drv);
+ umwait_limit_init();
if (disabled_states_mask & BIT(0))
drv->states[0].flags |= CPUIDLE_FLAG_OFF;
--
2.38.1
^ permalink raw reply related [flat|nested] 13+ messages in thread* Re: [PATCH 3/3] intel_idle: add C0.2 state for Sapphire Rapids Xeon
2023-03-06 12:34 ` [PATCH 3/3] intel_idle: add C0.2 state for Sapphire Rapids Xeon Artem Bityutskiy
@ 2023-03-06 15:32 ` Peter Zijlstra
2023-03-07 12:39 ` Rafael J. Wysocki
0 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2023-03-06 15:32 UTC (permalink / raw)
To: Artem Bityutskiy; +Cc: x86, Linux PM Mailing List
On Mon, Mar 06, 2023 at 02:34:18PM +0200, Artem Bityutskiy wrote:
> @@ -225,6 +229,27 @@ static __cpuidle int intel_idle_s2idle(struct cpuidle_device *dev,
> return 0;
> }
>
> +/**
> + * intel_idle_umwait_irq - Request C0.x using the 'umwait' instruction.
> + * @dev: cpuidle device of the target CPU.
> + * @drv: cpuidle driver (assumed to point to intel_idle_driver).
> + * @index: Target idle state index.
> + *
> + * Request C0.1 or C0.2 using 'umwait' instruction with interrupts enabled.
> + */
> +static __cpuidle int intel_idle_umwait_irq(struct cpuidle_device *dev,
> + struct cpuidle_driver *drv,
> + int index)
> +{
> + u32 state = flg2MWAIT(drv->states[index].flags);
> +
> + raw_local_irq_enable();
> + umwait_idle(rdtsc() + umwait_limit, state);
> + local_irq_disable();
> +
> + return index;
> +}
Bad copy paste job there...
vmlinux.o: warning: objtool: intel_idle_umwait_irq+0x8c: call to trace_hardirqs_off() leaves .noinstr.text section
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [PATCH 3/3] intel_idle: add C0.2 state for Sapphire Rapids Xeon
2023-03-06 15:32 ` Peter Zijlstra
@ 2023-03-07 12:39 ` Rafael J. Wysocki
2023-03-08 12:32 ` Peter Zijlstra
0 siblings, 1 reply; 13+ messages in thread
From: Rafael J. Wysocki @ 2023-03-07 12:39 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: Artem Bityutskiy, x86, Linux PM Mailing List
On Mon, Mar 6, 2023 at 4:32 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Mar 06, 2023 at 02:34:18PM +0200, Artem Bityutskiy wrote:
> > @@ -225,6 +229,27 @@ static __cpuidle int intel_idle_s2idle(struct cpuidle_device *dev,
> > return 0;
> > }
> >
> > +/**
> > + * intel_idle_umwait_irq - Request C0.x using the 'umwait' instruction.
> > + * @dev: cpuidle device of the target CPU.
> > + * @drv: cpuidle driver (assumed to point to intel_idle_driver).
> > + * @index: Target idle state index.
> > + *
> > + * Request C0.1 or C0.2 using 'umwait' instruction with interrupts enabled.
> > + */
> > +static __cpuidle int intel_idle_umwait_irq(struct cpuidle_device *dev,
> > + struct cpuidle_driver *drv,
> > + int index)
> > +{
> > + u32 state = flg2MWAIT(drv->states[index].flags);
> > +
> > + raw_local_irq_enable();
> > + umwait_idle(rdtsc() + umwait_limit, state);
> > + local_irq_disable();
> > +
> > + return index;
> > +}
>
> Bad copy paste job there...
>
> vmlinux.o: warning: objtool: intel_idle_umwait_irq+0x8c: call to trace_hardirqs_off() leaves .noinstr.text section
Well, it would be kind of nice to say that this is related to commit
6d9c7f51b1d9 ("cpuidle, intel_idle: Fix CPUIDLE_FLAG_IRQ_ENABLE
*again*") that is present in 6.3-rc1.
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [PATCH 3/3] intel_idle: add C0.2 state for Sapphire Rapids Xeon
2023-03-07 12:39 ` Rafael J. Wysocki
@ 2023-03-08 12:32 ` Peter Zijlstra
2023-03-09 8:01 ` Artem Bityutskiy
0 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2023-03-08 12:32 UTC (permalink / raw)
To: Rafael J. Wysocki; +Cc: Artem Bityutskiy, x86, Linux PM Mailing List
On Tue, Mar 07, 2023 at 01:39:09PM +0100, Rafael J. Wysocki wrote:
> On Mon, Mar 6, 2023 at 4:32 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Mon, Mar 06, 2023 at 02:34:18PM +0200, Artem Bityutskiy wrote:
> > > @@ -225,6 +229,27 @@ static __cpuidle int intel_idle_s2idle(struct cpuidle_device *dev,
> > > return 0;
> > > }
> > >
> > > +/**
> > > + * intel_idle_umwait_irq - Request C0.x using the 'umwait' instruction.
> > > + * @dev: cpuidle device of the target CPU.
> > > + * @drv: cpuidle driver (assumed to point to intel_idle_driver).
> > > + * @index: Target idle state index.
> > > + *
> > > + * Request C0.1 or C0.2 using 'umwait' instruction with interrupts enabled.
> > > + */
> > > +static __cpuidle int intel_idle_umwait_irq(struct cpuidle_device *dev,
> > > + struct cpuidle_driver *drv,
> > > + int index)
> > > +{
> > > + u32 state = flg2MWAIT(drv->states[index].flags);
> > > +
> > > + raw_local_irq_enable();
> > > + umwait_idle(rdtsc() + umwait_limit, state);
> > > + local_irq_disable();
> > > +
> > > + return index;
> > > +}
> >
> > Bad copy paste job there...
> >
> > vmlinux.o: warning: objtool: intel_idle_umwait_irq+0x8c: call to trace_hardirqs_off() leaves .noinstr.text section
>
> Well, it would be kind of nice to say that this is related to commit
> 6d9c7f51b1d9 ("cpuidle, intel_idle: Fix CPUIDLE_FLAG_IRQ_ENABLE
> *again*") that is present in 6.3-rc1.
Right, but he said the patches were based on -next, which would've had
that commit for a fair while too.
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [PATCH 3/3] intel_idle: add C0.2 state for Sapphire Rapids Xeon
2023-03-08 12:32 ` Peter Zijlstra
@ 2023-03-09 8:01 ` Artem Bityutskiy
2023-03-14 12:24 ` Peter Zijlstra
0 siblings, 1 reply; 13+ messages in thread
From: Artem Bityutskiy @ 2023-03-09 8:01 UTC (permalink / raw)
To: Peter Zijlstra, Rafael J. Wysocki; +Cc: x86, Linux PM Mailing List
On Wed, 2023-03-08 at 13:32 +0100, Peter Zijlstra wrote:
> > Well, it would be kind of nice to say that this is related to commit
> > 6d9c7f51b1d9 ("cpuidle, intel_idle: Fix CPUIDLE_FLAG_IRQ_ENABLE
> > *again*") that is present in 6.3-rc1.
>
> Right, but he said the patches were based on -next, which would've had
> that commit for a fair while too.
I can see what is the problem from the above mentioned commit. But I struggle to
reproduce it. I tried 'make W=1', and 'tools/objtool/objtool -n vmlinux.o'. I
have 'CONFIG_HAVE_NOINSTR_VALIDATION' too.
I also tries this:
1. Build kernel with this patch-set
2. tools/objtool/objtool -n vmlinux.o > ~/tmp/before.txt 2>&1
3. Make clean; Build kernel with a fix (local_irq_disable() ->
raw_local_irq_disable())
4. tools/objtool/objtool -n vmlinux.o > ~/tmp/after.txt 2>&1
5. diff -u ~/tmp/before.txt ~/tmp/after.txt
And no difference.
Could you please help by giving a hint how to verify?
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [PATCH 3/3] intel_idle: add C0.2 state for Sapphire Rapids Xeon
2023-03-09 8:01 ` Artem Bityutskiy
@ 2023-03-14 12:24 ` Peter Zijlstra
2023-03-17 8:42 ` Artem Bityutskiy
0 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2023-03-14 12:24 UTC (permalink / raw)
To: Artem Bityutskiy; +Cc: Rafael J. Wysocki, x86, Linux PM Mailing List
On Thu, Mar 09, 2023 at 10:01:57AM +0200, Artem Bityutskiy wrote:
> Could you please help by giving a hint how to verify?
I think you need both:
DEBUG_ENTRY and TRACE_IRQFLAGS
I can reproduce with:
make O=defconfig-build defconfig
cd defconfig-build
../scripts/config --enable INTEL_IDLE
../scripts/config --enable DEBUG_ENTRY
../scripts/config --enable IRQSOFF_TRACER
cd ..
make O=defconfig-build -j32 vmlinux
But typically it is useful to have a .config with lots of tracing and
debug (including lockdep and debug_entry) enabled for this stuff.
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2023-03-17 8:44 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-03-06 12:34 [PATCH 0/3] Sapphire Rapids C0.x idle states support Artem Bityutskiy
2023-03-06 12:34 ` [PATCH 1/3] x86/mwait: Add support for idle via umwait Artem Bityutskiy
2023-03-06 14:55 ` Peter Zijlstra
2023-03-07 11:55 ` Rafael J. Wysocki
2023-03-08 12:35 ` Peter Zijlstra
2023-03-06 12:34 ` [PATCH 2/3] x86/umwait: Increase tpause and umwait quanta Artem Bityutskiy
2023-03-06 12:34 ` [PATCH 3/3] intel_idle: add C0.2 state for Sapphire Rapids Xeon Artem Bityutskiy
2023-03-06 15:32 ` Peter Zijlstra
2023-03-07 12:39 ` Rafael J. Wysocki
2023-03-08 12:32 ` Peter Zijlstra
2023-03-09 8:01 ` Artem Bityutskiy
2023-03-14 12:24 ` Peter Zijlstra
2023-03-17 8:42 ` Artem Bityutskiy
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox