* [RFC] perf_events: exclude_guest impact on time_enabled/time_running @ 2024-06-06 7:57 Stephane Eranian 2024-06-06 15:48 ` Liang, Kan 2024-06-10 20:54 ` Peter Zijlstra 0 siblings, 2 replies; 6+ messages in thread From: Stephane Eranian @ 2024-06-06 7:57 UTC (permalink / raw) To: Peter Zijlstra Cc: LKML, Ian Rogers, Liang, Kan, Andi Kleen, Ingo Molnar, Narayan, Ananth, Bangoria, Ravikumar, Namhyung Kim, Mingwei Zhang, Dapeng Mi, Zhang Xiong Hi Peter, In the context of the new vPMU passthru patch series, we have to look closer at the definition and implementation of the exclude_guest filter in the perf_event_attr structure. This filter has been in the kernel for many years. See patch: https://lore.kernel.org/all/20240506053020.3911940-8-mizhang@google.com/ The presumed definition of the filter is that the user does not want the event to count while the processor is running in guest mode (i.e., inside the virtual machine guest OS or guest user code). The perf tool sets is by default on all core PMU events: $ perf stat -vv -e cycles sleep 0 ------------------------------------------------------------ perf_event_attr: size 112 sample_type IDENTIFIER read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING disabled 1 inherit 1 enable_on_exec 1 exclude_guest 1 ------------------------------------------------------------ In the kernel, the way this is treated differs between AMD and Intel because AMD does provide a hardware filter for guest vs. host in the PMU counters whereas Intel does not. For the latter, the kernel simply disables the event in the hardware counters, i.e., the event is not descheduled. Both approaches produce pretty much the same desired effect, the event is not counted while in guest mode. The issue I would like to raise has to do with the effects on time_enabled and time_running for exclude_guest=1 events. Given the event is not scheduled out while in guest mode, even though it is stopped, both time_enabled and time_running continue ticking while in guest mode. If a measurement is 10s long but only 5s are in non-guest mode, then time_enabled=10s, time_running=10s. The count represents 10s worth of non guest mode, of which only 5s were really actively monitoring, but the user has no way of determining this. If we look at vPMU passthru, the host event must have exclude_guest=1 to avoid going into an error state on context switch to the vCPU thread (with vPMU enabled). But this time, the event is scheduled out, that means that time_enabled keeps counting, but time_running stops. On context switch back in, the host event is scheduled again and time_running restarts ticking. For a 10s measurement, where 5s here in the guest, the event will come out with time_enabled=10s, time_running=5s, and the tool will scale it up because it thinks the event was multiplexed, when in fact it was not. This is not the intended outcome here. The tool should not scale the count, it was not multiplexed, it was descheduled because the filter forced it out. Note that if the event had been multiplexed while running on the host, then the scaling would be appropriate. In that case, I argue, time_running should be updated to cover the time the event was not running. That would bring us back to the case I was describing earlier. It boils down to the exact definition of exclude_guest and expected impact on time_enabled and time_running. Then, with or without vPMU passthru, we can fix the kernel to ensure a uniform behavior. What are your thoughts on this problem? ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [RFC] perf_events: exclude_guest impact on time_enabled/time_running 2024-06-06 7:57 [RFC] perf_events: exclude_guest impact on time_enabled/time_running Stephane Eranian @ 2024-06-06 15:48 ` Liang, Kan 2024-06-06 17:08 ` Stephane Eranian 2024-06-10 20:54 ` Peter Zijlstra 1 sibling, 1 reply; 6+ messages in thread From: Liang, Kan @ 2024-06-06 15:48 UTC (permalink / raw) To: Stephane Eranian, Peter Zijlstra Cc: LKML, Ian Rogers, Liang, Kan, Andi Kleen, Ingo Molnar, Narayan, Ananth, Bangoria, Ravikumar, Namhyung Kim, Mingwei Zhang, Dapeng Mi, Zhang Xiong On 2024-06-06 3:57 a.m., Stephane Eranian wrote: > Hi Peter, > > In the context of the new vPMU passthru patch series, we have to look > closer at the definition and implementation of the exclude_guest > filter in the perf_event_attr structure. This filter has been in the > kernel for many years. See patch: > https://lore.kernel.org/all/20240506053020.3911940-8-mizhang@google.com/ > > The presumed definition of the filter is that the user does not want > the event to count while the processor is running in guest mode (i.e., > inside the virtual machine guest OS or guest user code). > > The perf tool sets is by default on all core PMU events: > $ perf stat -vv -e cycles sleep 0 > ------------------------------------------------------------ > perf_event_attr: > size 112 > sample_type IDENTIFIER > read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING > disabled 1 > inherit 1 > enable_on_exec 1 > exclude_guest 1 > ------------------------------------------------------------ > > In the kernel, the way this is treated differs between AMD and Intel because AMD > does provide a hardware filter for guest vs. host in the PMU counters > whereas Intel > does not. For the latter, the kernel simply disables the event in the > hardware counters, > i.e., the event is not descheduled. Both approaches produce pretty > much the same > desired effect, the event is not counted while in guest mode. > > The issue I would like to raise has to do with the effects on > time_enabled and time_running > for exclude_guest=1 events. > > Given the event is not scheduled out while in guest mode, even though > it is stopped, both time_enabled and time_running continue ticking > while in guest mode. If a measurement is 10s > long but only 5s are in non-guest mode, then time_enabled=10s, > time_running=10s. The count > represents 10s worth of non guest mode, of which only 5s were really > actively monitoring, but > the user has no way of determining this. For the latest design/implementation, only the exclude_guest=1 host event can be successfully created for the case. The end user should not expect that anything is collected in the guest mode. So both the time_enabled and the time_running will be 5s. > > If we look at vPMU passthru, the host event must have exclude_guest=1 > to avoid going into > an error state on context switch to the vCPU thread (with vPMU > enabled). That's the design in V1. There is no error state anymore in V2 and later as suggested by Sean. https://lore.kernel.org/all/ZhmIrQQVgblrhCZs@google.com/ The VM cannot be created if there are host events with exclude_guest=0. If a VM has been created, user cannot create an event with exclude_guest=0. So nothing will be moved to an error state on context switch to the vCPU thread. > But this time, > the event is scheduled out, that means that time_enabled keeps > counting, but time_running > stops. On context switch back in, the host event is scheduled again > and time_running restarts > ticking. For a 10s measurement, where 5s here in the guest, the event > will come out with > time_enabled=10s, time_running=5s, and the tool will scale it up > because it thinks the event > was multiplexed, when in fact it was not. This is not the intended > outcome here. The tool should > not scale the count, it was not multiplexed, it was descheduled > because the filter forced it out. > Note that if the event had been multiplexed while running on the host, > then the scaling would be > appropriate. The scaling will not happen, since both time_enabled and time_running should be the same when there are enough counter resources. > > In that case, I argue, time_running should be updated to cover the > time the event was not running. That would bring us back to the case I > was describing earlier. > > It boils down to the exact definition of exclude_guest and expected > impact on time_enabled > and time_running. Then, with or without vPMU passthru, we can fix the > kernel to ensure a > uniform behavior. I think the time_enabled should be the one that has a controversial definition. Should the time in the guest mode count as the enabled time for an host event that explicitly sets the exclude_guest=1? I think the answer is NO. So I implemented it in the code. > > What are your thoughts on this problem? > Peter, please share your thoughts. We want to make sure the design is on the right track. Thanks, Kan ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [RFC] perf_events: exclude_guest impact on time_enabled/time_running 2024-06-06 15:48 ` Liang, Kan @ 2024-06-06 17:08 ` Stephane Eranian 2024-06-06 17:33 ` Liang, Kan 0 siblings, 1 reply; 6+ messages in thread From: Stephane Eranian @ 2024-06-06 17:08 UTC (permalink / raw) To: Liang, Kan Cc: Peter Zijlstra, LKML, Ian Rogers, Liang, Kan, Andi Kleen, Ingo Molnar, Narayan, Ananth, Bangoria, Ravikumar, Namhyung Kim, Mingwei Zhang, Dapeng Mi, Zhang Xiong On Thu, Jun 6, 2024 at 8:48 AM Liang, Kan <kan.liang@linux.intel.com> wrote: > > > > On 2024-06-06 3:57 a.m., Stephane Eranian wrote: > > Hi Peter, > > > > In the context of the new vPMU passthru patch series, we have to look > > closer at the definition and implementation of the exclude_guest > > filter in the perf_event_attr structure. This filter has been in the > > kernel for many years. See patch: > > https://lore.kernel.org/all/20240506053020.3911940-8-mizhang@google.com/ > > > > The presumed definition of the filter is that the user does not want > > the event to count while the processor is running in guest mode (i.e., > > inside the virtual machine guest OS or guest user code). > > > > The perf tool sets is by default on all core PMU events: > > $ perf stat -vv -e cycles sleep 0 > > ------------------------------------------------------------ > > perf_event_attr: > > size 112 > > sample_type IDENTIFIER > > read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING > > disabled 1 > > inherit 1 > > enable_on_exec 1 > > exclude_guest 1 > > ------------------------------------------------------------ > > > > In the kernel, the way this is treated differs between AMD and Intel because AMD > > does provide a hardware filter for guest vs. host in the PMU counters > > whereas Intel > > does not. For the latter, the kernel simply disables the event in the > > hardware counters, > > i.e., the event is not descheduled. Both approaches produce pretty > > much the same > > desired effect, the event is not counted while in guest mode. > > > > The issue I would like to raise has to do with the effects on > > time_enabled and time_running > > for exclude_guest=1 events. > > > > Given the event is not scheduled out while in guest mode, even though > > it is stopped, both time_enabled and time_running continue ticking > > while in guest mode. If a measurement is 10s > > long but only 5s are in non-guest mode, then time_enabled=10s, > > time_running=10s. The count > > represents 10s worth of non guest mode, of which only 5s were really > > actively monitoring, but > > the user has no way of determining this. > > > For the latest design/implementation, only the exclude_guest=1 host > event can be successfully created for the case. > The end user should not expect that anything is collected in the guest > mode. So both the time_enabled and the time_running will be 5s. > > > > > If we look at vPMU passthru, the host event must have exclude_guest=1 > > to avoid going into > > an error state on context switch to the vCPU thread (with vPMU > > enabled). > > That's the design in V1. There is no error state anymore in V2 and later > as suggested by Sean. > https://lore.kernel.org/all/ZhmIrQQVgblrhCZs@google.com/ > > The VM cannot be created if there are host events with exclude_guest=0. > If a VM has been created, user cannot create an event with > exclude_guest=0. So nothing will be moved to an error state on context > switch to the vCPU thread. > Ok, that's new. > > But this time, > > the event is scheduled out, that means that time_enabled keeps > > counting, but time_running > > stops. On context switch back in, the host event is scheduled again > > and time_running restarts > > ticking. For a 10s measurement, where 5s here in the guest, the event > > will come out with > > time_enabled=10s, time_running=5s, and the tool will scale it up > > because it thinks the event > > was multiplexed, when in fact it was not. This is not the intended > > outcome here. The tool should > > not scale the count, it was not multiplexed, it was descheduled > > because the filter forced it out. > > Note that if the event had been multiplexed while running on the host, > > then the scaling would be > > appropriate. > > The scaling will not happen, since both time_enabled and time_running > should be the same when there are enough counter resources. > If an event with exclude_guest=1 is sched out (event_sched_out), time_enabled will keep running but time_running will stop. Because the code assumes it is stopped because of multiplexing. However, here this is not the case. The event is stopped because the CPU is entering an execution domain that is excluded for the event. Unless you've modified the event_sched_out() code or added code to patch up time_running I don't see how they could be equal. The alternative, as you suggest, is to stop time_enabled. But usually time_enabled is controlled by ENABLE/DISABLE which are different from event_sched_out() and event_sched_in(). > > > > In that case, I argue, time_running should be updated to cover the > > time the event was not running. That would bring us back to the case I > > was describing earlier. > > > > It boils down to the exact definition of exclude_guest and expected > > impact on time_enabled > > and time_running. Then, with or without vPMU passthru, we can fix the > > kernel to ensure a > > uniform behavior. > > I think the time_enabled should be the one that has a controversial > definition. > Should the time in the guest mode count as the enabled time for an host > event that explicitly sets the exclude_guest=1? > > I think the answer is NO. So I implemented it in the code. > > > > > What are your thoughts on this problem? > > > > Peter, please share your thoughts. We want to make sure the design is on > the right track. > > Thanks, > Kan ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [RFC] perf_events: exclude_guest impact on time_enabled/time_running 2024-06-06 17:08 ` Stephane Eranian @ 2024-06-06 17:33 ` Liang, Kan 2024-06-10 17:39 ` Liang, Kan 0 siblings, 1 reply; 6+ messages in thread From: Liang, Kan @ 2024-06-06 17:33 UTC (permalink / raw) To: Stephane Eranian Cc: Peter Zijlstra, LKML, Ian Rogers, Liang, Kan, Andi Kleen, Ingo Molnar, Narayan, Ananth, Bangoria, Ravikumar, Namhyung Kim, Mingwei Zhang, Dapeng Mi, Zhang Xiong On 2024-06-06 1:08 p.m., Stephane Eranian wrote: > On Thu, Jun 6, 2024 at 8:48 AM Liang, Kan <kan.liang@linux.intel.com> wrote: >> >> >> >> On 2024-06-06 3:57 a.m., Stephane Eranian wrote: >>> Hi Peter, >>> >>> In the context of the new vPMU passthru patch series, we have to look >>> closer at the definition and implementation of the exclude_guest >>> filter in the perf_event_attr structure. This filter has been in the >>> kernel for many years. See patch: >>> https://lore.kernel.org/all/20240506053020.3911940-8-mizhang@google.com/ >>> >>> The presumed definition of the filter is that the user does not want >>> the event to count while the processor is running in guest mode (i.e., >>> inside the virtual machine guest OS or guest user code). >>> >>> The perf tool sets is by default on all core PMU events: >>> $ perf stat -vv -e cycles sleep 0 >>> ------------------------------------------------------------ >>> perf_event_attr: >>> size 112 >>> sample_type IDENTIFIER >>> read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING >>> disabled 1 >>> inherit 1 >>> enable_on_exec 1 >>> exclude_guest 1 >>> ------------------------------------------------------------ >>> >>> In the kernel, the way this is treated differs between AMD and Intel because AMD >>> does provide a hardware filter for guest vs. host in the PMU counters >>> whereas Intel >>> does not. For the latter, the kernel simply disables the event in the >>> hardware counters, >>> i.e., the event is not descheduled. Both approaches produce pretty >>> much the same >>> desired effect, the event is not counted while in guest mode. >>> >>> The issue I would like to raise has to do with the effects on >>> time_enabled and time_running >>> for exclude_guest=1 events. >>> >>> Given the event is not scheduled out while in guest mode, even though >>> it is stopped, both time_enabled and time_running continue ticking >>> while in guest mode. If a measurement is 10s >>> long but only 5s are in non-guest mode, then time_enabled=10s, >>> time_running=10s. The count >>> represents 10s worth of non guest mode, of which only 5s were really >>> actively monitoring, but >>> the user has no way of determining this. >> >> >> For the latest design/implementation, only the exclude_guest=1 host >> event can be successfully created for the case. >> The end user should not expect that anything is collected in the guest >> mode. So both the time_enabled and the time_running will be 5s. >> >>> >>> If we look at vPMU passthru, the host event must have exclude_guest=1 >>> to avoid going into >>> an error state on context switch to the vCPU thread (with vPMU >>> enabled). >> >> That's the design in V1. There is no error state anymore in V2 and later >> as suggested by Sean. >> https://lore.kernel.org/all/ZhmIrQQVgblrhCZs@google.com/ >> >> The VM cannot be created if there are host events with exclude_guest=0. >> If a VM has been created, user cannot create an event with >> exclude_guest=0. So nothing will be moved to an error state on context >> switch to the vCPU thread. >> > Ok, that's new. > >>> But this time, >>> the event is scheduled out, that means that time_enabled keeps >>> counting, but time_running >>> stops. On context switch back in, the host event is scheduled again >>> and time_running restarts >>> ticking. For a 10s measurement, where 5s here in the guest, the event >>> will come out with >>> time_enabled=10s, time_running=5s, and the tool will scale it up >>> because it thinks the event >>> was multiplexed, when in fact it was not. This is not the intended >>> outcome here. The tool should >>> not scale the count, it was not multiplexed, it was descheduled >>> because the filter forced it out. >>> Note that if the event had been multiplexed while running on the host, >>> then the scaling would be >>> appropriate. >> >> The scaling will not happen, since both time_enabled and time_running >> should be the same when there are enough counter resources. >> > If an event with exclude_guest=1 is sched out (event_sched_out), time_enabled > will keep running but time_running will stop. Because the code assumes > it is stopped > because of multiplexing. However, here this is not the case. The event > is stopped because > the CPU is entering an execution domain that is excluded for the event. > Unless you've modified the event_sched_out() code or added code to > patch up time_running > I don't see how they could be equal. The alternative, as you suggest, > is to stop time_enabled. > But usually time_enabled is controlled by ENABLE/DISABLE which are > different from > event_sched_out() and event_sched_in(). The entire context are scheduled, ctx_sched_{in,out}. https://lore.kernel.org/all/20240507085807.GS40213@noisy.programming.kicks-ass.net/ The time in the guest mode doesn't be accumulated in the ctx->time, which will be used to calculate the event time later. The event_sched_in() never be failed (assuming enough PMU resources in the host after be back from the guest). The time_enabled and time_running should be equal. Thanks, Kan > > >>> >>> In that case, I argue, time_running should be updated to cover the >>> time the event was not running. That would bring us back to the case I >>> was describing earlier. >>> >>> It boils down to the exact definition of exclude_guest and expected >>> impact on time_enabled >>> and time_running. Then, with or without vPMU passthru, we can fix the >>> kernel to ensure a >>> uniform behavior. >> >> I think the time_enabled should be the one that has a controversial >> definition. >> Should the time in the guest mode count as the enabled time for an host >> event that explicitly sets the exclude_guest=1? >> >> I think the answer is NO. So I implemented it in the code. >> >>> >>> What are your thoughts on this problem? >>> >> >> Peter, please share your thoughts. We want to make sure the design is on >> the right track. >> >> Thanks, >> Kan ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [RFC] perf_events: exclude_guest impact on time_enabled/time_running 2024-06-06 17:33 ` Liang, Kan @ 2024-06-10 17:39 ` Liang, Kan 0 siblings, 0 replies; 6+ messages in thread From: Liang, Kan @ 2024-06-10 17:39 UTC (permalink / raw) To: Stephane Eranian Cc: Peter Zijlstra, LKML, Ian Rogers, Liang, Kan, Andi Kleen, Ingo Molnar, Narayan, Ananth, Bangoria, Ravikumar, Namhyung Kim, Mingwei Zhang, Dapeng Mi, Zhang Xiong On 2024-06-06 1:33 p.m., Liang, Kan wrote: > > > On 2024-06-06 1:08 p.m., Stephane Eranian wrote: >> On Thu, Jun 6, 2024 at 8:48 AM Liang, Kan <kan.liang@linux.intel.com> wrote: >>> >>> >>> >>> On 2024-06-06 3:57 a.m., Stephane Eranian wrote: >>>> Hi Peter, >>>> >>>> In the context of the new vPMU passthru patch series, we have to look >>>> closer at the definition and implementation of the exclude_guest >>>> filter in the perf_event_attr structure. This filter has been in the >>>> kernel for many years. See patch: >>>> https://lore.kernel.org/all/20240506053020.3911940-8-mizhang@google.com/ >>>> >>>> The presumed definition of the filter is that the user does not want >>>> the event to count while the processor is running in guest mode (i.e., >>>> inside the virtual machine guest OS or guest user code). >>>> >>>> The perf tool sets is by default on all core PMU events: >>>> $ perf stat -vv -e cycles sleep 0 >>>> ------------------------------------------------------------ >>>> perf_event_attr: >>>> size 112 >>>> sample_type IDENTIFIER >>>> read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING >>>> disabled 1 >>>> inherit 1 >>>> enable_on_exec 1 >>>> exclude_guest 1 >>>> ------------------------------------------------------------ >>>> >>>> In the kernel, the way this is treated differs between AMD and Intel because AMD >>>> does provide a hardware filter for guest vs. host in the PMU counters >>>> whereas Intel >>>> does not. For the latter, the kernel simply disables the event in the >>>> hardware counters, >>>> i.e., the event is not descheduled. Both approaches produce pretty >>>> much the same >>>> desired effect, the event is not counted while in guest mode. >>>> >>>> The issue I would like to raise has to do with the effects on >>>> time_enabled and time_running >>>> for exclude_guest=1 events. >>>> >>>> Given the event is not scheduled out while in guest mode, even though >>>> it is stopped, both time_enabled and time_running continue ticking >>>> while in guest mode. If a measurement is 10s >>>> long but only 5s are in non-guest mode, then time_enabled=10s, >>>> time_running=10s. The count >>>> represents 10s worth of non guest mode, of which only 5s were really >>>> actively monitoring, but >>>> the user has no way of determining this. >>> >>> >>> For the latest design/implementation, only the exclude_guest=1 host >>> event can be successfully created for the case. >>> The end user should not expect that anything is collected in the guest >>> mode. So both the time_enabled and the time_running will be 5s. >>> >>>> >>>> If we look at vPMU passthru, the host event must have exclude_guest=1 >>>> to avoid going into >>>> an error state on context switch to the vCPU thread (with vPMU >>>> enabled). >>> >>> That's the design in V1. There is no error state anymore in V2 and later >>> as suggested by Sean. >>> https://lore.kernel.org/all/ZhmIrQQVgblrhCZs@google.com/ >>> >>> The VM cannot be created if there are host events with exclude_guest=0. >>> If a VM has been created, user cannot create an event with >>> exclude_guest=0. So nothing will be moved to an error state on context >>> switch to the vCPU thread. >>> >> Ok, that's new. >> >>>> But this time, >>>> the event is scheduled out, that means that time_enabled keeps >>>> counting, but time_running >>>> stops. On context switch back in, the host event is scheduled again >>>> and time_running restarts >>>> ticking. For a 10s measurement, where 5s here in the guest, the event >>>> will come out with >>>> time_enabled=10s, time_running=5s, and the tool will scale it up >>>> because it thinks the event >>>> was multiplexed, when in fact it was not. This is not the intended >>>> outcome here. The tool should >>>> not scale the count, it was not multiplexed, it was descheduled >>>> because the filter forced it out. >>>> Note that if the event had been multiplexed while running on the host, >>>> then the scaling would be >>>> appropriate. >>> >>> The scaling will not happen, since both time_enabled and time_running >>> should be the same when there are enough counter resources. >>> >> If an event with exclude_guest=1 is sched out (event_sched_out), time_enabled >> will keep running but time_running will stop. Because the code assumes >> it is stopped >> because of multiplexing. However, here this is not the case. The event >> is stopped because >> the CPU is entering an execution domain that is excluded for the event. >> Unless you've modified the event_sched_out() code or added code to >> patch up time_running >> I don't see how they could be equal. The alternative, as you suggest, >> is to stop time_enabled. >> But usually time_enabled is controlled by ENABLE/DISABLE which are >> different from >> event_sched_out() and event_sched_in(). > > The entire context are scheduled, ctx_sched_{in,out}. > https://lore.kernel.org/all/20240507085807.GS40213@noisy.programming.kicks-ass.net/ > > The time in the guest mode doesn't be accumulated in the ctx->time, > which will be used to calculate the event time later. > The ctx->time is shared among PMUs. The core PMU doesn't want the time in the guest mode, but other PMUs, e.g., uncore, still need it. So we do need to track the time in the guest mode. Something as below is planed to be implemented in the V3. The host ctx should keeps active in the guest mode as well. The event of core PMU will deduct the time of the guest mode. https://lore.kernel.org/linux-perf-users/902c40cc-6e0b-4b2f-826c-457f533a0a76@linux.intel.com/ For example, perf stat -e cpu/cycles/H,uncore_cha_0/event=0x1/ -C0 -vvv sleep 16 (when the perf command runs, a guest runs ~8s at the same CPU.) cpu/cycles/H: 55426106 7813838542 7813838542 uncore_cha_0/event=0x1/: 14302016384 16004903742 16004903742 Performance counter stats for 'CPU(s) 0': 55,426,106 cpu/cycles/H 14,302,016,384 uncore_cha_0/event=0x1/ 16.034896147 seconds time elapsed The time_enabled and time_running for cycles is 7813838542. For uncore_cha_0/event=0x1/, it's 16004903742. Thanks, Kan > The event_sched_in() never be failed (assuming enough PMU resources in > the host after be back from the guest). The time_enabled and > time_running should be equal. > > Thanks, > Kan >> >> >>>> >>>> In that case, I argue, time_running should be updated to cover the >>>> time the event was not running. That would bring us back to the case I >>>> was describing earlier. >>>> >>>> It boils down to the exact definition of exclude_guest and expected >>>> impact on time_enabled >>>> and time_running. Then, with or without vPMU passthru, we can fix the >>>> kernel to ensure a >>>> uniform behavior. >>> >>> I think the time_enabled should be the one that has a controversial >>> definition. >>> Should the time in the guest mode count as the enabled time for an host >>> event that explicitly sets the exclude_guest=1? >>> >>> I think the answer is NO. So I implemented it in the code. >>> >>>> >>>> What are your thoughts on this problem? >>>> >>> >>> Peter, please share your thoughts. We want to make sure the design is on >>> the right track. >>> >>> Thanks, >>> Kan > ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [RFC] perf_events: exclude_guest impact on time_enabled/time_running 2024-06-06 7:57 [RFC] perf_events: exclude_guest impact on time_enabled/time_running Stephane Eranian 2024-06-06 15:48 ` Liang, Kan @ 2024-06-10 20:54 ` Peter Zijlstra 1 sibling, 0 replies; 6+ messages in thread From: Peter Zijlstra @ 2024-06-10 20:54 UTC (permalink / raw) To: Stephane Eranian Cc: LKML, Ian Rogers, Liang, Kan, Andi Kleen, Ingo Molnar, Narayan, Ananth, Bangoria, Ravikumar, Namhyung Kim, Mingwei Zhang, Dapeng Mi, Zhang Xiong On Thu, Jun 06, 2024 at 12:57:35AM -0700, Stephane Eranian wrote: > Hi Peter, > > In the context of the new vPMU passthru patch series, we have to look > closer at the definition and implementation of the exclude_guest > filter in the perf_event_attr structure. This filter has been in the > kernel for many years. See patch: > https://lore.kernel.org/all/20240506053020.3911940-8-mizhang@google.com/ > > The presumed definition of the filter is that the user does not want > the event to count while the processor is running in guest mode (i.e., > inside the virtual machine guest OS or guest user code). > > The perf tool sets is by default on all core PMU events: > $ perf stat -vv -e cycles sleep 0 > ------------------------------------------------------------ > perf_event_attr: > size 112 > sample_type IDENTIFIER > read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING > disabled 1 > inherit 1 > enable_on_exec 1 > exclude_guest 1 > ------------------------------------------------------------ > > In the kernel, the way this is treated differs between AMD and Intel > because AMD does provide a hardware filter for guest vs. host in the > PMU counters whereas Intel does not. For the latter, the kernel > simply disables the event in the hardware counters, i.e., the event is > not descheduled. Both approaches produce pretty much the same desired > effect, the event is not counted while in guest mode. > > The issue I would like to raise has to do with the effects on > time_enabled and time_running for exclude_guest=1 events. > > Given the event is not scheduled out while in guest mode, even though > it is stopped, both time_enabled and time_running continue ticking > while in guest mode. If a measurement is 10s long but only 5s are in > non-guest mode, then time_enabled=10s, time_running=10s. The count > represents 10s worth of non guest mode, of which only 5s were really > actively monitoring, but the user has no way of determining this. > > If we look at vPMU passthru, the host event must have exclude_guest=1 > to avoid going into an error state on context switch to the vCPU > thread (with vPMU enabled). But this time, the event is scheduled out, > that means that time_enabled keeps counting, but time_running stops. > On context switch back in, the host event is scheduled again and > time_running restarts ticking. For a 10s measurement, where 5s here in > the guest, the event will come out with time_enabled=10s, > time_running=5s, and the tool will scale it up because it thinks the > event was multiplexed, when in fact it was not. This is not the > intended outcome here. The tool should not scale the count, it was not > multiplexed, it was descheduled because the filter forced it out. > Note that if the event had been multiplexed while running on the host, > then the scaling would be appropriate. > > In that case, I argue, time_running should be updated to cover the > time the event was not running. That would bring us back to the case I > was describing earlier. > > It boils down to the exact definition of exclude_guest and expected > impact on time_enabled and time_running. Then, with or without vPMU > passthru, we can fix the kernel to ensure a uniform behavior. > > What are your thoughts on this problem? So with those patches having explicit scheduling points, we can actually do this time accounting accurately, so I don't see a reason to not do the right thing here. Hysterically this was left vague in order to be able to avoid the scheduling for these scenarios -- performance raisins etc. The thing is, if you push this to its limits, we should start time accounting for the ring selectors too, and that's going to be painful. ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2024-06-10 20:54 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-06-06 7:57 [RFC] perf_events: exclude_guest impact on time_enabled/time_running Stephane Eranian 2024-06-06 15:48 ` Liang, Kan 2024-06-06 17:08 ` Stephane Eranian 2024-06-06 17:33 ` Liang, Kan 2024-06-10 17:39 ` Liang, Kan 2024-06-10 20:54 ` Peter Zijlstra
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox