* vmx_pmu_caps_test fails on Skylake based CPUS due to read only LBRs @ 2024-10-19 0:48 Maxim Levitsky 2024-10-28 15:55 ` Sean Christopherson 0 siblings, 1 reply; 12+ messages in thread From: Maxim Levitsky @ 2024-10-19 0:48 UTC (permalink / raw) To: kvm; +Cc: linux-kernel, Sean Christopherson Hi, Our CI found another issue, this time with vmx_pmu_caps_test. On 'Intel(R) Xeon(R) Gold 6328HL CPU' I see that all LBR msrs (from/to and TOS), are always read only - even when LBR is disabled - once I disable the feature in DEBUG_CTL, all LBR msrs reset to 0, and you can't change their value manually. Freeze LBRS on PMI seems not to affect this behavior. I don't know if this is how the hardware is supposed to work (Intel's manual doesn't mention anything about this), or if it is something platform specific, because this system also was found to have LBRs enabled (IA32_DEBUGCTL.LBR == 1) after a fresh boot, as if BIOS left them enabled - I don't have an idea on why. The problem is that vmx_pmu_caps_test writes 0 to LBR_TOS via KVM_SET_MSRS, and KVM actually passes this write to actual hardware msr (this is somewhat wierd), and since the MSR is not writable and silently drops writes instead, once the test tries to read it, it gets some random value instead. Any advice? Best regards, Maxim Levitsky ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: vmx_pmu_caps_test fails on Skylake based CPUS due to read only LBRs 2024-10-19 0:48 vmx_pmu_caps_test fails on Skylake based CPUS due to read only LBRs Maxim Levitsky @ 2024-10-28 15:55 ` Sean Christopherson 2024-11-03 23:32 ` Maxim Levitsky 0 siblings, 1 reply; 12+ messages in thread From: Sean Christopherson @ 2024-10-28 15:55 UTC (permalink / raw) To: Maxim Levitsky; +Cc: kvm, linux-kernel On Fri, Oct 18, 2024, Maxim Levitsky wrote: > Hi, > > Our CI found another issue, this time with vmx_pmu_caps_test. > > On 'Intel(R) Xeon(R) Gold 6328HL CPU' I see that all LBR msrs (from/to and > TOS), are always read only - even when LBR is disabled - once I disable the > feature in DEBUG_CTL, all LBR msrs reset to 0, and you can't change their > value manually. Freeze LBRS on PMI seems not to affect this behavior. > > I don't know if this is how the hardware is supposed to work (Intel's manual > doesn't mention anything about this), or if it is something platform > specific, because this system also was found to have LBRs enabled > (IA32_DEBUGCTL.LBR == 1) after a fresh boot, as if BIOS left them enabled - I > don't have an idea on why. > > The problem is that vmx_pmu_caps_test writes 0 to LBR_TOS via KVM_SET_MSRS, > and KVM actually passes this write to actual hardware msr (this is somewhat > wierd), When the "virtual" LBR event is active in host perf, the LBR MSRs are passed through to the guest, and so KVM needs to propagate the guest values into hardware. > and since the MSR is not writable and silently drops writes instead, > once the test tries to read it, it gets some random value instead. This just showed up in our testing too (delayed backport on our end). I haven't (yet) tried debugging our setup, but is there any chance Intel PT is interfering? 33.3.1.2 Model Specific Capability Restrictions Some processor generations impose restrictions that prevent use of LBRs/BTS/BTM/LERs when software has enabled tracing with Intel Processor Trace. On these processors, when TraceEn is set, updates of LBR, BTS, BTM, LERs are suspended but the states of the corresponding IA32_DEBUGCTL control fields remained unchanged as if it were still enabled. When TraceEn is cleared, the LBR array is reset, and LBR/BTS/BTM/LERs updates will resume. Further, reads of these registers will return 0, and writes will be dropped. The list of MSRs whose updates/accesses are restricted follows. • MSR_LASTBRANCH_x_TO_IP, MSR_LASTBRANCH_x_FROM_IP, MSR_LBR_INFO_x, MSR_LASTBRANCH_TOS • MSR_LER_FROM_LIP, MSR_LER_TO_LIP • MSR_LBR_SELECT For processors with CPUID DisplayFamily_DisplayModel signatures of 06_3DH, 06_47H, 06_4EH, 06_4FH, 06_56H, and 06_5EH, the use of Intel PT and LBRs are mutually exclusive. If Intel PT is NOT responsible, i.e. the behavior really is due to DEBUG_CTL.LBR=0, then I don't see how KVM can sanely virtualize LBRs. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: vmx_pmu_caps_test fails on Skylake based CPUS due to read only LBRs 2024-10-28 15:55 ` Sean Christopherson @ 2024-11-03 23:32 ` Maxim Levitsky 2024-11-22 3:35 ` Maxim Levitsky 2025-01-22 1:02 ` Sean Christopherson 0 siblings, 2 replies; 12+ messages in thread From: Maxim Levitsky @ 2024-11-03 23:32 UTC (permalink / raw) To: Sean Christopherson; +Cc: kvm, linux-kernel On Mon, 2024-10-28 at 08:55 -0700, Sean Christopherson wrote: > On Fri, Oct 18, 2024, Maxim Levitsky wrote: > > Hi, > > > > Our CI found another issue, this time with vmx_pmu_caps_test. > > > > On 'Intel(R) Xeon(R) Gold 6328HL CPU' I see that all LBR msrs (from/to and > > TOS), are always read only - even when LBR is disabled - once I disable the > > feature in DEBUG_CTL, all LBR msrs reset to 0, and you can't change their > > value manually. Freeze LBRS on PMI seems not to affect this behavior. > > > > I don't know if this is how the hardware is supposed to work (Intel's manual > > doesn't mention anything about this), or if it is something platform > > specific, because this system also was found to have LBRs enabled > > (IA32_DEBUGCTL.LBR == 1) after a fresh boot, as if BIOS left them enabled - I > > don't have an idea on why. > > > > The problem is that vmx_pmu_caps_test writes 0 to LBR_TOS via KVM_SET_MSRS, > > and KVM actually passes this write to actual hardware msr (this is somewhat > > wierd), > > When the "virtual" LBR event is active in host perf, the LBR MSRs are passed > through to the guest, and so KVM needs to propagate the guest values into hardware. Yes, but usually KVM_SET_MSRS doesn't touch hardware directly, even for registers/msrs that are passed through, but rather the relevant values are loaded when the guest vCPU is loaded and/or when the guest is entered. I don't know the details though. > > > and since the MSR is not writable and silently drops writes instead, > > once the test tries to read it, it gets some random value instead. > > This just showed up in our testing too (delayed backport on our end). I haven't > (yet) tried debugging our setup, but is there any chance Intel PT is interfering? > > 33.3.1.2 Model Specific Capability Restrictions > Some processor generations impose restrictions that prevent use of > LBRs/BTS/BTM/LERs when software has enabled tracing with Intel Processor Trace. > On these processors, when TraceEn is set, updates of LBR, BTS, BTM, LERs are > suspended but the states of the corresponding IA32_DEBUGCTL control fields > remained unchanged as if it were still enabled. When TraceEn is cleared, the > LBR array is reset, and LBR/BTS/BTM/LERs updates will resume. > Further, reads of these registers will return 0, and writes will be dropped. > > The list of MSRs whose updates/accesses are restricted follows. > > • MSR_LASTBRANCH_x_TO_IP, MSR_LASTBRANCH_x_FROM_IP, MSR_LBR_INFO_x, MSR_LASTBRANCH_TOS > • MSR_LER_FROM_LIP, MSR_LER_TO_LIP > • MSR_LBR_SELECT > > For processors with CPUID DisplayFamily_DisplayModel signatures of 06_3DH, > 06_47H, 06_4EH, 06_4FH, 06_56H, and 06_5EH, the use of Intel PT and LBRs are > mutually exclusive. > > If Intel PT is NOT responsible, i.e. the behavior really is due to DEBUG_CTL.LBR=0, > then I don't see how KVM can sanely virtualize LBRs. > Hi! I will check PT influence soon, but to me it looks like the hardware implementation has changed. It is just too consistent: When DEBUG_CTL.LBR=1, the LBRs do work, I see all the registers update, although TOS does seem to be stuck at one value, but it does change sometimes, and it's non zero. The FROM/TO do show healthy amount of updates Note that I read all msrs using 'rdmsr' userspace tool. However as soon as I disable DEBUG_CTL.LBR, all these MSRs reset to 0, and can't be changed. I'll check this on another Skylake based machine and see if I see the same thing. Best regards, Maxim Levitsky ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: vmx_pmu_caps_test fails on Skylake based CPUS due to read only LBRs 2024-11-03 23:32 ` Maxim Levitsky @ 2024-11-22 3:35 ` Maxim Levitsky 2024-12-14 0:20 ` Maxim Levitsky 2025-01-22 1:02 ` Sean Christopherson 1 sibling, 1 reply; 12+ messages in thread From: Maxim Levitsky @ 2024-11-22 3:35 UTC (permalink / raw) To: Sean Christopherson; +Cc: kvm, linux-kernel On Sun, 2024-11-03 at 18:32 -0500, Maxim Levitsky wrote: > On Mon, 2024-10-28 at 08:55 -0700, Sean Christopherson wrote: > > On Fri, Oct 18, 2024, Maxim Levitsky wrote: > > > Hi, > > > > > > Our CI found another issue, this time with vmx_pmu_caps_test. > > > > > > On 'Intel(R) Xeon(R) Gold 6328HL CPU' I see that all LBR msrs (from/to and > > > TOS), are always read only - even when LBR is disabled - once I disable the > > > feature in DEBUG_CTL, all LBR msrs reset to 0, and you can't change their > > > value manually. Freeze LBRS on PMI seems not to affect this behavior. > > > > > > I don't know if this is how the hardware is supposed to work (Intel's manual > > > doesn't mention anything about this), or if it is something platform > > > specific, because this system also was found to have LBRs enabled > > > (IA32_DEBUGCTL.LBR == 1) after a fresh boot, as if BIOS left them enabled - I > > > don't have an idea on why. > > > > > > The problem is that vmx_pmu_caps_test writes 0 to LBR_TOS via KVM_SET_MSRS, > > > and KVM actually passes this write to actual hardware msr (this is somewhat > > > wierd), > > > > When the "virtual" LBR event is active in host perf, the LBR MSRs are passed > > through to the guest, and so KVM needs to propagate the guest values into hardware. > > Yes, but usually KVM_SET_MSRS doesn't touch hardware directly, even for registers/msrs > that are passed through, but rather the relevant values are loaded when the guest vCPU > is loaded and/or when the guest is entered. > I don't know the details though. > > > > > and since the MSR is not writable and silently drops writes instead, > > > once the test tries to read it, it gets some random value instead. > > > > This just showed up in our testing too (delayed backport on our end). I haven't > > (yet) tried debugging our setup, but is there any chance Intel PT is interfering? > > > > 33.3.1.2 Model Specific Capability Restrictions > > Some processor generations impose restrictions that prevent use of > > LBRs/BTS/BTM/LERs when software has enabled tracing with Intel Processor Trace. > > On these processors, when TraceEn is set, updates of LBR, BTS, BTM, LERs are > > suspended but the states of the corresponding IA32_DEBUGCTL control fields > > remained unchanged as if it were still enabled. When TraceEn is cleared, the > > LBR array is reset, and LBR/BTS/BTM/LERs updates will resume. > > Further, reads of these registers will return 0, and writes will be dropped. > > > > The list of MSRs whose updates/accesses are restricted follows. > > > > • MSR_LASTBRANCH_x_TO_IP, MSR_LASTBRANCH_x_FROM_IP, MSR_LBR_INFO_x, MSR_LASTBRANCH_TOS > > • MSR_LER_FROM_LIP, MSR_LER_TO_LIP > > • MSR_LBR_SELECT > > > > For processors with CPUID DisplayFamily_DisplayModel signatures of 06_3DH, > > 06_47H, 06_4EH, 06_4FH, 06_56H, and 06_5EH, the use of Intel PT and LBRs are > > mutually exclusive. > > > > If Intel PT is NOT responsible, i.e. the behavior really is due to DEBUG_CTL.LBR=0, > > then I don't see how KVM can sanely virtualize LBRs. > > > > Hi! > > > I will check PT influence soon, but to me it looks like the hardware implementation has changed. > It is just too consistent: > > When DEBUG_CTL.LBR=1, the LBRs do work, I see all the registers update, although > TOS does seem to be stuck at one value, but it does change sometimes, and it's non zero. > > The FROM/TO do show healthy amount of updates > > Note that I read all msrs using 'rdmsr' userspace tool. > > However as soon as I disable DEBUG_CTL.LBR, all these MSRs reset to 0, and can't be changed. Hi, I tested this on another skylake based machine (Intel(R) Xeon(R) Silver 4214) and I see the same behavior: LBR_TOS is readonly: It's 0 when LBRS disabled in DEBUG_CTL, and running (changes all the time as expected) when LBRS are enabled in the DEBUG_CTL. IA32_RTIT_CTL.TraceEn is disabled (msr 0x570 is 0). Also on this machine BIOS didn't left LBRs running. I guess we need to at least disable the check in the unit test or at least speak with someone from Intel to clarify on what is going on. What do you think? Best regards, Maxim Levitsky > > I'll check this on another Skylake based machine and see if I see the same thing. > > Best regards, > Maxim Levitsky > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: vmx_pmu_caps_test fails on Skylake based CPUS due to read only LBRs 2024-11-22 3:35 ` Maxim Levitsky @ 2024-12-14 0:20 ` Maxim Levitsky 2025-01-21 22:56 ` Maxim Levitsky 0 siblings, 1 reply; 12+ messages in thread From: Maxim Levitsky @ 2024-12-14 0:20 UTC (permalink / raw) To: Sean Christopherson; +Cc: kvm, linux-kernel On Thu, 2024-11-21 at 22:35 -0500, Maxim Levitsky wrote: > On Sun, 2024-11-03 at 18:32 -0500, Maxim Levitsky wrote: > > On Mon, 2024-10-28 at 08:55 -0700, Sean Christopherson wrote: > > > On Fri, Oct 18, 2024, Maxim Levitsky wrote: > > > > Hi, > > > > > > > > Our CI found another issue, this time with vmx_pmu_caps_test. > > > > > > > > On 'Intel(R) Xeon(R) Gold 6328HL CPU' I see that all LBR msrs (from/to and > > > > TOS), are always read only - even when LBR is disabled - once I disable the > > > > feature in DEBUG_CTL, all LBR msrs reset to 0, and you can't change their > > > > value manually. Freeze LBRS on PMI seems not to affect this behavior. > > > > > > > > I don't know if this is how the hardware is supposed to work (Intel's manual > > > > doesn't mention anything about this), or if it is something platform > > > > specific, because this system also was found to have LBRs enabled > > > > (IA32_DEBUGCTL.LBR == 1) after a fresh boot, as if BIOS left them enabled - I > > > > don't have an idea on why. > > > > > > > > The problem is that vmx_pmu_caps_test writes 0 to LBR_TOS via KVM_SET_MSRS, > > > > and KVM actually passes this write to actual hardware msr (this is somewhat > > > > wierd), > > > > > > When the "virtual" LBR event is active in host perf, the LBR MSRs are passed > > > through to the guest, and so KVM needs to propagate the guest values into hardware. > > > > Yes, but usually KVM_SET_MSRS doesn't touch hardware directly, even for registers/msrs > > that are passed through, but rather the relevant values are loaded when the guest vCPU > > is loaded and/or when the guest is entered. > > I don't know the details though. > > > > > > > > and since the MSR is not writable and silently drops writes instead, > > > > once the test tries to read it, it gets some random value instead. > > > > > > This just showed up in our testing too (delayed backport on our end). I haven't > > > (yet) tried debugging our setup, but is there any chance Intel PT is interfering? > > > > > > 33.3.1.2 Model Specific Capability Restrictions > > > Some processor generations impose restrictions that prevent use of > > > LBRs/BTS/BTM/LERs when software has enabled tracing with Intel Processor Trace. > > > On these processors, when TraceEn is set, updates of LBR, BTS, BTM, LERs are > > > suspended but the states of the corresponding IA32_DEBUGCTL control fields > > > remained unchanged as if it were still enabled. When TraceEn is cleared, the > > > LBR array is reset, and LBR/BTS/BTM/LERs updates will resume. > > > Further, reads of these registers will return 0, and writes will be dropped. > > > > > > The list of MSRs whose updates/accesses are restricted follows. > > > > > > • MSR_LASTBRANCH_x_TO_IP, MSR_LASTBRANCH_x_FROM_IP, MSR_LBR_INFO_x, MSR_LASTBRANCH_TOS > > > • MSR_LER_FROM_LIP, MSR_LER_TO_LIP > > > • MSR_LBR_SELECT > > > > > > For processors with CPUID DisplayFamily_DisplayModel signatures of 06_3DH, > > > 06_47H, 06_4EH, 06_4FH, 06_56H, and 06_5EH, the use of Intel PT and LBRs are > > > mutually exclusive. > > > > > > If Intel PT is NOT responsible, i.e. the behavior really is due to DEBUG_CTL.LBR=0, > > > then I don't see how KVM can sanely virtualize LBRs. > > > > > > > Hi! > > > > > > I will check PT influence soon, but to me it looks like the hardware implementation has changed. > > It is just too consistent: > > > > When DEBUG_CTL.LBR=1, the LBRs do work, I see all the registers update, although > > TOS does seem to be stuck at one value, but it does change sometimes, and it's non zero. > > > > The FROM/TO do show healthy amount of updates > > > > Note that I read all msrs using 'rdmsr' userspace tool. > > > > However as soon as I disable DEBUG_CTL.LBR, all these MSRs reset to 0, and can't be changed. > > Hi, > I tested this on another skylake based machine (Intel(R) Xeon(R) Silver 4214) and I see the same behavior: > LBR_TOS is readonly: > > It's 0 when LBRS disabled in DEBUG_CTL, and running (changes all the time as expected) > when LBRS are enabled in the DEBUG_CTL. > > IA32_RTIT_CTL.TraceEn is disabled (msr 0x570 is 0). > > Also on this machine BIOS didn't left LBRs running. > > I guess we need to at least disable the check in the unit test or at least > speak with someone from Intel to clarify on what is going on. Any update on this? > What do you think? > > Best regards, > Maxim Levitsky > > > I'll check this on another Skylake based machine and see if I see the same thing. > > > > Best regards, > > Maxim Levitsky > > > > > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: vmx_pmu_caps_test fails on Skylake based CPUS due to read only LBRs 2024-12-14 0:20 ` Maxim Levitsky @ 2025-01-21 22:56 ` Maxim Levitsky 0 siblings, 0 replies; 12+ messages in thread From: Maxim Levitsky @ 2025-01-21 22:56 UTC (permalink / raw) To: Sean Christopherson; +Cc: kvm, linux-kernel On Fri, 2024-12-13 at 19:20 -0500, Maxim Levitsky wrote: > On Thu, 2024-11-21 at 22:35 -0500, Maxim Levitsky wrote: > > On Sun, 2024-11-03 at 18:32 -0500, Maxim Levitsky wrote: > > > On Mon, 2024-10-28 at 08:55 -0700, Sean Christopherson wrote: > > > > On Fri, Oct 18, 2024, Maxim Levitsky wrote: > > > > > Hi, > > > > > > > > > > Our CI found another issue, this time with vmx_pmu_caps_test. > > > > > > > > > > On 'Intel(R) Xeon(R) Gold 6328HL CPU' I see that all LBR msrs (from/to and > > > > > TOS), are always read only - even when LBR is disabled - once I disable the > > > > > feature in DEBUG_CTL, all LBR msrs reset to 0, and you can't change their > > > > > value manually. Freeze LBRS on PMI seems not to affect this behavior. > > > > > > > > > > I don't know if this is how the hardware is supposed to work (Intel's manual > > > > > doesn't mention anything about this), or if it is something platform > > > > > specific, because this system also was found to have LBRs enabled > > > > > (IA32_DEBUGCTL.LBR == 1) after a fresh boot, as if BIOS left them enabled - I > > > > > don't have an idea on why. > > > > > > > > > > The problem is that vmx_pmu_caps_test writes 0 to LBR_TOS via KVM_SET_MSRS, > > > > > and KVM actually passes this write to actual hardware msr (this is somewhat > > > > > wierd), > > > > > > > > When the "virtual" LBR event is active in host perf, the LBR MSRs are passed > > > > through to the guest, and so KVM needs to propagate the guest values into hardware. > > > > > > Yes, but usually KVM_SET_MSRS doesn't touch hardware directly, even for registers/msrs > > > that are passed through, but rather the relevant values are loaded when the guest vCPU > > > is loaded and/or when the guest is entered. > > > I don't know the details though. > > > > > > > > > > > and since the MSR is not writable and silently drops writes instead, > > > > > once the test tries to read it, it gets some random value instead. > > > > > > > > This just showed up in our testing too (delayed backport on our end). I haven't > > > > (yet) tried debugging our setup, but is there any chance Intel PT is interfering? > > > > > > > > 33.3.1.2 Model Specific Capability Restrictions > > > > Some processor generations impose restrictions that prevent use of > > > > LBRs/BTS/BTM/LERs when software has enabled tracing with Intel Processor Trace. > > > > On these processors, when TraceEn is set, updates of LBR, BTS, BTM, LERs are > > > > suspended but the states of the corresponding IA32_DEBUGCTL control fields > > > > remained unchanged as if it were still enabled. When TraceEn is cleared, the > > > > LBR array is reset, and LBR/BTS/BTM/LERs updates will resume. > > > > Further, reads of these registers will return 0, and writes will be dropped. > > > > > > > > The list of MSRs whose updates/accesses are restricted follows. > > > > > > > > • MSR_LASTBRANCH_x_TO_IP, MSR_LASTBRANCH_x_FROM_IP, MSR_LBR_INFO_x, MSR_LASTBRANCH_TOS > > > > • MSR_LER_FROM_LIP, MSR_LER_TO_LIP > > > > • MSR_LBR_SELECT > > > > > > > > For processors with CPUID DisplayFamily_DisplayModel signatures of 06_3DH, > > > > 06_47H, 06_4EH, 06_4FH, 06_56H, and 06_5EH, the use of Intel PT and LBRs are > > > > mutually exclusive. > > > > > > > > If Intel PT is NOT responsible, i.e. the behavior really is due to DEBUG_CTL.LBR=0, > > > > then I don't see how KVM can sanely virtualize LBRs. > > > > > > > > > > Hi! > > > > > > > > > I will check PT influence soon, but to me it looks like the hardware implementation has changed. > > > It is just too consistent: > > > > > > When DEBUG_CTL.LBR=1, the LBRs do work, I see all the registers update, although > > > TOS does seem to be stuck at one value, but it does change sometimes, and it's non zero. > > > > > > The FROM/TO do show healthy amount of updates > > > > > > Note that I read all msrs using 'rdmsr' userspace tool. > > > > > > However as soon as I disable DEBUG_CTL.LBR, all these MSRs reset to 0, and can't be changed. > > > > Hi, > > I tested this on another skylake based machine (Intel(R) Xeon(R) Silver 4214) and I see the same behavior: > > LBR_TOS is readonly: > > > > It's 0 when LBRS disabled in DEBUG_CTL, and running (changes all the time as expected) > > when LBRS are enabled in the DEBUG_CTL. > > > > IA32_RTIT_CTL.TraceEn is disabled (msr 0x570 is 0). > > > > Also on this machine BIOS didn't left LBRs running. > > > > I guess we need to at least disable the check in the unit test or at least > > speak with someone from Intel to clarify on what is going on. > > Any update on this? Hi, I hate to sound like a broken record, but any update on this? Best regards, Maxim Levitsky > > > > > What do you think? > > > > Best regards, > > Maxim Levitsky > > > > > I'll check this on another Skylake based machine and see if I see the same thing. > > > > > > Best regards, > > > Maxim Levitsky > > > > > > > > > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: vmx_pmu_caps_test fails on Skylake based CPUS due to read only LBRs 2024-11-03 23:32 ` Maxim Levitsky 2024-11-22 3:35 ` Maxim Levitsky @ 2025-01-22 1:02 ` Sean Christopherson 2025-01-22 16:36 ` Maxim Levitsky 1 sibling, 1 reply; 12+ messages in thread From: Sean Christopherson @ 2025-01-22 1:02 UTC (permalink / raw) To: Maxim Levitsky; +Cc: kvm, linux-kernel On Sun, Nov 03, 2024, Maxim Levitsky wrote: > On Mon, 2024-10-28 at 08:55 -0700, Sean Christopherson wrote: > > On Fri, Oct 18, 2024, Maxim Levitsky wrote: > > > Our CI found another issue, this time with vmx_pmu_caps_test. > > > > > > On 'Intel(R) Xeon(R) Gold 6328HL CPU' I see that all LBR msrs (from/to and > > > TOS), are always read only - even when LBR is disabled - once I disable the > > > feature in DEBUG_CTL, all LBR msrs reset to 0, and you can't change their > > > value manually. Freeze LBRS on PMI seems not to affect this behavior. ... > When DEBUG_CTL.LBR=1, the LBRs do work, I see all the registers update, > although TOS does seem to be stuck at one value, but it does change > sometimes, and it's non zero. > > The FROM/TO do show healthy amount of updates > > Note that I read all msrs using 'rdmsr' userspace tool. I'm pretty sure debugging via 'rdmsr', i.e. /dev/msr, isn't going to work. I assume perf is clobbering LBR MSRs on context switch, but I haven't tracked that down to confirm (the code I see on inspecition is gated on at least one perf event using LBRs). My guess is that there's a software bug somewhere in the perf/KVM exchange. I confirmed that using 'rdmsr' and 'wrmsr' "loses" values, but that hacking KVM to read/write all LBRs during initialization works with LBRs disabled. --- diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index f72835e85b6d..c68a5a79c668 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7907,6 +7907,8 @@ static __init u64 vmx_get_perf_capabilities(void) { u64 perf_cap = PMU_CAP_FW_WRITES; u64 host_perf_cap = 0; + u64 debugctl, val; + int i; if (!enable_pmu) return 0; @@ -7954,6 +7956,39 @@ static __init u64 vmx_get_perf_capabilities(void) perf_cap &= ~PERF_CAP_PEBS_BASELINE; } + if (!vmx_lbr_caps.nr) { + pr_warn("Uh, what? No LBRs...\n"); + goto out; + } + + rdmsrl(MSR_IA32_DEBUGCTLMSR, debugctl); + if (debugctl & DEBUGCTLMSR_LBR) { + pr_warn("Huh, LBRs enabled at KVM load? debugctl = %llx\n", debugctl); + wrmsrl(MSR_IA32_DEBUGCTLMSR, debugctl & ~DEBUGCTLMSR_LBR); + } + + for (i = 0; i < vmx_lbr_caps.nr; i++) { + wrmsrl(vmx_lbr_caps.from + i, 0xbeef0000 + i); + wrmsrl(vmx_lbr_caps.to + i, 0xcafe0000 + i); + } + + for (i = 0; i < vmx_lbr_caps.nr; i++) { + rdmsrl(vmx_lbr_caps.from + i, val); + if (val != 0xbeef0000 + i) + pr_warn("MSR 0x%x Expected %x, got %llx\n", + vmx_lbr_caps.from + i, 0xbeef0000 + i, val); + rdmsrl(vmx_lbr_caps.to + i, val); + if (val != 0xcafe0000 + i) + pr_warn("MSR 0x%x Expected %x, got %llx\n", + vmx_lbr_caps.from + i, 0xcafe0000 + i, val); + } + + pr_warn("Done validating %u from/to LBRs\n", vmx_lbr_caps.nr); + + if (debugctl & DEBUGCTLMSR_LBR) + wrmsrl(MSR_IA32_DEBUGCTLMSR, debugctl); + +out: return perf_cap; } -- And given that perf explicitly disables LBRs (see __intel_pmu_lbr_disable()) before reading LBR MSRs (see intel_pmu_lbr_read()) when taking a snaphot, and AFAIK no one has complained, I would be very surprised if this is hardware doing something odd. --- static noinline int __intel_pmu_snapshot_branch_stack(struct perf_branch_entry *entries, unsigned int cnt, unsigned long flags) { struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); intel_pmu_lbr_read(); cnt = min_t(unsigned int, cnt, x86_pmu.lbr_nr); memcpy(entries, cpuc->lbr_entries, sizeof(struct perf_branch_entry) * cnt); intel_pmu_enable_all(0); local_irq_restore(flags); return cnt; } static int intel_pmu_snapshot_branch_stack(struct perf_branch_entry *entries, unsigned int cnt) { unsigned long flags; /* must not have branches... */ local_irq_save(flags); __intel_pmu_disable_all(false); /* we don't care about BTS */ __intel_pmu_lbr_disable(); /* ... until here */ return __intel_pmu_snapshot_branch_stack(entries, cnt, flags); } --- ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: vmx_pmu_caps_test fails on Skylake based CPUS due to read only LBRs 2025-01-22 1:02 ` Sean Christopherson @ 2025-01-22 16:36 ` Maxim Levitsky 2025-01-22 21:02 ` Sean Christopherson 0 siblings, 1 reply; 12+ messages in thread From: Maxim Levitsky @ 2025-01-22 16:36 UTC (permalink / raw) To: Sean Christopherson; +Cc: kvm, linux-kernel On Tue, 2025-01-21 at 17:02 -0800, Sean Christopherson wrote: > On Sun, Nov 03, 2024, Maxim Levitsky wrote: > > On Mon, 2024-10-28 at 08:55 -0700, Sean Christopherson wrote: > > > On Fri, Oct 18, 2024, Maxim Levitsky wrote: > > > > Our CI found another issue, this time with vmx_pmu_caps_test. > > > > > > > > On 'Intel(R) Xeon(R) Gold 6328HL CPU' I see that all LBR msrs (from/to and > > > > TOS), are always read only - even when LBR is disabled - once I disable the > > > > feature in DEBUG_CTL, all LBR msrs reset to 0, and you can't change their > > > > value manually. Freeze LBRS on PMI seems not to affect this behavior. > > ... > > > When DEBUG_CTL.LBR=1, the LBRs do work, I see all the registers update, > > although TOS does seem to be stuck at one value, but it does change > > sometimes, and it's non zero. > > > > The FROM/TO do show healthy amount of updates > > > > Note that I read all msrs using 'rdmsr' userspace tool. > > I'm pretty sure debugging via 'rdmsr', i.e. /dev/msr, isn't going to work. I > assume perf is clobbering LBR MSRs on context switch, but I haven't tracked that > down to confirm (the code I see on inspecition is gated on at least one perf > event using LBRs). My guess is that there's a software bug somewhere in the > perf/KVM exchange. > > I confirmed that using 'rdmsr' and 'wrmsr' "loses" values, but that hacking KVM > to read/write all LBRs during initialization works with LBRs disabled. Hi, OK, this is a very good piece of the puzzle. I didn't expect context switch to interfere with this because I thought that perf code won't touch LBRs if they are not in use. rdmsr/wrmsr programs don't do much except doing the instruction in the kernel space. Is it then possible that the the fact that LBRs were left enabled by BIOS is the culprit of the problem? This particular test never enables LBRs, not anything in the system does this, I do some more code digging, lets see if I find anything odd. Thanks for the info, Best regards, Maxim Levitsky > > --- > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > index f72835e85b6d..c68a5a79c668 100644 > --- a/arch/x86/kvm/vmx/vmx.c > +++ b/arch/x86/kvm/vmx/vmx.c > @@ -7907,6 +7907,8 @@ static __init u64 vmx_get_perf_capabilities(void) > { > u64 perf_cap = PMU_CAP_FW_WRITES; > u64 host_perf_cap = 0; > + u64 debugctl, val; > + int i; > > if (!enable_pmu) > return 0; > @@ -7954,6 +7956,39 @@ static __init u64 vmx_get_perf_capabilities(void) > perf_cap &= ~PERF_CAP_PEBS_BASELINE; > } > > + if (!vmx_lbr_caps.nr) { > + pr_warn("Uh, what? No LBRs...\n"); > + goto out; > + } > + > + rdmsrl(MSR_IA32_DEBUGCTLMSR, debugctl); > + if (debugctl & DEBUGCTLMSR_LBR) { > + pr_warn("Huh, LBRs enabled at KVM load? debugctl = %llx\n", debugctl); > + wrmsrl(MSR_IA32_DEBUGCTLMSR, debugctl & ~DEBUGCTLMSR_LBR); > + } > + > + for (i = 0; i < vmx_lbr_caps.nr; i++) { > + wrmsrl(vmx_lbr_caps.from + i, 0xbeef0000 + i); > + wrmsrl(vmx_lbr_caps.to + i, 0xcafe0000 + i); > + } > + > + for (i = 0; i < vmx_lbr_caps.nr; i++) { > + rdmsrl(vmx_lbr_caps.from + i, val); > + if (val != 0xbeef0000 + i) > + pr_warn("MSR 0x%x Expected %x, got %llx\n", > + vmx_lbr_caps.from + i, 0xbeef0000 + i, val); > + rdmsrl(vmx_lbr_caps.to + i, val); > + if (val != 0xcafe0000 + i) > + pr_warn("MSR 0x%x Expected %x, got %llx\n", > + vmx_lbr_caps.from + i, 0xcafe0000 + i, val); > + } > + > + pr_warn("Done validating %u from/to LBRs\n", vmx_lbr_caps.nr); > + > + if (debugctl & DEBUGCTLMSR_LBR) > + wrmsrl(MSR_IA32_DEBUGCTLMSR, debugctl); > + > +out: > return perf_cap; > } > -- > > And given that perf explicitly disables LBRs (see __intel_pmu_lbr_disable()) > before reading LBR MSRs (see intel_pmu_lbr_read()) when taking a snaphot, and > AFAIK no one has complained, I would be very surprised if this is hardware doing > something odd. > > --- > static noinline int > __intel_pmu_snapshot_branch_stack(struct perf_branch_entry *entries, > unsigned int cnt, unsigned long flags) > { > struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events); > > intel_pmu_lbr_read(); > cnt = min_t(unsigned int, cnt, x86_pmu.lbr_nr); > > memcpy(entries, cpuc->lbr_entries, sizeof(struct perf_branch_entry) * cnt); > intel_pmu_enable_all(0); > local_irq_restore(flags); > return cnt; > } > > static int > intel_pmu_snapshot_branch_stack(struct perf_branch_entry *entries, unsigned int cnt) > { > unsigned long flags; > > /* must not have branches... */ > local_irq_save(flags); > __intel_pmu_disable_all(false); /* we don't care about BTS */ > __intel_pmu_lbr_disable(); > /* ... until here */ > return __intel_pmu_snapshot_branch_stack(entries, cnt, flags); > } > --- > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: vmx_pmu_caps_test fails on Skylake based CPUS due to read only LBRs 2025-01-22 16:36 ` Maxim Levitsky @ 2025-01-22 21:02 ` Sean Christopherson 2025-01-24 23:36 ` Maxim Levitsky 0 siblings, 1 reply; 12+ messages in thread From: Sean Christopherson @ 2025-01-22 21:02 UTC (permalink / raw) To: Maxim Levitsky; +Cc: kvm, linux-kernel On Wed, Jan 22, 2025, Maxim Levitsky wrote: > On Tue, 2025-01-21 at 17:02 -0800, Sean Christopherson wrote: > > On Sun, Nov 03, 2024, Maxim Levitsky wrote: > > > On Mon, 2024-10-28 at 08:55 -0700, Sean Christopherson wrote: > > > > On Fri, Oct 18, 2024, Maxim Levitsky wrote: > > > > > Our CI found another issue, this time with vmx_pmu_caps_test. > > > > > > > > > > On 'Intel(R) Xeon(R) Gold 6328HL CPU' I see that all LBR msrs (from/to and > > > > > TOS), are always read only - even when LBR is disabled - once I disable the > > > > > feature in DEBUG_CTL, all LBR msrs reset to 0, and you can't change their > > > > > value manually. Freeze LBRS on PMI seems not to affect this behavior. > > > > ... > > > > > When DEBUG_CTL.LBR=1, the LBRs do work, I see all the registers update, > > > although TOS does seem to be stuck at one value, but it does change > > > sometimes, and it's non zero. > > > > > > The FROM/TO do show healthy amount of updates > > > > > > Note that I read all msrs using 'rdmsr' userspace tool. > > > > I'm pretty sure debugging via 'rdmsr', i.e. /dev/msr, isn't going to work. I > > assume perf is clobbering LBR MSRs on context switch, but I haven't tracked that > > down to confirm (the code I see on inspecition is gated on at least one perf > > event using LBRs). My guess is that there's a software bug somewhere in the > > perf/KVM exchange. > > > > I confirmed that using 'rdmsr' and 'wrmsr' "loses" values, but that hacking KVM > > to read/write all LBRs during initialization works with LBRs disabled. > > Hi, > > OK, this is a very good piece of the puzzle. > > I didn't expect context switch to interfere with this because I thought that > perf code won't touch LBRs if they are not in use. > rdmsr/wrmsr programs don't do much except doing the instruction in the kernel space. > > Is it then possible that the the fact that LBRs were left enabled by BIOS is the > culprit of the problem? > > This particular test never enables LBRs, not anything in the system does this, Ugh, but it does. On writes to any LBR, including LBR_TOS, KVM creates a "virtual" LBR perf event. KVM then relies on perf to context switch LBR MSRs, i.e. relies on perf to load the guest's values into hardware. At least, I think that's what is supposed to happen. AFAIK, the perf-based LBR support has never been properly document[*]. Anyways, my understanding of intel_pmu_handle_lbr_msrs_access() is that if the vCPU's LBR perf event is scheduled out or can't be created, the guest's value is effectively lost. Again, I don't know the "rules" for the LBR perf event, but it wouldn't suprise me if your CI fails because something in the host conflicts with KVM's LBR perf event. [*] https://lore.kernel.org/all/Y9RUOvJ5dkCU9J8C@google.com ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: vmx_pmu_caps_test fails on Skylake based CPUS due to read only LBRs 2025-01-22 21:02 ` Sean Christopherson @ 2025-01-24 23:36 ` Maxim Levitsky 2025-01-25 0:12 ` Sean Christopherson 0 siblings, 1 reply; 12+ messages in thread From: Maxim Levitsky @ 2025-01-24 23:36 UTC (permalink / raw) To: Sean Christopherson; +Cc: kvm, linux-kernel On Wed, 2025-01-22 at 13:02 -0800, Sean Christopherson wrote: > On Wed, Jan 22, 2025, Maxim Levitsky wrote: > > On Tue, 2025-01-21 at 17:02 -0800, Sean Christopherson wrote: > > > On Sun, Nov 03, 2024, Maxim Levitsky wrote: > > > > On Mon, 2024-10-28 at 08:55 -0700, Sean Christopherson wrote: > > > > > On Fri, Oct 18, 2024, Maxim Levitsky wrote: > > > > > > Our CI found another issue, this time with vmx_pmu_caps_test. > > > > > > > > > > > > On 'Intel(R) Xeon(R) Gold 6328HL CPU' I see that all LBR msrs (from/to and > > > > > > TOS), are always read only - even when LBR is disabled - once I disable the > > > > > > feature in DEBUG_CTL, all LBR msrs reset to 0, and you can't change their > > > > > > value manually. Freeze LBRS on PMI seems not to affect this behavior. > > > > > > ... > > > > > > > When DEBUG_CTL.LBR=1, the LBRs do work, I see all the registers update, > > > > although TOS does seem to be stuck at one value, but it does change > > > > sometimes, and it's non zero. > > > > > > > > The FROM/TO do show healthy amount of updates > > > > > > > > Note that I read all msrs using 'rdmsr' userspace tool. > > > > > > I'm pretty sure debugging via 'rdmsr', i.e. /dev/msr, isn't going to work. I > > > assume perf is clobbering LBR MSRs on context switch, but I haven't tracked that > > > down to confirm (the code I see on inspecition is gated on at least one perf > > > event using LBRs). My guess is that there's a software bug somewhere in the > > > perf/KVM exchange. > > > > > > I confirmed that using 'rdmsr' and 'wrmsr' "loses" values, but that hacking KVM > > > to read/write all LBRs during initialization works with LBRs disabled. Hi! I finally got to the very bottom of this: First of all, your assumption that the kernel resets LBR related msrs on context switch after 'wrmsr' program finishes execution is wrong, because the kernel will only do this if it *itself* enables the LBR feature (that is when something like 'perf', uses a perf counter with a lbr call stack). Writes that 'wrmsr' tool does are not something that kernel expects so it doesn't do anything in this case. What is happening instead, is something completely different: Turns out that to shave off something like 50 nanoseconds, off the deep C-state entry/exit latency, some Intel CPU don't preserve LBR stack values over these C-state entries. Kernel PMU code even has some special code which works this around. So, right after 'wrmsr' execution the CPU on a otherwise idle host finishes, the CPU will enter a low power state, and 'poof', LBR state is gone. To see this for yourself, just disable C-states # cpupower idle-set --disable-by-latency 0 And suddenly wrmsr reads/writes the LBR stack start to work normally as expected. This also in particular explains why I had no problems reading/writing LBR stack msrs on some older CPUs. > > > > Hi, > > > > OK, this is a very good piece of the puzzle. > > > > I didn't expect context switch to interfere with this because I thought that > > perf code won't touch LBRs if they are not in use. > > rdmsr/wrmsr programs don't do much except doing the instruction in the kernel space. > > > > Is it then possible that the the fact that LBRs were left enabled by BIOS is the > > culprit of the problem? > > > > This particular test never enables LBRs, not anything in the system does this, > > Ugh, but it does. On writes to any LBR, including LBR_TOS, KVM creates a "virtual" > LBR perf event. KVM then relies on perf to context switch LBR MSRs, i.e. relies > on perf to load the guest's values into hardware. At least, I think that's what > is supposed to happen. AFAIK, the perf-based LBR support has never been properly > document[*]. > > Anyways, my understanding of intel_pmu_handle_lbr_msrs_access() is that if the > vCPU's LBR perf event is scheduled out or can't be created, the guest's value is > effectively lost. Again, I don't know the "rules" for the LBR perf event, but > it wouldn't suprise me if your CI fails because something in the host conflicts > with KVM's LBR perf event. Actually you are partially wrong here too (although BIOS can be considered 'something on the host'). I was able to prove that the reason why the unit test fails *is* because BIOS left LBRs enabled: First of all, setting LBR bit manually in DEBUG_CTL does trigger this bug (I use a different machine now, which doesn't have the bios bug): # wrmsr -a 0x1d9 0x4001 # ./x86_64/vmx_pmu_caps_test Random seed: 0x6b8b4567 TAP version 13 1..6 # Starting 6 tests from 1 test cases. # RUN vmx_pmu_caps.guest_wrmsr_perf_capabilities ... # OK vmx_pmu_caps.guest_wrmsr_perf_capabilities ok 1 vmx_pmu_caps.guest_wrmsr_perf_capabilities # RUN vmx_pmu_caps.basic_perf_capabilities ... # OK vmx_pmu_caps.basic_perf_capabilities ok 2 vmx_pmu_caps.basic_perf_capabilities # RUN vmx_pmu_caps.fungible_perf_capabilities ... # OK vmx_pmu_caps.fungible_perf_capabilities ok 3 vmx_pmu_caps.fungible_perf_capabilities # RUN vmx_pmu_caps.immutable_perf_capabilities ... # OK vmx_pmu_caps.immutable_perf_capabilities ok 4 vmx_pmu_caps.immutable_perf_capabilities # RUN vmx_pmu_caps.lbr_perf_capabilities ... ==== Test Assertion Failure ==== x86_64/vmx_pmu_caps_test.c:202: r == v pid=8415 tid=8415 errno=0 - Success 1 0x0000000000404301: __suite_lbr_perf_capabilities at vmx_pmu_caps_test.c:202 2 (inlined by) vmx_pmu_caps_lbr_perf_capabilities at vmx_pmu_caps_test.c:194 3 (inlined by) wrapper_vmx_pmu_caps_lbr_perf_capabilities at vmx_pmu_caps_test.c:194 4 0x000000000040511a: __run_test at kselftest_harness.h:1240 5 0x0000000000402b95: test_harness_run at kselftest_harness.h:1310 6 (inlined by) main at vmx_pmu_caps_test.c:246 7 0x00007f56ba2295cf: ?? ??:0 8 0x00007f56ba22967f: ?? ??:0 9 0x0000000000402e44: _start at ??:? Set MSR_LBR_TOS to '0x7', got back '0xc' # lbr_perf_capabilities: Test failed # FAIL vmx_pmu_caps.lbr_perf_capabilities not ok 5 vmx_pmu_caps.lbr_perf_capabilities # RUN vmx_pmu_caps.perf_capabilities_unsupported ... # OK vmx_pmu_caps.perf_capabilities_unsupported ok 6 vmx_pmu_caps.perf_capabilities_unsupported # FAILED: 5 / 6 tests passed. # Totals: pass:5 fail:1 xfail:0 xpass:0 skip:0 error:0 Secondary I went over all places in the kernel and all of them take care to preserve DEBUG_CTL and only set/clear specific bits. __intel_pmu_lbr_enable() and __intel_pmu_lbr_enable() are practically the only two places where DEBUGCTLMSR_LBR bit is touched, and the test doesn't trigger them. Most likely because the test uses special 'INTEL_FIXED_VLBR_EVENT' perf event (see intel_pmu_create_guest_lbr_event) which is not enabled while in host mode. To double check this I traced all writes to DEBUG_CTL msr during this test and the only write is done during 'guest_wrmsr_perf_capabilities' subtest, by vmx_vcpu_run() which just restores the value that the msr had prior to VM entry. So, why the value that BIOS sets survives? Because as I said all code that touches DEBUG_CTL takes care to preserve all bits but the bit which is changed, LBRs are never enabled on the host, and even the guest entry preserves host DEBUG_CTL. Therefore the value written by BIOS survives. So we end up with the test writing to LBR_TOS while LBRs are unexpectedly enabled, so it's not a surprise that when the test reads back the value written, it will differ, and the test will rightfully fail. Since we have seen this in CI, and you saw it too in your CI, I think this BIOS bug is not that rare, and so I suggest to stick 'wrmsrl(MSR_IA32_DEBUGCTLMSR, 0)' somewhere early in a kernel boot code or at least clear the DEBUGCTLMSR_LBR bit. I haven't found a very good place to put this, in a way that I can be sure that x86 maintainers won't reject it, so I am open to your suggestions. Best regards, Maxim Levitsky > > [*] https://lore.kernel.org/all/Y9RUOvJ5dkCU9J8C@google.com > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: vmx_pmu_caps_test fails on Skylake based CPUS due to read only LBRs 2025-01-24 23:36 ` Maxim Levitsky @ 2025-01-25 0:12 ` Sean Christopherson 2025-01-27 17:58 ` Maxim Levitsky 0 siblings, 1 reply; 12+ messages in thread From: Sean Christopherson @ 2025-01-25 0:12 UTC (permalink / raw) To: Maxim Levitsky; +Cc: kvm, linux-kernel On Fri, Jan 24, 2025, Maxim Levitsky wrote: > On Wed, 2025-01-22 at 13:02 -0800, Sean Christopherson wrote: > > > > > Note that I read all msrs using 'rdmsr' userspace tool. > > > > > > > > I'm pretty sure debugging via 'rdmsr', i.e. /dev/msr, isn't going to work. I > > > > assume perf is clobbering LBR MSRs on context switch, but I haven't tracked that > > > > down to confirm (the code I see on inspecition is gated on at least one perf > > > > event using LBRs). My guess is that there's a software bug somewhere in the > > > > perf/KVM exchange. > > > > > > > > I confirmed that using 'rdmsr' and 'wrmsr' "loses" values, but that hacking KVM > > > > to read/write all LBRs during initialization works with LBRs disabled. > > Hi! > > I finally got to the very bottom of this: > > First of all, your assumption that the kernel resets LBR related msrs on > context switch after 'wrmsr' program finishes execution is wrong, because the > kernel will only do this if it *itself* enables the LBR feature (that is when > something like 'perf', uses a perf counter with a lbr call stack). > > Writes that 'wrmsr' tool does are not something that kernel expects so it > doesn't do anything in this case. > > What is happening instead, is something completely different: Turns out that > to shave off something like 50 nanoseconds, off the deep C-state entry/exit > latency, some Intel CPU don't preserve LBR stack values over these C-state > entries. Ugh. > > Ugh, but it does. On writes to any LBR, including LBR_TOS, KVM creates a "virtual" > > LBR perf event. KVM then relies on perf to context switch LBR MSRs, i.e. relies > > on perf to load the guest's values into hardware. At least, I think that's what > > is supposed to happen. AFAIK, the perf-based LBR support has never been properly > > document[*]. > > > > Anyways, my understanding of intel_pmu_handle_lbr_msrs_access() is that if the > > vCPU's LBR perf event is scheduled out or can't be created, the guest's value is > > effectively lost. Again, I don't know the "rules" for the LBR perf event, but > > it wouldn't suprise me if your CI fails because something in the host conflicts > > with KVM's LBR perf event. > > Actually you are partially wrong here too (although BIOS can be considered > 'something on the host'). > > I was able to prove that the reason why the unit test fails *is* because BIOS > left LBRs enabled: > > First of all, setting LBR bit manually in DEBUG_CTL does trigger this bug > (I use a different machine now, which doesn't have the bios bug): ... > ==== Test Assertion Failure ==== > x86_64/vmx_pmu_caps_test.c:202: r == v > pid=8415 tid=8415 errno=0 - Success > 1 0x0000000000404301: __suite_lbr_perf_capabilities at vmx_pmu_caps_test.c:202 > 2 (inlined by) vmx_pmu_caps_lbr_perf_capabilities at vmx_pmu_caps_test.c:194 > 3 (inlined by) wrapper_vmx_pmu_caps_lbr_perf_capabilities at vmx_pmu_caps_test.c:194 > 4 0x000000000040511a: __run_test at kselftest_harness.h:1240 > 5 0x0000000000402b95: test_harness_run at kselftest_harness.h:1310 > 6 (inlined by) main at vmx_pmu_caps_test.c:246 > 7 0x00007f56ba2295cf: ?? ??:0 > 8 0x00007f56ba22967f: ?? ??:0 > 9 0x0000000000402e44: _start at ??:? > Set MSR_LBR_TOS to '0x7', got back '0xc' > # lbr_perf_capabilities: Test failed > # FAIL vmx_pmu_caps.lbr_perf_capabilities > not ok 5 vmx_pmu_caps.lbr_perf_capabilities > # RUN vmx_pmu_caps.perf_capabilities_unsupported ... > # OK vmx_pmu_caps.perf_capabilities_unsupported > ok 6 vmx_pmu_caps.perf_capabilities_unsupported > # FAILED: 5 / 6 tests passed. > # Totals: pass:5 fail:1 xfail:0 xpass:0 skip:0 error:0 > > Secondary I went over all places in the kernel and all of them take care to > preserve DEBUG_CTL and only set/clear specific bits. > > __intel_pmu_lbr_enable() and __intel_pmu_lbr_enable() are practically the > only two places where DEBUGCTLMSR_LBR bit is touched, and the test doesn't > trigger them. Most likely because the test uses special > 'INTEL_FIXED_VLBR_EVENT' perf event (see intel_pmu_create_guest_lbr_event) > which is not enabled while in host mode. > > To double check this I traced all writes to DEBUG_CTL msr during this test > and the only write is done during 'guest_wrmsr_perf_capabilities' subtest, by > vmx_vcpu_run() which just restores the value that the msr had prior to VM > entry. > > So, why the value that BIOS sets survives? Because as I said all code that > touches DEBUG_CTL takes care to preserve all bits but the bit which is > changed, LBRs are never enabled on the host, and even the guest entry > preserves host DEBUG_CTL. Therefore the value written by BIOS survives. Well that's rather insane. > So we end up with the test writing to LBR_TOS while LBRs are unexpectedly > enabled, so it's not a surprise that when the test reads back the value > written, it will differ, and the test will rightfully fail. > > Since we have seen this in CI, and you saw it too in your CI, Gah, that was bad reporting on my end. The failure we saw was something else entirely. > I think this BIOS bug is not that rare, and so I suggest to stick > 'wrmsrl(MSR_IA32_DEBUGCTLMSR, 0)' somewhere early in a kernel boot code or at > least clear the DEBUGCTLMSR_LBR bit. > > I haven't found a very good place to put this, in a way that I can be sure > that x86 maintainers won't reject it, so I am open to your suggestions. Compile tested only, but perf's CPU online path seems appropriate, especially since that path also explicitly clears LBRs. Ensuring LBRs are stopped before clearing them seems logical. diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index 99c590da0ae2..6e898b832d75 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -5030,8 +5030,12 @@ static void intel_pmu_cpu_starting(int cpu) init_debug_store_on_cpu(cpu); /* - * Deal with CPUs that don't clear their LBRs on power-up. + * Deal with CPUs that don't clear their LBRs on power-up, and with + * BIOSes that leave LBRs enabled. */ + if (!static_cpu_has(X86_FEATURE_ARCH_LBR) && x86_pmu.lbr_nr) + msr_clear_bit(MSR_IA32_DEBUGCTLMSR, DEBUGCTLMSR_LBR_BIT); + intel_pmu_lbr_reset(); cpuc->lbr_sel = NULL; diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h index 3ae84c3b8e6d..bb7dd85aa6f2 100644 --- a/arch/x86/include/asm/msr-index.h +++ b/arch/x86/include/asm/msr-index.h @@ -395,7 +395,8 @@ #define MSR_IA32_PASID_VALID BIT_ULL(31) /* DEBUGCTLMSR bits (others vary by model): */ -#define DEBUGCTLMSR_LBR (1UL << 0) /* last branch recording */ +#define DEBUGCTLMSR_LBR_BIT 0 +#define DEBUGCTLMSR_LBR (1UL << DEBUGCTLMSR_LBR_BIT) /* last branch recording */ #define DEBUGCTLMSR_BTF_SHIFT 1 #define DEBUGCTLMSR_BTF (1UL << 1) /* single-step on branches */ #define DEBUGCTLMSR_BUS_LOCK_DETECT (1UL << 2) ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: vmx_pmu_caps_test fails on Skylake based CPUS due to read only LBRs 2025-01-25 0:12 ` Sean Christopherson @ 2025-01-27 17:58 ` Maxim Levitsky 0 siblings, 0 replies; 12+ messages in thread From: Maxim Levitsky @ 2025-01-27 17:58 UTC (permalink / raw) To: Sean Christopherson; +Cc: kvm, linux-kernel On Fri, 2025-01-24 at 16:12 -0800, Sean Christopherson wrote: > On Fri, Jan 24, 2025, Maxim Levitsky wrote: > > On Wed, 2025-01-22 at 13:02 -0800, Sean Christopherson wrote: > > > > > > Note that I read all msrs using 'rdmsr' userspace tool. > > > > > > > > > > I'm pretty sure debugging via 'rdmsr', i.e. /dev/msr, isn't going to work. I > > > > > assume perf is clobbering LBR MSRs on context switch, but I haven't tracked that > > > > > down to confirm (the code I see on inspecition is gated on at least one perf > > > > > event using LBRs). My guess is that there's a software bug somewhere in the > > > > > perf/KVM exchange. > > > > > > > > > > I confirmed that using 'rdmsr' and 'wrmsr' "loses" values, but that hacking KVM > > > > > to read/write all LBRs during initialization works with LBRs disabled. > > > > Hi! > > > > I finally got to the very bottom of this: > > > > First of all, your assumption that the kernel resets LBR related msrs on > > context switch after 'wrmsr' program finishes execution is wrong, because the > > kernel will only do this if it *itself* enables the LBR feature (that is when > > something like 'perf', uses a perf counter with a lbr call stack). > > > > Writes that 'wrmsr' tool does are not something that kernel expects so it > > doesn't do anything in this case. > > > > What is happening instead, is something completely different: Turns out that > > to shave off something like 50 nanoseconds, off the deep C-state entry/exit > > latency, some Intel CPU don't preserve LBR stack values over these C-state > > entries. > > Ugh. > > > > Ugh, but it does. On writes to any LBR, including LBR_TOS, KVM creates a "virtual" > > > LBR perf event. KVM then relies on perf to context switch LBR MSRs, i.e. relies > > > on perf to load the guest's values into hardware. At least, I think that's what > > > is supposed to happen. AFAIK, the perf-based LBR support has never been properly > > > document[*]. > > > > > > Anyways, my understanding of intel_pmu_handle_lbr_msrs_access() is that if the > > > vCPU's LBR perf event is scheduled out or can't be created, the guest's value is > > > effectively lost. Again, I don't know the "rules" for the LBR perf event, but > > > it wouldn't suprise me if your CI fails because something in the host conflicts > > > with KVM's LBR perf event. > > > > Actually you are partially wrong here too (although BIOS can be considered > > 'something on the host'). > > > > I was able to prove that the reason why the unit test fails *is* because BIOS > > left LBRs enabled: > > > > First of all, setting LBR bit manually in DEBUG_CTL does trigger this bug > > (I use a different machine now, which doesn't have the bios bug): > > ... > > > ==== Test Assertion Failure ==== > > x86_64/vmx_pmu_caps_test.c:202: r == v > > pid=8415 tid=8415 errno=0 - Success > > 1 0x0000000000404301: __suite_lbr_perf_capabilities at vmx_pmu_caps_test.c:202 > > 2 (inlined by) vmx_pmu_caps_lbr_perf_capabilities at vmx_pmu_caps_test.c:194 > > 3 (inlined by) wrapper_vmx_pmu_caps_lbr_perf_capabilities at vmx_pmu_caps_test.c:194 > > 4 0x000000000040511a: __run_test at kselftest_harness.h:1240 > > 5 0x0000000000402b95: test_harness_run at kselftest_harness.h:1310 > > 6 (inlined by) main at vmx_pmu_caps_test.c:246 > > 7 0x00007f56ba2295cf: ?? ??:0 > > 8 0x00007f56ba22967f: ?? ??:0 > > 9 0x0000000000402e44: _start at ??:? > > Set MSR_LBR_TOS to '0x7', got back '0xc' > > # lbr_perf_capabilities: Test failed > > # FAIL vmx_pmu_caps.lbr_perf_capabilities > > not ok 5 vmx_pmu_caps.lbr_perf_capabilities > > # RUN vmx_pmu_caps.perf_capabilities_unsupported ... > > # OK vmx_pmu_caps.perf_capabilities_unsupported > > ok 6 vmx_pmu_caps.perf_capabilities_unsupported > > # FAILED: 5 / 6 tests passed. > > # Totals: pass:5 fail:1 xfail:0 xpass:0 skip:0 error:0 > > > > Secondary I went over all places in the kernel and all of them take care to > > preserve DEBUG_CTL and only set/clear specific bits. > > > > __intel_pmu_lbr_enable() and __intel_pmu_lbr_enable() are practically the > > only two places where DEBUGCTLMSR_LBR bit is touched, and the test doesn't > > trigger them. Most likely because the test uses special > > 'INTEL_FIXED_VLBR_EVENT' perf event (see intel_pmu_create_guest_lbr_event) > > which is not enabled while in host mode. > > > > To double check this I traced all writes to DEBUG_CTL msr during this test > > and the only write is done during 'guest_wrmsr_perf_capabilities' subtest, by > > vmx_vcpu_run() which just restores the value that the msr had prior to VM > > entry. > > > > So, why the value that BIOS sets survives? Because as I said all code that > > touches DEBUG_CTL takes care to preserve all bits but the bit which is > > changed, LBRs are never enabled on the host, and even the guest entry > > preserves host DEBUG_CTL. Therefore the value written by BIOS survives. > > Well that's rather insane. > > > So we end up with the test writing to LBR_TOS while LBRs are unexpectedly > > enabled, so it's not a surprise that when the test reads back the value > > written, it will differ, and the test will rightfully fail. > > > > Since we have seen this in CI, and you saw it too in your CI, > > Gah, that was bad reporting on my end. The failure we saw was something else > entirely. > > > I think this BIOS bug is not that rare, and so I suggest to stick > > 'wrmsrl(MSR_IA32_DEBUGCTLMSR, 0)' somewhere early in a kernel boot code or at > > least clear the DEBUGCTLMSR_LBR bit. > > > > I haven't found a very good place to put this, in a way that I can be sure > > that x86 maintainers won't reject it, so I am open to your suggestions. > > Compile tested only, but perf's CPU online path seems appropriate, especially > since that path also explicitly clears LBRs. Ensuring LBRs are stopped before > clearing them seems logical. > > diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c > index 99c590da0ae2..6e898b832d75 100644 > --- a/arch/x86/events/intel/core.c > +++ b/arch/x86/events/intel/core.c > @@ -5030,8 +5030,12 @@ static void intel_pmu_cpu_starting(int cpu) > > init_debug_store_on_cpu(cpu); > /* > - * Deal with CPUs that don't clear their LBRs on power-up. > + * Deal with CPUs that don't clear their LBRs on power-up, and with > + * BIOSes that leave LBRs enabled. > */ > + if (!static_cpu_has(X86_FEATURE_ARCH_LBR) && x86_pmu.lbr_nr) > + msr_clear_bit(MSR_IA32_DEBUGCTLMSR, DEBUGCTLMSR_LBR_BIT); > + > intel_pmu_lbr_reset(); > > cpuc->lbr_sel = NULL; > diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h > index 3ae84c3b8e6d..bb7dd85aa6f2 100644 > --- a/arch/x86/include/asm/msr-index.h > +++ b/arch/x86/include/asm/msr-index.h > @@ -395,7 +395,8 @@ > #define MSR_IA32_PASID_VALID BIT_ULL(31) > > /* DEBUGCTLMSR bits (others vary by model): */ > -#define DEBUGCTLMSR_LBR (1UL << 0) /* last branch recording */ > +#define DEBUGCTLMSR_LBR_BIT 0 > +#define DEBUGCTLMSR_LBR (1UL << DEBUGCTLMSR_LBR_BIT) /* last branch recording */ > #define DEBUGCTLMSR_BTF_SHIFT 1 > #define DEBUGCTLMSR_BTF (1UL << 1) /* single-step on branches */ > #define DEBUGCTLMSR_BUS_LOCK_DETECT (1UL << 2) > I did some simulated test which sets the DEBUGCTLMSR_LBR early in the boot and this patch, and it worked just fine. I agree that intel_pmu_cpu_starting is the best place to put this workaround. You might consider refactoring the code that deals with LBR setup into a function, like init_debug_store_on_cpu, maybe something like init_lbrs_on_cpu, but I don't mind that, this patch as-is, is fine as well. If I get my hands on the machine where this originally failed, I'll test there, although most likely this just a formality. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Best regards, Maxim Levitsky ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2025-01-27 17:58 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-10-19 0:48 vmx_pmu_caps_test fails on Skylake based CPUS due to read only LBRs Maxim Levitsky 2024-10-28 15:55 ` Sean Christopherson 2024-11-03 23:32 ` Maxim Levitsky 2024-11-22 3:35 ` Maxim Levitsky 2024-12-14 0:20 ` Maxim Levitsky 2025-01-21 22:56 ` Maxim Levitsky 2025-01-22 1:02 ` Sean Christopherson 2025-01-22 16:36 ` Maxim Levitsky 2025-01-22 21:02 ` Sean Christopherson 2025-01-24 23:36 ` Maxim Levitsky 2025-01-25 0:12 ` Sean Christopherson 2025-01-27 17:58 ` Maxim Levitsky
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox