* VPMU interrupt unreliability @ 2017-07-22 20:16 Kyle Huey 2017-07-24 14:08 ` Boris Ostrovsky 2017-07-24 14:13 ` Andrew Cooper 0 siblings, 2 replies; 16+ messages in thread From: Kyle Huey @ 2017-07-22 20:16 UTC (permalink / raw) To: Andrew Cooper, Boris Ostrovsky, xen-devel; +Cc: Robert O'Callahan Last year I reported[0] seeing occasional instability in performance counter values when running rr[1], which depends on completely deterministic counts of retired conditional branches of userspace programs. I recently identified the cause of this problem. Xen's VPMU code contains a workaround for an alleged Nehalem bug that was added in 2010[2]. Supposedly if a hardware performance counter reaches 0 exactly during a PMI another PMI is generated potentially causing an endless loop. The workaround is to set the counter to 1. In 2013 the original bug was believed to affect more than just Nehalem and the workaround was enabled for all family 6 CPUs.[3] This workaround unfortunately disturbs the counter value in non-deterministic ways (since the value the counter has in the irq handler depends on interrupt latency), which is fatal to rr. I've verified that the discrepancies we see in the counted values are entirely accounted for by the number of times the workaround is used in any given run. Furthermore, patching Xen not to use this workaround makes the discrepancies in the counts vanish. I've added code[4] to rr that reliably detects this problem from guest userspace. Even with the workaround removed in Xen I see some additional issues (but not disturbed counter values) with the PMI, such as interrupts occasionally not being delivered to the guest. I haven't done much work to track these down, but my working theory is that interrupts that "skid" out of the guest that requested them and into Xen itself or perhaps even another guest are not being delivered. Our current plan is to stop depending on the PMI during rr's recording phase (which we use for timeslicing tracees primarily because it's convenient) to enable producing correct recordings in Xen guests. Accurate replay will not be possible under virtualization because of the PMI issues; that will require transferring the recording to another machine. But that will be sufficient to enable the use cases we care about (e.g. record an automated process on a cloud computing provider and have an engineer download and replay a failing recording later to debug it). I can think of several possible ways to fix the overcount problem, including: 1. Restricting the workaround to apply only to older CPUs and not all family 6 Intel CPUs forever. 2. Intercepting MSR loads for counters that have the workaround applied and giving the guest the correct counter value. 3. Or perhaps even changing the workaround to disable the PMI on that counter until the guest acks via GLOBAL_OVF_CTRL, assuming that works on the relevant hardware. Since I don't have the relevant hardware to test changes to this workaround on and rr can avoid these bugs through other means I don't expect to work on this myself, but I wanted to apprise you of what we've learned. - Kyle [0] https://lists.xen.org/archives/html/xen-devel/2016-10/msg01288.html [1] http://rr-project.org/ [2] https://xenbits.xen.org/gitweb/?p=xen.git;a=blobdiff;f=xen/arch/x86/hvm/vmx/vpmu_core2.c;h=44aa8e3c47fc02e401f5c382d89b97eef0cd2019;hp=ce4fd2d43e04db5e9b042344dd294cfa11e1f405;hb=3ed6a063d2a5f6197306b030e8c27c36d5f31aa1;hpb=566f83823996cf9c95f9a0562488f6b1215a1052 [3] https://xenbits.xen.org/gitweb/?p=xen.git;a=blobdiff;f=xen/arch/x86/hvm/vmx/vpmu_core2.c;h=15b2036c8db1e56d8865ee34c363e7f23aa75e33;hp=9f152b48c26dfeedb6f94189a5fe4a5f7a772d83;hb=75a92f551ade530ebab73a0c3d4934dfb28149b5;hpb=71fc4da1306cec55a42787310b01a1cb52489abc [4] See https://github.com/mozilla/rr/blob/a5d23728cd7d01c6be0c79852af26c68160d4405/src/PerfCounters.cc#L313, which sets up a counter and then does some pointless math in a loop to reach exactly 500 conditional branches. Xen will report 501 branches because of this bug. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: VPMU interrupt unreliability 2017-07-22 20:16 VPMU interrupt unreliability Kyle Huey @ 2017-07-24 14:08 ` Boris Ostrovsky 2017-07-24 14:26 ` Kyle Huey 2017-07-24 14:13 ` Andrew Cooper 1 sibling, 1 reply; 16+ messages in thread From: Boris Ostrovsky @ 2017-07-24 14:08 UTC (permalink / raw) To: Kyle Huey, Andrew Cooper, xen-devel Cc: Tian, Kevin, Robert O'Callahan, Jun Nakajima On 07/22/2017 04:16 PM, Kyle Huey wrote: > Last year I reported[0] seeing occasional instability in performance > counter values when running rr[1], which depends on completely > deterministic counts of retired conditional branches of userspace > programs. > > I recently identified the cause of this problem. Xen's VPMU code > contains a workaround for an alleged Nehalem bug that was added in > 2010[2]. Supposedly if a hardware performance counter reaches 0 > exactly during a PMI another PMI is generated potentially causing an > endless loop. The workaround is to set the counter to 1. In 2013 the > original bug was believed to affect more than just Nehalem and the > workaround was enabled for all family 6 CPUs.[3] This workaround > unfortunately disturbs the counter value in non-deterministic ways > (since the value the counter has in the irq handler depends on > interrupt latency), which is fatal to rr. > > I've verified that the discrepancies we see in the counted values are > entirely accounted for by the number of times the workaround is used > in any given run. Furthermore, patching Xen not to use this > workaround makes the discrepancies in the counts vanish. I've added > code[4] to rr that reliably detects this problem from guest userspace. > > Even with the workaround removed in Xen I see some additional issues > (but not disturbed counter values) with the PMI, such as interrupts > occasionally not being delivered to the guest. I haven't done much > work to track these down, but my working theory is that interrupts > that "skid" out of the guest that requested them and into Xen itself > or perhaps even another guest are not being delivered. > > Our current plan is to stop depending on the PMI during rr's recording > phase (which we use for timeslicing tracees primarily because it's > convenient) to enable producing correct recordings in Xen guests. > Accurate replay will not be possible under virtualization because of > the PMI issues; that will require transferring the recording to > another machine. But that will be sufficient to enable the use cases > we care about (e.g. record an automated process on a cloud computing > provider and have an engineer download and replay a failing recording > later to debug it). > > I can think of several possible ways to fix the overcount problem, including: > 1. Restricting the workaround to apply only to older CPUs and not all > family 6 Intel CPUs forever. IIRC the question of which processors this workaround is applicable to was raised and Intel folks (copied here) couldn't find an answer. One thing I noticed is that the workaround doesn't appear to be complete: it is only checking PMC0 status and not other counters (fixed or architectural). Of course, without knowing what the actual problem was it's hard to say whether this was intentional. > 2. Intercepting MSR loads for counters that have the workaround > applied and giving the guest the correct counter value. We'd have to keep track of whether the counter has been reset (by the quirk) since the last MSR write. > 3. Or perhaps even changing the workaround to disable the PMI on that > counter until the guest acks via GLOBAL_OVF_CTRL, assuming that works > on the relevant hardware. MSR_CORE_PERF_GLOBAL_OVF_CTRL is written immediately after the quirk runs (in core2_vpmu_do_interrupt()) so we already do this, don't we? Thanks for looking into this. Would also be interesting to see/confirm how some interrupts are (possibly) lost. -boris > > Since I don't have the relevant hardware to test changes to this > workaround on and rr can avoid these bugs through other means I don't > expect to work on this myself, but I wanted to apprise you of what > we've learned. > > - Kyle > > [0] https://lists.xen.org/archives/html/xen-devel/2016-10/msg01288.html > [1] http://rr-project.org/ > [2] https://xenbits.xen.org/gitweb/?p=xen.git;a=blobdiff;f=xen/arch/x86/hvm/vmx/vpmu_core2.c;h=44aa8e3c47fc02e401f5c382d89b97eef0cd2019;hp=ce4fd2d43e04db5e9b042344dd294cfa11e1f405;hb=3ed6a063d2a5f6197306b030e8c27c36d5f31aa1;hpb=566f83823996cf9c95f9a0562488f6b1215a1052 > [3] https://xenbits.xen.org/gitweb/?p=xen.git;a=blobdiff;f=xen/arch/x86/hvm/vmx/vpmu_core2.c;h=15b2036c8db1e56d8865ee34c363e7f23aa75e33;hp=9f152b48c26dfeedb6f94189a5fe4a5f7a772d83;hb=75a92f551ade530ebab73a0c3d4934dfb28149b5;hpb=71fc4da1306cec55a42787310b01a1cb52489abc > [4] See https://github.com/mozilla/rr/blob/a5d23728cd7d01c6be0c79852af26c68160d4405/src/PerfCounters.cc#L313, > which sets up a counter and then does some pointless math in a loop to > reach exactly 500 conditional branches. Xen will report 501 branches > because of this bug. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: VPMU interrupt unreliability 2017-07-24 14:08 ` Boris Ostrovsky @ 2017-07-24 14:26 ` Kyle Huey 2017-07-24 15:07 ` Boris Ostrovsky 0 siblings, 1 reply; 16+ messages in thread From: Kyle Huey @ 2017-07-24 14:26 UTC (permalink / raw) To: Boris Ostrovsky Cc: Andrew Cooper, Tian, Kevin, Robert O'Callahan, Jun Nakajima, xen-devel On Mon, Jul 24, 2017 at 7:08 AM, Boris Ostrovsky <boris.ostrovsky@oracle.com> wrote: > On 07/22/2017 04:16 PM, Kyle Huey wrote: >> Last year I reported[0] seeing occasional instability in performance >> counter values when running rr[1], which depends on completely >> deterministic counts of retired conditional branches of userspace >> programs. >> >> I recently identified the cause of this problem. Xen's VPMU code >> contains a workaround for an alleged Nehalem bug that was added in >> 2010[2]. Supposedly if a hardware performance counter reaches 0 >> exactly during a PMI another PMI is generated potentially causing an >> endless loop. The workaround is to set the counter to 1. In 2013 the >> original bug was believed to affect more than just Nehalem and the >> workaround was enabled for all family 6 CPUs.[3] This workaround >> unfortunately disturbs the counter value in non-deterministic ways >> (since the value the counter has in the irq handler depends on >> interrupt latency), which is fatal to rr. >> >> I've verified that the discrepancies we see in the counted values are >> entirely accounted for by the number of times the workaround is used >> in any given run. Furthermore, patching Xen not to use this >> workaround makes the discrepancies in the counts vanish. I've added >> code[4] to rr that reliably detects this problem from guest userspace. >> >> Even with the workaround removed in Xen I see some additional issues >> (but not disturbed counter values) with the PMI, such as interrupts >> occasionally not being delivered to the guest. I haven't done much >> work to track these down, but my working theory is that interrupts >> that "skid" out of the guest that requested them and into Xen itself >> or perhaps even another guest are not being delivered. >> >> Our current plan is to stop depending on the PMI during rr's recording >> phase (which we use for timeslicing tracees primarily because it's >> convenient) to enable producing correct recordings in Xen guests. >> Accurate replay will not be possible under virtualization because of >> the PMI issues; that will require transferring the recording to >> another machine. But that will be sufficient to enable the use cases >> we care about (e.g. record an automated process on a cloud computing >> provider and have an engineer download and replay a failing recording >> later to debug it). >> >> I can think of several possible ways to fix the overcount problem, including: >> 1. Restricting the workaround to apply only to older CPUs and not all >> family 6 Intel CPUs forever. > > IIRC the question of which processors this workaround is applicable to > was raised and Intel folks (copied here) couldn't find an answer. > > One thing I noticed is that the workaround doesn't appear to be > complete: it is only checking PMC0 status and not other counters (fixed > or architectural). Of course, without knowing what the actual problem > was it's hard to say whether this was intentional. handle_pmc_quirk appears to loop through all the counters ... >> 2. Intercepting MSR loads for counters that have the workaround >> applied and giving the guest the correct counter value. > > > We'd have to keep track of whether the counter has been reset (by the > quirk) since the last MSR write. Yes. >> 3. Or perhaps even changing the workaround to disable the PMI on that >> counter until the guest acks via GLOBAL_OVF_CTRL, assuming that works >> on the relevant hardware. > > MSR_CORE_PERF_GLOBAL_OVF_CTRL is written immediately after the quirk > runs (in core2_vpmu_do_interrupt()) so we already do this, don't we? I'm suggesting waiting until the *guest* writes to the (virtualized) GLOBAL_OVF_CTRL. > Thanks for looking into this. Would also be interesting to see/confirm > how some interrupts are (possibly) lost. Indeed. Unfortunately it's not a high priority for me at the moment. - Kyle _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: VPMU interrupt unreliability 2017-07-24 14:26 ` Kyle Huey @ 2017-07-24 15:07 ` Boris Ostrovsky 2017-07-24 16:54 ` Kyle Huey 0 siblings, 1 reply; 16+ messages in thread From: Boris Ostrovsky @ 2017-07-24 15:07 UTC (permalink / raw) To: Kyle Huey Cc: Tian, Kevin, Andrew Cooper, Dietmar Hahn, xen-devel, Jun Nakajima, Robert O'Callahan >> One thing I noticed is that the workaround doesn't appear to be >> complete: it is only checking PMC0 status and not other counters (fixed >> or architectural). Of course, without knowing what the actual problem >> was it's hard to say whether this was intentional. > handle_pmc_quirk appears to loop through all the counters ... Right, I didn't notice that it is shifting MSR_CORE_PERF_GLOBAL_STATUS value one by one and so it is looking at all bits. > >>> 2. Intercepting MSR loads for counters that have the workaround >>> applied and giving the guest the correct counter value. >> >> We'd have to keep track of whether the counter has been reset (by the >> quirk) since the last MSR write. > Yes. > >>> 3. Or perhaps even changing the workaround to disable the PMI on that >>> counter until the guest acks via GLOBAL_OVF_CTRL, assuming that works >>> on the relevant hardware. >> MSR_CORE_PERF_GLOBAL_OVF_CTRL is written immediately after the quirk >> runs (in core2_vpmu_do_interrupt()) so we already do this, don't we? > I'm suggesting waiting until the *guest* writes to the (virtualized) > GLOBAL_OVF_CTRL. Wouldn't it be better to wait until the counter is reloaded? -boris _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: VPMU interrupt unreliability 2017-07-24 15:07 ` Boris Ostrovsky @ 2017-07-24 16:54 ` Kyle Huey 2017-10-10 16:54 ` Kyle Huey 0 siblings, 1 reply; 16+ messages in thread From: Kyle Huey @ 2017-07-24 16:54 UTC (permalink / raw) To: Boris Ostrovsky Cc: Tian, Kevin, Andrew Cooper, Dietmar Hahn, xen-devel, Jun Nakajima, Robert O'Callahan On Mon, Jul 24, 2017 at 8:07 AM, Boris Ostrovsky <boris.ostrovsky@oracle.com> wrote: > >>> One thing I noticed is that the workaround doesn't appear to be >>> complete: it is only checking PMC0 status and not other counters (fixed >>> or architectural). Of course, without knowing what the actual problem >>> was it's hard to say whether this was intentional. >> handle_pmc_quirk appears to loop through all the counters ... > > Right, I didn't notice that it is shifting MSR_CORE_PERF_GLOBAL_STATUS > value one by one and so it is looking at all bits. > >> >>>> 2. Intercepting MSR loads for counters that have the workaround >>>> applied and giving the guest the correct counter value. >>> >>> We'd have to keep track of whether the counter has been reset (by the >>> quirk) since the last MSR write. >> Yes. >> >>>> 3. Or perhaps even changing the workaround to disable the PMI on that >>>> counter until the guest acks via GLOBAL_OVF_CTRL, assuming that works >>>> on the relevant hardware. >>> MSR_CORE_PERF_GLOBAL_OVF_CTRL is written immediately after the quirk >>> runs (in core2_vpmu_do_interrupt()) so we already do this, don't we? >> I'm suggesting waiting until the *guest* writes to the (virtualized) >> GLOBAL_OVF_CTRL. > > Wouldn't it be better to wait until the counter is reloaded? Maybe! I haven't thought through it a lot. It's still not clear to me whether MSR_CORE_PERF_GLOBAL_OVF_CTRL actually controls the interrupt in any way or whether it just resets the bits in MSR_CORE_PERF_GLOBAL_STATUS and acking the interrupt on the APIC is all that's required to reenable it. - Kyle _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: VPMU interrupt unreliability 2017-07-24 16:54 ` Kyle Huey @ 2017-10-10 16:54 ` Kyle Huey 2017-10-11 14:09 ` Boris Ostrovsky 0 siblings, 1 reply; 16+ messages in thread From: Kyle Huey @ 2017-10-10 16:54 UTC (permalink / raw) To: Boris Ostrovsky Cc: Tian, Kevin, Andrew Cooper, Dietmar Hahn, xen-devel, Jun Nakajima, Robert O'Callahan On Mon, Jul 24, 2017 at 9:54 AM, Kyle Huey <me@kylehuey.com> wrote: > On Mon, Jul 24, 2017 at 8:07 AM, Boris Ostrovsky > <boris.ostrovsky@oracle.com> wrote: >> >>>> One thing I noticed is that the workaround doesn't appear to be >>>> complete: it is only checking PMC0 status and not other counters (fixed >>>> or architectural). Of course, without knowing what the actual problem >>>> was it's hard to say whether this was intentional. >>> handle_pmc_quirk appears to loop through all the counters ... >> >> Right, I didn't notice that it is shifting MSR_CORE_PERF_GLOBAL_STATUS >> value one by one and so it is looking at all bits. >> >>> >>>>> 2. Intercepting MSR loads for counters that have the workaround >>>>> applied and giving the guest the correct counter value. >>>> >>>> We'd have to keep track of whether the counter has been reset (by the >>>> quirk) since the last MSR write. >>> Yes. >>> >>>>> 3. Or perhaps even changing the workaround to disable the PMI on that >>>>> counter until the guest acks via GLOBAL_OVF_CTRL, assuming that works >>>>> on the relevant hardware. >>>> MSR_CORE_PERF_GLOBAL_OVF_CTRL is written immediately after the quirk >>>> runs (in core2_vpmu_do_interrupt()) so we already do this, don't we? >>> I'm suggesting waiting until the *guest* writes to the (virtualized) >>> GLOBAL_OVF_CTRL. >> >> Wouldn't it be better to wait until the counter is reloaded? > > Maybe! I haven't thought through it a lot. It's still not clear to > me whether MSR_CORE_PERF_GLOBAL_OVF_CTRL actually controls the > interrupt in any way or whether it just resets the bits in > MSR_CORE_PERF_GLOBAL_STATUS and acking the interrupt on the APIC is > all that's required to reenable it. > > - Kyle I wonder if it would be reasonable to just remove the workaround entirely at some point. The set of people using 1) several year old hardware, 2) an up to date Xen, and 3) the off-by-default performance counters is probably rather small. - Kyle _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: VPMU interrupt unreliability 2017-10-10 16:54 ` Kyle Huey @ 2017-10-11 14:09 ` Boris Ostrovsky 2017-10-19 15:09 ` Kyle Huey 0 siblings, 1 reply; 16+ messages in thread From: Boris Ostrovsky @ 2017-10-11 14:09 UTC (permalink / raw) To: Kyle Huey Cc: Tian, Kevin, Andrew Cooper, Dietmar Hahn, xen-devel, Jun Nakajima, Robert O'Callahan On 10/10/2017 12:54 PM, Kyle Huey wrote: > On Mon, Jul 24, 2017 at 9:54 AM, Kyle Huey <me@kylehuey.com> wrote: >> On Mon, Jul 24, 2017 at 8:07 AM, Boris Ostrovsky >> <boris.ostrovsky@oracle.com> wrote: >>>>> One thing I noticed is that the workaround doesn't appear to be >>>>> complete: it is only checking PMC0 status and not other counters (fixed >>>>> or architectural). Of course, without knowing what the actual problem >>>>> was it's hard to say whether this was intentional. >>>> handle_pmc_quirk appears to loop through all the counters ... >>> Right, I didn't notice that it is shifting MSR_CORE_PERF_GLOBAL_STATUS >>> value one by one and so it is looking at all bits. >>> >>>>>> 2. Intercepting MSR loads for counters that have the workaround >>>>>> applied and giving the guest the correct counter value. >>>>> We'd have to keep track of whether the counter has been reset (by the >>>>> quirk) since the last MSR write. >>>> Yes. >>>> >>>>>> 3. Or perhaps even changing the workaround to disable the PMI on that >>>>>> counter until the guest acks via GLOBAL_OVF_CTRL, assuming that works >>>>>> on the relevant hardware. >>>>> MSR_CORE_PERF_GLOBAL_OVF_CTRL is written immediately after the quirk >>>>> runs (in core2_vpmu_do_interrupt()) so we already do this, don't we? >>>> I'm suggesting waiting until the *guest* writes to the (virtualized) >>>> GLOBAL_OVF_CTRL. >>> Wouldn't it be better to wait until the counter is reloaded? >> Maybe! I haven't thought through it a lot. It's still not clear to >> me whether MSR_CORE_PERF_GLOBAL_OVF_CTRL actually controls the >> interrupt in any way or whether it just resets the bits in >> MSR_CORE_PERF_GLOBAL_STATUS and acking the interrupt on the APIC is >> all that's required to reenable it. >> >> - Kyle > I wonder if it would be reasonable to just remove the workaround > entirely at some point. The set of people using 1) several year old > hardware, 2) an up to date Xen, and 3) the off-by-default performance > counters is probably rather small. We'd probably want to only enable this for affected processors, not remove it outright. But the problem is that we still don't know for sure whether this issue affects NHM only, do we? (https://lists.xenproject.org/archives/html/xen-devel/2017-07/msg02242.html is the original message) -boris _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: VPMU interrupt unreliability 2017-10-11 14:09 ` Boris Ostrovsky @ 2017-10-19 15:09 ` Kyle Huey 2017-10-19 15:40 ` Andrew Cooper 0 siblings, 1 reply; 16+ messages in thread From: Kyle Huey @ 2017-10-19 15:09 UTC (permalink / raw) To: Boris Ostrovsky Cc: Tian, Kevin, Andrew Cooper, Dietmar Hahn, xen-devel, Jun Nakajima, Robert O'Callahan On Wed, Oct 11, 2017 at 7:09 AM, Boris Ostrovsky <boris.ostrovsky@oracle.com> wrote: > On 10/10/2017 12:54 PM, Kyle Huey wrote: >> On Mon, Jul 24, 2017 at 9:54 AM, Kyle Huey <me@kylehuey.com> wrote: >>> On Mon, Jul 24, 2017 at 8:07 AM, Boris Ostrovsky >>> <boris.ostrovsky@oracle.com> wrote: >>>>>> One thing I noticed is that the workaround doesn't appear to be >>>>>> complete: it is only checking PMC0 status and not other counters (fixed >>>>>> or architectural). Of course, without knowing what the actual problem >>>>>> was it's hard to say whether this was intentional. >>>>> handle_pmc_quirk appears to loop through all the counters ... >>>> Right, I didn't notice that it is shifting MSR_CORE_PERF_GLOBAL_STATUS >>>> value one by one and so it is looking at all bits. >>>> >>>>>>> 2. Intercepting MSR loads for counters that have the workaround >>>>>>> applied and giving the guest the correct counter value. >>>>>> We'd have to keep track of whether the counter has been reset (by the >>>>>> quirk) since the last MSR write. >>>>> Yes. >>>>> >>>>>>> 3. Or perhaps even changing the workaround to disable the PMI on that >>>>>>> counter until the guest acks via GLOBAL_OVF_CTRL, assuming that works >>>>>>> on the relevant hardware. >>>>>> MSR_CORE_PERF_GLOBAL_OVF_CTRL is written immediately after the quirk >>>>>> runs (in core2_vpmu_do_interrupt()) so we already do this, don't we? >>>>> I'm suggesting waiting until the *guest* writes to the (virtualized) >>>>> GLOBAL_OVF_CTRL. >>>> Wouldn't it be better to wait until the counter is reloaded? >>> Maybe! I haven't thought through it a lot. It's still not clear to >>> me whether MSR_CORE_PERF_GLOBAL_OVF_CTRL actually controls the >>> interrupt in any way or whether it just resets the bits in >>> MSR_CORE_PERF_GLOBAL_STATUS and acking the interrupt on the APIC is >>> all that's required to reenable it. >>> >>> - Kyle >> I wonder if it would be reasonable to just remove the workaround >> entirely at some point. The set of people using 1) several year old >> hardware, 2) an up to date Xen, and 3) the off-by-default performance >> counters is probably rather small. > > We'd probably want to only enable this for affected processors, not > remove it outright. But the problem is that we still don't know for sure > whether this issue affects NHM only, do we? > > (https://lists.xenproject.org/archives/html/xen-devel/2017-07/msg02242.html > is the original message) Yes, the basic problem is that we don't know where to draw the line. - Kyle _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: VPMU interrupt unreliability 2017-10-19 15:09 ` Kyle Huey @ 2017-10-19 15:40 ` Andrew Cooper 2017-10-19 18:20 ` Meng Xu 0 siblings, 1 reply; 16+ messages in thread From: Andrew Cooper @ 2017-10-19 15:40 UTC (permalink / raw) To: Kyle Huey, Boris Ostrovsky Cc: Tian, Kevin, Dietmar Hahn, Robert O'Callahan, Jun Nakajima, xen-devel On 19/10/17 16:09, Kyle Huey wrote: > On Wed, Oct 11, 2017 at 7:09 AM, Boris Ostrovsky > <boris.ostrovsky@oracle.com> wrote: >> On 10/10/2017 12:54 PM, Kyle Huey wrote: >>> On Mon, Jul 24, 2017 at 9:54 AM, Kyle Huey <me@kylehuey.com> wrote: >>>> On Mon, Jul 24, 2017 at 8:07 AM, Boris Ostrovsky >>>> <boris.ostrovsky@oracle.com> wrote: >>>>>>> One thing I noticed is that the workaround doesn't appear to be >>>>>>> complete: it is only checking PMC0 status and not other counters (fixed >>>>>>> or architectural). Of course, without knowing what the actual problem >>>>>>> was it's hard to say whether this was intentional. >>>>>> handle_pmc_quirk appears to loop through all the counters ... >>>>> Right, I didn't notice that it is shifting MSR_CORE_PERF_GLOBAL_STATUS >>>>> value one by one and so it is looking at all bits. >>>>> >>>>>>>> 2. Intercepting MSR loads for counters that have the workaround >>>>>>>> applied and giving the guest the correct counter value. >>>>>>> We'd have to keep track of whether the counter has been reset (by the >>>>>>> quirk) since the last MSR write. >>>>>> Yes. >>>>>> >>>>>>>> 3. Or perhaps even changing the workaround to disable the PMI on that >>>>>>>> counter until the guest acks via GLOBAL_OVF_CTRL, assuming that works >>>>>>>> on the relevant hardware. >>>>>>> MSR_CORE_PERF_GLOBAL_OVF_CTRL is written immediately after the quirk >>>>>>> runs (in core2_vpmu_do_interrupt()) so we already do this, don't we? >>>>>> I'm suggesting waiting until the *guest* writes to the (virtualized) >>>>>> GLOBAL_OVF_CTRL. >>>>> Wouldn't it be better to wait until the counter is reloaded? >>>> Maybe! I haven't thought through it a lot. It's still not clear to >>>> me whether MSR_CORE_PERF_GLOBAL_OVF_CTRL actually controls the >>>> interrupt in any way or whether it just resets the bits in >>>> MSR_CORE_PERF_GLOBAL_STATUS and acking the interrupt on the APIC is >>>> all that's required to reenable it. >>>> >>>> - Kyle >>> I wonder if it would be reasonable to just remove the workaround >>> entirely at some point. The set of people using 1) several year old >>> hardware, 2) an up to date Xen, and 3) the off-by-default performance >>> counters is probably rather small. >> We'd probably want to only enable this for affected processors, not >> remove it outright. But the problem is that we still don't know for sure >> whether this issue affects NHM only, do we? >> >> (https://lists.xenproject.org/archives/html/xen-devel/2017-07/msg02242.html >> is the original message) > Yes, the basic problem is that we don't know where to draw the line. vPMU is disabled by default for security reasons, and also broken, in a way which demonstrates that vPMU isn't getting much real-world use. As far as I'm concerned, all options (including rm -rf and start from scratch) are acceptable, especially if this ends up giving us a better overall subsystem. Do we know how other hypervisors work around this issue? I'm tempted to suggest just ripping it straight out. NHM is ancient these days, and if someone does manage to get a repro, we stand a better chance of being able to debug it properly. ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: VPMU interrupt unreliability 2017-10-19 15:40 ` Andrew Cooper @ 2017-10-19 18:20 ` Meng Xu 2017-10-19 18:24 ` Kyle Huey 2017-10-20 7:07 ` Jan Beulich 0 siblings, 2 replies; 16+ messages in thread From: Meng Xu @ 2017-10-19 18:20 UTC (permalink / raw) To: Andrew Cooper Cc: Tian, Kevin, Dietmar Hahn, xen-devel@lists.xen.org, Kyle Huey, Jun Nakajima, Boris Ostrovsky, Robert O'Callahan On Thu, Oct 19, 2017 at 11:40 AM, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > > On 19/10/17 16:09, Kyle Huey wrote: > > On Wed, Oct 11, 2017 at 7:09 AM, Boris Ostrovsky > > <boris.ostrovsky@oracle.com> wrote: > >> On 10/10/2017 12:54 PM, Kyle Huey wrote: > >>> On Mon, Jul 24, 2017 at 9:54 AM, Kyle Huey <me@kylehuey.com> wrote: > >>>> On Mon, Jul 24, 2017 at 8:07 AM, Boris Ostrovsky > >>>> <boris.ostrovsky@oracle.com> wrote: > >>>>>>> One thing I noticed is that the workaround doesn't appear to be > >>>>>>> complete: it is only checking PMC0 status and not other counters (fixed > >>>>>>> or architectural). Of course, without knowing what the actual problem > >>>>>>> was it's hard to say whether this was intentional. > >>>>>> handle_pmc_quirk appears to loop through all the counters ... > >>>>> Right, I didn't notice that it is shifting MSR_CORE_PERF_GLOBAL_STATUS > >>>>> value one by one and so it is looking at all bits. > >>>>> > >>>>>>>> 2. Intercepting MSR loads for counters that have the workaround > >>>>>>>> applied and giving the guest the correct counter value. > >>>>>>> We'd have to keep track of whether the counter has been reset (by the > >>>>>>> quirk) since the last MSR write. > >>>>>> Yes. > >>>>>> > >>>>>>>> 3. Or perhaps even changing the workaround to disable the PMI on that > >>>>>>>> counter until the guest acks via GLOBAL_OVF_CTRL, assuming that works > >>>>>>>> on the relevant hardware. > >>>>>>> MSR_CORE_PERF_GLOBAL_OVF_CTRL is written immediately after the quirk > >>>>>>> runs (in core2_vpmu_do_interrupt()) so we already do this, don't we? > >>>>>> I'm suggesting waiting until the *guest* writes to the (virtualized) > >>>>>> GLOBAL_OVF_CTRL. > >>>>> Wouldn't it be better to wait until the counter is reloaded? > >>>> Maybe! I haven't thought through it a lot. It's still not clear to > >>>> me whether MSR_CORE_PERF_GLOBAL_OVF_CTRL actually controls the > >>>> interrupt in any way or whether it just resets the bits in > >>>> MSR_CORE_PERF_GLOBAL_STATUS and acking the interrupt on the APIC is > >>>> all that's required to reenable it. > >>>> > >>>> - Kyle > >>> I wonder if it would be reasonable to just remove the workaround > >>> entirely at some point. The set of people using 1) several year old > >>> hardware, 2) an up to date Xen, and 3) the off-by-default performance > >>> counters is probably rather small. > >> We'd probably want to only enable this for affected processors, not > >> remove it outright. But the problem is that we still don't know for sure > >> whether this issue affects NHM only, do we? > >> > >> (https://lists.xenproject.org/archives/html/xen-devel/2017-07/msg02242.html > >> is the original message) > > Yes, the basic problem is that we don't know where to draw the line. > > vPMU is disabled by default for security reasons, Is there any document about the possible attack via the vPMU? The document I found (such as [1] and XSA-163) just briefly say that the vPMU should be disabled due to security concern. [1] https://xenbits.xen.org/docs/unstable/misc/xen-command-line.html > > and also broken, in a > way which demonstrates that vPMU isn't getting much real-world use. I also noticed that AWS seems support part of the vPMU functionalities, which were used by Netflix to optimize their applications' performance, according to http://www.brendangregg.com/blog/2017-05-04/the-pmcs-of-ec2.html . I guess the security issue should be solved by AWS? However, without knowing how the attack could be conducted, I'm not sure how AWS avoids the attack concern for vPMU. > > As far as I'm concerned, all options (including rm -rf and start from > scratch) are acceptable, especially if this ends up giving us a better > overall subsystem. > > Do we know how other hypervisors work around this issue? Maybe the solution of AWS is a choice? I'm not sure. I'm just thinking aloud. :) Thanks, Meng -- Meng Xu Ph.D. Candidate in Computer and Information Science University of Pennsylvania http://www.cis.upenn.edu/~mengxu/ _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: VPMU interrupt unreliability 2017-10-19 18:20 ` Meng Xu @ 2017-10-19 18:24 ` Kyle Huey 2017-10-19 18:38 ` Andrew Cooper 2017-10-20 7:07 ` Jan Beulich 1 sibling, 1 reply; 16+ messages in thread From: Kyle Huey @ 2017-10-19 18:24 UTC (permalink / raw) To: Meng Xu Cc: Tian, Kevin, Andrew Cooper, Dietmar Hahn, xen-devel@lists.xen.org, Jun Nakajima, Boris Ostrovsky, Robert O'Callahan On Thu, Oct 19, 2017 at 11:20 AM, Meng Xu <xumengpanda@gmail.com> wrote: > On Thu, Oct 19, 2017 at 11:40 AM, Andrew Cooper > <andrew.cooper3@citrix.com> wrote: >> >> On 19/10/17 16:09, Kyle Huey wrote: >> > On Wed, Oct 11, 2017 at 7:09 AM, Boris Ostrovsky >> > <boris.ostrovsky@oracle.com> wrote: >> >> On 10/10/2017 12:54 PM, Kyle Huey wrote: >> >>> On Mon, Jul 24, 2017 at 9:54 AM, Kyle Huey <me@kylehuey.com> wrote: >> >>>> On Mon, Jul 24, 2017 at 8:07 AM, Boris Ostrovsky >> >>>> <boris.ostrovsky@oracle.com> wrote: >> >>>>>>> One thing I noticed is that the workaround doesn't appear to be >> >>>>>>> complete: it is only checking PMC0 status and not other counters (fixed >> >>>>>>> or architectural). Of course, without knowing what the actual problem >> >>>>>>> was it's hard to say whether this was intentional. >> >>>>>> handle_pmc_quirk appears to loop through all the counters ... >> >>>>> Right, I didn't notice that it is shifting MSR_CORE_PERF_GLOBAL_STATUS >> >>>>> value one by one and so it is looking at all bits. >> >>>>> >> >>>>>>>> 2. Intercepting MSR loads for counters that have the workaround >> >>>>>>>> applied and giving the guest the correct counter value. >> >>>>>>> We'd have to keep track of whether the counter has been reset (by the >> >>>>>>> quirk) since the last MSR write. >> >>>>>> Yes. >> >>>>>> >> >>>>>>>> 3. Or perhaps even changing the workaround to disable the PMI on that >> >>>>>>>> counter until the guest acks via GLOBAL_OVF_CTRL, assuming that works >> >>>>>>>> on the relevant hardware. >> >>>>>>> MSR_CORE_PERF_GLOBAL_OVF_CTRL is written immediately after the quirk >> >>>>>>> runs (in core2_vpmu_do_interrupt()) so we already do this, don't we? >> >>>>>> I'm suggesting waiting until the *guest* writes to the (virtualized) >> >>>>>> GLOBAL_OVF_CTRL. >> >>>>> Wouldn't it be better to wait until the counter is reloaded? >> >>>> Maybe! I haven't thought through it a lot. It's still not clear to >> >>>> me whether MSR_CORE_PERF_GLOBAL_OVF_CTRL actually controls the >> >>>> interrupt in any way or whether it just resets the bits in >> >>>> MSR_CORE_PERF_GLOBAL_STATUS and acking the interrupt on the APIC is >> >>>> all that's required to reenable it. >> >>>> >> >>>> - Kyle >> >>> I wonder if it would be reasonable to just remove the workaround >> >>> entirely at some point. The set of people using 1) several year old >> >>> hardware, 2) an up to date Xen, and 3) the off-by-default performance >> >>> counters is probably rather small. >> >> We'd probably want to only enable this for affected processors, not >> >> remove it outright. But the problem is that we still don't know for sure >> >> whether this issue affects NHM only, do we? >> >> >> >> (https://lists.xenproject.org/archives/html/xen-devel/2017-07/msg02242.html >> >> is the original message) >> > Yes, the basic problem is that we don't know where to draw the line. >> >> vPMU is disabled by default for security reasons, > > > Is there any document about the possible attack via the vPMU? The > document I found (such as [1] and XSA-163) just briefly say that the > vPMU should be disabled due to security concern. > > > [1] https://xenbits.xen.org/docs/unstable/misc/xen-command-line.html Cross-guest information leaks, presumably. >> >> and also broken, in a >> way which demonstrates that vPMU isn't getting much real-world use. > > I also noticed that AWS seems support part of the vPMU > functionalities, which were used by Netflix to optimize their > applications' performance, according to > http://www.brendangregg.com/blog/2017-05-04/the-pmcs-of-ec2.html . > > I guess the security issue should be solved by AWS? However, without > knowing how the attack could be conducted, I'm not sure how AWS avoids > the attack concern for vPMU. AWS only allows you to use the vPMU if you have the entire physical machine your VM is running on dedicated to yourself. Cross-guest information leaks are not a big deal if the same tenant controls all the guests. >> >> As far as I'm concerned, all options (including rm -rf and start from >> scratch) are acceptable, especially if this ends up giving us a better >> overall subsystem. >> >> Do we know how other hypervisors work around this issue? > > Maybe the solution of AWS is a choice? I'm not sure. I'm just thinking aloud. :) > > Thanks, > > Meng > > -- > Meng Xu > Ph.D. Candidate in Computer and Information Science > University of Pennsylvania > http://www.cis.upenn.edu/~mengxu/ - Kyle _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: VPMU interrupt unreliability 2017-10-19 18:24 ` Kyle Huey @ 2017-10-19 18:38 ` Andrew Cooper 0 siblings, 0 replies; 16+ messages in thread From: Andrew Cooper @ 2017-10-19 18:38 UTC (permalink / raw) To: Kyle Huey, Meng Xu Cc: Tian, Kevin, Dietmar Hahn, xen-devel@lists.xen.org, Jun Nakajima, Boris Ostrovsky, Robert O'Callahan On 19/10/17 19:24, Kyle Huey wrote: > On Thu, Oct 19, 2017 at 11:20 AM, Meng Xu <xumengpanda@gmail.com> wrote: >> On Thu, Oct 19, 2017 at 11:40 AM, Andrew Cooper >> <andrew.cooper3@citrix.com> wrote: >>> On 19/10/17 16:09, Kyle Huey wrote: >>>> On Wed, Oct 11, 2017 at 7:09 AM, Boris Ostrovsky >>>> <boris.ostrovsky@oracle.com> wrote: >>>>> On 10/10/2017 12:54 PM, Kyle Huey wrote: >>>>>> On Mon, Jul 24, 2017 at 9:54 AM, Kyle Huey <me@kylehuey.com> wrote: >>>>>>> On Mon, Jul 24, 2017 at 8:07 AM, Boris Ostrovsky >>>>>>> <boris.ostrovsky@oracle.com> wrote: >>>>>>>>>> One thing I noticed is that the workaround doesn't appear to be >>>>>>>>>> complete: it is only checking PMC0 status and not other counters (fixed >>>>>>>>>> or architectural). Of course, without knowing what the actual problem >>>>>>>>>> was it's hard to say whether this was intentional. >>>>>>>>> handle_pmc_quirk appears to loop through all the counters ... >>>>>>>> Right, I didn't notice that it is shifting MSR_CORE_PERF_GLOBAL_STATUS >>>>>>>> value one by one and so it is looking at all bits. >>>>>>>> >>>>>>>>>>> 2. Intercepting MSR loads for counters that have the workaround >>>>>>>>>>> applied and giving the guest the correct counter value. >>>>>>>>>> We'd have to keep track of whether the counter has been reset (by the >>>>>>>>>> quirk) since the last MSR write. >>>>>>>>> Yes. >>>>>>>>> >>>>>>>>>>> 3. Or perhaps even changing the workaround to disable the PMI on that >>>>>>>>>>> counter until the guest acks via GLOBAL_OVF_CTRL, assuming that works >>>>>>>>>>> on the relevant hardware. >>>>>>>>>> MSR_CORE_PERF_GLOBAL_OVF_CTRL is written immediately after the quirk >>>>>>>>>> runs (in core2_vpmu_do_interrupt()) so we already do this, don't we? >>>>>>>>> I'm suggesting waiting until the *guest* writes to the (virtualized) >>>>>>>>> GLOBAL_OVF_CTRL. >>>>>>>> Wouldn't it be better to wait until the counter is reloaded? >>>>>>> Maybe! I haven't thought through it a lot. It's still not clear to >>>>>>> me whether MSR_CORE_PERF_GLOBAL_OVF_CTRL actually controls the >>>>>>> interrupt in any way or whether it just resets the bits in >>>>>>> MSR_CORE_PERF_GLOBAL_STATUS and acking the interrupt on the APIC is >>>>>>> all that's required to reenable it. >>>>>>> >>>>>>> - Kyle >>>>>> I wonder if it would be reasonable to just remove the workaround >>>>>> entirely at some point. The set of people using 1) several year old >>>>>> hardware, 2) an up to date Xen, and 3) the off-by-default performance >>>>>> counters is probably rather small. >>>>> We'd probably want to only enable this for affected processors, not >>>>> remove it outright. But the problem is that we still don't know for sure >>>>> whether this issue affects NHM only, do we? >>>>> >>>>> (https://lists.xenproject.org/archives/html/xen-devel/2017-07/msg02242.html >>>>> is the original message) >>>> Yes, the basic problem is that we don't know where to draw the line. >>> vPMU is disabled by default for security reasons, >> >> Is there any document about the possible attack via the vPMU? The >> document I found (such as [1] and XSA-163) just briefly say that the >> vPMU should be disabled due to security concern. >> >> >> [1] https://xenbits.xen.org/docs/unstable/misc/xen-command-line.html > Cross-guest information leaks, presumably. Plenty of "not context switching things properly". Off the top of my head, there was also a straight DoS by blindly passing guest values into an unchecked wrmsr(), and privilege escalation via letting the guest choose where ds_store dumped its data. ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: VPMU interrupt unreliability 2017-10-19 18:20 ` Meng Xu 2017-10-19 18:24 ` Kyle Huey @ 2017-10-20 7:07 ` Jan Beulich 2017-10-23 2:50 ` Meng Xu 1 sibling, 1 reply; 16+ messages in thread From: Jan Beulich @ 2017-10-20 7:07 UTC (permalink / raw) To: Meng Xu Cc: Kevin Tian, Andrew Cooper, Dietmar Hahn, xen-devel@lists.xen.org, Kyle Huey, Jun Nakajima, Boris Ostrovsky, Robert O'Callahan >>> On 19.10.17 at 20:20, <xumengpanda@gmail.com> wrote: > Is there any document about the possible attack via the vPMU? The > document I found (such as [1] and XSA-163) just briefly say that the > vPMU should be disabled due to security concern. Besides the other responses you've already got, I also recall there being at least some CPU models that would live lock upon the debug store being placed into virtual space not mapped by present pages. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: VPMU interrupt unreliability 2017-10-20 7:07 ` Jan Beulich @ 2017-10-23 2:50 ` Meng Xu 0 siblings, 0 replies; 16+ messages in thread From: Meng Xu @ 2017-10-23 2:50 UTC (permalink / raw) To: Jan Beulich Cc: Kevin Tian, Andrew Cooper, Dietmar Hahn, xen-devel@lists.xen.org, Kyle Huey, Jun Nakajima, Boris Ostrovsky, Robert O'Callahan On Fri, Oct 20, 2017 at 3:07 AM, Jan Beulich <JBeulich@suse.com> wrote: > > >>> On 19.10.17 at 20:20, <xumengpanda@gmail.com> wrote: > > Is there any document about the possible attack via the vPMU? The > > document I found (such as [1] and XSA-163) just briefly say that the > > vPMU should be disabled due to security concern. > > Besides the other responses you've already got, I also recall there > being at least some CPU models that would live lock upon the > debug store being placed into virtual space not mapped by present > pages. Thank you very much for your explanation! :) Best Regards, Meng ----------- Meng Xu Ph.D. Candidate in Computer and Information Science University of Pennsylvania http://www.cis.upenn.edu/~mengxu/ _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: VPMU interrupt unreliability 2017-07-22 20:16 VPMU interrupt unreliability Kyle Huey 2017-07-24 14:08 ` Boris Ostrovsky @ 2017-07-24 14:13 ` Andrew Cooper 2017-07-24 14:32 ` Kyle Huey 1 sibling, 1 reply; 16+ messages in thread From: Andrew Cooper @ 2017-07-24 14:13 UTC (permalink / raw) To: Kyle Huey, Boris Ostrovsky, xen-devel Cc: Kevin Tian, Jan Beulich, Dietmar Hahn On 22/07/17 21:16, Kyle Huey wrote: > Last year I reported[0] seeing occasional instability in performance > counter values when running rr[1], which depends on completely > deterministic counts of retired conditional branches of userspace > programs. > > I recently identified the cause of this problem. Xen's VPMU code > contains a workaround for an alleged Nehalem bug that was added in > 2010[2]. Supposedly if a hardware performance counter reaches 0 > exactly during a PMI another PMI is generated potentially causing an > endless loop. The workaround is to set the counter to 1. In 2013 the > original bug was believed to affect more than just Nehalem and the > workaround was enabled for all family 6 CPUs.[3] This workaround > unfortunately disturbs the counter value in non-deterministic ways > (since the value the counter has in the irq handler depends on > interrupt latency), which is fatal to rr. > > I've verified that the discrepancies we see in the counted values are > entirely accounted for by the number of times the workaround is used > in any given run. Furthermore, patching Xen not to use this > workaround makes the discrepancies in the counts vanish. I've added > code[4] to rr that reliably detects this problem from guest userspace. > > Even with the workaround removed in Xen I see some additional issues > (but not disturbed counter values) with the PMI, such as interrupts > occasionally not being delivered to the guest. I haven't done much > work to track these down, but my working theory is that interrupts > that "skid" out of the guest that requested them and into Xen itself > or perhaps even another guest are not being delivered. > > Our current plan is to stop depending on the PMI during rr's recording > phase (which we use for timeslicing tracees primarily because it's > convenient) to enable producing correct recordings in Xen guests. > Accurate replay will not be possible under virtualization because of > the PMI issues; that will require transferring the recording to > another machine. But that will be sufficient to enable the use cases > we care about (e.g. record an automated process on a cloud computing > provider and have an engineer download and replay a failing recording > later to debug it). > > I can think of several possible ways to fix the overcount problem, including: > 1. Restricting the workaround to apply only to older CPUs and not all > family 6 Intel CPUs forever. > 2. Intercepting MSR loads for counters that have the workaround > applied and giving the guest the correct counter value. > 3. Or perhaps even changing the workaround to disable the PMI on that > counter until the guest acks via GLOBAL_OVF_CTRL, assuming that works > on the relevant hardware. > > Since I don't have the relevant hardware to test changes to this > workaround on and rr can avoid these bugs through other means I don't > expect to work on this myself, but I wanted to apprise you of what > we've learned. Thankyou for this investigation and analysis. I think the first action is to try and identify what this mysterious erratum is. Despite the plethora of perf errata, the best I can find is AAK135 "Multiple Performance Monitor Interrupts are Possible on Overflow of IA32_FIXED_CTR2" which still doesn't obviously match the described symptoms. CC'ing Dietmar who was the author of the original workaround. Do you recall any other information which might be helpful in tracking this down? I also don't see any similar workaround in the Linux event infrastructure, which makes me wonder whether the observed behaviour was a side effect of something else Xen specific. Having Xen perturb the counters behind a guests back (in a way contrary to architectural or errata behaviour) is obviously a bad thing, and we should fix that. I do have access to hardware, but am lacking vPMU expertise. ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: VPMU interrupt unreliability 2017-07-24 14:13 ` Andrew Cooper @ 2017-07-24 14:32 ` Kyle Huey 0 siblings, 0 replies; 16+ messages in thread From: Kyle Huey @ 2017-07-24 14:32 UTC (permalink / raw) To: Andrew Cooper Cc: Kevin Tian, Boris Ostrovsky, Jan Beulich, Dietmar Hahn, xen-devel On Mon, Jul 24, 2017 at 7:13 AM, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > On 22/07/17 21:16, Kyle Huey wrote: >> Last year I reported[0] seeing occasional instability in performance >> counter values when running rr[1], which depends on completely >> deterministic counts of retired conditional branches of userspace >> programs. >> >> I recently identified the cause of this problem. Xen's VPMU code >> contains a workaround for an alleged Nehalem bug that was added in >> 2010[2]. Supposedly if a hardware performance counter reaches 0 >> exactly during a PMI another PMI is generated potentially causing an >> endless loop. The workaround is to set the counter to 1. In 2013 the >> original bug was believed to affect more than just Nehalem and the >> workaround was enabled for all family 6 CPUs.[3] This workaround >> unfortunately disturbs the counter value in non-deterministic ways >> (since the value the counter has in the irq handler depends on >> interrupt latency), which is fatal to rr. >> >> I've verified that the discrepancies we see in the counted values are >> entirely accounted for by the number of times the workaround is used >> in any given run. Furthermore, patching Xen not to use this >> workaround makes the discrepancies in the counts vanish. I've added >> code[4] to rr that reliably detects this problem from guest userspace. >> >> Even with the workaround removed in Xen I see some additional issues >> (but not disturbed counter values) with the PMI, such as interrupts >> occasionally not being delivered to the guest. I haven't done much >> work to track these down, but my working theory is that interrupts >> that "skid" out of the guest that requested them and into Xen itself >> or perhaps even another guest are not being delivered. >> >> Our current plan is to stop depending on the PMI during rr's recording >> phase (which we use for timeslicing tracees primarily because it's >> convenient) to enable producing correct recordings in Xen guests. >> Accurate replay will not be possible under virtualization because of >> the PMI issues; that will require transferring the recording to >> another machine. But that will be sufficient to enable the use cases >> we care about (e.g. record an automated process on a cloud computing >> provider and have an engineer download and replay a failing recording >> later to debug it). >> >> I can think of several possible ways to fix the overcount problem, including: >> 1. Restricting the workaround to apply only to older CPUs and not all >> family 6 Intel CPUs forever. >> 2. Intercepting MSR loads for counters that have the workaround >> applied and giving the guest the correct counter value. >> 3. Or perhaps even changing the workaround to disable the PMI on that >> counter until the guest acks via GLOBAL_OVF_CTRL, assuming that works >> on the relevant hardware. >> >> Since I don't have the relevant hardware to test changes to this >> workaround on and rr can avoid these bugs through other means I don't >> expect to work on this myself, but I wanted to apprise you of what >> we've learned. > > Thankyou for this investigation and analysis. > > I think the first action is to try and identify what this mysterious > erratum is. Despite the plethora of perf errata, the best I can find is > AAK135 "Multiple Performance Monitor Interrupts are Possible on Overflow > of IA32_FIXED_CTR2" which still doesn't obviously match the described > symptoms. I think it may be BJ58 "Performance-Counter Overflow Indication May Cause Undesired Behavior". > CC'ing Dietmar who was the author of the original workaround. Do you > recall any other information which might be helpful in tracking this > down? I also don't see any similar workaround in the Linux event > infrastructure, which makes me wonder whether the observed behaviour was > a side effect of something else Xen specific. Haitao Shan wrote "The issue causing interrupt loop is: It seems that on NHM (at that time) when a PMI arrives at CPU, the counter has a value to zero (instead of some other small value, say 3 or 5, seen on Core 2 Duo). In this case, unmasking the PMI via APIC will trigger immediately another PMI. This does not produce problem with native kernel, since it typically programs the counter with another value (as needed by making yet another sampling point) before unmasking. For Xen, PMI handler cannot handle the counter immediately since it should be handled by guests. It just records a virtual PMI to guests and unmasks the PMI before return." https://lists.xen.org/archives/html/xen-devel/2013-03/msg02615.html > Having Xen perturb the counters behind a guests back (in a way contrary > to architectural or errata behaviour) is obviously a bad thing, and we > should fix that. I do have access to hardware, but am lacking vPMU > expertise. - Kyle _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2017-10-23 2:50 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-07-22 20:16 VPMU interrupt unreliability Kyle Huey 2017-07-24 14:08 ` Boris Ostrovsky 2017-07-24 14:26 ` Kyle Huey 2017-07-24 15:07 ` Boris Ostrovsky 2017-07-24 16:54 ` Kyle Huey 2017-10-10 16:54 ` Kyle Huey 2017-10-11 14:09 ` Boris Ostrovsky 2017-10-19 15:09 ` Kyle Huey 2017-10-19 15:40 ` Andrew Cooper 2017-10-19 18:20 ` Meng Xu 2017-10-19 18:24 ` Kyle Huey 2017-10-19 18:38 ` Andrew Cooper 2017-10-20 7:07 ` Jan Beulich 2017-10-23 2:50 ` Meng Xu 2017-07-24 14:13 ` Andrew Cooper 2017-07-24 14:32 ` Kyle Huey
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).