VPMU interrupt unreliability

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

* VPMU interrupt unreliability
@ 2017-07-22 20:16 Kyle Huey
  2017-07-24 14:08 ` Boris Ostrovsky
  2017-07-24 14:13 ` Andrew Cooper
  0 siblings, 2 replies; 16+ messages in thread
From: Kyle Huey @ 2017-07-22 20:16 UTC (permalink / raw)
  To: Andrew Cooper, Boris Ostrovsky, xen-devel; +Cc: Robert O'Callahan

Last year I reported[0] seeing occasional instability in performance
counter values when running rr[1], which depends on completely
deterministic counts of retired conditional branches of userspace
programs.

I recently identified the cause of this problem.  Xen's VPMU code
contains a workaround for an alleged Nehalem bug that was added in
2010[2].  Supposedly if a hardware performance counter reaches 0
exactly during a PMI another PMI is generated potentially causing an
endless loop.  The workaround is to set the counter to 1.  In 2013 the
original bug was believed to affect more than just Nehalem and the
workaround was enabled for all family 6 CPUs.[3]  This workaround
unfortunately disturbs the counter value in non-deterministic ways
(since the value the counter has in the irq handler depends on
interrupt latency), which is fatal to rr.

I've verified that the discrepancies we see in the counted values are
entirely accounted for by the number of times the workaround is used
in any given run.  Furthermore, patching Xen not to use this
workaround makes the discrepancies in the counts vanish.  I've added
code[4] to rr that reliably detects this problem from guest userspace.

Even with the workaround removed in Xen I see some additional issues
(but not disturbed counter values) with the PMI, such as interrupts
occasionally not being delivered to the guest.  I haven't done much
work to track these down, but my working theory is that interrupts
that "skid" out of the guest that requested them and into Xen itself
or perhaps even another guest are not being delivered.

Our current plan is to stop depending on the PMI during rr's recording
phase (which we use for timeslicing tracees primarily because it's
convenient) to enable producing correct recordings in Xen guests.
Accurate replay will not be possible under virtualization because of
the PMI issues; that will require transferring the recording to
another machine.  But that will be sufficient to enable the use cases
we care about (e.g. record an automated process on a cloud computing
provider and have an engineer download and replay a failing recording
later to debug it).

I can think of several possible ways to fix the overcount problem, including:
1. Restricting the workaround to apply only to older CPUs and not all
family 6 Intel CPUs forever.
2. Intercepting MSR loads for counters that have the workaround
applied and giving the guest the correct counter value.
3. Or perhaps even changing the workaround to disable the PMI on that
counter until the guest acks via GLOBAL_OVF_CTRL, assuming that works
on the relevant hardware.

Since I don't have the relevant hardware to test changes to this
workaround on and rr can avoid these bugs through other means I don't
expect to work on this myself, but I wanted to apprise you of what
we've learned.

- Kyle

[0] https://lists.xen.org/archives/html/xen-devel/2016-10/msg01288.html
[1] http://rr-project.org/
[2] https://xenbits.xen.org/gitweb/?p=xen.git;a=blobdiff;f=xen/arch/x86/hvm/vmx/vpmu_core2.c;h=44aa8e3c47fc02e401f5c382d89b97eef0cd2019;hp=ce4fd2d43e04db5e9b042344dd294cfa11e1f405;hb=3ed6a063d2a5f6197306b030e8c27c36d5f31aa1;hpb=566f83823996cf9c95f9a0562488f6b1215a1052
[3] https://xenbits.xen.org/gitweb/?p=xen.git;a=blobdiff;f=xen/arch/x86/hvm/vmx/vpmu_core2.c;h=15b2036c8db1e56d8865ee34c363e7f23aa75e33;hp=9f152b48c26dfeedb6f94189a5fe4a5f7a772d83;hb=75a92f551ade530ebab73a0c3d4934dfb28149b5;hpb=71fc4da1306cec55a42787310b01a1cb52489abc
[4] See https://github.com/mozilla/rr/blob/a5d23728cd7d01c6be0c79852af26c68160d4405/src/PerfCounters.cc#L313,
which sets up a counter and then does some pointless math in a loop to
reach exactly 500 conditional branches.  Xen will report 501 branches
because of this bug.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: VPMU interrupt unreliability
  2017-07-22 20:16 VPMU interrupt unreliability Kyle Huey
@ 2017-07-24 14:08 ` Boris Ostrovsky
  2017-07-24 14:26   ` Kyle Huey
  2017-07-24 14:13 ` Andrew Cooper
  1 sibling, 1 reply; 16+ messages in thread
From: Boris Ostrovsky @ 2017-07-24 14:08 UTC (permalink / raw)
  To: Kyle Huey, Andrew Cooper, xen-devel
  Cc: Tian, Kevin, Robert O'Callahan, Jun Nakajima

On 07/22/2017 04:16 PM, Kyle Huey wrote:
> Last year I reported[0] seeing occasional instability in performance
> counter values when running rr[1], which depends on completely
> deterministic counts of retired conditional branches of userspace
> programs.
>
> I recently identified the cause of this problem.  Xen's VPMU code
> contains a workaround for an alleged Nehalem bug that was added in
> 2010[2].  Supposedly if a hardware performance counter reaches 0
> exactly during a PMI another PMI is generated potentially causing an
> endless loop.  The workaround is to set the counter to 1.  In 2013 the
> original bug was believed to affect more than just Nehalem and the
> workaround was enabled for all family 6 CPUs.[3]  This workaround
> unfortunately disturbs the counter value in non-deterministic ways
> (since the value the counter has in the irq handler depends on
> interrupt latency), which is fatal to rr.
>
> I've verified that the discrepancies we see in the counted values are
> entirely accounted for by the number of times the workaround is used
> in any given run.  Furthermore, patching Xen not to use this
> workaround makes the discrepancies in the counts vanish.  I've added
> code[4] to rr that reliably detects this problem from guest userspace.
>
> Even with the workaround removed in Xen I see some additional issues
> (but not disturbed counter values) with the PMI, such as interrupts
> occasionally not being delivered to the guest.  I haven't done much
> work to track these down, but my working theory is that interrupts
> that "skid" out of the guest that requested them and into Xen itself
> or perhaps even another guest are not being delivered.
>
> Our current plan is to stop depending on the PMI during rr's recording
> phase (which we use for timeslicing tracees primarily because it's
> convenient) to enable producing correct recordings in Xen guests.
> Accurate replay will not be possible under virtualization because of
> the PMI issues; that will require transferring the recording to
> another machine.  But that will be sufficient to enable the use cases
> we care about (e.g. record an automated process on a cloud computing
> provider and have an engineer download and replay a failing recording
> later to debug it).
>
> I can think of several possible ways to fix the overcount problem, including:
> 1. Restricting the workaround to apply only to older CPUs and not all
> family 6 Intel CPUs forever.

IIRC the question of which processors this workaround is applicable to
was raised and Intel folks (copied here) couldn't find an answer.

One thing I noticed is that the workaround doesn't appear to be
complete: it is only checking PMC0 status and not other counters (fixed
or architectural). Of course, without knowing what the actual problem
was it's hard to say whether this was intentional.


> 2. Intercepting MSR loads for counters that have the workaround
> applied and giving the guest the correct counter value.


We'd have to keep track of whether the counter has been reset (by the
quirk) since the last MSR write.

> 3. Or perhaps even changing the workaround to disable the PMI on that
> counter until the guest acks via GLOBAL_OVF_CTRL, assuming that works
> on the relevant hardware.

MSR_CORE_PERF_GLOBAL_OVF_CTRL is written immediately after the quirk
runs (in core2_vpmu_do_interrupt()) so we already do this, don't we?

Thanks for looking into this. Would also be interesting to see/confirm
how some interrupts are (possibly) lost.

-boris


>
> Since I don't have the relevant hardware to test changes to this
> workaround on and rr can avoid these bugs through other means I don't
> expect to work on this myself, but I wanted to apprise you of what
> we've learned.
>
> - Kyle
>
> [0] https://lists.xen.org/archives/html/xen-devel/2016-10/msg01288.html
> [1] http://rr-project.org/
> [2] https://xenbits.xen.org/gitweb/?p=xen.git;a=blobdiff;f=xen/arch/x86/hvm/vmx/vpmu_core2.c;h=44aa8e3c47fc02e401f5c382d89b97eef0cd2019;hp=ce4fd2d43e04db5e9b042344dd294cfa11e1f405;hb=3ed6a063d2a5f6197306b030e8c27c36d5f31aa1;hpb=566f83823996cf9c95f9a0562488f6b1215a1052
> [3] https://xenbits.xen.org/gitweb/?p=xen.git;a=blobdiff;f=xen/arch/x86/hvm/vmx/vpmu_core2.c;h=15b2036c8db1e56d8865ee34c363e7f23aa75e33;hp=9f152b48c26dfeedb6f94189a5fe4a5f7a772d83;hb=75a92f551ade530ebab73a0c3d4934dfb28149b5;hpb=71fc4da1306cec55a42787310b01a1cb52489abc
> [4] See https://github.com/mozilla/rr/blob/a5d23728cd7d01c6be0c79852af26c68160d4405/src/PerfCounters.cc#L313,
> which sets up a counter and then does some pointless math in a loop to
> reach exactly 500 conditional branches.  Xen will report 501 branches
> because of this bug.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: VPMU interrupt unreliability
  2017-07-24 14:08 ` Boris Ostrovsky
@ 2017-07-24 14:26   ` Kyle Huey
  2017-07-24 15:07     ` Boris Ostrovsky
  0 siblings, 1 reply; 16+ messages in thread
From: Kyle Huey @ 2017-07-24 14:26 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: Andrew Cooper, Tian, Kevin, Robert O'Callahan, Jun Nakajima,
	xen-devel

On Mon, Jul 24, 2017 at 7:08 AM, Boris Ostrovsky
<boris.ostrovsky@oracle.com> wrote:
> On 07/22/2017 04:16 PM, Kyle Huey wrote:
>> Last year I reported[0] seeing occasional instability in performance
>> counter values when running rr[1], which depends on completely
>> deterministic counts of retired conditional branches of userspace
>> programs.
>>
>> I recently identified the cause of this problem.  Xen's VPMU code
>> contains a workaround for an alleged Nehalem bug that was added in
>> 2010[2].  Supposedly if a hardware performance counter reaches 0
>> exactly during a PMI another PMI is generated potentially causing an
>> endless loop.  The workaround is to set the counter to 1.  In 2013 the
>> original bug was believed to affect more than just Nehalem and the
>> workaround was enabled for all family 6 CPUs.[3]  This workaround
>> unfortunately disturbs the counter value in non-deterministic ways
>> (since the value the counter has in the irq handler depends on
>> interrupt latency), which is fatal to rr.
>>
>> I've verified that the discrepancies we see in the counted values are
>> entirely accounted for by the number of times the workaround is used
>> in any given run.  Furthermore, patching Xen not to use this
>> workaround makes the discrepancies in the counts vanish.  I've added
>> code[4] to rr that reliably detects this problem from guest userspace.
>>
>> Even with the workaround removed in Xen I see some additional issues
>> (but not disturbed counter values) with the PMI, such as interrupts
>> occasionally not being delivered to the guest.  I haven't done much
>> work to track these down, but my working theory is that interrupts
>> that "skid" out of the guest that requested them and into Xen itself
>> or perhaps even another guest are not being delivered.
>>
>> Our current plan is to stop depending on the PMI during rr's recording
>> phase (which we use for timeslicing tracees primarily because it's
>> convenient) to enable producing correct recordings in Xen guests.
>> Accurate replay will not be possible under virtualization because of
>> the PMI issues; that will require transferring the recording to
>> another machine.  But that will be sufficient to enable the use cases
>> we care about (e.g. record an automated process on a cloud computing
>> provider and have an engineer download and replay a failing recording
>> later to debug it).
>>
>> I can think of several possible ways to fix the overcount problem, including:
>> 1. Restricting the workaround to apply only to older CPUs and not all
>> family 6 Intel CPUs forever.
>
> IIRC the question of which processors this workaround is applicable to
> was raised and Intel folks (copied here) couldn't find an answer.
>
> One thing I noticed is that the workaround doesn't appear to be
> complete: it is only checking PMC0 status and not other counters (fixed
> or architectural). Of course, without knowing what the actual problem
> was it's hard to say whether this was intentional.

handle_pmc_quirk appears to loop through all the counters ...

>> 2. Intercepting MSR loads for counters that have the workaround
>> applied and giving the guest the correct counter value.
>
>
> We'd have to keep track of whether the counter has been reset (by the
> quirk) since the last MSR write.

Yes.

>> 3. Or perhaps even changing the workaround to disable the PMI on that
>> counter until the guest acks via GLOBAL_OVF_CTRL, assuming that works
>> on the relevant hardware.
>
> MSR_CORE_PERF_GLOBAL_OVF_CTRL is written immediately after the quirk
> runs (in core2_vpmu_do_interrupt()) so we already do this, don't we?

I'm suggesting waiting until the *guest* writes to the (virtualized)
GLOBAL_OVF_CTRL.

> Thanks for looking into this. Would also be interesting to see/confirm
> how some interrupts are (possibly) lost.

Indeed.  Unfortunately it's not a high priority for me at the moment.

- Kyle

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: VPMU interrupt unreliability
  2017-07-24 14:26   ` Kyle Huey
@ 2017-07-24 15:07     ` Boris Ostrovsky
  2017-07-24 16:54       ` Kyle Huey
  0 siblings, 1 reply; 16+ messages in thread
From: Boris Ostrovsky @ 2017-07-24 15:07 UTC (permalink / raw)
  To: Kyle Huey
  Cc: Tian, Kevin, Andrew Cooper, Dietmar Hahn, xen-devel, Jun Nakajima,
	Robert O'Callahan


>> One thing I noticed is that the workaround doesn't appear to be
>> complete: it is only checking PMC0 status and not other counters (fixed
>> or architectural). Of course, without knowing what the actual problem
>> was it's hard to say whether this was intentional.
> handle_pmc_quirk appears to loop through all the counters ...

Right, I didn't notice that it is shifting MSR_CORE_PERF_GLOBAL_STATUS
value one by one and so it is looking at all bits.

>
>>> 2. Intercepting MSR loads for counters that have the workaround
>>> applied and giving the guest the correct counter value.
>>
>> We'd have to keep track of whether the counter has been reset (by the
>> quirk) since the last MSR write.
> Yes.
>
>>> 3. Or perhaps even changing the workaround to disable the PMI on that
>>> counter until the guest acks via GLOBAL_OVF_CTRL, assuming that works
>>> on the relevant hardware.
>> MSR_CORE_PERF_GLOBAL_OVF_CTRL is written immediately after the quirk
>> runs (in core2_vpmu_do_interrupt()) so we already do this, don't we?
> I'm suggesting waiting until the *guest* writes to the (virtualized)
> GLOBAL_OVF_CTRL.

Wouldn't it be better to wait until the counter is reloaded?


-boris

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: VPMU interrupt unreliability
  2017-07-24 15:07     ` Boris Ostrovsky
@ 2017-07-24 16:54       ` Kyle Huey
  2017-10-10 16:54         ` Kyle Huey
  0 siblings, 1 reply; 16+ messages in thread
From: Kyle Huey @ 2017-07-24 16:54 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: Tian, Kevin, Andrew Cooper, Dietmar Hahn, xen-devel, Jun Nakajima,
	Robert O'Callahan

On Mon, Jul 24, 2017 at 8:07 AM, Boris Ostrovsky
<boris.ostrovsky@oracle.com> wrote:
>
>>> One thing I noticed is that the workaround doesn't appear to be
>>> complete: it is only checking PMC0 status and not other counters (fixed
>>> or architectural). Of course, without knowing what the actual problem
>>> was it's hard to say whether this was intentional.
>> handle_pmc_quirk appears to loop through all the counters ...
>
> Right, I didn't notice that it is shifting MSR_CORE_PERF_GLOBAL_STATUS
> value one by one and so it is looking at all bits.
>
>>
>>>> 2. Intercepting MSR loads for counters that have the workaround
>>>> applied and giving the guest the correct counter value.
>>>
>>> We'd have to keep track of whether the counter has been reset (by the
>>> quirk) since the last MSR write.
>> Yes.
>>
>>>> 3. Or perhaps even changing the workaround to disable the PMI on that
>>>> counter until the guest acks via GLOBAL_OVF_CTRL, assuming that works
>>>> on the relevant hardware.
>>> MSR_CORE_PERF_GLOBAL_OVF_CTRL is written immediately after the quirk
>>> runs (in core2_vpmu_do_interrupt()) so we already do this, don't we?
>> I'm suggesting waiting until the *guest* writes to the (virtualized)
>> GLOBAL_OVF_CTRL.
>
> Wouldn't it be better to wait until the counter is reloaded?

Maybe!  I haven't thought through it a lot.  It's still not clear to
me whether MSR_CORE_PERF_GLOBAL_OVF_CTRL actually controls the
interrupt in any way or whether it just resets the bits in
MSR_CORE_PERF_GLOBAL_STATUS and acking the interrupt on the APIC is
all that's required to reenable it.

- Kyle

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: VPMU interrupt unreliability
  2017-07-24 16:54       ` Kyle Huey
@ 2017-10-10 16:54         ` Kyle Huey
  2017-10-11 14:09           ` Boris Ostrovsky
  0 siblings, 1 reply; 16+ messages in thread
From: Kyle Huey @ 2017-10-10 16:54 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: Tian, Kevin, Andrew Cooper, Dietmar Hahn, xen-devel, Jun Nakajima,
	Robert O'Callahan

On Mon, Jul 24, 2017 at 9:54 AM, Kyle Huey <me@kylehuey.com> wrote:
> On Mon, Jul 24, 2017 at 8:07 AM, Boris Ostrovsky
> <boris.ostrovsky@oracle.com> wrote:
>>
>>>> One thing I noticed is that the workaround doesn't appear to be
>>>> complete: it is only checking PMC0 status and not other counters (fixed
>>>> or architectural). Of course, without knowing what the actual problem
>>>> was it's hard to say whether this was intentional.
>>> handle_pmc_quirk appears to loop through all the counters ...
>>
>> Right, I didn't notice that it is shifting MSR_CORE_PERF_GLOBAL_STATUS
>> value one by one and so it is looking at all bits.
>>
>>>
>>>>> 2. Intercepting MSR loads for counters that have the workaround
>>>>> applied and giving the guest the correct counter value.
>>>>
>>>> We'd have to keep track of whether the counter has been reset (by the
>>>> quirk) since the last MSR write.
>>> Yes.
>>>
>>>>> 3. Or perhaps even changing the workaround to disable the PMI on that
>>>>> counter until the guest acks via GLOBAL_OVF_CTRL, assuming that works
>>>>> on the relevant hardware.
>>>> MSR_CORE_PERF_GLOBAL_OVF_CTRL is written immediately after the quirk
>>>> runs (in core2_vpmu_do_interrupt()) so we already do this, don't we?
>>> I'm suggesting waiting until the *guest* writes to the (virtualized)
>>> GLOBAL_OVF_CTRL.
>>
>> Wouldn't it be better to wait until the counter is reloaded?
>
> Maybe!  I haven't thought through it a lot.  It's still not clear to
> me whether MSR_CORE_PERF_GLOBAL_OVF_CTRL actually controls the
> interrupt in any way or whether it just resets the bits in
> MSR_CORE_PERF_GLOBAL_STATUS and acking the interrupt on the APIC is
> all that's required to reenable it.
>
> - Kyle

I wonder if it would be reasonable to just remove the workaround
entirely at some point.  The set of people using 1) several year old
hardware, 2) an up to date Xen, and 3) the off-by-default performance
counters is probably rather small.

- Kyle

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: VPMU interrupt unreliability
  2017-10-10 16:54         ` Kyle Huey
@ 2017-10-11 14:09           ` Boris Ostrovsky
  2017-10-19 15:09             ` Kyle Huey
  0 siblings, 1 reply; 16+ messages in thread
From: Boris Ostrovsky @ 2017-10-11 14:09 UTC (permalink / raw)
  To: Kyle Huey
  Cc: Tian, Kevin, Andrew Cooper, Dietmar Hahn, xen-devel, Jun Nakajima,
	Robert O'Callahan

On 10/10/2017 12:54 PM, Kyle Huey wrote:
> On Mon, Jul 24, 2017 at 9:54 AM, Kyle Huey <me@kylehuey.com> wrote:
>> On Mon, Jul 24, 2017 at 8:07 AM, Boris Ostrovsky
>> <boris.ostrovsky@oracle.com> wrote:
>>>>> One thing I noticed is that the workaround doesn't appear to be
>>>>> complete: it is only checking PMC0 status and not other counters (fixed
>>>>> or architectural). Of course, without knowing what the actual problem
>>>>> was it's hard to say whether this was intentional.
>>>> handle_pmc_quirk appears to loop through all the counters ...
>>> Right, I didn't notice that it is shifting MSR_CORE_PERF_GLOBAL_STATUS
>>> value one by one and so it is looking at all bits.
>>>
>>>>>> 2. Intercepting MSR loads for counters that have the workaround
>>>>>> applied and giving the guest the correct counter value.
>>>>> We'd have to keep track of whether the counter has been reset (by the
>>>>> quirk) since the last MSR write.
>>>> Yes.
>>>>
>>>>>> 3. Or perhaps even changing the workaround to disable the PMI on that
>>>>>> counter until the guest acks via GLOBAL_OVF_CTRL, assuming that works
>>>>>> on the relevant hardware.
>>>>> MSR_CORE_PERF_GLOBAL_OVF_CTRL is written immediately after the quirk
>>>>> runs (in core2_vpmu_do_interrupt()) so we already do this, don't we?
>>>> I'm suggesting waiting until the *guest* writes to the (virtualized)
>>>> GLOBAL_OVF_CTRL.
>>> Wouldn't it be better to wait until the counter is reloaded?
>> Maybe!  I haven't thought through it a lot.  It's still not clear to
>> me whether MSR_CORE_PERF_GLOBAL_OVF_CTRL actually controls the
>> interrupt in any way or whether it just resets the bits in
>> MSR_CORE_PERF_GLOBAL_STATUS and acking the interrupt on the APIC is
>> all that's required to reenable it.
>>
>> - Kyle
> I wonder if it would be reasonable to just remove the workaround
> entirely at some point.  The set of people using 1) several year old
> hardware, 2) an up to date Xen, and 3) the off-by-default performance
> counters is probably rather small.

We'd probably want to only enable this for affected processors, not
remove it outright. But the problem is that we still don't know for sure
whether this issue affects NHM only, do we?

(https://lists.xenproject.org/archives/html/xen-devel/2017-07/msg02242.html
is the original message)


-boris


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: VPMU interrupt unreliability
  2017-10-11 14:09           ` Boris Ostrovsky
@ 2017-10-19 15:09             ` Kyle Huey
  2017-10-19 15:40               ` Andrew Cooper
  0 siblings, 1 reply; 16+ messages in thread
From: Kyle Huey @ 2017-10-19 15:09 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: Tian, Kevin, Andrew Cooper, Dietmar Hahn, xen-devel, Jun Nakajima,
	Robert O'Callahan

On Wed, Oct 11, 2017 at 7:09 AM, Boris Ostrovsky
<boris.ostrovsky@oracle.com> wrote:
> On 10/10/2017 12:54 PM, Kyle Huey wrote:
>> On Mon, Jul 24, 2017 at 9:54 AM, Kyle Huey <me@kylehuey.com> wrote:
>>> On Mon, Jul 24, 2017 at 8:07 AM, Boris Ostrovsky
>>> <boris.ostrovsky@oracle.com> wrote:
>>>>>> One thing I noticed is that the workaround doesn't appear to be
>>>>>> complete: it is only checking PMC0 status and not other counters (fixed
>>>>>> or architectural). Of course, without knowing what the actual problem
>>>>>> was it's hard to say whether this was intentional.
>>>>> handle_pmc_quirk appears to loop through all the counters ...
>>>> Right, I didn't notice that it is shifting MSR_CORE_PERF_GLOBAL_STATUS
>>>> value one by one and so it is looking at all bits.
>>>>
>>>>>>> 2. Intercepting MSR loads for counters that have the workaround
>>>>>>> applied and giving the guest the correct counter value.
>>>>>> We'd have to keep track of whether the counter has been reset (by the
>>>>>> quirk) since the last MSR write.
>>>>> Yes.
>>>>>
>>>>>>> 3. Or perhaps even changing the workaround to disable the PMI on that
>>>>>>> counter until the guest acks via GLOBAL_OVF_CTRL, assuming that works
>>>>>>> on the relevant hardware.
>>>>>> MSR_CORE_PERF_GLOBAL_OVF_CTRL is written immediately after the quirk
>>>>>> runs (in core2_vpmu_do_interrupt()) so we already do this, don't we?
>>>>> I'm suggesting waiting until the *guest* writes to the (virtualized)
>>>>> GLOBAL_OVF_CTRL.
>>>> Wouldn't it be better to wait until the counter is reloaded?
>>> Maybe!  I haven't thought through it a lot.  It's still not clear to
>>> me whether MSR_CORE_PERF_GLOBAL_OVF_CTRL actually controls the
>>> interrupt in any way or whether it just resets the bits in
>>> MSR_CORE_PERF_GLOBAL_STATUS and acking the interrupt on the APIC is
>>> all that's required to reenable it.
>>>
>>> - Kyle
>> I wonder if it would be reasonable to just remove the workaround
>> entirely at some point.  The set of people using 1) several year old
>> hardware, 2) an up to date Xen, and 3) the off-by-default performance
>> counters is probably rather small.
>
> We'd probably want to only enable this for affected processors, not
> remove it outright. But the problem is that we still don't know for sure
> whether this issue affects NHM only, do we?
>
> (https://lists.xenproject.org/archives/html/xen-devel/2017-07/msg02242.html
> is the original message)

Yes, the basic problem is that we don't know where to draw the line.

- Kyle

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: VPMU interrupt unreliability
  2017-10-19 15:09             ` Kyle Huey
@ 2017-10-19 15:40               ` Andrew Cooper
  2017-10-19 18:20                 ` Meng Xu
  0 siblings, 1 reply; 16+ messages in thread
From: Andrew Cooper @ 2017-10-19 15:40 UTC (permalink / raw)
  To: Kyle Huey, Boris Ostrovsky
  Cc: Tian, Kevin, Dietmar Hahn, Robert O'Callahan, Jun Nakajima,
	xen-devel

On 19/10/17 16:09, Kyle Huey wrote:
> On Wed, Oct 11, 2017 at 7:09 AM, Boris Ostrovsky
> <boris.ostrovsky@oracle.com> wrote:
>> On 10/10/2017 12:54 PM, Kyle Huey wrote:
>>> On Mon, Jul 24, 2017 at 9:54 AM, Kyle Huey <me@kylehuey.com> wrote:
>>>> On Mon, Jul 24, 2017 at 8:07 AM, Boris Ostrovsky
>>>> <boris.ostrovsky@oracle.com> wrote:
>>>>>>> One thing I noticed is that the workaround doesn't appear to be
>>>>>>> complete: it is only checking PMC0 status and not other counters (fixed
>>>>>>> or architectural). Of course, without knowing what the actual problem
>>>>>>> was it's hard to say whether this was intentional.
>>>>>> handle_pmc_quirk appears to loop through all the counters ...
>>>>> Right, I didn't notice that it is shifting MSR_CORE_PERF_GLOBAL_STATUS
>>>>> value one by one and so it is looking at all bits.
>>>>>
>>>>>>>> 2. Intercepting MSR loads for counters that have the workaround
>>>>>>>> applied and giving the guest the correct counter value.
>>>>>>> We'd have to keep track of whether the counter has been reset (by the
>>>>>>> quirk) since the last MSR write.
>>>>>> Yes.
>>>>>>
>>>>>>>> 3. Or perhaps even changing the workaround to disable the PMI on that
>>>>>>>> counter until the guest acks via GLOBAL_OVF_CTRL, assuming that works
>>>>>>>> on the relevant hardware.
>>>>>>> MSR_CORE_PERF_GLOBAL_OVF_CTRL is written immediately after the quirk
>>>>>>> runs (in core2_vpmu_do_interrupt()) so we already do this, don't we?
>>>>>> I'm suggesting waiting until the *guest* writes to the (virtualized)
>>>>>> GLOBAL_OVF_CTRL.
>>>>> Wouldn't it be better to wait until the counter is reloaded?
>>>> Maybe!  I haven't thought through it a lot.  It's still not clear to
>>>> me whether MSR_CORE_PERF_GLOBAL_OVF_CTRL actually controls the
>>>> interrupt in any way or whether it just resets the bits in
>>>> MSR_CORE_PERF_GLOBAL_STATUS and acking the interrupt on the APIC is
>>>> all that's required to reenable it.
>>>>
>>>> - Kyle
>>> I wonder if it would be reasonable to just remove the workaround
>>> entirely at some point.  The set of people using 1) several year old
>>> hardware, 2) an up to date Xen, and 3) the off-by-default performance
>>> counters is probably rather small.
>> We'd probably want to only enable this for affected processors, not
>> remove it outright. But the problem is that we still don't know for sure
>> whether this issue affects NHM only, do we?
>>
>> (https://lists.xenproject.org/archives/html/xen-devel/2017-07/msg02242.html
>> is the original message)
> Yes, the basic problem is that we don't know where to draw the line.

vPMU is disabled by default for security reasons, and also broken, in a
way which demonstrates that vPMU isn't getting much real-world use.

As far as I'm concerned, all options (including rm -rf and start from
scratch) are acceptable, especially if this ends up giving us a better
overall subsystem.

Do we know how other hypervisors work around this issue?

I'm tempted to suggest just ripping it straight out.  NHM is ancient
these days, and if someone does manage to get a repro, we stand a better
chance of being able to debug it properly.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: VPMU interrupt unreliability
  2017-10-19 15:40               ` Andrew Cooper
@ 2017-10-19 18:20                 ` Meng Xu
  2017-10-19 18:24                   ` Kyle Huey
  2017-10-20  7:07                   ` Jan Beulich
  0 siblings, 2 replies; 16+ messages in thread
From: Meng Xu @ 2017-10-19 18:20 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Tian, Kevin, Dietmar Hahn, xen-devel@lists.xen.org, Kyle Huey,
	Jun Nakajima, Boris Ostrovsky, Robert O'Callahan

On Thu, Oct 19, 2017 at 11:40 AM, Andrew Cooper
<andrew.cooper3@citrix.com> wrote:
>
> On 19/10/17 16:09, Kyle Huey wrote:
> > On Wed, Oct 11, 2017 at 7:09 AM, Boris Ostrovsky
> > <boris.ostrovsky@oracle.com> wrote:
> >> On 10/10/2017 12:54 PM, Kyle Huey wrote:
> >>> On Mon, Jul 24, 2017 at 9:54 AM, Kyle Huey <me@kylehuey.com> wrote:
> >>>> On Mon, Jul 24, 2017 at 8:07 AM, Boris Ostrovsky
> >>>> <boris.ostrovsky@oracle.com> wrote:
> >>>>>>> One thing I noticed is that the workaround doesn't appear to be
> >>>>>>> complete: it is only checking PMC0 status and not other counters (fixed
> >>>>>>> or architectural). Of course, without knowing what the actual problem
> >>>>>>> was it's hard to say whether this was intentional.
> >>>>>> handle_pmc_quirk appears to loop through all the counters ...
> >>>>> Right, I didn't notice that it is shifting MSR_CORE_PERF_GLOBAL_STATUS
> >>>>> value one by one and so it is looking at all bits.
> >>>>>
> >>>>>>>> 2. Intercepting MSR loads for counters that have the workaround
> >>>>>>>> applied and giving the guest the correct counter value.
> >>>>>>> We'd have to keep track of whether the counter has been reset (by the
> >>>>>>> quirk) since the last MSR write.
> >>>>>> Yes.
> >>>>>>
> >>>>>>>> 3. Or perhaps even changing the workaround to disable the PMI on that
> >>>>>>>> counter until the guest acks via GLOBAL_OVF_CTRL, assuming that works
> >>>>>>>> on the relevant hardware.
> >>>>>>> MSR_CORE_PERF_GLOBAL_OVF_CTRL is written immediately after the quirk
> >>>>>>> runs (in core2_vpmu_do_interrupt()) so we already do this, don't we?
> >>>>>> I'm suggesting waiting until the *guest* writes to the (virtualized)
> >>>>>> GLOBAL_OVF_CTRL.
> >>>>> Wouldn't it be better to wait until the counter is reloaded?
> >>>> Maybe!  I haven't thought through it a lot.  It's still not clear to
> >>>> me whether MSR_CORE_PERF_GLOBAL_OVF_CTRL actually controls the
> >>>> interrupt in any way or whether it just resets the bits in
> >>>> MSR_CORE_PERF_GLOBAL_STATUS and acking the interrupt on the APIC is
> >>>> all that's required to reenable it.
> >>>>
> >>>> - Kyle
> >>> I wonder if it would be reasonable to just remove the workaround
> >>> entirely at some point.  The set of people using 1) several year old
> >>> hardware, 2) an up to date Xen, and 3) the off-by-default performance
> >>> counters is probably rather small.
> >> We'd probably want to only enable this for affected processors, not
> >> remove it outright. But the problem is that we still don't know for sure
> >> whether this issue affects NHM only, do we?
> >>
> >> (https://lists.xenproject.org/archives/html/xen-devel/2017-07/msg02242.html
> >> is the original message)
> > Yes, the basic problem is that we don't know where to draw the line.
>
> vPMU is disabled by default for security reasons,


Is there any document about the possible attack via the vPMU? The
document I found (such as [1] and XSA-163) just briefly say that the
vPMU should be disabled due to security concern.


[1] https://xenbits.xen.org/docs/unstable/misc/xen-command-line.html

>
> and also broken, in a
> way which demonstrates that vPMU isn't getting much real-world use.

I also noticed that AWS seems support part of the vPMU
functionalities, which were used by Netflix to optimize their
applications' performance, according to
http://www.brendangregg.com/blog/2017-05-04/the-pmcs-of-ec2.html .

I guess the security issue should be solved by AWS? However, without
knowing how the attack could be conducted, I'm not sure how AWS avoids
the attack concern for vPMU.

>
> As far as I'm concerned, all options (including rm -rf and start from
> scratch) are acceptable, especially if this ends up giving us a better
> overall subsystem.
>
> Do we know how other hypervisors work around this issue?

Maybe the solution of AWS is a choice? I'm not sure. I'm just thinking aloud. :)

Thanks,

Meng

-- 
Meng Xu
Ph.D. Candidate in Computer and Information Science
University of Pennsylvania
http://www.cis.upenn.edu/~mengxu/

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: VPMU interrupt unreliability
  2017-10-19 18:20                 ` Meng Xu
@ 2017-10-19 18:24                   ` Kyle Huey
  2017-10-19 18:38                     ` Andrew Cooper
  2017-10-20  7:07                   ` Jan Beulich
  1 sibling, 1 reply; 16+ messages in thread
From: Kyle Huey @ 2017-10-19 18:24 UTC (permalink / raw)
  To: Meng Xu
  Cc: Tian, Kevin, Andrew Cooper, Dietmar Hahn, xen-devel@lists.xen.org,
	Jun Nakajima, Boris Ostrovsky, Robert O'Callahan

On Thu, Oct 19, 2017 at 11:20 AM, Meng Xu <xumengpanda@gmail.com> wrote:
> On Thu, Oct 19, 2017 at 11:40 AM, Andrew Cooper
> <andrew.cooper3@citrix.com> wrote:
>>
>> On 19/10/17 16:09, Kyle Huey wrote:
>> > On Wed, Oct 11, 2017 at 7:09 AM, Boris Ostrovsky
>> > <boris.ostrovsky@oracle.com> wrote:
>> >> On 10/10/2017 12:54 PM, Kyle Huey wrote:
>> >>> On Mon, Jul 24, 2017 at 9:54 AM, Kyle Huey <me@kylehuey.com> wrote:
>> >>>> On Mon, Jul 24, 2017 at 8:07 AM, Boris Ostrovsky
>> >>>> <boris.ostrovsky@oracle.com> wrote:
>> >>>>>>> One thing I noticed is that the workaround doesn't appear to be
>> >>>>>>> complete: it is only checking PMC0 status and not other counters (fixed
>> >>>>>>> or architectural). Of course, without knowing what the actual problem
>> >>>>>>> was it's hard to say whether this was intentional.
>> >>>>>> handle_pmc_quirk appears to loop through all the counters ...
>> >>>>> Right, I didn't notice that it is shifting MSR_CORE_PERF_GLOBAL_STATUS
>> >>>>> value one by one and so it is looking at all bits.
>> >>>>>
>> >>>>>>>> 2. Intercepting MSR loads for counters that have the workaround
>> >>>>>>>> applied and giving the guest the correct counter value.
>> >>>>>>> We'd have to keep track of whether the counter has been reset (by the
>> >>>>>>> quirk) since the last MSR write.
>> >>>>>> Yes.
>> >>>>>>
>> >>>>>>>> 3. Or perhaps even changing the workaround to disable the PMI on that
>> >>>>>>>> counter until the guest acks via GLOBAL_OVF_CTRL, assuming that works
>> >>>>>>>> on the relevant hardware.
>> >>>>>>> MSR_CORE_PERF_GLOBAL_OVF_CTRL is written immediately after the quirk
>> >>>>>>> runs (in core2_vpmu_do_interrupt()) so we already do this, don't we?
>> >>>>>> I'm suggesting waiting until the *guest* writes to the (virtualized)
>> >>>>>> GLOBAL_OVF_CTRL.
>> >>>>> Wouldn't it be better to wait until the counter is reloaded?
>> >>>> Maybe!  I haven't thought through it a lot.  It's still not clear to
>> >>>> me whether MSR_CORE_PERF_GLOBAL_OVF_CTRL actually controls the
>> >>>> interrupt in any way or whether it just resets the bits in
>> >>>> MSR_CORE_PERF_GLOBAL_STATUS and acking the interrupt on the APIC is
>> >>>> all that's required to reenable it.
>> >>>>
>> >>>> - Kyle
>> >>> I wonder if it would be reasonable to just remove the workaround
>> >>> entirely at some point.  The set of people using 1) several year old
>> >>> hardware, 2) an up to date Xen, and 3) the off-by-default performance
>> >>> counters is probably rather small.
>> >> We'd probably want to only enable this for affected processors, not
>> >> remove it outright. But the problem is that we still don't know for sure
>> >> whether this issue affects NHM only, do we?
>> >>
>> >> (https://lists.xenproject.org/archives/html/xen-devel/2017-07/msg02242.html
>> >> is the original message)
>> > Yes, the basic problem is that we don't know where to draw the line.
>>
>> vPMU is disabled by default for security reasons,
>
>
> Is there any document about the possible attack via the vPMU? The
> document I found (such as [1] and XSA-163) just briefly say that the
> vPMU should be disabled due to security concern.
>
>
> [1] https://xenbits.xen.org/docs/unstable/misc/xen-command-line.html

Cross-guest information leaks, presumably.

>>
>> and also broken, in a
>> way which demonstrates that vPMU isn't getting much real-world use.
>
> I also noticed that AWS seems support part of the vPMU
> functionalities, which were used by Netflix to optimize their
> applications' performance, according to
> http://www.brendangregg.com/blog/2017-05-04/the-pmcs-of-ec2.html .
>
> I guess the security issue should be solved by AWS? However, without
> knowing how the attack could be conducted, I'm not sure how AWS avoids
> the attack concern for vPMU.

AWS only allows you to use the vPMU if you have the entire physical
machine your VM is running on dedicated to yourself.  Cross-guest
information leaks are not a big deal if the same tenant controls all
the guests.

>>
>> As far as I'm concerned, all options (including rm -rf and start from
>> scratch) are acceptable, especially if this ends up giving us a better
>> overall subsystem.
>>
>> Do we know how other hypervisors work around this issue?
>
> Maybe the solution of AWS is a choice? I'm not sure. I'm just thinking aloud. :)
>
> Thanks,
>
> Meng
>
> --
> Meng Xu
> Ph.D. Candidate in Computer and Information Science
> University of Pennsylvania
> http://www.cis.upenn.edu/~mengxu/

- Kyle

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: VPMU interrupt unreliability
  2017-10-19 18:24                   ` Kyle Huey
@ 2017-10-19 18:38                     ` Andrew Cooper
  0 siblings, 0 replies; 16+ messages in thread
From: Andrew Cooper @ 2017-10-19 18:38 UTC (permalink / raw)
  To: Kyle Huey, Meng Xu
  Cc: Tian, Kevin, Dietmar Hahn, xen-devel@lists.xen.org, Jun Nakajima,
	Boris Ostrovsky, Robert O'Callahan

On 19/10/17 19:24, Kyle Huey wrote:
> On Thu, Oct 19, 2017 at 11:20 AM, Meng Xu <xumengpanda@gmail.com> wrote:
>> On Thu, Oct 19, 2017 at 11:40 AM, Andrew Cooper
>> <andrew.cooper3@citrix.com> wrote:
>>> On 19/10/17 16:09, Kyle Huey wrote:
>>>> On Wed, Oct 11, 2017 at 7:09 AM, Boris Ostrovsky
>>>> <boris.ostrovsky@oracle.com> wrote:
>>>>> On 10/10/2017 12:54 PM, Kyle Huey wrote:
>>>>>> On Mon, Jul 24, 2017 at 9:54 AM, Kyle Huey <me@kylehuey.com> wrote:
>>>>>>> On Mon, Jul 24, 2017 at 8:07 AM, Boris Ostrovsky
>>>>>>> <boris.ostrovsky@oracle.com> wrote:
>>>>>>>>>> One thing I noticed is that the workaround doesn't appear to be
>>>>>>>>>> complete: it is only checking PMC0 status and not other counters (fixed
>>>>>>>>>> or architectural). Of course, without knowing what the actual problem
>>>>>>>>>> was it's hard to say whether this was intentional.
>>>>>>>>> handle_pmc_quirk appears to loop through all the counters ...
>>>>>>>> Right, I didn't notice that it is shifting MSR_CORE_PERF_GLOBAL_STATUS
>>>>>>>> value one by one and so it is looking at all bits.
>>>>>>>>
>>>>>>>>>>> 2. Intercepting MSR loads for counters that have the workaround
>>>>>>>>>>> applied and giving the guest the correct counter value.
>>>>>>>>>> We'd have to keep track of whether the counter has been reset (by the
>>>>>>>>>> quirk) since the last MSR write.
>>>>>>>>> Yes.
>>>>>>>>>
>>>>>>>>>>> 3. Or perhaps even changing the workaround to disable the PMI on that
>>>>>>>>>>> counter until the guest acks via GLOBAL_OVF_CTRL, assuming that works
>>>>>>>>>>> on the relevant hardware.
>>>>>>>>>> MSR_CORE_PERF_GLOBAL_OVF_CTRL is written immediately after the quirk
>>>>>>>>>> runs (in core2_vpmu_do_interrupt()) so we already do this, don't we?
>>>>>>>>> I'm suggesting waiting until the *guest* writes to the (virtualized)
>>>>>>>>> GLOBAL_OVF_CTRL.
>>>>>>>> Wouldn't it be better to wait until the counter is reloaded?
>>>>>>> Maybe!  I haven't thought through it a lot.  It's still not clear to
>>>>>>> me whether MSR_CORE_PERF_GLOBAL_OVF_CTRL actually controls the
>>>>>>> interrupt in any way or whether it just resets the bits in
>>>>>>> MSR_CORE_PERF_GLOBAL_STATUS and acking the interrupt on the APIC is
>>>>>>> all that's required to reenable it.
>>>>>>>
>>>>>>> - Kyle
>>>>>> I wonder if it would be reasonable to just remove the workaround
>>>>>> entirely at some point.  The set of people using 1) several year old
>>>>>> hardware, 2) an up to date Xen, and 3) the off-by-default performance
>>>>>> counters is probably rather small.
>>>>> We'd probably want to only enable this for affected processors, not
>>>>> remove it outright. But the problem is that we still don't know for sure
>>>>> whether this issue affects NHM only, do we?
>>>>>
>>>>> (https://lists.xenproject.org/archives/html/xen-devel/2017-07/msg02242.html
>>>>> is the original message)
>>>> Yes, the basic problem is that we don't know where to draw the line.
>>> vPMU is disabled by default for security reasons,
>>
>> Is there any document about the possible attack via the vPMU? The
>> document I found (such as [1] and XSA-163) just briefly say that the
>> vPMU should be disabled due to security concern.
>>
>>
>> [1] https://xenbits.xen.org/docs/unstable/misc/xen-command-line.html
> Cross-guest information leaks, presumably.

Plenty of "not context switching things properly".

Off the top of my head, there was also a straight DoS by blindly passing
guest values into an unchecked wrmsr(), and privilege escalation via
letting the guest choose where ds_store dumped its data.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: VPMU interrupt unreliability
  2017-10-19 18:20                 ` Meng Xu
  2017-10-19 18:24                   ` Kyle Huey
@ 2017-10-20  7:07                   ` Jan Beulich
  2017-10-23  2:50                     ` Meng Xu
  1 sibling, 1 reply; 16+ messages in thread
From: Jan Beulich @ 2017-10-20  7:07 UTC (permalink / raw)
  To: Meng Xu
  Cc: Kevin Tian, Andrew Cooper, Dietmar Hahn, xen-devel@lists.xen.org,
	Kyle Huey, Jun Nakajima, Boris Ostrovsky, Robert O'Callahan

>>> On 19.10.17 at 20:20, <xumengpanda@gmail.com> wrote:
> Is there any document about the possible attack via the vPMU? The
> document I found (such as [1] and XSA-163) just briefly say that the
> vPMU should be disabled due to security concern.

Besides the other responses you've already got, I also recall there
being at least some CPU models that would live lock upon the
debug store being placed into virtual space not mapped by present
pages.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: VPMU interrupt unreliability
  2017-10-20  7:07                   ` Jan Beulich
@ 2017-10-23  2:50                     ` Meng Xu
  0 siblings, 0 replies; 16+ messages in thread
From: Meng Xu @ 2017-10-23  2:50 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, Andrew Cooper, Dietmar Hahn, xen-devel@lists.xen.org,
	Kyle Huey, Jun Nakajima, Boris Ostrovsky, Robert O'Callahan

On Fri, Oct 20, 2017 at 3:07 AM, Jan Beulich <JBeulich@suse.com> wrote:
>
> >>> On 19.10.17 at 20:20, <xumengpanda@gmail.com> wrote:
> > Is there any document about the possible attack via the vPMU? The
> > document I found (such as [1] and XSA-163) just briefly say that the
> > vPMU should be disabled due to security concern.
>
> Besides the other responses you've already got, I also recall there
> being at least some CPU models that would live lock upon the
> debug store being placed into virtual space not mapped by present
> pages.


Thank you very much for your explanation! :)


Best Regards,

Meng

-----------
Meng Xu
Ph.D. Candidate in Computer and Information Science
University of Pennsylvania
http://www.cis.upenn.edu/~mengxu/

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: VPMU interrupt unreliability
  2017-07-22 20:16 VPMU interrupt unreliability Kyle Huey
  2017-07-24 14:08 ` Boris Ostrovsky
@ 2017-07-24 14:13 ` Andrew Cooper
  2017-07-24 14:32   ` Kyle Huey
  1 sibling, 1 reply; 16+ messages in thread
From: Andrew Cooper @ 2017-07-24 14:13 UTC (permalink / raw)
  To: Kyle Huey, Boris Ostrovsky, xen-devel
  Cc: Kevin Tian, Jan Beulich, Dietmar Hahn

On 22/07/17 21:16, Kyle Huey wrote:
> Last year I reported[0] seeing occasional instability in performance
> counter values when running rr[1], which depends on completely
> deterministic counts of retired conditional branches of userspace
> programs.
>
> I recently identified the cause of this problem.  Xen's VPMU code
> contains a workaround for an alleged Nehalem bug that was added in
> 2010[2].  Supposedly if a hardware performance counter reaches 0
> exactly during a PMI another PMI is generated potentially causing an
> endless loop.  The workaround is to set the counter to 1.  In 2013 the
> original bug was believed to affect more than just Nehalem and the
> workaround was enabled for all family 6 CPUs.[3]  This workaround
> unfortunately disturbs the counter value in non-deterministic ways
> (since the value the counter has in the irq handler depends on
> interrupt latency), which is fatal to rr.
>
> I've verified that the discrepancies we see in the counted values are
> entirely accounted for by the number of times the workaround is used
> in any given run.  Furthermore, patching Xen not to use this
> workaround makes the discrepancies in the counts vanish.  I've added
> code[4] to rr that reliably detects this problem from guest userspace.
>
> Even with the workaround removed in Xen I see some additional issues
> (but not disturbed counter values) with the PMI, such as interrupts
> occasionally not being delivered to the guest.  I haven't done much
> work to track these down, but my working theory is that interrupts
> that "skid" out of the guest that requested them and into Xen itself
> or perhaps even another guest are not being delivered.
>
> Our current plan is to stop depending on the PMI during rr's recording
> phase (which we use for timeslicing tracees primarily because it's
> convenient) to enable producing correct recordings in Xen guests.
> Accurate replay will not be possible under virtualization because of
> the PMI issues; that will require transferring the recording to
> another machine.  But that will be sufficient to enable the use cases
> we care about (e.g. record an automated process on a cloud computing
> provider and have an engineer download and replay a failing recording
> later to debug it).
>
> I can think of several possible ways to fix the overcount problem, including:
> 1. Restricting the workaround to apply only to older CPUs and not all
> family 6 Intel CPUs forever.
> 2. Intercepting MSR loads for counters that have the workaround
> applied and giving the guest the correct counter value.
> 3. Or perhaps even changing the workaround to disable the PMI on that
> counter until the guest acks via GLOBAL_OVF_CTRL, assuming that works
> on the relevant hardware.
>
> Since I don't have the relevant hardware to test changes to this
> workaround on and rr can avoid these bugs through other means I don't
> expect to work on this myself, but I wanted to apprise you of what
> we've learned.

Thankyou for this investigation and analysis.

I think the first action is to try and identify what this mysterious
erratum is.  Despite the plethora of perf errata, the best I can find is
AAK135 "Multiple Performance Monitor Interrupts are Possible on Overflow
of IA32_FIXED_CTR2" which still doesn't obviously match the described
symptoms.

CC'ing Dietmar who was the author of the original workaround.  Do you
recall any other information which might be helpful in tracking this
down?  I also don't see any similar workaround in the Linux event
infrastructure, which makes me wonder whether the observed behaviour was
a side effect of something else Xen specific.

Having Xen perturb the counters behind a guests back (in a way contrary
to architectural or errata behaviour) is obviously a bad thing, and we
should fix that.  I do have access to hardware, but am lacking vPMU
expertise.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: VPMU interrupt unreliability
  2017-07-24 14:13 ` Andrew Cooper
@ 2017-07-24 14:32   ` Kyle Huey
  0 siblings, 0 replies; 16+ messages in thread
From: Kyle Huey @ 2017-07-24 14:32 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Kevin Tian, Boris Ostrovsky, Jan Beulich, Dietmar Hahn, xen-devel

On Mon, Jul 24, 2017 at 7:13 AM, Andrew Cooper
<andrew.cooper3@citrix.com> wrote:
> On 22/07/17 21:16, Kyle Huey wrote:
>> Last year I reported[0] seeing occasional instability in performance
>> counter values when running rr[1], which depends on completely
>> deterministic counts of retired conditional branches of userspace
>> programs.
>>
>> I recently identified the cause of this problem.  Xen's VPMU code
>> contains a workaround for an alleged Nehalem bug that was added in
>> 2010[2].  Supposedly if a hardware performance counter reaches 0
>> exactly during a PMI another PMI is generated potentially causing an
>> endless loop.  The workaround is to set the counter to 1.  In 2013 the
>> original bug was believed to affect more than just Nehalem and the
>> workaround was enabled for all family 6 CPUs.[3]  This workaround
>> unfortunately disturbs the counter value in non-deterministic ways
>> (since the value the counter has in the irq handler depends on
>> interrupt latency), which is fatal to rr.
>>
>> I've verified that the discrepancies we see in the counted values are
>> entirely accounted for by the number of times the workaround is used
>> in any given run.  Furthermore, patching Xen not to use this
>> workaround makes the discrepancies in the counts vanish.  I've added
>> code[4] to rr that reliably detects this problem from guest userspace.
>>
>> Even with the workaround removed in Xen I see some additional issues
>> (but not disturbed counter values) with the PMI, such as interrupts
>> occasionally not being delivered to the guest.  I haven't done much
>> work to track these down, but my working theory is that interrupts
>> that "skid" out of the guest that requested them and into Xen itself
>> or perhaps even another guest are not being delivered.
>>
>> Our current plan is to stop depending on the PMI during rr's recording
>> phase (which we use for timeslicing tracees primarily because it's
>> convenient) to enable producing correct recordings in Xen guests.
>> Accurate replay will not be possible under virtualization because of
>> the PMI issues; that will require transferring the recording to
>> another machine.  But that will be sufficient to enable the use cases
>> we care about (e.g. record an automated process on a cloud computing
>> provider and have an engineer download and replay a failing recording
>> later to debug it).
>>
>> I can think of several possible ways to fix the overcount problem, including:
>> 1. Restricting the workaround to apply only to older CPUs and not all
>> family 6 Intel CPUs forever.
>> 2. Intercepting MSR loads for counters that have the workaround
>> applied and giving the guest the correct counter value.
>> 3. Or perhaps even changing the workaround to disable the PMI on that
>> counter until the guest acks via GLOBAL_OVF_CTRL, assuming that works
>> on the relevant hardware.
>>
>> Since I don't have the relevant hardware to test changes to this
>> workaround on and rr can avoid these bugs through other means I don't
>> expect to work on this myself, but I wanted to apprise you of what
>> we've learned.
>
> Thankyou for this investigation and analysis.
>
> I think the first action is to try and identify what this mysterious
> erratum is.  Despite the plethora of perf errata, the best I can find is
> AAK135 "Multiple Performance Monitor Interrupts are Possible on Overflow
> of IA32_FIXED_CTR2" which still doesn't obviously match the described
> symptoms.

I think it may be BJ58 "Performance-Counter Overflow Indication May
Cause Undesired Behavior".

> CC'ing Dietmar who was the author of the original workaround.  Do you
> recall any other information which might be helpful in tracking this
> down?  I also don't see any similar workaround in the Linux event
> infrastructure, which makes me wonder whether the observed behaviour was
> a side effect of something else Xen specific.

Haitao Shan wrote

"The issue causing interrupt loop is: It seems that on NHM (at that
time) when a PMI arrives at CPU, the counter has a value to zero
(instead of some other small value, say 3 or 5, seen on Core 2 Duo).
In this case, unmasking the PMI via APIC will trigger immediately
another PMI. This does not produce problem with native kernel, since
it typically programs the counter with another value (as needed by
making yet another sampling point) before unmasking. For Xen, PMI
handler cannot handle the counter immediately since it should be
handled by guests. It just records a virtual PMI to guests and unmasks
the PMI before return."

https://lists.xen.org/archives/html/xen-devel/2013-03/msg02615.html

> Having Xen perturb the counters behind a guests back (in a way contrary
> to architectural or errata behaviour) is obviously a bad thing, and we
> should fix that.  I do have access to hardware, but am lacking vPMU
> expertise.

- Kyle

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2017-10-23  2:50 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-07-22 20:16 VPMU interrupt unreliability Kyle Huey
2017-07-24 14:08 ` Boris Ostrovsky
2017-07-24 14:26   ` Kyle Huey
2017-07-24 15:07     ` Boris Ostrovsky
2017-07-24 16:54       ` Kyle Huey
2017-10-10 16:54         ` Kyle Huey
2017-10-11 14:09           ` Boris Ostrovsky
2017-10-19 15:09             ` Kyle Huey
2017-10-19 15:40               ` Andrew Cooper
2017-10-19 18:20                 ` Meng Xu
2017-10-19 18:24                   ` Kyle Huey
2017-10-19 18:38                     ` Andrew Cooper
2017-10-20  7:07                   ` Jan Beulich
2017-10-23  2:50                     ` Meng Xu
2017-07-24 14:13 ` Andrew Cooper
2017-07-24 14:32   ` Kyle Huey

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).