xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
From: Kyle Huey <me@kylehuey.com>
To: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Kevin Tian <kevin.tian@intel.com>,
	Boris Ostrovsky <boris.ostrovsky@oracle.com>,
	Jan Beulich <JBeulich@suse.com>,
	Dietmar Hahn <dietmar.hahn@ts.fujitsu.com>,
	xen-devel@lists.xen.org
Subject: Re: VPMU interrupt unreliability
Date: Mon, 24 Jul 2017 07:32:47 -0700	[thread overview]
Message-ID: <CAP045ArrN6srDVBr604G3VwKbJGXvFSvsUvuVfSBO2zD7v6opg@mail.gmail.com> (raw)
In-Reply-To: <9b74cd6e-d22a-0b8c-3219-7cb9410ea3a8@citrix.com>

On Mon, Jul 24, 2017 at 7:13 AM, Andrew Cooper
<andrew.cooper3@citrix.com> wrote:
> On 22/07/17 21:16, Kyle Huey wrote:
>> Last year I reported[0] seeing occasional instability in performance
>> counter values when running rr[1], which depends on completely
>> deterministic counts of retired conditional branches of userspace
>> programs.
>>
>> I recently identified the cause of this problem.  Xen's VPMU code
>> contains a workaround for an alleged Nehalem bug that was added in
>> 2010[2].  Supposedly if a hardware performance counter reaches 0
>> exactly during a PMI another PMI is generated potentially causing an
>> endless loop.  The workaround is to set the counter to 1.  In 2013 the
>> original bug was believed to affect more than just Nehalem and the
>> workaround was enabled for all family 6 CPUs.[3]  This workaround
>> unfortunately disturbs the counter value in non-deterministic ways
>> (since the value the counter has in the irq handler depends on
>> interrupt latency), which is fatal to rr.
>>
>> I've verified that the discrepancies we see in the counted values are
>> entirely accounted for by the number of times the workaround is used
>> in any given run.  Furthermore, patching Xen not to use this
>> workaround makes the discrepancies in the counts vanish.  I've added
>> code[4] to rr that reliably detects this problem from guest userspace.
>>
>> Even with the workaround removed in Xen I see some additional issues
>> (but not disturbed counter values) with the PMI, such as interrupts
>> occasionally not being delivered to the guest.  I haven't done much
>> work to track these down, but my working theory is that interrupts
>> that "skid" out of the guest that requested them and into Xen itself
>> or perhaps even another guest are not being delivered.
>>
>> Our current plan is to stop depending on the PMI during rr's recording
>> phase (which we use for timeslicing tracees primarily because it's
>> convenient) to enable producing correct recordings in Xen guests.
>> Accurate replay will not be possible under virtualization because of
>> the PMI issues; that will require transferring the recording to
>> another machine.  But that will be sufficient to enable the use cases
>> we care about (e.g. record an automated process on a cloud computing
>> provider and have an engineer download and replay a failing recording
>> later to debug it).
>>
>> I can think of several possible ways to fix the overcount problem, including:
>> 1. Restricting the workaround to apply only to older CPUs and not all
>> family 6 Intel CPUs forever.
>> 2. Intercepting MSR loads for counters that have the workaround
>> applied and giving the guest the correct counter value.
>> 3. Or perhaps even changing the workaround to disable the PMI on that
>> counter until the guest acks via GLOBAL_OVF_CTRL, assuming that works
>> on the relevant hardware.
>>
>> Since I don't have the relevant hardware to test changes to this
>> workaround on and rr can avoid these bugs through other means I don't
>> expect to work on this myself, but I wanted to apprise you of what
>> we've learned.
>
> Thankyou for this investigation and analysis.
>
> I think the first action is to try and identify what this mysterious
> erratum is.  Despite the plethora of perf errata, the best I can find is
> AAK135 "Multiple Performance Monitor Interrupts are Possible on Overflow
> of IA32_FIXED_CTR2" which still doesn't obviously match the described
> symptoms.

I think it may be BJ58 "Performance-Counter Overflow Indication May
Cause Undesired Behavior".

> CC'ing Dietmar who was the author of the original workaround.  Do you
> recall any other information which might be helpful in tracking this
> down?  I also don't see any similar workaround in the Linux event
> infrastructure, which makes me wonder whether the observed behaviour was
> a side effect of something else Xen specific.

Haitao Shan wrote

"The issue causing interrupt loop is: It seems that on NHM (at that
time) when a PMI arrives at CPU, the counter has a value to zero
(instead of some other small value, say 3 or 5, seen on Core 2 Duo).
In this case, unmasking the PMI via APIC will trigger immediately
another PMI. This does not produce problem with native kernel, since
it typically programs the counter with another value (as needed by
making yet another sampling point) before unmasking. For Xen, PMI
handler cannot handle the counter immediately since it should be
handled by guests. It just records a virtual PMI to guests and unmasks
the PMI before return."

https://lists.xen.org/archives/html/xen-devel/2013-03/msg02615.html

> Having Xen perturb the counters behind a guests back (in a way contrary
> to architectural or errata behaviour) is obviously a bad thing, and we
> should fix that.  I do have access to hardware, but am lacking vPMU
> expertise.

- Kyle

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

      reply	other threads:[~2017-07-24 14:32 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-07-22 20:16 VPMU interrupt unreliability Kyle Huey
2017-07-24 14:08 ` Boris Ostrovsky
2017-07-24 14:26   ` Kyle Huey
2017-07-24 15:07     ` Boris Ostrovsky
2017-07-24 16:54       ` Kyle Huey
2017-10-10 16:54         ` Kyle Huey
2017-10-11 14:09           ` Boris Ostrovsky
2017-10-19 15:09             ` Kyle Huey
2017-10-19 15:40               ` Andrew Cooper
2017-10-19 18:20                 ` Meng Xu
2017-10-19 18:24                   ` Kyle Huey
2017-10-19 18:38                     ` Andrew Cooper
2017-10-20  7:07                   ` Jan Beulich
2017-10-23  2:50                     ` Meng Xu
2017-07-24 14:13 ` Andrew Cooper
2017-07-24 14:32   ` Kyle Huey [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAP045ArrN6srDVBr604G3VwKbJGXvFSvsUvuVfSBO2zD7v6opg@mail.gmail.com \
    --to=me@kylehuey.com \
    --cc=JBeulich@suse.com \
    --cc=andrew.cooper3@citrix.com \
    --cc=boris.ostrovsky@oracle.com \
    --cc=dietmar.hahn@ts.fujitsu.com \
    --cc=kevin.tian@intel.com \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).