Re: VPMU interrupt unreliability

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

From: Boris Ostrovsky <boris.ostrovsky@oracle.com>
To: Kyle Huey <me@kylehuey.com>,
	Andrew Cooper <andrew.cooper3@citrix.com>,
	xen-devel@lists.xen.org
Cc: "Tian, Kevin" <kevin.tian@intel.com>,
	Robert O'Callahan <robert@ocallahan.org>,
	Jun Nakajima <jun.nakajima@intel.com>
Subject: Re: VPMU interrupt unreliability
Date: Mon, 24 Jul 2017 10:08:17 -0400	[thread overview]
Message-ID: <23dd26f5-d217-dc99-6e3c-02ff46bf2f7a@oracle.com> (raw)
In-Reply-To: <CAP045Arh6NMGkv=Khguyc+40gaN1fCO3T1MPvNOnThMT9uPSbQ@mail.gmail.com>

On 07/22/2017 04:16 PM, Kyle Huey wrote:
> Last year I reported[0] seeing occasional instability in performance
> counter values when running rr[1], which depends on completely
> deterministic counts of retired conditional branches of userspace
> programs.
>
> I recently identified the cause of this problem.  Xen's VPMU code
> contains a workaround for an alleged Nehalem bug that was added in
> 2010[2].  Supposedly if a hardware performance counter reaches 0
> exactly during a PMI another PMI is generated potentially causing an
> endless loop.  The workaround is to set the counter to 1.  In 2013 the
> original bug was believed to affect more than just Nehalem and the
> workaround was enabled for all family 6 CPUs.[3]  This workaround
> unfortunately disturbs the counter value in non-deterministic ways
> (since the value the counter has in the irq handler depends on
> interrupt latency), which is fatal to rr.
>
> I've verified that the discrepancies we see in the counted values are
> entirely accounted for by the number of times the workaround is used
> in any given run.  Furthermore, patching Xen not to use this
> workaround makes the discrepancies in the counts vanish.  I've added
> code[4] to rr that reliably detects this problem from guest userspace.
>
> Even with the workaround removed in Xen I see some additional issues
> (but not disturbed counter values) with the PMI, such as interrupts
> occasionally not being delivered to the guest.  I haven't done much
> work to track these down, but my working theory is that interrupts
> that "skid" out of the guest that requested them and into Xen itself
> or perhaps even another guest are not being delivered.
>
> Our current plan is to stop depending on the PMI during rr's recording
> phase (which we use for timeslicing tracees primarily because it's
> convenient) to enable producing correct recordings in Xen guests.
> Accurate replay will not be possible under virtualization because of
> the PMI issues; that will require transferring the recording to
> another machine.  But that will be sufficient to enable the use cases
> we care about (e.g. record an automated process on a cloud computing
> provider and have an engineer download and replay a failing recording
> later to debug it).
>
> I can think of several possible ways to fix the overcount problem, including:
> 1. Restricting the workaround to apply only to older CPUs and not all
> family 6 Intel CPUs forever.

IIRC the question of which processors this workaround is applicable to
was raised and Intel folks (copied here) couldn't find an answer.

One thing I noticed is that the workaround doesn't appear to be
complete: it is only checking PMC0 status and not other counters (fixed
or architectural). Of course, without knowing what the actual problem
was it's hard to say whether this was intentional.


> 2. Intercepting MSR loads for counters that have the workaround
> applied and giving the guest the correct counter value.


We'd have to keep track of whether the counter has been reset (by the
quirk) since the last MSR write.

> 3. Or perhaps even changing the workaround to disable the PMI on that
> counter until the guest acks via GLOBAL_OVF_CTRL, assuming that works
> on the relevant hardware.

MSR_CORE_PERF_GLOBAL_OVF_CTRL is written immediately after the quirk
runs (in core2_vpmu_do_interrupt()) so we already do this, don't we?

Thanks for looking into this. Would also be interesting to see/confirm
how some interrupts are (possibly) lost.

-boris


>
> Since I don't have the relevant hardware to test changes to this
> workaround on and rr can avoid these bugs through other means I don't
> expect to work on this myself, but I wanted to apprise you of what
> we've learned.
>
> - Kyle
>
> [0] https://lists.xen.org/archives/html/xen-devel/2016-10/msg01288.html
> [1] http://rr-project.org/
> [2] https://xenbits.xen.org/gitweb/?p=xen.git;a=blobdiff;f=xen/arch/x86/hvm/vmx/vpmu_core2.c;h=44aa8e3c47fc02e401f5c382d89b97eef0cd2019;hp=ce4fd2d43e04db5e9b042344dd294cfa11e1f405;hb=3ed6a063d2a5f6197306b030e8c27c36d5f31aa1;hpb=566f83823996cf9c95f9a0562488f6b1215a1052
> [3] https://xenbits.xen.org/gitweb/?p=xen.git;a=blobdiff;f=xen/arch/x86/hvm/vmx/vpmu_core2.c;h=15b2036c8db1e56d8865ee34c363e7f23aa75e33;hp=9f152b48c26dfeedb6f94189a5fe4a5f7a772d83;hb=75a92f551ade530ebab73a0c3d4934dfb28149b5;hpb=71fc4da1306cec55a42787310b01a1cb52489abc
> [4] See https://github.com/mozilla/rr/blob/a5d23728cd7d01c6be0c79852af26c68160d4405/src/PerfCounters.cc#L313,
> which sets up a counter and then does some pointless math in a loop to
> reach exactly 500 conditional branches.  Xen will report 501 branches
> because of this bug.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

next prev parent reply	other threads:[~2017-07-24 14:08 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-07-22 20:16 VPMU interrupt unreliability Kyle Huey
2017-07-24 14:08 ` Boris Ostrovsky [this message]
2017-07-24 14:26   ` Kyle Huey
2017-07-24 15:07     ` Boris Ostrovsky
2017-07-24 16:54       ` Kyle Huey
2017-10-10 16:54         ` Kyle Huey
2017-10-11 14:09           ` Boris Ostrovsky
2017-10-19 15:09             ` Kyle Huey
2017-10-19 15:40               ` Andrew Cooper
2017-10-19 18:20                 ` Meng Xu
2017-10-19 18:24                   ` Kyle Huey
2017-10-19 18:38                     ` Andrew Cooper
2017-10-20  7:07                   ` Jan Beulich
2017-10-23  2:50                     ` Meng Xu
2017-07-24 14:13 ` Andrew Cooper
2017-07-24 14:32   ` Kyle Huey

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=23dd26f5-d217-dc99-6e3c-02ff46bf2f7a@oracle.com \
    --to=boris.ostrovsky@oracle.com \
    --cc=andrew.cooper3@citrix.com \
    --cc=jun.nakajima@intel.com \
    --cc=kevin.tian@intel.com \
    --cc=me@kylehuey.com \
    --cc=robert@ocallahan.org \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).