Re: [PATCH v5][XSA-97] x86/paging: make log-dirty operations preemptible

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

From: Andrew Cooper <andrew.cooper3@citrix.com>
To: Jan Beulich <JBeulich@suse.com>, Tim Deegan <tim@xen.org>
Cc: xen-devel <xen-devel@lists.xenproject.org>, Keir Fraser <keir@xen.org>
Subject: Re: [PATCH v5][XSA-97] x86/paging: make log-dirty operations preemptible
Date: Mon, 15 Sep 2014 08:50:11 +0100	[thread overview]
Message-ID: <54169A33.7020305@citrix.com> (raw)
In-Reply-To: <541300BC02000078000347D8@mail.emea.novell.com>

On 12/09/2014 13:18, Jan Beulich wrote:
>>>> On 11.09.14 at 20:27, <andrew.cooper3@citrix.com> wrote:
>> On inspection of the libxc logs, I am feeing quite glad I left this
>> debugging message in:
>>
>> xenguest-75-save[11876]: xc: detail: Bitmap contained more entries than
>> expected...
>> xenguest-83-save[32123]: xc: detail: Bitmap contained more entries than
>> expected...
>> xenguest-84-save[471]: xc: detail: Bitmap contained more entries than
>> expected...
>> xenguest-88-save[3823]: xc: detail: Bitmap contained more entries than
>> expected...
>> xenguest-89-save[4656]: xc: detail: Bitmap contained more entries than
>> expected...
>> xenguest-95-save[9379]: xc: detail: Bitmap contained more entries than
>> expected...
>> xenguest-98-save[11784]: xc: detail: Bitmap contained more entries than
>> expected...
>>
>> This means that periodically, a XEN_DOMCTL_SHADOW_OP_{CLEAN,PEEK}
>> hypercall gives us back a bitmap with more set bits than
>> stats.dirty_count which it hands back at the same time.
> Which has quite simple an explanation, at least when (as I assume is
> the case) XEN_DOMCTL_SHADOW_OP_CLEAN is being used: We
> clear
>
> d->arch.paging.log_dirty.{dirty,fault}_count on the each step of the
> continuation instead of just on the final one. That code needs moving
> down.

Agreed.

>
> This, however, points at a broader problem: The necessary dropping
> of the paging lock between continuation steps implies that subsequent
> setting of bits in the map would need to distinguish between ranges
> where the bitmap already got copied and those still outstanding when
> it comes to updating dirty_count. I.e. no matter whether we leave
> the current copying/clearing where it is or move it down, the counts
> wouldn't be precise.

Having  a "logdirty dying" or "logdirty shutdown" state could compensate 
for this, by no longer having paging_log_dirty() true while the 
continuations are in progress.

>
> Now both from your description and from looking at current code
> I conclude that this is with your new migration code, since current
> code uses both counts only for printing messages. Which in turn
> raises the question whether use of these counts, getting exported
> just as statistics, for anything other than statistics is actually
> appropriate. Otoh, if your new migration code doesn't use the
> counts for non-statistical, non-logging purposes I still can't see why
> things go wrong for you but not for me or osstest.

It is indeed migration v2, which is necessary in XenServer given our 
recent switch from 32bit dom0 to 64bit.  The counts are only used for 
logging, and debugging purposes; all movement of pages is based off the 
bits in the bitmap alone.  In particular, the dirty count is used as a 
basis of the statistics for the present iteration of migration.  While 
getting it wrong is not the end of the world, it would certainly be 
preferable for the count to be accurate.

As for the memory corruption, XenRT usually tests pairs of VMs at a time 
(32 and 64bit variants) and all operations as back-to-back as possible.  
Therefore, it is highly likely that a continued operation on one domain 
intersects with other paging operations on another.

The results (now they have run fully) are 10 tests each.  10 passes 
without this patch, and 10 failures in similar ways with the patch, 
spread across a randomly selected set of hardware.

We also see failures with HVM VMs as well, although there are no errors 
at all from Xen or the toolstack components.  Symptoms range from BSODs 
to simply wedging with not apparent cause (I have not had a live repro 
to investigate)

~Andrew

next prev parent reply	other threads:[~2014-09-15  7:51 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-09-05 10:47 [PATCH v5][XSA-97] x86/paging: make log-dirty operations preemptible Jan Beulich
2014-09-11 18:27 ` Andrew Cooper
2014-09-12 12:18   ` Jan Beulich
2014-09-15  7:50     ` Andrew Cooper [this message]
2014-09-15 12:54       ` Jan Beulich
2014-09-15 13:56         ` Andrew Cooper
2014-09-15 14:20           ` Jan Beulich
2014-09-15 14:37             ` Andrew Cooper

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=54169A33.7020305@citrix.com \
    --to=andrew.cooper3@citrix.com \
    --cc=JBeulich@suse.com \
    --cc=keir@xen.org \
    --cc=tim@xen.org \
    --cc=xen-devel@lists.xenproject.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).