From mboxrd@z Thu Jan  1 00:00:00 1970
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: Re: PML (Page Modification Logging) design for Xen
Date: Fri, 13 Feb 2015 10:57:48 +0000
Message-ID: <54DDD8AC.1040102@citrix.com>
References: <54DB129D.3060102@linux.intel.com>
	<54DB4294.1080406@citrix.com>	<54DC1249.60507@linux.intel.com>	<AADFC41AFE54684AB9EE6CBC0274A5D12618F6DD@SHSMSX101.ccr.corp.intel.com>
	<54DCB43A.7080609@citrix.com> <54DD5D3D.7000601@linux.intel.com>
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="===============2334140116181706830=="
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <54DD5D3D.7000601@linux.intel.com>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Kai Huang <kai.huang@linux.intel.com>, "Tian, Kevin" <kevin.tian@intel.com>, "jbeulich@suse.com" <jbeulich@suse.com>, "tim@xen.org" <tim@xen.org>, "keir@xen.org" <keir@xen.org>, "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>
List-Id: xen-devel@lists.xenproject.org

--===============2334140116181706830==
Content-Type: multipart/alternative;
	boundary="------------040302080808020700050000"

--------------040302080808020700050000
Content-Type: text/plain; charset="windows-1252"
Content-Transfer-Encoding: quoted-printable

On 13/02/15 02:11, Kai Huang wrote:
>
> On 02/12/2015 10:10 PM, Andrew Cooper wrote:
>> On 12/02/15 06:54, Tian, Kevin wrote:
>>>>> which presumably
>>>>> means that the PML buffer flush needs to be aware of which gfns are=

>>>>> mapped by superpages to be able to correctly set a block of bits in=
 the
>>>>> logdirty bitmap.
>>>>>
>>>> Unfortunately PML itself can't tell us if the logged GPA comes from
>>>> superpage or not, but even in PML we still need to split superpages =
to
>>>> 4K page, just like traditional write protection approach does. I thi=
nk
>>>> this is because live migration should be based on 4K page granularit=
y.
>>>> Marking all 512 bits of a 2M page to be dirty by a single write does=
n't
>>>> make sense in both write protection and PML cases.
>>>>
>>> agree. extending one write to superpage enlarges dirty set unnecessar=
y.
>>> since spec doesn't say superpage logging is not supported, I'd think =
a
>>> 4k-aligned entry being logged if within superpage.
>> The spec states that an gfn is appended to the log strictly on the
>> transition of the D bit from 0 to 1.
>>
>> In the case of a 2M superpage, there is a single D bit for the entire =
2M
>> range.
>>
>>
>> The plausible (working) scenarios I can see are:
>>
>> 1) superpages are not supported (not indicated by the whitepaper).
> A better description would be -- PML doesn't check if it's superpage,
> it just operates with D-bit, no matter what page size.
>> 2) a single entry will be written which must be taken to cover the
>> entire 2M range.
>> 3) an individual entry is written for every access.
> Below is the reply from our hardware guy related to PML on superpage.
> It should have answered accurately.
>
> "As noted in Section 1.3, logging occurs whenever the CPU would set an
> EPT D bit.
>
> It does not matter whether the D bit is in an EPT PTE (4KB page), EPT
> PDE (2MB page), or EPT PDPTE (1GB page).
>
> In all cases, the GPA written to the PML log will be the address of
> the write that causes the D bit in question to be updated, with bits
> 11:0 cleared.
>
> This means that, in the case in which the D bit is in an EPT PDE or an
> EPT PDPTE, the log entry will communicate which 4KB region within the
> larger page was being written.
>
> Once the D bit is set in one of these entries, a subsequent write to
> the larger page will not generate a log entry, even if that write is
> to a different 4KB region within the larger page.  This is because log
> entries are created only when a D bit is being set and a write will
> not cause a D bit to be set if the page's D bit is already set.
>
> The log entries do not communicate the level of the EPT
> paging-structure entry in which the D bit was set (i.e., it does not
> communicate the page size). "

Thanks for the clarification.

The result of this behaviour is that the PML flush logic is going to
have to look up each gfn and check whether it is mapped by a superpage,
which will add a sizeable overhead.

It is also not conducive to minimising the data transmitted in the
migration stream.


One future option might be to shatter all the EPT superpages when
logdirty is enabled.  This would be ok for a domain which is being
migrated away, but would be suboptiomal for snapshot operations; Xen
currently has no ability to coalesce pages back into superpages.  It
also interacts poorly with HAP vram tracking which enables logdirty mode
itself.

~Andrew

--------------040302080808020700050000
Content-Type: text/html; charset="windows-1252"
Content-Length: 4762
Content-Transfer-Encoding: quoted-printable

<html>
  <head>
    <meta content=3D"text/html; charset=3Dwindows-1252"
      http-equiv=3D"Content-Type">
  </head>
  <body text=3D"#000000" bgcolor=3D"#FFFFFF">
    <div class=3D"moz-cite-prefix">On 13/02/15 02:11, Kai Huang wrote:<br>
    </div>
    <blockquote cite=3D"mid:54DD5D3D.7000601@linux.intel.com" type=3D"cite">
      <meta http-equiv=3D"Content-Type" content=3D"text/html;
        charset=3Dwindows-1252">
      <br>
      <div class=3D"moz-cite-prefix">On 02/12/2015 10:10 PM, Andrew Cooper
        wrote:<br>
      </div>
      <blockquote cite=3D"mid:54DCB43A.7080609@citrix.com" type=3D"cite">
        <pre wrap=3D"">On 12/02/15 06:54, Tian, Kevin wrote:
</pre>
        <blockquote type=3D"cite">
          <blockquote type=3D"cite">
            <blockquote type=3D"cite">
              <pre wrap=3D"">which presumably
means that the PML buffer flush needs to be aware of which gfns are
mapped by superpages to be able to correctly set a block of bits in the
logdirty bitmap.

</pre>
            </blockquote>
            <pre wrap=3D"">Unfortunately PML itself can't tell us if the logged GPA comes from
superpage or not, but even in PML we still need to split superpages to
4K page, just like traditional write protection approach does. I think
this is because live migration should be based on 4K page granularity.
Marking all 512 bits of a 2M page to be dirty by a single write doesn't
make sense in both write protection and PML cases.

</pre>
          </blockquote>
          <pre wrap=3D"">agree. extending one write to superpage enlarges dirty set unnecessary.
since spec doesn't say superpage logging is not supported, I'd think a
4k-aligned entry being logged if within superpage.
</pre>
        </blockquote>
        <pre wrap=3D"">The spec states that an gfn is appended to the log strictly on the
transition of the D bit from 0 to 1.

In the case of a 2M superpage, there is a single D bit for the entire 2M
range.


The plausible (working) scenarios I can see are:

1) superpages are not supported (not indicated by the whitepaper).</pre>
      </blockquote>
      A better description would be -- PML doesn't check if it's
      superpage, it just operates with D-bit, no matter what page size.<br>
      <blockquote cite=3D"mid:54DCB43A.7080609@citrix.com" type=3D"cite">
        <pre wrap=3D"">2) a single entry will be written which must be taken to cover the
entire 2M range.
3) an individual entry is written for every access.</pre>
      </blockquote>
      Below is the reply from our hardware guy related to PML on
      superpage. It should have answered accurately.<br>
      <br>
      "As noted in Section 1.3, logging occurs whenever the CPU would
      set an EPT D bit.<br>
      <br>
      It does not matter whether the D bit is in an EPT PTE (4KB page),
      EPT PDE (2MB page), or EPT PDPTE (1GB page).<br>
      <br>
      In all cases, the GPA written to the PML log will be the address
      of the write that causes the D bit in question to be updated, with
      bits 11:0 cleared.<br>
      <br>
      This means that, in the case in which the D bit is in an EPT PDE
      or an EPT PDPTE, the log entry will communicate which 4KB region
      within the larger page was being written.<br>
      <br>
      Once the D bit is set in one of these entries, a subsequent write
      to the larger page will not generate a log entry, even if that
      write is to a different 4KB region within the larger page.=A0 This
      is because log entries are created only when a D bit is being set
      and a write will not cause a D bit to be set if the page's D bit
      is already set.<br>
      <br>
      The log entries do not communicate the level of the EPT
      paging-structure entry in which the D bit was set (i.e., it does
      not communicate the page size). "<br>
    </blockquote>
    <br>
    Thanks for the clarification.<br>
    <br>
    The result of this behaviour is that the PML flush logic is going to
    have to look up each gfn and check whether it is mapped by a
    superpage, which will add a sizeable overhead.<br>
    <br>
    It is also not conducive to minimising the data transmitted in the
    migration stream.<br>
    <br>
    <br>
    One future option might be to shatter all the EPT superpages when
    logdirty is enabled.=A0 This would be ok for a domain which is being
    migrated away, but would be suboptiomal for snapshot operations; Xen
    currently has no ability to coalesce pages back into superpages.=A0 It
    also interacts poorly with HAP vram tracking which enables logdirty
    mode itself.<br>
    <br>
    ~Andrew<br>
  </body>
</html>

--------------040302080808020700050000--


--===============2334140116181706830==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

--===============2334140116181706830==--