public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Pavel Emelyanov <xemul@parallels.com>
To: David Vrabel <david.vrabel@citrix.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Cyrill Gorcunov <gorcunov@gmail.com>,
	"H. Peter Anvin" <hpa@zytor.com>, Jan Beulich <JBeulich@suse.com>,
	Andy Lutomirski <luto@amacapital.net>,
	Andrew Morton <akpm@linux-foundation.org>,
	<Xen-devel@lists.xen.org>,
	Boris Ostrovsky <boris.ostrovsky@oracle.com>,
	Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>,
	Ingo Molnar <mingo@redhat.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use
Date: Thu, 22 Aug 2013 14:16:07 +0400	[thread overview]
Message-ID: <5215E4E7.1020808@parallels.com> (raw)
In-Reply-To: <5215DAC0.8080806@citrix.com>

On 08/22/2013 01:32 PM, David Vrabel wrote:
> On 22/08/13 00:04, Linus Torvalds wrote:
>> On Wed, Aug 21, 2013 at 12:03 PM, Cyrill Gorcunov <gorcunov@gmail.com> wrote:
>>>
>>> I personally don't see bug here because
>>>
>>>  - this swapped page soft dirty bit is set for non-present entries only,
>>>    never for present ones, just at moment we form swap pte entry
>>>
>>>  - i don't find any code which would test for this bit directly without
>>>    is_swap_pte call
>>
>> Ok, having gone through the places that use swp_*soft_dirty(), I have
>> to agree. Afaik, it's only ever used on a swap-entry that has (by
>> definition) the P bit clear. So with or without Xen, I don't see how
>> it can make any difference.
>>
>> David/Konrad - did you actually see any issues, or was this just from
>> (mis)reading the code?
> 
> There are no Xen related bugs in the code, we were misreading it.
> 
> It was my call to raise this as a regression without a repro and clearly
> this was the wrong decision.
> 
> However, having looked at the soft dirty implementation and specifically
> the userspace ABI I think that it is far to closely coupled to the
> current implementation.  I think this will constrain future development
> of the feature should userspace require a more efficient ABI than
> scanning all of /proc/<pid>/pagemaps.
> 
> Minimal downtime during 'live' checkpointing of a running task needs the
> checkpointer to find and write out dirty pages faster than the task can
> dirty them.

Absolutely, but in "find and write" the "write" component is likely to take
the majority of time -- we can scan PTEs of a mapping MUCH faster, than
transmitting those over even 10Gbit link.

We actually see this IRL -- in CRIU there's an atomic test, that checks
mappings get dumped and restored properly. One of sub-tests is one 512Mb mapping.
With it total dump time _minus_ memory dump time (which includes not only pagemap
file scan, but also files, registers, process tree, sessions, etc.) is fractions 
of one second, while only the memory dump part's time is several seconds.

That said, the super-fast API for getting "what has changed" is not as tempting
to have as faster network/disk.

What is _more_ time consuming in iterative migration in our case is the need to
re-scan the whole /proc tree to get which processes had died and appeared, mess 
with /proc/pid/fd finding out what files were (re-)opened/closed/changed, talking
to sock_diag subsystem for sockets information and alike. However, we haven't yet
done careful analysis for what the slowest part is, but pagemap scans is definitely
not.

> This seems less likely to be possible if every iteration
> all PTEs have to be scanned by the checkpointer instead of (e.g.,)
> accessing a separate list of dirtied pages.

But we don't scan all the x64 virtual address space's PTEs, instead we first
analyze the /proc/pid/maps and scan only PTEs sitting in private mappings.

> David
> .
> 

Thanks,
Pavel

  reply	other threads:[~2013-08-22 10:16 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-08-21 13:48 Regression: x86/mm: new _PTE_SWP_SOFT_DIRTY bit conflicts with existing use David Vrabel
2013-08-21 13:53 ` konrad wilk
2013-08-21 14:11 ` H. Peter Anvin
2013-08-21 14:19   ` Cyrill Gorcunov
2013-08-21 14:22     ` H. Peter Anvin
2013-08-21 14:29       ` Cyrill Gorcunov
2013-08-21 16:30     ` Linus Torvalds
2013-08-21 16:42       ` Cyrill Gorcunov
2013-08-21 23:05       ` Cyrill Gorcunov
2013-08-21 23:42         ` Andi Kleen
2013-08-22  5:49           ` Cyrill Gorcunov
2013-08-22  6:37             ` Minchan Kim
2013-08-22 13:12               ` Cyrill Gorcunov
2013-08-27 22:04       ` Benjamin Herrenschmidt
2013-08-21 14:12 ` Cyrill Gorcunov
2013-08-21 14:22   ` H. Peter Anvin
2013-08-21 14:53   ` Jan Beulich
2013-08-21 14:58     ` H. Peter Anvin
2013-08-21 15:42     ` Cyrill Gorcunov
2013-08-21 16:03       ` Jan Beulich
2013-08-21 16:19         ` Cyrill Gorcunov
2013-08-21 16:56           ` David Vrabel
2013-08-21 17:25             ` Cyrill Gorcunov
2013-08-21 18:17               ` Cyrill Gorcunov
2013-08-21 18:50                 ` H. Peter Anvin
2013-08-21 19:03                   ` Cyrill Gorcunov
2013-08-21 19:07                     ` Andy Lutomirski
2013-08-21 19:20                       ` Cyrill Gorcunov
2013-08-21 19:21                       ` Pavel Emelyanov
2013-08-21 23:04                     ` Linus Torvalds
2013-08-22  0:51                       ` Dave Jones
2013-08-22  5:44                         ` Cyrill Gorcunov
2013-08-22  6:41                         ` Pavel Emelyanov
2013-08-22  7:47                       ` Jan Beulich
2013-08-22  9:32                       ` David Vrabel
2013-08-22 10:16                         ` Pavel Emelyanov [this message]
2013-08-22  6:56           ` Jan Beulich
2013-08-22  7:03             ` Cyrill Gorcunov
2013-08-22  7:27               ` Jan Beulich
2013-08-22 11:27                 ` Cyrill Gorcunov
2013-08-22 11:33                   ` Jan Beulich
2013-08-22 12:18                     ` Pavel Emelyanov
2013-08-21 17:28 ` Andy Lutomirski
2013-08-22  7:54   ` Jan Beulich

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5215E4E7.1020808@parallels.com \
    --to=xemul@parallels.com \
    --cc=JBeulich@suse.com \
    --cc=Xen-devel@lists.xen.org \
    --cc=akpm@linux-foundation.org \
    --cc=boris.ostrovsky@oracle.com \
    --cc=david.vrabel@citrix.com \
    --cc=gorcunov@gmail.com \
    --cc=hpa@zytor.com \
    --cc=konrad.wilk@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@amacapital.net \
    --cc=mingo@redhat.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox