qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
To: Peter Xu <peterx@redhat.com>
Cc: David Hildenbrand <david@redhat.com>,
	qemu-devel@nongnu.org, Juan Quintela <quintela@redhat.com>,
	"Dr . David Alan Gilbert" <dgilbert@redhat.com>,
	Markus Armbruster <armbru@redhat.com>,
	Paolo Bonzini <pbonzini@redhat.com>, Den Lunev <den@openvz.org>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	Alexander Duyck <alexander.duyck@gmail.com>
Subject: Re: [PATCH v13 0/5] UFFD write-tracking migration/snapshots
Date: Fri, 19 Feb 2021 09:57:37 +0300	[thread overview]
Message-ID: <add9a7f7-9e02-5024-4bfd-2597a8920ec5@virtuozzo.com> (raw)
In-Reply-To: <20210216233545.GD91264@xz-x1>

[-- Attachment #1: Type: text/plain, Size: 6247 bytes --]

On 17.02.2021 02:35, Peter Xu wrote:
> Hi, Andrey,
>
> On Sat, Feb 13, 2021 at 12:34:07PM +0300, Andrey Gruzdev wrote:
>> On 12.02.2021 19:11, Peter Xu wrote:
>>> On Fri, Feb 12, 2021 at 09:52:52AM +0100, David Hildenbrand wrote:
>>>> On 12.02.21 04:06, Peter Xu wrote:
>>>>> On Thu, Feb 11, 2021 at 10:09:58PM +0100, David Hildenbrand wrote:
>>>>>> The issue is when the discard happened before starting the snapshot. Write-protection won‘t work and the zeroed content won‘t be retained in the snapshot.
>>>>> I see what you mean now, and iiuc it will only be a problem if init_on_free=1.
>>>>> I think CONFIG_INIT_ON_FREE_DEFAULT_ON should be off for most distros, so the
>>>> Yes, some distros seem to enable init_on_alloc instead. Looking at the
>>>> introducing commit 6471384af2a6 ("mm: security: introduce init_on_alloc=1
>>>> and init_on_free=1 boot options") there are security use cases and it might
>>>> become important with memory tagging.
>>>>
>>>> Note that in Linux, there was also the option to poison pages with 0,
>>>> removed via f289041ed4cf ("mm, page_poison: remove
>>>> CONFIG_PAGE_POISONING_ZERO"), available in some kernels that supported free
>>>> page reporting.
>>>>
>>>> It got removed and use cases got told to use init_on_free.
>> I think we talk about init_on_free()/init_on_alloc() on guest side, right?
> Right.  IIUC it's the init_on_free() that matters.
>
> We'll have no issue if init_on_alloc=1 && init_on_free=0, since in that case
> all pages will be zeroed after all before the new page returned to the caller
> to allocate the page. Then we're safe, I think.
>
>> Still can't get how it relates to host's unpopulated pages..
>> Try to look from hardware side. Untouched SDRAM in hardware is required to contain zeroes somehow? No.
>> These 'trash' pages in migration stream are like never written physical memory pages, they are really
>> not needed in snapshot but they don't do any harm as well as there's no harm in that never-written physical
>> page is full of garbage.
>>
>> Do these 'trash' pages in snapshot contain sensitive information not allowed to be accessed by the same VM?
>> I think no. Or we need a good example how it can be potentially exploited.
>>
>> The only issue that I see is madvise(MADV_DONTNEED) for RAM blocks during snapshotting. And free page reporting
>> or memory balloon is secondary - the point is that UFFD_WP snapshot is incompatible with madvise(MADV_DONTNEED) on
>> hypervisor side. No matter which guest functionality can induce it.
> I think the problem is if with init_on_free=1, the kernel will assume that
> all the pages that got freed has been zeroed before-hand so it thinks that it's
> a waste of time to zero it again when the page is reused/reallocated.  As a
> reference see kernel prep_new_page() where there's:
>
> 	if (!free_pages_prezeroed() && want_init_on_alloc(gfp_flags))
> 		kernel_init_free_pages(page, 1 << order);
>
> In this case I believe free_pages_prezeroed() will return true, then we don't
> even need to check want_init_on_alloc() at all. Note that it'll cover all the
> cases where kernel allocates with __GFP_ZERO: it means it could happen that
> even the guest kernel tries to alloc_page(__GFP_ZERO) it may got a page with
> random data after the live snapshot is loaded.  So it's not about any hardware,
> it's the optimization of guest kernel instead.  It is actually reasonable and
> efficient since if we *know* that page is zero page then we shouldn't bother
> zeroing it again.  However it brought us a bit of trouble on live snapshot that
> the current solution might not work for all guest OS configurations.

So I'd like to clarify that guest's init_on_alloc/init_on_free cannot bring any problems if we don't
consider virtio-baloon here.. If these pages have been zeroed inside the guest they certainly have got
populated on the host. And UFFD_WP should work as expected.

>>>>> impact should be small, I think.  I thought about it, but indeed I didn't see a
>>>>> good way to fix this if without fixing the zero page copy for live snapshot.
>>>> We should really document this (unexpected) behavior of snapshotting.
>>>> Otherwise, the next feature comes around that relies on pages that were
>>>> discarded to remain zeroed (I even have one in mind ;) ) and forgets to
>>>> disable snapshots.
>>> Agreed.  I'll see whether Andrey would have any idea to workaround this, or
>>> further comment.  Or I can draft a patch to document this next week (or unless
>>> Andrey would beat me to it :).
>>>
>> Really better to document this specific behaviour but also clarify that the saved state remains
>> consistent and secure, off course if you agree with my arguments.
> Yes, no rush on anything yet, let's reach a consensus on understanding the
> problem before trying to even document anything.  If you agree with above, feel
> free to draft a documentation update patch if you'd like, that could also
> contain the code to prohibit virtio-balloon page reporting when live snapshot.
>
> IMHO if above issue exists, it'll be indeed better to implement the MISSING
> tracking as well with UFFDIO_ZEROCOPY - it's still not optimized to keep trash
> pages in the live snapshot, since for a high dirty rate guest the number of
> trash pages could still be huge.  It would definitely be more challenging than
> only do wr-protect so we may need multi-threading at least, but it'll be a pity
> that live snapshot randomly fails some guests, even if it's rare.  The thing is
> host cannot easily detect that guest config, so we can't simply block some
> users taking snapshots even if at last the snapshot could be silently broken.
>
> Thanks,
>
Agree, let's reach consensus before doing anything. For the concurrent RAM block discards it seems
clear enough that they should be disabled for the duration of snapshot, like it's done when incoming
postcopy is active.

For the discards that happen before snapshot is started, I need to dig into Linux and QEMU virtio-baloon
code more to get clear with it. Anyhow I'm quite sure that adding global MISSING handler for snapshotting
is too heavy and not really needed.

Thanks,

-- 
Andrey Gruzdev, Principal Engineer
Virtuozzo GmbH  +7-903-247-6397
                 virtuzzo.com


[-- Attachment #2: Type: text/html, Size: 7994 bytes --]

  parent reply	other threads:[~2021-02-19  7:13 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-21 15:24 [PATCH v13 0/5] UFFD write-tracking migration/snapshots andrey.gruzdev--- via
2021-01-21 15:24 ` [PATCH v13 1/5] migration: introduce 'background-snapshot' migration capability andrey.gruzdev--- via
2021-01-21 15:24 ` [PATCH v13 2/5] migration: introduce UFFD-WP low-level interface helpers andrey.gruzdev--- via
2021-01-21 15:24 ` [PATCH v13 3/5] migration: support UFFD write fault processing in ram_save_iterate() andrey.gruzdev--- via
2021-01-21 15:24 ` [PATCH v13 4/5] migration: implementation of background snapshot thread andrey.gruzdev--- via
2021-01-28 18:29   ` Dr. David Alan Gilbert
2021-01-29  8:17     ` Andrey Gruzdev
2021-01-21 15:24 ` [PATCH v13 5/5] migration: introduce 'userfaultfd-wrlat.py' script andrey.gruzdev--- via
2021-02-09 12:37 ` [PATCH v13 0/5] UFFD write-tracking migration/snapshots David Hildenbrand
2021-02-09 18:38   ` Andrey Gruzdev
2021-02-09 19:06     ` David Hildenbrand
2021-02-09 20:09       ` Peter Xu
2021-02-09 20:31         ` Peter Xu
2021-02-11  9:21           ` Andrey Gruzdev
2021-02-11 17:18             ` Peter Xu
2021-02-11 18:15               ` Andrey Gruzdev
2021-02-11 16:19       ` Andrey Gruzdev
2021-02-11 17:32         ` Peter Xu
2021-02-11 18:28           ` Andrey Gruzdev
2021-02-11 19:01             ` David Hildenbrand
2021-02-11 20:31               ` Peter Xu
2021-02-11 20:44                 ` David Hildenbrand
2021-02-11 21:05                   ` Peter Xu
2021-02-11 21:09                     ` David Hildenbrand
2021-02-12  3:06                       ` Peter Xu
2021-02-12  8:52                         ` David Hildenbrand
2021-02-12 16:11                           ` Peter Xu
2021-02-13  9:34                             ` Andrey Gruzdev
2021-02-13 10:30                               ` David Hildenbrand
2021-02-16 23:35                               ` Peter Xu
2021-02-17 10:31                                 ` David Hildenbrand
2021-02-19  6:57                                 ` Andrey Gruzdev [this message]
2021-02-19  7:45                                   ` David Hildenbrand
2021-02-19 20:50                                   ` Peter Xu
2021-02-19 21:10                                     ` Peter Xu
2021-02-19 21:14                                       ` David Hildenbrand
2021-02-19 21:20                                         ` David Hildenbrand
2021-02-19 22:47                                           ` Peter Xu
2021-02-20  7:59                                             ` David Hildenbrand
2021-02-22 17:29                                               ` Peter Xu
2021-02-22 17:33                                                 ` David Hildenbrand
2021-02-22 17:54                                                   ` Peter Xu
2021-02-22 18:11                                                     ` David Hildenbrand
2021-02-24 16:56                                                       ` Andrey Gruzdev
2021-02-24 17:01                                                         ` David Hildenbrand
2021-02-24 17:52                                                           ` Andrey Gruzdev
2021-02-24 16:43                                     ` Andrey Gruzdev
2021-02-24 16:54                                       ` David Hildenbrand
2021-02-24 17:00                                         ` Andrey Gruzdev
2021-02-11 19:21     ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=add9a7f7-9e02-5024-4bfd-2597a8920ec5@virtuozzo.com \
    --to=andrey.gruzdev@virtuozzo.com \
    --cc=alexander.duyck@gmail.com \
    --cc=armbru@redhat.com \
    --cc=david@redhat.com \
    --cc=den@openvz.org \
    --cc=dgilbert@redhat.com \
    --cc=mst@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=peterx@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).