All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: "“William Roche" <william.roche@oracle.com>
Cc: qemu-devel@nongnu.org, qemu-arm@nongnu.org,
	lizhijian@fujitsu.com, pbonzini@redhat.com, quintela@redhat.com,
	leobras@redhat.com, joao.m.martins@oracle.com,
	lidongchen@tencent.com
Subject: Re: [PATCH v5 0/2] Qemu crashes on VM migration after an handled memory error
Date: Wed, 8 Nov 2023 16:49:25 -0500	[thread overview]
Message-ID: <ZUwCZdZj-vZD1NJC@x1n> (raw)
In-Reply-To: <20231106220319.456765-1-william.roche@oracle.com>

On Mon, Nov 06, 2023 at 10:03:17PM +0000, “William Roche wrote:
> From: William Roche <william.roche@oracle.com>
> 
> 
> Note about ARM specificities:
> This code has a small part impacting more specificaly ARM machines,
> that's the reason why I added qemu-arm@nongnu.org -- see description.
> 
> 
> A Qemu VM can survive a memory error, as qemu can relay the error to the
> VM kernel which could also deal with it -- poisoning/off-lining the impacted
> page.
> This situation creates a hole in the VM memory address space that the VM kernel
> knows about (an unreadable page or set of pages).
> 
> But the migration of this VM (live migration through the network or
> pseudo-migration with the creation of a state file) will crash Qemu when
> it sequentially reads the memory address space and stumbles on the
> existing hole.
> 
> In order to thoroughly correct this problem, the poison information should
> follow the migration which represents several difficulties:
> - poisoning a page on the destination machine to replicate the source
>   poison requires CAP_SYS_ADMIN priviledges, and qemu process may not
>   always run as a root process
> - the destination kernel needs to be configured with CONFIG_MEMORY_FAILURE
> - the poison information would require a memory transfer protocol
>   enhancement to provide this information
> (The current patches don't provide any of that)
> 
> But if we rely on the fact that the a running VM kernel is correctly
> dealing with memory poison it is informed about: marking the poison page
> as inaccessible, we could count on the VM kernel to make sure that
> poisoned pages are not used, even after a migration.
> In this case, I suggest to treat the poisoned pages as if they were
> zero-pages for the migration copy.
> This fix also works with underlying large pages, taking into account the
> RAMBlock segment "page-size".
> 
> Now, it leaves a case that we have to deal with: if a memory error is
> reported to qemu but not injected into the running kernel...
> As the migration will go from a poisoned page to an all-zero page, if
> the VM kernel doesn't prevent the access to this page, a memory read
> that would generate a BUS_MCEERR_AR error on the source platform, could
> be reading zeros on the destination. This is a memory corruption.
> 
> So we have to ensure that all poisoned pages we set to zero are known by
> the running kernel. But we have a problem with platforms where BUS_MCEERR_AO
> errors are ignored, which means that qemu knows about the poison but the VM
> doesn't. For the moment it's only the case for ARM, but could later be
> also needed for AMD VMs.
> See https://lore.kernel.org/all/20230912211824.90952-3-john.allen@amd.com/
> 
> In order to avoid this possible silent data corruption situation, we should
> prevent the migration when we know that a poisoned page is ignored from the VM.
> 
> Which is, according to me, the smallest fix we need  to avoid qemu crashes
> on migration after an handled memory error, without introducing a possible
> corruption situation.
> 
> This fix is scripts/checkpatch.pl clean.
> Unit test: Migration blocking succesfully tested on ARM -- injected AO error
> blocks it. On x86 the same type of error being relayed doesn't block.
> 
> v2:
>   - adding compressed transfer handling of poisoned pages
> 
> v3:
>   - Included the Reviewed-by and Tested-by information on first patch
>   - added a TODO comment above control_save_page()
>     mentioning Zhijian's feedback about RDMA migration failure.
> 
> v4:
>   - adding a patch to deal with unknown poison tracking (impacting ARM)
>     (not using migrate_add_blocker as this is not devices related and
>     we want to avoid the interaction with --only-migratable mechanism)
> 
> v5:
>   - Updating the code to the latest version
>   - adding qemu-arm@nongnu.org for a complementary review
> 
> 
> William Roche (2):
>   migration: skip poisoned memory pages on "ram saving" phase
>   migration: prevent migration when a poisoned page is unknown from the
>     VM

I hope someone from arch-specific can have a quick look at patch 2..

One thing to mention is unfortunately waiting on patch 2 means we'll miss
this release. Actually it is already missed.. softfreeze yesterday [1].  So
it may likely need to wait for 9.0.

[1] https://wiki.qemu.org/Planning/8.2

-- 
Peter Xu


  parent reply	other threads:[~2023-11-08 21:49 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-06 13:59 [PATCH 0/1] Qemu crashes on VM migration after an handled memory error “William Roche
2023-09-06 13:59 ` [PATCH 1/1] migration: skip poisoned memory pages on "ram saving" phase “William Roche
2023-09-06 14:19   ` Joao Martins
2023-09-06 15:16     ` Peter Xu
2023-09-06 21:29       ` William Roche
2023-09-09 14:57         ` Joao Martins
2023-09-11 19:48           ` Peter Xu
2023-09-12 18:44             ` Peter Xu
2023-09-14 20:20               ` [PATCH v2 0/1] Qemu crashes on VM migration after an handled memory error “William Roche
2023-09-14 20:20                 ` [PATCH v2 1/1] migration: skip poisoned memory pages on "ram saving" phase “William Roche
2023-09-15  3:13                   ` Zhijian Li (Fujitsu)
2023-09-15 11:31                     ` William Roche
2023-09-18  3:47                       ` Zhijian Li (Fujitsu)
2023-09-20 10:04                       ` Zhijian Li (Fujitsu)
2023-09-20 12:11                         ` William Roche
2023-09-20 23:53                         ` [PATCH v3 0/1] Qemu crashes on VM migration after an handled memory error “William Roche
2023-09-20 23:53                           ` [PATCH v3 1/1] migration: skip poisoned memory pages on "ram saving" phase “William Roche
2023-10-13 15:08                           ` [PATCH v4 0/2] Qemu crashes on VM migration after an handled memory error “William Roche
2023-10-13 15:08                             ` [PATCH v4 1/2] migration: skip poisoned memory pages on "ram saving" phase “William Roche
2023-10-13 15:08                             ` [PATCH v4 2/2] migration: prevent migration when a poisoned page is unknown from the VM “William Roche
2023-10-16 16:48                               ` Peter Xu
2023-10-17  0:38                                 ` William Roche
2023-10-17 15:13                                   ` Peter Xu
2023-11-06 21:38                                     ` William Roche
2023-11-08 21:45                                       ` Peter Xu
2023-11-10 19:22                                         ` William Roche
2023-11-06 22:03                                     ` [PATCH v5 0/2] Qemu crashes on VM migration after an handled memory error “William Roche
2023-11-06 22:03                                       ` [PATCH v5 1/2] migration: skip poisoned memory pages on "ram saving" phase “William Roche
2023-11-06 22:03                                       ` [PATCH v5 2/2] migration: prevent migration when a poisoned page is unknown from the VM “William Roche
2023-11-08 21:49                                       ` Peter Xu [this message]
2024-01-30 19:06                                         ` [PATCH v1 0/1] Qemu crashes on VM migration after an handled memory error “William Roche
2024-01-30 19:06                                           ` [PATCH v1 1/1] migration: prevent migration when VM has poisoned memory “William Roche
2024-01-31  1:48                                             ` Peter Xu
2023-09-14 21:50                 ` [PATCH v2 0/1] Qemu crashes on VM migration after an handled memory error Peter Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZUwCZdZj-vZD1NJC@x1n \
    --to=peterx@redhat.com \
    --cc=joao.m.martins@oracle.com \
    --cc=leobras@redhat.com \
    --cc=lidongchen@tencent.com \
    --cc=lizhijian@fujitsu.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-arm@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    --cc=william.roche@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.