virtualization.lists.linux-foundation.org archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: zhenwei pi <pizhenwei@bytedance.com>,
	akpm@linux-foundation.org, naoya.horiguchi@nec.com,
	mst@redhat.com
Cc: qemu-devel@nongnu.org, linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org, linux-mm@kvack.org,
	pbonzini@redhat.com
Subject: Re: [PATCH 0/3] recover hardware corrupted page by virtio balloon
Date: Tue, 24 May 2022 20:59:09 +0200	[thread overview]
Message-ID: <68a4a96b-9c66-6509-e75d-b1bea6cd55d1@redhat.com> (raw)
In-Reply-To: <20220520070648.1794132-1-pizhenwei@bytedance.com>

On 20.05.22 09:06, zhenwei pi wrote:
> Hi,
> 
> I'm trying to recover hardware corrupted page by virtio balloon, the
> workflow of this feature like this:
> 
> Guest              5.MF -> 6.RVQ FE    10.Unpoison page
>                     /           \            /
> -------------------+-------------+----------+-----------
>                    |             |          |
>                 4.MCE        7.RVQ BE   9.RVQ Event
>  QEMU             /               \       /
>              3.SIGBUS              8.Remap
>                 /
> ----------------+------------------------------------
>                 |
>             +--2.MF
>  Host       /
>        1.HW error
> 
> 1, HardWare page error occurs randomly.
> 2, host side handles corrupted page by Memory Failure mechanism, sends
>    SIGBUS to the user process if early-kill is enabled.
> 3, QEMU handles SIGBUS, if the address belongs to guest RAM, then:
> 4, QEMU tries to inject MCE into guest.
> 5, guest handles memory failure again.
> 
> 1-5 is already supported for a long time, the next steps are supported
> in this patch(also related driver patch):
> 
> 6, guest balloon driver gets noticed of the corrupted PFN, and sends
>    request to host side by Recover VQ FrontEnd.
> 7, QEMU handles request from Recover VQ BackEnd, then:
> 8, QEMU remaps the corrupted HVA fo fix the memory failure, then:
> 9, QEMU acks the guest side the result by Recover VQ.
> 10, guest unpoisons the page if the corrupted page gets recoverd
>     successfully.
> 
> Test:
> This patch set can be tested with QEMU(also in developing):
> https://github.com/pizhenwei/qemu/tree/balloon-recover
> 
> Emulate MCE by QEMU(guest RAM normal page only, hugepage is not supported):
> virsh qemu-monitor-command vm --hmp mce 0 9 0xbd000000000000c0 0xd 0x61646678 0x8c
> 
> The guest works fine(on Intel Platinum 8260):
>  mce: [Hardware Error]: Machine check events logged
>  Memory failure: 0x61646: recovery action for dirty LRU page: Recovered
>  virtio_balloon virtio5: recovered pfn 0x61646
>  Unpoison: Unpoisoned page 0x61646 by virtio-balloon
>  MCE: Killing stress:24502 due to hardware memory corruption fault at 7f5be2e5a010
> 
> And the 'HardwareCorrupted' in /proc/meminfo also shows 0 kB.
> 
> About the protocol of virtio balloon recover VQ, it's undefined and in
> developing currently:
> - 'struct virtio_balloon_recover' defines the structure which is used to
>   exchange message between guest and host.
> - '__le32 corrupted_pages' in struct virtio_balloon_config is used in the next
>   step:
>   1, a VM uses RAM of 2M huge page, once a MCE occurs, the 2M becomes
>      unaccessible. Reporting 512 * 4K 'corrupted_pages' to the guest, the guest
>      has a chance to isolate the 512 pages ahead of time.
> 
>   2, after migrating to another host, the corrupted pages are actually recovered,
>      once the guest gets the 'corrupted_pages' with 0, then the guest could
>      unpoison all the poisoned pages which are recorded in the balloon driver.
> 

Hi,

I'm still on vacation this week, I'll try to have a look when I'm back
(and flushed out my overflowing inbox :D).


-- 
Thanks,

David / dhildenb

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

  parent reply	other threads:[~2022-05-24 18:59 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-20  7:06 [PATCH 0/3] recover hardware corrupted page by virtio balloon zhenwei pi
2022-05-20  7:06 ` [PATCH 1/3] memory-failure: Introduce memory failure notifier zhenwei pi
2022-05-20  7:06 ` [PATCH 2/3] mm/memory-failure.c: support reset PTE during unpoison zhenwei pi
     [not found]   ` <20220530050234.GA1036127@hori.linux.bs1.fc.nec.co.jp>
2022-05-30  5:46     ` zhenwei pi
2022-05-30  6:50   ` David Hildenbrand
2022-05-20  7:06 ` [PATCH 3/3] virtio_balloon: Introduce memory recover zhenwei pi
2022-05-20 12:48   ` kernel test robot
2022-05-20 13:39   ` kernel test robot
2022-05-20 15:28   ` kernel test robot
     [not found]   ` <Yo0zmP28FqpivlxF@google.com>
2022-05-24 23:32     ` zhenwei pi
2022-05-30  7:53       ` David Hildenbrand
2022-05-26 19:18   ` Michael S. Tsirkin
2022-05-27  2:22     ` zhenwei pi
2022-05-30  7:48   ` David Hildenbrand
2022-05-30 12:47     ` zhenwei pi
2022-05-24 18:59 ` David Hildenbrand [this message]
2022-05-27  3:47 ` [PATCH 0/3] recover hardware corrupted page by virtio balloon zhenwei pi
     [not found] <CAPcxDJ5pduUyMA0rf+-aTjK_2eBvig05UTiTptX1nVkWE-_g8w@mail.gmail.com>
2022-05-26 18:37 ` Peter Xu
2022-05-27  6:32   ` zhenwei pi
2022-05-30  7:41     ` David Hildenbrand
2022-05-30 11:33       ` zhenwei pi
2022-05-30 15:49         ` Peter Xu
     [not found]           ` <CAPcxDJ5UMfpys8KyLQVnkV9BPO1vaubxbhc7f4XC_TdNO7jr7g@mail.gmail.com>
2022-06-01  2:17             ` zhenwei pi
2022-06-01  7:59               ` David Hildenbrand
2022-06-02  9:28                 ` zhenwei pi
2022-06-02  9:40                   ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=68a4a96b-9c66-6509-e75d-b1bea6cd55d1@redhat.com \
    --to=david@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mst@redhat.com \
    --cc=naoya.horiguchi@nec.com \
    --cc=pbonzini@redhat.com \
    --cc=pizhenwei@bytedance.com \
    --cc=qemu-devel@nongnu.org \
    --cc=virtualization@lists.linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).