On Mon, Nov 06, 2023 at 10:38:14PM +0100, William Roche wrote:But it implies a lot of other changes: - The source has to flag the error pages to indicate a poison (new flag in the exchange protocole) - The destination has to be able to deal with the new protocoleIIUC these two can be simply implemented by migrating hwpoison_page_list over to dest. You need to have a compat bit for doing this, ignoring the list on old machine types, because old QEMUs will not recognize this vmsd. QEMU should even support migrating a list object in VMSD, feel free to have a look at VMSTATE_QLIST_V().
This is another area that I'll need to
learn about.
- The destination has to be able to mark the pages as poisoned (authorized to use userfaultfd)Note: userfaultfd is actually available without any privilege if to use UFFDIO_POISON only, as long as to open the uffd (either via syscall or /dev/userfaultfd) using UFFD_FLAG_USER_ONLY. A trick is we can register with UFFD_WP mode (not MISSING; because when a kernel accesses a missing page it'll cause SIGBUS then with USER_ONLY), then inject whatever POISON we want. As long as UFFDIO_WRITEPROTECT is not invoked, UFFD_WP does nothing (unlike MISSING).- So both source and destination have to be upgraded (of course qemu but also an appropriate kernel version providing UFFDIO_POISON on the destination)True. Unfortunately this is not avoidable.- we may need to be able to negotiate a fall back solution - an indication of the method to use could belong to the migration capabilities and parametersFor above two points: it's a common issue with migration compatibility. As long as you can provide above VMSD to migrate hwpoison_page_list, marking all old QEMU machine types skipping that, then it should just work. You can have a closer look at anything in hw_compat_* as an example.
Yes, I'll do that.
- etc...I think you did summarize mostly all the points I can think of; is there really anything more? :)
Probably some work to select the poison
migration method (allowing a
migration transforming poison into zeros as a fall back method
if the
poison migration itself with UFFDIO_POISON can't be
used, or not) for
example.
It'll be great if you can, or plan to, fix that for good.
Thanks for the offer ;)
I'd really like to implement that, but I currently have another
pressing
issue to work on. I should be back on this topic within a few
months.
I'm now waiting for some feedback from the ARM architecture reviewer(s).
Thanks a lot for all your suggestions.