Re: [BUG, RFC] cpr-transfer: qxl guest driver crashes after migration

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Steven Sistare <steven.sistare@oracle.com>
To: "Denis V. Lunev" <den@virtuozzo.com>,
	Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>,
	qemu-devel@nongnu.org
Cc: "William Roche" <william.roche@oracle.com>,
	"Gerd Hoffmann" <kraxel@redhat.com>,
	"Daniel P. Berrangé" <berrange@redhat.com>
Subject: Re: [BUG, RFC] cpr-transfer: qxl guest driver crashes after migration
Date: Thu, 6 Mar 2025 11:13:22 -0500	[thread overview]
Message-ID: <849106e8-1e4d-4a0b-8c79-6988b4cd8b0b@oracle.com> (raw)
In-Reply-To: <24677eb1-dbe6-4e0a-980b-9c38d4decde8@virtuozzo.com>

On 3/6/2025 10:52 AM, Denis V. Lunev wrote:
> On 3/6/25 16:16, Andrey Drobyshev wrote:
>> On 3/5/25 11:19 PM, Steven Sistare wrote:
>>> On 3/5/2025 11:50 AM, Andrey Drobyshev wrote:
>>>> On 3/4/25 9:05 PM, Steven Sistare wrote:
>>>>> On 2/28/2025 1:37 PM, Andrey Drobyshev wrote:
>>>>>> On 2/28/25 8:35 PM, Andrey Drobyshev wrote:
>>>>>>> On 2/28/25 8:20 PM, Steven Sistare wrote:
>>>>>>>> On 2/28/2025 1:13 PM, Steven Sistare wrote:
>>>>>>>>> On 2/28/2025 12:39 PM, Andrey Drobyshev wrote:
>>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>> We've been experimenting with cpr-transfer migration mode recently
>>>>>>>>>> and
>>>>>>>>>> have discovered the following issue with the guest QXL driver:
>>>>>>>>>>
>>>>>>>>>> Run migration source:
>>>>>>>>>>> EMULATOR=/path/to/emulator
>>>>>>>>>>> ROOTFS=/path/to/image
>>>>>>>>>>> QMPSOCK=/var/run/alma8qmp-src.sock
>>>>>>>>>>>
>>>>>>>>>>> $EMULATOR -enable-kvm \
>>>>>>>>>>>         -machine q35 \
>>>>>>>>>>>         -cpu host -smp 2 -m 2G \
>>>>>>>>>>>         -object memory-backend-file,id=ram0,size=2G,mem-path=/
>>>>>>>>>>> dev/shm/
>>>>>>>>>>> ram0,share=on\
>>>>>>>>>>>         -machine memory-backend=ram0 \
>>>>>>>>>>>         -machine aux-ram-share=on \
>>>>>>>>>>>         -drive file=$ROOTFS,media=disk,if=virtio \
>>>>>>>>>>>         -qmp unix:$QMPSOCK,server=on,wait=off \
>>>>>>>>>>>         -nographic \
>>>>>>>>>>>         -device qxl-vga
>>>>>>>>>> Run migration target:
>>>>>>>>>>> EMULATOR=/path/to/emulator
>>>>>>>>>>> ROOTFS=/path/to/image
>>>>>>>>>>> QMPSOCK=/var/run/alma8qmp-dst.sock
>>>>>>>>>>> $EMULATOR -enable-kvm \
>>>>>>>>>>>         -machine q35 \
>>>>>>>>>>>         -cpu host -smp 2 -m 2G \
>>>>>>>>>>>         -object memory-backend-file,id=ram0,size=2G,mem-path=/
>>>>>>>>>>> dev/shm/
>>>>>>>>>>> ram0,share=on\
>>>>>>>>>>>         -machine memory-backend=ram0 \
>>>>>>>>>>>         -machine aux-ram-share=on \
>>>>>>>>>>>         -drive file=$ROOTFS,media=disk,if=virtio \
>>>>>>>>>>>         -qmp unix:$QMPSOCK,server=on,wait=off \
>>>>>>>>>>>         -nographic \
>>>>>>>>>>>         -device qxl-vga \
>>>>>>>>>>>         -incoming tcp:0:44444 \
>>>>>>>>>>>         -incoming '{"channel-type": "cpr", "addr": { "transport":
>>>>>>>>>>> "socket", "type": "unix", "path": "/var/run/alma8cpr-dst.sock"}}'
>>>>>>>>>>
>>>>>>>>>> Launch the migration:
>>>>>>>>>>> QMPSHELL=/root/src/qemu/master/scripts/qmp/qmp-shell
>>>>>>>>>>> QMPSOCK=/var/run/alma8qmp-src.sock
>>>>>>>>>>>
>>>>>>>>>>> $QMPSHELL -p $QMPSOCK <<EOF
>>>>>>>>>>>         migrate-set-parameters mode=cpr-transfer
>>>>>>>>>>>         migrate channels=[{"channel-type":"main","addr":
>>>>>>>>>>> {"transport":"socket","type":"inet","host":"0","port":"44444"}},
>>>>>>>>>>> {"channel-type":"cpr","addr":
>>>>>>>>>>> {"transport":"socket","type":"unix","path":"/var/run/alma8cpr-
>>>>>>>>>>> dst.sock"}}]
>>>>>>>>>>> EOF
>>>>>>>>>> Then, after a while, QXL guest driver on target crashes spewing the
>>>>>>>>>> following messages:
>>>>>>>>>>> [   73.962002] [TTM] Buffer eviction failed
>>>>>>>>>>> [   73.962072] qxl 0000:00:02.0: object_init failed for (3149824,
>>>>>>>>>>> 0x00000001)
>>>>>>>>>>> [   73.962081] [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to
>>>>>>>>>>> allocate VRAM BO
>>>>>>>>>> That seems to be a known kernel QXL driver bug:
>>>>>>>>>>
>>>>>>>>>> https://lore.kernel.org/all/20220907094423.93581-1-
>>>>>>>>>> min_halo@163.com/T/
>>>>>>>>>> https://lore.kernel.org/lkml/ZTgydqRlK6WX_b29@eldamar.lan/
>>>>>>>>>>
>>>>>>>>>> (the latter discussion contains that reproduce script which
>>>>>>>>>> speeds up
>>>>>>>>>> the crash in the guest):
>>>>>>>>>>> #!/bin/bash
>>>>>>>>>>>
>>>>>>>>>>> chvt 3
>>>>>>>>>>>
>>>>>>>>>>> for j in $(seq 80); do
>>>>>>>>>>>             echo "$(date) starting round $j"
>>>>>>>>>>>             if [ "$(journalctl --boot | grep "failed to allocate
>>>>>>>>>>> VRAM
>>>>>>>>>>> BO")" != "" ]; then
>>>>>>>>>>>                     echo "bug was reproduced after $j tries"
>>>>>>>>>>>                     exit 1
>>>>>>>>>>>             fi
>>>>>>>>>>>             for i in $(seq 100); do
>>>>>>>>>>>                     dmesg > /dev/tty3
>>>>>>>>>>>             done
>>>>>>>>>>> done
>>>>>>>>>>>
>>>>>>>>>>> echo "bug could not be reproduced"
>>>>>>>>>>> exit 0
>>>>>>>>>> The bug itself seems to remain unfixed, as I was able to reproduce
>>>>>>>>>> that
>>>>>>>>>> with Fedora 41 guest, as well as AlmaLinux 8 guest. However our
>>>>>>>>>> cpr-transfer code also seems to be buggy as it triggers the crash -
>>>>>>>>>> without the cpr-transfer migration the above reproduce doesn't
>>>>>>>>>> lead to
>>>>>>>>>> crash on the source VM.
>>>>>>>>>>
>>>>>>>>>> I suspect that, as cpr-transfer doesn't migrate the guest
>>>>>>>>>> memory, but
>>>>>>>>>> rather passes it through the memory backend object, our code might
>>>>>>>>>> somehow corrupt the VRAM.  However, I wasn't able to trace the
>>>>>>>>>> corruption so far.
>>>>>>>>>>
>>>>>>>>>> Could somebody help the investigation and take a look into
>>>>>>>>>> this?  Any
>>>>>>>>>> suggestions would be appreciated.  Thanks!
>>>>>>>>> Possibly some memory region created by qxl is not being preserved.
>>>>>>>>> Try adding these traces to see what is preserved:
>>>>>>>>>
>>>>>>>>> -trace enable='*cpr*'
>>>>>>>>> -trace enable='*ram_alloc*'
>>>>>>>> Also try adding this patch to see if it flags any ram blocks as not
>>>>>>>> compatible with cpr.  A message is printed at migration start time.
>>>>>>>>      https://lore.kernel.org/qemu-devel/1740667681-257312-1-git-send-
>>>>>>>> email-
>>>>>>>> steven.sistare@oracle.com/
>>>>>>>>
>>>>>>>> - Steve
>>>>>>>>
>>>>>>> With the traces enabled + the "migration: ram block cpr blockers"
>>>>>>> patch
>>>>>>> applied:
>>>>>>>
>>>>>>> Source:
>>>>>>>> cpr_find_fd pc.bios, id 0 returns -1
>>>>>>>> cpr_save_fd pc.bios, id 0, fd 22
>>>>>>>> qemu_ram_alloc_shared pc.bios size 262144 max_size 262144 fd 22 host
>>>>>>>> 0x7fec18e00000
>>>>>>>> cpr_find_fd pc.rom, id 0 returns -1
>>>>>>>> cpr_save_fd pc.rom, id 0, fd 23
>>>>>>>> qemu_ram_alloc_shared pc.rom size 131072 max_size 131072 fd 23 host
>>>>>>>> 0x7fec18c00000
>>>>>>>> cpr_find_fd 0000:00:01.0/e1000e.rom, id 0 returns -1
>>>>>>>> cpr_save_fd 0000:00:01.0/e1000e.rom, id 0, fd 24
>>>>>>>> qemu_ram_alloc_shared 0000:00:01.0/e1000e.rom size 262144 max_size
>>>>>>>> 262144 fd 24 host 0x7fec18a00000
>>>>>>>> cpr_find_fd 0000:00:02.0/vga.vram, id 0 returns -1
>>>>>>>> cpr_save_fd 0000:00:02.0/vga.vram, id 0, fd 25
>>>>>>>> qemu_ram_alloc_shared 0000:00:02.0/vga.vram size 67108864 max_size
>>>>>>>> 67108864 fd 25 host 0x7feb77e00000
>>>>>>>> cpr_find_fd 0000:00:02.0/qxl.vrom, id 0 returns -1
>>>>>>>> cpr_save_fd 0000:00:02.0/qxl.vrom, id 0, fd 27
>>>>>>>> qemu_ram_alloc_shared 0000:00:02.0/qxl.vrom size 8192 max_size 8192
>>>>>>>> fd 27 host 0x7fec18800000
>>>>>>>> cpr_find_fd 0000:00:02.0/qxl.vram, id 0 returns -1
>>>>>>>> cpr_save_fd 0000:00:02.0/qxl.vram, id 0, fd 28
>>>>>>>> qemu_ram_alloc_shared 0000:00:02.0/qxl.vram size 67108864 max_size
>>>>>>>> 67108864 fd 28 host 0x7feb73c00000
>>>>>>>> cpr_find_fd 0000:00:02.0/qxl.rom, id 0 returns -1
>>>>>>>> cpr_save_fd 0000:00:02.0/qxl.rom, id 0, fd 34
>>>>>>>> qemu_ram_alloc_shared 0000:00:02.0/qxl.rom size 65536 max_size 65536
>>>>>>>> fd 34 host 0x7fec18600000
>>>>>>>> cpr_find_fd /rom@etc/acpi/tables, id 0 returns -1
>>>>>>>> cpr_save_fd /rom@etc/acpi/tables, id 0, fd 35
>>>>>>>> qemu_ram_alloc_shared /rom@etc/acpi/tables size 131072 max_size
>>>>>>>> 2097152 fd 35 host 0x7fec18200000
>>>>>>>> cpr_find_fd /rom@etc/table-loader, id 0 returns -1
>>>>>>>> cpr_save_fd /rom@etc/table-loader, id 0, fd 36
>>>>>>>> qemu_ram_alloc_shared /rom@etc/table-loader size 4096 max_size 65536
>>>>>>>> fd 36 host 0x7feb8b600000
>>>>>>>> cpr_find_fd /rom@etc/acpi/rsdp, id 0 returns -1
>>>>>>>> cpr_save_fd /rom@etc/acpi/rsdp, id 0, fd 37
>>>>>>>> qemu_ram_alloc_shared /rom@etc/acpi/rsdp size 4096 max_size 4096 fd
>>>>>>>> 37 host 0x7feb8b400000
>>>>>>>>
>>>>>>>> cpr_state_save cpr-transfer mode
>>>>>>>> cpr_transfer_output /var/run/alma8cpr-dst.sock
>>>>>>> Target:
>>>>>>>> cpr_transfer_input /var/run/alma8cpr-dst.sock
>>>>>>>> cpr_state_load cpr-transfer mode
>>>>>>>> cpr_find_fd pc.bios, id 0 returns 20
>>>>>>>> qemu_ram_alloc_shared pc.bios size 262144 max_size 262144 fd 20 host
>>>>>>>> 0x7fcdc9800000
>>>>>>>> cpr_find_fd pc.rom, id 0 returns 19
>>>>>>>> qemu_ram_alloc_shared pc.rom size 131072 max_size 131072 fd 19 host
>>>>>>>> 0x7fcdc9600000
>>>>>>>> cpr_find_fd 0000:00:01.0/e1000e.rom, id 0 returns 18
>>>>>>>> qemu_ram_alloc_shared 0000:00:01.0/e1000e.rom size 262144 max_size
>>>>>>>> 262144 fd 18 host 0x7fcdc9400000
>>>>>>>> cpr_find_fd 0000:00:02.0/vga.vram, id 0 returns 17
>>>>>>>> qemu_ram_alloc_shared 0000:00:02.0/vga.vram size 67108864 max_size
>>>>>>>> 67108864 fd 17 host 0x7fcd27e00000
>>>>>>>> cpr_find_fd 0000:00:02.0/qxl.vrom, id 0 returns 16
>>>>>>>> qemu_ram_alloc_shared 0000:00:02.0/qxl.vrom size 8192 max_size 8192
>>>>>>>> fd 16 host 0x7fcdc9200000
>>>>>>>> cpr_find_fd 0000:00:02.0/qxl.vram, id 0 returns 15
>>>>>>>> qemu_ram_alloc_shared 0000:00:02.0/qxl.vram size 67108864 max_size
>>>>>>>> 67108864 fd 15 host 0x7fcd23c00000
>>>>>>>> cpr_find_fd 0000:00:02.0/qxl.rom, id 0 returns 14
>>>>>>>> qemu_ram_alloc_shared 0000:00:02.0/qxl.rom size 65536 max_size 65536
>>>>>>>> fd 14 host 0x7fcdc8800000
>>>>>>>> cpr_find_fd /rom@etc/acpi/tables, id 0 returns 13
>>>>>>>> qemu_ram_alloc_shared /rom@etc/acpi/tables size 131072 max_size
>>>>>>>> 2097152 fd 13 host 0x7fcdc8400000
>>>>>>>> cpr_find_fd /rom@etc/table-loader, id 0 returns 11
>>>>>>>> qemu_ram_alloc_shared /rom@etc/table-loader size 4096 max_size 65536
>>>>>>>> fd 11 host 0x7fcdc8200000
>>>>>>>> cpr_find_fd /rom@etc/acpi/rsdp, id 0 returns 10
>>>>>>>> qemu_ram_alloc_shared /rom@etc/acpi/rsdp size 4096 max_size 4096 fd
>>>>>>>> 10 host 0x7fcd3be00000
>>>>>>> Looks like both vga.vram and qxl.vram are being preserved (with the
>>>>>>> same
>>>>>>> addresses), and no incompatible ram blocks are found during migration.
>>>>>> Sorry, addressed are not the same, of course.  However corresponding
>>>>>> ram
>>>>>> blocks do seem to be preserved and initialized.
>>>>> So far, I have not reproduced the guest driver failure.
>>>>>
>>>>> However, I have isolated places where new QEMU improperly writes to
>>>>> the qxl memory regions prior to starting the guest, by mmap'ing them
>>>>> readonly after cpr:
>>>>>
>>>>>     qemu_ram_alloc_internal()
>>>>>       if (reused && (strstr(name, "qxl") || strstr("name", "vga")))
>>>>>           ram_flags |= RAM_READONLY;
>>>>>       new_block = qemu_ram_alloc_from_fd(...)
>>>>>
>>>>> I have attached a draft fix; try it and let me know.
>>>>> My console window looks fine before and after cpr, using
>>>>> -vnc $hostip:0 -vga qxl
>>>>>
>>>>> - Steve
>>>> Regarding the reproduce: when I launch the buggy version with the same
>>>> options as you, i.e. "-vnc 0.0.0.0:$port -vga qxl", and do cpr-transfer,
>>>> my VNC client silently hangs on the target after a while.  Could it
>>>> happen on your stand as well?
>>> cpr does not preserve the vnc connection and session.  To test, I specify
>>> port 0 for the source VM and port 1 for the dest.  When the src vnc goes
>>> dormant the dest vnc becomes active.
>>>
>> Sure, I meant that VNC on the dest (on the port 1) works for a while
>> after the migration and then hangs, apparently after the guest QXL crash.
>>
>>>> Could you try launching VM with
>>>> "-nographic -device qxl-vga"?  That way VM's serial console is given you
>>>> directly in the shell, so when qxl driver crashes you're still able to
>>>> inspect the kernel messages.
>>> I have been running like that, but have not reproduced the qxl driver
>>> crash,
>>> and I suspect my guest image+kernel is too old.
>> Yes, that's probably the case.  But the crash occurs on my Fedora 41
>> guest with the 6.11.5-300.fc41.x86_64 kernel, so newer kernels seem to
>> be buggy.
>>
>>
>>> However, once I realized the
>>> issue was post-cpr modification of qxl memory, I switched my attention
>>> to the
>>> fix.
>>>
>>>> As for your patch, I can report that it doesn't resolve the issue as it
>>>> is.  But I was able to track down another possible memory corruption
>>>> using your approach with readonly mmap'ing:
>>>>
>>>>> Program terminated with signal SIGSEGV, Segmentation fault.
>>>>> #0  init_qxl_ram (d=0x5638996e0e70) at ../hw/display/qxl.c:412
>>>>> 412         d->ram->magic       = cpu_to_le32(QXL_RAM_MAGIC);
>>>>> [Current thread is 1 (Thread 0x7f1a4f83b480 (LWP 229798))]
>>>>> (gdb) bt
>>>>> #0  init_qxl_ram (d=0x5638996e0e70) at ../hw/display/qxl.c:412
>>>>> #1  0x0000563896e7f467 in qxl_realize_common (qxl=0x5638996e0e70,
>>>>> errp=0x7ffd3c2b8170) at ../hw/display/qxl.c:2142
>>>>> #2  0x0000563896e7fda1 in qxl_realize_primary (dev=0x5638996e0e70,
>>>>> errp=0x7ffd3c2b81d0) at ../hw/display/qxl.c:2257
>>>>> #3  0x0000563896c7e8f2 in pci_qdev_realize (qdev=0x5638996e0e70,
>>>>> errp=0x7ffd3c2b8250) at ../hw/pci/pci.c:2174
>>>>> #4  0x00005638970eb54b in device_set_realized (obj=0x5638996e0e70,
>>>>> value=true, errp=0x7ffd3c2b84e0) at ../hw/core/qdev.c:494
>>>>> #5  0x00005638970f5e14 in property_set_bool (obj=0x5638996e0e70,
>>>>> v=0x5638996f3770, name=0x56389759b141 "realized",
>>>>> opaque=0x5638987893d0, errp=0x7ffd3c2b84e0)
>>>>>       at ../qom/object.c:2374
>>>>> #6  0x00005638970f39f8 in object_property_set (obj=0x5638996e0e70,
>>>>> name=0x56389759b141 "realized", v=0x5638996f3770, errp=0x7ffd3c2b84e0)
>>>>>       at ../qom/object.c:1449
>>>>> #7  0x00005638970f8586 in object_property_set_qobject
>>>>> (obj=0x5638996e0e70, name=0x56389759b141 "realized",
>>>>> value=0x5638996df900, errp=0x7ffd3c2b84e0)
>>>>>       at ../qom/qom-qobject.c:28
>>>>> #8  0x00005638970f3d8d in object_property_set_bool
>>>>> (obj=0x5638996e0e70, name=0x56389759b141 "realized", value=true,
>>>>> errp=0x7ffd3c2b84e0)
>>>>>       at ../qom/object.c:1519
>>>>> #9  0x00005638970eacb0 in qdev_realize (dev=0x5638996e0e70,
>>>>> bus=0x563898cf3c20, errp=0x7ffd3c2b84e0) at ../hw/core/qdev.c:276
>>>>> #10 0x0000563896dba675 in qdev_device_add_from_qdict
>>>>> (opts=0x5638996dfe50, from_json=false, errp=0x7ffd3c2b84e0) at ../
>>>>> system/qdev-monitor.c:714
>>>>> #11 0x0000563896dba721 in qdev_device_add (opts=0x563898786150,
>>>>> errp=0x56389855dc40 <error_fatal>) at ../system/qdev-monitor.c:733
>>>>> #12 0x0000563896dc48f1 in device_init_func (opaque=0x0,
>>>>> opts=0x563898786150, errp=0x56389855dc40 <error_fatal>) at ../system/
>>>>> vl.c:1207
>>>>> #13 0x000056389737a6cc in qemu_opts_foreach
>>>>>       (list=0x563898427b60 <qemu_device_opts>, func=0x563896dc48ca
>>>>> <device_init_func>, opaque=0x0, errp=0x56389855dc40 <error_fatal>)
>>>>>       at ../util/qemu-option.c:1135
>>>>> #14 0x0000563896dc89b5 in qemu_create_cli_devices () at ../system/
>>>>> vl.c:2745
>>>>> #15 0x0000563896dc8c00 in qmp_x_exit_preconfig (errp=0x56389855dc40
>>>>> <error_fatal>) at ../system/vl.c:2806
>>>>> #16 0x0000563896dcb5de in qemu_init (argc=33, argv=0x7ffd3c2b8948)
>>>>> at ../system/vl.c:3838
>>>>> #17 0x0000563897297323 in main (argc=33, argv=0x7ffd3c2b8948) at ../
>>>>> system/main.c:72
>>>> So the attached adjusted version of your patch does seem to help.  At
>>>> least I can't reproduce the crash on my stand.
>>> Thanks for the stack trace; the calls to SPICE_RING_INIT in init_qxl_ram
>>> are
>>> definitely harmful.  Try V2 of the patch, attached, which skips the lines
>>> of init_qxl_ram that modify guest memory.
>>>
>> Thanks, your v2 patch does seem to prevent the crash.  Would you re-send
>> it to the list as a proper fix?

Yes.  Was waiting for your confirmation.

>>>> I'm wondering, could it be useful to explicitly mark all the reused
>>>> memory regions readonly upon cpr-transfer, and then make them writable
>>>> back again after the migration is done?  That way we will be segfaulting
>>>> early on instead of debugging tricky memory corruptions.
>>> It's a useful debugging technique, but changing protection on a large
>>> memory region
>>> can be too expensive for production due to TLB shootdowns.
>>>
>>> Also, there are cases where writes are performed but the value is
>>> guaranteed to
>>> be the same:
>>>    qxl_post_load()
>>>      qxl_set_mode()
>>>        d->rom->mode = cpu_to_le32(modenr);
>>> The value is the same because mode and shadow_rom.mode were passed in
>>> vmstate
>>> from old qemu.
>>>
>> There're also cases where devices' ROM might be re-initialized.  E.g.
>> this segfault occures upon further exploration of RO mapped RAM blocks:
>>
>>> Program terminated with signal SIGSEGV, Segmentation fault.
>>> #0  __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:664
>>> 664             rep     movsb
>>> [Current thread is 1 (Thread 0x7f6e7d08b480 (LWP 310379))]
>>> (gdb) bt
>>> #0  __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:664
>>> #1  0x000055aa1d030ecd in rom_set_mr (rom=0x55aa200ba380, owner=0x55aa2019ac10, name=0x7fffb8272bc0 "/rom@etc/acpi/tables", ro=true)
>>>      at ../hw/core/loader.c:1032
>>> #2  0x000055aa1d031577 in rom_add_blob
>>>      (name=0x55aa1da51f13 "etc/acpi/tables", blob=0x55aa208a1070, len=131072, max_len=2097152, addr=18446744073709551615, fw_file_name=0x55aa1da51f13 "etc/acpi/tables", fw_callback=0x55aa1d441f59 <acpi_build_update>, callback_opaque=0x55aa20ff0010, as=0x0, read_only=true) at ../hw/core/loader.c:1147
>>> #3  0x000055aa1cfd788d in acpi_add_rom_blob
>>>      (update=0x55aa1d441f59 <acpi_build_update>, opaque=0x55aa20ff0010, blob=0x55aa1fc9aa00, name=0x55aa1da51f13 "etc/acpi/tables") at ../hw/acpi/utils.c:46
>>> #4  0x000055aa1d44213f in acpi_setup () at ../hw/i386/acpi-build.c:2720
>>> #5  0x000055aa1d434199 in pc_machine_done (notifier=0x55aa1ff15050, data=0x0) at ../hw/i386/pc.c:638
>>> #6  0x000055aa1d876845 in notifier_list_notify (list=0x55aa1ea25c10 <machine_init_done_notifiers>, data=0x0) at ../util/notify.c:39
>>> #7  0x000055aa1d039ee5 in qdev_machine_creation_done () at ../hw/core/machine.c:1749
>>> #8  0x000055aa1d2c7b3e in qemu_machine_creation_done (errp=0x55aa1ea5cc40 <error_fatal>) at ../system/vl.c:2779
>>> #9  0x000055aa1d2c7c7d in qmp_x_exit_preconfig (errp=0x55aa1ea5cc40 <error_fatal>) at ../system/vl.c:2807
>>> #10 0x000055aa1d2ca64f in qemu_init (argc=35, argv=0x7fffb82730e8) at ../system/vl.c:3838
>>> #11 0x000055aa1d79638c in main (argc=35, argv=0x7fffb82730e8) at ../system/main.c:72
>> I'm not sure whether ACPI tables ROM in particular is rewritten with the
>> same content, but there might be cases where ROM can be read from file
>> system upon initialization.  That is undesirable as guest kernel
>> certainly won't be too happy about sudden change of the device's ROM
>> content.
>>
>> So the issue we're dealing with here is any unwanted memory related
>> device initialization upon cpr.
>>
>> For now the only thing that comes to my mind is to make a test where we
>> put as many devices as we can into a VM, make ram blocks RO upon cpr
>> (and remap them as RW later after migration is done, if needed), and
>> catch any unwanted memory violations.  As Den suggested, we might
>> consider adding that behaviour as a separate non-default option (or
>> "migrate" command flag specific to cpr-transfer), which would only be
>> used in the testing.

I'll look into adding an option, but there may be too many false positives,
such as the qxl_set_mode case above.  And the maintainers may object to me
eliminating the false positives by adding more CPR_IN tests, due to gratuitous
(from their POV) ugliness.

But I will use the technique to look for more write violations.

>> Andrey
> No way. ACPI with the source must be used in the same way as BIOSes
> and optional ROMs.

Yup, its a bug.  Will fix.

- Steve

next prev parent reply	other threads:[~2025-03-06 16:14 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-28 17:39 [BUG, RFC] cpr-transfer: qxl guest driver crashes after migration Andrey Drobyshev
2025-02-28 18:13 ` Steven Sistare
2025-02-28 18:20   ` Steven Sistare
2025-02-28 18:35     ` Andrey Drobyshev
2025-02-28 18:37       ` Andrey Drobyshev
2025-03-04 19:05         ` Steven Sistare
2025-03-05 16:50           ` Andrey Drobyshev
2025-03-05 21:19             ` Steven Sistare
2025-03-06  9:59               ` Denis V. Lunev
2025-03-06 15:16               ` Andrey Drobyshev
2025-03-06 15:52                 ` Denis V. Lunev
2025-03-06 16:13                   ` Steven Sistare [this message]
2025-03-07 21:00                     ` Steven Sistare

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=849106e8-1e4d-4a0b-8c79-6988b4cd8b0b@oracle.com \
    --to=steven.sistare@oracle.com \
    --cc=andrey.drobyshev@virtuozzo.com \
    --cc=berrange@redhat.com \
    --cc=den@virtuozzo.com \
    --cc=kraxel@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=william.roche@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).