* Memory corruption after resume from hibernate with Arm GICv3 ITS
@ 2025-07-23 10:04 David Woodhouse
2025-07-24 9:25 ` David Woodhouse
0 siblings, 1 reply; 4+ messages in thread
From: David Woodhouse @ 2025-07-23 10:04 UTC (permalink / raw)
To: Rafael J. Wysocki, Pavel Machek, linux-pm, Marc Zyngier,
linux-arm-kernel, Saidi, Ali, oliver.upton, Joey Gouly,
Suzuki K Poulose, Zenghui Yu, Catalin Marinas, Will Deacon,
linux-kernel, Heyne, Maximilian, Alexander Graf, Stamatis, Ilias
[-- Attachment #1: Type: text/plain, Size: 8866 bytes --]
We have seen guests crashing when, after they resume from hibernate,
the hypervisor serializes their state for live update or live
migration.
The Arm Generic Interrupt Controller is a complicated beast, and it
does scattershot DMA to little tables all across the guest's address
space, without even living behind an IOMMU.
Rather than simply turning it off overall, the guest has to explicitly
tear down *every* one of the individual tables which were previously
configured, in order to ensure that the memory is no longer used.
KVM's implementation of the virtual GIC only uses this guest memory
when asked to serialize its state. Instead of passing the information
up to userspace as most KVM devices will do for serialization, KVM
*only* supports scribbling it to guest memory.
So, when the transition from boot to resumed kernel leaves the vGIC
pointing at the *wrong* addresses, that's why a subsequent LU/LM of
that guest triggers the memory corruption by writing the KVM state to a
guest address that the now-running kernel did *not* expect.
I tried this, just to get some more information:
--- a/drivers/irqchip/irq-gic-v3-its.c
+++ b/drivers/irqchip/irq-gic-v3-its.c
@@ -720,7 +720,7 @@ static struct its_collection *its_build_mapd_cmd(struct its_node *its,
its_encode_valid(cmd, desc->its_mapd_cmd.valid);
its_fixup_cmd(cmd);
-
+ printk("%s dev 0x%x valid %d addr 0x%lx\n", __func__, desc->its_mapd_cmd.dev->device_id, desc->its_mapd_cmd.valid, itt_addr);
return NULL;
}
@@ -4996,10 +4996,15 @@ static int its_save_disable(void)
struct its_node *its;
int err = 0;
+ printk("%s\n", __func__);
raw_spin_lock(&its_lock);
list_for_each_entry(its, &its_nodes, entry) {
+ struct its_device *its_dev;
void __iomem *base;
+ list_for_each_entry(its_dev, &its->its_device_list, entry) {
+ its_send_mapd(its_dev, 0);
+ }
base = its->base;
its->ctlr_save = readl_relaxed(base + GITS_CTLR);
err = its_force_quiescent(base);
@@ -5032,8 +5037,10 @@ static void its_restore_enable(void)
struct its_node *its;
int ret;
+ printk("%s\n", __func__);
raw_spin_lock(&its_lock);
list_for_each_entry(its, &its_nodes, entry) {
+ struct its_device *its_dev;
void __iomem *base;
int i;
@@ -5083,6 +5090,10 @@ static void its_restore_enable(void)
if (its->collections[smp_processor_id()].col_id <
GITS_TYPER_HCC(gic_read_typer(base + GITS_TYPER)))
its_cpu_init_collection(its);
+
+ list_for_each_entry(its_dev, &its->its_device_list, entry) {
+ its_send_mapd(its_dev, 1);
+ }
}
raw_spin_unlock(&its_lock);
}
Running on a suitable host with qemu, I reproduce with
# echo reboot > /sys/power/disk
# echo disk > /sys/power/state
Example qemu command line:
qemu-system-aarch64 -serial mon:stdio -M virt,gic-version=host -cpu max -enable-kvm -drive file=~/Fedora-Cloud-Base-Generic-42-1.1.aarch64.qcow2,id=nvm,if=none,snapshot=off,format=qcow2 -device nvme,drive=nvm,serial=1 -m 8g -nographic -nic user,model=virtio -kernel vmlinuz-6.16.0-rc7-dirty -initrd initramfs-6.16.0-rc7-dirty.img -append 'root=UUID=6c7b9058-d040-4047-a892-d2f1c7dee687 ro rootflags=subvol=root no_timer_check console=tty1 console=ttyAMA0,115200n8 systemd.firstboot=off rootflags=subvol=root no_console_suspend=1 resume_offset=366703 resume=/dev/nvme0n1p3' -trace gicv3_its\*
As the kernel boots up for the first time, it sends a normal MAPD command:
[ 1.292956] its_build_mapd_cmd dev 0x10 valid 1 addr 0x10f010000
On hibernation, my newly added code unmaps and then *remaps* the same:
[root@localhost ~]# echo disk > /sys/power/state
[ 42.118573] PM: hibernation: hibernation entry
[ 42.134574] Filesystems sync: 0.015 seconds
[ 42.134899] Freezing user space processes
[ 42.135566] Freezing user space processes completed (elapsed 0.000 seconds)
[ 42.136040] OOM killer disabled.
[ 42.136307] PM: hibernation: Preallocating image memory
[ 42.371141] PM: hibernation: Allocated 297401 pages for snapshot
[ 42.371163] PM: hibernation: Allocated 1189604 kbytes in 0.23 seconds (5172.19 MB/s)
[ 42.371170] Freezing remaining freezable tasks
[ 42.373465] Freezing remaining freezable tasks completed (elapsed 0.002 seconds)
[ 42.378350] Disabling non-boot CPUs ...
[ 42.378363] its_save_disable
[ 42.378363] its_build_mapd_cmd dev 0x10 valid 0 addr 0x10f010000
[ 42.378363] PM: hibernation: Creating image:
[ 42.378363] PM: hibernation: Need to copy 153098 pages
[ 42.378363] PM: hibernation: Image created (115354 pages copied, 37744 zero pages)
[ 42.378363] its_restore_enable
[ 42.378363] its_build_mapd_cmd dev 0x10 valid 1 addr 0x10f010000
[ 42.383601] nvme nvme0: 1/0/0 default/read/poll queues
[ 42.384411] nvme nvme0: Ignoring bogus Namespace Identifiers
[ 42.384924] hibernate: Hibernating on CPU 0 [mpidr:0x0]
[ 42.387742] PM: Using 1 thread(s) for lzo compression
[ 42.387748] PM: Compressing and saving image data (115654 pages)...
[ 42.387757] PM: Image saving progress: 0%
[ 43.485794] PM: Image saving progress: 10%
[ 44.739662] PM: Image saving progress: 20%
[ 46.617453] PM: Image saving progress: 30%
[ 48.437644] PM: Image saving progress: 40%
[ 49.857855] PM: Image saving progress: 50%
[ 52.156928] PM: Image saving progress: 60%
[ 53.344810] PM: Image saving progress: 70%
[ 54.472998] PM: Image saving progress: 80%
[ 55.083950] PM: Image saving progress: 90%
[ 56.406480] PM: Image saving progress: 100%
[ 56.407088] PM: Image saving done
[ 56.407100] PM: hibernation: Wrote 462616 kbytes in 14.01 seconds (33.02 MB/s)
[ 56.407106] PM: Image size after compression: 148041 kbytes
[ 56.408210] PM: S|
[ 56.642393] Flash device refused suspend due to active operation (state 20)
[ 56.642871] Flash device refused suspend due to active operation (state 20)
[ 56.643432] reboot: Restarting system
[ 0.000000] Booting Linux on physical CPU 0x0000000000 [0x410fd4f1]
Then the *boot* kernel comes up, does its own MAPD using a slightly different address:
[ 1.270652] its_build_mapd_cmd dev 0x10 valid 1 addr 0x10f009000
... and then transfers control to the hibernated kernel, which again
tries to unmap and remap the ITT at its original address due to my
suspend/resume hack (which is clearly hooking the wrong thing, but is
at least giving us useful information):
Starting systemd-hibernate-resume.service - Resume from hibernation...
[ 1.391340] PM: hibernation: resume from hibernation
[ 1.391861] random: crng reseeded on system resumption
[ 1.391927] Freezing user space processes
[ 1.392984] Freezing user space processes completed (elapsed 0.001 seconds)
[ 1.393473] OOM killer disabled.
[ 1.393486] Freezing remaining freezable tasks
[ 1.395012] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
[ 1.400817] PM: Using 1 thread(s) for lzo decompression
[ 1.400832] PM: Loading and decompressing image data (115654 pages)...
[ 1.400836] hibernate: Hibernated on CPU 0 [mpidr:0x0]
[ 1.438621] PM: Image loading progress: 0%
[ 1.554623] PM: Image loading progress: 10%
[ 1.594714] PM: Image loading progress: 20%
[ 1.639317] PM: Image loading progress: 30%
[ 1.683055] PM: Image loading progress: 40%
[ 1.720726] PM: Image loading progress: 50%
[ 1.768878] PM: Image loading progress: 60%
[ 1.800203] PM: Image loading progress: 70%
[ 1.822833] PM: Image loading progress: 80%
[ 1.840985] PM: Image loading progress: 90%
[ 1.871253] PM: Image loading progress: 100%
[ 1.871611] PM: Image loading done
[ 1.871617] PM: hibernation: Read 462616 kbytes in 0.47 seconds (984.28 MB/s)
[ 42.378350] Disabling non-boot CPUs ...
[ 42.378363] its_save_disable
[ 42.378363] its_build_mapd_cmd dev 0x10 valid 0 addr 0x10f010000
[ 42.378363] PM: hibernation: Creating image:
[ 42.378363] PM: hibernation: Need to copy 153098 pages
[ 42.378363] hibernate: Restored 0 MTE pages
[ 42.378363] its_restore_enable
[ 42.378363] its_build_mapd_cmd dev 0x10 valid 1 addr 0x10f010000
[ 42.417445] OOM killer enabled.
[ 42.417455] Restarting tasks: Starting
[ 42.419915] nvme nvme0: 1/0/0 default/read/poll queues
[ 42.420407] Restarting tasks: Done
[ 42.420781] PM: hibernation: hibernation exit
[ 42.421149] nvme nvme0: Ignoring bogus Namespace Identifiers
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Memory corruption after resume from hibernate with Arm GICv3 ITS
2025-07-23 10:04 Memory corruption after resume from hibernate with Arm GICv3 ITS David Woodhouse
@ 2025-07-24 9:25 ` David Woodhouse
2025-07-24 9:51 ` Rafael J. Wysocki
0 siblings, 1 reply; 4+ messages in thread
From: David Woodhouse @ 2025-07-24 9:25 UTC (permalink / raw)
To: Rafael J. Wysocki, Pavel Machek, linux-pm, Marc Zyngier,
linux-arm-kernel, Saidi, Ali, oliver.upton, Joey Gouly,
Suzuki K Poulose, Zenghui Yu, Catalin Marinas, Will Deacon,
linux-kernel, Heyne, Maximilian, Alexander Graf, Stamatis, Ilias
[-- Attachment #1: Type: text/plain, Size: 12358 bytes --]
On Wed, 2025-07-23 at 12:04 +0200, David Woodhouse wrote:
> We have seen guests crashing when, after they resume from hibernate,
> the hypervisor serializes their state for live update or live
> migration.
>
> The Arm Generic Interrupt Controller is a complicated beast, and it
> does scattershot DMA to little tables all across the guest's address
> space, without even living behind an IOMMU.
>
> Rather than simply turning it off overall, the guest has to explicitly
> tear down *every* one of the individual tables which were previously
> configured, in order to ensure that the memory is no longer used.
>
> KVM's implementation of the virtual GIC only uses this guest memory
> when asked to serialize its state. Instead of passing the information
> up to userspace as most KVM devices will do for serialization, KVM
> *only* supports scribbling it to guest memory.
>
> So, when the transition from boot to resumed kernel leaves the vGIC
> pointing at the *wrong* addresses, that's why a subsequent LU/LM of
> that guest triggers the memory corruption by writing the KVM state to a
> guest address that the now-running kernel did *not* expect.
>
> I tried this, just to get some more information:
>
> --- a/drivers/irqchip/irq-gic-v3-its.c
> +++ b/drivers/irqchip/irq-gic-v3-its.c
> @@ -720,7 +720,7 @@ static struct its_collection *its_build_mapd_cmd(struct its_node *its,
> its_encode_valid(cmd, desc->its_mapd_cmd.valid);
>
> its_fixup_cmd(cmd);
> -
> + printk("%s dev 0x%x valid %d addr 0x%lx\n", __func__, desc->its_mapd_cmd.dev->device_id, desc->its_mapd_cmd.valid, itt_addr);
> return NULL;
> }
>
> @@ -4996,10 +4996,15 @@ static int its_save_disable(void)
> struct its_node *its;
> int err = 0;
>
> + printk("%s\n", __func__);
> raw_spin_lock(&its_lock);
> list_for_each_entry(its, &its_nodes, entry) {
> + struct its_device *its_dev;
> void __iomem *base;
>
> + list_for_each_entry(its_dev, &its->its_device_list, entry) {
> + its_send_mapd(its_dev, 0);
> + }
> base = its->base;
> its->ctlr_save = readl_relaxed(base + GITS_CTLR);
> err = its_force_quiescent(base);
> @@ -5032,8 +5037,10 @@ static void its_restore_enable(void)
> struct its_node *its;
> int ret;
>
> + printk("%s\n", __func__);
> raw_spin_lock(&its_lock);
> list_for_each_entry(its, &its_nodes, entry) {
> + struct its_device *its_dev;
> void __iomem *base;
> int i;
>
> @@ -5083,6 +5090,10 @@ static void its_restore_enable(void)
> if (its->collections[smp_processor_id()].col_id <
> GITS_TYPER_HCC(gic_read_typer(base + GITS_TYPER)))
> its_cpu_init_collection(its);
> +
> + list_for_each_entry(its_dev, &its->its_device_list, entry) {
> + its_send_mapd(its_dev, 1);
> + }
> }
> raw_spin_unlock(&its_lock);
> }
>
>
> Running on a suitable host with qemu, I reproduce with
> # echo reboot > /sys/power/disk
> # echo disk > /sys/power/state
>
> Example qemu command line:
> qemu-system-aarch64 -serial mon:stdio -M virt,gic-version=host -cpu max -enable-kvm -drive file=~/Fedora-Cloud-Base-Generic-42-1.1.aarch64.qcow2,id=nvm,if=none,snapshot=off,format=qcow2 -device nvme,drive=nvm,serial=1 -m 8g -nographic -nic user,model=virtio -kernel vmlinuz-6.16.0-rc7-dirty -initrd initramfs-6.16.0-rc7-dirty.img -append 'root=UUID=6c7b9058-d040-4047-a892-d2f1c7dee687 ro rootflags=subvol=root no_timer_check console=tty1 console=ttyAMA0,115200n8 systemd.firstboot=off rootflags=subvol=root no_console_suspend=1 resume_offset=366703 resume=/dev/nvme0n1p3' -trace gicv3_its\*
>
> As the kernel boots up for the first time, it sends a normal MAPD command:
>
> [ 1.292956] its_build_mapd_cmd dev 0x10 valid 1 addr 0x10f010000
>
> On hibernation, my newly added code unmaps and then *remaps* the same:
>
> [root@localhost ~]# echo disk > /sys/power/state
> [ 42.118573] PM: hibernation: hibernation entry
> [ 42.134574] Filesystems sync: 0.015 seconds
> [ 42.134899] Freezing user space processes
> [ 42.135566] Freezing user space processes completed (elapsed 0.000 seconds)
> [ 42.136040] OOM killer disabled.
> [ 42.136307] PM: hibernation: Preallocating image memory
> [ 42.371141] PM: hibernation: Allocated 297401 pages for snapshot
> [ 42.371163] PM: hibernation: Allocated 1189604 kbytes in 0.23 seconds (5172.19 MB/s)
> [ 42.371170] Freezing remaining freezable tasks
> [ 42.373465] Freezing remaining freezable tasks completed (elapsed 0.002 seconds)
> [ 42.378350] Disabling non-boot CPUs ...
> [ 42.378363] its_save_disable
> [ 42.378363] its_build_mapd_cmd dev 0x10 valid 0 addr 0x10f010000
> [ 42.378363] PM: hibernation: Creating image:
> [ 42.378363] PM: hibernation: Need to copy 153098 pages
> [ 42.378363] PM: hibernation: Image created (115354 pages copied, 37744 zero pages)
> [ 42.378363] its_restore_enable
> [ 42.378363] its_build_mapd_cmd dev 0x10 valid 1 addr 0x10f010000
> [ 42.383601] nvme nvme0: 1/0/0 default/read/poll queues
> [ 42.384411] nvme nvme0: Ignoring bogus Namespace Identifiers
> [ 42.384924] hibernate: Hibernating on CPU 0 [mpidr:0x0]
> [ 42.387742] PM: Using 1 thread(s) for lzo compression
> [ 42.387748] PM: Compressing and saving image data (115654 pages)...
> [ 42.387757] PM: Image saving progress: 0%
> [ 43.485794] PM: Image saving progress: 10%
> [ 44.739662] PM: Image saving progress: 20%
> [ 46.617453] PM: Image saving progress: 30%
> [ 48.437644] PM: Image saving progress: 40%
> [ 49.857855] PM: Image saving progress: 50%
> [ 52.156928] PM: Image saving progress: 60%
> [ 53.344810] PM: Image saving progress: 70%
> [ 54.472998] PM: Image saving progress: 80%
> [ 55.083950] PM: Image saving progress: 90%
> [ 56.406480] PM: Image saving progress: 100%
> [ 56.407088] PM: Image saving done
> [ 56.407100] PM: hibernation: Wrote 462616 kbytes in 14.01 seconds (33.02 MB/s)
> [ 56.407106] PM: Image size after compression: 148041 kbytes
> [ 56.408210] PM: S|
> [ 56.642393] Flash device refused suspend due to active operation (state 20)
> [ 56.642871] Flash device refused suspend due to active operation (state 20)
> [ 56.643432] reboot: Restarting system
> [ 0.000000] Booting Linux on physical CPU 0x0000000000 [0x410fd4f1]
>
> Then the *boot* kernel comes up, does its own MAPD using a slightly different address:
>
> [ 1.270652] its_build_mapd_cmd dev 0x10 valid 1 addr 0x10f009000
>
> ... and then transfers control to the hibernated kernel, which again
> tries to unmap and remap the ITT at its original address due to my
> suspend/resume hack (which is clearly hooking the wrong thing, but is
> at least giving us useful information):
>
> Starting systemd-hibernate-resume.service - Resume from hibernation...
> [ 1.391340] PM: hibernation: resume from hibernation
> [ 1.391861] random: crng reseeded on system resumption
> [ 1.391927] Freezing user space processes
> [ 1.392984] Freezing user space processes completed (elapsed 0.001 seconds)
> [ 1.393473] OOM killer disabled.
> [ 1.393486] Freezing remaining freezable tasks
> [ 1.395012] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
> [ 1.400817] PM: Using 1 thread(s) for lzo decompression
> [ 1.400832] PM: Loading and decompressing image data (115654 pages)...
> [ 1.400836] hibernate: Hibernated on CPU 0 [mpidr:0x0]
> [ 1.438621] PM: Image loading progress: 0%
> [ 1.554623] PM: Image loading progress: 10%
> [ 1.594714] PM: Image loading progress: 20%
> [ 1.639317] PM: Image loading progress: 30%
> [ 1.683055] PM: Image loading progress: 40%
> [ 1.720726] PM: Image loading progress: 50%
> [ 1.768878] PM: Image loading progress: 60%
> [ 1.800203] PM: Image loading progress: 70%
> [ 1.822833] PM: Image loading progress: 80%
> [ 1.840985] PM: Image loading progress: 90%
> [ 1.871253] PM: Image loading progress: 100%
> [ 1.871611] PM: Image loading done
> [ 1.871617] PM: hibernation: Read 462616 kbytes in 0.47 seconds (984.28 MB/s)
> [ 42.378350] Disabling non-boot CPUs ...
> [ 42.378363] its_save_disable
> [ 42.378363] its_build_mapd_cmd dev 0x10 valid 0 addr 0x10f010000
> [ 42.378363] PM: hibernation: Creating image:
> [ 42.378363] PM: hibernation: Need to copy 153098 pages
> [ 42.378363] hibernate: Restored 0 MTE pages
> [ 42.378363] its_restore_enable
> [ 42.378363] its_build_mapd_cmd dev 0x10 valid 1 addr 0x10f010000
> [ 42.417445] OOM killer enabled.
> [ 42.417455] Restarting tasks: Starting
> [ 42.419915] nvme nvme0: 1/0/0 default/read/poll queues
> [ 42.420407] Restarting tasks: Done
> [ 42.420781] PM: hibernation: hibernation exit
> [ 42.421149] nvme nvme0: Ignoring bogus Namespace Identifiers
Rafael points out that the resumed kernel isn't doing the unmap/remap
again; it's merely printing the *same* messages again from the printk
buffer.
Before writing the hibernate image, the kernel calls the suspend op:
[ 42.378350] Disabling non-boot CPUs ...
[ 42.378363] its_save_disable
[ 42.378363] its_build_mapd_cmd dev 0x10 valid 0 addr 0x10f010000
[ 42.378363] PM: hibernation: Creating image:
Those messages are stored in the printk buffer in the image. Then the
hibernating kernel calls the resume op, and writes the image:
[ 42.378363] PM: hibernation: Image created (115354 pages copied, 37744 zero pages)
[ 42.378363] its_restore_enable
[ 42.378363] its_build_mapd_cmd dev 0x10 valid 1 addr 0x10f010000
[ 42.383601] nvme nvme0: 1/0/0 default/read/poll queues
[ 42.384411] nvme nvme0: Ignoring bogus Namespace Identifiers
[ 42.384924] hibernate: Hibernating on CPU 0 [mpidr:0x0]
[ 42.387742] PM: Using 1 thread(s) for lzo compression
[ 42.387748] PM: Compressing and saving image data (115654 pages)...
[ 42.387757] PM: Image saving progress: 0%
[ 43.485794] PM: Image saving progress: 10%
...
Then the boot kernel comes up and maps an ITT:
[ 1.270652] its_build_mapd_cmd dev 0x10 valid 1 addr 0x10f009000
The boot kernel never seems to *unmap* that because the suspend method
doesn't get called before resuming the image.
On resume, the previous kernel flushes the messages which were in its
printk buffer to the serial port again, and then prints these *new*
messages...
[ 42.378363] hibernate: Restored 0 MTE pages
[ 42.378363] its_restore_enable
[ 42.378363] its_build_mapd_cmd dev 0x10 valid 1 addr 0x10f010000
[ 42.417445] OOM killer enabled.
[ 42.417455] Restarting tasks: Starting
So the hibernated kernel seems to be doing the right thing in both
suspend and resume phases but it looks like the *boot* kernel doesn't
call the suspend method before transitioning; is that intentional? I
think we *should* unmap all the ITTs from the boot kernel.
At least for the vGIC, when the hibernated image resumes it will
*change* the mapping for every device that it knows about, but there's
a *possibility* that the boot kernel might have set up one that the
hibernated kernel didn't know about (if a new PCI device exists now?).
And I'm not sure what the real hardware will do if it gets a subsequent
MAPD without the previous one being unmapped.
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Memory corruption after resume from hibernate with Arm GICv3 ITS
2025-07-24 9:25 ` David Woodhouse
@ 2025-07-24 9:51 ` Rafael J. Wysocki
2025-07-24 13:48 ` David Woodhouse
0 siblings, 1 reply; 4+ messages in thread
From: Rafael J. Wysocki @ 2025-07-24 9:51 UTC (permalink / raw)
To: David Woodhouse
Cc: Rafael J. Wysocki, Pavel Machek, linux-pm, Marc Zyngier,
linux-arm-kernel, Saidi, Ali, oliver.upton, Joey Gouly,
Suzuki K Poulose, Zenghui Yu, Catalin Marinas, Will Deacon,
linux-kernel, Heyne, Maximilian, Alexander Graf, Stamatis, Ilias
On Thu, Jul 24, 2025 at 11:26 AM David Woodhouse <dwmw2@infradead.org> wrote:
>
> On Wed, 2025-07-23 at 12:04 +0200, David Woodhouse wrote:
> > We have seen guests crashing when, after they resume from hibernate,
> > the hypervisor serializes their state for live update or live
> > migration.
> >
> > The Arm Generic Interrupt Controller is a complicated beast, and it
> > does scattershot DMA to little tables all across the guest's address
> > space, without even living behind an IOMMU.
> >
> > Rather than simply turning it off overall, the guest has to explicitly
> > tear down *every* one of the individual tables which were previously
> > configured, in order to ensure that the memory is no longer used.
> >
> > KVM's implementation of the virtual GIC only uses this guest memory
> > when asked to serialize its state. Instead of passing the information
> > up to userspace as most KVM devices will do for serialization, KVM
> > *only* supports scribbling it to guest memory.
> >
> > So, when the transition from boot to resumed kernel leaves the vGIC
> > pointing at the *wrong* addresses, that's why a subsequent LU/LM of
> > that guest triggers the memory corruption by writing the KVM state to a
> > guest address that the now-running kernel did *not* expect.
> >
> > I tried this, just to get some more information:
> >
> > --- a/drivers/irqchip/irq-gic-v3-its.c
> > +++ b/drivers/irqchip/irq-gic-v3-its.c
> > @@ -720,7 +720,7 @@ static struct its_collection *its_build_mapd_cmd(struct its_node *its,
> > its_encode_valid(cmd, desc->its_mapd_cmd.valid);
> >
> > its_fixup_cmd(cmd);
> > -
> > + printk("%s dev 0x%x valid %d addr 0x%lx\n", __func__, desc->its_mapd_cmd.dev->device_id, desc->its_mapd_cmd.valid, itt_addr);
> > return NULL;
> > }
> >
> > @@ -4996,10 +4996,15 @@ static int its_save_disable(void)
> > struct its_node *its;
> > int err = 0;
> >
> > + printk("%s\n", __func__);
> > raw_spin_lock(&its_lock);
> > list_for_each_entry(its, &its_nodes, entry) {
> > + struct its_device *its_dev;
> > void __iomem *base;
> >
> > + list_for_each_entry(its_dev, &its->its_device_list, entry) {
> > + its_send_mapd(its_dev, 0);
> > + }
> > base = its->base;
> > its->ctlr_save = readl_relaxed(base + GITS_CTLR);
> > err = its_force_quiescent(base);
> > @@ -5032,8 +5037,10 @@ static void its_restore_enable(void)
> > struct its_node *its;
> > int ret;
> >
> > + printk("%s\n", __func__);
> > raw_spin_lock(&its_lock);
> > list_for_each_entry(its, &its_nodes, entry) {
> > + struct its_device *its_dev;
> > void __iomem *base;
> > int i;
> >
> > @@ -5083,6 +5090,10 @@ static void its_restore_enable(void)
> > if (its->collections[smp_processor_id()].col_id <
> > GITS_TYPER_HCC(gic_read_typer(base + GITS_TYPER)))
> > its_cpu_init_collection(its);
> > +
> > + list_for_each_entry(its_dev, &its->its_device_list, entry) {
> > + its_send_mapd(its_dev, 1);
> > + }
> > }
> > raw_spin_unlock(&its_lock);
> > }
> >
> >
> > Running on a suitable host with qemu, I reproduce with
> > # echo reboot > /sys/power/disk
> > # echo disk > /sys/power/state
> >
> > Example qemu command line:
> > qemu-system-aarch64 -serial mon:stdio -M virt,gic-version=host -cpu max -enable-kvm -drive file=~/Fedora-Cloud-Base-Generic-42-1.1.aarch64.qcow2,id=nvm,if=none,snapshot=off,format=qcow2 -device nvme,drive=nvm,serial=1 -m 8g -nographic -nic user,model=virtio -kernel vmlinuz-6.16.0-rc7-dirty -initrd initramfs-6.16.0-rc7-dirty.img -append 'root=UUID=6c7b9058-d040-4047-a892-d2f1c7dee687 ro rootflags=subvol=root no_timer_check console=tty1 console=ttyAMA0,115200n8 systemd.firstboot=off rootflags=subvol=root no_console_suspend=1 resume_offset=366703 resume=/dev/nvme0n1p3' -trace gicv3_its\*
> >
> > As the kernel boots up for the first time, it sends a normal MAPD command:
> >
> > [ 1.292956] its_build_mapd_cmd dev 0x10 valid 1 addr 0x10f010000
> >
> > On hibernation, my newly added code unmaps and then *remaps* the same:
> >
> > [root@localhost ~]# echo disk > /sys/power/state
> > [ 42.118573] PM: hibernation: hibernation entry
> > [ 42.134574] Filesystems sync: 0.015 seconds
> > [ 42.134899] Freezing user space processes
> > [ 42.135566] Freezing user space processes completed (elapsed 0.000 seconds)
> > [ 42.136040] OOM killer disabled.
> > [ 42.136307] PM: hibernation: Preallocating image memory
> > [ 42.371141] PM: hibernation: Allocated 297401 pages for snapshot
> > [ 42.371163] PM: hibernation: Allocated 1189604 kbytes in 0.23 seconds (5172.19 MB/s)
> > [ 42.371170] Freezing remaining freezable tasks
> > [ 42.373465] Freezing remaining freezable tasks completed (elapsed 0.002 seconds)
> > [ 42.378350] Disabling non-boot CPUs ...
> > [ 42.378363] its_save_disable
> > [ 42.378363] its_build_mapd_cmd dev 0x10 valid 0 addr 0x10f010000
> > [ 42.378363] PM: hibernation: Creating image:
> > [ 42.378363] PM: hibernation: Need to copy 153098 pages
> > [ 42.378363] PM: hibernation: Image created (115354 pages copied, 37744 zero pages)
> > [ 42.378363] its_restore_enable
> > [ 42.378363] its_build_mapd_cmd dev 0x10 valid 1 addr 0x10f010000
> > [ 42.383601] nvme nvme0: 1/0/0 default/read/poll queues
> > [ 42.384411] nvme nvme0: Ignoring bogus Namespace Identifiers
> > [ 42.384924] hibernate: Hibernating on CPU 0 [mpidr:0x0]
> > [ 42.387742] PM: Using 1 thread(s) for lzo compression
> > [ 42.387748] PM: Compressing and saving image data (115654 pages)...
> > [ 42.387757] PM: Image saving progress: 0%
> > [ 43.485794] PM: Image saving progress: 10%
> > [ 44.739662] PM: Image saving progress: 20%
> > [ 46.617453] PM: Image saving progress: 30%
> > [ 48.437644] PM: Image saving progress: 40%
> > [ 49.857855] PM: Image saving progress: 50%
> > [ 52.156928] PM: Image saving progress: 60%
> > [ 53.344810] PM: Image saving progress: 70%
> > [ 54.472998] PM: Image saving progress: 80%
> > [ 55.083950] PM: Image saving progress: 90%
> > [ 56.406480] PM: Image saving progress: 100%
> > [ 56.407088] PM: Image saving done
> > [ 56.407100] PM: hibernation: Wrote 462616 kbytes in 14.01 seconds (33.02 MB/s)
> > [ 56.407106] PM: Image size after compression: 148041 kbytes
> > [ 56.408210] PM: S|
> > [ 56.642393] Flash device refused suspend due to active operation (state 20)
> > [ 56.642871] Flash device refused suspend due to active operation (state 20)
> > [ 56.643432] reboot: Restarting system
> > [ 0.000000] Booting Linux on physical CPU 0x0000000000 [0x410fd4f1]
> >
> > Then the *boot* kernel comes up, does its own MAPD using a slightly different address:
> >
> > [ 1.270652] its_build_mapd_cmd dev 0x10 valid 1 addr 0x10f009000
> >
> > ... and then transfers control to the hibernated kernel, which again
> > tries to unmap and remap the ITT at its original address due to my
> > suspend/resume hack (which is clearly hooking the wrong thing, but is
> > at least giving us useful information):
> >
> > Starting systemd-hibernate-resume.service - Resume from hibernation...
> > [ 1.391340] PM: hibernation: resume from hibernation
> > [ 1.391861] random: crng reseeded on system resumption
> > [ 1.391927] Freezing user space processes
> > [ 1.392984] Freezing user space processes completed (elapsed 0.001 seconds)
> > [ 1.393473] OOM killer disabled.
> > [ 1.393486] Freezing remaining freezable tasks
> > [ 1.395012] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
> > [ 1.400817] PM: Using 1 thread(s) for lzo decompression
> > [ 1.400832] PM: Loading and decompressing image data (115654 pages)...
> > [ 1.400836] hibernate: Hibernated on CPU 0 [mpidr:0x0]
> > [ 1.438621] PM: Image loading progress: 0%
> > [ 1.554623] PM: Image loading progress: 10%
> > [ 1.594714] PM: Image loading progress: 20%
> > [ 1.639317] PM: Image loading progress: 30%
> > [ 1.683055] PM: Image loading progress: 40%
> > [ 1.720726] PM: Image loading progress: 50%
> > [ 1.768878] PM: Image loading progress: 60%
> > [ 1.800203] PM: Image loading progress: 70%
> > [ 1.822833] PM: Image loading progress: 80%
> > [ 1.840985] PM: Image loading progress: 90%
> > [ 1.871253] PM: Image loading progress: 100%
> > [ 1.871611] PM: Image loading done
> > [ 1.871617] PM: hibernation: Read 462616 kbytes in 0.47 seconds (984.28 MB/s)
> > [ 42.378350] Disabling non-boot CPUs ...
> > [ 42.378363] its_save_disable
> > [ 42.378363] its_build_mapd_cmd dev 0x10 valid 0 addr 0x10f010000
> > [ 42.378363] PM: hibernation: Creating image:
> > [ 42.378363] PM: hibernation: Need to copy 153098 pages
> > [ 42.378363] hibernate: Restored 0 MTE pages
> > [ 42.378363] its_restore_enable
> > [ 42.378363] its_build_mapd_cmd dev 0x10 valid 1 addr 0x10f010000
> > [ 42.417445] OOM killer enabled.
> > [ 42.417455] Restarting tasks: Starting
> > [ 42.419915] nvme nvme0: 1/0/0 default/read/poll queues
> > [ 42.420407] Restarting tasks: Done
> > [ 42.420781] PM: hibernation: hibernation exit
> > [ 42.421149] nvme nvme0: Ignoring bogus Namespace Identifiers
>
> Rafael points out that the resumed kernel isn't doing the unmap/remap
> again; it's merely printing the *same* messages again from the printk
> buffer.
>
> Before writing the hibernate image, the kernel calls the suspend op:
>
> [ 42.378350] Disabling non-boot CPUs ...
> [ 42.378363] its_save_disable
> [ 42.378363] its_build_mapd_cmd dev 0x10 valid 0 addr 0x10f010000
> [ 42.378363] PM: hibernation: Creating image:
>
> Those messages are stored in the printk buffer in the image. Then the
> hibernating kernel calls the resume op, and writes the image:
>
> [ 42.378363] PM: hibernation: Image created (115354 pages copied, 37744 zero pages)
> [ 42.378363] its_restore_enable
> [ 42.378363] its_build_mapd_cmd dev 0x10 valid 1 addr 0x10f010000
> [ 42.383601] nvme nvme0: 1/0/0 default/read/poll queues
> [ 42.384411] nvme nvme0: Ignoring bogus Namespace Identifiers
> [ 42.384924] hibernate: Hibernating on CPU 0 [mpidr:0x0]
> [ 42.387742] PM: Using 1 thread(s) for lzo compression
> [ 42.387748] PM: Compressing and saving image data (115654 pages)...
> [ 42.387757] PM: Image saving progress: 0%
> [ 43.485794] PM: Image saving progress: 10%
> ...
>
> Then the boot kernel comes up and maps an ITT:
>
> [ 1.270652] its_build_mapd_cmd dev 0x10 valid 1 addr 0x10f009000
>
> The boot kernel never seems to *unmap* that because the suspend method
> doesn't get called before resuming the image.
>
> On resume, the previous kernel flushes the messages which were in its
> printk buffer to the serial port again, and then prints these *new*
> messages...
>
> [ 42.378363] hibernate: Restored 0 MTE pages
> [ 42.378363] its_restore_enable
> [ 42.378363] its_build_mapd_cmd dev 0x10 valid 1 addr 0x10f010000
> [ 42.417445] OOM killer enabled.
> [ 42.417455] Restarting tasks: Starting
>
> So the hibernated kernel seems to be doing the right thing in both
> suspend and resume phases but it looks like the *boot* kernel doesn't
> call the suspend method before transitioning;
No, it does this, but the messages are missing from the log.
The last message you see from the boot/restore kernel is about loading
the image; a lot of stuff happens afterwards.
This message:
[ 1.871617] PM: hibernation: Read 462616 kbytes in 0.47 seconds (984.28 MB/s)
is printed by load_compressed_image() which gets called by
swsusp_read(), which is invoked by load_image_and_restore().
It is successful, so hibernation_restore() gets called and it does
quite a bit of work, including calling resume_target_kernel(), which
among other things calls syscore_suspend(), from where your messages
should be printed if I'm not mistaken.
I have no idea why those messages don't get into the log (that would
happen if your boot kernel were different from the image kernel and it
didn't actually print them).
> is that intentional? I think we *should* unmap all the ITTs from the boot kernel.
Yes, it's better to unmap them, even though ->
> At least for the vGIC, when the hibernated image resumes it will
> *change* the mapping for every device that it knows about, but there's
> a *possibility* that the boot kernel might have set up one that the
> hibernated kernel didn't know about (if a new PCI device exists now?).
-> HW configuration is not supposed to change across hibernation/restore.
> And I'm not sure what the real hardware will do if it gets a subsequent
> MAPD without the previous one being unmapped.
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Memory corruption after resume from hibernate with Arm GICv3 ITS
2025-07-24 9:51 ` Rafael J. Wysocki
@ 2025-07-24 13:48 ` David Woodhouse
0 siblings, 0 replies; 4+ messages in thread
From: David Woodhouse @ 2025-07-24 13:48 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Pavel Machek, linux-pm, Marc Zyngier, linux-arm-kernel,
Saidi, Ali, oliver.upton, Joey Gouly, Suzuki K Poulose,
Zenghui Yu, Catalin Marinas, Will Deacon, linux-kernel,
Heyne, Maximilian, Alexander Graf, Stamatis, Ilias
[-- Attachment #1: Type: text/plain, Size: 4731 bytes --]
On Thu, 2025-07-24 at 11:51 +0200, Rafael J. Wysocki wrote:
>
> > So the hibernated kernel seems to be doing the right thing in both
> > suspend and resume phases but it looks like the *boot* kernel doesn't
> > call the suspend method before transitioning;
>
> No, it does this, but the messages are missing from the log.
>
> The last message you see from the boot/restore kernel is about loading
> the image; a lot of stuff happens afterwards.
>
> This message:
>
> [ 1.871617] PM: hibernation: Read 462616 kbytes in 0.47 seconds (984.28 MB/s)
>
> is printed by load_compressed_image() which gets called by
> swsusp_read(), which is invoked by load_image_and_restore().
>
> It is successful, so hibernation_restore() gets called and it does
> quite a bit of work, including calling resume_target_kernel(), which
> among other things calls syscore_suspend(), from where your messages
> should be printed if I'm not mistaken.
>
> I have no idea why those messages don't get into the log (that would
> happen if your boot kernel were different from the image kernel and it
> didn't actually print them).
This is serial console output (stdout from 'qemu -serial mon:stdio'). I
guess the missing messages were in the printk buffer of the boot kernel
but just didn't get flushed? I added some --trace arguments to qemu to
see what's actually happening.
So when resuming, the boot looks like this:
gicv3_its_process_command GICv3 ITS: processing command at offset 0x4: 0x8
gicv3_its_cmd_mapd GICv3 ITS: command MAPD DeviceID 0x10 Size 0x6 ITT_addr 0x10f0a20 V 1
gicv3_its_dte_write GICv3 ITS: Device Table write for DeviceID 0x10: valid 1 size 0x6 ITTaddr 0x10f0a20
[ 27.440351] its_build_mapd_cmd dev 0x10 valid 1 addr 0x10f0a2000
And then the transition goes:
[ 47.668973] PM: Image loading progress: 90%
[ 48.030462] PM: Image loading progress: 100%
[ 48.031307] PM: Image loading done
[ 48.031773] PM: hibernation: Read 460728 kbytes in 13.11 seconds (35.14 MB/s)
gicv3_its_cmd_mapd GICv3 ITS: command MAPD DeviceID 0x10 Size 0x6 ITT_addr 0x10f0a20 V 0
gicv3_its_dte_write GICv3 ITS: Device Table write for DeviceID 0x10: valid 0 size 0x6 ITTaddr 0x10f0a20
gicv3_its_cmd_mapd GICv3 ITS: command MAPD DeviceID 0x10 Size 0x6 ITT_addr 0x10e3130 V 1
gicv3_its_dte_write GICv3 ITS: Device Table write for DeviceID 0x10: valid 1 size 0x6 ITTaddr 0x10e3130
[ 178.261284] Disabling non-boot CPUs ...
[ 178.261674] its_save_disable
[ 178.261674] its_build_mapd_cmd dev 0x10 valid 0 addr 0x10e313000
[ 178.261674] PM: hibernation: Creating image:
[ 178.261674] PM: hibernation: Need to copy 152532 pages
[ 178.261674] hibernate: Restored 0 MTE pages
[ 178.261674] its_restore_enable
[ 178.261674] its_build_mapd_cmd dev 0x10 valid 1 addr 0x10e313000
[ 178.831481] OOM killer enabled.
[ 178.831614] Restarting tasks: Starting
So we don't see the *printk* from the boot kernel, as you said. But it
*is* unmapping from the old address (MAPD, ITT_addr 0x10f0a20, Valid 0)
before the resumed kernel does the map at the address *it* was using
(MAPD, ITT_addr 0x10e3130, Valid 1). Looking just at the MAPD traces:
gicv3_its_cmd_mapd GICv3 ITS: command MAPD DeviceID 0x10 Size 0x6 ITT_addr 0x10e3130 V 1 ← Original clean boot
gicv3_its_cmd_mapd GICv3 ITS: command MAPD DeviceID 0x10 Size 0x6 ITT_addr 0x10e3130 V 0 ← Prior to generating hibernate image
gicv3_its_cmd_mapd GICv3 ITS: command MAPD DeviceID 0x10 Size 0x6 ITT_addr 0x10e3130 V 1 ← Before *writing* hibernate image and powering down (actually reboot in this case)
gicv3_its_cmd_mapd GICv3 ITS: command MAPD DeviceID 0x10 Size 0x6 ITT_addr 0x10f0a20 V 1 ← Boot kernel starting up prior to resume
gicv3_its_cmd_mapd GICv3 ITS: command MAPD DeviceID 0x10 Size 0x6 ITT_addr 0x10f0a20 V 0 ← Boot kernel unmapping when we don't see its printk
gicv3_its_cmd_mapd GICv3 ITS: command MAPD DeviceID 0x10 Size 0x6 ITT_addr 0x10e3130 V 1 ← Hibernated kernel remapping the ITT
So it looks like my test patch is doing the right thing, at least for
hibernation? I'm not sure about kexec?
There are also *other* tables where the GIC scribbles on memory, for
pending interrupts for KVM guests (vLPI pending tables). We've had
problems with those too¹, causing machines to crash on kexec because
the GIC scribbles on pages which are *actually* now the new kernel's
text. I'm not sure if we should try to come up with a unified solution
for that or deal with them separately... the solution there seems to
involve iterating ∀ kvm ∀ vCPU so I suspect it does need to live in
KVM.
¹ https://lore.kernel.org/all/20250623132714.965474-2-dwmw2@infradead.org/
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2025-07-24 13:48 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-23 10:04 Memory corruption after resume from hibernate with Arm GICv3 ITS David Woodhouse
2025-07-24 9:25 ` David Woodhouse
2025-07-24 9:51 ` Rafael J. Wysocki
2025-07-24 13:48 ` David Woodhouse
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).