* Migration failure when running nested VMs @ 2019-09-20 19:01 Jintack Lim 2019-09-23 10:42 ` Dr. David Alan Gilbert 0 siblings, 1 reply; 6+ messages in thread From: Jintack Lim @ 2019-09-20 19:01 UTC (permalink / raw) To: QEMU Devel Mailing List Hi, I'm seeing VM live migration failure when a VM is running a nested VM. I'm using latest Linux kernel (v5.3) and QEMU (v4.1.0). I also tried v5.2, but the result was the same. Kernel versions in L1 and L2 VM are v4.18, but I don't think that matters. The symptom is that L2 VM kernel crashes in different places after migration but the call stack is mostly related to memory management like [1] and [2]. The kernel crash happens almost all the time. While L2 VM gets kernel panic, L1 VM runs fine after the migration. Both L1 and L2 VM were doing nothing during migration. I found a few clues about this issue. 1) It happens with a relatively large memory for L1 (24G), but it does not with a smaller size (3G). 2) Dead migration worked; when I ran "stop" command in the qemu monitor for L1 first and did migration, migration worked always. It also worked when I only stopped L2 VM and kept L1 live during the migration. With those two clues, I guess maybe some dirty pages made by L2 are not transferred to the destination correctly, but I'm not really sure. 3) It happens on Intel(R) Xeon(R) Silver 4114 CPU, but it does not on Intel(R) Xeon(R) CPU E5-2630 v3 CPU. This makes me confused because I thought migrating nested state doesn't depend on the underlying hardware.. Anyways, L1-only migration with the large memory size (24G) works on both CPUs without any problem. I would appreciate any comments/suggestions to fix this problem. Thanks, Jintack [1]https://paste.ubuntu.com/p/XGDKH45yt4/ [2]https://paste.ubuntu.com/p/CpbVTXJCyc/ ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Migration failure when running nested VMs 2019-09-20 19:01 Migration failure when running nested VMs Jintack Lim @ 2019-09-23 10:42 ` Dr. David Alan Gilbert 2019-09-23 11:48 ` Paolo Bonzini 2019-09-23 18:32 ` Jintack Lim 0 siblings, 2 replies; 6+ messages in thread From: Dr. David Alan Gilbert @ 2019-09-23 10:42 UTC (permalink / raw) To: Jintack Lim, pbonzini; +Cc: QEMU Devel Mailing List * Jintack Lim (incredible.tack@gmail.com) wrote: > Hi, Copying in Paolo, since he recently did work to fix nested migration - it was expected to be broken until pretty recently; but 4.1.0 qemu on 5.3 kernel is pretty new, so I think I'd expected it to work. > I'm seeing VM live migration failure when a VM is running a nested VM. > I'm using latest Linux kernel (v5.3) and QEMU (v4.1.0). I also tried > v5.2, but the result was the same. Kernel versions in L1 and L2 VM are > v4.18, but I don't think that matters. > > The symptom is that L2 VM kernel crashes in different places after > migration but the call stack is mostly related to memory management > like [1] and [2]. The kernel crash happens almost all the time. While > L2 VM gets kernel panic, L1 VM runs fine after the migration. Both L1 > and L2 VM were doing nothing during migration. > > I found a few clues about this issue. > 1) It happens with a relatively large memory for L1 (24G), but it does > not with a smaller size (3G). > > 2) Dead migration worked; when I ran "stop" command in the qemu > monitor for L1 first and did migration, migration worked always. It > also worked when I only stopped L2 VM and kept L1 live during the > migration. > > With those two clues, I guess maybe some dirty pages made by L2 are > not transferred to the destination correctly, but I'm not really sure. > > 3) It happens on Intel(R) Xeon(R) Silver 4114 CPU, but it does not on > Intel(R) Xeon(R) CPU E5-2630 v3 CPU. > > This makes me confused because I thought migrating nested state > doesn't depend on the underlying hardware.. Anyways, L1-only migration > with the large memory size (24G) works on both CPUs without any > problem. > > I would appreciate any comments/suggestions to fix this problem. Can you share the qemu command lines you're using for both L1 and L2 please ? Are there any dmesg entries around the time of the migration on either the hosts or the L1 VMs? What guest OS are you running in L1 and L2? Dave > Thanks, > Jintack > > > [1]https://paste.ubuntu.com/p/XGDKH45yt4/ > [2]https://paste.ubuntu.com/p/CpbVTXJCyc/ > -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Migration failure when running nested VMs 2019-09-23 10:42 ` Dr. David Alan Gilbert @ 2019-09-23 11:48 ` Paolo Bonzini 2019-09-23 18:32 ` Jintack Lim 2019-09-23 18:32 ` Jintack Lim 1 sibling, 1 reply; 6+ messages in thread From: Paolo Bonzini @ 2019-09-23 11:48 UTC (permalink / raw) To: Dr. David Alan Gilbert, Jintack Lim; +Cc: QEMU Devel Mailing List On 23/09/19 12:42, Dr. David Alan Gilbert wrote: > > With those two clues, I guess maybe some dirty pages made by L2 are > not transferred to the destination correctly, but I'm not really sure. > > 3) It happens on Intel(R) Xeon(R) Silver 4114 CPU, but it does not on > Intel(R) Xeon(R) CPU E5-2630 v3 CPU. Hmm, try disabling pml (kvm_intel.pml=0). This would be the main difference, memory-management wise, between those two machines. Paolo ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Migration failure when running nested VMs 2019-09-23 11:48 ` Paolo Bonzini @ 2019-09-23 18:32 ` Jintack Lim 2019-09-24 0:19 ` Paolo Bonzini 0 siblings, 1 reply; 6+ messages in thread From: Jintack Lim @ 2019-09-23 18:32 UTC (permalink / raw) To: Paolo Bonzini; +Cc: Dr. David Alan Gilbert, QEMU Devel Mailing List On Mon, Sep 23, 2019 at 4:48 AM Paolo Bonzini <pbonzini@redhat.com> wrote: > > On 23/09/19 12:42, Dr. David Alan Gilbert wrote: > > > > With those two clues, I guess maybe some dirty pages made by L2 are > > not transferred to the destination correctly, but I'm not really sure. > > > > 3) It happens on Intel(R) Xeon(R) Silver 4114 CPU, but it does not on > > Intel(R) Xeon(R) CPU E5-2630 v3 CPU. > > Hmm, try disabling pml (kvm_intel.pml=0). This would be the main > difference, memory-management wise, between those two machines. > Thank you, Paolo. This makes migration work successfully over 20 times in a row on Intel(R) Xeon(R) Silver 4114 CPU where migration failed almost always without disabling pml. I guess there's a problem in KVM pml code? I'm fine with disabling pml. But if you have patches to fix the issue, I'm willing to test it on the CPU. Thanks, Jintack > Paolo ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Migration failure when running nested VMs 2019-09-23 18:32 ` Jintack Lim @ 2019-09-24 0:19 ` Paolo Bonzini 0 siblings, 0 replies; 6+ messages in thread From: Paolo Bonzini @ 2019-09-24 0:19 UTC (permalink / raw) To: Jintack Lim; +Cc: Dr. David Alan Gilbert, QEMU Devel Mailing List On 23/09/19 20:32, Jintack Lim wrote: > On Mon, Sep 23, 2019 at 4:48 AM Paolo Bonzini <pbonzini@redhat.com> wrote: >> >> On 23/09/19 12:42, Dr. David Alan Gilbert wrote: >>> >>> With those two clues, I guess maybe some dirty pages made by L2 are >>> not transferred to the destination correctly, but I'm not really sure. >>> >>> 3) It happens on Intel(R) Xeon(R) Silver 4114 CPU, but it does not on >>> Intel(R) Xeon(R) CPU E5-2630 v3 CPU. >> >> Hmm, try disabling pml (kvm_intel.pml=0). This would be the main >> difference, memory-management wise, between those two machines. >> > > Thank you, Paolo. > > This makes migration work successfully over 20 times in a row on > Intel(R) Xeon(R) Silver 4114 CPU where migration failed almost always > without disabling pml. > > I guess there's a problem in KVM pml code? I'm fine with disabling > pml. But if you have patches to fix the issue, I'm willing to test it > on the CPU. Yes, it's a known bug in the PML code (that I thought was not an issue for migration, but I was wrong). I'll try to get you a patch this week. Paolo ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Migration failure when running nested VMs 2019-09-23 10:42 ` Dr. David Alan Gilbert 2019-09-23 11:48 ` Paolo Bonzini @ 2019-09-23 18:32 ` Jintack Lim 1 sibling, 0 replies; 6+ messages in thread From: Jintack Lim @ 2019-09-23 18:32 UTC (permalink / raw) To: Dr. David Alan Gilbert; +Cc: Paolo Bonzini, QEMU Devel Mailing List On Mon, Sep 23, 2019 at 3:42 AM Dr. David Alan Gilbert <dgilbert@redhat.com> wrote: > > * Jintack Lim (incredible.tack@gmail.com) wrote: > > Hi, > > Copying in Paolo, since he recently did work to fix nested migration - > it was expected to be broken until pretty recently; but 4.1.0 qemu on > 5.3 kernel is pretty new, so I think I'd expected it to work. > Thank you, Dave. What Paolo proposed make migration work! > > I'm seeing VM live migration failure when a VM is running a nested VM. > > I'm using latest Linux kernel (v5.3) and QEMU (v4.1.0). I also tried > > v5.2, but the result was the same. Kernel versions in L1 and L2 VM are > > v4.18, but I don't think that matters. > > > > The symptom is that L2 VM kernel crashes in different places after > > migration but the call stack is mostly related to memory management > > like [1] and [2]. The kernel crash happens almost all the time. While > > L2 VM gets kernel panic, L1 VM runs fine after the migration. Both L1 > > and L2 VM were doing nothing during migration. > > > > I found a few clues about this issue. > > 1) It happens with a relatively large memory for L1 (24G), but it does > > not with a smaller size (3G). > > > > 2) Dead migration worked; when I ran "stop" command in the qemu > > monitor for L1 first and did migration, migration worked always. It > > also worked when I only stopped L2 VM and kept L1 live during the > > migration. > > > > With those two clues, I guess maybe some dirty pages made by L2 are > > not transferred to the destination correctly, but I'm not really sure. > > > > 3) It happens on Intel(R) Xeon(R) Silver 4114 CPU, but it does not on > > Intel(R) Xeon(R) CPU E5-2630 v3 CPU. > > > > This makes me confused because I thought migrating nested state > > doesn't depend on the underlying hardware.. Anyways, L1-only migration > > with the large memory size (24G) works on both CPUs without any > > problem. > > > > I would appreciate any comments/suggestions to fix this problem. > > Can you share the qemu command lines you're using for both L1 and L2 > please ? Sure. I use the same QEMU command line for L1 and L2 except for cpu and memory allocation. This is the one for running L1, and I use smaller cpu and memory size for L2. ./qemu/x86_64-softmmu/qemu-system-x86_64 -smp 6 -m 24G -M q35,accel=kvm -cpu host -drive if=none,file=/vm_nfs/guest0.img,id=vda,cache=none,format=raw -device virtio-blk-pci,drive=vda --nographic -qmp unix:/var/run/qmp,server,wait -serial mon:stdio -netdev user,id=net0,hostfwd=tcp::2222-:22 -device virtio-net-pci,netdev=net0,mac=de:ad:be:ef:f2:12 -netdev tap,id=net1,vhost=on,helper=/srv/vm/qemu/qemu-bridge-helper -device virtio-net-pci,netdev=net1,disable-modern=off,disable-legacy=on,mac=de:ad:be:ef:f2:11 -monitor telnet:127.0.0.1:4444,server,nowait > Are there any dmesg entries around the time of the migration on either > the hosts or the L1 VMs? No, I didn't see anything special in L0 or L1 kernel log. > What guest OS are you running in L1 and L2? > I'm using Linux v4.18 both in L1 and L2. Thanks, Jintack > Dave > > > Thanks, > > Jintack > > > > > > [1]https://paste.ubuntu.com/p/XGDKH45yt4/ > > [2]https://paste.ubuntu.com/p/CpbVTXJCyc/ > > > -- > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2019-09-24 0:20 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2019-09-20 19:01 Migration failure when running nested VMs Jintack Lim 2019-09-23 10:42 ` Dr. David Alan Gilbert 2019-09-23 11:48 ` Paolo Bonzini 2019-09-23 18:32 ` Jintack Lim 2019-09-24 0:19 ` Paolo Bonzini 2019-09-23 18:32 ` Jintack Lim
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.