* [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever @ 2015-01-08 13:36 Gerhard Wiesinger 2015-01-08 17:24 ` Paolo Bonzini 0 siblings, 1 reply; 35+ messages in thread From: Gerhard Wiesinger @ 2015-01-08 13:36 UTC (permalink / raw) To: qemu-devel Hello, After upgrading my KVM environment from Fedora 20 to Fedora 21 up2date (hosts and guests, Intel CPU) I've the following problem: 1.) On the database VM PostgresSQL e.g. 2 processes hang with 100% cpu 2.) On the monitoring VM Munin/RRDtool also hangs with 100% cpu Killing of processes is not possible, only reboot helps. But I nailed it down to the following: yum install strace strace -y -p339 Process 339 attached # no system calls here ^CProcess 339 detached Going deeper: yum --enablerepo fedora-debuginfo,updates-debuginfo install gdb postgresql-debuginfo gdb postgres 339 (gdb) bt #0 0x00007fffffbf8ff8 in gettimeofday () #1 0x00000000006d425e in GetCurrentTimestamp () at timestamp.c:1274 (gdb) frame 1 #1 0x00000000006d425e in GetCurrentTimestamp () at timestamp.c:1274 1274 gettimeofday(&tp, NULL); (same on second process here) Quitting and reattaching gdb also hangs here, so gettimeofday takes 100% CPU and never ends! Therefore I guess this is a problem either in the Linux kernel or in QEMU/KVM. It might be the case that something changed in timer handling or some default changed here. VMs are time syncrhonized with NTP so this might also be a problem that the syscall hangs here. Workaround when it happens: reboot .... Any further ideas? Thank you. Ciao, Gerhard ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever 2015-01-08 13:36 [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever Gerhard Wiesinger @ 2015-01-08 17:24 ` Paolo Bonzini 2015-01-08 18:12 ` Gerhard Wiesinger 0 siblings, 1 reply; 35+ messages in thread From: Paolo Bonzini @ 2015-01-08 17:24 UTC (permalink / raw) To: Gerhard Wiesinger, qemu-devel On 08/01/2015 14:36, Gerhard Wiesinger wrote: > Quitting and reattaching gdb also hangs here, so gettimeofday takes 100% > CPU and never ends! > > Therefore I guess this is a problem either in the Linux kernel or in > QEMU/KVM. What kernel are you running on (and were you running on)? Can you try F20 host and F21 guest or vice versa? Paolo ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever 2015-01-08 17:24 ` Paolo Bonzini @ 2015-01-08 18:12 ` Gerhard Wiesinger 2015-01-08 18:22 ` Paolo Bonzini 0 siblings, 1 reply; 35+ messages in thread From: Gerhard Wiesinger @ 2015-01-08 18:12 UTC (permalink / raw) To: Paolo Bonzini, qemu-devel On 08.01.2015 18:24, Paolo Bonzini wrote: > > On 08/01/2015 14:36, Gerhard Wiesinger wrote: >> Quitting and reattaching gdb also hangs here, so gettimeofday takes 100% >> CPU and never ends! >> >> Therefore I guess this is a problem either in the Linux kernel or in >> QEMU/KVM. > What kernel are you running on (and were you running on)? > > Can you try F20 host and F21 guest or vice versa? Hello Paolo, Always latest available stable versions (kernel/qemu-kvm): F20: 3.17.6-200.fc20.x86_64 on guest/host, qemu-kvm-1.6.2-10.fc20.x86_64 on host F21: 3.17.7-300.fc21.x86_64 on guest/host, qemu-kvm-2.1.2-7.fc21.x86_64 on host (I had also 3.17.6-300.fc21.x86_64 and it happended there, too). The topic is: it happens after some time (e.g. hours to days) It is production environment running around 10VMs and I want to avoid many experiments there. Since kernels were the same on FC20/F21 and qemu/kvm changed from 1.6.2 to 2.1.2 I guess the topic seems to be there. Also newer gcc might be a topic. https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/kernel/time/timekeeping.c?id=refs/tags/v3.17.7#n695 I guess it hangs in the do/while loop here: https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/kernel/time/timekeeping.c?id=refs/tags/v3.17.7#n493 Ciao, Gerhard ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever 2015-01-08 18:12 ` Gerhard Wiesinger @ 2015-01-08 18:22 ` Paolo Bonzini 2015-01-08 22:28 ` Gerhard Wiesinger 0 siblings, 1 reply; 35+ messages in thread From: Paolo Bonzini @ 2015-01-08 18:22 UTC (permalink / raw) To: Gerhard Wiesinger, qemu-devel On 08/01/2015 19:12, Gerhard Wiesinger wrote: > Since kernels were the same on FC20/F21 and qemu/kvm changed from > 1.6.2 to 2.1.2 I guess the topic seems to be there. Also newer gcc > might be a topic. Indeed. Can you try the 2.2.0 qemu-kvm release, available in the fedora-virt-preview repository? http://fedoraproject.org/wiki/Virtualization_Preview_Repository Thanks, Paolo ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever 2015-01-08 18:22 ` Paolo Bonzini @ 2015-01-08 22:28 ` Gerhard Wiesinger 2015-01-08 22:42 ` Paolo Bonzini 2015-01-12 11:41 ` Gerhard Wiesinger 0 siblings, 2 replies; 35+ messages in thread From: Gerhard Wiesinger @ 2015-01-08 22:28 UTC (permalink / raw) To: Paolo Bonzini, qemu-devel On 08.01.2015 19:22, Paolo Bonzini wrote: > > On 08/01/2015 19:12, Gerhard Wiesinger wrote: >> Since kernels were the same on FC20/F21 and qemu/kvm changed from >> 1.6.2 to 2.1.2 I guess the topic seems to be there. Also newer gcc >> might be a topic. > Indeed. Can you try the 2.2.0 qemu-kvm release, available in the > fedora-virt-preview repository? > > http://fedoraproject.org/wiki/Virtualization_Preview_Repository Updated to 2.2.0 qemu-kvm release, worked seemless so far for all VMs. I'll keep you up to date in the next days whether it happens again or not. BTW: Has something changed in the time code area between 1.6.2 and 2.1.2? Thank you so far. Ciao, Gerhard ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever 2015-01-08 22:28 ` Gerhard Wiesinger @ 2015-01-08 22:42 ` Paolo Bonzini 2015-01-12 11:41 ` Gerhard Wiesinger 1 sibling, 0 replies; 35+ messages in thread From: Paolo Bonzini @ 2015-01-08 22:42 UTC (permalink / raw) To: Gerhard Wiesinger, qemu-devel On 08/01/2015 23:28, Gerhard Wiesinger wrote: >> > > Updated to 2.2.0 qemu-kvm release, worked seemless so far for all VMs. > > I'll keep you up to date in the next days whether it happens again or not. > > BTW: Has something changed in the time code area between 1.6.2 and 2.1.2? Most of this is actually in the kernel, but sometimes even reading/writing the status out of the kernel can trigger bugs. 2.2 has more robust live migration of the kvm pv clock, but fixing that also uncovered a few other bugs and it's possible they were latent in 2.1. Crossing fingers. Paolo ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever 2015-01-08 22:28 ` Gerhard Wiesinger 2015-01-08 22:42 ` Paolo Bonzini @ 2015-01-12 11:41 ` Gerhard Wiesinger 2015-01-12 11:46 ` Paolo Bonzini 2015-01-13 20:13 ` Gerhard Wiesinger 1 sibling, 2 replies; 35+ messages in thread From: Gerhard Wiesinger @ 2015-01-12 11:41 UTC (permalink / raw) To: Paolo Bonzini, qemu-devel On 08.01.2015 23:28, Gerhard Wiesinger wrote: > On 08.01.2015 19:22, Paolo Bonzini wrote: >> >> Indeed. Can you try the 2.2.0 qemu-kvm release, available in the >> fedora-virt-preview repository? >> >> http://fedoraproject.org/wiki/Virtualization_Preview_Repository > > Updated to 2.2.0 qemu-kvm release, worked seemless so far for all VMs. > > I'll keep you up to date in the next days whether it happens again or > not. With qemu-kvm 2.2.0 release from the above repository the 100% usage didn't happen so far (although I had to reboot after kernel update). So it looks that qemu-kvm 2.1.x has major bugs regarding timer handling. Any backporting planned? Ciao, Gerhard ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever 2015-01-12 11:41 ` Gerhard Wiesinger @ 2015-01-12 11:46 ` Paolo Bonzini 2015-01-12 11:53 ` Gerhard Wiesinger 2015-01-13 20:13 ` Gerhard Wiesinger 1 sibling, 1 reply; 35+ messages in thread From: Paolo Bonzini @ 2015-01-12 11:46 UTC (permalink / raw) To: Gerhard Wiesinger, qemu-devel On 12/01/2015 12:41, Gerhard Wiesinger wrote: >> >> Updated to 2.2.0 qemu-kvm release, worked seemless so far for all VMs. >> >> I'll keep you up to date in the next days whether it happens again or >> not. > > With qemu-kvm 2.2.0 release from the above repository the 100% usage > didn't happen so far (although I had to reboot after kernel update). > > So it looks that qemu-kvm 2.1.x has major bugs regarding timer handling. > Any backporting planned? That's difficult without bisection pointing out where the bugs were fixed. 2.1.3 is scheduled real soon now and it will be the last release from the 2.1.x branch. Paolo ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever 2015-01-12 11:46 ` Paolo Bonzini @ 2015-01-12 11:53 ` Gerhard Wiesinger 2015-01-12 11:54 ` Paolo Bonzini 0 siblings, 1 reply; 35+ messages in thread From: Gerhard Wiesinger @ 2015-01-12 11:53 UTC (permalink / raw) To: Paolo Bonzini, qemu-devel On 12.01.2015 12:46, Paolo Bonzini wrote: > > On 12/01/2015 12:41, Gerhard Wiesinger wrote: >>> Updated to 2.2.0 qemu-kvm release, worked seemless so far for all VMs. >>> >>> I'll keep you up to date in the next days whether it happens again or >>> not. >> With qemu-kvm 2.2.0 release from the above repository the 100% usage >> didn't happen so far (although I had to reboot after kernel update). >> >> So it looks that qemu-kvm 2.1.x has major bugs regarding timer handling. >> Any backporting planned? > That's difficult without bisection pointing out where the bugs were > fixed. 2.1.3 is scheduled real soon now and it will be the last release > from the 2.1.x branch. So no security updates planned for 2.1.x afterwards? For Fedora: Update for 2.2.x for the regulary update repo planned? Will updated for 2.2.x also be released in the preview repos? http://fedoraproject.org/wiki/Virtualization_Preview_Repository Ciao, Gerhard ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever 2015-01-12 11:53 ` Gerhard Wiesinger @ 2015-01-12 11:54 ` Paolo Bonzini 0 siblings, 0 replies; 35+ messages in thread From: Paolo Bonzini @ 2015-01-12 11:54 UTC (permalink / raw) To: Gerhard Wiesinger, qemu-devel On 12/01/2015 12:53, Gerhard Wiesinger wrote: >>> >> That's difficult without bisection pointing out where the bugs were >> fixed. 2.1.3 is scheduled real soon now and it will be the last release >> from the 2.1.x branch. > > So no security updates planned for 2.1.x afterwards? Distros usually take care of them, except for really egregious vulnerabilities. Paolo > For Fedora: Update for 2.2.x for the regulary update repo planned? > Will updated for 2.2.x also be released in the preview repos? > http://fedoraproject.org/wiki/Virtualization_Preview_Repository ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever 2015-01-12 11:41 ` Gerhard Wiesinger 2015-01-12 11:46 ` Paolo Bonzini @ 2015-01-13 20:13 ` Gerhard Wiesinger 2015-01-13 20:48 ` Paolo Bonzini 2015-01-14 17:47 ` Juan Quintela 1 sibling, 2 replies; 35+ messages in thread From: Gerhard Wiesinger @ 2015-01-13 20:13 UTC (permalink / raw) To: Paolo Bonzini, qemu-devel On 12.01.2015 12:41, Gerhard Wiesinger wrote: > On 08.01.2015 23:28, Gerhard Wiesinger wrote: >> >> I'll keep you up to date in the next days whether it happens again or >> not. > > With qemu-kvm 2.2.0 release from the above repository the 100% usage > didn't happen so far (although I had to reboot after kernel update). It happens also with qemu-kvm 2.2.0 on another VM where also PostgreSQL is running: (gdb) bt #0 0x00007fff9a1feff4 in gettimeofday () #1 0x00000000006d425e in GetCurrentTimestamp () at timestamp.c:1274 What we know: OK : F20: 3.17.6-200.fc20.x86_64 on guest/host, qemu-kvm-1.6.2-10.fc20.x86_64 on host NOK: F21: 3.17.7-300.fc21.x86_64 on guest/host, qemu-kvm-2.1.2-7.fc21.x86_64 on host NOK: F21: 3.17.8-300.fc21.x86_64 on guest/host, qemu-kvm-2.2.0-1.fc21.x86_64 on host No one less can reproduce or has similar problems? Any further ideas? BTW: I'm running ntp in the following manner: internet <=> ntp server in VM <=> ntp client on KVM host (firewall runs in KVM) Ciao, Gerhard ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever 2015-01-13 20:13 ` Gerhard Wiesinger @ 2015-01-13 20:48 ` Paolo Bonzini 2015-01-13 21:08 ` Cole Robinson 2015-01-13 21:14 ` Gerhard Wiesinger 2015-01-14 17:47 ` Juan Quintela 1 sibling, 2 replies; 35+ messages in thread From: Paolo Bonzini @ 2015-01-13 20:48 UTC (permalink / raw) To: Gerhard Wiesinger, qemu-devel, Cole Robinson, virt On 13/01/2015 21:13, Gerhard Wiesinger wrote: > > It happens also with qemu-kvm 2.2.0 on another VM where also PostgreSQL > is running: > (gdb) bt > #0 0x00007fff9a1feff4 in gettimeofday () > #1 0x00000000006d425e in GetCurrentTimestamp () at timestamp.c:1274 > > What we know: > OK : F20: 3.17.6-200.fc20.x86_64 on guest/host, > qemu-kvm-1.6.2-10.fc20.x86_64 on host > NOK: F21: 3.17.7-300.fc21.x86_64 on guest/host, > qemu-kvm-2.1.2-7.fc21.x86_64 on host > NOK: F21: 3.17.8-300.fc21.x86_64 on guest/host, > qemu-kvm-2.2.0-1.fc21.x86_64 on host > > No one less can reproduce or has similar problems? > Any further ideas? Hmm, too bad. :( Any chance you can try with 1.7 and 2.0 releases? Cole, perhaps you help building some RPMS for Gerhard? Paolo ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever 2015-01-13 20:48 ` Paolo Bonzini @ 2015-01-13 21:08 ` Cole Robinson 2015-01-13 21:14 ` Gerhard Wiesinger 1 sibling, 0 replies; 35+ messages in thread From: Cole Robinson @ 2015-01-13 21:08 UTC (permalink / raw) To: Paolo Bonzini, Gerhard Wiesinger, qemu-devel, virt On 01/13/2015 03:48 PM, Paolo Bonzini wrote: > > > On 13/01/2015 21:13, Gerhard Wiesinger wrote: >> >> It happens also with qemu-kvm 2.2.0 on another VM where also PostgreSQL >> is running: >> (gdb) bt >> #0 0x00007fff9a1feff4 in gettimeofday () >> #1 0x00000000006d425e in GetCurrentTimestamp () at timestamp.c:1274 >> >> What we know: >> OK : F20: 3.17.6-200.fc20.x86_64 on guest/host, >> qemu-kvm-1.6.2-10.fc20.x86_64 on host >> NOK: F21: 3.17.7-300.fc21.x86_64 on guest/host, >> qemu-kvm-2.1.2-7.fc21.x86_64 on host >> NOK: F21: 3.17.8-300.fc21.x86_64 on guest/host, >> qemu-kvm-2.2.0-1.fc21.x86_64 on host >> >> No one less can reproduce or has similar problems? >> Any further ideas? > > Hmm, too bad. :( > > Any chance you can try with 1.7 and 2.0 releases? Cole, perhaps you > help building some RPMS for Gerhard? > > Paolo > There's F21 qemu-2.0 builds here: http://koji.fedoraproject.org/koji/buildinfo?buildID=538383 And 1.7 builds here: http://koji.fedoraproject.org/koji/buildinfo?buildID=499026 There might be dependency issues trying to install on latest f21, if they were built against older libraries. Worth a shot though. Besides that something custom will need to be built, though at that point it's probably easier to build qemu from source - Cole ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever 2015-01-13 20:48 ` Paolo Bonzini 2015-01-13 21:08 ` Cole Robinson @ 2015-01-13 21:14 ` Gerhard Wiesinger 2015-01-13 21:16 ` Paolo Bonzini 1 sibling, 1 reply; 35+ messages in thread From: Gerhard Wiesinger @ 2015-01-13 21:14 UTC (permalink / raw) To: Paolo Bonzini, qemu-devel, Cole Robinson, virt On 13.01.2015 21:48, Paolo Bonzini wrote: > > On 13/01/2015 21:13, Gerhard Wiesinger wrote: >> It happens also with qemu-kvm 2.2.0 on another VM where also PostgreSQL >> is running: >> (gdb) bt >> #0 0x00007fff9a1feff4 in gettimeofday () >> #1 0x00000000006d425e in GetCurrentTimestamp () at timestamp.c:1274 >> >> What we know: >> OK : F20: 3.17.6-200.fc20.x86_64 on guest/host, >> qemu-kvm-1.6.2-10.fc20.x86_64 on host >> NOK: F21: 3.17.7-300.fc21.x86_64 on guest/host, >> qemu-kvm-2.1.2-7.fc21.x86_64 on host >> NOK: F21: 3.17.8-300.fc21.x86_64 on guest/host, >> qemu-kvm-2.2.0-1.fc21.x86_64 on host >> >> No one less can reproduce or has similar problems? >> Any further ideas? > Hmm, too bad. :( > > Any chance you can try with 1.7 and 2.0 releases? Cole, perhaps you > help building some RPMS for Gerhard? > Yes, I can try. RPMs would be fine. Don't know if it is related, but on another VM maschine I was getting the following 75s after reboot: [ 78.857006] INFO: rcu_sched self-detected stall on CPU { 0} (t=60000 jiffies g=1966 c=1965 q=0) [ 258.860006] INFO: rcu_sched self-detected stall on CPU { 0} (t=240003 jiffies g=1966 c=1965 q=0) I also had a look at the kernel code again: http://lxr.free-electrons.com/source/kernel/time/timekeeping.c?v=3.17#L493 499 do { 500 seq = read_seqcount_begin(&tk_core.seq); 501 502 ts->tv_sec = tk->xtime_sec; 503 nsecs = timekeeping_get_ns(&tk->tkr); 504 505 } while (read_seqcount_retry(&tk_core.seq, seq)); So it looks like that the seqcount always changes and therefore loops forever here (as far as I digged it down this is the only loop here). Might be something wrong with the memory barriers in recent qemu-kvm releases? Ciao, Gerhard ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever 2015-01-13 21:14 ` Gerhard Wiesinger @ 2015-01-13 21:16 ` Paolo Bonzini 2015-01-13 22:01 ` Gerhard Wiesinger 0 siblings, 1 reply; 35+ messages in thread From: Paolo Bonzini @ 2015-01-13 21:16 UTC (permalink / raw) To: Gerhard Wiesinger, qemu-devel, Cole Robinson, virt On 13/01/2015 22:14, Gerhard Wiesinger wrote: > > I also had a look at the kernel code again: > http://lxr.free-electrons.com/source/kernel/time/timekeeping.c?v=3.17#L493 > 499 do { > 500 seq = read_seqcount_begin(&tk_core.seq); > 501 > 502 ts->tv_sec = tk->xtime_sec; > 503 nsecs = timekeeping_get_ns(&tk->tkr); > 504 > 505 } while (read_seqcount_retry(&tk_core.seq, seq)); > > So it looks like that the seqcount always changes and therefore loops > forever here (as far as I digged it down this is the only loop here). > > Might be something wrong with the memory barriers in recent qemu-kvm > releases? No, that's not possible. Unless you pause/resume or migrate the VM, all of the handling of kvmclock is entirely in the kernel. Paolo ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever 2015-01-13 21:16 ` Paolo Bonzini @ 2015-01-13 22:01 ` Gerhard Wiesinger 2015-01-14 0:59 ` Laine Stump 0 siblings, 1 reply; 35+ messages in thread From: Gerhard Wiesinger @ 2015-01-13 22:01 UTC (permalink / raw) To: Paolo Bonzini, qemu-devel, Cole Robinson, virt On 13.01.2015 22:16, Paolo Bonzini wrote: > > On 13/01/2015 22:14, Gerhard Wiesinger wrote: >> I also had a look at the kernel code again: >> http://lxr.free-electrons.com/source/kernel/time/timekeeping.c?v=3.17#L493 >> 499 do { >> 500 seq = read_seqcount_begin(&tk_core.seq); >> 501 >> 502 ts->tv_sec = tk->xtime_sec; >> 503 nsecs = timekeeping_get_ns(&tk->tkr); >> 504 >> 505 } while (read_seqcount_retry(&tk_core.seq, seq)); >> >> So it looks like that the seqcount always changes and therefore loops >> forever here (as far as I digged it down this is the only loop here). >> >> Might be something wrong with the memory barriers in recent qemu-kvm >> releases? > No, that's not possible. Unless you pause/resume or migrate the VM, all > of the handling of kvmclock is entirely in the kernel. Any other possible explaination of the problem? Had a look at the diff (I guess the right file at least in qemu tree): # no critical changes IHMO here git diff -u v1.6.2..v2.1.2 ./hw/i386/kvm/clock.c Trying to reproduce with a loop: #include <sys/time.h> #include <stdio.h> int main(int argc, char* argv[]) { struct timeval tv; int i = 0; for (;;) { gettimeofday(&tv, 0); ++i; if (i >= 10000000) { i = 0; printf("%i\n", (int)tv.tv_sec); } } return 0; } As I wrote this: "First tests seem to run well, so no quick win ....", I could reproduce it with a stall in 318s :-) (gdb) bt #0 0x00007fff6d9fefff in gettimeofday () #1 0x00000000004005ad in main (argc=1, argv=0x7fff6d9b28b8) at gettimeofdayloop.c:10 So we have at least a testcase which is quickly to reproduce. So we are digging down my second findings about a major bug in qemu-kvm :-) Can someone try, too? Ciao, Gerhard ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever 2015-01-13 22:01 ` Gerhard Wiesinger @ 2015-01-14 0:59 ` Laine Stump 2015-01-14 9:15 ` Gerhard Wiesinger 0 siblings, 1 reply; 35+ messages in thread From: Laine Stump @ 2015-01-14 0:59 UTC (permalink / raw) To: Gerhard Wiesinger, Paolo Bonzini, qemu-devel, Cole Robinson, virt On 01/13/2015 05:01 PM, Gerhard Wiesinger wrote: > On 13.01.2015 22:16, Paolo Bonzini wrote: >> >> On 13/01/2015 22:14, Gerhard Wiesinger wrote: >>> I also had a look at the kernel code again: >>> http://lxr.free-electrons.com/source/kernel/time/timekeeping.c?v=3.17#L493 >>> >>> 499 do { >>> 500 seq = read_seqcount_begin(&tk_core.seq); >>> 501 >>> 502 ts->tv_sec = tk->xtime_sec; >>> 503 nsecs = timekeeping_get_ns(&tk->tkr); >>> 504 >>> 505 } while (read_seqcount_retry(&tk_core.seq, seq)); >>> >>> So it looks like that the seqcount always changes and therefore loops >>> forever here (as far as I digged it down this is the only loop here). >>> >>> Might be something wrong with the memory barriers in recent qemu-kvm >>> releases? >> No, that's not possible. Unless you pause/resume or migrate the VM, all >> of the handling of kvmclock is entirely in the kernel. > > Any other possible explaination of the problem? > > Had a look at the diff (I guess the right file at least in qemu tree): > # no critical changes IHMO here > git diff -u v1.6.2..v2.1.2 ./hw/i386/kvm/clock.c > > Trying to reproduce with a loop: > #include <sys/time.h> > #include <stdio.h> > > int main(int argc, char* argv[]) > { > struct timeval tv; > int i = 0; > for (;;) > { > gettimeofday(&tv, 0); > ++i; > if (i >= 10000000) > { > i = 0; > printf("%i\n", (int)tv.tv_sec); > } > } > return 0; > } > > As I wrote this: "First tests seem to run well, so no quick win ....", > I could reproduce it with a stall in 318s :-) > (gdb) bt > #0 0x00007fff6d9fefff in gettimeofday () > #1 0x00000000004005ad in main (argc=1, argv=0x7fff6d9b28b8) at > gettimeofdayloop.c:10 > > So we have at least a testcase which is quickly to reproduce. > > So we are digging down my second findings about a major bug in > qemu-kvm :-) > > Can someone try, too? > > Ciao, > Gerhard > > Take a look at the following kernel bug. It specifically deals with a hang in gettimeofday() in a KVM guest: https://bugzilla.redhat.com/show_bug.cgi?id=1178975 There is a link to a patched kernel you can try; it fixed my problems (I was repeatedly getting hangs in python-urlgrabber during yum updates on F21). ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever 2015-01-14 0:59 ` Laine Stump @ 2015-01-14 9:15 ` Gerhard Wiesinger 2015-02-15 8:18 ` Gerhard Wiesinger 0 siblings, 1 reply; 35+ messages in thread From: Gerhard Wiesinger @ 2015-01-14 9:15 UTC (permalink / raw) To: Laine Stump, Paolo Bonzini, qemu-devel, Cole Robinson, virt On 14.01.2015 01:59, Laine Stump wrote: > > Take a look at the following kernel bug. It specifically deals with a > hang in gettimeofday() in a KVM guest: > > https://bugzilla.redhat.com/show_bug.cgi?id=1178975 > > There is a link to a patched kernel you can try; it fixed my problems (I > was repeatedly getting hangs in python-urlgrabber during yum updates on > F21). > Looks to be fixed, commented in: https://bugzilla.redhat.com/show_bug.cgi?id=1178975 Installed kernels. http://koji.fedoraproject.org/koji/taskinfo?taskID=8575247 Time to release a new official kernel :-) Thanx for the comments. Ciao, Gerhard ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever 2015-01-14 9:15 ` Gerhard Wiesinger @ 2015-02-15 8:18 ` Gerhard Wiesinger 2015-02-16 14:18 ` Paolo Bonzini 0 siblings, 1 reply; 35+ messages in thread From: Gerhard Wiesinger @ 2015-02-15 8:18 UTC (permalink / raw) To: Laine Stump, Paolo Bonzini, qemu-devel, Cole Robinson, virt On 14.01.2015 10:15, Gerhard Wiesinger wrote: > On 14.01.2015 01:59, Laine Stump wrote: >> >> Take a look at the following kernel bug. It specifically deals with a >> hang in gettimeofday() in a KVM guest: >> >> https://bugzilla.redhat.com/show_bug.cgi?id=1178975 >> >> There is a link to a patched kernel you can try; it fixed my problems (I >> was repeatedly getting hangs in python-urlgrabber during yum updates on >> F21). >> > > Looks to be fixed, commented in: > https://bugzilla.redhat.com/show_bug.cgi?id=1178975 > Installed kernels. > http://koji.fedoraproject.org/koji/taskinfo?taskID=8575247 > There seems to be another problem. My PostgreSQL server VM still crashes (2 times). 1.) First time crash: no reaction at all in KVM console, no network access 2.) Second time "crash": - KVM console 2 characters could be entered, then console was dead forever - ping works - ssh not - host: 100% cpu on 2 cores strace -y -p on kvm host (loops through this): read(7<anon_inode:[eventfd]>, 0x7fffb862bf20, 16) = -1 EAGAIN (Resource temporarily unavailable) ioctl(11<anon_inode:kvm-vm>, KVM_GET_DIRTY_LOG, 0x7fffb862bd30) = 0 ioctl(11<anon_inode:kvm-vm>, KVM_GET_DIRTY_LOG, 0x7fffb862bd30) = 0 write(6<anon_inode:[eventfd]>, "\1\0\0\0\0\0\0\0", 8) = 8 ppoll([{fd=98<socket:[4660317]>, events=POLLIN|POLLERR|POLLHUP}, {fd=89<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, {fd=88<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, {fd=87<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, {fd=86<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, {fd=85<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, {fd=84<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, {fd=83<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, {fd=82<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, {fd=81<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, {fd=80<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, {fd=79<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, {fd=78<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, {fd=77<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, {fd=76<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, {fd=75<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, {fd=74<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, {fd=73<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, {fd=72<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, {fd=71<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, {fd=70<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, {fd=69<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, {fd=68<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, {fd=67<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, {fd=66<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, {fd=65<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, {fd=64<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, {fd=63<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, {fd=62<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, {fd=61<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, {fd=60<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, {fd=59<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, ...], 71, {0, 124112228}, NULL, 8) = 1 ([...], left {0, 124090614}) write(7<anon_inode:[eventfd]>, "\1\0\0\0\0\0\0\0", 8) = 8 read(6<anon_inode:[eventfd]>, "\1\0\0\0\0\0\0\0", 512) = 8 write(7<anon_inode:[eventfd]>, "\1\0\0\0\0\0\0\0", 8) = 8 Kernel (host/guest): 3.18.6-200.fc21.x86_64 #1 SMP qemu-kvm-2.2.0-5.fc21.x86_64 Bug 1178975 - endless loop in clock_gettime() on a kvm-based VM https://bugzilla.redhat.com/show_bug.cgi?id=1178975 is fixed (didn't occour with the test program posted at https://bugzilla.redhat.com/show_bug.cgi?id=1178975#c28 in 30min, happened before reproduceable in 2min, still running) So I guess there is another problem in the kernel with volatile and gcc optimizations (or maybe in qemu-KVM) Any ideas? Any other bug reports like this? Thank you. Ciao, Gerhard -- http://www.wiesinger.com/ ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever 2015-02-15 8:18 ` Gerhard Wiesinger @ 2015-02-16 14:18 ` Paolo Bonzini 2015-02-16 15:09 ` Gerhard Wiesinger 0 siblings, 1 reply; 35+ messages in thread From: Paolo Bonzini @ 2015-02-16 14:18 UTC (permalink / raw) To: Gerhard Wiesinger, Laine Stump, qemu-devel, Cole Robinson, virt On 15/02/2015 09:18, Gerhard Wiesinger wrote: > > ppoll([{fd=98<socket:[4660317]>, events=POLLIN|POLLERR|POLLHUP}, > {fd=89<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, > {fd=88<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, > {fd=87<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, > {fd=86<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, > {fd=85<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, > {fd=84<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, > {fd=83<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, > {fd=82<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, > {fd=81<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, > {fd=80<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, > {fd=79<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, > {fd=78<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, > {fd=77<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, > {fd=76<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, > {fd=75<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, > {fd=74<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, > {fd=73<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, > {fd=72<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, > {fd=71<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, > {fd=70<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, > {fd=69<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, > {fd=68<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, > {fd=67<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, > {fd=66<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, > {fd=65<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, > {fd=64<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, > {fd=63<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, > {fd=62<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, > {fd=61<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, > {fd=60<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, > {fd=59<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, ...], 71, > {0, 124112228}, NULL, 8) = 1 ([...], left {0, 124090614}) > write(7<anon_inode:[eventfd]>, "\1\0\0\0\0\0\0\0", 8) = 8 > read(6<anon_inode:[eventfd]>, "\1\0\0\0\0\0\0\0", 512) = 8 > write(7<anon_inode:[eventfd]>, "\1\0\0\0\0\0\0\0", 8) = 8 Can you grab some random backtraces ("thread apply all bt full") with gdb? What is the libvirt XML or qemu command line? Paolo > Kernel (host/guest): 3.18.6-200.fc21.x86_64 #1 SMP > qemu-kvm-2.2.0-5.fc21.x86_64 > > Bug 1178975 - endless loop in clock_gettime() on a kvm-based VM > https://bugzilla.redhat.com/show_bug.cgi?id=1178975 > is fixed (didn't occour with the test program posted at > https://bugzilla.redhat.com/show_bug.cgi?id=1178975#c28 in 30min, > happened before reproduceable in 2min, still running) > > So I guess there is another problem in the kernel with volatile and gcc > optimizations (or maybe in qemu-KVM) No, this doesn't look like volatile. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever 2015-02-16 14:18 ` Paolo Bonzini @ 2015-02-16 15:09 ` Gerhard Wiesinger 2015-02-16 15:29 ` Paolo Bonzini 0 siblings, 1 reply; 35+ messages in thread From: Gerhard Wiesinger @ 2015-02-16 15:09 UTC (permalink / raw) To: Paolo Bonzini, Laine Stump, qemu-devel, Cole Robinson, virt On 16.02.2015 15:18, Paolo Bonzini wrote: > > On 15/02/2015 09:18, Gerhard Wiesinger wrote: >> >> Can you grab some random backtraces ("thread apply all bt full") with gdb? Very low load on the machine, so I guess most will be sleeping and hard to catch non sleeping ones. See below For the records: gdb -p 14139 -ex 'thread apply all bt full' -batch >> >> What is the libvirt XML or qemu command line? See below. >> Kernel (host/guest): 3.18.6-200.fc21.x86_64 #1 SMP >> qemu-kvm-2.2.0-5.fc21.x86_64 >> >> Bug 1178975 - endless loop in clock_gettime() on a kvm-based VM >> https://bugzilla.redhat.com/show_bug.cgi?id=1178975 >> is fixed (didn't occour with the test program posted at >> https://bugzilla.redhat.com/show_bug.cgi?id=1178975#c28 in 30min, >> happened before reproduceable in 2min, still running) >> >> So I guess there is another problem in the kernel with volatile and gcc >> optimizations (or maybe in qemu-KVM) > No, this doesn't look like volatile. But why 100% on 2 cores? There were also recent bugfixes with optimizations in gcc-4.9.2-6.fc21 maybe we hit one of these. - update from the 4.9 branch - PRs c++/54442, c++/64514, c++/64521, c++/64901, c/57653, c/61553, c/64766, c/64778, c/64824, c/64868, debug/64511, debug/64663, fortran/56867, fortran/57023, fortran/60922, fortran/62044, fortran/63733, fortran/64230, fortran/64528, fortran/64771, ipa/63970, ipa/64068, ipa/64559, libstdc++/64476, libstdc++/64584, libstdc++/64585, libstdc++/64646, libstdc++/64649, libstdc++/64680, middle-end/63704, middle-end/64391, middle-end/64421, middle-end/64734, rtl-optimization/61058, rtl-optimization/63637, rtl-optimization/64286, rtl-optimization/64557, target/61413, target/63424, target/64358, target/64479, target/64505, target/64513, target/64580, target/64795, target/64882, target/64938, target/64979, testsuite/64712, tree-optimization/64563 Ciao, Gerhard /usr/bin/qemu-system-x86_64 -machine accel=kvm -name myvm -S -machine pc-0.15,accel=kvm,usb=off -m 384 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 -uuid b40e77d3-cd86-4d59-9ee4-5756ec88bf99 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/myvm.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot menu=on,strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device lsi,id=scsi0,bus=pci.0,addr=0x6 -device ahci,id=ahci0,bus=pci.0,addr=0x9 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x4 -drive if=none,id=drive-ide0-1-0,readonly=on,format=raw -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/var/lib/libvirt/images/myvm.img,if=none,id=drive-virtio-disk0,format=qcow2 -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x3,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=33,id=hostnet0,vhost=on,vhostfd=34 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:aa:bb:cc,bus=pci.0,addr=0x8 -netdev tap,fd=35,id=hostnet1,vhost=on,vhostfd=36 -device virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:aa:bb:cd,bus=pci.0,addr=0x7 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -device usb-tablet,id=input0 -vnc 127.0.0.1:9 -k de -device VGA,id=video0,vgamem_mb=16,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -msg timestamp=on <domain type='kvm' id='12'> <name>myvm</name> <uuid>b40e77d3-cd86-4d59-9ee4-5756ec88bf99</uuid> <description>Fedora 21</description> <memory unit='KiB'>393216</memory> <currentMemory unit='KiB'>393216</currentMemory> <vcpu placement='static'>2</vcpu> <resource> <partition>/machine</partition> </resource> <os> <type arch='x86_64' machine='pc-0.15'>hvm</type> <bootmenu enable='yes'/> </os> <features> <acpi/> <apic/> <pae/> </features> <clock offset='utc'/> <on_poweroff>destroy</on_poweroff> <on_reboot>restart</on_reboot> <on_crash>restart</on_crash> <devices> <emulator>/usr/bin/qemu-kvm</emulator> <disk type='block' device='cdrom'> <driver name='qemu' type='raw'/> <backingStore/> <target dev='hdc' bus='ide'/> <readonly/> <alias name='ide0-1-0'/> <address type='drive' controller='0' bus='1' target='0' unit='0'/> </disk> <disk type='file' device='disk'> <driver name='qemu' type='qcow2'/> <source file='/var/lib/libvirt/images/myvm.img'/> <backingStore/> <target dev='vda' bus='virtio'/> <boot order='1'/> <alias name='virtio-disk0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> </disk> <controller type='ide' index='0'> <alias name='ide0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/> </controller> <controller type='usb' index='0'> <alias name='usb0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/> </controller> <controller type='virtio-serial' index='0'> <alias name='virtio-serial0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> </controller> <controller type='scsi' index='0'> <alias name='scsi0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/> </controller> <controller type='sata' index='0'> <alias name='sata0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x09' function='0x0'/> </controller> <controller type='pci' index='0' model='pci-root'> <alias name='pci.0'/> </controller> <interface type='network'> <mac address='52:54:00:aa:bb:cc'/> <source network='DMZ' bridge='virbr2'/> <target dev='vnet16'/> <model type='virtio'/> <alias name='net0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/> </interface> <interface type='bridge'> <mac address='52:54:00:aa:bb:cd'/> <source bridge='br0'/> <target dev='vnet17'/> <model type='virtio'/> <alias name='net1'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/> </interface> <serial type='pty'> <source path='/dev/pts/9'/> <target port='0'/> <alias name='serial0'/> </serial> <console type='pty' tty='/dev/pts/9'> <source path='/dev/pts/9'/> <target type='serial' port='0'/> <alias name='serial0'/> </console> <input type='tablet' bus='usb'> <alias name='input0'/> </input> <input type='mouse' bus='ps2'/> <input type='keyboard' bus='ps2'/> <graphics type='vnc' port='5909' autoport='yes' listen='127.0.0.1' keymap='de'> <listen type='address' address='127.0.0.1'/> </graphics> <video> <model type='vga' vram='16384' heads='1'/> <alias name='video0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> </video> <memballoon model='virtio'> <alias name='balloon0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/> </memballoon> </devices> </domain> Thread 11 (Thread 0x7ffe8bdfe700 (LWP 14147)): #0 0x00007ffe9dca8977 in ioctl () at /lib64/libc.so.6 #1 0x00007ffea8c37c35 in kvm_vcpu_ioctl () #2 0x00007ffea8c37cec in kvm_cpu_exec () #3 0x00007ffea8c25b02 in qemu_kvm_cpu_thread_fn () #4 0x00007ffea771952a in start_thread () at /lib64/libpthread.so.0 #5 0x00007ffe9dcb279d in clone () at /lib64/libc.so.6 Thread 10 (Thread 0x7ffe8b5fd700 (LWP 14148)): #0 0x00007ffe9dca8977 in ioctl () at /lib64/libc.so.6 #1 0x00007ffea8c37c35 in kvm_vcpu_ioctl () #2 0x00007ffea8c37cec in kvm_cpu_exec () #3 0x00007ffea8c25b02 in qemu_kvm_cpu_thread_fn () #4 0x00007ffea771952a in start_thread () at /lib64/libpthread.so.0 #5 0x00007ffe9dcb279d in clone () at /lib64/libc.so.6 Thread 9 (Thread 0x7ffe893ff700 (LWP 14158)): #0 0x00007ffea771e590 in pthread_cond_wait@@GLIBC_2.3.2 () at /lib64/libpthread.so.0 #1 0x00007ffea8eccd79 in qemu_cond_wait () #2 0x00007ffea8e50dd3 in vnc_worker_thread_loop () #3 0x00007ffea8e511b8 in vnc_worker_thread () #4 0x00007ffea771952a in start_thread () at /lib64/libpthread.so.0 #5 0x00007ffe9dcb279d in clone () at /lib64/libc.so.6 Thread 8 (Thread 0x7ffe4effd700 (LWP 18334)): #0 0x00007ffea77207f0 in sem_timedwait () at /lib64/libpthread.so.0 #1 0x00007ffea8eccec7 in qemu_sem_timedwait () #2 0x00007ffea8e5523c in worker_thread () #3 0x00007ffea771952a in start_thread () at /lib64/libpthread.so.0 #4 0x00007ffe9dcb279d in clone () at /lib64/libc.so.6 Thread 7 (Thread 0x7ffe8231f700 (LWP 18335)): #0 0x00007ffea77207f0 in sem_timedwait () at /lib64/libpthread.so.0 #1 0x00007ffea8eccec7 in qemu_sem_timedwait () #2 0x00007ffea8e5523c in worker_thread () #3 0x00007ffea771952a in start_thread () at /lib64/libpthread.so.0 #4 0x00007ffe9dcb279d in clone () at /lib64/libc.so.6 Thread 6 (Thread 0x7ffe908c7700 (LWP 22620)): #0 0x00007ffea77207f0 in sem_timedwait () at /lib64/libpthread.so.0 #1 0x00007ffea8eccec7 in qemu_sem_timedwait () #2 0x00007ffea8e5523c in worker_thread () #3 0x00007ffea771952a in start_thread () at /lib64/libpthread.so.0 #4 0x00007ffe9dcb279d in clone () at /lib64/libc.so.6 Thread 5 (Thread 0x7ffe2d7fa700 (LWP 12264)): #0 0x00007ffea77207f0 in sem_timedwait () at /lib64/libpthread.so.0 #1 0x00007ffea8eccec7 in qemu_sem_timedwait () #2 0x00007ffea8e5523c in worker_thread () #3 0x00007ffea771952a in start_thread () at /lib64/libpthread.so.0 #4 0x00007ffe9dcb279d in clone () at /lib64/libc.so.6 Thread 4 (Thread 0x7ffe83622700 (LWP 17892)): #0 0x00007ffea77207f0 in sem_timedwait () at /lib64/libpthread.so.0 #1 0x00007ffea8eccec7 in qemu_sem_timedwait () #2 0x00007ffea8e5523c in worker_thread () #3 0x00007ffea771952a in start_thread () at /lib64/libpthread.so.0 #4 0x00007ffe9dcb279d in clone () at /lib64/libc.so.6 Thread 3 (Thread 0x7ffe4f7fe700 (LWP 17893)): #0 0x00007ffea77207f0 in sem_timedwait () at /lib64/libpthread.so.0 #1 0x00007ffea8eccec7 in qemu_sem_timedwait () #2 0x00007ffea8e5523c in worker_thread () #3 0x00007ffea771952a in start_thread () at /lib64/libpthread.so.0 #4 0x00007ffe9dcb279d in clone () at /lib64/libc.so.6 Thread 2 (Thread 0x7ffe82b20700 (LWP 17894)): ---Type <return> to continue, or q <return> to quit--- #0 0x00007ffea77207f0 in sem_timedwait () at /lib64/libpthread.so.0 #1 0x00007ffea8eccec7 in qemu_sem_timedwait () #2 0x00007ffea8e5523c in worker_thread () #3 0x00007ffea771952a in start_thread () at /lib64/libpthread.so.0 #4 0x00007ffe9dcb279d in clone () at /lib64/libc.so.6 Thread 1 (Thread 0x7ffea8aa5a80 (LWP 14139)): #0 0x00007ffe9dca72c1 in ppoll () at /lib64/libc.so.6 #1 0x00007ffea8e6289c in qemu_poll_ns () #2 0x00007ffea8e62034 in main_loop_wait () #3 0x00007ffea8bfdcdd in main () [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". 0x00007ffe9dca72c1 in ppoll () from /lib64/libc.so.6 Thread 8 (Thread 0x7ffe8bdfe700 (LWP 14147)): #0 0x00007ffe9dca8977 in ioctl () at /lib64/libc.so.6 #1 0x00007ffea8c37c35 in kvm_vcpu_ioctl () #2 0x00007ffea8c37cec in kvm_cpu_exec () #3 0x00007ffea8c25b02 in qemu_kvm_cpu_thread_fn () #4 0x00007ffea771952a in start_thread () at /lib64/libpthread.so.0 #5 0x00007ffe9dcb279d in clone () at /lib64/libc.so.6 Thread 7 (Thread 0x7ffe8b5fd700 (LWP 14148)): #0 0x00007ffe9dca8977 in ioctl () at /lib64/libc.so.6 #1 0x00007ffea8c37c35 in kvm_vcpu_ioctl () #2 0x00007ffea8c37cec in kvm_cpu_exec () #3 0x00007ffea8c25b02 in qemu_kvm_cpu_thread_fn () #4 0x00007ffea771952a in start_thread () at /lib64/libpthread.so.0 #5 0x00007ffe9dcb279d in clone () at /lib64/libc.so.6 Thread 6 (Thread 0x7ffe893ff700 (LWP 14158)): #0 0x00007ffea771e590 in pthread_cond_wait@@GLIBC_2.3.2 () at /lib64/libpthread.so.0 #1 0x00007ffea8eccd79 in qemu_cond_wait () #2 0x00007ffea8e50dd3 in vnc_worker_thread_loop () #3 0x00007ffea8e511b8 in vnc_worker_thread () #4 0x00007ffea771952a in start_thread () at /lib64/libpthread.so.0 #5 0x00007ffe9dcb279d in clone () at /lib64/libc.so.6 Thread 5 (Thread 0x7ffdeeffd700 (LWP 31356)): #0 0x00007ffea77207f0 in sem_timedwait () at /lib64/libpthread.so.0 #1 0x00007ffea8eccec7 in qemu_sem_timedwait () #2 0x00007ffea8e5523c in worker_thread () #3 0x00007ffea771952a in start_thread () at /lib64/libpthread.so.0 #4 0x00007ffe9dcb279d in clone () at /lib64/libc.so.6 Thread 4 (Thread 0x7ffdef7fe700 (LWP 32680)): #0 0x00007ffea77207f0 in sem_timedwait () at /lib64/libpthread.so.0 #1 0x00007ffea8eccec7 in qemu_sem_timedwait () #2 0x00007ffea8e5523c in worker_thread () #3 0x00007ffea771952a in start_thread () at /lib64/libpthread.so.0 #4 0x00007ffe9dcb279d in clone () at /lib64/libc.so.6 Thread 3 (Thread 0x7ffe0ffff700 (LWP 32681)): #0 0x00007ffea77207f0 in sem_timedwait () at /lib64/libpthread.so.0 #1 0x00007ffea8eccec7 in qemu_sem_timedwait () #2 0x00007ffea8e5523c in worker_thread () #3 0x00007ffea771952a in start_thread () at /lib64/libpthread.so.0 #4 0x00007ffe9dcb279d in clone () at /lib64/libc.so.6 Thread 2 (Thread 0x7ffe8131d700 (LWP 32682)): #0 0x00007ffea77207f0 in sem_timedwait () at /lib64/libpthread.so.0 #1 0x00007ffea8eccec7 in qemu_sem_timedwait () #2 0x00007ffea8e5523c in worker_thread () #3 0x00007ffea771952a in start_thread () at /lib64/libpthread.so.0 #4 0x00007ffe9dcb279d in clone () at /lib64/libc.so.6 Thread 1 (Thread 0x7ffea8aa5a80 (LWP 14139)): #0 0x00007ffe9dca72c1 in ppoll () at /lib64/libc.so.6 #1 0x00007ffea8e6289c in qemu_poll_ns () #2 0x00007ffea8e62034 in main_loop_wait () #3 0x00007ffea8bfdcdd in main () ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever 2015-02-16 15:09 ` Gerhard Wiesinger @ 2015-02-16 15:29 ` Paolo Bonzini 2015-03-01 10:36 ` Gerhard Wiesinger 0 siblings, 1 reply; 35+ messages in thread From: Paolo Bonzini @ 2015-02-16 15:29 UTC (permalink / raw) To: Gerhard Wiesinger, Laine Stump, qemu-devel, Cole Robinson, virt On 16/02/2015 16:09, Gerhard Wiesinger wrote: > On 16.02.2015 15:18, Paolo Bonzini wrote: >> >> On 15/02/2015 09:18, Gerhard Wiesinger wrote: >>> >>> Can you grab some random backtraces ("thread apply all bt full") with >>> gdb? > > Very low load on the machine, so I guess most will be sleeping and hard > to catch non sleeping ones. See below This likely means that the 100% threads are not the I/O (event loop) threads, but the VCPU threads. >>> Kernel (host/guest): 3.18.6-200.fc21.x86_64 #1 SMP >>> qemu-kvm-2.2.0-5.fc21.x86_64 >>> >>> Bug 1178975 - endless loop in clock_gettime() on a kvm-based VM >>> https://bugzilla.redhat.com/show_bug.cgi?id=1178975 >>> is fixed (didn't occour with the test program posted at >>> https://bugzilla.redhat.com/show_bug.cgi?id=1178975#c28 in 30min, >>> happened before reproduceable in 2min, still running) >>> >>> So I guess there is another problem in the kernel with volatile and gcc >>> optimizations (or maybe in qemu-KVM) >> No, this doesn't look like volatile. > > But why 100% on 2 cores? Because you have 2 VCPUs, I guess. Paolo ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever 2015-02-16 15:29 ` Paolo Bonzini @ 2015-03-01 10:36 ` Gerhard Wiesinger 2015-03-02 9:26 ` Paolo Bonzini 0 siblings, 1 reply; 35+ messages in thread From: Gerhard Wiesinger @ 2015-03-01 10:36 UTC (permalink / raw) To: Paolo Bonzini, Laine Stump, qemu-devel, Cole Robinson, virt On 16.02.2015 16:29, Paolo Bonzini wrote: > > On 16/02/2015 16:09, Gerhard Wiesinger wrote: >> On 16.02.2015 15:18, Paolo Bonzini wrote: >>> On 15/02/2015 09:18, Gerhard Wiesinger wrote: >>>> Can you grab some random backtraces ("thread apply all bt full") with >>>> gdb? >> Very low load on the machine, so I guess most will be sleeping and hard >> to catch non sleeping ones. See below > This likely means that the 100% threads are not the I/O (event loop) > threads, but the VCPU threads. ok, happened again, details below. Any further ideas from the stack traces? So far it happened only the PostgreSQL database VM. Kernel is alive (ping works well). ssh is not working. console window: after entering one character at login prompt, then crashed: [1438.384864] Out of memory: Kill process 10115 (pg_dump) score 112 or sacrifice child [1438.384990] Killed process 10115 (pg_dump) total-vm: 340548kB, anon-rss: 162712kB, file-rss: 220kB VM uptime is ~1 day, 2 cores 100%CPU. VM is very stripped down, nevertheless is should have enough memory (and also swap). Looks like it crashed at the nighlty backup (pg_dumpall), command is: ssh -x ${REMOTE_USER}@${REMOTE_HOST} "pg_dumpall" | bzip2 -9 > ${DEST_SQL} free total used free shared buff/cache available Mem: 372264 28536 214780 20884 128948 303360 Swap: 1081340 0 1081340 Ciao, Gerhard [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". 0x00007feee62412c1 in ppoll () from /lib64/libc.so.6 Thread 4 (Thread 0x7feed3fff700 (LWP 4636)): #0 0x00007feee6242977 in ioctl () at /lib64/libc.so.6 #1 0x00007feef11d1c35 in kvm_vcpu_ioctl () #2 0x00007feef11d1cec in kvm_cpu_exec () #3 0x00007feef11bfb02 in qemu_kvm_cpu_thread_fn () #4 0x00007feeefcb352a in start_thread () at /lib64/libpthread.so.0 #5 0x00007feee624c79d in clone () at /lib64/libc.so.6 Thread 3 (Thread 0x7feed37fe700 (LWP 4637)): #0 0x00007feee6242977 in ioctl () at /lib64/libc.so.6 #1 0x00007feef11d1c35 in kvm_vcpu_ioctl () #2 0x00007feef11d1cec in kvm_cpu_exec () #3 0x00007feef11bfb02 in qemu_kvm_cpu_thread_fn () #4 0x00007feeefcb352a in start_thread () at /lib64/libpthread.so.0 #5 0x00007feee624c79d in clone () at /lib64/libc.so.6 Thread 2 (Thread 0x7feed1bff700 (LWP 4653)): #0 0x00007feeefcb8590 in pthread_cond_wait@@GLIBC_2.3.2 () at /lib64/libpthread.so.0 #1 0x00007feef1466d79 in qemu_cond_wait () #2 0x00007feef13eadd3 in vnc_worker_thread_loop () #3 0x00007feef13eb1b8 in vnc_worker_thread () #4 0x00007feeefcb352a in start_thread () at /lib64/libpthread.so.0 #5 0x00007feee624c79d in clone () at /lib64/libc.so.6 Thread 1 (Thread 0x7feef103fa80 (LWP 4563)): #0 0x00007feee62412c1 in ppoll () at /lib64/libc.so.6 #1 0x00007feef13fc89c in qemu_poll_ns () #2 0x00007feef13fc034 in main_loop_wait () #3 0x00007feef1197cdd in main () ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever 2015-03-01 10:36 ` Gerhard Wiesinger @ 2015-03-02 9:26 ` Paolo Bonzini 2015-03-02 15:52 ` Gerhard Wiesinger 0 siblings, 1 reply; 35+ messages in thread From: Paolo Bonzini @ 2015-03-02 9:26 UTC (permalink / raw) To: Gerhard Wiesinger, Laine Stump, qemu-devel, Cole Robinson, virt On 01/03/2015 11:36, Gerhard Wiesinger wrote: > So far it happened only the PostgreSQL database VM. Kernel is alive > (ping works well). ssh is not working. > console window: after entering one character at login prompt, then crashed: > [1438.384864] Out of memory: Kill process 10115 (pg_dump) score 112 or > sacrifice child > [1438.384990] Killed process 10115 (pg_dump) total-vm: 340548kB, > anon-rss: 162712kB, file-rss: 220kB Can you get a vmcore or at least sysrq-t output? Paolo > VM uptime is ~1 day, 2 cores 100%CPU. > VM is very stripped down, nevertheless is should have enough memory (and > also swap). Looks like it crashed at the nighlty backup (pg_dumpall), > command is: ssh -x ${REMOTE_USER}@${REMOTE_HOST} "pg_dumpall" | bzip2 -9 >> ${DEST_SQL} > free > total used free shared buff/cache > available > Mem: 372264 28536 214780 20884 128948 303360 > Swap: 1081340 0 1081340 > > Ciao, > Gerhard > > [Thread debugging using libthread_db enabled] > Using host libthread_db library "/lib64/libthread_db.so.1". > 0x00007feee62412c1 in ppoll () from /lib64/libc.so.6 > > Thread 4 (Thread 0x7feed3fff700 (LWP 4636)): > #0 0x00007feee6242977 in ioctl () at /lib64/libc.so.6 > #1 0x00007feef11d1c35 in kvm_vcpu_ioctl () > #2 0x00007feef11d1cec in kvm_cpu_exec () > #3 0x00007feef11bfb02 in qemu_kvm_cpu_thread_fn () > #4 0x00007feeefcb352a in start_thread () at /lib64/libpthread.so.0 > #5 0x00007feee624c79d in clone () at /lib64/libc.so.6 > > Thread 3 (Thread 0x7feed37fe700 (LWP 4637)): > #0 0x00007feee6242977 in ioctl () at /lib64/libc.so.6 > #1 0x00007feef11d1c35 in kvm_vcpu_ioctl () > #2 0x00007feef11d1cec in kvm_cpu_exec () > #3 0x00007feef11bfb02 in qemu_kvm_cpu_thread_fn () > #4 0x00007feeefcb352a in start_thread () at /lib64/libpthread.so.0 > #5 0x00007feee624c79d in clone () at /lib64/libc.so.6 > > Thread 2 (Thread 0x7feed1bff700 (LWP 4653)): > #0 0x00007feeefcb8590 in pthread_cond_wait@@GLIBC_2.3.2 () at > /lib64/libpthread.so.0 > #1 0x00007feef1466d79 in qemu_cond_wait () > #2 0x00007feef13eadd3 in vnc_worker_thread_loop () > #3 0x00007feef13eb1b8 in vnc_worker_thread () > #4 0x00007feeefcb352a in start_thread () at /lib64/libpthread.so.0 > #5 0x00007feee624c79d in clone () at /lib64/libc.so.6 > > Thread 1 (Thread 0x7feef103fa80 (LWP 4563)): > #0 0x00007feee62412c1 in ppoll () at /lib64/libc.so.6 > #1 0x00007feef13fc89c in qemu_poll_ns () > #2 0x00007feef13fc034 in main_loop_wait () > #3 0x00007feef1197cdd in main () > > > ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever 2015-03-02 9:26 ` Paolo Bonzini @ 2015-03-02 15:52 ` Gerhard Wiesinger 2015-03-02 17:15 ` Gerhard Wiesinger 0 siblings, 1 reply; 35+ messages in thread From: Gerhard Wiesinger @ 2015-03-02 15:52 UTC (permalink / raw) To: Paolo Bonzini, Laine Stump, qemu-devel, Cole Robinson, virt On 02.03.2015 10:26, Paolo Bonzini wrote: > > On 01/03/2015 11:36, Gerhard Wiesinger wrote: >> So far it happened only the PostgreSQL database VM. Kernel is alive >> (ping works well). ssh is not working. >> console window: after entering one character at login prompt, then crashed: >> [1438.384864] Out of memory: Kill process 10115 (pg_dump) score 112 or >> sacrifice child >> [1438.384990] Killed process 10115 (pg_dump) total-vm: 340548kB, >> anon-rss: 162712kB, file-rss: 220kB > Can you get a vmcore or at least sysrq-t output? Yes, next time it happens I can analyze it. I think there are 2 problems: 1.) OOM (Out of Memory) problem with the low memory settings and kernel settings (see below) 2.) Instability problem which might have a dependency to 1.) What I've done so far (thanks to Andrey Korolyov for ideas and help): a.) Updated maschine type from pc-0.15 to pc-i440fx-2.2 virsh dumpxml database | grep "<type" <type arch='x86_64' machine='pc-0.15'>hvm</type> virsh edit database virsh dumpxml database | grep "<type" <type arch='x86_64' machine='pc-i440fx-2.2'>hvm</type> SMBIOS is updated therefore from 2.4 to 2.8: dmesg|grep -i SMBIOS [ 0.000000] SMBIOS 2.8 present. b.) Switched to tsc clock, kernel parameters: clocksource=tsc nohz=off highres=off c.) Changed overcommit to 1 echo "vm.overcommit_memory = 1" > /etc/sysctl.d/overcommit.conf d.) Tried 1 VCPU instead of 2 e.) Installed 512MB vRAM instead of 384MB f.) Prepared for sysrq and vmcore echo "kernel.sysrq = 1" > /etc/sysctl.d/sysrq.conf sysctl -w kernel.sysrq=1 virsh send-key database KEY_LEFTALT KEY_SYSRQ KEY_T virsh dump domain-name /tmp/dumpfile g.) Further ideas, not yet done: disable memory balooning by blacklisting baloon driver or remove from virsh xml config Summary: 1.) 512MB, tsc timer, 1VCPU, vm.overcommit_memory = 1: no OOM problem, no crash 2.) 512MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1: no OOM problem, no crash So the OOM problem seems to be solved (at least it didn't happen so far) by installing 512MB RAM and setting vm.overcommit_memory = 1 (I guess just setting overcommit would be fine, too). Instability didn't occour so far. If I can't reproduce it, I'll revert the settings. Ciao, Gerhard ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever 2015-03-02 15:52 ` Gerhard Wiesinger @ 2015-03-02 17:15 ` Gerhard Wiesinger 2015-03-03 9:12 ` Gerhard Wiesinger 0 siblings, 1 reply; 35+ messages in thread From: Gerhard Wiesinger @ 2015-03-02 17:15 UTC (permalink / raw) To: Paolo Bonzini, Laine Stump, qemu-devel, Cole Robinson, virt On 02.03.2015 16:52, Gerhard Wiesinger wrote: > On 02.03.2015 10:26, Paolo Bonzini wrote: >> >> On 01/03/2015 11:36, Gerhard Wiesinger wrote: >>> So far it happened only the PostgreSQL database VM. Kernel is alive >>> (ping works well). ssh is not working. >>> console window: after entering one character at login prompt, then >>> crashed: >>> [1438.384864] Out of memory: Kill process 10115 (pg_dump) score 112 or >>> sacrifice child >>> [1438.384990] Killed process 10115 (pg_dump) total-vm: 340548kB, >>> anon-rss: 162712kB, file-rss: 220kB >> Can you get a vmcore or at least sysrq-t output? > > Yes, next time it happens I can analyze it. > > I think there are 2 problems: > 1.) OOM (Out of Memory) problem with the low memory settings and > kernel settings (see below) > 2.) Instability problem which might have a dependency to 1.) > > What I've done so far (thanks to Andrey Korolyov for ideas and help): > a.) Updated maschine type from pc-0.15 to pc-i440fx-2.2 > virsh dumpxml database | grep "<type" > <type arch='x86_64' machine='pc-0.15'>hvm</type> > > virsh edit database > virsh dumpxml database | grep "<type" > <type arch='x86_64' machine='pc-i440fx-2.2'>hvm</type> > > SMBIOS is updated therefore from 2.4 to 2.8: > dmesg|grep -i SMBIOS > [ 0.000000] SMBIOS 2.8 present. > b.) Switched to tsc clock, kernel parameters: clocksource=tsc nohz=off > highres=off > c.) Changed overcommit to 1 > echo "vm.overcommit_memory = 1" > /etc/sysctl.d/overcommit.conf > d.) Tried 1 VCPU instead of 2 > e.) Installed 512MB vRAM instead of 384MB > f.) Prepared for sysrq and vmcore > echo "kernel.sysrq = 1" > /etc/sysctl.d/sysrq.conf > sysctl -w kernel.sysrq=1 > virsh send-key database KEY_LEFTALT KEY_SYSRQ KEY_T > virsh dump domain-name /tmp/dumpfile > g.) Further ideas, not yet done: disable memory balooning by > blacklisting baloon driver or remove from virsh xml config > > Summary: > 1.) 512MB, tsc timer, 1VCPU, vm.overcommit_memory = 1: no OOM problem, > no crash > 2.) 512MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1: no OOM problem, > no crash 3.) 384MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1: no OOM problem, no crash Ciao, Gerhard ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever 2015-03-02 17:15 ` Gerhard Wiesinger @ 2015-03-03 9:12 ` Gerhard Wiesinger 2015-03-03 10:12 ` Paolo Bonzini ` (2 more replies) 0 siblings, 3 replies; 35+ messages in thread From: Gerhard Wiesinger @ 2015-03-03 9:12 UTC (permalink / raw) To: Paolo Bonzini, Laine Stump, qemu-devel, Cole Robinson, virt On 02.03.2015 18:15, Gerhard Wiesinger wrote: > On 02.03.2015 16:52, Gerhard Wiesinger wrote: >> On 02.03.2015 10:26, Paolo Bonzini wrote: >>> >>> On 01/03/2015 11:36, Gerhard Wiesinger wrote: >>>> So far it happened only the PostgreSQL database VM. Kernel is alive >>>> (ping works well). ssh is not working. >>>> console window: after entering one character at login prompt, then >>>> crashed: >>>> [1438.384864] Out of memory: Kill process 10115 (pg_dump) score 112 or >>>> sacrifice child >>>> [1438.384990] Killed process 10115 (pg_dump) total-vm: 340548kB, >>>> anon-rss: 162712kB, file-rss: 220kB >>> Can you get a vmcore or at least sysrq-t output? >> >> Yes, next time it happens I can analyze it. >> >> I think there are 2 problems: >> 1.) OOM (Out of Memory) problem with the low memory settings and >> kernel settings (see below) >> 2.) Instability problem which might have a dependency to 1.) >> >> What I've done so far (thanks to Andrey Korolyov for ideas and help): >> a.) Updated maschine type from pc-0.15 to pc-i440fx-2.2 >> virsh dumpxml database | grep "<type" >> <type arch='x86_64' machine='pc-0.15'>hvm</type> >> >> virsh edit database >> virsh dumpxml database | grep "<type" >> <type arch='x86_64' machine='pc-i440fx-2.2'>hvm</type> >> >> SMBIOS is updated therefore from 2.4 to 2.8: >> dmesg|grep -i SMBIOS >> [ 0.000000] SMBIOS 2.8 present. >> b.) Switched to tsc clock, kernel parameters: clocksource=tsc >> nohz=off highres=off >> c.) Changed overcommit to 1 >> echo "vm.overcommit_memory = 1" > /etc/sysctl.d/overcommit.conf >> d.) Tried 1 VCPU instead of 2 >> e.) Installed 512MB vRAM instead of 384MB >> f.) Prepared for sysrq and vmcore >> echo "kernel.sysrq = 1" > /etc/sysctl.d/sysrq.conf >> sysctl -w kernel.sysrq=1 >> virsh send-key database KEY_LEFTALT KEY_SYSRQ KEY_T >> virsh dump domain-name /tmp/dumpfile >> g.) Further ideas, not yet done: disable memory balooning by >> blacklisting baloon driver or remove from virsh xml config >> >> Summary: >> 1.) 512MB, tsc timer, 1VCPU, vm.overcommit_memory = 1: no OOM >> problem, no crash >> 2.) 512MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1: no OOM >> problem, no crash > > 3.) 384MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1: no OOM problem, > no crash 3b.) Still happened again at the nightly backup with same configuration as in 3.) configuration 384MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1, pc-i440fx-2.2: no OOM problem, ping ok, no reaction, BUT CRASHED again SYSRQ: no reaction of the VM virsh send-key vm KEY_LEFTALT KEY_SYSRQ KEY_T virsh dump vm file.core error: Failed to core dump domain vm to file.core error: internal error: unable to execute QEMU command 'migrate': State blocked by non-migratable device '0000:00:09.0/ich9_ahci' Removed the SATA controller, dump should work for the future. Any futher ideas? Ciao, Gerhard ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever 2015-03-03 9:12 ` Gerhard Wiesinger @ 2015-03-03 10:12 ` Paolo Bonzini 2015-03-03 11:51 ` Gonglei 2015-03-03 12:28 ` Gerhard Wiesinger 2 siblings, 0 replies; 35+ messages in thread From: Paolo Bonzini @ 2015-03-03 10:12 UTC (permalink / raw) To: Gerhard Wiesinger, Laine Stump, qemu-devel, Cole Robinson, virt On 03/03/2015 10:12, Gerhard Wiesinger wrote: > > virsh dump vm file.core > error: Failed to core dump domain vm to file.core > error: internal error: unable to execute QEMU command 'migrate': State > blocked by non-migratable device '0000:00:09.0/ich9_ahci' > > Removed the SATA controller, dump should work for the future. Thanks. Paolo ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever 2015-03-03 9:12 ` Gerhard Wiesinger 2015-03-03 10:12 ` Paolo Bonzini @ 2015-03-03 11:51 ` Gonglei 2015-03-03 12:28 ` Gerhard Wiesinger 2 siblings, 0 replies; 35+ messages in thread From: Gonglei @ 2015-03-03 11:51 UTC (permalink / raw) To: Gerhard Wiesinger, Paolo Bonzini, Laine Stump, qemu-devel, Cole Robinson, virt On 2015/3/3 17:12, Gerhard Wiesinger wrote: > 3b.) Still happened again at the nightly backup with same configuration as in 3.) configuration 384MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1, pc-i440fx-2.2: no OOM problem, ping ok, no reaction, BUT CRASHED again > > SYSRQ: no reaction of the VM > virsh send-key vm KEY_LEFTALT KEY_SYSRQ KEY_T > > virsh dump vm file.core > error: Failed to core dump domain vm to file.core > error: internal error: unable to execute QEMU command 'migrate': State blocked by non-migratable device '0000:00:09.0/ich9_ahci' > You can use '--memory-only' argument to dump vm's memory only. eg: virsh dump vm file.core --memory-only > Removed the SATA controller, dump should work for the future. > > Any futher ideas? Regards, -Gonglei ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever 2015-03-03 9:12 ` Gerhard Wiesinger 2015-03-03 10:12 ` Paolo Bonzini 2015-03-03 11:51 ` Gonglei @ 2015-03-03 12:28 ` Gerhard Wiesinger 2015-03-03 13:18 ` Gerhard Wiesinger 2 siblings, 1 reply; 35+ messages in thread From: Gerhard Wiesinger @ 2015-03-03 12:28 UTC (permalink / raw) To: Paolo Bonzini, Laine Stump, qemu-devel, Cole Robinson, virt On 03.03.2015 10:12, Gerhard Wiesinger wrote: > On 02.03.2015 18:15, Gerhard Wiesinger wrote: >> On 02.03.2015 16:52, Gerhard Wiesinger wrote: >>> On 02.03.2015 10:26, Paolo Bonzini wrote: >>>> >>>> On 01/03/2015 11:36, Gerhard Wiesinger wrote: >>>>> So far it happened only the PostgreSQL database VM. Kernel is alive >>>>> (ping works well). ssh is not working. >>>>> console window: after entering one character at login prompt, then >>>>> crashed: >>>>> [1438.384864] Out of memory: Kill process 10115 (pg_dump) score >>>>> 112 or >>>>> sacrifice child >>>>> [1438.384990] Killed process 10115 (pg_dump) total-vm: 340548kB, >>>>> anon-rss: 162712kB, file-rss: 220kB >>>> Can you get a vmcore or at least sysrq-t output? >>> >>> Yes, next time it happens I can analyze it. >>> >>> I think there are 2 problems: >>> 1.) OOM (Out of Memory) problem with the low memory settings and >>> kernel settings (see below) >>> 2.) Instability problem which might have a dependency to 1.) >>> >>> What I've done so far (thanks to Andrey Korolyov for ideas and help): >>> a.) Updated maschine type from pc-0.15 to pc-i440fx-2.2 >>> virsh dumpxml database | grep "<type" >>> <type arch='x86_64' machine='pc-0.15'>hvm</type> >>> >>> virsh edit database >>> virsh dumpxml database | grep "<type" >>> <type arch='x86_64' machine='pc-i440fx-2.2'>hvm</type> >>> >>> SMBIOS is updated therefore from 2.4 to 2.8: >>> dmesg|grep -i SMBIOS >>> [ 0.000000] SMBIOS 2.8 present. >>> b.) Switched to tsc clock, kernel parameters: clocksource=tsc >>> nohz=off highres=off >>> c.) Changed overcommit to 1 >>> echo "vm.overcommit_memory = 1" > /etc/sysctl.d/overcommit.conf >>> d.) Tried 1 VCPU instead of 2 >>> e.) Installed 512MB vRAM instead of 384MB >>> f.) Prepared for sysrq and vmcore >>> echo "kernel.sysrq = 1" > /etc/sysctl.d/sysrq.conf >>> sysctl -w kernel.sysrq=1 >>> virsh send-key database KEY_LEFTALT KEY_SYSRQ KEY_T >>> virsh dump domain-name /tmp/dumpfile >>> g.) Further ideas, not yet done: disable memory balooning by >>> blacklisting baloon driver or remove from virsh xml config >>> >>> Summary: >>> 1.) 512MB, tsc timer, 1VCPU, vm.overcommit_memory = 1: no OOM >>> problem, no crash >>> 2.) 512MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1: no OOM >>> problem, no crash >> >> 3.) 384MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1: no OOM >> problem, no crash > > 3b.) Still happened again at the nightly backup with same > configuration as in 3.) configuration 384MB, kvm_clock, 2VCPU, > vm.overcommit_memory = 1, pc-i440fx-2.2: no OOM problem, ping ok, no > reaction, BUT CRASHED again > 3c.) configuration 384MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1, pc-i440fx-2.2: OOM problem, no crash postgres invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0 Free swap = 905924kB Total swap = 1081340kB Out of memory: Kill process 19312 (pg_dump) score 142 or sacrifice child Killed process 19312 (pg_dump) total-vm:384516kB, anon-rss:119260kB, file-rss:0kB An OOM should not occour: https://www.kernel.org/doc/gorman/html/understand/understand016.html Is there enough swap space left (nr_swap_pages > 0) ? If yes, not OOM Why does an OOM condition occour? Looks like a bug in the kernel? Any ideas? Ciao, Gerhard ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever 2015-03-03 12:28 ` Gerhard Wiesinger @ 2015-03-03 13:18 ` Gerhard Wiesinger 2015-03-03 20:50 ` Gerhard Wiesinger 0 siblings, 1 reply; 35+ messages in thread From: Gerhard Wiesinger @ 2015-03-03 13:18 UTC (permalink / raw) To: Paolo Bonzini, Laine Stump, qemu-devel, Cole Robinson, virt On 03.03.2015 13:28, Gerhard Wiesinger wrote: > On 03.03.2015 10:12, Gerhard Wiesinger wrote: >> On 02.03.2015 18:15, Gerhard Wiesinger wrote: >>> On 02.03.2015 16:52, Gerhard Wiesinger wrote: >>>> On 02.03.2015 10:26, Paolo Bonzini wrote: >>>>> >>>>> On 01/03/2015 11:36, Gerhard Wiesinger wrote: >>>>>> So far it happened only the PostgreSQL database VM. Kernel is alive >>>>>> (ping works well). ssh is not working. >>>>>> console window: after entering one character at login prompt, >>>>>> then crashed: >>>>>> [1438.384864] Out of memory: Kill process 10115 (pg_dump) score >>>>>> 112 or >>>>>> sacrifice child >>>>>> [1438.384990] Killed process 10115 (pg_dump) total-vm: 340548kB, >>>>>> anon-rss: 162712kB, file-rss: 220kB >>>>> Can you get a vmcore or at least sysrq-t output? >>>> >>>> Yes, next time it happens I can analyze it. >>>> >>>> I think there are 2 problems: >>>> 1.) OOM (Out of Memory) problem with the low memory settings and >>>> kernel settings (see below) >>>> 2.) Instability problem which might have a dependency to 1.) >>>> >>>> What I've done so far (thanks to Andrey Korolyov for ideas and help): >>>> a.) Updated maschine type from pc-0.15 to pc-i440fx-2.2 >>>> virsh dumpxml database | grep "<type" >>>> <type arch='x86_64' machine='pc-0.15'>hvm</type> >>>> >>>> virsh edit database >>>> virsh dumpxml database | grep "<type" >>>> <type arch='x86_64' machine='pc-i440fx-2.2'>hvm</type> >>>> >>>> SMBIOS is updated therefore from 2.4 to 2.8: >>>> dmesg|grep -i SMBIOS >>>> [ 0.000000] SMBIOS 2.8 present. >>>> b.) Switched to tsc clock, kernel parameters: clocksource=tsc >>>> nohz=off highres=off >>>> c.) Changed overcommit to 1 >>>> echo "vm.overcommit_memory = 1" > /etc/sysctl.d/overcommit.conf >>>> d.) Tried 1 VCPU instead of 2 >>>> e.) Installed 512MB vRAM instead of 384MB >>>> f.) Prepared for sysrq and vmcore >>>> echo "kernel.sysrq = 1" > /etc/sysctl.d/sysrq.conf >>>> sysctl -w kernel.sysrq=1 >>>> virsh send-key database KEY_LEFTALT KEY_SYSRQ KEY_T >>>> virsh dump domain-name /tmp/dumpfile >>>> g.) Further ideas, not yet done: disable memory balooning by >>>> blacklisting baloon driver or remove from virsh xml config >>>> >>>> Summary: >>>> 1.) 512MB, tsc timer, 1VCPU, vm.overcommit_memory = 1: no OOM >>>> problem, no crash >>>> 2.) 512MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1: no OOM >>>> problem, no crash >>> >>> 3.) 384MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1: no OOM >>> problem, no crash >> >> 3b.) Still happened again at the nightly backup with same >> configuration as in 3.) configuration 384MB, kvm_clock, 2VCPU, >> vm.overcommit_memory = 1, pc-i440fx-2.2: no OOM problem, ping ok, no >> reaction, BUT CRASHED again >> > > 3c.) configuration 384MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1, > pc-i440fx-2.2: OOM problem, no crash > > postgres invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0 > Free swap = 905924kB > Total swap = 1081340kB > Out of memory: Kill process 19312 (pg_dump) score 142 or sacrifice child > Killed process 19312 (pg_dump) total-vm:384516kB, anon-rss:119260kB, > file-rss:0kB > > An OOM should not occour: > https://www.kernel.org/doc/gorman/html/understand/understand016.html > Is there enough swap space left (nr_swap_pages > 0) ? If yes, not OOM > > Why does an OOM condition occour? Looks like a bug in the kernel? > Any ideas? # Allocating 800MB, killed by OOM killer ./mallocsleep 805306368 Killed Out of memory: Kill process 27160 (mallocsleep) score 525 or sacrifice child Killed process 27160 (mallocsleep) total-vm:790588kB, anon-rss:214948kB, file-rss:0kB free -m total used free shared buff/cache available Mem: 363 23 252 23 87 295 Swap: 1055 134 921 ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 1392 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 1392 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited # Maschine is getting inresponsive and stalls for seconds, but never reaches more than 1055MB swap size (+ 384MB RAM) vmstat 1 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 136472 241196 1400 98544 4 57 1724 67 211 261 2 3 91 2 2 0 0 136472 241228 1400 98540 0 0 0 0 30 48 0 0 100 0 0 0 0 136472 241228 1408 98532 0 0 0 52 53 51 0 0 89 11 0 0 0 136472 241224 1408 98540 0 0 0 112 44 92 0 0 100 0 0 0 0 136472 241224 1408 98540 0 0 0 0 24 32 0 0 100 0 0 0 0 136472 241352 1408 98540 0 0 0 0 31 44 0 1 100 0 0 0 0 136472 241328 1408 98540 0 0 0 36 97 142 0 1 99 0 0 0 0 136472 241364 1408 98540 0 0 0 0 22 30 0 0 100 0 0 0 0 136472 241376 1416 98532 0 0 0 80 52 45 0 0 92 8 1 1 0 136472 9236 1416 98548 0 0 8 0 762 55 11 23 66 0 0 2 7 270496 3804 140 61172 1144 412268 15028 412340 92805 301836 1 49 1 27 22 1 12 620320 4788 140 35240 1240 114864 96860 114976 46242 96395 1 26 0 61 12 3 18 661436 4788 144 35568 508 0 167884 0 5605 8097 5 76 0 16 4 3 4 661220 4288 144 34256 252 0 273684 0 7454 9777 3 71 0 19 7 5 20 661024 4532 144 34772 320 0 238288 0 9452 12395 3 78 0 13 6 6 19 660596 4592 144 35884 320 0 233160 8 12401 16798 5 67 0 12 15 3 20 677268 4296 140 36816 2180 18200 444328 18332 19382 36234 8 67 0 11 14 3 25 677208 4792 136 36044 68 0 524340 12 20637 26558 3 74 0 15 8 2 21 687880 4964 136 38200 260 10784 311152 10884 17707 28941 4 78 0 12 5 3 21 693808 4380 176 36860 136 6024 388932 6096 14576 22372 3 84 0 6 7 3 27 693740 4432 152 38288 56 20736 419592 20744 23212 31219 4 87 0 7 2 3 23 713696 4384 152 38172 796 0 481420 96 16498 27177 8 87 0 4 1 3 27 713360 4116 152 38372 1844 0 1308552 296 25074 33901 5 85 0 9 1 3 29 714628 4416 180 41992 256 2556 501832 2704 56498 76293 3 91 0 5 1 3 29 714572 3860 172 41076 156 0 920736 152 12131 17339 5 94 0 0 0 4 28 714396 5108 152 40124 212 10924 567648 11148 41901 56712 4 90 0 4 2 3 30 725216 4060 136 40604 124 0 286384 156 21992 35505 5 91 0 2 3 8 12 148836 230388 320 70888 5356 0 24304 52 9977 15084 17 75 0 5 3 0 0 146692 271900 416 76680 2200 0 6592 0 1561 3198 10 10 78 2 1 0 0 146584 271900 416 76892 152 0 184 0 75 139 0 0 100 0 1 0 0 146488 271396 552 76980 128 0 264 36 124 230 0 1 98 1 0 0 0 146372 271076 680 77196 124 0 252 8 79 167 0 0 100 0 0 0 0 146312 270948 688 77332 64 0 64 80 61 102 0 0 97 3 1 What's wrong here? Kernel Bug? Ciao, Gerhard #include <stdlib.h> #include <stdio.h> #include <unistd.h> typedef unsigned int BOOL; typedef char* PCHAR; typedef unsigned int DWORD; #define FALSE 0 #define TRUE 1 BOOL getlong(PCHAR s,DWORD* retvalue) { char *eptr; long value; value=strtol(s,&eptr,0); if ((eptr==s)||(*eptr!='\0')) return FALSE; if (value<0) return FALSE; *retvalue=value; return TRUE; } int main(int argc,char* argv[]) { unsigned int* p; unsigned int size=16*1024*1024; unsigned int size_of=sizeof(unsigned int); int i; if (argc>1) { if (!getlong(argv[1],&size)) { printf("Wrong memsize!\n"); exit(1); } } p=malloc(size); for(i=0;i<(size/size_of);i++) p[i]=0; sleep(3600); free(p); return 0; } ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever 2015-03-03 13:18 ` Gerhard Wiesinger @ 2015-03-03 20:50 ` Gerhard Wiesinger 0 siblings, 0 replies; 35+ messages in thread From: Gerhard Wiesinger @ 2015-03-03 20:50 UTC (permalink / raw) To: Paolo Bonzini, Laine Stump, qemu-devel, Cole Robinson, virt On 03.03.2015 14:18, Gerhard Wiesinger wrote: > On 03.03.2015 13:28, Gerhard Wiesinger wrote: >> On 03.03.2015 10:12, Gerhard Wiesinger wrote: >>> On 02.03.2015 18:15, Gerhard Wiesinger wrote: >>>> On 02.03.2015 16:52, Gerhard Wiesinger wrote: >>>>> On 02.03.2015 10:26, Paolo Bonzini wrote: >>>>>> >>>>>> On 01/03/2015 11:36, Gerhard Wiesinger wrote: >>>>>>> So far it happened only the PostgreSQL database VM. Kernel is alive >>>>>>> (ping works well). ssh is not working. >>>>>>> console window: after entering one character at login prompt, >>>>>>> then crashed: >>>>>>> [1438.384864] Out of memory: Kill process 10115 (pg_dump) score >>>>>>> 112 or >>>>>>> sacrifice child >>>>>>> [1438.384990] Killed process 10115 (pg_dump) total-vm: 340548kB, >>>>>>> anon-rss: 162712kB, file-rss: 220kB >>>>>> Can you get a vmcore or at least sysrq-t output? >>>>> >>>>> Yes, next time it happens I can analyze it. >>>>> >>>>> I think there are 2 problems: >>>>> 1.) OOM (Out of Memory) problem with the low memory settings and >>>>> kernel settings (see below) >>>>> 2.) Instability problem which might have a dependency to 1.) >>>>> >>>>> What I've done so far (thanks to Andrey Korolyov for ideas and help): >>>>> a.) Updated maschine type from pc-0.15 to pc-i440fx-2.2 >>>>> virsh dumpxml database | grep "<type" >>>>> <type arch='x86_64' machine='pc-0.15'>hvm</type> >>>>> >>>>> virsh edit database >>>>> virsh dumpxml database | grep "<type" >>>>> <type arch='x86_64' machine='pc-i440fx-2.2'>hvm</type> >>>>> >>>>> SMBIOS is updated therefore from 2.4 to 2.8: >>>>> dmesg|grep -i SMBIOS >>>>> [ 0.000000] SMBIOS 2.8 present. >>>>> b.) Switched to tsc clock, kernel parameters: clocksource=tsc >>>>> nohz=off highres=off >>>>> c.) Changed overcommit to 1 >>>>> echo "vm.overcommit_memory = 1" > /etc/sysctl.d/overcommit.conf >>>>> d.) Tried 1 VCPU instead of 2 >>>>> e.) Installed 512MB vRAM instead of 384MB >>>>> f.) Prepared for sysrq and vmcore >>>>> echo "kernel.sysrq = 1" > /etc/sysctl.d/sysrq.conf >>>>> sysctl -w kernel.sysrq=1 >>>>> virsh send-key database KEY_LEFTALT KEY_SYSRQ KEY_T >>>>> virsh dump domain-name /tmp/dumpfile >>>>> g.) Further ideas, not yet done: disable memory balooning by >>>>> blacklisting baloon driver or remove from virsh xml config >>>>> >>>>> Summary: >>>>> 1.) 512MB, tsc timer, 1VCPU, vm.overcommit_memory = 1: no OOM >>>>> problem, no crash >>>>> 2.) 512MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1: no OOM >>>>> problem, no crash >>>> >>>> 3.) 384MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1: no OOM >>>> problem, no crash >>> >>> 3b.) Still happened again at the nightly backup with same >>> configuration as in 3.) configuration 384MB, kvm_clock, 2VCPU, >>> vm.overcommit_memory = 1, pc-i440fx-2.2: no OOM problem, ping ok, no >>> reaction, BUT CRASHED again >>> >> >> 3c.) configuration 384MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1, >> pc-i440fx-2.2: OOM problem, no crash >> >> postgres invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0 >> Free swap = 905924kB >> Total swap = 1081340kB >> Out of memory: Kill process 19312 (pg_dump) score 142 or sacrifice child >> Killed process 19312 (pg_dump) total-vm:384516kB, anon-rss:119260kB, >> file-rss:0kB >> >> An OOM should not occour: >> https://www.kernel.org/doc/gorman/html/understand/understand016.html >> Is there enough swap space left (nr_swap_pages > 0) ? If yes, not OOM >> >> Why does an OOM condition occour? Looks like a bug in the kernel? >> Any ideas? > > # Allocating 800MB, killed by OOM killer > ./mallocsleep 805306368 > Killed > > Out of memory: Kill process 27160 (mallocsleep) score 525 or sacrifice > child > Killed process 27160 (mallocsleep) total-vm:790588kB, > anon-rss:214948kB, file-rss:0kB > > free -m > total used free shared buff/cache > available > Mem: 363 23 252 23 87 295 > Swap: 1055 134 921 > > ulimit -a > core file size (blocks, -c) 0 > data seg size (kbytes, -d) unlimited > scheduling priority (-e) 0 > file size (blocks, -f) unlimited > pending signals (-i) 1392 > max locked memory (kbytes, -l) 64 > max memory size (kbytes, -m) unlimited > open files (-n) 1024 > pipe size (512 bytes, -p) 8 > POSIX message queues (bytes, -q) 819200 > real-time priority (-r) 0 > stack size (kbytes, -s) 8192 > cpu time (seconds, -t) unlimited > max user processes (-u) 1392 > virtual memory (kbytes, -v) unlimited > file locks (-x) unlimited > > > # Maschine is getting inresponsive and stalls for seconds, but never > reaches more than 1055MB swap size (+ 384MB RAM) > vmstat 1 > procs -----------memory---------- ---swap-- -----io---- -system-- > ------cpu----- > r b swpd free buff cache si so bi bo in cs us sy > id wa st > 0 0 136472 241196 1400 98544 4 57 1724 67 211 261 2 3 > 91 2 2 > 0 0 136472 241228 1400 98540 0 0 0 0 30 48 0 0 > 100 0 0 > 0 0 136472 241228 1408 98532 0 0 0 52 53 51 0 0 > 89 11 0 > 0 0 136472 241224 1408 98540 0 0 0 112 44 92 0 0 > 100 0 0 > 0 0 136472 241224 1408 98540 0 0 0 0 24 32 0 0 > 100 0 0 > 0 0 136472 241352 1408 98540 0 0 0 0 31 44 0 1 > 100 0 0 > 0 0 136472 241328 1408 98540 0 0 0 36 97 142 0 1 > 99 0 0 > 0 0 136472 241364 1408 98540 0 0 0 0 22 30 0 0 > 100 0 0 > 0 0 136472 241376 1416 98532 0 0 0 80 52 45 0 0 > 92 8 1 > 1 0 136472 9236 1416 98548 0 0 8 0 762 55 11 23 > 66 0 0 > 2 7 270496 3804 140 61172 1144 412268 15028 412340 92805 > 301836 1 49 1 27 22 > 1 12 620320 4788 140 35240 1240 114864 96860 114976 46242 > 96395 1 26 0 61 12 > 3 18 661436 4788 144 35568 508 0 167884 0 5605 8097 5 > 76 0 16 4 > 3 4 661220 4288 144 34256 252 0 273684 0 7454 9777 3 > 71 0 19 7 > 5 20 661024 4532 144 34772 320 0 238288 0 9452 12395 3 > 78 0 13 6 > 6 19 660596 4592 144 35884 320 0 233160 8 12401 16798 > 5 67 0 12 15 > 3 20 677268 4296 140 36816 2180 18200 444328 18332 19382 36234 > 8 67 0 11 14 > 3 25 677208 4792 136 36044 68 0 524340 12 20637 26558 > 3 74 0 15 8 > 2 21 687880 4964 136 38200 260 10784 311152 10884 17707 28941 > 4 78 0 12 5 > 3 21 693808 4380 176 36860 136 6024 388932 6096 14576 22372 > 3 84 0 6 7 > 3 27 693740 4432 152 38288 56 20736 419592 20744 23212 31219 > 4 87 0 7 2 > 3 23 713696 4384 152 38172 796 0 481420 96 16498 27177 > 8 87 0 4 1 > 3 27 713360 4116 152 38372 1844 0 1308552 296 25074 33901 > 5 85 0 9 1 > 3 29 714628 4416 180 41992 256 2556 501832 2704 56498 76293 > 3 91 0 5 1 > 3 29 714572 3860 172 41076 156 0 920736 152 12131 17339 > 5 94 0 0 0 > 4 28 714396 5108 152 40124 212 10924 567648 11148 41901 56712 > 4 90 0 4 2 > 3 30 725216 4060 136 40604 124 0 286384 156 21992 35505 > 5 91 0 2 3 > 8 12 148836 230388 320 70888 5356 0 24304 52 9977 15084 17 > 75 0 5 3 > 0 0 146692 271900 416 76680 2200 0 6592 0 1561 3198 10 > 10 78 2 1 > 0 0 146584 271900 416 76892 152 0 184 0 75 139 0 0 > 100 0 1 > 0 0 146488 271396 552 76980 128 0 264 36 124 230 0 1 > 98 1 0 > 0 0 146372 271076 680 77196 124 0 252 8 79 167 0 0 > 100 0 0 > 0 0 146312 270948 688 77332 64 0 64 80 61 102 0 0 > 97 3 1 > > What's wrong here? > Kernel Bug? > Reminds me all of the post here: http://blog.nitrous.io/2014/03/10/stability-and-a-linux-oom-killer-bug.html Last month, these outages began to happen more regularly but also very randomly. The symptoms were quite similar: CPU spiked to 100% utilization. Disk I/O spiked. Server became completely inaccessible via SSH, etc. Logs show the Linux Out Of Memory (OOM) killer killing user processes that have hit their cgroup's memory limit shortly before the server froze. Host memory was not under pressure - it was close to fully utilized (which is normal) but there was a lot of unused swap. Ciao, Gerhard ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever 2015-01-13 20:13 ` Gerhard Wiesinger 2015-01-13 20:48 ` Paolo Bonzini @ 2015-01-14 17:47 ` Juan Quintela 2015-01-14 17:52 ` Juan Quintela 1 sibling, 1 reply; 35+ messages in thread From: Juan Quintela @ 2015-01-14 17:47 UTC (permalink / raw) To: Gerhard Wiesinger; +Cc: Paolo Bonzini, qemu-devel Gerhard Wiesinger <lists@wiesinger.com> wrote: > On 12.01.2015 12:41, Gerhard Wiesinger wrote: >> On 08.01.2015 23:28, Gerhard Wiesinger wrote: >>> >>> I'll keep you up to date in the next days whether it happens again >>> or not. >> >> With qemu-kvm 2.2.0 release from the above repository the 100% usage >> didn't happen so far (although I had to reboot after kernel update). > > It happens also with qemu-kvm 2.2.0 on another VM where also > PostgreSQL is running: > (gdb) bt > #0 0x00007fff9a1feff4 in gettimeofday () > #1 0x00000000006d425e in GetCurrentTimestamp () at timestamp.c:1274 > > What we know: > OK : F20: 3.17.6-200.fc20.x86_64 on guest/host, > qemu-kvm-1.6.2-10.fc20.x86_64 on host > NOK: F21: 3.17.7-300.fc21.x86_64 on guest/host, > qemu-kvm-2.1.2-7.fc21.x86_64 on host > NOK: F21: 3.17.8-300.fc21.x86_64 on guest/host, > qemu-kvm-2.2.0-1.fc21.x86_64 on host > > No one less can reproduce or has similar problems? I have similar problems, F21 guest on F21 host. https://bugzilla.redhat.com/show_bug.cgi?id=1174664 For me, it is firefox that from time to time loops on vclock_gettime(). At that point the DSO have got corrupted (kvmclock msr's), and as you say, only solution is reboot. Could you look at the bug, and see if the areas pointed by the MSR also got corrupted? I tried using kernel-debug with debugging for memory allocation (marcelo request). I see pattens like 0x5a5a5a5a, so clearly there is some corruption, but haven't been able to find _what_ is making that happen. > Any further ideas? Nope :-( > BTW: I'm running ntp in the following manner: internet <=> ntp server > in VM <=> ntp client on KVM host (firewall runs in KVM) I am uisng ntp on the host against Internet, but the guest don't use ntp (I do a ntpdate if I see that the guest has drifted too much, and it is a rare event. Right now I don't rememeber having fixed it lately). Later, Juan. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever 2015-01-14 17:47 ` Juan Quintela @ 2015-01-14 17:52 ` Juan Quintela 2015-01-14 22:19 ` Gerhard Wiesinger 0 siblings, 1 reply; 35+ messages in thread From: Juan Quintela @ 2015-01-14 17:52 UTC (permalink / raw) To: Gerhard Wiesinger; +Cc: Paolo Bonzini, qemu-devel Juan Quintela <quintela@redhat.com> wrote: > Gerhard Wiesinger <lists@wiesinger.com> wrote: >> On 12.01.2015 12:41, Gerhard Wiesinger wrote: >>> On 08.01.2015 23:28, Gerhard Wiesinger wrote: >>>> >>>> I'll keep you up to date in the next days whether it happens again >>>> or not. >>> >>> With qemu-kvm 2.2.0 release from the above repository the 100% usage >>> didn't happen so far (although I had to reboot after kernel update). >> >> It happens also with qemu-kvm 2.2.0 on another VM where also >> PostgreSQL is running: >> (gdb) bt >> #0 0x00007fff9a1feff4 in gettimeofday () >> #1 0x00000000006d425e in GetCurrentTimestamp () at timestamp.c:1274 >> >> What we know: >> OK : F20: 3.17.6-200.fc20.x86_64 on guest/host, >> qemu-kvm-1.6.2-10.fc20.x86_64 on host >> NOK: F21: 3.17.7-300.fc21.x86_64 on guest/host, >> qemu-kvm-2.1.2-7.fc21.x86_64 on host >> NOK: F21: 3.17.8-300.fc21.x86_64 on guest/host, >> qemu-kvm-2.2.0-1.fc21.x86_64 on host >> >> No one less can reproduce or has similar problems? > > I have similar problems, F21 guest on F21 host. > https://bugzilla.redhat.com/show_bug.cgi?id=1174664 > > For me, it is firefox that from time to time loops on vclock_gettime(). > At that point the DSO have got corrupted (kvmclock msr's), and as you > say, only solution is reboot. Could you look at the bug, and see if the > areas pointed by the MSR also got corrupted? I tried using kernel-debug > with debugging for memory allocation (marcelo request). I see pattens > like 0x5a5a5a5a, so clearly there is some corruption, but haven't been > able to find _what_ is making that happen. > >> Any further ideas? > > Nope :-( > >> BTW: I'm running ntp in the following manner: internet <=> ntp server >> in VM <=> ntp client on KVM host (firewall runs in KVM) > > I am uisng ntp on the host against Internet, but the guest don't use ntp > (I do a ntpdate if I see that the guest has drifted too much, and it is > a rare event. Right now I don't rememeber having fixed it lately). I forgot tell on the previous patch, I am using 2vcpus. with a single vcpu I have been unable to trigger this bug. Later, Juan. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever 2015-01-14 17:52 ` Juan Quintela @ 2015-01-14 22:19 ` Gerhard Wiesinger 0 siblings, 0 replies; 35+ messages in thread From: Gerhard Wiesinger @ 2015-01-14 22:19 UTC (permalink / raw) To: quintela; +Cc: Paolo Bonzini, qemu-devel On 14.01.2015 18:52, Juan Quintela wrote: > Juan Quintela <quintela@redhat.com> wrote: > I forgot tell on the previous patch, I am using 2vcpus. with a single > vcpu I have been unable to trigger this bug. There is already a fix with a new patched kernel available for the guest, see the bugzilla entry and my posts in this thread. Ciao, Gerhard ^ permalink raw reply [flat|nested] 35+ messages in thread
end of thread, other threads:[~2015-03-03 20:51 UTC | newest] Thread overview: 35+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-01-08 13:36 [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever Gerhard Wiesinger 2015-01-08 17:24 ` Paolo Bonzini 2015-01-08 18:12 ` Gerhard Wiesinger 2015-01-08 18:22 ` Paolo Bonzini 2015-01-08 22:28 ` Gerhard Wiesinger 2015-01-08 22:42 ` Paolo Bonzini 2015-01-12 11:41 ` Gerhard Wiesinger 2015-01-12 11:46 ` Paolo Bonzini 2015-01-12 11:53 ` Gerhard Wiesinger 2015-01-12 11:54 ` Paolo Bonzini 2015-01-13 20:13 ` Gerhard Wiesinger 2015-01-13 20:48 ` Paolo Bonzini 2015-01-13 21:08 ` Cole Robinson 2015-01-13 21:14 ` Gerhard Wiesinger 2015-01-13 21:16 ` Paolo Bonzini 2015-01-13 22:01 ` Gerhard Wiesinger 2015-01-14 0:59 ` Laine Stump 2015-01-14 9:15 ` Gerhard Wiesinger 2015-02-15 8:18 ` Gerhard Wiesinger 2015-02-16 14:18 ` Paolo Bonzini 2015-02-16 15:09 ` Gerhard Wiesinger 2015-02-16 15:29 ` Paolo Bonzini 2015-03-01 10:36 ` Gerhard Wiesinger 2015-03-02 9:26 ` Paolo Bonzini 2015-03-02 15:52 ` Gerhard Wiesinger 2015-03-02 17:15 ` Gerhard Wiesinger 2015-03-03 9:12 ` Gerhard Wiesinger 2015-03-03 10:12 ` Paolo Bonzini 2015-03-03 11:51 ` Gonglei 2015-03-03 12:28 ` Gerhard Wiesinger 2015-03-03 13:18 ` Gerhard Wiesinger 2015-03-03 20:50 ` Gerhard Wiesinger 2015-01-14 17:47 ` Juan Quintela 2015-01-14 17:52 ` Juan Quintela 2015-01-14 22:19 ` Gerhard Wiesinger
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).