From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:43006) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YSmj4-00069M-Jx for qemu-devel@nongnu.org; Tue, 03 Mar 2015 08:19:08 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1YSmiz-0005wy-Kn for qemu-devel@nongnu.org; Tue, 03 Mar 2015 08:19:06 -0500 Received: from vps01.wiesinger.com ([46.36.37.179]:33175) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YSmiz-0005wX-9Z for qemu-devel@nongnu.org; Tue, 03 Mar 2015 08:19:01 -0500 Message-ID: <54F5B4BC.5060302@wiesinger.com> Date: Tue, 03 Mar 2015 14:18:52 +0100 From: Gerhard Wiesinger MIME-Version: 1.0 References: <54AE87C1.2060907@wiesinger.com> <54AEBD43.2060705@redhat.com> <54AEC877.9080600@wiesinger.com> <54AECAF3.3060909@redhat.com> <54AF047D.8010009@wiesinger.com> <54B3B2F5.1090405@wiesinger.com> <54B57C51.7090002@wiesinger.com> <54B584AB.4090303@redhat.com> <54B58AC0.5080805@wiesinger.com> <54B58B18.9060205@redhat.com> <54B595C7.3080101@wiesinger.com> <54B5BF5F.9000805@redhat.com> <54B633CE.3040901@wiesinger.com> <54E05659.9050701@wiesinger.com> <54E1FC2B.3030805@redhat.com> <54E20812.4090006@wiesinger.com> <54E20CD5.3050909@redhat.com> <54F2EBA5.4050907@wiesinger.com> <54F42CC7.20504@redhat.com> <54F48734.7020800@wiesinger.com> <54F49A95.20300@wiesinger.com> <54F57B17.50100@wiesinger.com> <54F5A8F9.1060207@wiesinger.com> In-Reply-To: <54F5A8F9.1060207@wiesinger.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Paolo Bonzini , Laine Stump , qemu-devel@nongnu.org, Cole Robinson , virt@lists.fedoraproject.org On 03.03.2015 13:28, Gerhard Wiesinger wrote: > On 03.03.2015 10:12, Gerhard Wiesinger wrote: >> On 02.03.2015 18:15, Gerhard Wiesinger wrote: >>> On 02.03.2015 16:52, Gerhard Wiesinger wrote: >>>> On 02.03.2015 10:26, Paolo Bonzini wrote: >>>>> >>>>> On 01/03/2015 11:36, Gerhard Wiesinger wrote: >>>>>> So far it happened only the PostgreSQL database VM. Kernel is alive >>>>>> (ping works well). ssh is not working. >>>>>> console window: after entering one character at login prompt, >>>>>> then crashed: >>>>>> [1438.384864] Out of memory: Kill process 10115 (pg_dump) score >>>>>> 112 or >>>>>> sacrifice child >>>>>> [1438.384990] Killed process 10115 (pg_dump) total-vm: 340548kB, >>>>>> anon-rss: 162712kB, file-rss: 220kB >>>>> Can you get a vmcore or at least sysrq-t output? >>>> >>>> Yes, next time it happens I can analyze it. >>>> >>>> I think there are 2 problems: >>>> 1.) OOM (Out of Memory) problem with the low memory settings and >>>> kernel settings (see below) >>>> 2.) Instability problem which might have a dependency to 1.) >>>> >>>> What I've done so far (thanks to Andrey Korolyov for ideas and help): >>>> a.) Updated maschine type from pc-0.15 to pc-i440fx-2.2 >>>> virsh dumpxml database | grep ">>> hvm >>>> >>>> virsh edit database >>>> virsh dumpxml database | grep ">>> hvm >>>> >>>> SMBIOS is updated therefore from 2.4 to 2.8: >>>> dmesg|grep -i SMBIOS >>>> [ 0.000000] SMBIOS 2.8 present. >>>> b.) Switched to tsc clock, kernel parameters: clocksource=tsc >>>> nohz=off highres=off >>>> c.) Changed overcommit to 1 >>>> echo "vm.overcommit_memory = 1" > /etc/sysctl.d/overcommit.conf >>>> d.) Tried 1 VCPU instead of 2 >>>> e.) Installed 512MB vRAM instead of 384MB >>>> f.) Prepared for sysrq and vmcore >>>> echo "kernel.sysrq = 1" > /etc/sysctl.d/sysrq.conf >>>> sysctl -w kernel.sysrq=1 >>>> virsh send-key database KEY_LEFTALT KEY_SYSRQ KEY_T >>>> virsh dump domain-name /tmp/dumpfile >>>> g.) Further ideas, not yet done: disable memory balooning by >>>> blacklisting baloon driver or remove from virsh xml config >>>> >>>> Summary: >>>> 1.) 512MB, tsc timer, 1VCPU, vm.overcommit_memory = 1: no OOM >>>> problem, no crash >>>> 2.) 512MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1: no OOM >>>> problem, no crash >>> >>> 3.) 384MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1: no OOM >>> problem, no crash >> >> 3b.) Still happened again at the nightly backup with same >> configuration as in 3.) configuration 384MB, kvm_clock, 2VCPU, >> vm.overcommit_memory = 1, pc-i440fx-2.2: no OOM problem, ping ok, no >> reaction, BUT CRASHED again >> > > 3c.) configuration 384MB, kvm_clock, 2VCPU, vm.overcommit_memory = 1, > pc-i440fx-2.2: OOM problem, no crash > > postgres invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0 > Free swap = 905924kB > Total swap = 1081340kB > Out of memory: Kill process 19312 (pg_dump) score 142 or sacrifice child > Killed process 19312 (pg_dump) total-vm:384516kB, anon-rss:119260kB, > file-rss:0kB > > An OOM should not occour: > https://www.kernel.org/doc/gorman/html/understand/understand016.html > Is there enough swap space left (nr_swap_pages > 0) ? If yes, not OOM > > Why does an OOM condition occour? Looks like a bug in the kernel? > Any ideas? # Allocating 800MB, killed by OOM killer ./mallocsleep 805306368 Killed Out of memory: Kill process 27160 (mallocsleep) score 525 or sacrifice child Killed process 27160 (mallocsleep) total-vm:790588kB, anon-rss:214948kB, file-rss:0kB free -m total used free shared buff/cache available Mem: 363 23 252 23 87 295 Swap: 1055 134 921 ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 1392 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 1392 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited # Maschine is getting inresponsive and stalls for seconds, but never reaches more than 1055MB swap size (+ 384MB RAM) vmstat 1 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 136472 241196 1400 98544 4 57 1724 67 211 261 2 3 91 2 2 0 0 136472 241228 1400 98540 0 0 0 0 30 48 0 0 100 0 0 0 0 136472 241228 1408 98532 0 0 0 52 53 51 0 0 89 11 0 0 0 136472 241224 1408 98540 0 0 0 112 44 92 0 0 100 0 0 0 0 136472 241224 1408 98540 0 0 0 0 24 32 0 0 100 0 0 0 0 136472 241352 1408 98540 0 0 0 0 31 44 0 1 100 0 0 0 0 136472 241328 1408 98540 0 0 0 36 97 142 0 1 99 0 0 0 0 136472 241364 1408 98540 0 0 0 0 22 30 0 0 100 0 0 0 0 136472 241376 1416 98532 0 0 0 80 52 45 0 0 92 8 1 1 0 136472 9236 1416 98548 0 0 8 0 762 55 11 23 66 0 0 2 7 270496 3804 140 61172 1144 412268 15028 412340 92805 301836 1 49 1 27 22 1 12 620320 4788 140 35240 1240 114864 96860 114976 46242 96395 1 26 0 61 12 3 18 661436 4788 144 35568 508 0 167884 0 5605 8097 5 76 0 16 4 3 4 661220 4288 144 34256 252 0 273684 0 7454 9777 3 71 0 19 7 5 20 661024 4532 144 34772 320 0 238288 0 9452 12395 3 78 0 13 6 6 19 660596 4592 144 35884 320 0 233160 8 12401 16798 5 67 0 12 15 3 20 677268 4296 140 36816 2180 18200 444328 18332 19382 36234 8 67 0 11 14 3 25 677208 4792 136 36044 68 0 524340 12 20637 26558 3 74 0 15 8 2 21 687880 4964 136 38200 260 10784 311152 10884 17707 28941 4 78 0 12 5 3 21 693808 4380 176 36860 136 6024 388932 6096 14576 22372 3 84 0 6 7 3 27 693740 4432 152 38288 56 20736 419592 20744 23212 31219 4 87 0 7 2 3 23 713696 4384 152 38172 796 0 481420 96 16498 27177 8 87 0 4 1 3 27 713360 4116 152 38372 1844 0 1308552 296 25074 33901 5 85 0 9 1 3 29 714628 4416 180 41992 256 2556 501832 2704 56498 76293 3 91 0 5 1 3 29 714572 3860 172 41076 156 0 920736 152 12131 17339 5 94 0 0 0 4 28 714396 5108 152 40124 212 10924 567648 11148 41901 56712 4 90 0 4 2 3 30 725216 4060 136 40604 124 0 286384 156 21992 35505 5 91 0 2 3 8 12 148836 230388 320 70888 5356 0 24304 52 9977 15084 17 75 0 5 3 0 0 146692 271900 416 76680 2200 0 6592 0 1561 3198 10 10 78 2 1 0 0 146584 271900 416 76892 152 0 184 0 75 139 0 0 100 0 1 0 0 146488 271396 552 76980 128 0 264 36 124 230 0 1 98 1 0 0 0 146372 271076 680 77196 124 0 252 8 79 167 0 0 100 0 0 0 0 146312 270948 688 77332 64 0 64 80 61 102 0 0 97 3 1 What's wrong here? Kernel Bug? Ciao, Gerhard #include #include #include typedef unsigned int BOOL; typedef char* PCHAR; typedef unsigned int DWORD; #define FALSE 0 #define TRUE 1 BOOL getlong(PCHAR s,DWORD* retvalue) { char *eptr; long value; value=strtol(s,&eptr,0); if ((eptr==s)||(*eptr!='\0')) return FALSE; if (value<0) return FALSE; *retvalue=value; return TRUE; } int main(int argc,char* argv[]) { unsigned int* p; unsigned int size=16*1024*1024; unsigned int size_of=sizeof(unsigned int); int i; if (argc>1) { if (!getlong(argv[1],&size)) { printf("Wrong memsize!\n"); exit(1); } } p=malloc(size); for(i=0;i<(size/size_of);i++) p[i]=0; sleep(3600); free(p); return 0; }