KVM lock contention on 48 core AMD machine

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* KVM lock contention on 48 core AMD machine
@ 2011-03-18 12:02 Ben Nagy
  2011-03-18 12:30 ` Joerg Roedel
  2011-03-18 12:44 ` Stefan Hajnoczi
  0 siblings, 2 replies; 32+ messages in thread
From: Ben Nagy @ 2011-03-18 12:02 UTC (permalink / raw)
  To: kvm

Hi,

We've been trying to debug a problem when bring up VMs on a 48 core
AMD machine (4 x Opteron 6128). After some investigation and some
helpful comments from #kvm, it appears that we hit a serious lock
contention issue at a certain point. We have enabled lockdep debugging
(had to increase MAX_LOCK_DEPTH in sched.h to 144!) and have some
output, but I'm not all that sure how to progress from here in
troubleshooting the issue.

Linux eax 2.6.38-7-vmhost #35 SMP Thu Mar 17 13:25:10 SGT 2011 x86_64
x86_64 x86_64 GNU/Linux
(vmhost is a custom flavour which has the lock debugging stuff
enabled, the base distro is Ubuntu Natty alpha3)

QEMU emulator version 0.14.0 (qemu-kvm-0.14.0), Copyright (c)
2003-2008 Fabrice Bellard

CPUs - 4 physical 12 core CPUs.
model name      : AMD Opteron(tm) Processor 6168
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext
fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl
nonstop_tsc extd_apicid amd_dcm pni monitor cx16 popcnt lahf_lm
cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch
osvw ibs skinit wdt nodeid_msr npt lbrv svm_lock nrip_save pausefilter

RAM: 96GB

KVM commandline (using libvirt):
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/sbin:/sbin:/bin
QEMU_AUDIO_DRV=none /usr/local/bin/kvm-snapshot -S -M pc-0.14
-enable-kvm -m 1024 -smp 1,sockets=1,cores=1,threads=1 -name fb-0
-uuid de59229b-eb06-9ecc-758e-d20bc5ddc291 -nodefconfig -nodefaults
-chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/fb-0.monitor,server,nowait
-mon chardev=charmonitor,id=monitor,mode=readline -rtc base=localtime
-no-acpi -boot cd -drive
file=/mnt/big/bigfiles/kvm_disks/eax/fb-0.ovl,if=none,id=drive-ide0-0-0,format=qcow2
-device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0
-drive if=none,media=cdrom,id=drive-ide0-0-1,readonly=on,format=raw
-device ide-drive,bus=ide.0,unit=1,drive=drive-ide0-0-1,id=ide0-0-1
-netdev tap,fd=17,id=hostnet0 -device
virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:d9:09:ef,bus=pci.0,addr=0x3
-usb -device usb-tablet,id=input0 -vnc 127.0.0.1:0 -k en-us -vga
cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4

kvm-snapshot is just a script that runs /usr/bin/kvm "$@" -snapshot

The VMs are .ovl files which link to a single qcow2 disk, which is
hosted on an iscsi volume, with ocfs2 as the filesystem. However, we
reproduced the problem running all the VMs locally, so that seems to
indicate that it's not an IB issue.

Basically, after between 30-40 machines finish booting, the system cpu
utilisation climbs to up to 99.9%. VMs are unresponsive, but the host
itself is still responsive.

Here's some output from perf top while the system is locky:
           263832.00 46.3% delay_tsc
[kernel.kallsyms]
           231491.00 40.7% __ticket_spin_trylock
[kernel.kallsyms]
            14609.00  2.6% native_read_tsc
[kernel.kallsyms]
             9414.00  1.7% do_raw_spin_lock
[kernel.kallsyms]
             8041.00  1.4% local_clock
[kernel.kallsyms]
             6081.00  1.1% native_safe_halt
[kernel.kallsyms]
             3901.00  0.7% __lock_acquire.clone.18
[kernel.kallsyms]
             3665.00  0.6% do_raw_spin_unlock
[kernel.kallsyms]
             3042.00  0.5% __delay
[kernel.kallsyms]
             2484.00  0.4% lock_contended
[kernel.kallsyms]
             2484.00  0.4% sched_clock_cpu
[kernel.kallsyms]
             1906.00  0.3% sched_clock_local
[kernel.kallsyms]
             1419.00  0.2% lock_acquire
[kernel.kallsyms]
             1332.00  0.2% lock_release
[kernel.kallsyms]
              987.00  0.2% tg_load_down
[kernel.kallsyms]
              895.00  0.2% _raw_spin_lock_irqsave
[kernel.kallsyms]
              686.00  0.1% find_busiest_group
[kernel.kallsyms]

I have been looking at the top contended locks from when the system is
idle, with some VMs running and when it's in the locky condition

http://paste.ubuntu.com/582025/ - idle
http://paste.ubuntu.com/582007/ - some VMs
http://paste.ubuntu.com/582019/ - locky

(output is from grep : /proc/lock_stat | head 30)

The main things that I see are fidvid_mutex and idr_lock#3. The
fidvid_mutex seems like it might be related to the high % spent in
delay_tsc from perf top...

Anyway, if someone could give me some suggestions for things to try,
or more information that might help... anything really. :)

Thanks a lot,

ben

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: KVM lock contention on 48 core AMD machine
  2011-03-18 12:02 KVM lock contention on 48 core AMD machine Ben Nagy
@ 2011-03-18 12:30 ` Joerg Roedel
  2011-03-19  4:45   ` Ben Nagy
  2011-03-18 12:44 ` Stefan Hajnoczi
  1 sibling, 1 reply; 32+ messages in thread
From: Joerg Roedel @ 2011-03-18 12:30 UTC (permalink / raw)
  To: Ben Nagy; +Cc: kvm

Hi Ben,

On Fri, Mar 18, 2011 at 07:02:40PM +0700, Ben Nagy wrote:
> Here's some output from perf top while the system is locky:
>            263832.00 46.3% delay_tsc
> [kernel.kallsyms]
>            231491.00 40.7% __ticket_spin_trylock
> [kernel.kallsyms]
>             14609.00  2.6% native_read_tsc
> [kernel.kallsyms]
>              9414.00  1.7% do_raw_spin_lock
> [kernel.kallsyms]
>              8041.00  1.4% local_clock
> [kernel.kallsyms]
>              6081.00  1.1% native_safe_halt
> [kernel.kallsyms]
>              3901.00  0.7% __lock_acquire.clone.18
> [kernel.kallsyms]
>              3665.00  0.6% do_raw_spin_unlock
> [kernel.kallsyms]
>              3042.00  0.5% __delay
> [kernel.kallsyms]
>              2484.00  0.4% lock_contended
> [kernel.kallsyms]
>              2484.00  0.4% sched_clock_cpu
> [kernel.kallsyms]
>              1906.00  0.3% sched_clock_local
> [kernel.kallsyms]
>              1419.00  0.2% lock_acquire
> [kernel.kallsyms]
>              1332.00  0.2% lock_release
> [kernel.kallsyms]
>               987.00  0.2% tg_load_down
> [kernel.kallsyms]
>               895.00  0.2% _raw_spin_lock_irqsave
> [kernel.kallsyms]
>               686.00  0.1% find_busiest_group
> [kernel.kallsyms]

Can you try to run

# perf record -a -g

for a while when your VMs are up and unresponsive? This will monitor the
whole system and collects stack-traces. When you have done so
please run

# perf report > locks.txt

and upload the locks.txt file somewhere. The result might give us some
glue where the high lock-contention comes from.

Regards,

	Joerg


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: KVM lock contention on 48 core AMD machine
  2011-03-18 12:30 ` Joerg Roedel
@ 2011-03-19  4:45   ` Ben Nagy
  2011-03-21  9:50     ` Avi Kivity
  0 siblings, 1 reply; 32+ messages in thread
From: Ben Nagy @ 2011-03-19  4:45 UTC (permalink / raw)
  To: Joerg Roedel; +Cc: kvm

On Fri, Mar 18, 2011 at 7:30 PM, Joerg Roedel <joro@8bytes.org> wrote:
> Hi Ben,
>
> Can you try to run
>
> # perf record -a -g
>
> for a while when your VMs are up and unresponsive?

Hi Joerg,

Thanks for the response. Here's that output:

http://paste.ubuntu.com/582345/

Looks like it's spending all its time in kvm code, but there are some
symbols that are unresolved. Is the report still useful?

Cheers,

ben

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: KVM lock contention on 48 core AMD machine
  2011-03-19  4:45   ` Ben Nagy
@ 2011-03-21  9:50     ` Avi Kivity
  2011-03-21 11:43       ` Ben Nagy
  0 siblings, 1 reply; 32+ messages in thread
From: Avi Kivity @ 2011-03-21  9:50 UTC (permalink / raw)
  To: ben; +Cc: Joerg Roedel, kvm

On 03/19/2011 06:45 AM, Ben Nagy wrote:
> On Fri, Mar 18, 2011 at 7:30 PM, Joerg Roedel<joro@8bytes.org>  wrote:
> >  Hi Ben,
> >
> >  Can you try to run
> >
> >  # perf record -a -g
> >
> >  for a while when your VMs are up and unresponsive?
>
> Hi Joerg,
>
> Thanks for the response. Here's that output:
>
> http://paste.ubuntu.com/582345/
>
> Looks like it's spending all its time in kvm code, but there are some
> symbols that are unresolved. Is the report still useful?
>

It's actually not kvm code, but timer code.  Please build qemu with 
./configure --disable-strip and redo.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: KVM lock contention on 48 core AMD machine
  2011-03-21  9:50     ` Avi Kivity
@ 2011-03-21 11:43       ` Ben Nagy
  2011-03-21 13:41         ` Ben Nagy
  0 siblings, 1 reply; 32+ messages in thread
From: Ben Nagy @ 2011-03-21 11:43 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Joerg Roedel, kvm

On Mon, Mar 21, 2011 at 3:35 PM, Avi Kivity <avi@redhat.com> wrote:
> It's actually not kvm code, but timer code.  Please build qemu with
> ./configure --disable-strip and redo.

Bit of a status update:

Removing the usb tablet device helped a LOT - lowered the number of
machines we could bring up until the lock % got too high
Removing the -no-acpi option helped quite a lot - lowered the
delay_tsc calls to a 'normal' number

I am currently building a new XP template guest with the ACPI HAL,
using /usepmtimer, and disabling AMD PowerNow in the host BIOS, but
it's a lot of changes so it's taking a while to roll out and test. On
the IO side we moved to .raw disk images backing the files, and I
installed the latest virtio NIC and block drivers in the XP guest.
hdparm -t in the XP guest (running off the .raw over the iscsi) gets
437MB/s read, but I haven't tested it yet under contention. We did
verify (again) that the performance issues remain more or less the
same off the local SSD as off the infiniband / iSCSI.

Would anyone care to give me any tips regarding the best way to share
a single template disk for many guests based off that template?
Currently we plan to create a .raw template, keep it on a separate
physical device on the SAN and build qed files using that file as the
backing, then run the guests off those in snapshot mode (it's fine
that the changes get discarded, for our app).

Since the underlying configs have changed so much, I will set up, run
all the tests again and then followup if things are still locky.

Many thanks,

ben

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: KVM lock contention on 48 core AMD machine
  2011-03-21 11:43       ` Ben Nagy
@ 2011-03-21 13:41         ` Ben Nagy
  2011-03-21 13:53           ` Avi Kivity
  0 siblings, 1 reply; 32+ messages in thread
From: Ben Nagy @ 2011-03-21 13:41 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Joerg Roedel, kvm

On Mon, Mar 21, 2011 at 5:28 PM, Ben Nagy <ben@iagu.net> wrote:

> Bit of a status update:
[...]
> Since the underlying configs have changed so much, I will set up, run
> all the tests again and then followup if things are still locky.

Which they are.

I will need to get the git sources in order to build with
disable-strip as Avi suggested, which I can start doing tomorrow, but
in the meantime, I brought up 96 guests.

Command:
00:02:47 /usr/bin/kvm -S -M pc-0.14 -enable-kvm -m 1024 -smp
1,sockets=1,cores=1,threads=1 -name fbn-95 -uuid
2c7bfda3-525f-9a4f-901c-7e139dfc236f -nographic -nodefconfig
-nodefaults -chardev
socket,id=charmonitor,path=/var/lib/libvirt/qemu/fbn-95.monitor,server,nowait
-mon chardev=charmonitor,id=monitor,mode=readline -rtc base=localtime
-boot cd -drive
file=/mnt/big/bigfiles/kvm_disks/eax/fbn-95.ovl,if=none,id=drive-virtio-disk0,boot=on,format=qed
-device virtio-blk-pci,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0
-drive if=none,media=cdrom,id=drive-ide0-0-1,readonly=on,format=raw
-device ide-drive,bus=ide.0,unit=1,drive=drive-ide0-0-1,id=ide0-0-1
-netdev tap,fd=16,id=hostnet0 -device
virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:01:41:1c,bus=pci.0,addr=0x3
-vga std -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4
-snapshot

perf report output (huge, sorry)
http://paste.ubuntu.com/583311/

perf top output
http://paste.ubuntu.com/583313/

lock_stat contents
http://paste.ubuntu.com/583314/

The guests are xpsp3, ACPI Uniprocessor HAL, no /usepmtimer switch
anymore (standard boot.ini), fully updated, with version 1.1.16 of the
redhat virtio drivers for NIC and block

Major changes I made
- disabled usb
- changed from cirrus to std-vga
- changed .ovl format to qed
- changed backing disk format to raw

The machines now boot much MUCH faster, I can bring up 96 machines in
a few minutes instead of hours, but the system is at around 43% system
time, doing nothing, and each guest reports ~25% CPU from top.

If anybody can see anything from the stripped perf report output that
was be great, or if anyone has any other suggestions as to things I
can try to debug, I'm all ears. :) My existing plans for further
investigation are to build with disable-strip and retest, and also to
test with similarly configured linux guests.

Thanks a lot,

ben

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: KVM lock contention on 48 core AMD machine
  2011-03-21 13:41         ` Ben Nagy
@ 2011-03-21 13:53           ` Avi Kivity
       [not found]             ` <AANLkTikWQS281kTtJ32-qo5U+w_BAak7qUwVhUQgOxxv@mail.gmail.com>
  0 siblings, 1 reply; 32+ messages in thread
From: Avi Kivity @ 2011-03-21 13:53 UTC (permalink / raw)
  To: ben; +Cc: Joerg Roedel, kvm

On 03/21/2011 03:41 PM, Ben Nagy wrote:
> perf report output (huge, sorry)
> http://paste.ubuntu.com/583311/
>

In the future, please post the binary perf.dat.

Anyway, I see

     39.87%              kvm  [kernel.kallsyms]       [k] delay_tsc
                         |
                         --- delay_tsc
                            |
                            |--98.20%-- __delay
                            |          do_raw_spin_lock
                            |          |
                            |          |--90.51%-- _raw_spin_lock_irqsave
                            |          |          |
                            |          |          |--99.78%-- __lock_timer
                            |          |          |          |
                            |          |          |          |--62.85%-- sys_timer_gettime
                            |          |          |          |          system_call_fastpath
                            |          |          |          |          |
                            |          |          |          |          |--1.59%-- 0x7ff6c20d26fc
                            |          |          |          |          |
                            |          |          |          |          |--1.55%-- 0x7f16d89866fc


Please disable all spinlock debugging and try again.



-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 32+ messages in thread

[parent not found: <AANLkTikWQS281kTtJ32-qo5U+w_BAak7qUwVhUQgOxxv@mail.gmail.com>]

* Re: KVM lock contention on 48 core AMD machine
       [not found]             ` <AANLkTikWQS281kTtJ32-qo5U+w_BAak7qUwVhUQgOxxv@mail.gmail.com>
@ 2011-03-21 15:50               ` Avi Kivity
  2011-03-21 16:16                 ` Ben Nagy
  0 siblings, 1 reply; 32+ messages in thread
From: Avi Kivity @ 2011-03-21 15:50 UTC (permalink / raw)
  To: ben; +Cc: KVM list

(restored kvm@vger)


On 03/21/2011 05:44 PM, Ben Nagy wrote:
> On Mon, Mar 21, 2011 at 7:38 PM, Avi Kivity<avi@redhat.com>  wrote:
> >  In the future, please post the binary perf.dat.
>
> Hi Avi,
>
> Thanks for the help!
>
> How do I do that?
>
> I'll try again on the non lock debug kernel and let you know.
>

'make nconfig' and go to the kernel hacking section.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: KVM lock contention on 48 core AMD machine
  2011-03-21 15:50               ` Avi Kivity
@ 2011-03-21 16:16                 ` Ben Nagy
  2011-03-21 16:33                   ` Avi Kivity
  2011-03-21 16:54                   ` Eric Dumazet
  0 siblings, 2 replies; 32+ messages in thread
From: Ben Nagy @ 2011-03-21 16:16 UTC (permalink / raw)
  To: Avi Kivity; +Cc: KVM list

>> On Mon, Mar 21, 2011 at 7:38 PM, Avi Kivity<avi@redhat.com>  wrote:
>> >  In the future, please post the binary perf.dat.
>>
>> Hi Avi,
>>
>> How do I do that?
>
> 'make nconfig' and go to the kernel hacking section.

Imprecise question sorry, I meant how do I get the perf.dat not how do
I disable the debugging.

On the non-debug kernel: Linux eax 2.6.38-7-server #36-Ubuntu SMP Fri
Mar 18 23:36:13 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux

           150512.00 67.8% __ticket_spin_lock
[kernel.kallsyms]
            11126.00  5.0% memcmp_pages
[kernel.kallsyms]
             9563.00  4.3% native_safe_halt
[kernel.kallsyms]
             8965.00  4.0% svm_vcpu_run
/lib/modules/2.6.38-7-server/kernel/arch/x86/kvm/kvm-amd.ko
             6489.00  2.9% tg_load_down
[kernel.kallsyms]
             4676.00  2.1% kvm_get_cs_db_l_bits
/lib/modules/2.6.38-7-server/kernel/arch/x86/kvm/kvm.ko
             1931.00  0.9% load_balance_fair
[kernel.kallsyms]
             1917.00  0.9% ktime_get
[kernel.kallsyms]
             1624.00  0.7% walk_tg_tree.clone.129
[kernel.kallsyms]
             1542.00  0.7% find_busiest_group
[kernel.kallsyms]
             1326.00  0.6% find_next_bit
[kernel.kallsyms]
              673.00  0.3% lock_hrtimer_base.clone.25
[kernel.kallsyms]
              624.00  0.3% copy_user_generic_string    [kernel.kallsyms]

top now says:

top - 00:11:35 up 22 min,  4 users,  load average: 0.11, 6.15, 7.78
Tasks: 491 total,   3 running, 488 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.9%us, 15.4%sy,  0.0%ni, 83.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  99068660k total, 70831760k used, 28236900k free,    10036k buffers
Swap:  2438140k total,  2173652k used,   264488k free,  3396144k cached

With average 'resting cpu' per idle guest 8%, 96 guests running.

Is this as good as I am going to get? It seems like I can't really
debug that lock contention without blowing stuff up because of the
load of the lock debugging...

Don't know if I mentioned this before, but those guests are each
pinned to a cpu (cpu guestnum%48) with taskset.

Cheers,

ben

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: KVM lock contention on 48 core AMD machine
  2011-03-21 16:16                 ` Ben Nagy
@ 2011-03-21 16:33                   ` Avi Kivity
  2011-03-21 16:54                   ` Eric Dumazet
  1 sibling, 0 replies; 32+ messages in thread
From: Avi Kivity @ 2011-03-21 16:33 UTC (permalink / raw)
  To: ben; +Cc: KVM list

On 03/21/2011 06:16 PM, Ben Nagy wrote:
> >>  On Mon, Mar 21, 2011 at 7:38 PM, Avi Kivity<avi@redhat.com>    wrote:
> >>  >    In the future, please post the binary perf.dat.
> >>
> >>  Hi Avi,
> >>
> >>  How do I do that?
> >
> >  'make nconfig' and go to the kernel hacking section.
>
> Imprecise question sorry, I meant how do I get the perf.dat not how do
> I disable the debugging.
>

'perf record' dumps perf.dat in the current directory.

> On the non-debug kernel: Linux eax 2.6.38-7-server #36-Ubuntu SMP Fri
> Mar 18 23:36:13 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux
>
>             150512.00 67.8% __ticket_spin_lock
> [kernel.kallsyms]
>              11126.00  5.0% memcmp_pages
> [kernel.kallsyms]
>               9563.00  4.3% native_safe_halt
> [kernel.kallsyms]
>               8965.00  4.0% svm_vcpu_run
> /lib/modules/2.6.38-7-server/kernel/arch/x86/kvm/kvm-amd.ko
>               6489.00  2.9% tg_load_down
> [kernel.kallsyms]
>               4676.00  2.1% kvm_get_cs_db_l_bits
> /lib/modules/2.6.38-7-server/kernel/arch/x86/kvm/kvm.ko
>               1931.00  0.9% load_balance_fair
> [kernel.kallsyms]
>               1917.00  0.9% ktime_get
> [kernel.kallsyms]
>               1624.00  0.7% walk_tg_tree.clone.129
> [kernel.kallsyms]
>               1542.00  0.7% find_busiest_group
> [kernel.kallsyms]
>               1326.00  0.6% find_next_bit
> [kernel.kallsyms]
>                673.00  0.3% lock_hrtimer_base.clone.25
> [kernel.kallsyms]
>                624.00  0.3% copy_user_generic_string    [kernel.kallsyms]
>
> top now says:
>
> top - 00:11:35 up 22 min,  4 users,  load average: 0.11, 6.15, 7.78
> Tasks: 491 total,   3 running, 488 sleeping,   0 stopped,   0 zombie
> Cpu(s):  0.9%us, 15.4%sy,  0.0%ni, 83.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Mem:  99068660k total, 70831760k used, 28236900k free,    10036k buffers
> Swap:  2438140k total,  2173652k used,   264488k free,  3396144k cached
>
> With average 'resting cpu' per idle guest 8%, 96 guests running.
>
> Is this as good as I am going to get? It seems like I can't really
> debug that lock contention without blowing stuff up because of the
> load of the lock debugging...
>

Post perf.dat for the non-debug kernel, and we'll see.  You're probably 
hitting some Linux scaling limit, and these are continuously expanded as 
big hardware becomes more common.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: KVM lock contention on 48 core AMD machine
  2011-03-21 16:16                 ` Ben Nagy
  2011-03-21 16:33                   ` Avi Kivity
@ 2011-03-21 16:54                   ` Eric Dumazet
  2011-03-21 17:02                     ` Avi Kivity
  1 sibling, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2011-03-21 16:54 UTC (permalink / raw)
  To: ben; +Cc: Avi Kivity, KVM list

Le lundi 21 mars 2011 à 22:01 +0545, Ben Nagy a écrit :
> >> On Mon, Mar 21, 2011 at 7:38 PM, Avi Kivity<avi@redhat.com>  wrote:
> >> >  In the future, please post the binary perf.dat.
> >>
> >> Hi Avi,
> >>
> >> How do I do that?
> >
> > 'make nconfig' and go to the kernel hacking section.
> 
> Imprecise question sorry, I meant how do I get the perf.dat not how do
> I disable the debugging.
> 
> On the non-debug kernel: Linux eax 2.6.38-7-server #36-Ubuntu SMP Fri
> Mar 18 23:36:13 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux
> 
>            150512.00 67.8% __ticket_spin_lock
> [kernel.kallsyms]
>             11126.00  5.0% memcmp_pages
> [kernel.kallsyms]
>              9563.00  4.3% native_safe_halt
> [kernel.kallsyms]
>              8965.00  4.0% svm_vcpu_run
> /lib/modules/2.6.38-7-server/kernel/arch/x86/kvm/kvm-amd.ko
>              6489.00  2.9% tg_load_down
> [kernel.kallsyms]
>              4676.00  2.1% kvm_get_cs_db_l_bits
> /lib/modules/2.6.38-7-server/kernel/arch/x86/kvm/kvm.ko
>              1931.00  0.9% load_balance_fair
> [kernel.kallsyms]
>              1917.00  0.9% ktime_get
> [kernel.kallsyms]
>              1624.00  0.7% walk_tg_tree.clone.129
> [kernel.kallsyms]
>              1542.00  0.7% find_busiest_group
> [kernel.kallsyms]
>              1326.00  0.6% find_next_bit
> [kernel.kallsyms]
>               673.00  0.3% lock_hrtimer_base.clone.25
> [kernel.kallsyms]
>               624.00  0.3% copy_user_generic_string    [kernel.kallsyms]
> 
> top now says:
> 
> top - 00:11:35 up 22 min,  4 users,  load average: 0.11, 6.15, 7.78
> Tasks: 491 total,   3 running, 488 sleeping,   0 stopped,   0 zombie
> Cpu(s):  0.9%us, 15.4%sy,  0.0%ni, 83.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Mem:  99068660k total, 70831760k used, 28236900k free,    10036k buffers
> Swap:  2438140k total,  2173652k used,   264488k free,  3396144k cached
> 
> With average 'resting cpu' per idle guest 8%, 96 guests running.
> 
> Is this as good as I am going to get? It seems like I can't really
> debug that lock contention without blowing stuff up because of the
> load of the lock debugging...
> 
> Don't know if I mentioned this before, but those guests are each
> pinned to a cpu (cpu guestnum%48) with taskset.
> 
> Cheers,

It seems you hit idr_lock contention (used in kernel/posix-timers.c)




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: KVM lock contention on 48 core AMD machine
  2011-03-21 16:54                   ` Eric Dumazet
@ 2011-03-21 17:02                     ` Avi Kivity
  2011-03-21 17:12                       ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Avi Kivity @ 2011-03-21 17:02 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: ben, KVM list

On 03/21/2011 06:54 PM, Eric Dumazet wrote:
> Le lundi 21 mars 2011 à 22:01 +0545, Ben Nagy a écrit :
> >  >>  On Mon, Mar 21, 2011 at 7:38 PM, Avi Kivity<avi@redhat.com>   wrote:
> >  >>  >   In the future, please post the binary perf.dat.
> >  >>
> >  >>  Hi Avi,
> >  >>
> >  >>  How do I do that?
> >  >
> >  >  'make nconfig' and go to the kernel hacking section.
> >
> >  Imprecise question sorry, I meant how do I get the perf.dat not how do
> >  I disable the debugging.
> >
> >  On the non-debug kernel: Linux eax 2.6.38-7-server #36-Ubuntu SMP Fri
> >  Mar 18 23:36:13 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux
> >
> >             150512.00 67.8% __ticket_spin_lock
> >  [kernel.kallsyms]
> >              11126.00  5.0% memcmp_pages
> >  [kernel.kallsyms]
> >               9563.00  4.3% native_safe_halt
> >  [kernel.kallsyms]
> >               8965.00  4.0% svm_vcpu_run
> >  /lib/modules/2.6.38-7-server/kernel/arch/x86/kvm/kvm-amd.ko
> >               6489.00  2.9% tg_load_down
> >  [kernel.kallsyms]
> >               4676.00  2.1% kvm_get_cs_db_l_bits
> >  /lib/modules/2.6.38-7-server/kernel/arch/x86/kvm/kvm.ko
> >               1931.00  0.9% load_balance_fair
> >  [kernel.kallsyms]
> >               1917.00  0.9% ktime_get
> >  [kernel.kallsyms]
> >               1624.00  0.7% walk_tg_tree.clone.129
> >  [kernel.kallsyms]
> >               1542.00  0.7% find_busiest_group
> >  [kernel.kallsyms]
> >               1326.00  0.6% find_next_bit
> >  [kernel.kallsyms]
> >                673.00  0.3% lock_hrtimer_base.clone.25
> >  [kernel.kallsyms]
> >                624.00  0.3% copy_user_generic_string    [kernel.kallsyms]
> >
> >  top now says:
> >
> >  top - 00:11:35 up 22 min,  4 users,  load average: 0.11, 6.15, 7.78
> >  Tasks: 491 total,   3 running, 488 sleeping,   0 stopped,   0 zombie
> >  Cpu(s):  0.9%us, 15.4%sy,  0.0%ni, 83.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> >  Mem:  99068660k total, 70831760k used, 28236900k free,    10036k buffers
> >  Swap:  2438140k total,  2173652k used,   264488k free,  3396144k cached
> >
> >  With average 'resting cpu' per idle guest 8%, 96 guests running.
> >
> >  Is this as good as I am going to get? It seems like I can't really
> >  debug that lock contention without blowing stuff up because of the
> >  load of the lock debugging...
> >
> >  Don't know if I mentioned this before, but those guests are each
> >  pinned to a cpu (cpu guestnum%48) with taskset.
> >
> >  Cheers,
>
> It seems you hit idr_lock contention (used in kernel/posix-timers.c)
>

Any ideas on how to fix it?  We could pre-allocate IDs and batch them in 
per-cpu caches, but it seems like a lot of work.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: KVM lock contention on 48 core AMD machine
  2011-03-21 17:02                     ` Avi Kivity
@ 2011-03-21 17:12                       ` Eric Dumazet
  2011-03-21 18:12                         ` Ben Nagy
                                           ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Eric Dumazet @ 2011-03-21 17:12 UTC (permalink / raw)
  To: Avi Kivity; +Cc: ben, KVM list

Le lundi 21 mars 2011 à 19:02 +0200, Avi Kivity a écrit :

> Any ideas on how to fix it?  We could pre-allocate IDs and batch them in 
> per-cpu caches, but it seems like a lot of work.
> 

Hmm, I dont know what syscalls kvm do, but even a timer_gettime() has to
lock this idr_lock.

Sounds crazy, and should be fixed in kernel code, not kvm ;)




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: KVM lock contention on 48 core AMD machine
  2011-03-21 17:12                       ` Eric Dumazet
@ 2011-03-21 18:12                         ` Ben Nagy
  2011-03-21 22:27                           ` [RFC] posix-timers: RCU conversion Eric Dumazet
  2011-03-21 18:14                         ` KVM lock contention on 48 core AMD machine Avi Kivity
  2011-03-21 18:48                         ` Michael Tokarev
  2 siblings, 1 reply; 32+ messages in thread
From: Ben Nagy @ 2011-03-21 18:12 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Avi Kivity, KVM list

On Mon, Mar 21, 2011 at 10:57 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le lundi 21 mars 2011 à 19:02 +0200, Avi Kivity a écrit :
>
>> Any ideas on how to fix it?  We could pre-allocate IDs and batch them in
>> per-cpu caches, but it seems like a lot of work.
>>
>
> Hmm, I dont know what syscalls kvm do, but even a timer_gettime() has to
> lock this idr_lock.
>
> Sounds crazy, and should be fixed in kernel code, not kvm ;)

Hi,

I'll need to work out a way I can make the perf.data available
(~100M), but here's the summary

http://paste.ubuntu.com/583425/

And here's the summary of the summary

# Overhead          Command         Shared Object
                Symbol
# ........  ...............  ....................
..........................................
#
    71.86%              kvm  [kernel.kallsyms]     [k] __ticket_spin_lock
                        |
                        --- __ticket_spin_lock
                           |
                           |--54.19%-- default_spin_lock_flags
                           |          _raw_spin_lock_irqsave
                           |          |
                           |           --54.14%-- __lock_timer
                           |                     |
                           |                     |--31.92%-- sys_timer_gettime
                           |                     |          system_call_fastpath
                           |                     |
                           |                      --22.22%-- sys_timer_settime
                           |                                system_call_fastpath
                           |
                           |--15.66%-- _raw_spin_lock
[...]

Which I guess seems to match what Eric just said.

I'll post a link to the full data tomorrow. Many thanks for the help
so far. If it's a kernel scaling limit then I guess we just wait until
the kernel gets better. :S I'll check it out with a linux guest
tomorrow anyway.

Cheers,

ben

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC] posix-timers: RCU conversion
  2011-03-21 18:12                         ` Ben Nagy
@ 2011-03-21 22:27                           ` Eric Dumazet
  2011-03-22  7:09                             ` [PATCH] " Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2011-03-21 22:27 UTC (permalink / raw)
  To: ben
  Cc: Avi Kivity, KVM list, Thomas Gleixner, linux-kernel, John Stultz,
	Richard Cochran

Le lundi 21 mars 2011 à 23:57 +0545, Ben Nagy a écrit :
> On Mon, Mar 21, 2011 at 10:57 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > Le lundi 21 mars 2011 à 19:02 +0200, Avi Kivity a écrit :
> >
> >> Any ideas on how to fix it?  We could pre-allocate IDs and batch them in
> >> per-cpu caches, but it seems like a lot of work.
> >>
> >
> > Hmm, I dont know what syscalls kvm do, but even a timer_gettime() has to
> > lock this idr_lock.
> >
> > Sounds crazy, and should be fixed in kernel code, not kvm ;)
> 
> Hi,
> 
> I'll need to work out a way I can make the perf.data available
> (~100M), but here's the summary
> 
> http://paste.ubuntu.com/583425/
> 
> And here's the summary of the summary
> 
> # Overhead          Command         Shared Object
>                 Symbol
> # ........  ...............  ....................
> ..........................................
> #
>     71.86%              kvm  [kernel.kallsyms]     [k] __ticket_spin_lock
>                         |
>                         --- __ticket_spin_lock
>                            |
>                            |--54.19%-- default_spin_lock_flags
>                            |          _raw_spin_lock_irqsave
>                            |          |
>                            |           --54.14%-- __lock_timer
>                            |                     |
>                            |                     |--31.92%-- sys_timer_gettime
>                            |                     |          system_call_fastpath
>                            |                     |
>                            |                      --22.22%-- sys_timer_settime
>                            |                                system_call_fastpath
>                            |
>                            |--15.66%-- _raw_spin_lock
> [...]
> 
> Which I guess seems to match what Eric just said.
> 
> I'll post a link to the full data tomorrow. Many thanks for the help
> so far. If it's a kernel scaling limit then I guess we just wait until
> the kernel gets better. :S I'll check it out with a linux guest
> tomorrow anyway.

Here is a quick & dirty patch to lower idr_lock use. You could try it
eventually ;)

Thanks

[RFC] posix-timers: RCU conversion

Ben Nagy reported a scalability problem with KVM/QEMU that hit very hard
a single spinlock (idr_lock) in posix-timers code.

Even on a 16 cpu machine (2x4x2), a single test can show 98% of cpu time
used in ticket_spin_lock

Switching to RCU should be quite easy, IDR being already RCU ready.

idr_lock should be locked only before an insert/delete, not a find.

Reported-by: Ben Nagy <ben@iagu.net>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Richard Cochran <richard.cochran@omicron.at>
---
 include/linux/posix-timers.h |    1 +
 kernel/posix-timers.c        |   26 ++++++++++++++------------
 2 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h
index d51243a..5dc27ca 100644
--- a/include/linux/posix-timers.h
+++ b/include/linux/posix-timers.h
@@ -81,6 +81,7 @@ struct k_itimer {
 			unsigned long expires;
 		} mmtimer;
 	} it;
+	struct rcu_head rcu;
 };
 
 struct k_clock {
diff --git a/kernel/posix-timers.c b/kernel/posix-timers.c
index 4c01249..e2a823a 100644
--- a/kernel/posix-timers.c
+++ b/kernel/posix-timers.c
@@ -491,6 +491,11 @@ static struct k_itimer * alloc_posix_timer(void)
 	return tmr;
 }
 
+static void k_itimer_rcu_free(struct rcu_head *head)
+{
+	kmem_cache_free(posix_timers_cache, container_of(head, struct k_itimer, rcu));
+}
+
 #define IT_ID_SET	1
 #define IT_ID_NOT_SET	0
 static void release_posix_timer(struct k_itimer *tmr, int it_id_set)
@@ -499,11 +504,12 @@ static void release_posix_timer(struct k_itimer *tmr, int it_id_set)
 		unsigned long flags;
 		spin_lock_irqsave(&idr_lock, flags);
 		idr_remove(&posix_timers_id, tmr->it_id);
+		tmr->it_id = -1;
 		spin_unlock_irqrestore(&idr_lock, flags);
 	}
 	put_pid(tmr->it_pid);
 	sigqueue_free(tmr->sigq);
-	kmem_cache_free(posix_timers_cache, tmr);
+	call_rcu(&tmr->rcu, k_itimer_rcu_free);
 }
 
 static struct k_clock *clockid_to_kclock(const clockid_t id)
@@ -631,22 +637,18 @@ out:
 static struct k_itimer *__lock_timer(timer_t timer_id, unsigned long *flags)
 {
 	struct k_itimer *timr;
-	/*
-	 * Watch out here.  We do a irqsave on the idr_lock and pass the
-	 * flags part over to the timer lock.  Must not let interrupts in
-	 * while we are moving the lock.
-	 */
-	spin_lock_irqsave(&idr_lock, *flags);
+
+	rcu_read_lock();
 	timr = idr_find(&posix_timers_id, (int)timer_id);
 	if (timr) {
-		spin_lock(&timr->it_lock);
-		if (timr->it_signal == current->signal) {
-			spin_unlock(&idr_lock);
+		spin_lock_irqsave(&timr->it_lock, *flags);
+		if (timr->it_id == (int)timer_id && timr->it_signal == current->signal) {
+			rcu_read_unlock();
 			return timr;
 		}
-		spin_unlock(&timr->it_lock);
+		spin_unlock_irqrestore(&timr->it_lock, *flags);
 	}
-	spin_unlock_irqrestore(&idr_lock, *flags);
+	rcu_read_unlock();
 
 	return NULL;
 }



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH] posix-timers: RCU conversion
  2011-03-21 22:27                           ` [RFC] posix-timers: RCU conversion Eric Dumazet
@ 2011-03-22  7:09                             ` Eric Dumazet
  2011-03-22  8:59                               ` Ben Nagy
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2011-03-22  7:09 UTC (permalink / raw)
  To: ben, Thomas Gleixner
  Cc: Avi Kivity, KVM list, linux-kernel, John Stultz, Richard Cochran

Ben Nagy reported a scalability problem with KVM/QEMU that hit very hard
a single spinlock (idr_lock) in posix-timers code, on its 48 core
machine.

Even on a 16 cpu machine (2x4x2), a single test can show 98% of cpu time
used in ticket_spin_lock, from lock_timer

Ref: http://www.spinics.net/lists/kvm/msg51526.html

Switching to RCU is quite easy, IDR being already RCU ready.

idr_lock should be locked only for an insert/delete, not a lookup.

Benchmark on a 2x4x2 machine, 16 processes calling timer_gettime().

Before :

real	1m18.669s
user	0m1.346s
sys	1m17.180s

After :

real	0m3.296s
user	0m1.366s
sys	0m1.926s


Reported-by: Ben Nagy <ben@iagu.net>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Richard Cochran <richard.cochran@omicron.at>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 include/linux/posix-timers.h |    1 +
 kernel/posix-timers.c        |   25 ++++++++++++++-----------
 2 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h
index d51243a..5dc27ca 100644
--- a/include/linux/posix-timers.h
+++ b/include/linux/posix-timers.h
@@ -81,6 +81,7 @@ struct k_itimer {
 			unsigned long expires;
 		} mmtimer;
 	} it;
+	struct rcu_head rcu;
 };
 
 struct k_clock {
diff --git a/kernel/posix-timers.c b/kernel/posix-timers.c
index 4c01249..acb9be9 100644
--- a/kernel/posix-timers.c
+++ b/kernel/posix-timers.c
@@ -491,6 +491,13 @@ static struct k_itimer * alloc_posix_timer(void)
 	return tmr;
 }
 
+static void k_itimer_rcu_free(struct rcu_head *head)
+{
+	struct k_itimer *tmr = container_of(head, struct k_itimer, rcu);
+
+	kmem_cache_free(posix_timers_cache, tmr);
+}
+
 #define IT_ID_SET	1
 #define IT_ID_NOT_SET	0
 static void release_posix_timer(struct k_itimer *tmr, int it_id_set)
@@ -503,7 +510,7 @@ static void release_posix_timer(struct k_itimer *tmr, int it_id_set)
 	}
 	put_pid(tmr->it_pid);
 	sigqueue_free(tmr->sigq);
-	kmem_cache_free(posix_timers_cache, tmr);
+	call_rcu(&tmr->rcu, k_itimer_rcu_free);
 }
 
 static struct k_clock *clockid_to_kclock(const clockid_t id)
@@ -631,22 +638,18 @@ out:
 static struct k_itimer *__lock_timer(timer_t timer_id, unsigned long *flags)
 {
 	struct k_itimer *timr;
-	/*
-	 * Watch out here.  We do a irqsave on the idr_lock and pass the
-	 * flags part over to the timer lock.  Must not let interrupts in
-	 * while we are moving the lock.
-	 */
-	spin_lock_irqsave(&idr_lock, *flags);
+
+	rcu_read_lock();
 	timr = idr_find(&posix_timers_id, (int)timer_id);
 	if (timr) {
-		spin_lock(&timr->it_lock);
+		spin_lock_irqsave(&timr->it_lock, *flags);
 		if (timr->it_signal == current->signal) {
-			spin_unlock(&idr_lock);
+			rcu_read_unlock();
 			return timr;
 		}
-		spin_unlock(&timr->it_lock);
+		spin_unlock_irqrestore(&timr->it_lock, *flags);
 	}
-	spin_unlock_irqrestore(&idr_lock, *flags);
+	rcu_read_unlock();
 
 	return NULL;
 }



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH] posix-timers: RCU conversion
  2011-03-22  7:09                             ` [PATCH] " Eric Dumazet
@ 2011-03-22  8:59                               ` Ben Nagy
  2011-03-22 10:35                                 ` Avi Kivity
  0 siblings, 1 reply; 32+ messages in thread
From: Ben Nagy @ 2011-03-22  8:59 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Thomas Gleixner, Avi Kivity, KVM list, linux-kernel, John Stultz,
	Richard Cochran

On Tue, Mar 22, 2011 at 12:54 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Ben Nagy reported a scalability problem with KVM/QEMU that hit very hard
> a single spinlock (idr_lock) in posix-timers code, on its 48 core
> machine.

Hi all,

Thanks a lot for all the help so far. We've tested with Eric's patch.

First up, here's our version of the patch for the current ubuntu
kernel from git:
http://paste.ubuntu.com/583668/

Here's top with 96 idle guests running:
op - 16:47:53 up  1:09,  3 users,  load average: 0.00, 0.01, 0.05
Tasks: 499 total,   3 running, 496 sleeping,   0 stopped,   0 zombie
Cpu(s):  1.9%us,  3.2%sy,  0.0%ni, 95.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  99068656k total, 13121096k used, 85947560k free,    22192k buffers
Swap:  2438140k total,        0k used,  2438140k free,  3597860k cached
 (much better!)

Start of perf top:

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
   PerfTop:   10318 irqs/sec  kernel:97.4%  exact:  0.0% [1000Hz
cycles],  (all, 48 CPUs)
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                    DSO
             _______ _____ ___________________________
___________________________________________________________

            95444.00 59.3% __ticket_spin_lock
[kernel.kallsyms]
            12937.00  8.0% native_safe_halt
[kernel.kallsyms]
             6149.00  3.8% kvm_get_cs_db_l_bits
/lib/modules/2.6.38-7-server/kernel/arch/x86/kvm/kvm.ko
             5105.00  3.2% tg_load_down
[kernel.kallsyms]
             5088.00  3.2% svm_vcpu_run
/lib/modules/2.6.38-7-server/kernel/arch/x86/kvm/kvm-amd.ko
             4807.00  3.0% kvm_set_pfn_dirty
/lib/modules/2.6.38-7-server/kernel/arch/x86/kvm/kvm.ko
             2855.00  1.8% ktime_get
[kernel.kallsyms]
             1535.00  1.0% find_busiest_group
[kernel.kallsyms]
             1386.00  0.9% find_next_bit
[kernel.kallsyms]


Start of perf report -g
    55.26%            kvm  [kernel.kallsyms]     [k] __ticket_spin_lock
                      |
                      --- __ticket_spin_lock
                         |
                         |--94.68%-- _raw_spin_lock
                         |          |
                         |          |--97.55%-- double_rq_lock
                         |          |          load_balance
                         |          |          idle_balance
                         |          |          schedule
                         |          |          |
                         |          |          |--60.56%--
schedule_hrtimeout_range_clock
                         |          |          |
schedule_hrtimeout_range
                         |          |          |          poll_schedule_timeout
                         |          |          |          do_select
                         |          |          |          core_sys_select
                         |          |          |          sys_select
                         |          |          |          system_call_fastpath


Here is the perf.data from the unpatched (non debug) kernel
http://www.coseinc.com/woigbfwr32/perf.data

Here is the perf.data from the patched (non debug) kernel
http://www.coseinc.com/woigbfwr32/perf_patched.data

I think we're certainly in 'it's going to be useable' territory now,
but any further improvements or patches to test would of course be
gratefully received! Next step from my end is to test the guests under
load, unless there are any other suggestions.

I'm extremely impressed by the speed and professionalism of the
response to this problem, both from those on #kvm and the widening
circle of those on this email thread.

Many thanks!

Cheers,

ben

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] posix-timers: RCU conversion
  2011-03-22  8:59                               ` Ben Nagy
@ 2011-03-22 10:35                                 ` Avi Kivity
  2011-04-04  3:30                                   ` Ben Nagy
                                                     ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Avi Kivity @ 2011-03-22 10:35 UTC (permalink / raw)
  To: ben
  Cc: Eric Dumazet, Thomas Gleixner, KVM list, linux-kernel,
	John Stultz, Richard Cochran, Peter Zijlstra

On 03/22/2011 10:59 AM, Ben Nagy wrote:
> On Tue, Mar 22, 2011 at 12:54 PM, Eric Dumazet<eric.dumazet@gmail.com>  wrote:
> >  Ben Nagy reported a scalability problem with KVM/QEMU that hit very hard
> >  a single spinlock (idr_lock) in posix-timers code, on its 48 core
> >  machine.
>
> Hi all,
>
> Thanks a lot for all the help so far. We've tested with Eric's patch.
>
> First up, here's our version of the patch for the current ubuntu
> kernel from git:
> http://paste.ubuntu.com/583668/
>
> Here's top with 96 idle guests running:
> op - 16:47:53 up  1:09,  3 users,  load average: 0.00, 0.01, 0.05
> Tasks: 499 total,   3 running, 496 sleeping,   0 stopped,   0 zombie
> Cpu(s):  1.9%us,  3.2%sy,  0.0%ni, 95.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Mem:  99068656k total, 13121096k used, 85947560k free,    22192k buffers
> Swap:  2438140k total,        0k used,  2438140k free,  3597860k cached
>   (much better!)
>
> Start of perf top:
>
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>     PerfTop:   10318 irqs/sec  kernel:97.4%  exact:  0.0% [1000Hz
> cycles],  (all, 48 CPUs)
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>               samples  pcnt function                    DSO
>               _______ _____ ___________________________
> ___________________________________________________________
>
>              95444.00 59.3% __ticket_spin_lock
> [kernel.kallsyms]
>              12937.00  8.0% native_safe_halt
> [kernel.kallsyms]
>               6149.00  3.8% kvm_get_cs_db_l_bits
> /lib/modules/2.6.38-7-server/kernel/arch/x86/kvm/kvm.ko
>               5105.00  3.2% tg_load_down
> [kernel.kallsyms]
>               5088.00  3.2% svm_vcpu_run
> /lib/modules/2.6.38-7-server/kernel/arch/x86/kvm/kvm-amd.ko
>               4807.00  3.0% kvm_set_pfn_dirty
> /lib/modules/2.6.38-7-server/kernel/arch/x86/kvm/kvm.ko
>               2855.00  1.8% ktime_get
> [kernel.kallsyms]
>               1535.00  1.0% find_busiest_group
> [kernel.kallsyms]
>               1386.00  0.9% find_next_bit
> [kernel.kallsyms]
>
>
> Start of perf report -g
>      55.26%            kvm  [kernel.kallsyms]     [k] __ticket_spin_lock
>                        |
>                        --- __ticket_spin_lock
>                           |
>                           |--94.68%-- _raw_spin_lock
>                           |          |
>                           |          |--97.55%-- double_rq_lock
>                           |          |          load_balance
>                           |          |          idle_balance
>                           |          |          schedule
>                           |          |          |
>                           |          |          |--60.56%--
> schedule_hrtimeout_range_clock
>                           |          |          |
> schedule_hrtimeout_range
>                           |          |          |          poll_schedule_timeout
>                           |          |          |          do_select
>                           |          |          |          core_sys_select
>                           |          |          |          sys_select
>                           |          |          |          system_call_fastpath
>
>

Looks like the posix-timer issue is completely gone, to be replaced by 
the load balancer.

Copying peterz.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] posix-timers: RCU conversion
  2011-03-22 10:35                                 ` Avi Kivity
@ 2011-04-04  3:30                                   ` Ben Nagy
  2011-04-04  7:18                                     ` Avi Kivity
  2011-04-05  7:49                                   ` Peter Zijlstra
  2011-04-05  8:48                                   ` Peter Zijlstra
  2 siblings, 1 reply; 32+ messages in thread
From: Ben Nagy @ 2011-04-04  3:30 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Eric Dumazet, Thomas Gleixner, KVM list, linux-kernel,
	John Stultz, Richard Cochran, Peter Zijlstra

On Tue, Mar 22, 2011 at 4:20 PM, Avi Kivity <avi@redhat.com> wrote:
[...]
> Looks like the posix-timer issue is completely gone, to be replaced by the
> load balancer.
>
> Copying peterz.

Hi all,

I feel bad about such a big cc list, but I don't know who can be left out :/

Still got the performance issue with the load balancer.

Additionally, when trying to test with 48 VMs under load, they
consistently all lock up after a few minutes, which we traced as far
as ocfs2 dying. We're investigating it as an ocfs2 issue, and we'll
try and replicate on ext4 / nfs, but I just thought it might be
somehow related.

Apr  3 18:27:35 eax kernel: [16029.399507] ------------[ cut here ]------------
Apr  3 18:27:35 eax kernel: [16029.401408] kernel BUG at
/home/fuzzadmin/src/natty/source/fs/jbd2/journal.c:1610!
Apr  3 18:27:35 eax kernel: [16029.404541] invalid opcode: 0000 [#1] SMP
Apr  3 18:27:35 eax kernel: [16029.406289] last sysfs file:
/sys/devices/system/cpu/cpu47/cache/index2/shared_cpu_map
Apr  3 18:27:35 eax kernel: [16029.409453] CPU 36
Apr  3 18:27:35 eax kernel: [16029.409453] Modules linked in: ocfs2
quota_tree ip6table_filter ip6_tables ipt_MASQUERADE iptable_nat
nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntr
ack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp iptable_filter
ip_tables x_tables bridge stp w83627ehf hwmon_vid ocfs2_dlmfs
ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue c
onfigfs ipmi_si ipmi_msghandler ib_srp scsi_transport_srp scsi_tgt
vesafb ib_ipoib ib_iser ib_umad iscsi_tcp rdma_ucm libiscsi_tcp
rdma_cm libiscsi ib_cm iw_cm scsi_transport_iscsi ib_addr ib
_sa ib_uverbs mlx4_ib ib_mad ib_core vhost_net amd64_edac_mod psmouse
edac_core sp5100_tco ghes joydev serio_raw hed edac_mce_amd k10temp
kvm_amd i2c_piix4 kvm lp parport usb_storage usbhid h
id uas igb ahci pata_atiixp libahci dca mlx4_core
Apr  3 18:27:35 eax kernel: [16029.409453]
Apr  3 18:27:35 eax kernel: [16029.409453] Pid: 1746, comm: ocfs2cmt
Not tainted 2.6.38-7-server #39 Supermicro H8QG6/H8QG6
Apr  3 18:27:35 eax kernel: [16029.409453] RIP:
0010:[<ffffffff8124905a>]  [<ffffffff8124905a>]
jbd2_journal_flush+0x17a/0x190
Apr  3 18:27:35 eax kernel: [16029.409453] RSP: 0018:ffff880404be3dc0
EFLAGS: 00010286
Apr  3 18:27:35 eax kernel: [16029.409453] RAX: 0000000000000029 RBX:
ffff88181593c000 RCX: 000000000000001e
Apr  3 18:27:35 eax kernel: [16029.409453] RDX: 00000000fffffffb RSI:
ffff880404be3cd0 RDI: ffff88181593c024
Apr  3 18:27:35 eax kernel: [16029.409453] RBP: ffff880404be3df0 R08:
ffff880404be2000 R09: 0000000000000000
Apr  3 18:27:35 eax kernel: [16029.409453] R10: 0000000000000000 R11:
0000000000000001 R12: 0000000000000e5d
Apr  3 18:27:35 eax kernel: [16029.409453] R13: ffff88181593c39c R14:
ffff88181593c024 R15: 0000000000000000
Apr  3 18:27:35 eax kernel: [16029.409453] FS:  00007f73affb7700(0000)
GS:ffff881627c00000(0000) knlGS:0000000000000000
Apr  3 18:27:35 eax kernel: [16029.736049] CS:  0010 DS: 0000 ES: 0000
CR0: 000000008005003b
Apr  3 18:27:35 eax kernel: [16029.736049] CR2: 0000000002f72000 CR3:
00000002f0d31000 CR4: 00000000000006e0
Apr  3 18:27:35 eax kernel: [16029.736049] DR0: 00000000000000a0 DR1:
0000000000000000 DR2: 0000000000000003
Apr  3 18:27:35 eax kernel: [16029.736049] DR3: 00000000000000b0 DR6:
00000000ffff0ff0 DR7: 0000000000000400
Apr  3 18:27:35 eax kernel: [16029.736049] Process ocfs2cmt (pid:
1746, threadinfo ffff880404be2000, task ffff8803f72396e0)
Apr  3 18:27:35 eax kernel: [16029.736049] Stack:
Apr  3 18:27:35 eax kernel: [16029.736049]  0000000000000001
ffff8806155eb800 ffff8806155eb838 ffff8803f72396e0
Apr  3 18:27:35 eax kernel: [16029.736049]  ffff8806160cb000
ffff8806160cb160 ffff880404be3e40 ffffffffa042af12
Apr  3 18:27:35 eax kernel: [16029.736049]  0000000000000000
ffff8806155eb860 0000000000000286 ffff8806155eb828
Apr  3 18:27:35 eax kernel: [16029.736049] Call Trace:
Apr  3 18:27:35 eax kernel: [16029.736049]  [<ffffffffa042af12>]
ocfs2_commit_cache+0xc2/0x330 [ocfs2]
Apr  3 18:27:35 eax kernel: [16029.736049]  [<ffffffffa042b1e1>]
ocfs2_commit_thread+0x61/0x210 [ocfs2]
Apr  3 18:27:35 eax kernel: [16029.736049]  [<ffffffff81087870>] ?
autoremove_wake_function+0x0/0x40
Apr  3 18:27:35 eax kernel: [16029.736049]  [<ffffffffa042b180>] ?
ocfs2_commit_thread+0x0/0x210 [ocfs2]
Apr  3 18:27:35 eax kernel: [16029.736049]  [<ffffffff81087126>]
kthread+0x96/0xa0
Apr  3 18:27:35 eax kernel: [16029.736049]  [<ffffffff8100cde4>]
kernel_thread_helper+0x4/0x10
Apr  3 18:27:35 eax kernel: [16029.736049]  [<ffffffff81087090>] ?
kthread+0x0/0xa0
Apr  3 18:27:35 eax kernel: [16029.736049]  [<ffffffff8100cde0>] ?
kernel_thread_helper+0x0/0x10
Apr  3 18:27:35 eax kernel: [16029.736049] Code: c0 5b 41 5c 41 5d 41
5e 41 5f c9 c3 0f 1f 44 00 00 4c 8b 63 58 4d 85 e4 0f 85 d2 fe ff ff
f0 81 43 24 00 00 00 01 e9 da fe ff ff <0f> 0b 0f 0b 0f 0b 0f 0b 0f 0b
66 66 66 2e 0f 1f 84 00 00 00 00
Apr  3 18:27:35 eax kernel: [16029.736049] RIP  [<ffffffff8124905a>]
jbd2_journal_flush+0x17a/0x190
Apr  3 18:27:35 eax kernel: [16031.320082] ------------[ cut here ]------------

Thanks for any feedback...

Cheers,

ben

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] posix-timers: RCU conversion
  2011-04-04  3:30                                   ` Ben Nagy
@ 2011-04-04  7:18                                     ` Avi Kivity
  0 siblings, 0 replies; 32+ messages in thread
From: Avi Kivity @ 2011-04-04  7:18 UTC (permalink / raw)
  To: ben
  Cc: Eric Dumazet, Thomas Gleixner, KVM list, linux-kernel,
	John Stultz, Richard Cochran, Peter Zijlstra

On 04/04/2011 06:30 AM, Ben Nagy wrote:
> On Tue, Mar 22, 2011 at 4:20 PM, Avi Kivity<avi@redhat.com>  wrote:
> [...]
> >  Looks like the posix-timer issue is completely gone, to be replaced by the
> >  load balancer.
> >
> >  Copying peterz.
>
> Hi all,
>
> I feel bad about such a big cc list, but I don't know who can be left out :/
>
> Still got the performance issue with the load balancer.
>
> Additionally, when trying to test with 48 VMs under load, they
> consistently all lock up after a few minutes, which we traced as far
> as ocfs2 dying. We're investigating it as an ocfs2 issue, and we'll
> try and replicate on ext4 / nfs, but I just thought it might be
> somehow related.
>
> Apr  3 18:27:35 eax kernel: [16029.399507] ------------[ cut here ]------------
> Apr  3 18:27:35 eax kernel: [16029.401408] kernel BUG at
> /home/fuzzadmin/src/natty/source/fs/jbd2/journal.c:1610!
> Apr  3 18:27:35 eax kernel: [16029.404541] invalid opcode: 0000 [#1] SMP
> Apr  3 18:27:35 eax kernel: [16029.406289] last sysfs file:
> /sys/devices/system/cpu/cpu47/cache/index2/shared_cpu_map
> Apr  3 18:27:35 eax kernel: [16029.409453] CPU 36
> Apr  3 18:27:35 eax kernel: [16029.409453] Modules linked in: ocfs2
> quota_tree ip6table_filter ip6_tables ipt_MASQUERADE iptable_nat
> nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntr
> ack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp iptable_filter
> ip_tables x_tables bridge stp w83627ehf hwmon_vid ocfs2_dlmfs
> ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue c
> onfigfs ipmi_si ipmi_msghandler ib_srp scsi_transport_srp scsi_tgt
> vesafb ib_ipoib ib_iser ib_umad iscsi_tcp rdma_ucm libiscsi_tcp
> rdma_cm libiscsi ib_cm iw_cm scsi_transport_iscsi ib_addr ib
> _sa ib_uverbs mlx4_ib ib_mad ib_core vhost_net amd64_edac_mod psmouse
> edac_core sp5100_tco ghes joydev serio_raw hed edac_mce_amd k10temp
> kvm_amd i2c_piix4 kvm lp parport usb_storage usbhid h
> id uas igb ahci pata_atiixp libahci dca mlx4_core
> Apr  3 18:27:35 eax kernel: [16029.409453]
> Apr  3 18:27:35 eax kernel: [16029.409453] Pid: 1746, comm: ocfs2cmt
> Not tainted 2.6.38-7-server #39 Supermicro H8QG6/H8QG6
> Apr  3 18:27:35 eax kernel: [16029.409453] RIP:
> 0010:[<ffffffff8124905a>]  [<ffffffff8124905a>]
> jbd2_journal_flush+0x17a/0x190
> Apr  3 18:27:35 eax kernel: [16029.409453] RSP: 0018:ffff880404be3dc0
> EFLAGS: 00010286
> Apr  3 18:27:35 eax kernel: [16029.409453] RAX: 0000000000000029 RBX:
> ffff88181593c000 RCX: 000000000000001e
> Apr  3 18:27:35 eax kernel: [16029.409453] RDX: 00000000fffffffb RSI:
> ffff880404be3cd0 RDI: ffff88181593c024
> Apr  3 18:27:35 eax kernel: [16029.409453] RBP: ffff880404be3df0 R08:
> ffff880404be2000 R09: 0000000000000000
> Apr  3 18:27:35 eax kernel: [16029.409453] R10: 0000000000000000 R11:
> 0000000000000001 R12: 0000000000000e5d
> Apr  3 18:27:35 eax kernel: [16029.409453] R13: ffff88181593c39c R14:
> ffff88181593c024 R15: 0000000000000000
> Apr  3 18:27:35 eax kernel: [16029.409453] FS:  00007f73affb7700(0000)
> GS:ffff881627c00000(0000) knlGS:0000000000000000
> Apr  3 18:27:35 eax kernel: [16029.736049] CS:  0010 DS: 0000 ES: 0000
> CR0: 000000008005003b
> Apr  3 18:27:35 eax kernel: [16029.736049] CR2: 0000000002f72000 CR3:
> 00000002f0d31000 CR4: 00000000000006e0
> Apr  3 18:27:35 eax kernel: [16029.736049] DR0: 00000000000000a0 DR1:
> 0000000000000000 DR2: 0000000000000003
> Apr  3 18:27:35 eax kernel: [16029.736049] DR3: 00000000000000b0 DR6:
> 00000000ffff0ff0 DR7: 0000000000000400
> Apr  3 18:27:35 eax kernel: [16029.736049] Process ocfs2cmt (pid:
> 1746, threadinfo ffff880404be2000, task ffff8803f72396e0)
> Apr  3 18:27:35 eax kernel: [16029.736049] Stack:
> Apr  3 18:27:35 eax kernel: [16029.736049]  0000000000000001
> ffff8806155eb800 ffff8806155eb838 ffff8803f72396e0
> Apr  3 18:27:35 eax kernel: [16029.736049]  ffff8806160cb000
> ffff8806160cb160 ffff880404be3e40 ffffffffa042af12
> Apr  3 18:27:35 eax kernel: [16029.736049]  0000000000000000
> ffff8806155eb860 0000000000000286 ffff8806155eb828
> Apr  3 18:27:35 eax kernel: [16029.736049] Call Trace:
> Apr  3 18:27:35 eax kernel: [16029.736049]  [<ffffffffa042af12>]
> ocfs2_commit_cache+0xc2/0x330 [ocfs2]
> Apr  3 18:27:35 eax kernel: [16029.736049]  [<ffffffffa042b1e1>]
> ocfs2_commit_thread+0x61/0x210 [ocfs2]
> Apr  3 18:27:35 eax kernel: [16029.736049]  [<ffffffff81087870>] ?
> autoremove_wake_function+0x0/0x40
> Apr  3 18:27:35 eax kernel: [16029.736049]  [<ffffffffa042b180>] ?
> ocfs2_commit_thread+0x0/0x210 [ocfs2]
> Apr  3 18:27:35 eax kernel: [16029.736049]  [<ffffffff81087126>]
> kthread+0x96/0xa0
> Apr  3 18:27:35 eax kernel: [16029.736049]  [<ffffffff8100cde4>]
> kernel_thread_helper+0x4/0x10
> Apr  3 18:27:35 eax kernel: [16029.736049]  [<ffffffff81087090>] ?
> kthread+0x0/0xa0
> Apr  3 18:27:35 eax kernel: [16029.736049]  [<ffffffff8100cde0>] ?
> kernel_thread_helper+0x0/0x10
> Apr  3 18:27:35 eax kernel: [16029.736049] Code: c0 5b 41 5c 41 5d 41
> 5e 41 5f c9 c3 0f 1f 44 00 00 4c 8b 63 58 4d 85 e4 0f 85 d2 fe ff ff
> f0 81 43 24 00 00 00 01 e9 da fe ff ff<0f>  0b 0f 0b 0f 0b 0f 0b 0f 0b
> 66 66 66 2e 0f 1f 84 00 00 00 00
> Apr  3 18:27:35 eax kernel: [16029.736049] RIP  [<ffffffff8124905a>]
> jbd2_journal_flush+0x17a/0x190
> Apr  3 18:27:35 eax kernel: [16031.320082] ------------[ cut here ]------------

Please report this to the ocfs2 developers.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] posix-timers: RCU conversion
  2011-03-22 10:35                                 ` Avi Kivity
  2011-04-04  3:30                                   ` Ben Nagy
@ 2011-04-05  7:49                                   ` Peter Zijlstra
  2011-04-05  8:16                                     ` Avi Kivity
  2011-04-05  8:48                                   ` Peter Zijlstra
  2 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2011-04-05  7:49 UTC (permalink / raw)
  To: Avi Kivity
  Cc: ben, Eric Dumazet, Thomas Gleixner, KVM list, linux-kernel,
	John Stultz, Richard Cochran

On Tue, 2011-03-22 at 12:35 +0200, Avi Kivity wrote:
> Looks like the posix-timer issue is completely gone, to be replaced by 
> the load balancer. 

-ENOINFO, no kernel version, no setup, no workload, no nothing.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] posix-timers: RCU conversion
  2011-04-05  7:49                                   ` Peter Zijlstra
@ 2011-04-05  8:16                                     ` Avi Kivity
  0 siblings, 0 replies; 32+ messages in thread
From: Avi Kivity @ 2011-04-05  8:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: ben, Eric Dumazet, Thomas Gleixner, KVM list, linux-kernel,
	John Stultz, Richard Cochran

On 04/05/2011 10:49 AM, Peter Zijlstra wrote:
> On Tue, 2011-03-22 at 12:35 +0200, Avi Kivity wrote:
> >  Looks like the posix-timer issue is completely gone, to be replaced by
> >  the load balancer.
>
> -ENOINFO, no kernel version, no setup, no workload, no nothing.

http://www.spinics.net/lists/kvm/msg51526.html

plus patch in

http://www.spinics.net/lists/kvm/msg51595.html

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] posix-timers: RCU conversion
  2011-03-22 10:35                                 ` Avi Kivity
  2011-04-04  3:30                                   ` Ben Nagy
  2011-04-05  7:49                                   ` Peter Zijlstra
@ 2011-04-05  8:48                                   ` Peter Zijlstra
  2011-04-05  8:56                                     ` Avi Kivity
  2011-04-05  8:56                                     ` Mike Galbraith
  2 siblings, 2 replies; 32+ messages in thread
From: Peter Zijlstra @ 2011-04-05  8:48 UTC (permalink / raw)
  To: Avi Kivity
  Cc: ben, Eric Dumazet, Thomas Gleixner, KVM list, linux-kernel,
	John Stultz, Richard Cochran, Mike Galbraith

On Tue, 2011-03-22 at 12:35 +0200, Avi Kivity wrote:
>> Here's top with 96 idle guests running:

On some hacked up 2.6.38 kernel...

> > Start of perf report -g
> >      55.26%            kvm  [kernel.kallsyms]     [k] __ticket_spin_lock
> >                        |
> >                        --- __ticket_spin_lock
> >                           |
> >                           |--94.68%-- _raw_spin_lock
> >                           |          |
> >                           |          |--97.55%-- double_rq_lock
> >                           |          |          load_balance
> >                           |          |          idle_balance
> >                           |          |          schedule
> >                           |          |          |
> >                           |          |          |--60.56%-- schedule_hrtimeout_range_clock
> >                           |          |          |          schedule_hrtimeout_range
> >                           |          |          |          poll_schedule_timeout
> >                           |          |          |          do_select
> >                           |          |          |          core_sys_select
> >                           |          |          |          sys_select
> >                           |          |          |          system_call_fastpath

Looks like your workload and idle balancing don't much like each-other.

What I think is happening is that all your 'idle' qemu thingies keep
waking up frequently and because you've got like twice the number of
qemu instances as you've got cpus there's a fair chance you'll have a
cpu with a pending task while another one goes idle.

(Why does qemu keep waking if its idle? broken NOHZ?)

So idle balancing is called when the cpu goes idle (context switch to
the idle thread) and tries to steal a pending task from another cpu,
clearly it keeps finding these tasks otherwise it wouldn't try to take
that lock.

Mike, you build in some idle balance throttle logic, but that seems
defeated here (possible because it keeps finding pending tasks to
migrate? still need morning juice).



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] posix-timers: RCU conversion
  2011-04-05  8:48                                   ` Peter Zijlstra
@ 2011-04-05  8:56                                     ` Avi Kivity
  2011-04-05  9:03                                       ` Peter Zijlstra
  2011-04-05  8:56                                     ` Mike Galbraith
  1 sibling, 1 reply; 32+ messages in thread
From: Avi Kivity @ 2011-04-05  8:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: ben, Eric Dumazet, Thomas Gleixner, KVM list, linux-kernel,
	John Stultz, Richard Cochran, Mike Galbraith

On 04/05/2011 11:48 AM, Peter Zijlstra wrote:
> What I think is happening is that all your 'idle' qemu thingies keep
> waking up frequently and because you've got like twice the number of
> qemu instances as you've got cpus there's a fair chance you'll have a
> cpu with a pending task while another one goes idle.

Those qemus have two threads each (one vcpu thread, one io thread).

> (Why does qemu keep waking if its idle? broken NOHZ?)

Could be waking up due to guest wakeups, or qemu internal wakeups 
(display refresh) or due to guest timer sources which are masked away in 
the guest (if that's the case we should optimize it away).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] posix-timers: RCU conversion
  2011-04-05  8:56                                     ` Avi Kivity
@ 2011-04-05  9:03                                       ` Peter Zijlstra
  2011-04-05  9:08                                         ` Avi Kivity
  2011-04-05  9:50                                         ` Ben Nagy
  0 siblings, 2 replies; 32+ messages in thread
From: Peter Zijlstra @ 2011-04-05  9:03 UTC (permalink / raw)
  To: Avi Kivity
  Cc: ben, Eric Dumazet, Thomas Gleixner, KVM list, linux-kernel,
	John Stultz, Richard Cochran, Mike Galbraith

On Tue, 2011-04-05 at 11:56 +0300, Avi Kivity wrote:
> 
> Could be waking up due to guest wakeups, or qemu internal wakeups 
> (display refresh) or due to guest timer sources which are masked away in 
> the guest (if that's the case we should optimize it away).

Right, so I guess we're all clutching at straws here :-)

Ben how usable is that system when its in that state? Could you run a
function trace or a trace with all kvm and sched trace-events enabled?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] posix-timers: RCU conversion
  2011-04-05  9:03                                       ` Peter Zijlstra
@ 2011-04-05  9:08                                         ` Avi Kivity
  2011-04-05  9:50                                         ` Ben Nagy
  1 sibling, 0 replies; 32+ messages in thread
From: Avi Kivity @ 2011-04-05  9:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: ben, Eric Dumazet, Thomas Gleixner, KVM list, linux-kernel,
	John Stultz, Richard Cochran, Mike Galbraith

On 04/05/2011 12:03 PM, Peter Zijlstra wrote:
> On Tue, 2011-04-05 at 11:56 +0300, Avi Kivity wrote:
> >
> >  Could be waking up due to guest wakeups, or qemu internal wakeups
> >  (display refresh) or due to guest timer sources which are masked away in
> >  the guest (if that's the case we should optimize it away).
>
> Right, so I guess we're all clutching at straws here :-)
>
> Ben how usable is that system when its in that state? Could you run a
> function trace or a trace with all kvm and sched trace-events enabled?

Ben, to do that, look up http://www.linux-kvm.org/page/Tracing; the 
command for combined sched/kvm tracing is

# trace-cmd record -b 20000 -e kvm -e sched



-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] posix-timers: RCU conversion
  2011-04-05  9:03                                       ` Peter Zijlstra
  2011-04-05  9:08                                         ` Avi Kivity
@ 2011-04-05  9:50                                         ` Ben Nagy
  1 sibling, 0 replies; 32+ messages in thread
From: Ben Nagy @ 2011-04-05  9:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Avi Kivity, Eric Dumazet, Thomas Gleixner, KVM list, linux-kernel,
	John Stultz, Richard Cochran, Mike Galbraith

On Tue, Apr 5, 2011 at 2:48 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Tue, 2011-04-05 at 11:56 +0300, Avi Kivity wrote:
>>
>> Could be waking up due to guest wakeups, or qemu internal wakeups
>> (display refresh) or due to guest timer sources which are masked away in
>> the guest (if that's the case we should optimize it away).
>
> Right, so I guess we're all clutching at straws here :-)
>
> Ben how usable is that system when its in that state? Could you run a
> function trace or a trace with all kvm and sched trace-events enabled?

I'm just rebuilding the storage on the network to work around an ocfs2
kernel oops (trying nfs/rdma) so I can't test anything just at the
moment.

I ran some tests under load with the local ext4 ssd, and, weirdly,
everything looked to be just fine - the huge bulk of the system time
was in svm_vcpu_run, which is as it should be I guess, but that was
with only 60 loaded guests.

I'll be able to repeat the same workload test tomorrow, and I'll see
how the perf top stuff looks. I should also be able to repeat the '96
idle guests' test and see if it's the same - if so we'll do that
tracing. My kernel's a moving target at the moment, sorry, we're
tracking the natty git (with Eric's rcu patch merged in).

Thanks, all,

ben

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] posix-timers: RCU conversion
  2011-04-05  8:48                                   ` Peter Zijlstra
  2011-04-05  8:56                                     ` Avi Kivity
@ 2011-04-05  8:56                                     ` Mike Galbraith
  1 sibling, 0 replies; 32+ messages in thread
From: Mike Galbraith @ 2011-04-05  8:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Avi Kivity, ben, Eric Dumazet, Thomas Gleixner, KVM list,
	linux-kernel, John Stultz, Richard Cochran

On Tue, 2011-04-05 at 10:48 +0200, Peter Zijlstra wrote:
> On Tue, 2011-03-22 at 12:35 +0200, Avi Kivity wrote:
> >> Here's top with 96 idle guests running:
> 
> On some hacked up 2.6.38 kernel...
> 
> > > Start of perf report -g
> > >      55.26%            kvm  [kernel.kallsyms]     [k] __ticket_spin_lock
> > >                        |
> > >                        --- __ticket_spin_lock
> > >                           |
> > >                           |--94.68%-- _raw_spin_lock
> > >                           |          |
> > >                           |          |--97.55%-- double_rq_lock
> > >                           |          |          load_balance
> > >                           |          |          idle_balance
> > >                           |          |          schedule
> > >                           |          |          |
> > >                           |          |          |--60.56%-- schedule_hrtimeout_range_clock
> > >                           |          |          |          schedule_hrtimeout_range
> > >                           |          |          |          poll_schedule_timeout
> > >                           |          |          |          do_select
> > >                           |          |          |          core_sys_select
> > >                           |          |          |          sys_select
> > >                           |          |          |          system_call_fastpath
> 
> Looks like your workload and idle balancing don't much like each-other.
> 
> What I think is happening is that all your 'idle' qemu thingies keep
> waking up frequently and because you've got like twice the number of
> qemu instances as you've got cpus there's a fair chance you'll have a
> cpu with a pending task while another one goes idle.
> 
> (Why does qemu keep waking if its idle? broken NOHZ?)
> 
> So idle balancing is called when the cpu goes idle (context switch to
> the idle thread) and tries to steal a pending task from another cpu,
> clearly it keeps finding these tasks otherwise it wouldn't try to take
> that lock.
> 
> Mike, you build in some idle balance throttle logic, but that seems
> defeated here (possible because it keeps finding pending tasks to
> migrate? still need morning juice).

Hm.  Maybe someone set sysctl_sched_migration_cost to 0, or wakeups
aren't arriving that frequently is all that pops to mind atm.

	-Mike

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: KVM lock contention on 48 core AMD machine
  2011-03-21 17:12                       ` Eric Dumazet
  2011-03-21 18:12                         ` Ben Nagy
@ 2011-03-21 18:14                         ` Avi Kivity
  2011-03-21 18:48                         ` Michael Tokarev
  2 siblings, 0 replies; 32+ messages in thread
From: Avi Kivity @ 2011-03-21 18:14 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: ben, KVM list

On 03/21/2011 07:12 PM, Eric Dumazet wrote:
> Le lundi 21 mars 2011 à 19:02 +0200, Avi Kivity a écrit :
>
> >  Any ideas on how to fix it?  We could pre-allocate IDs and batch them in
> >  per-cpu caches, but it seems like a lot of work.
> >
>
> Hmm, I dont know what syscalls kvm do, but even a timer_gettime() has to
> lock this idr_lock.

It's qemu playing with posix timers, nothing to do with kvm per se.

> Sounds crazy, and should be fixed in kernel code, not kvm ;)
>

Oh yes, I couldn't fix it in kvm if I wanted to (and I don't want to).

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: KVM lock contention on 48 core AMD machine
  2011-03-21 17:12                       ` Eric Dumazet
  2011-03-21 18:12                         ` Ben Nagy
  2011-03-21 18:14                         ` KVM lock contention on 48 core AMD machine Avi Kivity
@ 2011-03-21 18:48                         ` Michael Tokarev
  2011-03-21 18:53                           ` Avi Kivity
  2 siblings, 1 reply; 32+ messages in thread
From: Michael Tokarev @ 2011-03-21 18:48 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Avi Kivity, ben, KVM list

21.03.2011 20:12, Eric Dumazet wrote:
> Le lundi 21 mars 2011 à 19:02 +0200, Avi Kivity a écrit :
> 
>> Any ideas on how to fix it?  We could pre-allocate IDs and batch them in 
>> per-cpu caches, but it seems like a lot of work.
>>
> 
> Hmm, I dont know what syscalls kvm do, but even a timer_gettime() has to
> lock this idr_lock.

I wonder why my testcase - 160 kvm guests on a 2-core AMD AthlonII CPU -
shows no contention of that sort, not at all...  Yes sure, all these
(linux) guests are idling at login: prompt, but I'd expect at least
some gettimeofday() and some load average, -- everything stays close
to zero.

/mjt

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: KVM lock contention on 48 core AMD machine
  2011-03-21 18:48                         ` Michael Tokarev
@ 2011-03-21 18:53                           ` Avi Kivity
  0 siblings, 0 replies; 32+ messages in thread
From: Avi Kivity @ 2011-03-21 18:53 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: Eric Dumazet, ben, KVM list

On 03/21/2011 08:48 PM, Michael Tokarev wrote:
> 21.03.2011 20:12, Eric Dumazet wrote:
> >  Le lundi 21 mars 2011 à 19:02 +0200, Avi Kivity a écrit :
> >
> >>  Any ideas on how to fix it?  We could pre-allocate IDs and batch them in
> >>  per-cpu caches, but it seems like a lot of work.
> >>
> >
> >  Hmm, I dont know what syscalls kvm do, but even a timer_gettime() has to
> >  lock this idr_lock.
>
> I wonder why my testcase - 160 kvm guests on a 2-core AMD AthlonII CPU -
> shows no contention of that sort, not at all...  Yes sure, all these
> (linux) guests are idling at login: prompt, but I'd expect at least
> some gettimeofday() and some load average, -- everything stays close
> to zero.

For this workload spinlock contention only happens on large machines; it 
(mostly) doesn't matter how many guests you have on a small host.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: KVM lock contention on 48 core AMD machine
  2011-03-18 12:02 KVM lock contention on 48 core AMD machine Ben Nagy
  2011-03-18 12:30 ` Joerg Roedel
@ 2011-03-18 12:44 ` Stefan Hajnoczi
  1 sibling, 0 replies; 32+ messages in thread
From: Stefan Hajnoczi @ 2011-03-18 12:44 UTC (permalink / raw)
  To: ben; +Cc: kvm

On Fri, Mar 18, 2011 at 12:02 PM, Ben Nagy <ben@iagu.net> wrote:
> KVM commandline (using libvirt):
> LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/sbin:/sbin:/bin
> QEMU_AUDIO_DRV=none /usr/local/bin/kvm-snapshot -S -M pc-0.14
> -enable-kvm -m 1024 -smp 1,sockets=1,cores=1,threads=1 -name fb-0
> -uuid de59229b-eb06-9ecc-758e-d20bc5ddc291 -nodefconfig -nodefaults
> -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/fb-0.monitor,server,nowait
> -mon chardev=charmonitor,id=monitor,mode=readline -rtc base=localtime
> -no-acpi -boot cd -drive
> file=/mnt/big/bigfiles/kvm_disks/eax/fb-0.ovl,if=none,id=drive-ide0-0-0,format=qcow2
> -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0
> -drive if=none,media=cdrom,id=drive-ide0-0-1,readonly=on,format=raw
> -device ide-drive,bus=ide.0,unit=1,drive=drive-ide0-0-1,id=ide0-0-1
> -netdev tap,fd=17,id=hostnet0 -device
> virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:d9:09:ef,bus=pci.0,addr=0x3
> -usb -device usb-tablet,id=input0 -vnc 127.0.0.1:0 -k en-us -vga
> cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4

Please try without -usb -device usb-tablet,id=input0.  That is known
to cause increased CPU utilization.  I notice that idr_lock is either
Infiniband or POSIX timers related:
drivers/infiniband/core/sa_query.c
kernel/posix-timers.c

-usb sets up a 1000 Hz timer for each VM.

Stefan

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2011-04-05  9:50 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-03-18 12:02 KVM lock contention on 48 core AMD machine Ben Nagy
2011-03-18 12:30 ` Joerg Roedel
2011-03-19  4:45   ` Ben Nagy
2011-03-21  9:50     ` Avi Kivity
2011-03-21 11:43       ` Ben Nagy
2011-03-21 13:41         ` Ben Nagy
2011-03-21 13:53           ` Avi Kivity
     [not found]             ` <AANLkTikWQS281kTtJ32-qo5U+w_BAak7qUwVhUQgOxxv@mail.gmail.com>
2011-03-21 15:50               ` Avi Kivity
2011-03-21 16:16                 ` Ben Nagy
2011-03-21 16:33                   ` Avi Kivity
2011-03-21 16:54                   ` Eric Dumazet
2011-03-21 17:02                     ` Avi Kivity
2011-03-21 17:12                       ` Eric Dumazet
2011-03-21 18:12                         ` Ben Nagy
2011-03-21 22:27                           ` [RFC] posix-timers: RCU conversion Eric Dumazet
2011-03-22  7:09                             ` [PATCH] " Eric Dumazet
2011-03-22  8:59                               ` Ben Nagy
2011-03-22 10:35                                 ` Avi Kivity
2011-04-04  3:30                                   ` Ben Nagy
2011-04-04  7:18                                     ` Avi Kivity
2011-04-05  7:49                                   ` Peter Zijlstra
2011-04-05  8:16                                     ` Avi Kivity
2011-04-05  8:48                                   ` Peter Zijlstra
2011-04-05  8:56                                     ` Avi Kivity
2011-04-05  9:03                                       ` Peter Zijlstra
2011-04-05  9:08                                         ` Avi Kivity
2011-04-05  9:50                                         ` Ben Nagy
2011-04-05  8:56                                     ` Mike Galbraith
2011-03-21 18:14                         ` KVM lock contention on 48 core AMD machine Avi Kivity
2011-03-21 18:48                         ` Michael Tokarev
2011-03-21 18:53                           ` Avi Kivity
2011-03-18 12:44 ` Stefan Hajnoczi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).