From: Troy Benjegerdes <hozer@hozed.org>
To: Richard Davies <richard@arachsys.com>
Cc: Avi Kivity <avi@redhat.com>, kvm@vger.kernel.org, qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] Windows slow boot: contractor wanted
Date: Tue, 21 Aug 2012 10:39:46 -0500 [thread overview]
Message-ID: <20120821153946.GO7722@us.grid.coop> (raw)
In-Reply-To: <20120821152107.GA16363@alpha.arachsys.com>
Do you have any way to determine what CPU groups the different VMs
are running on?
If you end up in an overcommit situation where half the 'virtual'
cpus are on one AMD socket, and the other half are on a different
AMD socket, then you'll be thrashing the hypertransport link.
At Cray we were very carefull to never overcommit runnable processes
to CPUS, and generally locked processes to a single cpu.
Have a read of
http://berrange.com/posts/2010/02/12/controlling-guest-cpu-numa-affinity-in-libvirt-with-qemu-kvm-xen/
I'm going to speculate that when things don't work very well you end up with
memory from a booting guest scattered across many different NUMA nodes/cpus,
and then it really won't matter how good the spin loop/scheduler code is
because you are bound by the additional latency and bandwidth limitations of
running on one socekt and accessing half the memory that's resident on a
different socket.
On Tue, Aug 21, 2012 at 04:21:07PM +0100, Richard Davies wrote:
> Avi Kivity wrote:
> > Richard Davies wrote:
> > > We're running host kernel 3.5.1 and qemu-kvm 1.1.1.
> > >
> > > I hadn't though about it, but I agree this is related to cpu overcommit. The
> > > slow boots are intermittent (and infrequent) with cpu overcommit whereas I
> > > don't think it occurs without cpu overcommit.
> > >
> > > In addition, if there is a slow boot ongoing, and you kill some other VMs to
> > > reduce cpu overcommit then this will sometimes speed it up.
> > >
> > > I guess the question is why even with overcommit most boots are fine, but
> > > some small fraction then go slow?
> >
> > Could be a bug. The scheduler and the spin-loop handling code fight
> > each other instead of working well.
> >
> > Please provide snapshots of 'perf top' while a slow boot is in progress.
>
> Below are two 'perf top' snapshots during a slow boot, which appear to me to
> support your idea of a spin-lock problem.
>
> There are a lot more "unprocessable samples recorded" messages at the end of
> each snapshot which I haven't included. I think these may be from the guest
> OS - the kernel is listed, and qemu-kvm itself is listed on some other
> traces which I did, although not these.
>
> Richard.
>
>
>
> PerfTop: 62249 irqs/sec kernel:96.9% exact: 0.0% [4000Hz cycles], (all, 16 CPUs)
> --------------------------------------------------------------------------------------------------------------------------------
>
> 35.80% [kernel] [k] _raw_spin_lock_irqsave
> 21.64% [kernel] [k] isolate_freepages_block
> 5.91% [kernel] [k] yield_to
> 4.95% [kernel] [k] _raw_spin_lock
> 3.37% [kernel] [k] kvm_vcpu_on_spin
> 2.74% [kernel] [k] add_preempt_count
> 2.45% [kernel] [k] _raw_spin_unlock
> 2.33% [kernel] [k] sub_preempt_count
> 2.18% [kernel] [k] svm_vcpu_run
> 2.17% [kernel] [k] kvm_vcpu_yield_to
> 1.89% [kernel] [k] memcmp
> 1.50% [kernel] [k] get_pid_task
> 1.26% [kernel] [k] kvm_arch_vcpu_ioctl_run
> 1.16% [kernel] [k] pid_task
> 0.70% [kernel] [k] rcu_note_context_switch
> 0.70% [kernel] [k] trace_hardirqs_on
> 0.52% [kernel] [k] __rcu_read_unlock
> 0.51% [kernel] [k] trace_preempt_on
> 0.47% [kernel] [k] __srcu_read_lock
> 0.43% [kernel] [k] get_parent_ip
> 0.42% [kernel] [k] get_pageblock_flags_group
> 0.38% [kernel] [k] in_lock_functions
> 0.34% [kernel] [k] trace_preempt_off
> 0.34% [kernel] [k] trace_hardirqs_off
> 0.29% [kernel] [k] clear_page_c
> 0.23% [kernel] [k] __srcu_read_unlock
> 0.20% [kernel] [k] __rcu_read_lock
> 0.14% [kernel] [k] handle_exit
> 0.11% libc-2.10.1.so [.] strcmp
> 0.11% [kernel] [k] _raw_spin_unlock_irqrestore
> 0.11% [kernel] [k] _raw_spin_lock_irq
> 0.11% [kernel] [k] find_highest_vector
> 0.09% [kernel] [k] ktime_get
> 0.08% [kernel] [k] copy_page_c
> 0.08% [kernel] [k] pause_interception
> 0.08% [kernel] [k] kmem_cache_alloc
> 0.08% [kernel] [k] resched_task
> 0.08% perf [.] dso__find_symbol
> 0.06% [kernel] [k] compaction_alloc
> 0.06% libc-2.10.1.so [.] 0x0000000000076dab
> 0.06% [kernel] [k] read_tsc
> 0.06% perf [.] add_hist_entry
> 0.05% [kernel] [k] svm_read_l1_tsc
> 0.05% [kernel] [k] native_read_tsc
> 0.05% perf [.] sort__dso_cmp
> 0.05% [kernel] [k] copy_user_generic_string
> 0.05% [kernel] [k] ktime_get_update_offsets
> 0.04% [kernel] [k] kvm_check_async_pf_completion
> 0.04% [kernel] [k] __schedule
> 0.04% [kernel] [k] __rcu_pending
> 0.04% [kernel] [k] svm_complete_interrupts
> 0.04% [kernel] [k] perf_pmu_disable
> 0.04% [kernel] [k] isolate_migratepages_range
> 0.04% [kernel] [k] sched_clock_cpu
> 0.04% [kernel] [k] kvm_cpu_has_pending_timer
> 0.04% [kernel] [k] apic_timer_interrupt
> 0.04% [vdso] [.] 0x00007fff2e1ff607
> 0.04% [kernel] [k] apic_update_ppr
> 0.04% [kernel] [k] do_select
> 0.04% [kernel] [k] svm_scale_tsc
> 0.04% [kernel] [k] system_call_after_swapgs
> 0.03% [kernel] [k] kvm_lapic_get_cr8
> 0.03% perf [.] sort__sym_cmp
> 0.03% [kernel] [k] find_next_bit
> 0.03% [kernel] [k] kvm_set_cr8
> 0.03% [kernel] [k] rcu_check_callbacks
> 9750 unprocessable samples recorded.9751 unprocessable samples recorded.9752 unprocessable samples recorded.9753 unprocessable samples recorded.9754 unprocessable samples recorded.9755 unprocessable samples recorded.9756 unprocessable samples recorded.9757 u nprocessable samples recorded.9758 unprocessable samples recorded.9759 unprocessable samples recorded.9760 unprocessable samples recorded.9761 unprocessable samples recorded.9762 unprocessable samples recorded.9763 unprocessable samples recorded.
>
>
>
> PerfTop: 61584 irqs/sec kernel:97.4% exact: 0.0% [4000Hz cycles], (all, 16 CPUs)
> --------------------------------------------------------------------------------------------------------------------------------
>
> 36.73% [kernel] [k] _raw_spin_lock_irqsave
> 19.00% [kernel] [k] isolate_freepages_block
> 5.80% [kernel] [k] yield_to
> 5.23% [kernel] [k] _raw_spin_lock
> 3.97% [kernel] [k] kvm_vcpu_on_spin
> 2.98% [kernel] [k] add_preempt_count
> 2.45% [kernel] [k] sub_preempt_count
> 2.37% [kernel] [k] _raw_spin_unlock
> 2.22% [kernel] [k] svm_vcpu_run
> 2.19% [kernel] [k] kvm_vcpu_yield_to
> 1.90% [kernel] [k] memcmp
> 1.54% [kernel] [k] get_pid_task
> 1.39% [kernel] [k] kvm_arch_vcpu_ioctl_run
> 1.30% [kernel] [k] pid_task
> 0.75% [kernel] [k] rcu_note_context_switch
> 0.74% [kernel] [k] trace_hardirqs_on
> 0.58% [kernel] [k] __rcu_read_unlock
> 0.55% [kernel] [k] trace_preempt_on
> 0.47% [kernel] [k] __srcu_read_lock
> 0.44% [kernel] [k] get_parent_ip
> 0.41% [kernel] [k] clear_page_c
> 0.40% [kernel] [k] get_pageblock_flags_group
> 0.39% [kernel] [k] in_lock_functions
> 0.36% [kernel] [k] trace_preempt_off
> 0.35% [kernel] [k] trace_hardirqs_off
> 0.23% [kernel] [k] __srcu_read_unlock
> 0.20% [kernel] [k] __rcu_read_lock
> 0.15% [kernel] [k] _raw_spin_lock_irq
> 0.14% [kernel] [k] handle_exit
> 0.12% [kernel] [k] find_highest_vector
> 0.11% [kernel] [k] resched_task
> 0.10% libc-2.10.1.so [.] strcmp
> 0.09% [kernel] [k] _raw_spin_unlock_irqrestore
> 0.09% [kernel] [k] ktime_get
> 0.08% [kernel] [k] pause_interception
> 0.08% [kernel] [k] copy_page_c
> 0.07% [kernel] [k] __schedule
> 0.07% [kernel] [k] compact_zone
> 0.07% perf [.] dso__find_symbol
> 0.06% perf [.] add_hist_entry
> 0.06% [kernel] [k] read_tsc
> 0.06% [kernel] [k] svm_read_l1_tsc
> 0.05% [kernel] [k] native_read_tsc
> 0.05% [kernel] [k] ktime_get_update_offsets
> 0.05% [kernel] [k] compaction_alloc
> 0.05% libc-2.10.1.so [.] 0x0000000000073ae0
> 0.05% [kernel] [k] kmem_cache_alloc
> 0.05% [kernel] [k] svm_complete_interrupts
> 0.05% [kernel] [k] kvm_check_async_pf_completion
> 0.05% [kernel] [k] apic_timer_interrupt
> 0.05% perf [.] sort__dso_cmp
> 0.05% [kernel] [k] kvm_cpu_has_pending_timer
> 0.04% [kernel] [k] svm_scale_tsc
> 0.04% [kernel] [k] isolate_migratepages_range
> 0.04% [kernel] [k] sched_clock_cpu
> 0.04% [kernel] [k] __rcu_pending
> 0.04% [kernel] [k] apic_update_ppr
> 0.04% [kernel] [k] do_select
> 0.04% [kernel] [k] perf_pmu_disable
> 0.04% [kernel] [k] kvm_set_cr8
> 0.04% [kernel] [k] update_curr
> 0.04% [kernel] [k] reschedule_interrupt
> 0.03% [kernel] [k] kvm_lapic_get_cr8
> 0.03% libc-2.10.1.so [.] strstr
> 0.03% [kernel] [k] apic_has_pending_timer
> 0.03% perf [.] sort__sym_cmp
> 4963 unprocessable samples recorded.4964 unprocessable samples recorded.4965 unprocessable samples recorded.4966 unprocessable samples recorded.4967 unprocessable samples recorded.4968 unprocessable samples recorded.4969 unprocessable samples recorded.4970 unprocessable samples recorded.4971 unprocessable samples recorded.4972 unprocessable samples recorded.4973 unprocessable samples recorded.4974 unprocessable samples recorded.4975 unprocessable samples recorded.
>
next prev parent reply other threads:[~2012-08-21 15:40 UTC|newest]
Thread overview: 46+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-08-16 10:47 [Qemu-devel] Windows slow boot: contractor wanted Richard Davies
2012-08-16 11:39 ` Avi Kivity
2012-08-17 12:36 ` Richard Davies
2012-08-17 13:02 ` Robert Vineyard
2012-08-18 14:44 ` Richard Davies
2012-08-19 5:02 ` Brian Jackson
2012-08-20 8:16 ` Richard Davies
2012-08-19 8:40 ` Avi Kivity
2012-08-19 8:51 ` Richard Davies
2012-08-19 14:04 ` Avi Kivity
2012-08-20 13:56 ` Richard Davies
2012-08-21 9:00 ` Avi Kivity
2012-08-21 15:21 ` Richard Davies
2012-08-21 15:39 ` Troy Benjegerdes [this message]
2012-08-22 9:08 ` Avi Kivity
2012-08-22 12:40 ` Richard Davies
2012-08-22 12:44 ` Avi Kivity
2012-08-22 14:41 ` Richard Davies
2012-08-22 14:53 ` Avi Kivity
2012-08-22 15:26 ` Richard Davies
2012-08-22 17:22 ` Troy Benjegerdes
2012-08-25 17:51 ` Richard Davies
2012-08-22 15:21 ` Rik van Riel
2012-08-22 15:34 ` Richard Davies
2012-08-25 17:45 ` Richard Davies
2012-08-25 18:11 ` Rik van Riel
2012-08-26 10:58 ` Richard Davies
2012-09-06 9:20 ` Richard Davies
2012-09-12 10:56 ` [Qemu-devel] Windows VM slow boot Richard Davies
2012-09-12 12:25 ` Mel Gorman
2012-09-12 16:46 ` Richard Davies
2012-09-13 9:50 ` Mel Gorman
2012-09-13 19:47 ` [Qemu-devel] [PATCH 1/2] Revert "mm: have order > 0 compaction start near a pageblock with free pages" Rik van Riel
2012-09-13 19:48 ` [Qemu-devel] [PATCH 2/2] make the compaction "skip ahead" logic robust Rik van Riel
2012-09-13 19:54 ` [Qemu-devel] [PATCH -v2 " Rik van Riel
2012-09-15 15:55 ` Richard Davies
2012-09-16 19:12 ` Richard Davies
2012-09-17 12:26 ` Mel Gorman
2012-09-18 8:14 ` Richard Davies
2012-09-18 11:21 ` Mel Gorman
2012-09-18 17:58 ` Richard Davies
2012-09-17 13:50 ` Rik van Riel
2012-09-17 14:07 ` Mel Gorman
2012-08-16 14:10 ` [Qemu-devel] Windows slow boot: contractor wanted Benoît Canet
2012-08-16 15:53 ` Troy Benjegerdes
2012-09-18 15:12 ` [Qemu-devel] Windows slow boot Michael Tokarev
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20120821153946.GO7722@us.grid.coop \
--to=hozer@hozed.org \
--cc=avi@redhat.com \
--cc=kvm@vger.kernel.org \
--cc=qemu-devel@nongnu.org \
--cc=richard@arachsys.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).