From: "Longpeng (Mike)" <longpeng2@huawei.com>
To: Roman Kagan <rkagan@virtuozzo.com>,
Eduardo Habkost <ehabkost@redhat.com>
Cc: "Denis V. Lunev" <den@virtuozzo.com>,
"Michael S. Tsirkin" <mst@redhat.com>,
Denis Plotnikov <dplotnikov@virtuozzo.com>,
pbonzini@redhat.com, rth@twiddle.net, qemu-devel@nongnu.org,
Gonglei <arei.gonglei@huawei.com>,
zhaoshenglong <zhaoshenglong@huawei.com>
Subject: Re: [Qemu-devel] [PATCH] i386: turn off l3-cache property by default
Date: Thu, 30 Nov 2017 17:26:44 +0800 [thread overview]
Message-ID: <5A1FCED4.8030600@huawei.com> (raw)
In-Reply-To: <20171129133524.GA2279@rkaganb.sw.ru>
On 2017/11/29 21:35, Roman Kagan wrote:
> On Wed, Nov 29, 2017 at 07:58:19PM +0800, Longpeng (Mike) wrote:
>> On 2017/11/29 18:41, Eduardo Habkost wrote:
>>> On Wed, Nov 29, 2017 at 01:20:38PM +0800, Longpeng (Mike) wrote:
>>>> On 2017/11/29 5:13, Eduardo Habkost wrote:
>>>>> On Tue, Nov 28, 2017 at 11:20:27PM +0300, Denis V. Lunev wrote:
>>>>>> On 11/28/2017 10:58 PM, Eduardo Habkost wrote:
>>>>>>> On Fri, Nov 24, 2017 at 04:26:50PM +0300, Denis Plotnikov wrote:
>>>>>>>> Commit 14c985cffa "target-i386: present virtual L3 cache info for vcpus"
>>>>>>>> introduced and set by default exposing l3 to the guest.
>>>>>>>>
>>>>>>>> The motivation behind it was that in the Linux scheduler, when waking up
>>>>>>>> a task on a sibling CPU, the task was put onto the target CPU's runqueue
>>>>>>>> directly, without sending a reschedule IPI. Reduction in the IPI count
>>>>>>>> led to performance gain.
>>>>>>>>
>>>>>>>> However, this isn't the whole story. Once the task is on the target
>>>>>>>> CPU's runqueue, it may have to preempt the current task on that CPU, be
>>>>>>>> it the idle task putting the CPU to sleep or just another running task.
>>>>>>>> For that a reschedule IPI will have to be issued, too. Only when that
>>>>>>>> other CPU is running a normal task for too little time, the fairness
>>>>>>>> constraints will prevent the preemption and thus the IPI.
>>>>>>>>
>>>>
>>>> Agree. :)
>>>>
>>>> Our testing VM is Suse11 guest with idle=poll at that time and now I realize
>>>> that Suse11 has a BUG in its scheduler.
>>>>
>>>> For REHL 7.3 or upstream kernel, in ttwu_queue_remote(), a RES IPI is issued if
>>>> rq->idle is not polling:
>>>> '''
>>>> static void ttwu_queue_remote(struct task_struct *p, int cpu)
>>>> {
>>>> struct rq *rq = cpu_rq(cpu);
>>>>
>>>> if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list)) {
>>>> if (!set_nr_if_polling(rq->idle))
>>>> smp_send_reschedule(cpu);
>>>> else
>>>> trace_sched_wake_idle_without_ipi(cpu);
>>>> }
>>>> }
>>>> '''
>>>>
>>>> But for Suse11, it does not check, it send a RES IPI unconditionally.
>>>
>>> So, does that mean no Linux guest benefits from the l3-cache=on
>>> default except SuSE 11 guests?
>>>
>>
>> Not only that, there is another scenario:
>>
>> static void ttwu_queue(...)
>> {
>> if (...two cpus NOT sharing L3-cache) {
>> ...
>> ttwu_queue_remote(p, cpu, wake_flags);
>> return;
>> }
>> ...
>> ttwu_do_activate(rq, p, wake_flags, &rf); <--*Here*
>> ...
>> }
>>
>> In ttwu_do_activate(), there are also some opportunities with low probability to
>> do not send RES IPI even if the target cpu isn't in IDLE polling state.
>
> Well it isn't so low actually, what you need is to keep the cpus busy
> switching tasks. In that case it's not uncommon that the task being
> woken up on a remote cpu has accumulated more vruntime than the task
> already running on that cpu; in that case the new task won't preempt the
> current task and the IPI won't be issued. E.g. on a RHEL 7.4 guest we
> saw:
>
I get it, thanks.
>>>>>>>> This boils down to the improvement being only achievable in workloads
>>>>>>>> with many actively switching tasks. We had no access to the
>>>>>>>> (proprietary?) SAP HANA benchmark the commit referred to, but the
>>>>>>>> pattern is also reproduced with "perf bench sched messaging -g 1"
>>>>>>>> on 1 socket, 8 cores vCPU topology, we see indeed:
>>>>>>>>
>>>>>>>> l3-cache #res IPI /s #time / 10000 loops
>>>>>>>> off 560K 1.8 sec
>>>>>>>> on 40K 0.9 sec
>
> The workload where it bites is mostly idle guest, with chains of
> dependent wakeups, i.e. with little parallelism:
>
>>>>>>>> Now there's a downside: with L3 cache the Linux scheduler is more eager
>>>>>>>> to wake up tasks on sibling CPUs, resulting in unnecessary cross-vCPU
>>>>>>>> interactions and therefore exessive halts and IPIs. E.g. "perf bench
>>>>>>>> sched pipe -i 100000" gives
>>>>>>>>
>>>>>>>> l3-cache #res IPI /s #HLT /s #time /100000 loops
>>>>>>>> off 200 (no K) 230 0.2 sec
>>>>>>>> on 400K 330K 0.5 sec
>>>>>>>>
>>>>
>>>> I guess this issue could be resolved by disable the SD_WAKE_AFFINE.
>
> Actually, it's SD_WAKE_AFFINE that's being effectively defeated by this
> l3-cache, because the scheduler thinks that the cpus that share the
> last-level cache are close enough that a dependent task can be woken up
> on a sibling cpu.
>
In this case (sched pipe), without L3-cache, a dependent task is woken up on the
original cpu mostly, if these two tasks ran on the same cpu then the dependent
task is woken up without a RES IPI. The related codes are:
'''
void resched_curr(struct rq *rq)
{
if (cpu == smp_processor_id()) {
set_tsk_need_resched(curr);
set_preempt_need_resched();
return;
}
}
'''
Do I understand correctly ? If not, hope you could point out what's wrong :)
>>>> As Gonglei said:
>>>> 1. the L3 cache relates to the user experience.
>>>
>>> This is true, in a way: I have seen a fair share of user reports
>>> where they incorrectly blame the L3 cache absence or the L3 cache
>>> size for performance problems.
>>>
>>>> 2. the glibc would get the cache info by CPUID directly, and relates to the
>>>> memory performance.
>>>
>>> I'm interested in numbers that demonstrate that.
>
> Me too. I vaguely remember debugging a memcpy degradation in the guest
> (on the Parallels proprietary hypervisor), that turned out being due a
> combination of l3 cache size and the cpu topology exposed to the guest,
> which caused glibc to choose an inadequate buffuer size.
>
We faced the same problem several months ago.
I did some simple tests at noon, it seems that numbers are better without
L3-cache except 'perf bench sched messaging'.
VM: 1 sockets, 8 cores, 3.10.0 guest
Hardware: Intel(R) Xeon(R) CPU E7-8890 v2 @ 2.80GHz
Stream:(100 turns)
l3 Copy Scale Add Triad
------------------------------------
off 8025.8 8019.5 8363.1 8589.9
on 8016.7 7999.9 8344.2 8568.9
perf sched bench message:(100 turns)
l3 Total-time
-----------------
off 0.0238
on 0.0178
perf sched bench pipe:(100 turns)
l3 Total-time
-----------------
off 0.3190
on 1.2688
We are so busy at end of each month, maybe my tests is insufficient, I'm sorry
for that.
According the numbers above, I think it's worth to turn off L3-cache by default.
>> Sorry I have no numbers in hand currently :(
>>
>> I'll do some tests these days, please give me some time.
>
> We'll try to get some data on this, too.
>
>>>> What's more, the L3 cache relates to the sched_domain which is important to the
>>>> (load) balancer when system is busy.
>>>>
>>>> All this doesn't mean the patch is insignificant, I just think we should do more
>>>> research before decide. I'll do some tests, thanks. :)
>>>
>>> Yes, we need more data. But if we find out that there are no
>>> cases where the l3-cache=on default actually improves
>>> performance, I will be willing to apply this patch.
>>>
>>
>> That's a good thing if we find the truth, it's free. :)
>>
>> OTOH, I think we should notice that: Linux is designed on real hardware, maybe
>> there're some other problems if QEMU lacks some related features. If we search
>> 'cpus_share_cache' in the Linux kernel, we can see that it's also used by Block
>> Layer.
>>
>>> IMO, the long term solution is to make Linux guests not misbehave
>>> when we stop lying about the L3 cache. Maybe we could provide a
>>> "IPIs are expensive, please avoid them" hint in the KVM CPUID
>>> leaf?
>
> We already have it, it's the hypervisor bit ;) Seriously, I'm unaware
> of hypervisors where IPIs aren't expensive.
>
>> Maybe more PV features could be digged.
>
> One problem with this is that PV features are hard to get into other
> guest OSes or existing Linux guests.
>
Some cloud providers (e.g. Amazon,AliBaBa...) provide a customized guest which
could includes more PV features to reach the limiting performance.
> Roman.
>
>
--
Regards,
Longpeng(Mike)
next prev parent reply other threads:[~2017-11-30 9:27 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-11-24 13:26 [Qemu-devel] [PATCH] i386: turn off l3-cache property by default Denis Plotnikov
2017-11-28 18:54 ` Michael S. Tsirkin
2017-11-28 19:50 ` Paolo Bonzini
2017-11-28 20:05 ` Eduardo Habkost
2017-11-29 4:56 ` Roman Kagan
2017-11-28 19:58 ` Eduardo Habkost
2017-11-28 20:20 ` Denis V. Lunev
2017-11-28 21:13 ` Eduardo Habkost
2017-11-29 1:57 ` Gonglei (Arei)
2017-11-29 5:55 ` rkagan
2017-11-29 6:01 ` Gonglei (Arei)
2017-11-29 5:20 ` Longpeng (Mike)
2017-11-29 6:01 ` Roman Kagan
2017-11-29 7:38 ` Longpeng (Mike)
2017-11-29 10:41 ` Eduardo Habkost
2017-11-29 11:58 ` Longpeng (Mike)
2017-11-29 13:35 ` Roman Kagan
2017-11-29 17:09 ` Eduardo Habkost
2017-11-29 17:15 ` Paolo Bonzini
2017-11-30 6:28 ` Roman Kagan
2017-11-30 9:26 ` Longpeng (Mike) [this message]
2017-11-29 5:46 ` Roman Kagan
2017-11-29 10:25 ` Eduardo Habkost
2017-11-29 4:17 ` Michael S. Tsirkin
2017-11-29 6:25 ` Roman Kagan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5A1FCED4.8030600@huawei.com \
--to=longpeng2@huawei.com \
--cc=arei.gonglei@huawei.com \
--cc=den@virtuozzo.com \
--cc=dplotnikov@virtuozzo.com \
--cc=ehabkost@redhat.com \
--cc=mst@redhat.com \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=rkagan@virtuozzo.com \
--cc=rth@twiddle.net \
--cc=zhaoshenglong@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.