qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: "Longpeng (Mike)" <longpeng2@huawei.com>
To: Eduardo Habkost <ehabkost@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	Richard Henderson <rth@twiddle.net>,
	huangpeng <peter.huangpeng@huawei.com>,
	zhaoshenglong <zhaoshenglong@huawei.com>,
	Gonglei <arei.gonglei@huawei.com>,
	QEMU-DEV <qemu-devel@nongnu.org>
Subject: Re: [Qemu-devel] [RFC] target-i386: present virtual L3 cache info for vcpus
Date: Wed, 31 Aug 2016 08:59:41 +0800	[thread overview]
Message-ID: <57C62BFD.7070008@huawei.com> (raw)
In-Reply-To: <20160830142501.GE3731@thinpad.lan.raisama.net>

Hi Eduardo,

On 2016/8/30 22:25, Eduardo Habkost wrote:

> On Mon, Aug 29, 2016 at 09:17:02AM +0800, Longpeng (Mike) wrote:
>> This patch presents virtual L3 cache info for virtual cpus.
> 
> Just changing the L3 cache size in the CPUID code will make
> guests see a different cache topology after upgrading QEMU (even
> on live migration). If you want to change the default, you need
> to keep the old values on old machine-types.
> 
> In other words, we need to make the cache size configurable, and
> set compat_props on PC_COMPAT_2_7.
> 

Thanks for your suggestions, I will fix it later.

> Other comments below:
> 
>>
>> Some software algorithms are based on the hardware's cache info, for example,
>> for x86 linux kernel, when cpu1 want to wakeup a task on cpu2, cpu1 will trigger
>> a resched IPI and told cpu2 to do the wakeup if they don't share low level
>> cache. Oppositely, cpu1 will access cpu2's runqueue directly if they share llc.
>> The relevant linux-kernel code as bellow:
>>
>> 	static void ttwu_queue(struct task_struct *p, int cpu)
>> 	{
>> 		struct rq *rq = cpu_rq(cpu);
>> 		......
>> 		if (... && !cpus_share_cache(smp_processor_id(), cpu)) {
>> 			......
>> 			ttwu_queue_remote(p, cpu); /* will trigger RES IPI */
>> 			return;
>> 		}
>> 		......
>> 		ttwu_do_activate(rq, p, 0); /* access target's rq directly */
>> 		......
>> 	}
>>
>> In real hardware, the cpus on the same socket share L3 cache, so one won't
>> trigger a resched IPIs when wakeup a task on others. But QEMU doesn't present a
>> virtual L3 cache info for VM, then the linux guest will trigger lots of RES IPIs
>> under some workloads even if the virtual cpus belongs to the same virtual socket.
>>
>> For KVM, this degrades performance, because there will be lots of vmexit due to
>> guest send IPIs.
>>
>> The workload is a SAP HANA's testsuite, we run it one round(about 40 minuates)
>> and observe the (Suse11sp3)Guest's amounts of RES IPIs which triggering during
>> the period:
>>
>>         No-L3           With-L3(applied this patch)
>> cpu0:	363890		44582
>> cpu1:	373405		43109
>> cpu2:	340783		43797
>> cpu3:	333854		43409
>> cpu4:	327170		40038
>> cpu5:	325491		39922
>> cpu6:	319129		42391
>> cpu7:	306480		41035
>> cpu8:	161139		32188
>> cpu9:	164649		31024
>> cpu10:	149823		30398
>> cpu11:	149823		32455
>> cpu12:	164830		35143
>> cpu13:	172269		35805
>> cpu14:	179979		33898
>> cpu15:	194505		32754
>> avg:	268963.6	40129.8
>>
>> The VM's topology is "1*socket 8*cores 2*threads".
>> After present virtual L3 cache info for VM, the amounts of RES IPI in guest
>> reduce 85%.
> 
> What happens to overall system performance if the VCPU threads
> actually run on separate sockets in the host?

On separate socket, the worst case is access remote memory, but I think the
overhead of vmexits is more heavily. I will do some tests and give the data in
next version.

> 
> Other questions about the code below:
> 
>>
>> Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com>
>> ---
>>  target-i386/cpu.c | 34 +++++++++++++++++++++++++++-------
>>  1 file changed, 27 insertions(+), 7 deletions(-)
>>
>> diff --git a/target-i386/cpu.c b/target-i386/cpu.c
>> index 6a1afab..5a5fd06 100644
>> --- a/target-i386/cpu.c
>> +++ b/target-i386/cpu.c
>> @@ -57,6 +57,7 @@
>>  #define CPUID_2_L1D_32KB_8WAY_64B 0x2c
>>  #define CPUID_2_L1I_32KB_8WAY_64B 0x30
>>  #define CPUID_2_L2_2MB_8WAY_64B   0x7d
>> +#define CPUID_2_L3_12MB_24WAY_64B 0xea
>>
>>
>>  /* CPUID Leaf 4 constants: */
>> @@ -131,11 +132,15 @@
>>  #define L2_LINES_PER_TAG       1
>>  #define L2_SIZE_KB_AMD       512
>>
>> -/* No L3 cache: */
>> -#define L3_SIZE_KB             0 /* disabled */
>> -#define L3_ASSOCIATIVITY       0 /* disabled */
>> -#define L3_LINES_PER_TAG       0 /* disabled */
>> -#define L3_LINE_SIZE           0 /* disabled */
>> +/* Level 3 unified cache: */
>> +#define L3_LINE_SIZE          64
>> +#define L3_ASSOCIATIVITY      24
>> +#define L3_SETS             8192
>> +#define L3_PARTITIONS          1
>> +#define L3_DESCRIPTOR CPUID_2_L3_12MB_24WAY_64B
>> +/*FIXME: CPUID leaf 0x80000006 is inconsistent with leaves 2 & 4 */
> 
> Why are you intentionally introducing a bug?

Please forgive my foolish, I will fix it.

By the way, There are the same legacy bugs in L1/L2 codes, is there need to fix
them ?

> 
>> +#define L3_LINES_PER_TAG       1
>> +#define L3_SIZE_KB_AMD      1024
>>
>>  /* TLB definitions: */
>>
>> @@ -2328,7 +2333,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index,
>> uint32_t count,
> 
> I believe Thunderbird line wrapping broke the patch. Git can't
> apply it.
> 
>>          }
>>          *eax = 1; /* Number of CPUID[EAX=2] calls required */
>>          *ebx = 0;
>> -        *ecx = 0;
>> +        *ecx = (L3_DESCRIPTOR);
>>          *edx = (L1D_DESCRIPTOR << 16) | \
>>                 (L1I_DESCRIPTOR <<  8) | \
>>                 (L2_DESCRIPTOR);
>> @@ -2374,6 +2379,21 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index,
>> uint32_t count,
>>                  *ecx = L2_SETS - 1;
>>                  *edx = CPUID_4_NO_INVD_SHARING;
>>                  break;
>> +            case 3: /* L3 cache info */
>> +                *eax |= CPUID_4_TYPE_UNIFIED | \
>> +                        CPUID_4_LEVEL(3) | \
>> +                        CPUID_4_SELF_INIT_LEVEL;
>> +                /*
>> +                * According to qemu's APICIDs generating rule, this can make
>> +                * sure vcpus on the same vsocket get the same llc_id.
>> +                */
>> +                *eax |= (cs->nr_cores * cs->nr_threads - 1) << 14;
> 
> The above comment doesn't seem to be true.
> 
> For example: if nr_cores=9,nr_threads=3, then:
> 
> SMT_Mask_Width = Log2 ( RoundToNearestPof2(CPUID.1:EBX[23:16]) / ((CPUID.(EAX=4, ECX=0):EAX[31:26] ) + 1))
>                = Log2 ( RoundToNearestPof2(nr_cores * nr_threads) / ((nr_cores - 1) + 1 ))
>                = Log2 ( RoundToNearestPof2(27) / 9 )
>                = Log2 ( 32 / 9 ) = Log2 ( 3 ) = 2
> CoreOnly_Mask_Width = Log2(1 + (CPUID.(EAX=4, ECX=0):EAX[31:26]))
>                     = Log2(1 + (nr_cores - 1)) = Log2(nr_cores) =
>                     = Log2(9) = 4
> CorePlus_Mask_Width = CoreOnly_Mask_Width + SMT_Mask_Width
>                     = 2 + 4 = 6
> 
> But:
> 
> Cache_Mask_Width[3] = Log2(RoundToNearestPof2( (1 + CPUID.(EAX=4, ECX=n):EAX[25:14]))
>                     = Log2(RoundToNearestPof2( (1 + CPUID.(EAX=4, ECX=3):EAX[25:14]))
>                     = Log2(RoundToNearestPof2( (1 + (nr_cores * nr_threads - 1)))
>                     = Log2(RoundToNearestPof2( nr_cores * nr_threads ))
>                     = Log2(RoundToNearestPof2( 27 ))
>                     = Log2(32) = 5
> 
> That means the VCPU at package 1, core 8, thread 2 (APIC ID
> 1100010b) have Package_ID=1 but L3 Cache_ID would be 3.
> 

Nice! I will fix it.

> 
>> +                *ebx = (L3_LINE_SIZE - 1) | \
>> +                       ((L3_PARTITIONS - 1) << 12) | \
>> +                       ((L3_ASSOCIATIVITY - 1) << 22);
>> +                *ecx = L3_SETS - 1;
>> +                *edx = CPUID_4_INCLUSIVE | CPUID_4_COMPLEX_IDX;
>> +                break;
>>              default: /* end of info */
>>                  *eax = 0;
>>                  *ebx = 0;
>> @@ -2585,7 +2605,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index,
>> uint32_t count,
>>          *ecx = (L2_SIZE_KB_AMD << 16) | \
>>                 (AMD_ENC_ASSOC(L2_ASSOCIATIVITY) << 12) | \
>>                 (L2_LINES_PER_TAG << 8) | (L2_LINE_SIZE);
>> -        *edx = ((L3_SIZE_KB/512) << 18) | \
>> +        *edx = ((L3_SIZE_KB_AMD / 512) << 18) | \
>>                 (AMD_ENC_ASSOC(L3_ASSOCIATIVITY) << 12) | \
>>                 (L3_LINES_PER_TAG << 8) | (L3_LINE_SIZE);
>>          break;
>> -- 
>> 1.8.3.1
>>
> 

Thanks!

  reply	other threads:[~2016-08-31  1:00 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-08-29  1:17 [Qemu-devel] [RFC] target-i386: present virtual L3 cache info for vcpus Longpeng (Mike)
2016-08-30 14:25 ` Eduardo Habkost
2016-08-31  0:59   ` Longpeng (Mike) [this message]
2016-08-31 14:17     ` Eduardo Habkost

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=57C62BFD.7070008@huawei.com \
    --to=longpeng2@huawei.com \
    --cc=arei.gonglei@huawei.com \
    --cc=ehabkost@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=peter.huangpeng@huawei.com \
    --cc=qemu-devel@nongnu.org \
    --cc=rth@twiddle.net \
    --cc=zhaoshenglong@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).