From mboxrd@z Thu Jan  1 00:00:00 1970
From: George Dunlap <George.Dunlap@eu.citrix.com>
Subject: Re: Performance Monitoring Counter(PMC) Problem
Date: Tue, 23 Nov 2010 14:56:35 +0000
Message-ID: <4CEBD623.8010009@eu.citrix.com>
References: <AANLkTim=q8D0KvsabtBb2O2_7mmwnzktGq2VRjYZ8dkJ@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xensource.com>
In-Reply-To: <AANLkTim=q8D0KvsabtBb2O2_7mmwnzktGq2VRjYZ8dkJ@mail.gmail.com>
List-Unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xensource.com>
List-Help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-Subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
Sender: xen-devel-bounces@lists.xensource.com
Errors-To: xen-devel-bounces@lists.xensource.com
To: alex <leiye.xen@gmail.com>
Cc: "xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>, Keir Fraser <keir@xen.org>
List-Id: xen-devel@lists.xenproject.org

So, as I understand it, you expect the 6.5M cache misses for cases 1, 2, 
4, but don't understand the much lower results for 3 and 5?  And you 
suspect that the numbers you're getting aren't valid, but are the result 
of some mismanagement?

For one, you don't check in your {start,stop}PMC() functions whether the 
domain ID is less than MAX_DOMAIN_NUMBER-1.

You're aware also that context_switch() isn't called when switching from a

It might be worth adding a traces for the values you're reading from and 
writing to the registers, and using xenalyze to see if you notice 
anything strange.

The xenalyze source can be found here:
  http://xenbits.xensource.com/ext/xenalyze.hg

You'd have to make xenalyze understand your performance trace record, 
but it will understand most everything else.

  -George

On 21/11/10 03:28, alex wrote:
> Hi all,
> I am running 64bit xen-3.4.2 on AMD Phenom II Quad-core processor, on
> which each core has sepearate performance counter. My research is to use
> performance counter to track interesting events(say, L3 cache miss) for
> each VM. Thus, I developed software multiplexing to support PMC tracking
> for individual PVOPS-VMs. Each domain has its own logical performance
> counter. Every time it will be scheduled, performance counter for that
> domain is reloaded and started. When a domain (VCPU) is de-scheduled
> after 30ms, the performance counter is stopped and stored to logical
> counter.  I modified context_switch (in arch/x86/domain.c) by adding
> statements below:
> #define MAX_DOMAIN_NUMBER 8
> volatile uint64_t perfcounter[MAX_DOMAIN_NUMBER] = { 0, 0, 0, 0, 0, 0,
> 0, 0 };   // element 0 is reserved for dom0, element 7 is reserved for
> IDLE_DOMAIN_ID.
> // multiplexing the performance counter for more than 4 VMs
> void startPMC(unsigned int pcpu_id, unsigned int domain_id)
> {
>      uint32_t eax, edx;
>      /* reload performance counter for next dom */
>      if(domain_id == IDLE_DOMAIN_ID) {
>            wrmsrl(MSR_K7_PERFCTR0, perfcounter[7]);
>      }
>      else {
>           wrmsrl(MSR_K7_PERFCTR0, perfcounter[domain_id]);
>      }
>      edx = 0x4;
>      eax = 0xE1 | (0x3 << 16) | (0x1 << (12 + pcpu_id)) | (0x7 << 8) |
> (0x1 << 22); // L3 cache misses for accesses from a core(cpu)
>      wrmsr(MSR_K7_EVNTSEL0, eax, edx);
> }
> // multiplexing the performance counter for more than 4 VMs
> void stopPMC(unsigned int pcpu_id, unsigned int domain_id)
> {
>        uint32_t eax, edx;
>        edx = 0x4;
>        eax = (0xE1 | (0x3 << 16) | (0x1 << (12 + pcpu_id)) | (0x7 << 8))
> & ~(0x1 << 22); // L3 cache misses for accesses from core(cpu)
>        wrmsr(MSR_K7_EVNTSEL0, eax, edx);
>       /* save current performance counter */
>      if(domain_id == IDLE_DOMAIN_ID) {
>            rdmsrl(MSR_K7_PERFCTR0, perfcounter[7]);
>      }
>      else {
>           rdmsrl(MSR_K7_PERFCTR0, perfcounter[domain_id]);
>      }
> }
> void context_switch(struct vcpu *prev, struct vcpu *next)
> {
>     unsigned int cpu = smp_processor_id();
>     ......
>      stopPMC(cpu, prev->domain->domain_id);
>      startPMC(cpu, next->domain->domain_id);
>     .......
> }
> In my experiment, I run 6 pvops DomUs and each pv domain has one VCPU
> and 1GB allocated memory. Dom0 has four VCPUs, and 1GB memory. The
> testing code is very simple and listed below.
> int main(void)
> {
>    int i;
>    char *buf;
>    buf = (char *)malloc(400 * 1024 * 1024);
>    for(i=0; i<400*1024*1024;i++)
>      buf[i] = 1;
>    return 0;
> }
> I run the testing code only in DomUs and found the mismatching results
> in different scenarios.
> Results:
> 1) If I pin Dom0 vcpu 0 to pcpu 0, vcpu 1 to pcpu 1.....  and run the
> testing code in a DomU whose vcpu is also pinned, the performance
> counter I checked is around 6553600 (L3 cache misses). That means DomU
> accessed about 400MB data. (The cache line size is 64B).
> 2) If I pin Dom0 vcpu 0 to pcpu 0, vcpu 1 to pcpu 1.....  but don't
> pin DomU's vcpu, the performance counter is a little less than 6553600
> (L3 cache misses).
> 3) If I don't pin Dom0 vcpus, and still run testing code in a DomU, the
> performance counter I got is much less than 6553600 (L3 cache misses),
> and the average value is 2949120.
> Now I move to run testing code in 2 DomUs, and also found mismatching
> results.
> 4) If I pin Dom0 vcpu 0 to pcpu 0, vcpu 1 to pcpu 1.....  and pin each
> DomU vcpu to different pcpu, the result is around 6553600 (L3 cache misses).
> 5) If I pin Dom0 vcpu 0 to pcpu 0, vcpu 1 to pcpu 1.....  and pin each
> DomU vcpu to the same pcpu, the result is much less than 6553600 (L3
> cache misses).
> To validate each VM accesses the specified data, I used "page-fault
> approach" which is another way to track memory accesses and find out
> each VM has about 102400 page faults, which is equivalent to 400MB.
> Since there is no paging sharing between VMs, each VM should access the
> same amount of 400MB data. I am not sure what is wrong with the
> performance counter multiplexing.  So can anyone give me some suggestions?
> Thank you,
> Lei
> University of Arizona