From mboxrd@z Thu Jan 1 00:00:00 1970 From: George Dunlap Subject: Re: Performance Monitoring Counter(PMC) Problem Date: Tue, 23 Nov 2010 14:56:35 +0000 Message-ID: <4CEBD623.8010009@eu.citrix.com> References: Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: alex Cc: "xen-devel@lists.xensource.com" , Keir Fraser List-Id: xen-devel@lists.xenproject.org So, as I understand it, you expect the 6.5M cache misses for cases 1, 2, 4, but don't understand the much lower results for 3 and 5? And you suspect that the numbers you're getting aren't valid, but are the result of some mismanagement? For one, you don't check in your {start,stop}PMC() functions whether the domain ID is less than MAX_DOMAIN_NUMBER-1. You're aware also that context_switch() isn't called when switching from a It might be worth adding a traces for the values you're reading from and writing to the registers, and using xenalyze to see if you notice anything strange. The xenalyze source can be found here: http://xenbits.xensource.com/ext/xenalyze.hg You'd have to make xenalyze understand your performance trace record, but it will understand most everything else. -George On 21/11/10 03:28, alex wrote: > Hi all, > I am running 64bit xen-3.4.2 on AMD Phenom II Quad-core processor, on > which each core has sepearate performance counter. My research is to use > performance counter to track interesting events(say, L3 cache miss) for > each VM. Thus, I developed software multiplexing to support PMC tracking > for individual PVOPS-VMs. Each domain has its own logical performance > counter. Every time it will be scheduled, performance counter for that > domain is reloaded and started. When a domain (VCPU) is de-scheduled > after 30ms, the performance counter is stopped and stored to logical > counter. I modified context_switch (in arch/x86/domain.c) by adding > statements below: > #define MAX_DOMAIN_NUMBER 8 > volatile uint64_t perfcounter[MAX_DOMAIN_NUMBER] = { 0, 0, 0, 0, 0, 0, > 0, 0 }; // element 0 is reserved for dom0, element 7 is reserved for > IDLE_DOMAIN_ID. > // multiplexing the performance counter for more than 4 VMs > void startPMC(unsigned int pcpu_id, unsigned int domain_id) > { > uint32_t eax, edx; > /* reload performance counter for next dom */ > if(domain_id == IDLE_DOMAIN_ID) { > wrmsrl(MSR_K7_PERFCTR0, perfcounter[7]); > } > else { > wrmsrl(MSR_K7_PERFCTR0, perfcounter[domain_id]); > } > edx = 0x4; > eax = 0xE1 | (0x3 << 16) | (0x1 << (12 + pcpu_id)) | (0x7 << 8) | > (0x1 << 22); // L3 cache misses for accesses from a core(cpu) > wrmsr(MSR_K7_EVNTSEL0, eax, edx); > } > // multiplexing the performance counter for more than 4 VMs > void stopPMC(unsigned int pcpu_id, unsigned int domain_id) > { > uint32_t eax, edx; > edx = 0x4; > eax = (0xE1 | (0x3 << 16) | (0x1 << (12 + pcpu_id)) | (0x7 << 8)) > & ~(0x1 << 22); // L3 cache misses for accesses from core(cpu) > wrmsr(MSR_K7_EVNTSEL0, eax, edx); > /* save current performance counter */ > if(domain_id == IDLE_DOMAIN_ID) { > rdmsrl(MSR_K7_PERFCTR0, perfcounter[7]); > } > else { > rdmsrl(MSR_K7_PERFCTR0, perfcounter[domain_id]); > } > } > void context_switch(struct vcpu *prev, struct vcpu *next) > { > unsigned int cpu = smp_processor_id(); > ...... > stopPMC(cpu, prev->domain->domain_id); > startPMC(cpu, next->domain->domain_id); > ....... > } > In my experiment, I run 6 pvops DomUs and each pv domain has one VCPU > and 1GB allocated memory. Dom0 has four VCPUs, and 1GB memory. The > testing code is very simple and listed below. > int main(void) > { > int i; > char *buf; > buf = (char *)malloc(400 * 1024 * 1024); > for(i=0; i<400*1024*1024;i++) > buf[i] = 1; > return 0; > } > I run the testing code only in DomUs and found the mismatching results > in different scenarios. > Results: > 1) If I pin Dom0 vcpu 0 to pcpu 0, vcpu 1 to pcpu 1..... and run the > testing code in a DomU whose vcpu is also pinned, the performance > counter I checked is around 6553600 (L3 cache misses). That means DomU > accessed about 400MB data. (The cache line size is 64B). > 2) If I pin Dom0 vcpu 0 to pcpu 0, vcpu 1 to pcpu 1..... but don't > pin DomU's vcpu, the performance counter is a little less than 6553600 > (L3 cache misses). > 3) If I don't pin Dom0 vcpus, and still run testing code in a DomU, the > performance counter I got is much less than 6553600 (L3 cache misses), > and the average value is 2949120. > Now I move to run testing code in 2 DomUs, and also found mismatching > results. > 4) If I pin Dom0 vcpu 0 to pcpu 0, vcpu 1 to pcpu 1..... and pin each > DomU vcpu to different pcpu, the result is around 6553600 (L3 cache misses). > 5) If I pin Dom0 vcpu 0 to pcpu 0, vcpu 1 to pcpu 1..... and pin each > DomU vcpu to the same pcpu, the result is much less than 6553600 (L3 > cache misses). > To validate each VM accesses the specified data, I used "page-fault > approach" which is another way to track memory accesses and find out > each VM has about 102400 page faults, which is equivalent to 400MB. > Since there is no paging sharing between VMs, each VM should access the > same amount of 400MB data. I am not sure what is wrong with the > performance counter multiplexing. So can anyone give me some suggestions? > Thank you, > Lei > University of Arizona