* New MPI benchmark performance results (update)
@ 2005-05-03 9:11 xuehai zhang
2005-05-03 9:28 ` Steven Hand
2005-05-03 20:24 ` Nivedita Singhvi
0 siblings, 2 replies; 10+ messages in thread
From: xuehai zhang @ 2005-05-03 9:11 UTC (permalink / raw)
To: Xen-devel
Hi all,
In the following post I sent in early April
(http://lists.xensource.com/archives/html/xen-devel/2005-04/msg00091.html), I reported some
performance gap when running PMB SendRecv benchmark on both native Linux and domU. Now I've prepared
a webpage comparing 8 PMB benchmarks' performance under 4 scenarios (native Linux, dom0, domU with
SMP, and domU without SMP) at http://people.cs.uchicago.edu/~hai/vm1/vcluster/PMB/.
In the graphs presented on the webpage, we take the results of native Linux as the reference and
normalize the other 3 scenarios to it. We observe a general pattern that usually dom0 has a better
performance than domU with SMP than domU without SMP (here better performance means low latency and
high throughput). However, we also notice very big performance gap between domU (w/o SMP) and native
linux (or dom0 because generally dom0 has a very similar performance as native linux). Some distinct
examples are: 8-node SendRecv latency (max domU/linux score ~ 18), 8-node Allgather latency (max
domU/linux score ~ 17), and 8-node Alltoall latency (max domU/linux > 60). The performance
difference in the last example is HUGE and we could not think about a reasonable explaination why
transferring 512B message size is so much different than other sizes. We appreciate if you can
provide your insight to such a big performance problem in these benchmarks.
BTW, all the benchmarking is based on the original Xen code. That is, we didn't modify the
net_rx_action netback to kick the frontend after every packet as suggested by Ian in the following
post (http://lists.xensource.com/archives/html/xen-devel/2005-04/msg00180.html)
Please let me know if you have any questions about the configuration of the benchmarking
experiments. I am looking forward to your insightful explainations.
Thanks.
Xuehai
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: New MPI benchmark performance results (update)
2005-05-03 9:11 New MPI benchmark performance results (update) xuehai zhang
@ 2005-05-03 9:28 ` Steven Hand
2005-05-03 16:36 ` xuehai zhang
2005-05-03 20:24 ` Nivedita Singhvi
1 sibling, 1 reply; 10+ messages in thread
From: Steven Hand @ 2005-05-03 9:28 UTC (permalink / raw)
To: xuehai zhang; +Cc: Xen-devel
> Please let me know if you have any questions about the configuration
> of the benchmarking experiments. I am looking forward to your
> insightful explainations.
Erm, what version of Xen are you using for these? I notice that the
dom0 kernel seems to be using 2.4.28 which is not current in any of
the trees. Since you're using SMP guests, I'm guessing this is some
old version of xen-unstable?
Your results are kinda interesting but I think you'd probably be
better off trying to compare like with like so that we can isolate
the performance issues due to Xen/XenLinux, i.e.
- use the same kernel (or ported kernel) in each case;
- use the same amount of memory in each case.
Otherwise you end up comparing 2.4 to 2.6, or 128MB/360MB/512MB, ...
Also you should probably use the current unstable tree since there
have been a number of performance fixes.
cheers,
S.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: New MPI benchmark performance results (update)
2005-05-03 9:28 ` Steven Hand
@ 2005-05-03 16:36 ` xuehai zhang
2005-05-03 16:13 ` Mark Williamson
0 siblings, 1 reply; 10+ messages in thread
From: xuehai zhang @ 2005-05-03 16:36 UTC (permalink / raw)
To: Steven Hand; +Cc: Xen-devel
Steven,
Thanks for the response.
>>Please let me know if you have any questions about the configuration
>>of the benchmarking experiments. I am looking forward to your
>>insightful explainations.
>
>
> Erm, what version of Xen are you using for these? I notice that the
> dom0 kernel seems to be using 2.4.28 which is not current in any of
> the trees. Since you're using SMP guests, I'm guessing this is some
> old version of xen-unstable?
The Xen version is 2.0 for all the experiments. I am not sure if the SMP mentioned in my email is
the same as "SMP guests" you mentioned. To clarify, "domU with SMP" I mentioned means Xen is booted
with SMP support (no "nosmp" option) and I pin dom0 to the 1st CPU and pin domU to the 2nd CPU;
"domU with no SMP" I mentioned means Xen is booted without SMP support (with "nosmp" option) and
both dom0 and domU use the same single CPU.
> Your results are kinda interesting but I think you'd probably be
> better off trying to compare like with like so that we can isolate
> the performance issues due to Xen/XenLinux, i.e.
I agree with your suggestion.
> - use the same kernel (or ported kernel) in each case;
I will use 2.6 kernel for both dom0 and domU. For native linux, the current kernel version is 2.4
and I have to convince the cluster administrator to upgrade it to 2.6 for a fair comparison as you
point out.
> - use the same amount of memory in each case.
It is hard to use the same amount of memory, especially for domU memory because dom0 will occupy
part of the 512MB physical memory. BTW, we think the memory is unlikely a key factor to the
performance because the maximum message size is 4MB and we only test up to 8-node cluster (8
processes) and the memory will not be overallocated.
> Otherwise you end up comparing 2.4 to 2.6, or 128MB/360MB/512MB, ...
>
> Also you should probably use the current unstable tree since there
> have been a number of performance fixes.
I will grab the current unstable tree and rerun the experiments by integrating the above
configuration improvements. I will send a new result update when I finish.
Thanks again for the help.
Xuehai
> cheers,
>
> S.
>
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: New MPI benchmark performance results (update)
2005-05-03 16:36 ` xuehai zhang
@ 2005-05-03 16:13 ` Mark Williamson
2005-05-03 16:58 ` xuehai zhang
0 siblings, 1 reply; 10+ messages in thread
From: Mark Williamson @ 2005-05-03 16:13 UTC (permalink / raw)
To: xen-devel; +Cc: Steven Hand, xuehai zhang
> I will grab the current unstable tree and rerun the experiments by
> integrating the above configuration improvements. I will send a new result
> update when I finish.
Any bug fixes should also have gone into the -testing tree, which is part of
the 2.0 series.
Cheers,
Mark
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: New MPI benchmark performance results (update)
2005-05-03 16:13 ` Mark Williamson
@ 2005-05-03 16:58 ` xuehai zhang
0 siblings, 0 replies; 10+ messages in thread
From: xuehai zhang @ 2005-05-03 16:58 UTC (permalink / raw)
To: Mark Williamson; +Cc: xen-devel
Mark Williamson wrote:
>>I will grab the current unstable tree and rerun the experiments by
>>integrating the above configuration improvements. I will send a new result
>>update when I finish.
>
>
> Any bug fixes should also have gone into the -testing tree, which is part of
> the 2.0 series.
Mark,
Thanks for the reminding. I will try the -testing tree instead of -unstable tree then. BTW, could
you please provide me some information about the current status of the Atropos scheduler? I sent out
an email earlier to ask about it but no response so far. Thanks again.
Xuehai
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: New MPI benchmark performance results (update)
2005-05-03 9:11 New MPI benchmark performance results (update) xuehai zhang
2005-05-03 9:28 ` Steven Hand
@ 2005-05-03 20:24 ` Nivedita Singhvi
2005-05-03 22:05 ` xuehai zhang
1 sibling, 1 reply; 10+ messages in thread
From: Nivedita Singhvi @ 2005-05-03 20:24 UTC (permalink / raw)
To: xuehai zhang; +Cc: Xen-devel
xuehai zhang wrote:
> Hi all,
>
> In the following post I sent in early April
> (http://lists.xensource.com/archives/html/xen-devel/2005-04/msg00091.html),
Hi, thanks for sharing the data - it was interesting.
I tried to find additional data on the benchmarks using
the link you have for the user manual but it gave me
a 404 Error. It wasn't clear whether your benchmarks
use TCP or UDP or possibly raw sockets?
As has been pointed out by several people, running the
2.6 kernel and comparing apples to apples as much as possible
would help.
Is there any chance you kept some of the system statistics
and settings (netstat -s, sysctl -a info)?.
Did you tune the settings for the system at all?
> Alltoall latency (max domU/linux > 60). The performance difference in
> the last example is HUGE and we could not think about a reasonable
> explaination why transferring 512B message size is so much different
> than other sizes. We appreciate if you can provide your insight to such
> a big performance problem in these benchmarks.
You have an anomalous point on most of the results - and again,
knowing what kind of traffic this is would really help.
thanks,
Nivedita
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: New MPI benchmark performance results (update)
2005-05-03 20:24 ` Nivedita Singhvi
@ 2005-05-03 22:05 ` xuehai zhang
0 siblings, 0 replies; 10+ messages in thread
From: xuehai zhang @ 2005-05-03 22:05 UTC (permalink / raw)
To: Nivedita Singhvi; +Cc: Xen-devel
Hi Nivedita,
Thanks for the response and the suggestion!
>> Hi all,
>>
>> In the following post I sent in early April
>> (http://lists.xensource.com/archives/html/xen-devel/2005-04/msg00091.html),
>
>
>
> Hi, thanks for sharing the data - it was interesting.
> I tried to find additional data on the benchmarks using
> the link you have for the user manual but it gave me
> a 404 Error.
I corrected the link error and now you can access the user manual through the link.
> It wasn't clear whether your benchmarks
> use TCP or UDP or possibly raw sockets?
I've read through the PMB user manual and it doesn't mention the communication protocol it uses.
However, I do read "typically TCP/IP is the protocol used over Ethernet networks for MPI
communications" from several references.
> As has been pointed out by several people, running the
> 2.6 kernel and comparing apples to apples as much as possible
> would help.
I fully agree with that and currently I try to rerun the experiments by using the same kernel
versions for both dom0 and domU (maybe native linux too).
>
> Is there any chance you kept some of the system statistics
> and settings (netstat -s, sysctl -a info)?.
I did not collect them while running the benchmarks, but I will try to log them when I rerun the
experiments.
> Did you tune the settings for the system at all?
No, I did not do any specific things to tune the system.
>> Alltoall latency (max domU/linux > 60). The performance difference in
>> the last example is HUGE and we could not think about a reasonable
>> explaination why transferring 512B message size is so much different
>> than other sizes. We appreciate if you can provide your insight to
>> such a big performance problem in these benchmarks.
>
>
> You have an anomalous point on most of the results - and again,
> knowing what kind of traffic this is would really help.
I will try to dig into the source code and find it out.
Thanks again for the help.
Xuehai
>
> thanks,
> Nivedita
>
>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* RE: New MPI benchmark performance results (update)
@ 2005-05-03 13:56 Ian Pratt
2005-05-03 16:48 ` xuehai zhang
0 siblings, 1 reply; 10+ messages in thread
From: Ian Pratt @ 2005-05-03 13:56 UTC (permalink / raw)
To: xuehai zhang, Xen-devel
>
> In the graphs presented on the webpage, we take the results
> of native Linux as the reference and normalize the other 3
> scenarios to it. We observe a general pattern that usually
> dom0 has a better performance than domU with SMP than domU
> without SMP (here better performance means low latency and
> high throughput). However, we also notice very big
> performance gap between domU (w/o SMP) and native linux (or
> dom0 because generally dom0 has a very similar performance as
> native linux). Some distinct examples are: 8-node SendRecv
> latency (max domU/linux score ~ 18), 8-node Allgather latency
> (max domU/linux score ~ 17), and 8-node Alltoall latency (max
> domU/linux > 60). The performance difference in the last
> example is HUGE and we could not think about a reasonable
> explaination why transferring 512B message size is so much
> different than other sizes. We appreciate if you can provide
> your insight to such a big performance problem in these benchmarks.
I still don't quite understand your experimental setup. What version of
Xen are you using? How many CPUs does each node have? How many domU's do
you run on a single node?
As regards the anomalous result for 512B AlltoAll performance, the best
way to track this down would be to use xen-oprofile. Is it reliably
repeatable? Really bad results are usually due to packets being dropped
somewhere -- there hasn't ben a whole lot of effort put into UDP
performance because so few applications use it.
Ian
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: New MPI benchmark performance results (update)
2005-05-03 13:56 Ian Pratt
@ 2005-05-03 16:48 ` xuehai zhang
0 siblings, 0 replies; 10+ messages in thread
From: xuehai zhang @ 2005-05-03 16:48 UTC (permalink / raw)
To: Ian Pratt; +Cc: Xen-devel
Ian,
Thanks for the response.
>>In the graphs presented on the webpage, we take the results
>>of native Linux as the reference and normalize the other 3
>>scenarios to it. We observe a general pattern that usually
>>dom0 has a better performance than domU with SMP than domU
>>without SMP (here better performance means low latency and
>>high throughput). However, we also notice very big
>>performance gap between domU (w/o SMP) and native linux (or
>>dom0 because generally dom0 has a very similar performance as
>>native linux). Some distinct examples are: 8-node SendRecv
>>latency (max domU/linux score ~ 18), 8-node Allgather latency
>>(max domU/linux score ~ 17), and 8-node Alltoall latency (max
>>domU/linux > 60). The performance difference in the last
>>example is HUGE and we could not think about a reasonable
>>explaination why transferring 512B message size is so much
>>different than other sizes. We appreciate if you can provide
>>your insight to such a big performance problem in these benchmarks.
>
>
> I still don't quite understand your experimental setup. What version of
> Xen are you using? How many CPUs does each node have? How many domU's do
> you run on a single node?
The Xen version is 2.0. Each node has 2 CPUs. "domU with SMP" I mentioned in the previous email
means Xen is booted with SMP support (no "nosmp" option) and I pin dom0 to the 1st CPU and pin domU
to the 2nd CPU; "domU with no SMP" I mentioned means Xen is booted without SMP support (with "nosmp"
option) and both dom0 and domU use the same single CPU. There is only 1 domU running on a single
node for each experiment.
> As regards the anomalous result for 512B AlltoAll performance, the best
> way to track this down would be to use xen-oprofile.
I am not very familar with xen-oprofile. I notice there are some discussions about it in the mailing
list. I wonder if there is any other documents that I can refer to. Thanks.
> Is it reliably repeatable?
Yes, we observe this anomaly repeatable. The reported data point in the graph is the average of 10
different runs of the same experiment in different time.
> Really bad results are usually due to packets being dropped
> somewhere -- there hasn't ben a whole lot of effort put into UDP
> performance because so few applications use it.
To clarify: do you indicate that benchmark like AlltoAll might use UDP rather than TCP as
transportation protocol?
Thanks again for the help.
Xuehai
^ permalink raw reply [flat|nested] 10+ messages in thread
* RE: New MPI benchmark performance results (update)
@ 2005-05-03 19:09 Santos, Jose Renato G (Jose Renato Santos)
0 siblings, 0 replies; 10+ messages in thread
From: Santos, Jose Renato G (Jose Renato Santos) @ 2005-05-03 19:09 UTC (permalink / raw)
To: xuehai zhang, Ian Pratt; +Cc: Xen-devel
[-- Attachment #1: Type: text/plain, Size: 789 bytes --]
> I am not very familar with xen-oprofile. I notice there are
> some discussions about it in the mailing
> list. I wonder if there is any other documents that I can
> refer to. Thanks.
>
Please, see http://xenoprof.sourceforge.net for a description
of xenoprof and for downloading patches.(You will need 3
patches: one for xen, one for linux and one for oprofile).
You need to be familiar with oprofile to use xenoprof. Please check
http://oprofile.sourceforge.net/ for more info on oprofile.
Xenoprof is currently available only for Xen 2.0.5.
I am working on getting it to xen unstable but there
is a problem with NMI handling which was not solved yet.
I have also attached a text file that gives an overview
of xenoprof
Renato
[-- Attachment #2: xenoprof.txt --]
[-- Type: text/plain, Size: 9206 bytes --]
XENOPROF - Performance profiling in Xen
=========================================
User Guide
============
Version: 1.0
Date: April 8, 2005
Copyright (C) 2005 Hewlett-Packard Co. (http://xenoprof.sourceforge.net)
(Aravind Menon, Jose Renato Santos, Yoshio Turner, G.(John) Janakiraman)
1. Overview
===========
This file provides an overview of Xenoprof, a system-wide statistical
profiling toolkit implemented for the Xen virtual machine environment.
The Xenoprof toolkit supports system-wide coordinated profiling in a
Xen environment to obtain the distribution of hardware events such as
clock cycles, instruction execution, TLB and cache misses, etc. Xenoprof
allows profiling of concurrently executing virtual machines (which
includes the operating system and applications running in each virtual
machine) and the Xen VMM itself. Xenoprof provides profiling data at the
fine granularity of individual processes and routines executing in either
the virtual machine or in the Xen VMM
Xenoprof was developed at HP Labs by modifying and extending the
original OProfile code for linux (http://oprofile.sourceforge.net).
We assume the reader is familiar with OProfile and its tools. If you
are not familiar with OProfile we suggest that you read the OProfile
user manual at http://oprofile.sourceforge.net/docs before using Xenoprof.
System wide profiling in Xen requires the cooperation of 3 software
components at different levels of the software stack.
a) Xenoprof:
Extensions to the Xen hypervisor to support system-wide statistical
profiling. Xenoprof programs hardware performance counters to
generate sampling interrupts at regular event count intervals, and
handles the Non Maskable Interrupts (NMI) generated by the
performance counters at overflow. The NMI handler samples the
program counter (PC) at the time of interrupt and stores the PC
value in a per domain sample buffer. Domains interact with
Xenoprof using a specific hypercall. This hypercall enables
domains to define the hardware performance events to be sampled and
their parameters (e.g., overflow interval), as well as to control
the start and end of profiling. Domains are notified of new PC
samples in their respective sample buffers using the virtual
interrupt mechanism provided by Xen (e.g., event notification).
b) OProfile kernel module:
This module is responsible for interpreting the PC samples received
from Xenoprof and mapping the PC sample to the appropriate routine
in user, kernel or hypervisor level. The original OProfile kernel
module for linux was modified to use the Xenoprof interface instead
of accessing the hardware counters directly.
The OProfile module is organized in two main components: a low
level driver, specific to a particular CPU model, and a generic
module that is independent of the specific CPU model and implements
the higher level profiling functions. To enable OProfile to be
used with Xenoprof, a new low level driver specific to Xen was
created. This driver accesses the hypervisor through the exposed
Xenoprof interface, while the high level generic module was kept
almost unmodified, except for minor changes necessary to interpret
performance events associated with the hypervisor.
c) OProfile user level daemon and tools:
The user level daemon is responsible for collecting the performance
event samples from the kernel module and storing them on files for
later processing and reporting. The user level tools implement
commands that enable the user to start and stop a profiling
session, selecting the appropriate performance events and
parameters as well as to generate reports. In order to be used in
a Xen environment these tools were slightly modified. In
particular, new command line options were added to the opcontrol
command as described below.
2. Profiling multiple domains
=============================
A profiling session may profile one, a subset, or all domains running
in a particular physical machine. In every profiling session one of
the domains takes the role of the initiator, which is responsible for
configuring, starting and stopping the session. Other domains can be
included in the session as active participants or passive
participants. Active participants are domains which have an active
OProfile kernel module that can map a PC sample to the appropriate
routine in user, kernel or hypervisor level, given that the CPU was
executing that domain when the PC was sampled. Passive participants
do not need to be executing an OProfile kernel module. For these
domains performance profiling is done at a coarser granularity with PC
samples being assigned to the domain as whole, instead of to specific
routines. Passive domains are useful when profiling systems running
domains with operating systems that do not support the OProfile kernel
module or equivalent. Note that the initiator must always be an
active domain. The initiator will process the PC samples of all
passive domains.
A performance event (generated when one of the hardware performance
counters overflows) is delivered to the appropriate domain, depending
on the type of domain running at the time of the event. If the
running domain is an active domain the PC sample is delivered to that
domain. If the running domain is a passive domain, the PC sample is
delivered to the initiator. If the running domain is not included in
the profiling session, the PC sample is discarded.
3) Extensions to OProfile user level commands
=============================================
A few command line options were added to OProfile command "opcontrol"
for use in Xen environments. The new command line options are:
a) --xen=<xen_image_file>
This option is used to specify the xen image (e.g. xen-syms). This
is used to resolve PC samples collected when executing the
hypervisor.
b) --active-domains=<list>
(where <list> is a list of comma separated domain ids)
This option is used in the initiator domain to specify the list of
active domains to be profiled. The specification of the initiator
domain id in the list of active domains is not necessary. The
initiator domain will always be considered an active domain and its
inclusion on the specified active domain list is optional.
For example: --active-domains=2,5,6 indicates that domains 2, 5 and
6 are active domains. Assuming that domain 0 was the initiator the
previous specification would be equivalent to
--active-domains=0,2,5,6.
c) --passive-domains=<list>
This option is used to specify the list of passive domains.
Besides opcontrol no other OProfile commands were modified for use in
Xen environments.
Full system profiling reports can be easily obtained by concatenating
the individual reports of each active domain, using the regular
opreport command in each active domain. New tools that combine
multiple reports on a single system-wide report can be implemented in
the future.
4) Multi-domain profiling
=========================
In order to start and stop a profiling session across multiple domains
a set of OProfile commands must be executed in the multiple domains in
a coordinated way. A typical sequence of commands for starting and
stopping profiling are listed below
A) Sequence of commands to start profiling:
1) On the initiator domain
> opcontrol --reset
(clear out any previous data of current session)
> opcontrol --start-daemon
[--active-domains=<active_list>]
[--passive-domains=<passive_list>] ...
(start OProfile daemon and specify the set of active and
passive domains in the session)
2) On each active domain
> opcontrol --reset
> opcontrol --start
(indicates domain is ready to process performance events)
3) On initiator
> opcontrol --start
(Multi-domain profiling session starts)
(This is only successful if all active domains are ready)
B) Sequence of commands to stop profiling
1) On each active domain
> opcontrol --stop
2) On initiator domain
> opcontrol --stop
5) Current supported configurations
a) Xen versions: Xen 2.0.3 to 2.0.5
b) Processor architecture: X86
c) Processor models: Pentium 4, Pentium iii
d) Active Domains: Uniprocessor - linux 2.6. (No SMP, No linux 2.4)
e) Passive Domains: No restriction
6) Patch files
==============
In order to run OProfile in Xen environments three patches are needed:
a) xenoprof-1.0-xen-2.0.5.patch
Patch for Xen hypervisor.
b) xenoprof-1.0-linux-2.6.10.patch
Patch for Linux 2.6.10 (Apply to linux-sparse tree in Xen source tree)
c) xenoprof-1.0-oprofile-0.8.1.patch
Patch for OProfile 0.8.1
[-- Attachment #3: Type: text/plain, Size: 138 bytes --]
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2005-05-03 22:05 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-05-03 9:11 New MPI benchmark performance results (update) xuehai zhang
2005-05-03 9:28 ` Steven Hand
2005-05-03 16:36 ` xuehai zhang
2005-05-03 16:13 ` Mark Williamson
2005-05-03 16:58 ` xuehai zhang
2005-05-03 20:24 ` Nivedita Singhvi
2005-05-03 22:05 ` xuehai zhang
-- strict thread matches above, loose matches on Subject: below --
2005-05-03 13:56 Ian Pratt
2005-05-03 16:48 ` xuehai zhang
2005-05-03 19:09 Santos, Jose Renato G (Jose Renato Santos)
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.