Re: [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy

All of lore.kernel.org
 help / color / mirror / Atom feed

From: George Dunlap <george.dunlap@citrix.com>
To: Dario Faggioli <dario.faggioli@citrix.com>,
	"xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org>
Cc: Juergen Gross <jgross@suse.com>,
	Andrew Cooper <Andrew.Cooper3@citrix.com>,
	"Luis R. Rodriguez" <mcgrof@do-not-panic.com>,
	David Vrabel <david.vrabel@citrix.com>,
	Boris Ostrovsky <boris.ostrovsky@oracle.com>,
	Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Subject: Re: [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy
Date: Thu, 27 Aug 2015 11:24:22 +0100	[thread overview]
Message-ID: <55DEE556.3010802@citrix.com> (raw)
In-Reply-To: <1439913332.4239.134.camel@citrix.com>

On 08/18/2015 04:55 PM, Dario Faggioli wrote:
> Hey everyone,
> 
> So, as a followup of what we were discussing in this thread:
> 
>  [Xen-devel] PV-vNUMA issue: topology is misinterpreted by the guest
>  http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg03241.html
> 
> I started looking in more details at scheduling domains in the Linux
> kernel. Now, that thread was about CPUID and vNUMA, and their weird way
> of interacting, while this thing I'm proposing here is completely
> independent from them both.
> 
> In fact, no matter whether vNUMA is supported and enabled, and no matter
> whether CPUID is reporting accurate, random, meaningful or completely
> misleading information, I think that we should do something about how
> scheduling domains are build.
> 
> Fact is, unless we use 1:1, and immutable (across all the guest
> lifetime) pinning, scheduling domains should not be constructed, in
> Linux, by looking at *any* topology information, because that just does
> not make any sense, when vcpus move around.
> 
> Let me state this again (hoping to make myself as clear as possible): no
> matter in  how much good shape we put CPUID support, no matter how
> beautifully and consistently that will interact with both vNUMA,
> licensing requirements and whatever else. It will be always possible for
> vCPU #0 and vCPU #3 to be scheduled on two SMT threads at time t1, and
> on two different NUMA nodes at time t2. Hence, the Linux scheduler
> should really not skew his load balancing logic toward any of those two
> situations, as neither of them could be considered correct (since
> nothing is!).
> 
> For now, this only covers the PV case. HVM case shouldn't be any
> different, but I haven't looked at how to make the same thing happen in
> there as well.
> 
> OVERALL DESCRIPTION
> ===================
> What this RFC patch does is, in the Xen PV case, configure scheduling
> domains in such a way that there is only one of them, spanning all the
> pCPUs of the guest.
> 
> Note that the patch deals directly with scheduling domains, and there is
> no need to alter the masks that will then be used for building and
> reporting the topology (via CPUID, /proc/cpuinfo, /sysfs, etc.). That is
> the main difference between it and the patch proposed by Juergen here:
> http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg05088.html
> 
> This means that when, in future, we will fix CPUID handling and make it
> comply with whatever logic or requirements we want, that won't have  any
> unexpected side effects on scheduling domains.
> 
> Information about how the scheduling domains are being constructed
> during boot are available in `dmesg', if the kernel is booted with the
> 'sched_debug' parameter. It is also possible to look
> at /proc/sys/kernel/sched_domain/cpu*, and at /proc/schedstat.
> 
> With the patch applied, only one scheduling domain is created, called
> the 'VCPU' domain, spanning all the guest's (or Dom0's) vCPUs. You can
> tell that from the fact that every cpu* folder
> in /proc/sys/kernel/sched_domain/ only have one subdirectory
> ('domain0'), with all the tweaks and the tunables for our scheduling
> domain.
> 
> EVALUATION
> ==========
> I've tested this with UnixBench, and by looking at Xen build time, on a
> 16, 24 and 48 pCPUs hosts. I've run the benchmarks in Dom0 only, for
> now, but I plan to re-run them in DomUs soon (Juergen may be doing
> something similar to this in DomU already, AFAUI).
> 
> I've run the benchmarks with and without the patch applied ('patched'
> and 'vanilla', respectively, in the tables below), and with different
> number of build jobs (in case of the Xen build) or of parallel copy of
> the benchmarks (in the case of UnixBench).
> 
> What I get from the numbers is that the patch almost always brings
> benefits, in some cases even huge ones. There are a couple of cases
> where we regress, but always only slightly so, especially if comparing
> that to the magnitude of some of the improvement that we get.
> 
> Bear also in mind that these results are gathered from Dom0, and without
> any overcommitment at the vCPU level (i.e., nr. vCPUs == nr pCPUs). If
> we move things in DomU and do overcommit at the Xen scheduler level, I
> am expecting even better results.
> 
> RESULTS
> =======
> To have a quick idea of how a benchmark went, look at the '%
> improvement' row of each table.
> 
> I'll put these results online, in a googledoc spreadsheet or something
> like that, to make them easier to read, as soon as possible.
> 
> *** Intel(R) Xeon(R) E5620 @ 2.40GHz                                                                                                                    
> *** pCPUs      16        DOM0 vCPUS  16
> *** RAM        12285 MB  DOM0 Memory 9955 MB
> *** NUMA nodes 2         
> =======================================================================================================================================
> MAKE XEN (lower == better)                                                                                                                            
> =======================================================================================================================================
> # of build jobs                     -j1                   -j6                   -j8                   -j16**                -j24                
> vanilla/patched              vanilla    patched    vanilla    patched    vanilla    patched    vanilla    patched    vanilla    patched
> ---------------------------------------------------------------------------------------------------------------------------------------
>                               153.72     152.41      35.33      34.93       30.7      30.33      26.79      25.97      26.88      26.21
>                               153.81     152.76      35.37      34.99      30.81      30.36      26.83      26.08         27      26.24
>                               153.93     152.79      35.37      35.25      30.92      30.39      26.83      26.13      27.01      26.28
>                               153.94     152.94      35.39      35.28      31.05      30.43       26.9      26.14      27.01      26.44
>                               153.98     153.06      35.45      35.31      31.17       30.5      26.95      26.18      27.02      26.55
>                               154.01     153.23       35.5      35.35       31.2      30.59      26.98       26.2      27.05      26.61
>                               154.04     153.34      35.56      35.42      31.45      30.76      27.12      26.21      27.06      26.78
>                               154.16      153.5      37.79      35.58      31.68      30.83      27.16      26.23      27.16      26.78
>                               154.18     153.71      37.98      35.61      33.73       30.9      27.49      26.32      27.16       26.8
>                               154.9      154.67      38.03      37.64      34.69      31.69      29.82      26.38       27.2      28.63
> ---------------------------------------------------------------------------------------------------------------------------------------
>  Avg.                        154.067    153.241     36.177     35.536      31.74     30.678     27.287     26.184     27.055     26.732
> ---------------------------------------------------------------------------------------------------------------------------------------
>  Std. Dev.                     0.325      0.631      1.215      0.771      1.352      0.410      0.914      0.116      0.095      0.704
> ---------------------------------------------------------------------------------------------------------------------------------------
>  % improvement                            0.536                 1.772                 3.346                 4.042                 1.194
> ========================================================================================================================================
> ====================================================================================================================================================
> UNIXBENCH
> ====================================================================================================================================================
> # parallel copies                            1 parallel            6 parrallel           8 parallel            16 parallel**         24 parallel
> vanilla/patched                          vanilla    patched    vanilla    pached     vanilla    patched    vanilla    patched    vanilla    patched
> ----------------------------------------------------------------------------------------------------------------------------------------------------
> Dhrystone 2 using register variables       2302.2     2302.1    13157.8    12262.4    15691.5    15860.1    18927.7    19078.5    18654.3    18855.6
> Double-Precision Whetstone                  620.2      620.2     3481.2     3566.9     4669.2     4551.5     7610.1     7614.3    11558.9    11561.3
> Execl Throughput                            184.3      186.7      884.6      905.3     1168.4     1213.6     2134.6     2210.2     2250.9       2265
> File Copy 1024 bufsize 2000 maxblocks       780.8      783.3     1243.7     1255.5     1250.6     1215.7     1080.9     1094.2     1069.8     1062.5
> File Copy 256 bufsize 500 maxblocks         479.8      482.8      781.8      803.6      806.4        781      682.9      707.7      698.2      694.6
> File Copy 4096 bufsize 8000 maxblocks      1617.6     1593.5     2739.7     2943.4     2818.3     2957.8     2389.6     2412.6     2371.6     2423.8
> Pipe Throughput                             363.9      361.6     2068.6     2065.6       2622     2633.5     4053.3     4085.9     4064.7     4076.7
> Pipe-based Context Switching                 70.6      207.2      369.1     1126.8      623.9     1431.3     1970.4     2082.9     1963.8       2077
> Process Creation                            103.1        135        503      677.6      618.7      855.4       1138     1113.7     1195.6       1199
> Shell Scripts (1 concurrent)                723.2      765.3     4406.4     4334.4     5045.4     5002.5     5861.9     5844.2     5958.8     5916.1
> Shell Scripts (8 concurrent)               2243.7     2715.3     5694.7     5663.6     5694.7     5657.8     5637.1     5600.5     5582.9     5543.6
> System Call Overhead                          330      330.1     1669.2     1672.4     2028.6     1996.6     2920.5     2947.1     2923.9     2952.5
> System Benchmarks Index Score               496.8      567.5     1861.9       2106     2220.3     2441.3     2972.5     3007.9     3103.4     3125.3
> ----------------------------------------------------------------------------------------------------------------------------------------------------
> % increase (of the Index Score)                       14.231                13.110                 9.954                 1.191                 0.706
> ====================================================================================================================================================
> 
> *** Intel(R) Xeon(R) X5650 @ 2.67GHz
> *** pCPUs      24        DOM0 vCPUS  16
> *** RAM        36851 MB  DOM0 Memory 9955 MB
> *** NUMA nodes 2
> =======================================================================================================================================
> MAKE XEN (lower == better)
> =======================================================================================================================================
> # of build jobs                     -j1                   -j8                   -j12                   -j24**               -j32
> vanilla/patched              vanilla    patched    vanilla    patched    vanilla    patched    vanilla    patched    vanilla    patched
> ---------------------------------------------------------------------------------------------------------------------------------------
>                               119.49     119.47      23.37      23.29      20.12      19.85      17.99       17.9      17.82       17.8
>                               119.59     119.64      23.52      23.31      20.16      19.99      18.19      18.05      18.23      17.89
>                               119.59     119.65      23.53      23.35      20.19      20.08      18.26      18.09      18.35      17.91
>                               119.72     119.75      23.63      23.41       20.2      20.14      18.54       18.1       18.4      17.95
>                               119.95     119.86      23.68      23.42      20.24      20.19      18.57      18.15      18.44      18.03
>                               119.97      119.9      23.72      23.51      20.38      20.31      18.61      18.21      18.49      18.03
>                               119.97     119.91      25.03      23.53      20.38      20.42      18.75      18.28      18.51      18.08
>                               120.01     119.98      25.05      23.93      20.39      21.69      19.99      18.49      18.52       18.6
>                               120.24     119.99      25.12      24.19      21.67      21.76      20.08      19.74      19.73      19.62
>                               120.66     121.22      25.16      25.36      21.94      21.85      20.26       20.3      19.92      19.81
> ---------------------------------------------------------------------------------------------------------------------------------------
>  Avg.                        119.919    119.937     24.181      23.73     20.567     20.628     18.924     18.531     18.641     18.372
> ---------------------------------------------------------------------------------------------------------------------------------------
>  Std. Dev.                     0.351      0.481      0.789      0.642      0.663      0.802      0.851      0.811      0.658      0.741
> ---------------------------------------------------------------------------------------------------------------------------------------
>  % improvement                           -0.015                 1.865                -0.297                 2.077                 1.443
> ========================================================================================================================================
> ====================================================================================================================================================
> UNIXBENCH
> ====================================================================================================================================================
> # parallel copies                            1 parallel            8 parrallel            12 parallel           24 parallel**         32 parallel
> vanilla/patched                          vanilla     patched   vanilla     pached     vanilla    patched    vanilla    patched    vanilla    patched
> ----------------------------------------------------------------------------------------------------------------------------------------------------
> Dhrystone 2 using register variables       2650.1     2664.6    18967.8    19060.4    27534.1    27046.8    30077.9    30110.6    30542.1    30358.7
> Double-Precision Whetstone                  713.7      713.5     5463.6     5455.1     7863.9     7923.8    12725.1    12727.8    17474.3    17463.3
> Execl Throughput                            280.9      283.8     1724.4     1866.5     2029.5     2367.6       2370     2521.3       2453     2506.8
> File Copy 1024 bufsize 2000 maxblocks       891.1      894.2       1423     1457.7     1385.6     1482.2     1226.1     1224.2     1235.9     1265.5
> File Copy 256 bufsize 500 maxblocks         546.9      555.4        949      972.1      882.8      878.6      821.9      817.7      784.7      810.8
> File Copy 4096 bufsize 8000 maxblocks      1743.4     1722.8     3406.5     3438.9     3314.3     3265.9     2801.9     2788.3     2695.2     2781.5
> Pipe Throughput                             426.8      423.4     3207.9       3234     4635.1     4708.9       7326     7335.3     7327.2     7319.7
> Pipe-based Context Switching                110.2      223.5      680.8     1602.2      998.6     2324.6     3122.1     3252.7     3128.6     3337.2
> Process Creation                            130.7      224.4     1001.3     1043.6       1209     1248.2     1337.9     1380.4     1338.6     1280.1
> Shell Scripts (1 concurrent)               1140.5     1257.5     5462.8     6146.4     6435.3     7206.1     7425.2     7636.2     7566.1     7636.6
> Shell Scripts (8 concurrent)                 3492     3586.7     7144.9       7307       7258     7320.2     7295.1     7296.7     7248.6     7252.2
> System Call Overhead                        387.7      387.5     2398.4       2367     2793.8     2752.7     3735.7     3694.2     3752.1     3709.4
> System Benchmarks Index Score               634.8      712.6     2725.8     3005.7     3232.4     3569.7     3981.3     4028.8     4085.2     4126.3
> ----------------------------------------------------------------------------------------------------------------------------------------------------
> % increase (of the Index Score)                       12.256                10.269                10.435                 1.193                 1.006
> ====================================================================================================================================================
> 
> *** Intel(R) Xeon(R) X5650 @ 2.67GHz
> *** pCPUs      48        DOM0 vCPUS  16
> *** RAM        393138 MB DOM0 Memory 9955 MB
> *** NUMA nodes 2
> =======================================================================================================================================
> MAKE XEN (lower == better)
> =======================================================================================================================================
> # of build jobs                     -j1                   -j20                   -j24                  -j48**               -j62
> vanilla/patched              vanilla    patched    vanilla    patched    vanilla    patched    vanilla    patched    vanilla    patched
> ---------------------------------------------------------------------------------------------------------------------------------------
>                               267.78     233.25      36.53      35.53      35.98      34.99      33.46      32.13      33.57      32.54
>                               268.42     233.92      36.82      35.56      36.12       35.2      34.24      32.24      33.64      32.56
>                               268.85     234.39      36.92      35.75      36.15      35.35      34.48      32.86      33.67      32.74
>                               268.98     235.11      36.96      36.01      36.25      35.46      34.73      32.89      33.97      32.83
>                               269.03     236.48      37.04      36.16      36.45      35.63      34.77      32.97      34.12      33.01
>                               269.54     237.05      40.33      36.59      36.57      36.15      34.97      33.09      34.18      33.52
>                               269.99     238.24      40.45      36.78      36.58      36.22      34.99      33.69      34.28      33.63
>                               270.11     238.48      41.13      39.98      40.22      36.24         38      33.92      34.35      33.87
>                               270.96     239.07      41.66      40.81      40.59      36.35      38.99      34.19      34.49      37.24
>                               271.84     240.89      42.07      41.24      40.63      40.06      39.07      36.04      34.69      37.59
> ---------------------------------------------------------------------------------------------------------------------------------------
>  Avg.                         269.55    236.688     38.991     37.441     37.554     36.165      35.77     33.402     34.096     33.953
> ---------------------------------------------------------------------------------------------------------------------------------------
>  Std. Dev.                     1.213      2.503      2.312      2.288      2.031      1.452      2.079      1.142      0.379      1.882
> ---------------------------------------------------------------------------------------------------------------------------------------
>  % improvement                           12.191                 3.975                 3.699                 6.620                 0.419
> ========================================================================================================================================

I'm a bit confused here as to why, if dom0 has 16 vcpus in all of your
tests, you change the -j number (apparently) based on the number of
pcpus available to Xen.  Wouldn't it make more sense to stick with
1/6/8/16/24?  That would allow us to have actually comparable numbers.

But in any case, it seems to me that the numbers do show a uniform
improvement and no regressions -- I think this approach looks really
good, particularly as it is so small and well-contained.

 -George

next prev parent reply	other threads:[~2015-08-27 10:24 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-08-18 15:55 [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy Dario Faggioli
2015-08-18 16:53 ` Konrad Rzeszutek Wilk
2015-08-20 18:16 ` Juergen Groß
2015-08-31 16:12   ` Boris Ostrovsky
2015-09-02 11:58     ` Juergen Gross
2015-09-02 14:08       ` Boris Ostrovsky
2015-09-02 14:30         ` Juergen Gross
2015-09-15 17:16           ` [Xen-devel] " Dario Faggioli
2015-09-15 16:50   ` Dario Faggioli
2015-09-21  5:49     ` Juergen Gross
2015-09-22  4:42       ` Juergen Gross
2015-09-22 16:22         ` George Dunlap
2015-09-23  4:36           ` Juergen Gross
2015-09-23  8:30             ` Dario Faggioli
2015-09-23  9:44               ` Juergen Gross
2015-09-23 10:23             ` George Dunlap
2015-09-23  7:24       ` Dario Faggioli
2015-09-23  7:35         ` Juergen Gross
2015-09-23 12:25           ` Boris Ostrovsky
2015-08-27 10:24 ` George Dunlap [this message]
2015-08-27 17:05   ` George Dunlap
2015-09-15 14:32   ` Dario Faggioli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=55DEE556.3010802@citrix.com \
    --to=george.dunlap@citrix.com \
    --cc=Andrew.Cooper3@citrix.com \
    --cc=boris.ostrovsky@oracle.com \
    --cc=dario.faggioli@citrix.com \
    --cc=david.vrabel@citrix.com \
    --cc=jgross@suse.com \
    --cc=konrad.wilk@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mcgrof@do-not-panic.com \
    --cc=stefano.stabellini@eu.citrix.com \
    --cc=xen-devel@lists.xenproject.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.