xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
From: Dario Faggioli <raistlin@linux.it>
To: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Andre Przywara <andre.przywara@amd.com>,
	Ian Campbell <Ian.Campbell@citrix.com>,
	Stefano Stabellini <Stefano.Stabellini@eu.citrix.com>,
	Juergen Gross <juergen.gross@ts.fujitsu.com>,
	Ian Jackson <Ian.Jackson@eu.citrix.com>,
	"xen-devel@lists.xen.org" <xen-devel@lists.xen.org>,
	Jan Beulich <JBeulich@suse.com>
Subject: Re: [PATCH 05 of 10 [RFC]] xl: Explicit node affinity specification for guests via config file
Date: Fri, 13 Apr 2012 00:21:37 +0200	[thread overview]
Message-ID: <1334269297.2417.23.camel@Abyss> (raw)
In-Reply-To: <4F86AD6E.3050705@eu.citrix.com>


[-- Attachment #1.1: Type: text/plain, Size: 7273 bytes --]

On Thu, 2012-04-12 at 11:24 +0100, George Dunlap wrote: 
> > A valid setting will directly impact the node_affinity mask
> > of the guest, i.e., it will change the behaviour of the low level
> > memory allocator. On the other hand, this commit does not affect
> > by any means how the guest's vCPUs are scheduled on the host's
> > pCPUs.
> I would probably break this into three separate patches, starting with 
> the hypervisor, then libxc, then libxl, especially since they tend to 
> have different maintainers and committers.
>
That's fine.

> > +A list of nodes should be specified as follows: `nodes=["0", "3"]`
> > +for the guest to be affine with nodes 0 and 3.
> > +
> Hmm, I think that using "affine" here is technically correct, and is 
> what one would use if writing a research paper; but it's unusual to hear 
> the word in more common English; it would be more common to hear someone 
> describe a VM as "having affinity with".
>
I see...

> How about something like this:
> 
> "The list of NUMA nodes the domain is considered to have affinity with.  
> Memory from the guest will be allocated from these nodes."
> 
Sounds good, I'll go for it. Thanks for looking at these stuff too, I
really need and enjoy some good English training! :-P

> Also, is there a way to specify that the affinity to be to all nodes 
> and/or based on the vcpu mask of the vcpus?
>
Well, yes but no. :-) Actually, if you do not specify any affinity, the
default is right now 'all nodes'. As we might want to change that, I'll
add something like `nodes="all"`. Similarly, if you don't specify any
"nodes=", and you have some vcpu-affinity, this patch won't call any
node_affinity setting routine, so that the default behavior of building
it up upon vcpu-affinity will be retained. Actually, even with the
following patches that introduces automatic placement, if no
node-affinity is specified, while a vcpu-affinity is, the algorithm will
try to place the domain where vcpu-affinity is saying.

So, summarizing, I'll add some means of saying "all nodes", while I
don't think anything is needed to say "do what vcpu-affinity wants", but
I'm more than open to suggestions and alternatives if you think this is
not good/clear enough.

> > +/**
> > + * This function explicitly sets the affinity of a domain with the
> > + * host NUMA nodes.
> > + *
> > + * @parm xch a handle to an open hypervisor interface.
> > + * @parm domid the domain id in which vcpus are to be created.
> > + * @parm node_affinity the map of the affine nodes.
> > + * @return 0 on success, -1 on failure.
> > + */
> > +int xc_domain_node_affinity(xc_interface *xch,
> > +                            uint32_t domind,
> > +                            xc_nodemap_t node_affinity);
> > +
> Seems like it would be useful to have both a get and a set.
>
Ok. Not right now, but I agree it would make it a more sane interface.
Will add a *_get_affinity (and change this to *_set_affinity)

> >   int libxl_set_vcpuonline(libxl_ctx *ctx, uint32_t domid, libxl_cpumap *cpumap)
> >   {
> >       GC_INIT(ctx);
> > diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h
> > --- a/tools/libxl/libxl.h
> > +++ b/tools/libxl/libxl.h
> > @@ -727,6 +727,8 @@ int libxl_set_vcpuaffinity(libxl_ctx *ct
> >                              libxl_cpumap *cpumap);
> >   int libxl_set_vcpuaffinity_all(libxl_ctx *ctx, uint32_t domid,
> >                                  unsigned int max_vcpus, libxl_cpumap *cpumap);
> > +int libxl_set_node_affinity(libxl_ctx *ctx, uint32_t domid,
> > +                            libxl_nodemap *nodemap);
> Hmm -- is there really no libxl_get_vcpuaffinity()?  That seems to be a 
> missing component...
>
Apparently not, seems like `xl vcpu-list' reaches out print_vcpuinfo()
(xl.c:4268 here) which figures out thing by its own with something like
this:

    /* CPU AFFINITY */
    print_bitmap(vcpuinfo->cpumap.map, nr_cpus, stdout);

Should we change this?

> > +    }
> > +    else {
> > +        /*
> > +         * XXX What would a sane default be? I think doing nothing
> > +         *     (i.e., relying on cpu-affinity/cpupool ==>  the current
> > +         *     behavior) should be just fine.
> > +         *     That would mean we're saying to the user "if you want us
> > +         *     to take care of NUMA, please tell us, maybe just putting
> > +         *     'nodes=auto', but tell us... otherwise we do as usual".
> > +         */
> I think that for this patch, doing nothing is fine (which means removing 
> the whole else clause).  But once we have the auto placement, I think 
> that "nodes=auto" should be the default.
>
Ok, that makes sense. I'm running benchmarks comparing the three
different policies I've implemented so far (see patches 8 and 9), to
understand if they make any sense at all and, if they do, which one
would be better suited for being used as the default policy. Will post
the results as soon as they finish.

> > +    case XEN_DOMCTL_node_affinity:
> > +    {
> > +        domid_t dom = op->domain;
> > +        struct domain *d = rcu_lock_domain_by_id(dom);
> > +        nodemask_t new_affinity;
> > +
> > +        ret = -ESRCH;
> > +        if ( d == NULL )
> > +            break;
> > +
> > +        /* XXX We need an xsm_* for this I guess, right? */
> Yes. :-)
>
Ok, I'll look into it and put it together for the next release of this
series. I'm planning to take great inspiration from the one in
*_vcpu_affinity, do you think that could be the right thing?

> > diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
> > --- a/xen/include/xen/sched.h
> > +++ b/xen/include/xen/sched.h
> > @@ -346,8 +346,12 @@ struct domain
> >       /* Various mem_events */
> >       struct mem_event_per_domain *mem_event;
> >
> > -    /* Currently computed from union of all vcpu cpu-affinity masks. */
> > +    /*
> > +     * Can be specified by the user. If that is not the case, it is
> > +     * computed from the union of all the vcpu cpu-affinity masks.
> > +     */
> >       nodemask_t node_affinity;
> > +    int has_node_affinity;
> There's something that seems a bit clunky about this; but I can't really 
> think of a better way.  At every least I'd change the name of the 
> element here to something more descriptive; perhaps "auto_node_affinity" 
> (which would invert the meaning)?
>
You know what, I really don't like this either, but I need something
that tells me whether I should compute the node affinity from
vcpu-affinity/cpupool or the user has set that to something I shouldn't
touch, and did not find any other way of doing such than adding this
"flag". I kind of took inspiration from d->is_pinned although, yes, I
know, that is used for different purposes.

Any suggestion is welcome, in the meanwhile, I'll change the name and
the semantic to what you're suggesting.

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

  parent reply	other threads:[~2012-04-12 22:21 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-04-11 13:17 [PATCH 00 of 10 [RFC]] Automatically place guest on host's NUMA nodes with xl Dario Faggioli
2012-04-11 13:17 ` [PATCH 01 of 10 [RFC]] libxc: Generalize xenctl_cpumap to just xenctl_map Dario Faggioli
2012-04-11 16:08   ` George Dunlap
2012-04-11 16:31     ` Dario Faggioli
2012-04-11 16:41       ` Dario Faggioli
2012-04-11 13:17 ` [PATCH 02 of 10 [RFC]] libxl: Generalize libxl_cpumap to just libxl_map Dario Faggioli
2012-04-11 13:17 ` [PATCH 03 of 10 [RFC]] libxc, libxl: Introduce xc_nodemap_t and libxl_nodemap Dario Faggioli
2012-04-11 16:38   ` George Dunlap
2012-04-11 16:57     ` Dario Faggioli
2012-04-11 13:17 ` [PATCH 04 of 10 [RFC]] libxl: Introduce libxl_get_numainfo() calling xc_numainfo() Dario Faggioli
2012-04-11 13:17 ` [PATCH 05 of 10 [RFC]] xl: Explicit node affinity specification for guests via config file Dario Faggioli
2012-04-12 10:24   ` George Dunlap
2012-04-12 10:48     ` David Vrabel
2012-04-12 22:25       ` Dario Faggioli
2012-04-12 11:32     ` Formatting of emails which are comments on patches Ian Jackson
2012-04-12 11:42       ` George Dunlap
2012-04-12 22:21     ` Dario Faggioli [this message]
2012-04-11 13:17 ` [PATCH 06 of 10 [RFC]] xl: Allow user to set or change node affinity on-line Dario Faggioli
2012-04-12 10:29   ` George Dunlap
2012-04-12 21:57     ` Dario Faggioli
2012-04-11 13:17 ` [PATCH 07 of 10 [RFC]] sched_credit: Let the scheduler know about `node affinity` Dario Faggioli
2012-04-12 23:06   ` Dario Faggioli
2012-04-27 14:45   ` George Dunlap
2012-05-02 15:13     ` Dario Faggioli
2012-04-11 13:17 ` [PATCH 08 of 10 [RFC]] xl: Introduce First Fit memory-wise placement of guests on nodes Dario Faggioli
2012-05-01 15:45   ` George Dunlap
2012-05-02 16:30     ` Dario Faggioli
2012-05-03  1:03       ` Dario Faggioli
2012-05-03  8:10         ` Ian Campbell
2012-05-03 10:16         ` George Dunlap
2012-05-03 13:41       ` George Dunlap
2012-05-03 14:58         ` Dario Faggioli
2012-04-11 13:17 ` [PATCH 09 of 10 [RFC]] xl: Introduce Best and Worst Fit guest placement algorithms Dario Faggioli
2012-04-16 10:29   ` Dario Faggioli
2012-04-11 13:17 ` [PATCH 10 of 10 [RFC]] xl: Some automatic NUMA placement documentation Dario Faggioli
2012-04-12  9:11   ` Ian Campbell
2012-04-12 10:32     ` Dario Faggioli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1334269297.2417.23.camel@Abyss \
    --to=raistlin@linux.it \
    --cc=Ian.Campbell@citrix.com \
    --cc=Ian.Jackson@eu.citrix.com \
    --cc=JBeulich@suse.com \
    --cc=Stefano.Stabellini@eu.citrix.com \
    --cc=andre.przywara@amd.com \
    --cc=george.dunlap@eu.citrix.com \
    --cc=juergen.gross@ts.fujitsu.com \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).