From mboxrd@z Thu Jan  1 00:00:00 1970
From: George Dunlap <george.dunlap@eu.citrix.com>
Subject: Re: [PATCH 05 of 10 [RFC]] xl: Explicit node affinity
 specification for guests via config file
Date: Thu, 12 Apr 2012 11:24:46 +0100
Message-ID: <4F86AD6E.3050705@eu.citrix.com>
References: <patchbomb.1334150267@Solace>
	<7e76233448b02810f0ae.1334150272@Solace>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"; Format="flowed"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <7e76233448b02810f0ae.1334150272@Solace>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Dario Faggioli <raistlin@linux.it>
Cc: Andre Przywara <andre.przywara@amd.com>, Ian Campbell <Ian.Campbell@citrix.com>, Stefano Stabellini <Stefano.Stabellini@eu.citrix.com>, Juergen Gross <juergen.gross@ts.fujitsu.com>, Ian Jackson <Ian.Jackson@eu.citrix.com>, "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>, Jan Beulich <JBeulich@suse.com>
List-Id: xen-devel@lists.xenproject.org

On 11/04/12 14:17, Dario Faggioli wrote:
> Let the user specify the NUMA node affinity of a guest via the
> 'nodes=' config switch. Explicitly listing the intended target host
> nodes is required as of now.
>
> A valid setting will directly impact the node_affinity mask
> of the guest, i.e., it will change the behaviour of the low level
> memory allocator. On the other hand, this commit does not affect
> by any means how the guest's vCPUs are scheduled on the host's
> pCPUs.
I would probably break this into three separate patches, starting with 
the hypervisor, then libxc, then libxl, especially since they tend to 
have different maintainers and committers.
>
> TODO: * Better investigate and test interactions with cpupools.
>
> Signed-off-by: Dario Faggioli<dario.faggioli@citrix.com>
>
> diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
> --- a/docs/man/xl.cfg.pod.5
> +++ b/docs/man/xl.cfg.pod.5
> @@ -112,6 +112,15 @@ List of which cpus the guest is allowed
>   (all vcpus will run on cpus 0,2,3,5), or `cpus=["2", "3"]` (all vcpus
>   will run on cpus 2 and 3).
>
> +=item B<nodes=[ NODE, NODE, ...]>
> +
> +List of the NUMA nodes of the host the guest should be considered
> +`affine` with. Being affine to a (set of) NUMA node(s) mainly means
> +the guest's memory is going to be allocated on those node(s).
> +
> +A list of nodes should be specified as follows: `nodes=["0", "3"]`
> +for the guest to be affine with nodes 0 and 3.
> +
Hmm, I think that using "affine" here is technically correct, and is 
what one would use if writing a research paper; but it's unusual to hear 
the word in more common English; it would be more common to hear someone 
describe a VM as "having affinity with".  How about something like this:

"The list of NUMA nodes the domain is considered to have affinity with.  
Memory from the guest will be allocated from these nodes."

(Technically you're also not supposed to end a sentence with a 
preposition, but I think "...with which the domain is considered to have 
affinity" is just to awkward.)

Also, is there a way to specify that the affinity to be to all nodes 
and/or based on the vcpu mask of the vcpus?
>   =item B<memory=MBYTES>
>
>   Start the guest with MBYTES megabytes of RAM.
> diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
> --- a/tools/libxc/xc_domain.c
> +++ b/tools/libxc/xc_domain.c
> @@ -110,6 +110,44 @@ int xc_domain_shutdown(xc_interface *xch
>   }
>
>
> +int xc_domain_node_affinity(xc_interface *xch,
> +                            uint32_t domid,
> +                            xc_nodemap_t node_affinity)
> +{
> +    DECLARE_DOMCTL;
> +    DECLARE_HYPERCALL_BUFFER(uint8_t, local);
> +    int ret = -1;
> +    int nodesize;
> +
> +    nodesize = xc_get_nodemap_size(xch);
> +    if (!nodesize)
> +    {
> +        PERROR("Could not get number of nodes");
> +        goto out;
> +    }
> +
> +    local = xc_hypercall_buffer_alloc(xch, local, nodesize);
> +    if ( local == NULL )
> +    {
> +        PERROR("Could not allocate memory for domain_node_affinity domctl hypercall");
> +        goto out;
> +    }
> +
> +    domctl.cmd = XEN_DOMCTL_node_affinity;
> +    domctl.domain = (domid_t)domid;
> +
> +    memcpy(local, node_affinity, nodesize);
> +    set_xen_guest_handle(domctl.u.node_affinity.nodemap.bitmap, local);
> +    domctl.u.node_affinity.nodemap.nr_elems = nodesize * 8;
> +
> +    ret = do_domctl(xch,&domctl);
> +
> +    xc_hypercall_buffer_free(xch, local);
> +
> + out:
> +    return ret;
> +}
> +
>   int xc_vcpu_setaffinity(xc_interface *xch,
>                           uint32_t domid,
>                           int vcpu,
> diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h
> --- a/tools/libxc/xenctrl.h
> +++ b/tools/libxc/xenctrl.h
> @@ -520,6 +520,19 @@ int xc_watchdog(xc_interface *xch,
>                  uint32_t id,
>                  uint32_t timeout);
>
> +/**
> + * This function explicitly sets the affinity of a domain with the
> + * host NUMA nodes.
> + *
> + * @parm xch a handle to an open hypervisor interface.
> + * @parm domid the domain id in which vcpus are to be created.
> + * @parm node_affinity the map of the affine nodes.
> + * @return 0 on success, -1 on failure.
> + */
> +int xc_domain_node_affinity(xc_interface *xch,
> +                            uint32_t domind,
> +                            xc_nodemap_t node_affinity);
> +
Seems like it would be useful to have both a get and a set.
>   int xc_vcpu_setaffinity(xc_interface *xch,
>                           uint32_t domid,
>                           int vcpu,
> diff --git a/tools/libxl/gentest.py b/tools/libxl/gentest.py
> --- a/tools/libxl/gentest.py
> +++ b/tools/libxl/gentest.py
> @@ -20,7 +20,7 @@ def randomize_case(s):
>   def randomize_enum(e):
>       return random.choice([v.name for v in e.values])
>
> -handcoded = ["libxl_cpumap", "libxl_key_value_list",
> +handcoded = ["libxl_cpumap", "libxl_nodemap", "libxl_key_value_list",
>                "libxl_cpuid_policy_list", "libxl_file_reference",
>                "libxl_string_list"]
>
> @@ -119,6 +119,19 @@ static void libxl_cpumap_rand_init(libxl
>       }
>   }
>
> +static void libxl_nodemap_rand_init(libxl_nodemap *nodemap)
> +{
> +    int i;
> +    nodemap->size = rand() % 16;
> +    nodemap->map = calloc(nodemap->size, sizeof(*nodemap->map));
> +    libxl_for_each_node(i, *nodemap) {
> +        if (rand() % 2)
> +            libxl_nodemap_set(nodemap, i);
> +        else
> +            libxl_nodemap_reset(nodemap, i);
> +    }
> +}
> +
>   static void libxl_key_value_list_rand_init(libxl_key_value_list *pkvl)
>   {
>       int i, nr_kvp = rand() % 16;
> diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
> --- a/tools/libxl/libxl.c
> +++ b/tools/libxl/libxl.c
> @@ -3082,6 +3082,16 @@ int libxl_set_vcpuaffinity_all(libxl_ctx
>       return rc;
>   }
>
> +int libxl_set_node_affinity(libxl_ctx *ctx, uint32_t domid,
> +                            libxl_nodemap *nodemap)
> +{
> +    if (xc_domain_node_affinity(ctx->xch, domid, nodemap->map)) {
> +        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "setting node affinity");
> +        return ERROR_FAIL;
> +    }
> +    return 0;
> +}
> +
>   int libxl_set_vcpuonline(libxl_ctx *ctx, uint32_t domid, libxl_cpumap *cpumap)
>   {
>       GC_INIT(ctx);
> diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h
> --- a/tools/libxl/libxl.h
> +++ b/tools/libxl/libxl.h
> @@ -727,6 +727,8 @@ int libxl_set_vcpuaffinity(libxl_ctx *ct
>                              libxl_cpumap *cpumap);
>   int libxl_set_vcpuaffinity_all(libxl_ctx *ctx, uint32_t domid,
>                                  unsigned int max_vcpus, libxl_cpumap *cpumap);
> +int libxl_set_node_affinity(libxl_ctx *ctx, uint32_t domid,
> +                            libxl_nodemap *nodemap);
Hmm -- is there really no libxl_get_vcpuaffinity()?  That seems to be a 
missing component...
>   int libxl_set_vcpuonline(libxl_ctx *ctx, uint32_t domid, libxl_cpumap *cpumap);
>
>   libxl_scheduler libxl_get_scheduler(libxl_ctx *ctx);
> diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
> --- a/tools/libxl/libxl_create.c
> +++ b/tools/libxl/libxl_create.c
> @@ -121,6 +121,12 @@ int libxl__domain_build_info_setdefault(
>           libxl_cpumap_set_any(&b_info->cpumap);
>       }
>
> +    if (!b_info->nodemap.size) {
> +        if (libxl_nodemap_alloc(CTX,&b_info->nodemap))
> +            return ERROR_NOMEM;
> +        libxl_nodemap_set_none(&b_info->nodemap);
> +    }
> +
>       if (b_info->max_memkb == LIBXL_MEMKB_DEFAULT)
>           b_info->max_memkb = 32 * 1024;
>       if (b_info->target_memkb == LIBXL_MEMKB_DEFAULT)
> diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
> --- a/tools/libxl/libxl_dom.c
> +++ b/tools/libxl/libxl_dom.c
> @@ -67,6 +67,7 @@ int libxl__build_pre(libxl__gc *gc, uint
>       char *xs_domid, *con_domid;
>       xc_domain_max_vcpus(ctx->xch, domid, info->max_vcpus);
>       libxl_set_vcpuaffinity_all(ctx, domid, info->max_vcpus,&info->cpumap);
> +    libxl_set_node_affinity(ctx, domid,&info->nodemap);
>       xc_domain_setmaxmem(ctx->xch, domid, info->target_memkb + LIBXL_MAXMEM_CONSTANT);
>       if (info->type == LIBXL_DOMAIN_TYPE_PV)
>           xc_domain_set_memmap_limit(ctx->xch, domid,
> diff --git a/tools/libxl/libxl_json.c b/tools/libxl/libxl_json.c
> --- a/tools/libxl/libxl_json.c
> +++ b/tools/libxl/libxl_json.c
> @@ -119,6 +119,26 @@ out:
>       return s;
>   }
>
> +yajl_gen_status libxl_nodemap_gen_json(yajl_gen hand,
> +                                       libxl_nodemap *nodemap)
> +{
> +    yajl_gen_status s;
> +    int i;
> +
> +    s = yajl_gen_array_open(hand);
> +    if (s != yajl_gen_status_ok) goto out;
> +
> +    libxl_for_each_node(i, *nodemap) {
> +        if (libxl_nodemap_test(nodemap, i)) {
> +            s = yajl_gen_integer(hand, i);
> +            if (s != yajl_gen_status_ok) goto out;
> +        }
> +    }
> +    s = yajl_gen_array_close(hand);
> +out:
> +    return s;
> +}
> +
>   yajl_gen_status libxl_key_value_list_gen_json(yajl_gen hand,
>                                                 libxl_key_value_list *pkvl)
>   {
> diff --git a/tools/libxl/libxl_json.h b/tools/libxl/libxl_json.h
> --- a/tools/libxl/libxl_json.h
> +++ b/tools/libxl/libxl_json.h
> @@ -27,6 +27,7 @@ yajl_gen_status libxl_domid_gen_json(yaj
>   yajl_gen_status libxl_uuid_gen_json(yajl_gen hand, libxl_uuid *p);
>   yajl_gen_status libxl_mac_gen_json(yajl_gen hand, libxl_mac *p);
>   yajl_gen_status libxl_cpumap_gen_json(yajl_gen hand, libxl_cpumap *p);
> +yajl_gen_status libxl_nodemap_gen_json(yajl_gen hand, libxl_nodemap *p);
>   yajl_gen_status libxl_cpuid_policy_list_gen_json(yajl_gen hand,
>                                                    libxl_cpuid_policy_list *p);
>   yajl_gen_status libxl_string_list_gen_json(yajl_gen hand, libxl_string_list *p);
> diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
> --- a/tools/libxl/libxl_types.idl
> +++ b/tools/libxl/libxl_types.idl
> @@ -11,6 +11,7 @@ libxl_domid = Builtin("domid", json_fn =
>   libxl_uuid = Builtin("uuid", passby=PASS_BY_REFERENCE)
>   libxl_mac = Builtin("mac", passby=PASS_BY_REFERENCE)
>   libxl_cpumap = Builtin("cpumap", dispose_fn="libxl_cpumap_dispose", passby=PASS_BY_REFERENCE)
> +libxl_nodemap = Builtin("nodemap", dispose_fn="libxl_nodemap_dispose", passby=PASS_BY_REFERENCE)
>   libxl_cpuid_policy_list = Builtin("cpuid_policy_list", dispose_fn="libxl_cpuid_dispose", passby=PASS_BY_REFERENCE)
>
>   libxl_string_list = Builtin("string_list", dispose_fn="libxl_string_list_dispose", passby=PASS_BY_REFERENCE)
> @@ -233,6 +234,7 @@ libxl_domain_build_info = Struct("domain
>       ("max_vcpus",       integer),
>       ("cur_vcpus",       integer),
>       ("cpumap",          libxl_cpumap),
> +    ("nodemap",         libxl_nodemap),
>       ("tsc_mode",        libxl_tsc_mode),
>       ("max_memkb",       MemKB),
>       ("target_memkb",    MemKB),
> diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
> --- a/tools/libxl/xl_cmdimpl.c
> +++ b/tools/libxl/xl_cmdimpl.c
> @@ -515,7 +515,7 @@ static void parse_config_data(const char
>       const char *buf;
>       long l;
>       XLU_Config *config;
> -    XLU_ConfigList *cpus, *vbds, *nics, *pcis, *cvfbs, *cpuids;
> +    XLU_ConfigList *cpus, *nodes, *vbds, *nics, *pcis, *cvfbs, *cpuids;
>       int pci_power_mgmt = 0;
>       int pci_msitranslate = 1;
>       int pci_permissive = 0;
> @@ -628,6 +628,39 @@ static void parse_config_data(const char
>           free(buf2);
>       }
>
> +    if (!xlu_cfg_get_list (config, "nodes",&nodes, 0, 0)) {
> +        int i, n_nodes = 0;
> +
> +        if (libxl_nodemap_alloc(ctx,&b_info->nodemap)) {
> +            fprintf(stderr, "Unable to allocate nodemap\n");
> +            exit(1);
> +        }
> +
> +        libxl_nodemap_set_none(&b_info->nodemap);
> +        while ((buf = xlu_cfg_get_listitem(nodes, n_nodes)) != NULL) {
> +            i = atoi(buf);
> +            if (!libxl_nodemap_node_valid(&b_info->nodemap, i)) {
> +                fprintf(stderr, "node %d illegal\n", i);
> +                exit(1);
> +            }
> +            libxl_nodemap_set(&b_info->nodemap, i);
> +            n_nodes++;
> +        }
> +
> +        if (n_nodes == 0)
> +            fprintf(stderr, "No valid node found: no affinity will be set\n");
> +    }
> +    else {
> +        /*
> +         * XXX What would a sane default be? I think doing nothing
> +         *     (i.e., relying on cpu-affinity/cpupool ==>  the current
> +         *     behavior) should be just fine.
> +         *     That would mean we're saying to the user "if you want us
> +         *     to take care of NUMA, please tell us, maybe just putting
> +         *     'nodes=auto', but tell us... otherwise we do as usual".
> +         */
I think that for this patch, doing nothing is fine (which means removing 
the whole else clause).  But once we have the auto placement, I think 
that "nodes=auto" should be the default.
> +    }
> +
>       if (!xlu_cfg_get_long (config, "memory",&l, 0)) {
>           b_info->max_memkb = l * 1024;
>           b_info->target_memkb = b_info->max_memkb;
> diff --git a/tools/python/xen/lowlevel/xl/xl.c b/tools/python/xen/lowlevel/xl/xl.c
> --- a/tools/python/xen/lowlevel/xl/xl.c
> +++ b/tools/python/xen/lowlevel/xl/xl.c
> @@ -243,6 +243,18 @@ int attrib__libxl_cpumap_set(PyObject *v
>       return 0;
>   }
>
> +int attrib__libxl_nodemap_set(PyObject *v, libxl_nodemap *pptr)
> +{
> +    int i;
> +    long node;
> +
> +    for (i = 0; i<  PyList_Size(v); i++) {
> +        node = PyInt_AsLong(PyList_GetItem(v, i));
> +        libxl_nodemap_set(pptr, node);
> +    }
> +    return 0;
> +}
> +
>   int attrib__libxl_file_reference_set(PyObject *v, libxl_file_reference *pptr)
>   {
>       return genwrap__string_set(v,&pptr->path);
> diff --git a/xen/common/domain.c b/xen/common/domain.c
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -222,6 +222,7 @@ struct domain *domain_create(
>
>       spin_lock_init(&d->node_affinity_lock);
>       d->node_affinity = NODE_MASK_ALL;
> +    d->has_node_affinity = 0;
>
>       spin_lock_init(&d->shutdown_lock);
>       d->shutdown_code = -1;
> @@ -337,7 +338,7 @@ void domain_update_node_affinity(struct
>   {
>       cpumask_var_t cpumask;
>       cpumask_var_t online_affinity;
> -    const cpumask_t *online;
> +    const cpumask_t *online = cpupool_online_cpumask(d->cpupool);
>       nodemask_t nodemask = NODE_MASK_NONE;
>       struct vcpu *v;
>       unsigned int node;
> @@ -350,9 +351,22 @@ void domain_update_node_affinity(struct
>           return;
>       }
>
> -    online = cpupool_online_cpumask(d->cpupool);
> +    spin_lock(&d->node_affinity_lock);
>
> -    spin_lock(&d->node_affinity_lock);
> +    /*
> +     * If someone explicitly told us our NUMA affinity, avoid messing
> +     * that up. Notice, however, that vcpu affinity is still what
> +     * drives all scheduling decisions.
> +     *
> +     * XXX I'm quite sure this is fine wrt to the various v->cpu_affinity
> +     *     (at least, that was the intended design!). Could it be useful
> +     *     to cross-check d->node_affinity against `online`? The basic
> +     *     idea here is "Hey, if you shoot yourself in the foot... You've
> +     *     shot in the foot!", but, you know...
> +     */
> +    if ( d->has_node_affinity )
> +        goto out;
> +
>
>       for_each_vcpu ( d, v )
>       {
> @@ -365,6 +379,8 @@ void domain_update_node_affinity(struct
>               node_set(node, nodemask);
>
>       d->node_affinity = nodemask;
> +
> +out:
>       spin_unlock(&d->node_affinity_lock);
>
>       free_cpumask_var(online_affinity);
> @@ -372,6 +388,31 @@ void domain_update_node_affinity(struct
>   }
>
>
> +int domain_set_node_affinity(struct domain *d, const nodemask_t *affinity)
> +{
> +    spin_lock(&d->node_affinity_lock);
> +
> +    /*
> +     * Being/becoming affine to _no_ nodes is not going to work,
> +     * let's take it as the `reset node affinity` command.
> +     */
> +    if ( nodes_empty(*affinity) )
> +    {
> +        d->has_node_affinity = 0;
> +        spin_unlock(&d->node_affinity_lock);
> +        domain_update_node_affinity(d);
> +        return 0;
> +    }
> +
> +    d->has_node_affinity = 1;
> +    d->node_affinity = *affinity;
> +
> +    spin_unlock(&d->node_affinity_lock);
> +
> +    return 0;
> +}
> +
> +
>   struct domain *get_domain_by_id(domid_t dom)
>   {
>       struct domain *d;
> diff --git a/xen/common/domctl.c b/xen/common/domctl.c
> --- a/xen/common/domctl.c
> +++ b/xen/common/domctl.c
> @@ -621,6 +621,27 @@ long do_domctl(XEN_GUEST_HANDLE(xen_domc
>       }
>       break;
>
> +    case XEN_DOMCTL_node_affinity:
> +    {
> +        domid_t dom = op->domain;
> +        struct domain *d = rcu_lock_domain_by_id(dom);
> +        nodemask_t new_affinity;
> +
> +        ret = -ESRCH;
> +        if ( d == NULL )
> +            break;
> +
> +        /* XXX We need an xsm_* for this I guess, right? */
Yes. :-)
> +
> +        ret = xenctl_nodemap_to_nodemask(&new_affinity,
> +&op->u.node_affinity.nodemap);
> +        if ( !ret )
> +            ret = domain_set_node_affinity(d,&new_affinity);
> +
> +        rcu_unlock_domain(d);
> +    }
> +    break;
> +
>       case XEN_DOMCTL_setvcpuaffinity:
>       case XEN_DOMCTL_getvcpuaffinity:
>       {
> diff --git a/xen/common/keyhandler.c b/xen/common/keyhandler.c
> --- a/xen/common/keyhandler.c
> +++ b/xen/common/keyhandler.c
> @@ -217,6 +217,14 @@ static void cpuset_print(char *set, int
>       *set++ = '\0';
>   }
>
> +static void nodeset_print(char *set, int size, const nodemask_t *mask)
> +{
> +    *set++ = '[';
> +    set += nodelist_scnprintf(set, size-2, mask);
> +    *set++ = ']';
> +    *set++ = '\0';
> +}
> +
>   static void periodic_timer_print(char *str, int size, uint64_t period)
>   {
>       if ( period == 0 )
> @@ -272,6 +280,9 @@ static void dump_domains(unsigned char k
>
>           dump_pageframe_info(d);
>
> +        nodeset_print(tmpstr, sizeof(tmpstr),&d->node_affinity);
> +        printk("NODE affinity for domain %d: %s\n", d->domain_id, tmpstr);
> +
>           printk("VCPU information and callbacks for domain %u:\n",
>                  d->domain_id);
>           for_each_vcpu ( d, v )
> diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
> --- a/xen/include/public/domctl.h
> +++ b/xen/include/public/domctl.h
> @@ -278,6 +278,15 @@ typedef struct xen_domctl_getvcpuinfo xe
>   DEFINE_XEN_GUEST_HANDLE(xen_domctl_getvcpuinfo_t);
>
>
> +/* Set the NUMA node(s) with which the guest is `affine`. */
> +/* XEN_DOMCTL_node_affinity */
> +struct xen_domctl_node_affinity {
> +    struct xenctl_map nodemap;   /* IN */
> +};
> +typedef struct xen_domctl_node_affinity xen_domctl_node_affinity_t;
> +DEFINE_XEN_GUEST_HANDLE(xen_domctl_node_affinity_t);
> +
> +
>   /* Get/set which physical cpus a vcpu can execute on. */
>   /* XEN_DOMCTL_setvcpuaffinity */
>   /* XEN_DOMCTL_getvcpuaffinity */
> @@ -914,6 +923,7 @@ struct xen_domctl {
>   #define XEN_DOMCTL_set_access_required           64
>   #define XEN_DOMCTL_audit_p2m                     65
>   #define XEN_DOMCTL_set_virq_handler              66
> +#define XEN_DOMCTL_node_affinity                 67
>   #define XEN_DOMCTL_gdbsx_guestmemio            1000
>   #define XEN_DOMCTL_gdbsx_pausevcpu             1001
>   #define XEN_DOMCTL_gdbsx_unpausevcpu           1002
> @@ -927,6 +937,7 @@ struct xen_domctl {
>           struct xen_domctl_getpageframeinfo  getpageframeinfo;
>           struct xen_domctl_getpageframeinfo2 getpageframeinfo2;
>           struct xen_domctl_getpageframeinfo3 getpageframeinfo3;
> +        struct xen_domctl_node_affinity     node_affinity;
>           struct xen_domctl_vcpuaffinity      vcpuaffinity;
>           struct xen_domctl_shadow_op         shadow_op;
>           struct xen_domctl_max_mem           max_mem;
> diff --git a/xen/include/xen/nodemask.h b/xen/include/xen/nodemask.h
> --- a/xen/include/xen/nodemask.h
> +++ b/xen/include/xen/nodemask.h
> @@ -8,8 +8,9 @@
>    * See detailed comments in the file linux/bitmap.h describing the
>    * data type on which these nodemasks are based.
>    *
> - * For details of nodemask_scnprintf() and nodemask_parse(),
> - * see bitmap_scnprintf() and bitmap_parse() in lib/bitmap.c.
> + * For details of nodemask_scnprintf(), nodelist_scnpintf() and
> + * nodemask_parse(), see bitmap_scnprintf() and bitmap_parse()
> + * in lib/bitmap.c.
>    *
>    * The available nodemask operations are:
>    *
> @@ -48,6 +49,7 @@
>    * unsigned long *nodes_addr(mask)     Array of unsigned long's in mask
>    *
>    * int nodemask_scnprintf(buf, len, mask) Format nodemask for printing
> + * int nodelist_scnprintf(buf, len, mask) Format nodemask as a list for printing
>    * int nodemask_parse(ubuf, ulen, mask)        Parse ascii string as nodemask
>    *
>    * for_each_node_mask(node, mask)      for-loop node over mask
> @@ -280,6 +282,14 @@ static inline int __first_unset_node(con
>
>   #define nodes_addr(src) ((src).bits)
>
> +#define nodelist_scnprintf(buf, len, src) \
> +                       __nodelist_scnprintf((buf), (len), (src), MAX_NUMNODES)
> +static inline int __nodelist_scnprintf(char *buf, int len,
> +                                       const nodemask_t *srcp, int nbits)
> +{
> +       return bitmap_scnlistprintf(buf, len, srcp->bits, nbits);
> +}
> +
>   #if 0
>   #define nodemask_scnprintf(buf, len, src) \
>                          __nodemask_scnprintf((buf), (len),&(src), MAX_NUMNODES)
> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
> --- a/xen/include/xen/sched.h
> +++ b/xen/include/xen/sched.h
> @@ -346,8 +346,12 @@ struct domain
>       /* Various mem_events */
>       struct mem_event_per_domain *mem_event;
>
> -    /* Currently computed from union of all vcpu cpu-affinity masks. */
> +    /*
> +     * Can be specified by the user. If that is not the case, it is
> +     * computed from the union of all the vcpu cpu-affinity masks.
> +     */
>       nodemask_t node_affinity;
> +    int has_node_affinity;
There's something that seems a bit clunky about this; but I can't really 
think of a better way.  At every least I'd change the name of the 
element here to something more descriptive; perhaps "auto_node_affinity" 
(which would invert the meaning)?
>       unsigned int last_alloc_node;
>       spinlock_t node_affinity_lock;
>   };
> @@ -416,6 +420,7 @@ static inline void get_knownalive_domain
>       ASSERT(!(atomic_read(&d->refcnt)&  DOMAIN_DESTROYED));
>   }
>
> +int domain_set_node_affinity(struct domain *d, const nodemask_t *affinity);
>   void domain_update_node_affinity(struct domain *d);
>
>   struct domain *domain_create(