Linux virtualization list

Linux virtualization list
 help / color / mirror / Atom feed

* [RFC V2 PATCH 1/4] option: introduce qemu_get_opt_all()
From: Jason Wang @ 2012-06-25 10:04 UTC (permalink / raw)
  To: krkumar2, habanero, aliguori, rusty, mst, mashirle, qemu-devel,
	virtualization, tahm, jwhan, akong
  Cc: kvm
In-Reply-To: <20120625095059.8096.49429.stgit@amd-6168-8-1.englab.nay.redhat.com>

Sometimes, we need to pass option like -netdev tap,fd=100,fd=101,fd=102 which
can not be properly parsed by qemu_find_opt() because it only returns the first
matched option. So qemu_get_opt_all() were introduced to return an array of
pointers which contains all matched option.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 qemu-option.c |   19 +++++++++++++++++++
 qemu-option.h |    2 ++
 2 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/qemu-option.c b/qemu-option.c
index bb3886c..9263125 100644
--- a/qemu-option.c
+++ b/qemu-option.c
@@ -545,6 +545,25 @@ static QemuOpt *qemu_opt_find(QemuOpts *opts, const char *name)
     return NULL;
 }
 
+int qemu_opt_get_all(QemuOpts *opts, const char *name, const char **optp,
+                     int max)
+{
+    QemuOpt *opt;
+    int index = 0;
+
+    QTAILQ_FOREACH_REVERSE(opt, &opts->head, QemuOptHead, next) {
+        if (strcmp(opt->name, name) == 0) {
+            if (index < max) {
+                optp[index++] = opt->str;
+            }
+            if (index == max) {
+                break;
+            }
+        }
+    }
+    return index;
+}
+
 const char *qemu_opt_get(QemuOpts *opts, const char *name)
 {
     QemuOpt *opt = qemu_opt_find(opts, name);
diff --git a/qemu-option.h b/qemu-option.h
index 951dec3..3c9a273 100644
--- a/qemu-option.h
+++ b/qemu-option.h
@@ -106,6 +106,8 @@ struct QemuOptsList {
     QemuOptDesc desc[];
 };
 
+int qemu_opt_get_all(QemuOpts *opts, const char *name, const char **optp,
+                     int max);
 const char *qemu_opt_get(QemuOpts *opts, const char *name);
 bool qemu_opt_get_bool(QemuOpts *opts, const char *name, bool defval);
 uint64_t qemu_opt_get_number(QemuOpts *opts, const char *name, uint64_t defval);

^ permalink raw reply related

* [RFC V2 PATCH 2/4] tap: multiqueue support
From: Jason Wang @ 2012-06-25 10:04 UTC (permalink / raw)
  To: krkumar2, habanero, aliguori, rusty, mst, mashirle, qemu-devel,
	virtualization, tahm, jwhan, akong
  Cc: kvm
In-Reply-To: <20120625095059.8096.49429.stgit@amd-6168-8-1.englab.nay.redhat.com>

This patch adds basic support for the multiple queue capable tap device. When
multiqueue were enabled for a tap device, user can attach/detach multiple files
(sockets) to the device through TUNATTACHQUEUE/TUNDETACHQUEUE.

Two helpers tun_attach() and tun_deatch() were introduced to attach and detach
file. Platform-specific helpers were called and only linux helper has its
content as multiqueue tap were only supported in linux.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 net.c             |    4 +
 net/tap-aix.c     |   13 +++-
 net/tap-bsd.c     |   13 +++-
 net/tap-haiku.c   |   13 +++-
 net/tap-linux.c   |   55 +++++++++++++++
 net/tap-linux.h   |    3 +
 net/tap-solaris.c |   13 +++-
 net/tap-win32.c   |   11 +++
 net/tap.c         |  189 ++++++++++++++++++++++++++++++++++-------------------
 net/tap.h         |    7 ++
 10 files changed, 245 insertions(+), 76 deletions(-)

diff --git a/net.c b/net.c
index 4aa416c..eabe830 100644
--- a/net.c
+++ b/net.c
@@ -978,6 +978,10 @@ static const struct {
                 .name = "vhostforce",
                 .type = QEMU_OPT_BOOL,
                 .help = "force vhost on for non-MSIX virtio guests",
+            }, {
+                .name = "queues",
+                .type = QEMU_OPT_NUMBER,
+                .help = "number of queues the backend can provides",
         },
 #endif /* _WIN32 */
             { /* end of list */ }
diff --git a/net/tap-aix.c b/net/tap-aix.c
index e19aaba..f111e0f 100644
--- a/net/tap-aix.c
+++ b/net/tap-aix.c
@@ -25,7 +25,8 @@
 #include "net/tap.h"
 #include <stdio.h>
 
-int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int vnet_hdr_required)
+int tap_open(char *ifname, int ifname_size, int *vnet_hdr,
+             int vnet_hdr_required, int attach)
 {
     fprintf(stderr, "no tap on AIX\n");
     return -1;
@@ -59,3 +60,13 @@ void tap_fd_set_offload(int fd, int csum, int tso4,
                         int tso6, int ecn, int ufo)
 {
 }
+
+int tap_fd_attach(int fd, const char *ifname)
+{
+    return -1;
+}
+
+int tap_fd_detach(int fd, const char *ifname)
+{
+    return -1;
+}
diff --git a/net/tap-bsd.c b/net/tap-bsd.c
index 937a94b..44f3421 100644
--- a/net/tap-bsd.c
+++ b/net/tap-bsd.c
@@ -33,7 +33,8 @@
 #include <net/if_tap.h>
 #endif
 
-int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int vnet_hdr_required)
+int tap_open(char *ifname, int ifname_size, int *vnet_hdr,
+             int vnet_hdr_required, int attach)
 {
     int fd;
 #ifdef TAPGIFNAME
@@ -145,3 +146,13 @@ void tap_fd_set_offload(int fd, int csum, int tso4,
                         int tso6, int ecn, int ufo)
 {
 }
+
+int tap_fd_attach(int fd, const char *ifname)
+{
+    return -1;
+}
+
+int tap_fd_detach(int fd, const char *ifname)
+{
+    return -1;
+}
diff --git a/net/tap-haiku.c b/net/tap-haiku.c
index 91dda8e..6fb6719 100644
--- a/net/tap-haiku.c
+++ b/net/tap-haiku.c
@@ -25,7 +25,8 @@
 #include "net/tap.h"
 #include <stdio.h>
 
-int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int vnet_hdr_required)
+int tap_open(char *ifname, int ifname_size, int *vnet_hdr,
+             int vnet_hdr_required, int attach)
 {
     fprintf(stderr, "no tap on Haiku\n");
     return -1;
@@ -59,3 +60,13 @@ void tap_fd_set_offload(int fd, int csum, int tso4,
                         int tso6, int ecn, int ufo)
 {
 }
+
+int tap_fd_attach(int fd, const char *ifname)
+{
+    return -1;
+}
+
+int tap_fd_detach(int fd, const char *ifname)
+{
+    return -1;
+}
diff --git a/net/tap-linux.c b/net/tap-linux.c
index 41d581b..5d74b53 100644
--- a/net/tap-linux.c
+++ b/net/tap-linux.c
@@ -35,7 +35,8 @@
 
 #define PATH_NET_TUN "/dev/net/tun"
 
-int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int vnet_hdr_required)
+int tap_open(char *ifname, int ifname_size, int *vnet_hdr,
+             int vnet_hdr_required, int attach)
 {
     struct ifreq ifr;
     int fd, ret;
@@ -47,6 +48,8 @@ int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int vnet_hdr_required
     }
     memset(&ifr, 0, sizeof(ifr));
     ifr.ifr_flags = IFF_TAP | IFF_NO_PI;
+    if (!attach)
+        ifr.ifr_flags |= IFF_MULTI_QUEUE;
 
     if (*vnet_hdr) {
         unsigned int features;
@@ -71,7 +74,10 @@ int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int vnet_hdr_required
         pstrcpy(ifr.ifr_name, IFNAMSIZ, ifname);
     else
         pstrcpy(ifr.ifr_name, IFNAMSIZ, "tap%d");
-    ret = ioctl(fd, TUNSETIFF, (void *) &ifr);
+    if (attach)
+        ret = ioctl(fd, TUNATTACHQUEUE, (void *) &ifr);
+    else
+        ret = ioctl(fd, TUNSETIFF, (void *) &ifr);
     if (ret != 0) {
         if (ifname[0] != '\0') {
             error_report("could not configure %s (%s): %m", PATH_NET_TUN, ifr.ifr_name);
@@ -197,3 +203,48 @@ void tap_fd_set_offload(int fd, int csum, int tso4,
         }
     }
 }
+
+/* Attach a file descriptor to a TUN/TAP device. This descriptor should be
+ * detached before.
+ */
+int tap_fd_attach(int fd, const char *ifname)
+{
+    struct ifreq ifr;
+    int ret;
+
+    memset(&ifr, 0, sizeof(ifr));
+
+    ifr.ifr_flags = IFF_TAP | IFF_NO_PI | IFF_VNET_HDR;
+    pstrcpy(ifr.ifr_name, IFNAMSIZ, ifname);
+
+    ret = ioctl(fd, TUNATTACHQUEUE, (void *) &ifr);
+
+    if (ret != 0) {
+        error_report("could not attach to %s", ifname);
+    }
+
+    return ret;
+}
+
+/* Detach a file descriptor to a TUN/TAP device. This file descriptor must have
+ * been attach to a device.
+ */
+int tap_fd_detach(int fd, const char *ifname)
+{
+    struct ifreq ifr;
+    int ret;
+
+    memset(&ifr, 0, sizeof(ifr));
+
+    ifr.ifr_flags = IFF_TAP | IFF_NO_PI | IFF_VNET_HDR;
+    pstrcpy(ifr.ifr_name, IFNAMSIZ, ifname);
+
+    ret = ioctl(fd, TUNDETACHQUEUE, (void *) &ifr);
+
+    if (ret != 0) {
+        error_report("could not detach to %s", ifname);
+    }
+
+    return ret;
+}
+
diff --git a/net/tap-linux.h b/net/tap-linux.h
index 659e981..0f5e34e 100644
--- a/net/tap-linux.h
+++ b/net/tap-linux.h
@@ -29,6 +29,8 @@
 #define TUNSETSNDBUF   _IOW('T', 212, int)
 #define TUNGETVNETHDRSZ _IOR('T', 215, int)
 #define TUNSETVNETHDRSZ _IOW('T', 216, int)
+#define TUNATTACHQUEUE  _IOW('T', 217, int)
+#define TUNDETACHQUEUE  _IOW('T', 218, int)
 
 #endif
 
@@ -36,6 +38,7 @@
 #define IFF_TAP		0x0002
 #define IFF_NO_PI	0x1000
 #define IFF_VNET_HDR	0x4000
+#define IFF_MULTI_QUEUE 0x0100
 
 /* Features for GSO (TUNSETOFFLOAD). */
 #define TUN_F_CSUM	0x01	/* You can hand me unchecksummed packets. */
diff --git a/net/tap-solaris.c b/net/tap-solaris.c
index cf76463..f7c8e8d 100644
--- a/net/tap-solaris.c
+++ b/net/tap-solaris.c
@@ -173,7 +173,8 @@ static int tap_alloc(char *dev, size_t dev_size)
     return tap_fd;
 }
 
-int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int vnet_hdr_required)
+int tap_open(char *ifname, int ifname_size, int *vnet_hdr,
+             int vnet_hdr_required, int attach)
 {
     char  dev[10]="";
     int fd;
@@ -225,3 +226,13 @@ void tap_fd_set_offload(int fd, int csum, int tso4,
                         int tso6, int ecn, int ufo)
 {
 }
+
+int tap_fd_attach(int fd, const char *ifname)
+{
+    return -1;
+}
+
+int tap_fd_detach(int fd, const char *ifname)
+{
+    return -1;
+}
diff --git a/net/tap-win32.c b/net/tap-win32.c
index a801a55..dae1c00 100644
--- a/net/tap-win32.c
+++ b/net/tap-win32.c
@@ -749,3 +749,14 @@ struct vhost_net *tap_get_vhost_net(VLANClientState *nc)
 {
     return NULL;
 }
+
+int tap_attach(VLANClientState *nc)
+{
+    return -1;
+}
+
+int tap_detach(VLANClientState *nc)
+{
+    return -1;
+}
+
diff --git a/net/tap.c b/net/tap.c
index 5ac4ba3..2b9dcb5 100644
--- a/net/tap.c
+++ b/net/tap.c
@@ -53,11 +53,13 @@ typedef struct TAPState {
     int fd;
     char down_script[1024];
     char down_script_arg[128];
+    char ifname[128];
     uint8_t buf[TAP_BUFSIZE];
     unsigned int read_poll : 1;
     unsigned int write_poll : 1;
     unsigned int using_vnet_hdr : 1;
     unsigned int has_ufo: 1;
+    unsigned int enabled:1;
     VHostNetState *vhost_net;
     unsigned host_vnet_hdr_len;
 } TAPState;
@@ -546,7 +548,7 @@ int net_init_bridge(QemuOpts *opts, const char *name, VLANState *vlan)
     return 0;
 }
 
-static int net_tap_init(QemuOpts *opts, int *vnet_hdr)
+static int net_tap_init(QemuOpts *opts, int *vnet_hdr, int attach)
 {
     int fd, vnet_hdr_required;
     char ifname[128] = {0,};
@@ -563,7 +565,9 @@ static int net_tap_init(QemuOpts *opts, int *vnet_hdr)
         vnet_hdr_required = 0;
     }
 
-    TFR(fd = tap_open(ifname, sizeof(ifname), vnet_hdr, vnet_hdr_required));
+    TFR(fd = tap_open(ifname, sizeof(ifname), vnet_hdr, vnet_hdr_required,
+                      attach));
+
     if (fd < 0) {
         return -1;
     }
@@ -572,7 +576,7 @@ static int net_tap_init(QemuOpts *opts, int *vnet_hdr)
     if (setup_script &&
         setup_script[0] != '\0' &&
         strcmp(setup_script, "no") != 0 &&
-        launch_script(setup_script, ifname, fd)) {
+        (!attach && launch_script(setup_script, ifname, fd))) {
         close(fd);
         return -1;
     }
@@ -582,74 +586,11 @@ static int net_tap_init(QemuOpts *opts, int *vnet_hdr)
     return fd;
 }
 
-int net_init_tap(QemuOpts *opts, const char *name, VLANState *vlan)
+static int __net_init_tap(QemuOpts *opts, Monitor *mon, const char *name,
+                          VLANState *vlan, int fd, int vnet_hdr)
 {
-    TAPState *s;
-    int fd, vnet_hdr = 0;
-    const char *model;
-
-    if (qemu_opt_get(opts, "fd")) {
-        if (qemu_opt_get(opts, "ifname") ||
-            qemu_opt_get(opts, "script") ||
-            qemu_opt_get(opts, "downscript") ||
-            qemu_opt_get(opts, "vnet_hdr") ||
-            qemu_opt_get(opts, "helper")) {
-            error_report("ifname=, script=, downscript=, vnet_hdr=, "
-                         "and helper= are invalid with fd=");
-            return -1;
-        }
-
-        fd = net_handle_fd_param(cur_mon, qemu_opt_get(opts, "fd"));
-        if (fd == -1) {
-            return -1;
-        }
-
-        fcntl(fd, F_SETFL, O_NONBLOCK);
-
-        vnet_hdr = tap_probe_vnet_hdr(fd);
-
-        model = "tap";
-
-    } else if (qemu_opt_get(opts, "helper")) {
-        if (qemu_opt_get(opts, "ifname") ||
-            qemu_opt_get(opts, "script") ||
-            qemu_opt_get(opts, "downscript") ||
-            qemu_opt_get(opts, "vnet_hdr")) {
-            error_report("ifname=, script=, downscript=, and vnet_hdr= "
-                         "are invalid with helper=");
-            return -1;
-        }
-
-        fd = net_bridge_run_helper(qemu_opt_get(opts, "helper"),
-                                   DEFAULT_BRIDGE_INTERFACE);
-        if (fd == -1) {
-            return -1;
-        }
-
-        fcntl(fd, F_SETFL, O_NONBLOCK);
-
-        vnet_hdr = tap_probe_vnet_hdr(fd);
-
-        model = "bridge";
-
-    } else {
-        if (!qemu_opt_get(opts, "script")) {
-            qemu_opt_set(opts, "script", DEFAULT_NETWORK_SCRIPT);
-        }
-
-        if (!qemu_opt_get(opts, "downscript")) {
-            qemu_opt_set(opts, "downscript", DEFAULT_NETWORK_DOWN_SCRIPT);
-        }
+    TAPState *s = net_tap_fd_init(vlan, "tap", name, fd, vnet_hdr);
 
-        fd = net_tap_init(opts, &vnet_hdr);
-        if (fd == -1) {
-            return -1;
-        }
-
-        model = "tap";
-    }
-
-    s = net_tap_fd_init(vlan, model, name, fd, vnet_hdr);
     if (!s) {
         close(fd);
         return -1;
@@ -671,6 +612,7 @@ int net_init_tap(QemuOpts *opts, const char *name, VLANState *vlan)
         script     = qemu_opt_get(opts, "script");
         downscript = qemu_opt_get(opts, "downscript");
 
+        pstrcpy(s->ifname, sizeof(s->ifname), ifname);
         snprintf(s->nc.info_str, sizeof(s->nc.info_str),
                  "ifname=%s,script=%s,downscript=%s",
                  ifname, script, downscript);
@@ -704,6 +646,82 @@ int net_init_tap(QemuOpts *opts, const char *name, VLANState *vlan)
         return -1;
     }
 
+    s->enabled = 1;
+    return 0;
+}
+
+int net_init_tap(QemuOpts *opts, const char *name, VLANState *vlan)
+{
+    int i, fd, vnet_hdr = 0;
+    int numqueues = qemu_opt_get_number(opts, "queues", 1);
+
+    if (qemu_opt_get(opts, "fd")) {
+        const char *fdp[16];
+        if (qemu_opt_get(opts, "ifname") ||
+            qemu_opt_get(opts, "script") ||
+            qemu_opt_get(opts, "downscript") ||
+            qemu_opt_get(opts, "vnet_hdr") ||
+            qemu_opt_get(opts, "helper")) {
+            error_report("ifname=, script=, downscript=, vnet_hdr=, "
+                         "and helper= are invalid with fd=");
+            return -1;
+        }
+
+        if (numqueues != qemu_opt_get_all(opts, "fd", fdp, 16)) {
+            error_report("the number of queue does not match the"
+                         "number of fd passed");
+            return -1;
+        }
+
+        for (i = 0; i < numqueues; i++) {
+            fd = net_handle_fd_param(cur_mon, fdp[i]);
+            if (fd == -1) {
+                return -1;
+            }
+
+            fcntl(fd, F_SETFL, O_NONBLOCK);
+
+            vnet_hdr = tap_probe_vnet_hdr(fd);
+
+            __net_init_tap(opts, cur_mon, name, vlan, fd, vnet_hdr);
+        }
+    } else if (qemu_opt_get(opts, "helper")) {
+        if (qemu_opt_get(opts, "ifname") ||
+            qemu_opt_get(opts, "script") ||
+            qemu_opt_get(opts, "downscript") ||
+            qemu_opt_get(opts, "vnet_hdr")) {
+            error_report("ifname=, script=, downscript=, and vnet_hdr= "
+                         "are invalid with helper=");
+            return -1;
+        }
+
+        /* FIXME: multiqueue helper */
+        fd = net_bridge_run_helper(qemu_opt_get(opts, "helper"),
+                                   DEFAULT_BRIDGE_INTERFACE);
+        if (fd == -1) {
+            return -1;
+        }
+
+        fcntl(fd, F_SETFL, O_NONBLOCK);
+
+        vnet_hdr = tap_probe_vnet_hdr(fd);
+    } else {
+        if (!qemu_opt_get(opts, "script")) {
+            qemu_opt_set(opts, "script", DEFAULT_NETWORK_SCRIPT);
+        }
+
+        if (!qemu_opt_get(opts, "downscript")) {
+            qemu_opt_set(opts, "downscript", DEFAULT_NETWORK_DOWN_SCRIPT);
+        }
+
+        for (i = 0; i < numqueues; i++) {
+            fd = net_tap_init(opts, &vnet_hdr, i != 0);
+            if (fd == -1) {
+                return -1;
+            }
+            __net_init_tap(opts, cur_mon, name, vlan, fd, vnet_hdr);
+        }
+    }
     return 0;
 }
 
@@ -713,3 +731,36 @@ VHostNetState *tap_get_vhost_net(VLANClientState *nc)
     assert(nc->info->type == NET_CLIENT_TYPE_TAP);
     return s->vhost_net;
 }
+
+int tap_attach(VLANClientState *nc)
+{
+    TAPState *s = DO_UPCAST(TAPState, nc, nc);
+    int ret;
+
+    if (s->enabled) {
+        return 0;
+    } else {
+        ret = tap_fd_attach(s->fd, s->ifname);
+        if (ret == 0) {
+            s->enabled = 1;
+        }
+        return ret;
+    }
+}
+
+int tap_detach(VLANClientState *nc)
+{
+    TAPState *s = DO_UPCAST(TAPState, nc, nc);
+    int ret;
+
+    if (s->enabled == 0) {
+        return 0;
+    } else {
+        ret = tap_fd_detach(s->fd, s->ifname);
+        if (ret == 0) {
+            s->enabled = 0;
+        }
+        return ret;
+    }
+}
+
diff --git a/net/tap.h b/net/tap.h
index b2a9450..cead7ca 100644
--- a/net/tap.h
+++ b/net/tap.h
@@ -34,7 +34,8 @@
 
 int net_init_tap(QemuOpts *opts, const char *name, VLANState *vlan);
 
-int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int vnet_hdr_required);
+int tap_open(char *ifname, int ifname_size, int *vnet_hdr,
+             int vnet_hdr_required, int attach);
 
 ssize_t tap_read_packet(int tapfd, uint8_t *buf, int maxlen);
 
@@ -51,6 +52,10 @@ int tap_probe_vnet_hdr_len(int fd, int len);
 int tap_probe_has_ufo(int fd);
 void tap_fd_set_offload(int fd, int csum, int tso4, int tso6, int ecn, int ufo);
 void tap_fd_set_vnet_hdr_len(int fd, int len);
+int tap_attach(VLANClientState *vc);
+int tap_detach(VLANClientState *vc);
+int tap_fd_attach(int fd, const char *ifname);
+int tap_fd_detach(int fd, const char *ifname);
 
 int tap_get_fd(VLANClientState *vc);

^ permalink raw reply related

* [RFC V2 PATCH 3/4] net: multiqueue support
From: Jason Wang @ 2012-06-25 10:04 UTC (permalink / raw)
  To: krkumar2, habanero, aliguori, rusty, mst, mashirle, qemu-devel,
	virtualization, tahm, jwhan, akong
  Cc: kvm
In-Reply-To: <20120625095059.8096.49429.stgit@amd-6168-8-1.englab.nay.redhat.com>

This patch adds the multiqueues support for emulated nics. Each VLANClientState
pairs are now abstract as a queue instead of a nic, and multiple VLANClientState
pointers were stored in the NICState. A queue_index were also introduced to let
the emulated nics know which queue the packet were came from or sent
out. Virtio-net would be the first user.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 hw/dp8393x.c         |    2 +-
 hw/mcf_fec.c         |    2 +-
 hw/qdev-properties.c |   33 +++++++++++++++++++++++-----
 hw/qdev.h            |    3 ++-
 net.c                |   58 +++++++++++++++++++++++++++++++++++++++++++-------
 net.h                |   16 ++++++++++----
 6 files changed, 93 insertions(+), 21 deletions(-)

diff --git a/hw/dp8393x.c b/hw/dp8393x.c
index 017d074..483a868 100644
--- a/hw/dp8393x.c
+++ b/hw/dp8393x.c
@@ -900,7 +900,7 @@ void dp83932_init(NICInfo *nd, target_phys_addr_t base, int it_shift,
 
     s->conf.macaddr = nd->macaddr;
     s->conf.vlan = nd->vlan;
-    s->conf.peer = nd->netdev;
+    s->conf.peers[0] = nd->netdev;
 
     s->nic = qemu_new_nic(&net_dp83932_info, &s->conf, nd->model, nd->name, s);
 
diff --git a/hw/mcf_fec.c b/hw/mcf_fec.c
index ae37bef..69f508d 100644
--- a/hw/mcf_fec.c
+++ b/hw/mcf_fec.c
@@ -473,7 +473,7 @@ void mcf_fec_init(MemoryRegion *sysmem, NICInfo *nd,
 
     s->conf.macaddr = nd->macaddr;
     s->conf.vlan = nd->vlan;
-    s->conf.peer = nd->netdev;
+    s->conf.peers[0] = nd->netdev;
 
     s->nic = qemu_new_nic(&net_mcf_fec_info, &s->conf, nd->model, nd->name, s);
 
diff --git a/hw/qdev-properties.c b/hw/qdev-properties.c
index 9ae3187..d45fcef 100644
--- a/hw/qdev-properties.c
+++ b/hw/qdev-properties.c
@@ -554,16 +554,37 @@ PropertyInfo qdev_prop_chr = {
 
 static int parse_netdev(DeviceState *dev, const char *str, void **ptr)
 {
-    VLANClientState *netdev = qemu_find_netdev(str);
+    VLANClientState ***nc = (VLANClientState ***)ptr;
+    VLANClientState *vcs[MAX_QUEUE_NUM];
+    int queues, i = 0;
+    int ret;
 
-    if (netdev == NULL) {
-        return -ENOENT;
+    *nc = g_malloc(MAX_QUEUE_NUM * sizeof(VLANClientState *));
+    queues = qemu_find_netdev_all(str, vcs, MAX_QUEUE_NUM);
+    if (queues == 0) {
+        ret = -ENOENT;
+        goto err;
     }
-    if (netdev->peer) {
-        return -EEXIST;
+
+    for (i = 0; i < queues; i++) {
+        if (vcs[i] == NULL) {
+            ret = -ENOENT;
+            goto err;
+        }
+
+        if (vcs[i]->peer) {
+            ret = -EEXIST;
+            goto err;
+        }
+
+        (*nc)[i] = vcs[i];
     }
-    *ptr = netdev;
+
     return 0;
+
+err:
+    g_free(*nc);
+    return ret;
 }
 
 static const char *print_netdev(void *ptr)
diff --git a/hw/qdev.h b/hw/qdev.h
index 5386b16..1c023b4 100644
--- a/hw/qdev.h
+++ b/hw/qdev.h
@@ -248,6 +248,7 @@ extern PropertyInfo qdev_prop_blocksize;
         .defval    = (bool)_defval,                              \
         }
 
+
 #define DEFINE_PROP_UINT8(_n, _s, _f, _d)                       \
     DEFINE_PROP_DEFAULT(_n, _s, _f, _d, qdev_prop_uint8, uint8_t)
 #define DEFINE_PROP_UINT16(_n, _s, _f, _d)                      \
@@ -274,7 +275,7 @@ extern PropertyInfo qdev_prop_blocksize;
 #define DEFINE_PROP_STRING(_n, _s, _f)             \
     DEFINE_PROP(_n, _s, _f, qdev_prop_string, char*)
 #define DEFINE_PROP_NETDEV(_n, _s, _f)             \
-    DEFINE_PROP(_n, _s, _f, qdev_prop_netdev, VLANClientState*)
+    DEFINE_PROP(_n, _s, _f, qdev_prop_netdev, VLANClientState**)
 #define DEFINE_PROP_VLAN(_n, _s, _f)             \
     DEFINE_PROP(_n, _s, _f, qdev_prop_vlan, VLANState*)
 #define DEFINE_PROP_DRIVE(_n, _s, _f) \
diff --git a/net.c b/net.c
index eabe830..026a03a 100644
--- a/net.c
+++ b/net.c
@@ -238,16 +238,40 @@ NICState *qemu_new_nic(NetClientInfo *info,
 {
     VLANClientState *nc;
     NICState *nic;
+    int i;
 
     assert(info->type == NET_CLIENT_TYPE_NIC);
     assert(info->size >= sizeof(NICState));
 
-    nc = qemu_new_net_client(info, conf->vlan, conf->peer, model, name);
+    if (conf->peers) {
+        nc = qemu_new_net_client(info, NULL, conf->peers[0], model, name);
+    } else {
+        nc = qemu_new_net_client(info, conf->vlan, NULL, model, name);
+    }
 
     nic = DO_UPCAST(NICState, nc, nc);
     nic->conf = conf;
     nic->opaque = opaque;
 
+    /* For compatiablity with single queue nic */
+    nic->ncs[0] = nc;
+    nc->opaque = nic;
+
+    for (i = 1 ; i < conf->queues; i++) {
+        VLANClientState *vc = g_malloc0(sizeof(*vc));
+        vc->opaque = nic;
+        nic->ncs[i] = vc;
+        vc->peer = conf->peers[i];
+        vc->info = info;
+        vc->queue_index = i;
+        vc->peer->peer = vc;
+        QTAILQ_INSERT_TAIL(&non_vlan_clients, vc, next);
+
+        vc->send_queue = qemu_new_net_queue(qemu_deliver_packet,
+                                            qemu_deliver_packet_iov,
+                                            vc);
+    }
+
     return nic;
 }
 
@@ -283,11 +307,10 @@ void qemu_del_vlan_client(VLANClientState *vc)
 {
     /* If there is a peer NIC, delete and cleanup client, but do not free. */
     if (!vc->vlan && vc->peer && vc->peer->info->type == NET_CLIENT_TYPE_NIC) {
-        NICState *nic = DO_UPCAST(NICState, nc, vc->peer);
-        if (nic->peer_deleted) {
+        if (vc->peer_deleted) {
             return;
         }
-        nic->peer_deleted = true;
+        vc->peer_deleted = true;
         /* Let NIC know peer is gone. */
         vc->peer->link_down = true;
         if (vc->peer->info->link_status_changed) {
@@ -299,8 +322,7 @@ void qemu_del_vlan_client(VLANClientState *vc)
 
     /* If this is a peer NIC and peer has already been deleted, free it now. */
     if (!vc->vlan && vc->peer && vc->info->type == NET_CLIENT_TYPE_NIC) {
-        NICState *nic = DO_UPCAST(NICState, nc, vc);
-        if (nic->peer_deleted) {
+        if (vc->peer_deleted) {
             qemu_free_vlan_client(vc->peer);
         }
     }
@@ -342,14 +364,14 @@ void qemu_foreach_nic(qemu_nic_foreach func, void *opaque)
 
     QTAILQ_FOREACH(nc, &non_vlan_clients, next) {
         if (nc->info->type == NET_CLIENT_TYPE_NIC) {
-            func(DO_UPCAST(NICState, nc, nc), opaque);
+            func((NICState *)nc->opaque, opaque);
         }
     }
 
     QTAILQ_FOREACH(vlan, &vlans, next) {
         QTAILQ_FOREACH(nc, &vlan->clients, next) {
             if (nc->info->type == NET_CLIENT_TYPE_NIC) {
-                func(DO_UPCAST(NICState, nc, nc), opaque);
+                func((NICState *)nc->opaque, opaque);
             }
         }
     }
@@ -674,6 +696,26 @@ VLANClientState *qemu_find_netdev(const char *id)
     return NULL;
 }
 
+int qemu_find_netdev_all(const char *id, VLANClientState **vcs, int max)
+{
+    VLANClientState *vc;
+    int ret = 0;
+
+    QTAILQ_FOREACH(vc, &non_vlan_clients, next) {
+        if (vc->info->type == NET_CLIENT_TYPE_NIC) {
+            continue;
+        }
+        if (!strcmp(vc->name, id) && ret < max) {
+            vcs[ret++] = vc;
+        }
+        if (ret >= max) {
+            break;
+        }
+    }
+
+    return ret;
+}
+
 static int nic_get_free_idx(void)
 {
     int index;
diff --git a/net.h b/net.h
index bdc2a06..40378ce 100644
--- a/net.h
+++ b/net.h
@@ -12,20 +12,24 @@ struct MACAddr {
     uint8_t a[6];
 };
 
+#define MAX_QUEUE_NUM 32
+
 /* qdev nic properties */
 
 typedef struct NICConf {
     MACAddr macaddr;
     VLANState *vlan;
-    VLANClientState *peer;
+    VLANClientState **peers;
     int32_t bootindex;
+    int32_t queues;
 } NICConf;
 
 #define DEFINE_NIC_PROPERTIES(_state, _conf)                            \
     DEFINE_PROP_MACADDR("mac",   _state, _conf.macaddr),                \
     DEFINE_PROP_VLAN("vlan",     _state, _conf.vlan),                   \
-    DEFINE_PROP_NETDEV("netdev", _state, _conf.peer),                   \
-    DEFINE_PROP_INT32("bootindex", _state, _conf.bootindex, -1)
+    DEFINE_PROP_NETDEV("netdev", _state, _conf.peers),                   \
+    DEFINE_PROP_INT32("bootindex", _state, _conf.bootindex, -1),        \
+    DEFINE_PROP_INT32("queues", _state, _conf.queues, 1)
 
 /* VLANs support */
 
@@ -72,13 +76,16 @@ struct VLANClientState {
     char *name;
     char info_str[256];
     unsigned receive_disabled : 1;
+    unsigned int queue_index;
+    bool peer_deleted;
+    void *opaque;
 };
 
 typedef struct NICState {
     VLANClientState nc;
+    VLANClientState *ncs[MAX_QUEUE_NUM];
     NICConf *conf;
     void *opaque;
-    bool peer_deleted;
 } NICState;
 
 struct VLANState {
@@ -90,6 +97,7 @@ struct VLANState {
 
 VLANState *qemu_find_vlan(int id, int allocate);
 VLANClientState *qemu_find_netdev(const char *id);
+int qemu_find_netdev_all(const char *id, VLANClientState **vcs, int max);
 VLANClientState *qemu_new_net_client(NetClientInfo *info,
                                      VLANState *vlan,
                                      VLANClientState *peer,

^ permalink raw reply related

* [RFC V2 PATCH 4/4] virtio-net: add multiqueue support
From: Jason Wang @ 2012-06-25 10:04 UTC (permalink / raw)
  To: krkumar2, habanero, aliguori, rusty, mst, mashirle, qemu-devel,
	virtualization, tahm, jwhan, akong
  Cc: kvm
In-Reply-To: <20120625095059.8096.49429.stgit@amd-6168-8-1.englab.nay.redhat.com>

This patch let the virtio-net can transmit and recevie packets through multiuple
VLANClientStates and abstract them as multiple virtqueues to guest. A new
parameter 'queues' were introduced to specify the number of queue pairs.

The main goal for vhost support is to let the multiqueue could be used without
changes in vhost code. So each vhost_net structure were used to track a single
VLANClientState and two virtqueues in the past. As multiple VLANClientState were
stored in the NICState, we can infer the correspond VLANClientState from this
and queue_index easily.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 hw/vhost.c      |   58 ++++---
 hw/vhost.h      |    1 
 hw/vhost_net.c  |    7 +
 hw/vhost_net.h  |    2 
 hw/virtio-net.c |  461 +++++++++++++++++++++++++++++++++++++------------------
 hw/virtio-net.h |    3 
 6 files changed, 355 insertions(+), 177 deletions(-)

diff --git a/hw/vhost.c b/hw/vhost.c
index 43664e7..6318bb2 100644
--- a/hw/vhost.c
+++ b/hw/vhost.c
@@ -620,11 +620,12 @@ static int vhost_virtqueue_init(struct vhost_dev *dev,
 {
     target_phys_addr_t s, l, a;
     int r;
+    int vhost_vq_index = (idx > 2 ? idx - 1 : idx) % dev->nvqs;
     struct vhost_vring_file file = {
-        .index = idx,
+	.index = vhost_vq_index
     };
     struct vhost_vring_state state = {
-        .index = idx,
+        .index = vhost_vq_index
     };
     struct VirtQueue *vvq = virtio_get_queue(vdev, idx);
 
@@ -670,11 +671,12 @@ static int vhost_virtqueue_init(struct vhost_dev *dev,
         goto fail_alloc_ring;
     }
 
-    r = vhost_virtqueue_set_addr(dev, vq, idx, dev->log_enabled);
+    r = vhost_virtqueue_set_addr(dev, vq, vhost_vq_index, dev->log_enabled);
     if (r < 0) {
         r = -errno;
         goto fail_alloc;
     }
+
     file.fd = event_notifier_get_fd(virtio_queue_get_host_notifier(vvq));
     r = ioctl(dev->control, VHOST_SET_VRING_KICK, &file);
     if (r) {
@@ -715,7 +717,7 @@ static void vhost_virtqueue_cleanup(struct vhost_dev *dev,
                                     unsigned idx)
 {
     struct vhost_vring_state state = {
-        .index = idx,
+        .index = (idx > 2 ? idx - 1 : idx) % dev->nvqs,
     };
     int r;
     r = ioctl(dev->control, VHOST_GET_VRING_BASE, &state);
@@ -829,7 +831,9 @@ int vhost_dev_enable_notifiers(struct vhost_dev *hdev, VirtIODevice *vdev)
     }
 
     for (i = 0; i < hdev->nvqs; ++i) {
-        r = vdev->binding->set_host_notifier(vdev->binding_opaque, i, true);
+        r = vdev->binding->set_host_notifier(vdev->binding_opaque,
+					     hdev->start_idx + i,
+					     true);
         if (r < 0) {
             fprintf(stderr, "vhost VQ %d notifier binding failed: %d\n", i, -r);
             goto fail_vq;
@@ -839,7 +843,9 @@ int vhost_dev_enable_notifiers(struct vhost_dev *hdev, VirtIODevice *vdev)
     return 0;
 fail_vq:
     while (--i >= 0) {
-        r = vdev->binding->set_host_notifier(vdev->binding_opaque, i, false);
+        r = vdev->binding->set_host_notifier(vdev->binding_opaque,
+					     hdev->start_idx + i,
+					     false);
         if (r < 0) {
             fprintf(stderr, "vhost VQ %d notifier cleanup error: %d\n", i, -r);
             fflush(stderr);
@@ -860,7 +866,9 @@ void vhost_dev_disable_notifiers(struct vhost_dev *hdev, VirtIODevice *vdev)
     int i, r;
 
     for (i = 0; i < hdev->nvqs; ++i) {
-        r = vdev->binding->set_host_notifier(vdev->binding_opaque, i, false);
+        r = vdev->binding->set_host_notifier(vdev->binding_opaque,
+					     hdev->start_idx + i,
+					     false);
         if (r < 0) {
             fprintf(stderr, "vhost VQ %d notifier cleanup failed: %d\n", i, -r);
             fflush(stderr);
@@ -874,15 +882,17 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev)
 {
     int i, r;
     if (!vdev->binding->set_guest_notifiers) {
-        fprintf(stderr, "binding does not support guest notifiers\n");
+        fprintf(stderr, "binding does not support guest notifier\n");
         r = -ENOSYS;
         goto fail;
     }
 
-    r = vdev->binding->set_guest_notifiers(vdev->binding_opaque, true);
-    if (r < 0) {
-        fprintf(stderr, "Error binding guest notifier: %d\n", -r);
-        goto fail_notifiers;
+    if (hdev->start_idx == 0) {
+        r = vdev->binding->set_guest_notifiers(vdev->binding_opaque, true);
+        if (r < 0) {
+            fprintf(stderr, "Error binding guest notifier: %d\n", -r);
+            goto fail_notifiers;
+        }
     }
 
     r = vhost_dev_set_features(hdev, hdev->log_enabled);
@@ -898,7 +908,7 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev)
         r = vhost_virtqueue_init(hdev,
                                  vdev,
                                  hdev->vqs + i,
-                                 i);
+                                 hdev->start_idx + i);
         if (r < 0) {
             goto fail_vq;
         }
@@ -925,11 +935,13 @@ fail_vq:
         vhost_virtqueue_cleanup(hdev,
                                 vdev,
                                 hdev->vqs + i,
-                                i);
+                                hdev->start_idx + i);
     }
+    i = hdev->nvqs;
 fail_mem:
 fail_features:
-    vdev->binding->set_guest_notifiers(vdev->binding_opaque, false);
+    if (hdev->start_idx == 0)
+        vdev->binding->set_guest_notifiers(vdev->binding_opaque, false);
 fail_notifiers:
 fail:
     return r;
@@ -944,18 +956,22 @@ void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev)
         vhost_virtqueue_cleanup(hdev,
                                 vdev,
                                 hdev->vqs + i,
-                                i);
+                                hdev->start_idx + i);
     }
+
     for (i = 0; i < hdev->n_mem_sections; ++i) {
         vhost_sync_dirty_bitmap(hdev, &hdev->mem_sections[i],
                                 0, (target_phys_addr_t)~0x0ull);
     }
-    r = vdev->binding->set_guest_notifiers(vdev->binding_opaque, false);
-    if (r < 0) {
-        fprintf(stderr, "vhost guest notifier cleanup failed: %d\n", r);
-        fflush(stderr);
+
+    if (hdev->start_idx == 0) {
+	r = vdev->binding->set_guest_notifiers(vdev->binding_opaque, false);
+	if (r < 0) {
+	    fprintf(stderr, "vhost guest notifier cleanup failed: %d\n", r);
+	    fflush(stderr);
+	}
+	assert (r >= 0);
     }
-    assert (r >= 0);
 
     hdev->started = false;
     g_free(hdev->log);
diff --git a/hw/vhost.h b/hw/vhost.h
index 80e64df..fa5357a 100644
--- a/hw/vhost.h
+++ b/hw/vhost.h
@@ -34,6 +34,7 @@ struct vhost_dev {
     MemoryRegionSection *mem_sections;
     struct vhost_virtqueue *vqs;
     int nvqs;
+    int start_idx;
     unsigned long long features;
     unsigned long long acked_features;
     unsigned long long backend_features;
diff --git a/hw/vhost_net.c b/hw/vhost_net.c
index f672e9d..73a72bb 100644
--- a/hw/vhost_net.c
+++ b/hw/vhost_net.c
@@ -138,13 +138,15 @@ bool vhost_net_query(VHostNetState *net, VirtIODevice *dev)
 }
 
 int vhost_net_start(struct vhost_net *net,
-                    VirtIODevice *dev)
+                    VirtIODevice *dev,
+                    int start_idx)
 {
     struct vhost_vring_file file = { };
     int r;
 
     net->dev.nvqs = 2;
     net->dev.vqs = net->vqs;
+    net->dev.start_idx = start_idx;
 
     r = vhost_dev_enable_notifiers(&net->dev, dev);
     if (r < 0) {
@@ -227,7 +229,8 @@ bool vhost_net_query(VHostNetState *net, VirtIODevice *dev)
 }
 
 int vhost_net_start(struct vhost_net *net,
-		    VirtIODevice *dev)
+                    VirtIODevice *dev,
+                    int start_idx)
 {
     return -ENOSYS;
 }
diff --git a/hw/vhost_net.h b/hw/vhost_net.h
index 91e40b1..79a4f09 100644
--- a/hw/vhost_net.h
+++ b/hw/vhost_net.h
@@ -9,7 +9,7 @@ typedef struct vhost_net VHostNetState;
 VHostNetState *vhost_net_init(VLANClientState *backend, int devfd, bool force);
 
 bool vhost_net_query(VHostNetState *net, VirtIODevice *dev);
-int vhost_net_start(VHostNetState *net, VirtIODevice *dev);
+int vhost_net_start(VHostNetState *net, VirtIODevice *dev, int start_idx);
 void vhost_net_stop(VHostNetState *net, VirtIODevice *dev);
 
 void vhost_net_cleanup(VHostNetState *net);
diff --git a/hw/virtio-net.c b/hw/virtio-net.c
index 3f190d4..d42c4cc 100644
--- a/hw/virtio-net.c
+++ b/hw/virtio-net.c
@@ -26,34 +26,43 @@
 #define MAC_TABLE_ENTRIES    64
 #define MAX_VLAN    (1 << 12)   /* Per 802.1Q definition */
 
-typedef struct VirtIONet
+struct VirtIONet;
+
+typedef struct VirtIONetQueue
 {
-    VirtIODevice vdev;
-    uint8_t mac[ETH_ALEN];
-    uint16_t status;
     VirtQueue *rx_vq;
     VirtQueue *tx_vq;
-    VirtQueue *ctrl_vq;
-    NICState *nic;
     QEMUTimer *tx_timer;
     QEMUBH *tx_bh;
     uint32_t tx_timeout;
-    int32_t tx_burst;
     int tx_waiting;
-    uint32_t has_vnet_hdr;
-    uint8_t has_ufo;
     struct {
         VirtQueueElement elem;
         ssize_t len;
     } async_tx;
+    struct VirtIONet *n;
+    uint8_t vhost_started;
+} VirtIONetQueue;
+
+typedef struct VirtIONet
+{
+    VirtIODevice vdev;
+    uint8_t mac[ETH_ALEN];
+    uint16_t status;
+    VirtIONetQueue vqs[MAX_QUEUE_NUM];
+    VirtQueue *ctrl_vq;
+    NICState *nic;
+    int32_t tx_burst;
+    uint32_t has_vnet_hdr;
+    uint8_t has_ufo;
     int mergeable_rx_bufs;
+    int multiqueue;
     uint8_t promisc;
     uint8_t allmulti;
     uint8_t alluni;
     uint8_t nomulti;
     uint8_t nouni;
     uint8_t nobcast;
-    uint8_t vhost_started;
     struct {
         int in_use;
         int first_multi;
@@ -63,6 +72,7 @@ typedef struct VirtIONet
     } mac_table;
     uint32_t *vlans;
     DeviceState *qdev;
+    uint32_t queues;
 } VirtIONet;
 
 /* TODO
@@ -74,12 +84,25 @@ static VirtIONet *to_virtio_net(VirtIODevice *vdev)
     return (VirtIONet *)vdev;
 }
 
+static int vq_get_pair_index(VirtIONet *n, VirtQueue *vq)
+{
+    int i;
+    for (i = 0; i < n->queues; i++) {
+        if (n->vqs[i].tx_vq == vq || n->vqs[i].rx_vq == vq) {
+            return i;
+        }
+    }
+    assert(1);
+    return -1;
+}
+
 static void virtio_net_get_config(VirtIODevice *vdev, uint8_t *config)
 {
     VirtIONet *n = to_virtio_net(vdev);
     struct virtio_net_config netcfg;
 
     stw_p(&netcfg.status, n->status);
+    netcfg.queues = n->queues * 2;
     memcpy(netcfg.mac, n->mac, ETH_ALEN);
     memcpy(config, &netcfg, sizeof(netcfg));
 }
@@ -103,78 +126,140 @@ static bool virtio_net_started(VirtIONet *n, uint8_t status)
         (n->status & VIRTIO_NET_S_LINK_UP) && n->vdev.vm_running;
 }
 
-static void virtio_net_vhost_status(VirtIONet *n, uint8_t status)
+static void nc_vhost_status(VLANClientState *nc, VirtIONet *n,
+                            uint8_t status)
 {
-    if (!n->nic->nc.peer) {
+    int queue_index = nc->queue_index;
+    VLANClientState *peer = nc->peer;
+    VirtIONetQueue *netq = &n->vqs[nc->queue_index];
+
+    if (!peer) {
         return;
     }
-    if (n->nic->nc.peer->info->type != NET_CLIENT_TYPE_TAP) {
+    if (peer->info->type != NET_CLIENT_TYPE_TAP) {
         return;
     }
 
-    if (!tap_get_vhost_net(n->nic->nc.peer)) {
+    if (!tap_get_vhost_net(peer)) {
         return;
     }
-    if (!!n->vhost_started == virtio_net_started(n, status) &&
-                              !n->nic->nc.peer->link_down) {
+    if (!!netq->vhost_started == virtio_net_started(n, status) &&
+                                 !peer->link_down) {
         return;
     }
-    if (!n->vhost_started) {
-        int r;
-        if (!vhost_net_query(tap_get_vhost_net(n->nic->nc.peer), &n->vdev)) {
+    if (!netq->vhost_started) {
+	/* skip ctrl vq */
+	int r, start_idx = queue_index == 0 ? 0 : queue_index * 2 + 1;
+        if (!vhost_net_query(tap_get_vhost_net(peer), &n->vdev)) {
             return;
         }
-        r = vhost_net_start(tap_get_vhost_net(n->nic->nc.peer), &n->vdev);
+        r = vhost_net_start(tap_get_vhost_net(peer), &n->vdev, start_idx);
         if (r < 0) {
             error_report("unable to start vhost net: %d: "
                          "falling back on userspace virtio", -r);
         } else {
-            n->vhost_started = 1;
+            netq->vhost_started = 1;
         }
     } else {
-        vhost_net_stop(tap_get_vhost_net(n->nic->nc.peer), &n->vdev);
-        n->vhost_started = 0;
+        vhost_net_stop(tap_get_vhost_net(peer), &n->vdev);
+        netq->vhost_started = 0;
+    }
+}
+
+static int peer_attach(VirtIONet *n, int index)
+{
+    if (!n->nic->ncs[index]->peer) {
+	return -1;
+    }
+
+    if (n->nic->ncs[index]->peer->info->type != NET_CLIENT_TYPE_TAP) {
+	return -1;
+    }
+
+    return tap_attach(n->nic->ncs[index]->peer);
+}
+
+static int peer_detach(VirtIONet *n, int index)
+{
+    if (!n->nic->ncs[index]->peer) {
+	return -1;
+    }
+
+    if (n->nic->ncs[index]->peer->info->type != NET_CLIENT_TYPE_TAP) {
+	return -1;
+    }
+
+    return tap_detach(n->nic->ncs[index]->peer);
+}
+
+static void virtio_net_vhost_status(VirtIONet *n, uint8_t status)
+{
+    int i;
+    for (i = 0; i < n->queues; i++) {
+	if (!n->multiqueue && i != 0)
+	    status = 0;
+        nc_vhost_status(n->nic->ncs[i], n, status);
     }
 }
 
 static void virtio_net_set_status(struct VirtIODevice *vdev, uint8_t status)
 {
     VirtIONet *n = to_virtio_net(vdev);
+    int i;
 
     virtio_net_vhost_status(n, status);
 
-    if (!n->tx_waiting) {
-        return;
-    }
+    for (i = 0; i < n->queues; i++) {
+        VirtIONetQueue *netq = &n->vqs[i];
+        if (!netq->tx_waiting) {
+            continue;
+        }
+
+	if (!n->multiqueue && i != 0)
+	    status = 0;
 
-    if (virtio_net_started(n, status) && !n->vhost_started) {
-        if (n->tx_timer) {
-            qemu_mod_timer(n->tx_timer,
-                           qemu_get_clock_ns(vm_clock) + n->tx_timeout);
+        if (virtio_net_started(n, status) && !netq->vhost_started) {
+            if (netq->tx_timer) {
+                qemu_mod_timer(netq->tx_timer,
+                               qemu_get_clock_ns(vm_clock) + netq->tx_timeout);
+            } else {
+                qemu_bh_schedule(netq->tx_bh);
+            }
         } else {
-            qemu_bh_schedule(n->tx_bh);
+            if (netq->tx_timer) {
+                qemu_del_timer(netq->tx_timer);
+            } else {
+                qemu_bh_cancel(netq->tx_bh);
+            }
         }
-    } else {
-        if (n->tx_timer) {
-            qemu_del_timer(n->tx_timer);
-        } else {
-            qemu_bh_cancel(n->tx_bh);
+    }
+}
+
+static bool virtio_net_is_link_up(VirtIONet *n)
+{
+    int i;
+    for (i = 0; i < n->queues; i++) {
+        if (n->nic->ncs[i]->link_down) {
+            return false;
         }
     }
+    return true;
 }
 
 static void virtio_net_set_link_status(VLANClientState *nc)
 {
-    VirtIONet *n = DO_UPCAST(NICState, nc, nc)->opaque;
+    VirtIONet *n = ((NICState *)(nc->opaque))->opaque;
     uint16_t old_status = n->status;
 
-    if (nc->link_down)
+    if (virtio_net_is_link_up(n)) {
         n->status &= ~VIRTIO_NET_S_LINK_UP;
-    else
+    } else {
         n->status |= VIRTIO_NET_S_LINK_UP;
+    }
 
-    if (n->status != old_status)
+    if (n->status != old_status) {
         virtio_notify_config(&n->vdev);
+    }
 
     virtio_net_set_status(&n->vdev, n->vdev.status);
 }
@@ -202,13 +287,15 @@ static void virtio_net_reset(VirtIODevice *vdev)
 
 static int peer_has_vnet_hdr(VirtIONet *n)
 {
-    if (!n->nic->nc.peer)
+    if (!n->nic->ncs[0]->peer) {
         return 0;
+    }
 
-    if (n->nic->nc.peer->info->type != NET_CLIENT_TYPE_TAP)
+    if (n->nic->ncs[0]->peer->info->type != NET_CLIENT_TYPE_TAP) {
         return 0;
+    }
 
-    n->has_vnet_hdr = tap_has_vnet_hdr(n->nic->nc.peer);
+    n->has_vnet_hdr = tap_has_vnet_hdr(n->nic->ncs[0]->peer);
 
     return n->has_vnet_hdr;
 }
@@ -218,7 +305,7 @@ static int peer_has_ufo(VirtIONet *n)
     if (!peer_has_vnet_hdr(n))
         return 0;
 
-    n->has_ufo = tap_has_ufo(n->nic->nc.peer);
+    n->has_ufo = tap_has_ufo(n->nic->ncs[0]->peer);
 
     return n->has_ufo;
 }
@@ -228,9 +315,13 @@ static uint32_t virtio_net_get_features(VirtIODevice *vdev, uint32_t features)
     VirtIONet *n = to_virtio_net(vdev);
 
     features |= (1 << VIRTIO_NET_F_MAC);
+    features |= (1 << VIRTIO_NET_F_MULTIQUEUE);
 
     if (peer_has_vnet_hdr(n)) {
-        tap_using_vnet_hdr(n->nic->nc.peer, 1);
+        int i;
+        for (i = 0; i < n->queues; i++) {
+            tap_using_vnet_hdr(n->nic->ncs[i]->peer, 1);
+        }
     } else {
         features &= ~(0x1 << VIRTIO_NET_F_CSUM);
         features &= ~(0x1 << VIRTIO_NET_F_HOST_TSO4);
@@ -248,14 +339,15 @@ static uint32_t virtio_net_get_features(VirtIODevice *vdev, uint32_t features)
         features &= ~(0x1 << VIRTIO_NET_F_HOST_UFO);
     }
 
-    if (!n->nic->nc.peer ||
-        n->nic->nc.peer->info->type != NET_CLIENT_TYPE_TAP) {
+    if (!n->nic->ncs[0]->peer ||
+        n->nic->ncs[0]->peer->info->type != NET_CLIENT_TYPE_TAP) {
         return features;
     }
-    if (!tap_get_vhost_net(n->nic->nc.peer)) {
+    if (!tap_get_vhost_net(n->nic->ncs[0]->peer)) {
         return features;
     }
-    return vhost_net_get_features(tap_get_vhost_net(n->nic->nc.peer), features);
+    return vhost_net_get_features(tap_get_vhost_net(n->nic->ncs[0]->peer),
+                                  features);
 }
 
 static uint32_t virtio_net_bad_features(VirtIODevice *vdev)
@@ -276,25 +368,38 @@ static uint32_t virtio_net_bad_features(VirtIODevice *vdev)
 static void virtio_net_set_features(VirtIODevice *vdev, uint32_t features)
 {
     VirtIONet *n = to_virtio_net(vdev);
+    int i, r;
 
     n->mergeable_rx_bufs = !!(features & (1 << VIRTIO_NET_F_MRG_RXBUF));
+    n->multiqueue = !!(features & (1 << VIRTIO_NET_F_MULTIQUEUE));
 
-    if (n->has_vnet_hdr) {
-        tap_set_offload(n->nic->nc.peer,
-                        (features >> VIRTIO_NET_F_GUEST_CSUM) & 1,
-                        (features >> VIRTIO_NET_F_GUEST_TSO4) & 1,
-                        (features >> VIRTIO_NET_F_GUEST_TSO6) & 1,
-                        (features >> VIRTIO_NET_F_GUEST_ECN)  & 1,
-                        (features >> VIRTIO_NET_F_GUEST_UFO)  & 1);
-    }
-    if (!n->nic->nc.peer ||
-        n->nic->nc.peer->info->type != NET_CLIENT_TYPE_TAP) {
-        return;
-    }
-    if (!tap_get_vhost_net(n->nic->nc.peer)) {
-        return;
+    for (i = 0; i < n->queues; i++) {
+        if (!n->multiqueue && i != 0) {
+            r = peer_detach(n, i);
+            assert(r == 0);
+        } else {
+            r = peer_attach(n, i);
+            assert(r == 0);
+
+            if (n->has_vnet_hdr) {
+                tap_set_offload(n->nic->ncs[i]->peer,
+                                (features >> VIRTIO_NET_F_GUEST_CSUM) & 1,
+                                (features >> VIRTIO_NET_F_GUEST_TSO4) & 1,
+                                (features >> VIRTIO_NET_F_GUEST_TSO6) & 1,
+                                (features >> VIRTIO_NET_F_GUEST_ECN)  & 1,
+                                (features >> VIRTIO_NET_F_GUEST_UFO)  & 1);
+            }
+            if (!n->nic->ncs[i]->peer ||
+                n->nic->ncs[i]->peer->info->type != NET_CLIENT_TYPE_TAP) {
+                continue;
+            }
+            if (!tap_get_vhost_net(n->nic->ncs[i]->peer)) {
+                continue;
+            }
+            vhost_net_ack_features(tap_get_vhost_net(n->nic->ncs[i]->peer),
+                                   features);
+        }
     }
-    vhost_net_ack_features(tap_get_vhost_net(n->nic->nc.peer), features);
 }
 
 static int virtio_net_handle_rx_mode(VirtIONet *n, uint8_t cmd,
@@ -446,7 +551,7 @@ static void virtio_net_handle_rx(VirtIODevice *vdev, VirtQueue *vq)
 {
     VirtIONet *n = to_virtio_net(vdev);
 
-    qemu_flush_queued_packets(&n->nic->nc);
+    qemu_flush_queued_packets(n->nic->ncs[vq_get_pair_index(n, vq)]);
 
     /* We now have RX buffers, signal to the IO thread to break out of the
      * select to re-poll the tap file descriptor */
@@ -455,36 +560,37 @@ static void virtio_net_handle_rx(VirtIODevice *vdev, VirtQueue *vq)
 
 static int virtio_net_can_receive(VLANClientState *nc)
 {
-    VirtIONet *n = DO_UPCAST(NICState, nc, nc)->opaque;
+    int queue_index = nc->queue_index;
+    VirtIONet *n = ((NICState *)nc->opaque)->opaque;
+
     if (!n->vdev.vm_running) {
         return 0;
     }
 
-    if (!virtio_queue_ready(n->rx_vq) ||
+    if (!virtio_queue_ready(n->vqs[queue_index].rx_vq) ||
         !(n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK))
         return 0;
 
     return 1;
 }
 
-static int virtio_net_has_buffers(VirtIONet *n, int bufsize)
+static int virtio_net_has_buffers(VirtIONet *n, int bufsize, VirtQueue *vq)
 {
-    if (virtio_queue_empty(n->rx_vq) ||
-        (n->mergeable_rx_bufs &&
-         !virtqueue_avail_bytes(n->rx_vq, bufsize, 0))) {
-        virtio_queue_set_notification(n->rx_vq, 1);
+    if (virtio_queue_empty(vq) || (n->mergeable_rx_bufs &&
+        !virtqueue_avail_bytes(vq, bufsize, 0))) {
+        virtio_queue_set_notification(vq, 1);
 
         /* To avoid a race condition where the guest has made some buffers
          * available after the above check but before notification was
          * enabled, check for available buffers again.
          */
-        if (virtio_queue_empty(n->rx_vq) ||
-            (n->mergeable_rx_bufs &&
-             !virtqueue_avail_bytes(n->rx_vq, bufsize, 0)))
+        if (virtio_queue_empty(vq) || (n->mergeable_rx_bufs &&
+            !virtqueue_avail_bytes(vq, bufsize, 0))) {
             return 0;
+        }
     }
 
-    virtio_queue_set_notification(n->rx_vq, 0);
+    virtio_queue_set_notification(vq, 0);
     return 1;
 }
 
@@ -595,12 +701,15 @@ static int receive_filter(VirtIONet *n, const uint8_t *buf, int size)
 
 static ssize_t virtio_net_receive(VLANClientState *nc, const uint8_t *buf, size_t size)
 {
-    VirtIONet *n = DO_UPCAST(NICState, nc, nc)->opaque;
+    int queue_index = nc->queue_index;
+    VirtIONet *n = ((NICState *)(nc->opaque))->opaque;
+    VirtQueue *vq = n->vqs[queue_index].rx_vq;
     struct virtio_net_hdr_mrg_rxbuf *mhdr = NULL;
     size_t guest_hdr_len, offset, i, host_hdr_len;
 
-    if (!virtio_net_can_receive(&n->nic->nc))
+    if (!virtio_net_can_receive(n->nic->ncs[queue_index])) {
         return -1;
+    }
 
     /* hdr_len refers to the header we supply to the guest */
     guest_hdr_len = n->mergeable_rx_bufs ?
@@ -608,7 +717,7 @@ static ssize_t virtio_net_receive(VLANClientState *nc, const uint8_t *buf, size_
 
 
     host_hdr_len = n->has_vnet_hdr ? sizeof(struct virtio_net_hdr) : 0;
-    if (!virtio_net_has_buffers(n, size + guest_hdr_len - host_hdr_len))
+    if (!virtio_net_has_buffers(n, size + guest_hdr_len - host_hdr_len, vq))
         return 0;
 
     if (!receive_filter(n, buf, size))
@@ -623,7 +732,7 @@ static ssize_t virtio_net_receive(VLANClientState *nc, const uint8_t *buf, size_
 
         total = 0;
 
-        if (virtqueue_pop(n->rx_vq, &elem) == 0) {
+        if (virtqueue_pop(vq, &elem) == 0) {
             if (i == 0)
                 return -1;
             error_report("virtio-net unexpected empty queue: "
@@ -675,47 +784,50 @@ static ssize_t virtio_net_receive(VLANClientState *nc, const uint8_t *buf, size_
         }
 
         /* signal other side */
-        virtqueue_fill(n->rx_vq, &elem, total, i++);
+        virtqueue_fill(vq, &elem, total, i++);
     }
 
     if (mhdr) {
         stw_p(&mhdr->num_buffers, i);
     }
 
-    virtqueue_flush(n->rx_vq, i);
-    virtio_notify(&n->vdev, n->rx_vq);
+    virtqueue_flush(vq, i);
+    virtio_notify(&n->vdev, vq);
 
     return size;
 }
 
-static int32_t virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq);
+static int32_t virtio_net_flush_tx(VirtIONet *n, VirtIONetQueue *tvq);
 
 static void virtio_net_tx_complete(VLANClientState *nc, ssize_t len)
 {
-    VirtIONet *n = DO_UPCAST(NICState, nc, nc)->opaque;
+    VirtIONet *n = ((NICState *)nc->opaque)->opaque;
+    VirtIONetQueue *netq = &n->vqs[nc->queue_index];
 
-    virtqueue_push(n->tx_vq, &n->async_tx.elem, n->async_tx.len);
-    virtio_notify(&n->vdev, n->tx_vq);
+    virtqueue_push(netq->tx_vq, &netq->async_tx.elem, netq->async_tx.len);
+    virtio_notify(&n->vdev, netq->tx_vq);
 
-    n->async_tx.elem.out_num = n->async_tx.len = 0;
+    netq->async_tx.elem.out_num = netq->async_tx.len;
 
-    virtio_queue_set_notification(n->tx_vq, 1);
-    virtio_net_flush_tx(n, n->tx_vq);
+    virtio_queue_set_notification(netq->tx_vq, 1);
+    virtio_net_flush_tx(n, netq);
 }
 
 /* TX */
-static int32_t virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq)
+static int32_t virtio_net_flush_tx(VirtIONet *n, VirtIONetQueue *netq)
 {
     VirtQueueElement elem;
     int32_t num_packets = 0;
+    VirtQueue *vq = netq->tx_vq;
+
     if (!(n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK)) {
         return num_packets;
     }
 
     assert(n->vdev.vm_running);
 
-    if (n->async_tx.elem.out_num) {
-        virtio_queue_set_notification(n->tx_vq, 0);
+    if (netq->async_tx.elem.out_num) {
+        virtio_queue_set_notification(vq, 0);
         return num_packets;
     }
 
@@ -747,12 +859,12 @@ static int32_t virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq)
             len += hdr_len;
         }
 
-        ret = qemu_sendv_packet_async(&n->nic->nc, out_sg, out_num,
-                                      virtio_net_tx_complete);
+        ret = qemu_sendv_packet_async(n->nic->ncs[vq_get_pair_index(n, vq)],
+                                      out_sg, out_num, virtio_net_tx_complete);
         if (ret == 0) {
-            virtio_queue_set_notification(n->tx_vq, 0);
-            n->async_tx.elem = elem;
-            n->async_tx.len  = len;
+            virtio_queue_set_notification(vq, 0);
+            netq->async_tx.elem = elem;
+            netq->async_tx.len  = len;
             return -EBUSY;
         }
 
@@ -771,22 +883,23 @@ static int32_t virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq)
 static void virtio_net_handle_tx_timer(VirtIODevice *vdev, VirtQueue *vq)
 {
     VirtIONet *n = to_virtio_net(vdev);
+    VirtIONetQueue *netq = &n->vqs[vq_get_pair_index(n, vq)];
 
     /* This happens when device was stopped but VCPU wasn't. */
     if (!n->vdev.vm_running) {
-        n->tx_waiting = 1;
+        netq->tx_waiting = 1;
         return;
     }
 
-    if (n->tx_waiting) {
+    if (netq->tx_waiting) {
         virtio_queue_set_notification(vq, 1);
-        qemu_del_timer(n->tx_timer);
-        n->tx_waiting = 0;
-        virtio_net_flush_tx(n, vq);
+        qemu_del_timer(netq->tx_timer);
+        netq->tx_waiting = 0;
+        virtio_net_flush_tx(n, netq);
     } else {
-        qemu_mod_timer(n->tx_timer,
-                       qemu_get_clock_ns(vm_clock) + n->tx_timeout);
-        n->tx_waiting = 1;
+        qemu_mod_timer(netq->tx_timer,
+                       qemu_get_clock_ns(vm_clock) + netq->tx_timeout);
+        netq->tx_waiting = 1;
         virtio_queue_set_notification(vq, 0);
     }
 }
@@ -794,48 +907,53 @@ static void virtio_net_handle_tx_timer(VirtIODevice *vdev, VirtQueue *vq)
 static void virtio_net_handle_tx_bh(VirtIODevice *vdev, VirtQueue *vq)
 {
     VirtIONet *n = to_virtio_net(vdev);
+    VirtIONetQueue *netq = &n->vqs[vq_get_pair_index(n, vq)];
 
-    if (unlikely(n->tx_waiting)) {
+    if (unlikely(netq->tx_waiting)) {
         return;
     }
-    n->tx_waiting = 1;
+    netq->tx_waiting = 1;
     /* This happens when device was stopped but VCPU wasn't. */
     if (!n->vdev.vm_running) {
         return;
     }
     virtio_queue_set_notification(vq, 0);
-    qemu_bh_schedule(n->tx_bh);
+    qemu_bh_schedule(netq->tx_bh);
 }
 
 static void virtio_net_tx_timer(void *opaque)
 {
-    VirtIONet *n = opaque;
+    VirtIONetQueue *netq = opaque;
+    VirtIONet *n = netq->n;
+
     assert(n->vdev.vm_running);
 
-    n->tx_waiting = 0;
+    netq->tx_waiting = 0;
 
     /* Just in case the driver is not ready on more */
     if (!(n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK))
         return;
 
-    virtio_queue_set_notification(n->tx_vq, 1);
-    virtio_net_flush_tx(n, n->tx_vq);
+    virtio_queue_set_notification(netq->tx_vq, 1);
+    virtio_net_flush_tx(n, netq);
 }
 
 static void virtio_net_tx_bh(void *opaque)
 {
-    VirtIONet *n = opaque;
+    VirtIONetQueue *netq = opaque;
+    VirtQueue *vq = netq->tx_vq;
+    VirtIONet *n = netq->n;
     int32_t ret;
 
     assert(n->vdev.vm_running);
 
-    n->tx_waiting = 0;
+    netq->tx_waiting = 0;
 
     /* Just in case the driver is not ready on more */
     if (unlikely(!(n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK)))
         return;
 
-    ret = virtio_net_flush_tx(n, n->tx_vq);
+    ret = virtio_net_flush_tx(n, netq);
     if (ret == -EBUSY) {
         return; /* Notification re-enable handled by tx_complete */
     }
@@ -843,33 +961,39 @@ static void virtio_net_tx_bh(void *opaque)
     /* If we flush a full burst of packets, assume there are
      * more coming and immediately reschedule */
     if (ret >= n->tx_burst) {
-        qemu_bh_schedule(n->tx_bh);
-        n->tx_waiting = 1;
+        qemu_bh_schedule(netq->tx_bh);
+        netq->tx_waiting = 1;
         return;
     }
 
     /* If less than a full burst, re-enable notification and flush
      * anything that may have come in while we weren't looking.  If
      * we find something, assume the guest is still active and reschedule */
-    virtio_queue_set_notification(n->tx_vq, 1);
-    if (virtio_net_flush_tx(n, n->tx_vq) > 0) {
-        virtio_queue_set_notification(n->tx_vq, 0);
-        qemu_bh_schedule(n->tx_bh);
-        n->tx_waiting = 1;
+    virtio_queue_set_notification(vq, 1);
+    if (virtio_net_flush_tx(n, netq) > 0) {
+        virtio_queue_set_notification(vq, 0);
+        qemu_bh_schedule(netq->tx_bh);
+        netq->tx_waiting = 1;
     }
 }
 
 static void virtio_net_save(QEMUFile *f, void *opaque)
 {
     VirtIONet *n = opaque;
+    int i;
 
     /* At this point, backend must be stopped, otherwise
      * it might keep writing to memory. */
-    assert(!n->vhost_started);
+    for (i = 0; i < n->queues; i++) {
+        assert(!n->vqs[i].vhost_started);
+    }
     virtio_save(&n->vdev, f);
 
     qemu_put_buffer(f, n->mac, ETH_ALEN);
-    qemu_put_be32(f, n->tx_waiting);
+    qemu_put_be32(f, n->queues);
+    for (i = 0; i < n->queues; i++) {
+        qemu_put_be32(f, n->vqs[i].tx_waiting);
+    }
     qemu_put_be32(f, n->mergeable_rx_bufs);
     qemu_put_be16(f, n->status);
     qemu_put_byte(f, n->promisc);
@@ -902,7 +1026,10 @@ static int virtio_net_load(QEMUFile *f, void *opaque, int version_id)
     }
 
     qemu_get_buffer(f, n->mac, ETH_ALEN);
-    n->tx_waiting = qemu_get_be32(f);
+    n->queues = qemu_get_be32(f);
+    for (i = 0; i < n->queues; i++) {
+        n->vqs[i].tx_waiting = qemu_get_be32(f);
+    }
     n->mergeable_rx_bufs = qemu_get_be32(f);
 
     if (version_id >= 3)
@@ -930,7 +1057,7 @@ static int virtio_net_load(QEMUFile *f, void *opaque, int version_id)
             n->mac_table.in_use = 0;
         }
     }
- 
+
     if (version_id >= 6)
         qemu_get_buffer(f, (uint8_t *)n->vlans, MAX_VLAN >> 3);
 
@@ -941,13 +1068,16 @@ static int virtio_net_load(QEMUFile *f, void *opaque, int version_id)
         }
 
         if (n->has_vnet_hdr) {
-            tap_using_vnet_hdr(n->nic->nc.peer, 1);
-            tap_set_offload(n->nic->nc.peer,
-                    (n->vdev.guest_features >> VIRTIO_NET_F_GUEST_CSUM) & 1,
-                    (n->vdev.guest_features >> VIRTIO_NET_F_GUEST_TSO4) & 1,
-                    (n->vdev.guest_features >> VIRTIO_NET_F_GUEST_TSO6) & 1,
-                    (n->vdev.guest_features >> VIRTIO_NET_F_GUEST_ECN)  & 1,
-                    (n->vdev.guest_features >> VIRTIO_NET_F_GUEST_UFO)  & 1);
+            for(i = 0; i < n->queues; i++) {
+                tap_using_vnet_hdr(n->nic->ncs[i]->peer, 1);
+                tap_set_offload(n->nic->ncs[i]->peer,
+                        (n->vdev.guest_features >> VIRTIO_NET_F_GUEST_CSUM) & 1,
+                        (n->vdev.guest_features >> VIRTIO_NET_F_GUEST_TSO4) & 1,
+                        (n->vdev.guest_features >> VIRTIO_NET_F_GUEST_TSO6) & 1,
+                        (n->vdev.guest_features >> VIRTIO_NET_F_GUEST_ECN)  & 1,
+                        (n->vdev.guest_features >> VIRTIO_NET_F_GUEST_UFO)  &
+                        1);
+           }
         }
     }
 
@@ -982,7 +1112,7 @@ static int virtio_net_load(QEMUFile *f, void *opaque, int version_id)
 
 static void virtio_net_cleanup(VLANClientState *nc)
 {
-    VirtIONet *n = DO_UPCAST(NICState, nc, nc)->opaque;
+    VirtIONet *n = ((NICState *)nc->opaque)->opaque;
 
     n->nic = NULL;
 }
@@ -1000,6 +1130,7 @@ VirtIODevice *virtio_net_init(DeviceState *dev, NICConf *conf,
                               virtio_net_conf *net)
 {
     VirtIONet *n;
+    int i;
 
     n = (VirtIONet *)virtio_common_init("virtio-net", VIRTIO_ID_NET,
                                         sizeof(struct virtio_net_config),
@@ -1012,7 +1143,6 @@ VirtIODevice *virtio_net_init(DeviceState *dev, NICConf *conf,
     n->vdev.bad_features = virtio_net_bad_features;
     n->vdev.reset = virtio_net_reset;
     n->vdev.set_status = virtio_net_set_status;
-    n->rx_vq = virtio_add_queue(&n->vdev, 256, virtio_net_handle_rx);
 
     if (net->tx && strcmp(net->tx, "timer") && strcmp(net->tx, "bh")) {
         error_report("virtio-net: "
@@ -1021,15 +1151,6 @@ VirtIODevice *virtio_net_init(DeviceState *dev, NICConf *conf,
         error_report("Defaulting to \"bh\"");
     }
 
-    if (net->tx && !strcmp(net->tx, "timer")) {
-        n->tx_vq = virtio_add_queue(&n->vdev, 256, virtio_net_handle_tx_timer);
-        n->tx_timer = qemu_new_timer_ns(vm_clock, virtio_net_tx_timer, n);
-        n->tx_timeout = net->txtimer;
-    } else {
-        n->tx_vq = virtio_add_queue(&n->vdev, 256, virtio_net_handle_tx_bh);
-        n->tx_bh = qemu_bh_new(virtio_net_tx_bh, n);
-    }
-    n->ctrl_vq = virtio_add_queue(&n->vdev, 64, virtio_net_handle_ctrl);
     qemu_macaddr_default_if_unset(&conf->macaddr);
     memcpy(&n->mac[0], &conf->macaddr, sizeof(n->mac));
     n->status = VIRTIO_NET_S_LINK_UP;
@@ -1038,7 +1159,6 @@ VirtIODevice *virtio_net_init(DeviceState *dev, NICConf *conf,
 
     qemu_format_nic_info_str(&n->nic->nc, conf->macaddr.a);
 
-    n->tx_waiting = 0;
     n->tx_burst = net->txburst;
     n->mergeable_rx_bufs = 0;
     n->promisc = 1; /* for compatibility */
@@ -1046,6 +1166,32 @@ VirtIODevice *virtio_net_init(DeviceState *dev, NICConf *conf,
     n->mac_table.macs = g_malloc0(MAC_TABLE_ENTRIES * ETH_ALEN);
 
     n->vlans = g_malloc0(MAX_VLAN >> 3);
+    n->queues = conf->queues;
+
+    /* Allocate per rx/tx vq's */
+    for (i = 0; i < n->queues; i++) {
+        n->vqs[i].rx_vq = virtio_add_queue(&n->vdev, 256, virtio_net_handle_rx);
+        if (net->tx && !strcmp(net->tx, "timer")) {
+            n->vqs[i].tx_vq = virtio_add_queue(&n->vdev, 256,
+                                               virtio_net_handle_tx_timer);
+            n->vqs[i].tx_timer = qemu_new_timer_ns(vm_clock,
+                                                   virtio_net_tx_timer,
+                                                   &n->vqs[i]);
+            n->vqs[i].tx_timeout = net->txtimer;
+        } else {
+            n->vqs[i].tx_vq = virtio_add_queue(&n->vdev, 256,
+                                               virtio_net_handle_tx_bh);
+            n->vqs[i].tx_bh = qemu_bh_new(virtio_net_tx_bh, &n->vqs[i]);
+        }
+
+        n->vqs[i].tx_waiting = 0;
+        n->vqs[i].n = n;
+
+        if (i == 0) {
+            /* keep compatiable with spec and old guest */
+            n->ctrl_vq = virtio_add_queue(&n->vdev, 64, virtio_net_handle_ctrl);
+        }
+    }
 
     n->qdev = dev;
     register_savevm(dev, "virtio-net", -1, VIRTIO_NET_VM_VERSION,
@@ -1059,24 +1205,33 @@ VirtIODevice *virtio_net_init(DeviceState *dev, NICConf *conf,
 void virtio_net_exit(VirtIODevice *vdev)
 {
     VirtIONet *n = DO_UPCAST(VirtIONet, vdev, vdev);
+    int i;
 
     /* This will stop vhost backend if appropriate. */
     virtio_net_set_status(vdev, 0);
 
-    qemu_purge_queued_packets(&n->nic->nc);
+    for (i = 0; i < n->queues; i++) {
+        qemu_purge_queued_packets(n->nic->ncs[i]);
+    }
 
     unregister_savevm(n->qdev, "virtio-net", n);
 
     g_free(n->mac_table.macs);
     g_free(n->vlans);
 
-    if (n->tx_timer) {
-        qemu_del_timer(n->tx_timer);
-        qemu_free_timer(n->tx_timer);
-    } else {
-        qemu_bh_delete(n->tx_bh);
+    for (i = 0; i < n->queues; i++) {
+        VirtIONetQueue *netq = &n->vqs[i];
+        if (netq->tx_timer) {
+            qemu_del_timer(netq->tx_timer);
+            qemu_free_timer(netq->tx_timer);
+        } else {
+            qemu_bh_delete(netq->tx_bh);
+        }
     }
 
-    qemu_del_vlan_client(&n->nic->nc);
     virtio_cleanup(&n->vdev);
+
+    for (i = 0; i < n->queues; i++) {
+        qemu_del_vlan_client(n->nic->ncs[i]);
+    }
 }
diff --git a/hw/virtio-net.h b/hw/virtio-net.h
index 36aa463..b35ba5d 100644
--- a/hw/virtio-net.h
+++ b/hw/virtio-net.h
@@ -44,6 +44,7 @@
 #define VIRTIO_NET_F_CTRL_RX    18      /* Control channel RX mode support */
 #define VIRTIO_NET_F_CTRL_VLAN  19      /* Control channel VLAN filtering */
 #define VIRTIO_NET_F_CTRL_RX_EXTRA 20   /* Extra RX mode control support */
+#define VIRTIO_NET_F_MULTIQUEUE   22
 
 #define VIRTIO_NET_S_LINK_UP    1       /* Link is up */
 
@@ -72,6 +73,8 @@ struct virtio_net_config
     uint8_t mac[ETH_ALEN];
     /* See VIRTIO_NET_F_STATUS and VIRTIO_NET_S_* above */
     uint16_t status;
+
+    uint16_t queues;
 } QEMU_PACKED;
 
 /* This is the first element of the scatter-gather list.  If you don't

^ permalink raw reply related

* Re: [net-next RFC V4 PATCH 0/4] Multiqueue virtio-net
From: Michael S. Tsirkin @ 2012-06-25 10:07 UTC (permalink / raw)
  To: Jason Wang
  Cc: krkumar2, habanero, kvm, netdev, mashirle, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem
In-Reply-To: <20120625090829.7263.65026.stgit@amd-6168-8-1.englab.nay.redhat.com>

On Mon, Jun 25, 2012 at 05:16:48PM +0800, Jason Wang wrote:
> Hello All:
> 
> This series is an update version of multiqueue virtio-net driver based on
> Krishna Kumar's work to let virtio-net use multiple rx/tx queues to do the
> packets reception and transmission. Please review and comments.
> 
> Test Environment:
> - Intel(R) Xeon(R) CPU E5620 @ 2.40GHz, 8 cores 2 numa nodes
> - Two directed connected 82599
> 
> Test Summary:
> 
> - Highlights: huge improvements on TCP_RR test
> - Lowlights: regression on small packet transmission, higher cpu utilization
>              than single queue, need further optimization

Didn't review yet, reacting this this paragraph:

To avoid regressions, it seems reasonable to make
the device use a single queue by default for now.
Add a way to switch multiqueue on/off using ethtool.

This way guest admin can tune the device for the
workload manually until we manage to imlement some
self-tuning heuristics.

-- 
MST

^ permalink raw reply

* Re: [net-next RFC V4 PATCH 3/4] virtio: introduce a method to get the irq of a specific virtqueue
From: Michael S. Tsirkin @ 2012-06-25 10:14 UTC (permalink / raw)
  To: Jason Wang
  Cc: krkumar2, habanero, kvm, qemu-devel, netdev, mashirle,
	linux-kernel, virtualization, edumazet, tahm, jwhan, davem
In-Reply-To: <1340617278-8022-1-git-send-email-jasowang@redhat.com>

On Mon, Jun 25, 2012 at 05:41:17PM +0800, Jason Wang wrote:
> Device specific irq optimizations such as irq affinity may be used by virtio
> drivers. So this patch introduce a new method to get the irq of a specific
> virtqueue.
> 
> After this patch, virtio device drivers could query the irq and do device
> specific optimizations. First user would be virtio-net.
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  drivers/lguest/lguest_device.c |    8 ++++++++
>  drivers/s390/kvm/kvm_virtio.c  |    6 ++++++
>  drivers/virtio/virtio_mmio.c   |    8 ++++++++
>  drivers/virtio/virtio_pci.c    |   12 ++++++++++++
>  include/linux/virtio_config.h  |    4 ++++
>  5 files changed, 38 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/lguest/lguest_device.c b/drivers/lguest/lguest_device.c
> index 9e8388e..bcd080f 100644
> --- a/drivers/lguest/lguest_device.c
> +++ b/drivers/lguest/lguest_device.c
> @@ -392,6 +392,13 @@ static const char *lg_bus_name(struct virtio_device *vdev)
>  	return "";
>  }
>  
> +static int lg_get_vq_irq(struct virtio_device *vdev, struct virtqueue *vq)
> +{
> +	struct lguest_vq_info *lvq = vq->priv;
> +
> +	return lvq->config.irq;
> +}
> +
>  /* The ops structure which hooks everything together. */
>  static struct virtio_config_ops lguest_config_ops = {
>  	.get_features = lg_get_features,
> @@ -404,6 +411,7 @@ static struct virtio_config_ops lguest_config_ops = {
>  	.find_vqs = lg_find_vqs,
>  	.del_vqs = lg_del_vqs,
>  	.bus_name = lg_bus_name,
> +	.get_vq_irq = lg_get_vq_irq,
>  };
>  
>  /*
> diff --git a/drivers/s390/kvm/kvm_virtio.c b/drivers/s390/kvm/kvm_virtio.c
> index d74e9ae..a897de2 100644
> --- a/drivers/s390/kvm/kvm_virtio.c
> +++ b/drivers/s390/kvm/kvm_virtio.c
> @@ -268,6 +268,11 @@ static const char *kvm_bus_name(struct virtio_device *vdev)
>  	return "";
>  }
>  
> +static int kvm_get_vq_irq(struct virtio_device *vdev, struct virtqueue *vq)
> +{
> +	return 0x2603;
> +}
> +
>  /*
>   * The config ops structure as defined by virtio config
>   */
> @@ -282,6 +287,7 @@ static struct virtio_config_ops kvm_vq_configspace_ops = {
>  	.find_vqs = kvm_find_vqs,
>  	.del_vqs = kvm_del_vqs,
>  	.bus_name = kvm_bus_name,
> +	.get_vq_irq = kvm_get_vq_irq,
>  };
>  
>  /*
> diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
> index f5432b6..2ba37ed 100644
> --- a/drivers/virtio/virtio_mmio.c
> +++ b/drivers/virtio/virtio_mmio.c
> @@ -411,6 +411,13 @@ static const char *vm_bus_name(struct virtio_device *vdev)
>  	return vm_dev->pdev->name;
>  }
>  
> +static int vm_get_vq_irq(struct virtio_device *vdev, struct virtqueue *vq)
> +{
> +	struct virtio_mmio_device *vm_dev = to_virtio_mmio_device(vdev);
> +
> +	return platform_get_irq(vm_dev->pdev, 0);
> +}
> +
>  static struct virtio_config_ops virtio_mmio_config_ops = {
>  	.get		= vm_get,
>  	.set		= vm_set,
> @@ -422,6 +429,7 @@ static struct virtio_config_ops virtio_mmio_config_ops = {
>  	.get_features	= vm_get_features,
>  	.finalize_features = vm_finalize_features,
>  	.bus_name	= vm_bus_name,
> +	.get_vq_irq     = vm_get_vq_irq,
>  };
>  
>  
> diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
> index adb24f2..c062227 100644
> --- a/drivers/virtio/virtio_pci.c
> +++ b/drivers/virtio/virtio_pci.c
> @@ -607,6 +607,17 @@ static const char *vp_bus_name(struct virtio_device *vdev)
>  	return pci_name(vp_dev->pci_dev);
>  }
>  
> +static int vp_get_vq_irq(struct virtio_device *vdev, struct virtqueue *vq)
> +{
> +	struct virtio_pci_device *vp_dev = to_vp_device(vdev);
> +	struct virtio_pci_vq_info *info = vq->priv;
> +
> +	if (vp_dev->intx_enabled)
> +		return vp_dev->pci_dev->irq;
> +	else
> +		return vp_dev->msix_entries[info->msix_vector].vector;
> +}
> +
>  static struct virtio_config_ops virtio_pci_config_ops = {
>  	.get		= vp_get,
>  	.set		= vp_set,
> @@ -618,6 +629,7 @@ static struct virtio_config_ops virtio_pci_config_ops = {
>  	.get_features	= vp_get_features,
>  	.finalize_features = vp_finalize_features,
>  	.bus_name	= vp_bus_name,
> +	.get_vq_irq     = vp_get_vq_irq,
>  };
>  
>  static void virtio_pci_release_dev(struct device *_d)
> diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
> index fc457f4..acd6930 100644
> --- a/include/linux/virtio_config.h
> +++ b/include/linux/virtio_config.h
> @@ -98,6 +98,9 @@
>   *	vdev: the virtio_device
>   *      This returns a pointer to the bus name a la pci_name from which
>   *      the caller can then copy.
> + * @get_vq_irq: get the irq numer of the specific virt queue.
> + *      vdev: the virtio_device
> + *      vq: the virtqueue

What if the vq does not have an IRQ? E.g. control vqs don't.
What if the IRQ is shared between VQs? Between devices?
The need to cleanup affinity on destroy is also nasty.
How about we expose a set_affinity API instead?
Then:
	- non PCI can ignore for now
	- with a per vq vector we can force it
	- with a shared MSI we make it an OR over all affinities
	- with a level interrupt we can ignore it
	- on cleanup we can do it in core


>   */
>  typedef void vq_callback_t(struct virtqueue *);
>  struct virtio_config_ops {
> @@ -116,6 +119,7 @@ struct virtio_config_ops {
>  	u32 (*get_features)(struct virtio_device *vdev);
>  	void (*finalize_features)(struct virtio_device *vdev);
>  	const char *(*bus_name)(struct virtio_device *vdev);
> +	int (*get_vq_irq)(struct virtio_device *vdev, struct virtqueue *vq);
>  };
>  
>  /* If driver didn't advertise the feature, it will never appear. */
> -- 
> 1.7.1

^ permalink raw reply

* Re: [net-next RFC V4 PATCH 0/4] Multiqueue virtio-net
From: John Fastabend @ 2012-06-25 14:13 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: krkumar2, habanero, mashirle, kvm, netdev, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem
In-Reply-To: <20120625100758.GB19169@redhat.com>

On 6/25/2012 3:07 AM, Michael S. Tsirkin wrote:
> On Mon, Jun 25, 2012 at 05:16:48PM +0800, Jason Wang wrote:
>> Hello All:
>>
>> This series is an update version of multiqueue virtio-net driver based on
>> Krishna Kumar's work to let virtio-net use multiple rx/tx queues to do the
>> packets reception and transmission. Please review and comments.
>>
>> Test Environment:
>> - Intel(R) Xeon(R) CPU E5620 @ 2.40GHz, 8 cores 2 numa nodes
>> - Two directed connected 82599
>>
>> Test Summary:
>>
>> - Highlights: huge improvements on TCP_RR test
>> - Lowlights: regression on small packet transmission, higher cpu utilization
>>               than single queue, need further optimization
>
> Didn't review yet, reacting this this paragraph:
>
> To avoid regressions, it seems reasonable to make
> the device use a single queue by default for now.
> Add a way to switch multiqueue on/off using ethtool.
>
> This way guest admin can tune the device for the
> workload manually until we manage to imlement some
> self-tuning heuristics.
>

Ethtool already has this switch 'ethtool -L' can be
used to set the number tx/rx channels. So you would
likely just need to add a set_channels hook.

.John

^ permalink raw reply

* Re: [net-next RFC V4 PATCH 0/4] Multiqueue virtio-net
From: Sridhar Samudrala @ 2012-06-25 17:49 UTC (permalink / raw)
  To: Jason Wang
  Cc: krkumar2, habanero, kvm, mst, netdev, mashirle, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem
In-Reply-To: <20120625090829.7263.65026.stgit@amd-6168-8-1.englab.nay.redhat.com>

On 6/25/2012 2:16 AM, Jason Wang wrote:
> Hello All:
>
> This series is an update version of multiqueue virtio-net driver based on
> Krishna Kumar's work to let virtio-net use multiple rx/tx queues to do the
> packets reception and transmission. Please review and comments.
>
> Test Environment:
> - Intel(R) Xeon(R) CPU E5620 @ 2.40GHz, 8 cores 2 numa nodes
> - Two directed connected 82599
>
> Test Summary:
>
> - Highlights: huge improvements on TCP_RR test
> - Lowlights: regression on small packet transmission, higher cpu utilization
>               than single queue, need further optimization
Does this also scale with increased number of VMs?

Thanks
Sridhar
>
> Analysis of the performance result:
>
> - I count the number of packets sending/receiving during the test, and
>    multiqueue show much more ability in terms of packets per second.
>
> - For the tx regression, multiqueue send about 1-2 times of more packets
>    compared to single queue, and the packets size were much smaller than single
>    queue does. I suspect tcp does less batching in multiqueue, so I hack the
>    tcp_write_xmit() to forece more batching, multiqueue works as well as
>    singlequeue for both small transmission and throughput
>
> - I didn't pack the accelerate RFS with virtio-net in this sereis as it still
>    need further shaping, for the one that interested in this please see:
>    http://www.mail-archive.com/kvm@vger.kernel.org/msg64111.html
>
>

^ permalink raw reply

* Re: [net-next RFC V4 PATCH 0/4] Multiqueue virtio-net
From: Shirley Ma @ 2012-06-25 18:01 UTC (permalink / raw)
  To: Jason Wang
  Cc: krkumar2, habanero, kvm, mst, netdev, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem
In-Reply-To: <20120625090829.7263.65026.stgit@amd-6168-8-1.englab.nay.redhat.com>

Hello Jason,

Good work. Do you have local guest to guest results?

Thanks
Shirley

^ permalink raw reply

* [PATCH 0/4] make balloon pages movable by compaction
From: Rafael Aquini @ 2012-06-25 23:25 UTC (permalink / raw)
  To: linux-mm
  Cc: Rik van Riel, Rafael Aquini, Michael S. Tsirkin, linux-kernel,
	virtualization

This patchset follows the main idea discussed at 2012 LSFMMS section:
"Ballooning for transparent huge pages" -- http://lwn.net/Articles/490114/

to introduce the required changes to the virtio_balloon driver, as well as
changes to the core compaction & migration bits, in order to allow
memory balloon pages become movable within a guest.

Rafael Aquini (4):
  mm: introduce compaction and migration for virtio ballooned pages
  virtio_balloon: handle concurrent accesses to virtio_balloon struct
    elements
  virtio_balloon: introduce migration primitives to balloon pages
  mm: add vm event counters for balloon pages compaction

 drivers/virtio/virtio_balloon.c |  142 +++++++++++++++++++++++++++++++++++----
 include/linux/mm.h              |   17 +++++
 include/linux/virtio_balloon.h  |    6 ++
 include/linux/vm_event_item.h   |    2 +
 mm/compaction.c                 |   74 ++++++++++++++++++++
 mm/migrate.c                    |   32 ++++++++-
 mm/vmstat.c                     |    4 ++
 7 files changed, 263 insertions(+), 14 deletions(-)


Preliminary test results:
(2 VCPU 1024mB RAM KVM guest running 3.5.0_rc4+)

* 64mB balloon:
[root@localhost ~]# awk '/compact/ {print}' /proc/vmstat
compact_blocks_moved 0
compact_pages_moved 0
compact_pagemigrate_failed 0
compact_stall 0
compact_fail 0
compact_success 0
compact_balloon_migrated 0
compact_balloon_failed 0
compact_balloon_isolated 0
compact_balloon_freed 0
[root@localhost ~]#
[root@localhost ~]# for i in $(seq 1 4); do echo 1> /proc/sys/vm/compact_memory & done &>/dev/null
[1]   Done                    echo > /proc/sys/vm/compact_memory
[2]   Done                    echo > /proc/sys/vm/compact_memory
[3]-  Done                    echo > /proc/sys/vm/compact_memory
[4]+  Done                    echo > /proc/sys/vm/compact_memory
[root@localhost ~]#
[root@localhost ~]# awk '/compact/ {print}' /proc/vmstat
compact_blocks_moved 2683
compact_pages_moved 47502
compact_pagemigrate_failed 61
compact_stall 0
compact_fail 0
compact_success 0
compact_balloon_migrated 16384
compact_balloon_failed 0
compact_balloon_isolated 16384
compact_balloon_freed 16384


* 128mB balloon:
[root@localhost ~]# awk '/compact/ {print}' /proc/vmstat
compact_blocks_moved 0
compact_pages_moved 0
compact_pagemigrate_failed 0
compact_stall 0
compact_fail 0
compact_success 0
compact_balloon_migrated 0
compact_balloon_failed 0
compact_balloon_isolated 0
compact_balloon_freed 0
[root@localhost ~]#
[root@localhost ~]# for i in $(seq 1 4); do echo 1> /proc/sys/vm/compact_memory & done &>/dev/null
[1]   Done                    echo > /proc/sys/vm/compact_memory
[2]   Done                    echo > /proc/sys/vm/compact_memory
[3]-  Done                    echo > /proc/sys/vm/compact_memory
[4]+  Done                    echo > /proc/sys/vm/compact_memory
[root@localhost ~]# awk '/compact/ {print}' /proc/vmstat
compact_blocks_moved 2624
compact_pages_moved 49195
compact_pagemigrate_failed 54
compact_stall 0
compact_fail 0
compact_success 0
compact_balloon_migrated 29350
compact_balloon_failed 29
compact_balloon_isolated 29379
compact_balloon_freed 29350
-- 
1.7.10.2

^ permalink raw reply

* [PATCH 1/4] mm: introduce compaction and migration for virtio ballooned pages
From: Rafael Aquini @ 2012-06-25 23:25 UTC (permalink / raw)
  To: linux-mm
  Cc: Rik van Riel, Rafael Aquini, Michael S. Tsirkin, linux-kernel,
	virtualization
In-Reply-To: <cover.1340665087.git.aquini@redhat.com>

This patch introduces helper functions that teach compaction and migration bits
how to cope with pages which are part of a guest memory balloon, in order to
make them movable by memory compaction procedures.

Signed-off-by: Rafael Aquini <aquini@redhat.com>
---
 include/linux/mm.h |   17 +++++++++++++
 mm/compaction.c    |   72 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/migrate.c       |   30 +++++++++++++++++++++-
 3 files changed, 118 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index b36d08c..360656e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1629,5 +1629,22 @@ static inline unsigned int debug_guardpage_minorder(void) { return 0; }
 static inline bool page_is_guard(struct page *page) { return false; }
 #endif /* CONFIG_DEBUG_PAGEALLOC */
 
+#if (defined(CONFIG_VIRTIO_BALLOON) || \
+	defined(CONFIG_VIRTIO_BALLOON_MODULE)) && defined(CONFIG_COMPACTION)
+extern int is_balloon_page(struct page *);
+extern int isolate_balloon_page(struct page *);
+extern int putback_balloon_page(struct page *);
+
+/* return 1 if page is part of a guest's memory balloon, 0 otherwise */
+static inline int PageBalloon(struct page *page)
+{
+	return is_balloon_page(page);
+}
+#else
+static inline int PageBalloon(struct page *page)		{ return 0; }
+static inline int isolate_balloon_page(struct page *page)	{ return 0; }
+static inline int putback_balloon_page(struct page *page)	{ return 0; }
+#endif /* (VIRTIO_BALLOON || VIRTIO_BALLOON_MODULE) && COMPACTION */
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/mm/compaction.c b/mm/compaction.c
index 7ea259d..8835d55 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -14,6 +14,7 @@
 #include <linux/backing-dev.h>
 #include <linux/sysctl.h>
 #include <linux/sysfs.h>
+#include <linux/export.h>
 #include "internal.h"
 
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
@@ -312,6 +313,14 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 			continue;
 		}
 
+		/*
+		 * For ballooned pages, we need to isolate them before testing
+		 * for PageLRU, as well as skip the LRU page isolation steps.
+		 */
+		if (PageBalloon(page))
+			if (isolate_balloon_page(page))
+				goto isolated_balloon_page;
+
 		if (!PageLRU(page))
 			continue;
 
@@ -338,6 +347,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 
 		/* Successfully isolated */
 		del_page_from_lru_list(page, lruvec, page_lru(page));
+isolated_balloon_page:
 		list_add(&page->lru, migratelist);
 		cc->nr_migratepages++;
 		nr_isolated++;
@@ -903,4 +913,66 @@ void compaction_unregister_node(struct node *node)
 }
 #endif /* CONFIG_SYSFS && CONFIG_NUMA */
 
+#if defined(CONFIG_VIRTIO_BALLOON) || defined(CONFIG_VIRTIO_BALLOON_MODULE)
+/*
+ * Balloon pages special page->mapping.
+ * users must properly allocate and initiliaze an instance of balloon_mapping,
+ * and set it as the page->mapping for balloon enlisted page instances.
+ *
+ * address_space_operations necessary methods for ballooned pages:
+ *   .migratepage    - used to perform balloon's page migration (as is)
+ *   .invalidatepage - used to isolate a page from balloon's page list
+ *   .freepage       - used to reinsert an isolated page to balloon's page list
+ */
+struct address_space *balloon_mapping;
+EXPORT_SYMBOL(balloon_mapping);
+
+/* ballooned page id check */
+int is_balloon_page(struct page *page)
+{
+	struct address_space *mapping = page->mapping;
+	if (mapping == balloon_mapping)
+		return 1;
+	return 0;
+}
+
+/* __isolate_lru_page() counterpart for a ballooned page */
+int isolate_balloon_page(struct page *page)
+{
+	struct address_space *mapping = page->mapping;
+	if (mapping->a_ops->invalidatepage) {
+		/*
+		 * We can race against move_to_new_page() and stumble across a
+		 * locked 'newpage'. If we succeed on isolating it, the result
+		 * tends to be disastrous. So, we sanely skip PageLocked here.
+		 */
+		if (likely(!PageLocked(page) && get_page_unless_zero(page))) {
+			/*
+			 * A ballooned page, by default, has just one refcount.
+			 * Prevent concurrent compaction threads from isolating
+			 * an already isolated balloon page.
+			 */
+			if (page_count(page) == 2) {
+				mapping->a_ops->invalidatepage(page, 0);
+				return 1;
+			}
+			/* Drop refcount taken for this already isolated page */
+			put_page(page);
+		}
+	}
+	return 0;
+}
+
+/* putback_lru_page() counterpart for a ballooned page */
+int putback_balloon_page(struct page *page)
+{
+	struct address_space *mapping = page->mapping;
+	if (mapping->a_ops->freepage) {
+		mapping->a_ops->freepage(page);
+		put_page(page);
+		return 1;
+	}
+	return 0;
+}
+#endif /* CONFIG_VIRTIO_BALLOON || CONFIG_VIRTIO_BALLOON_MODULE */
 #endif /* CONFIG_COMPACTION */
diff --git a/mm/migrate.c b/mm/migrate.c
index be26d5c..ffc02a4 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -78,7 +78,10 @@ void putback_lru_pages(struct list_head *l)
 		list_del(&page->lru);
 		dec_zone_page_state(page, NR_ISOLATED_ANON +
 				page_is_file_cache(page));
-		putback_lru_page(page);
+		if (unlikely(PageBalloon(page)))
+			VM_BUG_ON(!putback_balloon_page(page));
+		else
+			putback_lru_page(page);
 	}
 }
 
@@ -783,6 +786,17 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 		}
 	}
 
+	if (PageBalloon(page)) {
+		/*
+		 * A ballooned page does not need any special attention from
+		 * physical to virtual reverse mapping procedures.
+		 * To avoid burning cycles at rmap level,
+		 * skip attempts to unmap PTEs or remap swapcache.
+		 */
+		remap_swapcache = 0;
+		goto skip_unmap;
+	}
+
 	/*
 	 * Corner case handling:
 	 * 1. When a new swap-cache page is read into, it is added to the LRU
@@ -852,6 +866,20 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 			goto out;
 
 	rc = __unmap_and_move(page, newpage, force, offlining, mode);
+
+	if (PageBalloon(newpage)) {
+		/*
+		 * A ballooned page has been migrated already. Now, it is the
+		 * time to wrap-up counters, handle the old page back to Buddy
+		 * and return.
+		 */
+		list_del(&page->lru);
+		dec_zone_page_state(page, NR_ISOLATED_ANON +
+				    page_is_file_cache(page));
+		put_page(page);
+		__free_page(page);
+		return rc;
+	}
 out:
 	if (rc != -EAGAIN) {
 		/*
-- 
1.7.10.2

^ permalink raw reply related

* [PATCH 2/4] virtio_balloon: handle concurrent accesses to virtio_balloon struct elements
From: Rafael Aquini @ 2012-06-25 23:25 UTC (permalink / raw)
  To: linux-mm
  Cc: Rik van Riel, Rafael Aquini, Michael S. Tsirkin, linux-kernel,
	virtualization
In-Reply-To: <cover.1340665087.git.aquini@redhat.com>

This patch introduces access sychronization to critical elements of struct
virtio_balloon, in order to allow the thread concurrency compaction/migration
bits might ended up imposing to the balloon driver on several situations.

Signed-off-by: Rafael Aquini <aquini@redhat.com>
---
 drivers/virtio/virtio_balloon.c |   45 +++++++++++++++++++++++++++++----------
 1 file changed, 34 insertions(+), 11 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index bfbc15c..d47c5c2 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -51,6 +51,10 @@ struct virtio_balloon
 
 	/* Number of balloon pages we've told the Host we're not using. */
 	unsigned int num_pages;
+
+	/* Protect 'pages', 'pfns' & 'num_pnfs' against concurrent updates */
+	spinlock_t pfn_list_lock;
+
 	/*
 	 * The pages we've told the Host we're not using.
 	 * Each page on this list adds VIRTIO_BALLOON_PAGES_PER_PAGE
@@ -97,21 +101,23 @@ static void balloon_ack(struct virtqueue *vq)
 		complete(&vb->acked);
 }
 
-static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
-{
-	struct scatterlist sg;
-
-	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
+/* Protection for concurrent accesses to balloon virtqueues and vb->acked */
+DEFINE_MUTEX(vb_queue_completion);
 
+static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq,
+		      struct scatterlist *sg)
+{
+	mutex_lock(&vb_queue_completion);
 	init_completion(&vb->acked);
 
 	/* We should always be able to add one buffer to an empty queue. */
-	if (virtqueue_add_buf(vq, &sg, 1, 0, vb, GFP_KERNEL) < 0)
+	if (virtqueue_add_buf(vq, sg, 1, 0, vb, GFP_KERNEL) < 0)
 		BUG();
 	virtqueue_kick(vq);
 
 	/* When host has read buffer, this completes via balloon_ack */
 	wait_for_completion(&vb->acked);
+	mutex_unlock(&vb_queue_completion);
 }
 
 static void set_page_pfns(u32 pfns[], struct page *page)
@@ -126,9 +132,12 @@ static void set_page_pfns(u32 pfns[], struct page *page)
 
 static void fill_balloon(struct virtio_balloon *vb, size_t num)
 {
+	struct scatterlist sg;
+	int alloc_failed = 0;
 	/* We can only do one array worth at a time. */
 	num = min(num, ARRAY_SIZE(vb->pfns));
 
+	spin_lock(&vb->pfn_list_lock);
 	for (vb->num_pfns = 0; vb->num_pfns < num;
 	     vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
 		struct page *page = alloc_page(GFP_HIGHUSER | __GFP_NORETRY |
@@ -138,8 +147,7 @@ static void fill_balloon(struct virtio_balloon *vb, size_t num)
 				dev_printk(KERN_INFO, &vb->vdev->dev,
 					   "Out of puff! Can't get %zu pages\n",
 					   num);
-			/* Sleep for at least 1/5 of a second before retry. */
-			msleep(200);
+			alloc_failed = 1;
 			break;
 		}
 		set_page_pfns(vb->pfns + vb->num_pfns, page);
@@ -149,10 +157,19 @@ static void fill_balloon(struct virtio_balloon *vb, size_t num)
 	}
 
 	/* Didn't get any?  Oh well. */
-	if (vb->num_pfns == 0)
+	if (vb->num_pfns == 0) {
+		spin_unlock(&vb->pfn_list_lock);
 		return;
+	}
+
+	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
+	spin_unlock(&vb->pfn_list_lock);
 
-	tell_host(vb, vb->inflate_vq);
+	/* alloc_page failed, sleep for at least 1/5 of a sec before retry. */
+	if (alloc_failed)
+		msleep(200);
+
+	tell_host(vb, vb->inflate_vq, &sg);
 }
 
 static void release_pages_by_pfn(const u32 pfns[], unsigned int num)
@@ -169,10 +186,12 @@ static void release_pages_by_pfn(const u32 pfns[], unsigned int num)
 static void leak_balloon(struct virtio_balloon *vb, size_t num)
 {
 	struct page *page;
+	struct scatterlist sg;
 
 	/* We can only do one array worth at a time. */
 	num = min(num, ARRAY_SIZE(vb->pfns));
 
+	spin_lock(&vb->pfn_list_lock);
 	for (vb->num_pfns = 0; vb->num_pfns < num;
 	     vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
 		page = list_first_entry(&vb->pages, struct page, lru);
@@ -180,13 +199,15 @@ static void leak_balloon(struct virtio_balloon *vb, size_t num)
 		set_page_pfns(vb->pfns + vb->num_pfns, page);
 		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
 	}
+	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
+	spin_unlock(&vb->pfn_list_lock);
 
 	/*
 	 * Note that if
 	 * virtio_has_feature(vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST);
 	 * is true, we *have* to do it in this order
 	 */
-	tell_host(vb, vb->deflate_vq);
+	tell_host(vb, vb->deflate_vq, &sg);
 	release_pages_by_pfn(vb->pfns, vb->num_pfns);
 }
 
@@ -356,6 +377,8 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	}
 
 	INIT_LIST_HEAD(&vb->pages);
+	spin_lock_init(&vb->pfn_list_lock);
+
 	vb->num_pages = 0;
 	init_waitqueue_head(&vb->config_change);
 	vb->vdev = vdev;
-- 
1.7.10.2

^ permalink raw reply related

* [PATCH 3/4] virtio_balloon: introduce migration primitives to balloon pages
From: Rafael Aquini @ 2012-06-25 23:25 UTC (permalink / raw)
  To: linux-mm
  Cc: Rik van Riel, Rafael Aquini, Michael S. Tsirkin, linux-kernel,
	virtualization
In-Reply-To: <cover.1340665087.git.aquini@redhat.com>

This patch makes balloon pages movable at allocation time and introduces the
infrastructure needed to perform the balloon page migration operation.

Signed-off-by: Rafael Aquini <aquini@redhat.com>
---
 drivers/virtio/virtio_balloon.c |   96 ++++++++++++++++++++++++++++++++++++++-
 include/linux/virtio_balloon.h  |    6 +++
 2 files changed, 100 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index d47c5c2..53386aa 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -27,6 +27,8 @@
 #include <linux/delay.h>
 #include <linux/slab.h>
 #include <linux/module.h>
+#include <linux/fs.h>
+#include <linux/pagemap.h>
 
 /*
  * Balloon device works in 4K page units.  So each page is pointed to by
@@ -140,8 +142,9 @@ static void fill_balloon(struct virtio_balloon *vb, size_t num)
 	spin_lock(&vb->pfn_list_lock);
 	for (vb->num_pfns = 0; vb->num_pfns < num;
 	     vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
-		struct page *page = alloc_page(GFP_HIGHUSER | __GFP_NORETRY |
-					__GFP_NOMEMALLOC | __GFP_NOWARN);
+		struct page *page = alloc_page(GFP_HIGHUSER_MOVABLE |
+						__GFP_NORETRY | __GFP_NOWARN |
+						__GFP_NOMEMALLOC);
 		if (!page) {
 			if (printk_ratelimit())
 				dev_printk(KERN_INFO, &vb->vdev->dev,
@@ -154,6 +157,7 @@ static void fill_balloon(struct virtio_balloon *vb, size_t num)
 		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
 		totalram_pages--;
 		list_add(&page->lru, &vb->pages);
+		page->mapping = balloon_mapping;
 	}
 
 	/* Didn't get any?  Oh well. */
@@ -195,6 +199,7 @@ static void leak_balloon(struct virtio_balloon *vb, size_t num)
 	for (vb->num_pfns = 0; vb->num_pfns < num;
 	     vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
 		page = list_first_entry(&vb->pages, struct page, lru);
+		page->mapping = NULL;
 		list_del(&page->lru);
 		set_page_pfns(vb->pfns + vb->num_pfns, page);
 		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
@@ -365,6 +370,77 @@ static int init_vqs(struct virtio_balloon *vb)
 	return 0;
 }
 
+/*
+ * Populate balloon_mapping->a_ops->migratepage method to perform the balloon
+ * page migration task.
+ *
+ * After a ballooned page gets isolated by compaction procedures, this is the
+ * function that performs the page migration on behalf of move_to_new_page(),
+ * when the last calls (page)->mapping->a_ops->migratepage.
+ *
+ * Page migration for virtio balloon is done in a simple swap fashion which
+ * follows these two steps:
+ *  1) insert newpage into vb->pages list and update the host about it;
+ *  2) update the host about the removed old page from vb->pages list;
+ */
+int virtballoon_migratepage(struct address_space *mapping,
+		struct page *newpage, struct page *page, enum migrate_mode mode)
+{
+	struct virtio_balloon *vb = (void *)mapping->backing_dev_info;
+	struct scatterlist sg;
+
+	/* balloon's page migration 1st step */
+	spin_lock(&vb->pfn_list_lock);
+	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+	list_add(&newpage->lru, &vb->pages);
+	set_page_pfns(vb->pfns, newpage);
+	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
+	spin_unlock(&vb->pfn_list_lock);
+	tell_host(vb, vb->inflate_vq, &sg);
+
+	/* balloon's page migration 2nd step */
+	spin_lock(&vb->pfn_list_lock);
+	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
+	set_page_pfns(vb->pfns, page);
+	sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
+	spin_unlock(&vb->pfn_list_lock);
+	tell_host(vb, vb->deflate_vq, &sg);
+
+	return 0;
+}
+
+/*
+ * Populate balloon_mapping->a_ops->invalidatepage method to help compaction on
+ * isolating a page from the balloon page list.
+ */
+void virtballoon_isolatepage(struct page *page, unsigned long mode)
+{
+	struct address_space *mapping = page->mapping;
+	struct virtio_balloon *vb = (void *)mapping->backing_dev_info;
+	spin_lock(&vb->pfn_list_lock);
+	list_del(&page->lru);
+	spin_unlock(&vb->pfn_list_lock);
+}
+
+/*
+ * Populate balloon_mapping->a_ops->freepage method to help compaction on
+ * re-inserting an isolated page into the balloon page list.
+ */
+void virtballoon_putbackpage(struct page *page)
+{
+	struct address_space *mapping = page->mapping;
+	struct virtio_balloon *vb = (void *)mapping->backing_dev_info;
+	spin_lock(&vb->pfn_list_lock);
+	list_add(&page->lru, &vb->pages);
+	spin_unlock(&vb->pfn_list_lock);
+}
+
+static const struct address_space_operations virtio_balloon_aops = {
+	.migratepage = virtballoon_migratepage,
+	.invalidatepage = virtballoon_isolatepage,
+	.freepage = virtballoon_putbackpage,
+};
+
 static int virtballoon_probe(struct virtio_device *vdev)
 {
 	struct virtio_balloon *vb;
@@ -384,6 +460,19 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	vb->vdev = vdev;
 	vb->need_stats_update = 0;
 
+	/* Init the ballooned page->mapping special balloon_mapping */
+	balloon_mapping = kmalloc(sizeof(*balloon_mapping), GFP_KERNEL);
+	if (!balloon_mapping) {
+		err = -ENOMEM;
+		goto out_free_mapping;
+	}
+
+	INIT_RADIX_TREE(&balloon_mapping->page_tree, GFP_ATOMIC | __GFP_NOWARN);
+	INIT_LIST_HEAD(&balloon_mapping->i_mmap_nonlinear);
+	spin_lock_init(&balloon_mapping->tree_lock);
+	balloon_mapping->a_ops = &virtio_balloon_aops;
+	balloon_mapping->backing_dev_info = (void *)vb;
+
 	err = init_vqs(vb);
 	if (err)
 		goto out_free_vb;
@@ -398,6 +487,8 @@ static int virtballoon_probe(struct virtio_device *vdev)
 
 out_del_vqs:
 	vdev->config->del_vqs(vdev);
+out_free_mapping:
+	kfree(balloon_mapping);
 out_free_vb:
 	kfree(vb);
 out:
@@ -424,6 +515,7 @@ static void __devexit virtballoon_remove(struct virtio_device *vdev)
 	kthread_stop(vb->thread);
 	remove_common(vb);
 	kfree(vb);
+	kfree(balloon_mapping);
 }
 
 #ifdef CONFIG_PM
diff --git a/include/linux/virtio_balloon.h b/include/linux/virtio_balloon.h
index 652dc8b..db21300 100644
--- a/include/linux/virtio_balloon.h
+++ b/include/linux/virtio_balloon.h
@@ -56,4 +56,10 @@ struct virtio_balloon_stat {
 	u64 val;
 } __attribute__((packed));
 
+#if defined(CONFIG_COMPACTION)
+extern struct address_space *balloon_mapping;
+#else
+struct address_space *balloon_mapping;
+#endif
+
 #endif /* _LINUX_VIRTIO_BALLOON_H */
-- 
1.7.10.2

^ permalink raw reply related

* [PATCH 4/4] mm: add vm event counters for balloon pages compaction
From: Rafael Aquini @ 2012-06-25 23:25 UTC (permalink / raw)
  To: linux-mm
  Cc: Rik van Riel, Rafael Aquini, Michael S. Tsirkin, linux-kernel,
	virtualization
In-Reply-To: <cover.1340665087.git.aquini@redhat.com>

This patch is only for testing report purposes and shall be dropped in case of
the rest of this patchset getting accepted for merging.

Signed-off-by: Rafael Aquini <aquini@redhat.com>
---
 drivers/virtio/virtio_balloon.c |    1 +
 include/linux/vm_event_item.h   |    2 ++
 mm/compaction.c                 |    4 +++-
 mm/migrate.c                    |    6 ++++--
 mm/vmstat.c                     |    4 ++++
 5 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 53386aa..c4a929d 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -406,6 +406,7 @@ int virtballoon_migratepage(struct address_space *mapping,
 	spin_unlock(&vb->pfn_list_lock);
 	tell_host(vb, vb->deflate_vq, &sg);
 
+	count_vm_event(COMPACTBALLOONMIGRATED);
 	return 0;
 }
 
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 06f8e38..e330c5a 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -40,6 +40,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #ifdef CONFIG_COMPACTION
 		COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
 		COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
+		COMPACTBALLOONMIGRATED, COMPACTBALLOONFAILED,
+		COMPACTBALLOONISOLATED, COMPACTBALLOONFREED,
 #endif
 #ifdef CONFIG_HUGETLB_PAGE
 		HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
diff --git a/mm/compaction.c b/mm/compaction.c
index 8835d55..cf250b8 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -318,8 +318,10 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 		 * for PageLRU, as well as skip the LRU page isolation steps.
 		 */
 		if (PageBalloon(page))
-			if (isolate_balloon_page(page))
+			if (isolate_balloon_page(page)) {
+				count_vm_event(COMPACTBALLOONISOLATED);
 				goto isolated_balloon_page;
+		}
 
 		if (!PageLRU(page))
 			continue;
diff --git a/mm/migrate.c b/mm/migrate.c
index ffc02a4..3dbca33 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -78,9 +78,10 @@ void putback_lru_pages(struct list_head *l)
 		list_del(&page->lru);
 		dec_zone_page_state(page, NR_ISOLATED_ANON +
 				page_is_file_cache(page));
-		if (unlikely(PageBalloon(page)))
+		if (unlikely(PageBalloon(page))) {
 			VM_BUG_ON(!putback_balloon_page(page));
-		else
+			count_vm_event(COMPACTBALLOONFAILED);
+		} else
 			putback_lru_page(page);
 	}
 }
@@ -878,6 +879,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 				    page_is_file_cache(page));
 		put_page(page);
 		__free_page(page);
+		count_vm_event(COMPACTBALLOONFREED);
 		return rc;
 	}
 out:
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 1bbbbd9..3b7109f 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -767,6 +767,10 @@ const char * const vmstat_text[] = {
 	"compact_stall",
 	"compact_fail",
 	"compact_success",
+	"compact_balloon_migrated",
+	"compact_balloon_failed",
+	"compact_balloon_isolated",
+	"compact_balloon_freed",
 #endif
 
 #ifdef CONFIG_HUGETLB_PAGE
-- 
1.7.10.2

^ permalink raw reply related

* Re: [PATCH 1/4] mm: introduce compaction and migration for virtio ballooned pages
From: Konrad Rzeszutek Wilk @ 2012-06-25 23:31 UTC (permalink / raw)
  To: Rafael Aquini
  Cc: Rik van Riel, Michael S. Tsirkin, linux-kernel, virtualization,
	linux-mm
In-Reply-To: <7f83427b3894af7969c67acc0f27ab5aa68b4279.1340665087.git.aquini@redhat.com>

On Mon, Jun 25, 2012 at 7:25 PM, Rafael Aquini <aquini@redhat.com> wrote:
> This patch introduces helper functions that teach compaction and migration bits
> how to cope with pages which are part of a guest memory balloon, in order to
> make them movable by memory compaction procedures.
>

Should the names that are exported be prefixed with kvm_?

> Signed-off-by: Rafael Aquini <aquini@redhat.com>
> ---
>  include/linux/mm.h |   17 +++++++++++++
>  mm/compaction.c    |   72 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  mm/migrate.c       |   30 +++++++++++++++++++++-
>  3 files changed, 118 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index b36d08c..360656e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1629,5 +1629,22 @@ static inline unsigned int debug_guardpage_minorder(void) { return 0; }
>  static inline bool page_is_guard(struct page *page) { return false; }
>  #endif /* CONFIG_DEBUG_PAGEALLOC */
>
> +#if (defined(CONFIG_VIRTIO_BALLOON) || \
> +       defined(CONFIG_VIRTIO_BALLOON_MODULE)) && defined(CONFIG_COMPACTION)
> +extern int is_balloon_page(struct page *);
> +extern int isolate_balloon_page(struct page *);
> +extern int putback_balloon_page(struct page *);
> +
> +/* return 1 if page is part of a guest's memory balloon, 0 otherwise */
> +static inline int PageBalloon(struct page *page)
> +{
> +       return is_balloon_page(page);
> +}
> +#else
> +static inline int PageBalloon(struct page *page)               { return 0; }
> +static inline int isolate_balloon_page(struct page *page)      { return 0; }
> +static inline int putback_balloon_page(struct page *page)      { return 0; }
> +#endif /* (VIRTIO_BALLOON || VIRTIO_BALLOON_MODULE) && COMPACTION */
> +
>  #endif /* __KERNEL__ */
>  #endif /* _LINUX_MM_H */
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 7ea259d..8835d55 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -14,6 +14,7 @@
>  #include <linux/backing-dev.h>
>  #include <linux/sysctl.h>
>  #include <linux/sysfs.h>
> +#include <linux/export.h>
>  #include "internal.h"
>
>  #if defined CONFIG_COMPACTION || defined CONFIG_CMA
> @@ -312,6 +313,14 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>                        continue;
>                }
>
> +               /*
> +                * For ballooned pages, we need to isolate them before testing
> +                * for PageLRU, as well as skip the LRU page isolation steps.
> +                */
> +               if (PageBalloon(page))
> +                       if (isolate_balloon_page(page))
> +                               goto isolated_balloon_page;
> +
>                if (!PageLRU(page))
>                        continue;
>
> @@ -338,6 +347,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>
>                /* Successfully isolated */
>                del_page_from_lru_list(page, lruvec, page_lru(page));
> +isolated_balloon_page:
>                list_add(&page->lru, migratelist);
>                cc->nr_migratepages++;
>                nr_isolated++;
> @@ -903,4 +913,66 @@ void compaction_unregister_node(struct node *node)
>  }
>  #endif /* CONFIG_SYSFS && CONFIG_NUMA */
>
> +#if defined(CONFIG_VIRTIO_BALLOON) || defined(CONFIG_VIRTIO_BALLOON_MODULE)
> +/*
> + * Balloon pages special page->mapping.
> + * users must properly allocate and initiliaze an instance of balloon_mapping,
> + * and set it as the page->mapping for balloon enlisted page instances.
> + *
> + * address_space_operations necessary methods for ballooned pages:
> + *   .migratepage    - used to perform balloon's page migration (as is)
> + *   .invalidatepage - used to isolate a page from balloon's page list
> + *   .freepage       - used to reinsert an isolated page to balloon's page list
> + */
> +struct address_space *balloon_mapping;
> +EXPORT_SYMBOL(balloon_mapping);
> +
> +/* ballooned page id check */
> +int is_balloon_page(struct page *page)
> +{
> +       struct address_space *mapping = page->mapping;
> +       if (mapping == balloon_mapping)
> +               return 1;
> +       return 0;
> +}
> +
> +/* __isolate_lru_page() counterpart for a ballooned page */
> +int isolate_balloon_page(struct page *page)
> +{
> +       struct address_space *mapping = page->mapping;
> +       if (mapping->a_ops->invalidatepage) {
> +               /*
> +                * We can race against move_to_new_page() and stumble across a
> +                * locked 'newpage'. If we succeed on isolating it, the result
> +                * tends to be disastrous. So, we sanely skip PageLocked here.
> +                */
> +               if (likely(!PageLocked(page) && get_page_unless_zero(page))) {
> +                       /*
> +                        * A ballooned page, by default, has just one refcount.
> +                        * Prevent concurrent compaction threads from isolating
> +                        * an already isolated balloon page.
> +                        */
> +                       if (page_count(page) == 2) {
> +                               mapping->a_ops->invalidatepage(page, 0);
> +                               return 1;
> +                       }
> +                       /* Drop refcount taken for this already isolated page */
> +                       put_page(page);
> +               }
> +       }
> +       return 0;
> +}
> +
> +/* putback_lru_page() counterpart for a ballooned page */
> +int putback_balloon_page(struct page *page)
> +{
> +       struct address_space *mapping = page->mapping;
> +       if (mapping->a_ops->freepage) {
> +               mapping->a_ops->freepage(page);
> +               put_page(page);
> +               return 1;
> +       }
> +       return 0;
> +}
> +#endif /* CONFIG_VIRTIO_BALLOON || CONFIG_VIRTIO_BALLOON_MODULE */
>  #endif /* CONFIG_COMPACTION */
> diff --git a/mm/migrate.c b/mm/migrate.c
> index be26d5c..ffc02a4 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -78,7 +78,10 @@ void putback_lru_pages(struct list_head *l)
>                list_del(&page->lru);
>                dec_zone_page_state(page, NR_ISOLATED_ANON +
>                                page_is_file_cache(page));
> -               putback_lru_page(page);
> +               if (unlikely(PageBalloon(page)))
> +                       VM_BUG_ON(!putback_balloon_page(page));
> +               else
> +                       putback_lru_page(page);
>        }
>  }
>
> @@ -783,6 +786,17 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
>                }
>        }
>
> +       if (PageBalloon(page)) {
> +               /*
> +                * A ballooned page does not need any special attention from
> +                * physical to virtual reverse mapping procedures.
> +                * To avoid burning cycles at rmap level,
> +                * skip attempts to unmap PTEs or remap swapcache.
> +                */
> +               remap_swapcache = 0;
> +               goto skip_unmap;
> +       }
> +
>        /*
>         * Corner case handling:
>         * 1. When a new swap-cache page is read into, it is added to the LRU
> @@ -852,6 +866,20 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
>                        goto out;
>
>        rc = __unmap_and_move(page, newpage, force, offlining, mode);
> +
> +       if (PageBalloon(newpage)) {
> +               /*
> +                * A ballooned page has been migrated already. Now, it is the
> +                * time to wrap-up counters, handle the old page back to Buddy
> +                * and return.
> +                */
> +               list_del(&page->lru);
> +               dec_zone_page_state(page, NR_ISOLATED_ANON +
> +                                   page_is_file_cache(page));
> +               put_page(page);
> +               __free_page(page);
> +               return rc;
> +       }
>  out:
>        if (rc != -EAGAIN) {
>                /*
> --
> 1.7.10.2
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

^ permalink raw reply

* Re: [PATCH 1/4] mm: introduce compaction and migration for virtio ballooned pages
From: Rafael Aquini @ 2012-06-25 23:57 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Rik van Riel, Michael S. Tsirkin, linux-kernel, virtualization,
	linux-mm
In-Reply-To: <CAPbh3rvN0U=xVuqb=7wHkbEAgM=dC67uG-1=m=8GAv9MNX7LWg@mail.gmail.com>

On Mon, Jun 25, 2012 at 07:31:38PM -0400, Konrad Rzeszutek Wilk wrote:
> On Mon, Jun 25, 2012 at 7:25 PM, Rafael Aquini <aquini@redhat.com> wrote:
> > This patch introduces helper functions that teach compaction and migration bits
> > how to cope with pages which are part of a guest memory balloon, in order to
> > make them movable by memory compaction procedures.
> >
> 
> Should the names that are exported be prefixed with kvm_?
>
I rather let them as generic as possible, specially because I believe other
balloon drivers can leverage the same technique to make their page lists movable
as well, in the near future. However, I do agree with your tip if we ended up
finding out no one else can reuse this piece of code.

 
> > Signed-off-by: Rafael Aquini <aquini@redhat.com>
> > ---
> >  include/linux/mm.h |   17 +++++++++++++
> >  mm/compaction.c    |   72 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  mm/migrate.c       |   30 +++++++++++++++++++++-
> >  3 files changed, 118 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index b36d08c..360656e 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1629,5 +1629,22 @@ static inline unsigned int debug_guardpage_minorder(void) { return 0; }
> >  static inline bool page_is_guard(struct page *page) { return false; }
> >  #endif /* CONFIG_DEBUG_PAGEALLOC */
> >
> > +#if (defined(CONFIG_VIRTIO_BALLOON) || \
> > +       defined(CONFIG_VIRTIO_BALLOON_MODULE)) && defined(CONFIG_COMPACTION)
> > +extern int is_balloon_page(struct page *);
> > +extern int isolate_balloon_page(struct page *);
> > +extern int putback_balloon_page(struct page *);
> > +
> > +/* return 1 if page is part of a guest's memory balloon, 0 otherwise */
> > +static inline int PageBalloon(struct page *page)
> > +{
> > +       return is_balloon_page(page);
> > +}
> > +#else
> > +static inline int PageBalloon(struct page *page)               { return 0; }
> > +static inline int isolate_balloon_page(struct page *page)      { return 0; }
> > +static inline int putback_balloon_page(struct page *page)      { return 0; }
> > +#endif /* (VIRTIO_BALLOON || VIRTIO_BALLOON_MODULE) && COMPACTION */
> > +
> >  #endif /* __KERNEL__ */
> >  #endif /* _LINUX_MM_H */
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index 7ea259d..8835d55 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -14,6 +14,7 @@
> >  #include <linux/backing-dev.h>
> >  #include <linux/sysctl.h>
> >  #include <linux/sysfs.h>
> > +#include <linux/export.h>
> >  #include "internal.h"
> >
> >  #if defined CONFIG_COMPACTION || defined CONFIG_CMA
> > @@ -312,6 +313,14 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
> >                        continue;
> >                }
> >
> > +               /*
> > +                * For ballooned pages, we need to isolate them before testing
> > +                * for PageLRU, as well as skip the LRU page isolation steps.
> > +                */
> > +               if (PageBalloon(page))
> > +                       if (isolate_balloon_page(page))
> > +                               goto isolated_balloon_page;
> > +
> >                if (!PageLRU(page))
> >                        continue;
> >
> > @@ -338,6 +347,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
> >
> >                /* Successfully isolated */
> >                del_page_from_lru_list(page, lruvec, page_lru(page));
> > +isolated_balloon_page:
> >                list_add(&page->lru, migratelist);
> >                cc->nr_migratepages++;
> >                nr_isolated++;
> > @@ -903,4 +913,66 @@ void compaction_unregister_node(struct node *node)
> >  }
> >  #endif /* CONFIG_SYSFS && CONFIG_NUMA */
> >
> > +#if defined(CONFIG_VIRTIO_BALLOON) || defined(CONFIG_VIRTIO_BALLOON_MODULE)
> > +/*
> > + * Balloon pages special page->mapping.
> > + * users must properly allocate and initiliaze an instance of balloon_mapping,
> > + * and set it as the page->mapping for balloon enlisted page instances.
> > + *
> > + * address_space_operations necessary methods for ballooned pages:
> > + *   .migratepage    - used to perform balloon's page migration (as is)
> > + *   .invalidatepage - used to isolate a page from balloon's page list
> > + *   .freepage       - used to reinsert an isolated page to balloon's page list
> > + */
> > +struct address_space *balloon_mapping;
> > +EXPORT_SYMBOL(balloon_mapping);
> > +
> > +/* ballooned page id check */
> > +int is_balloon_page(struct page *page)
> > +{
> > +       struct address_space *mapping = page->mapping;
> > +       if (mapping == balloon_mapping)
> > +               return 1;
> > +       return 0;
> > +}
> > +
> > +/* __isolate_lru_page() counterpart for a ballooned page */
> > +int isolate_balloon_page(struct page *page)
> > +{
> > +       struct address_space *mapping = page->mapping;
> > +       if (mapping->a_ops->invalidatepage) {
> > +               /*
> > +                * We can race against move_to_new_page() and stumble across a
> > +                * locked 'newpage'. If we succeed on isolating it, the result
> > +                * tends to be disastrous. So, we sanely skip PageLocked here.
> > +                */
> > +               if (likely(!PageLocked(page) && get_page_unless_zero(page))) {
> > +                       /*
> > +                        * A ballooned page, by default, has just one refcount.
> > +                        * Prevent concurrent compaction threads from isolating
> > +                        * an already isolated balloon page.
> > +                        */
> > +                       if (page_count(page) == 2) {
> > +                               mapping->a_ops->invalidatepage(page, 0);
> > +                               return 1;
> > +                       }
> > +                       /* Drop refcount taken for this already isolated page */
> > +                       put_page(page);
> > +               }
> > +       }
> > +       return 0;
> > +}
> > +
> > +/* putback_lru_page() counterpart for a ballooned page */
> > +int putback_balloon_page(struct page *page)
> > +{
> > +       struct address_space *mapping = page->mapping;
> > +       if (mapping->a_ops->freepage) {
> > +               mapping->a_ops->freepage(page);
> > +               put_page(page);
> > +               return 1;
> > +       }
> > +       return 0;
> > +}
> > +#endif /* CONFIG_VIRTIO_BALLOON || CONFIG_VIRTIO_BALLOON_MODULE */
> >  #endif /* CONFIG_COMPACTION */
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index be26d5c..ffc02a4 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -78,7 +78,10 @@ void putback_lru_pages(struct list_head *l)
> >                list_del(&page->lru);
> >                dec_zone_page_state(page, NR_ISOLATED_ANON +
> >                                page_is_file_cache(page));
> > -               putback_lru_page(page);
> > +               if (unlikely(PageBalloon(page)))
> > +                       VM_BUG_ON(!putback_balloon_page(page));
> > +               else
> > +                       putback_lru_page(page);
> >        }
> >  }
> >
> > @@ -783,6 +786,17 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
> >                }
> >        }
> >
> > +       if (PageBalloon(page)) {
> > +               /*
> > +                * A ballooned page does not need any special attention from
> > +                * physical to virtual reverse mapping procedures.
> > +                * To avoid burning cycles at rmap level,
> > +                * skip attempts to unmap PTEs or remap swapcache.
> > +                */
> > +               remap_swapcache = 0;
> > +               goto skip_unmap;
> > +       }
> > +
> >        /*
> >         * Corner case handling:
> >         * 1. When a new swap-cache page is read into, it is added to the LRU
> > @@ -852,6 +866,20 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
> >                        goto out;
> >
> >        rc = __unmap_and_move(page, newpage, force, offlining, mode);
> > +
> > +       if (PageBalloon(newpage)) {
> > +               /*
> > +                * A ballooned page has been migrated already. Now, it is the
> > +                * time to wrap-up counters, handle the old page back to Buddy
> > +                * and return.
> > +                */
> > +               list_del(&page->lru);
> > +               dec_zone_page_state(page, NR_ISOLATED_ANON +
> > +                                   page_is_file_cache(page));
> > +               put_page(page);
> > +               __free_page(page);
> > +               return rc;
> > +       }
> >  out:
> >        if (rc != -EAGAIN) {
> >                /*
> > --
> > 1.7.10.2
> >
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >

^ permalink raw reply

* RE: [PATCH 00/13] drivers: hv: kvp
From: KY Srinivasan @ 2012-06-26  2:29 UTC (permalink / raw)
  To: Greg KH
  Cc: apw@canonical.com, devel@linuxdriverproject.org,
	virtualization@lists.osdl.org, ohering@suse.com,
	linux-kernel@vger.kernel.org
In-Reply-To: <20120622132547.GA2639@kroah.com>



> -----Original Message-----
> From: Greg KH [mailto:gregkh@linuxfoundation.org]
> Sent: Friday, June 22, 2012 9:26 AM
> To: KY Srinivasan
> Cc: apw@canonical.com; devel@linuxdriverproject.org; ohering@suse.com;
> linux-kernel@vger.kernel.org; virtualization@lists.osdl.org
> Subject: Re: [PATCH 00/13] drivers: hv: kvp
> 
> On Fri, Jun 22, 2012 at 01:06:53PM +0000, KY Srinivasan wrote:
> > Are you still missing it; do you want me to resend the whole set?
> 
> Nope, it showed up a few hours later, thanks.  You really should get
> that fixed...

Greg, some additional testing with VM replication on Windows Server 2012 has revealed some issues with this patch-set. Please drop the patch set, I will  fix the problem and resend it.

Thank you,

K. Y

^ permalink raw reply

* Re: [net-next RFC V4 PATCH 3/4] virtio: introduce a method to get the irq of a specific virtqueue
From: Jason Wang @ 2012-06-26  5:59 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: krkumar2, habanero, kvm, qemu-devel, netdev, mashirle,
	linux-kernel, virtualization, edumazet, tahm, jwhan, davem
In-Reply-To: <20120625101439.GC19169@redhat.com>

On 06/25/2012 06:14 PM, Michael S. Tsirkin wrote:
> On Mon, Jun 25, 2012 at 05:41:17PM +0800, Jason Wang wrote:
>> Device specific irq optimizations such as irq affinity may be used by virtio
>> drivers. So this patch introduce a new method to get the irq of a specific
>> virtqueue.
>>
>> After this patch, virtio device drivers could query the irq and do device
>> specific optimizations. First user would be virtio-net.
>>
>> Signed-off-by: Jason Wang<jasowang@redhat.com>
>> ---
>>   drivers/lguest/lguest_device.c |    8 ++++++++
>>   drivers/s390/kvm/kvm_virtio.c  |    6 ++++++
>>   drivers/virtio/virtio_mmio.c   |    8 ++++++++
>>   drivers/virtio/virtio_pci.c    |   12 ++++++++++++
>>   include/linux/virtio_config.h  |    4 ++++
>>   5 files changed, 38 insertions(+), 0 deletions(-)
>>
>> diff --git a/drivers/lguest/lguest_device.c b/drivers/lguest/lguest_device.c
>> index 9e8388e..bcd080f 100644
>> --- a/drivers/lguest/lguest_device.c
>> +++ b/drivers/lguest/lguest_device.c
>> @@ -392,6 +392,13 @@ static const char *lg_bus_name(struct virtio_device *vdev)
>>   	return "";
>>   }
>>
>> +static int lg_get_vq_irq(struct virtio_device *vdev, struct virtqueue *vq)
>> +{
>> +	struct lguest_vq_info *lvq = vq->priv;
>> +
>> +	return lvq->config.irq;
>> +}
>> +
>>   /* The ops structure which hooks everything together. */
>>   static struct virtio_config_ops lguest_config_ops = {
>>   	.get_features = lg_get_features,
>> @@ -404,6 +411,7 @@ static struct virtio_config_ops lguest_config_ops = {
>>   	.find_vqs = lg_find_vqs,
>>   	.del_vqs = lg_del_vqs,
>>   	.bus_name = lg_bus_name,
>> +	.get_vq_irq = lg_get_vq_irq,
>>   };
>>
>>   /*
>> diff --git a/drivers/s390/kvm/kvm_virtio.c b/drivers/s390/kvm/kvm_virtio.c
>> index d74e9ae..a897de2 100644
>> --- a/drivers/s390/kvm/kvm_virtio.c
>> +++ b/drivers/s390/kvm/kvm_virtio.c
>> @@ -268,6 +268,11 @@ static const char *kvm_bus_name(struct virtio_device *vdev)
>>   	return "";
>>   }
>>
>> +static int kvm_get_vq_irq(struct virtio_device *vdev, struct virtqueue *vq)
>> +{
>> +	return 0x2603;
>> +}
>> +
>>   /*
>>    * The config ops structure as defined by virtio config
>>    */
>> @@ -282,6 +287,7 @@ static struct virtio_config_ops kvm_vq_configspace_ops = {
>>   	.find_vqs = kvm_find_vqs,
>>   	.del_vqs = kvm_del_vqs,
>>   	.bus_name = kvm_bus_name,
>> +	.get_vq_irq = kvm_get_vq_irq,
>>   };
>>
>>   /*
>> diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
>> index f5432b6..2ba37ed 100644
>> --- a/drivers/virtio/virtio_mmio.c
>> +++ b/drivers/virtio/virtio_mmio.c
>> @@ -411,6 +411,13 @@ static const char *vm_bus_name(struct virtio_device *vdev)
>>   	return vm_dev->pdev->name;
>>   }
>>
>> +static int vm_get_vq_irq(struct virtio_device *vdev, struct virtqueue *vq)
>> +{
>> +	struct virtio_mmio_device *vm_dev = to_virtio_mmio_device(vdev);
>> +
>> +	return platform_get_irq(vm_dev->pdev, 0);
>> +}
>> +
>>   static struct virtio_config_ops virtio_mmio_config_ops = {
>>   	.get		= vm_get,
>>   	.set		= vm_set,
>> @@ -422,6 +429,7 @@ static struct virtio_config_ops virtio_mmio_config_ops = {
>>   	.get_features	= vm_get_features,
>>   	.finalize_features = vm_finalize_features,
>>   	.bus_name	= vm_bus_name,
>> +	.get_vq_irq     = vm_get_vq_irq,
>>   };
>>
>>
>> diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
>> index adb24f2..c062227 100644
>> --- a/drivers/virtio/virtio_pci.c
>> +++ b/drivers/virtio/virtio_pci.c
>> @@ -607,6 +607,17 @@ static const char *vp_bus_name(struct virtio_device *vdev)
>>   	return pci_name(vp_dev->pci_dev);
>>   }
>>
>> +static int vp_get_vq_irq(struct virtio_device *vdev, struct virtqueue *vq)
>> +{
>> +	struct virtio_pci_device *vp_dev = to_vp_device(vdev);
>> +	struct virtio_pci_vq_info *info = vq->priv;
>> +
>> +	if (vp_dev->intx_enabled)
>> +		return vp_dev->pci_dev->irq;
>> +	else
>> +		return vp_dev->msix_entries[info->msix_vector].vector;
>> +}
>> +
>>   static struct virtio_config_ops virtio_pci_config_ops = {
>>   	.get		= vp_get,
>>   	.set		= vp_set,
>> @@ -618,6 +629,7 @@ static struct virtio_config_ops virtio_pci_config_ops = {
>>   	.get_features	= vp_get_features,
>>   	.finalize_features = vp_finalize_features,
>>   	.bus_name	= vp_bus_name,
>> +	.get_vq_irq     = vp_get_vq_irq,
>>   };
>>
>>   static void virtio_pci_release_dev(struct device *_d)
>> diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
>> index fc457f4..acd6930 100644
>> --- a/include/linux/virtio_config.h
>> +++ b/include/linux/virtio_config.h
>> @@ -98,6 +98,9 @@
>>    *	vdev: the virtio_device
>>    *      This returns a pointer to the bus name a la pci_name from which
>>    *      the caller can then copy.
>> + * @get_vq_irq: get the irq numer of the specific virt queue.
>> + *      vdev: the virtio_device
>> + *      vq: the virtqueue
> What if the vq does not have an IRQ? E.g. control vqs don't.
> What if the IRQ is shared between VQs? Between devices?
> The need to cleanup affinity on destroy is also nasty.
> How about we expose a set_affinity API instead?

Or exposed the irq information such as sharing, level/edge. But I think 
exposing the set_affinity API is enough for multiqueue virtio-net.
> Then:
> 	- non PCI can ignore for now
> 	- with a per vq vector we can force it
> 	- with a shared MSI we make it an OR over all affinities
> 	- with a level interrupt we can ignore it
> 	- on cleanup we can do it in core

Looks good, thanks.
>
>>    */
>>   typedef void vq_callback_t(struct virtqueue *);
>>   struct virtio_config_ops {
>> @@ -116,6 +119,7 @@ struct virtio_config_ops {
>>   	u32 (*get_features)(struct virtio_device *vdev);
>>   	void (*finalize_features)(struct virtio_device *vdev);
>>   	const char *(*bus_name)(struct virtio_device *vdev);
>> +	int (*get_vq_irq)(struct virtio_device *vdev, struct virtqueue *vq);
>>   };
>>
>>   /* If driver didn't advertise the feature, it will never appear. */
>> -- 
>> 1.7.1

^ permalink raw reply

* Re: [net-next RFC V4 PATCH 0/4] Multiqueue virtio-net
From: Jason Wang @ 2012-06-26  6:02 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: krkumar2, habanero, kvm, mst, netdev, mashirle, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem
In-Reply-To: <4FE8A4B6.1080200@us.ibm.com>

On 06/26/2012 01:49 AM, Sridhar Samudrala wrote:
> On 6/25/2012 2:16 AM, Jason Wang wrote:
>> Hello All:
>>
>> This series is an update version of multiqueue virtio-net driver 
>> based on
>> Krishna Kumar's work to let virtio-net use multiple rx/tx queues to 
>> do the
>> packets reception and transmission. Please review and comments.
>>
>> Test Environment:
>> - Intel(R) Xeon(R) CPU E5620 @ 2.40GHz, 8 cores 2 numa nodes
>> - Two directed connected 82599
>>
>> Test Summary:
>>
>> - Highlights: huge improvements on TCP_RR test
>> - Lowlights: regression on small packet transmission, higher cpu 
>> utilization
>>               than single queue, need further optimization
> Does this also scale with increased number of VMs?
>

Hi Sridhar:

Good suggestions, I didn't measure them. I would run test and post them.

Thanks

> Thanks
> Sridhar
>>
>> Analysis of the performance result:
>>
>> - I count the number of packets sending/receiving during the test, and
>>    multiqueue show much more ability in terms of packets per second.
>>
>> - For the tx regression, multiqueue send about 1-2 times of more packets
>>    compared to single queue, and the packets size were much smaller 
>> than single
>>    queue does. I suspect tcp does less batching in multiqueue, so I 
>> hack the
>>    tcp_write_xmit() to forece more batching, multiqueue works as well as
>>    singlequeue for both small transmission and throughput
>>
>> - I didn't pack the accelerate RFS with virtio-net in this sereis as 
>> it still
>>    need further shaping, for the one that interested in this please see:
>>    http://www.mail-archive.com/kvm@vger.kernel.org/msg64111.html
>>
>>
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [net-next RFC V4 PATCH 0/4] Multiqueue virtio-net
From: Jason Wang @ 2012-06-26  6:03 UTC (permalink / raw)
  To: Shirley Ma
  Cc: krkumar2, habanero, kvm, mst, netdev, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem
In-Reply-To: <1340647278.23823.0.camel@oc3660625478.ibm.com>

On 06/26/2012 02:01 AM, Shirley Ma wrote:
> Hello Jason,
>
> Good work. Do you have local guest to guest results?
>
> Thanks
> Shirley
Hi Shirley:

I would run tests to measure the performance and post here.

Thanks

^ permalink raw reply

* [rfc] virtio-spec: introduce VIRTIO_NET_F_MULTIQUEUE
From: Jason Wang @ 2012-06-26  7:15 UTC (permalink / raw)
  To: jasowang, mst, rusty; +Cc: netdev, virtualization

This patch introduces the multiqueue capabilities to virtio net devices. The
number of tx/rx queue pairs available in the device were exposed through config
space, and driver could negotiate the number of pairs it wish to use through
ctrl vq.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 virtio-0.9.5.lyx |  180 ++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 176 insertions(+), 4 deletions(-)

diff --git a/virtio-0.9.5.lyx b/virtio-0.9.5.lyx
index 3c80ecf..480e9c7 100644
--- a/virtio-0.9.5.lyx
+++ b/virtio-0.9.5.lyx
@@ -56,6 +56,7 @@
 \html_math_output 0
 \html_css_as_file 0
 \html_be_strict false
+\author 2090695081 "Jason Wang" 
 \end_header
 
 \begin_body
@@ -3854,11 +3855,22 @@ ID 1
 \end_layout
 
 \begin_layout Description
-Virtqueues 0:receiveq.
+Virtqueues 
+\change_inserted 2090695081 1340693104
+
+\end_layout
+
+\begin_deeper
+\begin_layout Description
+
+\change_inserted 2090695081 1340693118
+When VIRTIO_NET_F_MULTIQUEUE is not set: 
+\change_unchanged
+0:receiveq.
  1:transmitq.
  2:controlq
 \begin_inset Foot
-status open
+status collapsed
 
 \begin_layout Plain Layout
 Only if VIRTIO_NET_F_CTRL_VQ set
@@ -3867,9 +3879,60 @@ Only if VIRTIO_NET_F_CTRL_VQ set
 \end_inset
 
 
+\change_inserted 2090695081 1340693122
+
 \end_layout
 
 \begin_layout Description
+
+\change_inserted 2090695081 1340693866
+When VIRTIO_NET_F_MULTIQUEUE is set and there's N tx/rx queue pairs: 0:receiveq1.
+ 1:transmitq1.
+ 2:controlq
+\begin_inset Foot
+status collapsed
+
+\begin_layout Plain Layout
+
+\change_inserted 2090695081 1340693141
+Only if VIRTIO_NET_F_CTRL_VQ set
+\end_layout
+
+\end_inset
+
+ ...
+ 2N-1
+\begin_inset Foot
+status collapsed
+
+\begin_layout Plain Layout
+
+\change_inserted 2090695081 1340693284
+2N-2 If VIRTIO_NET_F_CTRL_VQ not set
+\end_layout
+
+\end_inset
+
+:receiveqN.
+ 2N
+\begin_inset Foot
+status collapsed
+
+\begin_layout Plain Layout
+
+\change_inserted 2090695081 1340693302
+2N-1 If VIRTIO_NET_F_CTRL_VQ is not set
+\end_layout
+
+\end_inset
+
+: transmitqN
+\change_unchanged
+
+\end_layout
+
+\end_deeper
+\begin_layout Description
 Feature
 \begin_inset space ~
 \end_inset
@@ -4027,6 +4090,16 @@ VIRTIO_NET_F_CTRL_VLAN
 
 \begin_layout Description
 VIRTIO_NET_F_GUEST_ANNOUNCE(21) Guest can send gratuitous packets.
+\change_inserted 2090695081 1340692965
+
+\end_layout
+
+\begin_layout Description
+
+\change_inserted 2090695081 1340693017
+VIRTIO_NET_F_MULTIQUEUE (22) Device has multiple tx/rx queues.
+\change_unchanged
+
 \end_layout
 
 \end_deeper
@@ -4039,11 +4112,22 @@ configuration
 \begin_inset space ~
 \end_inset
 
-layout Two configuration fields are currently defined.
+layout T
+\change_inserted 2090695081 1340693345
+hree
+\change_deleted 2090695081 1340693344
+wo
+\change_unchanged
+ configuration fields are currently defined.
  The mac address field always exists (though is only valid if VIRTIO_NET_F_MAC
  is set), and the status field only exists if VIRTIO_NET_F_STATUS is set.
  Two read-only bits are currently defined for the status field: VIRTIO_NET_S_LIN
 K_UP and VIRTIO_NET_S_ANNOUNCE.
+
+\change_inserted 2090695081 1340693398
+ The num queue pairs fields only exist if VIRTIO_NET_F_MULTIQUEUE is set.
+
+\change_unchanged
  
 \begin_inset listings
 inline false
@@ -4076,6 +4160,17 @@ struct virtio_net_config {
 \begin_layout Plain Layout
 
     u16 status;
+\change_inserted 2090695081 1340692955
+
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 2090695081 1340692962
+
+	u16 num_queue_pairs;
+\change_unchanged
+
 \end_layout
 
 \begin_layout Plain Layout
@@ -4527,7 +4622,7 @@ O features are used, the Guest will need to accept packets of up to 65550
  So unless VIRTIO_NET_F_MRG_RXBUF is negotiated, every buffer in the receive
  queue needs to be at least this length 
 \begin_inset Foot
-status open
+status collapsed
 
 \begin_layout Plain Layout
 Obviously each one can be split across multiple descriptor elements.
@@ -4980,6 +5075,83 @@ Sending VIRTIO_NET_CTRL_ANNOUNCE_ACK command through control vq.
 \begin_layout Enumerate
 .
  
+\change_inserted 2090695081 1340693446
+
+\end_layout
+
+\begin_layout Subsection*
+
+\change_inserted 2090695081 1340693500
+Negotiating the number of queue pairs
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 2090695081 1340693733
+If the driver negotiates the VIRTIO_NET_F_MULTIQUEUE (depends on VIRTIO_NET_F_CT
+RL_VQ), it can then negotiate the number of queue pairs it wish to use by
+ placing the number in num_queue_pairs field of virtio_net_ctrl_multiqueue
+ through VIRTIO_NET_CTRL_MULTIQUEUE_NUM command.
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 2090695081 1340693782
+If the driver doesn't negotiate the number, all tx/rx queues were enabled
+ by default.
+\end_layout
+
+\begin_layout Standard
+
+\change_inserted 2090695081 1340693616
+\begin_inset listings
+inline false
+status open
+
+\begin_layout Plain Layout
+
+\change_inserted 2090695081 1340693620
+
+struct virtio_net_ctrl_multiqueue {
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 2090695081 1340693627
+
+	u16 num_queue_pairs;
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 2090695081 1340693616
+
+};
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 2090695081 1340693616
+
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 2090695081 1340693639
+
+#define VIRTIO_NET_CTRL_MULTIQUEUE    4
+\end_layout
+
+\begin_layout Plain Layout
+
+\change_inserted 2090695081 1340693646
+
+ #define VIRTIO_NET_CTRL_MULTIQUEUE_NUM        0 
+\end_layout
+
+\end_inset
+
+
 \end_layout
 
 \begin_layout Chapter*
-- 
1.7.9.5

^ permalink raw reply related

* Re: [PATCH 1/4] mm: introduce compaction and migration for virtio ballooned pages
From: Mel Gorman @ 2012-06-26 10:17 UTC (permalink / raw)
  To: Rafael Aquini
  Cc: Rik van Riel, Michael S. Tsirkin, linux-kernel, virtualization,
	linux-mm, Andi Kleen
In-Reply-To: <7f83427b3894af7969c67acc0f27ab5aa68b4279.1340665087.git.aquini@redhat.com>


(apologies if there are excessive typos. I damaged my left hand and
typing is painful).

Adding Andi to cc for question on VM_BUG_ON.

On Mon, Jun 25, 2012 at 08:25:56PM -0300, Rafael Aquini wrote:
> This patch introduces helper functions that teach compaction and migration bits
> how to cope with pages which are part of a guest memory balloon, in order to
> make them movable by memory compaction procedures.
> 
> Signed-off-by: Rafael Aquini <aquini@redhat.com>
> ---
>  include/linux/mm.h |   17 +++++++++++++
>  mm/compaction.c    |   72 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  mm/migrate.c       |   30 +++++++++++++++++++++-
>  3 files changed, 118 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index b36d08c..360656e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1629,5 +1629,22 @@ static inline unsigned int debug_guardpage_minorder(void) { return 0; }
>  static inline bool page_is_guard(struct page *page) { return false; }
>  #endif /* CONFIG_DEBUG_PAGEALLOC */
>  
> +#if (defined(CONFIG_VIRTIO_BALLOON) || \
> +	defined(CONFIG_VIRTIO_BALLOON_MODULE)) && defined(CONFIG_COMPACTION)
> +extern int is_balloon_page(struct page *);
> +extern int isolate_balloon_page(struct page *);
> +extern int putback_balloon_page(struct page *);
> +
> +/* return 1 if page is part of a guest's memory balloon, 0 otherwise */
> +static inline int PageBalloon(struct page *page)
> +{
> +	return is_balloon_page(page);
> +}

bool

Why is there both is_balloon_page and PageBalloon? 

is_ballon_page is so simple it should just be a static inline here

extern struct address_space *balloon_mapping;
static inline bool is_balloon_page(page)
{
	return page->mapping == balloon_mapping;
}
	


> +#else
> +static inline int PageBalloon(struct page *page)		{ return 0; }
> +static inline int isolate_balloon_page(struct page *page)	{ return 0; }
> +static inline int putback_balloon_page(struct page *page)	{ return 0; }
> +#endif /* (VIRTIO_BALLOON || VIRTIO_BALLOON_MODULE) && COMPACTION */
> +
>  #endif /* __KERNEL__ */
>  #endif /* _LINUX_MM_H */
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 7ea259d..8835d55 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -14,6 +14,7 @@
>  #include <linux/backing-dev.h>
>  #include <linux/sysctl.h>
>  #include <linux/sysfs.h>
> +#include <linux/export.h>
>  #include "internal.h"
>  
>  #if defined CONFIG_COMPACTION || defined CONFIG_CMA
> @@ -312,6 +313,14 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>  			continue;
>  		}
>  
> +		/*
> +		 * For ballooned pages, we need to isolate them before testing
> +		 * for PageLRU, as well as skip the LRU page isolation steps.
> +		 */

This says what, but not why.

I didn't check the exact mechanics of a balloon page but I expect it's that
balloon pages are not on the LRU. If they are on the LRU, that's pretty dumb.


/*
 * Balloon pages can be migrated but are not on the LRU. Isolate
 * them before LRU checks.
 */


It would be nicer to do this without gotos

/*
 * It is possible to migrate LRU pages and balloon pages. Skip
 * any other type of page
 */
if (is_balloon_page(page)) {
	if (!isolate_balloon_page(page))
		continue;
} else if (PageLRU(page)) {
	....
}

You will need to shuffle things around a little to make it work properly
but if we handle other page types in the future it will be neater
overall.
	

> +		if (PageBalloon(page))
> +			if (isolate_balloon_page(page))
> +				goto isolated_balloon_page;
> +
>  		if (!PageLRU(page))
>  			continue;
>  
> @@ -338,6 +347,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>  
>  		/* Successfully isolated */
>  		del_page_from_lru_list(page, lruvec, page_lru(page));
> +isolated_balloon_page:
>  		list_add(&page->lru, migratelist);
>  		cc->nr_migratepages++;
>  		nr_isolated++;
> @@ -903,4 +913,66 @@ void compaction_unregister_node(struct node *node)
>  }
>  #endif /* CONFIG_SYSFS && CONFIG_NUMA */
>  
> +#if defined(CONFIG_VIRTIO_BALLOON) || defined(CONFIG_VIRTIO_BALLOON_MODULE)
> +/*
> + * Balloon pages special page->mapping.
> + * users must properly allocate and initiliaze an instance of balloon_mapping,
> + * and set it as the page->mapping for balloon enlisted page instances.
> + *
> + * address_space_operations necessary methods for ballooned pages:
> + *   .migratepage    - used to perform balloon's page migration (as is)
> + *   .invalidatepage - used to isolate a page from balloon's page list
> + *   .freepage       - used to reinsert an isolated page to balloon's page list
> + */
> +struct address_space *balloon_mapping;
> +EXPORT_SYMBOL(balloon_mapping);
> +

EXPORT_SYMBOL_GPL?

I don't mind how it is exported as such. I'm idly curious if there are
external closed modules that use the driver.

> +/* ballooned page id check */
> +int is_balloon_page(struct page *page)
> +{
> +	struct address_space *mapping = page->mapping;
> +	if (mapping == balloon_mapping)
> +		return 1;
> +	return 0;
> +}
> +
> +/* __isolate_lru_page() counterpart for a ballooned page */
> +int isolate_balloon_page(struct page *page)
> +{
> +	struct address_space *mapping = page->mapping;

This is a publicly visible function and while your current usage looks
correct it would not hurt to do something like this;

if (WARN_ON(!is_page_ballon(page))
	return 0;

> +	if (mapping->a_ops->invalidatepage) {
> +		/*
> +		 * We can race against move_to_new_page() and stumble across a
> +		 * locked 'newpage'. If we succeed on isolating it, the result
> +		 * tends to be disastrous. So, we sanely skip PageLocked here.
> +		 */
> +		if (likely(!PageLocked(page) && get_page_unless_zero(page))) {

But the page can get locked after this point.

Would it not be better to do a trylock_page() and unlock the page on
exit after the isolation completes?

> +			/*
> +			 * A ballooned page, by default, has just one refcount.
> +			 * Prevent concurrent compaction threads from isolating
> +			 * an already isolated balloon page.
> +			 */
> +			if (page_count(page) == 2) {
> +				mapping->a_ops->invalidatepage(page, 0);
> +				return 1;
> +			}
> +			/* Drop refcount taken for this already isolated page */
> +			put_page(page);
> +		}
> +	}
> +	return 0;
> +}

Otherwise looks reasonable. The comments really help so thanks for that.

> +
> +/* putback_lru_page() counterpart for a ballooned page */
> +int putback_balloon_page(struct page *page)
> +{
> +	struct address_space *mapping = page->mapping;
> +	if (mapping->a_ops->freepage) {
> +		mapping->a_ops->freepage(page);
> +		put_page(page);
> +		return 1;
> +	}
> +	return 0;
> +}
> +#endif /* CONFIG_VIRTIO_BALLOON || CONFIG_VIRTIO_BALLOON_MODULE */
>  #endif /* CONFIG_COMPACTION */
> diff --git a/mm/migrate.c b/mm/migrate.c
> index be26d5c..ffc02a4 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -78,7 +78,10 @@ void putback_lru_pages(struct list_head *l)
>  		list_del(&page->lru);
>  		dec_zone_page_state(page, NR_ISOLATED_ANON +
>  				page_is_file_cache(page));
> -		putback_lru_page(page);
> +		if (unlikely(PageBalloon(page)))
> +			VM_BUG_ON(!putback_balloon_page(page));

Why not BUG_ON?

What shocked me actually is that VM_BUG_ON code is executed on
!CONFIG_DEBUG_VM builds and has been since 2.6.36 due to commit [4e60c86bd:
gcc-4.6: mm: fix unused but set warnings]. I thought the whole point of
VM_BUG_ON was to avoid expensive and usually unnecessary checks. Andi,
was this deliberate?

Either way, you always want to call putback_ballon_page() so BUG_ON is
more appropriate although gracefully recovering from the situation and a
WARN would be better.

> +		else
> +			putback_lru_page(page);
>  	}
>  }
>  
> @@ -783,6 +786,17 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
>  		}
>  	}
>  
> +	if (PageBalloon(page)) {
> +		/*
> +		 * A ballooned page does not need any special attention from
> +		 * physical to virtual reverse mapping procedures.
> +		 * To avoid burning cycles at rmap level,
> +		 * skip attempts to unmap PTEs or remap swapcache.
> +		 */
> +		remap_swapcache = 0;
> +		goto skip_unmap;
> +	}
> +
>  	/*
>  	 * Corner case handling:
>  	 * 1. When a new swap-cache page is read into, it is added to the LRU
> @@ -852,6 +866,20 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
>  			goto out;
>  
>  	rc = __unmap_and_move(page, newpage, force, offlining, mode);
> +
> +	if (PageBalloon(newpage)) {
> +		/*
> +		 * A ballooned page has been migrated already. Now, it is the
> +		 * time to wrap-up counters, handle the old page back to Buddy
> +		 * and return.
> +		 */
> +		list_del(&page->lru);
> +		dec_zone_page_state(page, NR_ISOLATED_ANON +
> +				    page_is_file_cache(page));
> +		put_page(page);
> +		__free_page(page);
> +		return rc;
> +	}
>  out:
>  	if (rc != -EAGAIN) {
>  		/*
> -- 
> 1.7.10.2
> 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply

* Re: [PATCH 1/4] mm: introduce compaction and migration for virtio ballooned pages
From: Rik van Riel @ 2012-06-26 13:17 UTC (permalink / raw)
  To: Rafael Aquini; +Cc: linux-mm, Michael S. Tsirkin, linux-kernel, virtualization
In-Reply-To: <7f83427b3894af7969c67acc0f27ab5aa68b4279.1340665087.git.aquini@redhat.com>

On 06/25/2012 07:25 PM, Rafael Aquini wrote:
> This patch introduces helper functions that teach compaction and migration bits
> how to cope with pages which are part of a guest memory balloon, in order to
> make them movable by memory compaction procedures.
>
> Signed-off-by: Rafael Aquini<aquini@redhat.com>

The function fill_balloon in drivers/virtio/virtio_balloon.c
should probably add __GFP_MOVABLE to the gfp mask for alloc_pages,
to keep the pageblock where balloon pages are allocated marked
as movable.

-- 
All rights reversed

^ permalink raw reply

* Re: [PATCH 1/4] mm: introduce compaction and migration for virtio ballooned pages
From: Mel Gorman @ 2012-06-26 13:20 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-mm, virtualization, Rafael Aquini, linux-kernel,
	Michael S. Tsirkin
In-Reply-To: <4FE9B677.8040409@redhat.com>

On Tue, Jun 26, 2012 at 09:17:43AM -0400, Rik van Riel wrote:
> On 06/25/2012 07:25 PM, Rafael Aquini wrote:
> >This patch introduces helper functions that teach compaction and migration bits
> >how to cope with pages which are part of a guest memory balloon, in order to
> >make them movable by memory compaction procedures.
> >
> >Signed-off-by: Rafael Aquini<aquini@redhat.com>
> 
> The function fill_balloon in drivers/virtio/virtio_balloon.c
> should probably add __GFP_MOVABLE to the gfp mask for alloc_pages,
> to keep the pageblock where balloon pages are allocated marked
> as movable.
> 

You're right, but patch 3 is already doing that at the same time the
migration primitives are introduced. It does mean the full series has to
be applied to do anything but I think it's bisect safe.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply

* Re: [PATCH 1/4] mm: introduce compaction and migration for virtio ballooned pages
From: Andi Kleen @ 2012-06-26 16:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rik van Riel, Rafael Aquini, Michael S. Tsirkin, linux-kernel,
	virtualization, linux-mm, Andi Kleen
In-Reply-To: <20120626101729.GF8103@csn.ul.ie>

> 
> What shocked me actually is that VM_BUG_ON code is executed on
> !CONFIG_DEBUG_VM builds and has been since 2.6.36 due to commit [4e60c86bd:
> gcc-4.6: mm: fix unused but set warnings]. I thought the whole point of
> VM_BUG_ON was to avoid expensive and usually unnecessary checks. Andi,
> was this deliberate?

The idea was that the compiler optimizes it away anyways.

I'm not fully sure what putback_balloon_page does, but if it just tests
a bit (without non variable test_bit) it should be ok.

-Andi

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox