Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] net: ibmveth: convert to hw_features
From: David Miller @ 2011-04-20  8:32 UTC (permalink / raw)
  To: mirq-linux; +Cc: netdev, santil
In-Reply-To: <20110419121426.18A2D13909@rere.qmqm.pl>

From: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Date: Tue, 19 Apr 2011 14:14:25 +0200 (CEST)

> A minimal conversion.
> 
> ibmveth_set_csum_offload() can be folded into ibmveth_set_features()
> and adapter->rx_csum removed - left for later cleanup.
> 
> Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>

Applied.

^ permalink raw reply

* [RFC PATCH 2/2] virtio-net: add multiqueue support
From: Jason Wang @ 2011-04-20  8:33 UTC (permalink / raw)
  To: krkumar2, kvm, mst, netdev, rusty, qemu-devel, anthony
In-Reply-To: <20110420082706.32157.59668.stgit@dhcp-91-7.nay.redhat.com.englab.nay.redhat.com>

This patch add the multiqueue ability to virtio-net for both userapce and
vhost. With this patch the kernel side vhost could be reused without
modification to support multiqueue virtio-net nics.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 hw/vhost.c      |   26 ++-
 hw/vhost.h      |    1 
 hw/vhost_net.c  |    7 +
 hw/vhost_net.h  |    2 
 hw/virtio-net.c |  409 +++++++++++++++++++++++++++++++++++--------------------
 hw/virtio-net.h |    2 
 hw/virtio-pci.c |    1 
 hw/virtio.h     |    1 
 8 files changed, 284 insertions(+), 165 deletions(-)

diff --git a/hw/vhost.c b/hw/vhost.c
index 14b571d..2301d53 100644
--- a/hw/vhost.c
+++ b/hw/vhost.c
@@ -450,10 +450,10 @@ static int vhost_virtqueue_init(struct vhost_dev *dev,
     target_phys_addr_t s, l, a;
     int r;
     struct vhost_vring_file file = {
-        .index = idx,
+        .index = idx % dev->nvqs,
     };
     struct vhost_vring_state state = {
-        .index = idx,
+        .index = idx % dev->nvqs,
     };
     struct VirtQueue *vvq = virtio_get_queue(vdev, idx);
 
@@ -504,12 +504,13 @@ static int vhost_virtqueue_init(struct vhost_dev *dev,
         goto fail_alloc_ring;
     }
 
-    r = vhost_virtqueue_set_addr(dev, vq, idx, dev->log_enabled);
+    r = vhost_virtqueue_set_addr(dev, vq, idx % dev->nvqs, dev->log_enabled);
     if (r < 0) {
         r = -errno;
         goto fail_alloc;
     }
     r = vdev->binding->set_host_notifier(vdev->binding_opaque, idx, true);
+
     if (r < 0) {
         fprintf(stderr, "Error binding host notifier: %d\n", -r);
         goto fail_host_notifier;
@@ -557,7 +558,7 @@ static void vhost_virtqueue_cleanup(struct vhost_dev *dev,
                                     unsigned idx)
 {
     struct vhost_vring_state state = {
-        .index = idx,
+        .index = idx % dev->nvqs,
     };
     int r;
     r = vdev->binding->set_host_notifier(vdev->binding_opaque, idx, false);
@@ -648,10 +649,13 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev)
         goto fail;
     }
 
-    r = vdev->binding->set_guest_notifiers(vdev->binding_opaque, true);
-    if (r < 0) {
-        fprintf(stderr, "Error binding guest notifier: %d\n", -r);
-        goto fail_notifiers;
+    for (i = 0; i < hdev->nvqs; i++) {
+        r = vdev->binding->set_guest_notifier(vdev->binding_opaque,
+                                              hdev->start_idx + i, true);
+        if (r < 0) {
+            fprintf(stderr, "Error binding guest notifier: %d\n", -r);
+            goto fail_notifiers;
+        }
     }
 
     r = vhost_dev_set_features(hdev, hdev->log_enabled);
@@ -667,7 +671,7 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev)
         r = vhost_virtqueue_init(hdev,
                                  vdev,
                                  hdev->vqs + i,
-                                 i);
+                                 hdev->start_idx + i);
         if (r < 0) {
             goto fail_vq;
         }
@@ -694,7 +698,7 @@ fail_vq:
         vhost_virtqueue_cleanup(hdev,
                                 vdev,
                                 hdev->vqs + i,
-                                i);
+                                hdev->start_idx + i);
     }
 fail_mem:
 fail_features:
@@ -712,7 +716,7 @@ void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev)
         vhost_virtqueue_cleanup(hdev,
                                 vdev,
                                 hdev->vqs + i,
-                                i);
+                                hdev->start_idx + i);
     }
     vhost_client_sync_dirty_bitmap(&hdev->client, 0,
                                    (target_phys_addr_t)~0x0ull);
diff --git a/hw/vhost.h b/hw/vhost.h
index c8c595a..48b9478 100644
--- a/hw/vhost.h
+++ b/hw/vhost.h
@@ -31,6 +31,7 @@ struct vhost_dev {
     struct vhost_memory *mem;
     struct vhost_virtqueue *vqs;
     int nvqs;
+    int start_idx;
     unsigned long long features;
     unsigned long long acked_features;
     unsigned long long backend_features;
diff --git a/hw/vhost_net.c b/hw/vhost_net.c
index 420e05f..7fc87f8 100644
--- a/hw/vhost_net.c
+++ b/hw/vhost_net.c
@@ -128,7 +128,8 @@ bool vhost_net_query(VHostNetState *net, VirtIODevice *dev)
 }
 
 int vhost_net_start(struct vhost_net *net,
-                    VirtIODevice *dev)
+                    VirtIODevice *dev,
+                    int start_idx)
 {
     struct vhost_vring_file file = { };
     int r;
@@ -139,6 +140,7 @@ int vhost_net_start(struct vhost_net *net,
 
     net->dev.nvqs = 2;
     net->dev.vqs = net->vqs;
+    net->dev.start_idx = start_idx;
     r = vhost_dev_start(&net->dev, dev);
     if (r < 0) {
         return r;
@@ -206,7 +208,8 @@ bool vhost_net_query(VHostNetState *net, VirtIODevice *dev)
 }
 
 int vhost_net_start(struct vhost_net *net,
-		    VirtIODevice *dev)
+                    VirtIODevice *dev,
+                    int start_idx)
 {
     return -ENOSYS;
 }
diff --git a/hw/vhost_net.h b/hw/vhost_net.h
index 91e40b1..79a4f09 100644
--- a/hw/vhost_net.h
+++ b/hw/vhost_net.h
@@ -9,7 +9,7 @@ typedef struct vhost_net VHostNetState;
 VHostNetState *vhost_net_init(VLANClientState *backend, int devfd, bool force);
 
 bool vhost_net_query(VHostNetState *net, VirtIODevice *dev);
-int vhost_net_start(VHostNetState *net, VirtIODevice *dev);
+int vhost_net_start(VHostNetState *net, VirtIODevice *dev, int start_idx);
 void vhost_net_stop(VHostNetState *net, VirtIODevice *dev);
 
 void vhost_net_cleanup(VHostNetState *net);
diff --git a/hw/virtio-net.c b/hw/virtio-net.c
index 6997e02..b4430b7 100644
--- a/hw/virtio-net.c
+++ b/hw/virtio-net.c
@@ -26,26 +26,35 @@
 #define MAC_TABLE_ENTRIES    64
 #define MAX_VLAN    (1 << 12)   /* Per 802.1Q definition */
 
-typedef struct VirtIONet
+struct VirtIONet;
+
+typedef struct VirtIONetQueue
 {
-    VirtIODevice vdev;
-    uint8_t mac[ETH_ALEN];
-    uint16_t status;
     VirtQueue *rx_vq;
     VirtQueue *tx_vq;
-    VirtQueue *ctrl_vq;
-    NICState *nic;
     QEMUTimer *tx_timer;
     QEMUBH *tx_bh;
     uint32_t tx_timeout;
-    int32_t tx_burst;
     int tx_waiting;
-    uint32_t has_vnet_hdr;
-    uint8_t has_ufo;
     struct {
         VirtQueueElement elem;
         ssize_t len;
     } async_tx;
+    struct VirtIONet *n;
+    uint8_t vhost_started;
+} VirtIONetQueue;
+
+typedef struct VirtIONet
+{
+    VirtIODevice vdev;
+    uint8_t mac[ETH_ALEN];
+    uint16_t status;
+    VirtIONetQueue vqs[MAX_QUEUE_NUM];
+    VirtQueue *ctrl_vq;
+    NICState *nic;
+    int32_t tx_burst;
+    uint32_t has_vnet_hdr;
+    uint8_t has_ufo;
     int mergeable_rx_bufs;
     uint8_t promisc;
     uint8_t allmulti;
@@ -53,7 +62,6 @@ typedef struct VirtIONet
     uint8_t nomulti;
     uint8_t nouni;
     uint8_t nobcast;
-    uint8_t vhost_started;
     struct {
         int in_use;
         int first_multi;
@@ -63,6 +71,7 @@ typedef struct VirtIONet
     } mac_table;
     uint32_t *vlans;
     DeviceState *qdev;
+    uint32_t queues;
 } VirtIONet;
 
 /* TODO
@@ -74,12 +83,25 @@ static VirtIONet *to_virtio_net(VirtIODevice *vdev)
     return (VirtIONet *)vdev;
 }
 
+static int vq_get_pair_index(VirtIONet *n, VirtQueue *vq)
+{
+    int i;
+    for (i = 0; i < n->queues; i++) {
+        if (n->vqs[i].tx_vq == vq || n->vqs[i].rx_vq == vq) {
+            return i;
+        }
+    }
+    assert(1);
+    return -1;
+}
+
 static void virtio_net_get_config(VirtIODevice *vdev, uint8_t *config)
 {
     VirtIONet *n = to_virtio_net(vdev);
     struct virtio_net_config netcfg;
 
     stw_p(&netcfg.status, n->status);
+    netcfg.queues = n->queues * 2;
     memcpy(netcfg.mac, n->mac, ETH_ALEN);
     memcpy(config, &netcfg, sizeof(netcfg));
 }
@@ -103,75 +125,104 @@ static bool virtio_net_started(VirtIONet *n, uint8_t status)
         (n->status & VIRTIO_NET_S_LINK_UP) && n->vdev.vm_running;
 }
 
-static void virtio_net_vhost_status(VirtIONet *n, uint8_t status)
+static void nc_vhost_status(VLANClientState *nc, VirtIONet *n,
+                            uint8_t status)
 {
-    if (!n->nic->nc.peer) {
+    int queue_index = nc->queue_index;
+    VLANClientState *peer = nc->peer;
+    VirtIONetQueue *netq = &n->vqs[nc->queue_index];
+
+    if (!peer) {
         return;
     }
-    if (n->nic->nc.peer->info->type != NET_CLIENT_TYPE_TAP) {
+    if (peer->info->type != NET_CLIENT_TYPE_TAP) {
         return;
     }
 
-    if (!tap_get_vhost_net(n->nic->nc.peer)) {
+    if (!tap_get_vhost_net(peer)) {
         return;
     }
-    if (!!n->vhost_started == virtio_net_started(n, status) &&
-                              !n->nic->nc.peer->link_down) {
+    if (!!netq->vhost_started == virtio_net_started(n, status) &&
+                                 !peer->link_down) {
         return;
     }
-    if (!n->vhost_started) {
+    if (!netq->vhost_started) {
         int r;
-        if (!vhost_net_query(tap_get_vhost_net(n->nic->nc.peer), &n->vdev)) {
+        if (!vhost_net_query(tap_get_vhost_net(peer), &n->vdev)) {
             return;
         }
-        r = vhost_net_start(tap_get_vhost_net(n->nic->nc.peer), &n->vdev);
+        r = vhost_net_start(tap_get_vhost_net(peer), &n->vdev, queue_index * 2);
         if (r < 0) {
             error_report("unable to start vhost net: %d: "
                          "falling back on userspace virtio", -r);
         } else {
-            n->vhost_started = 1;
+            netq->vhost_started = 1;
         }
     } else {
-        vhost_net_stop(tap_get_vhost_net(n->nic->nc.peer), &n->vdev);
-        n->vhost_started = 0;
+        vhost_net_stop(tap_get_vhost_net(peer), &n->vdev);
+        netq->vhost_started = 0;
+    }
+}
+
+static void virtio_net_vhost_status(VirtIONet *n, uint8_t status)
+{
+    int i;
+    for (i = 0; i < n->queues; i++) {
+        nc_vhost_status(n->nic->ncs[i], n, status);
     }
 }
 
 static void virtio_net_set_status(struct VirtIODevice *vdev, uint8_t status)
 {
     VirtIONet *n = to_virtio_net(vdev);
+    int i;
 
     virtio_net_vhost_status(n, status);
 
-    if (!n->tx_waiting) {
-        return;
-    }
+    for (i = 0; i < n->queues; i++) {
+        VirtIONetQueue *netq = &n->vqs[i];
+        if (!netq->tx_waiting) {
+            continue;
+        }
 
-    if (virtio_net_started(n, status) && !n->vhost_started) {
-        if (n->tx_timer) {
-            qemu_mod_timer(n->tx_timer,
-                           qemu_get_clock_ns(vm_clock) + n->tx_timeout);
+        if (virtio_net_started(n, status) && !netq->vhost_started) {
+            if (netq->tx_timer) {
+                qemu_mod_timer(netq->tx_timer,
+                               qemu_get_clock_ns(vm_clock) + netq->tx_timeout);
+            } else {
+                qemu_bh_schedule(netq->tx_bh);
+            }
         } else {
-            qemu_bh_schedule(n->tx_bh);
+            if (netq->tx_timer) {
+                qemu_del_timer(netq->tx_timer);
+            } else {
+                qemu_bh_cancel(netq->tx_bh);
+            }
         }
-    } else {
-        if (n->tx_timer) {
-            qemu_del_timer(n->tx_timer);
-        } else {
-            qemu_bh_cancel(n->tx_bh);
+    }
+}
+
+static bool virtio_net_is_link_up(VirtIONet *n)
+{
+    int i;
+    for (i = 0; i < n->queues; i++) {
+        if (n->nic->ncs[i]->link_down) {
+            return false;
         }
     }
+    return true;
 }
 
 static void virtio_net_set_link_status(VLANClientState *nc)
 {
-    VirtIONet *n = DO_UPCAST(NICState, nc, nc)->opaque;
+    VirtIONet *n = ((NICState *)(nc->opaque))->opaque;
     uint16_t old_status = n->status;
 
-    if (nc->link_down)
+    if (virtio_net_is_link_up(n)) {
         n->status &= ~VIRTIO_NET_S_LINK_UP;
-    else
+    } else {
         n->status |= VIRTIO_NET_S_LINK_UP;
+    }
 
     if (n->status != old_status)
         virtio_notify_config(&n->vdev);
@@ -202,13 +253,15 @@ static void virtio_net_reset(VirtIODevice *vdev)
 
 static int peer_has_vnet_hdr(VirtIONet *n)
 {
-    if (!n->nic->nc.peer)
+    if (!n->nic->ncs[0]->peer) {
         return 0;
+    }
 
-    if (n->nic->nc.peer->info->type != NET_CLIENT_TYPE_TAP)
+    if (n->nic->ncs[0]->peer->info->type != NET_CLIENT_TYPE_TAP) {
         return 0;
+    }
 
-    n->has_vnet_hdr = tap_has_vnet_hdr(n->nic->nc.peer);
+    n->has_vnet_hdr = tap_has_vnet_hdr(n->nic->ncs[0]->peer);
 
     return n->has_vnet_hdr;
 }
@@ -218,7 +271,7 @@ static int peer_has_ufo(VirtIONet *n)
     if (!peer_has_vnet_hdr(n))
         return 0;
 
-    n->has_ufo = tap_has_ufo(n->nic->nc.peer);
+    n->has_ufo = tap_has_ufo(n->nic->ncs[0]->peer);
 
     return n->has_ufo;
 }
@@ -228,9 +281,13 @@ static uint32_t virtio_net_get_features(VirtIODevice *vdev, uint32_t features)
     VirtIONet *n = to_virtio_net(vdev);
 
     features |= (1 << VIRTIO_NET_F_MAC);
+    features |= (1 << VIRTIO_NET_F_MULTIQUEUE);
 
     if (peer_has_vnet_hdr(n)) {
-        tap_using_vnet_hdr(n->nic->nc.peer, 1);
+        int i;
+        for (i = 0; i < n->queues; i++) {
+            tap_using_vnet_hdr(n->nic->ncs[i]->peer, 1);
+        }
     } else {
         features &= ~(0x1 << VIRTIO_NET_F_CSUM);
         features &= ~(0x1 << VIRTIO_NET_F_HOST_TSO4);
@@ -248,14 +305,15 @@ static uint32_t virtio_net_get_features(VirtIODevice *vdev, uint32_t features)
         features &= ~(0x1 << VIRTIO_NET_F_HOST_UFO);
     }
 
-    if (!n->nic->nc.peer ||
-        n->nic->nc.peer->info->type != NET_CLIENT_TYPE_TAP) {
+    if (!n->nic->ncs[0]->peer ||
+        n->nic->ncs[0]->peer->info->type != NET_CLIENT_TYPE_TAP) {
         return features;
     }
-    if (!tap_get_vhost_net(n->nic->nc.peer)) {
+    if (!tap_get_vhost_net(n->nic->ncs[0]->peer)) {
         return features;
     }
-    return vhost_net_get_features(tap_get_vhost_net(n->nic->nc.peer), features);
+    return vhost_net_get_features(tap_get_vhost_net(n->nic->ncs[0]->peer),
+                                  features);
 }
 
 static uint32_t virtio_net_bad_features(VirtIODevice *vdev)
@@ -276,25 +334,29 @@ static uint32_t virtio_net_bad_features(VirtIODevice *vdev)
 static void virtio_net_set_features(VirtIODevice *vdev, uint32_t features)
 {
     VirtIONet *n = to_virtio_net(vdev);
+    int i;
 
     n->mergeable_rx_bufs = !!(features & (1 << VIRTIO_NET_F_MRG_RXBUF));
 
-    if (n->has_vnet_hdr) {
-        tap_set_offload(n->nic->nc.peer,
-                        (features >> VIRTIO_NET_F_GUEST_CSUM) & 1,
-                        (features >> VIRTIO_NET_F_GUEST_TSO4) & 1,
-                        (features >> VIRTIO_NET_F_GUEST_TSO6) & 1,
-                        (features >> VIRTIO_NET_F_GUEST_ECN)  & 1,
-                        (features >> VIRTIO_NET_F_GUEST_UFO)  & 1);
-    }
-    if (!n->nic->nc.peer ||
-        n->nic->nc.peer->info->type != NET_CLIENT_TYPE_TAP) {
-        return;
-    }
-    if (!tap_get_vhost_net(n->nic->nc.peer)) {
-        return;
+    for (i = 0; i < n->queues; i++) {
+        if (n->has_vnet_hdr) {
+            tap_set_offload(n->nic->ncs[i]->peer,
+                            (features >> VIRTIO_NET_F_GUEST_CSUM) & 1,
+                            (features >> VIRTIO_NET_F_GUEST_TSO4) & 1,
+                            (features >> VIRTIO_NET_F_GUEST_TSO6) & 1,
+                            (features >> VIRTIO_NET_F_GUEST_ECN)  & 1,
+                            (features >> VIRTIO_NET_F_GUEST_UFO)  & 1);
+        }
+        if (!n->nic->ncs[i]->peer ||
+            n->nic->ncs[i]->peer->info->type != NET_CLIENT_TYPE_TAP) {
+            continue;
+        }
+        if (!tap_get_vhost_net(n->nic->ncs[i]->peer)) {
+            continue;
+        }
+        vhost_net_ack_features(tap_get_vhost_net(n->nic->ncs[i]->peer),
+                               features);
     }
-    vhost_net_ack_features(tap_get_vhost_net(n->nic->nc.peer), features);
 }
 
 static int virtio_net_handle_rx_mode(VirtIONet *n, uint8_t cmd,
@@ -446,7 +508,7 @@ static void virtio_net_handle_rx(VirtIODevice *vdev, VirtQueue *vq)
 {
     VirtIONet *n = to_virtio_net(vdev);
 
-    qemu_flush_queued_packets(&n->nic->nc);
+    qemu_flush_queued_packets(n->nic->ncs[vq_get_pair_index(n, vq)]);
 
     /* We now have RX buffers, signal to the IO thread to break out of the
      * select to re-poll the tap file descriptor */
@@ -455,36 +517,36 @@ static void virtio_net_handle_rx(VirtIODevice *vdev, VirtQueue *vq)
 
 static int virtio_net_can_receive(VLANClientState *nc)
 {
-    VirtIONet *n = DO_UPCAST(NICState, nc, nc)->opaque;
+    int queue_index = nc->queue_index;
+    VirtIONet *n = ((NICState *)nc->opaque)->opaque;
+
     if (!n->vdev.vm_running) {
         return 0;
     }
 
-    if (!virtio_queue_ready(n->rx_vq) ||
+    if (!virtio_queue_ready(n->vqs[queue_index].rx_vq) ||
         !(n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK))
         return 0;
 
     return 1;
 }
 
-static int virtio_net_has_buffers(VirtIONet *n, int bufsize)
+static int virtio_net_has_buffers(VirtIONet *n, int bufsize, VirtQueue *vq)
 {
-    if (virtio_queue_empty(n->rx_vq) ||
-        (n->mergeable_rx_bufs &&
-         !virtqueue_avail_bytes(n->rx_vq, bufsize, 0))) {
-        virtio_queue_set_notification(n->rx_vq, 1);
+    if (virtio_queue_empty(vq) || (n->mergeable_rx_bufs &&
+        !virtqueue_avail_bytes(vq, bufsize, 0))) {
+        virtio_queue_set_notification(vq, 1);
 
         /* To avoid a race condition where the guest has made some buffers
          * available after the above check but before notification was
          * enabled, check for available buffers again.
          */
-        if (virtio_queue_empty(n->rx_vq) ||
-            (n->mergeable_rx_bufs &&
-             !virtqueue_avail_bytes(n->rx_vq, bufsize, 0)))
+        if (virtio_queue_empty(vq) || (n->mergeable_rx_bufs &&
+             !virtqueue_avail_bytes(vq, bufsize, 0)))
             return 0;
     }
 
-    virtio_queue_set_notification(n->rx_vq, 0);
+    virtio_queue_set_notification(vq, 0);
     return 1;
 }
 
@@ -595,11 +657,13 @@ static int receive_filter(VirtIONet *n, const uint8_t *buf, int size)
 
 static ssize_t virtio_net_receive(VLANClientState *nc, const uint8_t *buf, size_t size)
 {
-    VirtIONet *n = DO_UPCAST(NICState, nc, nc)->opaque;
+    int queue_index = nc->queue_index;
+    VirtIONet *n = ((NICState *)(nc->opaque))->opaque;
+    VirtQueue *vq = n->vqs[queue_index].rx_vq;
     struct virtio_net_hdr_mrg_rxbuf *mhdr = NULL;
     size_t guest_hdr_len, offset, i, host_hdr_len;
 
-    if (!virtio_net_can_receive(&n->nic->nc))
+    if (!virtio_net_can_receive(n->nic->ncs[queue_index]))
         return -1;
 
     /* hdr_len refers to the header we supply to the guest */
@@ -608,7 +672,7 @@ static ssize_t virtio_net_receive(VLANClientState *nc, const uint8_t *buf, size_
 
 
     host_hdr_len = n->has_vnet_hdr ? sizeof(struct virtio_net_hdr) : 0;
-    if (!virtio_net_has_buffers(n, size + guest_hdr_len - host_hdr_len))
+    if (!virtio_net_has_buffers(n, size + guest_hdr_len - host_hdr_len, vq))
         return 0;
 
     if (!receive_filter(n, buf, size))
@@ -623,7 +687,7 @@ static ssize_t virtio_net_receive(VLANClientState *nc, const uint8_t *buf, size_
 
         total = 0;
 
-        if (virtqueue_pop(n->rx_vq, &elem) == 0) {
+        if (virtqueue_pop(vq, &elem) == 0) {
             if (i == 0)
                 return -1;
             error_report("virtio-net unexpected empty queue: "
@@ -675,47 +739,50 @@ static ssize_t virtio_net_receive(VLANClientState *nc, const uint8_t *buf, size_
         }
 
         /* signal other side */
-        virtqueue_fill(n->rx_vq, &elem, total, i++);
+        virtqueue_fill(vq, &elem, total, i++);
     }
 
     if (mhdr) {
         stw_p(&mhdr->num_buffers, i);
     }
 
-    virtqueue_flush(n->rx_vq, i);
-    virtio_notify(&n->vdev, n->rx_vq);
+    virtqueue_flush(vq, i);
+    virtio_notify(&n->vdev, vq);
 
     return size;
 }
 
-static int32_t virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq);
+static int32_t virtio_net_flush_tx(VirtIONet *n, VirtIONetQueue *tvq);
 
 static void virtio_net_tx_complete(VLANClientState *nc, ssize_t len)
 {
-    VirtIONet *n = DO_UPCAST(NICState, nc, nc)->opaque;
+    VirtIONet *n = ((NICState *)nc->opaque)->opaque;
+    VirtIONetQueue *netq = &n->vqs[nc->queue_index];
 
-    virtqueue_push(n->tx_vq, &n->async_tx.elem, n->async_tx.len);
-    virtio_notify(&n->vdev, n->tx_vq);
+    virtqueue_push(netq->tx_vq, &netq->async_tx.elem, netq->async_tx.len);
+    virtio_notify(&n->vdev, netq->tx_vq);
 
-    n->async_tx.elem.out_num = n->async_tx.len = 0;
+    netq->async_tx.elem.out_num = netq->async_tx.len;
 
-    virtio_queue_set_notification(n->tx_vq, 1);
-    virtio_net_flush_tx(n, n->tx_vq);
+    virtio_queue_set_notification(netq->tx_vq, 1);
+    virtio_net_flush_tx(n, netq);
 }
 
 /* TX */
-static int32_t virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq)
+static int32_t virtio_net_flush_tx(VirtIONet *n, VirtIONetQueue *netq)
 {
     VirtQueueElement elem;
     int32_t num_packets = 0;
+    VirtQueue *vq = netq->tx_vq;
+
     if (!(n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK)) {
         return num_packets;
     }
 
     assert(n->vdev.vm_running);
 
-    if (n->async_tx.elem.out_num) {
-        virtio_queue_set_notification(n->tx_vq, 0);
+    if (netq->async_tx.elem.out_num) {
+        virtio_queue_set_notification(vq, 0);
         return num_packets;
     }
 
@@ -747,12 +814,12 @@ static int32_t virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq)
             len += hdr_len;
         }
 
-        ret = qemu_sendv_packet_async(&n->nic->nc, out_sg, out_num,
-                                      virtio_net_tx_complete);
+        ret = qemu_sendv_packet_async(n->nic->ncs[vq_get_pair_index(n, vq)],
+                                      out_sg, out_num, virtio_net_tx_complete);
         if (ret == 0) {
-            virtio_queue_set_notification(n->tx_vq, 0);
-            n->async_tx.elem = elem;
-            n->async_tx.len  = len;
+            virtio_queue_set_notification(vq, 0);
+            netq->async_tx.elem = elem;
+            netq->async_tx.len  = len;
             return -EBUSY;
         }
 
@@ -771,22 +838,23 @@ static int32_t virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq)
 static void virtio_net_handle_tx_timer(VirtIODevice *vdev, VirtQueue *vq)
 {
     VirtIONet *n = to_virtio_net(vdev);
+    VirtIONetQueue *netq = &n->vqs[vq_get_pair_index(n, vq)];
 
     /* This happens when device was stopped but VCPU wasn't. */
     if (!n->vdev.vm_running) {
-        n->tx_waiting = 1;
+        netq->tx_waiting = 1;
         return;
     }
 
-    if (n->tx_waiting) {
+    if (netq->tx_waiting) {
         virtio_queue_set_notification(vq, 1);
-        qemu_del_timer(n->tx_timer);
-        n->tx_waiting = 0;
-        virtio_net_flush_tx(n, vq);
+        qemu_del_timer(netq->tx_timer);
+        netq->tx_waiting = 0;
+        virtio_net_flush_tx(n, netq);
     } else {
-        qemu_mod_timer(n->tx_timer,
-                       qemu_get_clock_ns(vm_clock) + n->tx_timeout);
-        n->tx_waiting = 1;
+        qemu_mod_timer(netq->tx_timer,
+                       qemu_get_clock_ns(vm_clock) + netq->tx_timeout);
+        netq->tx_waiting = 1;
         virtio_queue_set_notification(vq, 0);
     }
 }
@@ -794,48 +862,53 @@ static void virtio_net_handle_tx_timer(VirtIODevice *vdev, VirtQueue *vq)
 static void virtio_net_handle_tx_bh(VirtIODevice *vdev, VirtQueue *vq)
 {
     VirtIONet *n = to_virtio_net(vdev);
+    VirtIONetQueue *netq = &n->vqs[vq_get_pair_index(n, vq)];
 
-    if (unlikely(n->tx_waiting)) {
+    if (unlikely(netq->tx_waiting)) {
         return;
     }
-    n->tx_waiting = 1;
+    netq->tx_waiting = 1;
     /* This happens when device was stopped but VCPU wasn't. */
     if (!n->vdev.vm_running) {
         return;
     }
     virtio_queue_set_notification(vq, 0);
-    qemu_bh_schedule(n->tx_bh);
+    qemu_bh_schedule(netq->tx_bh);
 }
 
 static void virtio_net_tx_timer(void *opaque)
 {
-    VirtIONet *n = opaque;
+    VirtIONetQueue *netq = opaque;
+    VirtIONet *n = netq->n;
+
     assert(n->vdev.vm_running);
 
-    n->tx_waiting = 0;
+    netq->tx_waiting = 0;
 
     /* Just in case the driver is not ready on more */
     if (!(n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK))
         return;
 
-    virtio_queue_set_notification(n->tx_vq, 1);
-    virtio_net_flush_tx(n, n->tx_vq);
+    virtio_queue_set_notification(netq->tx_vq, 1);
+    virtio_net_flush_tx(n, netq);
 }
 
 static void virtio_net_tx_bh(void *opaque)
 {
-    VirtIONet *n = opaque;
+    VirtIONetQueue *netq = opaque;
+    VirtQueue *vq = netq->tx_vq;
+    VirtIONet *n = netq->n;
     int32_t ret;
 
     assert(n->vdev.vm_running);
 
-    n->tx_waiting = 0;
+    netq->tx_waiting = 0;
 
     /* Just in case the driver is not ready on more */
     if (unlikely(!(n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK)))
         return;
 
-    ret = virtio_net_flush_tx(n, n->tx_vq);
+    ret = virtio_net_flush_tx(n, netq);
     if (ret == -EBUSY) {
         return; /* Notification re-enable handled by tx_complete */
     }
@@ -843,33 +916,39 @@ static void virtio_net_tx_bh(void *opaque)
     /* If we flush a full burst of packets, assume there are
      * more coming and immediately reschedule */
     if (ret >= n->tx_burst) {
-        qemu_bh_schedule(n->tx_bh);
-        n->tx_waiting = 1;
+        qemu_bh_schedule(netq->tx_bh);
+        netq->tx_waiting = 1;
         return;
     }
 
     /* If less than a full burst, re-enable notification and flush
      * anything that may have come in while we weren't looking.  If
      * we find something, assume the guest is still active and reschedule */
-    virtio_queue_set_notification(n->tx_vq, 1);
-    if (virtio_net_flush_tx(n, n->tx_vq) > 0) {
-        virtio_queue_set_notification(n->tx_vq, 0);
-        qemu_bh_schedule(n->tx_bh);
-        n->tx_waiting = 1;
+    virtio_queue_set_notification(vq, 1);
+    if (virtio_net_flush_tx(n, netq) > 0) {
+        virtio_queue_set_notification(vq, 0);
+        qemu_bh_schedule(netq->tx_bh);
+        netq->tx_waiting = 1;
     }
 }
 
 static void virtio_net_save(QEMUFile *f, void *opaque)
 {
     VirtIONet *n = opaque;
+    int i;
 
     /* At this point, backend must be stopped, otherwise
      * it might keep writing to memory. */
-    assert(!n->vhost_started);
+    for (i = 0; i < n->queues; i++) {
+        assert(!n->vqs[i].vhost_started);
+    }
     virtio_save(&n->vdev, f);
 
     qemu_put_buffer(f, n->mac, ETH_ALEN);
-    qemu_put_be32(f, n->tx_waiting);
+    qemu_put_be32(f, n->queues);
+    for (i = 0; i < n->queues; i++) {
+        qemu_put_be32(f, n->vqs[i].tx_waiting);
+    }
     qemu_put_be32(f, n->mergeable_rx_bufs);
     qemu_put_be16(f, n->status);
     qemu_put_byte(f, n->promisc);
@@ -898,7 +977,10 @@ static int virtio_net_load(QEMUFile *f, void *opaque, int version_id)
     virtio_load(&n->vdev, f);
 
     qemu_get_buffer(f, n->mac, ETH_ALEN);
-    n->tx_waiting = qemu_get_be32(f);
+    n->queues = qemu_get_be32(f);
+    for (i = 0; i < n->queues; i++) {
+        n->vqs[i].tx_waiting = qemu_get_be32(f);
+    }
     n->mergeable_rx_bufs = qemu_get_be32(f);
 
     if (version_id >= 3)
@@ -926,7 +1008,7 @@ static int virtio_net_load(QEMUFile *f, void *opaque, int version_id)
             n->mac_table.in_use = 0;
         }
     }
- 
+
     if (version_id >= 6)
         qemu_get_buffer(f, (uint8_t *)n->vlans, MAX_VLAN >> 3);
 
@@ -937,13 +1019,16 @@ static int virtio_net_load(QEMUFile *f, void *opaque, int version_id)
         }
 
         if (n->has_vnet_hdr) {
-            tap_using_vnet_hdr(n->nic->nc.peer, 1);
-            tap_set_offload(n->nic->nc.peer,
-                    (n->vdev.guest_features >> VIRTIO_NET_F_GUEST_CSUM) & 1,
-                    (n->vdev.guest_features >> VIRTIO_NET_F_GUEST_TSO4) & 1,
-                    (n->vdev.guest_features >> VIRTIO_NET_F_GUEST_TSO6) & 1,
-                    (n->vdev.guest_features >> VIRTIO_NET_F_GUEST_ECN)  & 1,
-                    (n->vdev.guest_features >> VIRTIO_NET_F_GUEST_UFO)  & 1);
+            for(i = 0; i < n->queues; i++) {
+                tap_using_vnet_hdr(n->nic->ncs[i]->peer, 1);
+                tap_set_offload(n->nic->ncs[i]->peer,
+                        (n->vdev.guest_features >> VIRTIO_NET_F_GUEST_CSUM) & 1,
+                        (n->vdev.guest_features >> VIRTIO_NET_F_GUEST_TSO4) & 1,
+                        (n->vdev.guest_features >> VIRTIO_NET_F_GUEST_TSO6) & 1,
+                        (n->vdev.guest_features >> VIRTIO_NET_F_GUEST_ECN)  & 1,
+                        (n->vdev.guest_features >> VIRTIO_NET_F_GUEST_UFO)  &
+                        1);
+           }
         }
     }
 
@@ -978,7 +1063,7 @@ static int virtio_net_load(QEMUFile *f, void *opaque, int version_id)
 
 static void virtio_net_cleanup(VLANClientState *nc)
 {
-    VirtIONet *n = DO_UPCAST(NICState, nc, nc)->opaque;
+    VirtIONet *n = ((NICState *)nc->opaque)->opaque;
 
     n->nic = NULL;
 }
@@ -996,6 +1081,7 @@ VirtIODevice *virtio_net_init(DeviceState *dev, NICConf *conf,
                               virtio_net_conf *net)
 {
     VirtIONet *n;
+    int i;
 
     n = (VirtIONet *)virtio_common_init("virtio-net", VIRTIO_ID_NET,
                                         sizeof(struct virtio_net_config),
@@ -1008,7 +1094,6 @@ VirtIODevice *virtio_net_init(DeviceState *dev, NICConf *conf,
     n->vdev.bad_features = virtio_net_bad_features;
     n->vdev.reset = virtio_net_reset;
     n->vdev.set_status = virtio_net_set_status;
-    n->rx_vq = virtio_add_queue(&n->vdev, 256, virtio_net_handle_rx);
 
     if (net->tx && strcmp(net->tx, "timer") && strcmp(net->tx, "bh")) {
         error_report("virtio-net: "
@@ -1017,15 +1102,6 @@ VirtIODevice *virtio_net_init(DeviceState *dev, NICConf *conf,
         error_report("Defaulting to \"bh\"");
     }
 
-    if (net->tx && !strcmp(net->tx, "timer")) {
-        n->tx_vq = virtio_add_queue(&n->vdev, 256, virtio_net_handle_tx_timer);
-        n->tx_timer = qemu_new_timer_ns(vm_clock, virtio_net_tx_timer, n);
-        n->tx_timeout = net->txtimer;
-    } else {
-        n->tx_vq = virtio_add_queue(&n->vdev, 256, virtio_net_handle_tx_bh);
-        n->tx_bh = qemu_bh_new(virtio_net_tx_bh, n);
-    }
-    n->ctrl_vq = virtio_add_queue(&n->vdev, 64, virtio_net_handle_ctrl);
     qemu_macaddr_default_if_unset(&conf->macaddr);
     memcpy(&n->mac[0], &conf->macaddr, sizeof(n->mac));
     n->status = VIRTIO_NET_S_LINK_UP;
@@ -1034,7 +1110,6 @@ VirtIODevice *virtio_net_init(DeviceState *dev, NICConf *conf,
 
     qemu_format_nic_info_str(&n->nic->nc, conf->macaddr.a);
 
-    n->tx_waiting = 0;
     n->tx_burst = net->txburst;
     n->mergeable_rx_bufs = 0;
     n->promisc = 1; /* for compatibility */
@@ -1042,6 +1117,29 @@ VirtIODevice *virtio_net_init(DeviceState *dev, NICConf *conf,
     n->mac_table.macs = qemu_mallocz(MAC_TABLE_ENTRIES * ETH_ALEN);
 
     n->vlans = qemu_mallocz(MAX_VLAN >> 3);
+    n->queues = conf->queues;
+
+    /* Allocate per rx/tx vq's */
+    for (i = 0; i < n->queues; i++) {
+        n->vqs[i].rx_vq = virtio_add_queue(&n->vdev, 256, virtio_net_handle_rx);
+        if (net->tx && !strcmp(net->tx, "timer")) {
+            n->vqs[i].tx_vq = virtio_add_queue(&n->vdev, 256,
+                                               virtio_net_handle_tx_timer);
+            n->vqs[i].tx_timer = qemu_new_timer_ns(vm_clock,
+                                                   virtio_net_tx_timer,
+                                                   &n->vqs[i]);
+            n->vqs[i].tx_timeout = net->txtimer;
+        } else {
+            n->vqs[i].tx_vq = virtio_add_queue(&n->vdev, 256,
+                                               virtio_net_handle_tx_bh);
+            n->vqs[i].tx_bh = qemu_bh_new(virtio_net_tx_bh, &n->vqs[i]);
+        }
+
+        n->vqs[i].tx_waiting = 0;
+        n->vqs[i].n = n;
+    }
+
+    n->ctrl_vq = virtio_add_queue(&n->vdev, 64, virtio_net_handle_ctrl);
 
     n->qdev = dev;
     register_savevm(dev, "virtio-net", -1, VIRTIO_NET_VM_VERSION,
@@ -1055,24 +1153,33 @@ VirtIODevice *virtio_net_init(DeviceState *dev, NICConf *conf,
 void virtio_net_exit(VirtIODevice *vdev)
 {
     VirtIONet *n = DO_UPCAST(VirtIONet, vdev, vdev);
+    int i;
 
     /* This will stop vhost backend if appropriate. */
     virtio_net_set_status(vdev, 0);
 
-    qemu_purge_queued_packets(&n->nic->nc);
+    for (i = 0; i < n->queues; i++) {
+        qemu_purge_queued_packets(n->nic->ncs[i]);
+    }
 
     unregister_savevm(n->qdev, "virtio-net", n);
 
     qemu_free(n->mac_table.macs);
     qemu_free(n->vlans);
 
-    if (n->tx_timer) {
-        qemu_del_timer(n->tx_timer);
-        qemu_free_timer(n->tx_timer);
-    } else {
-        qemu_bh_delete(n->tx_bh);
+    for (i = 0; i < n->queues; i++) {
+        VirtIONetQueue *netq = &n->vqs[i];
+        if (netq->tx_timer) {
+            qemu_del_timer(netq->tx_timer);
+            qemu_free_timer(netq->tx_timer);
+        } else {
+            qemu_bh_delete(netq->tx_bh);
+        }
     }
 
     virtio_cleanup(&n->vdev);
-    qemu_del_vlan_client(&n->nic->nc);
+
+    for (i = 0; i < n->queues; i++) {
+        qemu_del_vlan_client(n->nic->ncs[i]);
+    }
 }
diff --git a/hw/virtio-net.h b/hw/virtio-net.h
index 8af9a1c..479489f 100644
--- a/hw/virtio-net.h
+++ b/hw/virtio-net.h
@@ -44,6 +44,7 @@
 #define VIRTIO_NET_F_CTRL_RX    18      /* Control channel RX mode support */
 #define VIRTIO_NET_F_CTRL_VLAN  19      /* Control channel VLAN filtering */
 #define VIRTIO_NET_F_CTRL_RX_EXTRA 20   /* Extra RX mode control support */
+#define VIRTIO_NET_F_MULTIQUEUE   21
 
 #define VIRTIO_NET_S_LINK_UP    1       /* Link is up */
 
@@ -72,6 +73,7 @@ struct virtio_net_config
     uint8_t mac[ETH_ALEN];
     /* See VIRTIO_NET_F_STATUS and VIRTIO_NET_S_* above */
     uint16_t status;
+    uint16_t queues;
 } __attribute__((packed));
 
 /* This is the first element of the scatter-gather list.  If you don't
diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
index 555f23f..cae311e 100644
--- a/hw/virtio-pci.c
+++ b/hw/virtio-pci.c
@@ -666,6 +666,7 @@ static const VirtIOBindings virtio_pci_bindings = {
     .query_guest_notifiers = virtio_pci_query_guest_notifiers,
     .set_host_notifier = virtio_pci_set_host_notifier,
     .set_guest_notifiers = virtio_pci_set_guest_notifiers,
+    .set_guest_notifier = virtio_pci_set_guest_notifier,
     .vmstate_change = virtio_pci_vmstate_change,
 };
 
diff --git a/hw/virtio.h b/hw/virtio.h
index bc72289..e939674 100644
--- a/hw/virtio.h
+++ b/hw/virtio.h
@@ -96,6 +96,7 @@ typedef struct {
     unsigned (*get_features)(void * opaque);
     bool (*query_guest_notifiers)(void * opaque);
     int (*set_guest_notifiers)(void * opaque, bool assigned);
+    int (*set_guest_notifier)(void *opaque, int n, bool assigned);
     int (*set_host_notifier)(void * opaque, int n, bool assigned);
     void (*vmstate_change)(void * opaque, bool running);
 } VirtIOBindings;


^ permalink raw reply related

* Re: [PATCH v2] net: qlcnic: convert to hw_features
From: David Miller @ 2011-04-20  8:32 UTC (permalink / raw)
  To: mirq-linux; +Cc: netdev, amit.salecha, anirban.chakraborty, linux-driver
In-Reply-To: <20110419130357.40FA813909@rere.qmqm.pl>

From: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Date: Tue, 19 Apr 2011 15:03:57 +0200 (CEST)

> Bit more than minimal conversion. There might be some issues because
> of qlcnic_set_netdev_features() if it's called after netdev init.
> 
> Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
> ---
> v2: fix compilation issues (I forgot to 'stg refresh' before sending v1)

Applied.

^ permalink raw reply

* Re: [PATCH] net: tun: convert to hw_features
From: David Miller @ 2011-04-20  8:32 UTC (permalink / raw)
  To: mirq-linux; +Cc: netdev, rusty
In-Reply-To: <20110419161310.7508513909@rere.qmqm.pl>

From: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Date: Tue, 19 Apr 2011 18:13:10 +0200 (CEST)

> This changes offload setting behaviour to what I think is correct:
>  - offloads set via ethtool mean what admin wants to use (by default
>    he wants 'em all)
>  - offloads set via ioctl() mean what userspace is expecting to get
>    (this limits which admin wishes are granted)
>  - TUN_NOCHECKSUM is ignored, as it might cause broken packets when
>    forwarded (ip_summed == CHECKSUM_UNNECESSARY means that checksum
>    was verified, not that it can be ignored)
> 
> If TUN_NOCHECKSUM is implemented, it should set skb->csum_* and
> skb->ip_summed (= CHECKSUM_PARTIAL) for known protocols and let others
> be verified by kernel when necessary.
> 
> TUN_NOCHECKSUM handling was introduced by commit
> f43798c27684ab925adde7d8acc34c78c6e50df8:
> 
>     tun: Allow GSO using virtio_net_hdr
>     
> Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>

Applied.

^ permalink raw reply

* [RFC PATCH 0/2] Multiqueue support for qemu(virtio-net)
From: Jason Wang @ 2011-04-20  8:33 UTC (permalink / raw)
  To: krkumar2, kvm, mst, netdev, rusty, qemu-devel, anthony

Inspired by Krishna's patch (http://www.spinics.net/lists/kvm/msg52098.html) and
Michael's suggestions.  The following series adds the multiqueue support for
qemu and enable it for virtio-net (both userspace and vhost).

The aim for this series is to simplified the management and achieve the same
performacne with less codes.

Follows are the differences between this series and Krishna's:

- Add the multiqueue support for qemu and also for userspace virtio-net
- Instead of hacking the vhost module to manipulate kthreads, this patch just
implement the userspace based multiqueues and thus can re-use the existed vhost kernel-side codes without any modification.
- Use 1:1 mapping between TX/RX pairs and vhost kthread because the
implementation is based on usersapce.
- The cli is also changed to make the mgmt easier, the -netdev option of qdev
can now accpet more than one ids. You can start a multiqueue virtio-net device
through:
./qemu-system-x86_64 -netdev tap,id=hn0,vhost=on,fd=X -netdev
tap,id=hn0,vhost=on,fd=Y -device virtio-net-pci,netdev=hn0#hn1,queues=2 ...

The series is very primitive and still need polished.

Suggestions are welcomed.
---

Jason Wang (2):
      net: Add multiqueue support
      virtio-net: add multiqueue support


 hw/qdev-properties.c |   37 ++++-
 hw/qdev.h            |    3 
 hw/vhost.c           |   26 ++-
 hw/vhost.h           |    1 
 hw/vhost_net.c       |    7 +
 hw/vhost_net.h       |    2 
 hw/virtio-net.c      |  409 ++++++++++++++++++++++++++++++++------------------
 hw/virtio-net.h      |    2 
 hw/virtio-pci.c      |    1 
 hw/virtio.h          |    1 
 net.c                |   34 +++-
 net.h                |   15 +-
 12 files changed, 353 insertions(+), 185 deletions(-)

-- 
Jason Wang

^ permalink raw reply

* Re: A patch you wrote some time ago (aka: "[patch 41/54] ICMP: Fix icmp_errors_use_inbound_ifaddr sysctl")
From: Alexander Hoogerhuis @ 2011-04-20  8:38 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Chris Wright, linux-kernel, netdev
In-Reply-To: <4DAE9824.10802@trash.net>

On 20.04.2011 10:24, Patrick McHardy wrote:
>
> That might be a possibility to fix this for your case. But I'm
> wondering why you're turning this on at all and not have routing
> decide the correct source address?

Not a whole lot of tuning, but trying to figure why this would not work 
as any other VRRP implementation would work on other routers/OSes.

My case seems to be a general problem for ICMP errors, as the IP stack 
tends to want to listen more to advice coming back with the source IP of 
the gateway, not a third party.

If you have two machines (A and B) run VRRP and share an IP (C), then 
any ICMP redirect should have the VRRP IP as source (C), and the way it 
works today (with or without sysctl_icmp_errors_use_inbound_ifaddr) is 
that it will have the source set to the primary IP of the source interface.

I suspect this holds for any other ICMP message sent back to hosts in 
the connected network as well, such as PMTU-related issues, etc.

In my case nodes in the connected subnet would get ICMP redirects from 
the primary IPs, and thus not listen to them as they are arriving from 
nodes not listen in the list of known gateways.

It would make more sense when returning ICMP messages the source IP 
would be the actual IP it is recveied on, not the primary IP of the 
interface.

mvh,
A
-- 
Alexander Hoogerhuis | http://no.linkedin.com/in/alexh
Boxed Solutions AS   | +47 908 21 485 - alexh@boxed.no
"Given enough eyeballs, all bugs are shallow." -Eric S. Raymond

^ permalink raw reply

* Re: [PATCH] ehea: Fix a DLPAR bug on ehea_rereg_mrs().
From: David Miller @ 2011-04-20  8:42 UTC (permalink / raw)
  To: leitao; +Cc: netdev
In-Reply-To: <1303241962-32285-1-git-send-email-leitao@linux.vnet.ibm.com>

From: Breno Leitao <leitao@linux.vnet.ibm.com>
Date: Tue, 19 Apr 2011 16:39:22 -0300

> We are currently continuing if ehea_restart_qps() fails, when we
> do a memory DLPAR (remove or add more memory to the system).
> 
> This patch just let the NAPI disabled if the ehea_restart_qps()
> fails.
> 
> Signed-off-by: Breno Leitao <leitao@linux.vnet.ibm.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH] bonding: 802.3ad - fix agg_device_up
From: David Miller @ 2011-04-20  8:45 UTC (permalink / raw)
  To: fubar; +Cc: jbohac, netdev, andy, shemminger
In-Reply-To: <21326.1303232103@death>

From: Jay Vosburgh <fubar@us.ibm.com>
Date: Tue, 19 Apr 2011 09:55:03 -0700

> Jiri Bohac <jbohac@suse.cz> wrote:
> 
>>The slave member of struct aggregator does not necessarily point
>>to a slave which is part of the aggregator. It points to the
>>slave structure containing the aggregator structure, while
>>completely different slaves (or no slaves at all) may be part of
>>the aggregator.
>>
>>The agg_device_up() function wrongly uses agg->slave to find the state of the
>>aggregator.  Use agg->lag_ports->slave instead. The bug has been
>>introduced by commit 4cd6fe1c6483cde93e2ec91f58b7af9c9eea51ad.
>>
>>Signed-off-by: Jiri Bohac <jbohac@suse.cz>
> 
> 	One additional note: port->slave can be NULL when the slave is
> transitioning in or out of the bond, but not when the port is part of an
> aggregator, so this usage should be safe.
> 
> 	-J
> 
> Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>

Applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: [PATCH] bnx2x: dont dereference tcp header on non tcp frames
From: David Miller @ 2011-04-20  8:46 UTC (permalink / raw)
  To: eric.dumazet; +Cc: dmitry, eilong, netdev
In-Reply-To: <1303284713.3186.13.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 20 Apr 2011 09:31:53 +0200

> [PATCH] bnx2x: dont dereference tcp header on non tcp frames
> 
> bnx2x_set_pbd_csum() & bnx2x_set_pbd_csum_e2() use 
> tcp_hdrlen(skb) even for non TCP frames
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Broadcom folks please review.

^ permalink raw reply

* Re: [RFC PATCH 0/2] Multiqueue support for qemu(virtio-net)
From: Krishna Kumar2 @ 2011-04-20  8:52 UTC (permalink / raw)
  To: Jason Wang; +Cc: kvm, mst, netdev, rusty, qemu-devel
In-Reply-To: <20110420082706.32157.59668.stgit@dhcp-91-7.nay.redhat.com.englab.nay.redhat.com>

Thanks Jason!

So I can use my virtio-net guest driver and test with this patch?
Please provide the script you use to start MQ guest.

Regards,

- KK

Jason Wang <jasowang@redhat.com> wrote on 04/20/2011 02:03:07 PM:

> Jason Wang <jasowang@redhat.com>
> 04/20/2011 02:03 PM
>
> To
>
> Krishna Kumar2/India/IBM@IBMIN, kvm@vger.kernel.org, mst@redhat.com,
> netdev@vger.kernel.org, rusty@rustcorp.com.au, qemu-
> devel@nongnu.org, anthony@codemonkey.ws
>
> cc
>
> Subject
>
> [RFC PATCH 0/2] Multiqueue support for qemu(virtio-net)
>
> Inspired by Krishna's patch
(http://www.spinics.net/lists/kvm/msg52098.html
> ) and
> Michael's suggestions.  The following series adds the multiqueue support
for
> qemu and enable it for virtio-net (both userspace and vhost).
>
> The aim for this series is to simplified the management and achieve the
same
> performacne with less codes.
>
> Follows are the differences between this series and Krishna's:
>
> - Add the multiqueue support for qemu and also for userspace virtio-net
> - Instead of hacking the vhost module to manipulate kthreads, this patch
just
> implement the userspace based multiqueues and thus can re-use the
> existed vhost kernel-side codes without any modification.
> - Use 1:1 mapping between TX/RX pairs and vhost kthread because the
> implementation is based on usersapce.
> - The cli is also changed to make the mgmt easier, the -netdev option of
qdev
> can now accpet more than one ids. You can start a multiqueue virtio-net
device
> through:
> ./qemu-system-x86_64 -netdev tap,id=hn0,vhost=on,fd=X -netdev
> tap,id=hn0,vhost=on,fd=Y -device
virtio-net-pci,netdev=hn0#hn1,queues=2 ...
>
> The series is very primitive and still need polished.
>
> Suggestions are welcomed.
> ---
>
> Jason Wang (2):
>       net: Add multiqueue support
>       virtio-net: add multiqueue support
>
>
>  hw/qdev-properties.c |   37 ++++-
>  hw/qdev.h            |    3
>  hw/vhost.c           |   26 ++-
>  hw/vhost.h           |    1
>  hw/vhost_net.c       |    7 +
>  hw/vhost_net.h       |    2
>  hw/virtio-net.c      |  409 +++++++++++++++++++++++++++++++
> +------------------
>  hw/virtio-net.h      |    2
>  hw/virtio-pci.c      |    1
>  hw/virtio.h          |    1
>  net.c                |   34 +++-
>  net.h                |   15 +-
>  12 files changed, 353 insertions(+), 185 deletions(-)
>
> --
> Jason Wang

^ permalink raw reply

* [PATCH 1/1] powerpc: Fix multicast problem in fs_enet driver
From: Andrea Galbusera @ 2011-04-20  8:55 UTC (permalink / raw)
  To: Pantelis Antoniou, Vitaly Bordug
  Cc: linuxppc-dev, netdev, linux-kernel, Andrea Galbusera

mac-fec.c was setting individual UDP address registers instead of multicast
group address registers when joining a multicast group.
This prevented from correctly receiving UDP multicast packets.
According to datasheet, replaced hash_table_high and hash_table_low
with grp_hash_table_high and grp_hash_table_low respectively.

Tested on a MPC5121 based board.

Signed-off-by: Andrea Galbusera <gizero@gmail.com>
---
 drivers/net/fs_enet/mac-fec.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/fs_enet/mac-fec.c b/drivers/net/fs_enet/mac-fec.c
index 61035fc..b9fbc83 100644
--- a/drivers/net/fs_enet/mac-fec.c
+++ b/drivers/net/fs_enet/mac-fec.c
@@ -226,8 +226,8 @@ static void set_multicast_finish(struct net_device *dev)
 	}
 
 	FC(fecp, r_cntrl, FEC_RCNTRL_PROM);
-	FW(fecp, hash_table_high, fep->fec.hthi);
-	FW(fecp, hash_table_low, fep->fec.htlo);
+	FW(fecp, grp_hash_table_high, fep->fec.hthi);
+	FW(fecp, grp_hash_table_low, fep->fec.htlo);
 }
 
 static void set_multicast_list(struct net_device *dev)
@@ -273,8 +273,8 @@ static void restart(struct net_device *dev)
 	/*
 	 * Reset all multicast.
 	 */
-	FW(fecp, hash_table_high, fep->fec.hthi);
-	FW(fecp, hash_table_low, fep->fec.htlo);
+	FW(fecp, grp_hash_table_high, fep->fec.hthi);
+	FW(fecp, grp_hash_table_low, fep->fec.htlo);
 
 	/*
 	 * Set maximum receive buffer size.
-- 
1.7.0.4

^ permalink raw reply related

* Re: [PATCH net-next-2.6 0/9] SCTP updates for net-next-2.6
From: David Miller @ 2011-04-20  8:55 UTC (permalink / raw)
  To: yjwei; +Cc: netdev, linux-sctp
In-Reply-To: <4DAE8A27.3040007@cn.fujitsu.com>

From: Wei Yongjun <yjwei@cn.fujitsu.com>
Date: Wed, 20 Apr 2011 15:24:23 +0800

> Here is a set of SCTP patches for net-next-2.6, also from vlad's
> lksctp-dev tree.  Please apply.
> 
> Shan Wei (4):
>       sctp: check parameter value of length in ERROR chunk
>       sctp: check invalid value of length parameter in error cause
>       sctp: remove redundant check when walking through a list of TLV parameters
>       sctp: handle ootb packet in chunk order as defined
> 
> Vlad Yasevich (2):
>       sctp: remove completely unsed EMPTY state
>       sctp: bail from sctp_endpoint_lookup_assoc() if not bound.
> 
> Wei Yongjun (3):
>       sctp: fix to check the source address of COOKIE-ECHO chunk
>       sctp: make heartbeat information in sctp_make_heartbeat()
>       sctp: move chunk from retransmit queue to abandoned list

Applied, thanks for doing these backports from Vlad's tree.

^ permalink raw reply

* [PATCH BUG-FIX] ipv6: udp: fix the wrong headroom check
From: Shan Wei @ 2011-04-20  8:52 UTC (permalink / raw)
  To: kuznet, David Miller, pekkas, jmorris,
	yoshfuji@linux-ipv6.org >> YOSHIFUJI Hideaki, Patrick 

At this point, skb->data points to skb_transport_header.
So, headroom check is wrong. 

For some case:bridge(UFO is on) + eth device(UFO is off),
there is no enough headroom for IPv6 frag head.
But headroom check is always false.

This will bring about data be moved to there prior to skb->head,
when adding IPv6 frag header to skb.

Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
---
 net/ipv6/udp.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 15c3774..9e305d7 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1335,7 +1335,7 @@ static struct sk_buff *udp6_ufo_fragment(struct sk_buff *skb, u32 features)
 	skb->ip_summed = CHECKSUM_NONE;
 
 	/* Check if there is enough headroom to insert fragment header. */
-	if ((skb_headroom(skb) < frag_hdr_sz) &&
+	if ((skb_mac_header(skb) < skb->head + frag_hdr_sz) &&
 	    pskb_expand_head(skb, frag_hdr_sz, 0, GFP_ATOMIC))
 		goto out;
 
-- 
1.6.3.3
 

^ permalink raw reply related

* Re: [Bugme-new] [Bug 33502] New: Caught 64-bit read from uninitialized memory in __alloc_skb
From: Eric Dumazet @ 2011-04-20  9:07 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: casteyde.christian, Christoph Lameter, Andrew Morton, netdev,
	bugzilla-daemon, bugme-daemon, Vegard Nossum
In-Reply-To: <1303286998.3186.18.camel@edumazet-laptop>


I had one splat (before applying any patch), adding CONFIG_PREEMPT to my
build :

[   84.629936] WARNING: kmemcheck: Caught 64-bit read from uninitialized memory (ffff88011a7376f0)
[   84.629939] 5868ba1a0188ffff5868ba1a0188ffff7078731a0188ffffe0e1181a0188ffff
[   84.629952]  i i i i i i i i i i i i i i i i u u u u u u u u u u u u u u u u
[   84.629963]                                  ^
[   84.629964] 
[   84.629966] Pid: 2060, comm: 05-wait_for_sys Not tainted 2.6.39-rc4-00237-ge3de956-dirty #550 HP ProLiant BL460c G6
[   84.629969] RIP: 0010:[<ffffffff810f0e80>]  [<ffffffff810f0e80>] kmem_cache_alloc+0x50/0x190
[   84.629977] RSP: 0018:ffff88011a0b9988  EFLAGS: 00010286
[   84.629979] RAX: 0000000000000000 RBX: ffff88011a739a98 RCX: 0000000000620780
[   84.629980] RDX: 0000000000620740 RSI: 0000000000017400 RDI: ffffffff810db8f5
[   84.629982] RBP: ffff88011a0b99b8 R08: ffff88011a261dd8 R09: 0000000000000009
[   84.629983] R10: 0000000000000002 R11: ffff88011fffce00 R12: ffff88011b00a600
[   84.629985] R13: ffff88011a7376f0 R14: 00000000000000d0 R15: ffff88011abbb0c0
[   84.629987] FS:  0000000000000000(0000) GS:ffff88011fc00000(0063) knlGS:00000000f76ff6c0
[   84.629989] CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
[   84.629991] CR2: ffff88011998dab8 CR3: 000000011a13f000 CR4: 00000000000006f0
[   84.629992] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   84.629994] DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: 0000000000000400
[   84.629995]  [<ffffffff810db8f5>] anon_vma_prepare+0x55/0x190
[   84.630000]  [<ffffffff810cfe90>] handle_pte_fault+0x470/0x6e0
[   84.630002]  [<ffffffff810d1654>] handle_mm_fault+0x124/0x1a0
[   84.630005]  [<ffffffff8147a510>] do_page_fault+0x160/0x560
[   84.630008]  [<ffffffff81477fcf>] page_fault+0x1f/0x30
[   84.630013]  [<ffffffff810b31f8>] generic_file_aio_read+0x528/0x740
[   84.630017]  [<ffffffff810f5461>] do_sync_read+0xd1/0x110
[   84.630021]  [<ffffffff810f5c36>] vfs_read+0xc6/0x160
[   84.630023]  [<ffffffff810f60b0>] sys_read+0x50/0x90
[   84.630025]  [<ffffffff8147f4e9>] sysenter_dispatch+0x7/0x27
[   84.630030]  [<ffffffffffffffff>] 0xffffffffffffffff

ffffffff810f0e6d:       0f 84 b5 00 00 00       je     ffffffff810f0f28 <kmem_cache_alloc+0xf8>
ffffffff810f0e73:       49 63 44 24 20          movslq 0x20(%r12),%rax
ffffffff810f0e78:       48 8d 4a 40             lea    0x40(%rdx),%rcx
ffffffff810f0e7c:       49 8b 34 24             mov    (%r12),%rsi
>>ffffffff810f0e80:       49 8b 5c 05 00          mov    0x0(%r13,%rax,1),%rbx
ffffffff810f0e85:       4c 89 e8                mov    %r13,%rax
ffffffff810f0e88:       e8 83 19 0e 00          callq  ffffffff811d2810 <this_cpu_cmpxchg16b_emu>
ffffffff810f0e8d:       0f 1f 40 00             nopl   0x0(%rax)
ffffffff810f0e91:       84 c0                   test   %al,%al
ffffffff810f0e93:       74 c1                   je     ffffffff810f0e56 <kmem_cache_alloc+0x26>


So this is definitely the access to object->next that triggers the fault.

Problem is : my (2nd) patch only reduces the window of occurrence of the bug, since we can
be preempted/interrupted between the kmemcheck_mark_initialized() and access to object->next :

The preempt/irq could allocate the object and free it again.

So KMEMCHECK would require us to block IRQ to be safe.

Then, just disable SLUB_CMPXCHG_DOUBLE if KMEMCHECK is defined, as I did in my first patch.

Christoph, please reconsider first version of patch (disabling SLUB_CMPXCHG_DOUBLE if
either KMEMCHECK or DEBUG_PAGEALLOC is defined)

Thanks



^ permalink raw reply

* Re: A patch you wrote some time ago (aka: "[patch 41/54] ICMP: Fix icmp_errors_use_inbound_ifaddr sysctl")
From: Alexander Hoogerhuis @ 2011-04-20  9:11 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Chris Wright, linux-kernel, netdev
In-Reply-To: <4DAE9824.10802@trash.net>

Also, in a quick followup on myself, this is ambiguous, as it does not 
state that it will select the primary IP. I've

The second part is that RFC 5798 (VRRP v3) lists in section 8.1.1:

>    The IPv4 source address of an ICMP redirect should be the address
>    that the end-host used when making its next-hop routing decision.

I.e. the settings in linux, with or without the sysctl flag set, would 
run against this.

mvh,
A
-- 
Alexander Hoogerhuis | http://no.linkedin.com/in/alexh
Boxed Solutions AS   | +47 908 21 485 - alexh@boxed.no
"Given enough eyeballs, all bugs are shallow." -Eric S. Raymond

^ permalink raw reply

* Re: [PATCH 1/3] IPVS: Change of socket usage to enable name space exit.
From: Eric W. Biederman @ 2011-04-20  9:40 UTC (permalink / raw)
  To: Hans Schillstrom
  Cc: horms, ja, lvs-devel, netdev, netfilter-devel, hans.schillstrom
In-Reply-To: <1303226705-29178-1-git-send-email-hans@schillstrom.com>

Hans Schillstrom <hans@schillstrom.com> writes:

> This is the first patch in a series of three.
> The cleanup doesn't work when not exit in a clean way by using ipvsadm.
> Killing of a namespace causes a hanging ipvs, this series will cure that.
>
> If the sync daemons run in a namespace while it crashes
> or get killed, there is no way to stop them except for a reboot.
>
> Kernel threads should not increment the use count of a socket.
> By calling sk_change_net() after creating a socket this is avoided.
> sock_release cant be used, instead sk_release_kernel() should be used.
>
> Thanks to Eric W Biederman.
>
> This patch is based on net-next-2.6  ver 2.6.39-rc2

I am not comfortable with this patch on it's own.  What kills the
daemons on network namespace exit?

I would be a little happier if the code also used sock_create_kern
instead of __sock_create. Now that it uses sk_change_net.  Just to make
the code more idiomatic.

But not actually guaranteeing that the namespace exit kills the threads
at this point seems pretty badly broken, as this idiom is not safe
unless we have a hard guarantee that the threads are dead.

Eric

> Signed-off-by: Hans Schillstrom <hans@schillstrom.com>
> ---
>  net/netfilter/ipvs/ip_vs_sync.c |   28 +++++++++++++++++++---------
>  1 files changed, 19 insertions(+), 9 deletions(-)
>
> diff --git a/net/netfilter/ipvs/ip_vs_sync.c b/net/netfilter/ipvs/ip_vs_sync.c
> index 3e7961e..3f87555 100644
> --- a/net/netfilter/ipvs/ip_vs_sync.c
> +++ b/net/netfilter/ipvs/ip_vs_sync.c
> @@ -1309,7 +1309,12 @@ static struct socket *make_send_sock(struct net *net)
>  		pr_err("Error during creation of socket; terminating\n");
>  		return ERR_PTR(result);
>  	}
> -
> +	/*
> +	 * Kernel sockets that are a part of a namespace, should not
> +	 * hold a reference to a namespace in order to allow to stop it.
> +	 * After sk_change_net should be released using sk_release_kernel.
> +	 */
> +	sk_change_net(sock->sk, net);
>  	result = set_mcast_if(sock->sk, ipvs->master_mcast_ifn);
>  	if (result < 0) {
>  		pr_err("Error setting outbound mcast interface\n");
> @@ -1334,8 +1339,8 @@ static struct socket *make_send_sock(struct net *net)
>
>  	return sock;
>
> -  error:
> -	sock_release(sock);
> +error:
> +	sk_release_kernel(sock->sk);
>  	return ERR_PTR(result);
>  }
>
> @@ -1355,7 +1360,12 @@ static struct socket *make_receive_sock(struct net *net)
>  		pr_err("Error during creation of socket; terminating\n");
>  		return ERR_PTR(result);
>  	}
> -
> +	/*
> +	 * Kernel sockets that are a part of a namespace, should not
> +	 * hold a reference to a namespace in order to allow to stop it.
> +	 * After sk_change_net should be released using sk_release_kernel.
> +	 */
> +	sk_change_net(sock->sk, net);
>  	/* it is equivalent to the REUSEADDR option in user-space */
>  	sock->sk->sk_reuse = 1;
>
> @@ -1377,8 +1387,8 @@ static struct socket *make_receive_sock(struct net *net)
>
>  	return sock;
>
> -  error:
> -	sock_release(sock);
> +error:
> +	sk_release_kernel(sock->sk);
>  	return ERR_PTR(result);
>  }
>
> @@ -1473,7 +1483,7 @@ static int sync_thread_master(void *data)
>  		ip_vs_sync_buff_release(sb);
>
>  	/* release the sending multicast socket */
> -	sock_release(tinfo->sock);
> +	sk_release_kernel(tinfo->sock->sk);
>  	kfree(tinfo);
>
>  	return 0;
> @@ -1513,7 +1523,7 @@ static int sync_thread_backup(void *data)
>  	}
>
>  	/* release the sending multicast socket */
> -	sock_release(tinfo->sock);
> +	sk_release_kernel(tinfo->sock->sk);
>  	kfree(tinfo->buf);
>  	kfree(tinfo);
>
> @@ -1601,7 +1611,7 @@ outtinfo:
>  outbuf:
>  	kfree(buf);
>  outsocket:
> -	sock_release(sock);
> +	sk_release_kernel(sock->sk);
>  out:
>  	return result;
>  }
> --
> 1.7.2.3

^ permalink raw reply

* Re: [PATCH] bnx2x: dont dereference tcp header on non tcp frames
From: Dmitry Kravkov @ 2011-04-20  9:40 UTC (permalink / raw)
  To: David Miller
  Cc: eric.dumazet@gmail.com, Eilon Greenstein, netdev@vger.kernel.org
In-Reply-To: <20110420.014657.232735224.davem@davemloft.net>

On Wed, 2011-04-20 at 01:46 -0700, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Wed, 20 Apr 2011 09:31:53 +0200
> 
> > [PATCH] bnx2x: dont dereference tcp header on non tcp frames
> > 
> > bnx2x_set_pbd_csum() & bnx2x_set_pbd_csum_e2() use 
> > tcp_hdrlen(skb) even for non TCP frames
> > 
> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> 
> Broadcom folks please review.
> 

Following patch fixes udp checksum offload flow and also addresses
issues raised by Eric. We are testing it now.

From: Vladislav Zolotarov <vladz@broadcom.com>

Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
---
 drivers/net/bnx2x/bnx2x_cmn.c |   34 ++++++++++++++++++++++++----------
 1 files changed, 24 insertions(+), 10 deletions(-)

diff --git a/drivers/net/bnx2x/bnx2x_cmn.c b/drivers/net/bnx2x/bnx2x_cmn.c
index e83ac6d..16581df 100644
--- a/drivers/net/bnx2x/bnx2x_cmn.c
+++ b/drivers/net/bnx2x/bnx2x_cmn.c
@@ -2019,15 +2019,23 @@ static inline void bnx2x_set_pbd_gso(struct sk_buff *skb,
 static inline  u8 bnx2x_set_pbd_csum_e2(struct bnx2x *bp, struct sk_buff *skb,
 	u32 *parsing_data, u32 xmit_type)
 {
-	*parsing_data |= ((tcp_hdrlen(skb)/4) <<
-		ETH_TX_PARSE_BD_E2_TCP_HDR_LENGTH_DW_SHIFT) &
-		ETH_TX_PARSE_BD_E2_TCP_HDR_LENGTH_DW;
+	*parsing_data |=
+			((((u8 *)skb_transport_header(skb) - skb->data) >> 1) <<
+			ETH_TX_PARSE_BD_E2_TCP_HDR_START_OFFSET_W_SHIFT) &
+			ETH_TX_PARSE_BD_E2_TCP_HDR_START_OFFSET_W;
 
-	*parsing_data |= ((((u8 *)tcp_hdr(skb) - skb->data) / 2) <<
-		ETH_TX_PARSE_BD_E2_TCP_HDR_START_OFFSET_W_SHIFT) &
-		ETH_TX_PARSE_BD_E2_TCP_HDR_START_OFFSET_W;
+	if (xmit_type & XMIT_CSUM_TCP) {
+		*parsing_data |= ((tcp_hdrlen(skb) / 4) <<
+			ETH_TX_PARSE_BD_E2_TCP_HDR_LENGTH_DW_SHIFT) &
+			ETH_TX_PARSE_BD_E2_TCP_HDR_LENGTH_DW;
 
-	return skb_transport_header(skb) + tcp_hdrlen(skb) - skb->data;
+		return skb_transport_header(skb) + tcp_hdrlen(skb) - skb->data;
+	} else
+		/* We support checksum offload for TCP and UDP only.
+		 * No need to pass the UDP header length - it's a constant.
+		 */
+		return skb_transport_header(skb) +
+				sizeof(struct udphdr) - skb->data;
 }
 
 /**
@@ -2043,7 +2051,7 @@ static inline u8 bnx2x_set_pbd_csum(struct bnx2x *bp, struct sk_buff *skb,
 	struct eth_tx_parse_bd_e1x *pbd,
 	u32 xmit_type)
 {
-	u8 hlen = (skb_network_header(skb) - skb->data) / 2;
+	u8 hlen = (skb_network_header(skb) - skb->data) >> 1;
 
 	/* for now NS flag is not used in Linux */
 	pbd->global_data =
@@ -2051,9 +2059,15 @@ static inline u8 bnx2x_set_pbd_csum(struct bnx2x *bp, struct sk_buff *skb,
 			 ETH_TX_PARSE_BD_E1X_LLC_SNAP_EN_SHIFT));
 
 	pbd->ip_hlen_w = (skb_transport_header(skb) -
-			skb_network_header(skb)) / 2;
+			skb_network_header(skb)) >> 1;
 
-	hlen += pbd->ip_hlen_w + tcp_hdrlen(skb) / 2;
+	hlen += pbd->ip_hlen_w;
+
+	/* We support checksum offload for TCP and UDP only */
+	if (xmit_type & XMIT_CSUM_TCP)
+		hlen += tcp_hdrlen(skb) / 2;
+	else
+		hlen += sizeof(struct udphdr) / 2;
 
 	pbd->total_hlen_w = cpu_to_le16(hlen);
 	hlen = hlen*2;
-- 
1.7.2.2





^ permalink raw reply related

* Re: [PATCH 2/3] IPVS: Change of register_pernet_subsys to register_pernet_device
From: Eric W. Biederman @ 2011-04-20  9:46 UTC (permalink / raw)
  To: Hans Schillstrom
  Cc: horms, ja, lvs-devel, netdev, netfilter-devel, hans.schillstrom
In-Reply-To: <1303226705-29178-2-git-send-email-hans@schillstrom.com>

Hans Schillstrom <hans@schillstrom.com> writes:

> This is part 1 of a makeover of the init and cleanup
> functions in ip_vs using name space.

That this fixes problems for you is great.  Why does this
fix the problems.  Does ip_vs really act more as a
network device passing packets than as a protocol implementation
or a library?

register_pernet_subsys already gives you the guarantee that
cleanup is in the reverse order of initialization.

I expect you are on the right track but it isn't clear to me from
reading the patch and it's description why this code should work.

Eric


> Signed-off-by: Hans Schillstrom <hans@schillstrom.com>
> ---
>  net/netfilter/ipvs/ip_vs_app.c   |    4 ++--
>  net/netfilter/ipvs/ip_vs_conn.c  |    4 ++--
>  net/netfilter/ipvs/ip_vs_core.c  |    6 +++---
>  net/netfilter/ipvs/ip_vs_ctl.c   |    6 +++---
>  net/netfilter/ipvs/ip_vs_est.c   |    4 ++--
>  net/netfilter/ipvs/ip_vs_ftp.c   |    4 ++--
>  net/netfilter/ipvs/ip_vs_lblc.c  |    6 +++---
>  net/netfilter/ipvs/ip_vs_lblcr.c |    6 +++---
>  net/netfilter/ipvs/ip_vs_proto.c |    4 ++--
>  net/netfilter/ipvs/ip_vs_sync.c  |    4 ++--
>  10 files changed, 24 insertions(+), 24 deletions(-)
>
> diff --git a/net/netfilter/ipvs/ip_vs_app.c b/net/netfilter/ipvs/ip_vs_app.c
> index 2dc6de1..7e8e769 100644
> --- a/net/netfilter/ipvs/ip_vs_app.c
> +++ b/net/netfilter/ipvs/ip_vs_app.c
> @@ -599,12 +599,12 @@ int __init ip_vs_app_init(void)
>  {
>  	int rv;
>  
> -	rv = register_pernet_subsys(&ip_vs_app_ops);
> +	rv = register_pernet_device(&ip_vs_app_ops);
>  	return rv;
>  }
>  
>  
>  void ip_vs_app_cleanup(void)
>  {
> -	unregister_pernet_subsys(&ip_vs_app_ops);
> +	unregister_pernet_device(&ip_vs_app_ops);
>  }
> diff --git a/net/netfilter/ipvs/ip_vs_conn.c b/net/netfilter/ipvs/ip_vs_conn.c
> index c97bd45..36cd5ea 100644
> --- a/net/netfilter/ipvs/ip_vs_conn.c
> +++ b/net/netfilter/ipvs/ip_vs_conn.c
> @@ -1309,7 +1309,7 @@ int __init ip_vs_conn_init(void)
>  		rwlock_init(&__ip_vs_conntbl_lock_array[idx].l);
>  	}
>  
> -	retc = register_pernet_subsys(&ipvs_conn_ops);
> +	retc = register_pernet_device(&ipvs_conn_ops);
>  
>  	/* calculate the random value for connection hash */
>  	get_random_bytes(&ip_vs_conn_rnd, sizeof(ip_vs_conn_rnd));
> @@ -1319,7 +1319,7 @@ int __init ip_vs_conn_init(void)
>  
>  void ip_vs_conn_cleanup(void)
>  {
> -	unregister_pernet_subsys(&ipvs_conn_ops);
> +	unregister_pernet_device(&ipvs_conn_ops);
>  	/* Release the empty cache */
>  	kmem_cache_destroy(ip_vs_conn_cachep);
>  	vfree(ip_vs_conn_tab);
> diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
> index 07accf6..a7bb81d 100644
> --- a/net/netfilter/ipvs/ip_vs_core.c
> +++ b/net/netfilter/ipvs/ip_vs_core.c
> @@ -1913,7 +1913,7 @@ static int __init ip_vs_init(void)
>  {
>  	int ret;
>  
> -	ret = register_pernet_subsys(&ipvs_core_ops);	/* Alloc ip_vs struct */
> +	ret = register_pernet_device(&ipvs_core_ops);	/* Alloc ip_vs struct */
>  	if (ret < 0)
>  		return ret;
>  
> @@ -1964,7 +1964,7 @@ cleanup_sync:
>  	ip_vs_control_cleanup();
>    cleanup_estimator:
>  	ip_vs_estimator_cleanup();
> -	unregister_pernet_subsys(&ipvs_core_ops);	/* free ip_vs struct */
> +	unregister_pernet_device(&ipvs_core_ops);	/* free ip_vs struct */
>  	return ret;
>  }
>  
> @@ -1977,7 +1977,7 @@ static void __exit ip_vs_cleanup(void)
>  	ip_vs_protocol_cleanup();
>  	ip_vs_control_cleanup();
>  	ip_vs_estimator_cleanup();
> -	unregister_pernet_subsys(&ipvs_core_ops);	/* free ip_vs struct */
> +	unregister_pernet_device(&ipvs_core_ops);	/* free ip_vs struct */
>  	pr_info("ipvs unloaded.\n");
>  }
>  
> diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> index ae47090..08715d8 100644
> --- a/net/netfilter/ipvs/ip_vs_ctl.c
> +++ b/net/netfilter/ipvs/ip_vs_ctl.c
> @@ -3657,7 +3657,7 @@ int __init ip_vs_control_init(void)
>  		INIT_LIST_HEAD(&ip_vs_svc_fwm_table[idx]);
>  	}
>  
> -	ret = register_pernet_subsys(&ipvs_control_ops);
> +	ret = register_pernet_device(&ipvs_control_ops);
>  	if (ret) {
>  		pr_err("cannot register namespace.\n");
>  		goto err;
> @@ -3682,7 +3682,7 @@ int __init ip_vs_control_init(void)
>  	return 0;
>  
>  err_net:
> -	unregister_pernet_subsys(&ipvs_control_ops);
> +	unregister_pernet_device(&ipvs_control_ops);
>  err:
>  	return ret;
>  }
> @@ -3691,7 +3691,7 @@ err:
>  void ip_vs_control_cleanup(void)
>  {
>  	EnterFunction(2);
> -	unregister_pernet_subsys(&ipvs_control_ops);
> +	unregister_pernet_device(&ipvs_control_ops);
>  	ip_vs_genl_unregister();
>  	nf_unregister_sockopt(&ip_vs_sockopts);
>  	LeaveFunction(2);
> diff --git a/net/netfilter/ipvs/ip_vs_est.c b/net/netfilter/ipvs/ip_vs_est.c
> index 8c8766c..759163e 100644
> --- a/net/netfilter/ipvs/ip_vs_est.c
> +++ b/net/netfilter/ipvs/ip_vs_est.c
> @@ -216,11 +216,11 @@ int __init ip_vs_estimator_init(void)
>  {
>  	int rv;
>  
> -	rv = register_pernet_subsys(&ip_vs_app_ops);
> +	rv = register_pernet_device(&ip_vs_app_ops);
>  	return rv;
>  }
>  
>  void ip_vs_estimator_cleanup(void)
>  {
> -	unregister_pernet_subsys(&ip_vs_app_ops);
> +	unregister_pernet_device(&ip_vs_app_ops);
>  }
> diff --git a/net/netfilter/ipvs/ip_vs_ftp.c b/net/netfilter/ipvs/ip_vs_ftp.c
> index 6b5dd6d..dfa04d3 100644
> --- a/net/netfilter/ipvs/ip_vs_ftp.c
> +++ b/net/netfilter/ipvs/ip_vs_ftp.c
> @@ -451,7 +451,7 @@ int __init ip_vs_ftp_init(void)
>  {
>  	int rv;
>  
> -	rv = register_pernet_subsys(&ip_vs_ftp_ops);
> +	rv = register_pernet_device(&ip_vs_ftp_ops);
>  	return rv;
>  }
>  
> @@ -460,7 +460,7 @@ int __init ip_vs_ftp_init(void)
>   */
>  static void __exit ip_vs_ftp_exit(void)
>  {
> -	unregister_pernet_subsys(&ip_vs_ftp_ops);
> +	unregister_pernet_device(&ip_vs_ftp_ops);
>  }
>  
>  
> diff --git a/net/netfilter/ipvs/ip_vs_lblc.c b/net/netfilter/ipvs/ip_vs_lblc.c
> index 87e40ea..96765d0 100644
> --- a/net/netfilter/ipvs/ip_vs_lblc.c
> +++ b/net/netfilter/ipvs/ip_vs_lblc.c
> @@ -603,20 +603,20 @@ static int __init ip_vs_lblc_init(void)
>  {
>  	int ret;
>  
> -	ret = register_pernet_subsys(&ip_vs_lblc_ops);
> +	ret = register_pernet_device(&ip_vs_lblc_ops);
>  	if (ret)
>  		return ret;
>  
>  	ret = register_ip_vs_scheduler(&ip_vs_lblc_scheduler);
>  	if (ret)
> -		unregister_pernet_subsys(&ip_vs_lblc_ops);
> +		unregister_pernet_device(&ip_vs_lblc_ops);
>  	return ret;
>  }
>  
>  static void __exit ip_vs_lblc_cleanup(void)
>  {
>  	unregister_ip_vs_scheduler(&ip_vs_lblc_scheduler);
> -	unregister_pernet_subsys(&ip_vs_lblc_ops);
> +	unregister_pernet_device(&ip_vs_lblc_ops);
>  }
>  
>  
> diff --git a/net/netfilter/ipvs/ip_vs_lblcr.c b/net/netfilter/ipvs/ip_vs_lblcr.c
> index 90f618a..5de425f 100644
> --- a/net/netfilter/ipvs/ip_vs_lblcr.c
> +++ b/net/netfilter/ipvs/ip_vs_lblcr.c
> @@ -799,20 +799,20 @@ static int __init ip_vs_lblcr_init(void)
>  {
>  	int ret;
>  
> -	ret = register_pernet_subsys(&ip_vs_lblcr_ops);
> +	ret = register_pernet_device(&ip_vs_lblcr_ops);
>  	if (ret)
>  		return ret;
>  
>  	ret = register_ip_vs_scheduler(&ip_vs_lblcr_scheduler);
>  	if (ret)
> -		unregister_pernet_subsys(&ip_vs_lblcr_ops);
> +		unregister_pernet_device(&ip_vs_lblcr_ops);
>  	return ret;
>  }
>  
>  static void __exit ip_vs_lblcr_cleanup(void)
>  {
>  	unregister_ip_vs_scheduler(&ip_vs_lblcr_scheduler);
> -	unregister_pernet_subsys(&ip_vs_lblcr_ops);
> +	unregister_pernet_device(&ip_vs_lblcr_ops);
>  }
>  
>  
> diff --git a/net/netfilter/ipvs/ip_vs_proto.c b/net/netfilter/ipvs/ip_vs_proto.c
> index 17484a4..f7021fc 100644
> --- a/net/netfilter/ipvs/ip_vs_proto.c
> +++ b/net/netfilter/ipvs/ip_vs_proto.c
> @@ -382,7 +382,7 @@ int __init ip_vs_protocol_init(void)
>  	REGISTER_PROTOCOL(&ip_vs_protocol_esp);
>  #endif
>  	pr_info("Registered protocols (%s)\n", &protocols[2]);
> -	return register_pernet_subsys(&ipvs_proto_ops);
> +	return register_pernet_device(&ipvs_proto_ops);
>  
>  	return 0;
>  }
> @@ -393,7 +393,7 @@ void ip_vs_protocol_cleanup(void)
>  	struct ip_vs_protocol *pp;
>  	int i;
>  
> -	unregister_pernet_subsys(&ipvs_proto_ops);
> +	unregister_pernet_device(&ipvs_proto_ops);
>  	/* unregister all the ipvs protocols */
>  	for (i = 0; i < IP_VS_PROTO_TAB_SIZE; i++) {
>  		while ((pp = ip_vs_proto_table[i]) != NULL)
> diff --git a/net/netfilter/ipvs/ip_vs_sync.c b/net/netfilter/ipvs/ip_vs_sync.c
> index 3f87555..1aeca1d 100644
> --- a/net/netfilter/ipvs/ip_vs_sync.c
> +++ b/net/netfilter/ipvs/ip_vs_sync.c
> @@ -1692,10 +1692,10 @@ static struct pernet_operations ipvs_sync_ops = {
>  
>  int __init ip_vs_sync_init(void)
>  {
> -	return register_pernet_subsys(&ipvs_sync_ops);
> +	return register_pernet_device(&ipvs_sync_ops);
>  }
>  
>  void ip_vs_sync_cleanup(void)
>  {
> -	unregister_pernet_subsys(&ipvs_sync_ops);
> +	unregister_pernet_device(&ipvs_sync_ops);
>  }

^ permalink raw reply

* Re: [PATCH 3/3] IPVS: init and cleanup restructuring.
From: Hans Schillstrom @ 2011-04-20  9:56 UTC (permalink / raw)
  To: Julian Anastasov, ebiederm
  Cc: horms, lvs-devel, netdev, netfilter-devel, hans.schillstrom
In-Reply-To: <alpine.LFD.2.00.1104200207340.1724@ja.ssi.bg>

Thanks everyone
I will send a new patch series today based on your comments

Regards
Hans

^ permalink raw reply

* Re: [PATCH 1/3] IPVS: Change of socket usage to enable name space exit.
From: Eric W. Biederman @ 2011-04-20 10:00 UTC (permalink / raw)
  To: Simon Horman
  Cc: Hans Schillstrom, ja, lvs-devel, netdev, netfilter-devel,
	hans.schillstrom
In-Reply-To: <20110419221145.GC12922@verge.net.au>

Simon Horman <horms@verge.net.au> writes:

> On Tue, Apr 19, 2011 at 05:25:03PM +0200, Hans Schillstrom wrote:
>> This is the first patch in a series of three.
>> The cleanup doesn't work when not exit in a clean way by using ipvsadm.
>> Killing of a namespace causes a hanging ipvs, this series will cure that.
>> 
>> If the sync daemons run in a namespace while it crashes
>> or get killed, there is no way to stop them except for a reboot.
>> 
>> Kernel threads should not increment the use count of a socket.
>> By calling sk_change_net() after creating a socket this is avoided.
>> sock_release cant be used, instead sk_release_kernel() should be used.
>> 
>> Thanks to Eric W Biederman.
>> 
>> This patch is based on net-next-2.6  ver 2.6.39-rc2
>
> Thanks Hans and Eric.
>
> Is it only this 1st patch that is intended for 2.6.39?
> The entire series feels a bit long to be applied
> this late in the rc series.

Hans was tracking two bugs that I gave him some advice on.
The first was the threads keeping namespaces from exiting
at the proper time, which the first patch appears to almost
fix.  I didn't see the code in that to ensure the threads
were gone.

The second was a something that sounded like packets flying
around during network namespace subsys cleanup.  Simply
not having network devices and sockets open in the namespace
should be enough to guarantee you don't have packets flying
around.

My feeling is that the patches are in the right neighbourhood
but not quite there yet.

Eric

^ permalink raw reply

* Re: [Bugme-new] [Bug 33502] New: Caught 64-bit read from uninitialized memory in __alloc_skb
From: Eric Dumazet @ 2011-04-20 10:02 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: casteyde.christian, Christoph Lameter, Andrew Morton, netdev,
	bugzilla-daemon, bugme-daemon, Vegard Nossum
In-Reply-To: <1303290464.3186.32.camel@edumazet-laptop>

Le mercredi 20 avril 2011 à 11:07 +0200, Eric Dumazet a écrit :
> I had one splat (before applying any patch), adding CONFIG_PREEMPT to my
> build :
> 
> [   84.629936] WARNING: kmemcheck: Caught 64-bit read from uninitialized memory (ffff88011a7376f0)
> [   84.629939] 5868ba1a0188ffff5868ba1a0188ffff7078731a0188ffffe0e1181a0188ffff
> [   84.629952]  i i i i i i i i i i i i i i i i u u u u u u u u u u u u u u u u
> [   84.629963]                                  ^
> [   84.629964] 
> [   84.629966] Pid: 2060, comm: 05-wait_for_sys Not tainted 2.6.39-rc4-00237-ge3de956-dirty #550 HP ProLiant BL460c G6
> [   84.629969] RIP: 0010:[<ffffffff810f0e80>]  [<ffffffff810f0e80>] kmem_cache_alloc+0x50/0x190
> [   84.629977] RSP: 0018:ffff88011a0b9988  EFLAGS: 00010286
> [   84.629979] RAX: 0000000000000000 RBX: ffff88011a739a98 RCX: 0000000000620780
> [   84.629980] RDX: 0000000000620740 RSI: 0000000000017400 RDI: ffffffff810db8f5
> [   84.629982] RBP: ffff88011a0b99b8 R08: ffff88011a261dd8 R09: 0000000000000009
> [   84.629983] R10: 0000000000000002 R11: ffff88011fffce00 R12: ffff88011b00a600
> [   84.629985] R13: ffff88011a7376f0 R14: 00000000000000d0 R15: ffff88011abbb0c0
> [   84.629987] FS:  0000000000000000(0000) GS:ffff88011fc00000(0063) knlGS:00000000f76ff6c0
> [   84.629989] CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
> [   84.629991] CR2: ffff88011998dab8 CR3: 000000011a13f000 CR4: 00000000000006f0
> [   84.629992] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   84.629994] DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: 0000000000000400
> [   84.629995]  [<ffffffff810db8f5>] anon_vma_prepare+0x55/0x190
> [   84.630000]  [<ffffffff810cfe90>] handle_pte_fault+0x470/0x6e0
> [   84.630002]  [<ffffffff810d1654>] handle_mm_fault+0x124/0x1a0
> [   84.630005]  [<ffffffff8147a510>] do_page_fault+0x160/0x560
> [   84.630008]  [<ffffffff81477fcf>] page_fault+0x1f/0x30
> [   84.630013]  [<ffffffff810b31f8>] generic_file_aio_read+0x528/0x740
> [   84.630017]  [<ffffffff810f5461>] do_sync_read+0xd1/0x110
> [   84.630021]  [<ffffffff810f5c36>] vfs_read+0xc6/0x160
> [   84.630023]  [<ffffffff810f60b0>] sys_read+0x50/0x90
> [   84.630025]  [<ffffffff8147f4e9>] sysenter_dispatch+0x7/0x27
> [   84.630030]  [<ffffffffffffffff>] 0xffffffffffffffff
> 
> ffffffff810f0e6d:       0f 84 b5 00 00 00       je     ffffffff810f0f28 <kmem_cache_alloc+0xf8>
> ffffffff810f0e73:       49 63 44 24 20          movslq 0x20(%r12),%rax
> ffffffff810f0e78:       48 8d 4a 40             lea    0x40(%rdx),%rcx
> ffffffff810f0e7c:       49 8b 34 24             mov    (%r12),%rsi
> >>ffffffff810f0e80:       49 8b 5c 05 00          mov    0x0(%r13,%rax,1),%rbx
> ffffffff810f0e85:       4c 89 e8                mov    %r13,%rax
> ffffffff810f0e88:       e8 83 19 0e 00          callq  ffffffff811d2810 <this_cpu_cmpxchg16b_emu>
> ffffffff810f0e8d:       0f 1f 40 00             nopl   0x0(%rax)
> ffffffff810f0e91:       84 c0                   test   %al,%al
> ffffffff810f0e93:       74 c1                   je     ffffffff810f0e56 <kmem_cache_alloc+0x26>
> 
> 
> So this is definitely the access to object->next that triggers the fault.
> 
> Problem is : my (2nd) patch only reduces the window of occurrence of the bug, since we can
> be preempted/interrupted between the kmemcheck_mark_initialized() and access to object->next :
> 
> The preempt/irq could allocate the object and free it again.
> 
> So KMEMCHECK would require us to block IRQ to be safe.
> 
> Then, just disable SLUB_CMPXCHG_DOUBLE if KMEMCHECK is defined, as I did in my first patch.
> 
> Christoph, please reconsider first version of patch (disabling SLUB_CMPXCHG_DOUBLE if
> either KMEMCHECK or DEBUG_PAGEALLOC is defined)

Or this alternate one : keep cmpxchg_double() stuff but block IRQ in
slab_alloc() if KMEMCHECK or DEBUG_PAGEALLOC

I tested it and got no kmemcheck splat

Thanks

[PATCH] slub: block IRQ in slab_alloc() if KMEMCHECK or DEBUG_PAGEALLOC

Christian Casteyde reported a KMEMCHECK splat in slub code.

Problem is now we are lockless and allow IRQ in slab_alloc, the object
we manipulate from freelist can be allocated and freed right before we
try to read object->next

Same problem can happen with DEBUG_PAGEALLOC

Reported-by: Christian Casteyde <casteyde.christian@free.fr>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Vegard Nossum <vegardno@ifi.uio.no>
Cc: Christoph Lameter <cl@linux.com>
---
 mm/slub.c |   16 ++++++++++++----
 1 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 94d2a33..9c34a2b 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1881,7 +1881,13 @@ debug:
  * If not then __slab_alloc is called for slow processing.
  *
  * Otherwise we can simply pick the next object from the lockless free list.
+ * Since we access object->next, we must disable irqs if KMEMCHECK or
+ * DEBUG_PAGEALLOC is defined, since object could be allocated and freed
+ * under us.
  */
+#if !defined(CONFIG_CMPXCHG_LOCAL) || defined(CONFIG_KMEMCHECK) || defined(CONFIG_DEBUG_PAGEALLOC)
+#define MASK_IRQ_IN_SLAB_ALLOC
+#endif
 static __always_inline void *slab_alloc(struct kmem_cache *s,
 		gfp_t gfpflags, int node, unsigned long addr)
 {
@@ -1889,16 +1895,18 @@ static __always_inline void *slab_alloc(struct kmem_cache *s,
 	struct kmem_cache_cpu *c;
 #ifdef CONFIG_CMPXCHG_LOCAL
 	unsigned long tid;
-#else
+#endif
+#ifdef MASK_IRQ_IN_SLAB_ALLOC
 	unsigned long flags;
 #endif
 
 	if (slab_pre_alloc_hook(s, gfpflags))
 		return NULL;
 
-#ifndef CONFIG_CMPXCHG_LOCAL
+#ifdef MASK_IRQ_IN_SLAB_ALLOC
 	local_irq_save(flags);
-#else
+#endif
+#ifdef CONFIG_CMPXCHG_LOCAL
 redo:
 #endif
 
@@ -1954,7 +1962,7 @@ redo:
 		stat(s, ALLOC_FASTPATH);
 	}
 
-#ifndef CONFIG_CMPXCHG_LOCAL
+#ifdef MASK_IRQ_IN_SLAB_ALLOC
 	local_irq_restore(flags);
 #endif
 



^ permalink raw reply related

* Re: [PATCH 3/3] IPVS: init and cleanup restructuring.
From: Hans Schillstrom @ 2011-04-20 10:41 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: horms, ebiederm, lvs-devel, netdev, netfilter-devel,
	hans.schillstrom
In-Reply-To: <alpine.LFD.2.00.1104200207340.1724@ja.ssi.bg>

On Wednesday, April 20, 2011 01:19:38 Julian Anastasov wrote:
> 
> 	Hello,
> 
> On Tue, 19 Apr 2011, Hans Schillstrom wrote:
> 
> > This patch tries to restore the initial init and cleanup
> > sequences that was before name space patch.
> > 
> > The number of calls to register_pernet_device have been
> > reduced to one for the ip_vs.ko
> > Schedulers still have their own calls.
> > 
> > This patch adds a function __ip_vs_service_cleanup()
> > and a throttle or actually on/off switch for
> > the netfilter hooks.
> > 
> > The nf hooks will be enabled when the first service is loaded
> > and disabled when the last service is removed or when a
> > name space exit starts.
> 
> 	For me using _net suffix is more clear compared
> to __ prefix for the pernet methods.
> 
Sure I'll do that.

> 	For ip_vs_in: may be we can move the check here:
> 
> +	net = skb_net(skb);
> +	if (net->ipvs->throttle)
> +		return NF_ACCEPT;
>  	ip_vs_fill_iphdr(af, skb_network_header(skb), &iph);
>  
>  	/* Bad... Do not break raw sockets */

Done

> 
> 	It will save such checks later in ICMP funcs. But this
> throttle idea looks dangerous for cleanup. It does not use RCU.
> The readers can cache the 0 in throttle for long time.
> May be by using register_pernet_device we are in list with other
> devices and it is still possible some device used by
> our dst_cache to be unregistered before IPVS or we to be
> unregistered before such device and some race with throttle
> to happen. throttle is good when enabling traffic with
> the first virtual service, later it can slowly stop the traffic
> but we can not rely on it during netns cleanup.

OK , I will revert back to register_pernet_subsys(),
and use one single register_pernet_device()  exit hook for :
 - the throttle to disable traffic
 - stop the kernel threads.

> 
> 	So, there are 2 problems with the devices:
> 
> - if we use _device pernet registration we can see packets
> in our netns during cleanup. I assume this is possible
> when IPVS is unregistered before such devices.
> 
> - dests can cache dst and to hold the device after it is
> unregistered in netns, obviously for very short time until
> IPVS is later unregistered from netns. And for long time
> if device is unregistered but netns remains.
> 
> 	Also, in most of the cases svc->refcnt is above 0 because
> dests can be in trash list. You should be lucky to delete the
> service without any connections, only then ip_vs_svc_unhash can
> see refcnt == 0.
> 
> 	So, may be we have to use register_pernet_subsys (not
> _device). We need just to register notifier with
> register_netdevice_notifier and to catch NETDEV_UNREGISTER,
> so that if any dest uses this device we have to release the dst:
> 
I made a quick test of the concept and it seems to work, 
(A mix of new and old connections at 1GB/s into 4 namespaces when killing them all)

> 	- lock mutex
> 	- for every dest (also in trash):
> 	spin_lock_bh(&dest->dst_lock);
> 	if (dest->dst_cache && dest->dst_cache->dev == unregistered_dev)
> 		ip_vs_dst_reset(dest);
> 	spin_unlock_bh(&dest->dst_lock);


Here is a what i did

static inline void
__ip_vs_dev_reset(struct ip_vs_dest *dest, struct net_device *dev)
{
	spin_lock_bh(&dest->dst_lock);
	if (dest->dst_cache && dest->dst_cache->dev == dev)
		ip_vs_dst_reset(dest);
	spin_unlock_bh(&dest->dst_lock);
}

static void ip_vs_svc_reset(struct ip_vs_service *svc, struct net_device *dev)
{
	struct ip_vs_dest *dest;

	list_for_each_entry(dest, &svc->destinations, n_list) {
		__ip_vs_dev_reset(dest, dev);
	}
}

static int ip_vs_dst_event(struct notifier_block *this, unsigned long event,
			    void *ptr)
{
	struct net_device *dev = ptr;
	struct net *net = dev_net(dev);
	struct ip_vs_service *svc;
	struct ip_vs_dest *dest;
	unsigned int idx;

	if (event != NETDEV_UNREGISTER)
		return NOTIFY_DONE;

	EnterFunction(2);
	mutex_lock(&__ip_vs_mutex);
	for (idx = 0; idx<IP_VS_SVC_TAB_SIZE; idx++) {
		write_lock_bh(&__ip_vs_svc_lock);
		list_for_each_entry(svc, &ip_vs_svc_table[idx], s_list) {
			if (net_eq(svc->net, net))
				ip_vs_svc_reset(svc, dev);
		}
		list_for_each_entry(svc, &ip_vs_svc_fwm_table[idx], f_list) {
			if (net_eq(svc->net, net))
				ip_vs_svc_reset(svc, dev);
		}
		write_unlock_bh(&__ip_vs_svc_lock);
	}

	list_for_each_entry(dest, &net_ipvs(net)->dest_trash, n_list) {
		__ip_vs_dev_reset(dest, dev);
	}
	mutex_unlock(&__ip_vs_mutex);
	LeaveFunction(2);
	return NOTIFY_DONE;
}

static struct notifier_block ip_vs_dst_notifier = {
	.notifier_call = ip_vs_dst_event,
};


int __init ip_vs_control_init(void)
...
	at the end
	ret = register_netdevice_notifier(&ip_vs_dst_notifier);
...

and an unregister_ of course.


> 
> 	There are many examples for this, eg. net/core/fib_rules.c
> Then we are sure on cleanup we can not see traffic for our net
> because all devices are unregistered before us. We don't have
> to rely on throttle to stop the traffic during cleanup. And
> we do not hold devices after NETDEV_UNREGISTER.
> 
> 	I can prepare such patch but in next days. We need such
> code anyways because the dests can hold such dsts when no
> traffic is present and we can see again this "waiting for %s" ...
> message.
> 
> 	throttle still can be used but now it can not stop
> the traffic if connections exist.
> 
> 	For __ip_vs_service_cleanup: it still has to use mutex.

OK I will try to use unlock methods, if it doesn't work use the mutex.

> Or we can avoid it by introducing ip_vs_unlink_service_nolock:
> ip_vs_flush will look like your __ip_vs_service_cleanup and
> will call ip_vs_unlink_service_nolock. ip_vs_unlink_service_nolock
> will be called by ip_vs_flush and by ip_vs_unlink_service.
> You can try such changes, if not, I'll prepare some patches
> after 2-3 days.
> 

Regards
Hans

^ permalink raw reply

* Re: [PATCH BUG-FIX] ipv6: udp: fix the wrong headroom check
From: Herbert Xu @ 2011-04-20 10:50 UTC (permalink / raw)
  To: Shan Wei
  Cc: kuznet, David Miller, pekkas, jmorris,
	yoshfuji@linux-ipv6.org >> YOSHIFUJI Hideaki,
	Patrick McHardy, netdev
In-Reply-To: <4DAE9EE1.1050405@cn.fujitsu.com>

On Wed, Apr 20, 2011 at 04:52:49PM +0800, Shan Wei wrote:
> At this point, skb->data points to skb_transport_header.
> So, headroom check is wrong. 
> 
> For some case:bridge(UFO is on) + eth device(UFO is off),
> there is no enough headroom for IPv6 frag head.
> But headroom check is always false.
> 
> This will bring about data be moved to there prior to skb->head,
> when adding IPv6 frag header to skb.
> 
> Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>

Ouch.

Acked-by: Herbert Xu <herbert@gondor.apana.org.au>

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [RFC v3 5/6] j1939: rename NAME to UUID?
From: Oliver Hartkopp @ 2011-04-20 10:59 UTC (permalink / raw)
  To: Kurt Van Dijck
  Cc: socketcan-core-0fE9KPoRgkgATYTw5x5z8w,
	netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20110420072439.GB332-ozGf4kBk5synFtIcQ8t7k3L8HoS0Hn3T@public.gmane.org>

On 20.04.2011 09:24, Kurt Van Dijck wrote:
> On Fri, Apr 15, 2011 at 07:57:25PM +0200, Oliver Hartkopp wrote:
>> On 13.04.2011 06:49, Kurt Van Dijck wrote:
>>> Oliver et.al.,
>>
>> Thinking about the approach to implement the j1939 address claiming (AC) in
>> userspace, i discovered two ways which could both be hidden inside some
>> easy-to-use helper functions:
>>
>> 1. implement a thread (e.g. within a library) which opens a CAN_RAW socket on
>> a specific CAN-interface and takes care of the AC procedure and monitors
>> ongoing AC procedures on the bus. In this case every j1939 application
>> requiring AC internally would monitor all the AC handling on itself (which
>> should be no general problem - written only once).
>>
>> 2. create j1939ac daemon(s) using PF_UNIX-sockets to be named e.g.
>> j1939ac_can0, j1939ac_can1, etc. - these daemons take care for all AC
>> requirements of the host it is running on. The PF_UNIX-sockets are used in
>> SOCK_DGRAM mode and only the j1939 processes that need AC can then register
>> their NAME by sending a request datagram, and get back the j1939-address once
>> it is claimed (and all the updates on changes). As the j1939ac daemons are
>> running on the same host as the j1939 application processes, optional the
>> process' PID could be provided to the daemon during the registering process,
>> so that the daemon can send a signal to a signal handler of the application
>> process (if you would like to omit the select() syscall to handle both the
>> j1939 and PF_UNIX sockets).
>>
>> ->   <Req><Name="A3B5667799332242" PID="12345">
>> <-   <Resp><ACState="claimed" Name="A3B5667799332242" Address="1B">
>> (some time)
>> <-   <Resp><ACState="changed" Name="A3B5667799332242" Address="1C">
>>
>> This is a sketch that could be put into simple C-structs that are sent via the
>> PF_UNIX DGRAM socket.
>>
>> In all suggested cases (using a thread, daemon with/without signal) the AC
>> procedure can be managed in userspace without real pain.
> I seriously doubt this statement.
> define 'real pain'.

The same pain as writing a DNS or DHCP daemon which is doing a similar job ...

>> But especially with
>> less pain than putting the AC process into kernelspace and provide your
>> suggested socket API with bind/connect/... in very different manners.
> * You're only counting LOC in kernel, and not in userspace.

Yes! Things that can be left out of the kernel, should be implemented and
maintained in userspace - at least to reduce complexity and potential security
issues.

> * The constructs you present create some kind of infrastructure. You will
>   need a torough documentation of the 'good practice' since that's crucial
>   to the correct operation of the stack. Did you ever consider the impact
>   on the userspace application side.

Yes.

>   Remember that userspace programs should be easy to write.
>   I found your proposals impact application development very hard.

define 'very hard' ;-)

> I still think my passive support in kernel performs better and gives an
> easier API to get things done with.

Kurt, the problem for me is, that you constantly state that your approach is
the best. For me it is not.

The major issue in your implementation is the lack of the possibility to
simulate several j1939 ECUs on one Linux host talking to each other via
virtual CAN busses to create a complete j1939 network. And so far you did not
address this request.

There are several j1939 implementations that are running (or can be made
running) completely in userspace using raw-sockets. Obviously many people are
convenient with these implementations.

I'm fine to place the j1939 data transfer part (supporting the segmented
transfer of long j1939 PDUs) into the kernel - but not all the address
claiming and the binding of j1939 addresses to network interfaces that also
kills the requested feature of simulating a complete j1939 network.

As an amicable approach i would suggest to proceed in two steps:

1. post and mainline the j1939 bits that deal with the data transfer only
   - no address claiming / j1939 name handling
   - no binding of j1939 addresses to CAN network devices
   - no extensions in the current af_can.c
   - etc.

2. discuss the implementation of the (optional) j1939 address claiming

For me this approach makes sense for j1939 newbies and also experienced j1939
users that may become interested in the Linux mainline implementation.

Then - discussing with a larger number of potential and real j1939 users - we
should face the address claiming and its possible implementation options.

Best regards,
Oliver

^ permalink raw reply

* Re: Add NAPI support to ll_temac driver
From: Michal Simek @ 2011-04-20 11:06 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: Eric Dumazet, netdev
In-Reply-To: <1303218898.3464.59.camel@localhost>

Hi,

Ben Hutchings wrote:
> On Tue, 2011-04-19 at 14:48 +0200, Michal Simek wrote:
>> Ben Hutchings wrote:
>>> On Tue, 2011-04-19 at 12:43 +0200, Eric Dumazet wrote:
>>> [...]
>>>> One possible way to get better performance is to change driver to
>>>> allocate skbs only right before calling netif_rx(), so that you dont
>>>> have to access cold sk_buff data twice (once when allocating skb and put
>>>> it in ring buffer, a second time when receiving frame)
>>>>
>>>> drivers/net/niu.c is a good example for this (NAPI + netdev_alloc_skb()
>>>> just in time + pull in skbhead only first cache line of packet)
>>> [...]
>>>
>>> If the hardware can do RX checksumming (it's not clear) then the driver
>>> should pass the paged buffers into GRO and that will take care of skb
>>> allocation as necessary.
>> Hardware supports RX and TX partial checksumming. I can enable it. The driver 
>> has also this option and from my tests there is of course some performance 
>> improvemetn.
>>
>> Just for sure - here are links on documentation.
>> http://www.xilinx.com/support/documentation/ip_documentation/xps_ll_temac.pdf
>> or
>> http://www.xilinx.com/support/documentation/ip_documentation/axi_ethernet/v2_01_a/ds759_axi_ethernet.pdf
> 
> I'm not going to read those.  Just providing brief advice.
> 
>> About SKB allocation. I fixed our non mainline driver to allocate skb based on 
>> current mtu size. Mainline driver allocate max mtu (9k). This has also impact on 
>> performance because Microblaze works with smaller SKBs.
>>
>> Can you please be more specific about passing the paged buffers into GRO?
>> Or point me to any documentation or code which can help me to understand what 
>> that means.
> 
> You would use napi_get_frags() to get a new or recycled skb, fill in
> skb->frags, then call napi_gro_frags() to pass it into GRO.  The benet,
> cxgb3 and sfc drivers do this.

I have measured TX path and I have found that driver design is not so good.
It is always create one BD for one SKB and it starts DMA to copy packet to 
controller and send it. On 66MHz cpu it takes approximately 800 cpu cycles (not 
800 instructions) for sending (1.5k packet).
Current driver also enable irq for TX and when the packet is send interrupt is 
generated and skb is freed.
I see that it takes more time to handle the IRQ than busy waiting when DMA is 
done. I looked at sfc driver and there is any TX queue and any notifier. Hos 
does it work? Is it required to have any hw support?

Thanks,
Michal



-- 
Michal Simek, Ing. (M.Eng)
w: www.monstr.eu p: +42-0-721842854
Maintainer of Linux kernel 2.6 Microblaze Linux - http://www.monstr.eu/fdt/
Microblaze U-BOOT custodian

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox