Netdev List
 help / color / mirror / Atom feed
* [RFC][PATCH v5 06/19] Add a function to indicate if device use external buffer.
From: xiaohui.xin @ 2010-05-07  9:34 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, jdike; +Cc: Xin Xiaohui
In-Reply-To: <1273224906-4874-6-git-send-email-xiaohui.xin@intel.com>

From: Xin Xiaohui <xiaohui.xin@intel.com>

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 include/linux/netdevice.h |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 31d9c4a..0cb78f4 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1602,6 +1602,11 @@ extern void netdev_mp_port_detach(struct net_device *dev);
 extern int netdev_mp_port_prep(struct net_device *dev,
 			struct mpassthru_port *port);
 
+static inline bool dev_is_mpassthru(struct net_device *dev)
+{
+	return (dev && dev->mp_port);
+}
+
 static inline void napi_free_frags(struct napi_struct *napi)
 {
 	kfree_skb(napi->skb);
-- 
1.5.4.4


^ permalink raw reply related

* [RFC][PATCH v5 05/19] Add a function make external buffer owner to query capability.
From: xiaohui.xin @ 2010-05-07  9:34 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, jdike; +Cc: Xin Xiaohui
In-Reply-To: <1273224906-4874-5-git-send-email-xiaohui.xin@intel.com>

From: Xin Xiaohui <xiaohui.xin@intel.com>

The external buffer owner can use the functions to get
the capability of the underlying NIC driver.

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhaonew@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 include/linux/netdevice.h |    2 +
 net/core/dev.c            |   51 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 53 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 183c786..31d9c4a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1599,6 +1599,8 @@ extern gro_result_t	napi_gro_frags(struct napi_struct *napi);
 extern int netdev_mp_port_attach(struct net_device *dev,
 				 struct mpassthru_port *port);
 extern void netdev_mp_port_detach(struct net_device *dev);
+extern int netdev_mp_port_prep(struct net_device *dev,
+			struct mpassthru_port *port);
 
 static inline void napi_free_frags(struct napi_struct *napi)
 {
diff --git a/net/core/dev.c b/net/core/dev.c
index ecbb6b1..37b389a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2497,6 +2497,57 @@ void netdev_mp_port_detach(struct net_device *dev)
 }
 EXPORT_SYMBOL(netdev_mp_port_detach);
 
+/* To support meidate passthru(zero-copy) with NIC driver,
+ * we'd better query NIC driver for the capability it can
+ * provide, especially for packet split mode, now we only
+ * query for the header size, and the payload a descriptor
+ * may carry. If a driver does not use the API to export,
+ * then we may try to use a default value, currently,
+ * we use the default value from an IGB driver. Now,
+ * it's only called by mpassthru device.
+ */
+#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE)
+int netdev_mp_port_prep(struct net_device *dev,
+			struct mpassthru_port *port)
+{
+	int rc;
+	int npages, data_len;
+	const struct net_device_ops *ops = dev->netdev_ops;
+
+	/* needed by packet split */
+
+	if (ops->ndo_mp_port_prep) {
+		rc = ops->ndo_mp_port_prep(dev, port);
+		if (rc)
+			return rc;
+	} else {
+		/* If the NIC driver did not report this,
+		 * then we try to use default value.
+		 */
+		port->hdr_len = 128;
+		port->data_len = 2048;
+		port->npages = 1;
+	}
+
+	if (port->hdr_len <= 0)
+		goto err;
+
+	npages = port->npages;
+	data_len = port->data_len;
+	if (npages <= 0 || npages > MAX_SKB_FRAGS ||
+			(data_len < PAGE_SIZE * (npages - 1) ||
+			 data_len > PAGE_SIZE * npages))
+		goto err;
+
+	return 0;
+err:
+	dev_warn(&dev->dev, "invalid page constructor parameters\n");
+
+	return -EINVAL;
+}
+EXPORT_SYMBOL(netdev_mp_port_prep);
+#endif
+
 /**
  *	netif_receive_skb - process receive buffer from network
  *	@skb: buffer to process
-- 
1.5.4.4

^ permalink raw reply related

* [RFC][PATCH v5 04/19] Add a ndo_mp_port_prep pointer to net_device_ops.
From: xiaohui.xin @ 2010-05-07  9:34 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, jdike; +Cc: Xin Xiaohui
In-Reply-To: <1273224906-4874-4-git-send-email-xiaohui.xin@intel.com>

From: Xin Xiaohui <xiaohui.xin@intel.com>

If the driver want to allocate external buffers,
then it can export it's capability, as the skb
buffer header length, the page length can be DMA, etc.
The external buffers owner may utilize this.

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 include/linux/netdevice.h |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index efb575a..183c786 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -707,6 +707,10 @@ struct net_device_ops {
 	int			(*ndo_fcoe_get_wwn)(struct net_device *dev,
 						    u64 *wwn, int type);
 #endif
+#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE)
+	int			(*ndo_mp_port_prep)(struct net_device *dev,
+						struct mpassthru_port *port);
+#endif
 };
 
 /*
-- 
1.5.4.4


^ permalink raw reply related

* [RFC][PATCH v5 02/19] Add a new struct for device to manipulate external buffer.
From: xiaohui.xin @ 2010-05-07  9:34 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, jdike; +Cc: Xin Xiaohui
In-Reply-To: <1273224906-4874-2-git-send-email-xiaohui.xin@intel.com>

From: Xin Xiaohui <xiaohui.xin@intel.com>

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 include/linux/netdevice.h |   19 ++++++++++++++++++-
 1 files changed, 18 insertions(+), 1 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index fa8b476..bae725c 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -530,6 +530,22 @@ struct netdev_queue {
 	unsigned long		tx_dropped;
 } ____cacheline_aligned_in_smp;
 
+/* Add a structure in structure net_device, the new field is
+ * named as mp_port. It's for mediate passthru (zero-copy).
+ * It contains the capability for the net device driver,
+ * a socket, and an external buffer creator, external means
+ * skb buffer belongs to the device may not be allocated from
+ * kernel space.
+ */
+struct mpassthru_port	{
+	int		hdr_len;
+	int		data_len;
+	int		npages;
+	unsigned	flags;
+	struct socket	*sock;
+	struct skb_external_page *(*ctor)(struct mpassthru_port *,
+				struct sk_buff *, int);
+};
 
 /*
  * This structure defines the management hooks for network devices.
@@ -952,7 +968,8 @@ struct net_device {
 	struct macvlan_port	*macvlan_port;
 	/* GARP */
 	struct garp_port	*garp_port;
-
+	/* mpassthru */
+	struct mpassthru_port	*mp_port;
 	/* class/net/name entry */
 	struct device		dev;
 	/* space for optional device, statistics, and wireless sysfs groups */
-- 
1.5.4.4

^ permalink raw reply related

* [RFC][PATCH v5 00/19] Provide a zero-copy method on KVM virtio-net.
From: xiaohui.xin @ 2010-05-07  9:34 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, jdike

We provide an zero-copy method which driver side may get external
buffers to DMA. Here external means driver don't use kernel space
to allocate skb buffers. Currently the external buffer can be from
guest virtio-net driver.

The idea is simple, just to pin the guest VM user space and then
let host NIC driver has the chance to directly DMA to it. 
The patches are based on vhost-net backend driver. We add a device
which provides proto_ops as sendmsg/recvmsg to vhost-net to
send/recv directly to/from the NIC driver. KVM guest who use the
vhost-net backend may bind any ethX interface in the host side to
get copyless data transfer thru guest virtio-net frontend.

patch 01-13:  	net core changes.
patch 14-18:  	new device as interface to mantpulate external buffers.
patch 19: 	for vhost-net.

The guest virtio-net driver submits multiple requests thru vhost-net
backend driver to the kernel. And the requests are queued and then
completed after corresponding actions in h/w are done.

For read, user space buffers are dispensed to NIC driver for rx when
a page constructor API is invoked. Means NICs can allocate user buffers
from a page constructor. We add a hook in netif_receive_skb() function
to intercept the incoming packets, and notify the zero-copy device.

For write, the zero-copy deivce may allocates a new host skb and puts
payload on the skb_shinfo(skb)->frags, and copied the header to skb->data.
The request remains pending until the skb is transmitted by h/w.

Here, we have ever considered 2 ways to utilize the page constructor
API to dispense the user buffers.

One:	Modify __alloc_skb() function a bit, it can only allocate a 
	structure of sk_buff, and the data pointer is pointing to a 
	user buffer which is coming from a page constructor API.
	Then the shinfo of the skb is also from guest.
	When packet is received from hardware, the skb->data is filled
	directly by h/w. What we have done is in this way.

	Pros:	We can avoid any copy here.
	Cons:	Guest virtio-net driver needs to allocate skb as almost
		the same method with the host NIC drivers, say the size
		of netdev_alloc_skb() and the same reserved space in the
		head of skb. Many NIC drivers are the same with guest and
		ok for this. But some lastest NIC drivers reserves special
		room in skb head. To deal with it, we suggest to provide
		a method in guest virtio-net driver to ask for parameter
		we interest from the NIC driver when we know which device 
		we have bind to do zero-copy. Then we ask guest to do so.
		

Two:	Modify driver to get user buffer allocated from a page constructor
	API(to substitute alloc_page()), the user buffer are used as payload
	buffers and filled by h/w directly when packet is received. Driver
	should associate the pages with skb (skb_shinfo(skb)->frags). For 
	the head buffer side, let host allocates skb, and h/w fills it. 
	After that, the data filled in host skb header will be copied into
	guest header buffer which is submitted together with the payload buffer.

	Pros:	We could less care the way how guest or host allocates their
		buffers.
	Cons:	We still need a bit copy here for the skb header.

We are not sure which way is the better here. This is the first thing we want
to get comments from the community. We wish the modification to the network
part will be generic which not used by vhost-net backend only, but a user
application may use it as well when the zero-copy device may provides async
read/write operations later.

We have got comments from Michael. And he said the first method will break
the compatiblity of virtio-net driver and may complicate the qemu live 
migration. Currently, we tried to ignore the skb_reserve() if the device
is doing zero-copy. Then guest virtio-net driver wil not changed. So we now
continue to go with the first way. 
But comments about the two ways are still appreicated.

We provide multiple submits and asynchronous notifiicaton to 
vhost-net too.

Our goal is to improve the bandwidth and reduce the CPU usage.
Exact performance data will be provided later. But for simple
test with netperf, we found bindwidth up and CPU % up too,
but the bindwidth up ratio is much more than CPU % up ratio.

What we have not done yet:
	packet split support
	To support GRO
	Performance tuning

what we have done in v1:
	polish the RCU usage
	deal with write logging in asynchroush mode in vhost
	add notifier block for mp device
	rename page_ctor to mp_port in netdevice.h to make it looks generic
	add mp_dev_change_flags() for mp device to change NIC state
	add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load
	a small fix for missing dev_put when fail
	using dynamic minor instead of static minor number
	a __KERNEL__ protect to mp_get_sock()

what we have done in v2:
	
	remove most of the RCU usage, since the ctor pointer is only
	changed by BIND/UNBIND ioctl, and during that time, NIC will be
	stopped to get good cleanup(all outstanding requests are finished),
	so the ctor pointer cannot be raced into wrong situation.

	Remove the struct vhost_notifier with struct kiocb.
	Let vhost-net backend to alloc/free the kiocb and transfer them
	via sendmsg/recvmsg.

	use get_user_pages_fast() and set_page_dirty_lock() when read.

	Add some comments for netdev_mp_port_prep() and handle_mpassthru().

what we have done in v3:
	the async write logging is rewritten 
	a drafted synchronous write function for qemu live migration
	a limit for locked pages from get_user_pages_fast() to prevent Dos
	by using RLIMIT_MEMLOCK
	

what we have done in v4:
	add iocb completion callback from vhost-net to queue iocb in mp device
	replace vq->receiver by mp_sock_data_ready()
	remove stuff in mp device which access structures from vhost-net
	modify skb_reserve() to ignore host NIC driver reserved space
	rebase to the latest vhost tree
	split large patches into small pieces, especially for net core part.
	

what we have done in v5:
	address Arnd Bergmann's comments
		-remove IFF_MPASSTHRU_EXCL flag in mp device
		-Add CONFIG_COMPAT macro
		-remove mp_release ops
	move dev_is_mpassthru() as inline func
	fix a bug in memory relinquish
	Apply to current git tree.
			
performance:
	using netperf with GSO/TSO disabled, 10G NIC, 
	disabled packet split mode, with raw socket case compared to vhost.

	bindwidth will be from 1.1Gbps to 1.7Gbps
	CPU % from 120%-140% to 140%-160%

^ permalink raw reply

* [RFC][PATCH v5 17/19] Export proto_ops to vhost-net driver.
From: xiaohui.xin @ 2010-05-07  9:35 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, jdike; +Cc: Xin Xiaohui
In-Reply-To: <1273224906-4874-17-git-send-email-xiaohui.xin@intel.com>

From: Xin Xiaohui <xiaohui.xin@intel.com>

Currently, vhost-net is only user to the mp device.

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 drivers/vhost/mpassthru.c |  322 ++++++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 318 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c
index 8538a87..96b314a 100644
--- a/drivers/vhost/mpassthru.c
+++ b/drivers/vhost/mpassthru.c
@@ -577,8 +577,322 @@ failed:
 	return NULL;
 }
 
+static void mp_sock_destruct(struct sock *sk)
+{
+	struct mp_struct *mp = container_of(sk, struct mp_sock, sk)->mp;
+	kfree(mp);
+}
+
+static void mp_sock_state_change(struct sock *sk)
+{
+	if (sk_has_sleeper(sk))
+		wake_up_interruptible_sync_poll(sk->sk_sleep, POLLIN);
+}
+
+static void mp_sock_write_space(struct sock *sk)
+{
+	if (sk_has_sleeper(sk))
+		wake_up_interruptible_sync_poll(sk->sk_sleep, POLLOUT);
+}
+
+static void mp_sock_data_ready(struct sock *sk, int coming)
+{
+	struct mp_struct *mp = container_of(sk, struct mp_sock, sk)->mp;
+	struct page_ctor *ctor = NULL;
+	struct sk_buff *skb = NULL;
+	struct page_info *info = NULL;
+	struct ethhdr *eth;
+	struct kiocb *iocb = NULL;
+	int len, i;
+
+	struct virtio_net_hdr hdr = {
+		.flags = 0,
+		.gso_type = VIRTIO_NET_HDR_GSO_NONE
+	};
+
+	ctor = rcu_dereference(mp->ctor);
+	if (!ctor)
+		return;
+
+	while ((skb = skb_dequeue(&sk->sk_receive_queue)) != NULL) {
+		if (skb_shinfo(skb)->destructor_arg) {
+			info = container_of(skb_shinfo(skb)->destructor_arg,
+					struct page_info, ext_page);
+			info->skb = skb;
+			if (skb->len > info->len) {
+				mp->dev->stats.rx_dropped++;
+				DBG(KERN_INFO "Discarded truncated rx packet: "
+				    " len %d > %zd\n", skb->len, info->len);
+				info->total = skb->len;
+				goto clean;
+			} else {
+				int i;
+				struct skb_shared_info *gshinfo =
+					(struct skb_shared_info *)
+					(&info->ushinfo);
+				struct skb_shared_info *hshinfo =
+					skb_shinfo(skb);
+
+				if (gshinfo->nr_frags < hshinfo->nr_frags)
+					goto clean;
+				eth = eth_hdr(skb);
+				skb_push(skb, ETH_HLEN);
+
+				hdr.hdr_len = skb_headlen(skb);
+				info->total = skb->len;
+
+				for (i = 0; i < gshinfo->nr_frags; i++)
+					gshinfo->frags[i].size = 0;
+				for (i = 0; i < hshinfo->nr_frags; i++)
+					gshinfo->frags[i].size =
+						hshinfo->frags[i].size;
+			}
+		} else {
+			/* The skb composed with kernel buffers
+			 * in case external buffers are not sufficent.
+			 * The case should be rare.
+			 */
+			unsigned long flags;
+			int i;
+			struct skb_shared_info *gshinfo = NULL;
+
+			info = NULL;
+
+			spin_lock_irqsave(&ctor->read_lock, flags);
+			if (!list_empty(&ctor->readq)) {
+				info = list_first_entry(&ctor->readq,
+						struct page_info, list);
+				list_del(&info->list);
+			}
+			spin_unlock_irqrestore(&ctor->read_lock, flags);
+			if (!info) {
+				DBG(KERN_INFO
+				    "No external buffer avaliable %p\n",
+				    skb);
+				skb_queue_head(&sk->sk_receive_queue,
+						skb);
+				break;
+			}
+			info->skb = skb;
+			/* compute the guest skb frags info */
+			gshinfo = (struct skb_shared_info *)
+				  (info->ext_page.start +
+				  SKB_DATA_ALIGN(info->ext_page.size));
+
+			if (gshinfo->nr_frags < skb_shinfo(skb)->nr_frags)
+				goto clean;
+
+			eth = eth_hdr(skb);
+			skb_push(skb, ETH_HLEN);
+			info->total = skb->len;
+
+			for (i = 0; i < gshinfo->nr_frags; i++)
+				gshinfo->frags[i].size = 0;
+			for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
+				gshinfo->frags[i].size =
+					skb_shinfo(skb)->frags[i].size;
+			hdr.hdr_len = min_t(int, skb->len,
+					info->iov[1].iov_len);
+			skb_copy_datagram_iovec(skb, 0, info->iov, skb->len);
+		}
+
+		len = memcpy_toiovec(info->hdr, (unsigned char *)&hdr,
+				sizeof hdr);
+		if (len) {
+			DBG(KERN_INFO
+				"Unable to write vnet_hdr at addr %p: %d\n",
+				info->hdr->iov_base, len);
+			goto clean;
+		}
+
+		iocb = create_iocb(info, skb->len + sizeof(hdr));
+		continue;
+
+clean:
+		kfree_skb(skb);
+		for (i = 0; info->pages[i]; i++)
+			put_page(info->pages[i]);
+		kmem_cache_free(ctor->cache, info);
+	}
+	return;
+}
+
+static int mp_sendmsg(struct kiocb *iocb, struct socket *sock,
+		struct msghdr *m, size_t total_len)
+{
+	struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp;
+	struct page_ctor *ctor;
+	struct iovec *iov = m->msg_iov;
+	struct page_info *info = NULL;
+	struct frag frags[MAX_SKB_FRAGS];
+	struct sk_buff *skb;
+	int count = m->msg_iovlen;
+	int total = 0, header, n, i, len, rc;
+	unsigned long base;
+
+	ctor = rcu_dereference(mp->ctor);
+	if (!ctor)
+		return -ENODEV;
+
+	total = iov_length(iov, count);
+
+	if (total < ETH_HLEN)
+		return -EINVAL;
+
+	if (total <= COPY_THRESHOLD)
+		goto copy;
+
+	n = 0;
+	for (i = 0; i < count; i++) {
+		base = (unsigned long)iov[i].iov_base;
+		len = iov[i].iov_len;
+		if (!len)
+			continue;
+		n += ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
+		if (n > MAX_SKB_FRAGS)
+			return -EINVAL;
+	}
+
+copy:
+	header = total > COPY_THRESHOLD ? COPY_HDR_LEN : total;
+
+	skb = alloc_skb(header + NET_IP_ALIGN, GFP_ATOMIC);
+	if (!skb)
+		goto drop;
+
+	skb_reserve(skb, NET_IP_ALIGN);
+
+	skb_set_network_header(skb, ETH_HLEN);
+
+	memcpy_fromiovec(skb->data, iov, header);
+	skb_put(skb, header);
+	skb->protocol = *((__be16 *)(skb->data) + ETH_ALEN);
+
+	if (header == total) {
+		rc = total;
+		info = alloc_small_page_info(ctor, iocb, total);
+	} else {
+		info = alloc_page_info(ctor, iocb, iov, count, frags, 0, total);
+		if (info)
+			for (i = 0; info->pages[i]; i++) {
+				skb_add_rx_frag(skb, i, info->pages[i],
+						frags[i].offset, frags[i].size);
+				info->pages[i] = NULL;
+			}
+	}
+	if (info != NULL) {
+		info->desc_pos = iocb->ki_pos;
+		info->total = total;
+		info->skb = skb;
+		skb_shinfo(skb)->destructor_arg = &info->ext_page;
+		skb->dev = mp->dev;
+		ctor->wq_len++;
+		dev_queue_xmit(skb);
+		return 0;
+	}
+drop:
+	kfree_skb(skb);
+	if (info) {
+		for (i = 0; info->pages[i]; i++)
+			put_page(info->pages[i]);
+		kmem_cache_free(info->ctor->cache, info);
+	}
+	mp->dev->stats.tx_dropped++;
+	return -ENOMEM;
+}
+
+static int mp_recvmsg(struct kiocb *iocb, struct socket *sock,
+		struct msghdr *m, size_t total_len,
+		int flags)
+{
+	struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp;
+	struct page_ctor *ctor;
+	struct iovec *iov = m->msg_iov;
+	int count = m->msg_iovlen;
+	int npages, payload;
+	struct page_info *info;
+	struct frag frags[MAX_SKB_FRAGS];
+	unsigned long base;
+	int i, len;
+	unsigned long flag;
+
+	if (!(flags & MSG_DONTWAIT))
+		return -EINVAL;
+
+	ctor = rcu_dereference(mp->ctor);
+	if (!ctor)
+		return -EINVAL;
+
+	/* Error detections in case invalid external buffer */
+	if (count > 2 && iov[1].iov_len < ctor->port.hdr_len &&
+			mp->dev->features & NETIF_F_SG) {
+		return -EINVAL;
+	}
+
+	npages = ctor->port.npages;
+	payload = ctor->port.data_len;
+
+	/* If KVM guest virtio-net FE driver use SG feature */
+	if (count > 2) {
+		for (i = 2; i < count; i++) {
+			base = (unsigned long)iov[i].iov_base & ~PAGE_MASK;
+			len = iov[i].iov_len;
+			if (npages == 1)
+				len = min_t(int, len, PAGE_SIZE - base);
+			else if (base)
+				break;
+			payload -= len;
+			if (payload <= 0)
+				goto proceed;
+			if (npages == 1 || (len & ~PAGE_MASK))
+				break;
+		}
+	}
+
+	if ((((unsigned long)iov[1].iov_base & ~PAGE_MASK)
+				- NET_SKB_PAD - NET_IP_ALIGN) >= 0)
+		goto proceed;
+
+	return -EINVAL;
+
+proceed:
+	/* skip the virtnet head */
+	iov++;
+	count--;
+
+	if (!ctor->lock_pages)
+		set_memlock_rlimit(ctor, RLIMIT_MEMLOCK,
+				iocb->ki_user_data * 4096,
+				iocb->ki_user_data * 4096);
+
+	/* Translate address to kernel */
+	info = alloc_page_info(ctor, iocb, iov, count, frags, npages, 0);
+	if (!info)
+		return -ENOMEM;
+	info->len = total_len;
+	info->hdr[0].iov_base = iocb->ki_iovec[0].iov_base;
+	info->hdr[0].iov_len = iocb->ki_iovec[0].iov_len;
+	info->offset = frags[0].offset;
+	info->desc_pos = iocb->ki_pos;
+
+	iov--;
+	count++;
+
+	memcpy(info->iov, iov, sizeof(struct iovec) * count);
+
+	spin_lock_irqsave(&ctor->read_lock, flag);
+	list_add_tail(&info->list, &ctor->readq);
+	spin_unlock_irqrestore(&ctor->read_lock, flag);
+
+	ctor->rq_len++;
+
+	return 0;
+}
+
 /* Ops structure to mimic raw sockets with mp device */
 static const struct proto_ops mp_socket_ops = {
+	.sendmsg = mp_sendmsg,
+	.recvmsg = mp_recvmsg,
 };
 
 static struct proto mp_proto = {
@@ -701,10 +1015,10 @@ static long mp_chr_ioctl(struct file *file, unsigned int cmd,
 		sk->sk_sndbuf = INT_MAX;
 		container_of(sk, struct mp_sock, sk)->mp = mp;
 
-		sk->sk_destruct = NULL;
-		sk->sk_data_ready = NULL;
-		sk->sk_write_space = NULL;
-		sk->sk_state_change = NULL;
+		sk->sk_destruct = mp_sock_destruct;
+		sk->sk_data_ready = mp_sock_data_ready;
+		sk->sk_write_space = mp_sock_write_space;
+		sk->sk_state_change = mp_sock_state_change;
 		ret = mp_attach(mp, file);
 		if (ret < 0)
 			goto err_free_sk;
-- 
1.5.4.4


^ permalink raw reply related

* [RFC][PATCH v5 16/19] Manipulate external buffers in mp device.
From: xiaohui.xin @ 2010-05-07  9:35 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, jdike; +Cc: Xin Xiaohui
In-Reply-To: <1273224906-4874-16-git-send-email-xiaohui.xin@intel.com>

From: Xin Xiaohui <xiaohui.xin@intel.com>

How external buffer comes from, how to destroy.

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 drivers/vhost/mpassthru.c |  254 ++++++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 251 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c
index 33cc123..8538a87 100644
--- a/drivers/vhost/mpassthru.c
+++ b/drivers/vhost/mpassthru.c
@@ -160,6 +160,39 @@ static int mp_dev_change_flags(struct net_device *dev, unsigned flags)
 	return ret;
 }
 
+/* The main function to allocate external buffers */
+static struct skb_external_page *page_ctor(struct mpassthru_port *port,
+		struct sk_buff *skb, int npages)
+{
+	int i;
+	unsigned long flags;
+	struct page_ctor *ctor;
+	struct page_info *info = NULL;
+
+	ctor = container_of(port, struct page_ctor, port);
+
+	spin_lock_irqsave(&ctor->read_lock, flags);
+	if (!list_empty(&ctor->readq)) {
+		info = list_first_entry(&ctor->readq, struct page_info, list);
+		list_del(&info->list);
+	}
+	spin_unlock_irqrestore(&ctor->read_lock, flags);
+	if (!info)
+		return NULL;
+
+	for (i = 0; i < info->pnum; i++) {
+		get_page(info->pages[i]);
+		info->frag[i].page = info->pages[i];
+		info->frag[i].page_offset = i ? 0 : info->offset;
+		info->frag[i].size = port->npages > 1 ? PAGE_SIZE :
+			port->data_len;
+	}
+	info->skb = skb;
+	info->ext_page.frags = info->frag;
+	info->ext_page.ushinfo = &info->ushinfo;
+	return &info->ext_page;
+}
+
 static int page_ctor_attach(struct mp_struct *mp)
 {
 	int rc;
@@ -192,7 +225,7 @@ static int page_ctor_attach(struct mp_struct *mp)
 
 	dev_hold(dev);
 	ctor->dev = dev;
-	ctor->port.ctor = NULL;
+	ctor->port.ctor = page_ctor;
 	ctor->port.sock = &mp->socket;
 	ctor->lock_pages = 0;
 	rc = netdev_mp_port_attach(dev, &ctor->port);
@@ -260,11 +293,66 @@ static int set_memlock_rlimit(struct page_ctor *ctor, int resource,
 	return 0;
 }
 
+static void relinquish_resource(struct page_ctor *ctor)
+{
+	if (!(ctor->dev->flags & IFF_UP) &&
+			!(ctor->wq_len + ctor->rq_len))
+		kmem_cache_destroy(ctor->cache);
+}
+
+static void mp_ki_dtor(struct kiocb *iocb)
+{
+	struct page_info *info = (struct page_info *)(iocb->private);
+	int i;
+
+	if (info->flags == INFO_READ) {
+		for (i = 0; i < info->pnum; i++) {
+			if (info->pages[i]) {
+				set_page_dirty_lock(info->pages[i]);
+				put_page(info->pages[i]);
+			}
+		}
+		info->skb->destructor = NULL;
+		kfree_skb(info->skb);
+		info->ctor->rq_len--;
+	} else
+		info->ctor->wq_len--;
+	/* Decrement the number of locked pages */
+	info->ctor->lock_pages -= info->pnum;
+	kmem_cache_free(info->ctor->cache, info);
+	relinquish_resource(info->ctor);
+
+	return;
+}
+
+static struct kiocb *create_iocb(struct page_info *info, int size)
+{
+	struct kiocb *iocb = NULL;
+
+	iocb = info->iocb;
+	if (!iocb)
+		return iocb;
+	iocb->ki_flags = 0;
+	iocb->ki_users = 1;
+	iocb->ki_key = 0;
+	iocb->ki_ctx = NULL;
+	iocb->ki_cancel = NULL;
+	iocb->ki_retry = NULL;
+	iocb->ki_iovec = NULL;
+	iocb->ki_eventfd = NULL;
+	iocb->ki_pos = info->desc_pos;
+	iocb->ki_nbytes = size;
+	iocb->ki_dtor(iocb);
+	iocb->private = (void *)info;
+	iocb->ki_dtor = mp_ki_dtor;
+
+	return iocb;
+}
+
 static int page_ctor_detach(struct mp_struct *mp)
 {
 	struct page_ctor *ctor;
 	struct page_info *info;
-	struct kiocb *iocb = NULL;
 	int i;
 
 	/* locked by mp_mutex */
@@ -276,12 +364,17 @@ static int page_ctor_detach(struct mp_struct *mp)
 		for (i = 0; i < info->pnum; i++)
 			if (info->pages[i])
 				put_page(info->pages[i]);
+		create_iocb(info, 0);
+		ctor->rq_len--;
 		kmem_cache_free(ctor->cache, info);
 	}
+
+	relinquish_resource(ctor);
+
 	set_memlock_rlimit(ctor, RLIMIT_MEMLOCK,
 			   ctor->o_rlim.rlim_cur,
 			   ctor->o_rlim.rlim_max);
-	kmem_cache_destroy(ctor->cache);
+
 	netdev_mp_port_detach(ctor->dev);
 	dev_put(ctor->dev);
 
@@ -329,6 +422,161 @@ static void mp_put(struct mp_file *mfile)
 		mp_detach(mfile->mp);
 }
 
+/* The callback to destruct the external buffers or skb */
+static void page_dtor(struct skb_external_page *ext_page)
+{
+	struct page_info *info;
+	struct page_ctor *ctor;
+	struct sock *sk;
+	struct sk_buff *skb;
+	struct kiocb *iocb = NULL;
+	unsigned long flags;
+
+	if (!ext_page)
+		return;
+	info = container_of(ext_page, struct page_info, ext_page);
+	if (!info)
+		return;
+	ctor = info->ctor;
+	skb = info->skb;
+
+	if ((info->flags == INFO_READ) && info->skb)
+		info->skb->head = NULL;
+
+	/* If the info->total is 0, make it to be reused */
+	if (!info->total) {
+		spin_lock_irqsave(&ctor->read_lock, flags);
+		list_add(&info->list, &ctor->readq);
+		spin_unlock_irqrestore(&ctor->read_lock, flags);
+		return;
+	}
+
+	if (info->flags == INFO_READ)
+		return;
+
+	/* For transmit, we should wait for the DMA finish by hardware.
+	 * Queue the notifier to wake up the backend driver
+	 */
+
+	iocb = create_iocb(info, info->total);
+
+	sk = ctor->port.sock->sk;
+	sk->sk_write_space(sk);
+
+	return;
+}
+
+/* For small exteranl buffers transmit, we don't need to call
+ * get_user_pages().
+ */
+static struct page_info *alloc_small_page_info(struct page_ctor *ctor,
+		struct kiocb *iocb, int total)
+{
+	struct page_info *info = kmem_cache_zalloc(ctor->cache, GFP_KERNEL);
+
+	if (!info)
+		return NULL;
+	info->total = total;
+	info->ext_page.dtor = page_dtor;
+	info->ctor = ctor;
+	info->flags = INFO_WRITE;
+	info->iocb = iocb;
+	return info;
+}
+
+/* The main function to transform the guest user space address
+ * to host kernel address via get_user_pages(). Thus the hardware
+ * can do DMA directly to the external buffer address.
+ */
+static struct page_info *alloc_page_info(struct page_ctor *ctor,
+		struct kiocb *iocb, struct iovec *iov,
+		int count, struct frag *frags,
+		int npages, int total)
+{
+	int rc;
+	int i, j, n = 0;
+	int len;
+	unsigned long base, lock_limit;
+	struct page_info *info = NULL;
+
+	lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur;
+	lock_limit >>= PAGE_SHIFT;
+
+	if (ctor->lock_pages + count > lock_limit) {
+		printk(KERN_INFO "exceed the locked memory rlimit.");
+		return NULL;
+	}
+
+	info = kmem_cache_zalloc(ctor->cache, GFP_KERNEL);
+
+	if (!info)
+		return NULL;
+
+	for (i = j = 0; i < count; i++) {
+		base = (unsigned long)iov[i].iov_base;
+		len = iov[i].iov_len;
+
+		if (!len)
+			continue;
+		n = ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
+
+		rc = get_user_pages_fast(base, n, npages ? 1 : 0,
+				&info->pages[j]);
+		if (rc != n)
+			goto failed;
+
+		while (n--) {
+			frags[j].offset = base & ~PAGE_MASK;
+			frags[j].size = min_t(int, len,
+					PAGE_SIZE - frags[j].offset);
+			len -= frags[j].size;
+			base += frags[j].size;
+			j++;
+		}
+	}
+
+#ifdef CONFIG_HIGHMEM
+	if (npages && !(dev->features & NETIF_F_HIGHDMA)) {
+		for (i = 0; i < j; i++) {
+			if (PageHighMem(info->pages[i]))
+				goto failed;
+		}
+	}
+#endif
+
+	info->total = total;
+	info->ext_page.dtor = page_dtor;
+	info->ctor = ctor;
+	info->pnum = j;
+	info->iocb = iocb;
+	if (!npages)
+		info->flags = INFO_WRITE;
+	if (info->flags == INFO_READ) {
+		info->ext_page.start = (u8 *)(((unsigned long)
+				(pfn_to_kaddr(page_to_pfn(info->pages[0]))) +
+				frags[0].offset));
+#ifdef NET_SKBUFF_DATA_USES_OFFSET
+		info->ext_page.size = SKB_DATA_ALIGN(
+				iov[0].iov_len + NET_IP_ALIGN + NET_SKB_PAD);
+#else
+		info->ext_page.size = SKB_DATA_ALIGN(
+				iov[0].iov_len + NET_IP_ALIGN + NET_SKB_PAD) -
+			NET_IP_ALIGN - NET_SKB_PAD;
+#endif
+	}
+	/* increment the number of locked pages */
+	ctor->lock_pages += j;
+	return info;
+
+failed:
+	for (i = 0; i < j; i++)
+		put_page(info->pages[i]);
+
+	kmem_cache_free(ctor->cache, info);
+
+	return NULL;
+}
+
 /* Ops structure to mimic raw sockets with mp device */
 static const struct proto_ops mp_socket_ops = {
 };
-- 
1.5.4.4


^ permalink raw reply related

* [RFC][PATCH v5 14/19] Add header file for mp device.
From: xiaohui.xin @ 2010-05-07  9:35 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, jdike; +Cc: Xin Xiaohui
In-Reply-To: <1273224906-4874-14-git-send-email-xiaohui.xin@intel.com>

From: Xin Xiaohui <xiaohui.xin@intel.com>

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 include/linux/mpassthru.h |   25 +++++++++++++++++++++++++
 1 files changed, 25 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/mpassthru.h

diff --git a/include/linux/mpassthru.h b/include/linux/mpassthru.h
new file mode 100644
index 0000000..ba8f320
--- /dev/null
+++ b/include/linux/mpassthru.h
@@ -0,0 +1,25 @@
+#ifndef __MPASSTHRU_H
+#define __MPASSTHRU_H
+
+#include <linux/types.h>
+#include <linux/if_ether.h>
+
+/* ioctl defines */
+#define MPASSTHRU_BINDDEV      _IOW('M', 213, int)
+#define MPASSTHRU_UNBINDDEV    _IO('M', 214)
+
+#ifdef __KERNEL__
+#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE)
+struct socket *mp_get_socket(struct file *);
+#else
+#include <linux/err.h>
+#include <linux/errno.h>
+struct file;
+struct socket;
+static inline struct socket *mp_get_socket(struct file *f)
+{
+	return ERR_PTR(-EINVAL);
+}
+#endif /* CONFIG_MEDIATE_PASSTHRU */
+#endif /* __KERNEL__ */
+#endif /* __MPASSTHRU_H */
-- 
1.5.4.4


^ permalink raw reply related

* [RFC][PATCH v5 13/19] To skip GRO if buffer is external currently.
From: xiaohui.xin @ 2010-05-07  9:35 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, jdike; +Cc: Xin Xiaohui
In-Reply-To: <1273224906-4874-13-git-send-email-xiaohui.xin@intel.com>

From: Xin Xiaohui <xiaohui.xin@intel.com>

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 net/core/dev.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index dc2f225..6c6b2fe 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2787,6 +2787,10 @@ enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
 	if (skb_is_gso(skb) || skb_has_frags(skb))
 		goto normal;
 
+	/* currently GRO is not supported by mediate passthru */
+	if (dev_is_mpassthru(skb->dev))
+		goto normal;
+
 	rcu_read_lock();
 	list_for_each_entry_rcu(ptype, head, list) {
 		if (ptype->type != type || ptype->dev || !ptype->gro_receive)
-- 
1.5.4.4


^ permalink raw reply related

* [RFC][PATCH v5 11/19] Use callback to deal with skb_release_data() specially.
From: xiaohui.xin @ 2010-05-07  9:34 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, jdike; +Cc: Xin Xiaohui
In-Reply-To: <1273224906-4874-11-git-send-email-xiaohui.xin@intel.com>

From: Xin Xiaohui <xiaohui.xin@intel.com>

If buffer is external, then use the callback to destruct
buffers.

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 net/core/skbuff.c |   11 +++++++++++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 169f22c..5d93b2d 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -385,6 +385,11 @@ static void skb_clone_fraglist(struct sk_buff *skb)
 
 static void skb_release_data(struct sk_buff *skb)
 {
+	/* check if the skb has external buffers, we have use destructor_arg
+	 * here to indicate
+	 */
+	struct skb_external_page *ext_page = skb_shinfo(skb)->destructor_arg;
+
 	if (!skb->cloned ||
 	    !atomic_sub_return(skb->nohdr ? (1 << SKB_DATAREF_SHIFT) + 1 : 1,
 			       &skb_shinfo(skb)->dataref)) {
@@ -397,6 +402,12 @@ static void skb_release_data(struct sk_buff *skb)
 		if (skb_has_frags(skb))
 			skb_drop_fraglist(skb);
 
+		/* if the skb has external buffers, use destructor here,
+		 * since after that skb->head will be kfree, in case skb->head
+		 * from external buffer cannot use kfree to destroy.
+		 */
+		if (dev_is_mpassthru(skb->dev) && ext_page && ext_page->dtor)
+			ext_page->dtor(ext_page);
 		kfree(skb->head);
 	}
 }
-- 
1.5.4.4


^ permalink raw reply related

* [RFC][PATCH v5 03/19] Export 2 func for device to assign/deassign new strucure
From: xiaohui.xin @ 2010-05-07  9:34 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, jdike; +Cc: Xin Xiaohui
In-Reply-To: <1273224906-4874-3-git-send-email-xiaohui.xin@intel.com>

From: Xin Xiaohui <xiaohui.xin@intel.com>

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 include/linux/netdevice.h |    3 +++
 net/core/dev.c            |   28 ++++++++++++++++++++++++++++
 2 files changed, 31 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index bae725c..efb575a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1592,6 +1592,9 @@ extern gro_result_t	napi_frags_finish(struct napi_struct *napi,
 					  gro_result_t ret);
 extern struct sk_buff *	napi_frags_skb(struct napi_struct *napi);
 extern gro_result_t	napi_gro_frags(struct napi_struct *napi);
+extern int netdev_mp_port_attach(struct net_device *dev,
+				 struct mpassthru_port *port);
+extern void netdev_mp_port_detach(struct net_device *dev);
 
 static inline void napi_free_frags(struct napi_struct *napi)
 {
diff --git a/net/core/dev.c b/net/core/dev.c
index f769098..ecbb6b1 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2469,6 +2469,34 @@ void netif_nit_deliver(struct sk_buff *skb)
 	rcu_read_unlock();
 }
 
+/* Export two functions to assign/de-assign mp_port pointer
+ * to a net device.
+ */
+
+int netdev_mp_port_attach(struct net_device *dev,
+			struct mpassthru_port *port)
+{
+	/* locked by mp_mutex */
+	if (rcu_dereference(dev->mp_port))
+		return -EBUSY;
+
+	rcu_assign_pointer(dev->mp_port, port);
+
+	return 0;
+}
+EXPORT_SYMBOL(netdev_mp_port_attach);
+
+void netdev_mp_port_detach(struct net_device *dev)
+{
+	/* locked by mp_mutex */
+	if (!rcu_dereference(dev->mp_port))
+		return;
+
+	rcu_assign_pointer(dev->mp_port, NULL);
+	synchronize_rcu();
+}
+EXPORT_SYMBOL(netdev_mp_port_detach);
+
 /**
  *	netif_receive_skb - process receive buffer from network
  *	@skb: buffer to process
-- 
1.5.4.4


^ permalink raw reply related

* [RFC][PATCH v5 01/19] Add a new structure for skb buffer from external.
From: xiaohui.xin @ 2010-05-07  9:34 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, jdike; +Cc: Xin Xiaohui
In-Reply-To: <1273224906-4874-1-git-send-email-xiaohui.xin@intel.com>

From: Xin Xiaohui <xiaohui.xin@intel.com>

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 include/linux/skbuff.h |   12 ++++++++++++
 1 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 124f90c..cf309c9 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -203,6 +203,18 @@ struct skb_shared_info {
 	void *		destructor_arg;
 };
 
+/* The structure is for a skb which skb->data may point to
+ * an external buffer, which is not allocated from kernel space.
+ * Since the buffer is external, then the shinfo or frags are
+ * also extern too. It also contains a destructor for itself.
+ */
+struct skb_external_page {
+	u8              *start;
+	int             size;
+	struct skb_frag_struct *frags;
+	struct skb_shared_info *ushinfo;
+	void		(*dtor)(struct skb_external_page *);
+};
 /* We divide dataref into two halves.  The higher 16 bits hold references
  * to the payload part of skb->data.  The lower 16 bits hold references to
  * the entire skb->data.  A clone of a headerless skb holds the length of
-- 
1.5.4.4


^ permalink raw reply related

* Re: VLAN I/F's and TX queue.
From: Joakim Tjernlund @ 2010-05-07  9:29 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Patrick McHardy, netdev
In-Reply-To: <1273222403.2261.26.camel@edumazet-laptop>

Eric Dumazet <eric.dumazet@gmail.com> wrote on 2010/05/07 10:53:23:
>
> Le vendredi 07 mai 2010 à 10:04 +0200, Joakim Tjernlund a écrit :
> > Joakim Tjernlund/Transmode wrote on 2010/05/03 13:34:28:
> > >
> > > We noted dropped pkgs on our VLAN interfaces and i stated to look
> > > for a cause. Here is a ifconfig example:
> > >
> > > eth0      Link encap:Ethernet  HWaddr 00:AA:BB:CC:DD:EE
> > >           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
> > >           RX packets:8886910 errors:0 dropped:0 overruns:0 frame:0
> > >           TX packets:8880219 errors:0 dropped:0 overruns:0 carrier:0
> > >           collisions:0 txqueuelen:100
> > >           RX bytes:1626842951 (1.5 GiB)  TX bytes:1555540810 (1.4 GiB)
> > >
> > > eth0.1    Link encap:Ethernet  HWaddr 00:AA:BB:CC:DD:EE
> > >           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
> > >           RX packets:2163164 errors:0 dropped:0 overruns:0 frame:0
> > >           TX packets:2161943 errors:0 dropped:98 overruns:0 carrier:0
> > >           collisions:0 txqueuelen:0
> > >           RX bytes:2467090557 (2.2 GiB)  TX bytes:2480246455 (2.3 GiB)
> > >
> > > eth0.1.1  Link encap:Ethernet  HWaddr 00:AA:BB:CC:DD:EE
> > >           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
> > >           RX packets:2163164 errors:0 dropped:0 overruns:0 frame:0
> > >           TX packets:2161943 errors:0 dropped:98 overruns:0 carrier:0
> > >           collisions:0 txqueuelen:0
> > >           RX bytes:2458437901 (2.2 GiB)  TX bytes:2471598683 (2.3 GiB)
> > >
> > > Here I note that txqueuelen is 0 for eth0.1/eth0.1.1 and 100 for eth0 and
> > > that it is only eth0.1 and eth0.1.1 that drops pkgs. It feels as if eth0.1
> > > bypasses eth0's tx queue and passes pkgs directly to the HW driver. Is that so?
> > > If so, that feels a bit strange and I am not sure how to best
> > > fix this. Any ides?
> > >
> > > Using kernel 2.6.33
> >
> > So I did some more testing
> > two nodes A and B connected over a slow link.
> > Create two VLAN's as above and start pinging from A to B
> > with pkg size 9600, start a few(4-10) parallel ping processes.
> >
> > Now I see dropped packages on B, the receiver of pings, and no
> > pkg loss on A.
> >
>
> dropped on RX path or TX path ?

On TX path(see the ifconfig listing above)

>
> > 1) since the link is symmetrical, why do I only see pkg loss
> >    at B?
> >
> > 2) pkg loss in B only manifests on the VLAN's interfaces and
> >    always in pair as if one lost pkg is counted twice?
> >
>
> Congestion notifications can be stacked since commit cbbef5e183079
> (vlan/macvlan: propagate transmission state to upper layers)

I see.

>
> > 3) I would expect lost pkgs to be accounted on eth0 instead of
> >    the VLAN interface(s) since that is where the pkg is lost, why
> >    isn't it so?
>
> You try to send packets on eth0.XXX, some are dropped, and accounted for
> on eth0.XXX stats. What is wrong with this ?

In this case one lost pkg is accounted for twice, once on eth0.1 and
once more on eth0.1.1. Note that eth0.1.1 is stacked on
top of eth0.1

I would at least expect eth0 to also account lost pkgs too.
I was confused by the current accounting as I knew that
the underlying HW I/F should be the only I/F that could
drop pkgs.

>
> If you want to avoid this, just add queues to your vlans
>
> ip link add link eth0 eth0.103 txqueuelen 100 type vlan id 103

From memory now, but that didn't help. Still accounts pgks
as described. Why would where to account pkgs be affected by
queue or no queue?

  Jocke


^ permalink raw reply

* Re: TCP-MD5 checksum failure on x86_64 SMP
From: David Miller @ 2010-05-07  9:12 UTC (permalink / raw)
  To: lars.eggert; +Cc: eric.dumazet, bhaskie, shemminger, bhutchings, netdev
In-Reply-To: <70125082-4A0B-4E54-A7F6-B6CDC0375E78@nokia.com>

From: Lars Eggert <lars.eggert@nokia.com>
Date: Fri, 7 May 2010 11:46:40 +0300

> You'll obviously still want TCP-MD5 support for talking to existing
> equipment, but significant cycles would IMO be better spent on
> TCP-AO.

Code we have and users use which is unstable and crashes is more
important to work on and fix than code we don't have which
user's therefore don't use.

You're wrong from just about every possible angle.

Whoever finds AO useful will work on it, just as was the case with MD5
support.  And right now, that "whoever" is definitely not us. :-)


^ permalink raw reply

* Re: [PATCH net-next-2.6] net: Increase NET_SKB_PAD to 64 bytes
From: Michal Simek @ 2010-05-07  9:02 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev, hadi, therbert, microblaze-uclinux
In-Reply-To: <20100507.013216.133893678.davem@davemloft.net>

David Miller wrote:
> From: Michal Simek <monstr@monstr.eu>
> Date: Fri, 07 May 2010 09:53:48 +0200
> 
>> I will add this Microblaze patch to my repo for testing and anyway
>> should go through my repo.
> 
> It's already in the net-next-2.6 tree.

hmm. Not happy from it. I will test that patch from linux-next.

Michal



-- 
Michal Simek, Ing. (M.Eng)
w: www.monstr.eu p: +42-0-721842854
Maintainer of Linux kernel 2.6 Microblaze Linux - http://www.monstr.eu/fdt/
Microblaze U-BOOT custodian

^ permalink raw reply

* Re: TCP-MD5 checksum failure on x86_64 SMP
From: Bhaskar Dutta @ 2010-05-07  8:59 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Stephen Hemminger, Ben Hutchings, netdev, David Miller
In-Reply-To: <1273219222.2261.11.camel@edumazet-laptop>

On Fri, May 7, 2010 at 1:30 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le vendredi 07 mai 2010 à 07:39 +0200, Eric Dumazet a écrit :
>> Le jeudi 06 mai 2010 à 17:25 +0530, Bhaskar Dutta a écrit :
>> > On Thu, May 6, 2010 at 12:23 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>
>> > > I am not familiar with this code, but I suspect same per_cpu data can be
>> > > used at both time by a sender (process context) and by a receiver
>> > > (softirq context).
>> > >
>> > > To trigger this, you need at least two active md5 sockets.
>> > >
>> > > tcp_get_md5sig_pool() should probably disable bh to make sure current
>> > > cpu wont be preempted by softirq processing
>> > >
>> > >
>> > > Something like :
>> > >
>> > > diff --git a/include/net/tcp.h b/include/net/tcp.h
>> > > index fb5c66b..e232123 100644
>> > > --- a/include/net/tcp.h
>> > > +++ b/include/net/tcp.h
>> > > @@ -1221,12 +1221,15 @@ struct tcp_md5sig_pool          *tcp_get_md5sig_pool(void)
>> > >        struct tcp_md5sig_pool *ret = __tcp_get_md5sig_pool(cpu);
>> > >        if (!ret)
>> > >                put_cpu();
>> > > +       else
>> > > +               local_bh_disable();
>> > >        return ret;
>> > >  }
>> > >
>> > >  static inline void             tcp_put_md5sig_pool(void)
>> > >  {
>> > >        __tcp_put_md5sig_pool();
>> > > +       local_bh_enable();
>> > >        put_cpu();
>> > >  }
>> > >
>> > >
>> > >
>> >
>> > I put in the above change and ran some load tests with around 50
>> > active TCP connections doing MD5.
>> > I could see only 1 bad packet in 30 min (earlier the problem used to
>> > occur instantaneously and repeatedly).
>> >
>>
>>
>> > I think there is another possibility of being preempted when calling
>> > tcp_alloc_md5sig_pool()
>> > this function releases the spinlock when calling __tcp_alloc_md5sig_pool().
>> >
>> > I will run some more tests after changing the  tcp_alloc_md5sig_pool
>> > and see if the problem is completely resolved.
>
> Here is my official patch submission, could you please test it ?
>


Eric,

Thanks a lot! I will test it out and let you know.
BTW this patch seems to essentially do the same as the earlier fix you
had posted (where you just do bh disable/enable).
Am I missing something?

With the earlier fix, I ran load tests with 80 TCP connections for
over 6 hrs and found 5 bad checksum packets.
So there is still a problem. Without the fix I see a bad packet every
minute or so.

Bhaskar

^ permalink raw reply

* Re: TCP-MD5 checksum failure on x86_64 SMP
From: Eric Dumazet @ 2010-05-07  8:55 UTC (permalink / raw)
  To: Lars Eggert
  Cc: Bhaskar Dutta, Stephen Hemminger, Ben Hutchings,
	netdev@vger.kernel.org
In-Reply-To: <70125082-4A0B-4E54-A7F6-B6CDC0375E78@nokia.com>

Le vendredi 07 mai 2010 à 11:46 +0300, Lars Eggert a écrit :
> On 2010-5-6, at 15:06, Eric Dumazet wrote:
> > This code should be completely rewritten for linux-2.6.35, its very ugly
> > and over complex, yet it is not scalable.
> 
> Let me quickly point out that the IETF has recently replaced TCP-MD5 with TCP-AO (http://tools.ietf.org/html/draft-ietf-tcpm-tcp-auth-opt-11) which is days or weeks away from being published as an RFC. (The draft has been approved and it waiting for the RFC Editor to finish the final copy-editing.)
> 
> You'll obviously still want TCP-MD5 support for talking to existing equipment, but significant cycles would IMO be better spent on TCP-AO.
> 

Thanks for this info !



^ permalink raw reply

* Re: VLAN I/F's and TX queue.
From: Eric Dumazet @ 2010-05-07  8:53 UTC (permalink / raw)
  To: Joakim Tjernlund; +Cc: netdev, Patrick McHardy
In-Reply-To: <OF3F4D7278.DC6A6E51-ONC125771C.002B1AF4-C125771C.002C64E3@transmode.se>

Le vendredi 07 mai 2010 à 10:04 +0200, Joakim Tjernlund a écrit :
> Joakim Tjernlund/Transmode wrote on 2010/05/03 13:34:28:
> >
> > We noted dropped pkgs on our VLAN interfaces and i stated to look
> > for a cause. Here is a ifconfig example:
> >
> > eth0      Link encap:Ethernet  HWaddr 00:AA:BB:CC:DD:EE
> >           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
> >           RX packets:8886910 errors:0 dropped:0 overruns:0 frame:0
> >           TX packets:8880219 errors:0 dropped:0 overruns:0 carrier:0
> >           collisions:0 txqueuelen:100
> >           RX bytes:1626842951 (1.5 GiB)  TX bytes:1555540810 (1.4 GiB)
> >
> > eth0.1    Link encap:Ethernet  HWaddr 00:AA:BB:CC:DD:EE
> >           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
> >           RX packets:2163164 errors:0 dropped:0 overruns:0 frame:0
> >           TX packets:2161943 errors:0 dropped:98 overruns:0 carrier:0
> >           collisions:0 txqueuelen:0
> >           RX bytes:2467090557 (2.2 GiB)  TX bytes:2480246455 (2.3 GiB)
> >
> > eth0.1.1  Link encap:Ethernet  HWaddr 00:AA:BB:CC:DD:EE
> >           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
> >           RX packets:2163164 errors:0 dropped:0 overruns:0 frame:0
> >           TX packets:2161943 errors:0 dropped:98 overruns:0 carrier:0
> >           collisions:0 txqueuelen:0
> >           RX bytes:2458437901 (2.2 GiB)  TX bytes:2471598683 (2.3 GiB)
> >
> > Here I note that txqueuelen is 0 for eth0.1/eth0.1.1 and 100 for eth0 and
> > that it is only eth0.1 and eth0.1.1 that drops pkgs. It feels as if eth0.1
> > bypasses eth0's tx queue and passes pkgs directly to the HW driver. Is that so?
> > If so, that feels a bit strange and I am not sure how to best
> > fix this. Any ides?
> >
> > Using kernel 2.6.33
> 
> So I did some more testing
> two nodes A and B connected over a slow link.
> Create two VLAN's as above and start pinging from A to B
> with pkg size 9600, start a few(4-10) parallel ping processes.
> 
> Now I see dropped packages on B, the receiver of pings, and no
> pkg loss on A.
> 

dropped on RX path or TX path ?

> 1) since the link is symmetrical, why do I only see pkg loss
>    at B?
> 
> 2) pkg loss in B only manifests on the VLAN's interfaces and
>    always in pair as if one lost pkg is counted twice?
> 

Congestion notifications can be stacked since commit cbbef5e183079
(vlan/macvlan: propagate transmission state to upper layers)

> 3) I would expect lost pkgs to be accounted on eth0 instead of
>    the VLAN interface(s) since that is where the pkg is lost, why
>    isn't it so?

You try to send packets on eth0.XXX, some are dropped, and accounted for
on eth0.XXX stats. What is wrong with this ?

If you want to avoid this, just add queues to your vlans

ip link add link eth0 eth0.103 txqueuelen 100 type vlan id 103

Patrick what do you think of special casing NET_XMIT_CN ?


diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c
index b5249c5..c671b1a 100644
--- a/net/8021q/vlan_dev.c
+++ b/net/8021q/vlan_dev.c
@@ -327,6 +327,8 @@ static netdev_tx_t vlan_dev_hard_start_xmit(struct sk_buff *skb,
 	len = skb->len;
 	ret = dev_queue_xmit(skb);
 
+	ret = net_xmit_eval(ret);
+
 	if (likely(ret == NET_XMIT_SUCCESS)) {
 		txq->tx_packets++;
 		txq->tx_bytes += len;
@@ -353,6 +355,8 @@ static netdev_tx_t vlan_dev_hwaccel_hard_start_xmit(struct sk_buff *skb,
 	len = skb->len;
 	ret = dev_queue_xmit(skb);
 
+	ret = net_xmit_eval(ret);
+
 	if (likely(ret == NET_XMIT_SUCCESS)) {
 		txq->tx_packets++;
 		txq->tx_bytes += len;



^ permalink raw reply related

* Re: TCP-MD5 checksum failure on x86_64 SMP
From: Lars Eggert @ 2010-05-07  8:46 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Bhaskar Dutta, Stephen Hemminger, Ben Hutchings,
	netdev@vger.kernel.org
In-Reply-To: <1273147586.2357.63.camel@edumazet-laptop>

[-- Attachment #1: Type: text/plain, Size: 616 bytes --]

On 2010-5-6, at 15:06, Eric Dumazet wrote:
> This code should be completely rewritten for linux-2.6.35, its very ugly
> and over complex, yet it is not scalable.

Let me quickly point out that the IETF has recently replaced TCP-MD5 with TCP-AO (http://tools.ietf.org/html/draft-ietf-tcpm-tcp-auth-opt-11) which is days or weeks away from being published as an RFC. (The draft has been approved and it waiting for the RFC Editor to finish the final copy-editing.)

You'll obviously still want TCP-MD5 support for talking to existing equipment, but significant cycles would IMO be better spent on TCP-AO.

Lars

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 2490 bytes --]

^ permalink raw reply

* Re: [PATCH net-next-2.6] net: Increase NET_SKB_PAD to 64 bytes
From: David Miller @ 2010-05-07  8:32 UTC (permalink / raw)
  To: monstr; +Cc: eric.dumazet, netdev, hadi, therbert, microblaze-uclinux
In-Reply-To: <4BE3C70C.4060705@monstr.eu>

From: Michal Simek <monstr@monstr.eu>
Date: Fri, 07 May 2010 09:53:48 +0200

> I will add this Microblaze patch to my repo for testing and anyway
> should go through my repo.

It's already in the net-next-2.6 tree.

^ permalink raw reply

* Re: VLAN I/F's and TX queue.
From: Joakim Tjernlund @ 2010-05-07  8:04 UTC (permalink / raw)
  To: netdev
In-Reply-To: <OF5A42C874.3AF220FE-ONC1257718.003ABC6E-C1257718.003F94D2@LocalDomain>

Joakim Tjernlund/Transmode wrote on 2010/05/03 13:34:28:
>
> We noted dropped pkgs on our VLAN interfaces and i stated to look
> for a cause. Here is a ifconfig example:
>
> eth0      Link encap:Ethernet  HWaddr 00:AA:BB:CC:DD:EE
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           RX packets:8886910 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:8880219 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:100
>           RX bytes:1626842951 (1.5 GiB)  TX bytes:1555540810 (1.4 GiB)
>
> eth0.1    Link encap:Ethernet  HWaddr 00:AA:BB:CC:DD:EE
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           RX packets:2163164 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:2161943 errors:0 dropped:98 overruns:0 carrier:0
>           collisions:0 txqueuelen:0
>           RX bytes:2467090557 (2.2 GiB)  TX bytes:2480246455 (2.3 GiB)
>
> eth0.1.1  Link encap:Ethernet  HWaddr 00:AA:BB:CC:DD:EE
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           RX packets:2163164 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:2161943 errors:0 dropped:98 overruns:0 carrier:0
>           collisions:0 txqueuelen:0
>           RX bytes:2458437901 (2.2 GiB)  TX bytes:2471598683 (2.3 GiB)
>
> Here I note that txqueuelen is 0 for eth0.1/eth0.1.1 and 100 for eth0 and
> that it is only eth0.1 and eth0.1.1 that drops pkgs. It feels as if eth0.1
> bypasses eth0's tx queue and passes pkgs directly to the HW driver. Is that so?
> If so, that feels a bit strange and I am not sure how to best
> fix this. Any ides?
>
> Using kernel 2.6.33

So I did some more testing
two nodes A and B connected over a slow link.
Create two VLAN's as above and start pinging from A to B
with pkg size 9600, start a few(4-10) parallel ping processes.

Now I see dropped packages on B, the receiver of pings, and no
pkg loss on A.

1) since the link is symmetrical, why do I only see pkg loss
   at B?

2) pkg loss in B only manifests on the VLAN's interfaces and
   always in pair as if one lost pkg is counted twice?

3) I would expect lost pkgs to be accounted on eth0 instead of
   the VLAN interface(s) since that is where the pkg is lost, why
   isn't it so?

    Jocke


^ permalink raw reply

* Re: TCP-MD5 checksum failure on x86_64 SMP
From: Eric Dumazet @ 2010-05-07  8:00 UTC (permalink / raw)
  To: Bhaskar Dutta; +Cc: Stephen Hemminger, Ben Hutchings, netdev, David Miller
In-Reply-To: <1273210774.2222.45.camel@edumazet-laptop>

Le vendredi 07 mai 2010 à 07:39 +0200, Eric Dumazet a écrit :
> Le jeudi 06 mai 2010 à 17:25 +0530, Bhaskar Dutta a écrit :
> > On Thu, May 6, 2010 at 12:23 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > > I am not familiar with this code, but I suspect same per_cpu data can be
> > > used at both time by a sender (process context) and by a receiver
> > > (softirq context).
> > >
> > > To trigger this, you need at least two active md5 sockets.
> > >
> > > tcp_get_md5sig_pool() should probably disable bh to make sure current
> > > cpu wont be preempted by softirq processing
> > >
> > >
> > > Something like :
> > >
> > > diff --git a/include/net/tcp.h b/include/net/tcp.h
> > > index fb5c66b..e232123 100644
> > > --- a/include/net/tcp.h
> > > +++ b/include/net/tcp.h
> > > @@ -1221,12 +1221,15 @@ struct tcp_md5sig_pool          *tcp_get_md5sig_pool(void)
> > >        struct tcp_md5sig_pool *ret = __tcp_get_md5sig_pool(cpu);
> > >        if (!ret)
> > >                put_cpu();
> > > +       else
> > > +               local_bh_disable();
> > >        return ret;
> > >  }
> > >
> > >  static inline void             tcp_put_md5sig_pool(void)
> > >  {
> > >        __tcp_put_md5sig_pool();
> > > +       local_bh_enable();
> > >        put_cpu();
> > >  }
> > >
> > >
> > >
> > 
> > I put in the above change and ran some load tests with around 50
> > active TCP connections doing MD5.
> > I could see only 1 bad packet in 30 min (earlier the problem used to
> > occur instantaneously and repeatedly).
> > 
> 
> 
> > I think there is another possibility of being preempted when calling
> > tcp_alloc_md5sig_pool()
> > this function releases the spinlock when calling __tcp_alloc_md5sig_pool().
> > 
> > I will run some more tests after changing the  tcp_alloc_md5sig_pool
> > and see if the problem is completely resolved.

Here is my official patch submission, could you please test it ?

Thanks

[PATCH] tcp: fix MD5 (RFC2385) support

TCP MD5 support uses percpu data for temporary storage. It currently
disables preemption so that same storage cannot be reclaimed by another
thread on same cpu.

We also have to make sure a softirq handler wont try to use also same
context. Various bug reports demonstrated corruptions.

Fix is to disable preemption and BH.

Reported-by: Bhaskar Dutta <bhaskie@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/tcp.h |   21 +++------------------
 net/ipv4/tcp.c    |   34 ++++++++++++++++++++++++----------
 2 files changed, 27 insertions(+), 28 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 75be5a2..aa04b9a 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1197,30 +1197,15 @@ extern int			tcp_v4_md5_do_del(struct sock *sk,
 extern struct tcp_md5sig_pool * __percpu *tcp_alloc_md5sig_pool(struct sock *);
 extern void			tcp_free_md5sig_pool(void);
 
-extern struct tcp_md5sig_pool	*__tcp_get_md5sig_pool(int cpu);
-extern void			__tcp_put_md5sig_pool(void);
+extern struct tcp_md5sig_pool	*tcp_get_md5sig_pool(void);
+extern void			tcp_put_md5sig_pool(void);
+
 extern int tcp_md5_hash_header(struct tcp_md5sig_pool *, struct tcphdr *);
 extern int tcp_md5_hash_skb_data(struct tcp_md5sig_pool *, struct sk_buff *,
 				 unsigned header_len);
 extern int tcp_md5_hash_key(struct tcp_md5sig_pool *hp,
 			    struct tcp_md5sig_key *key);
 
-static inline
-struct tcp_md5sig_pool		*tcp_get_md5sig_pool(void)
-{
-	int cpu = get_cpu();
-	struct tcp_md5sig_pool *ret = __tcp_get_md5sig_pool(cpu);
-	if (!ret)
-		put_cpu();
-	return ret;
-}
-
-static inline void		tcp_put_md5sig_pool(void)
-{
-	__tcp_put_md5sig_pool();
-	put_cpu();
-}
-
 /* write queue abstraction */
 static inline void tcp_write_queue_purge(struct sock *sk)
 {
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 0f8caf6..296150b 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2839,7 +2839,6 @@ static void __tcp_free_md5sig_pool(struct tcp_md5sig_pool * __percpu *pool)
 			if (p->md5_desc.tfm)
 				crypto_free_hash(p->md5_desc.tfm);
 			kfree(p);
-			p = NULL;
 		}
 	}
 	free_percpu(pool);
@@ -2937,25 +2936,40 @@ retry:
 
 EXPORT_SYMBOL(tcp_alloc_md5sig_pool);
 
-struct tcp_md5sig_pool *__tcp_get_md5sig_pool(int cpu)
+
+/**
+ *	tcp_get_md5sig_pool - get md5sig_pool for this user
+ *
+ *	We use percpu structure, so if we succeed, we exit with preemption
+ *	and BH disabled, to make sure another thread or softirq handling
+ *	wont try to get same context.
+ */
+struct tcp_md5sig_pool *tcp_get_md5sig_pool(void)
 {
 	struct tcp_md5sig_pool * __percpu *p;
-	spin_lock_bh(&tcp_md5sig_pool_lock);
+
+	local_bh_disable();
+
+	spin_lock(&tcp_md5sig_pool_lock);
 	p = tcp_md5sig_pool;
 	if (p)
 		tcp_md5sig_users++;
-	spin_unlock_bh(&tcp_md5sig_pool_lock);
-	return (p ? *per_cpu_ptr(p, cpu) : NULL);
-}
+	spin_unlock(&tcp_md5sig_pool_lock);
+
+	if (p)
+		return *per_cpu_ptr(p, smp_processor_id());
 
-EXPORT_SYMBOL(__tcp_get_md5sig_pool);
+	local_bh_enable();
+	return NULL;
+}
+EXPORT_SYMBOL(tcp_get_md5sig_pool);
 
-void __tcp_put_md5sig_pool(void)
+void tcp_put_md5sig_pool(void)
 {
+	local_bh_enable();
 	tcp_free_md5sig_pool();
 }
-
-EXPORT_SYMBOL(__tcp_put_md5sig_pool);
+EXPORT_SYMBOL(tcp_put_md5sig_pool);
 
 int tcp_md5_hash_header(struct tcp_md5sig_pool *hp,
 			struct tcphdr *th)



^ permalink raw reply related

* Re: [PATCH net-next-2.6] net: Increase NET_SKB_PAD to 64 bytes
From: Michal Simek @ 2010-05-07  7:53 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, hadi, therbert, microblaze-uclinux
In-Reply-To: <1273209321.2222.36.camel@edumazet-laptop>

Eric Dumazet wrote:
> Le jeudi 06 mai 2010 à 22:02 -0700, David Miller a écrit :
> 
>> Seeing this made me go check who was overriding NET_IP_ALIGN or
>> NET_SKB_PAD.
>>
>> The powerpc bits are legitimate, but the microblaze case is complete
>> bogosity.  It defines NET_IP_ALIGN to the default (2) and sets
>> NET_SKB_PAD to L1_CACHE_BYTES which on microblaze is 4 and
>> significantly smaller than the default.
>>
>> So I'm going to delete them in net-next-2.6 like so:
>>
>> --------------------
>> microblaze: Kill NET_SKB_PAD and NET_IP_ALIGN overrides.
>>
>> NET_IP_ALIGN defaults to 2, no need to override.
>>
>> NET_SKB_PAD is now 64, which is much larger than microblaze's
>> L1_CACHE_SIZE so no need to override that either.
>>
>> Signed-off-by: David S. Miller <davem@davemloft.net>
>> ---
>>  arch/microblaze/include/asm/system.h |   10 ----------
>>  1 files changed, 0 insertions(+), 10 deletions(-)
>>
>> diff --git a/arch/microblaze/include/asm/system.h b/arch/microblaze/include/asm/system.h
>> index 48c4f03..b1e2f07 100644
>> --- a/arch/microblaze/include/asm/system.h
>> +++ b/arch/microblaze/include/asm/system.h
>> @@ -97,14 +97,4 @@ extern struct dentry *of_debugfs_root;
>>  
>>  #define arch_align_stack(x) (x)
>>  
>> -/*
>> - * MicroBlaze doesn't handle unaligned accesses in hardware.
>> - *
>> - * Based on this we force the IP header alignment in network drivers.
>> - * We also modify NET_SKB_PAD to be a cacheline in size, thus maintaining
>> - * cacheline alignment of buffers.
>> - */
>> -#define NET_IP_ALIGN	2
>> -#define NET_SKB_PAD	L1_CACHE_BYTES
>> -
>>  #endif /* _ASM_MICROBLAZE_SYSTEM_H */
> 
> Yes, this seems strange it actually worked if L1_CACHE_BYTES = 4

This was fault which I fixed. I sent pull request to Linus yesterday 
with contains patch which fix it.
L1_CACHE_BYTES was setup to 32 which is maximum cache line length on 
Microblaze.

I will add this Microblaze patch to my repo for testing and anyway 
should go through my repo.

Thanks,
Michal

-- 
Michal Simek, Ing. (M.Eng)
w: www.monstr.eu p: +42-0-721842854
Maintainer of Linux kernel 2.6 Microblaze Linux - http://www.monstr.eu/fdt/
Microblaze U-BOOT custodian

^ permalink raw reply

* Re: [PATCH v21 020/100] c/r: documentation
From: Oren Laadan @ 2010-05-07  6:54 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Andrew Morton, containers, linux-kernel, Serge Hallyn,
	Matt Helsley, Pavel Emelyanov, linux-api, linux-mm, linux-fsdevel,
	netdev, Dave Hansen
In-Reply-To: <4BE32640.8010402@oracle.com>


Thanks for reading carefully through and pointing out
glitches and inconsistencies. I'll fix it for next post.

Oren.


On 05/06/2010 04:27 PM, Randy Dunlap wrote:
> On Sat,  1 May 2010 10:15:02 -0400 Oren Laadan wrote:
> 
>> Covers application checkpoint/restart, overall design, interfaces,
>> usage, shared objects, and and checkpoint image format.
>>
>> Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
>> Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
>> Acked-by: Serge E. Hallyn <serue@us.ibm.com>
>> Tested-by: Serge E. Hallyn <serue@us.ibm.com>
>> ---
>>  Documentation/checkpoint/checkpoint.c      |   38 +++
>>  Documentation/checkpoint/readme.txt        |  370 ++++++++++++++++++++++++++++
>>  Documentation/checkpoint/self_checkpoint.c |   69 +++++
>>  Documentation/checkpoint/self_restart.c    |   40 +++
>>  Documentation/checkpoint/usage.txt         |  247 +++++++++++++++++++
>>  5 files changed, 764 insertions(+), 0 deletions(-)
>>  create mode 100644 Documentation/checkpoint/checkpoint.c
>>  create mode 100644 Documentation/checkpoint/readme.txt
>>  create mode 100644 Documentation/checkpoint/self_checkpoint.c
>>  create mode 100644 Documentation/checkpoint/self_restart.c
>>  create mode 100644 Documentation/checkpoint/usage.txt
> 
>> diff --git a/Documentation/checkpoint/readme.txt b/Documentation/checkpoint/readme.txt
>> new file mode 100644
>> index 0000000..4fa5560
>> --- /dev/null
>> +++ b/Documentation/checkpoint/readme.txt
>> @@ -0,0 +1,370 @@
>> +
> ...
>> +In contrast, when checkpointing a subtree of a container it is up to
>> +the user to ensure that dependencies either don't exist or can be
>> +safely ignored. This is useful, for instance, for HPC scenarios or
>> +even a user that would like to periodically checkpoint a long-running
> 
>                who
> 
>> +batch job.
>> +
> ...
> 
>> +
>> +Checkpoint image format
>> +=======================
>> +
> ...
> 
>> +
>> +The container configuration section containers information that is
> 
>                                        contains
> 
>> +global to the container. Security (LSM) configuration is one example.
>> +Network configuration and container-wide mounts may also go here, so
>> +that the userspace restart coordinator can re-create a suitable
>> +environment.
>> +
> ...
> 
>> +
>> +Then the state of all tasks is saved, in the order that they appear in
>> +the tasks array above. For each state, we save data like task_struct,
>> +namespaces, open files, memory layout, memory contents, cpu state,
> 
>                                                            CPU (throughout, please)
> 
>> +signals and signal handlers, etc. For resources that are shared among
>> +multiple processes, we first checkpoint said resource (and only once),
>> +and in the task data we give a reference to it. More about shared
>> +resources below.
>> +
> ...
> 
>> +
>> +Shared objects
>> +==============
>> +
>> +Many resources may be shared by multiple tasks (e.g. file descriptors,
>> +memory address space, etc), or even have multiple references from
> 
>                          etc.),
> 
>> +other resources (e.g. a single inode that represents two ends of a
>> +pipe).
>> +
> ...
> 
>> +Memory contents format
>> +======================
>> +
>> +The memory contents of a given memory address space (->mm) is dumped
> 
>                                                               are (I think)
> 
>> +as a sequence of vma objects, represented by 'struct ckpt_hdr_vma'.
>> +This header details the vma properties, and a reference to a file
>> +(if file backed) or an inode (or shared memory) object.
>> +
>> +The vma header is followed by the actual contents - but only those
>> +pages that need to be saved, i.e. dirty pages. They are written in
>> +chunks of data, where each chunks contains a header that indicates
> 
>                               chunk
> 
>> +that number of pages in the chunk, followed by an array of virtual
> 
>    the
> 
>> +addresses and then an array of actual page contents. The last chunk
>> +holds zero pages.
>> +
> ...
> 
>> +Kernel interfaces
>> +=================
>> +
>> +* To checkpoint a vma, the 'struct vm_operations_struct' needs to
>> +  provide a method ->checkpoint:
>> +    int checkpoint(struct ckpt_ctx *, struct vma_struct *)
>> +  Restart requires a matching (exported) restore:
>> +    int restore(struct ckpt_ctx *, struct mm_struct *, struct ckpt_hdr_vma *)
>> +
>> +* To checkpoint a file, the 'struct file_operations' needs to provide
>> +  the methods ->checkpoint and ->collect:
>> +    int checkpoint(struct ckpt_ctx *, struct file *)
>> +    int collect(struct ckpt_ctx *, struct file *)
>> +  Restart requires a matching (exported) restore:
>> +    int restore(struct ckpt_ctx *, struct ckpt_hdr_file *)
>> +  For most file systems, generic_file_{checkpoint,restore}() can be
>> +  used.
>> +
>> +* To checkpoint a socket, the 'struct proto_ops' needs to provide
> 
>      To checkpoint/restart a socket,
> 
>> +  the methods ->checkpoint, ->collect and ->restore:
>> +    int checkpoint(struct ckpt_ctx *ctx, struct socket *sock);
>> +    int collect(struct ckpt_ctx *ctx, struct socket *sock);
>> +    int restore(struct ckpt_ctx *, struct socket *sock, struct ckpt_hdr_socket *h)
> 
> 
>> diff --git a/Documentation/checkpoint/usage.txt b/Documentation/checkpoint/usage.txt
>> new file mode 100644
>> index 0000000..c6fc045
>> --- /dev/null
>> +++ b/Documentation/checkpoint/usage.txt
>> @@ -0,0 +1,247 @@
>> +
>> +	      How to use Checkpoint-Restart
>> +	=========================================
>> +
>> +
>> +API
>> +===
>> +
>> +The API consists of three new system calls:
>> +
>> +* long checkpoint(pid_t pid, int fd, unsigned long flag, int logfd);
> 
>                                                       flags,
> 
>> +
>> + Checkpoint a (sub-)container whose root task is identified by @pid,
>> + to the open file indicated by @fd. If @logfd isn't -1, it indicates
>> + an open file to which error and debug messages are written. @flags
>> + may be one or more of:
>> +   - CHECKPOINT_SUBTREE : allow checkpoint of sub-container
>> + (other value are not allowed).
>> +
>> + Returns: a positive checkpoint identifier (ckptid) upon success, 0 if
>> + it returns from a restart, and -1 if an error occurs. The ckptid will
>> + uniquely identify a checkpoint image, for as long as the checkpoint
>> + is kept in the kernel (e.g. if one wishes to keep a checkpoint, or a
>> + partial checkpoint, residing in kernel memory).
>> +
>> +* long sys_restart(pid_t pid, int fd, unsigned long flags, int logfd);
>> +
>> + Restart a process hierarchy from a checkpoint image that is read from
>> + the blob stored in the file indicated by @fd.  If @logfd isn't -1, it
>> + indicates an open file to which error and debug messages are written.
>> + @flags will have future meaning (must be 0 for now). @pid indicates
>> + the root of the hierarchy as seen in the coordinator's pid-namespace,
>> + and is expected to be a child of the coordinator. @flags may be one
>> + or more of:
>> +   - RESTART_TASKSELF : (self) restart of a single process
>> +   - RESTART_FROEZN : processes remain frozen once restart completes
> 
>                 FROZEN ?
> 
>> +   - RESTART_GHOST : process is a ghost (placeholder for a pid)
> 
> about @flags:  Above says both of these:
> a) @flags will have future meaning (must be 0 for now)
> b) @flags may be one or more of:
> 
> so please decide which one it is ;)
> 
>> + (Note that this argument may mean 'ckptid' to identify an in-kernel
>> + checkpoint image, with some @flags in the future).
>> +
>> + Returns: -1 if an error occurs, 0 on success when restarting from a
>> + "self" checkpoint, and return value of system call at the time of the
>> + checkpoint when restarting from an "external" checkpoint.
>> +
> ...
>> +
>> +Sysctl/proc
>> +===========
>> +
>> +/proc/sys/kernel/ckpt_unpriv_allowed		[default = 1]
>> +  controls whether c/r operation is allowed for unprivileged users
> 
>                       C/R
> 
>> +
>> +
>> +Operation
>> +=========
>> +
>> +The granularity of a checkpoint usually is a process hierarchy. The
>> +'pid' argument is interpreted in the caller's pid namespace. So to
>> +checkpoint a container whose init task (pid 1 in that pidns) appears
>> +as pid 3497 the caller's pidns, the caller must use pid 3497. Passing
>> +pid 1 will attempt to checkpoint the caller's container, and if the
>> +caller isn't privileged and init is owned by root, it will fail.
>> +
>> +Unless the CHECKPOINT_SUBTREE flag is set, if the caller passes a pid
>> +which does not refer to a container's init task, then sys_checkpoint()
>> +would return -EINVAL.
> 
>    returns -EINVAL.
> 
> ...
> 
>> +
>> +
>> +User tools
>> +==========
>> +
>> +* checkpoint(1): a tool to perform a checkpoint of a container/subtree
>> +* restart(1): a tool to restart a container/subtree
>> +* ckptinfo: a tool to examine a checkpoint image
>> +
>> +It is best to use the dedicated user tools for checkpoint and restart.
>> +
>> +If you insist, then here is a code snippet that illustrates how a
>> +checkpoint is initiated by a process inside a container - the logic is
>> +similar to fork():
>> +	...
>> +	ckptid = checkpoint(0, ...);
>> +	switch (crid) {
> 
> 	       (ckptid) ?
> 
>> +	case -1:
>> +		perror("checkpoint failed");
>> +		break;
>> +	default:
>> +		fprintf(stderr, "checkpoint succeeded, CRID=%d\n", ret);
> 
> s/ret/ckptid/ ?
> 
>> +		/* proceed with execution after checkpoint */
>> +		...
>> +		break;
>> +	case 0:
>> +		fprintf(stderr, "returned after restart\n");
>> +		/* proceed with action required following a restart */
>> +		...
>> +		break;
>> +	}
>> +	...
>> +
>> +And to initiate a restart, the process in an empty container can use
>> +logic similar to execve():
>> +	...
>> +	if (restart(pid, ...) < 0)
>> +		perror("restart failed");
>> +	/* only get here if restart failed */
>> +	...
>> +
>> +Note, that the code also supports "self" checkpoint, where a process
> 
>    Note that
> 
>> +can checkpoint itself. This mode does not capture the relationships of
>> +the task with other tasks, or any shared resources. It is useful for
>> +application that wish to be able to save and restore their state.
> 
>    applications
> 
>> +They will either not use (or care about) shared resources, or they
>> +will be aware of the operations and adapt suitably after a restart.
>> +The code above can also be used for "self" checkpoint.
>> +
>> +
>> +You may find the following sample programs useful:
>> +
>> +* checkpoint.c: accepts a 'pid' and checkpoint that task to stdout
> 
>                                        checkpoints
> 
>> +* self_checkpoint.c: a simple test program doing self-checkpoint
>> +* self_restart.c: restarts a (self-) checkpoint image from stdin
>> +
>> +See also the utilities 'checkpoint' and 'restart' (from user-cr).
>> +
>> +
>> +"External" checkpoint
>> +=====================
>> +
>> +To do "external" checkpoint, you need to first freeze that other task
>> +either using the freezer cgroup.
> 
> eh?  cannot parse that.
> 
>> +
>> +Restart does not preserve the original PID yet, (because we haven't
>> +solved yet the fork-with-specific-pid issue). In a real scenario, you
>> +probably want to first create a new names space, and have the init
> 
>                                        namespace,
> 
>> +task there call 'sys_restart()'.
>> +
>> +I tested it this way:
> 
> ...
> 
> ---
> ~Randy
> *** Remember to use Documentation/SubmitChecklist when testing your code ***
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [microblaze-uclinux] Re: [PATCH net-next-2.6] net: Increase NET_SKB_PAD to 64 bytes
From: David Miller @ 2010-05-07  6:29 UTC (permalink / raw)
  To: john.williams; +Cc: microblaze-uclinux, netdev, hadi, therbert, monstr
In-Reply-To: <m2g1d3f23371005062228y267b398cue82f518c7638929e@mail.gmail.com>

From: John Williams <john.williams@petalogix.com>
Date: Fri, 7 May 2010 15:28:08 +1000

> There will be some patches coming from Michal that cleans all of this
> up - MicroBlaze has a configurable cacheline length, we have some
> patches that set this to the longest possible (32 bytes) in a
> conservative assumption.

32 it the old default, the new one is 64.

So either way, these settings are completely bogus.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox