Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH v12 00/17] Provide a zero-copy method on KVM virtio-net.
From: xiaohui.xin @ 2010-09-30 14:04 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike

We provide an zero-copy method which driver side may get external
buffers to DMA. Here external means driver don't use kernel space
to allocate skb buffers. Currently the external buffer can be from
guest virtio-net driver.

The idea is simple, just to pin the guest VM user space and then
let host NIC driver has the chance to directly DMA to it. 
The patches are based on vhost-net backend driver. We add a device
which provides proto_ops as sendmsg/recvmsg to vhost-net to
send/recv directly to/from the NIC driver. KVM guest who use the
vhost-net backend may bind any ethX interface in the host side to
get copyless data transfer thru guest virtio-net frontend.

patch 01-10:  	net core and kernel changes.
patch 11-13:  	new device as interface to mantpulate external buffers.
patch 14: 	for vhost-net.
patch 15:	An example on modifying NIC driver to using napi_gro_frags().
patch 16:	An example how to get guest buffers based on driver
		who using napi_gro_frags().
patch 17:	It's a patch to address comments from Michael S. Thirkin
		to add 2 new ioctls in mp device.
		We split it out here to make easier reiewer.
		Need to revise.

The guest virtio-net driver submits multiple requests thru vhost-net
backend driver to the kernel. And the requests are queued and then
completed after corresponding actions in h/w are done.

For read, user space buffers are dispensed to NIC driver for rx when
a page constructor API is invoked. Means NICs can allocate user buffers
from a page constructor. We add a hook in netif_receive_skb() function
to intercept the incoming packets, and notify the zero-copy device.

For write, the zero-copy deivce may allocates a new host skb and puts
payload on the skb_shinfo(skb)->frags, and copied the header to skb->data.
The request remains pending until the skb is transmitted by h/w.

We provide multiple submits and asynchronous notifiicaton to 
vhost-net too.

Our goal is to improve the bandwidth and reduce the CPU usage.
Exact performance data will be provided later.

What we have not done yet:
	Performance tuning

what we have done in v1:
	polish the RCU usage
	deal with write logging in asynchroush mode in vhost
	add notifier block for mp device
	rename page_ctor to mp_port in netdevice.h to make it looks generic
	add mp_dev_change_flags() for mp device to change NIC state
	add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load
	a small fix for missing dev_put when fail
	using dynamic minor instead of static minor number
	a __KERNEL__ protect to mp_get_sock()

what we have done in v2:
	
	remove most of the RCU usage, since the ctor pointer is only
	changed by BIND/UNBIND ioctl, and during that time, NIC will be
	stopped to get good cleanup(all outstanding requests are finished),
	so the ctor pointer cannot be raced into wrong situation.

	Remove the struct vhost_notifier with struct kiocb.
	Let vhost-net backend to alloc/free the kiocb and transfer them
	via sendmsg/recvmsg.

	use get_user_pages_fast() and set_page_dirty_lock() when read.

	Add some comments for netdev_mp_port_prep() and handle_mpassthru().

what we have done in v3:
	the async write logging is rewritten 
	a drafted synchronous write function for qemu live migration
	a limit for locked pages from get_user_pages_fast() to prevent Dos
	by using RLIMIT_MEMLOCK
	

what we have done in v4:
	add iocb completion callback from vhost-net to queue iocb in mp device
	replace vq->receiver by mp_sock_data_ready()
	remove stuff in mp device which access structures from vhost-net
	modify skb_reserve() to ignore host NIC driver reserved space
	rebase to the latest vhost tree
	split large patches into small pieces, especially for net core part.
	

what we have done in v5:
	address Arnd Bergmann's comments
		-remove IFF_MPASSTHRU_EXCL flag in mp device
		-Add CONFIG_COMPAT macro
		-remove mp_release ops
	move dev_is_mpassthru() as inline func
	fix a bug in memory relinquish
	Apply to current git (2.6.34-rc6) tree.

what we have done in v6:
	move create_iocb() out of page_dtor which may happen in interrupt context
	-This remove the potential issues which lock called in interrupt context
	make the cache used by mp, vhost as static, and created/destoryed during
	modules init/exit functions.
	-This makes multiple mp guest created at the same time.

what we have done in v7:
	some cleanup prepared to suppprt PS mode

what we have done in v8:
	discarding the modifications to point skb->data to guest buffer directly.
	Add code to modify driver to support napi_gro_frags() with Herbert's comments.
	To support PS mode.
	Add mergeable buffer support in mp device.
	Add GSO/GRO support in mp deice.
	Address comments from Eric Dumazet about cache line and rcu usage.

what we have done in v9:
	v8 patch is based on a fix in dev_gro_receive().
	But Herbert did not agree with the fix we have sent out.
	And he suggest another fix. v9 is modified to base on that fix.
	

what we have done in v10:
	Fix a partial csum error.
	Cleanup some unused fields with struct page_info{} in mp device.
	Modify kmem_cache_zalloc() to kmem_cache_alloc() based on Michael S. Thirkin.

what we have done in v11:
	Address comments from Michael S. Thirkin to add two new ioctls in mp device.
	But still need to revise.

what we have done in v11:
	Address most comments from Ben Hutchings, except the compat ioctls.
	As the comments are sparse, so do not make a split patch.
	Change struct mpassthru_port to struct mp_port, and struct page_ctor
	to struct page_pool.
 
Performance:
	We have seen the performance data request from mailling-list.
	And we are now looking into this.

^ permalink raw reply

* [PATCH v12 16/17]An example how to alloc user buffer based on napi_gro_frags() interface.
From: xiaohui.xin @ 2010-09-30 14:04 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui
In-Reply-To: <59d8a50047ee01e26658fd676d26c0162b79e5fd.1285853725.git.xiaohui.xin@intel.com>

From: Xin Xiaohui <xiaohui.xin@intel.com>

This example is made on ixgbe driver which using napi_gro_frags().
It can get buffers from guest side directly using netdev_alloc_page()
and release guest buffers using netdev_free_page().

---
 drivers/net/ixgbe/ixgbe_main.c |   25 +++++++++++++++++++++----
 1 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index 905d6d2..0977f2f 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -691,7 +691,14 @@ static inline void ixgbe_release_rx_desc(struct ixgbe_hw *hw,
 static bool is_rx_buffer_mapped_as_page(struct ixgbe_rx_buffer *bi,
 					struct net_device *dev)
 {
-	return true;
+	return dev_is_mpassthru(dev);
+}
+
+static u32 get_page_skb_offset(struct net_device *dev)
+{
+	if (!dev_is_mpassthru(dev))
+		return 0;
+	return dev->mp_port->vnet_hlen;
 }
 
 /**
@@ -764,7 +771,8 @@ static void ixgbe_alloc_rx_buffers(struct ixgbe_adapter *adapter,
 				adapter->alloc_rx_page_failed++;
 				goto no_buffers;
 			}
-			bi->page_skb_offset = 0;
+			bi->page_skb_offset =
+				get_page_skb_offset(adapter->netdev);
 			bi->dma = pci_map_page(pdev, bi->page_skb,
 					bi->page_skb_offset,
 					(PAGE_SIZE / 2),
@@ -899,8 +907,10 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 			len = le16_to_cpu(rx_desc->wb.upper.length);
 		}
 
-		if (is_no_buffer(rx_buffer_info))
+		if (is_no_buffer(rx_buffer_info)) {
+			printk("no buffers\n");
 			break;
+		}
 
 		cleaned = true;
 
@@ -959,6 +969,12 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 						rx_buffer_info->page_skb,
 						rx_buffer_info->page_skb_offset,
 						len);
+				if (dev_is_mpassthru(netdev) &&
+						netdev->mp_port->hash)
+					skb_shinfo(skb)->destructor_arg =
+					netdev->mp_port->hash(netdev,
+					rx_buffer_info->page_skb);
+
 				rx_buffer_info->page_skb = NULL;
 				skb->len += len;
 				skb->data_len += len;
@@ -976,7 +992,8 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 			                   upper_len);
 
 			if ((rx_ring->rx_buf_len > (PAGE_SIZE / 2)) ||
-			    (page_count(rx_buffer_info->page) != 1))
+			    (page_count(rx_buffer_info->page) != 1) ||
+				dev_is_mpassthru(netdev))
 				rx_buffer_info->page = NULL;
 			else
 				get_page(rx_buffer_info->page);
-- 
1.7.3


^ permalink raw reply related

* [PATCH v12 12/17] Add mp(mediate passthru) device.
From: xiaohui.xin @ 2010-09-30 14:04 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui
In-Reply-To: <59d8a50047ee01e26658fd676d26c0162b79e5fd.1285853725.git.xiaohui.xin@intel.com>

From: Xin Xiaohui <xiaohui.xin@intel.com>

The patch add mp(mediate passthru) device, which now
based on vhost-net backend driver and provides proto_ops
to send/receive guest buffers data from/to guest vitio-net
driver.

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 drivers/vhost/mpassthru.c | 1380 +++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 1380 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vhost/mpassthru.c

diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c
new file mode 100644
index 0000000..1a114d1
--- /dev/null
+++ b/drivers/vhost/mpassthru.c
@@ -0,0 +1,1380 @@
+/*
+ *  MPASSTHRU - Mediate passthrough device.
+ *  Copyright (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ *  GNU General Public License for more details.
+ *
+ */
+
+#define DRV_NAME        "mpassthru"
+#define DRV_DESCRIPTION "Mediate passthru device driver"
+#define DRV_COPYRIGHT   "(C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G"
+
+#include <linux/compat.h>
+#include <linux/module.h>
+#include <linux/errno.h>
+#include <linux/kernel.h>
+#include <linux/major.h>
+#include <linux/slab.h>
+#include <linux/smp_lock.h>
+#include <linux/poll.h>
+#include <linux/fcntl.h>
+#include <linux/init.h>
+#include <linux/aio.h>
+
+#include <linux/skbuff.h>
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/miscdevice.h>
+#include <linux/ethtool.h>
+#include <linux/rtnetlink.h>
+#include <linux/if.h>
+#include <linux/if_arp.h>
+#include <linux/if_ether.h>
+#include <linux/crc32.h>
+#include <linux/nsproxy.h>
+#include <linux/uaccess.h>
+#include <linux/virtio_net.h>
+#include <linux/mpassthru.h>
+#include <net/net_namespace.h>
+#include <net/netns/generic.h>
+#include <net/rtnetlink.h>
+#include <net/sock.h>
+
+#include <asm/system.h>
+
+#define COPY_THRESHOLD (L1_CACHE_BYTES * 4)
+#define COPY_HDR_LEN   (L1_CACHE_BYTES < 64 ? 64 : L1_CACHE_BYTES)
+
+struct frag {
+	u16     offset;
+	u16     size;
+};
+
+#define	HASH_BUCKETS	(8192*2)
+
+struct page_info {
+	struct list_head        list;
+	struct page_info	*next;
+	struct page_info	*prev;
+	struct page             *pages[MAX_SKB_FRAGS];
+	struct sk_buff          *skb;
+	struct page_pool        *pool;
+
+	/* The pointer relayed to skb, to indicate
+	 * it's a external allocated skb or kernel
+	 */
+	struct skb_ext_page    ext_page;
+	/* flag to indicate read or write */
+#define INFO_READ                      0
+#define INFO_WRITE                     1
+	unsigned                flags;
+	/* exact number of locked pages */
+	unsigned                pnum;
+
+	/* The fields after that is for backend
+	 * driver, now for vhost-net.
+	 */
+	/* the kiocb structure related to */
+	struct kiocb            *iocb;
+	/* the ring descriptor index */
+	unsigned int            desc_pos;
+	/* the iovec coming from backend, we only
+	* need few of them */
+	struct iovec            hdr[2];
+	struct iovec            iov[2];
+};
+
+static struct kmem_cache *ext_page_info_cache;
+
+struct page_pool {
+	/* the queue for rx side */
+	struct list_head        readq;
+	/* the lock to protect readq */
+	spinlock_t		read_lock;
+	/* record the orignal rlimit */
+	struct rlimit		o_rlim;
+	/* record the locked pages */
+	int			lock_pages;
+	/* the device according to */
+	struct net_device	*dev;
+	/* the mp_port according to dev */
+	struct mp_port		port;
+	/* the hash_table list to find each locked page */
+	struct page_info	**hash_table;
+};
+
+struct mp_struct {
+	struct mp_file		*mfile;
+	struct net_device       *dev;
+	struct page_pool	*pool;
+	struct socket           socket;
+};
+
+struct mp_file {
+	atomic_t count;
+	struct mp_struct *mp;
+	struct net *net;
+};
+
+struct mp_sock {
+	struct sock		sk;
+	struct mp_struct	*mp;
+};
+
+/* The main function to allocate external buffers */
+static struct skb_ext_page *page_ctor(struct mp_port *port,
+				      struct sk_buff *skb,
+				      int npages)
+{
+	int i;
+	unsigned long flags;
+	struct page_pool *pool;
+	struct page_info *info = NULL;
+
+	if (npages != 1)
+		BUG();
+	pool = container_of(port, struct page_pool, port);
+
+	spin_lock_irqsave(&pool->read_lock, flags);
+	if (!list_empty(&pool->readq)) {
+		info = list_first_entry(&pool->readq, struct page_info, list);
+		list_del(&info->list);
+	}
+	spin_unlock_irqrestore(&pool->read_lock, flags);
+	if (!info)
+		return NULL;
+
+	for (i = 0; i < info->pnum; i++)
+		get_page(info->pages[i]);
+	info->skb = skb;
+	return &info->ext_page;
+}
+
+static struct page_info *mp_hash_lookup(struct page_pool *pool,
+					struct page *page);
+static struct page_info *mp_hash_delete(struct page_pool *pool,
+					struct page_info *info);
+
+static struct skb_ext_page *mp_lookup(struct net_device *dev,
+				      struct page *page)
+{
+	struct mp_struct *mp =
+		container_of(dev->mp_port->sock->sk, struct mp_sock, sk)->mp;
+	struct page_pool *pool = mp->pool;
+	struct page_info *info;
+
+	info = mp_hash_lookup(pool, page);
+	if (!info)
+		return NULL;
+	return &info->ext_page;
+}
+
+static int page_pool_attach(struct mp_struct *mp)
+{
+	int rc;
+	struct page_pool *pool;
+	struct net_device *dev = mp->dev;
+
+	/* locked by mp_mutex */
+	if (mp->pool)
+		return -EBUSY;
+
+	pool = kzalloc(sizeof(*pool), GFP_KERNEL);
+	if (!pool)
+		return -ENOMEM;
+	rc = netdev_mp_port_prep(dev, &pool->port);
+	if (rc)
+		goto fail;
+
+	INIT_LIST_HEAD(&pool->readq);
+	spin_lock_init(&pool->read_lock);
+	pool->hash_table = kzalloc(sizeof(struct page_info *) * HASH_BUCKETS,
+			GFP_KERNEL);
+	if (!pool->hash_table)
+		goto fail;
+
+	dev_hold(dev);
+	pool->dev = dev;
+	pool->port.ctor = page_ctor;
+	pool->port.sock = &mp->socket;
+	pool->port.hash = mp_lookup;
+	pool->lock_pages = 0;
+
+	/* locked by mp_mutex */
+	dev->mp_port = &pool->port;
+	mp->pool = pool;
+
+	return 0;
+
+fail:
+	kfree(pool);
+	dev_put(dev);
+
+	return rc;
+}
+
+struct page_info *info_dequeue(struct page_pool *pool)
+{
+	unsigned long flags;
+	struct page_info *info = NULL;
+	spin_lock_irqsave(&pool->read_lock, flags);
+	if (!list_empty(&pool->readq)) {
+		info = list_first_entry(&pool->readq,
+				struct page_info, list);
+		list_del(&info->list);
+	}
+	spin_unlock_irqrestore(&pool->read_lock, flags);
+	return info;
+}
+
+static int set_memlock_rlimit(struct page_pool *pool, int resource,
+			      unsigned long cur, unsigned long max)
+{
+	struct rlimit new_rlim, *old_rlim;
+	int retval;
+
+	if (resource != RLIMIT_MEMLOCK)
+		return -EINVAL;
+	new_rlim.rlim_cur = cur;
+	new_rlim.rlim_max = max;
+
+	old_rlim = current->signal->rlim + resource;
+
+	/* remember the old rlimit value when backend enabled */
+	pool->o_rlim.rlim_cur = old_rlim->rlim_cur;
+	pool->o_rlim.rlim_max = old_rlim->rlim_max;
+
+	if ((new_rlim.rlim_max > old_rlim->rlim_max) &&
+			!capable(CAP_SYS_RESOURCE))
+		return -EPERM;
+
+	retval = security_task_setrlimit(resource, &new_rlim);
+	if (retval)
+		return retval;
+
+	task_lock(current->group_leader);
+	*old_rlim = new_rlim;
+	task_unlock(current->group_leader);
+	return 0;
+}
+
+static void mp_ki_dtor(struct kiocb *iocb)
+{
+	struct page_info *info = (struct page_info *)(iocb->private);
+	int i;
+
+	if (info->flags == INFO_READ) {
+		for (i = 0; i < info->pnum; i++) {
+			if (info->pages[i]) {
+				set_page_dirty_lock(info->pages[i]);
+				put_page(info->pages[i]);
+			}
+		}
+		mp_hash_delete(info->pool, info);
+		if (info->skb) {
+			info->skb->destructor = NULL;
+			kfree_skb(info->skb);
+		}
+	}
+	/* Decrement the number of locked pages */
+	info->pool->lock_pages -= info->pnum;
+	kmem_cache_free(ext_page_info_cache, info);
+
+	return;
+}
+
+static struct kiocb *create_iocb(struct page_info *info, int size)
+{
+	struct kiocb *iocb = NULL;
+
+	iocb = info->iocb;
+	if (!iocb)
+		return iocb;
+	iocb->ki_flags = 0;
+	iocb->ki_users = 1;
+	iocb->ki_key = 0;
+	iocb->ki_ctx = NULL;
+	iocb->ki_cancel = NULL;
+	iocb->ki_retry = NULL;
+	iocb->ki_eventfd = NULL;
+	iocb->ki_pos = info->desc_pos;
+	iocb->ki_nbytes = size;
+	iocb->ki_dtor(iocb);
+	iocb->private = (void *)info;
+	iocb->ki_dtor = mp_ki_dtor;
+
+	return iocb;
+}
+
+static int page_pool_detach(struct mp_struct *mp)
+{
+	struct page_pool *pool;
+	struct page_info *info;
+	int i;
+
+	/* locked by mp_mutex */
+	pool = mp->pool;
+	if (!pool)
+		return -ENODEV;
+
+	while ((info = info_dequeue(pool))) {
+		for (i = 0; i < info->pnum; i++)
+			if (info->pages[i])
+				put_page(info->pages[i]);
+		create_iocb(info, 0);
+		kmem_cache_free(ext_page_info_cache, info);
+	}
+
+	set_memlock_rlimit(pool, RLIMIT_MEMLOCK,
+			   pool->o_rlim.rlim_cur,
+			   pool->o_rlim.rlim_max);
+
+	/* locked by mp_mutex */
+	pool->dev->mp_port = NULL;
+	dev_put(pool->dev);
+
+	mp->pool = NULL;
+	kfree(pool->hash_table);
+	kfree(pool);
+	return 0;
+}
+
+static void __mp_detach(struct mp_struct *mp)
+{
+	mp->mfile = NULL;
+
+	dev_change_flags(mp->dev, mp->dev->flags & ~IFF_UP);
+	page_pool_detach(mp);
+	dev_change_flags(mp->dev, mp->dev->flags | IFF_UP);
+
+	/* Drop the extra count on the net device */
+	dev_put(mp->dev);
+}
+
+static DEFINE_MUTEX(mp_mutex);
+
+static void mp_detach(struct mp_struct *mp)
+{
+	mutex_lock(&mp_mutex);
+	__mp_detach(mp);
+	mutex_unlock(&mp_mutex);
+}
+
+static struct mp_struct *mp_get(struct mp_file *mfile)
+{
+	struct mp_struct *mp = NULL;
+	if (atomic_inc_not_zero(&mfile->count))
+		mp = mfile->mp;
+
+	return mp;
+}
+
+static void mp_put(struct mp_file *mfile)
+{
+	if (atomic_dec_and_test(&mfile->count)) {
+		if (!rtnl_is_locked()) {
+			rtnl_lock();
+			mp_detach(mfile->mp);
+			rtnl_unlock();
+		} else
+			mp_detach(mfile->mp);
+	}
+}
+
+static void iocb_tag(struct kiocb *iocb)
+{
+	iocb->ki_flags = 1;
+}
+
+/* The callback to destruct the external buffers or skb */
+static void page_dtor(struct skb_ext_page *ext_page)
+{
+	struct page_info *info;
+	struct page_pool *pool;
+	struct sock *sk;
+	struct sk_buff *skb;
+
+	if (!ext_page)
+		return;
+	info = container_of(ext_page, struct page_info, ext_page);
+	if (!info)
+		return;
+	pool = info->pool;
+	skb = info->skb;
+
+	if (info->flags == INFO_READ) {
+		create_iocb(info, 0);
+		return;
+	}
+
+	/* For transmit, we should wait for the DMA finish by hardware.
+	 * Queue the notifier to wake up the backend driver
+	 */
+
+	iocb_tag(info->iocb);
+	sk = pool->port.sock->sk;
+	sk->sk_write_space(sk);
+
+	return;
+}
+
+/* For small exteranl buffers transmit, we don't need to call
+ * get_user_pages().
+ */
+static struct page_info *alloc_small_page_info(struct page_pool *pool,
+		struct kiocb *iocb, int total)
+{
+	struct page_info *info =
+		kmem_cache_alloc(ext_page_info_cache, GFP_KERNEL);
+
+	if (!info)
+		return NULL;
+	info->ext_page.dtor = page_dtor;
+	info->pool = pool;
+	info->flags = INFO_WRITE;
+	info->iocb = iocb;
+	info->pnum = 0;
+	return info;
+}
+
+typedef u32 key_mp_t;
+static inline key_mp_t mp_hash(struct page *page, int buckets)
+{
+	key_mp_t k;
+#if BITS_PER_LONG == 64
+	k = ((((unsigned long)page << 32UL) >> 32UL) /
+			sizeof(struct page)) % buckets ;
+#elif BITS_PER_LONG == 32
+	k = ((unsigned long)page / sizeof(struct page)) % buckets;
+#endif
+
+	return k;
+}
+
+static void mp_hash_insert(struct page_pool *pool,
+		struct page *page, struct page_info *page_info)
+{
+	struct page_info *tmp;
+	key_mp_t key = mp_hash(page, HASH_BUCKETS);
+	if (!pool->hash_table[key]) {
+		pool->hash_table[key] = page_info;
+		return;
+	}
+
+	tmp = pool->hash_table[key];
+	while (tmp->next)
+		tmp = tmp->next;
+
+	tmp->next = page_info;
+	page_info->prev = tmp;
+	return;
+}
+
+static struct page_info *mp_hash_delete(struct page_pool *pool,
+					struct page_info *info)
+{
+	key_mp_t key = mp_hash(info->pages[0], HASH_BUCKETS);
+	struct page_info *tmp = NULL;
+
+	tmp = pool->hash_table[key];
+	while (tmp) {
+		if (tmp == info) {
+			if (!tmp->prev) {
+				pool->hash_table[key] = tmp->next;
+				if (tmp->next)
+					tmp->next->prev = NULL;
+			} else {
+				tmp->prev->next = tmp->next;
+				if (tmp->next)
+					tmp->next->prev = tmp->prev;
+			}
+			return tmp;
+		}
+		tmp = tmp->next;
+	}
+	return tmp;
+}
+
+static struct page_info *mp_hash_lookup(struct page_pool *pool,
+					struct page *page)
+{
+	key_mp_t key = mp_hash(page, HASH_BUCKETS);
+	struct page_info *tmp = NULL;
+
+	int i;
+	tmp = pool->hash_table[key];
+	while (tmp) {
+		for (i = 0; i < tmp->pnum; i++) {
+			if (tmp->pages[i] == page)
+				return tmp;
+		}
+		tmp = tmp->next;
+	}
+	return tmp;
+}
+
+/* The main function to transform the guest user space address
+ * to host kernel address via get_user_pages(). Thus the hardware
+ * can do DMA directly to the external buffer address.
+ */
+static struct page_info *alloc_page_info(struct page_pool *pool,
+		struct kiocb *iocb, struct iovec *iov,
+		int count, struct frag *frags,
+		int npages, int total)
+{
+	int rc;
+	int i, j, n = 0;
+	int len;
+	unsigned long base, lock_limit;
+	struct page_info *info = NULL;
+
+	lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur;
+	lock_limit >>= PAGE_SHIFT;
+
+	if (pool->lock_pages + count > lock_limit && npages) {
+		printk(KERN_INFO "exceed the locked memory rlimit.");
+		return NULL;
+	}
+
+	info = kmem_cache_alloc(ext_page_info_cache, GFP_KERNEL);
+	
+	if (!info)
+		return NULL;
+	info->skb = NULL;
+	info->next = info->prev = NULL;
+
+	for (i = j = 0; i < count; i++) {
+		base = (unsigned long)iov[i].iov_base;
+		len = iov[i].iov_len;
+
+		if (!len)
+			continue;
+		n = ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
+
+		rc = get_user_pages_fast(base, n, npages ? 1 : 0,
+				&info->pages[j]);
+		if (rc != n)
+			goto failed;
+
+		while (n--) {
+			frags[j].offset = base & ~PAGE_MASK;
+			frags[j].size = min_t(int, len,
+					PAGE_SIZE - frags[j].offset);
+			len -= frags[j].size;
+			base += frags[j].size;
+			j++;
+		}
+	}
+
+#ifdef CONFIG_HIGHMEM
+	if (npages && !(dev->features & NETIF_F_HIGHDMA)) {
+		for (i = 0; i < j; i++) {
+			if (PageHighMem(info->pages[i]))
+				goto failed;
+		}
+	}
+#endif
+
+	info->ext_page.dtor = page_dtor;
+	info->ext_page.page = info->pages[0];
+	info->pool = pool;
+	info->pnum = j;
+	info->iocb = iocb;
+	if (!npages)
+		info->flags = INFO_WRITE;
+	else
+		info->flags = INFO_READ;
+
+	if (info->flags == INFO_READ) {
+		if (frags[0].offset == 0 && iocb->ki_iovec[0].iov_len) {
+			frags[0].offset = iocb->ki_iovec[0].iov_len;
+			pool->port.vnet_hlen = iocb->ki_iovec[0].iov_len;
+		}
+		for (i = 0; i < j; i++)
+			mp_hash_insert(pool, info->pages[i], info);
+	}
+	/* increment the number of locked pages */
+	pool->lock_pages += j;
+	return info;
+
+failed:
+	for (i = 0; i < j; i++)
+		put_page(info->pages[i]);
+
+	kmem_cache_free(ext_page_info_cache, info);
+
+	return NULL;
+}
+
+static void mp_sock_destruct(struct sock *sk)
+{
+	struct mp_struct *mp = container_of(sk, struct mp_sock, sk)->mp;
+	kfree(mp);
+}
+
+static void mp_sock_state_change(struct sock *sk)
+{
+	if (sk_has_sleeper(sk))
+		wake_up_interruptible_sync_poll(sk->sk_sleep, POLLIN);
+}
+
+static void mp_sock_write_space(struct sock *sk)
+{
+	if (sk_has_sleeper(sk))
+		wake_up_interruptible_sync_poll(sk->sk_sleep, POLLOUT);
+}
+
+static void mp_sock_data_ready(struct sock *sk, int coming)
+{
+	struct mp_struct *mp = container_of(sk, struct mp_sock, sk)->mp;
+	struct page_pool *pool = NULL;
+	struct sk_buff *skb = NULL;
+	struct page_info *info = NULL;
+	int len;
+
+	pool = mp->pool;
+	if (!pool)
+		return;
+
+	while ((skb = skb_dequeue(&sk->sk_receive_queue)) != NULL) {
+		struct page *page;
+		int off;
+		int size = 0, i = 0;
+		struct skb_shared_info *shinfo = skb_shinfo(skb);
+		struct skb_ext_page *ext_page =
+			(struct skb_ext_page *)(shinfo->destructor_arg);
+		struct virtio_net_hdr_mrg_rxbuf hdr = {
+			.hdr.flags = 0,
+			.hdr.gso_type = VIRTIO_NET_HDR_GSO_NONE
+		};
+
+		if (skb->ip_summed == CHECKSUM_COMPLETE)
+			printk(KERN_INFO "Complete checksum occurs\n");
+
+		if (shinfo->frags[0].page == ext_page->page) {
+			info = container_of(ext_page,
+					    struct page_info,
+					    ext_page);
+			if (shinfo->nr_frags)
+				hdr.num_buffers = shinfo->nr_frags;
+			else
+				hdr.num_buffers = shinfo->nr_frags + 1;
+		} else {
+			info = container_of(ext_page,
+					    struct page_info,
+					    ext_page);
+			hdr.num_buffers = shinfo->nr_frags + 1;
+		}
+		skb_push(skb, ETH_HLEN);
+
+		if (skb_is_gso(skb)) {
+			hdr.hdr.hdr_len = skb_headlen(skb);
+			hdr.hdr.gso_size = shinfo->gso_size;
+			if (shinfo->gso_type & SKB_GSO_TCPV4)
+				hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
+			else if (shinfo->gso_type & SKB_GSO_TCPV6)
+				hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
+			else if (shinfo->gso_type & SKB_GSO_UDP)
+				hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_UDP;
+			else
+				BUG();
+			if (shinfo->gso_type & SKB_GSO_TCP_ECN)
+				hdr.hdr.gso_type |= VIRTIO_NET_HDR_GSO_ECN;
+
+		} else
+			hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_NONE;
+
+		if (skb->ip_summed == CHECKSUM_PARTIAL) {
+			hdr.hdr.flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
+			hdr.hdr.csum_start =
+				skb->csum_start - skb_headroom(skb);
+			hdr.hdr.csum_offset = skb->csum_offset;
+		}
+
+		off = info->hdr[0].iov_len;
+		len = memcpy_toiovec(info->iov, (unsigned char *)&hdr, off);
+		if (len) {
+			pr_debug("Unable to write vnet_hdr at addr '%p': '%d'\n",
+				info->iov, len);
+			goto clean;
+		}
+
+		memcpy_toiovec(info->iov, skb->data, skb_headlen(skb));
+
+		info->iocb->ki_left = hdr.num_buffers;
+		if (shinfo->frags[0].page == ext_page->page) {
+			size = shinfo->frags[0].size +
+				shinfo->frags[0].page_offset - off;
+			i = 1;
+		} else {
+			size = skb_headlen(skb);
+			i = 0;
+		}
+		create_iocb(info, off + size);
+		for (i = i; i < shinfo->nr_frags; i++) {
+			page = shinfo->frags[i].page;
+			info = mp_hash_lookup(pool, shinfo->frags[i].page);
+			create_iocb(info, shinfo->frags[i].size);
+		}
+		info->skb = skb;
+		shinfo->nr_frags = 0;
+		shinfo->destructor_arg = NULL;
+		continue;
+clean:
+		kfree_skb(skb);
+		for (i = 0; i < info->pnum; i++)
+			put_page(info->pages[i]);
+		kmem_cache_free(ext_page_info_cache, info);
+	}
+	return;
+}
+
+static inline struct sk_buff *mp_alloc_skb(struct sock *sk, size_t prepad,
+					   size_t len, size_t linear,
+					   int noblock, int *err)
+{
+	struct sk_buff *skb;
+
+	/* Under a page?  Don't bother with paged skb. */
+	if (prepad + len < PAGE_SIZE || !linear)
+		linear = len;
+
+	skb = sock_alloc_send_pskb(sk, prepad + linear, len - linear, noblock,
+			err);
+	if (!skb)
+		return NULL;
+
+	skb_reserve(skb, prepad);
+	skb_put(skb, linear);
+	skb->data_len = len - linear;
+	skb->len += len - linear;
+
+	return skb;
+}
+
+static int mp_skb_from_vnet_hdr(struct sk_buff *skb,
+		struct virtio_net_hdr *vnet_hdr)
+{
+	unsigned short gso_type = 0;
+	if (vnet_hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
+		switch (vnet_hdr->gso_type & ~VIRTIO_NET_HDR_GSO_ECN) {
+		case VIRTIO_NET_HDR_GSO_TCPV4:
+			gso_type = SKB_GSO_TCPV4;
+			break;
+		case VIRTIO_NET_HDR_GSO_TCPV6:
+			gso_type = SKB_GSO_TCPV6;
+			break;
+		case VIRTIO_NET_HDR_GSO_UDP:
+			gso_type = SKB_GSO_UDP;
+			break;
+		default:
+			return -EINVAL;
+		}
+
+		if (vnet_hdr->gso_type & VIRTIO_NET_HDR_GSO_ECN)
+			gso_type |= SKB_GSO_TCP_ECN;
+
+		if (vnet_hdr->gso_size == 0)
+			return -EINVAL;
+	}
+
+	if (vnet_hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
+		if (!skb_partial_csum_set(skb, vnet_hdr->csum_start,
+					vnet_hdr->csum_offset))
+			return -EINVAL;
+	}
+
+	if (vnet_hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
+		skb_shinfo(skb)->gso_size = vnet_hdr->gso_size;
+		skb_shinfo(skb)->gso_type = gso_type;
+
+		/* Header must be checked, and gso_segs computed. */
+		skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY;
+		skb_shinfo(skb)->gso_segs = 0;
+	}
+	return 0;
+}
+
+static int mp_sendmsg(struct kiocb *iocb, struct socket *sock,
+		struct msghdr *m, size_t total_len)
+{
+	struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp;
+	struct virtio_net_hdr vnet_hdr = {0};
+	int hdr_len = 0;
+	struct page_pool *pool;
+	struct iovec *iov = m->msg_iov;
+	struct page_info *info = NULL;
+	struct frag frags[MAX_SKB_FRAGS];
+	struct sk_buff *skb;
+	int count = m->msg_iovlen;
+	int total = 0, header, n, i, len, rc;
+	unsigned long base;
+
+	pool = mp->pool;
+	if (!pool)
+		return -ENODEV;
+
+	total = iov_length(iov, count);
+
+	if (total < ETH_HLEN)
+		return -EINVAL;
+
+	if (total <= COPY_THRESHOLD)
+		goto copy;
+
+	n = 0;
+	for (i = 0; i < count; i++) {
+		base = (unsigned long)iov[i].iov_base;
+		len = iov[i].iov_len;
+		if (!len)
+			continue;
+		n += ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
+		if (n > MAX_SKB_FRAGS)
+			return -EINVAL;
+	}
+
+copy:
+	hdr_len = sizeof(vnet_hdr);
+	if ((total - iocb->ki_iovec[0].iov_len) < 0)
+		return -EINVAL;
+
+	rc = memcpy_fromiovecend((void *)&vnet_hdr, iocb->ki_iovec, 0, hdr_len);
+	if (rc < 0)
+		return -EINVAL;
+
+	if ((vnet_hdr.flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
+			vnet_hdr.csum_start + vnet_hdr.csum_offset + 2 >
+			vnet_hdr.hdr_len)
+		vnet_hdr.hdr_len = vnet_hdr.csum_start +
+			vnet_hdr.csum_offset + 2;
+
+	if (vnet_hdr.hdr_len > total)
+		return -EINVAL;
+
+	header = total > COPY_THRESHOLD ? COPY_HDR_LEN : total;
+
+	skb = mp_alloc_skb(sock->sk, NET_IP_ALIGN, header,
+			   iocb->ki_iovec[0].iov_len, 1, &rc);
+
+	if (!skb)
+		goto drop;
+
+	skb_set_network_header(skb, ETH_HLEN);
+	memcpy_fromiovec(skb->data, iov, header);
+
+	skb_reset_mac_header(skb);
+	skb->protocol = eth_hdr(skb)->h_proto;
+
+	rc = mp_skb_from_vnet_hdr(skb, &vnet_hdr);
+	if (rc)
+		goto drop;
+
+	if (header == total) {
+		rc = total;
+		info = alloc_small_page_info(pool, iocb, total);
+	} else {
+		info = alloc_page_info(pool, iocb, iov, count, frags, 0, total);
+		if (info)
+			for (i = 0; i < info->pnum; i++) {
+				skb_add_rx_frag(skb, i, info->pages[i],
+						frags[i].offset, frags[i].size);
+				info->pages[i] = NULL;
+			}
+	}
+	if (!pool->lock_pages)
+		sock->sk->sk_state_change(sock->sk);
+
+	if (info != NULL) {
+		info->desc_pos = iocb->ki_pos;
+		info->skb = skb;
+		skb_shinfo(skb)->destructor_arg = &info->ext_page;
+		skb->dev = mp->dev;
+		create_iocb(info, total);
+		dev_queue_xmit(skb);
+		return 0;
+	}
+drop:
+	kfree_skb(skb);
+	if (info) {
+		for (i = 0; i < info->pnum; i++)
+			put_page(info->pages[i]);
+		kmem_cache_free(ext_page_info_cache, info);
+	}
+	mp->dev->stats.tx_dropped++;
+	return -ENOMEM;
+}
+
+static int mp_recvmsg(struct kiocb *iocb, struct socket *sock,
+		struct msghdr *m, size_t total_len,
+		int flags)
+{
+	struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp;
+	struct page_pool *pool;
+	struct iovec *iov = m->msg_iov;
+	int count = m->msg_iovlen;
+	int npages, payload;
+	struct page_info *info;
+	struct frag frags[MAX_SKB_FRAGS];
+	unsigned long base;
+	int i, len;
+	unsigned long flag;
+
+	if (!(flags & MSG_DONTWAIT))
+		return -EINVAL;
+
+	pool = mp->pool;
+	if (!pool)
+		return -EINVAL;
+
+	/* Error detections in case invalid external buffer */
+	if (count > 2 && iov[1].iov_len < pool->port.hdr_len &&
+			mp->dev->features & NETIF_F_SG) {
+		return -EINVAL;
+	}
+
+	npages = pool->port.npages;
+	payload = pool->port.data_len;
+
+	/* If KVM guest virtio-net FE driver use SG feature */
+	if (count > 2) {
+		for (i = 2; i < count; i++) {
+			base = (unsigned long)iov[i].iov_base & ~PAGE_MASK;
+			len = iov[i].iov_len;
+			if (npages == 1)
+				len = min_t(int, len, PAGE_SIZE - base);
+			else if (base)
+				break;
+			payload -= len;
+			if (payload <= 0)
+				goto proceed;
+			if (npages == 1 || (len & ~PAGE_MASK))
+				break;
+		}
+	}
+
+	if ((((unsigned long)iov[1].iov_base & ~PAGE_MASK)
+				- NET_SKB_PAD - NET_IP_ALIGN) >= 0)
+		goto proceed;
+
+	return -EINVAL;
+
+proceed:
+	/* skip the virtnet head */
+	if (count > 1) {
+		iov++;
+		count--;
+	}
+
+	if (!pool->lock_pages) {
+		set_memlock_rlimit(pool, RLIMIT_MEMLOCK,
+				iocb->ki_user_data * 4096 * 2,
+				iocb->ki_user_data * 4096 * 2);
+	}
+
+	/* Translate address to kernel */
+	info = alloc_page_info(pool, iocb, iov, count, frags, npages, 0);
+	if (!info)
+		return -ENOMEM;
+	info->hdr[0].iov_base = iocb->ki_iovec[0].iov_base;
+	info->hdr[0].iov_len = iocb->ki_iovec[0].iov_len;
+	iocb->ki_iovec[0].iov_len = 0;
+	iocb->ki_left = 0;
+	info->desc_pos = iocb->ki_pos;
+
+	if (count > 1) {
+		iov--;
+		count++;
+	}
+
+	memcpy(info->iov, iov, sizeof(struct iovec) * count);
+
+	spin_lock_irqsave(&pool->read_lock, flag);
+	list_add_tail(&info->list, &pool->readq);
+	spin_unlock_irqrestore(&pool->read_lock, flag);
+
+	return 0;
+}
+
+/* Ops structure to mimic raw sockets with mp device */
+static const struct proto_ops mp_socket_ops = {
+	.sendmsg = mp_sendmsg,
+	.recvmsg = mp_recvmsg,
+};
+
+static struct proto mp_proto = {
+	.name           = "mp",
+	.owner          = THIS_MODULE,
+	.obj_size       = sizeof(struct mp_sock),
+};
+
+static int mp_chr_open(struct inode *inode, struct file * file)
+{
+	struct mp_file *mfile;
+	cycle_kernel_lock();
+
+	pr_debug("mp: mp_chr_open\n");
+	mfile = kzalloc(sizeof(*mfile), GFP_KERNEL);
+	if (!mfile)
+		return -ENOMEM;
+	atomic_set(&mfile->count, 0);
+	mfile->mp = NULL;
+	mfile->net = get_net(current->nsproxy->net_ns);
+	file->private_data = mfile;
+	return 0;
+}
+
+static int mp_attach(struct mp_struct *mp, struct file *file)
+{
+	struct mp_file *mfile = file->private_data;
+	int err;
+
+	netif_tx_lock_bh(mp->dev);
+
+	err = -EINVAL;
+
+	if (mfile->mp)
+		goto out;
+
+	err = -EBUSY;
+	if (mp->mfile)
+		goto out;
+
+	err = 0;
+	mfile->mp = mp;
+	mp->mfile = mfile;
+	mp->socket.file = file;
+	dev_hold(mp->dev);
+	sock_hold(mp->socket.sk);
+	atomic_inc(&mfile->count);
+
+out:
+	netif_tx_unlock_bh(mp->dev);
+	return err;
+}
+
+static int do_unbind(struct mp_file *mfile)
+{
+	struct mp_struct *mp = mp_get(mfile);
+
+	if (!mp)
+		return -EINVAL;
+
+	mp_detach(mp);
+	sock_put(mp->socket.sk);
+	mp_put(mfile);
+	return 0;
+}
+
+static long mp_chr_ioctl(struct file *file, unsigned int cmd,
+		unsigned long arg)
+{
+	struct mp_file *mfile = file->private_data;
+	struct mp_struct *mp;
+	struct net_device *dev;
+	void __user* argp = (void __user *)arg;
+	struct ifreq ifr;
+	struct sock *sk;
+	int ret;
+
+	ret = -EINVAL;
+
+	switch (cmd) {
+	case MPASSTHRU_BINDDEV:
+		ret = -EFAULT;
+		if (copy_from_user(&ifr, argp, sizeof ifr))
+			break;
+
+		ifr.ifr_name[IFNAMSIZ-1] = '\0';
+
+		ret = -ENODEV;
+
+		rtnl_lock();
+		dev = dev_get_by_name(mfile->net, ifr.ifr_name);
+		if (!dev) {
+			rtnl_unlock();
+			break;
+		}
+
+		mutex_lock(&mp_mutex);
+
+		ret = -EBUSY;
+
+		/* the device can be only bind once */
+		if (dev_is_mpassthru(dev))
+			goto err_dev_put;
+
+		mp = mfile->mp;
+		if (mp)
+			goto err_dev_put;
+
+		mp = kzalloc(sizeof(*mp), GFP_KERNEL);
+		if (!mp) {
+			ret = -ENOMEM;
+			goto err_dev_put;
+		}
+		mp->dev = dev;
+		ret = -ENOMEM;
+
+		sk = sk_alloc(mfile->net, AF_UNSPEC, GFP_KERNEL, &mp_proto);
+		if (!sk)
+			goto err_free_mp;
+
+		init_waitqueue_head(&mp->socket.wait);
+		mp->socket.ops = &mp_socket_ops;
+		sock_init_data(&mp->socket, sk);
+		sk->sk_sndbuf = INT_MAX;
+		container_of(sk, struct mp_sock, sk)->mp = mp;
+
+		sk->sk_destruct = mp_sock_destruct;
+		sk->sk_data_ready = mp_sock_data_ready;
+		sk->sk_write_space = mp_sock_write_space;
+		sk->sk_state_change = mp_sock_state_change;
+		ret = mp_attach(mp, file);
+		if (ret < 0)
+			goto err_free_sk;
+
+		ret = page_pool_attach(mp);
+		if (ret < 0)
+			goto err_free_sk;
+		dev_change_flags(mp->dev, mp->dev->flags & (~IFF_UP));
+		dev_change_flags(mp->dev, mp->dev->flags | IFF_UP);
+		sk->sk_state_change(sk);
+out:
+		mutex_unlock(&mp_mutex);
+		rtnl_unlock();
+		break;
+err_free_sk:
+		sk_free(sk);
+err_free_mp:
+		kfree(mp);
+err_dev_put:
+		dev_put(dev);
+		goto out;
+
+	case MPASSTHRU_UNBINDDEV:
+		rtnl_lock();
+		ret = do_unbind(mfile);
+		rtnl_unlock();
+		break;
+
+	default:
+		break;
+	}
+	return ret;
+}
+
+static unsigned int mp_chr_poll(struct file *file, poll_table * wait)
+{
+	struct mp_file *mfile = file->private_data;
+	struct mp_struct *mp = mp_get(mfile);
+	struct sock *sk;
+	unsigned int mask = 0;
+
+	if (!mp)
+		return POLLERR;
+
+	sk = mp->socket.sk;
+
+	poll_wait(file, &mp->socket.wait, wait);
+
+	if (!skb_queue_empty(&sk->sk_receive_queue))
+		mask |= POLLIN | POLLRDNORM;
+
+	if (sock_writeable(sk) ||
+		(!test_and_set_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags) &&
+			 sock_writeable(sk)))
+		mask |= POLLOUT | POLLWRNORM;
+
+	if (mp->dev->reg_state != NETREG_REGISTERED)
+		mask = POLLERR;
+
+	mp_put(mfile);
+	return mask;
+}
+
+static ssize_t mp_chr_aio_write(struct kiocb *iocb, const struct iovec *iov,
+				unsigned long count, loff_t pos)
+{
+	struct file *file = iocb->ki_filp;
+	struct mp_struct *mp = mp_get(file->private_data);
+	struct sock *sk = mp->socket.sk;
+	struct sk_buff *skb;
+	int len, err;
+	ssize_t result = 0;
+
+	if (!mp)
+		return -EBADFD;
+
+	/* currently, async is not supported.
+	 * but we may support real async aio from user application,
+	 * maybe qemu virtio-net backend.
+	 */
+	if (!is_sync_kiocb(iocb))
+		return -EFAULT;
+
+	len = iov_length(iov, count);
+
+	if (unlikely(len < ETH_HLEN))
+		return -EINVAL;
+
+	skb = sock_alloc_send_skb(sk, len + NET_IP_ALIGN,
+				  file->f_flags & O_NONBLOCK, &err);
+
+	if (!skb)
+		return -ENOMEM;
+
+	skb_reserve(skb, NET_IP_ALIGN);
+	skb_put(skb, len);
+
+	if (skb_copy_datagram_from_iovec(skb, 0, iov, 0, len)) {
+		kfree_skb(skb);
+		return -EAGAIN;
+	}
+
+	skb->protocol = eth_type_trans(skb, mp->dev);
+	skb->dev = mp->dev;
+
+	dev_queue_xmit(skb);
+
+	mp_put(file->private_data);
+	return result;
+}
+
+static int mp_chr_close(struct inode *inode, struct file *file)
+{
+	struct mp_file *mfile = file->private_data;
+
+	/*
+	 * Ignore return value since an error only means there was nothing to
+	 * do
+	 */
+	do_unbind(mfile);
+
+	put_net(mfile->net);
+	kfree(mfile);
+
+	return 0;
+}
+
+#ifdef CONFIG_COMPAT
+static long mp_chr_compat_ioctl(struct file *f, unsigned int ioctl,
+				unsigned long arg)
+{
+	return mp_chr_ioctl(f, ioctl, (unsigned long)compat_ptr(arg));
+}
+#endif
+
+static const struct file_operations mp_fops = {
+	.owner  = THIS_MODULE,
+	.llseek = no_llseek,
+	.write  = do_sync_write,
+	.aio_write = mp_chr_aio_write,
+	.poll   = mp_chr_poll,
+	.unlocked_ioctl = mp_chr_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl = mp_chr_compat_ioctl,
+#endif
+	.open   = mp_chr_open,
+	.release = mp_chr_close,
+};
+
+static struct miscdevice mp_miscdev = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = "mp",
+	.nodename = "net/mp",
+	.fops = &mp_fops,
+};
+
+static int mp_device_event(struct notifier_block *unused,
+		unsigned long event, void *ptr)
+{
+	struct net_device *dev = ptr;
+	struct mp_port *port;
+	struct mp_struct *mp = NULL;
+	struct socket *sock = NULL;
+	struct sock *sk;
+
+	port = dev->mp_port;
+	if (port == NULL)
+		return NOTIFY_DONE;
+
+	switch (event) {
+	case NETDEV_UNREGISTER:
+		sock = dev->mp_port->sock;
+		mp = container_of(sock->sk, struct mp_sock, sk)->mp;
+		do_unbind(mp->mfile);
+		break;
+	case NETDEV_CHANGE:
+		sk = dev->mp_port->sock->sk;
+		sk->sk_state_change(sk);
+		break;
+	}
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block mp_notifier_block __read_mostly = {
+	.notifier_call  = mp_device_event,
+};
+
+static int mp_init(void)
+{
+	int err = 0;
+
+	ext_page_info_cache = kmem_cache_create("skb_page_info",
+						sizeof(struct page_info),
+						0, SLAB_HWCACHE_ALIGN, NULL);
+	if (!ext_page_info_cache)
+		return -ENOMEM;
+
+	err = misc_register(&mp_miscdev);
+	if (err) {
+		printk(KERN_ERR "mp: Can't register misc device\n");
+		kmem_cache_destroy(ext_page_info_cache);
+	} else {
+		printk(KERN_INFO "Registering mp misc device - minor = %d\n",
+				mp_miscdev.minor);
+		register_netdevice_notifier(&mp_notifier_block);
+	}
+	return err;
+}
+
+void mp_exit(void)
+{
+	unregister_netdevice_notifier(&mp_notifier_block);
+	misc_deregister(&mp_miscdev);
+	kmem_cache_destroy(ext_page_info_cache);
+}
+
+/* Get an underlying socket object from mp file.  Returns error unless file is
+ * attached to a device.  The returned object works like a packet socket, it
+ * can be used for sock_sendmsg/sock_recvmsg.  The caller is responsible for
+ * holding a reference to the file for as long as the socket is in use. */
+struct socket *mp_get_socket(struct file *file)
+{
+	struct mp_file *mfile = file->private_data;
+	struct mp_struct *mp;
+
+	if (file->f_op != &mp_fops)
+		return ERR_PTR(-EINVAL);
+	mp = mp_get(mfile);
+	if (!mp)
+		return ERR_PTR(-EBADFD);
+	mp_put(mfile);
+	return &mp->socket;
+}
+EXPORT_SYMBOL_GPL(mp_get_socket);
+
+module_init(mp_init);
+module_exit(mp_exit);
+MODULE_AUTHOR(DRV_COPYRIGHT);
+MODULE_DESCRIPTION(DRV_DESCRIPTION);
+MODULE_LICENSE("GPL v2");
-- 
1.7.3


^ permalink raw reply related

* [PATCH v12 13/17] Add a kconfig entry and make entry for mp device.
From: xiaohui.xin @ 2010-09-30 14:04 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui
In-Reply-To: <59d8a50047ee01e26658fd676d26c0162b79e5fd.1285853725.git.xiaohui.xin@intel.com>

From: Xin Xiaohui <xiaohui.xin@intel.com>

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 drivers/vhost/Kconfig  |   10 ++++++++++
 drivers/vhost/Makefile |    2 ++
 2 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index e4e2fd1..a6b8cbf 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -9,3 +9,13 @@ config VHOST_NET
 	  To compile this driver as a module, choose M here: the module will
 	  be called vhost_net.
 
+config MEDIATE_PASSTHRU
+	tristate "mediate passthru network driver (EXPERIMENTAL)"
+	depends on VHOST_NET
+	---help---
+	  zerocopy network I/O support, we call it as mediate passthru to
+	  be distiguish with hardare passthru.
+
+	  To compile this driver as a module, choose M here: the module will
+	  be called mpassthru.
+
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index 72dd020..c18b9fc 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -1,2 +1,4 @@
 obj-$(CONFIG_VHOST_NET) += vhost_net.o
 vhost_net-y := vhost.o net.o
+
+obj-$(CONFIG_MEDIATE_PASSTHRU) += mpassthru.o
-- 
1.7.3


^ permalink raw reply related

* [PATCH v12 04/17] Add a function make external buffer owner to query capability.
From: xiaohui.xin @ 2010-09-30 14:04 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui
In-Reply-To: <59d8a50047ee01e26658fd676d26c0162b79e5fd.1285853725.git.xiaohui.xin@intel.com>

From: Xin Xiaohui <xiaohui.xin@intel.com>

The external buffer owner can use the functions to get
the capability of the underlying NIC driver.

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhaonew@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>

---
 include/linux/netdevice.h |    2 +
 net/core/dev.c            |   49 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 51 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 24a31e7..27f5024 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1608,6 +1608,8 @@ extern gro_result_t	napi_frags_finish(struct napi_struct *napi,
 					  gro_result_t ret);
 extern struct sk_buff *	napi_frags_skb(struct napi_struct *napi);
 extern gro_result_t	napi_gro_frags(struct napi_struct *napi);
+extern int netdev_mp_port_prep(struct net_device *dev,
+				struct mp_port *port);
 
 static inline void napi_free_frags(struct napi_struct *napi)
 {
diff --git a/net/core/dev.c b/net/core/dev.c
index 264137f..c11e32c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2468,6 +2468,55 @@ void netif_nit_deliver(struct sk_buff *skb)
 	rcu_read_unlock();
 }
 
+/* To support meidate passthru(zero-copy) with NIC driver,
+ * we'd better query NIC driver for the capability it can
+ * provide, especially for packet split mode, now we only
+ * query for the header size, and the payload a descriptor
+ * may carry. If a driver does not use the API to export,
+ * then we may try to use a default value, currently,
+ * we use the default value from an IGB driver. Now,
+ * it's only called by mpassthru device.
+ */
+#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE)
+int netdev_mp_port_prep(struct net_device *dev,
+		struct mp_port *port)
+{
+	int rc;
+	int npages, data_len;
+	const struct net_device_ops *ops = dev->netdev_ops;
+
+	if (ops->ndo_mp_port_prep) {
+		rc = ops->ndo_mp_port_prep(dev, port);
+		if (rc)
+			return rc;
+	} else {
+		/* If the NIC driver did not report this,
+		 * then we try to use default value.
+		 */
+		port->hdr_len = 128;
+		port->data_len = 2048;
+		port->npages = 1;
+	}
+
+	if (port->hdr_len <= 0)
+		goto err;
+
+	npages = port->npages;
+	data_len = port->data_len;
+	if (npages <= 0 || npages > MAX_SKB_FRAGS ||
+			(data_len < PAGE_SIZE * (npages - 1) ||
+			 data_len > PAGE_SIZE * npages))
+		goto err;
+
+	return 0;
+err:
+	dev_warn(&dev->dev, "invalid page constructor parameters\n");
+
+	return -EINVAL;
+}
+EXPORT_SYMBOL(netdev_mp_port_prep);
+#endif
+
 /**
  *	netif_receive_skb - process receive buffer from network
  *	@skb: buffer to process
-- 
1.7.3


^ permalink raw reply related

* [PATCH v12 01/17] Add a new structure for skb buffer from external.
From: xiaohui.xin @ 2010-09-30 14:04 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui
In-Reply-To: <1285855474-12110-1-git-send-email-xiaohui.xin@intel.com>

From: Xin Xiaohui <xiaohui.xin@intel.com>

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 include/linux/skbuff.h |    9 +++++++++
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 124f90c..74af06c 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -203,6 +203,15 @@ struct skb_shared_info {
 	void *		destructor_arg;
 };
 
+/* The structure is for a skb which pages may point to
+ * an external buffer, which is not allocated from kernel space.
+ * It also contains a destructor for itself.
+ */
+struct skb_ext_page {
+	struct		page *page;
+	void		(*dtor)(struct skb_ext_page *);
+};
+
 /* We divide dataref into two halves.  The higher 16 bits hold references
  * to the payload part of skb->data.  The lower 16 bits hold references to
  * the entire skb->data.  A clone of a headerless skb holds the length of
-- 
1.7.3


^ permalink raw reply related

* [PATCH net-next 2/2] ipv4: rcu conversion in ip_route_output_slow
From: Eric Dumazet @ 2010-09-30 13:33 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

ip_route_output_slow() is enclosed in an rcu_read_lock() protected
section, so that no references are taken/released on device, thanks to
__ip_dev_find() & dev_get_by_index_rcu()

Tested with ip route cache disabled, and a stress test :

Before patch:

elapsed time :

real	1m38.347s
user	0m11.909s
sys	23m51.501s

Profile:

13788.00 22.7% ip_route_output_slow [kernel]
 7875.00 13.0% dst_destroy          [kernel]
 3925.00  6.5% fib_semantic_match   [kernel]
 3144.00  5.2% fib_rules_lookup     [kernel]
 3061.00  5.0% dst_alloc            [kernel]
 2276.00  3.7% rt_set_nexthop       [kernel]
 1762.00  2.9% fib_table_lookup     [kernel]
 1538.00  2.5% _raw_read_lock       [kernel]
 1358.00  2.2% ip_output            [kernel]

After patch:

real	1m28.808s
user	0m13.245s
sys	20m37.293s


10950.00 17.2% ip_route_output_slow [kernel]
10726.00 16.9% dst_destroy          [kernel]
 5170.00  8.1% fib_semantic_match   [kernel]
 3937.00  6.2% dst_alloc            [kernel]
 3635.00  5.7% rt_set_nexthop       [kernel]
 2900.00  4.6% fib_rules_lookup     [kernel]
 2240.00  3.5% fib_table_lookup     [kernel]
 1427.00  2.2% _raw_read_lock       [kernel]
 1157.00  1.8% kmem_cache_alloc     [kernel]

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/ipv4/route.c |   38 ++++++++++++--------------------------
 1 files changed, 12 insertions(+), 26 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 98beda4..9b60129 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2490,6 +2490,7 @@ static int ip_mkroute_output(struct rtable **rp,
 
 /*
  * Major route resolver routine.
+ * called with rcu_read_lock();
  */
 
 static int ip_route_output_slow(struct net *net, struct rtable **rp,
@@ -2508,7 +2509,7 @@ static int ip_route_output_slow(struct net *net, struct rtable **rp,
 			    .iif = net->loopback_dev->ifindex,
 			    .oif = oldflp->oif };
 	struct fib_result res;
-	unsigned flags = 0;
+	unsigned int flags = 0;
 	struct net_device *dev_out = NULL;
 	int free_res = 0;
 	int err;
@@ -2538,7 +2539,7 @@ static int ip_route_output_slow(struct net *net, struct rtable **rp,
 		    (ipv4_is_multicast(oldflp->fl4_dst) ||
 		     oldflp->fl4_dst == htonl(0xFFFFFFFF))) {
 			/* It is equivalent to inet_addr_type(saddr) == RTN_LOCAL */
-			dev_out = ip_dev_find(net, oldflp->fl4_src);
+			dev_out = __ip_dev_find(net, oldflp->fl4_src, false);
 			if (dev_out == NULL)
 				goto out;
 
@@ -2563,26 +2564,21 @@ static int ip_route_output_slow(struct net *net, struct rtable **rp,
 
 		if (!(oldflp->flags & FLOWI_FLAG_ANYSRC)) {
 			/* It is equivalent to inet_addr_type(saddr) == RTN_LOCAL */
-			dev_out = ip_dev_find(net, oldflp->fl4_src);
-			if (dev_out == NULL)
+			if (!__ip_dev_find(net, oldflp->fl4_src, false))
 				goto out;
-			dev_put(dev_out);
-			dev_out = NULL;
 		}
 	}
 
 
 	if (oldflp->oif) {
-		dev_out = dev_get_by_index(net, oldflp->oif);
+		dev_out = dev_get_by_index_rcu(net, oldflp->oif);
 		err = -ENODEV;
 		if (dev_out == NULL)
 			goto out;
 
 		/* RACE: Check return value of inet_select_addr instead. */
-		if (rcu_dereference_raw(dev_out->ip_ptr) == NULL) {
-			dev_put(dev_out);
+		if (rcu_dereference(dev_out->ip_ptr) == NULL)
 			goto out;	/* Wrong error code */
-		}
 
 		if (ipv4_is_local_multicast(oldflp->fl4_dst) ||
 		    oldflp->fl4_dst == htonl(0xFFFFFFFF)) {
@@ -2605,10 +2601,7 @@ static int ip_route_output_slow(struct net *net, struct rtable **rp,
 		fl.fl4_dst = fl.fl4_src;
 		if (!fl.fl4_dst)
 			fl.fl4_dst = fl.fl4_src = htonl(INADDR_LOOPBACK);
-		if (dev_out)
-			dev_put(dev_out);
 		dev_out = net->loopback_dev;
-		dev_hold(dev_out);
 		fl.oif = net->loopback_dev->ifindex;
 		res.type = RTN_LOCAL;
 		flags |= RTCF_LOCAL;
@@ -2642,8 +2635,6 @@ static int ip_route_output_slow(struct net *net, struct rtable **rp,
 			res.type = RTN_UNICAST;
 			goto make_route;
 		}
-		if (dev_out)
-			dev_put(dev_out);
 		err = -ENETUNREACH;
 		goto out;
 	}
@@ -2652,10 +2643,7 @@ static int ip_route_output_slow(struct net *net, struct rtable **rp,
 	if (res.type == RTN_LOCAL) {
 		if (!fl.fl4_src)
 			fl.fl4_src = fl.fl4_dst;
-		if (dev_out)
-			dev_put(dev_out);
 		dev_out = net->loopback_dev;
-		dev_hold(dev_out);
 		fl.oif = dev_out->ifindex;
 		if (res.fi)
 			fib_info_put(res.fi);
@@ -2675,28 +2663,23 @@ static int ip_route_output_slow(struct net *net, struct rtable **rp,
 	if (!fl.fl4_src)
 		fl.fl4_src = FIB_RES_PREFSRC(res);
 
-	if (dev_out)
-		dev_put(dev_out);
 	dev_out = FIB_RES_DEV(res);
-	dev_hold(dev_out);
 	fl.oif = dev_out->ifindex;
 
 
 make_route:
 	err = ip_mkroute_output(rp, &res, &fl, oldflp, dev_out, flags);
 
-
 	if (free_res)
 		fib_res_put(&res);
-	if (dev_out)
-		dev_put(dev_out);
 out:	return err;
 }
 
 int __ip_route_output_key(struct net *net, struct rtable **rp,
 			  const struct flowi *flp)
 {
-	unsigned hash;
+	unsigned int hash;
+	int res;
 	struct rtable *rth;
 
 	if (!rt_caching(net))
@@ -2727,7 +2710,10 @@ int __ip_route_output_key(struct net *net, struct rtable **rp,
 	rcu_read_unlock_bh();
 
 slow_output:
-	return ip_route_output_slow(net, rp, flp);
+	rcu_read_lock();
+	res = ip_route_output_slow(net, rp, flp);
+	rcu_read_unlock();
+	return res;
 }
 EXPORT_SYMBOL_GPL(__ip_route_output_key);
 



^ permalink raw reply related

* PATCH net-next 1/2] ipv4: introduce __ip_dev_find()
From: Eric Dumazet @ 2010-09-30 13:31 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

ip_dev_find(net, addr) finds a device given an IPv4 source address and
takes a reference on it.

Introduce __ip_dev_find(), taking a third argument, to optionally take
the device reference. Callers not asking the reference to be taken
should be in an rcu_read_lock() protected section.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/linux/inetdevice.h |    7 ++++++-
 net/ipv4/fib_frontend.c    |   32 +++++++++++++++++++-------------
 2 files changed, 25 insertions(+), 14 deletions(-)

diff --git a/include/linux/inetdevice.h b/include/linux/inetdevice.h
index 1ec09bb..ccd5b07 100644
--- a/include/linux/inetdevice.h
+++ b/include/linux/inetdevice.h
@@ -159,7 +159,12 @@ struct in_ifaddr {
 extern int register_inetaddr_notifier(struct notifier_block *nb);
 extern int unregister_inetaddr_notifier(struct notifier_block *nb);
 
-extern struct net_device *ip_dev_find(struct net *net, __be32 addr);
+extern struct net_device *__ip_dev_find(struct net *net, __be32 addr, bool devref);
+static inline struct net_device *ip_dev_find(struct net *net, __be32 addr)
+{
+	return __ip_dev_find(net, addr, true);
+}
+
 extern int		inet_addr_onlink(struct in_device *in_dev, __be32 a, __be32 b);
 extern int		devinet_ioctl(struct net *net, unsigned int cmd, void __user *);
 extern void		devinet_init(void);
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 981f3c5..4a69a95 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -147,34 +147,40 @@ static void fib_flush(struct net *net)
 		rt_cache_flush(net, -1);
 }
 
-/*
- *	Find the first device with a given source address.
+/**
+ * __ip_dev_find - find the first device with a given source address.
+ * @net: the net namespace
+ * @addr: the source address
+ * @devref: if true, take a reference on the found device
+ *
+ * If a caller uses devref=false, it should be protected by RCU
  */
-
-struct net_device * ip_dev_find(struct net *net, __be32 addr)
+struct net_device *__ip_dev_find(struct net *net, __be32 addr, bool devref)
 {
-	struct flowi fl = { .nl_u = { .ip4_u = { .daddr = addr } },
-			    .flags = FLOWI_FLAG_MATCH_ANY_IIF };
-	struct fib_result res;
+	struct flowi fl = {
+		.nl_u = {
+			.ip4_u = {
+				.daddr = addr
+			}
+		},
+		.flags = FLOWI_FLAG_MATCH_ANY_IIF
+	};
+	struct fib_result res = { 0 };
 	struct net_device *dev = NULL;
 
-#ifdef CONFIG_IP_MULTIPLE_TABLES
-	res.r = NULL;
-#endif
-
 	if (fib_lookup(net, &fl, &res))
 		return NULL;
 	if (res.type != RTN_LOCAL)
 		goto out;
 	dev = FIB_RES_DEV(res);
 
-	if (dev)
+	if (dev && devref)
 		dev_hold(dev);
 out:
 	fib_res_put(&res);
 	return dev;
 }
-EXPORT_SYMBOL(ip_dev_find);
+EXPORT_SYMBOL(__ip_dev_find);
 
 /*
  * Find address type as if only "dev" was present in the system. If



^ permalink raw reply related

* pull request: wireless-2.6 2010-09-29
From: John W. Linville @ 2010-09-29 20:33 UTC (permalink / raw)
  To: davem; +Cc: linux-wireless, netdev, linux-kernel

Dave,

Here are two more fixes intended for 2.6.36.  One fixes a user after
free error, the other fixes a reported regression (bug 17722).  Both are
reasonably small and well documented in the commit logs.

Please let me know if there are problems!

Thanks,

John

---

The following changes since commit 01db403cf99f739f86903314a489fb420e0e254f:

  tcp: Fix >4GB writes on 64-bit. (2010-09-27 20:24:54 -0700)

are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6.git master

Florian Mickler (1):
      iwl3945: queue the right work if the scan needs to be aborted

Johannes Berg (1):
      mac80211: fix use-after-free

 drivers/net/wireless/iwlwifi/iwl-agn-lib.c  |    2 +-
 drivers/net/wireless/iwlwifi/iwl3945-base.c |    2 +-
 net/mac80211/rx.c                           |    4 ----
 3 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/drivers/net/wireless/iwlwifi/iwl-agn-lib.c b/drivers/net/wireless/iwlwifi/iwl-agn-lib.c
index 9dd9e64..8fd00a6 100644
--- a/drivers/net/wireless/iwlwifi/iwl-agn-lib.c
+++ b/drivers/net/wireless/iwlwifi/iwl-agn-lib.c
@@ -1411,7 +1411,7 @@ void iwlagn_request_scan(struct iwl_priv *priv, struct ieee80211_vif *vif)
 	clear_bit(STATUS_SCAN_HW, &priv->status);
 	clear_bit(STATUS_SCANNING, &priv->status);
 	/* inform mac80211 scan aborted */
-	queue_work(priv->workqueue, &priv->scan_completed);
+	queue_work(priv->workqueue, &priv->abort_scan);
 }
 
 int iwlagn_manage_ibss_station(struct iwl_priv *priv,
diff --git a/drivers/net/wireless/iwlwifi/iwl3945-base.c b/drivers/net/wireless/iwlwifi/iwl3945-base.c
index 59a308b..d31661c 100644
--- a/drivers/net/wireless/iwlwifi/iwl3945-base.c
+++ b/drivers/net/wireless/iwlwifi/iwl3945-base.c
@@ -3018,7 +3018,7 @@ void iwl3945_request_scan(struct iwl_priv *priv, struct ieee80211_vif *vif)
 	clear_bit(STATUS_SCANNING, &priv->status);
 
 	/* inform mac80211 scan aborted */
-	queue_work(priv->workqueue, &priv->scan_completed);
+	queue_work(priv->workqueue, &priv->abort_scan);
 }
 
 static void iwl3945_bg_restart(struct work_struct *data)
diff --git a/net/mac80211/rx.c b/net/mac80211/rx.c
index fa0f37e..2862428 100644
--- a/net/mac80211/rx.c
+++ b/net/mac80211/rx.c
@@ -2199,9 +2199,6 @@ static void ieee80211_rx_cooked_monitor(struct ieee80211_rx_data *rx,
 	struct net_device *prev_dev = NULL;
 	struct ieee80211_rx_status *status = IEEE80211_SKB_RXCB(skb);
 
-	if (status->flag & RX_FLAG_INTERNAL_CMTR)
-		goto out_free_skb;
-
 	if (skb_headroom(skb) < sizeof(*rthdr) &&
 	    pskb_expand_head(skb, sizeof(*rthdr), 0, GFP_ATOMIC))
 		goto out_free_skb;
@@ -2260,7 +2257,6 @@ static void ieee80211_rx_cooked_monitor(struct ieee80211_rx_data *rx,
 	} else
 		goto out_free_skb;
 
-	status->flag |= RX_FLAG_INTERNAL_CMTR;
 	return;
 
  out_free_skb:
-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply related

* Re: Packet time delays on multi-core systems
From: Eric Dumazet @ 2010-09-30 12:46 UTC (permalink / raw)
  To: Alexey Vlasov; +Cc: Linux Kernel Mailing List, netdev
In-Reply-To: <20100930123048.GA4094@beaver.vrungel.ru>

Le jeudi 30 septembre 2010 à 16:30 +0400, Alexey Vlasov a écrit :
> On Wed, Sep 29, 2010 at 11:45:21PM +0200, Eric Dumazet wrote:
> > But if you send SYN packets in the same time, (logged), this might
> > slow
> > down the reception (and answers) of ICMP frames. LOG target can be
> > quite
> > expensive...
> 
> Yes, it's clear that  some slow down can appear, but 100 ms is too much,
> and this happens at 200 SYN packets in 2 minutes just as in my example.
> On old servers where some tx/rx are missing in NIC card I don't see
> such a situation even at more then 1000 SYN-packets per sec.

Because all cpus were servicing interrupts, which was good for your
needs. Things apparently changed with 2.6.32. 

You have a multiqueue NIC, but using a single CPU to handle the
workload.

> 
> > Is using other rules gives same problem ?
> >
> > iptables -A OUTPUT -p tcp -m tcp --dport 80 --tcp-flags
> 
> No, only LOG gives such a scheme.
> 

^ permalink raw reply

* Re: Packet time delays on multi-core systems
From: Eric Dumazet @ 2010-09-30 12:44 UTC (permalink / raw)
  To: Alexey Vlasov; +Cc: Linux Kernel Mailing List, netdev
In-Reply-To: <20100930122321.GA1575@beaver.vrungel.ru>

Le jeudi 30 septembre 2010 à 16:23 +0400, Alexey Vlasov a écrit :
> On Thu, Sep 30, 2010 at 08:33:52AM +0200, Eric Dumazet wrote:
> > Le jeudi 30 septembre 2010 ?? 10:24 +0400, Alexey Vlasov a ??crit :
> > > Here I found some dude with the same problem:
> > > http://lkml.org/lkml/2010/7/9/340
> > > 
> > 
> > In your opinion its the same problem.
> > 
> > But the description you gave is completely different.
> > 
> > You have time skew only when activating a particular iptables rule.
>  
> Well I put interrups from NIC, namely tx/rx query, to different
> processors and got normal pings by adding LOG rule.
> 
> I also found that overruns is constantly growing, I don't know if these are connected.
> RX packets:2831439546 errors:0 dropped:134726 overruns:947671733 frame:0
> TX packets:2880849825 errors:0 dropped:0 overruns:0 carrier:0
> 
> Rather strange that only one processor was involved, even in top was
> clear that ksoftirqd eats the first processor up to 100%. 
> 

OK, because only CPU0 gets interrupts of all queues.

> Here goes the typical distribution of interrups on new servers:
>            CPU0    CPU1    CPU2    CPU3 ... CPU23
> 752:         11       0       0       0 ...     0 PCI-MSI-edge eth0
> 753: 2799366721       0       0       0 ...     0 PCI-MSI-edge eth0-rx3
> 754: 2821840553       0       0       0 ...     0 PCI-MSI-edge eth0-rx2
> 755: 2786117044       0       0       0 ...     0 PCI-MSI-edge eth0-rx1
> 756: 2896099336       0       0       0 ...     0 PCI-MSI-edge eth0-rx0
> 757: 1808404680       0       0       0 ...     0 PCI-MSI-edge eth0-tx3
> 758: 1797855130       0       0       0 ...     0 PCI-MSI-edge eth0-tx2
> 759: 1807222032       0       0       0 ...     0 PCI-MSI-edge eth0-tx1
> 760: 1820309360       0       0       0 ...     0 PCI-MSI-edge eth0-tx0
> 

echo 01 >/proc/irq/*/eth0-rx0/../smp_affinity
echo 02 >/proc/irq/*/eth0-rx1/../smp_affinity
echo 04 >/proc/irq/*/eth0-rx2/../smp_affinity
echo 08 >/proc/irq/*/eth0-rx3/../smp_affinity


cat /proc/irq/*/eth0-rx0/../smp_affinity
cat /proc/irq/*/eth0-rx1/../smp_affinity
cat /proc/irq/*/eth0-rx2/../smp_affinity
cat /proc/irq/*/eth0-rx3/../smp_affinity



> On the old ones:
>            CPU0       CPU1       CPU2  ...      CPU8
> 502:  522320256  522384039  522327386  ... 522380267 PCI-MSI-edge eth0
> 

What network driver is it (newbox), was it (old box) ?

If you switch to 2.6.35, you can use RPS to dispatch packets to several
cpu, in the case interrupt affinity could not be changed (all interrupts
still handled by CPU0)




^ permalink raw reply

* Re: Packet time delays on multi-core systems
From: Alexey Vlasov @ 2010-09-30 12:30 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Linux Kernel Mailing List, netdev
In-Reply-To: <1285796721.5211.156.camel@edumazet-laptop>

On Wed, Sep 29, 2010 at 11:45:21PM +0200, Eric Dumazet wrote:
> But if you send SYN packets in the same time, (logged), this might
> slow
> down the reception (and answers) of ICMP frames. LOG target can be
> quite
> expensive...

Yes, it's clear that  some slow down can appear, but 100 ms is too much,
and this happens at 200 SYN packets in 2 minutes just as in my example.
On old servers where some tx/rx are missing in NIC card I don't see
such a situation even at more then 1000 SYN-packets per sec.

> Is using other rules gives same problem ?
>
> iptables -A OUTPUT -p tcp -m tcp --dport 80 --tcp-flags

No, only LOG gives such a scheme.

-- 
BRGDS. Alexey Vlasov.

^ permalink raw reply

* Re: Packet time delays on multi-core systems
From: Alexey Vlasov @ 2010-09-30 12:23 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Linux Kernel Mailing List, netdev
In-Reply-To: <1285828432.5211.812.camel@edumazet-laptop>

On Thu, Sep 30, 2010 at 08:33:52AM +0200, Eric Dumazet wrote:
> Le jeudi 30 septembre 2010 ?? 10:24 +0400, Alexey Vlasov a ??crit :
> > Here I found some dude with the same problem:
> > http://lkml.org/lkml/2010/7/9/340
> > 
> 
> In your opinion its the same problem.
> 
> But the description you gave is completely different.
> 
> You have time skew only when activating a particular iptables rule.
 
Well I put interrups from NIC, namely tx/rx query, to different
processors and got normal pings by adding LOG rule.

I also found that overruns is constantly growing, I don't know if these are connected.
RX packets:2831439546 errors:0 dropped:134726 overruns:947671733 frame:0
TX packets:2880849825 errors:0 dropped:0 overruns:0 carrier:0

Rather strange that only one processor was involved, even in top was
clear that ksoftirqd eats the first processor up to 100%. 

Here goes the typical distribution of interrups on new servers:
           CPU0    CPU1    CPU2    CPU3 ... CPU23
752:         11       0       0       0 ...     0 PCI-MSI-edge eth0
753: 2799366721       0       0       0 ...     0 PCI-MSI-edge eth0-rx3
754: 2821840553       0       0       0 ...     0 PCI-MSI-edge eth0-rx2
755: 2786117044       0       0       0 ...     0 PCI-MSI-edge eth0-rx1
756: 2896099336       0       0       0 ...     0 PCI-MSI-edge eth0-rx0
757: 1808404680       0       0       0 ...     0 PCI-MSI-edge eth0-tx3
758: 1797855130       0       0       0 ...     0 PCI-MSI-edge eth0-tx2
759: 1807222032       0       0       0 ...     0 PCI-MSI-edge eth0-tx1
760: 1820309360       0       0       0 ...     0 PCI-MSI-edge eth0-tx0

On the old ones:
           CPU0       CPU1       CPU2  ...      CPU8
502:  522320256  522384039  522327386  ... 522380267 PCI-MSI-edge eth0

-- 
BRGDS. Alexey Vlasov.

^ permalink raw reply

* Re: VLAN packets silently dropped in promiscuous mode
From: Eric Dumazet @ 2010-09-30 12:16 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Roger Luethi, David Miller, Jesse Gross, netdev
In-Reply-To: <4CA46B72.2090100@trash.net>

Le jeudi 30 septembre 2010 à 12:50 +0200, Patrick McHardy a écrit :

> This should be fine as long as the packets are properly marked
> with PACKET_OTHERHOST.

Ah thanks Patrick for the tip !

I tested following patch on tg3 and it is doing the right thing this
time.


[PATCH] vlan: dont drop packets from unknown vlans in promiscuous mode

Roger Luethi noticed packets for unknown VLANs getting silently dropped
even in promiscuous mode.

Check for promiscuous mode in __vlan_hwaccel_rx() and vlan_gro_common()
before drops.

As suggested by Patrick, mark such packets to have skb->pkt_type set to
PACKET_OTHERHOST to make sure they are dropped by IP stack.

Reported-by: Roger Luethi <rl@hellgate.ch>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Patrick McHardy <kaber@trash.net>
---
 net/8021q/vlan_core.c |   14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/net/8021q/vlan_core.c b/net/8021q/vlan_core.c
index 01ddb04..0eb96f7 100644
--- a/net/8021q/vlan_core.c
+++ b/net/8021q/vlan_core.c
@@ -24,8 +24,11 @@ int __vlan_hwaccel_rx(struct sk_buff *skb, struct vlan_group *grp,
 
 	if (vlan_dev)
 		skb->dev = vlan_dev;
-	else if (vlan_id)
-		goto drop;
+	else if (vlan_id) {
+		if (!(skb->dev->flags & IFF_PROMISC))
+			goto drop;
+		skb->pkt_type = PACKET_OTHERHOST;
+	}
 
 	return (polling ? netif_receive_skb(skb) : netif_rx(skb));
 
@@ -102,8 +105,11 @@ vlan_gro_common(struct napi_struct *napi, struct vlan_group *grp,
 
 	if (vlan_dev)
 		skb->dev = vlan_dev;
-	else if (vlan_id)
-		goto drop;
+	else if (vlan_id) {
+		if (!(skb->dev->flags & IFF_PROMISC))
+			goto drop;
+		skb->pkt_type = PACKET_OTHERHOST;
+	}
 
 	for (p = napi->gro_list; p; p = p->next) {
 		NAPI_GRO_CB(p)->same_flow =



^ permalink raw reply related

* Re: [PATCH] ipv4: remove all rt cache entries on UNREGISTER event
From: Nicolas Dichtel @ 2010-09-30 11:49 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, Octavian Purdila
In-Reply-To: <1285751929.2615.30.camel@edumazet-laptop>

Patch works well with my case.
In fact, it's more proper to returns an error to the daemon to let it 
knows that packet was not sent.

Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>

Regards,
Nicolas

Eric Dumazet wrote:
> I found following patch was enough to avoid route being created if
> device is down. This is still racy and needs more thinking.
> 
> 
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index ac6559c..1ee0b1a 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -2586,9 +2586,10 @@ static int ip_route_output_slow(struct net *net, struct rtable **rp,
>  			goto out;
>  
>  		/* RACE: Check return value of inet_select_addr instead. */
> -		if (__in_dev_get_rtnl(dev_out) == NULL) {
> +		if (!(dev_out->flags & IFF_UP) || __in_dev_get_rtnl(dev_out) == NULL) {
>  			dev_put(dev_out);
> -			goto out;	/* Wrong error code */
> +			err = -ENETUNREACH;
> +			goto out;
>  		}
>  
>  		if (ipv4_is_local_multicast(oldflp->fl4_dst) ||
> 
> 


^ permalink raw reply

* Re: [PATCH] nf_nat_snmp: fix checksum calculation (v3)
From: Patrick McHardy @ 2010-09-30 11:03 UTC (permalink / raw)
  To: 王韬; +Cc: akpm, bugzilla-daemon, bugme-daemon, netdev, shemminger
In-Reply-To: <AANLkTi=FbjWdAVfmoBK8E21gDEqxYyDsUyoJsGAfRumW@mail.gmail.com>

On 30.09.2010 10:25, 王韬 wrote:
> Dear Patrick
> 
>                   I am Clark Wang.  I have one question is that which
> version of kernel will contain this patch.

Its currently contained in the net-2.6.git tree, so it should be
in final .36 release.

^ permalink raw reply

* Re: VLAN packets silently dropped in promiscuous mode
From: Patrick McHardy @ 2010-09-30 10:50 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Roger Luethi, David Miller, Jesse Gross, netdev
In-Reply-To: <1285843234.2615.278.camel@edumazet-laptop>

On 30.09.2010 12:40, Eric Dumazet wrote:
> Le jeudi 30 septembre 2010 à 11:55 +0200, Eric Dumazet a écrit :
>> Le jeudi 30 septembre 2010 à 11:16 +0200, Eric Dumazet a écrit :
>>
>>> Agreed
>>>
>>> Could you try following patch, based on net-next-2.6 ?
>>
>> Here is the official patch I cooked for net-2.6 (linux-2.6)
>>
>> I tested it successfully with a tg3 NIC.
>>
>> Thanks
>>
>> [PATCH] vlan: dont drop packets from unknown vlans in promiscuous mode
>>
>> Roger Luethi noticed packets for unknown VLANs getting silently dropped
>> even in promiscuous mode.
>>
>> Check for promiscuous mode in __vlan_hwaccel_rx() and vlan_gro_common()
>> before drops.
>>
>> Reported-by: Roger Luethi <rl@hellgate.ch>
>> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
>> ---
>>  net/8021q/vlan_core.c |    4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/net/8021q/vlan_core.c b/net/8021q/vlan_core.c
>> index 01ddb04..ba502b4 100644
>> --- a/net/8021q/vlan_core.c
>> +++ b/net/8021q/vlan_core.c
>> @@ -24,7 +24,7 @@ int __vlan_hwaccel_rx(struct sk_buff *skb, struct vlan_group *grp,
>>  
>>  	if (vlan_dev)
>>  		skb->dev = vlan_dev;
>> -	else if (vlan_id)
>> +	else if (vlan_id && !(skb->dev->flags & IFF_PROMISC))
>>  		goto drop;
>>  
>>  	return (polling ? netif_receive_skb(skb) : netif_rx(skb));
>> @@ -102,7 +102,7 @@ vlan_gro_common(struct napi_struct *napi, struct vlan_group *grp,
>>  
>>  	if (vlan_dev)
>>  		skb->dev = vlan_dev;
>> -	else if (vlan_id)
>> +	else if (vlan_id && !(skb->dev->flags & IFF_PROMISC))
>>  		goto drop;
>>  
>>  	for (p = napi->gro_list; p; p = p->next) {
>>
> 
> 
> Hmm, packets are delivered not only to tcpdump but also on other stacks,
> on ethX.
> 
> So this is a domain violation.

This should be fine as long as the packets are properly marked
with PACKET_OTHERHOST.

^ permalink raw reply

* Re: VLAN packets silently dropped in promiscuous mode
From: Eric Dumazet @ 2010-09-30 10:40 UTC (permalink / raw)
  To: Roger Luethi; +Cc: David Miller, Jesse Gross, netdev, Patrick McHardy
In-Reply-To: <1285840532.2615.196.camel@edumazet-laptop>

Le jeudi 30 septembre 2010 à 11:55 +0200, Eric Dumazet a écrit :
> Le jeudi 30 septembre 2010 à 11:16 +0200, Eric Dumazet a écrit :
> 
> > Agreed
> > 
> > Could you try following patch, based on net-next-2.6 ?
> 
> Here is the official patch I cooked for net-2.6 (linux-2.6)
> 
> I tested it successfully with a tg3 NIC.
> 
> Thanks
> 
> [PATCH] vlan: dont drop packets from unknown vlans in promiscuous mode
> 
> Roger Luethi noticed packets for unknown VLANs getting silently dropped
> even in promiscuous mode.
> 
> Check for promiscuous mode in __vlan_hwaccel_rx() and vlan_gro_common()
> before drops.
> 
> Reported-by: Roger Luethi <rl@hellgate.ch>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---
>  net/8021q/vlan_core.c |    4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/net/8021q/vlan_core.c b/net/8021q/vlan_core.c
> index 01ddb04..ba502b4 100644
> --- a/net/8021q/vlan_core.c
> +++ b/net/8021q/vlan_core.c
> @@ -24,7 +24,7 @@ int __vlan_hwaccel_rx(struct sk_buff *skb, struct vlan_group *grp,
>  
>  	if (vlan_dev)
>  		skb->dev = vlan_dev;
> -	else if (vlan_id)
> +	else if (vlan_id && !(skb->dev->flags & IFF_PROMISC))
>  		goto drop;
>  
>  	return (polling ? netif_receive_skb(skb) : netif_rx(skb));
> @@ -102,7 +102,7 @@ vlan_gro_common(struct napi_struct *napi, struct vlan_group *grp,
>  
>  	if (vlan_dev)
>  		skb->dev = vlan_dev;
> -	else if (vlan_id)
> +	else if (vlan_id && !(skb->dev->flags & IFF_PROMISC))
>  		goto drop;
>  
>  	for (p = napi->gro_list; p; p = p->next) {
> 


Hmm, packets are delivered not only to tcpdump but also on other stacks,
on ethX.

So this is a domain violation.

Patch is not applicable as is.




^ permalink raw reply

* Re: VLAN packets silently dropped in promiscuous mode
From: Eric Dumazet @ 2010-09-30  9:55 UTC (permalink / raw)
  To: Roger Luethi, David Miller; +Cc: Jesse Gross, netdev, Patrick McHardy
In-Reply-To: <1285838215.2615.126.camel@edumazet-laptop>

Le jeudi 30 septembre 2010 à 11:16 +0200, Eric Dumazet a écrit :

> Agreed
> 
> Could you try following patch, based on net-next-2.6 ?

Here is the official patch I cooked for net-2.6 (linux-2.6)

I tested it successfully with a tg3 NIC.

Thanks

[PATCH] vlan: dont drop packets from unknown vlans in promiscuous mode

Roger Luethi noticed packets for unknown VLANs getting silently dropped
even in promiscuous mode.

Check for promiscuous mode in __vlan_hwaccel_rx() and vlan_gro_common()
before drops.

Reported-by: Roger Luethi <rl@hellgate.ch>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/8021q/vlan_core.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/8021q/vlan_core.c b/net/8021q/vlan_core.c
index 01ddb04..ba502b4 100644
--- a/net/8021q/vlan_core.c
+++ b/net/8021q/vlan_core.c
@@ -24,7 +24,7 @@ int __vlan_hwaccel_rx(struct sk_buff *skb, struct vlan_group *grp,
 
 	if (vlan_dev)
 		skb->dev = vlan_dev;
-	else if (vlan_id)
+	else if (vlan_id && !(skb->dev->flags & IFF_PROMISC))
 		goto drop;
 
 	return (polling ? netif_receive_skb(skb) : netif_rx(skb));
@@ -102,7 +102,7 @@ vlan_gro_common(struct napi_struct *napi, struct vlan_group *grp,
 
 	if (vlan_dev)
 		skb->dev = vlan_dev;
-	else if (vlan_id)
+	else if (vlan_id && !(skb->dev->flags & IFF_PROMISC))
 		goto drop;
 
 	for (p = napi->gro_list; p; p = p->next) {



^ permalink raw reply related

* Re: [PATCH] de2104x: disable media debug messages by default
From: Michał Mirosław @ 2010-09-30  9:46 UTC (permalink / raw)
  To: Ondrej Zary; +Cc: jgarzik, netdev, Kernel development list
In-Reply-To: <201009282018.57547.linux@rainbow-software.org>

2010/9/28 Ondrej Zary <linux@rainbow-software.org>:
> Print media debug messages only when HW debug is enabled.
>
> Signed-off-by: Ondrej Zary <linux@rainbow-software.org>
>
> --- linux-2.6.36-rc3-/drivers/net/tulip/de2104x.c       2010-09-28 19:50:51.000000000 +0200
> +++ linux-2.6.36-rc3/drivers/net/tulip/de2104x.c        2010-09-28 20:05:34.000000000 +0200
> @@ -948,8 +948,9 @@ static void de_set_media (struct de_priv
>        else
>                macmode &= ~FullDuplex;
>
> -       if (netif_msg_link(de)) {
> +       if (netif_msg_link(de))
>                dev_info(&de->dev->dev, "set link %s\n", media_name[media]);

You can use netif_info(de, link, de->dev, ...) instead and get 'ethX:
' prefix for free.

> +       if (netif_msg_hw(de)) {
>                dev_info(&de->dev->dev, "mode 0x%x, sia 0x%x,0x%x,0x%x,0x%x\n",
>                         dr32(MacMode), dr32(SIAStatus),
>                         dr32(CSR13), dr32(CSR14), dr32(CSR15));

Same here.

Best Regards,
Michał Mirosław

^ permalink raw reply

* Re: Something wrong with tx/rx/sg/gso with dhclient etc (Was: Linux 2.6.35.6/e1000e does not receive replies from DHCP server, 2.6.33 works)
From: Sami Farin @ 2010-09-30  9:44 UTC (permalink / raw)
  To: Tantilov, Emil S; +Cc: Sami Farin, e1000-devel, linux-kernel, netdev
In-Reply-To: <EA929A9653AAE14F841771FB1DE5A13660222E867B@rrsmsx501.amr.corp.intel.com>

On Wed, Sep 29, 2010 at 17:40:08 -0600, Tantilov, Emil S wrote:
> Sami Farin wrote:
> > False alarm, 2.6.35.6+latest git e1000e does not work any better.
> > I was just lucky.
> 
> Is your `latest git` from Linus or net-next tree? 

Linus.

>All of our latest patches go into net-next, so if you haven't already - give it a try and see if it resolves your issue.

Maybe later =)

> > 
> > One thing what was common, when it works, I get this line a little
> > time before dhclient start working:
> > 
> > eth0: IPv6 duplicate address fe80::219:d1ff:fe00:5f01 detected!
> 
> I have a system with the same device ID (it is not the same board) and could not reproduce any issues with DHCP on 2.6.35.6 kernel. 

Okay, thanks for trying.
 
> Aside from checking the latest net-next tree, there are some other things to look into:
> 
> 1. Is AMT enabled - there is usually a manageability tab/option in the BIOS. If you have that option try enabling/disabling it and see if it makes a difference.

I believe I haven't used AMT, but I check that option.

> 2. Make sure your BIOS is up to date.

Latest .ISO update which worked was CO6079P from Aug 2008, maybe I can
get the USB boot to work..

> If any of the above does not help your situation please file a bug at e1000.sf.net and include the following information:

I find it odd I stop seeing the reply packets just like that..
Can this be e1000e bug/feature or something else?
For example, I did "make oldconfig" in net-next, and:

----------------
CONFIG_NETFILTER_XT_TARGET_CHECKSUM:

This option adds a `CHECKSUM' target, which can be used in the iptables mangle table.

You can use this target to compute and fill in the checksum in
a packet that lacks a checksum.  This is particularly useful,
if you need to work around old applications such as dhcp clients,       <<=== here
that do not work well with checksum offloads, but don't want to disable
checksum offload in your device.
----------------

But not even tcpdump sees the packets..  shouldn't it see them despite
rx/tx settings?  What could be eating the packets?

FYI, I just played with "rx off tx off sg off gso off" and
"rx on tx on sg on gso on", when the options were all on, I could not
even get ARP reply from my router!  Two seconds after I turned them off,
all start working.  Also dhclient worked—now that I tried—with all off.
I believe there is something fishy in rx, tx, sg and/or gso features,
which worked in 2.6.33 AFAICT.  Am I sounding ambigue? ;)

> 1. lspci -vvv
> 2. ethtool -e eth0
> 3. there is a tool call ethregs which you can download from this site. If you can include the output of ethregs -s 00:19.0 
> 4. kernel config
> 5. anything that you think may be related - like setup, type of traffic etc.
> 
> Thanks,
> Emil

-- 
Do what you love because life is too short for anything else.


------------------------------------------------------------------------------
Start uncovering the many advantages of virtual appliances
and start using them to simplify application deployment and
accelerate your shift to cloud computing.
http://p.sf.net/sfu/novell-sfdev2dev
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply

* Re: VLAN packets silently dropped in promiscuous mode
From: Eric Dumazet @ 2010-09-30  9:16 UTC (permalink / raw)
  To: Roger Luethi; +Cc: Jesse Gross, netdev, Patrick McHardy
In-Reply-To: <20100930080703.GA10827@core.hellgate.ch>

Le jeudi 30 septembre 2010 à 10:07 +0200, Roger Luethi a écrit :
> On Wed, 29 Sep 2010 10:44:26 -0700, Jesse Gross wrote:
> > On Wed, Sep 29, 2010 at 4:37 AM, Roger Luethi <rl@hellgate.ch> wrote:
> > > I noticed packets for unknown VLANs getting silently dropped even in
> > > promiscuous mode (this is true only for the hardware accelerated path).
> > > netif_nit_deliver was introduced specifically to prevent that, but the
> > > function gets called only _after_ packets from unknown VLANs have been
> > > dropped.
> > 
> > Some drivers are fixing this on a case by case basis by disabling
> > hardware accelerated VLAN stripping when in promiscuous mode, i.e.:
> > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=5f6c01819979afbfec7e0b15fe52371b8eed87e8
> > 
> > However, at this point it is more or less random which drivers do
> > this.  It would obviously be much better if it were consistent.
> 
> My understanding is this. Hardware VLAN tagging and stripping can always be
> enabled. The kernel passes 802.1Q information along with the stripped
> header to libpcap which reassembles the original header where necessary.
> Works for me.
> 
> Hardware VLAN filtering, on the other hand, must be disabled in promiscuous
> mode. But doing that in the driver makes no difference now as the current
> VLAN code drops the packets so preserved before they are passed to the pcap
> interface. That appears to be a bug.

Agreed

Could you try following patch, based on net-next-2.6 ?

diff --git a/net/8021q/vlan_core.c b/net/8021q/vlan_core.c
index 0eb486d..fabdedb 100644
--- a/net/8021q/vlan_core.c
+++ b/net/8021q/vlan_core.c
@@ -101,7 +101,7 @@ vlan_gro_common(struct napi_struct *napi, struct vlan_group *grp,
 
 	if (vlan_dev)
 		skb->dev = vlan_dev;
-	else if (vlan_id)
+	else if (vlan_id && !(skb->dev->flags & IFF_PROMISC))
 		goto drop;
 
 	for (p = napi->gro_list; p; p = p->next) {



^ permalink raw reply related

* Re: [MeeGo-Dev][PATCH v3] Topcliff: Update PCH_CAN driver to 2.6.35
From: Wolfgang Grandegger @ 2010-09-30  9:10 UTC (permalink / raw)
  To: Masayuki Ohtak
  Cc: andrew.chih.howe.khor-ral2JQCrhuEAvxtiuMwx3w, Samuel Ortiz,
	margie.foster-ral2JQCrhuEAvxtiuMwx3w,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	socketcan-core-0fE9KPoRgkgATYTw5x5z8w,
	yong.y.wang-ral2JQCrhuEAvxtiuMwx3w,
	kok.howg.ewe-ral2JQCrhuEAvxtiuMwx3w,
	joel.clark-ral2JQCrhuEAvxtiuMwx3w, Tomoya MORINAGA,
	meego-dev-WXzIur8shnEAvxtiuMwx3w, David S. Miller,
	Christian Pellegrin, qi.wang-ral2JQCrhuEAvxtiuMwx3w
In-Reply-To: <4C9C7C6F.1000003-ECg8zkTtlr0C6LszWs/t0g@public.gmane.org>

Hi Ohtake,

here comes my review, sorry for delay.

On 09/24/2010 12:24 PM, Masayuki Ohtak wrote:
> Hi Wolfgang and Marc,
> 
> We have modified a pretty amount of our driver based on other accepted Socket CAN driver.
> Additionally, We have reduced the number of lines 3601 to 1444.

Much better, but I believe it could be reduced even further.

> Please check below.
> 
> Thanks, Ohtake(OKISemi)
> 
> ---
> CAN driver of Topcliff PCH
> 
> Topcliff PCH is the platform controller hub that is going to be used in
> Intel's upcoming general embedded platform. All IO peripherals in
> Topcliff PCH are actually devices sitting on AMBA bus. 
> Topcliff PCH has CAN I/F. This driver enables CAN function.
> 
> Signed-off-by: Masayuki Ohtake <masa-korg-ECg8zkTtlr0C6LszWs/t0g@public.gmane.org>
> ---
>  drivers/net/can/Kconfig   |    8 +
>  drivers/net/can/Makefile  |    1 +
>  drivers/net/can/pch_can.c | 1444 +++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 1453 insertions(+), 0 deletions(-)
>  create mode 100644 drivers/net/can/pch_can.c
> 
> diff --git a/drivers/net/can/Kconfig b/drivers/net/can/Kconfig
> index 2c5227c..5c98a20 100644
> --- a/drivers/net/can/Kconfig
> +++ b/drivers/net/can/Kconfig
> @@ -73,6 +73,14 @@ config CAN_JANZ_ICAN3
>  	  This driver can also be built as a module. If so, the module will be
>  	  called janz-ican3.ko.
>  
> +config PCH_CAN
> +	tristate "PCH CAN"
> +	depends on  CAN_DEV
> +	---help---
> +	  This driver is for PCH CAN of Topcliff which is an IOH for x86
> +	  embedded processor.
> +	  This driver can access CAN bus.
> +
>  source "drivers/net/can/mscan/Kconfig"
>  
>  source "drivers/net/can/sja1000/Kconfig"
> diff --git a/drivers/net/can/Makefile b/drivers/net/can/Makefile
> index 9047cd0..3ddc6a7 100644
> --- a/drivers/net/can/Makefile
> +++ b/drivers/net/can/Makefile
> @@ -16,5 +16,6 @@ obj-$(CONFIG_CAN_TI_HECC)	+= ti_hecc.o
>  obj-$(CONFIG_CAN_MCP251X)	+= mcp251x.o
>  obj-$(CONFIG_CAN_BFIN)		+= bfin_can.o
>  obj-$(CONFIG_CAN_JANZ_ICAN3)	+= janz-ican3.o
> +obj-$(CONFIG_PCH_CAN)		+= pch_can.o

Please provide patches against David Millers "net-next-2.6" GIT tree and
use the prefix "can: " in your subject next time. See
http://svn.berlios.de/wsvn/socketcan/trunk/README.submitting-patches
for further information.

>  ccflags-$(CONFIG_CAN_DEBUG_DEVICES) := -DDEBUG
> diff --git a/drivers/net/can/pch_can.c b/drivers/net/can/pch_can.c
> new file mode 100644
> index 0000000..8c1731b
> --- /dev/null
> +++ b/drivers/net/can/pch_can.c
> @@ -0,0 +1,1444 @@
> +/*
> + * Copyright (C) 1999 - 2010 Intel Corporation.
> + * Copyright (C) 2010 OKI SEMICONDUCTOR Co., LTD.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; version 2 of the License.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307, USA.
> + */
> +
> +#include <linux/interrupt.h>
> +#include <linux/delay.h>
> +#include <linux/io.h>
> +#include <linux/module.h>
> +#include <linux/sched.h>
> +#include <linux/pci.h>
> +#include <linux/init.h>
> +#include <linux/kernel.h>
> +#include <linux/types.h>
> +#include <linux/errno.h>
> +#include <linux/netdevice.h>
> +#include <linux/skbuff.h>
> +#include <linux/can.h>
> +#include <linux/can/dev.h>
> +#include <linux/can/error.h>
> +
> +#define MAX_BITRATE		0x3e8

Dead code? At least it's not used anywhere.

> +
> +#define MAX_MSG_OBJ		32
> +#define MSG_OBJ_RX		0 /* The receive message object flag. */
> +#define MSG_OBJ_TX		1 /* The transmit message object flag. */
> +
> +#define ENABLE			1 /* The enable flag */
> +#define DISABLE			0 /* The disable flag */
> +#define CAN_CTRL_INIT		0x0001 /* The INIT bit of CANCONT register. */
> +#define CAN_CTRL_IE		0x0002 /* The IE bit of CAN control register */
> +#define CAN_CTRL_IE_SIE_EIE	0x000e
> +#define CAN_CTRL_CCE		0x0040
> +#define CAN_CTRL_OPT		0x0080 /* The OPT bit of CANCONT register. */
> +#define CAN_OPT_SILENT		0x0008 /* The Silent bit of CANOPT reg. */
> +#define CAN_OPT_LBACK		0x0010 /* The LoopBack bit of CANOPT reg. */
> +#define CAN_CMASK_RX_TX_SET	0x00f3
> +#define CAN_CMASK_RX_TX_GET	0x0073
> +#define CAN_CMASK_ALL		0xff
> +#define CAN_CMASK_RDWR		0x80
> +#define CAN_CMASK_ARB		0x20
> +#define CAN_CMASK_CTRL		0x10
> +#define CAN_CMASK_MASK		0x40
> +
> +#define CAN_IF_MCONT_NEWDAT	0x8000
> +#define CAN_IF_MCONT_INTPND	0x2000
> +#define CAN_IF_MCONT_UMASK		0x1000
> +#define CAN_IF_MCONT_TXIE		0x0800
> +#define CAN_IF_MCONT_RXIE		0x0400
> +#define CAN_IF_MCONT_RMTEN		0x0200
> +#define CAN_IF_MCONT_TXRQXT		0x0100
> +#define CAN_IF_MCONT_EOB		0x0080
> +#define CAN_IF_MCONT_MSGLOST		0x4000
> +#define CAN_MASK2_MDIR_MXTD		0xc000
> +#define CAN_ID2_DIR			0x2000
> +#define CAN_ID_MSGVAL			0x8000
> +
> +#define CAN_STATUS_INT			0x8000
> +#define CAN_IF_CREQ_BUSY		0x8000
> +#define CAN_ID2_XTD			0x4000
> +
> +#define CAN_REC				0x00007f00
> +#define CAN_TEC				0x000000ff
> +
> +#define PCH_RX_OK			0x00000010
> +#define PCH_TX_OK			0x00000008
> +#define PCH_BUS_OFF			0x00000080
> +#define PCH_EWARN			0x00000040
> +#define PCH_EPASSIV			0x00000020

> +#define PCH_LEC0			0x00000001
> +#define PCH_LEC1			0x00000002
> +#define PCH_LEC2			0x00000004
> +#define PCH_LEC_ALL			(PCH_LEC0 | PCH_LEC1 | PCH_LEC2)
> +#define PCH_STUF_ERR			PCH_LEC0
> +#define PCH_FORM_ERR			PCH_LEC1
> +#define PCH_ACK_ERR			(PCH_LEC0 | PCH_LEC1)
> +#define PCH_BIT1_ERR			PCH_LEC2
> +#define PCH_BIT0_ERR			(PCH_LEC0 | PCH_LEC2)
> +#define PCH_CRC_ERR			(PCH_LEC1 | PCH_LEC2)

enum {
 	PCH_LEC_STUF_ERR = 0,
	PCH_LEC_FORM_ERR,
	PCH_LEC_ACK_ERR,
	...
	PCH_LEC_ALL
};	

Seems more appropriate. More comments below.

> +
> +/* bit position of certain controller bits. */
> +#define BIT_BITT_BRP			0
> +#define BIT_BITT_SJW			6
> +#define BIT_BITT_TSEG1			8
> +#define BIT_BITT_TSEG2			12
> +#define BIT_IF1_MCONT_RXIE		10
> +#define BIT_IF2_MCONT_TXIE		11
> +#define BIT_BRPE_BRPE			6
> +#define BIT_ES_TXERRCNT			0
> +#define BIT_ES_RXERRCNT			8
> +#define MSK_BITT_BRP			0x3f
> +#define MSK_BITT_SJW			0xc0
> +#define MSK_BITT_TSEG1			0xf00
> +#define MSK_BITT_TSEG2			0x7000
> +#define MSK_BRPE_BRPE			0x3c0
> +#define MSK_BRPE_GET			0x0f
> +#define MSK_CTRL_IE_SIE_EIE		0x07
> +#define MSK_MCONT_TXIE			0x08
> +#define MSK_MCONT_RXIE			0x10
> +#define PCH_CAN_NO_TX_BUFF		1
> +#define PCI_DEVICE_ID_INTEL_PCH1_CAN	0x8818
> +#define COUNTER_LIMIT 0xFFFF

Keep alignment?

> +#define PCH_CAN_CLK			50000	/* 50MHz */

Please specify it in Hz already here.

> +
> +/* Total 32 OBJs */
> +#define PCH_RX_OBJ_NUM	1
> +#define PCH_TX_OBJ_NUM	1
> +#define PCH_OBJ_NUM (PCH_TX_OBJ_NUM + PCH_RX_OBJ_NUM)

Please explain biefly what message object are use for what purpose.
Either here or in the initialization code.

> +
> +#define	PCH_CAN_ACTIVE	0
> +#define	PCH_CAN_LISTEN	1
> +#define PCH_CAN_STOP	0
> +#define PCH_CAN_RUN	1
> +
> +#define PCH_CAN_ENABLE	0
> +#define PCH_CAN_DISABLE	1
> +#define PCH_CAN_ALL	2
> +#define PCH_CAN_NONE	3

The above are used in switch case and should therefore be anonymous
enums. I suggested to remove them because I'm not a real friend of the
helper functions which are just called *once*.

> +
> +struct pch_can_regs {
> +	u32 cont;
> +	u32 stat;
> +	u32 errc;
> +	u32 bitt;
> +	u32 intr;
> +	u32 opt;
> +	u32 brpe;
> +	u32 reserve1;
> +	u32 if1_creq;
> +	u32 if1_cmask;
> +	u32 if1_mask1;
> +	u32 if1_mask2;
> +	u32 if1_id1;
> +	u32 if1_id2;
> +	u32 if1_mcont;
> +	u32 if1_dataa1;
> +	u32 if1_dataa2;
> +	u32 if1_datab1;
> +	u32 if1_datab2;
> +	u32 reserve2;
> +	u32 reserve3[12];
> +	u32 if2_creq;
> +	u32 if2_cmask;
> +	u32 if2_mask1;
> +	u32 if2_mask2;
> +	u32 if2_id1;
> +	u32 if2_id2;
> +	u32 if2_mcont;
> +	u32 if2_dataa1;
> +	u32 if2_dataa2;
> +	u32 if2_datab1;
> +	u32 if2_datab2;
> +	u32 reserve4;
> +	u32 reserve5[20];
> +	u32 treq1;
> +	u32 treq2;
> +	u32 reserve6[2];
> +	u32 reserve7[56];
> +	u32 reserve8[3];
> +	u32 srst;
> +};

Nice.

> +struct pch_can_priv {
> +	struct can_priv can;
> +	void __iomem *base;
> +	unsigned int can_num;
> +	struct pci_dev *dev;
> +	unsigned int tx_enable[MAX_MSG_OBJ];
> +	unsigned int rx_enable[MAX_MSG_OBJ];
> +	unsigned int rx_link[MAX_MSG_OBJ];
> +	unsigned int int_enables;
> +	unsigned int int_stat;
> +	unsigned int bus_off_interrupt;
> +	struct net_device *ndev;
> +	spinlock_t msgif_reg_lock; /* Message Interface Registers Access Lock*/
> +	unsigned int msg_obj[MAX_MSG_OBJ];
> +	struct pch_can_regs *regs;

Please add __iomem. Do you need both, regs *and* base?

> +};
> +
> +static struct can_bittiming_const pch_can_bittiming_const = {
> +	.name = KBUILD_MODNAME,

Not sure what KBUILD_MODNAME is. Should be "pch_can", the name of the
driver.

> +	.tseg1_min = 1,
> +	.tseg1_max = 16,
> +	.tseg2_min = 1,
> +	.tseg2_max = 8,
> +	.sjw_max = 4,
> +	.brp_min = 1,
> +	.brp_max = 1024, /* 6bit + extended 4bit */
> +	.brp_inc = 1,
> +};
> +
> +static const struct pci_device_id pch_can_pcidev_id[] __devinitdata = {
> +	{PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_PCH1_CAN)},
> +	{}
> +};

Please use DEFINE_PCI_DEVICE_TABLE.

> +static inline void pch_can_bit_set(u32 *addr, u32 mask)
> +{
> +	iowrite32((ioread32(addr) | mask), addr);

Outer brackets not needed!

> +}
> +
> +static inline void pch_can_bit_clear(u32 *addr, u32 mask)
> +{
> +	iowrite32((ioread32(addr) & ~(mask)), addr);

Ditto.

> +}
> +
> +static void pch_can_set_run_mode(struct pch_can_priv *priv, u32 mode)
> +{
> +	switch (mode) {
> +	case PCH_CAN_RUN:
> +		pch_can_bit_clear(&(priv->regs)->cont, CAN_CTRL_INIT);
> +		break;
> +
> +	case PCH_CAN_STOP:
> +		pch_can_bit_set(&(priv->regs)->cont, CAN_CTRL_INIT);
> +		break;
> +
> +	default:
> +		dev_err(&priv->ndev->dev, "%s -> Invalid Mode.\n", __func__);
> +		break;
> +	}
> +}
> +
> +static void pch_can_get_run_mode(struct pch_can_priv *priv, u32 *mode)
> +{
> +	u32 reg_val = ioread32(&(priv->regs)->cont);

I don't think you need the brackets around "priv->regs". Therefore I
suggest s/&(priv->regs)/&priv->regs/ for the whole file.

> +
> +	if (reg_val & CAN_CTRL_INIT)
> +		*mode = PCH_CAN_STOP;
> +	else
> +		*mode = PCH_CAN_RUN;
> +}

These are the helper functions I complained about above. And reg_val is
not really needed.

> +static void pch_can_set_optmode(struct pch_can_priv *priv)
> +{
> +	u32 reg_val = ioread32(&(priv->regs)->opt);
> +
> +	if (priv->can.ctrlmode & CAN_CTRLMODE_LISTENONLY)
> +		reg_val |= CAN_OPT_SILENT;
> +
> +	if (priv->can.ctrlmode & CAN_CTRLMODE_LOOPBACK)
> +		reg_val |= CAN_OPT_LBACK;
> +
> +	pch_can_bit_set(&(priv->regs)->cont, CAN_CTRL_OPT);
> +	iowrite32(reg_val, &(priv->regs)->opt);
> +}
> +
> +static void pch_can_set_int_custom(struct pch_can_priv *priv)
> +{
> +	/* Clearing the IE, SIE and EIE bits of Can control register. */
> +	pch_can_bit_clear(&(priv->regs)->cont, CAN_CTRL_IE_SIE_EIE);
> +
> +	/* Appropriately setting them. */
> +	pch_can_bit_set(&(priv->regs)->cont,
> +			((priv->int_enables & MSK_CTRL_IE_SIE_EIE) << 1));
> +}
> +
> +/* This function retrieves interrupt enabled for the CAN device. */
> +static void pch_can_get_int_enables(struct pch_can_priv *priv, u32 *enables)
> +{
> +	u32 reg_ctrl_val = ioread32(&(priv->regs)->cont);
> +
> +	/* Obtaining the status of IE, SIE and EIE interrupt bits. */
> +	*enables = ((reg_ctrl_val & CAN_CTRL_IE_SIE_EIE) >> 1);

Do you really need an extra variable?

> +}
> +
> +static void pch_can_set_int_enables(struct pch_can_priv *priv, u32 interrupt_no)
> +{
> +	switch (interrupt_no) {
> +	case PCH_CAN_ENABLE:
> +		pch_can_bit_set(&(priv->regs)->cont, CAN_CTRL_IE);
> +		break;
> +
> +	case PCH_CAN_DISABLE:
> +		pch_can_bit_clear(&(priv->regs)->cont, CAN_CTRL_IE);
> +		break;
> +
> +	case PCH_CAN_ALL:
> +		pch_can_bit_set(&(priv->regs)->cont, CAN_CTRL_IE_SIE_EIE);
> +		break;
> +
> +	case PCH_CAN_NONE:
> +		pch_can_bit_clear(&(priv->regs)->cont, CAN_CTRL_IE_SIE_EIE);
> +		break;
> +
> +	default:
> +		dev_err(&priv->ndev->dev, "Invalid interrupt number.\n");
> +		break;
> +	}
> +}
> +
> +static void pch_can_check_if1_busy(struct pch_can_priv *priv, u32 num)
> +{
> +	u32 counter = COUNTER_LIMIT;
> +	u32 if1_creq;
> +
> +	iowrite32(num, &(priv->regs)->if1_creq);
> +	while (counter) {
> +		if1_creq = (ioread32(&(priv->regs)->if1_creq)) &
> +				     CAN_IF_CREQ_BUSY;
> +		if (!if1_creq)
> +			break;
> +		counter--;
> +	}
> +	if (!counter)
> +		dev_err(&priv->ndev->dev, "IF1 BUSY Flag is set forever.\n");

Please use a defined delay for the above timeout. How long does it
usually take the bit to toggle? A small delay, e.g. udelay(1) could be
fine. This function is called in the time critical path!

> +}
> +
> +static void pch_can_check_if2_busy(struct pch_can_priv *priv, u32 num)
> +{
> +	u32 counter = COUNTER_LIMIT;
> +	u32 if2_creq;
> +
> +	iowrite32(num, &(priv->regs)->if2_creq);
> +	while (counter) {
> +		if2_creq = (ioread32(&(priv->regs)->if2_creq)) &
> +				     CAN_IF_CREQ_BUSY;
> +		if (!if2_creq)
> +			break;
> +		counter--;
> +	}
> +	if (!counter)
> +		dev_err(&priv->ndev->dev, "IF2 BUSY Flag is set forever.\n");
> +}

Duplicated code!

> +static void pch_can_set_rx_enable(struct pch_can_priv *priv, u32 buff_num,
> +				  u32 set)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&priv->msgif_reg_lock, flags);
> +	/*Reading the receive buffer data from RAM to Interface1 registers */

Space after /* ?

> +	iowrite32(CAN_CMASK_RX_TX_GET, &(priv->regs)->if1_cmask);
> +	pch_can_check_if1_busy(priv, buff_num); /* Read from MsgRAN */
> +
> +	/* Setting the IF1MASK1 register to access MsgVal and RxIE bits */
> +	iowrite32((CAN_CMASK_RDWR | CAN_CMASK_ARB | CAN_CMASK_CTRL),
> +		  (&(priv->regs)->if1_cmask));
> +
> +	if (set == ENABLE) {
> +		/* Setting the MsgVal and RxIE bits */
> +		pch_can_bit_set(&(priv->regs)->if1_mcont, CAN_IF_MCONT_RXIE);
> +		pch_can_bit_set(&(priv->regs)->if1_id2, CAN_ID_MSGVAL);
> +
> +	} else if (set == DISABLE) {
> +		/* Resetting the MsgVal and RxIE bits */
> +		pch_can_bit_clear(&(priv->regs)->if1_mcont, CAN_IF_MCONT_RXIE);
> +		pch_can_bit_clear(&(priv->regs)->if1_id2, CAN_ID_MSGVAL);
> +	}
> +
> +	pch_can_check_if1_busy(priv, buff_num); /* Write to MsgRAM */
> +	spin_unlock_irqrestore(&priv->msgif_reg_lock, flags);
> +}
> +
> +static void pch_can_rx_enable_all(struct pch_can_priv *priv)
> +{
> +	u32 i;
> +
> +	/* Traversing to obtain the object configured as receivers. */
> +	for (i = 0; i < PCH_OBJ_NUM; i++) {
> +		if (priv->msg_obj[i] == MSG_OBJ_RX)
> +			pch_can_set_rx_enable(priv, i + 1, ENABLE);
> +	}
> +}
> +
> +static void pch_can_rx_disable_all(struct pch_can_priv *priv)
> +{
> +	u32 i;
> +
> +	/* Traversing to obtain the object configured as receivers. */
> +	for (i = 0; i < PCH_OBJ_NUM; i++) {
> +		if (priv->msg_obj[i] == MSG_OBJ_RX)
> +			pch_can_set_rx_enable(priv, (i + 1), DISABLE);
> +	}
> +}
> +
> +static void pch_can_set_tx_enable(struct pch_can_priv *priv, u32 buff_num,
> +				 u32 set)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&priv->msgif_reg_lock, flags);
> +	/* Reading the Msg buffer from Message RAM to Interface2 registers. */
> +	iowrite32(CAN_CMASK_RX_TX_GET, (&(priv->regs)->if1_cmask));
> +	pch_can_check_if1_busy(priv, buff_num);
> +
> +	/* Setting the IF2CMASK register for accessing the
> +		MsgVal and TxIE bits */
> +	iowrite32((CAN_CMASK_RDWR | CAN_CMASK_ARB | CAN_CMASK_CTRL),
> +		 (&(priv->regs)->if1_cmask));
> +
> +	if (set == ENABLE) {
> +		/* Setting the MsgVal and TxIE bits */
> +		pch_can_bit_set(&(priv->regs)->if1_mcont, CAN_IF_MCONT_TXIE);
> +		pch_can_bit_set(&(priv->regs)->if1_id2, CAN_ID_MSGVAL);
> +	} else if (set == DISABLE) {
> +		/* Resetting the MsgVal and TxIE bits. */
> +		pch_can_bit_clear(&(priv->regs)->if1_mcont, CAN_IF_MCONT_TXIE);
> +		pch_can_bit_clear(&(priv->regs)->if1_id2, CAN_ID_MSGVAL);
> +	}
> +
> +	pch_can_check_if1_busy(priv, buff_num); /* Write to MsgRAM */
> +	spin_unlock_irqrestore(&priv->msgif_reg_lock, flags);
> +}
> +
> +static void pch_can_tx_enable_all(struct pch_can_priv *priv)
> +{
> +	u32 i;
> +
> +	/* Traversing to obtain the object configured as transmit object. */
> +	for (i = 0; i < PCH_OBJ_NUM; i++) {
> +		if (priv->msg_obj[i] == MSG_OBJ_TX)
> +			pch_can_set_tx_enable(priv, (i + 1), ENABLE);
> +	}
> +}
> +
> +static void pch_can_tx_disable_all(struct pch_can_priv *priv)
> +{
> +	u32 i;
> +
> +	/* Traversing to obtain the object configured as transmit object. */
> +	for (i = 0; i < PCH_OBJ_NUM; i++) {
> +		if (priv->msg_obj[i] == MSG_OBJ_TX)
> +			pch_can_set_tx_enable(priv, (i + 1), DISABLE);
> +	}
> +}
> +
> +static void pch_can_get_rx_enable(struct pch_can_priv *priv, u32 buff_num,
> +				 u32 *enable)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&priv->msgif_reg_lock, flags);
> +	iowrite32(CAN_CMASK_RX_TX_GET, (&(priv->regs)->if1_cmask));
> +	pch_can_check_if1_busy(priv, buff_num);
> +
> +	if (((ioread32(&(priv->regs)->if1_id2)) & CAN_ID_MSGVAL) &&
> +			((ioread32(&(priv->regs)->if1_mcont)) &
> +			CAN_IF_MCONT_RXIE))
> +		*enable = ENABLE;
> +	else
> +		*enable = DISABLE;
> +	spin_unlock_irqrestore(&priv->msgif_reg_lock, flags);
> +}
> +
> +static void pch_can_get_tx_enable(struct pch_can_priv *priv, u32 buff_num,
> +				 u32 *enable)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&priv->msgif_reg_lock, flags);
> +	iowrite32(CAN_CMASK_RX_TX_GET, &(priv->regs)->if1_cmask);
> +	pch_can_check_if1_busy(priv, buff_num);
> +
> +	if (((ioread32(&(priv->regs)->if1_id2)) & CAN_ID_MSGVAL) &&
> +			((ioread32(&(priv->regs)->if1_mcont)) &
> +			CAN_IF_MCONT_TXIE)) {
> +		*enable = ENABLE;
> +	} else {
> +		*enable = DISABLE;
> +	}
> +	spin_unlock_irqrestore(&priv->msgif_reg_lock, flags);
> +}
> +
> +static int pch_can_int_pending(struct pch_can_priv *priv)
> +{
> +	return ioread32(&(priv->regs)->intr) & 0xffff;
> +}
> +
> +static void pch_can_set_rx_buffer_link(struct pch_can_priv *priv,
> +				       u32 buffer_num, u32 set)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&priv->msgif_reg_lock, flags);
> +	iowrite32(CAN_CMASK_RX_TX_GET, &(priv->regs)->if1_cmask);
> +	pch_can_check_if1_busy(priv, buffer_num);
> +	iowrite32((CAN_CMASK_RDWR | CAN_CMASK_CTRL), &(priv->regs)->if1_cmask);
> +	if (set == ENABLE)
> +		pch_can_bit_clear(&(priv->regs)->if1_mcont, CAN_IF_MCONT_EOB);
> +	else
> +		pch_can_bit_set(&(priv->regs)->if1_mcont, CAN_IF_MCONT_EOB);
> +
> +	pch_can_check_if1_busy(priv, buffer_num);
> +	spin_unlock_irqrestore(&priv->msgif_reg_lock, flags);
> +}
> +
> +static void pch_can_get_rx_buffer_link(struct pch_can_priv *priv,
> +				       u32 buffer_num, u32 *link)
> +{
> +	u32 reg_val;

Really needed?

> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&priv->msgif_reg_lock, flags);
> +	iowrite32(CAN_CMASK_RX_TX_GET, &(priv->regs)->if1_cmask);
> +	pch_can_check_if1_busy(priv, buffer_num);
> +
> +	reg_val = ioread32(&(priv->regs)->if1_mcont);
> +	if (reg_val & CAN_IF_MCONT_EOB)
> +		*link = DISABLE;
> +	else
> +		*link = ENABLE;
> +	spin_unlock_irqrestore(&priv->msgif_reg_lock, flags);
> +}
> +
> +static void pch_can_clear_buffers(struct pch_can_priv *priv)
> +{
> +	u32 i;
> +	u32 rx_buff_num;
> +	u32 tx_buff_num;

Really needed?

> +
> +	iowrite32(CAN_CMASK_RX_TX_SET, &(priv->regs)->if1_cmask);
> +	iowrite32(CAN_CMASK_RX_TX_SET, &(priv->regs)->if2_cmask);
> +	iowrite32(0xffff, &(priv->regs)->if1_mask1);
> +	iowrite32(0xffff, &(priv->regs)->if1_mask2);
> +	iowrite32(0xffff, &(priv->regs)->if2_mask1);
> +	iowrite32(0xffff, &(priv->regs)->if2_mask2);
> +
> +	iowrite32(0x0, &(priv->regs)->if1_id1);
> +	iowrite32(0x0, &(priv->regs)->if1_id2);
> +	iowrite32(0x0, &(priv->regs)->if2_id1);
> +	iowrite32(0x0, &(priv->regs)->if2_id2);
> +	iowrite32(0x0, &(priv->regs)->if1_mcont);
> +	iowrite32(0x0, &(priv->regs)->if2_mcont);
> +	iowrite32(0x0, &(priv->regs)->if1_dataa1);
> +	iowrite32(0x0, &(priv->regs)->if1_dataa2);
> +	iowrite32(0x0, &(priv->regs)->if1_datab1);
> +	iowrite32(0x0, &(priv->regs)->if1_datab2);
> +	iowrite32(0x0, &(priv->regs)->if2_dataa1);
> +	iowrite32(0x0, &(priv->regs)->if2_dataa2);
> +	iowrite32(0x0, &(priv->regs)->if2_datab1);
> +	iowrite32(0x0, &(priv->regs)->if2_datab2);
> +
> +	for (i = 1; i <= (MAX_MSG_OBJ / 2); i++) {
> +		rx_buff_num = 2 * i;
> +		tx_buff_num = (2 * i) - 1;
> +
> +		iowrite32(rx_buff_num, &(priv->regs)->if1_creq);
> +		iowrite32(tx_buff_num, &(priv->regs)->if2_creq);
> +
> +		mdelay(10);
> +	}
> +}
> +
> +static void pch_can_config_rx_tx_buffers(struct pch_can_priv *priv)
> +{
> +	u32 i;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&priv->msgif_reg_lock, flags);
> +	/* For accssing MsgVal, ID and EOB bit */
> +	iowrite32((CAN_CMASK_RDWR | CAN_CMASK_ARB | CAN_CMASK_CTRL),
> +		 (&(priv->regs)->if1_cmask));
> +	iowrite32((CAN_CMASK_RDWR | CAN_CMASK_ARB | CAN_CMASK_CTRL),
> +		 (&(priv->regs)->if2_cmask));
> +	iowrite32(0x0, (&(priv->regs)->if1_id1));
> +	iowrite32(0x0, (&(priv->regs)->if1_id2));
> +
> +	/* Resetting DIR bit for reception */
> +	iowrite32(0x0, (&(priv->regs)->if2_id1));
> +
> +	/* Setting DIR bit for transmission */
> +	iowrite32((CAN_ID2_DIR | (0x7ff << 2)),
> +				(&(priv->regs)->if2_id2));
> +
> +	/* Setting EOB bit for receiver */
> +	iowrite32(CAN_IF_MCONT_EOB, &(priv->regs)->if1_mcont);
> +
> +	/* Setting EOB bit for transmitter */
> +	iowrite32(CAN_IF_MCONT_EOB, (&(priv->regs)->if2_mcont));
> +
> +	for (i = 0; i < PCH_OBJ_NUM; i++) {
> +		if (priv->msg_obj[i] == MSG_OBJ_RX)
> +			pch_can_check_if1_busy(priv, i + 1);
> +		else if (priv->msg_obj[i] == MSG_OBJ_TX)
> +			pch_can_check_if2_busy(priv, i + 1);
> +		else
> +			dev_err(&priv->ndev->dev, "Invalid OBJ\n");
> +	}
> +
> +	for (i = 0; i < PCH_OBJ_NUM; i++) {
> +		if (priv->msg_obj[i] == MSG_OBJ_RX) {
> +			iowrite32(CAN_CMASK_RX_TX_GET,
> +				&(priv->regs)->if1_cmask);
> +			pch_can_check_if1_busy(priv, i+1);
> +
> +			pch_can_bit_clear(&(priv->regs)->if1_id2, 0x1fff);
> +			pch_can_bit_clear(&(priv->regs)->if1_id2, CAN_ID2_XTD);

Could'nt it be set just by one call?

> +			iowrite32(0, (&(priv->regs)->if1_id1));
> +			pch_can_bit_set(&(priv->regs)->if1_id2, 0);
> +			pch_can_bit_set(&(priv->regs)->if1_mcont,
> +					CAN_IF_MCONT_UMASK);
> +			pch_can_bit_set(&(priv->regs)->if2_mcont,
> +					CAN_IF_MCONT_UMASK);
> +
> +			iowrite32(0, &(priv->regs)->if1_mask1);
> +			pch_can_bit_clear(&(priv->regs)->if1_mask2, 0x1fff);
> +			pch_can_bit_clear(&(priv->regs)->if1_mask2,
> +					  CAN_MASK2_MDIR_MXTD);
> +
> +			iowrite32(0, &(priv->regs)->if2_mask1);
> +			pch_can_bit_clear(&(priv->regs)->if2_mask2, 0x1fff);
> +
> +			/* Setting CMASK for writing */
> +			iowrite32((CAN_CMASK_RDWR | CAN_CMASK_MASK |
> +				   CAN_CMASK_ARB | CAN_CMASK_CTRL),
> +				  (&(priv->regs)->if1_cmask));
> +
> +			pch_can_check_if1_busy(priv, i+1);
> +		}
> +	}
> +	spin_unlock_irqrestore(&priv->msgif_reg_lock, flags);
> +}
> +
> +static void pch_can_open(struct pch_can_priv *priv)

Probably pch_can_init is the better name.

> +{
> +	/* Stopping the Can device. */
> +	pch_can_set_run_mode(priv, PCH_CAN_STOP);
> +
> +	/* Clearing all the message object buffers. */
> +	pch_can_clear_buffers(priv);
> +
> +	/* Configuring the respective message object as either rx/tx object. */
> +	pch_can_config_rx_tx_buffers(priv);
> +
> +	/* Enabling all receive objects. */
> +	pch_can_rx_enable_all(priv);
> +
> +	/* Enabling all transmit objects. */
> +	pch_can_tx_enable_all(priv);
> +
> +	/* Enabling the interrupts. */
> +	pch_can_set_int_enables(priv, PCH_CAN_ALL);
> +
> +	/* Setting the CAN to run mode. */
> +	pch_can_set_run_mode(priv, PCH_CAN_RUN);

Hm, you start the controller here... more later.

> +}
> +
> +static void pch_can_release(struct pch_can_priv *priv)
> +{
> +	/* Stooping the CAN device. */
> +	pch_can_set_run_mode(priv, PCH_CAN_STOP);
> +
> +	/* Disabling the interrupts. */
> +	pch_can_set_int_enables(priv, PCH_CAN_NONE);
> +
> +	/* Disabling all the receive object. */
> +	pch_can_rx_disable_all(priv);
> +
> +	/* Disabling all the transmit object. */
> +	pch_can_tx_disable_all(priv);
> +}
> +
> +/* This function clears interrupt(s) from the CAN device. */
> +static void pch_can_int_clr(struct pch_can_priv *priv, u32 mask)
> +{
> +	if (mask == CAN_STATUS_INT) {
> +		ioread32(&(priv->regs)->stat);
> +		return;
> +	}
> +
> +	/* Clear interrupt for transmit object */
> +	if (priv->msg_obj[mask - 1] == MSG_OBJ_TX) {
> +		/* Setting CMASK for clearing interrupts for
> +					 frame transmission. */
> +		iowrite32((CAN_CMASK_RDWR | CAN_CMASK_CTRL | CAN_CMASK_ARB),
> +					(&(priv->regs)->if2_cmask));
> +
> +		/* Resetting the ID registers. */
> +		pch_can_bit_set(&(priv->regs)->if2_id2,
> +			       (CAN_ID2_DIR | (0x7ff << 2)));
> +		iowrite32(0x0, (&(priv->regs)->if2_id1));
> +
> +		/* Claring NewDat, TxRqst & IntPnd */
> +		pch_can_bit_clear(&(priv->regs)->if2_mcont,
> +				  (CAN_IF_MCONT_NEWDAT | CAN_IF_MCONT_INTPND |
> +				   CAN_IF_MCONT_TXRQXT));
> +		pch_can_check_if2_busy(priv, mask);
> +	}
> +	/* Clear interrupt for receive object */
> +	else if (priv->msg_obj[mask - 1] == MSG_OBJ_RX) {

Should be "} else if ..."

> +		/* Setting CMASK for clearing the reception interrupts. */
> +		iowrite32((CAN_CMASK_RDWR | CAN_CMASK_CTRL | CAN_CMASK_ARB),
> +			  (&(priv->regs)->if2_cmask));
> +
> +		/* Clearing the Dir bit. */
> +		pch_can_bit_clear(&(priv->regs)->if2_id2, CAN_ID2_DIR);
> +
> +		/* Clearing NewDat & IntPnd */
> +		pch_can_bit_clear(&(priv->regs)->if2_mcont,
> +				  (CAN_IF_MCONT_NEWDAT | CAN_IF_MCONT_INTPND));
> +
> +		pch_can_check_if2_busy(priv, mask);
> +	}
> +}
> +
> +static int pch_can_get_buffer_status(struct pch_can_priv *priv)
> +{
> +	u32 reg_treq1;
> +	u32 reg_treq2;

Really needed?

> +
> +	/* Reading the transmission request registers. */
> +	reg_treq1 = (ioread32(&(priv->regs)->treq1) & 0xffff);
> +	reg_treq2 = ((ioread32(&(priv->regs)->treq2) & 0xffff) << 16);
> +
> +	return reg_treq1 | reg_treq2;
> +}
> +
> +static void pch_can_reset(struct pch_can_priv *priv)
> +{
> +	/* write to sw reset register */
> +	iowrite32(1, (&(priv->regs)->srst));
> +	iowrite32(0, (&(priv->regs)->srst));
> +}
> +
> +static void pch_can_msg_obj(struct net_device *ndev, u32 status)
> +{
> +	struct pch_can_priv *priv = netdev_priv(ndev);
> +	u32 reg;
> +	struct sk_buff *skb;
> +	struct can_frame *cf;
> +	canid_t id;
> +	u32 ide;
> +	u32 rtr;
> +	int i, j;
> +	struct net_device_stats *stats = &(priv->ndev->stats);
> +
> +	/* Reading the messsage object from the Message RAM */
> +	iowrite32(CAN_CMASK_RX_TX_GET, &(priv->regs)->if2_cmask);
> +	pch_can_check_if2_busy(priv, status);
> +
> +	/* Reading the MCONT register. */
> +	reg = ioread32(&(priv->regs)->if2_mcont);
> +	reg &= 0xffff;
> +
> +	/* If MsgLost bit set. */
> +	if (reg & CAN_IF_MCONT_MSGLOST) {
> +		pch_can_bit_clear(&(priv->regs)->if2_mcont,
> +				  CAN_IF_MCONT_MSGLOST);
> +		dev_err(&priv->ndev->dev, "Msg Obj is overwritten.\n");

That should create an error message as well.

> +	}
> +	/* Read the direction bit for determination of remote frame . */
> +	rtr = (ioread32((&(priv->regs)->if2_id2)) &  CAN_ID2_DIR);
> +	/* Clearing interrupts. */
> +	pch_can_int_clr(priv, status);
> +	/* Hanlde reception interrupt */

Typo!

> +	if (priv->msg_obj[status - 1] == MSG_OBJ_RX) {
> +		if (!(reg & CAN_IF_MCONT_NEWDAT)) {
> +			dev_err(&priv->ndev->dev, "MCONT_NEWDAT isn't SET.\n");
> +			return;
> +		}
> +		skb = alloc_can_skb(priv->ndev, &cf);
> +		if (!skb)
> +			return;
> +
> +		ide = ((ioread32(&(priv->regs)->if2_id2)) & CAN_ID2_XTD) >> 14;
> +		if (ide) {
> +			id = (ioread32(&(priv->regs)->if2_id1) & 0xffff);
> +			id |= (((ioread32(&(priv->regs)->if2_id2)) &
> +					    0x1fff) << 16);
> +			cf->can_id = (id & CAN_EFF_MASK) | CAN_EFF_FLAG;
> +		} else {
> +			id = (((ioread32(&(priv->regs)->if2_id2)) &
> +					  (CAN_SFF_MASK << 2)) >> 2);
> +			cf->can_id = (id & CAN_SFF_MASK);
> +		}
> +
> +		if (rtr) {
> +			cf->can_dlc = 0;
> +			cf->can_id |= CAN_RTR_FLAG;
> +		} else {
> +			cf->can_dlc = ((ioread32(&(priv->regs)->if2_mcont)) &
> +						   0x0f);
> +		}
> +
> +		/* Reading back the data. */
> +		for (i = 0, j = 0; i < cf->can_dlc; j++) {
> +			reg = ioread32(&(priv->regs)->if2_dataa1 + j*4);
> +			cf->data[i++] = cpu_to_le32(reg & 0xff);
> +			if (i == cf->can_dlc)
> +				break;
> +			cf->data[i++] = cpu_to_le32((reg & (0xff << 8)) >> 8);
> +		}
> +		netif_rx(skb);
> +		stats->rx_packets++;
> +		stats->rx_bytes += cf->can_dlc;
> +	} else if (priv->msg_obj[status - 1] == MSG_OBJ_TX) {
> +		/* Hanlde transmission interrupt */

Typo!

> +		can_get_echo_skb(priv->ndev, 0);
> +		netif_wake_queue(priv->ndev);
> +	}
> +}
> +
> +static void pch_can_error(struct net_device *ndev, u32 status)
> +{
> +	struct sk_buff *skb;
> +	struct pch_can_priv *priv = netdev_priv(ndev);
> +	struct can_frame *cf;
> +	u32 errc;
> +	struct net_device_stats *stats = &(priv->ndev->stats);
> +
> +	skb = alloc_can_err_skb(ndev, &cf);
> +	if (!skb) {
> +		dev_err(&ndev->dev, "%s -> No memory.\n", __func__);

Please drop the error message.

> +		return;
> +	}
> +
> +	if (status & PCH_BUS_OFF) {
> +		if (!priv->bus_off_interrupt) {
> +			pch_can_tx_disable_all(priv);
> +			pch_can_rx_disable_all(priv);
> +
> +			priv->can.state = CAN_STATE_BUS_OFF;
> +			cf->can_id |= CAN_ERR_BUSOFF;
> +			can_bus_off(ndev);
> +
> +			priv->bus_off_interrupt = 1;
> +			pch_can_set_run_mode(priv, PCH_CAN_RUN);

Hm, you automatically restart the contoller after a bus-off. That's not
the intended behaviour. It's up to the user to define how and when the
device should recover from bus-off. For further information read

http://lxr.linux.no/#linux+v2.6.35.7/Documentation/networking/can.txt#L767

> +		}
> +	}

> +	/* Warning interrupt. */
> +	if (status & PCH_EWARN) {
> +		priv->can.state = CAN_STATE_ERROR_WARNING;
> +		priv->can.can_stats.error_warning++;
> +		cf->can_id |= CAN_ERR_CRTL;
> +		errc = ioread32((&(priv->regs)->errc));
> +		if (((errc & CAN_REC) >> 8) > 96)
> +			cf->data[1] |= CAN_ERR_CRTL_RX_WARNING;
> +		if ((errc & CAN_TEC) > 96)
> +			cf->data[1] |= CAN_ERR_CRTL_TX_WARNING;
> +		dev_warn(&ndev->dev, "%s -> Warning interrupt.\n", __func__);
> +	}
> +	/* Error passive interrupt. */
> +	if (status & PCH_EPASSIV) {
> +		priv->can.can_stats.error_passive++;
> +		priv->can.state = CAN_STATE_ERROR_PASSIVE;
> +		cf->can_id |= CAN_ERR_CRTL;
> +		errc = ioread32((&(priv->regs)->errc));
> +		if (((errc & CAN_REC) >> 8) > 127)
> +			cf->data[1] |= CAN_ERR_CRTL_RX_PASSIVE;
> +		if ((errc & CAN_TEC) > 127)
> +			cf->data[1] |= CAN_ERR_CRTL_TX_PASSIVE;
> +		dev_err(&ndev->dev,
> +			"%s -> Error interrupt.\n", __func__);
> +	}
> +
> +	if (status & PCH_STUF_ERR)
> +		cf->data[2] |= CAN_ERR_PROT_STUFF;
> +
> +	if (status & PCH_FORM_ERR)
> +		cf->data[2] |= CAN_ERR_PROT_FORM;
> +
> +	if (status & PCH_ACK_ERR)
> +		cf->data[2] |= CAN_ERR_PROT_LOC_ACK | CAN_ERR_PROT_LOC_ACK_DEL;
> +
> +	if ((status & PCH_BIT1_ERR) || (status & PCH_BIT0_ERR))
> +		cf->data[2] |= CAN_ERR_PROT_BIT;
> +
> +	if (status & PCH_CRC_ERR)
> +		cf->data[2] |= CAN_ERR_PROT_LOC_CRC_SEQ |
> +				CAN_ERR_PROT_LOC_CRC_DEL;
> +
> +	if (status & PCH_LEC_ALL)
> +		iowrite32(status | PCH_LEC_ALL,
> +			  &(priv->regs)->stat);

A bit-wise test of the above values is wrong, I believe. Please use the
switch statement instead.

> +
> +	stats->rx_packets++;
> +	stats->rx_bytes += cf->can_dlc;
> +	netif_rx(skb);
> +}
> +
> +static irqreturn_t pch_can_handler(int irq, void *dev_id)

A better name making clear that it's the interrupt handler would be nice.

> +{
> +	u32 int_stat;
> +	u32 reg_stat;
> +	struct net_device *ndev = (struct net_device *)dev_id;
> +	struct pch_can_priv *priv = netdev_priv(ndev);
> +	int_stat = pch_can_int_pending(priv);
> +
> +	if (!int_stat)
> +		return IRQ_NONE;
> +
> +	if (int_stat == CAN_STATUS_INT) {
> +		reg_stat = ioread32((&(priv->regs)->stat));
> +		if (reg_stat & (PCH_BUS_OFF | PCH_LEC_ALL | PCH_EWARN |
> +								PCH_EPASSIV)) {
> +			if ((reg_stat & PCH_LEC_ALL) != PCH_LEC_ALL)
> +				pch_can_error(ndev, reg_stat);
> +		}
> +
> +		/* Recover from Bus Off */
> +		if (!reg_stat && priv->bus_off_interrupt) {
> +			priv->bus_off_interrupt = 0;
> +			pch_can_tx_enable_all(priv);
> +			pch_can_rx_enable_all(priv);
> +
> +			dev_info(&priv->ndev->dev, "BusOff stage recovered.\n");

Bogus bus-off handling, more later...

> +		}
> +
> +		if (reg_stat & PCH_RX_OK)
> +			pch_can_bit_clear(&(priv->regs)->stat, PCH_RX_OK);
> +
> +		if (reg_stat & PCH_TX_OK)
> +			pch_can_bit_clear(&(priv->regs)->stat, PCH_TX_OK);

Could be done in one call, I think.

> +		int_stat = pch_can_int_pending(priv);
> +	}
> +
> +	if ((int_stat > 0) && (int_stat <= MAX_MSG_OBJ))
> +		pch_can_msg_obj(ndev, int_stat);
> +
> +	return IRQ_HANDLED;
> +}
> +
> +static int pch_set_bittiming(struct net_device *ndev)
> +{
> +	struct pch_can_priv *priv = netdev_priv(ndev);
> +	const struct can_bittiming *bt = &priv->can.bittiming;
> +	u32 curr_mode;
> +	u32 reg1; /* CANBIT */
> +	u32 reg2; /* BEPE */

Why not "u32 canbit" then?

> +	u32 brp;
> +
> +	pch_can_get_run_mode(priv, &curr_mode);
> +	if (curr_mode == PCH_CAN_RUN)
> +		pch_can_set_run_mode(priv, PCH_CAN_STOP);

The device is stopped when this function is called. Please remove.

> +
> +	/* Setting the CCE bit for accessing the Can Timing register. */
> +	pch_can_bit_set(&(priv->regs)->cont, CAN_CTRL_CCE);
> +
> +	brp = (bt->tq) / (1000000/PCH_CAN_CLK) - 1;
> +	reg1 = brp & MSK_BITT_BRP;
> +	reg1 |= (bt->sjw - 1) << BIT_BITT_SJW;
> +	reg1 |= (bt->phase_seg1 + bt->prop_seg - 1) << BIT_BITT_TSEG1;
> +	reg1 |= (bt->phase_seg2 - 1) << BIT_BITT_TSEG2;
> +	reg2 = (brp & MSK_BRPE_BRPE) >> BIT_BRPE_BRPE;
> +	iowrite32(reg1, (&(priv->regs)->bitt));
> +	iowrite32(reg2, (&(priv->regs)->brpe));
> +	pch_can_bit_clear(&(priv->regs)->cont, CAN_CTRL_CCE);
> +
> +	if (curr_mode == PCH_CAN_RUN)
> +		pch_can_set_run_mode(priv, PCH_CAN_RUN);

Ditto.

> +	return 0;
> +}
> +
> +static void pch_can_start(struct net_device *ndev)
> +{
> +	struct pch_can_priv *priv = netdev_priv(ndev);
> +
> +	if (priv->can.state != CAN_STATE_STOPPED)
> +		pch_can_reset(priv);
> +
> +	pch_set_bittiming(ndev);
> +	pch_can_set_optmode(priv);
> +	priv->can.state = CAN_STATE_ERROR_ACTIVE;

Hm, where do you really start the controller. I'm missing
pch_can_set_run_mode(priv, PCH_CAN_RUN).

> +	return;
> +}
> +
> +static int pch_can_do_set_mode(struct net_device *ndev, enum can_mode mode)
> +{
> +	int ret = 0;
> +
> +	switch (mode) {
> +	case CAN_MODE_START:
> +		pch_can_start(ndev);
> +		netif_wake_queue(ndev);
> +		break;
> +	default:
> +		ret = -EOPNOTSUPP;
> +		break;
> +	}
> +
> +	return ret;
> +}

Note that this function is called when the device will recover from bus-off.

> +static int pch_can_get_state(const struct net_device *ndev,
> +			     enum can_state *state)
> +{
> +	struct pch_can_priv *priv = netdev_priv(ndev);
> +
> +	*state = priv->can.state;
> +	return 0;
> +}

There is no need for that function as the driver handles state changes
in the interrupt handler.

> +static int pch_open(struct net_device *ndev)

That's confussing! Please use the prefix pch_can throught this file.

> +{
> +	struct pch_can_priv *priv = netdev_priv(ndev);
> +	int retval;
> +
> +	pch_can_open(priv);

This function already starts the controller, which is too *early*.

> +
> +	retval = pci_enable_msi(priv->dev);
> +	if (retval) {
> +		dev_err(&ndev->dev, "Unable to allocate MSI ret=%d\n", retval);
> +		goto pci_en_msi_err;
> +	}
> +
> +	/* Regsitering the interrupt. */
> +	retval = request_irq(priv->dev->irq, pch_can_handler, IRQF_SHARED,
> +			     ndev->name, ndev);
> +	if (retval) {
> +		dev_err(&ndev->dev, "request_irq failed.\n");
> +		goto req_irq_err;
> +	}
> +
> +	/* Assuming that no bus off interrupt. */
> +	priv->bus_off_interrupt = 0;
> +
> +	/* Open common can device */
> +	retval = open_candev(ndev);
> +	if (retval) {
> +		dev_err(ndev->dev.parent, "open_candev() failed %d\n", retval);
> +		goto err_open_candev;
> +	}
> +
> +	pch_can_start(ndev);

Thde above function should finally start the controller.

> +	netif_start_queue(ndev);
> +
> +	return 0;
> +
> +err_open_candev:
> +	free_irq(priv->dev->irq, ndev);
> +req_irq_err:
> +	pci_disable_msi(priv->dev);
> +pci_en_msi_err:
> +	pch_can_release(priv);
> +
> +	return retval;
> +}
> +
> +static int pch_close(struct net_device *ndev)
> +{
> +	struct pch_can_priv *priv = netdev_priv(ndev);
> +
> +	netif_stop_queue(ndev);
> +	pch_can_release(priv);
> +	free_irq(priv->dev->irq, ndev);
> +	pci_disable_msi(priv->dev);
> +	close_candev(ndev);
> +	priv->can.state = CAN_STATE_STOPPED;
> +	return 0;
> +}
> +
> +static int pch_get_free_msg_obj(struct net_device *ndev)
> +{
> +	u32 buffer_status = 0;
> +	u32 tx_disable_counter = 0;
> +	u32 tx_buffer_avail = 0;
> +	u32 status;
> +	s32 i;
> +	struct pch_can_priv *priv = netdev_priv(ndev);
> +
> +	/* Getting the message object status. */
> +	buffer_status = (u32) pch_can_get_buffer_status(priv);
> +
> +	/* Getting the free transmit message object. */
> +	for (i = 0; i < PCH_OBJ_NUM; i++) {
> +		if ((priv->msg_obj[i] == MSG_OBJ_TX)) {
> +			/* Checking whether the object is enabled. */
> +			pch_can_get_tx_enable(priv, i + 1, &status);
> +			if (status == ENABLE) {
> +				if (!((buffer_status >> i) & 1)) {
> +					tx_buffer_avail = (i + 1);
> +					break;
> +				}
> +			} else {
> +				tx_disable_counter++;
> +			}
> +		}
> +	}
> +
> +	/* If no transmit object available. */
> +	if (!tx_buffer_avail) {
> +		/* If no object is enabled. */
> +		if ((tx_disable_counter == PCH_TX_OBJ_NUM)) {
> +			dev_err(&ndev->dev, "All tx buffers are disabled.\n");
> +			return -EPERM;
> +		} else {
> +			dev_err(&ndev->dev, "%s:No tx buf free.\n", __func__);
> +			return -PCH_CAN_NO_TX_BUFF;
> +		}
> +	}
> +	return tx_buffer_avail;
> +}
> +
> +static netdev_tx_t pch_xmit(struct sk_buff *skb, struct net_device *ndev)
> +{
> +	canid_t id;
> +	u32 id1 = 0;
> +	u32 id2 = 0;

Need these values to be preset?

> +	u32 run_mode;
> +	u32 i, j;

It's common to use type "int" for the usual incrementer value... as you
do in other places as well. Please check!

> +	unsigned long flags;
> +	struct pch_can_priv *priv = netdev_priv(ndev);
> +	struct can_frame *cf = (struct can_frame *)skb->data;
> +	struct net_device_stats *stats = &ndev->stats;
> +	u32 tx_buffer_avail = 0;
> +
> +	if (can_dropped_invalid_skb(ndev, skb))
> +		return NETDEV_TX_OK;
> +
> +	/* Getting the current CAN mode. */
> +	pch_can_get_run_mode(priv, &run_mode);
> +	if (run_mode != PCH_CAN_RUN) {
> +		dev_err(&ndev->dev, "CAN stopped on transmit attempt.\n");
> +		return -EPERM;
> +	}

Can this happen? I think this check can be removed. Anyway, -EPERM is
not a valid return value for that function.

> +
> +	tx_buffer_avail = pch_get_free_msg_obj(ndev);
> +	if (tx_buffer_avail < 0)
> +		return tx_buffer_avail;

Wrong return value?

> +
> +	/* Attaining the lock. */
> +	spin_lock_irqsave(&priv->msgif_reg_lock, flags);
> +
> +	/* Reading the Msg Obj from the Msg RAM to the Interface register. */
> +	iowrite32(CAN_CMASK_RX_TX_GET, &(priv->regs)->if1_cmask);
> +	pch_can_check_if1_busy(priv, tx_buffer_avail);
> +
> +	/* Setting the CMASK register. */
> +	pch_can_bit_set(&(priv->regs)->if1_cmask, CAN_CMASK_ALL);
> +
> +	/* If ID extended is set. */
> +	if (cf->can_id & CAN_EFF_FLAG) {
> +		id =  cf->can_id & CAN_EFF_MASK;
> +		id1 = id & 0xffff;
> +		id2 = ((id & (0x1fff << 16)) >> 16) | CAN_ID2_XTD;

Please use some more macro definitions for the sake of readability.

> +	} else {
> +		id =  cf->can_id & CAN_SFF_MASK;
> +		id1 = 0;
> +		id2 = ((id & CAN_SFF_MASK) << 2);
> +	}
> +	pch_can_bit_clear(&(priv->regs)->if1_id1, 0xffff);
> +	pch_can_bit_clear(&(priv->regs)->if1_id2, 0x1fff | CAN_ID2_XTD);
> +	pch_can_bit_set(&(priv->regs)->if1_id1, id1);
> +	pch_can_bit_set(&(priv->regs)->if1_id2, id2);
> +
> +	/* If remote frame has to be transmitted.. */
> +	if (cf->can_id & CAN_RTR_FLAG)
> +		pch_can_bit_clear(&(priv->regs)->if1_id2, CAN_ID2_DIR);
> +
> +	for (i = 0, j = 0; i < cf->can_dlc; j++) {
> +		iowrite32(le32_to_cpu(cf->data[i++]),
> +			 (&(priv->regs)->if1_dataa1) + j*4);
> +		if (i == cf->can_dlc)
> +			break;
> +		iowrite32(le32_to_cpu(cf->data[i++] << 8),
> +			 (&(priv->regs)->if1_dataa1) + j*4);
> +	}
> +	can_put_echo_skb(skb, ndev, 0);
> +
> +	/* Updating the size of the data. */
> +	pch_can_bit_clear(&(priv->regs)->if1_mcont, 0x0f);
> +	pch_can_bit_set(&(priv->regs)->if1_mcont, cf->can_dlc);
> +
> +	/* Clearing IntPend, NewDat & TxRqst */
> +	pch_can_bit_clear(&(priv->regs)->if1_mcont,
> +			   (CAN_IF_MCONT_NEWDAT | CAN_IF_MCONT_INTPND |
> +			    CAN_IF_MCONT_TXRQXT));
> +
> +	/* Setting NewDat, TxRqst bits */
> +	pch_can_bit_set(&(priv->regs)->if1_mcont,
> +			 (CAN_IF_MCONT_NEWDAT | CAN_IF_MCONT_TXRQXT));
> +
> +	pch_can_check_if1_busy(priv, tx_buffer_avail);
> +	spin_unlock_irqrestore(&priv->msgif_reg_lock, flags);
> +
> +	stats->tx_bytes += cf->can_dlc;
> +	stats->tx_packets++;

That shoould be incremented when the TX done interrupt is handled.

> +
> +	return NETDEV_TX_OK;
> +}
> +
> +static const struct net_device_ops pch_can_netdev_ops = {
> +	.ndo_open		= pch_open,
> +	.ndo_stop		= pch_close,
> +	.ndo_start_xmit		= pch_xmit,
> +};
> +
> +static void __devexit pch_can_remove(struct pci_dev *pdev)
> +{
> +	struct net_device *ndev = pci_get_drvdata(pdev);
> +	struct pch_can_priv *priv = netdev_priv(ndev);
> +
> +	unregister_candev(priv->ndev);
> +	free_candev(priv->ndev);
> +	pci_iounmap(pdev, priv->base);
> +	pci_release_regions(pdev);
> +	pci_disable_device(pdev);
> +	pci_set_drvdata(pdev, NULL);
> +	pch_can_reset(priv);
> +}
> +
> +#ifdef CONFIG_PM
> +static int pch_can_suspend(struct pci_dev *pdev, pm_message_t state)
> +{
> +	int i;			/* Counter variable. */
> +	int retval;		/* Return value. */
> +	u32 buf_stat;	/* Variable for reading the transmit buffer status. */
> +	u32 counter = 0xFFFFFF;
> +
> +	struct net_device *dev = pci_get_drvdata(pdev);
> +	struct pch_can_priv *priv = netdev_priv(dev);
> +
> +	/* Stop the CAN controller */
> +	pch_can_set_run_mode(priv, PCH_CAN_STOP);
> +
> +	/* Indicate that we are aboutto/in suspend */
> +	priv->can.state = CAN_STATE_SLEEPING;
> +
> +	/* Waiting for all transmission to complete. */
> +	while (counter) {
> +		buf_stat = pch_can_get_buffer_status(priv);
> +		if (!buf_stat)
> +			break;
> +		counter--;
> +	}
> +	if (!counter)
> +		dev_err(&pdev->dev, "%s -> Transmission time out.\n", __func__);

Timeout without defined delay!

> +	/* Save interrupt configuration and then disable them */
> +	pch_can_get_int_enables(priv, &(priv->int_enables));
> +	pch_can_set_int_enables(priv, PCH_CAN_DISABLE);
> +
> +	/* Save Tx buffer enable state */
> +	for (i = 0; i < PCH_OBJ_NUM; i++) {
> +		if (priv->msg_obj[i] == MSG_OBJ_TX)
> +			pch_can_get_tx_enable(priv, (i + 1),
> +					      &(priv->tx_enable[i]));
> +	}
> +
> +	/* Disable all Transmit buffers */
> +	pch_can_tx_disable_all(priv);
> +
> +	/* Save Rx buffer enable state */
> +	for (i = 0; i < PCH_OBJ_NUM; i++) {
> +		if (priv->msg_obj[i] == MSG_OBJ_RX) {
> +			pch_can_get_rx_enable(priv, (i + 1),
> +						&(priv->rx_enable[i]));
> +			pch_can_get_rx_buffer_link(priv, (i + 1),
> +						&(priv->rx_link[i]));
> +		}
> +	}
> +
> +	/* Disable all Receive buffers */
> +	pch_can_rx_disable_all(priv);
> +	retval = pci_save_state(pdev);
> +	if (retval) {
> +		dev_err(&pdev->dev, "pci_save_state failed.\n");
> +	} else {
> +		pci_enable_wake(pdev, PCI_D3hot, 0);
> +		pci_disable_device(pdev);
> +		pci_set_power_state(pdev, pci_choose_state(pdev, state));
> +	}
> +
> +	return retval;
> +}
> +
> +static int pch_can_resume(struct pci_dev *pdev)
> +{
> +	int i;			/* Counter variable. */
> +	int retval;		/* Return variable. */
> +	struct net_device *dev = pci_get_drvdata(pdev);
> +	struct pch_can_priv *priv = netdev_priv(dev);
> +
> +	pci_set_power_state(pdev, PCI_D0);
> +	pci_restore_state(pdev);
> +	retval = pci_enable_device(pdev);
> +	if (retval) {
> +		dev_err(&pdev->dev, "pci_enable_device failed.\n");
> +		return retval;
> +	}
> +
> +	pci_enable_wake(pdev, PCI_D3hot, 0);
> +
> +	priv->can.state = CAN_STATE_ERROR_ACTIVE;
> +
> +	/* Disabling all interrupts. */
> +	pch_can_set_int_enables(priv, PCH_CAN_DISABLE);
> +
> +	/* Setting the CAN device in Stop Mode. */
> +	pch_can_set_run_mode(priv, PCH_CAN_STOP);
> +
> +	/* Configuring the transmit and receive buffers. */
> +	pch_can_config_rx_tx_buffers(priv);
> +
> +	/* Restore the CAN state */
> +	pch_set_bittiming(dev);
> +
> +	/* Listen/Active */
> +	pch_can_set_optmode(priv);
> +
> +	/* Enabling the transmit buffer. */
> +	for (i = 0; i < PCH_OBJ_NUM; i++) {
> +		if (priv->msg_obj[i] == MSG_OBJ_TX) {
> +			pch_can_set_tx_enable(priv, i + 1,
> +					      priv->tx_enable[i]);
> +		}
> +	}
> +
> +	/* Configuring the receive buffer and enabling them. */
> +	for (i = 0; i < PCH_OBJ_NUM; i++) {
> +		if (priv->msg_obj[i] == MSG_OBJ_RX) {
> +			/* Restore buffer link */
> +			pch_can_set_rx_buffer_link(priv, i + 1,
> +						   priv->rx_link[i]);
> +
> +			/* Restore buffer enables */
> +			pch_can_set_rx_enable(priv, i + 1, priv->rx_enable[i]);
> +		}
> +	}
> +
> +	/* Enable CAN Interrupts */
> +	pch_can_set_int_custom(priv);
> +
> +	/* Restore Run Mode */
> +	pch_can_set_run_mode(priv, PCH_CAN_RUN);
> +
> +	return retval;
> +}

Are the suspend and resume functions tested?

> +#else
> +#define pch_can_suspend NULL
> +#define pch_can_resume NULL
> +#endif

Add empty line here

> +static int __devinit pch_can_probe(struct pci_dev *pdev,
> +				   const struct pci_device_id *id)
> +{
> +	struct net_device *ndev;
> +	struct pch_can_priv *priv;
> +	int rc;
> +	int index;
> +	void __iomem *addr;
> +
> +	rc = pci_enable_device(pdev);
> +	if (rc) {
> +		dev_err(&pdev->dev, "Failed pci_enable_device %d\n", rc);
> +		goto probe_exit_endev;
> +	}
> +
> +	rc = pci_request_regions(pdev, KBUILD_MODNAME);
> +	if (rc) {
> +		dev_err(&pdev->dev, "Failed pci_request_regions %d\n", rc);
> +		goto probe_exit_pcireq;
> +	}
> +
> +	addr = pci_iomap(pdev, 1, 0);
> +	if (!addr) {
> +		rc = -EIO;
> +		dev_err(&pdev->dev, "Failed pci_iomap\n");
> +		goto probe_exit_ipmap;
> +	}
> +
> +	ndev = alloc_candev(sizeof(struct pch_can_priv), 1);
> +	if (!ndev) {
> +		rc = -ENOMEM;
> +		dev_err(&pdev->dev, "Failed alloc_candev\n");
> +		goto probe_exit_alloc_candev;
> +	}
> +
> +	priv = netdev_priv(ndev);
> +	priv->ndev = ndev;
> +	priv->base = addr;
> +	priv->regs = addr;
> +	priv->dev = pdev;
> +	priv->can.bittiming_const = &pch_can_bittiming_const;
> +	priv->can.do_set_mode = pch_can_do_set_mode;
> +	priv->can.do_get_state = pch_can_get_state;

Not needed! See above.

Could you please also implement do_get_berr_counter().

> +	priv->can.ctrlmode_supported = CAN_CTRLMODE_LISTENONLY |
> +				       CAN_CTRLMODE_LOOPBACK;
> +	ndev->irq = pdev->irq;
> +	ndev->flags |= IFF_ECHO;
> +
> +	pci_set_drvdata(pdev, ndev);
> +	SET_NETDEV_DEV(ndev, &pdev->dev);
> +	ndev->netdev_ops = &pch_can_netdev_ops;
> +
> +	priv->can.clock.freq = PCH_CAN_CLK * 1000; /* Hz to KHz) */
> +	for (index = 0; index < PCH_RX_OBJ_NUM;)
> +		priv->msg_obj[index++] = MSG_OBJ_RX;
> +
> +	for (index = index;  index < PCH_OBJ_NUM;)
> +		priv->msg_obj[index++] = MSG_OBJ_TX;
> +
> +	rc = register_candev(ndev);
> +	if (rc) {
> +		dev_err(&pdev->dev, "Failed register_candev %d\n", rc);
> +		goto probe_exit_reg_candev;
> +	}
> +
> +	return 0;
> +
> +probe_exit_reg_candev:
> +	free_candev(ndev);
> +probe_exit_alloc_candev:
> +	pci_iounmap(pdev, addr);
> +probe_exit_ipmap:
> +	pci_release_regions(pdev);
> +probe_exit_pcireq:
> +	pci_disable_device(pdev);
> +probe_exit_endev:
> +	return rc;
> +}
> +
> +static struct pci_driver pch_can_pcidev = {
> +	.name = KBUILD_MODNAME,
> +	.id_table = pch_can_pcidev_id,
> +	.probe = pch_can_probe,
> +	.remove = __devexit_p(pch_can_remove),
> +	.suspend = pch_can_suspend,
> +	.resume = pch_can_resume,
> +};
> +
> +static int __init pch_can_pci_init(void)
> +{
> +	return pci_register_driver(&pch_can_pcidev);
> +}
> +module_init(pch_can_pci_init);
> +
> +static void __exit pch_can_pci_exit(void)
> +{
> +	pci_unregister_driver(&pch_can_pcidev);
> +}
> +module_exit(pch_can_pci_exit);
> +
> +MODULE_DESCRIPTION("Controller Area Network Driver");
> +MODULE_LICENSE("GPL");

GPL v2 ?

> +MODULE_VERSION("0.94");
> +MODULE_DEVICE_TABLE(pci, pch_can_pcidev_id);

Please add it below the declaration of pch_can_pcidev_id.

In this driver you are using just *one* RX object. This means that the
CPU must handle new messages as quickly as possible otherwise message
losses will happen, right?. For sure, this will not make user's happy.
Any chance to use more RX objects in FIFO mode?

Thanks,

Wolfgang.

^ permalink raw reply

* [PATCH] Phonet: restore flow control credits when sending fails
From: Kumar A Sanghvi @ 2010-09-30  8:33 UTC (permalink / raw)
  To: netdev, davem, remi.denis-courmont, eric.dumazet
  Cc: gulshan.karmani, Kumar Sanghvi, Linus Walleij

From: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>

This patch restores the below flow control patch submitted by Rémi
Denis-Courmont, which accidentaly got lost due to Pipe controller patch
on Phonet.

	commit 1a98214feef2221cd7c24b17cd688a5a9d85b2ea
	Author: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>
	Date:   Mon Aug 30 12:57:03 2010 +0000

	Phonet: restore flow control credits when sending fails

	Signed-off-by: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>
	Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Kumar Sanghvi <kumar.sanghvi@stericsson.com>
Acked-by: Linus Walleij <linus.walleij@stericsson.com>
---
 net/phonet/pep.c |   10 ++++++++--
 1 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/net/phonet/pep.c b/net/phonet/pep.c
index 9746c6d..aa3d870 100644
--- a/net/phonet/pep.c
+++ b/net/phonet/pep.c
@@ -1289,6 +1289,7 @@ static int pipe_skb_send(struct sock *sk, struct sk_buff *skb)
 {
 	struct pep_sock *pn = pep_sk(sk);
 	struct pnpipehdr *ph;
+	int err;
 #ifdef CONFIG_PHONET_PIPECTRLR
 	struct sockaddr_pn spn = {
 		.spn_family = AF_PHONET,
@@ -1315,10 +1316,15 @@ static int pipe_skb_send(struct sock *sk, struct sk_buff *skb)
 		ph->message_id = PNS_PIPE_DATA;
 	ph->pipe_handle = pn->pipe_handle;
 #ifdef CONFIG_PHONET_PIPECTRLR
-	return pn_skb_send(sk, skb, &spn);
+	err = pn_skb_send(sk, skb, &spn);
 #else
-	return pn_skb_send(sk, skb, &pipe_srv);
+	err = pn_skb_send(sk, skb, &pipe_srv);
 #endif
+
+	if (err && pn_flow_safe(pn->tx_fc))
+		atomic_inc(&pn->tx_credits);
+	return err;
+
 }
 
 static int pep_sendmsg(struct kiocb *iocb, struct sock *sk,
-- 
1.7.2.dirty


^ permalink raw reply related

* Re: [PATCH 4/4] Phonet: restore flow control credits when sending fails
From: Kumar SANGHVI @ 2010-09-30  8:31 UTC (permalink / raw)
  To: netdev@vger.kernel.org, davem@davemloft.net,
	remi.denis-courmont@nokia.com, "eric.dumazet@gmail.co
  Cc: Gulshan KARMANI, Linus WALLEIJ
In-Reply-To: <1285835105-20293-1-git-send-email-kumar.sanghvi@stericsson.com>

Hi All,

On Thu, Sep 30, 2010 at 10:25:05 +0200, Kumar A SANGHVI wrote:
> From: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>
> 
> This patch restores the below flow control patch submitted by Rémi
> Denis-Courmont, which accidentaly got lost due to Pipe controller patch
> on Phonet.
> 
> 	commit 1a98214feef2221cd7c24b17cd688a5a9d85b2ea
> 	Author: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>
> 	Date:   Mon Aug 30 12:57:03 2010 +0000
> 
> 	Phonet: restore flow control credits when sending fails
> 
> 	Signed-off-by: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>
> 	Signed-off-by: David S. Miller <davem@davemloft.net>
> 
> Signed-off-by: Kumar Sanghvi <kumar.sanghvi@stericsson.com>
> Acked-by: Linus Walleij <linus.walleij@stericsson.com>

Please discard this.
I will send a new patch.

Thanks,
Kumar. 

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox