Netdev List
 help / color / mirror / Atom feed
* [PATCH 09/12] nfs: disable data cache revalidation for swapfiles
From: Mel Gorman @ 2012-05-17 14:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, Linux-Netdev, Linux-NFS, LKML, David Miller,
	Trond Myklebust, Neil Brown, Christoph Hellwig, Peter Zijlstra,
	Mike Christie, Eric B Munson, Mel Gorman
In-Reply-To: <1337266285-8102-1-git-send-email-mgorman@suse.de>

The VM does not like PG_private set on PG_swapcache pages. As suggested
by Trond in http://lkml.org/lkml/2006/8/25/348, this patch disables
NFS data cache revalidation on swap files.  as it does not make
sense to have other clients change the file while it is being used as
swap. This avoids setting PG_private on swap pages, since there ought
to be no further races with invalidate_inode_pages2() to deal with.

Since we cannot set PG_private we cannot use page->private which
is already used by PG_swapcache pages to store the nfs_page. Thus
augment the new nfs_page_find_request logic.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 fs/nfs/inode.c |    6 ++++++
 fs/nfs/write.c |   49 +++++++++++++++++++++++++++++++++++--------------
 2 files changed, 41 insertions(+), 14 deletions(-)

diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index e8bbfa5..af43ef6 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -880,6 +880,12 @@ int nfs_revalidate_mapping(struct inode *inode, struct address_space *mapping)
 	struct nfs_inode *nfsi = NFS_I(inode);
 	int ret = 0;
 
+	/*
+	 * swapfiles are not supposed to be shared.
+	 */
+	if (IS_SWAPFILE(inode))
+		goto out;
+
 	if ((nfsi->cache_validity & NFS_INO_REVAL_PAGECACHE)
 			|| nfs_attribute_cache_expired(inode)
 			|| NFS_STALE(inode)) {
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 8223b2c..21cfe71 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -111,15 +111,28 @@ static void nfs_context_set_write_error(struct nfs_open_context *ctx, int error)
 	set_bit(NFS_CONTEXT_ERROR_WRITE, &ctx->flags);
 }
 
-static struct nfs_page *nfs_page_find_request_locked(struct page *page)
+static struct nfs_page *
+nfs_page_find_request_locked(struct nfs_inode *nfsi, struct page *page)
 {
 	struct nfs_page *req = NULL;
 
-	if (PagePrivate(page)) {
+	if (PagePrivate(page))
 		req = (struct nfs_page *)page_private(page);
-		if (req != NULL)
-			kref_get(&req->wb_kref);
+	else if (unlikely(PageSwapCache(page))) {
+		struct nfs_page *freq, *t;
+
+		/* Linearly search the commit list for the correct req */
+		list_for_each_entry_safe(freq, t, &nfsi->commit_list, wb_list) {
+			if (freq->wb_page == page) {
+				req = freq;
+				break;
+			}
+		}
 	}
+
+	if (req)
+		kref_get(&req->wb_kref);
+
 	return req;
 }
 
@@ -129,7 +142,7 @@ static struct nfs_page *nfs_page_find_request(struct page *page)
 	struct nfs_page *req = NULL;
 
 	spin_lock(&inode->i_lock);
-	req = nfs_page_find_request_locked(page);
+	req = nfs_page_find_request_locked(NFS_I(inode), page);
 	spin_unlock(&inode->i_lock);
 	return req;
 }
@@ -232,7 +245,7 @@ static struct nfs_page *nfs_find_and_lock_request(struct page *page, bool nonblo
 
 	spin_lock(&inode->i_lock);
 	for (;;) {
-		req = nfs_page_find_request_locked(page);
+		req = nfs_page_find_request_locked(NFS_I(inode), page);
 		if (req == NULL)
 			break;
 		if (nfs_lock_request_dontget(req))
@@ -385,9 +398,15 @@ static void nfs_inode_add_request(struct inode *inode, struct nfs_page *req)
 	spin_lock(&inode->i_lock);
 	if (!nfsi->npages && nfs_have_delegation(inode, FMODE_WRITE))
 		inode->i_version++;
-	set_bit(PG_MAPPED, &req->wb_flags);
-	SetPagePrivate(req->wb_page);
-	set_page_private(req->wb_page, (unsigned long)req);
+	/*
+	 * Swap-space should not get truncated. Hence no need to plug the race
+	 * with invalidate/truncate.
+	 */
+	if (likely(!PageSwapCache(req->wb_page))) {
+		set_bit(PG_MAPPED, &req->wb_flags);
+		SetPagePrivate(req->wb_page);
+		set_page_private(req->wb_page, (unsigned long)req);
+	}
 	nfsi->npages++;
 	kref_get(&req->wb_kref);
 	spin_unlock(&inode->i_lock);
@@ -404,9 +423,11 @@ static void nfs_inode_remove_request(struct nfs_page *req)
 	BUG_ON (!NFS_WBACK_BUSY(req));
 
 	spin_lock(&inode->i_lock);
-	set_page_private(req->wb_page, 0);
-	ClearPagePrivate(req->wb_page);
-	clear_bit(PG_MAPPED, &req->wb_flags);
+	if (likely(!PageSwapCache(req->wb_page))) {
+		set_page_private(req->wb_page, 0);
+		ClearPagePrivate(req->wb_page);
+		clear_bit(PG_MAPPED, &req->wb_flags);
+	}
 	nfsi->npages--;
 	spin_unlock(&inode->i_lock);
 	nfs_release_request(req);
@@ -646,7 +667,7 @@ static struct nfs_page *nfs_try_to_update_request(struct inode *inode,
 	spin_lock(&inode->i_lock);
 
 	for (;;) {
-		req = nfs_page_find_request_locked(page);
+		req = nfs_page_find_request_locked(NFS_I(inode), page);
 		if (req == NULL)
 			goto out_unlock;
 
@@ -1691,7 +1712,7 @@ int nfs_wb_page_cancel(struct inode *inode, struct page *page)
  */
 int nfs_wb_page(struct inode *inode, struct page *page)
 {
-	loff_t range_start = page_offset(page);
+	loff_t range_start = page_file_offset(page);
 	loff_t range_end = range_start + (loff_t)(PAGE_CACHE_SIZE - 1);
 	struct writeback_control wbc = {
 		.sync_mode = WB_SYNC_ALL,
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH 11/12] nfs: Prevent page allocator recursions with swap over NFS.
From: Mel Gorman @ 2012-05-17 14:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, Linux-Netdev, Linux-NFS, LKML, David Miller,
	Trond Myklebust, Neil Brown, Christoph Hellwig, Peter Zijlstra,
	Mike Christie, Eric B Munson, Mel Gorman
In-Reply-To: <1337266285-8102-1-git-send-email-mgorman@suse.de>

GFP_NOFS is _more_ permissive than GFP_NOIO in that it will initiate
IO, just not of any filesystem data.

The problem is that previously NOFS was correct because that avoids
recursion into the NFS code. With swap-over-NFS, it is no longer
correct as swap IO can lead to this recursion.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 fs/nfs/pagelist.c |    2 +-
 fs/nfs/write.c    |    7 ++++---
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
index 77aa83e..8e80d71 100644
--- a/fs/nfs/pagelist.c
+++ b/fs/nfs/pagelist.c
@@ -29,7 +29,7 @@ static struct kmem_cache *nfs_page_cachep;
 static inline struct nfs_page *
 nfs_page_alloc(void)
 {
-	struct nfs_page	*p = kmem_cache_zalloc(nfs_page_cachep, GFP_KERNEL);
+	struct nfs_page	*p = kmem_cache_zalloc(nfs_page_cachep, GFP_NOIO);
 	if (p)
 		INIT_LIST_HEAD(&p->wb_list);
 	return p;
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 21cfe71..abdbe61 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -52,7 +52,7 @@ static mempool_t *nfs_commit_mempool;
 
 struct nfs_write_data *nfs_commitdata_alloc(void)
 {
-	struct nfs_write_data *p = mempool_alloc(nfs_commit_mempool, GFP_NOFS);
+	struct nfs_write_data *p = mempool_alloc(nfs_commit_mempool, GFP_NOIO);
 
 	if (p) {
 		memset(p, 0, sizeof(*p));
@@ -72,7 +72,7 @@ EXPORT_SYMBOL_GPL(nfs_commit_free);
 
 struct nfs_write_data *nfs_writedata_alloc(unsigned int pagecount)
 {
-	struct nfs_write_data *p = mempool_alloc(nfs_wdata_mempool, GFP_NOFS);
+	struct nfs_write_data *p = mempool_alloc(nfs_wdata_mempool, GFP_NOIO);
 
 	if (p) {
 		memset(p, 0, sizeof(*p));
@@ -81,7 +81,8 @@ struct nfs_write_data *nfs_writedata_alloc(unsigned int pagecount)
 		if (pagecount <= ARRAY_SIZE(p->page_array))
 			p->pagevec = p->page_array;
 		else {
-			p->pagevec = kcalloc(pagecount, sizeof(struct page *), GFP_NOFS);
+			p->pagevec = kcalloc(pagecount, sizeof(struct page *),
+					GFP_NOIO);
 			if (!p->pagevec) {
 				mempool_free(p, nfs_wdata_mempool);
 				p = NULL;
-- 
1.7.9.2

^ permalink raw reply related

* [PATCH 12/12] Avoid dereferencing bd_disk during swap_entry_free for network storage
From: Mel Gorman @ 2012-05-17 14:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, Linux-Netdev, Linux-NFS, LKML, David Miller,
	Trond Myklebust, Neil Brown, Christoph Hellwig, Peter Zijlstra,
	Mike Christie, Eric B Munson, Mel Gorman
In-Reply-To: <1337266285-8102-1-git-send-email-mgorman-l3A5Bk7waGM@public.gmane.org>

Commit [b3a27d: swap: Add swap slot free callback to
block_device_operations] dereferences p->bdev->bd_disk but this is a
NULL dereference if using swap-over-NFS. This patch checks SWP_BLKDEV
on the swap_info_struct before dereferencing.

With reference to this callback, Christoph Hellwig stated "Please
just remove the callback entirely.  It has no user outside the staging
tree and was added clearly against the rules for that staging tree".
This would also be my preference but there was not an obvious way of
keeping zram in staging/ happy.

Signed-off-by: Xiaotian Feng <dfeng-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org>
---
 mm/swapfile.c |    9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 80b3415..d85d842 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -547,7 +547,6 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
 
 	/* free if no reference */
 	if (!usage) {
-		struct gendisk *disk = p->bdev->bd_disk;
 		if (offset < p->lowest_bit)
 			p->lowest_bit = offset;
 		if (offset > p->highest_bit)
@@ -557,9 +556,11 @@ static unsigned char swap_entry_free(struct swap_info_struct *p,
 			swap_list.next = p->type;
 		nr_swap_pages++;
 		p->inuse_pages--;
-		if ((p->flags & SWP_BLKDEV) &&
-				disk->fops->swap_slot_free_notify)
-			disk->fops->swap_slot_free_notify(p->bdev, offset);
+		if (p->flags & SWP_BLKDEV) {
+			struct gendisk *disk = p->bdev->bd_disk;
+			if (disk->fops->swap_slot_free_notify)
+				disk->fops->swap_slot_free_notify(p->bdev, offset);
+		}
 	}
 
 	return usage;
-- 
1.7.9.2

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH 10/12] nfs: enable swap on NFS
From: Mel Gorman @ 2012-05-17 14:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux-MM, Linux-Netdev, Linux-NFS, LKML, David Miller,
	Trond Myklebust, Neil Brown, Christoph Hellwig, Peter Zijlstra,
	Mike Christie, Eric B Munson, Mel Gorman
In-Reply-To: <1337266285-8102-1-git-send-email-mgorman@suse.de>

Implement the new swapfile a_ops for NFS and hook up ->direct_IO. This
will set the NFS socket to SOCK_MEMALLOC and run socket reconnect
under PF_MEMALLOC as well as reset SOCK_MEMALLOC before engaging the
protocol ->connect() method.

PF_MEMALLOC should allow the allocation of struct socket and related
objects and the early (re)setting of SOCK_MEMALLOC should allow us
to receive the packets required for the TCP connection buildup.

[dfeng@redhat.com: Fix handling of multiple swap files]
[a.p.zijlstra@chello.nl: Original patch]
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 fs/nfs/Kconfig              |    8 ++++
 fs/nfs/direct.c             |   94 +++++++++++++++++++++++++++++--------------
 fs/nfs/file.c               |   22 +++++++++-
 include/linux/nfs_fs.h      |    4 +-
 include/linux/sunrpc/xprt.h |    3 ++
 net/sunrpc/Kconfig          |    5 +++
 net/sunrpc/clnt.c           |    2 +
 net/sunrpc/sched.c          |    7 +++-
 net/sunrpc/xprtsock.c       |   53 ++++++++++++++++++++++++
 9 files changed, 161 insertions(+), 37 deletions(-)

diff --git a/fs/nfs/Kconfig b/fs/nfs/Kconfig
index 2a0e6c5..ff93d0c 100644
--- a/fs/nfs/Kconfig
+++ b/fs/nfs/Kconfig
@@ -75,6 +75,14 @@ config NFS_V4
 
 	  If unsure, say Y.
 
+config NFS_SWAP
+	bool "Provide swap over NFS support"
+	default n
+	depends on NFS_FS
+	select SUNRPC_SWAP
+	help
+	  This option enables swapon to work on files located on NFS mounts.
+
 config NFS_V4_1
 	bool "NFS client support for NFSv4.1 (EXPERIMENTAL)"
 	depends on NFS_FS && NFS_V4 && EXPERIMENTAL
diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 481be7f..2ef7395 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -111,17 +111,28 @@ static inline int put_dreq(struct nfs_direct_req *dreq)
  * @nr_segs: size of iovec array
  *
  * The presence of this routine in the address space ops vector means
- * the NFS client supports direct I/O.  However, we shunt off direct
- * read and write requests before the VFS gets them, so this method
- * should never be called.
+ * the NFS client supports direct I/O. However, for most direct IO, we
+ * shunt off direct read and write requests before the VFS gets them,
+ * so this method is only ever called for swap.
  */
 ssize_t nfs_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, loff_t pos, unsigned long nr_segs)
 {
+#ifndef CONFIG_NFS_SWAP
 	dprintk("NFS: nfs_direct_IO (%s) off/no(%Ld/%lu) EINVAL\n",
 			iocb->ki_filp->f_path.dentry->d_name.name,
 			(long long) pos, nr_segs);
 
 	return -EINVAL;
+#else
+	VM_BUG_ON(iocb->ki_left != PAGE_SIZE);
+	VM_BUG_ON(iocb->ki_nbytes != PAGE_SIZE);
+
+	if (rw == READ || rw == KERNEL_READ)
+		return nfs_file_direct_read(iocb, iov, nr_segs, pos,
+				rw == READ ? true : false);
+	return nfs_file_direct_write(iocb, iov, nr_segs, pos,
+				rw == WRITE ? true : false);
+#endif /* CONFIG_NFS_SWAP */
 }
 
 static void nfs_direct_dirty_pages(struct page **pages, unsigned int pgbase, size_t count)
@@ -278,7 +289,7 @@ static const struct rpc_call_ops nfs_read_direct_ops = {
  */
 static ssize_t nfs_direct_read_schedule_segment(struct nfs_direct_req *dreq,
 						const struct iovec *iov,
-						loff_t pos)
+						loff_t pos, bool uio)
 {
 	struct nfs_open_context *ctx = dreq->ctx;
 	struct inode *inode = ctx->dentry->d_inode;
@@ -312,13 +323,22 @@ static ssize_t nfs_direct_read_schedule_segment(struct nfs_direct_req *dreq,
 		if (unlikely(!data))
 			break;
 
-		down_read(&current->mm->mmap_sem);
-		result = get_user_pages(current, current->mm, user_addr,
-					data->npages, 1, 0, data->pagevec, NULL);
-		up_read(&current->mm->mmap_sem);
-		if (result < 0) {
-			nfs_readdata_free(data);
-			break;
+		if (uio) {
+			down_read(&current->mm->mmap_sem);
+			result = get_user_pages(current, current->mm, user_addr,
+				data->npages, 1, 0, data->pagevec, NULL);
+			up_read(&current->mm->mmap_sem);
+			if (result < 0) {
+				nfs_readdata_free(data);
+				break;
+			}
+		} else {
+			WARN_ON(data->npages != 1);
+			result = get_kernel_page(user_addr, 1, data->pagevec);
+			if (WARN_ON(result != 1)) {
+				nfs_readdata_free(data);
+				break;
+			}
 		}
 		if ((unsigned)result < data->npages) {
 			bytes = result * PAGE_SIZE;
@@ -386,7 +406,7 @@ static ssize_t nfs_direct_read_schedule_segment(struct nfs_direct_req *dreq,
 static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
 					      const struct iovec *iov,
 					      unsigned long nr_segs,
-					      loff_t pos)
+					      loff_t pos, bool uio)
 {
 	ssize_t result = -EINVAL;
 	size_t requested_bytes = 0;
@@ -396,7 +416,7 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
 
 	for (seg = 0; seg < nr_segs; seg++) {
 		const struct iovec *vec = &iov[seg];
-		result = nfs_direct_read_schedule_segment(dreq, vec, pos);
+		result = nfs_direct_read_schedule_segment(dreq, vec, pos, uio);
 		if (result < 0)
 			break;
 		requested_bytes += result;
@@ -420,7 +440,7 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
 }
 
 static ssize_t nfs_direct_read(struct kiocb *iocb, const struct iovec *iov,
-			       unsigned long nr_segs, loff_t pos)
+			       unsigned long nr_segs, loff_t pos, bool uio)
 {
 	ssize_t result = -ENOMEM;
 	struct inode *inode = iocb->ki_filp->f_mapping->host;
@@ -438,7 +458,7 @@ static ssize_t nfs_direct_read(struct kiocb *iocb, const struct iovec *iov,
 	if (!is_sync_kiocb(iocb))
 		dreq->iocb = iocb;
 
-	result = nfs_direct_read_schedule_iovec(dreq, iov, nr_segs, pos);
+	result = nfs_direct_read_schedule_iovec(dreq, iov, nr_segs, pos, uio);
 	if (!result)
 		result = nfs_direct_wait(dreq);
 out_release:
@@ -705,7 +725,8 @@ static const struct rpc_call_ops nfs_write_direct_ops = {
  */
 static ssize_t nfs_direct_write_schedule_segment(struct nfs_direct_req *dreq,
 						 const struct iovec *iov,
-						 loff_t pos, int sync)
+						 loff_t pos, int sync,
+						 bool uio)
 {
 	struct nfs_open_context *ctx = dreq->ctx;
 	struct inode *inode = ctx->dentry->d_inode;
@@ -739,13 +760,22 @@ static ssize_t nfs_direct_write_schedule_segment(struct nfs_direct_req *dreq,
 		if (unlikely(!data))
 			break;
 
-		down_read(&current->mm->mmap_sem);
-		result = get_user_pages(current, current->mm, user_addr,
-					data->npages, 0, 0, data->pagevec, NULL);
-		up_read(&current->mm->mmap_sem);
-		if (result < 0) {
-			nfs_writedata_free(data);
-			break;
+		if (uio) {
+			down_read(&current->mm->mmap_sem);
+			result = get_user_pages(current, current->mm, user_addr,
+				data->npages, 0, 0, data->pagevec, NULL);
+			up_read(&current->mm->mmap_sem);
+			if (result < 0) {
+				nfs_writedata_free(data);
+				break;
+			}
+		} else {
+			WARN_ON(data->npages != 1);
+			result = get_kernel_page(user_addr, 0, data->pagevec);
+			if (WARN_ON(result != 1)) {
+				nfs_writedata_free(data);
+				break;
+			}
 		}
 		if ((unsigned)result < data->npages) {
 			bytes = result * PAGE_SIZE;
@@ -817,7 +847,8 @@ static ssize_t nfs_direct_write_schedule_segment(struct nfs_direct_req *dreq,
 static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
 					       const struct iovec *iov,
 					       unsigned long nr_segs,
-					       loff_t pos, int sync)
+					       loff_t pos, int sync,
+					       bool uio)
 {
 	ssize_t result = 0;
 	size_t requested_bytes = 0;
@@ -828,7 +859,7 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
 	for (seg = 0; seg < nr_segs; seg++) {
 		const struct iovec *vec = &iov[seg];
 		result = nfs_direct_write_schedule_segment(dreq, vec,
-							   pos, sync);
+							   pos, sync, uio);
 		if (result < 0)
 			break;
 		requested_bytes += result;
@@ -853,7 +884,7 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
 
 static ssize_t nfs_direct_write(struct kiocb *iocb, const struct iovec *iov,
 				unsigned long nr_segs, loff_t pos,
-				size_t count)
+				size_t count, bool uio)
 {
 	ssize_t result = -ENOMEM;
 	struct inode *inode = iocb->ki_filp->f_mapping->host;
@@ -877,7 +908,8 @@ static ssize_t nfs_direct_write(struct kiocb *iocb, const struct iovec *iov,
 	if (!is_sync_kiocb(iocb))
 		dreq->iocb = iocb;
 
-	result = nfs_direct_write_schedule_iovec(dreq, iov, nr_segs, pos, sync);
+	result = nfs_direct_write_schedule_iovec(dreq, iov, nr_segs, pos,
+								sync, uio);
 	if (!result)
 		result = nfs_direct_wait(dreq);
 out_release:
@@ -908,7 +940,7 @@ out:
  * cache.
  */
 ssize_t nfs_file_direct_read(struct kiocb *iocb, const struct iovec *iov,
-				unsigned long nr_segs, loff_t pos)
+				unsigned long nr_segs, loff_t pos, bool uio)
 {
 	ssize_t retval = -EINVAL;
 	struct file *file = iocb->ki_filp;
@@ -933,7 +965,7 @@ ssize_t nfs_file_direct_read(struct kiocb *iocb, const struct iovec *iov,
 
 	task_io_account_read(count);
 
-	retval = nfs_direct_read(iocb, iov, nr_segs, pos);
+	retval = nfs_direct_read(iocb, iov, nr_segs, pos, uio);
 	if (retval > 0)
 		iocb->ki_pos = pos + retval;
 
@@ -964,7 +996,7 @@ out:
  * is no atomic O_APPEND write facility in the NFS protocol.
  */
 ssize_t nfs_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
-				unsigned long nr_segs, loff_t pos)
+				unsigned long nr_segs, loff_t pos, bool uio)
 {
 	ssize_t retval = -EINVAL;
 	struct file *file = iocb->ki_filp;
@@ -996,7 +1028,7 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
 
 	task_io_account_write(count);
 
-	retval = nfs_direct_write(iocb, iov, nr_segs, pos, count);
+	retval = nfs_direct_write(iocb, iov, nr_segs, pos, count, uio);
 
 	if (retval > 0)
 		iocb->ki_pos = pos + retval;
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 6ead5e3..0e330f2 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -187,7 +187,7 @@ nfs_file_read(struct kiocb *iocb, const struct iovec *iov,
 	ssize_t result;
 
 	if (iocb->ki_filp->f_flags & O_DIRECT)
-		return nfs_file_direct_read(iocb, iov, nr_segs, pos);
+		return nfs_file_direct_read(iocb, iov, nr_segs, pos, true);
 
 	dprintk("NFS: read(%s/%s, %lu@%lu)\n",
 		dentry->d_parent->d_name.name, dentry->d_name.name,
@@ -486,6 +486,20 @@ static int nfs_launder_page(struct page *page)
 	return nfs_wb_page(inode, page);
 }
 
+#ifdef CONFIG_NFS_SWAP
+static int nfs_swap_activate(struct swap_info_struct *sis, struct file *file,
+						sector_t *span)
+{
+	*span = sis->pages;
+	return xs_swapper(NFS_CLIENT(file->f_mapping->host)->cl_xprt, 1);
+}
+
+static void nfs_swap_deactivate(struct file *file)
+{
+	xs_swapper(NFS_CLIENT(file->f_mapping->host)->cl_xprt, 0);
+}
+#endif
+
 const struct address_space_operations nfs_file_aops = {
 	.readpage = nfs_readpage,
 	.readpages = nfs_readpages,
@@ -500,6 +514,10 @@ const struct address_space_operations nfs_file_aops = {
 	.migratepage = nfs_migrate_page,
 	.launder_page = nfs_launder_page,
 	.error_remove_page = generic_error_remove_page,
+#ifdef CONFIG_NFS_SWAP
+	.swap_activate = nfs_swap_activate,
+	.swap_deactivate = nfs_swap_deactivate,
+#endif
 };
 
 /*
@@ -574,7 +592,7 @@ static ssize_t nfs_file_write(struct kiocb *iocb, const struct iovec *iov,
 	size_t count = iov_length(iov, nr_segs);
 
 	if (iocb->ki_filp->f_flags & O_DIRECT)
-		return nfs_file_direct_write(iocb, iov, nr_segs, pos);
+		return nfs_file_direct_write(iocb, iov, nr_segs, pos, true);
 
 	dprintk("NFS: write(%s/%s, %lu@%Ld)\n",
 		dentry->d_parent->d_name.name, dentry->d_name.name,
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index 52a1bdb..1f4efab 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -481,10 +481,10 @@ extern ssize_t nfs_direct_IO(int, struct kiocb *, const struct iovec *, loff_t,
 			unsigned long);
 extern ssize_t nfs_file_direct_read(struct kiocb *iocb,
 			const struct iovec *iov, unsigned long nr_segs,
-			loff_t pos);
+			loff_t pos, bool uio);
 extern ssize_t nfs_file_direct_write(struct kiocb *iocb,
 			const struct iovec *iov, unsigned long nr_segs,
-			loff_t pos);
+			loff_t pos, bool uio);
 
 /*
  * linux/fs/nfs/dir.c
diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index 77d278d..cff40aa 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -174,6 +174,8 @@ struct rpc_xprt {
 	unsigned long		state;		/* transport state */
 	unsigned char		shutdown   : 1,	/* being shut down */
 				resvport   : 1; /* use a reserved port */
+	unsigned int		swapper;	/* we're swapping over this
+						   transport */
 	unsigned int		bind_index;	/* bind function index */
 
 	/*
@@ -316,6 +318,7 @@ void			xprt_release_rqst_cong(struct rpc_task *task);
 void			xprt_disconnect_done(struct rpc_xprt *xprt);
 void			xprt_force_disconnect(struct rpc_xprt *xprt);
 void			xprt_conditional_disconnect(struct rpc_xprt *xprt, unsigned int cookie);
+int			xs_swapper(struct rpc_xprt *xprt, int enable);
 
 /*
  * Reserved bit positions in xprt->state
diff --git a/net/sunrpc/Kconfig b/net/sunrpc/Kconfig
index 9fe8857..03d03e3 100644
--- a/net/sunrpc/Kconfig
+++ b/net/sunrpc/Kconfig
@@ -21,6 +21,11 @@ config SUNRPC_XPRT_RDMA
 
 	  If unsure, say N.
 
+config SUNRPC_SWAP
+	bool
+	depends on SUNRPC
+	select NETVM
+
 config RPCSEC_GSS_KRB5
 	tristate "Secure RPC: Kerberos V mechanism"
 	depends on SUNRPC && CRYPTO
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index adf2990..d6f4e623 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -719,6 +719,8 @@ void rpc_task_set_client(struct rpc_task *task, struct rpc_clnt *clnt)
 		atomic_inc(&clnt->cl_count);
 		if (clnt->cl_softrtry)
 			task->tk_flags |= RPC_TASK_SOFT;
+		if (task->tk_client->cl_xprt->swapper)
+			task->tk_flags |= RPC_TASK_SWAPPER;
 		/* Add to the client's list of all tasks */
 		spin_lock(&clnt->cl_lock);
 		list_add_tail(&task->tk_task, &clnt->cl_tasks);
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index 994cfea..83a4c43 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -812,7 +812,10 @@ static void rpc_async_schedule(struct work_struct *work)
 void *rpc_malloc(struct rpc_task *task, size_t size)
 {
 	struct rpc_buffer *buf;
-	gfp_t gfp = RPC_IS_SWAPPER(task) ? GFP_ATOMIC : GFP_NOWAIT;
+	gfp_t gfp = GFP_NOWAIT;
+
+	if (RPC_IS_SWAPPER(task))
+		gfp |= __GFP_MEMALLOC;
 
 	size += sizeof(struct rpc_buffer);
 	if (size <= RPC_BUFFER_MAXSIZE)
@@ -886,7 +889,7 @@ static void rpc_init_task(struct rpc_task *task, const struct rpc_task_setup *ta
 static struct rpc_task *
 rpc_alloc_task(void)
 {
-	return (struct rpc_task *)mempool_alloc(rpc_task_mempool, GFP_NOFS);
+	return (struct rpc_task *)mempool_alloc(rpc_task_mempool, GFP_NOIO);
 }
 
 /*
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 890b03f..b84df34 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -1930,6 +1930,45 @@ out:
 	xprt_wake_pending_tasks(xprt, status);
 }
 
+#ifdef CONFIG_SUNRPC_SWAP
+static void xs_set_memalloc(struct rpc_xprt *xprt)
+{
+	struct sock_xprt *transport = container_of(xprt, struct sock_xprt,
+			xprt);
+
+	if (xprt->swapper)
+		sk_set_memalloc(transport->inet);
+}
+
+/**
+ * xs_swapper - Tag this transport as being used for swap.
+ * @xprt: transport to tag
+ * @enable: enable/disable
+ *
+ */
+int xs_swapper(struct rpc_xprt *xprt, int enable)
+{
+	struct sock_xprt *transport = container_of(xprt, struct sock_xprt,
+			xprt);
+	int err = 0;
+
+	if (enable) {
+		xprt->swapper++;
+		xs_set_memalloc(xprt);
+	} else if (xprt->swapper) {
+		xprt->swapper--;
+		sk_clear_memalloc(transport->inet);
+	}
+
+	return err;
+}
+EXPORT_SYMBOL_GPL(xs_swapper);
+#else
+static void xs_set_memalloc(struct rpc_xprt *xprt)
+{
+}
+#endif
+
 static void xs_udp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock)
 {
 	struct sock_xprt *transport = container_of(xprt, struct sock_xprt, xprt);
@@ -1954,6 +1993,8 @@ static void xs_udp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock)
 		transport->sock = sock;
 		transport->inet = sk;
 
+		xs_set_memalloc(xprt);
+
 		write_unlock_bh(&sk->sk_callback_lock);
 	}
 	xs_udp_do_set_buffer_size(xprt);
@@ -1965,11 +2006,15 @@ static void xs_udp_setup_socket(struct work_struct *work)
 		container_of(work, struct sock_xprt, connect_worker.work);
 	struct rpc_xprt *xprt = &transport->xprt;
 	struct socket *sock = transport->sock;
+	unsigned long pflags = current->flags;
 	int status = -EIO;
 
 	if (xprt->shutdown)
 		goto out;
 
+	if (xprt->swapper)
+		current->flags |= PF_MEMALLOC;
+
 	/* Start by resetting any existing state */
 	xs_reset_transport(transport);
 	sock = xs_create_sock(xprt, transport,
@@ -1988,6 +2033,7 @@ static void xs_udp_setup_socket(struct work_struct *work)
 out:
 	xprt_clear_connecting(xprt);
 	xprt_wake_pending_tasks(xprt, status);
+	tsk_restore_flags(current, pflags, PF_MEMALLOC);
 }
 
 /*
@@ -2078,6 +2124,8 @@ static int xs_tcp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock)
 	if (!xprt_bound(xprt))
 		goto out;
 
+	xs_set_memalloc(xprt);
+
 	/* Tell the socket layer to start connecting... */
 	xprt->stat.connect_count++;
 	xprt->stat.connect_start = jiffies;
@@ -2108,11 +2156,15 @@ static void xs_tcp_setup_socket(struct work_struct *work)
 		container_of(work, struct sock_xprt, connect_worker.work);
 	struct socket *sock = transport->sock;
 	struct rpc_xprt *xprt = &transport->xprt;
+	unsigned long pflags = current->flags;
 	int status = -EIO;
 
 	if (xprt->shutdown)
 		goto out;
 
+	if (xprt->swapper)
+		current->flags |= PF_MEMALLOC;
+
 	if (!sock) {
 		clear_bit(XPRT_CONNECTION_ABORT, &xprt->state);
 		sock = xs_create_sock(xprt, transport,
@@ -2174,6 +2226,7 @@ out_eagain:
 out:
 	xprt_clear_connecting(xprt);
 	xprt_wake_pending_tasks(xprt, status);
+	tsk_restore_flags(current, pflags, PF_MEMALLOC);
 }
 
 /**
-- 
1.7.9.2

^ permalink raw reply related

* Re: [PATCH 3/7] netfilter: xt_HMARK: modulus is expensive for hash calculation
From: Pablo Neira Ayuso @ 2012-05-17 14:55 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Laight, netfilter-devel, davem, netdev
In-Reply-To: <1337243968.3403.3.camel@edumazet-glaptop>

On Thu, May 17, 2012 at 10:39:28AM +0200, Eric Dumazet wrote:
> On Thu, 2012-05-17 at 09:16 +0100, David Laight wrote:
> > > From: Pablo Neira Ayuso <pablo@netfilter.org>
> > > 
> > > Use:
> > > 
> > > ((u64)(HASH_VAL * HASH_SIZE)) >> 32
> > > 
> > > as suggested by David S. Miller.
> > 
> > That (u64) cast is very unlikely to have any effect.
> > If you want a 64 bit result from the product of two
> > 32 bit values, you have to cast one of the 32 bit values
> > prior to the multiply - as in the patch below.
> 
> Hey, Changelog is a bit wrong (for several reasons) but code is correct.
> 
> return (((u64)hash * info->hmodulus) >> 32) + info->hoffset;

Sorry, for the mistake in the changelog. I copied & pasted it from the
mailing list discussion.

^ permalink raw reply

* Re: Stable regression with 'tcp: allow splice() to build full TSO packets'
From: Willy Tarreau @ 2012-05-17 15:01 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <20120517121800.GA18052@1wt.eu>

Hi Eric,

On Thu, May 17, 2012 at 02:18:00PM +0200, Willy Tarreau wrote:
> Hi Eric,
> 
> I'm facing a regression in stable 3.2.17 and 3.0.31 which is
> exhibited by your patch 'tcp: allow splice() to build full TSO
> packets' which unfortunately I am very interested in !
> 
> What I'm observing is that TCP transmits using splice() stall
> quite quickly if I'm using pipes larger than 64kB (even 65537
> is enough to reliably observe the stall).
(...)

I managed to fix the issue and I really think that the fix makes sense.
I'm appending the patch, please could you review it and if approved,
push it for inclusion ?

BTW, your patch significantly improves performance here. On this
machine I was reaching max 515 Mbps of proxied traffic, and with
the patch I reach 665 Mbps with the same test ! I think you managed
to fix what caused splice() to always be slightly slower than
recv/send() for a long time in my tests with a number of NICs !

Thanks,
Willy

-------

>From 6da6a21798d0156e647a993c31782eec739fa5df Mon Sep 17 00:00:00 2001
From: Willy Tarreau <w@1wt.eu>
Date: Thu, 17 May 2012 16:48:56 +0200
Subject: [PATCH] tcp: force push data out when buffers are missing

Commit 2f533844242 (tcp: allow splice() to build full TSO packets)
significantly improved splice() performance for some workloads but
caused stalls when pipe buffers were larger than socket buffers.

The issue seems to happen when no data can be copied at all due to
lack of buffers, which results in pending data never being pushed.

This change checks if all pending data has been pushed or not and
pushes them when waiting for send buffers.
---
 net/ipv4/tcp.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 80b988f..f6d9e5f 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -850,7 +850,7 @@ new_segment:
 wait_for_sndbuf:
 		set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
 wait_for_memory:
-		if (copied)
+		if (tcp_sk(sk)->pushed_seq != tcp_sk(sk)->write_seq)
 			tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH);
 
 		if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
-- 
1.7.2.1.45.g54fbc

^ permalink raw reply related

* [PATCH net-next-2.6] pppoe: remove unused return value from two methods.
From: Rami Rosen @ 2012-05-17 15:04 UTC (permalink / raw)
  To: David Miller, netdev

[-- Attachment #1: Type: text/plain, Size: 168 bytes --]

Hi,

 The patch removes unused return value from __delete_item() and
 delete_item() methods in drivers/net/ppp/pppoe.c.

Signed-off-by: Rami Rosen <ramirose@gmail.com>

[-- Attachment #2: patch.txt --]
[-- Type: text/plain, Size: 1273 bytes --]

diff --git a/drivers/net/ppp/pppoe.c b/drivers/net/ppp/pppoe.c
index 2fa1a9b..dd15b8f 100644
--- a/drivers/net/ppp/pppoe.c
+++ b/drivers/net/ppp/pppoe.c
@@ -201,7 +201,7 @@ static int __set_item(struct pppoe_net *pn, struct pppox_sock *po)
 	return 0;
 }
 
-static struct pppox_sock *__delete_item(struct pppoe_net *pn, __be16 sid,
+static void __delete_item(struct pppoe_net *pn, __be16 sid,
 					char *addr, int ifindex)
 {
 	int hash = hash_item(sid, addr);
@@ -220,8 +220,6 @@ static struct pppox_sock *__delete_item(struct pppoe_net *pn, __be16 sid,
 		src = &ret->next;
 		ret = ret->next;
 	}
-
-	return ret;
 }
 
 /**********************************************************************
@@ -264,16 +262,12 @@ static inline struct pppox_sock *get_item_by_addr(struct net *net,
 	return pppox_sock;
 }
 
-static inline struct pppox_sock *delete_item(struct pppoe_net *pn, __be16 sid,
+static inline void delete_item(struct pppoe_net *pn, __be16 sid,
 					char *addr, int ifindex)
 {
-	struct pppox_sock *ret;
-
 	write_lock_bh(&pn->hash_lock);
-	ret = __delete_item(pn, sid, addr, ifindex);
+	__delete_item(pn, sid, addr, ifindex);
 	write_unlock_bh(&pn->hash_lock);
-
-	return ret;
 }
 
 /***************************************************************************

^ permalink raw reply related

* [PATCH net-next] etherdevice: fix comments
From: Stephen Hemminger @ 2012-05-17 15:17 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

Fix some minor problems in comments of etherdevice.h
 * Warning is out dated, file hasn't moved or disappeared in many years and
    is unlikely to do so soon.
 * Capitalize Ethernet consistently since it is a proper name
 * Fix descriptive comment of padding
 * Spelling and grammar fix for alignment comment

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
 

--- a/include/linux/etherdevice.h	2012-05-17 07:28:39.911045825 -0700
+++ b/include/linux/etherdevice.h	2012-05-17 08:15:26.448550013 -0700
@@ -18,8 +18,6 @@
  *		as published by the Free Software Foundation; either version
  *		2 of the License, or (at your option) any later version.
  *
- *	WARNING: This move may well be temporary. This file will get merged with others RSN.
- *
  */
 #ifndef _LINUX_ETHERDEVICE_H
 #define _LINUX_ETHERDEVICE_H
@@ -159,7 +157,7 @@ static inline void eth_hw_addr_random(st
  * @addr1: Pointer to a six-byte array containing the Ethernet address
  * @addr2: Pointer other six-byte array containing the Ethernet address
  *
- * Compare two ethernet addresses, returns 0 if equal, non-zero otherwise.
+ * Compare two Ethernet addresses, returns 0 if equal, non-zero otherwise.
  * Unlike memcmp(), it doesn't return a value suitable for sorting.
  */
 static inline unsigned compare_ether_addr(const u8 *addr1, const u8 *addr2)
@@ -176,7 +174,7 @@ static inline unsigned compare_ether_add
  * @addr1: Pointer to a six-byte array containing the Ethernet address
  * @addr2: Pointer other six-byte array containing the Ethernet address
  *
- * Compare two ethernet addresses, returns true if equal
+ * Compare two Ethernet addresses, returns true if equal
  */
 static inline bool ether_addr_equal(const u8 *addr1, const u8 *addr2)
 {
@@ -197,13 +195,13 @@ static inline unsigned long zap_last_2by
  * @addr1: Pointer to an array of 8 bytes
  * @addr2: Pointer to an other array of 8 bytes
  *
- * Compare two ethernet addresses, returns true if equal, false otherwise.
+ * Compare two Ethernet addresses, returns true if equal, false otherwise.
  *
  * The function doesn't need any conditional branches and possibly uses
  * word memory accesses on CPU allowing cheap unaligned memory reads.
- * arrays = { byte1, byte2, byte3, byte4, byte6, byte7, pad1, pad2}
+ * arrays = { byte1, byte2, byte3, byte4, byte5, byte6, pad1, pad2 }
  *
- * Please note that alignment of addr1 & addr2 is only guaranted to be 16 bits.
+ * Please note that alignment of addr1 & addr2 are only guaranteed to be 16 bits.
  */
 
 static inline bool ether_addr_equal_64bits(const u8 addr1[6+2],
@@ -257,7 +255,7 @@ static inline bool is_etherdev_addr(cons
  * @a: Pointer to Ethernet header
  * @b: Pointer to Ethernet header
  *
- * Compare two ethernet headers, returns 0 if equal.
+ * Compare two Ethernet headers, returns 0 if equal.
  * This assumes that the network header (i.e., IP header) is 4-byte
  * aligned OR the platform can handle unaligned access.  This is the
  * case for all packets coming into netif_receive_skb or similar

^ permalink raw reply

* Re: [PATCH v5 2/2] decrement static keys on real destroy time
From: Tejun Heo @ 2012-05-17 15:19 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Andrew Morton, cgroups, linux-mm, devel, kamezawa.hiroyu, netdev,
	Li Zefan, Johannes Weiner, Michal Hocko
In-Reply-To: <4FB4CA4D.50608@parallels.com>

On Thu, May 17, 2012 at 01:52:13PM +0400, Glauber Costa wrote:
> Andrew is right. It seems we will need that mutex after all. Just
> this is not a race, and neither something that should belong in the
> static_branch interface.

Yeah, with a completely different comment.  It just needs to wrap
->activated alteration and static key inc/dec, right?

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [V2 PATCH 2/9] macvtap: zerocopy: fix truesize underestimation
From: Shirley Ma @ 2012-05-17 15:28 UTC (permalink / raw)
  To: Jason Wang; +Cc: eric.dumazet, mst, netdev, linux-kernel, ebiederm, davem
In-Reply-To: <4FB469A1.3040507@redhat.com>

On Thu, 2012-05-17 at 10:59 +0800, Jason Wang wrote:
> Didn't see how this affact skb->len. And for truesize, I think they
> are 
> different, when the offset were not zero, the data in this vector
> were 
> divided into two parts. First part is copied into skb directly, and
> the 
> second were pinned from a whole userspace page by
> get_user_pages_fast(), 
> so we need count the whole page to the socket limit to prevent evil 
> application. 

What I meant that the code for skb->truesize has double added the first
offset if any left from that vector (partically copied into skb
directly, and then count pagesize which includes the offset (truesize +=
PAGE_SIZE)).

Thanks
Shirley

^ permalink raw reply

* Re: [V2 PATCH 9/9] vhost: zerocopy: poll vq in zerocopy callback
From: Shirley Ma @ 2012-05-17 15:34 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, eric.dumazet, netdev, linux-kernel, ebiederm,
	davem
In-Reply-To: <4FB4677A.8020402@redhat.com>

On Thu, 2012-05-17 at 10:50 +0800, Jason Wang wrote:
> The problem is we may stop the tx queue when there no enough capacity
> to 
> place packets, at this moment  we depends on the tx interrupt to 
> re-enable the tx queue. So if we didn't poll the vhost during
> callback, 
> guest may lose the tx interrupt to re-enable the tx queue which could 
> stall the whole tx queue. 

VHOST_MAX_PEND should handle the capacity. 

Hasn't the above situation been handled in handle_tx() code?:
...
                        if (unlikely(num_pends > VHOST_MAX_PEND)) {
                                tx_poll_start(net, sock);
                                set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
                                break;
                        }
...

Thanks
Shirley

^ permalink raw reply

* Re: Stable regression with 'tcp: allow splice() to build full TSO packets'
From: Eric Dumazet @ 2012-05-17 15:43 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: netdev
In-Reply-To: <20120517150157.GA19274@1wt.eu>

On Thu, 2012-05-17 at 17:01 +0200, Willy Tarreau wrote:
> Hi Eric,
> 
> On Thu, May 17, 2012 at 02:18:00PM +0200, Willy Tarreau wrote:
> > Hi Eric,
> > 
> > I'm facing a regression in stable 3.2.17 and 3.0.31 which is
> > exhibited by your patch 'tcp: allow splice() to build full TSO
> > packets' which unfortunately I am very interested in !
> > 
> > What I'm observing is that TCP transmits using splice() stall
> > quite quickly if I'm using pipes larger than 64kB (even 65537
> > is enough to reliably observe the stall).
> (...)
> 
> I managed to fix the issue and I really think that the fix makes sense.
> I'm appending the patch, please could you review it and if approved,
> push it for inclusion ?
> 
> BTW, your patch significantly improves performance here. On this
> machine I was reaching max 515 Mbps of proxied traffic, and with
> the patch I reach 665 Mbps with the same test ! I think you managed
> to fix what caused splice() to always be slightly slower than
> recv/send() for a long time in my tests with a number of NICs !
> 

What NIC do you use exactly ?

With commit 1d0c0b328a6 in net-next
(net: makes skb_splice_bits() aware of skb->head_frag)
You'll be able to get even more speed, if NIC uses frag to hold frame.

> Thanks,
> Willy
> 
> -------
> 
> From 6da6a21798d0156e647a993c31782eec739fa5df Mon Sep 17 00:00:00 2001
> From: Willy Tarreau <w@1wt.eu>
> Date: Thu, 17 May 2012 16:48:56 +0200
> Subject: [PATCH] tcp: force push data out when buffers are missing
> 
> Commit 2f533844242 (tcp: allow splice() to build full TSO packets)
> significantly improved splice() performance for some workloads but
> caused stalls when pipe buffers were larger than socket buffers.
> 

But... This seems bug was already there ? Only having more data in pipe
trigger it more often ?

> The issue seems to happen when no data can be copied at all due to
> lack of buffers, which results in pending data never being pushed.
> 
> This change checks if all pending data has been pushed or not and
> pushes them when waiting for send buffers.
> ---
>  net/ipv4/tcp.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 80b988f..f6d9e5f 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -850,7 +850,7 @@ new_segment:
>  wait_for_sndbuf:
>  		set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
>  wait_for_memory:
> -		if (copied)
> +		if (tcp_sk(sk)->pushed_seq != tcp_sk(sk)->write_seq)
>  			tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH);
>  
>  		if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)

Can you check you have commit 35f9c09fe9c72eb8ca2b8e89a593e1c151f28fc2 
( tcp: tcp_sendpages() should call tcp_push() once )

^ permalink raw reply

* Re: [PATCH net-next] tcp: bool conversions
From: Eric Dumazet @ 2012-05-17 15:46 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: David Miller, netdev
In-Reply-To: <1337258796.2496.6.camel@bwh-desktop.uk.solarflarecom.com>

On Thu, 2012-05-17 at 13:46 +0100, Ben Hutchings wrote:
> On Thu, 2012-05-17 at 11:15 +0200, Eric Dumazet wrote:
> > From: Eric Dumazet <edumazet@google.com>
> > 
> > bool conversions where possible.
> 
> There's a bit more than bool conversions here:
> 
> [...]
> > --- a/net/ipv4/tcp_hybla.c
> > +++ b/net/ipv4/tcp_hybla.c
> [...]
> > @@ -24,8 +24,7 @@ struct hybla {
> >  	u32   minrtt;	      /* Minimum smoothed round trip time value seen */
> >  };
> >  
> > -/* Hybla reference round trip time (default= 1/40 sec = 25 ms),
> > -   expressed in jiffies */
> > +/* Hybla reference round trip time (default= 1/40 sec = 25 ms), in ms */
> >  static int rtt0 = 25;
> >  module_param(rtt0, int, 0644);
> >  MODULE_PARM_DESC(rtt0, "reference rout trip time (ms)");
> > @@ -39,7 +38,7 @@ static inline void hybla_recalc_param (struct sock *sk)
> >  	ca->rho_3ls = max_t(u32, tcp_sk(sk)->srtt / msecs_to_jiffies(rtt0), 8);
> >  	ca->rho = ca->rho_3ls >> 3;
> >  	ca->rho2_7ls = (ca->rho_3ls * ca->rho_3ls) << 1;
> > -	ca->rho2 = ca->rho2_7ls >>7;
> > +	ca->rho2 = ca->rho2_7ls >> 7;
> >  }
> >  
> >  static void hybla_init(struct sock *sk)
> [...] 
> > @@ -67,6 +66,7 @@ static void hybla_init(struct sock *sk)
> >  static void hybla_state(struct sock *sk, u8 ca_state)
> >  {
> >  	struct hybla *ca = inet_csk_ca(sk);
> > +
> >  	ca->hybla_en = (ca_state == TCP_CA_Open);
> >  }
> >  
> [...]
> > --- a/net/ipv4/tcp_minisocks.c
> > +++ b/net/ipv4/tcp_minisocks.c
> [...]
> > -static __inline__ int tcp_in_window(u32 seq, u32 end_seq, u32 s_win, u32 e_win)
> > +static bool tcp_in_window(u32 seq, u32 end_seq, u32 s_win, u32 e_win)
> >  { 
> [...]
> > --- a/net/ipv4/tcp_output.c
> > +++ b/net/ipv4/tcp_output.c
> [...]
> >  /* Does at least the first segment of SKB fit into the send window? */
> > -static inline int tcp_snd_wnd_test(const struct tcp_sock *tp, const struct sk_buff *skb,
> > -				   unsigned int cur_mss)
> > +static bool tcp_snd_wnd_test(const struct tcp_sock *tp,
> > +			     const struct sk_buff *skb,
> > +			     unsigned int cur_mss)
> [...]
> 


You got me ;)

99 % of the patch is about bool conversion.

I have no problem removing the spaces/inline parts.

^ permalink raw reply

* Re: Stable regression with 'tcp: allow splice() to build full TSO packets'
From: Willy Tarreau @ 2012-05-17 15:56 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1337269380.3403.10.camel@edumazet-glaptop>

On Thu, May 17, 2012 at 05:43:00PM +0200, Eric Dumazet wrote:
> On Thu, 2012-05-17 at 17:01 +0200, Willy Tarreau wrote:
> > Hi Eric,
> > 
> > On Thu, May 17, 2012 at 02:18:00PM +0200, Willy Tarreau wrote:
> > > Hi Eric,
> > > 
> > > I'm facing a regression in stable 3.2.17 and 3.0.31 which is
> > > exhibited by your patch 'tcp: allow splice() to build full TSO
> > > packets' which unfortunately I am very interested in !
> > > 
> > > What I'm observing is that TCP transmits using splice() stall
> > > quite quickly if I'm using pipes larger than 64kB (even 65537
> > > is enough to reliably observe the stall).
> > (...)
> > 
> > I managed to fix the issue and I really think that the fix makes sense.
> > I'm appending the patch, please could you review it and if approved,
> > push it for inclusion ?
> > 
> > BTW, your patch significantly improves performance here. On this
> > machine I was reaching max 515 Mbps of proxied traffic, and with
> > the patch I reach 665 Mbps with the same test ! I think you managed
> > to fix what caused splice() to always be slightly slower than
> > recv/send() for a long time in my tests with a number of NICs !
> > 
> 
> What NIC do you use exactly ?

It's the NIC included in the system-on-chip (Marvell 88F6281), and the NIC
driver is mv643xx. It's the same hardware you find in sheevaplugs, guruplugs,
dreamplugs and almost all ARM-based cheap NAS boxes.

> With commit 1d0c0b328a6 in net-next
> (net: makes skb_splice_bits() aware of skb->head_frag)
> You'll be able to get even more speed, if NIC uses frag to hold frame.

I'm going to check this now, sounds interesting :-)

The NIC does not support TSO but I've seen an alternate driver for this
NIC which pretends to do TSO and in fact builds header frags so that the
NIC is able to send all frames at once. I think it's already what GSO is
doing but I'm wondering whether it would be possible to get more speed
by doing this than by relying on GSO to (possibly) split the frags earlier.

Note that I'm not quite skilled on this, so do not hesitate to correct me
if I'm saying stupid things.

(...)

> > Commit 2f533844242 (tcp: allow splice() to build full TSO packets)
> > significantly improved splice() performance for some workloads but
> > caused stalls when pipe buffers were larger than socket buffers.
> 
> But... This seems bug was already there ? Only having more data in pipe
> trigger it more often ?

I honnestly don't know. As I said in the initial mail, I suspect this is
the case because I don't see in your patch what could cause the bug, but
I suspect it makes it more frequent. Still, I could not manage to reproduce
the bug without your patch. So maybe we switched from just below a limit
to just over.

> > The issue seems to happen when no data can be copied at all due to
> > lack of buffers, which results in pending data never being pushed.
> > 
> > This change checks if all pending data has been pushed or not and
> > pushes them when waiting for send buffers.
> > ---
> >  net/ipv4/tcp.c |    2 +-
> >  1 files changed, 1 insertions(+), 1 deletions(-)
> > 
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index 80b988f..f6d9e5f 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -850,7 +850,7 @@ new_segment:
> >  wait_for_sndbuf:
> >  		set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
> >  wait_for_memory:
> > -		if (copied)
> > +		if (tcp_sk(sk)->pushed_seq != tcp_sk(sk)->write_seq)
> >  			tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH);
> >  
> >  		if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
> 
> Can you check you have commit 35f9c09fe9c72eb8ca2b8e89a593e1c151f28fc2 
> ( tcp: tcp_sendpages() should call tcp_push() once )

Yes I have it, in fact I'm on -stable (3.0.31) which includes your backport
of the two patches in a single one. It's just a few lines below, out of the
context of this patch.

Best regards,
Willy

^ permalink raw reply

* [PATCH] ipv6: correct the ipv6 option name - Pad0 to Pad1
From: Eldad Zack @ 2012-05-17 16:00 UTC (permalink / raw)
  To: Stephen Hemminger, David S. Miller, Alexey Kuznetsov,
	James Morris, Hideaki YOSHIFUJI, Patrick McHardy,
	Jamal Hadi Salim
  Cc: bridge, netdev, linux-kernel, Eldad Zack

The padding destination or hop-by-hop option is called Pad1 and not Pad0.

See RFC2460 (4.2) or the IANA ipv6-parameters registry:
http://www.iana.org/assignments/ipv6-parameters/ipv6-parameters.xml

Signed-off-by: Eldad Zack <eldad@fogrefinery.com>

diff --git a/include/linux/in6.h b/include/linux/in6.h
index 5c83d9e..cba469b 100644
--- a/include/linux/in6.h
+++ b/include/linux/in6.h
@@ -142,7 +142,7 @@ struct in6_flowlabel_req {
 /*
  *	IPv6 TLV options.
  */
-#define IPV6_TLV_PAD0		0
+#define IPV6_TLV_PAD1		0
 #define IPV6_TLV_PADN		1
 #define IPV6_TLV_ROUTERALERT	5
 #define IPV6_TLV_JUMBO		194
diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
index 5ca4c50..b665812 100644
--- a/net/bridge/br_multicast.c
+++ b/net/bridge/br_multicast.c
@@ -460,8 +460,8 @@ static struct sk_buff *br_ip6_multicast_alloc_query(struct net_bridge *br,
 	hopopt[3] = 2;				/* Length of RA Option */
 	hopopt[4] = 0;				/* Type = 0x0000 (MLD) */
 	hopopt[5] = 0;
-	hopopt[6] = IPV6_TLV_PAD0;		/* Pad0 */
-	hopopt[7] = IPV6_TLV_PAD0;		/* Pad0 */
+	hopopt[6] = IPV6_TLV_PAD1;		/* Pad1 */
+	hopopt[7] = IPV6_TLV_PAD1;		/* Pad1 */
 
 	skb_put(skb, sizeof(*ip6h) + 8);
 
diff --git a/net/bridge/br_netfilter.c b/net/bridge/br_netfilter.c
index dce55d4..e41456b 100644
--- a/net/bridge/br_netfilter.c
+++ b/net/bridge/br_netfilter.c
@@ -558,7 +558,7 @@ static int check_hbh_len(struct sk_buff *skb)
 		int optlen = nh[off + 1] + 2;
 
 		switch (nh[off]) {
-		case IPV6_TLV_PAD0:
+		case IPV6_TLV_PAD1:
 			optlen = 1;
 			break;
 
diff --git a/net/ipv6/ah6.c b/net/ipv6/ah6.c
index 9aa3d01..5d32e7a 100644
--- a/net/ipv6/ah6.c
+++ b/net/ipv6/ah6.c
@@ -127,7 +127,7 @@ static int zero_out_mutable_opts(struct ipv6_opt_hdr *opthdr)
 
 		switch (opt[off]) {
 
-		case IPV6_TLV_PAD0:
+		case IPV6_TLV_PAD1:
 			optlen = 1;
 			break;
 		default:
@@ -171,7 +171,7 @@ static void ipv6_rearrange_destopt(struct ipv6hdr *iph, struct ipv6_opt_hdr *des
 
 		switch (opt[off]) {
 
-		case IPV6_TLV_PAD0:
+		case IPV6_TLV_PAD1:
 			optlen = 1;
 			break;
 		default:
diff --git a/net/ipv6/exthdrs.c b/net/ipv6/exthdrs.c
index a93bd23..a3cded6 100644
--- a/net/ipv6/exthdrs.c
+++ b/net/ipv6/exthdrs.c
@@ -75,7 +75,7 @@ int ipv6_find_tlv(struct sk_buff *skb, int offset, int type)
 			return offset;
 
 		switch (opttype) {
-		case IPV6_TLV_PAD0:
+		case IPV6_TLV_PAD1:
 			optlen = 1;
 			break;
 		default:
@@ -156,7 +156,7 @@ static int ip6_parse_tlv(struct tlvtype_proc *procs, struct sk_buff *skb)
 		int i;
 
 		switch (nh[off]) {
-		case IPV6_TLV_PAD0:
+		case IPV6_TLV_PAD1:
 			optlen = 1;
 			break;
 
diff --git a/net/ipv6/mip6.c b/net/ipv6/mip6.c
index 2e02f7c..5b087c3 100644
--- a/net/ipv6/mip6.c
+++ b/net/ipv6/mip6.c
@@ -46,7 +46,7 @@ static inline void *mip6_padn(__u8 *data, __u8 padlen)
 	if (!data)
 		return NULL;
 	if (padlen == 1) {
-		data[0] = IPV6_TLV_PAD0;
+		data[0] = IPV6_TLV_PAD1;
 	} else if (padlen > 1) {
 		data[0] = IPV6_TLV_PADN;
 		data[1] = padlen - 2;
diff --git a/net/sched/act_csum.c b/net/sched/act_csum.c
index 882124c..2c8ad7c 100644
--- a/net/sched/act_csum.c
+++ b/net/sched/act_csum.c
@@ -397,7 +397,7 @@ static int tcf_csum_ipv6_hopopts(struct ipv6_opt_hdr *ip6xh,
 
 	while (len > 1) {
 		switch (xh[off]) {
-		case IPV6_TLV_PAD0:
+		case IPV6_TLV_PAD1:
 			optlen = 1;
 			break;
 		case IPV6_TLV_JUMBO:
-- 
1.7.10

^ permalink raw reply related

* Re: [PATCH] ipv6: correct the ipv6 option name - Pad0 to Pad1
From: Stephen Hemminger @ 2012-05-17 16:09 UTC (permalink / raw)
  To: Eldad Zack
  Cc: Hideaki YOSHIFUJI, netdev, bridge, Jamal Hadi Salim, James Morris,
	David S. Miller, Alexey Kuznetsov, linux-kernel
In-Reply-To: <1337270425-21873-1-git-send-email-eldad@fogrefinery.com>

On Thu, 17 May 2012 18:00:25 +0200
Eldad Zack <eldad@fogrefinery.com> wrote:

> The padding destination or hop-by-hop option is called Pad1 and not Pad0.
> 
> See RFC2460 (4.2) or the IANA ipv6-parameters registry:
> http://www.iana.org/assignments/ipv6-parameters/ipv6-parameters.xml
> 
> Signed-off-by: Eldad Zack <eldad@fogrefinery.com>

If standards bodies want to call a 0 a 1, then I guess we have to follow :-)

^ permalink raw reply

* Re: [PATCH] ipv6: correct the ipv6 option name - Pad0 to Pad1
From: Eldad Zack @ 2012-05-17 16:21 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, Jamal Hadi Salim, bridge,
	netdev, linux-kernel
In-Reply-To: <20120517090923.0d0930e9@nehalam.linuxnetplumber.net>


On Thu, 17 May 2012, Stephen Hemminger wrote:
> On Thu, 17 May 2012 18:00:25 +0200
> Eldad Zack <eldad@fogrefinery.com> wrote:
> 
> > The padding destination or hop-by-hop option is called Pad1 and not Pad0.
> > 
> > See RFC2460 (4.2) or the IANA ipv6-parameters registry:
> > http://www.iana.org/assignments/ipv6-parameters/ipv6-parameters.xml
> > 
> > Signed-off-by: Eldad Zack <eldad@fogrefinery.com>
> 
> If standards bodies want to call a 0 a 1, then I guess we have to follow :-)

Yeah, Pad0 is a much better name for it. That's probably why it never
occured to any one for so long :-)

^ permalink raw reply

* Re: [RFC 13/13] USB: Disable hub-initiated LPM for comms devices.
From: Sarah Sharp @ 2012-05-17 16:29 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: linux-usb, Alan Stern, linux-bluetooth, gigaset307x-common,
	netdev, linux-wireless, ath9k-devel, libertas-dev, users
In-Reply-To: <20120517144951.GB27597@kroah.com>

On Thu, May 17, 2012 at 07:49:51AM -0700, Greg Kroah-Hartman wrote:
> On Wed, May 16, 2012 at 09:52:20PM -0700, Sarah Sharp wrote:
> > On Wed, May 16, 2012 at 04:20:19PM -0700, Greg Kroah-Hartman wrote:
> > > On Wed, May 16, 2012 at 03:45:28PM -0700, Sarah Sharp wrote:
> > > > The Intel Windows folks are disabling hub-initiated LPM for all USB
> > > > communications devices under a xHCI USB 3.0 host.  In order to keep
> > > > the Linux behavior as close as possible to Windows, we need to do the
> > > > same in Linux.
> > > 
> > > How is the USB core on Windows determining that LPM should be turned off
> > > for these devices?  Surely they aren't modifying each individual driver
> > > like this is, right?  Any way we also can do this in the core?
> > 
> > No, I don't think they're modifying individual drivers.  Maybe they
> > placed a shim/filter driver below other drivers?
> 
> They can do this in their driver by just watching the device class type.
> 
> > Basically, I don't know the exact details of what the Windows folks are
> > doing.  The recommendation from the Intel Windows team was simply to
> > turn hub-initiated LPM off for "all communications devices".  Perhaps
> > the Windows USB core is looking for specific USB class codes?  Or maybe
> > it has some older API that lets the core know it's a communications
> > device?
> > 
> > I'm not really sure we can do it in the USB core with out basically
> > duplicating all the class/PID/VID matching in the communications driver.
> > I think just adding a flag might be the best way.  I'm open to
> > suggestions though.
> 
> You can detect something as "simple" as a class type, which I bet is all
> that Windows is going to be able to do as well.
> 
> > > Or, turn it around the other way, and only enable it if we know it's
> > > safe to do so, in each driver, but I guess that would be even messier.
> > 
> > Yeah, I think it would be messier.
> 
> Ok, this is probably the best solution for us as well, sorry for the
> noise.

No worries. :)

One of the problems with this patch is that several mailing lists
rejected it because of the large number of recipients in the Cc list,
even when just the mailing lists were on the Cc list.  Should I break it
up into a series of patches per driver?

Also, do I need to wait on acks from the other maintainers for this
change?  My instinct is no, since it shouldn't change the behavior of
their drivers with the USB devices that are out there today, but I don't
want to step on anyone's toes.

Sarah Sharp

^ permalink raw reply

* Re: TCPBacklogDrops during aggressive bursts of traffic
From: Kieran Mansley @ 2012-05-17 16:31 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Ben Hutchings, netdev
In-Reply-To: <1337101280.8512.1108.camel@edumazet-glaptop>

On Tue, 2012-05-15 at 19:01 +0200, Eric Dumazet wrote:
> You could try setting tcp_adv_win_scale to -2

I've been unable to reproduce this at all this afternoon (not sure why)
so haven't been able to try that suggestion yet.

>From looking in to what it does it sounds like it could well help, but
only in the sense of tuning the system to make it harder to hit the race
rather than fixing the race altogether.  Similarly for fixing
skb->truesize (we spotted this last week when we first hit the problem,
and adjusting it didn't make much difference).

I'll take another look tomorrow and see if I can work out what's changed
to make the drops disappear.

Kieran 

^ permalink raw reply

* Re: Stable regression with 'tcp: allow splice() to build full TSO packets'
From: Eric Dumazet @ 2012-05-17 16:33 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: netdev
In-Reply-To: <20120517155621.GK14498@1wt.eu>

On Thu, 2012-05-17 at 17:56 +0200, Willy Tarreau wrote:
> On Thu, May 17, 2012 at 05:43:00PM +0200, Eric Dumazet wrote:

> It's the NIC included in the system-on-chip (Marvell 88F6281), and the NIC
> driver is mv643xx. It's the same hardware you find in sheevaplugs, guruplugs,
> dreamplugs and almost all ARM-based cheap NAS boxes.
> 
> > With commit 1d0c0b328a6 in net-next
> > (net: makes skb_splice_bits() aware of skb->head_frag)
> > You'll be able to get even more speed, if NIC uses frag to hold frame.
> 
> I'm going to check this now, sounds interesting :-)

splice(socket -> pipe) does a copy of frame from skb->head to a page
fragment.

With latest patches (net-next), we can provide a frag for skb->head and
avoid this copy in splice(). You would have a true zero copy
socket->socket mode.

tg3 drivers uses this new thing, mv643xx_eth.c could be changed as well.

^ permalink raw reply

* Re: TCPBacklogDrops during aggressive bursts of traffic
From: Eric Dumazet @ 2012-05-17 16:37 UTC (permalink / raw)
  To: Kieran Mansley; +Cc: Ben Hutchings, netdev
In-Reply-To: <1337272292.1681.16.camel@kjm-desktop.uk.level5networks.com>

On Thu, 2012-05-17 at 17:31 +0100, Kieran Mansley wrote:
> On Tue, 2012-05-15 at 19:01 +0200, Eric Dumazet wrote:
> > You could try setting tcp_adv_win_scale to -2
> 
> I've been unable to reproduce this at all this afternoon (not sure why)
> so haven't been able to try that suggestion yet.
> 
> From looking in to what it does it sounds like it could well help, but
> only in the sense of tuning the system to make it harder to hit the race
> rather than fixing the race altogether.  Similarly for fixing
> skb->truesize (we spotted this last week when we first hit the problem,
> and adjusting it didn't make much difference).

Hmm, I was not suggesting running production servers with this setting,
only do an experiment.

With net-next and tcp coalescing, I no longer have TCPBacklogDrops /
collapses, but I dont have sfc card/driver.

^ permalink raw reply

* Re: [RFC 13/13] USB: Disable hub-initiated LPM for comms devices.
From: Greg Kroah-Hartman @ 2012-05-17 16:39 UTC (permalink / raw)
  To: Sarah Sharp
  Cc: linux-usb, Alan Stern, linux-bluetooth, gigaset307x-common,
	netdev, linux-wireless, ath9k-devel, libertas-dev, users
In-Reply-To: <20120517162922.GC4967@xanatos>

On Thu, May 17, 2012 at 09:29:22AM -0700, Sarah Sharp wrote:
> One of the problems with this patch is that several mailing lists
> rejected it because of the large number of recipients in the Cc list,
> even when just the mailing lists were on the Cc list.  Should I break it
> up into a series of patches per driver?

No need to do that, I can just take it.

> Also, do I need to wait on acks from the other maintainers for this
> change?  My instinct is no, since it shouldn't change the behavior of
> their drivers with the USB devices that are out there today, but I don't
> want to step on anyone's toes.

Don't worry about it, we change USB drivers in other trees all the time :)

greg k-h

^ permalink raw reply

* Re: Stable regression with 'tcp: allow splice() to build full TSO packets'
From: Willy Tarreau @ 2012-05-17 16:40 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1337272404.3403.18.camel@edumazet-glaptop>

On Thu, May 17, 2012 at 06:33:24PM +0200, Eric Dumazet wrote:
> On Thu, 2012-05-17 at 17:56 +0200, Willy Tarreau wrote:
> > On Thu, May 17, 2012 at 05:43:00PM +0200, Eric Dumazet wrote:
> 
> > It's the NIC included in the system-on-chip (Marvell 88F6281), and the NIC
> > driver is mv643xx. It's the same hardware you find in sheevaplugs, guruplugs,
> > dreamplugs and almost all ARM-based cheap NAS boxes.
> > 
> > > With commit 1d0c0b328a6 in net-next
> > > (net: makes skb_splice_bits() aware of skb->head_frag)
> > > You'll be able to get even more speed, if NIC uses frag to hold frame.
> > 
> > I'm going to check this now, sounds interesting :-)
> 
> splice(socket -> pipe) does a copy of frame from skb->head to a page
> fragment.
> 
> With latest patches (net-next), we can provide a frag for skb->head and
> avoid this copy in splice(). You would have a true zero copy
> socket->socket mode.

That's what I've read in your commit messages. Indeed, we've been waiting
for this to happen for a long time. It should bring a real benefit on small
devices like the one I'm playing with which have very low memory bandwidth,
because it's a shame to copy the data and thrash the cache when we don't
need to see them.

I'm currently trying to get mainline to boot then I'll switch to net-next.

> tg3 drivers uses this new thing, mv643xx_eth.c could be changed as well.

I'll try to port your patch (8d4057 tg3: provide frags as skb head) to
mv643xx_eth as an exercise. If I succeed and notice an improvement, I'll
send a patch :-)

Cheers,
Willy

^ permalink raw reply

* [PATCH 2/5] drivers/net: delete all code/drivers depending on CONFIG_MCA
From: Paul Gortmaker @ 2012-05-17 16:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: JBottomley, Paul Gortmaker, David S. Miller, James Bottomley,
	netdev
In-Reply-To: <1337272841-25931-1-git-send-email-paul.gortmaker@windriver.com>

The support for CONFIG_MCA is being removed, since the 20
year old hardware simply isn't capable of meeting today's
software demands on CPU and memory resources.

This commit removes any MCA specific net drivers, and removes
any MCA specific probe/support code from drivers that were
doing a dual ISA/MCA role.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: James Bottomley <JBottomley@Parallels.com>
Cc: netdev@vger.kernel.org
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
 Documentation/networking/3c509.txt     |    1 -
 Documentation/networking/fore200e.txt  |    6 +-
 drivers/net/Space.c                    |   16 +-
 drivers/net/ethernet/3com/3c509.c      |  123 +---
 drivers/net/ethernet/8390/Kconfig      |   24 -
 drivers/net/ethernet/8390/Makefile     |    1 -
 drivers/net/ethernet/8390/ne2.c        |  798 ---------------
 drivers/net/ethernet/8390/smc-mca.c    |  575 -----------
 drivers/net/ethernet/amd/depca.c       |  210 +----
 drivers/net/ethernet/fujitsu/at1700.c  |  120 +---
 drivers/net/ethernet/i825xx/3c523.c    | 1312 -------------------------
 drivers/net/ethernet/i825xx/3c523.h    |  355 -------
 drivers/net/ethernet/i825xx/3c527.c    | 1660 --------------------------------
 drivers/net/ethernet/i825xx/Kconfig    |   22 -
 drivers/net/ethernet/i825xx/Makefile   |    2 -
 drivers/net/ethernet/i825xx/eexpress.c |   60 +--
 drivers/net/ethernet/natsemi/Kconfig   |   20 +-
 drivers/net/ethernet/natsemi/Makefile  |    1 -
 18 files changed, 18 insertions(+), 5288 deletions(-)
 delete mode 100644 drivers/net/ethernet/8390/ne2.c
 delete mode 100644 drivers/net/ethernet/8390/smc-mca.c
 delete mode 100644 drivers/net/ethernet/i825xx/3c523.c
 delete mode 100644 drivers/net/ethernet/i825xx/3c523.h
 delete mode 100644 drivers/net/ethernet/i825xx/3c527.c

diff --git a/Documentation/networking/3c509.txt b/Documentation/networking/3c509.txt
index dcc9eaf..fbf722e 100644
--- a/Documentation/networking/3c509.txt
+++ b/Documentation/networking/3c509.txt
@@ -25,7 +25,6 @@ models:
   3c509B (later revision of the ISA card; supports full-duplex)
   3c589 (PCMCIA)
   3c589B (later revision of the 3c589; supports full-duplex)
-  3c529 (MCA)
   3c579 (EISA)
 
 Large portions of this documentation were heavily borrowed from the guide
diff --git a/Documentation/networking/fore200e.txt b/Documentation/networking/fore200e.txt
index f648eb2..d52af53 100644
--- a/Documentation/networking/fore200e.txt
+++ b/Documentation/networking/fore200e.txt
@@ -11,12 +11,10 @@ i386, alpha (untested), powerpc, sparc and sparc64 archs.
 
 The intent is to enable the use of different models of FORE adapters at the
 same time, by hosts that have several bus interfaces (such as PCI+SBUS,
-PCI+MCA or PCI+EISA).
+or PCI+EISA).
 
 Only PCI and SBUS devices are currently supported by the driver, but support
-for other bus interfaces such as EISA should not be too hard to add (this may
-be more tricky for the MCA bus, though, as FORE made some MCA-specific
-modifications to the adapter's AALI interface).
+for other bus interfaces such as EISA should not be too hard to add.
 
 
 Firmware Copyright Notice
diff --git a/drivers/net/Space.c b/drivers/net/Space.c
index 486e2dc..e3f0fac 100644
--- a/drivers/net/Space.c
+++ b/drivers/net/Space.c
@@ -133,22 +133,9 @@ static struct devprobe2 eisa_probes[] __initdata = {
 	{NULL, 0},
 };
 
-static struct devprobe2 mca_probes[] __initdata = {
-#ifdef CONFIG_NE2_MCA
-	{ne2_probe, 0},
-#endif
-#ifdef CONFIG_ELMC		/* 3c523 */
-	{elmc_probe, 0},
-#endif
-#ifdef CONFIG_ELMC_II		/* 3c527 */
-	{mc32_probe, 0},
-#endif
-	{NULL, 0},
-};
-
 /*
  * ISA probes that touch addresses < 0x400 (including those that also
- * look for EISA/PCI/MCA cards in addition to ISA cards).
+ * look for EISA/PCI cards in addition to ISA cards).
  */
 static struct devprobe2 isa_probes[] __initdata = {
 #if defined(CONFIG_HP100) && defined(CONFIG_ISA)	/* ISA, EISA */
@@ -278,7 +265,6 @@ static void __init ethif_probe2(int unit)
 
 	(void)(	probe_list2(unit, m68k_probes, base_addr == 0) &&
 		probe_list2(unit, eisa_probes, base_addr == 0) &&
-		probe_list2(unit, mca_probes, base_addr == 0) &&
 		probe_list2(unit, isa_probes, base_addr == 0) &&
 		probe_list2(unit, parport_probes, base_addr == 0));
 }
diff --git a/drivers/net/ethernet/3com/3c509.c b/drivers/net/ethernet/3com/3c509.c
index 41719da..1a8eef2 100644
--- a/drivers/net/ethernet/3com/3c509.c
+++ b/drivers/net/ethernet/3com/3c509.c
@@ -69,7 +69,6 @@
 #define TX_TIMEOUT  (400*HZ/1000)
 
 #include <linux/module.h>
-#include <linux/mca.h>
 #include <linux/isa.h>
 #include <linux/pnp.h>
 #include <linux/string.h>
@@ -102,7 +101,7 @@ static int el3_debug = 2;
 #endif
 
 /* Used to do a global count of all the cards in the system.  Must be
- * a global variable so that the mca/eisa probe routines can increment
+ * a global variable so that the eisa probe routines can increment
  * it */
 static int el3_cards = 0;
 #define EL3_MAX_CARDS 8
@@ -163,7 +162,7 @@ enum RxFilter {
  */
 #define SKB_QUEUE_SIZE	64
 
-enum el3_cardtype { EL3_ISA, EL3_PNP, EL3_MCA, EL3_EISA };
+enum el3_cardtype { EL3_ISA, EL3_PNP, EL3_EISA };
 
 struct el3_private {
 	spinlock_t lock;
@@ -505,41 +504,6 @@ static struct eisa_driver el3_eisa_driver = {
 static int eisa_registered;
 #endif
 
-#ifdef CONFIG_MCA
-static int el3_mca_probe(struct device *dev);
-
-static short el3_mca_adapter_ids[] __initdata = {
-		0x627c,
-		0x627d,
-		0x62db,
-		0x62f6,
-		0x62f7,
-		0x0000
-};
-
-static char *el3_mca_adapter_names[] __initdata = {
-		"3Com 3c529 EtherLink III (10base2)",
-		"3Com 3c529 EtherLink III (10baseT)",
-		"3Com 3c529 EtherLink III (test mode)",
-		"3Com 3c529 EtherLink III (TP or coax)",
-		"3Com 3c529 EtherLink III (TP)",
-		NULL
-};
-
-static struct mca_driver el3_mca_driver = {
-		.id_table = el3_mca_adapter_ids,
-		.driver = {
-				.name = "3c529",
-				.bus = &mca_bus_type,
-				.probe = el3_mca_probe,
-				.remove = __devexit_p(el3_device_remove),
-				.suspend = el3_suspend,
-				.resume  = el3_resume,
-		},
-};
-static int mca_registered;
-#endif /* CONFIG_MCA */
-
 static const struct net_device_ops netdev_ops = {
 	.ndo_open 		= el3_open,
 	.ndo_stop	 	= el3_close,
@@ -600,76 +564,6 @@ static void el3_common_remove (struct net_device *dev)
 	free_netdev (dev);
 }
 
-#ifdef CONFIG_MCA
-static int __init el3_mca_probe(struct device *device)
-{
-	/* Based on Erik Nygren's (nygren@mit.edu) 3c529 patch,
-	 * heavily modified by Chris Beauregard
-	 * (cpbeaure@csclub.uwaterloo.ca) to support standard MCA
-	 * probing.
-	 *
-	 * redone for multi-card detection by ZP Gu (zpg@castle.net)
-	 * now works as a module */
-
-	short i;
-	int ioaddr, irq, if_port;
-	__be16 phys_addr[3];
-	struct net_device *dev = NULL;
-	u_char pos4, pos5;
-	struct mca_device *mdev = to_mca_device(device);
-	int slot = mdev->slot;
-	int err;
-
-	pos4 = mca_device_read_stored_pos(mdev, 4);
-	pos5 = mca_device_read_stored_pos(mdev, 5);
-
-	ioaddr = ((short)((pos4&0xfc)|0x02)) << 8;
-	irq = pos5 & 0x0f;
-
-
-	pr_info("3c529: found %s at slot %d\n",
-		el3_mca_adapter_names[mdev->index], slot + 1);
-
-	/* claim the slot */
-	strncpy(mdev->name, el3_mca_adapter_names[mdev->index],
-			sizeof(mdev->name));
-	mca_device_set_claim(mdev, 1);
-
-	if_port = pos4 & 0x03;
-
-	irq = mca_device_transform_irq(mdev, irq);
-	ioaddr = mca_device_transform_ioport(mdev, ioaddr);
-	if (el3_debug > 2) {
-		pr_debug("3c529: irq %d  ioaddr 0x%x  ifport %d\n", irq, ioaddr, if_port);
-	}
-	EL3WINDOW(0);
-	for (i = 0; i < 3; i++)
-		phys_addr[i] = htons(read_eeprom(ioaddr, i));
-
-	dev = alloc_etherdev(sizeof (struct el3_private));
-	if (dev == NULL) {
-		release_region(ioaddr, EL3_IO_EXTENT);
-		return -ENOMEM;
-	}
-
-	netdev_boot_setup_check(dev);
-
-	el3_dev_fill(dev, phys_addr, ioaddr, irq, if_port, EL3_MCA);
-	dev_set_drvdata(device, dev);
-	err = el3_common_init(dev);
-
-	if (err) {
-		dev_set_drvdata(device, NULL);
-		free_netdev(dev);
-		return -ENOMEM;
-	}
-
-	el3_devs[el3_cards++] = dev;
-	return 0;
-}
-
-#endif /* CONFIG_MCA */
-
 #ifdef CONFIG_EISA
 static int __init el3_eisa_probe (struct device *device)
 {
@@ -1547,11 +1441,6 @@ static int __init el3_init_module(void)
 	if (!ret)
 		eisa_registered = 1;
 #endif
-#ifdef CONFIG_MCA
-	ret = mca_register_driver(&el3_mca_driver);
-	if (!ret)
-		mca_registered = 1;
-#endif
 
 #ifdef CONFIG_PNP
 	if (pnp_registered)
@@ -1563,10 +1452,6 @@ static int __init el3_init_module(void)
 	if (eisa_registered)
 		ret = 0;
 #endif
-#ifdef CONFIG_MCA
-	if (mca_registered)
-		ret = 0;
-#endif
 	return ret;
 }
 
@@ -1584,10 +1469,6 @@ static void __exit el3_cleanup_module(void)
 	if (eisa_registered)
 		eisa_driver_unregister(&el3_eisa_driver);
 #endif
-#ifdef CONFIG_MCA
-	if (mca_registered)
-		mca_unregister_driver(&el3_mca_driver);
-#endif
 }
 
 module_init (el3_init_module);
diff --git a/drivers/net/ethernet/8390/Kconfig b/drivers/net/ethernet/8390/Kconfig
index 910895c..2e53867 100644
--- a/drivers/net/ethernet/8390/Kconfig
+++ b/drivers/net/ethernet/8390/Kconfig
@@ -182,18 +182,6 @@ config NE2000
 	  To compile this driver as a module, choose M here. The module
 	  will be called ne.
 
-config NE2_MCA
-	tristate "NE/2 (ne2000 MCA version) support"
-	depends on MCA_LEGACY
-	select CRC32
-	---help---
-	  If you have a network (Ethernet) card of this type, say Y and read
-	  the Ethernet-HOWTO, available from
-	  <http://www.tldp.org/docs.html#howto>.
-
-	  To compile this driver as a module, choose M here. The module
-	  will be called ne2.
-
 config NE2K_PCI
 	tristate "PCI NE2000 and clones support (see help)"
 	depends on PCI
@@ -267,18 +255,6 @@ config STNIC
 
 	  If unsure, say N.
 
-config ULTRAMCA
-	tristate "SMC Ultra MCA support"
-	depends on MCA
-	select CRC32
-	---help---
-	  If you have a network (Ethernet) card of this type and are running
-	  an MCA based system (PS/2), say Y and read the Ethernet-HOWTO,
-	  available from <http://www.tldp.org/docs.html#howto>.
-
-	  To compile this driver as a module, choose M here. The module
-	  will be called smc-mca.
-
 config ULTRA
 	tristate "SMC Ultra support"
 	depends on ISA
diff --git a/drivers/net/ethernet/8390/Makefile b/drivers/net/ethernet/8390/Makefile
index 3337d7f..d13790b 100644
--- a/drivers/net/ethernet/8390/Makefile
+++ b/drivers/net/ethernet/8390/Makefile
@@ -24,6 +24,5 @@ obj-$(CONFIG_PCMCIA_PCNET) += pcnet_cs.o 8390.o
 obj-$(CONFIG_STNIC) += stnic.o 8390.o
 obj-$(CONFIG_ULTRA) += smc-ultra.o 8390.o
 obj-$(CONFIG_ULTRA32) += smc-ultra32.o 8390.o
-obj-$(CONFIG_ULTRAMCA) += smc-mca.o 8390.o
 obj-$(CONFIG_WD80x3) += wd.o 8390.o
 obj-$(CONFIG_ZORRO8390) += zorro8390.o 8390.o
diff --git a/drivers/net/ethernet/8390/ne2.c b/drivers/net/ethernet/8390/ne2.c
deleted file mode 100644
index ef85839..0000000
diff --git a/drivers/net/ethernet/8390/smc-mca.c b/drivers/net/ethernet/8390/smc-mca.c
deleted file mode 100644
index 7a68590..0000000
diff --git a/drivers/net/ethernet/amd/depca.c b/drivers/net/ethernet/amd/depca.c
index 7f7b99a..c771de7 100644
--- a/drivers/net/ethernet/amd/depca.c
+++ b/drivers/net/ethernet/amd/depca.c
@@ -155,23 +155,10 @@
     2 depca's in a PC).
 
     ************************************************************************
-    Support for MCA EtherWORKS cards added 11-3-98.
+    Support for MCA EtherWORKS cards added 11-3-98. (MCA since deleted)
     Verified to work with up to 2 DE212 cards in a system (although not
       fully stress-tested).
 
-    Currently known bugs/limitations:
-
-    Note:  with the MCA stuff as a module, it trusts the MCA configuration,
-           not the command line for IRQ and memory address.  You can
-           specify them if you want, but it will throw your values out.
-           You still have to pass the IO address it was configured as
-           though.
-
-    ************************************************************************
-    TO DO:
-    ------
-
-
     Revision History
     ----------------
 
@@ -261,10 +248,6 @@
 #include <asm/io.h>
 #include <asm/dma.h>
 
-#ifdef CONFIG_MCA
-#include <linux/mca.h>
-#endif
-
 #ifdef CONFIG_EISA
 #include <linux/eisa.h>
 #endif
@@ -360,44 +343,6 @@ static struct eisa_driver depca_eisa_driver = {
 };
 #endif
 
-#ifdef CONFIG_MCA
-/*
-** Adapter ID for the MCA EtherWORKS DE210/212 adapter
-*/
-#define DE210_ID 0x628d
-#define DE212_ID 0x6def
-
-static short depca_mca_adapter_ids[] = {
-	DE210_ID,
-	DE212_ID,
-	0x0000
-};
-
-static char *depca_mca_adapter_name[] = {
-	"DEC EtherWORKS MC Adapter (DE210)",
-	"DEC EtherWORKS MC Adapter (DE212)",
-	NULL
-};
-
-static enum depca_type depca_mca_adapter_type[] = {
-	de210,
-	de212,
-	0
-};
-
-static int depca_mca_probe (struct device *);
-
-static struct mca_driver depca_mca_driver = {
-	.id_table = depca_mca_adapter_ids,
-	.driver   = {
-		.name   = depca_string,
-		.bus    = &mca_bus_type,
-		.probe  = depca_mca_probe,
-		.remove = __devexit_p(depca_device_remove),
-	},
-};
-#endif
-
 static int depca_isa_probe (struct platform_device *);
 
 static int __devexit depca_isa_remove(struct platform_device *pdev)
@@ -464,8 +409,7 @@ struct depca_private {
 	char adapter_name[DEPCA_STRLEN];	/* /proc/ioports string                  */
 	enum depca_type adapter;		/* Adapter type */
 	enum {
-                DEPCA_BUS_MCA = 1,
-                DEPCA_BUS_ISA,
+                DEPCA_BUS_ISA = 1,
                 DEPCA_BUS_EISA,
         } depca_bus;	        /* type of bus */
 	struct depca_init init_block;	/* Shadow Initialization block            */
@@ -624,12 +568,6 @@ static int __init depca_hw_init (struct net_device *dev, struct device *device)
 	       dev_name(device), depca_signature[lp->adapter], ioaddr);
 
 	switch (lp->depca_bus) {
-#ifdef CONFIG_MCA
-	case DEPCA_BUS_MCA:
-		printk(" (MCA slot %d)", to_mca_device(device)->slot + 1);
-		break;
-#endif
-
 #ifdef CONFIG_EISA
 	case DEPCA_BUS_EISA:
 		printk(" (EISA slot %d)", to_eisa_device(device)->slot);
@@ -661,10 +599,7 @@ static int __init depca_hw_init (struct net_device *dev, struct device *device)
 	if (nicsr & BUF) {
 		nicsr &= ~BS;	/* DEPCA RAM in top 32k */
 		netRAM -= 32;
-
-		/* Only EISA/ISA needs start address to be re-computed */
-		if (lp->depca_bus != DEPCA_BUS_MCA)
-			mem_start += 0x8000;
+		mem_start += 0x8000;
 	}
 
 	if ((mem_len = (NUM_RX_DESC * (sizeof(struct depca_rx_desc) + RX_BUFF_SZ) + NUM_TX_DESC * (sizeof(struct depca_tx_desc) + TX_BUFF_SZ) + sizeof(struct depca_init)))
@@ -1325,130 +1260,6 @@ static int __init depca_common_init (u_long ioaddr, struct net_device **devp)
 	return status;
 }
 
-#ifdef CONFIG_MCA
-/*
-** Microchannel bus I/O device probe
-*/
-static int __init depca_mca_probe(struct device *device)
-{
-	unsigned char pos[2];
-	unsigned char where;
-	unsigned long iobase, mem_start;
-	int irq, err;
-	struct mca_device *mdev = to_mca_device (device);
-	struct net_device *dev;
-	struct depca_private *lp;
-
-	/*
-	** Search for the adapter.  If an address has been given, search
-	** specifically for the card at that address.  Otherwise find the
-	** first card in the system.
-	*/
-
-	pos[0] = mca_device_read_stored_pos(mdev, 2);
-	pos[1] = mca_device_read_stored_pos(mdev, 3);
-
-	/*
-	** IO of card is handled by bits 1 and 2 of pos0.
-	**
-	**    bit2 bit1    IO
-	**       0    0    0x2c00
-	**       0    1    0x2c10
-	**       1    0    0x2c20
-	**       1    1    0x2c30
-	*/
-	where = (pos[0] & 6) >> 1;
-	iobase = 0x2c00 + (0x10 * where);
-
-	/*
-	** Found the adapter we were looking for. Now start setting it up.
-	**
-	** First work on decoding the IRQ.  It's stored in the lower 4 bits
-	** of pos1.  Bits are as follows (from the ADF file):
-	**
-	**      Bits
-	**   3   2   1   0    IRQ
-	**   --------------------
-	**   0   0   1   0     5
-	**   0   0   0   1     9
-	**   0   1   0   0    10
-	**   1   0   0   0    11
-	*/
-	where = pos[1] & 0x0f;
-	switch (where) {
-	case 1:
-		irq = 9;
-		break;
-	case 2:
-		irq = 5;
-		break;
-	case 4:
-		irq = 10;
-		break;
-	case 8:
-		irq = 11;
-		break;
-	default:
-		printk("%s: mca_probe IRQ error.  You should never get here (%d).\n", mdev->name, where);
-		return -EINVAL;
-	}
-
-	/*
-	** Shared memory address of adapter is stored in bits 3-5 of pos0.
-	** They are mapped as follows:
-	**
-	**    Bit
-	**   5  4  3       Memory Addresses
-	**   0  0  0       C0000-CFFFF (64K)
-	**   1  0  0       C8000-CFFFF (32K)
-	**   0  0  1       D0000-DFFFF (64K)
-	**   1  0  1       D8000-DFFFF (32K)
-	**   0  1  0       E0000-EFFFF (64K)
-	**   1  1  0       E8000-EFFFF (32K)
-	*/
-	where = (pos[0] & 0x18) >> 3;
-	mem_start = 0xc0000 + (where * 0x10000);
-	if (pos[0] & 0x20) {
-		mem_start += 0x8000;
-	}
-
-	/* claim the slot */
-	strncpy(mdev->name, depca_mca_adapter_name[mdev->index],
-		sizeof(mdev->name));
-	mca_device_set_claim(mdev, 1);
-
-        /*
-	** Get everything allocated and initialized...  (almost just
-	** like the ISA and EISA probes)
-	*/
-	irq = mca_device_transform_irq(mdev, irq);
-	iobase = mca_device_transform_ioport(mdev, iobase);
-
-	if ((err = depca_common_init (iobase, &dev)))
-		goto out_unclaim;
-
-	dev->irq = irq;
-	dev->base_addr = iobase;
-	lp = netdev_priv(dev);
-	lp->depca_bus = DEPCA_BUS_MCA;
-	lp->adapter = depca_mca_adapter_type[mdev->index];
-	lp->mem_start = mem_start;
-
-	if ((err = depca_hw_init(dev, device)))
-		goto out_free;
-
-	return 0;
-
- out_free:
-	free_netdev (dev);
-	release_region (iobase, DEPCA_TOTAL_SIZE);
- out_unclaim:
-	mca_device_set_claim(mdev, 0);
-
-	return err;
-}
-#endif
-
 /*
 ** ISA bus I/O device probe
 */
@@ -2059,15 +1870,10 @@ static int __init depca_module_init (void)
 {
 	int err = 0;
 
-#ifdef CONFIG_MCA
-	err = mca_register_driver(&depca_mca_driver);
-	if (err)
-		goto err;
-#endif
 #ifdef CONFIG_EISA
 	err = eisa_driver_register(&depca_eisa_driver);
 	if (err)
-		goto err_mca;
+		goto err_eisa;
 #endif
 	err = platform_driver_register(&depca_isa_driver);
 	if (err)
@@ -2079,11 +1885,6 @@ static int __init depca_module_init (void)
 err_eisa:
 #ifdef CONFIG_EISA
 	eisa_driver_unregister(&depca_eisa_driver);
-err_mca:
-#endif
-#ifdef CONFIG_MCA
-	mca_unregister_driver(&depca_mca_driver);
-err:
 #endif
 	return err;
 }
@@ -2091,9 +1892,6 @@ err:
 static void __exit depca_module_exit (void)
 {
 	int i;
-#ifdef CONFIG_MCA
-        mca_unregister_driver (&depca_mca_driver);
-#endif
 #ifdef CONFIG_EISA
         eisa_driver_unregister (&depca_eisa_driver);
 #endif
diff --git a/drivers/net/ethernet/fujitsu/at1700.c b/drivers/net/ethernet/fujitsu/at1700.c
index 3d94797..4b80dc4 100644
--- a/drivers/net/ethernet/fujitsu/at1700.c
+++ b/drivers/net/ethernet/fujitsu/at1700.c
@@ -27,7 +27,7 @@
 	ATI provided their EEPROM configuration code header file.
     Thanks to NIIBE Yutaka <gniibe@mri.co.jp> for bug fixes.
 
-    MCA bus (AT1720) support by Rene Schmit <rene@bss.lu>
+    MCA bus (AT1720) support (now deleted) by Rene Schmit <rene@bss.lu>
 
   Bugs:
 	The MB86965 has a design flaw that makes all probes unreliable.  Not
@@ -38,7 +38,6 @@
 #include <linux/errno.h>
 #include <linux/netdevice.h>
 #include <linux/etherdevice.h>
-#include <linux/mca-legacy.h>
 #include <linux/module.h>
 #include <linux/kernel.h>
 #include <linux/types.h>
@@ -79,24 +78,6 @@ static unsigned at1700_probe_list[] __initdata = {
 	0x260, 0x280, 0x2a0, 0x240, 0x340, 0x320, 0x380, 0x300, 0
 };
 
-/*
- *	MCA
- */
-#ifdef CONFIG_MCA_LEGACY
-static int at1700_ioaddr_pattern[] __initdata = {
-	0x00, 0x04, 0x01, 0x05, 0x02, 0x06, 0x03, 0x07
-};
-
-static int at1700_mca_probe_list[] __initdata = {
-	0x400, 0x1400, 0x2400, 0x3400, 0x4400, 0x5400, 0x6400, 0x7400, 0
-};
-
-static int at1700_irq_pattern[] __initdata = {
-	0x00, 0x00, 0x00, 0x30, 0x70, 0xb0, 0x00, 0x00,
-	0x00, 0xf0, 0x34, 0x74, 0xb4, 0x00, 0x00, 0xf4, 0x00
-};
-#endif
-
 /* use 0 for production, 1 for verification, >2 for debug */
 #ifndef NET_DEBUG
 #define NET_DEBUG 1
@@ -114,7 +95,6 @@ struct net_local {
 	uint tx_queue_ready:1;			/* Tx queue is ready to be sent. */
 	uint rx_started:1;			/* Packets are Rxing. */
 	uchar tx_queue;				/* Number of packet on the Tx queue. */
-	char mca_slot;				/* -1 means ISA */
 	ushort tx_queue_len;			/* Current length of the Tx queue. */
 };
 
@@ -166,21 +146,6 @@ static void set_rx_mode(struct net_device *dev);
 static void net_tx_timeout (struct net_device *dev);
 
 
-#ifdef CONFIG_MCA_LEGACY
-struct at1720_mca_adapters_struct {
-	char* name;
-	int id;
-};
-/* rEnE : maybe there are others I don't know off... */
-
-static struct at1720_mca_adapters_struct at1720_mca_adapters[] __initdata = {
-	{ "Allied Telesys AT1720AT",	0x6410 },
-	{ "Allied Telesys AT1720BT", 	0x6413 },
-	{ "Allied Telesys AT1720T",	0x6416 },
-	{ NULL, 0 },
-};
-#endif
-
 /* Check for a network adaptor of this type, and return '0' iff one exists.
    If dev->base_addr == 0, probe all likely locations.
    If dev->base_addr == 1, always return failure.
@@ -194,11 +159,6 @@ static int irq;
 
 static void cleanup_card(struct net_device *dev)
 {
-#ifdef CONFIG_MCA_LEGACY
-	struct net_local *lp = netdev_priv(dev);
-	if (lp->mca_slot >= 0)
-		mca_mark_as_unused(lp->mca_slot);
-#endif
 	free_irq(dev->irq, NULL);
 	release_region(dev->base_addr, AT1700_IO_EXTENT);
 }
@@ -273,7 +233,7 @@ static int __init at1700_probe1(struct net_device *dev, int ioaddr)
 	static const char fmv_irqmap_pnp[8] = {3, 4, 5, 7, 9, 10, 11, 15};
 	static const char at1700_irqmap[8] = {3, 4, 5, 9, 10, 11, 14, 15};
 	unsigned int i, irq, is_fmv18x = 0, is_at1700 = 0;
-	int slot, ret = -ENODEV;
+	int ret = -ENODEV;
 	struct net_local *lp = netdev_priv(dev);
 
 	if (!request_region(ioaddr, AT1700_IO_EXTENT, DRV_NAME))
@@ -288,64 +248,6 @@ static int __init at1700_probe1(struct net_device *dev, int ioaddr)
 		   ioaddr, read_eeprom(ioaddr, 4), read_eeprom(ioaddr, 5),
 		   read_eeprom(ioaddr, 6), inw(ioaddr + EEPROM_Ctrl));
 #endif
-
-#ifdef CONFIG_MCA_LEGACY
-	/* rEnE (rene@bss.lu): got this from 3c509 driver source , adapted for AT1720 */
-
-    /* Based on Erik Nygren's (nygren@mit.edu) 3c529 patch, heavily
-	modified by Chris Beauregard (cpbeaure@csclub.uwaterloo.ca)
-	to support standard MCA probing. */
-
-	/* redone for multi-card detection by ZP Gu (zpg@castle.net) */
-	/* now works as a module */
-
-	if (MCA_bus) {
-		int j;
-		int l_i;
-		u_char pos3, pos4;
-
-		for (j = 0; at1720_mca_adapters[j].name != NULL; j ++) {
-			slot = 0;
-			while (slot != MCA_NOTFOUND) {
-
-				slot = mca_find_unused_adapter( at1720_mca_adapters[j].id, slot );
-				if (slot == MCA_NOTFOUND) break;
-
-				/* if we get this far, an adapter has been detected and is
-				enabled */
-
-				pos3 = mca_read_stored_pos( slot, 3 );
-				pos4 = mca_read_stored_pos( slot, 4 );
-
-				for (l_i = 0; l_i < 8; l_i++)
-					if (( pos3 & 0x07) == at1700_ioaddr_pattern[l_i])
-						break;
-				ioaddr = at1700_mca_probe_list[l_i];
-
-				for (irq = 0; irq < 0x10; irq++)
-					if (((((pos4>>4) & 0x0f) | (pos3 & 0xf0)) & 0xff) == at1700_irq_pattern[irq])
-						break;
-
-					/* probing for a card at a particular IO/IRQ */
-				if ((dev->irq && dev->irq != irq) ||
-				    (dev->base_addr && dev->base_addr != ioaddr)) {
-				  	slot++;		/* probing next slot */
-				  	continue;
-				}
-
-				dev->irq = irq;
-
-				/* claim the slot */
-				mca_set_adapter_name( slot, at1720_mca_adapters[j].name );
-				mca_mark_as_used(slot);
-
-				goto found;
-			}
-		}
-		/* if we get here, we didn't find an MCA adapter - try ISA */
-	}
-#endif
-	slot = -1;
 	/* We must check for the EEPROM-config boards first, else accessing
 	   IOCONFIG0 will move the board! */
 	if (at1700_probe_list[inb(ioaddr + IOCONFIG1) & 0x07] == ioaddr &&
@@ -360,11 +262,7 @@ static int __init at1700_probe1(struct net_device *dev, int ioaddr)
 		goto err_out;
 	}
 
-#ifdef CONFIG_MCA_LEGACY
-found:
-#endif
-
-		/* Reset the internal state machines. */
+	/* Reset the internal state machines. */
 	outb(0, ioaddr + RESET);
 
 	if (is_at1700) {
@@ -380,11 +278,11 @@ found:
 					break;
 			}
 			if (i == 8) {
-				goto err_mca;
+				goto err_out;
 			}
 		} else {
 			if (fmv18x_probe_list[inb(ioaddr + IOCONFIG) & 0x07] != ioaddr)
-				goto err_mca;
+				goto err_out;
 			irq = fmv_irqmap[(inb(ioaddr + IOCONFIG)>>6) & 0x03];
 		}
 	}
@@ -464,23 +362,17 @@ found:
 	spin_lock_init(&lp->lock);
 
 	lp->jumpered = is_fmv18x;
-	lp->mca_slot = slot;
 	/* Snarf the interrupt vector now. */
 	ret = request_irq(irq, net_interrupt, 0, DRV_NAME, dev);
 	if (ret) {
 		printk(KERN_ERR "AT1700 at %#3x is unusable due to a "
 		       "conflict on IRQ %d.\n",
 		       ioaddr, irq);
-		goto err_mca;
+		goto err_out;
 	}
 
 	return 0;
 
-err_mca:
-#ifdef CONFIG_MCA_LEGACY
-	if (slot >= 0)
-		mca_mark_as_unused(slot);
-#endif
 err_out:
 	release_region(ioaddr, AT1700_IO_EXTENT);
 	return ret;
diff --git a/drivers/net/ethernet/i825xx/3c523.c b/drivers/net/ethernet/i825xx/3c523.c
deleted file mode 100644
index 8451ecd..0000000
diff --git a/drivers/net/ethernet/i825xx/3c523.h b/drivers/net/ethernet/i825xx/3c523.h
deleted file mode 100644
index 6956441..0000000
diff --git a/drivers/net/ethernet/i825xx/3c527.c b/drivers/net/ethernet/i825xx/3c527.c
deleted file mode 100644
index 278e791..0000000
diff --git a/drivers/net/ethernet/i825xx/Kconfig b/drivers/net/ethernet/i825xx/Kconfig
index ca1ae98..fed5080 100644
--- a/drivers/net/ethernet/i825xx/Kconfig
+++ b/drivers/net/ethernet/i825xx/Kconfig
@@ -43,28 +43,6 @@ config EL16
 	  To compile this driver as a module, choose M here. The module
 	  will be called 3c507.
 
-config ELMC
-	tristate "3c523 \"EtherLink/MC\" support"
-	depends on MCA_LEGACY
-	---help---
-	  If you have a network (Ethernet) card of this type, say Y and read
-	  the Ethernet-HOWTO, available from
-	  <http://www.tldp.org/docs.html#howto>.
-
-	  To compile this driver as a module, choose M here. The module
-	  will be called 3c523.
-
-config ELMC_II
-	tristate "3c527 \"EtherLink/MC 32\" support (EXPERIMENTAL)"
-	depends on MCA && MCA_LEGACY
-	---help---
-	  If you have a network (Ethernet) card of this type, say Y and read
-	  the Ethernet-HOWTO, available from
-	  <http://www.tldp.org/docs.html#howto>.
-
-	  To compile this driver as a module, choose M here. The module
-	  will be called 3c527.
-
 config ARM_ETHER1
 	tristate "Acorn Ether1 support"
 	depends on ARM && ARCH_ACORN
diff --git a/drivers/net/ethernet/i825xx/Makefile b/drivers/net/ethernet/i825xx/Makefile
index f68a369..6adff85 100644
--- a/drivers/net/ethernet/i825xx/Makefile
+++ b/drivers/net/ethernet/i825xx/Makefile
@@ -7,8 +7,6 @@ obj-$(CONFIG_EEXPRESS) += eexpress.o
 obj-$(CONFIG_EEXPRESS_PRO) += eepro.o
 obj-$(CONFIG_ELPLUS) += 3c505.o
 obj-$(CONFIG_EL16) += 3c507.o
-obj-$(CONFIG_ELMC) += 3c523.o
-obj-$(CONFIG_ELMC_II) += 3c527.o
 obj-$(CONFIG_LP486E) += lp486e.o
 obj-$(CONFIG_NI52) += ni52.o
 obj-$(CONFIG_SUN3_82586) += sun3_82586.o
diff --git a/drivers/net/ethernet/i825xx/eexpress.c b/drivers/net/ethernet/i825xx/eexpress.c
index cc2e66a..7a6a2f0 100644
--- a/drivers/net/ethernet/i825xx/eexpress.c
+++ b/drivers/net/ethernet/i825xx/eexpress.c
@@ -9,7 +9,7 @@
  * Many modifications, and currently maintained, by
  *  Philip Blundell <philb@gnu.org>
  * Added the Compaq LTE  Alan Cox <alan@lxorguk.ukuu.org.uk>
- * Added MCA support Adam Fritzler
+ * Added MCA support Adam Fritzler (now deleted)
  *
  * Note - this driver is experimental still - it has problems on faster
  * machines. Someone needs to sit down and go through it line by line with
@@ -111,7 +111,6 @@
 #include <linux/netdevice.h>
 #include <linux/etherdevice.h>
 #include <linux/skbuff.h>
-#include <linux/mca-legacy.h>
 #include <linux/spinlock.h>
 #include <linux/bitops.h>
 #include <linux/jiffies.h>
@@ -227,16 +226,6 @@ static unsigned short start_code[] = {
 /* maps irq number to EtherExpress magic value */
 static char irqrmap[] = { 0,0,1,2,3,4,0,0,0,1,5,6,0,0,0,0 };
 
-#ifdef CONFIG_MCA_LEGACY
-/* mapping of the first four bits of the second POS register */
-static unsigned short mca_iomap[] = {
-	0x270, 0x260, 0x250, 0x240, 0x230, 0x220, 0x210, 0x200,
-	0x370, 0x360, 0x350, 0x340, 0x330, 0x320, 0x310, 0x300
-};
-/* bits 5-7 of the second POS register */
-static char mca_irqmap[] = { 12, 9, 3, 4, 5, 10, 11, 15 };
-#endif
-
 /*
  * Prototypes for Linux interface
  */
@@ -340,53 +329,6 @@ static int __init do_express_probe(struct net_device *dev)
 
 	dev->if_port = 0xff; /* not set */
 
-#ifdef CONFIG_MCA_LEGACY
-	if (MCA_bus) {
-		int slot = 0;
-
-		/*
-		 * Only find one card at a time.  Subsequent calls
-		 * will find others, however, proper multicard MCA
-		 * probing and setup can't be done with the
-		 * old-style Space.c init routines.  -- ASF
-		 */
-		while (slot != MCA_NOTFOUND) {
-			int pos0, pos1;
-
-			slot = mca_find_unused_adapter(0x628B, slot);
-			if (slot == MCA_NOTFOUND)
-				break;
-
-			pos0 = mca_read_stored_pos(slot, 2);
-			pos1 = mca_read_stored_pos(slot, 3);
-			ioaddr = mca_iomap[pos1&0xf];
-
-			dev->irq = mca_irqmap[(pos1>>4)&0x7];
-
-			/*
-			 * XXX: Transceiver selection is done
-			 * differently on the MCA version.
-			 * How to get it to select something
-			 * other than external/AUI is currently
-			 * unknown.  This code is just for looks. -- ASF
-			 */
-			if ((pos0 & 0x7) == 0x1)
-				dev->if_port = AUI;
-			else if ((pos0 & 0x7) == 0x5) {
-				if (pos1 & 0x80)
-					dev->if_port = BNC;
-				else
-					dev->if_port = TPE;
-			}
-
-			mca_set_adapter_name(slot, "Intel EtherExpress 16 MCA");
-			mca_set_adapter_procfn(slot, NULL, dev);
-			mca_mark_as_used(slot);
-
-			break;
-		}
-	}
-#endif
 	if (ioaddr&0xfe00) {
 		if (!request_region(ioaddr, EEXP_IO_EXTENT, "EtherExpress"))
 			return -EBUSY;
diff --git a/drivers/net/ethernet/natsemi/Kconfig b/drivers/net/ethernet/natsemi/Kconfig
index eb836f7..f157334 100644
--- a/drivers/net/ethernet/natsemi/Kconfig
+++ b/drivers/net/ethernet/natsemi/Kconfig
@@ -6,9 +6,8 @@ config NET_VENDOR_NATSEMI
 	bool "National Semi-conductor devices"
 	default y
 	depends on AMIGA_PCMCIA || ARM || EISA || EXPERIMENTAL || H8300 || \
-		   ISA || M32R || MAC || MACH_JAZZ || MACH_TX49XX || MCA || \
-		   MCA_LEGACY || MIPS || PCI || PCMCIA || SUPERH || \
-		   XTENSA_PLATFORM_XT2000 || ZORRO
+		   ISA || M32R || MAC || MACH_JAZZ || MACH_TX49XX || MIPS || \
+		   PCI || PCMCIA || SUPERH || XTENSA_PLATFORM_XT2000 || ZORRO
 	---help---
 	  If you have a network (Ethernet) card belonging to this class, say Y
 	  and read the Ethernet-HOWTO, available from
@@ -21,21 +20,6 @@ config NET_VENDOR_NATSEMI
 
 if NET_VENDOR_NATSEMI
 
-config IBMLANA
-	tristate "IBM LAN Adapter/A support"
-	depends on MCA
-	---help---
-	  This is a Micro Channel Ethernet adapter.  You need to set
-	  CONFIG_MCA to use this driver.  It is both available as an in-kernel
-	  driver and as a module.
-
-	  To compile this driver as a module, choose M here. The only
-	  currently supported card is the IBM LAN Adapter/A for Ethernet.  It
-	  will both support 16K and 32K memory windows, however a 32K window
-	  gives a better security against packet losses.  Usage of multiple
-	  boards with this driver should be possible, but has not been tested
-	  up to now due to lack of hardware.
-
 config MACSONIC
 	tristate "Macintosh SONIC based ethernet (onboard, NuBus, LC, CS)"
 	depends on MAC
diff --git a/drivers/net/ethernet/natsemi/Makefile b/drivers/net/ethernet/natsemi/Makefile
index 9aa5dea..764c532 100644
--- a/drivers/net/ethernet/natsemi/Makefile
+++ b/drivers/net/ethernet/natsemi/Makefile
@@ -2,7 +2,6 @@
 # Makefile for the National Semi-conductor Sonic devices.
 #
 
-obj-$(CONFIG_IBMLANA) += ibmlana.o
 obj-$(CONFIG_MACSONIC) += macsonic.o
 obj-$(CONFIG_MIPS_JAZZ_SONIC) += jazzsonic.o
 obj-$(CONFIG_NATSEMI) += natsemi.o
-- 
1.7.9.1

^ permalink raw reply related

* Re: Stable regression with 'tcp: allow splice() to build full TSO packets'
From: Eric Dumazet @ 2012-05-17 16:47 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: netdev
In-Reply-To: <20120517164016.GL14498@1wt.eu>

On Thu, 2012-05-17 at 18:40 +0200, Willy Tarreau wrote:
> On Thu, May 17, 2012 at 06:33:24PM +0200, Eric Dumazet wrote:
> > On Thu, 2012-05-17 at 17:56 +0200, Willy Tarreau wrote:
> > > On Thu, May 17, 2012 at 05:43:00PM +0200, Eric Dumazet wrote:
> > 
> > > It's the NIC included in the system-on-chip (Marvell 88F6281), and the NIC
> > > driver is mv643xx. It's the same hardware you find in sheevaplugs, guruplugs,
> > > dreamplugs and almost all ARM-based cheap NAS boxes.
> > > 
> > > > With commit 1d0c0b328a6 in net-next
> > > > (net: makes skb_splice_bits() aware of skb->head_frag)
> > > > You'll be able to get even more speed, if NIC uses frag to hold frame.
> > > 
> > > I'm going to check this now, sounds interesting :-)
> > 
> > splice(socket -> pipe) does a copy of frame from skb->head to a page
> > fragment.
> > 
> > With latest patches (net-next), we can provide a frag for skb->head and
> > avoid this copy in splice(). You would have a true zero copy
> > socket->socket mode.
> 
> That's what I've read in your commit messages. Indeed, we've been waiting
> for this to happen for a long time. It should bring a real benefit on small
> devices like the one I'm playing with which have very low memory bandwidth,
> because it's a shame to copy the data and thrash the cache when we don't
> need to see them.
> 
> I'm currently trying to get mainline to boot then I'll switch to net-next.
> 
> > tg3 drivers uses this new thing, mv643xx_eth.c could be changed as well.
> 
> I'll try to port your patch (8d4057 tg3: provide frags as skb head) to
> mv643xx_eth as an exercise. If I succeed and notice an improvement, I'll
> send a patch :-)

In fact I could post a generic patch, to make netdev_alloc_skb() do this
automatically for all network drivers.

Its probably less than 30 lines of code.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox