Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net] tcp: fix crashes in do_tcp_sendpages()
From: Joe Perches @ 2012-12-02  1:43 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Willy Tarreau, David Miller, netdev
In-Reply-To: <1354403222.20109.539.camel@edumazet-glaptop>

On Sat, 2012-12-01 at 15:07 -0800, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> Recent network changes allowed high order pages being used
> for skb fragments.
> 
> This uncovered a bug in do_tcp_sendpages() which was assuming its caller
> provided an array of order-0 page pointers.
> 
> We only have to deal with a single page in this function, and its order
> is irrelevant.
[]
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
[]
> @@ -830,8 +830,8 @@ static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
>  	return mss_now;
>  }
>  
> -static ssize_t do_tcp_sendpages(struct sock *sk, struct page **pages, int poffset,
> -			 size_t psize, int flags)
> +static ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
> +				size_t size, int flags)

Shouldn't this now be named do_tcp_sendpage ?

^ permalink raw reply

* Re: [PATCH net 1/1] 8139cp: fix coherent mapping leak in error path.
From: David Miller @ 2012-12-02  1:41 UTC (permalink / raw)
  To: romieu; +Cc: jgarzik, dwmw2, jasowang, netdev, jogreene
In-Reply-To: <20121201230850.GA10935@electric-eye.fr.zoreil.com>

From: Francois Romieu <romieu@fr.zoreil.com>
Date: Sun, 2 Dec 2012 00:08:50 +0100

> cp_open
> [...]
>         rc = cp_alloc_rings(cp);
>         if (rc)
>                 return rc;
> 
> cp_alloc_rings
> [...]
>         mem = dma_alloc_coherent(&cp->pdev->dev, CP_RING_BYTES,
>                                  &cp->ring_dma, GFP_KERNEL);
> 
> - cp_alloc_rings never frees the coherent mapping it allocates
> - neither do cp_open when cp_alloc_rings fails
> 
> Signed-off-by: Francois Romieu <romieu@fr.zoreil.com>

Applied.

^ permalink raw reply

* Re: [PATCH net] tcp: fix crashes in do_tcp_sendpages()
From: David Miller @ 2012-12-02  1:41 UTC (permalink / raw)
  To: eric.dumazet; +Cc: w, netdev
In-Reply-To: <1354403222.20109.539.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sat, 01 Dec 2012 15:07:02 -0800

> From: Eric Dumazet <edumazet@google.com>
> 
> Recent network changes allowed high order pages being used
> for skb fragments.
> 
> This uncovered a bug in do_tcp_sendpages() which was assuming its caller
> provided an array of order-0 page pointers.
> 
> We only have to deal with a single page in this function, and its order
> is irrelevant.
> 
> Reported-by: Willy Tarreau <w@1wt.eu>
> Tested-by: Willy Tarreau <w@1wt.eu>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied, thanks Eric.

^ permalink raw reply

* [PATCH] vhost-blk: Add vhost-blk support v6
From: Asias He @ 2012-12-02  1:33 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jens Axboe, kvm, netdev, linux-kernel, virtualization,
	Christoph Hellwig, David S. Miller

vhost-blk is an in-kernel virito-blk device accelerator.

Due to lack of proper in-kernel AIO interface, this version converts
guest's I/O request to bio and use submit_bio() to submit I/O directly.
So this version any supports raw block device as guest's disk image,
e.g. /dev/sda, /dev/ram0. We can add file based image support to
vhost-blk once we have in-kernel AIO interface. There are some work in
progress for in-kernel AIO interface from Dave Kleikamp and Zach Brown:

   http://marc.info/?l=linux-fsdevel&m=133312234313122

Performance evaluation:
-----------------------------
LKVM: Fio with libaio ioengine on 1 Fusion IO device
IOPS(k)    Before    After    Improvement
seq-read   107       121      +13.0%
seq-write  130       179      +37.6%
rnd-read   102       122      +19.6%
rnd-write  125       159      +27.0%

QEMU: Fio with libaio ioengine on 1 Fusion IO device
IOPS(k)    Before    After    Improvement
seq-read   76        123      +61.8%
seq-write  139       173      +24.4%
rnd-read   73        120      +64.3%
rnd-write  75        156      +108.0%

QEMU: Fio with libaio ioengine on 1 Ramdisk device
IOPS(k)    Before    After    Improvement
seq-read   138       437      +216%
seq-write  191       436      +128%
rnd-read   137       426      +210%
rnd-write  140       415      +196%

QEMU: Fio with libaio ioengine on 8 Ramdisk device
50% read + 50% write
IOPS(k)    Before    After    Improvement
randrw     64/64     189/189  +195%/+195%

Userspace bits:
-----------------------------
1) LKVM
The latest vhost-blk userspace bits for kvm tool can be found here:
git@github.com:asias/linux-kvm.git blk.vhost-blk

2) QEMU
The latest vhost-blk userspace prototype for QEMU can be found here:
git@github.com:asias/qemu.git blk.vhost-blk

Changes in v6:
- Use inline req_page_list to reduce kmalloc
- Switch to single thread model, thanks mst!
- Wait until requests fired before vhost_blk_flush to be finished

Changes in v5:
- Do not assume the buffer layout
- Fix wakeup race

Changes in v4:
- Mark req->status as userspace pointer
- Use __copy_to_user() instead of copy_to_user() in vhost_blk_set_status()
- Add if (need_resched()) schedule() in blk thread
- Kill vhost_blk_stop_vq() and move it into vhost_blk_stop()
- Use vq_err() instead of pr_warn()
- Fail un Unsupported request
- Add flush in vhost_blk_set_features()

Changes in v3:
- Sending REQ_FLUSH bio instead of vfs_fsync, thanks Christoph!
- Check file passed by user is a raw block device file

Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Asias He <asias@redhat.com>
---
 drivers/vhost/Kconfig     |   1 +
 drivers/vhost/Kconfig.blk |  10 +
 drivers/vhost/Makefile    |   2 +
 drivers/vhost/blk.c       | 724 ++++++++++++++++++++++++++++++++++++++++++++++
 drivers/vhost/blk.h       |   8 +
 5 files changed, 745 insertions(+)
 create mode 100644 drivers/vhost/Kconfig.blk
 create mode 100644 drivers/vhost/blk.c
 create mode 100644 drivers/vhost/blk.h

diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index 202bba6..acd8038 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -11,4 +11,5 @@ config VHOST_NET
 
 if STAGING
 source "drivers/vhost/Kconfig.tcm"
+source "drivers/vhost/Kconfig.blk"
 endif
diff --git a/drivers/vhost/Kconfig.blk b/drivers/vhost/Kconfig.blk
new file mode 100644
index 0000000..ff8ab76
--- /dev/null
+++ b/drivers/vhost/Kconfig.blk
@@ -0,0 +1,10 @@
+config VHOST_BLK
+	tristate "Host kernel accelerator for virtio blk (EXPERIMENTAL)"
+	depends on BLOCK &&  EXPERIMENTAL && m
+	---help---
+	  This kernel module can be loaded in host kernel to accelerate
+	  guest block with virtio_blk. Not to be confused with virtio_blk
+	  module itself which needs to be loaded in guest kernel.
+
+	  To compile this driver as a module, choose M here: the module will
+	  be called vhost_blk.
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index a27b053..1a8a4a5 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -2,3 +2,5 @@ obj-$(CONFIG_VHOST_NET) += vhost_net.o
 vhost_net-y := vhost.o net.o
 
 obj-$(CONFIG_TCM_VHOST) += tcm_vhost.o
+obj-$(CONFIG_VHOST_BLK) += vhost_blk.o
+vhost_blk-y := blk.o
diff --git a/drivers/vhost/blk.c b/drivers/vhost/blk.c
new file mode 100644
index 0000000..e4ca4b6
--- /dev/null
+++ b/drivers/vhost/blk.c
@@ -0,0 +1,724 @@
+/*
+ * Copyright (C) 2011 Taobao, Inc.
+ * Author: Liu Yuan <tailai.ly@taobao.com>
+ *
+ * Copyright (C) 2012 Red Hat, Inc.
+ * Author: Asias He <asias@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ *
+ * virtio-blk server in host kernel.
+ */
+
+#include <linux/miscdevice.h>
+#include <linux/module.h>
+#include <linux/vhost.h>
+#include <linux/virtio_blk.h>
+#include <linux/mutex.h>
+#include <linux/file.h>
+#include <linux/kthread.h>
+#include <linux/blkdev.h>
+#include <linux/llist.h>
+
+#include "vhost.c"
+#include "vhost.h"
+#include "blk.h"
+
+static DEFINE_IDA(vhost_blk_index_ida);
+
+enum {
+	VHOST_BLK_VQ_REQ = 0,
+	VHOST_BLK_VQ_MAX = 1,
+};
+
+struct req_page_list {
+	struct page **pages;
+	int pages_nr;
+};
+
+#define NR_INLINE 16
+
+struct vhost_blk_req {
+	struct req_page_list inline_pl[NR_INLINE];
+	struct page *inline_page[NR_INLINE];
+	struct bio *inline_bio[NR_INLINE];
+	struct req_page_list *pl;
+	int during_flush;
+	bool use_inline;
+
+	struct llist_node llnode;
+
+	struct vhost_blk *blk;
+
+	struct iovec *iov;
+	int iov_nr;
+
+	struct bio **bio;
+	atomic_t bio_nr;
+
+	struct iovec status[1];
+
+	sector_t sector;
+	int write;
+	u16 head;
+	long len;
+};
+
+struct vhost_blk {
+	wait_queue_head_t flush_wait;
+	struct iovec iov[UIO_MAXIOV];
+	struct vhost_blk_req *reqs;
+	struct vhost_virtqueue vq;
+	struct llist_head llhead;
+	atomic_t req_inflight[2];
+	struct vhost_work work;
+	spinlock_t flush_lock;
+	struct vhost_dev dev;
+	int during_flush;
+	u16 reqs_nr;
+	int index;
+};
+
+static int move_iovec(struct iovec *from, struct iovec *to,
+		      size_t len, int iov_count)
+{
+	int seg = 0;
+	size_t size;
+
+	while (len && seg < iov_count) {
+		if (from->iov_len == 0) {
+			++from;
+			continue;
+		}
+		size = min(from->iov_len, len);
+		to->iov_base = from->iov_base;
+		to->iov_len = size;
+		from->iov_len -= size;
+		from->iov_base += size;
+		len -= size;
+		++from;
+		++to;
+		++seg;
+	}
+	return seg;
+}
+
+static inline int iov_num_pages(struct iovec *iov)
+{
+	return (PAGE_ALIGN((unsigned long)iov->iov_base + iov->iov_len) -
+	       ((unsigned long)iov->iov_base & PAGE_MASK)) >> PAGE_SHIFT;
+}
+
+static inline int vhost_blk_set_status(struct vhost_blk_req *req, u8 status)
+{
+	struct vhost_blk *blk = req->blk;
+	int ret;
+
+	ret = memcpy_toiovecend(req->status, &status, 0, sizeof(status));
+
+	if (ret) {
+		vq_err(&blk->vq, "Failed to write status\n");
+		return -EFAULT;
+	}
+
+	return 0;
+}
+
+static void vhost_blk_req_done(struct bio *bio, int err)
+{
+	struct vhost_blk_req *req = bio->bi_private;
+	struct vhost_blk *blk = req->blk;
+
+	if (err)
+		req->len = err;
+
+	if (atomic_dec_and_test(&req->bio_nr)) {
+		llist_add(&req->llnode, &blk->llhead);
+		vhost_work_queue(&blk->dev, &blk->work);
+	}
+
+	bio_put(bio);
+}
+
+static void vhost_blk_req_umap(struct vhost_blk_req *req)
+{
+	struct req_page_list *pl;
+	int i, j;
+
+	if (req->pl) {
+		for (i = 0; i < req->iov_nr; i++) {
+			pl = &req->pl[i];
+			for (j = 0; j < pl->pages_nr; j++) {
+				if (!req->write)
+					set_page_dirty_lock(pl->pages[j]);
+				page_cache_release(pl->pages[j]);
+			}
+		}
+	}
+
+	if (!req->use_inline)
+		kfree(req->pl);
+}
+
+static int vhost_blk_bio_make(struct vhost_blk_req *req,
+			      struct block_device *bdev)
+{
+	int pages_nr_total, i, j, ret;
+	struct iovec *iov = req->iov;
+	int iov_nr = req->iov_nr;
+	struct page **pages, *page;
+	struct bio *bio = NULL;
+	int bio_nr = 0;
+	void *buf;
+
+	pages_nr_total = 0;
+	for (i = 0; i < iov_nr; i++)
+		pages_nr_total += iov_num_pages(&iov[i]);
+
+	if (unlikely(req->write == WRITE_FLUSH)) {
+		req->use_inline = true;
+		req->pl = NULL;
+		req->bio = req->inline_bio;
+
+		bio = bio_alloc(GFP_KERNEL, 1);
+		if (!bio)
+			return -ENOMEM;
+
+		bio->bi_sector  = req->sector;
+		bio->bi_bdev    = bdev;
+		bio->bi_private = req;
+		bio->bi_end_io  = vhost_blk_req_done;
+		req->bio[bio_nr++] = bio;
+
+		goto out;
+	}
+
+	if (pages_nr_total > NR_INLINE) {
+		int pl_len, page_len, bio_len;
+
+		req->use_inline = false;
+		pl_len = iov_nr * sizeof(req->pl[0]);
+		page_len = pages_nr_total * sizeof(struct page *);
+		bio_len = pages_nr_total * sizeof(struct bio *);
+
+		buf = kmalloc(pl_len + page_len + bio_len, GFP_KERNEL);
+		if (!buf)
+			return -ENOMEM;
+
+		req->pl	= buf;
+		pages = buf + pl_len;
+		req->bio = buf + pl_len + page_len;
+	} else {
+		req->use_inline = true;
+		req->pl = req->inline_pl;
+		pages = req->inline_page;
+		req->bio = req->inline_bio;
+	}
+
+	req->iov_nr = 0;
+	for (i = 0; i < iov_nr; i++) {
+		int pages_nr = iov_num_pages(&iov[i]);
+		unsigned long iov_base, iov_len;
+		struct req_page_list *pl;
+
+		iov_base = (unsigned long)iov[i].iov_base;
+		iov_len  = (unsigned long)iov[i].iov_len;
+
+		/* TODO: Limit the total number of pages pinned */
+		ret = get_user_pages_fast(iov_base, pages_nr,
+					  !req->write, pages);
+		if (ret != pages_nr)
+			goto fail;
+
+		req->iov_nr++;
+		pl = &req->pl[i];
+		pl->pages_nr = pages_nr;
+		pl->pages = pages;
+
+		for (j = 0; j < pages_nr; j++) {
+			unsigned int off, len;
+			page = pages[j];
+			off = iov_base & ~PAGE_MASK;
+			len = PAGE_SIZE - off;
+			if (len > iov_len)
+				len = iov_len;
+
+			while (!bio || bio_add_page(bio, page, len, off) <= 0) {
+				bio = bio_alloc(GFP_KERNEL, pages_nr_total);
+				if (!bio)
+					goto fail;
+				bio->bi_sector  = req->sector;
+				bio->bi_bdev    = bdev;
+				bio->bi_private = req;
+				bio->bi_end_io  = vhost_blk_req_done;
+				req->bio[bio_nr++] = bio;
+			}
+			req->sector	+= len >> 9;
+			iov_base	+= len;
+			iov_len		-= len;
+		}
+
+		pages += pages_nr;
+	}
+out:
+	atomic_set(&req->bio_nr, bio_nr);
+	return 0;
+
+fail:
+	for (i = 0; i < bio_nr; i++)
+		bio_put(req->bio[i]);
+	vhost_blk_req_umap(req);
+	return -ENOMEM;
+}
+
+static inline void vhost_blk_bio_send(struct vhost_blk_req *req)
+{
+	struct blk_plug plug;
+	int i, bio_nr;
+
+	bio_nr = atomic_read(&req->bio_nr);
+	blk_start_plug(&plug);
+	for (i = 0; i < bio_nr; i++)
+		submit_bio(req->write, req->bio[i]);
+	blk_finish_plug(&plug);
+}
+
+static int vhost_blk_req_submit(struct vhost_blk_req *req, struct file *file)
+{
+
+	struct inode *inode = file->f_mapping->host;
+	struct block_device *bdev = inode->i_bdev;
+	int ret;
+
+	ret = vhost_blk_bio_make(req, bdev);
+	if (ret < 0)
+		return ret;
+
+	vhost_blk_bio_send(req);
+
+	spin_lock(&req->blk->flush_lock);
+	req->during_flush = req->blk->during_flush;
+	atomic_inc(&req->blk->req_inflight[req->during_flush]);
+	spin_unlock(&req->blk->flush_lock);
+
+	return ret;
+}
+
+static int vhost_blk_req_handle(struct vhost_virtqueue *vq,
+				struct virtio_blk_outhdr *hdr,
+				u16 head, u16 out, u16 in,
+				struct file *file)
+{
+	struct vhost_blk *blk = container_of(vq->dev, struct vhost_blk, dev);
+	unsigned char id[VIRTIO_BLK_ID_BYTES];
+	struct vhost_blk_req *req;
+	int ret, len;
+	u8 status;
+
+	req		= &blk->reqs[head];
+	req->head	= head;
+	req->blk	= blk;
+	req->sector	= hdr->sector;
+	req->iov	= blk->iov;
+
+	req->len	= iov_length(vq->iov, out + in) - sizeof(status);
+	req->iov_nr	= move_iovec(vq->iov, req->iov, req->len, out + in);
+
+	move_iovec(vq->iov, req->status, sizeof(status), out + in);
+
+	switch (hdr->type) {
+	case VIRTIO_BLK_T_OUT:
+		req->write = WRITE;
+		ret = vhost_blk_req_submit(req, file);
+		break;
+	case VIRTIO_BLK_T_IN:
+		req->write = READ;
+		ret = vhost_blk_req_submit(req, file);
+		break;
+	case VIRTIO_BLK_T_FLUSH:
+		req->write = WRITE_FLUSH;
+		ret = vhost_blk_req_submit(req, file);
+		break;
+	case VIRTIO_BLK_T_GET_ID:
+		ret = snprintf(id, VIRTIO_BLK_ID_BYTES,
+			       "vhost-blk%d", blk->index);
+		if (ret < 0)
+			break;
+		len = ret;
+		ret = memcpy_toiovecend(req->iov, id, 0, len);
+		status = ret < 0 ? VIRTIO_BLK_S_IOERR : VIRTIO_BLK_S_OK;
+		ret = vhost_blk_set_status(req, status);
+		if (ret)
+			break;
+		vhost_add_used_and_signal(&blk->dev, vq, head, len);
+		break;
+	default:
+		vq_err(vq, "Unsupported request type %d\n", hdr->type);
+		status = VIRTIO_BLK_S_UNSUPP;
+		ret = vhost_blk_set_status(req, status);
+		if (ret)
+			break;
+		vhost_add_used_and_signal(&blk->dev, vq, head, 0);
+	}
+
+	return ret;
+}
+
+/* Guest kick us for I/O submit */
+static void vhost_blk_handle_guest_kick(struct vhost_work *work)
+{
+	struct virtio_blk_outhdr hdr;
+	struct vhost_virtqueue *vq;
+	struct vhost_blk *blk;
+	int in, out, ret;
+	struct file *f;
+	u16 head;
+
+	vq = container_of(work, struct vhost_virtqueue, poll.work);
+	blk = container_of(vq->dev, struct vhost_blk, dev);
+
+	/* TODO: check that we are running from vhost_worker? */
+	f = rcu_dereference_check(vq->private_data, 1);
+	if (!f)
+		return;
+
+	vhost_disable_notify(&blk->dev, vq);
+	for (;;) {
+		head = vhost_get_vq_desc(&blk->dev, vq, vq->iov,
+					 ARRAY_SIZE(vq->iov),
+					 &out, &in, NULL, NULL);
+		if (unlikely(head < 0))
+			break;
+
+		if (unlikely(head == vq->num)) {
+			if (unlikely(vhost_enable_notify(&blk->dev, vq))) {
+				vhost_disable_notify(&blk->dev, vq);
+				continue;
+			}
+			break;
+		}
+		move_iovec(vq->iov, vq->hdr, sizeof(hdr), out);
+		ret = memcpy_fromiovecend((unsigned char *)&hdr, vq->hdr, 0,
+					   sizeof(hdr));
+		if (ret < 0) {
+			vq_err(vq, "Failed to get block header!\n");
+			vhost_discard_vq_desc(vq, 1);
+			break;
+		}
+
+		if (vhost_blk_req_handle(vq, &hdr, head, out, in, f) < 0)
+			break;
+
+		if (!llist_empty(&blk->llhead)) {
+			vhost_poll_queue(&vq->poll);
+			break;
+		}
+	}
+}
+
+/* Host kick us for I/O completion */
+static void vhost_blk_handle_host_kick(struct vhost_work *work)
+{
+
+	struct vhost_virtqueue *vq;
+	struct vhost_blk_req *req;
+	struct llist_node *llnode;
+	struct vhost_blk *blk;
+	bool added, zero;
+	u8 status;
+	int ret;
+
+	blk = container_of(work, struct vhost_blk, work);
+	vq = &blk->vq;
+
+	llnode = llist_del_all(&blk->llhead);
+	added = false;
+	while (llnode) {
+		req = llist_entry(llnode, struct vhost_blk_req, llnode);
+		llnode = llist_next(llnode);
+
+		vhost_blk_req_umap(req);
+
+		status = req->len > 0 ?  VIRTIO_BLK_S_OK : VIRTIO_BLK_S_IOERR;
+		ret = vhost_blk_set_status(req, status);
+		if (unlikely(ret))
+			continue;
+		vhost_add_used(&blk->vq, req->head, req->len);
+		added = true;
+
+		spin_lock(&req->blk->flush_lock);
+		zero = atomic_dec_and_test(
+				&req->blk->req_inflight[req->during_flush]);
+		if (zero && !req->during_flush)
+			wake_up(&blk->flush_wait);
+		spin_unlock(&req->blk->flush_lock);
+
+	}
+	if (likely(added))
+		vhost_signal(&blk->dev, &blk->vq);
+}
+
+static void vhost_blk_flush(struct vhost_blk *blk)
+{
+	spin_lock(&blk->flush_lock);
+	blk->during_flush = 1;
+	spin_unlock(&blk->flush_lock);
+
+	vhost_poll_flush(&blk->vq.poll);
+	vhost_work_flush(&blk->dev, &blk->work);
+	/*
+	 * Wait until requests fired before the flush to be finished
+	 * req_inflight[0] is used to track the requests fired before the flush
+	 * req_inflight[1] is used to track the requests fired during the flush
+	 */
+	wait_event(blk->flush_wait, !atomic_read(&blk->req_inflight[0]));
+
+	spin_lock(&blk->flush_lock);
+	blk->during_flush = 0;
+	spin_unlock(&blk->flush_lock);
+}
+
+static inline void vhost_blk_stop(struct vhost_blk *blk, struct file **file)
+{
+	struct vhost_virtqueue *vq = &blk->vq;
+	struct file *f;
+
+	mutex_lock(&vq->mutex);
+	f = rcu_dereference_protected(vq->private_data,
+				      lockdep_is_held(&vq->mutex));
+	rcu_assign_pointer(vq->private_data, NULL);
+	mutex_unlock(&vq->mutex);
+
+	*file = f;
+}
+
+static int vhost_blk_open(struct inode *inode, struct file *file)
+{
+	struct vhost_blk *blk;
+	int ret;
+
+	blk = kzalloc(sizeof(*blk), GFP_KERNEL);
+	if (!blk) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ret = ida_simple_get(&vhost_blk_index_ida, 0, 0, GFP_KERNEL);
+	if (ret < 0)
+		goto out_dev;
+	blk->index = ret;
+
+	blk->vq.handle_kick = vhost_blk_handle_guest_kick;
+	atomic_set(&blk->req_inflight[0], 0);
+	atomic_set(&blk->req_inflight[1], 0);
+	blk->during_flush = 0;
+	spin_lock_init(&blk->flush_lock);
+	init_waitqueue_head(&blk->flush_wait);
+
+	ret = vhost_dev_init(&blk->dev, &blk->vq, VHOST_BLK_VQ_MAX);
+	if (ret < 0)
+		goto out_dev;
+	file->private_data = blk;
+
+	vhost_work_init(&blk->work, vhost_blk_handle_host_kick);
+
+	return ret;
+out_dev:
+	kfree(blk);
+out:
+	return ret;
+}
+
+static int vhost_blk_release(struct inode *inode, struct file *f)
+{
+	struct vhost_blk *blk = f->private_data;
+	struct file *file;
+
+	ida_simple_remove(&vhost_blk_index_ida, blk->index);
+	vhost_blk_stop(blk, &file);
+	vhost_blk_flush(blk);
+	vhost_dev_cleanup(&blk->dev, false);
+	if (file)
+		fput(file);
+	kfree(blk->reqs);
+	kfree(blk);
+
+	return 0;
+}
+
+static int vhost_blk_set_features(struct vhost_blk *blk, u64 features)
+{
+	mutex_lock(&blk->dev.mutex);
+	blk->dev.acked_features = features;
+	vhost_blk_flush(blk);
+	mutex_unlock(&blk->dev.mutex);
+
+	return 0;
+}
+
+static long vhost_blk_set_backend(struct vhost_blk *blk, unsigned index, int fd)
+{
+	struct vhost_virtqueue *vq = &blk->vq;
+	struct file *file, *oldfile;
+	struct inode *inode;
+	int ret;
+
+	mutex_lock(&blk->dev.mutex);
+	ret = vhost_dev_check_owner(&blk->dev);
+	if (ret)
+		goto out_dev;
+
+	if (index >= VHOST_BLK_VQ_MAX) {
+		ret = -ENOBUFS;
+		goto out_dev;
+	}
+
+	mutex_lock(&vq->mutex);
+
+	if (!vhost_vq_access_ok(vq)) {
+		ret = -EFAULT;
+		goto out_vq;
+	}
+
+	file = fget(fd);
+	if (IS_ERR(file)) {
+		ret = PTR_ERR(file);
+		goto out_vq;
+	}
+
+	/* Only raw block device is supported for now */
+	inode = file->f_mapping->host;
+	if (!S_ISBLK(inode->i_mode)) {
+		ret = -EFAULT;
+		goto out_file;
+	}
+
+	oldfile = rcu_dereference_protected(vq->private_data,
+			lockdep_is_held(&vq->mutex));
+	if (file != oldfile) {
+		rcu_assign_pointer(vq->private_data, file);
+
+		ret = vhost_init_used(vq);
+		if (ret)
+			goto out_file;
+	}
+
+	mutex_unlock(&vq->mutex);
+
+	if (oldfile) {
+		vhost_blk_flush(blk);
+		fput(oldfile);
+	}
+
+	mutex_unlock(&blk->dev.mutex);
+	return 0;
+
+out_file:
+	fput(file);
+out_vq:
+	mutex_unlock(&vq->mutex);
+out_dev:
+	mutex_unlock(&blk->dev.mutex);
+	return ret;
+}
+
+static long vhost_blk_reset_owner(struct vhost_blk *blk)
+{
+	struct file *file = NULL;
+	int err;
+
+	mutex_lock(&blk->dev.mutex);
+	err = vhost_dev_check_owner(&blk->dev);
+	if (err)
+		goto done;
+	vhost_blk_stop(blk, &file);
+	vhost_blk_flush(blk);
+	err = vhost_dev_reset_owner(&blk->dev);
+done:
+	mutex_unlock(&blk->dev.mutex);
+	if (file)
+		fput(file);
+	return err;
+}
+
+static int vhost_blk_setup(struct vhost_blk *blk)
+{
+	blk->reqs_nr = blk->vq.num;
+
+	blk->reqs = kmalloc(sizeof(struct vhost_blk_req) * blk->reqs_nr,
+			    GFP_KERNEL);
+	if (!blk->reqs)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static long vhost_blk_ioctl(struct file *f, unsigned int ioctl,
+			    unsigned long arg)
+{
+	struct vhost_blk *blk = f->private_data;
+	void __user *argp = (void __user *)arg;
+	struct vhost_vring_file backend;
+	u64 __user *featurep = argp;
+	u64 features;
+	int ret;
+
+	switch (ioctl) {
+	case VHOST_BLK_SET_BACKEND:
+		if (copy_from_user(&backend, argp, sizeof(backend)))
+			return -EFAULT;
+		return vhost_blk_set_backend(blk, backend.index, backend.fd);
+	case VHOST_GET_FEATURES:
+		features = VHOST_BLK_FEATURES;
+		if (copy_to_user(featurep, &features, sizeof(features)))
+			return -EFAULT;
+		return 0;
+	case VHOST_SET_FEATURES:
+		if (copy_from_user(&features, featurep, sizeof(features)))
+			return -EFAULT;
+		if (features & ~VHOST_BLK_FEATURES)
+			return -EOPNOTSUPP;
+		return vhost_blk_set_features(blk, features);
+	case VHOST_RESET_OWNER:
+		return vhost_blk_reset_owner(blk);
+	default:
+		mutex_lock(&blk->dev.mutex);
+		ret = vhost_dev_ioctl(&blk->dev, ioctl, arg);
+		if (!ret && ioctl == VHOST_SET_VRING_NUM)
+			ret = vhost_blk_setup(blk);
+		vhost_blk_flush(blk);
+		mutex_unlock(&blk->dev.mutex);
+		return ret;
+	}
+}
+
+static const struct file_operations vhost_blk_fops = {
+	.owner          = THIS_MODULE,
+	.open           = vhost_blk_open,
+	.release        = vhost_blk_release,
+	.llseek		= noop_llseek,
+	.unlocked_ioctl = vhost_blk_ioctl,
+};
+
+static struct miscdevice vhost_blk_misc = {
+	MISC_DYNAMIC_MINOR,
+	"vhost-blk",
+	&vhost_blk_fops,
+};
+
+static int vhost_blk_init(void)
+{
+	return misc_register(&vhost_blk_misc);
+}
+
+static void vhost_blk_exit(void)
+{
+	misc_deregister(&vhost_blk_misc);
+}
+
+module_init(vhost_blk_init);
+module_exit(vhost_blk_exit);
+
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Asias He");
+MODULE_DESCRIPTION("Host kernel accelerator for virtio_blk");
diff --git a/drivers/vhost/blk.h b/drivers/vhost/blk.h
new file mode 100644
index 0000000..2f674f0
--- /dev/null
+++ b/drivers/vhost/blk.h
@@ -0,0 +1,8 @@
+#include <linux/vhost.h>
+
+enum {
+	VHOST_BLK_FEATURES = (1ULL << VIRTIO_RING_F_INDIRECT_DESC) |
+			     (1ULL << VIRTIO_RING_F_EVENT_IDX),
+};
+/* VHOST_BLK specific defines */
+#define VHOST_BLK_SET_BACKEND _IOW(VHOST_VIRTIO, 0x50, struct vhost_vring_file)
-- 
1.7.11.7

^ permalink raw reply related

* Re: TCP and reordering
From: David Woodhouse @ 2012-12-02  1:30 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Benjamin LaHaise, Vijay Subramanian, David Miller, saku,
	rick.jones2, netdev
In-Reply-To: <1354122996.14302.427.camel@edumazet-glaptop>

[-- Attachment #1: Type: text/plain, Size: 2958 bytes --]

On Wed, 2012-11-28 at 09:16 -0800, Eric Dumazet wrote:
> On Wed, 2012-11-28 at 16:41 +0000, David Woodhouse wrote:
> 
> > Another fun issue with tunnelling protocols and BQL... packets tend to
> > *grow* as they get encapsulated. So you might end up calling
> > netdev_sent_queue() with a given size, then netdev_completed_queue()
> > with a bigger packet later...
> 
> Its the driver responsibility to maintain the coherent 'bytes' value for
> each transmitted/completed packet.
> 
> If a driver calls an external entity, it cannot possibly use BQL, unless
> doing an approximation (bytes becomes a fixed value)
> 
> BQL was really something to control/limit queueing on ethernet links,
> not for stacked devices, as stacked devices normally have no queue.

Oh, of *course*... this is why my kernel would panic if I attempted to
add BQL to BR2684. The ATM low-level driver was pushing a header onto
the skb, and I ended up calling netdev_completed_queue() with a larger
'bytes' value than the one I'd called netdev_sent_queue() with. Which
leads to a BUG(), which immediately results in a panic. A moderately
suboptimal failure mode, when a nasty warning and disabling BQL on this
interface might have been nicer.

However, *perhaps* it isn't so hard to get a consistent 'bytes' value.
This version appears to work... can we use something along these lines
in the general case? What if skb_is_nonlinear()?

This probably doesn't work for L2TP where it's going to be passed down
the whole stack and get a new network header and everything. But for
BR2684 is this at least a salvageable approach?

--- net/atm/br2684.c~	2012-12-01 16:35:49.000000000 +0000
+++ net/atm/br2684.c	2012-12-02 01:18:35.216607088 +0000
@@ -180,6 +180,11 @@ static struct notifier_block atm_dev_not
 	.notifier_call = atm_dev_event,
 };
 
+static unsigned int skb_acct_len(struct sk_buff *skb)
+{
+	return skb_tail_pointer(skb) - skb_network_header(skb);
+}
+
 /* chained vcc->pop function.  Check if we should wake the netif_queue */
 static void br2684_pop(struct atm_vcc *vcc, struct sk_buff *skb)
 {
@@ -191,6 +196,8 @@ static void br2684_pop(struct atm_vcc *v
 	/* If the queue space just went up from zero, wake */
 	if (atomic_inc_return(&brvcc->qspace) == 1)
 		netif_wake_queue(brvcc->device);
+
+	netdev_completed_queue(brvcc->device, 1, skb_acct_len(skb));
 }
 
 /*
@@ -265,6 +272,7 @@ static int br2684_xmit_vcc(struct sk_buf
 			netif_wake_queue(brvcc->device);
 	}
 
+	netdev_sent_queue(brvcc->device, skb_acct_len(skb));
 	/* If this fails immediately, the skb will be freed and br2684_pop()
 	   will wake the queue if appropriate. Just return an error so that
 	   the stats are updated correctly */
@@ -710,6 +718,7 @@ static int br2684_create(void __user *ar
 		return err;
 	}
 
+	netdev_reset_queue(netdev);
 	write_lock_irq(&devs_lock);
 
 	brdev->payload = payload;




-- 
dwmw2


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 6171 bytes --]

^ permalink raw reply

* [PATCH] workqueue: mod_delayed_work_on() shouldn't queue timer on 0 delay
From: Tejun Heo @ 2012-12-02  1:11 UTC (permalink / raw)
  To: Anders Kaseorg
  Cc: Herbert Xu, John W. Linville, netdev, linux-wireless,
	linux-kernel, Zlatko Calusic, Joonsoo Kim
In-Reply-To: <alpine.DEB.2.00.1212011852100.25064@dr-wily.mit.edu>

>From 8852aac25e79e38cc6529f20298eed154f60b574 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Sat, 1 Dec 2012 16:23:42 -0800

8376fe22c7 ("workqueue: implement mod_delayed_work[_on]()")
implemented mod_delayed_work[_on]() using the improved
try_to_grab_pending().  The function is later used, among others, to
replace [__]candel_delayed_work() + queue_delayed_work() combinations.

Unfortunately, a delayed_work item w/ zero @delay is handled slightly
differently by mod_delayed_work_on() compared to
queue_delayed_work_on().  The latter skips timer altogether and
directly queues it using queue_work_on() while the former schedules
timer which will expire on the closest tick.  This means, when @delay
is zero, that [__]cancel_delayed_work() + queue_delayed_work_on()
makes the target item immediately executable while
mod_delayed_work_on() may induce delay of upto a full tick.

This somewhat subtle difference breaks some of the converted users.
e.g. block queue plugging uses delayed_work for deferred processing
and uses mod_delayed_work_on() when the queue needs to be immediately
unplugged.  The above problem manifested as noticeably higher number
of context switches under certain circumstances.

The difference in behavior was caused by missing special case handling
for 0 delay in mod_delayed_work_on() compared to
queue_delayed_work_on().  Joonsoo Kim posted a patch to add it -
("workqueue: optimize mod_delayed_work_on() when @delay == 0")[1].
The patch was queued for 3.8 but it was described as optimization and
I missed that it was a correctness issue.

As both queue_delayed_work_on() and mod_delayed_work_on() use
__queue_delayed_work() for queueing, it seems that the better approach
is to move the 0 delay special handling to the function instead of
duplicating it in mod_delayed_work_on().

Fix the problem by moving 0 delay special case handling from
queue_delayed_work_on() to __queue_delayed_work().  This replaces
Joonsoo's patch.

[1] http://thread.gmane.org/gmane.linux.kernel/1379011/focus=1379012

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Anders Kaseorg <andersk@MIT.EDU>
Reported-and-tested-by: Zlatko Calusic <zlatko.calusic@iskon.hr>
LKML-Reference: <alpine.DEB.2.00.1211280953350.26602@dr-wily.mit.edu>
LKML-Reference: <50A78AA9.5040904@iskon.hr>
Cc: Joonsoo Kim <js1304@gmail.com>
---
Applied to wq/for-3.7-fixes.  Pull request sent.

Thanks!

 kernel/workqueue.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index ac25db1..084aa47 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1364,6 +1364,17 @@ static void __queue_delayed_work(int cpu, struct workqueue_struct *wq,
 	BUG_ON(timer_pending(timer));
 	BUG_ON(!list_empty(&work->entry));
 
+	/*
+	 * If @delay is 0, queue @dwork->work immediately.  This is for
+	 * both optimization and correctness.  The earliest @timer can
+	 * expire is on the closest next tick and delayed_work users depend
+	 * on that there's no such delay when @delay is 0.
+	 */
+	if (!delay) {
+		__queue_work(cpu, wq, &dwork->work);
+		return;
+	}
+
 	timer_stats_timer_set_start_info(&dwork->timer);
 
 	/*
@@ -1417,9 +1428,6 @@ bool queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
 	bool ret = false;
 	unsigned long flags;
 
-	if (!delay)
-		return queue_work_on(cpu, wq, &dwork->work);
-
 	/* read the comment in __queue_work() */
 	local_irq_save(flags);
 
-- 
1.7.11.7

^ permalink raw reply related

* Re: [PATCH 00/17] ATM fixes for pppoatm/br2684
From: David Woodhouse @ 2012-12-02  0:40 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, chas, krzysiek
In-Reply-To: <1354383226.21562.351.camel@shinybook.infradead.org>

[-- Attachment #1: Type: text/plain, Size: 1676 bytes --]

On Sat, 2012-12-01 at 17:33 +0000, David Woodhouse wrote:
> 
> Very glad I added the BUILD_BUG_ON on the cb struct size now. Perhaps
> there should be a generic helper for that? Something like
>  skb_cb_cast(struct foo_cb, skb) could do it automatically...?

Something like this, perhaps? Using skb_cast_cb() would then make it
fairly much impossible to accidentally overflow the size of the skb cb.

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index f2af494..b4cb2cf 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -567,6 +567,11 @@ static inline struct rtable *skb_rtable(const struct sk_buff *skb)
 	return (struct rtable *)skb_dst(skb);
 }
 
+#define skb_cast_cb(skb, type)  ({				\
+	BUILD_BUG_ON(sizeof(type) > sizeof((skb)->cb));		\
+	(type *)(skb)->cb;					\
+	})
+
 extern void kfree_skb(struct sk_buff *skb);
 extern void skb_tx_error(struct sk_buff *skb);
 extern void consume_skb(struct sk_buff *skb);
diff --git a/drivers/atm/solos-pci.c b/drivers/atm/solos-pci.c
index 6619a8a..8dccf82 100644
--- a/drivers/atm/solos-pci.c
+++ b/drivers/atm/solos-pci.c
@@ -97,7 +97,7 @@ struct solos_skb_cb {
 };
 
 
-#define SKB_CB(skb)		((struct solos_skb_cb *)skb->cb)
+#define SKB_CB(skb)	skb_cast_cb(skb, struct solos_skb_cb)
 
 #define PKT_DATA	0
 #define PKT_COMMAND	1
@@ -1324,8 +1324,6 @@ static struct pci_driver fpga_driver = {
 
 static int __init solos_pci_init(void)
 {
-	BUILD_BUG_ON(sizeof(struct solos_skb_cb) > sizeof(((struct sk_buff *)0)->cb));
-
 	printk(KERN_INFO "Solos PCI Driver Version %s\n", VERSION);
 	return pci_register_driver(&fpga_driver);
 }


-- 
dwmw2


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 6171 bytes --]

^ permalink raw reply related

* Re: [PATCH 00/17] ATM fixes for pppoatm/br2684
From: David Woodhouse @ 2012-12-02  0:35 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, chas, krzysiek
In-Reply-To: <20121201.114440.331740099591161757.davem@davemloft.net>

[-- Attachment #1: Type: text/plain, Size: 2447 bytes --]

On Sat, 2012-12-01 at 11:44 -0500, David Miller wrote:
> 
> > drivers/atm/solos-pci.c: In function ‘solos_pci_init’:
> > drivers/atm/solos-pci.c:1329:2: error: size of unnamed array is
> negative
> 
> It's from adding the completion to the solos skb cb, you can't do
> that.  It won't fit on 64-bit when all debugging kconfig options are
> enabled.

Thanks for catching that. I've just posted a [v2] version of the
offending patch, which no longer puts a completion into the skb cb.

I'd appreciate a slightly more clueful eye looking over the incremental
patch (below) just to confirm that the new method is correct, but it
certainly seems to work. This version is identical to the one I posted
earlier, except that I use dev_kfree_skb() in the pclose() function
instead of dev_kfree_skb_any(). We know this will be called from a
suitable context, and it even uses GFP_KERNEL a few lines higher up.

If that's OK, please pull the resulting tree from
	git://git.infradead.org/users/dwmw2/atm.git


diff --git a/drivers/atm/solos-pci.c b/drivers/atm/solos-pci.c
index e59bcfd..6619a8a 100644
--- a/drivers/atm/solos-pci.c
+++ b/drivers/atm/solos-pci.c
@@ -92,7 +92,6 @@ struct pkt_hdr {
 };
 
 struct solos_skb_cb {
-	struct completion c;
 	struct atm_vcc *vcc;
 	uint32_t dma_addr;
 };
@@ -853,13 +852,14 @@ static void pclose(struct atm_vcc *vcc)
 	header->vci = cpu_to_le16(vcc->vci);
 	header->type = cpu_to_le16(PKT_PCLOSE);
 
-	init_completion(&SKB_CB(skb)->c);
-
+	skb_get(skb);
 	fpga_queue(card, port, skb, NULL);
 
-	if (!wait_for_completion_timeout(&SKB_CB(skb)->c, 5 * HZ))
-		dev_warn(&card->dev->dev, "Timeout waiting for VCC close on port %d\n",
-			 port);
+	if (!wait_event_timeout(card->param_wq, !skb_shared(skb), 5 * HZ))
+		dev_warn(&card->dev->dev,
+			 "Timeout waiting for VCC close on port %d\n", port);
+
+	dev_kfree_skb(skb);
 
 	/* Hold up vcc_destroy_socket() (our caller) until solos_bh() in the
 	   tasklet has finished processing any incoming packets (and, more to
@@ -990,10 +990,8 @@ static uint32_t fpga_tx(struct solos_card *card)
 				atomic_inc(&vcc->stats->tx);
 				solos_pop(vcc, oldskb);
 			} else {
-				struct pkt_hdr *header = (void *)oldskb->data;
-				if (le16_to_cpu(header->type) == PKT_PCLOSE)
-					complete(&SKB_CB(oldskb)->c);
 				dev_kfree_skb_irq(oldskb);
+				wake_up(&card->param_wq);
 			}
 		}
 	}

-- 
dwmw2


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 6171 bytes --]

^ permalink raw reply related

* [PATCH v2 07/17] solos-pci: Wait for pending TX to complete when releasing vcc
From: David Woodhouse @ 2012-12-02  0:17 UTC (permalink / raw)
  To: netdev; +Cc: chas, krzysiek
In-Reply-To: <1354235736-26833-8-git-send-email-dwmw2@infradead.org>

[-- Attachment #1: Type: text/plain, Size: 2566 bytes --]

We should no longer be calling the old pop routine for the vcc, after
vcc_release() has completed. Make sure we wait for any pending TX skbs
to complete, by waiting for our own PKT_PCLOSE control skb to be sent.

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
---
v2: Don't put a struct completion into skb->cb[]. It doesn't fit, so
instead use the existing card->param_wq, hold a refcount on the skb, and
wait for the fpga_tx code to free it.

I won't repost the rest of the series; they're materially unchanged
apart from a tiny amount of massaging to make them apply with this
change, so it would just be noise.

 drivers/atm/solos-pci.c | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/drivers/atm/solos-pci.c b/drivers/atm/solos-pci.c
index 9851093..026bdc1 100644
--- a/drivers/atm/solos-pci.c
+++ b/drivers/atm/solos-pci.c
@@ -866,6 +866,7 @@ static int popen(struct atm_vcc *vcc)
 static void pclose(struct atm_vcc *vcc)
 {
 	struct solos_card *card = vcc->dev->dev_data;
+	unsigned char port = SOLOS_CHAN(vcc->dev);
 	struct sk_buff *skb;
 	struct pkt_hdr *header;
 
@@ -881,11 +882,18 @@ static void pclose(struct atm_vcc *vcc)
 	header->vci = cpu_to_le16(vcc->vci);
 	header->type = cpu_to_le16(PKT_PCLOSE);
 
-	fpga_queue(card, SOLOS_CHAN(vcc->dev), skb, NULL);
+	skb_get(skb);
+	fpga_queue(card, port, skb, NULL);
 
 	clear_bit(ATM_VF_ADDR, &vcc->flags);
 	clear_bit(ATM_VF_READY, &vcc->flags);
 
+	if (!wait_event_timeout(card->param_wq, !skb_shared(skb), 5 * HZ))
+		dev_warn(&card->dev->dev,
+			 "Timeout waiting for VCC close on port %d\n", port);
+
+	dev_kfree_skb(skb);
+
 	/* Hold up vcc_destroy_socket() (our caller) until solos_bh() in the
 	   tasklet has finished processing any incoming packets (and, more to
 	   the point, using the vcc pointer). */
@@ -1011,9 +1019,10 @@ static uint32_t fpga_tx(struct solos_card *card)
 			if (vcc) {
 				atomic_inc(&vcc->stats->tx);
 				solos_pop(vcc, oldskb);
-			} else
+			} else {
 				dev_kfree_skb_irq(oldskb);
-
+				wake_up(&card->param_wq);
+			}
 		}
 	}
 	/* For non-DMA TX, write the 'TX start' bit for all four ports simultaneously */
@@ -1345,6 +1354,8 @@ static struct pci_driver fpga_driver = {
 
 static int __init solos_pci_init(void)
 {
+	BUILD_BUG_ON(sizeof(struct solos_skb_cb) > sizeof(((struct sk_buff *)0)->cb));
+
 	printk(KERN_INFO "Solos PCI Driver Version %s\n", VERSION);
 	return pci_register_driver(&fpga_driver);
 }
-- 
1.8.0


-- 
dwmw2


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 6171 bytes --]

^ permalink raw reply related

* Re: Wireless regression in workqueue: use mod_delayed_work() instead of __cancel + queue
From: Anders Kaseorg @ 2012-12-01 23:53 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Herbert Xu, John W. Linville, netdev, linux-wireless,
	linux-kernel
In-Reply-To: <20121201143926.GB2685@htj.dyndns.org>

On Sat, 1 Dec 2012, Tejun Heo wrote:
> Can you please test this one too?  Thanks!
> 
> […]
> +	if (!delay) {
> +		__queue_work(cpu, wq, &dwork->work);
> +		return;
> +	}
> +
> […]
> -	if (!delay)
> -		return queue_work_on(cpu, wq, &dwork->work);
> -

Yes, this one fixes the bug too (on v3.7.0-rc7).

Anders

^ permalink raw reply

* [PATCH net 1/1] 8139cp: fix coherent mapping leak in error path.
From: Francois Romieu @ 2012-12-01 23:08 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: David Miller, dwmw2, jasowang, netdev, John Greene

cp_open
[...]
        rc = cp_alloc_rings(cp);
        if (rc)
                return rc;

cp_alloc_rings
[...]
        mem = dma_alloc_coherent(&cp->pdev->dev, CP_RING_BYTES,
                                 &cp->ring_dma, GFP_KERNEL);

- cp_alloc_rings never frees the coherent mapping it allocates
- neither do cp_open when cp_alloc_rings fails

Signed-off-by: Francois Romieu <romieu@fr.zoreil.com>
---
 drivers/net/ethernet/realtek/8139cp.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/realtek/8139cp.c b/drivers/net/ethernet/realtek/8139cp.c
index b01f83a..609125a 100644
--- a/drivers/net/ethernet/realtek/8139cp.c
+++ b/drivers/net/ethernet/realtek/8139cp.c
@@ -1060,17 +1060,22 @@ static int cp_init_rings (struct cp_private *cp)
 
 static int cp_alloc_rings (struct cp_private *cp)
 {
+	struct device *d = &cp->pdev->dev;
 	void *mem;
+	int rc;
 
-	mem = dma_alloc_coherent(&cp->pdev->dev, CP_RING_BYTES,
-				 &cp->ring_dma, GFP_KERNEL);
+	mem = dma_alloc_coherent(d, CP_RING_BYTES, &cp->ring_dma, GFP_KERNEL);
 	if (!mem)
 		return -ENOMEM;
 
 	cp->rx_ring = mem;
 	cp->tx_ring = &cp->rx_ring[CP_RX_RING_SIZE];
 
-	return cp_init_rings(cp);
+	rc = cp_init_rings(cp);
+	if (rc < 0)
+		dma_free_coherent(d, CP_RING_BYTES, cp->rx_ring, cp->ring_dma);
+
+	return rc;
 }
 
 static void cp_clean_rings (struct cp_private *cp)
-- 
1.7.11.7

^ permalink raw reply related

* [PATCH net] tcp: fix crashes in do_tcp_sendpages()
From: Eric Dumazet @ 2012-12-01 23:07 UTC (permalink / raw)
  To: Willy Tarreau, David Miller; +Cc: netdev
In-Reply-To: <20121201224044.GL25450@1wt.eu>

From: Eric Dumazet <edumazet@google.com>

Recent network changes allowed high order pages being used
for skb fragments.

This uncovered a bug in do_tcp_sendpages() which was assuming its caller
provided an array of order-0 page pointers.

We only have to deal with a single page in this function, and its order
is irrelevant.

Reported-by: Willy Tarreau <w@1wt.eu>
Tested-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp.c |   15 ++++++---------
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 083092e..e457c7a 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -830,8 +830,8 @@ static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
 	return mss_now;
 }
 
-static ssize_t do_tcp_sendpages(struct sock *sk, struct page **pages, int poffset,
-			 size_t psize, int flags)
+static ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
+				size_t size, int flags)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	int mss_now, size_goal;
@@ -858,12 +858,9 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page **pages, int poffse
 	if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
 		goto out_err;
 
-	while (psize > 0) {
+	while (size > 0) {
 		struct sk_buff *skb = tcp_write_queue_tail(sk);
-		struct page *page = pages[poffset / PAGE_SIZE];
 		int copy, i;
-		int offset = poffset % PAGE_SIZE;
-		int size = min_t(size_t, psize, PAGE_SIZE - offset);
 		bool can_coalesce;
 
 		if (!tcp_send_head(sk) || (copy = size_goal - skb->len) <= 0) {
@@ -912,8 +909,8 @@ new_segment:
 			TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_PSH;
 
 		copied += copy;
-		poffset += copy;
-		if (!(psize -= copy))
+		offset += copy;
+		if (!(size -= copy))
 			goto out;
 
 		if (skb->len < size_goal || (flags & MSG_OOB))
@@ -960,7 +957,7 @@ int tcp_sendpage(struct sock *sk, struct page *page, int offset,
 					flags);
 
 	lock_sock(sk);
-	res = do_tcp_sendpages(sk, &page, offset, size, flags);
+	res = do_tcp_sendpages(sk, page, offset, size, flags);
 	release_sock(sk);
 	return res;
 }

^ permalink raw reply related

* Re: GRO + splice panics in 3.7.0-rc5
From: Willy Tarreau @ 2012-12-01 22:40 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1354401121.20109.531.camel@edumazet-glaptop>

On Sat, Dec 01, 2012 at 02:32:01PM -0800, Eric Dumazet wrote:
> On Sat, 2012-12-01 at 13:47 -0800, Eric Dumazet wrote:
> 
> > Thanks a lot Willy
> > 
> > I believe do_tcp_sendpages() needs a fix, I'll send a patch asap 
> > 
> 
> Could you try the following patch ?
> 
> do_tcp_sendpages() looks really wrong, as only one page is provided by
> the caller.

Excellent Eric, it's rock solid now both with the reproducer and with
haproxy!  Feel free to add my Tested-By if you want.

I really think we should feed this to Linus quickly before he releases 3.7,
as I'm realizing that it would really not be nice to have a user-triggerable
crash in a release :-/

Thanks !
Willy

^ permalink raw reply

* Re: GRO + splice panics in 3.7.0-rc5
From: Eric Dumazet @ 2012-12-01 22:32 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: netdev
In-Reply-To: <1354398458.20109.528.camel@edumazet-glaptop>

On Sat, 2012-12-01 at 13:47 -0800, Eric Dumazet wrote:

> Thanks a lot Willy
> 
> I believe do_tcp_sendpages() needs a fix, I'll send a patch asap 
> 

Could you try the following patch ?

do_tcp_sendpages() looks really wrong, as only one page is provided by
the caller.

Thanks !

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index e6eace1..6976dba 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -831,8 +831,8 @@ static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
 	return mss_now;
 }
 
-static ssize_t do_tcp_sendpages(struct sock *sk, struct page **pages, int poffset,
-			 size_t psize, int flags)
+static ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
+				size_t size, int flags)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	int mss_now, size_goal;
@@ -859,12 +859,9 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page **pages, int poffse
 	if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
 		goto out_err;
 
-	while (psize > 0) {
+	while (size > 0) {
 		struct sk_buff *skb = tcp_write_queue_tail(sk);
-		struct page *page = pages[poffset / PAGE_SIZE];
 		int copy, i;
-		int offset = poffset % PAGE_SIZE;
-		int size = min_t(size_t, psize, PAGE_SIZE - offset);
 		bool can_coalesce;
 
 		if (!tcp_send_head(sk) || (copy = size_goal - skb->len) <= 0) {
@@ -913,8 +910,8 @@ new_segment:
 			TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_PSH;
 
 		copied += copy;
-		poffset += copy;
-		if (!(psize -= copy))
+		offset += copy;
+		if (!(size -= copy))
 			goto out;
 
 		if (skb->len < size_goal || (flags & MSG_OOB))
@@ -961,7 +958,7 @@ int tcp_sendpage(struct sock *sk, struct page *page, int offset,
 					flags);
 
 	lock_sock(sk);
-	res = do_tcp_sendpages(sk, &page, offset, size, flags);
+	res = do_tcp_sendpages(sk, page, offset, size, flags);
 	release_sock(sk);
 	return res;
 }

^ permalink raw reply related

* Re: [RFT PATCH] 8139cp: properly support change of MTU values
From: Francois Romieu @ 2012-12-01 21:49 UTC (permalink / raw)
  To: John Greene; +Cc: netdev, David S. Miller
In-Reply-To: <1354308680-9236-1-git-send-email-jogreene@redhat.com>

John Greene <jogreene@redhat.com> :
[...]
> Testing: has been test on virtual 8139cp setup without issue,
> have no access real hardware 8139cp, need testing help.

You should Cc: David Woodhouse <dwmw2@infradead.org> for testing.

[...]
> diff --git a/drivers/net/ethernet/realtek/8139cp.c b/drivers/net/ethernet/realtek/8139cp.c
> index 6cb96b4..7847c83 100644
> --- a/drivers/net/ethernet/realtek/8139cp.c
> +++ b/drivers/net/ethernet/realtek/8139cp.c
[...]
> @@ -1244,22 +1241,11 @@ static int cp_change_mtu(struct net_device *dev, int new_mtu)
>  		return 0;
>  	}
>  
> -	spin_lock_irqsave(&cp->lock, flags);
> -
> -	cp_stop_hw(cp);			/* stop h/w and free rings */
> -	cp_clean_rings(cp);
> -
> +	/* network IS up, close it, reset MTU, and come up again. */
> +	cp_close(dev);
>  	dev->mtu = new_mtu;
> -	cp_set_rxbufsize(cp);		/* set new rx buf size */
> -
> -	rc = cp_init_rings(cp);		/* realloc and restart h/w */
> -	cp_start_hw(cp);
> -
> -	spin_unlock_irqrestore(&cp->lock, flags);
> -
> -	return rc;
> +	return cp_open(dev);

netif_running is true since cp_close() is not running from dev_close.
Bad things will happen if cp_open fails while cp_interrupt was waiting
for the lock.

-- 
Ueimor

^ permalink raw reply

* Re: GRO + splice panics in 3.7.0-rc5
From: Eric Dumazet @ 2012-12-01 21:47 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: netdev
In-Reply-To: <20121201205227.GA28390@1wt.eu>

On Sat, 2012-12-01 at 21:52 +0100, Willy Tarreau wrote:
> OK now I have a simple and reliable reproducer for the bug. It
> just creates 2 TCP sockets connected to each other and splices
> from one to the other. It looks 100% reliable, I'm attaching it
> hoping it can help, it's much easier and more reliable than
> starting haproxy with clients and servers!
> 
> I'm noting that it only crashes when I send more than one page
> of data. 4096 bytes and below are OK, 4097 and above do crash.
> 
> I have added a few printk() and it dies in get_page() called from
> do_tcp_sendpages() :
> 
>   888                  i = skb_shinfo(skb)->nr_frags;
>   889  printk(KERN_DEBUG "%d: page=%p i=%d skb=%p offset=%d\n", __LINE__, page, i, skb, offset);
>   890                  can_coalesce = skb_can_coalesce(skb, i, page, offset);
>   891  printk(KERN_DEBUG "%d: can_coalesce=%d\n", __LINE__, can_coalesce);
>   892                  if (!can_coalesce && i >= MAX_SKB_FRAGS) {
>   893  printk(KERN_DEBUG "%d\n", __LINE__);
>   894                          tcp_mark_push(tp, skb);
>   895                          goto new_segment;
>   896                  }
>   897                  if (!sk_wmem_schedule(sk, copy)) {
>   898  printk(KERN_DEBUG "%d\n", __LINE__);
>   899                          goto wait_for_memory;
>   900                  }
>   901
>   902                  if (can_coalesce) {
>   903  printk(KERN_DEBUG "%d\n", __LINE__);
>   904                          skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
>   905                  } else {
>   906  printk(KERN_DEBUG "%d\n", __LINE__);
>   907                          get_page(page);
>   908  printk(KERN_DEBUG "%d\n", __LINE__);
>   909                          skb_fill_page_desc(skb, i, page, offset, copy);
>   910                  }
>   911
>   912  printk(KERN_DEBUG "%d\n", __LINE__);
>   913                  skb->len += copy;
>   914                  skb->data_len += copy;
> 
> dmesg starts with :
> 
>   889: page=f77b9ca0 i=0 skb=f5b91540 offset=0
>   891: can_coalesce=0
>   906
>   908
>   912
>   871
>   875
>   880
>   889: page=f6e7a300 i=0 skb=f46f5c80 offset=2513
>   891: can_coalesce=0
>   906
>   908
>   912
>   889: page=0000fb80 i=1 skb=f46f5c80 offset=0
>   891: can_coalesce=0
>   906
>   BUG: unable to handle kernel paging request at 0000fb80
> IP: [<c13b4878>] tcp_sendpage+0x568/0x6d0
> 
> If that can help, I also noticed this one once with a
> page that could look valid, so the error was propagated
> further (the send() was 64kB) :
> 
>   ...
>   906
>   908
>   912
>   889: page=f6e79f80 i=6 skb=f37a1c80 offset=0
>   891: can_coalesce=0
>   906
>   908
>   912
>   889: page=c13d58a7 i=7 skb=f37a1c80 offset=927
>   891: can_coalesce=0
>   906
>   908
>   912
>   889: page=c13b4310 i=8 skb=f37a1c80 offset=3877
>   891: can_coalesce=0
>   906
>   BUG: unable to handle kernel paging request at 8b31752a
>   IP: [<c1094371>] __get_page_tail+0x31/0x90
>   *pde = 00000000 
> 
> Now I'm really at loss, so do not hesitate to ask me for more
> info, because I don't know where to look.

Thanks a lot Willy

I believe do_tcp_sendpages() needs a fix, I'll send a patch asap 

^ permalink raw reply

* WARNING!!! VIRUS DETECTED AND SPY, UPDATE NOW!!!
From: Webmail System Administrator @ 2012-12-01  4:00 UTC (permalink / raw)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=win-1251, Size: 395 bytes --]

 Your webmail account is corrupted by virus, you are to advise to upgrade.
 You need to upgrade your email limit quota to 10GB within the next 48
hours. Use the below web link to upgrade your valid email account:

https://docs.google.com/a/blumail.org/spreadsheet/viewform?formkey=dHV1ODN5bnkxNDVxODgtb0RMTk9DNlE6MQ

Thank you for using our email.
Copyright ©2012 Email Helpdesk Centre.

^ permalink raw reply

* Re: GRO + splice panics in 3.7.0-rc5
From: Willy Tarreau @ 2012-12-01 20:52 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <20121201194304.GI25450@1wt.eu>

[-- Attachment #1: Type: text/plain, Size: 2878 bytes --]

OK now I have a simple and reliable reproducer for the bug. It
just creates 2 TCP sockets connected to each other and splices
from one to the other. It looks 100% reliable, I'm attaching it
hoping it can help, it's much easier and more reliable than
starting haproxy with clients and servers!

I'm noting that it only crashes when I send more than one page
of data. 4096 bytes and below are OK, 4097 and above do crash.

I have added a few printk() and it dies in get_page() called from
do_tcp_sendpages() :

  888                  i = skb_shinfo(skb)->nr_frags;
  889  printk(KERN_DEBUG "%d: page=%p i=%d skb=%p offset=%d\n", __LINE__, page, i, skb, offset);
  890                  can_coalesce = skb_can_coalesce(skb, i, page, offset);
  891  printk(KERN_DEBUG "%d: can_coalesce=%d\n", __LINE__, can_coalesce);
  892                  if (!can_coalesce && i >= MAX_SKB_FRAGS) {
  893  printk(KERN_DEBUG "%d\n", __LINE__);
  894                          tcp_mark_push(tp, skb);
  895                          goto new_segment;
  896                  }
  897                  if (!sk_wmem_schedule(sk, copy)) {
  898  printk(KERN_DEBUG "%d\n", __LINE__);
  899                          goto wait_for_memory;
  900                  }
  901
  902                  if (can_coalesce) {
  903  printk(KERN_DEBUG "%d\n", __LINE__);
  904                          skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
  905                  } else {
  906  printk(KERN_DEBUG "%d\n", __LINE__);
  907                          get_page(page);
  908  printk(KERN_DEBUG "%d\n", __LINE__);
  909                          skb_fill_page_desc(skb, i, page, offset, copy);
  910                  }
  911
  912  printk(KERN_DEBUG "%d\n", __LINE__);
  913                  skb->len += copy;
  914                  skb->data_len += copy;

dmesg starts with :

  889: page=f77b9ca0 i=0 skb=f5b91540 offset=0
  891: can_coalesce=0
  906
  908
  912
  871
  875
  880
  889: page=f6e7a300 i=0 skb=f46f5c80 offset=2513
  891: can_coalesce=0
  906
  908
  912
  889: page=0000fb80 i=1 skb=f46f5c80 offset=0
  891: can_coalesce=0
  906
  BUG: unable to handle kernel paging request at 0000fb80
IP: [<c13b4878>] tcp_sendpage+0x568/0x6d0

If that can help, I also noticed this one once with a
page that could look valid, so the error was propagated
further (the send() was 64kB) :

  ...
  906
  908
  912
  889: page=f6e79f80 i=6 skb=f37a1c80 offset=0
  891: can_coalesce=0
  906
  908
  912
  889: page=c13d58a7 i=7 skb=f37a1c80 offset=927
  891: can_coalesce=0
  906
  908
  912
  889: page=c13b4310 i=8 skb=f37a1c80 offset=3877
  891: can_coalesce=0
  906
  BUG: unable to handle kernel paging request at 8b31752a
  IP: [<c1094371>] __get_page_tail+0x31/0x90
  *pde = 00000000 

Now I'm really at loss, so do not hesitate to ask me for more
info, because I don't know where to look.

Cheers,
Willy


[-- Attachment #2: splice-bug.c --]
[-- Type: text/plain, Size: 907 bytes --]

#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

void fail(const char *e)
{
	perror(e);
	exit(1);
}

int main(int argc, char **argv)
{
	struct sockaddr sa;
	socklen_t sl;
	int p[2];
	int l, s, r;
	char buffer[4097];

	l = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
	if (l < 0)
		fail("L: socket()");

	if (listen(l, 1) < 0)
		fail("L: listen()");

	sl = sizeof(sa);
	if (getsockname(l, &sa, &sl) < 0)
		fail("L: getsockname()");

	if ((s = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP)) < 0)
		fail("S: socket()");

	if (connect(s, &sa, sl) < 0)
		fail("S: connect()");

	sl = sizeof(sa);
	if ((r = accept(l, &sa, &sl)) < 0)
		fail("R: accept()");

	pipe(p);

	send(s, buffer, sizeof(buffer), MSG_DONTWAIT);

	l = splice(r, NULL, p[1], NULL, 1048576, 3);
	splice(p[0], NULL, s, NULL, l, 3);

	exit(0);
}

^ permalink raw reply

* Re: [PATCH] irda: ep7211-sir: Convert to platform_diver
From: Arnd Bergmann @ 2012-12-01 20:50 UTC (permalink / raw)
  To: Alexander Shiyan; +Cc: netdev, Samuel Ortiz
In-Reply-To: <1354360046-4818-1-git-send-email-shc_work@mail.ru>

On Saturday 01 December 2012, Alexander Shiyan wrote:
> This patch converts ep7211-sir driver to platform_driver.
> Since driver can be used not only for EP7211 CPU, function names
> was be renamed to generic clps711x...
> 
> Signed-off-by: Alexander Shiyan <shc_work@mail.ru>

Acked-by: Arnd Bergmann <arnd@arndb.de>

^ permalink raw reply

* [PATCH] net: phy: smsc: force all capable mode if the phy is started in powerdown mode
From: Philippe Reynes @ 2012-12-01 20:44 UTC (permalink / raw)
  To: netdev, linux-kernel, davem
  Cc: otavio, javier, jkosina, eric.jarrige, julien.boibessot,
	thomas.petazzoni, Philippe Reynes

A SMSC PHY in power down mode can't be used.
If a SMSC PHY is in this mode in the config_init
stage, the mode "all capable" is set. So the PHY
could then be used.

Signed-off-by: Philippe Reynes <tremyfr@yahoo.fr>
---
 drivers/net/phy/smsc.c  |   27 ++++++++++++++++++++++++++-
 include/linux/smscphy.h |    5 +++++
 2 files changed, 31 insertions(+), 1 deletions(-)

diff --git a/drivers/net/phy/smsc.c b/drivers/net/phy/smsc.c
index 88e3991..63018d0 100644
--- a/drivers/net/phy/smsc.c
+++ b/drivers/net/phy/smsc.c
@@ -43,7 +43,32 @@ static int smsc_phy_ack_interrupt(struct phy_device *phydev)
 
 static int smsc_phy_config_init(struct phy_device *phydev)
 {
-	int rc = phy_read(phydev, MII_LAN83C185_CTRL_STATUS);
+	/*
+	 * If the SMSC PHY is in power down mode, then set it
+	 * in all capable mode before using it.
+	 */
+	int rc = phy_read(phydev, MII_LAN83C185_SPECIAL_MODES);
+	if (rc < 0)
+		return rc;
+
+	if ((rc & MII_LAN83C185_MODE_MASK) == MII_LAN83C185_MODE_POWERDOWN) {
+		int timeout = 50000;
+
+		/* set "all capable" mode and reset the phy */
+		rc |= MII_LAN83C185_MODE_ALL;
+		phy_write(phydev, MII_LAN83C185_SPECIAL_MODES, rc);
+		phy_write(phydev, MII_BMCR, BMCR_RESET);
+
+		/* wait end of reset (max 500 ms) */
+		do {
+			udelay(10);
+			if (timeout-- == 0)
+				return -1;
+			rc = phy_read(phydev, MII_BMCR);
+		} while (rc & BMCR_RESET);
+	}
+
+	rc = phy_read(phydev, MII_LAN83C185_CTRL_STATUS);
 	if (rc < 0)
 		return rc;
 
diff --git a/include/linux/smscphy.h b/include/linux/smscphy.h
index ce718cb..f4bf16e 100644
--- a/include/linux/smscphy.h
+++ b/include/linux/smscphy.h
@@ -4,6 +4,7 @@
 #define MII_LAN83C185_ISF 29 /* Interrupt Source Flags */
 #define MII_LAN83C185_IM  30 /* Interrupt Mask */
 #define MII_LAN83C185_CTRL_STATUS 17 /* Mode/Status Register */
+#define MII_LAN83C185_SPECIAL_MODES 18 /* Special Modes Register */
 
 #define MII_LAN83C185_ISF_INT1 (1<<1) /* Auto-Negotiation Page Received */
 #define MII_LAN83C185_ISF_INT2 (1<<2) /* Parallel Detection Fault */
@@ -22,4 +23,8 @@
 #define MII_LAN83C185_EDPWRDOWN (1 << 13) /* EDPWRDOWN */
 #define MII_LAN83C185_ENERGYON  (1 << 1)  /* ENERGYON */
 
+#define MII_LAN83C185_MODE_MASK      0xE0
+#define MII_LAN83C185_MODE_POWERDOWN 0xC0 /* Power Down mode */
+#define MII_LAN83C185_MODE_ALL       0xE0 /* All capable mode */
+
 #endif /* __LINUX_SMSCPHY_H__ */
-- 
1.7.4.4

^ permalink raw reply related

* Re: sky2: Correct mistakenly switched read/write sequence
From: Lino Sanfilippo @ 2012-12-01 20:06 UTC (permalink / raw)
  To: David Miller; +Cc: LinoSanfilippo, shemminger, mlindner, netdev
In-Reply-To: <20121201.122701.2065141028797970034.davem@davemloft.net>

On Sat, Dec 01, 2012 at 12:27:01PM -0500, David Miller wrote:
> 
> > In sky2_all_down() the order of the read()/write() access to B0_IMSK seems to
> > be mistakenly switched. The original intention was obviously to avoid PCI write
> > posting.
> > This patch fixes the order.
> > 
> > Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
> 
> I would say that no such intention exists at all.
> 
> The read is there because a long time ago the result as used
> to compute an 'imask' variable.
> 
> Please see commit:
> 
> commit d72ff8fa7f8b344382963721f842256825c4660b

I see. Indeed I was not aware of the history that led to the recent code, so
thanks for pointing that out.

However, since Stephen thinks the patch is useful nevertheless:
@Stephen please feel free to adjust the commit message as you think would make 
the most sense.

Regards,
Lino

^ permalink raw reply

* Re: [PATCH] net: fix possible deadlocks in rtnl_trylock/unlock
From: Simon Wunderlich @ 2012-12-01 20:01 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	b.a.t.m.a.n-ZwoEplunGu2X36UT3dwllkB+6BGkLq7r, Simon Wunderlich,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q
In-Reply-To: <1354391074.20109.527.camel@edumazet-glaptop>

[-- Attachment #1: Type: text/plain, Size: 1707 bytes --]

On Sat, Dec 01, 2012 at 11:44:34AM -0800, Eric Dumazet wrote:
> On Sat, 2012-12-01 at 20:04 +0100, Sven Eckelmann wrote:
> > On Saturday 01 December 2012 10:36:12 Eric Dumazet wrote:
> >  
> > > > diff --git a/net/bridge/br_sysfs_br.c b/net/bridge/br_sysfs_br.c
> > > > index c5c0593..c122782 100644
> > > > --- a/net/bridge/br_sysfs_br.c
> > > > +++ b/net/bridge/br_sysfs_br.c
> > > > @@ -142,7 +142,7 @@ static ssize_t store_stp_state(struct device *d,
> > > > 
> > > >  	if (!rtnl_trylock())
> > > >  	
> > > >  		return restart_syscall();
> > > >  	
> > > >  	br_stp_set_enabled(br, val);
> > > > 
> > > > -	rtnl_unlock();
> > > > +	__rtnl_unlock();
> > > > 
> > > >  	return len;
> > > >  
> > > >  }
> > > 
> > > I have no idea of why you believe there is a problem here.
> > > 
> > > Could you explain how net_todo_list could be not empty ?
> > > 
> > > As long as no device is unregistered between
> > > rtnl_trylock()/rtnl_unlock(), there is no possible deadlock.
> > 
> > I am not sure what "here" means for your. At least batman-adv tries to 
> > unregister a device -> problem [1]. I will not make any judgements about the 
> > other uses in the kernel/other parts patched by Simon.
> > 
> 
> I was reacting to the change in net/bridge/br_sysfs_br.c
> 
> rtnl_trylock() could set a boolean flag to explicitly WARN_ON()
> in case we try to unregister a device.

Well, I'm not sure if this can happen in the bridge code, but from looking at the
code it doesn't appear to be impossible. It would be better to be sure that it can't
deadlock IMHO.

(Although doing unlock/lock/unlock in an unlock function is also a little "uncommon").

Cheers,
	Simon

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* Re: [PATCH] net: fix possible deadlocks in rtnl_trylock/unlock
From: Sven Eckelmann @ 2012-12-01 19:57 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	b.a.t.m.a.n-ZwoEplunGu2X36UT3dwllkB+6BGkLq7r,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q, Simon Wunderlich
In-Reply-To: <1354391074.20109.527.camel@edumazet-glaptop>

[-- Attachment #1: Type: text/plain, Size: 798 bytes --]

On Saturday 01 December 2012 11:44:34 Eric Dumazet wrote:
[...]
> > > I have no idea of why you believe there is a problem here.
> > > 
> > > Could you explain how net_todo_list could be not empty ?
> > > 
> > > As long as no device is unregistered between
> > > rtnl_trylock()/rtnl_unlock(), there is no possible deadlock.
> > 
> > I am not sure what "here" means for your. At least batman-adv tries to
> > unregister a device -> problem [1]. I will not make any judgements about
> > the other uses in the kernel/other parts patched by Simon.
> 
> I was reacting to the change in net/bridge/br_sysfs_br.c

Yes, in this context it makes more sense.

> rtnl_trylock() could set a boolean flag to explicitly WARN_ON()
> in case we try to unregister a device.

Sounds interesting.

Kind regards,
	Sven

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* Re: Re: [B.A.T.M.A.N.] [PATCH] net: fix possible deadlocks in rtnl_trylock/unlock
From: Eric Dumazet @ 2012-12-01 19:44 UTC (permalink / raw)
  To: Sven Eckelmann
  Cc: b.a.t.m.a.n, Simon Wunderlich, netdev, davem, Simon Wunderlich
In-Reply-To: <10264267.HX5FJDguj3@sven-laptop.home.narfation.org>

On Sat, 2012-12-01 at 20:04 +0100, Sven Eckelmann wrote:
> On Saturday 01 December 2012 10:36:12 Eric Dumazet wrote:
>  
> > > diff --git a/net/bridge/br_sysfs_br.c b/net/bridge/br_sysfs_br.c
> > > index c5c0593..c122782 100644
> > > --- a/net/bridge/br_sysfs_br.c
> > > +++ b/net/bridge/br_sysfs_br.c
> > > @@ -142,7 +142,7 @@ static ssize_t store_stp_state(struct device *d,
> > > 
> > >  	if (!rtnl_trylock())
> > >  	
> > >  		return restart_syscall();
> > >  	
> > >  	br_stp_set_enabled(br, val);
> > > 
> > > -	rtnl_unlock();
> > > +	__rtnl_unlock();
> > > 
> > >  	return len;
> > >  
> > >  }
> > 
> > I have no idea of why you believe there is a problem here.
> > 
> > Could you explain how net_todo_list could be not empty ?
> > 
> > As long as no device is unregistered between
> > rtnl_trylock()/rtnl_unlock(), there is no possible deadlock.
> 
> I am not sure what "here" means for your. At least batman-adv tries to 
> unregister a device -> problem [1]. I will not make any judgements about the 
> other uses in the kernel/other parts patched by Simon.
> 

I was reacting to the change in net/bridge/br_sysfs_br.c

rtnl_trylock() could set a boolean flag to explicitly WARN_ON()
in case we try to unregister a device.

^ permalink raw reply

* Re: GRO + splice panics in 3.7.0-rc5
From: Willy Tarreau @ 2012-12-01 19:43 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1353023344.10798.8.camel@edumazet-glaptop>

Hi Eric,

On Thu, Nov 15, 2012 at 03:49:04PM -0800, Eric Dumazet wrote:
> On Thu, 2012-11-15 at 23:28 +0100, Willy Tarreau wrote:
> > Hello,
> > 
> > I was just about to make a quick comparison between LRO and GRO in
> > 3.7.0-rc5 to see if LRO still had the big advantage I've always observed,
> > but I failed the test because as soon as I enable LRO + splice, the kernel
> > panics and reboots.
> > 
> > I could not yet manage to catch the panic output, I could just reliably
> > reproduce it, it crashes instantly.
> > 
> > All I can say at the moment is the following :
> >   - test consist in forwarding HTTP traffic between two NICs via haproxy
> >   - driver used was myri10ge
> >   - LRO + recv+send : OK
> >   - LRO + splice : OK
> >   - GRO + recv+send : OK
> >   - GRO + splice : panic
> >   - no such problem was observed in 3.6.6 so I think this is a recent
> >     regression.
> > 
> > I'll go back digging for more information, but as I'm used to often see
> > Eric suggest the right candidates for reverting, I wanted to report the
> > issue here in case there are easy ones to try first.
> 
> Hi Willy
> 
> Nothing particular comes to mind, there were a lot of recent changes
> that could trigger this kind of bug.
> 
> A stack trace would be useful of course ;)

First, sorry for the long delay, I would really have liked to debug
this earlier.

I could finally bisect the regression to this commit :

  69b08f62e17439ee3d436faf0b9a7ca6fffb78db is the first bad commit
  commit 69b08f62e17439ee3d436faf0b9a7ca6fffb78db
  Author: Eric Dumazet <edumazet@google.com>
  Date:   Wed Sep 26 06:46:57 2012 +0000

    net: use bigger pages in __netdev_alloc_frag

The regression is still present in latest git (7c17e486), and goes away
if I revert this patch.

I get this trace both on my 64-bit 10Gig machines with a myri10ge NIC,
and on my 32-bit eeepc with an atl1c NIC at 100 Mbps, so this is not
related to the driver nor the architecture.

The trace looks like this on my eeepc, it crashes in tcp_sendpage()
(more precisely in do_tcp_sendpages() which was inlined in the former) :

BUG: unable to handle kernel NULL pointer dereference at 00000b50
IP: [<c13fdf9c>] tcp_sendpage+0x28c/0x5d0
*pde = 00000000 
Oops: 0000 [#1] SMP 
Modules linked in: snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss uvcvideo videobuf2_core videobuf2_vmalloc videobuf2_memops ath9k mac80211 snd_hda_codec_realtek microcode ath9k_common psmouse ath9k_hw serio_raw uhci_hcd ath sg cfg80211 snd_hda_intel lpc_ich ehci_hcd mfd_core snd_hda_codec atl1c snd_hwdep rtc_cmos snd_pcm snd_timer snd snd_page_alloc eeepc_laptop sparse_keymap evdev
Pid: 2880, comm: haproxy Not tainted 3.7.0-rc7-eeepc #1 ASUSTeK Computer INC. 1005HA/1005HA
EIP: 0060:[<c13fdf9c>] EFLAGS: 00210246 CPU: 0
EIP is at tcp_sendpage+0x28c/0x5d0
EAX: c16a4138 EBX: f5202500 ECX: 00000b50 EDX: f5a72f68
ESI: f5333cc0 EDI: 000005a8 EBP: f55b7e34 ESP: f55b7df0
 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
CR0: 8005003b CR2: 00000b50 CR3: 35556000 CR4: 000007d0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000400
Process haproxy (pid: 2880, ti=f55b6000 task=f62060c0 task.ti=f55b6000)
Stack:
 00000000 00000002 00000000 00000582 02000000 f52025a0 000005a8 00000002
 00000b50 00001582 00000000 f6eb3900 00000b50 00000000 f5202500 c13fdd10
 00000040 f55b7e5c c14201c7 000005a8 00000040 f55b7ecc f6eb3900 f5fc1680
Call Trace:
 [<c13fdd10>] ? sk_stream_alloc_skb+0xf0/0xf0
 [<c14201c7>] inet_sendpage+0x57/0xc0
 [<c1420170>] ? inet_dgram_connect+0x80/0x80
 [<c13b6d99>] kernel_sendpage+0x29/0x50
 [<c13b6de8>] sock_sendpage+0x28/0x30
 [<c13b6dc0>] ? kernel_sendpage+0x50/0x50
 [<c10e55b5>] pipe_to_sendpage+0x65/0x80
 [<c10e5647>] splice_from_pipe_feed+0x77/0x150
 [<c10e5550>] ? splice_from_pipe_begin+0x10/0x10
 [<c10e5a84>] __splice_from_pipe+0x54/0x60
 [<c10e5550>] ? splice_from_pipe_begin+0x10/0x10
 [<c10e7190>] splice_from_pipe+0x50/0x70
 [<c10e7200>] ? default_file_splice_write+0x50/0x50
 [<c10e7222>] generic_splice_sendpage+0x22/0x30
 [<c10e5550>] ? splice_from_pipe_begin+0x10/0x10
 [<c10e5af6>] do_splice_from+0x66/0x90
 [<c10e787f>] sys_splice+0x42f/0x480
 [<c14671cc>] syscall_call+0x7/0xb
 [<c1460000>] ? blk_iopoll_cpu_notify+0x2c/0x5d
Code: 7d 08 0f 47 7d 08 39 f8 0f 4e f8 8b 43 1c 8b 40 60 85 c0 74 09 39 7b 64 0f 8c 21 01 00 00 80 7d ce 00 0f 85 ef 00 00 00 8b 4d dc <8b> 01 f6 c4 80 0f 85 0e 03 00 00 8b 45 dc f0 ff 40 10 8b 55 d8
EIP: [<c13fdf9c>] tcp_sendpage+0x28c/0x5d0 SS:ESP 0068:f55b7df0
CR2: 0000000000000b50
---[ end trace a07522433caf9655 ]---

In order to reproduce it, I'm just running haproxy with TCP splicing
enabled. It simply forwards data from one TCP socket to another one.

In case this matters, here's what the sequence looks like when it works
(just relevant syscalls). fd 7 = client retrieving data, fd 8 = server
providing data :

  socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 8
  fcntl64(8, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
  setsockopt(8, SOL_TCP, TCP_NODELAY, [1], 4) = 0
  connect(8, {sa_family=AF_INET, sin_port=htons(8000), sin_addr=inet_addr("10.8.3.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
  send(8, "GET /?s=100k HTTP/1.1\r\nConnection"..., 120, MSG_DONTWAIT|MSG_NOSIGNAL) = 120
  pipe([9, 10])                           = 0

  splice(0x8, 0, 0xa, 0, 0x40000000, 0x3) = 5792
  splice(0x8, 0, 0xa, 0, 0x40000000, 0x3) = 1448
  splice(0x8, 0, 0xa, 0, 0x40000000, 0x3) = -1 EAGAIN (Resource temporarily unavailable)
  splice(0x9, 0, 0x7, 0, 0x1c48, 0x3)     = 7240

And the splice() block above repeats over and over with varying sizes.

It's worth mentionning that the issue does not happend when recv/send are
used instead of splice(). Changing the pipe size using F_SETPIPE_SZ does
not change the behaviour either (was using 1 MB when it first appeared and
thought it was the cause).

I managed to catch the oops with strace running on the process. It crashed
on the first splice() call on the sending side and confirms the stack trace :

  pipe([9, 10])                           = 0
  splice(0x8, 0, 0xa, 0, 0x40000000, 0x3) = 10136
  splice(0x9, 0, 0x7, 0, 0x2798, 0x3)     ---> never completes, oops here.

Are there any additional information I can provide to help debug this ?

Seeing what the patch does, I don't think it's the direct responsible
but that it triggers an existing bug somewhere else. However I don't
know how to try to trigger the same bug without the patch.

The corresponding disassembled code looks like this :

    1ee3:       0f 47 7d 08             cmova  0x8(%ebp),%edi
    1ee7:       39 f8                   cmp    %edi,%eax
    1ee9:       0f 4e f8                cmovle %eax,%edi
    1eec:       8b 43 1c                mov    0x1c(%ebx),%eax
    1eef:       8b 40 60                mov    0x60(%eax),%eax
    1ef2:       85 c0                   test   %eax,%eax
    1ef4:       74 09                   je     1eff <tcp_sendpage+0x27f>
    1ef6:       39 7b 64                cmp    %edi,0x64(%ebx)
    1ef9:       0f 8c 21 01 00 00       jl     2020 <tcp_sendpage+0x3a0>
    1eff:       80 7d ce 00             cmpb   $0x0,-0x32(%ebp)
    1f03:       0f 85 ef 00 00 00       jne    1ff8 <tcp_sendpage+0x378>
    1f09:       8b 4d dc                mov    -0x24(%ebp),%ecx
--> 1f0c:       8b 01                   mov    (%ecx),%eax
    1f0e:       f6 c4 80                test   $0x80,%ah
    1f11:       0f 85 0b 03 00 00       jne    2222 <tcp_sendpage+0x5a2>
    1f17:       8b 45 dc                mov    -0x24(%ebp),%eax
    1f1a:       f0 ff 40 10             lock incl 0x10(%eax)
    1f1e:       8b 55 d8                mov    -0x28(%ebp),%edx
    1f21:       8b 4d dc                mov    -0x24(%ebp),%ecx
    1f24:       8d 04 d5 20 00 00 00    lea    0x20(,%edx,8),%eax
    1f2b:       03 86 90 00 00 00       add    0x90(%esi),%eax

This seems to be at the beginning of this block, at new_segment :

   866                  int size = min_t(size_t, psize, PAGE_SIZE - offset);
   867                  bool can_coalesce;
   868
   869                  if (!tcp_send_head(sk) || (copy = size_goal - skb->len) <= 0) {
   870  new_segment:
   871                          if (!sk_stream_memory_free(sk))
   872                                  goto wait_for_sndbuf;
   873

I'm trying to add some debug code.

Regards,
Willy

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox