* [PATCH] ipv6: Fix default route failover when CONFIG_IPV6_ROUTER_PREF=n
From: Paul Marks @ 2012-12-02 10:00 UTC (permalink / raw)
To: davem, yoshfuji; +Cc: netdev, Paul Marks
I believe this commit from 2008 was incorrect:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=398bcbebb6f721ac308df1e3d658c0029bb74503
When CONFIG_IPV6_ROUTER_PREF is disabled, the kernel should follow
RFC4861 section 6.3.6: if no route is NUD_VALID, then traffic should be
sprayed across all routers (indirectly triggering NUD) until one of them
becomes NUD_VALID.
However, the following experiment demonstrates that this does not work:
1) Connect to an IPv6 network.
2) Change the router's MAC (and link-local) address.
The kernel will lock onto the first router and never try the new one, even
if the first becomes unreachable. This patch fixes the problem by
allowing rt6_check_neigh() to return 0; if all routers return 0, then
rt6_select() will fall back to round-robin behavior.
This patch should have no effect when CONFIG_IPV6_ROUTER_PREF=y.
Note that rt6_check_neigh() is only used in a boolean context, so the
presence of both "m = 1" and "m = 2" was irrelevant and confusing.
Signed-off-by: Paul Marks <pmarks@google.com>
---
net/ipv6/route.c | 13 +++++--------
1 files changed, 5 insertions(+), 8 deletions(-)
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index b1e6cf0..60b2995 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -490,7 +490,7 @@ static inline int rt6_check_dev(struct rt6_info *rt, int oif)
static inline int rt6_check_neigh(struct rt6_info *rt)
{
struct neighbour *neigh;
- int m;
+ int m = 0;
neigh = rt->n;
if (rt->rt6i_flags & RTF_NONEXTHOP ||
@@ -499,16 +499,13 @@ static inline int rt6_check_neigh(struct rt6_info *rt)
else if (neigh) {
read_lock_bh(&neigh->lock);
if (neigh->nud_state & NUD_VALID)
- m = 2;
+ m = 1;
#ifdef CONFIG_IPV6_ROUTER_PREF
- else if (neigh->nud_state & NUD_FAILED)
- m = 0;
-#endif
- else
+ else if (!(neigh->nud_state & NUD_FAILED))
m = 1;
+#endif
read_unlock_bh(&neigh->lock);
- } else
- m = 0;
+ }
return m;
}
--
1.7.7.3
^ permalink raw reply related
* Re: TCP and reordering
From: David Woodhouse @ 2012-12-02 9:31 UTC (permalink / raw)
To: Eric Dumazet
Cc: Benjamin LaHaise, Vijay Subramanian, David Miller, saku,
rick.jones2, netdev
In-Reply-To: <1354416005.20109.557.camel@edumazet-glaptop>
[-- Attachment #1: Type: text/plain, Size: 1021 bytes --]
On Sat, 2012-12-01 at 18:40 -0800, Eric Dumazet wrote:
> > +static unsigned int skb_acct_len(struct sk_buff *skb)
> > +{
> > + return skb_tail_pointer(skb) - skb_network_header(skb);
> > +}
>
> Apparently this driver doesnt add any feature at alloc_netdev() time, so
> it might work. (no frags in any skb)
Well... as long as no driver appends stuff to the *tail* of the skb. As
long as they only *prepend* headers for the device to use, we're fine.
But I think that's fairly safe, for now.
For PPP I think I'll end up keeping a tuple of
{ skb, orig_len, orig_destructor, orig_sk }
for each skb which is in-flight. Since the queues of those should be
*short*, and should generally complete in-order, that shouldn't be too
much overhead. It'll only need a small static array of them per-channel.
(Before complaining about a *static* array holding metadata about items
on a queue, stop thinking of the PPP channel being a queue and start
thinking of it more like a ring buffer.)
--
dwmw2
[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 6171 bytes --]
^ permalink raw reply
* Re: [PATCH 00/17] ATM fixes for pppoatm/br2684
From: David Woodhouse @ 2012-12-02 8:14 UTC (permalink / raw)
To: David Miller; +Cc: netdev, chas, krzysiek
In-Reply-To: <20121201.204906.1703696018528746748.davem@davemloft.net>
[-- Attachment #1: Type: text/plain, Size: 803 bytes --]
On Sat, 2012-12-01 at 20:49 -0500, David Miller wrote:
>
> I actually prefer what we do now, which is do the BUILD_BUG_ON()
> once in the subsystem specific code, usually the initializer.
>
> It's part of creating a new SKB cb, adding that assertion somewhere.
Where it's *subsystem* code that's great... but *drivers* get to use the
skb cb too. And driver authors aren't always so reliable :)
Looking just at solos-pci, there was no feedback during the initial
submission that I ought to be using a BUILD_BUG_ON to limit the size of
struct solos_skb_cb. It was just luck, really, that I remembered to do
it — as an afterthought in a later iteration of the patch.
Giving driver authors fewer ways to shoot themselves in the foot always
seems like a good idea...
--
dwmw2
[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 6171 bytes --]
^ permalink raw reply
* WEBMAIL ACCOUNT UPDATE NOW!!!
From: Webmail System Administrator @ 2012-12-01 11:36 UTC (permalink / raw)
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=win-1251, Size: 395 bytes --]
Your webmail account is corrupted by virus, you are to advise to upgrade.
You need to upgrade your email limit quota to 10GB within the next 48
hours. Use the below web link to upgrade your valid email account:
https://docs.google.com/a/blumail.org/spreadsheet/viewform?formkey=dHV1ODN5bnkxNDVxODgtb0RMTk9DNlE6MQ
Thank you for using our email.
Copyright ©2012 Email Helpdesk Centre.
^ permalink raw reply
* Re: GRO + splice panics in 3.7.0-rc5
From: Eric Dumazet @ 2012-12-02 2:58 UTC (permalink / raw)
To: Willy Tarreau; +Cc: netdev
In-Reply-To: <20121201224044.GL25450@1wt.eu>
On Sat, 2012-12-01 at 23:40 +0100, Willy Tarreau wrote:
> Excellent Eric, it's rock solid now both with the reproducer and with
> haproxy! Feel free to add my Tested-By if you want.
You did a very good work, thanks again Willy.
^ permalink raw reply
* [PATCH 1/6] bna: remove useless calls to memset().
From: Cyril Roelandt @ 2012-12-02 2:40 UTC (permalink / raw)
To: linux-kernel; +Cc: kernel-janitors, Cyril Roelandt, rmody, netdev
In-Reply-To: <1354416022-20189-1-git-send-email-tipecaml@gmail.com>
These calls are followed by calls to memcpy() on the same memory area, so they
can safely be removed.
Signed-off-by: Cyril Roelandt <tipecaml@gmail.com>
---
drivers/net/ethernet/brocade/bna/bfa_ioc.c | 4 ----
1 file changed, 4 deletions(-)
diff --git a/drivers/net/ethernet/brocade/bna/bfa_ioc.c b/drivers/net/ethernet/brocade/bna/bfa_ioc.c
index 959c58e..3227fdd 100644
--- a/drivers/net/ethernet/brocade/bna/bfa_ioc.c
+++ b/drivers/net/ethernet/brocade/bna/bfa_ioc.c
@@ -2273,7 +2273,6 @@ bfa_ioc_get_type(struct bfa_ioc *ioc)
static void
bfa_ioc_get_adapter_serial_num(struct bfa_ioc *ioc, char *serial_num)
{
- memset(serial_num, 0, BFA_ADAPTER_SERIAL_NUM_LEN);
memcpy(serial_num,
(void *)ioc->attr->brcd_serialnum,
BFA_ADAPTER_SERIAL_NUM_LEN);
@@ -2282,7 +2281,6 @@ bfa_ioc_get_adapter_serial_num(struct bfa_ioc *ioc, char *serial_num)
static void
bfa_ioc_get_adapter_fw_ver(struct bfa_ioc *ioc, char *fw_ver)
{
- memset(fw_ver, 0, BFA_VERSION_LEN);
memcpy(fw_ver, ioc->attr->fw_version, BFA_VERSION_LEN);
}
@@ -2304,7 +2302,6 @@ bfa_ioc_get_pci_chip_rev(struct bfa_ioc *ioc, char *chip_rev)
static void
bfa_ioc_get_adapter_optrom_ver(struct bfa_ioc *ioc, char *optrom_ver)
{
- memset(optrom_ver, 0, BFA_VERSION_LEN);
memcpy(optrom_ver, ioc->attr->optrom_version,
BFA_VERSION_LEN);
}
@@ -2312,7 +2309,6 @@ bfa_ioc_get_adapter_optrom_ver(struct bfa_ioc *ioc, char *optrom_ver)
static void
bfa_ioc_get_adapter_manufacturer(struct bfa_ioc *ioc, char *manufacturer)
{
- memset(manufacturer, 0, BFA_ADAPTER_MFG_NAME_LEN);
memcpy(manufacturer, BFA_MFG_NAME, BFA_ADAPTER_MFG_NAME_LEN);
}
--
1.7.10.4
^ permalink raw reply related
* Re: TCP and reordering
From: Eric Dumazet @ 2012-12-02 2:40 UTC (permalink / raw)
To: David Woodhouse
Cc: Benjamin LaHaise, Vijay Subramanian, David Miller, saku,
rick.jones2, netdev
In-Reply-To: <1354411840.21562.373.camel@shinybook.infradead.org>
On Sun, 2012-12-02 at 01:30 +0000, David Woodhouse wrote:
> Oh, of *course*... this is why my kernel would panic if I attempted to
> add BQL to BR2684. The ATM low-level driver was pushing a header onto
> the skb, and I ended up calling netdev_completed_queue() with a larger
> 'bytes' value than the one I'd called netdev_sent_queue() with. Which
> leads to a BUG(), which immediately results in a panic. A moderately
> suboptimal failure mode, when a nasty warning and disabling BQL on this
> interface might have been nicer.
>
> However, *perhaps* it isn't so hard to get a consistent 'bytes' value.
> This version appears to work... can we use something along these lines
> in the general case? What if skb_is_nonlinear()?
>
> This probably doesn't work for L2TP where it's going to be passed down
> the whole stack and get a new network header and everything. But for
> BR2684 is this at least a salvageable approach?
>
> --- net/atm/br2684.c~ 2012-12-01 16:35:49.000000000 +0000
> +++ net/atm/br2684.c 2012-12-02 01:18:35.216607088 +0000
> @@ -180,6 +180,11 @@ static struct notifier_block atm_dev_not
> .notifier_call = atm_dev_event,
> };
>
> +static unsigned int skb_acct_len(struct sk_buff *skb)
> +{
> + return skb_tail_pointer(skb) - skb_network_header(skb);
> +}
> +
> /* chained vcc->pop function. Check if we should wake the netif_queue */
> static void br2684_pop(struct atm_vcc *vcc, struct sk_buff *skb)
> {
> @@ -191,6 +196,8 @@ static void br2684_pop(struct atm_vcc *v
> /* If the queue space just went up from zero, wake */
> if (atomic_inc_return(&brvcc->qspace) == 1)
> netif_wake_queue(brvcc->device);
> +
> + netdev_completed_queue(brvcc->device, 1, skb_acct_len(skb));
> }
>
> /*
> @@ -265,6 +272,7 @@ static int br2684_xmit_vcc(struct sk_buf
> netif_wake_queue(brvcc->device);
> }
>
> + netdev_sent_queue(brvcc->device, skb_acct_len(skb));
> /* If this fails immediately, the skb will be freed and br2684_pop()
> will wake the queue if appropriate. Just return an error so that
> the stats are updated correctly */
> @@ -710,6 +718,7 @@ static int br2684_create(void __user *ar
> return err;
> }
Apparently this driver doesnt add any feature at alloc_netdev() time, so
it might work. (no frags in any skb)
^ permalink raw reply
* Re: [PATCH 00/17] ATM fixes for pppoatm/br2684
From: David Miller @ 2012-12-02 2:17 UTC (permalink / raw)
To: chas; +Cc: dwmw2, netdev, krzysiek
In-Reply-To: <201212020157.qB21vZiN011233@thirdoffive.cmf.nrl.navy.mil>
From: "Chas Williams (CONTRACTOR)" <chas@cmf.nrl.navy.mil>
Date: Sat, 01 Dec 2012 20:57:35 -0500
> In message <1354382493.21562.347.camel@shinybook.infradead.org>,David Woodhouse writes:
>>Possibly not even a bug at all, in fact =E2=80=94 GCC is fairly loose with =
>>those
>>warnings. But if it is a bug it's my fault. I'll take a look at that
>>too. Not tonight though; I'm going out shortly and will only just manage
>>the solos-pci fix.
>
> it is not a bug, just gcc being pedantic. gcc doesn't believe
> that passing by reference it a form of initilization.
It's not that, it actually is analyzing the initializations done by
that function call during inlining.
What it can't see is the case where multiple flows of control have
different treatments of a variable.
It can't see that, for example, all paths that test boolean X do not
touch the uninitialized variable. It doesn't analyze the
inter-dependencies of each code path in enough detail to know for
certain.
^ permalink raw reply
* Re: [PATCH 00/17] ATM fixes for pppoatm/br2684
From: Chas Williams (CONTRACTOR) @ 2012-12-02 1:57 UTC (permalink / raw)
To: David Woodhouse; +Cc: David Miller, netdev, krzysiek
In-Reply-To: <1354382493.21562.347.camel@shinybook.infradead.org>
In message <1354382493.21562.347.camel@shinybook.infradead.org>,David Woodhouse writes:
>Possibly not even a bug at all, in fact =E2=80=94 GCC is fairly loose with =
>those
>warnings. But if it is a bug it's my fault. I'll take a look at that
>too. Not tonight though; I'm going out shortly and will only just manage
>the solos-pci fix.
it is not a bug, just gcc being pedantic. gcc doesn't believe
that passing by reference it a form of initilization.
^ permalink raw reply
* WEBMAIL ACCOUNT UPDATE NOW!!!
From: Webmail System Administrator @ 2012-12-01 2:40 UTC (permalink / raw)
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=win-1251, Size: 277 bytes --]
your webmail account need to be updated. you are advise to use the weblink
below to update now.
https://docs.google.com/a/blumail.org/spreadsheet/viewform?formkey=dGRvTDVjdVIyY2RYMGlfSVhVQUw4NEE6MQ
Thank you for using our email.
Copyright ©2012 Email Helpdesk Centre.
^ permalink raw reply
* Re: [PATCH 00/17] ATM fixes for pppoatm/br2684
From: David Miller @ 2012-12-02 1:49 UTC (permalink / raw)
To: dwmw2; +Cc: netdev, chas, krzysiek
In-Reply-To: <1354408847.21562.365.camel@shinybook.infradead.org>
From: David Woodhouse <dwmw2@infradead.org>
Date: Sun, 02 Dec 2012 00:40:47 +0000
> On Sat, 2012-12-01 at 17:33 +0000, David Woodhouse wrote:
>>
>> Very glad I added the BUILD_BUG_ON on the cb struct size now. Perhaps
>> there should be a generic helper for that? Something like
>> skb_cb_cast(struct foo_cb, skb) could do it automatically...?
>
> Something like this, perhaps? Using skb_cast_cb() would then make it
> fairly much impossible to accidentally overflow the size of the skb cb.
I actually prefer what we do now, which is do the BUILD_BUG_ON()
once in the subsystem specific code, usually the initializer.
It's part of creating a new SKB cb, adding that assertion somewhere.
^ permalink raw reply
* Re: [PATCH 00/17] ATM fixes for pppoatm/br2684
From: David Miller @ 2012-12-02 1:47 UTC (permalink / raw)
To: dwmw2; +Cc: netdev, chas, krzysiek
In-Reply-To: <1354408547.21562.363.camel@shinybook.infradead.org>
From: David Woodhouse <dwmw2@infradead.org>
Date: Sun, 02 Dec 2012 00:35:47 +0000
> On Sat, 2012-12-01 at 11:44 -0500, David Miller wrote:
>>
>> > drivers/atm/solos-pci.c: In function ‘solos_pci_init’:
>> > drivers/atm/solos-pci.c:1329:2: error: size of unnamed array is
>> negative
>>
>> It's from adding the completion to the solos skb cb, you can't do
>> that. It won't fit on 64-bit when all debugging kconfig options are
>> enabled.
>
> Thanks for catching that. I've just posted a [v2] version of the
> offending patch, which no longer puts a completion into the skb cb.
>
> I'd appreciate a slightly more clueful eye looking over the incremental
> patch (below) just to confirm that the new method is correct, but it
> certainly seems to work. This version is identical to the one I posted
> earlier, except that I use dev_kfree_skb() in the pclose() function
> instead of dev_kfree_skb_any(). We know this will be called from a
> suitable context, and it even uses GFP_KERNEL a few lines higher up.
>
> If that's OK, please pull the resulting tree from
> git://git.infradead.org/users/dwmw2/atm.git
Looks good, pulled, thanks David.
^ permalink raw reply
* Re: [PATCH net] tcp: fix crashes in do_tcp_sendpages()
From: Joe Perches @ 2012-12-02 1:43 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Willy Tarreau, David Miller, netdev
In-Reply-To: <1354403222.20109.539.camel@edumazet-glaptop>
On Sat, 2012-12-01 at 15:07 -0800, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> Recent network changes allowed high order pages being used
> for skb fragments.
>
> This uncovered a bug in do_tcp_sendpages() which was assuming its caller
> provided an array of order-0 page pointers.
>
> We only have to deal with a single page in this function, and its order
> is irrelevant.
[]
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
[]
> @@ -830,8 +830,8 @@ static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
> return mss_now;
> }
>
> -static ssize_t do_tcp_sendpages(struct sock *sk, struct page **pages, int poffset,
> - size_t psize, int flags)
> +static ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
> + size_t size, int flags)
Shouldn't this now be named do_tcp_sendpage ?
^ permalink raw reply
* Re: [PATCH net 1/1] 8139cp: fix coherent mapping leak in error path.
From: David Miller @ 2012-12-02 1:41 UTC (permalink / raw)
To: romieu; +Cc: jgarzik, dwmw2, jasowang, netdev, jogreene
In-Reply-To: <20121201230850.GA10935@electric-eye.fr.zoreil.com>
From: Francois Romieu <romieu@fr.zoreil.com>
Date: Sun, 2 Dec 2012 00:08:50 +0100
> cp_open
> [...]
> rc = cp_alloc_rings(cp);
> if (rc)
> return rc;
>
> cp_alloc_rings
> [...]
> mem = dma_alloc_coherent(&cp->pdev->dev, CP_RING_BYTES,
> &cp->ring_dma, GFP_KERNEL);
>
> - cp_alloc_rings never frees the coherent mapping it allocates
> - neither do cp_open when cp_alloc_rings fails
>
> Signed-off-by: Francois Romieu <romieu@fr.zoreil.com>
Applied.
^ permalink raw reply
* Re: [PATCH net] tcp: fix crashes in do_tcp_sendpages()
From: David Miller @ 2012-12-02 1:41 UTC (permalink / raw)
To: eric.dumazet; +Cc: w, netdev
In-Reply-To: <1354403222.20109.539.camel@edumazet-glaptop>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sat, 01 Dec 2012 15:07:02 -0800
> From: Eric Dumazet <edumazet@google.com>
>
> Recent network changes allowed high order pages being used
> for skb fragments.
>
> This uncovered a bug in do_tcp_sendpages() which was assuming its caller
> provided an array of order-0 page pointers.
>
> We only have to deal with a single page in this function, and its order
> is irrelevant.
>
> Reported-by: Willy Tarreau <w@1wt.eu>
> Tested-by: Willy Tarreau <w@1wt.eu>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
Applied, thanks Eric.
^ permalink raw reply
* [PATCH] vhost-blk: Add vhost-blk support v6
From: Asias He @ 2012-12-02 1:33 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Jens Axboe, kvm, netdev, linux-kernel, virtualization,
Christoph Hellwig, David S. Miller
vhost-blk is an in-kernel virito-blk device accelerator.
Due to lack of proper in-kernel AIO interface, this version converts
guest's I/O request to bio and use submit_bio() to submit I/O directly.
So this version any supports raw block device as guest's disk image,
e.g. /dev/sda, /dev/ram0. We can add file based image support to
vhost-blk once we have in-kernel AIO interface. There are some work in
progress for in-kernel AIO interface from Dave Kleikamp and Zach Brown:
http://marc.info/?l=linux-fsdevel&m=133312234313122
Performance evaluation:
-----------------------------
LKVM: Fio with libaio ioengine on 1 Fusion IO device
IOPS(k) Before After Improvement
seq-read 107 121 +13.0%
seq-write 130 179 +37.6%
rnd-read 102 122 +19.6%
rnd-write 125 159 +27.0%
QEMU: Fio with libaio ioengine on 1 Fusion IO device
IOPS(k) Before After Improvement
seq-read 76 123 +61.8%
seq-write 139 173 +24.4%
rnd-read 73 120 +64.3%
rnd-write 75 156 +108.0%
QEMU: Fio with libaio ioengine on 1 Ramdisk device
IOPS(k) Before After Improvement
seq-read 138 437 +216%
seq-write 191 436 +128%
rnd-read 137 426 +210%
rnd-write 140 415 +196%
QEMU: Fio with libaio ioengine on 8 Ramdisk device
50% read + 50% write
IOPS(k) Before After Improvement
randrw 64/64 189/189 +195%/+195%
Userspace bits:
-----------------------------
1) LKVM
The latest vhost-blk userspace bits for kvm tool can be found here:
git@github.com:asias/linux-kvm.git blk.vhost-blk
2) QEMU
The latest vhost-blk userspace prototype for QEMU can be found here:
git@github.com:asias/qemu.git blk.vhost-blk
Changes in v6:
- Use inline req_page_list to reduce kmalloc
- Switch to single thread model, thanks mst!
- Wait until requests fired before vhost_blk_flush to be finished
Changes in v5:
- Do not assume the buffer layout
- Fix wakeup race
Changes in v4:
- Mark req->status as userspace pointer
- Use __copy_to_user() instead of copy_to_user() in vhost_blk_set_status()
- Add if (need_resched()) schedule() in blk thread
- Kill vhost_blk_stop_vq() and move it into vhost_blk_stop()
- Use vq_err() instead of pr_warn()
- Fail un Unsupported request
- Add flush in vhost_blk_set_features()
Changes in v3:
- Sending REQ_FLUSH bio instead of vfs_fsync, thanks Christoph!
- Check file passed by user is a raw block device file
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Asias He <asias@redhat.com>
---
drivers/vhost/Kconfig | 1 +
drivers/vhost/Kconfig.blk | 10 +
drivers/vhost/Makefile | 2 +
drivers/vhost/blk.c | 724 ++++++++++++++++++++++++++++++++++++++++++++++
drivers/vhost/blk.h | 8 +
5 files changed, 745 insertions(+)
create mode 100644 drivers/vhost/Kconfig.blk
create mode 100644 drivers/vhost/blk.c
create mode 100644 drivers/vhost/blk.h
diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index 202bba6..acd8038 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -11,4 +11,5 @@ config VHOST_NET
if STAGING
source "drivers/vhost/Kconfig.tcm"
+source "drivers/vhost/Kconfig.blk"
endif
diff --git a/drivers/vhost/Kconfig.blk b/drivers/vhost/Kconfig.blk
new file mode 100644
index 0000000..ff8ab76
--- /dev/null
+++ b/drivers/vhost/Kconfig.blk
@@ -0,0 +1,10 @@
+config VHOST_BLK
+ tristate "Host kernel accelerator for virtio blk (EXPERIMENTAL)"
+ depends on BLOCK && EXPERIMENTAL && m
+ ---help---
+ This kernel module can be loaded in host kernel to accelerate
+ guest block with virtio_blk. Not to be confused with virtio_blk
+ module itself which needs to be loaded in guest kernel.
+
+ To compile this driver as a module, choose M here: the module will
+ be called vhost_blk.
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index a27b053..1a8a4a5 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -2,3 +2,5 @@ obj-$(CONFIG_VHOST_NET) += vhost_net.o
vhost_net-y := vhost.o net.o
obj-$(CONFIG_TCM_VHOST) += tcm_vhost.o
+obj-$(CONFIG_VHOST_BLK) += vhost_blk.o
+vhost_blk-y := blk.o
diff --git a/drivers/vhost/blk.c b/drivers/vhost/blk.c
new file mode 100644
index 0000000..e4ca4b6
--- /dev/null
+++ b/drivers/vhost/blk.c
@@ -0,0 +1,724 @@
+/*
+ * Copyright (C) 2011 Taobao, Inc.
+ * Author: Liu Yuan <tailai.ly@taobao.com>
+ *
+ * Copyright (C) 2012 Red Hat, Inc.
+ * Author: Asias He <asias@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ *
+ * virtio-blk server in host kernel.
+ */
+
+#include <linux/miscdevice.h>
+#include <linux/module.h>
+#include <linux/vhost.h>
+#include <linux/virtio_blk.h>
+#include <linux/mutex.h>
+#include <linux/file.h>
+#include <linux/kthread.h>
+#include <linux/blkdev.h>
+#include <linux/llist.h>
+
+#include "vhost.c"
+#include "vhost.h"
+#include "blk.h"
+
+static DEFINE_IDA(vhost_blk_index_ida);
+
+enum {
+ VHOST_BLK_VQ_REQ = 0,
+ VHOST_BLK_VQ_MAX = 1,
+};
+
+struct req_page_list {
+ struct page **pages;
+ int pages_nr;
+};
+
+#define NR_INLINE 16
+
+struct vhost_blk_req {
+ struct req_page_list inline_pl[NR_INLINE];
+ struct page *inline_page[NR_INLINE];
+ struct bio *inline_bio[NR_INLINE];
+ struct req_page_list *pl;
+ int during_flush;
+ bool use_inline;
+
+ struct llist_node llnode;
+
+ struct vhost_blk *blk;
+
+ struct iovec *iov;
+ int iov_nr;
+
+ struct bio **bio;
+ atomic_t bio_nr;
+
+ struct iovec status[1];
+
+ sector_t sector;
+ int write;
+ u16 head;
+ long len;
+};
+
+struct vhost_blk {
+ wait_queue_head_t flush_wait;
+ struct iovec iov[UIO_MAXIOV];
+ struct vhost_blk_req *reqs;
+ struct vhost_virtqueue vq;
+ struct llist_head llhead;
+ atomic_t req_inflight[2];
+ struct vhost_work work;
+ spinlock_t flush_lock;
+ struct vhost_dev dev;
+ int during_flush;
+ u16 reqs_nr;
+ int index;
+};
+
+static int move_iovec(struct iovec *from, struct iovec *to,
+ size_t len, int iov_count)
+{
+ int seg = 0;
+ size_t size;
+
+ while (len && seg < iov_count) {
+ if (from->iov_len == 0) {
+ ++from;
+ continue;
+ }
+ size = min(from->iov_len, len);
+ to->iov_base = from->iov_base;
+ to->iov_len = size;
+ from->iov_len -= size;
+ from->iov_base += size;
+ len -= size;
+ ++from;
+ ++to;
+ ++seg;
+ }
+ return seg;
+}
+
+static inline int iov_num_pages(struct iovec *iov)
+{
+ return (PAGE_ALIGN((unsigned long)iov->iov_base + iov->iov_len) -
+ ((unsigned long)iov->iov_base & PAGE_MASK)) >> PAGE_SHIFT;
+}
+
+static inline int vhost_blk_set_status(struct vhost_blk_req *req, u8 status)
+{
+ struct vhost_blk *blk = req->blk;
+ int ret;
+
+ ret = memcpy_toiovecend(req->status, &status, 0, sizeof(status));
+
+ if (ret) {
+ vq_err(&blk->vq, "Failed to write status\n");
+ return -EFAULT;
+ }
+
+ return 0;
+}
+
+static void vhost_blk_req_done(struct bio *bio, int err)
+{
+ struct vhost_blk_req *req = bio->bi_private;
+ struct vhost_blk *blk = req->blk;
+
+ if (err)
+ req->len = err;
+
+ if (atomic_dec_and_test(&req->bio_nr)) {
+ llist_add(&req->llnode, &blk->llhead);
+ vhost_work_queue(&blk->dev, &blk->work);
+ }
+
+ bio_put(bio);
+}
+
+static void vhost_blk_req_umap(struct vhost_blk_req *req)
+{
+ struct req_page_list *pl;
+ int i, j;
+
+ if (req->pl) {
+ for (i = 0; i < req->iov_nr; i++) {
+ pl = &req->pl[i];
+ for (j = 0; j < pl->pages_nr; j++) {
+ if (!req->write)
+ set_page_dirty_lock(pl->pages[j]);
+ page_cache_release(pl->pages[j]);
+ }
+ }
+ }
+
+ if (!req->use_inline)
+ kfree(req->pl);
+}
+
+static int vhost_blk_bio_make(struct vhost_blk_req *req,
+ struct block_device *bdev)
+{
+ int pages_nr_total, i, j, ret;
+ struct iovec *iov = req->iov;
+ int iov_nr = req->iov_nr;
+ struct page **pages, *page;
+ struct bio *bio = NULL;
+ int bio_nr = 0;
+ void *buf;
+
+ pages_nr_total = 0;
+ for (i = 0; i < iov_nr; i++)
+ pages_nr_total += iov_num_pages(&iov[i]);
+
+ if (unlikely(req->write == WRITE_FLUSH)) {
+ req->use_inline = true;
+ req->pl = NULL;
+ req->bio = req->inline_bio;
+
+ bio = bio_alloc(GFP_KERNEL, 1);
+ if (!bio)
+ return -ENOMEM;
+
+ bio->bi_sector = req->sector;
+ bio->bi_bdev = bdev;
+ bio->bi_private = req;
+ bio->bi_end_io = vhost_blk_req_done;
+ req->bio[bio_nr++] = bio;
+
+ goto out;
+ }
+
+ if (pages_nr_total > NR_INLINE) {
+ int pl_len, page_len, bio_len;
+
+ req->use_inline = false;
+ pl_len = iov_nr * sizeof(req->pl[0]);
+ page_len = pages_nr_total * sizeof(struct page *);
+ bio_len = pages_nr_total * sizeof(struct bio *);
+
+ buf = kmalloc(pl_len + page_len + bio_len, GFP_KERNEL);
+ if (!buf)
+ return -ENOMEM;
+
+ req->pl = buf;
+ pages = buf + pl_len;
+ req->bio = buf + pl_len + page_len;
+ } else {
+ req->use_inline = true;
+ req->pl = req->inline_pl;
+ pages = req->inline_page;
+ req->bio = req->inline_bio;
+ }
+
+ req->iov_nr = 0;
+ for (i = 0; i < iov_nr; i++) {
+ int pages_nr = iov_num_pages(&iov[i]);
+ unsigned long iov_base, iov_len;
+ struct req_page_list *pl;
+
+ iov_base = (unsigned long)iov[i].iov_base;
+ iov_len = (unsigned long)iov[i].iov_len;
+
+ /* TODO: Limit the total number of pages pinned */
+ ret = get_user_pages_fast(iov_base, pages_nr,
+ !req->write, pages);
+ if (ret != pages_nr)
+ goto fail;
+
+ req->iov_nr++;
+ pl = &req->pl[i];
+ pl->pages_nr = pages_nr;
+ pl->pages = pages;
+
+ for (j = 0; j < pages_nr; j++) {
+ unsigned int off, len;
+ page = pages[j];
+ off = iov_base & ~PAGE_MASK;
+ len = PAGE_SIZE - off;
+ if (len > iov_len)
+ len = iov_len;
+
+ while (!bio || bio_add_page(bio, page, len, off) <= 0) {
+ bio = bio_alloc(GFP_KERNEL, pages_nr_total);
+ if (!bio)
+ goto fail;
+ bio->bi_sector = req->sector;
+ bio->bi_bdev = bdev;
+ bio->bi_private = req;
+ bio->bi_end_io = vhost_blk_req_done;
+ req->bio[bio_nr++] = bio;
+ }
+ req->sector += len >> 9;
+ iov_base += len;
+ iov_len -= len;
+ }
+
+ pages += pages_nr;
+ }
+out:
+ atomic_set(&req->bio_nr, bio_nr);
+ return 0;
+
+fail:
+ for (i = 0; i < bio_nr; i++)
+ bio_put(req->bio[i]);
+ vhost_blk_req_umap(req);
+ return -ENOMEM;
+}
+
+static inline void vhost_blk_bio_send(struct vhost_blk_req *req)
+{
+ struct blk_plug plug;
+ int i, bio_nr;
+
+ bio_nr = atomic_read(&req->bio_nr);
+ blk_start_plug(&plug);
+ for (i = 0; i < bio_nr; i++)
+ submit_bio(req->write, req->bio[i]);
+ blk_finish_plug(&plug);
+}
+
+static int vhost_blk_req_submit(struct vhost_blk_req *req, struct file *file)
+{
+
+ struct inode *inode = file->f_mapping->host;
+ struct block_device *bdev = inode->i_bdev;
+ int ret;
+
+ ret = vhost_blk_bio_make(req, bdev);
+ if (ret < 0)
+ return ret;
+
+ vhost_blk_bio_send(req);
+
+ spin_lock(&req->blk->flush_lock);
+ req->during_flush = req->blk->during_flush;
+ atomic_inc(&req->blk->req_inflight[req->during_flush]);
+ spin_unlock(&req->blk->flush_lock);
+
+ return ret;
+}
+
+static int vhost_blk_req_handle(struct vhost_virtqueue *vq,
+ struct virtio_blk_outhdr *hdr,
+ u16 head, u16 out, u16 in,
+ struct file *file)
+{
+ struct vhost_blk *blk = container_of(vq->dev, struct vhost_blk, dev);
+ unsigned char id[VIRTIO_BLK_ID_BYTES];
+ struct vhost_blk_req *req;
+ int ret, len;
+ u8 status;
+
+ req = &blk->reqs[head];
+ req->head = head;
+ req->blk = blk;
+ req->sector = hdr->sector;
+ req->iov = blk->iov;
+
+ req->len = iov_length(vq->iov, out + in) - sizeof(status);
+ req->iov_nr = move_iovec(vq->iov, req->iov, req->len, out + in);
+
+ move_iovec(vq->iov, req->status, sizeof(status), out + in);
+
+ switch (hdr->type) {
+ case VIRTIO_BLK_T_OUT:
+ req->write = WRITE;
+ ret = vhost_blk_req_submit(req, file);
+ break;
+ case VIRTIO_BLK_T_IN:
+ req->write = READ;
+ ret = vhost_blk_req_submit(req, file);
+ break;
+ case VIRTIO_BLK_T_FLUSH:
+ req->write = WRITE_FLUSH;
+ ret = vhost_blk_req_submit(req, file);
+ break;
+ case VIRTIO_BLK_T_GET_ID:
+ ret = snprintf(id, VIRTIO_BLK_ID_BYTES,
+ "vhost-blk%d", blk->index);
+ if (ret < 0)
+ break;
+ len = ret;
+ ret = memcpy_toiovecend(req->iov, id, 0, len);
+ status = ret < 0 ? VIRTIO_BLK_S_IOERR : VIRTIO_BLK_S_OK;
+ ret = vhost_blk_set_status(req, status);
+ if (ret)
+ break;
+ vhost_add_used_and_signal(&blk->dev, vq, head, len);
+ break;
+ default:
+ vq_err(vq, "Unsupported request type %d\n", hdr->type);
+ status = VIRTIO_BLK_S_UNSUPP;
+ ret = vhost_blk_set_status(req, status);
+ if (ret)
+ break;
+ vhost_add_used_and_signal(&blk->dev, vq, head, 0);
+ }
+
+ return ret;
+}
+
+/* Guest kick us for I/O submit */
+static void vhost_blk_handle_guest_kick(struct vhost_work *work)
+{
+ struct virtio_blk_outhdr hdr;
+ struct vhost_virtqueue *vq;
+ struct vhost_blk *blk;
+ int in, out, ret;
+ struct file *f;
+ u16 head;
+
+ vq = container_of(work, struct vhost_virtqueue, poll.work);
+ blk = container_of(vq->dev, struct vhost_blk, dev);
+
+ /* TODO: check that we are running from vhost_worker? */
+ f = rcu_dereference_check(vq->private_data, 1);
+ if (!f)
+ return;
+
+ vhost_disable_notify(&blk->dev, vq);
+ for (;;) {
+ head = vhost_get_vq_desc(&blk->dev, vq, vq->iov,
+ ARRAY_SIZE(vq->iov),
+ &out, &in, NULL, NULL);
+ if (unlikely(head < 0))
+ break;
+
+ if (unlikely(head == vq->num)) {
+ if (unlikely(vhost_enable_notify(&blk->dev, vq))) {
+ vhost_disable_notify(&blk->dev, vq);
+ continue;
+ }
+ break;
+ }
+ move_iovec(vq->iov, vq->hdr, sizeof(hdr), out);
+ ret = memcpy_fromiovecend((unsigned char *)&hdr, vq->hdr, 0,
+ sizeof(hdr));
+ if (ret < 0) {
+ vq_err(vq, "Failed to get block header!\n");
+ vhost_discard_vq_desc(vq, 1);
+ break;
+ }
+
+ if (vhost_blk_req_handle(vq, &hdr, head, out, in, f) < 0)
+ break;
+
+ if (!llist_empty(&blk->llhead)) {
+ vhost_poll_queue(&vq->poll);
+ break;
+ }
+ }
+}
+
+/* Host kick us for I/O completion */
+static void vhost_blk_handle_host_kick(struct vhost_work *work)
+{
+
+ struct vhost_virtqueue *vq;
+ struct vhost_blk_req *req;
+ struct llist_node *llnode;
+ struct vhost_blk *blk;
+ bool added, zero;
+ u8 status;
+ int ret;
+
+ blk = container_of(work, struct vhost_blk, work);
+ vq = &blk->vq;
+
+ llnode = llist_del_all(&blk->llhead);
+ added = false;
+ while (llnode) {
+ req = llist_entry(llnode, struct vhost_blk_req, llnode);
+ llnode = llist_next(llnode);
+
+ vhost_blk_req_umap(req);
+
+ status = req->len > 0 ? VIRTIO_BLK_S_OK : VIRTIO_BLK_S_IOERR;
+ ret = vhost_blk_set_status(req, status);
+ if (unlikely(ret))
+ continue;
+ vhost_add_used(&blk->vq, req->head, req->len);
+ added = true;
+
+ spin_lock(&req->blk->flush_lock);
+ zero = atomic_dec_and_test(
+ &req->blk->req_inflight[req->during_flush]);
+ if (zero && !req->during_flush)
+ wake_up(&blk->flush_wait);
+ spin_unlock(&req->blk->flush_lock);
+
+ }
+ if (likely(added))
+ vhost_signal(&blk->dev, &blk->vq);
+}
+
+static void vhost_blk_flush(struct vhost_blk *blk)
+{
+ spin_lock(&blk->flush_lock);
+ blk->during_flush = 1;
+ spin_unlock(&blk->flush_lock);
+
+ vhost_poll_flush(&blk->vq.poll);
+ vhost_work_flush(&blk->dev, &blk->work);
+ /*
+ * Wait until requests fired before the flush to be finished
+ * req_inflight[0] is used to track the requests fired before the flush
+ * req_inflight[1] is used to track the requests fired during the flush
+ */
+ wait_event(blk->flush_wait, !atomic_read(&blk->req_inflight[0]));
+
+ spin_lock(&blk->flush_lock);
+ blk->during_flush = 0;
+ spin_unlock(&blk->flush_lock);
+}
+
+static inline void vhost_blk_stop(struct vhost_blk *blk, struct file **file)
+{
+ struct vhost_virtqueue *vq = &blk->vq;
+ struct file *f;
+
+ mutex_lock(&vq->mutex);
+ f = rcu_dereference_protected(vq->private_data,
+ lockdep_is_held(&vq->mutex));
+ rcu_assign_pointer(vq->private_data, NULL);
+ mutex_unlock(&vq->mutex);
+
+ *file = f;
+}
+
+static int vhost_blk_open(struct inode *inode, struct file *file)
+{
+ struct vhost_blk *blk;
+ int ret;
+
+ blk = kzalloc(sizeof(*blk), GFP_KERNEL);
+ if (!blk) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ ret = ida_simple_get(&vhost_blk_index_ida, 0, 0, GFP_KERNEL);
+ if (ret < 0)
+ goto out_dev;
+ blk->index = ret;
+
+ blk->vq.handle_kick = vhost_blk_handle_guest_kick;
+ atomic_set(&blk->req_inflight[0], 0);
+ atomic_set(&blk->req_inflight[1], 0);
+ blk->during_flush = 0;
+ spin_lock_init(&blk->flush_lock);
+ init_waitqueue_head(&blk->flush_wait);
+
+ ret = vhost_dev_init(&blk->dev, &blk->vq, VHOST_BLK_VQ_MAX);
+ if (ret < 0)
+ goto out_dev;
+ file->private_data = blk;
+
+ vhost_work_init(&blk->work, vhost_blk_handle_host_kick);
+
+ return ret;
+out_dev:
+ kfree(blk);
+out:
+ return ret;
+}
+
+static int vhost_blk_release(struct inode *inode, struct file *f)
+{
+ struct vhost_blk *blk = f->private_data;
+ struct file *file;
+
+ ida_simple_remove(&vhost_blk_index_ida, blk->index);
+ vhost_blk_stop(blk, &file);
+ vhost_blk_flush(blk);
+ vhost_dev_cleanup(&blk->dev, false);
+ if (file)
+ fput(file);
+ kfree(blk->reqs);
+ kfree(blk);
+
+ return 0;
+}
+
+static int vhost_blk_set_features(struct vhost_blk *blk, u64 features)
+{
+ mutex_lock(&blk->dev.mutex);
+ blk->dev.acked_features = features;
+ vhost_blk_flush(blk);
+ mutex_unlock(&blk->dev.mutex);
+
+ return 0;
+}
+
+static long vhost_blk_set_backend(struct vhost_blk *blk, unsigned index, int fd)
+{
+ struct vhost_virtqueue *vq = &blk->vq;
+ struct file *file, *oldfile;
+ struct inode *inode;
+ int ret;
+
+ mutex_lock(&blk->dev.mutex);
+ ret = vhost_dev_check_owner(&blk->dev);
+ if (ret)
+ goto out_dev;
+
+ if (index >= VHOST_BLK_VQ_MAX) {
+ ret = -ENOBUFS;
+ goto out_dev;
+ }
+
+ mutex_lock(&vq->mutex);
+
+ if (!vhost_vq_access_ok(vq)) {
+ ret = -EFAULT;
+ goto out_vq;
+ }
+
+ file = fget(fd);
+ if (IS_ERR(file)) {
+ ret = PTR_ERR(file);
+ goto out_vq;
+ }
+
+ /* Only raw block device is supported for now */
+ inode = file->f_mapping->host;
+ if (!S_ISBLK(inode->i_mode)) {
+ ret = -EFAULT;
+ goto out_file;
+ }
+
+ oldfile = rcu_dereference_protected(vq->private_data,
+ lockdep_is_held(&vq->mutex));
+ if (file != oldfile) {
+ rcu_assign_pointer(vq->private_data, file);
+
+ ret = vhost_init_used(vq);
+ if (ret)
+ goto out_file;
+ }
+
+ mutex_unlock(&vq->mutex);
+
+ if (oldfile) {
+ vhost_blk_flush(blk);
+ fput(oldfile);
+ }
+
+ mutex_unlock(&blk->dev.mutex);
+ return 0;
+
+out_file:
+ fput(file);
+out_vq:
+ mutex_unlock(&vq->mutex);
+out_dev:
+ mutex_unlock(&blk->dev.mutex);
+ return ret;
+}
+
+static long vhost_blk_reset_owner(struct vhost_blk *blk)
+{
+ struct file *file = NULL;
+ int err;
+
+ mutex_lock(&blk->dev.mutex);
+ err = vhost_dev_check_owner(&blk->dev);
+ if (err)
+ goto done;
+ vhost_blk_stop(blk, &file);
+ vhost_blk_flush(blk);
+ err = vhost_dev_reset_owner(&blk->dev);
+done:
+ mutex_unlock(&blk->dev.mutex);
+ if (file)
+ fput(file);
+ return err;
+}
+
+static int vhost_blk_setup(struct vhost_blk *blk)
+{
+ blk->reqs_nr = blk->vq.num;
+
+ blk->reqs = kmalloc(sizeof(struct vhost_blk_req) * blk->reqs_nr,
+ GFP_KERNEL);
+ if (!blk->reqs)
+ return -ENOMEM;
+
+ return 0;
+}
+
+static long vhost_blk_ioctl(struct file *f, unsigned int ioctl,
+ unsigned long arg)
+{
+ struct vhost_blk *blk = f->private_data;
+ void __user *argp = (void __user *)arg;
+ struct vhost_vring_file backend;
+ u64 __user *featurep = argp;
+ u64 features;
+ int ret;
+
+ switch (ioctl) {
+ case VHOST_BLK_SET_BACKEND:
+ if (copy_from_user(&backend, argp, sizeof(backend)))
+ return -EFAULT;
+ return vhost_blk_set_backend(blk, backend.index, backend.fd);
+ case VHOST_GET_FEATURES:
+ features = VHOST_BLK_FEATURES;
+ if (copy_to_user(featurep, &features, sizeof(features)))
+ return -EFAULT;
+ return 0;
+ case VHOST_SET_FEATURES:
+ if (copy_from_user(&features, featurep, sizeof(features)))
+ return -EFAULT;
+ if (features & ~VHOST_BLK_FEATURES)
+ return -EOPNOTSUPP;
+ return vhost_blk_set_features(blk, features);
+ case VHOST_RESET_OWNER:
+ return vhost_blk_reset_owner(blk);
+ default:
+ mutex_lock(&blk->dev.mutex);
+ ret = vhost_dev_ioctl(&blk->dev, ioctl, arg);
+ if (!ret && ioctl == VHOST_SET_VRING_NUM)
+ ret = vhost_blk_setup(blk);
+ vhost_blk_flush(blk);
+ mutex_unlock(&blk->dev.mutex);
+ return ret;
+ }
+}
+
+static const struct file_operations vhost_blk_fops = {
+ .owner = THIS_MODULE,
+ .open = vhost_blk_open,
+ .release = vhost_blk_release,
+ .llseek = noop_llseek,
+ .unlocked_ioctl = vhost_blk_ioctl,
+};
+
+static struct miscdevice vhost_blk_misc = {
+ MISC_DYNAMIC_MINOR,
+ "vhost-blk",
+ &vhost_blk_fops,
+};
+
+static int vhost_blk_init(void)
+{
+ return misc_register(&vhost_blk_misc);
+}
+
+static void vhost_blk_exit(void)
+{
+ misc_deregister(&vhost_blk_misc);
+}
+
+module_init(vhost_blk_init);
+module_exit(vhost_blk_exit);
+
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Asias He");
+MODULE_DESCRIPTION("Host kernel accelerator for virtio_blk");
diff --git a/drivers/vhost/blk.h b/drivers/vhost/blk.h
new file mode 100644
index 0000000..2f674f0
--- /dev/null
+++ b/drivers/vhost/blk.h
@@ -0,0 +1,8 @@
+#include <linux/vhost.h>
+
+enum {
+ VHOST_BLK_FEATURES = (1ULL << VIRTIO_RING_F_INDIRECT_DESC) |
+ (1ULL << VIRTIO_RING_F_EVENT_IDX),
+};
+/* VHOST_BLK specific defines */
+#define VHOST_BLK_SET_BACKEND _IOW(VHOST_VIRTIO, 0x50, struct vhost_vring_file)
--
1.7.11.7
^ permalink raw reply related
* Re: TCP and reordering
From: David Woodhouse @ 2012-12-02 1:30 UTC (permalink / raw)
To: Eric Dumazet
Cc: Benjamin LaHaise, Vijay Subramanian, David Miller, saku,
rick.jones2, netdev
In-Reply-To: <1354122996.14302.427.camel@edumazet-glaptop>
[-- Attachment #1: Type: text/plain, Size: 2958 bytes --]
On Wed, 2012-11-28 at 09:16 -0800, Eric Dumazet wrote:
> On Wed, 2012-11-28 at 16:41 +0000, David Woodhouse wrote:
>
> > Another fun issue with tunnelling protocols and BQL... packets tend to
> > *grow* as they get encapsulated. So you might end up calling
> > netdev_sent_queue() with a given size, then netdev_completed_queue()
> > with a bigger packet later...
>
> Its the driver responsibility to maintain the coherent 'bytes' value for
> each transmitted/completed packet.
>
> If a driver calls an external entity, it cannot possibly use BQL, unless
> doing an approximation (bytes becomes a fixed value)
>
> BQL was really something to control/limit queueing on ethernet links,
> not for stacked devices, as stacked devices normally have no queue.
Oh, of *course*... this is why my kernel would panic if I attempted to
add BQL to BR2684. The ATM low-level driver was pushing a header onto
the skb, and I ended up calling netdev_completed_queue() with a larger
'bytes' value than the one I'd called netdev_sent_queue() with. Which
leads to a BUG(), which immediately results in a panic. A moderately
suboptimal failure mode, when a nasty warning and disabling BQL on this
interface might have been nicer.
However, *perhaps* it isn't so hard to get a consistent 'bytes' value.
This version appears to work... can we use something along these lines
in the general case? What if skb_is_nonlinear()?
This probably doesn't work for L2TP where it's going to be passed down
the whole stack and get a new network header and everything. But for
BR2684 is this at least a salvageable approach?
--- net/atm/br2684.c~ 2012-12-01 16:35:49.000000000 +0000
+++ net/atm/br2684.c 2012-12-02 01:18:35.216607088 +0000
@@ -180,6 +180,11 @@ static struct notifier_block atm_dev_not
.notifier_call = atm_dev_event,
};
+static unsigned int skb_acct_len(struct sk_buff *skb)
+{
+ return skb_tail_pointer(skb) - skb_network_header(skb);
+}
+
/* chained vcc->pop function. Check if we should wake the netif_queue */
static void br2684_pop(struct atm_vcc *vcc, struct sk_buff *skb)
{
@@ -191,6 +196,8 @@ static void br2684_pop(struct atm_vcc *v
/* If the queue space just went up from zero, wake */
if (atomic_inc_return(&brvcc->qspace) == 1)
netif_wake_queue(brvcc->device);
+
+ netdev_completed_queue(brvcc->device, 1, skb_acct_len(skb));
}
/*
@@ -265,6 +272,7 @@ static int br2684_xmit_vcc(struct sk_buf
netif_wake_queue(brvcc->device);
}
+ netdev_sent_queue(brvcc->device, skb_acct_len(skb));
/* If this fails immediately, the skb will be freed and br2684_pop()
will wake the queue if appropriate. Just return an error so that
the stats are updated correctly */
@@ -710,6 +718,7 @@ static int br2684_create(void __user *ar
return err;
}
+ netdev_reset_queue(netdev);
write_lock_irq(&devs_lock);
brdev->payload = payload;
--
dwmw2
[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 6171 bytes --]
^ permalink raw reply
* [PATCH] workqueue: mod_delayed_work_on() shouldn't queue timer on 0 delay
From: Tejun Heo @ 2012-12-02 1:11 UTC (permalink / raw)
To: Anders Kaseorg
Cc: Herbert Xu, John W. Linville, netdev, linux-wireless,
linux-kernel, Zlatko Calusic, Joonsoo Kim
In-Reply-To: <alpine.DEB.2.00.1212011852100.25064@dr-wily.mit.edu>
>From 8852aac25e79e38cc6529f20298eed154f60b574 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Sat, 1 Dec 2012 16:23:42 -0800
8376fe22c7 ("workqueue: implement mod_delayed_work[_on]()")
implemented mod_delayed_work[_on]() using the improved
try_to_grab_pending(). The function is later used, among others, to
replace [__]candel_delayed_work() + queue_delayed_work() combinations.
Unfortunately, a delayed_work item w/ zero @delay is handled slightly
differently by mod_delayed_work_on() compared to
queue_delayed_work_on(). The latter skips timer altogether and
directly queues it using queue_work_on() while the former schedules
timer which will expire on the closest tick. This means, when @delay
is zero, that [__]cancel_delayed_work() + queue_delayed_work_on()
makes the target item immediately executable while
mod_delayed_work_on() may induce delay of upto a full tick.
This somewhat subtle difference breaks some of the converted users.
e.g. block queue plugging uses delayed_work for deferred processing
and uses mod_delayed_work_on() when the queue needs to be immediately
unplugged. The above problem manifested as noticeably higher number
of context switches under certain circumstances.
The difference in behavior was caused by missing special case handling
for 0 delay in mod_delayed_work_on() compared to
queue_delayed_work_on(). Joonsoo Kim posted a patch to add it -
("workqueue: optimize mod_delayed_work_on() when @delay == 0")[1].
The patch was queued for 3.8 but it was described as optimization and
I missed that it was a correctness issue.
As both queue_delayed_work_on() and mod_delayed_work_on() use
__queue_delayed_work() for queueing, it seems that the better approach
is to move the 0 delay special handling to the function instead of
duplicating it in mod_delayed_work_on().
Fix the problem by moving 0 delay special case handling from
queue_delayed_work_on() to __queue_delayed_work(). This replaces
Joonsoo's patch.
[1] http://thread.gmane.org/gmane.linux.kernel/1379011/focus=1379012
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Anders Kaseorg <andersk@MIT.EDU>
Reported-and-tested-by: Zlatko Calusic <zlatko.calusic@iskon.hr>
LKML-Reference: <alpine.DEB.2.00.1211280953350.26602@dr-wily.mit.edu>
LKML-Reference: <50A78AA9.5040904@iskon.hr>
Cc: Joonsoo Kim <js1304@gmail.com>
---
Applied to wq/for-3.7-fixes. Pull request sent.
Thanks!
kernel/workqueue.c | 14 +++++++++++---
1 file changed, 11 insertions(+), 3 deletions(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index ac25db1..084aa47 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1364,6 +1364,17 @@ static void __queue_delayed_work(int cpu, struct workqueue_struct *wq,
BUG_ON(timer_pending(timer));
BUG_ON(!list_empty(&work->entry));
+ /*
+ * If @delay is 0, queue @dwork->work immediately. This is for
+ * both optimization and correctness. The earliest @timer can
+ * expire is on the closest next tick and delayed_work users depend
+ * on that there's no such delay when @delay is 0.
+ */
+ if (!delay) {
+ __queue_work(cpu, wq, &dwork->work);
+ return;
+ }
+
timer_stats_timer_set_start_info(&dwork->timer);
/*
@@ -1417,9 +1428,6 @@ bool queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
bool ret = false;
unsigned long flags;
- if (!delay)
- return queue_work_on(cpu, wq, &dwork->work);
-
/* read the comment in __queue_work() */
local_irq_save(flags);
--
1.7.11.7
^ permalink raw reply related
* Re: [PATCH 00/17] ATM fixes for pppoatm/br2684
From: David Woodhouse @ 2012-12-02 0:40 UTC (permalink / raw)
To: David Miller; +Cc: netdev, chas, krzysiek
In-Reply-To: <1354383226.21562.351.camel@shinybook.infradead.org>
[-- Attachment #1: Type: text/plain, Size: 1676 bytes --]
On Sat, 2012-12-01 at 17:33 +0000, David Woodhouse wrote:
>
> Very glad I added the BUILD_BUG_ON on the cb struct size now. Perhaps
> there should be a generic helper for that? Something like
> skb_cb_cast(struct foo_cb, skb) could do it automatically...?
Something like this, perhaps? Using skb_cast_cb() would then make it
fairly much impossible to accidentally overflow the size of the skb cb.
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index f2af494..b4cb2cf 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -567,6 +567,11 @@ static inline struct rtable *skb_rtable(const struct sk_buff *skb)
return (struct rtable *)skb_dst(skb);
}
+#define skb_cast_cb(skb, type) ({ \
+ BUILD_BUG_ON(sizeof(type) > sizeof((skb)->cb)); \
+ (type *)(skb)->cb; \
+ })
+
extern void kfree_skb(struct sk_buff *skb);
extern void skb_tx_error(struct sk_buff *skb);
extern void consume_skb(struct sk_buff *skb);
diff --git a/drivers/atm/solos-pci.c b/drivers/atm/solos-pci.c
index 6619a8a..8dccf82 100644
--- a/drivers/atm/solos-pci.c
+++ b/drivers/atm/solos-pci.c
@@ -97,7 +97,7 @@ struct solos_skb_cb {
};
-#define SKB_CB(skb) ((struct solos_skb_cb *)skb->cb)
+#define SKB_CB(skb) skb_cast_cb(skb, struct solos_skb_cb)
#define PKT_DATA 0
#define PKT_COMMAND 1
@@ -1324,8 +1324,6 @@ static struct pci_driver fpga_driver = {
static int __init solos_pci_init(void)
{
- BUILD_BUG_ON(sizeof(struct solos_skb_cb) > sizeof(((struct sk_buff *)0)->cb));
-
printk(KERN_INFO "Solos PCI Driver Version %s\n", VERSION);
return pci_register_driver(&fpga_driver);
}
--
dwmw2
[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 6171 bytes --]
^ permalink raw reply related
* Re: [PATCH 00/17] ATM fixes for pppoatm/br2684
From: David Woodhouse @ 2012-12-02 0:35 UTC (permalink / raw)
To: David Miller; +Cc: netdev, chas, krzysiek
In-Reply-To: <20121201.114440.331740099591161757.davem@davemloft.net>
[-- Attachment #1: Type: text/plain, Size: 2447 bytes --]
On Sat, 2012-12-01 at 11:44 -0500, David Miller wrote:
>
> > drivers/atm/solos-pci.c: In function ‘solos_pci_init’:
> > drivers/atm/solos-pci.c:1329:2: error: size of unnamed array is
> negative
>
> It's from adding the completion to the solos skb cb, you can't do
> that. It won't fit on 64-bit when all debugging kconfig options are
> enabled.
Thanks for catching that. I've just posted a [v2] version of the
offending patch, which no longer puts a completion into the skb cb.
I'd appreciate a slightly more clueful eye looking over the incremental
patch (below) just to confirm that the new method is correct, but it
certainly seems to work. This version is identical to the one I posted
earlier, except that I use dev_kfree_skb() in the pclose() function
instead of dev_kfree_skb_any(). We know this will be called from a
suitable context, and it even uses GFP_KERNEL a few lines higher up.
If that's OK, please pull the resulting tree from
git://git.infradead.org/users/dwmw2/atm.git
diff --git a/drivers/atm/solos-pci.c b/drivers/atm/solos-pci.c
index e59bcfd..6619a8a 100644
--- a/drivers/atm/solos-pci.c
+++ b/drivers/atm/solos-pci.c
@@ -92,7 +92,6 @@ struct pkt_hdr {
};
struct solos_skb_cb {
- struct completion c;
struct atm_vcc *vcc;
uint32_t dma_addr;
};
@@ -853,13 +852,14 @@ static void pclose(struct atm_vcc *vcc)
header->vci = cpu_to_le16(vcc->vci);
header->type = cpu_to_le16(PKT_PCLOSE);
- init_completion(&SKB_CB(skb)->c);
-
+ skb_get(skb);
fpga_queue(card, port, skb, NULL);
- if (!wait_for_completion_timeout(&SKB_CB(skb)->c, 5 * HZ))
- dev_warn(&card->dev->dev, "Timeout waiting for VCC close on port %d\n",
- port);
+ if (!wait_event_timeout(card->param_wq, !skb_shared(skb), 5 * HZ))
+ dev_warn(&card->dev->dev,
+ "Timeout waiting for VCC close on port %d\n", port);
+
+ dev_kfree_skb(skb);
/* Hold up vcc_destroy_socket() (our caller) until solos_bh() in the
tasklet has finished processing any incoming packets (and, more to
@@ -990,10 +990,8 @@ static uint32_t fpga_tx(struct solos_card *card)
atomic_inc(&vcc->stats->tx);
solos_pop(vcc, oldskb);
} else {
- struct pkt_hdr *header = (void *)oldskb->data;
- if (le16_to_cpu(header->type) == PKT_PCLOSE)
- complete(&SKB_CB(oldskb)->c);
dev_kfree_skb_irq(oldskb);
+ wake_up(&card->param_wq);
}
}
}
--
dwmw2
[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 6171 bytes --]
^ permalink raw reply related
* [PATCH v2 07/17] solos-pci: Wait for pending TX to complete when releasing vcc
From: David Woodhouse @ 2012-12-02 0:17 UTC (permalink / raw)
To: netdev; +Cc: chas, krzysiek
In-Reply-To: <1354235736-26833-8-git-send-email-dwmw2@infradead.org>
[-- Attachment #1: Type: text/plain, Size: 2566 bytes --]
We should no longer be calling the old pop routine for the vcc, after
vcc_release() has completed. Make sure we wait for any pending TX skbs
to complete, by waiting for our own PKT_PCLOSE control skb to be sent.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
---
v2: Don't put a struct completion into skb->cb[]. It doesn't fit, so
instead use the existing card->param_wq, hold a refcount on the skb, and
wait for the fpga_tx code to free it.
I won't repost the rest of the series; they're materially unchanged
apart from a tiny amount of massaging to make them apply with this
change, so it would just be noise.
drivers/atm/solos-pci.c | 17 ++++++++++++++---
1 file changed, 14 insertions(+), 3 deletions(-)
diff --git a/drivers/atm/solos-pci.c b/drivers/atm/solos-pci.c
index 9851093..026bdc1 100644
--- a/drivers/atm/solos-pci.c
+++ b/drivers/atm/solos-pci.c
@@ -866,6 +866,7 @@ static int popen(struct atm_vcc *vcc)
static void pclose(struct atm_vcc *vcc)
{
struct solos_card *card = vcc->dev->dev_data;
+ unsigned char port = SOLOS_CHAN(vcc->dev);
struct sk_buff *skb;
struct pkt_hdr *header;
@@ -881,11 +882,18 @@ static void pclose(struct atm_vcc *vcc)
header->vci = cpu_to_le16(vcc->vci);
header->type = cpu_to_le16(PKT_PCLOSE);
- fpga_queue(card, SOLOS_CHAN(vcc->dev), skb, NULL);
+ skb_get(skb);
+ fpga_queue(card, port, skb, NULL);
clear_bit(ATM_VF_ADDR, &vcc->flags);
clear_bit(ATM_VF_READY, &vcc->flags);
+ if (!wait_event_timeout(card->param_wq, !skb_shared(skb), 5 * HZ))
+ dev_warn(&card->dev->dev,
+ "Timeout waiting for VCC close on port %d\n", port);
+
+ dev_kfree_skb(skb);
+
/* Hold up vcc_destroy_socket() (our caller) until solos_bh() in the
tasklet has finished processing any incoming packets (and, more to
the point, using the vcc pointer). */
@@ -1011,9 +1019,10 @@ static uint32_t fpga_tx(struct solos_card *card)
if (vcc) {
atomic_inc(&vcc->stats->tx);
solos_pop(vcc, oldskb);
- } else
+ } else {
dev_kfree_skb_irq(oldskb);
-
+ wake_up(&card->param_wq);
+ }
}
}
/* For non-DMA TX, write the 'TX start' bit for all four ports simultaneously */
@@ -1345,6 +1354,8 @@ static struct pci_driver fpga_driver = {
static int __init solos_pci_init(void)
{
+ BUILD_BUG_ON(sizeof(struct solos_skb_cb) > sizeof(((struct sk_buff *)0)->cb));
+
printk(KERN_INFO "Solos PCI Driver Version %s\n", VERSION);
return pci_register_driver(&fpga_driver);
}
--
1.8.0
--
dwmw2
[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 6171 bytes --]
^ permalink raw reply related
* Re: Wireless regression in workqueue: use mod_delayed_work() instead of __cancel + queue
From: Anders Kaseorg @ 2012-12-01 23:53 UTC (permalink / raw)
To: Tejun Heo
Cc: Herbert Xu, John W. Linville, netdev, linux-wireless,
linux-kernel
In-Reply-To: <20121201143926.GB2685@htj.dyndns.org>
On Sat, 1 Dec 2012, Tejun Heo wrote:
> Can you please test this one too? Thanks!
>
> […]
> + if (!delay) {
> + __queue_work(cpu, wq, &dwork->work);
> + return;
> + }
> +
> […]
> - if (!delay)
> - return queue_work_on(cpu, wq, &dwork->work);
> -
Yes, this one fixes the bug too (on v3.7.0-rc7).
Anders
^ permalink raw reply
* [PATCH net 1/1] 8139cp: fix coherent mapping leak in error path.
From: Francois Romieu @ 2012-12-01 23:08 UTC (permalink / raw)
To: Jeff Garzik; +Cc: David Miller, dwmw2, jasowang, netdev, John Greene
cp_open
[...]
rc = cp_alloc_rings(cp);
if (rc)
return rc;
cp_alloc_rings
[...]
mem = dma_alloc_coherent(&cp->pdev->dev, CP_RING_BYTES,
&cp->ring_dma, GFP_KERNEL);
- cp_alloc_rings never frees the coherent mapping it allocates
- neither do cp_open when cp_alloc_rings fails
Signed-off-by: Francois Romieu <romieu@fr.zoreil.com>
---
drivers/net/ethernet/realtek/8139cp.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/realtek/8139cp.c b/drivers/net/ethernet/realtek/8139cp.c
index b01f83a..609125a 100644
--- a/drivers/net/ethernet/realtek/8139cp.c
+++ b/drivers/net/ethernet/realtek/8139cp.c
@@ -1060,17 +1060,22 @@ static int cp_init_rings (struct cp_private *cp)
static int cp_alloc_rings (struct cp_private *cp)
{
+ struct device *d = &cp->pdev->dev;
void *mem;
+ int rc;
- mem = dma_alloc_coherent(&cp->pdev->dev, CP_RING_BYTES,
- &cp->ring_dma, GFP_KERNEL);
+ mem = dma_alloc_coherent(d, CP_RING_BYTES, &cp->ring_dma, GFP_KERNEL);
if (!mem)
return -ENOMEM;
cp->rx_ring = mem;
cp->tx_ring = &cp->rx_ring[CP_RX_RING_SIZE];
- return cp_init_rings(cp);
+ rc = cp_init_rings(cp);
+ if (rc < 0)
+ dma_free_coherent(d, CP_RING_BYTES, cp->rx_ring, cp->ring_dma);
+
+ return rc;
}
static void cp_clean_rings (struct cp_private *cp)
--
1.7.11.7
^ permalink raw reply related
* [PATCH net] tcp: fix crashes in do_tcp_sendpages()
From: Eric Dumazet @ 2012-12-01 23:07 UTC (permalink / raw)
To: Willy Tarreau, David Miller; +Cc: netdev
In-Reply-To: <20121201224044.GL25450@1wt.eu>
From: Eric Dumazet <edumazet@google.com>
Recent network changes allowed high order pages being used
for skb fragments.
This uncovered a bug in do_tcp_sendpages() which was assuming its caller
provided an array of order-0 page pointers.
We only have to deal with a single page in this function, and its order
is irrelevant.
Reported-by: Willy Tarreau <w@1wt.eu>
Tested-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
net/ipv4/tcp.c | 15 ++++++---------
1 file changed, 6 insertions(+), 9 deletions(-)
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 083092e..e457c7a 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -830,8 +830,8 @@ static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
return mss_now;
}
-static ssize_t do_tcp_sendpages(struct sock *sk, struct page **pages, int poffset,
- size_t psize, int flags)
+static ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
+ size_t size, int flags)
{
struct tcp_sock *tp = tcp_sk(sk);
int mss_now, size_goal;
@@ -858,12 +858,9 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page **pages, int poffse
if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
goto out_err;
- while (psize > 0) {
+ while (size > 0) {
struct sk_buff *skb = tcp_write_queue_tail(sk);
- struct page *page = pages[poffset / PAGE_SIZE];
int copy, i;
- int offset = poffset % PAGE_SIZE;
- int size = min_t(size_t, psize, PAGE_SIZE - offset);
bool can_coalesce;
if (!tcp_send_head(sk) || (copy = size_goal - skb->len) <= 0) {
@@ -912,8 +909,8 @@ new_segment:
TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_PSH;
copied += copy;
- poffset += copy;
- if (!(psize -= copy))
+ offset += copy;
+ if (!(size -= copy))
goto out;
if (skb->len < size_goal || (flags & MSG_OOB))
@@ -960,7 +957,7 @@ int tcp_sendpage(struct sock *sk, struct page *page, int offset,
flags);
lock_sock(sk);
- res = do_tcp_sendpages(sk, &page, offset, size, flags);
+ res = do_tcp_sendpages(sk, page, offset, size, flags);
release_sock(sk);
return res;
}
^ permalink raw reply related
* Re: GRO + splice panics in 3.7.0-rc5
From: Willy Tarreau @ 2012-12-01 22:40 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1354401121.20109.531.camel@edumazet-glaptop>
On Sat, Dec 01, 2012 at 02:32:01PM -0800, Eric Dumazet wrote:
> On Sat, 2012-12-01 at 13:47 -0800, Eric Dumazet wrote:
>
> > Thanks a lot Willy
> >
> > I believe do_tcp_sendpages() needs a fix, I'll send a patch asap
> >
>
> Could you try the following patch ?
>
> do_tcp_sendpages() looks really wrong, as only one page is provided by
> the caller.
Excellent Eric, it's rock solid now both with the reproducer and with
haproxy! Feel free to add my Tested-By if you want.
I really think we should feed this to Linus quickly before he releases 3.7,
as I'm realizing that it would really not be nice to have a user-triggerable
crash in a release :-/
Thanks !
Willy
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox