* Re: [PATCH] netfilter, ipvs: Avoid undefined order of evaluation in assignments to struct nf_conn *
From: Simon Horman @ 2011-05-29 23:23 UTC (permalink / raw)
To: Jesper Juhl
Cc: linux-kernel, netfilter, coreteam, netfilter-devel, lvs-devel,
netdev, David S. Miller, Patrick McHardy, Julian Anastasov,
Wensong Zhang, Pablo Neira Ayuso
In-Reply-To: <alpine.LNX.2.00.1105292011450.4411@swampdragon.chaosbits.net>
On Sun, May 29, 2011 at 08:22:56PM +0200, Jesper Juhl wrote:
> In net/netfilter/ipvs/ip_vs_nfct.c::ip_vs_update_conntrack(),
> net/netfilter/ipvs/ip_vs_xmit.c::ip_vs_nat_xmit(),
> net/netfilter/ipvs/ip_vs_xmit.c::ip_vs_nat_xmit_v6(),
> net/netfilter/ipvs/ip_vs_xmit.c::ip_vs_icmp_xmit)()
> net/netfilter/ipvs/ip_vs_xmit.c::and ip_vs_icmp_xmit_v6() we do this:
> ...
> struct nf_conn *ct = ct = nf_ct_get(skb, &ctinfo);
> ...
>
> Since '=' is not a sequence point the order of these assignments happening
> is undefined. Luckily it's easy to avoid by just doing what is obviously
> the intended thing:
> struct nf_conn *ct = nf_ct_get(skb, &ctinfo);
>
> Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Acked-by: Simon Horman <horms@verge.net.au>
^ permalink raw reply
* Re: Ooops with 2.6.39
From: Andrew Morton @ 2011-05-29 22:45 UTC (permalink / raw)
To: Robert Goldner; +Cc: linux-kernel, netdev
In-Reply-To: <4DE1E013.6010300@au-79.de>
(cc netdev)
On Sun, 29 May 2011 07:56:35 +0200 Robert Goldner <robert@au-79.de> wrote:
> Hi,
>
> sometimes I got an Ooops while the system is starting:
>
> [ 15.824365] ------------[ cut here ]------------
> [ 15.824616] WARNING: at net/sched/sch_generic.c:256
> dev_watchdog+0x1d6/0x1e0()
> [ 15.824868] Hardware name:
> [ 15.825098] NETDEV WATCHDOG: eth0 (sis900): transmit queue 0 timed out
> [ 15.825333] Modules linked in: coretemp qt1010 af9013 dvb_pll mt2060
> dvb_usb_nova_t_usb2 dvb_usb_dibusb_common dib3000mc psmouse
> dibx000_common pcspkr dvb_usb_af9015 ohci_hcd
> [ 15.829066] Pid: 1462, comm: startpar Not tainted 2.6.39-agathe #2
> [ 15.829301] Call Trace:
> [ 15.829534] [<c103c79d>] warn_slowpath_common+0x6d/0xa0
> [ 15.829948] [<c13ff216>] ? dev_watchdog+0x1d6/0x1e0
> [ 15.830183] [<c13ff216>] ? dev_watchdog+0x1d6/0x1e0
> [ 15.830417] [<c103c84e>] warn_slowpath_fmt+0x2e/0x30
> [ 15.830828] [<c13ff216>] dev_watchdog+0x1d6/0x1e0
> [ 15.831241] [<c104ecc8>] ? __queue_work+0xb8/0x300
> [ 15.831476] [<c10048a8>] ? handle_irq+0x18/0x90
> [ 15.831889] [<c1041bc7>] ? irq_exit+0x37/0x80
> [ 15.832127] [<c10044f6>] ? do_IRQ+0x46/0xb0
> [ 15.832541] [<c1080189>] ? rcu_start_gp+0x149/0x1d0
> [ 15.832954] [<c1046a24>] run_timer_softirq+0xe4/0x200
> [ 15.833189] [<c1041900>] ? local_bh_enable+0x80/0x80
> [ 15.833426] [<c151ade9>] ? common_interrupt+0x29/0x30
> [ 15.833838] [<c13ff040>] ? qdisc_reset+0x40/0x40
> [ 15.834250] [<c1041978>] __do_softirq+0x78/0x100
> [ 15.834841] [<c1041900>] ? local_bh_enable+0x80/0x80
> [ 15.835097] <IRQ> [<c1041bfe>] ? irq_exit+0x6e/0x80
> [ 15.835188] [<c1019dc6>] ? smp_apic_timer_interrupt+0x56/0x90
> [ 15.835247] [<c151a46a>] ? apic_timer_interrupt+0x2a/0x30
> [ 15.835305] [<c151a168>] ? resume_userspace+0x14/0x14
> [ 15.835360] ---[ end trace 519c5412e162986a ]---
>
>
> The level of pain is very low, because the system is still running very
> well.
>
> If somebody needs some more information, please let me know.
>
Which net device driver is being used?
^ permalink raw reply
* [PATCH] drivers/net: ks8842 Fix crash on received packet when in PIO mode.
From: Dennis Aberilla @ 2011-05-29 21:46 UTC (permalink / raw)
To: info, davem; +Cc: netdev
This patch fixes a driver crash during packet reception due to not enough
bytes allocated in the skb. Since the loop reads out 4 bytes at a time, we
need to allow for up to 3 bytes of slack space.
Signed-off-by: Dennis Aberilla <denzzzhome@yahoo.com>
---
diff --git a/drivers/net/ks8842.c b/drivers/net/ks8842.c
index f0d8346..9bd0f55 100644
--- a/drivers/net/ks8842.c
+++ b/drivers/net/ks8842.c
@@ -662,7 +662,7 @@ static void ks8842_rx_frame(struct net_device *netdev,
/* check the status */
if ((status & RXSR_VALID) && !(status & RXSR_ERROR)) {
- struct sk_buff *skb = netdev_alloc_skb_ip_align(netdev, len);
+ struct sk_buff *skb = netdev_alloc_skb_ip_align(netdev, len + 3);
if (skb) {
--
Thanks.
|Dennis
=======================================================================
This email, including any attachments, is only for the intended
addressee. It is subject to copyright, is confidential and may be
the subject of legal or other privilege, none of which is waived or
lost by reason of this transmission.
If the receiver is not the intended addressee, please accept our
apologies, notify us by return, delete all copies and perform no
other act on the email.
Unfortunately, we cannot warrant that the email has not been
altered or corrupted during transmission.
=======================================================================
^ permalink raw reply related
* [PATCH] ip_options_compile: properly handle unaligned pointer
From: Chris Metcalf @ 2011-05-29 20:55 UTC (permalink / raw)
To: netdev, linux-kernel, kaber
The current code takes an unaligned pointer and does htonl() on it to
make it big-endian, then does a memcpy(). The problem is that the
compiler decides that since the pointer is to a __be32, it is legal
to optimize the copy into a processor word store. However, on an
architecture that does not handled unaligned writes in kernel space,
this produces an unaligned exception fault.
The solution is to track the pointer as a "char *" (which removes a bunch
of unpleasant casts in any case), and then just use put_unaligned_be32()
to write the value to memory.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
---
net/ipv4/ip_options.c | 15 ++++++++-------
1 files changed, 8 insertions(+), 7 deletions(-)
diff --git a/net/ipv4/ip_options.c b/net/ipv4/ip_options.c
index 2391b24..a12d33f 100644
--- a/net/ipv4/ip_options.c
+++ b/net/ipv4/ip_options.c
@@ -14,6 +14,7 @@
#include <linux/slab.h>
#include <linux/types.h>
#include <asm/uaccess.h>
+#include <asm/unaligned.h>
#include <linux/skbuff.h>
#include <linux/ip.h>
#include <linux/icmp.h>
@@ -352,7 +353,7 @@ int ip_options_compile(struct net *net,
goto error;
}
if (optptr[2] <= optlen) {
- __be32 *timeptr = NULL;
+ unsigned char *timeptr = NULL;
if (optptr[2]+3 > optptr[1]) {
pp_ptr = optptr + 2;
goto error;
@@ -361,7 +362,7 @@ int ip_options_compile(struct net *net,
case IPOPT_TS_TSONLY:
opt->ts = optptr - iph;
if (skb)
- timeptr = (__be32*)&optptr[optptr[2]-1];
+ timeptr = &optptr[optptr[2]-1];
opt->ts_needtime = 1;
optptr[2] += 4;
break;
@@ -373,7 +374,7 @@ int ip_options_compile(struct net *net,
opt->ts = optptr - iph;
if (rt) {
memcpy(&optptr[optptr[2]-1], &rt->rt_spec_dst, 4);
- timeptr = (__be32*)&optptr[optptr[2]+3];
+ timeptr = &optptr[optptr[2]+3];
}
opt->ts_needaddr = 1;
opt->ts_needtime = 1;
@@ -391,7 +392,7 @@ int ip_options_compile(struct net *net,
if (inet_addr_type(net, addr) == RTN_UNICAST)
break;
if (skb)
- timeptr = (__be32*)&optptr[optptr[2]+3];
+ timeptr = &optptr[optptr[2]+3];
}
opt->ts_needtime = 1;
optptr[2] += 8;
@@ -405,10 +406,10 @@ int ip_options_compile(struct net *net,
}
if (timeptr) {
struct timespec tv;
- __be32 midtime;
+ u32 midtime;
getnstimeofday(&tv);
- midtime = htonl((tv.tv_sec % 86400) * MSEC_PER_SEC + tv.tv_nsec / NSEC_PER_MSEC);
- memcpy(timeptr, &midtime, sizeof(__be32));
+ midtime = (tv.tv_sec % 86400) * MSEC_PER_SEC + tv.tv_nsec / NSEC_PER_MSEC;
+ put_unaligned_be32(midtime, timeptr);
opt->is_changed = 1;
}
} else {
--
1.6.5.2
^ permalink raw reply related
* [PATCH] netfilter, ipvs: Avoid undefined order of evaluation in assignments to struct nf_conn *
From: Jesper Juhl @ 2011-05-29 18:22 UTC (permalink / raw)
To: linux-kernel
Cc: netfilter, coreteam, netfilter-devel, lvs-devel, netdev,
David S. Miller, Patrick McHardy, Julian Anastasov, Simon Horman,
Wensong Zhang
In net/netfilter/ipvs/ip_vs_nfct.c::ip_vs_update_conntrack(),
net/netfilter/ipvs/ip_vs_xmit.c::ip_vs_nat_xmit(),
net/netfilter/ipvs/ip_vs_xmit.c::ip_vs_nat_xmit_v6(),
net/netfilter/ipvs/ip_vs_xmit.c::ip_vs_icmp_xmit)()
net/netfilter/ipvs/ip_vs_xmit.c::and ip_vs_icmp_xmit_v6() we do this:
...
struct nf_conn *ct = ct = nf_ct_get(skb, &ctinfo);
...
Since '=' is not a sequence point the order of these assignments happening
is undefined. Luckily it's easy to avoid by just doing what is obviously
the intended thing:
struct nf_conn *ct = nf_ct_get(skb, &ctinfo);
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
---
ip_vs_nfct.c | 2 +-
ip_vs_xmit.c | 8 ++++----
2 files changed, 5 insertions(+), 5 deletions(-)
Compile tested only.
Patch is against Linus' tree as of a few minutes ago.
diff --git a/net/netfilter/ipvs/ip_vs_nfct.c b/net/netfilter/ipvs/ip_vs_nfct.c
index f454c80..a3d86c2 100644
--- a/net/netfilter/ipvs/ip_vs_nfct.c
+++ b/net/netfilter/ipvs/ip_vs_nfct.c
@@ -82,7 +82,7 @@ void
ip_vs_update_conntrack(struct sk_buff *skb, struct ip_vs_conn *cp, int outin)
{
enum ip_conntrack_info ctinfo;
- struct nf_conn *ct = ct = nf_ct_get(skb, &ctinfo);
+ struct nf_conn *ct = nf_ct_get(skb, &ctinfo);
struct nf_conntrack_tuple new_tuple;
if (ct == NULL || nf_ct_is_confirmed(ct) || nf_ct_is_untracked(ct) ||
diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
index ee319a4..16d129e 100644
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -544,7 +544,7 @@ ip_vs_nat_xmit(struct sk_buff *skb, struct ip_vs_conn *cp,
#if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
if (cp->flags & IP_VS_CONN_F_SYNC && local) {
enum ip_conntrack_info ctinfo;
- struct nf_conn *ct = ct = nf_ct_get(skb, &ctinfo);
+ struct nf_conn *ct = nf_ct_get(skb, &ctinfo);
if (ct && !nf_ct_is_untracked(ct)) {
IP_VS_DBG_RL_PKT(10, AF_INET, pp, skb, 0,
@@ -661,7 +661,7 @@ ip_vs_nat_xmit_v6(struct sk_buff *skb, struct ip_vs_conn *cp,
#if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
if (cp->flags & IP_VS_CONN_F_SYNC && local) {
enum ip_conntrack_info ctinfo;
- struct nf_conn *ct = ct = nf_ct_get(skb, &ctinfo);
+ struct nf_conn *ct = nf_ct_get(skb, &ctinfo);
if (ct && !nf_ct_is_untracked(ct)) {
IP_VS_DBG_RL_PKT(10, AF_INET6, pp, skb, 0,
@@ -1176,7 +1176,7 @@ ip_vs_icmp_xmit(struct sk_buff *skb, struct ip_vs_conn *cp,
#if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
if (cp->flags & IP_VS_CONN_F_SYNC && local) {
enum ip_conntrack_info ctinfo;
- struct nf_conn *ct = ct = nf_ct_get(skb, &ctinfo);
+ struct nf_conn *ct = nf_ct_get(skb, &ctinfo);
if (ct && !nf_ct_is_untracked(ct)) {
IP_VS_DBG(10, "%s(): "
@@ -1296,7 +1296,7 @@ ip_vs_icmp_xmit_v6(struct sk_buff *skb, struct ip_vs_conn *cp,
#if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
if (cp->flags & IP_VS_CONN_F_SYNC && local) {
enum ip_conntrack_info ctinfo;
- struct nf_conn *ct = ct = nf_ct_get(skb, &ctinfo);
+ struct nf_conn *ct = nf_ct_get(skb, &ctinfo);
if (ct && !nf_ct_is_untracked(ct)) {
IP_VS_DBG(10, "%s(): "
--
Jesper Juhl <jj@chaosbits.net> http://www.chaosbits.net/
Don't top-post http://www.catb.org/jargon/html/T/top-post.html
Plain text mails only, please.
^ permalink raw reply related
* Re: Kernel crash after using new Intel NIC (igb)
From: Ingo Molnar @ 2011-05-29 12:33 UTC (permalink / raw)
To: Eric Dumazet
Cc: Arun Sharma, David Miller, Maximilian Engelhardt, linux-kernel,
netdev, StuStaNet Vorstand, Yann Dupont, Denys Fedoryshchenko,
Thomas Gleixner
In-Reply-To: <1306654983.30021.10.camel@edumazet-laptop>
* Eric Dumazet <eric.dumazet@gmail.com> wrote:
> I asked Arun if he wanted to make this himself, because initial
> idea was coming from him, not because I did not want to make it ;)
Hey, fair enough and sorry about the fuss! :-)
Thanks,
Ingo
^ permalink raw reply
* Re: [PATCH] net: ep93xx_eth: drop GFP_DMA from memory allocations
From: Russell King - ARM Linux @ 2011-05-29 11:27 UTC (permalink / raw)
To: Mika Westerberg; +Cc: netdev, hsweeten, ryan, kernel, linux-arm-kernel
In-Reply-To: <20110529105946.GA2655@acer>
On Sun, May 29, 2011 at 01:59:46PM +0300, Mika Westerberg wrote:
> FYI, I just enabled DMA debugging and I've got:
>
> [ 1.980000] WARNING: at lib/dma-debug.c:911 check_sync+0x460/0x510()
> [ 1.980000] NULL NULL: DMA-API: device driver tries to sync DMA memory it has not allocated [device address=0x00000000c5a11800] [size=78 bytes]
That's because of the 'allocate one buffer, map it once, then treat it
as two buffers' thing. DMA API debugging requires that the struct
device, and device address match:
static struct dma_debug_entry *hash_bucket_find(struct hash_bucket *bucket,
struct dma_debug_entry *ref)
{
struct dma_debug_entry *entry, *ret = NULL;
int matches = 0, match_lvl, last_lvl = 0;
list_for_each_entry(entry, &bucket->list, list) {
if ((entry->dev_addr != ref->dev_addr) ||
(entry->dev != ref->dev))
continue;
so the practice of using dma_sync_single_for_xxx() with partial buffers
is prohibited by this code (which I've always believed to be the right
answer.) I've always believed that dma_sync_single_range_for_xxx() is
the correct interface for doing this kind of thing.
Others may have a different view, in which case _something_ needs to get
fixed because their view is inconsistent with the debugging code!
^ permalink raw reply
* Re: [PATCH] net: ep93xx_eth: drop GFP_DMA from memory allocations
From: Mika Westerberg @ 2011-05-29 10:59 UTC (permalink / raw)
To: Russell King - ARM Linux; +Cc: netdev, hsweeten, ryan, kernel, linux-arm-kernel
In-Reply-To: <20110529103825.GZ24876@n2100.arm.linux.org.uk>
On Sun, May 29, 2011 at 11:38:25AM +0100, Russell King - ARM Linux wrote:
> On Sun, May 29, 2011 at 12:03:57PM +0300, Mika Westerberg wrote:
> > diff --git a/drivers/net/arm/ep93xx_eth.c b/drivers/net/arm/ep93xx_eth.c
> > index 5a77001..e495f76 100644
> > --- a/drivers/net/arm/ep93xx_eth.c
> > +++ b/drivers/net/arm/ep93xx_eth.c
> > @@ -494,7 +494,7 @@ static int ep93xx_alloc_buffers(struct ep93xx_priv *ep)
> > int i;
> >
> > ep->descs = dma_alloc_coherent(NULL, sizeof(struct ep93xx_descs),
> > - &ep->descs_dma_addr, GFP_KERNEL | GFP_DMA);
> > + &ep->descs_dma_addr, GFP_KERNEL);
>
> This one is correct anyway - whether the allocation comes from the DMA
> zone or not is something which should be controlled by the DMA mask and
> not the driver passing random GFP flags to dma_alloc_coherent.
>
> However, because it is passing a NULL device to dma_alloc_coherent(), it
> assumes that it is for an old ISA device, and so will continue to perform
> a GFP_DMA allocation. The solution - pass a struct device and set the
> DMA masks correctly.
Ok, thanks. I'll send updated patch soon.
> > @@ -525,7 +525,7 @@ static int ep93xx_alloc_buffers(struct ep93xx_priv *ep)
> > void *page;
> > dma_addr_t d;
> >
> > - page = (void *)__get_free_page(GFP_KERNEL | GFP_DMA);
> > + page = (void *)__get_free_page(GFP_KERNEL);
> > if (page == NULL)
> > goto err;
> >
>
> Wow, just looked at what this driver is doing with the DMA buffer handling,
> and I'm wondering how come its soo broken.
>
[...]
>
> While it may work on ep93xx, it's not respecting the DMA API rules,
> and with DMA debugging enabled it will probably encounter quite a few
> warnings.
FYI, I just enabled DMA debugging and I've got:
[ 1.980000] WARNING: at lib/dma-debug.c:911 check_sync+0x460/0x510()
[ 1.980000] NULL NULL: DMA-API: device driver tries to sync DMA memory it has not allocated [device address=0x00000000c5a11800] [size=78 b
ytes]
[ 1.980000] Modules linked in:
[ 1.980000] [<c0035498>] (unwind_backtrace+0x0/0xf4) from [<c0043ea4>] (warn_slowpath_common+0x48/0x60)
[ 1.980000] [<c0043ea4>] (warn_slowpath_common+0x48/0x60) from [<c0043f50>] (warn_slowpath_fmt+0x30/0x40)
[ 1.980000] [<c0043f50>] (warn_slowpath_fmt+0x30/0x40) from [<c01e2a00>] (check_sync+0x460/0x510)
[ 1.980000] [<c01e2a00>] (check_sync+0x460/0x510) from [<c01e2ea4>] (debug_dma_sync_single_for_cpu+0x50/0x5c)
[ 1.980000] [<c01e2ea4>] (debug_dma_sync_single_for_cpu+0x50/0x5c) from [<c0229a40>] (ep93xx_xmit+0x80/0x144)
[ 1.980000] [<c0229a40>] (ep93xx_xmit+0x80/0x144) from [<c0284558>] (dev_hard_start_xmit+0x33c/0x5e8)
[ 1.980000] [<c0284558>] (dev_hard_start_xmit+0x33c/0x5e8) from [<c02992d8>] (sch_direct_xmit+0xa8/0x1c0)
[ 1.980000] [<c02992d8>] (sch_direct_xmit+0xa8/0x1c0) from [<c028494c>] (dev_queue_xmit+0x148/0x46c)
[ 1.980000] [<c028494c>] (dev_queue_xmit+0x148/0x46c) from [<c028ed24>] (neigh_resolve_output+0x110/0x3bc)
[ 1.980000] [<c028ed24>] (neigh_resolve_output+0x110/0x3bc) from [<c02f7898>] (ip6_finish_output2+0x388/0x458)
[ 1.980000] [<c02f7898>] (ip6_finish_output2+0x388/0x458) from [<c03071a4>] (ndisc_send_skb+0x1dc/0x330)
[ 1.980000] [<c03071a4>] (ndisc_send_skb+0x1dc/0x330) from [<c0308be4>] (ndisc_send_ns+0x7c/0xac)
[ 1.980000] [<c0308be4>] (ndisc_send_ns+0x7c/0xac) from [<c02fb7e0>] (addrconf_dad_timer+0x144/0x15c)
[ 1.980000] [<c02fb7e0>] (addrconf_dad_timer+0x144/0x15c) from [<c0050760>] (run_timer_softirq+0x110/0x23c)
[ 1.980000] [<c0050760>] (run_timer_softirq+0x110/0x23c) from [<c0049ec0>] (__do_softirq+0xa4/0x168)
[ 1.980000] [<c0049ec0>] (__do_softirq+0xa4/0x168) from [<c004a148>] (irq_exit+0x54/0x60)
[ 1.980000] [<c004a148>] (irq_exit+0x54/0x60) from [<c0023034>] (asm_do_IRQ+0x34/0x84)
[ 1.980000] [<c0023034>] (asm_do_IRQ+0x34/0x84) from [<c002f4f8>] (__irq_svc+0x38/0xd4)
...
I'll prepare a separate patch where these are fixed.
^ permalink raw reply
* Re: [PATCH] net: ep93xx_eth: drop GFP_DMA from memory allocations
From: Russell King - ARM Linux @ 2011-05-29 10:38 UTC (permalink / raw)
To: Mika Westerberg; +Cc: netdev, hsweeten, ryan, kernel, linux-arm-kernel
In-Reply-To: <1306659837-23553-1-git-send-email-mika.westerberg@iki.fi>
On Sun, May 29, 2011 at 12:03:57PM +0300, Mika Westerberg wrote:
> diff --git a/drivers/net/arm/ep93xx_eth.c b/drivers/net/arm/ep93xx_eth.c
> index 5a77001..e495f76 100644
> --- a/drivers/net/arm/ep93xx_eth.c
> +++ b/drivers/net/arm/ep93xx_eth.c
> @@ -494,7 +494,7 @@ static int ep93xx_alloc_buffers(struct ep93xx_priv *ep)
> int i;
>
> ep->descs = dma_alloc_coherent(NULL, sizeof(struct ep93xx_descs),
> - &ep->descs_dma_addr, GFP_KERNEL | GFP_DMA);
> + &ep->descs_dma_addr, GFP_KERNEL);
This one is correct anyway - whether the allocation comes from the DMA
zone or not is something which should be controlled by the DMA mask and
not the driver passing random GFP flags to dma_alloc_coherent.
However, because it is passing a NULL device to dma_alloc_coherent(), it
assumes that it is for an old ISA device, and so will continue to perform
a GFP_DMA allocation. The solution - pass a struct device and set the
DMA masks correctly.
> @@ -525,7 +525,7 @@ static int ep93xx_alloc_buffers(struct ep93xx_priv *ep)
> void *page;
> dma_addr_t d;
>
> - page = (void *)__get_free_page(GFP_KERNEL | GFP_DMA);
> + page = (void *)__get_free_page(GFP_KERNEL);
> if (page == NULL)
> goto err;
>
Wow, just looked at what this driver is doing with the DMA buffer handling,
and I'm wondering how come its soo broken.
So, to summarize what its doing:
1. It allocates buffers for rx and tx.
2. It maps them with dma_map_single().
This transfers ownership of the buffer to the DMA device.
3. In ep93xx_xmit,
3a. It copies the data into the buffer with skb_copy_and_csum_dev()
This violates the DMA buffer ownership rules - the CPU should
not be writing to this buffer while it is (in principle) owned
by the DMA device.
3b. It then calls dma_sync_single_for_cpu() for the buffer.
This transfers ownership of the buffer to the CPU, which surely
is the wrong direction.
4. In ep93xx_rx,
4a. It calls dma_sync_single_for_cpu() for the buffer.
This at least transfers the DMA buffer ownership to the CPU
before the CPU reads the buffer
4b. It then uses skb_copy_to_linear_data() to copy the data out.
At no point does it transfer ownership back to the DMA device.
5. When the driver is removed, it dma_unmap_single()'s the buffer.
This transfers ownership of the buffer to the CPU.
6. It frees the buffer.
While it may work on ep93xx, it's not respecting the DMA API rules,
and with DMA debugging enabled it will probably encounter quite a few
warnings.
^ permalink raw reply
* Re: [PATCH] pxa: don't ask for a buffer from DMA zone
From: Eric Miao @ 2011-05-29 10:28 UTC (permalink / raw)
To: Russell King - ARM Linux
Cc: Dmitry Eremin-Solenikov, Samuel Ortiz, netdev, linux-arm-kernel
In-Reply-To: <20110529102456.GX24876@n2100.arm.linux.org.uk>
On Sun, May 29, 2011 at 6:24 PM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
> On Sun, May 29, 2011 at 12:42:55PM +0400, Dmitry Eremin-Solenikov wrote:
>> PXA don't have special DMA zone. And since
>> 197b59ae6e8bee56fcef37ea2482dc08414e2ac (mm: fail GFP_DMA allocations
>> when ZONE_DMA is not configured) allocation with GFP_DMA set will fail
>> with a trace like this:
>
> These buffers are never used with DMA, its only used with the PIO activity
> when in SIR mode. When in FIR mode, and DMA is being used, we copy it to
> a block of memory allocated by dma_coherent_alloc(). So the GFP_DMA
> annotation here is redundant.
>
> And that's probably more important to document in the changelog... its not
> that PXA doesn't have a special DMA zone, it's that the driver doesn't do
> DMA on these buffers so its pointless marking them with GFP_DMA.
>
Not really sure about the whole story, but further speaking, for
architectures like
PXA and most other ARM SoCs, where DMA can happen anywhere, does this
mean we don't even need to put GFP_DMA flag when doing memory allocation?
>> static int pxa_irda_init_iobuf(iobuff_t *io, int size)
>> {
>> - io->head = kmalloc(size, GFP_KERNEL | GFP_DMA);
>> + io->head = kmalloc(size, GFP_KERNEL);
>> if (io->head != NULL) {
>> io->truesize = size;
>> io->in_frame = FALSE;
>> --
>> 1.7.4.4
>>
>>
>> _______________________________________________
>> linux-arm-kernel mailing list
>> linux-arm-kernel@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>
^ permalink raw reply
* Re: [PATCH] pxa: don't ask for a buffer from DMA zone
From: Russell King - ARM Linux @ 2011-05-29 10:24 UTC (permalink / raw)
To: Dmitry Eremin-Solenikov; +Cc: Samuel Ortiz, netdev, Eric Miao, linux-arm-kernel
In-Reply-To: <1306658575-17160-1-git-send-email-dbaryshkov@gmail.com>
On Sun, May 29, 2011 at 12:42:55PM +0400, Dmitry Eremin-Solenikov wrote:
> PXA don't have special DMA zone. And since
> 197b59ae6e8bee56fcef37ea2482dc08414e2ac (mm: fail GFP_DMA allocations
> when ZONE_DMA is not configured) allocation with GFP_DMA set will fail
> with a trace like this:
These buffers are never used with DMA, its only used with the PIO activity
when in SIR mode. When in FIR mode, and DMA is being used, we copy it to
a block of memory allocated by dma_coherent_alloc(). So the GFP_DMA
annotation here is redundant.
And that's probably more important to document in the changelog... its not
that PXA doesn't have a special DMA zone, it's that the driver doesn't do
DMA on these buffers so its pointless marking them with GFP_DMA.
> static int pxa_irda_init_iobuf(iobuff_t *io, int size)
> {
> - io->head = kmalloc(size, GFP_KERNEL | GFP_DMA);
> + io->head = kmalloc(size, GFP_KERNEL);
> if (io->head != NULL) {
> io->truesize = size;
> io->in_frame = FALSE;
> --
> 1.7.4.4
>
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply
* Re: [Xen-devel] Re: [PATCH] ethtool: ETHTOOL_SFEATURES: remove NETIF_F_COMPAT return
From: Michał Mirosław @ 2011-05-29 9:38 UTC (permalink / raw)
To: Jesse Gross
Cc: dev-yBygre7rU0TnMu66kgdUjQ,
xen-devel-GuqFBffKawuULHF6PoxzQEEOCMrvLtNR@public.gmane.org,
Ian Campbell, netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
xen-api-GuqFBffKawuULHF6PoxzQEEOCMrvLtNR, Ben Hutchings,
David Miller
In-Reply-To: <BANLkTime8PHYe+BFELt92gg7SZ91xKvAwA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
On Sat, May 28, 2011 at 10:31:03AM -0700, Jesse Gross wrote:
> 2011/5/28 Ian Campbell <Ian.Campbell-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org>:
> > On Sat, 2011-05-28 at 08:35 +0100, Michał Mirosław wrote:
> >> On Sat, May 28, 2011 at 12:25:55AM +0100, Ben Hutchings wrote:
> >> > On Fri, 2011-05-27 at 18:34 +0200, Michał Mirosław wrote:
> >> > > On Fri, May 27, 2011 at 04:45:50PM +0100, Ben Hutchings wrote:
> >> > > > On Fri, 2011-05-27 at 17:28 +0200, Michał Mirosław wrote:
> >> > [...]
> >> > > > > (note: ETHTOOL_S{SG,...} are not ever going away)
> >> > > > > - causes NETIF_F_* to be an ABI
> >> > > > If feature flag numbers are not stable then what is the point of
> >> > > > /sys/class/net/<name>/features? Also, I'm not aware that features have
> >> > > > ever been renumbered in the past.
> >> > > Since no NETIF_F_* were exported earlier, I assume /sys/class/net/*/features
> >> > > is a debugging aid. What is it used for besides that?
> >> > xen-api <https://github.com/xen-org/xen-api> uses it in
> >> > scripts/InterfaceReconfigureVswitch.py. Though it doesn't seem to be
> >> > used for a particularly good reason...
> >> Look like it should use ETHTOOL_GFLAGS instead for netdev_has_vlan_accel().
> ETHTOOL_GFLAGS didn't expose the vlan acceleration flags until 2.6.37,
> which is why /sys/class/net was used instead.
https://github.com/xen-org/xen-api/commit/78b8078e6ae3cf48179859bed6350bb326987546
The commit using it was introduced after 2.6.37 kernel was released and uses
undocumented acccess path to the bits in question. What is the kernel patch
this commit is referring to?
Best Regards,
Michał Mirosław
^ permalink raw reply
* [PATCH] net: ep93xx_eth: drop GFP_DMA from memory allocations
From: Mika Westerberg @ 2011-05-29 9:03 UTC (permalink / raw)
To: netdev; +Cc: kernel, hsweeten, ryan, linux-arm-kernel
Commit a197b59ae6e8 (mm: fail GFP_DMA allocations when ZONE_DMA is not
configured) made page allocator to return NULL if GFP_DMA is set but
CONFIG_ZONE_DMA is disabled.
This causes ep93xx_eth to fail:
WARNING: at mm/page_alloc.c:2251 __alloc_pages_nodemask+0x11c/0x638()
Modules linked in:
[<c0035498>] (unwind_backtrace+0x0/0xf4) from [<c0043da4>] (warn_slowpath_common+0x48/0x60)
[<c0043da4>] (warn_slowpath_common+0x48/0x60) from [<c0043dd8>] (warn_slowpath_null+0x1c/0x24)
[<c0043dd8>] (warn_slowpath_null+0x1c/0x24) from [<c0083b6c>] (__alloc_pages_nodemask+0x11c/0x638)
[<c0083b6c>] (__alloc_pages_nodemask+0x11c/0x638) from [<c00366fc>] (__dma_alloc+0x8c/0x3ec)
[<c00366fc>] (__dma_alloc+0x8c/0x3ec) from [<c0036adc>] (dma_alloc_coherent+0x54/0x60)
[<c0036adc>] (dma_alloc_coherent+0x54/0x60) from [<c0227808>] (ep93xx_open+0x20/0x864)
[<c0227808>] (ep93xx_open+0x20/0x864) from [<c0283144>] (__dev_open+0xb8/0x108)
[<c0283144>] (__dev_open+0xb8/0x108) from [<c0280528>] (__dev_change_flags+0x70/0x128)
[<c0280528>] (__dev_change_flags+0x70/0x128) from [<c0283054>] (dev_change_flags+0x10/0x48)
[<c0283054>] (dev_change_flags+0x10/0x48) from [<c001a720>] (ip_auto_config+0x190/0xf68)
[<c001a720>] (ip_auto_config+0x190/0xf68) from [<c00233b0>] (do_one_initcall+0x34/0x18c)
[<c00233b0>] (do_one_initcall+0x34/0x18c) from [<c0008400>] (kernel_init+0x94/0x134)
[<c0008400>] (kernel_init+0x94/0x134) from [<c0030858>] (kernel_thread_exit+0x0/0x8)
Since there is no restrictions for DMA on ep93xx, we can fix this by just
removing the GFP_DMA flag from the allocations.
Signed-off-by: Mika Westerberg <mika.westerberg@iki.fi>
---
drivers/net/arm/ep93xx_eth.c | 6 +++---
1 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/net/arm/ep93xx_eth.c b/drivers/net/arm/ep93xx_eth.c
index 5a77001..e495f76 100644
--- a/drivers/net/arm/ep93xx_eth.c
+++ b/drivers/net/arm/ep93xx_eth.c
@@ -494,7 +494,7 @@ static int ep93xx_alloc_buffers(struct ep93xx_priv *ep)
int i;
ep->descs = dma_alloc_coherent(NULL, sizeof(struct ep93xx_descs),
- &ep->descs_dma_addr, GFP_KERNEL | GFP_DMA);
+ &ep->descs_dma_addr, GFP_KERNEL);
if (ep->descs == NULL)
return 1;
@@ -502,7 +502,7 @@ static int ep93xx_alloc_buffers(struct ep93xx_priv *ep)
void *page;
dma_addr_t d;
- page = (void *)__get_free_page(GFP_KERNEL | GFP_DMA);
+ page = (void *)__get_free_page(GFP_KERNEL);
if (page == NULL)
goto err;
@@ -525,7 +525,7 @@ static int ep93xx_alloc_buffers(struct ep93xx_priv *ep)
void *page;
dma_addr_t d;
- page = (void *)__get_free_page(GFP_KERNEL | GFP_DMA);
+ page = (void *)__get_free_page(GFP_KERNEL);
if (page == NULL)
goto err;
--
1.7.4.4
^ permalink raw reply related
* [PATCH] pxa: don't ask for a buffer from DMA zone
From: Dmitry Eremin-Solenikov @ 2011-05-29 8:42 UTC (permalink / raw)
To: Samuel Ortiz; +Cc: netdev, Eric Miao, linux-arm-kernel
PXA don't have special DMA zone. And since
197b59ae6e8bee56fcef37ea2482dc08414e2ac (mm: fail GFP_DMA allocations
when ZONE_DMA is not configured) allocation with GFP_DMA set will fail
with a trace like this:
------------[ cut here ]------------
WARNING: at mm/page_alloc.c:2251
__alloc_pages_nodemask+0xa0/0x5ac()
Modules linked in:
[<c00385b0>] (unwind_backtrace+0x0/0xf0) from [<c0050b1c>] (warn_slowpath_common+0x4c/0x64)
[<c0050b1c>] (warn_slowpath_common+0x4c/0x64) from [<c0050b4c>] (warn_slowpath_null+0x18/0x1c)
[<c0050b4c>] (warn_slowpath_null+0x18/0x1c) from [<c00908ec>] (__alloc_pages_nodemask+0xa0/0x5ac)
[<c00908ec>] (__alloc_pages_nodemask+0xa0/0x5ac) from [<c0090e74>] (__get_free_pages+0x10/0x3c)
[<c0090e74>] (__get_free_pages+0x10/0x3c) from [<c01d608c>] (pxa_irda_init_iobuf+0x18/0x48)
[<c01d608c>] (pxa_irda_init_iobuf+0x18/0x48) from [<c01d61d8>] (pxa_irda_probe+0x11c/0x32c)
[<c01d61d8>] (pxa_irda_probe+0x11c/0x32c) from [<c019474c>] (platform_drv_probe+0x14/0x18)
[<c019474c>] (platform_drv_probe+0x14/0x18) from [<c0193508>] (really_probe+0xa0/0x158)
[<c0193508>] (really_probe+0xa0/0x158) from [<c019360c>] (driver_probe_device+0x4c/0x64)
[<c019360c>] (driver_probe_device+0x4c/0x64) from [<c0193684>] (__driver_attach+0x60/0x84)
[<c0193684>] (__driver_attach+0x60/0x84) from [<c0192d78>] (bus_for_each_dev+0x48/0x84)
[<c0192d78>] (bus_for_each_dev+0x48/0x84) from [<c01926b8>] (bus_add_driver+0xa8/0x220)
[<c01926b8>] (bus_add_driver+0xa8/0x220) from [<c0193c7c>] (driver_register+0xac/0x13c)
[<c0193c7c>] (driver_register+0xac/0x13c) from [<c0033440>] (do_one_initcall+0x94/0x16c)
[<c0033440>] (do_one_initcall+0x94/0x16c) from [<c00083f4>] (kernel_init+0x94/0x140)
[<c00083f4>] (kernel_init+0x94/0x140) from [<c00348d0>] (kernel_thread_exit+0x0/0x8)
---[ end trace 0b8bf08f70147098 ]---
Signed-off-by: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
---
drivers/net/irda/pxaficp_ir.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/drivers/net/irda/pxaficp_ir.c b/drivers/net/irda/pxaficp_ir.c
index 001ed0a..a5ebaef 100644
--- a/drivers/net/irda/pxaficp_ir.c
+++ b/drivers/net/irda/pxaficp_ir.c
@@ -805,7 +805,7 @@ static int pxa_irda_resume(struct platform_device *_dev)
static int pxa_irda_init_iobuf(iobuff_t *io, int size)
{
- io->head = kmalloc(size, GFP_KERNEL | GFP_DMA);
+ io->head = kmalloc(size, GFP_KERNEL);
if (io->head != NULL) {
io->truesize = size;
io->in_frame = FALSE;
--
1.7.4.4
^ permalink raw reply related
* Re: Kernel crash after using new Intel NIC (igb)
From: Eric Dumazet @ 2011-05-29 7:43 UTC (permalink / raw)
To: Ingo Molnar
Cc: Arun Sharma, David Miller, Maximilian Engelhardt, linux-kernel,
netdev, StuStaNet Vorstand, Yann Dupont, Denys Fedoryshchenko,
Thomas Gleixner
In-Reply-To: <20110529073840.GB21254@elte.hu>
Le dimanche 29 mai 2011 à 09:38 +0200, Ingo Molnar a écrit :
> Yes and i do this all the time as well, to make life easier for the
> stable team.
>
> What wasnt fine was to not follow up the backportable hack with the
> proper patch though. Or did you plan to do that yourself if Arun
> fails to complete the generic variant? (in which case my remark is
> moot)
>
I am currently working on seqlock.h split into seqcount.h and seqlock.h,
because I really need this split, but can take care of this work too, no
problem :
I asked Arun if he wanted to make this himself, because initial idea was
coming from him, not because I did not want to make it ;)
^ permalink raw reply
* Re: Kernel crash after using new Intel NIC (igb)
From: Ingo Molnar @ 2011-05-29 7:38 UTC (permalink / raw)
To: Eric Dumazet
Cc: Arun Sharma, David Miller, Maximilian Engelhardt, linux-kernel,
netdev, StuStaNet Vorstand, Yann Dupont, Denys Fedoryshchenko,
Thomas Gleixner
In-Reply-To: <1306654424.30021.7.camel@edumazet-laptop>
* Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le samedi 28 mai 2011 à 20:04 +0200, Ingo Molnar a écrit :
> > * Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> > > > +static inline int atomic_add_unless(atomic_t *v, int a, int u)
> > > > +{
> > > > + return __atomic_add_unless(v, a, u) != u;
> > > > }
> > > >
> > > > #define atomic_inc_not_zero(v) atomic_add_unless((v), 1, 0)
> > >
> > > As I said, atomic_add_unless() has several implementations in
> > > various arches. You must take care of all, not only x86.
> >
> > It's a bit sad to see local hacks to generic facilities committed
> > upstream like that.
> >
>
> Again, its a stable candidate patch, on a file that had many changes in
> recent kernels.
>
> I felt this was the right thing to do, to ease stable teams work.
Yes and i do this all the time as well, to make life easier for the
stable team.
What wasnt fine was to not follow up the backportable hack with the
proper patch though. Or did you plan to do that yourself if Arun
fails to complete the generic variant? (in which case my remark is
moot)
Thanks,
Ingo
^ permalink raw reply
* Re: Kernel crash after using new Intel NIC (igb)
From: Eric Dumazet @ 2011-05-29 7:33 UTC (permalink / raw)
To: Ingo Molnar
Cc: Arun Sharma, David Miller, Maximilian Engelhardt, linux-kernel,
netdev, StuStaNet Vorstand, Yann Dupont, Denys Fedoryshchenko,
Thomas Gleixner
In-Reply-To: <20110528180434.GB12530@elte.hu>
Le samedi 28 mai 2011 à 20:04 +0200, Ingo Molnar a écrit :
> * Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> > > +static inline int atomic_add_unless(atomic_t *v, int a, int u)
> > > +{
> > > + return __atomic_add_unless(v, a, u) != u;
> > > }
> > >
> > > #define atomic_inc_not_zero(v) atomic_add_unless((v), 1, 0)
> >
> > As I said, atomic_add_unless() has several implementations in
> > various arches. You must take care of all, not only x86.
>
> It's a bit sad to see local hacks to generic facilities committed
> upstream like that.
>
Again, its a stable candidate patch, on a file that had many changes in
recent kernels.
I felt this was the right thing to do, to ease stable teams work.
Its not like there is an urgent need for this atomic_add_unless_return()
thing that we have to worry right now.
^ permalink raw reply
* Re: [PATCH 1/1] IPVS : bug in ip_vs_ftp, same list heaad used in all netns.
From: Simon Horman @ 2011-05-29 2:10 UTC (permalink / raw)
To: Pablo Neira Ayuso
Cc: Hans Schillstrom, ja@ssi.bg, wensong@linux-vs.org,
lvs-devel@vger.kernel.org, netdev@vger.kernel.org,
netfilter-devel@vger.kernel.org, hans@schillstrom.com,
David Miller
In-Reply-To: <4DDF6FE5.507@netfilter.org>
On Fri, May 27, 2011 at 11:33:25AM +0200, Pablo Neira Ayuso wrote:
> On 27/05/11 08:04, Simon Horman wrote:
> > On Fri, May 27, 2011 at 07:24:22AM +0200, Hans Schillstrom wrote:
> >> On Friday 27 May 2011 01:37:45 Simon Horman wrote:
> >>> On Thu, May 26, 2011 at 07:17:50PM +0200, Pablo Neira Ayuso wrote:
> >>>> On 24/05/11 14:11, Hans Schillstrom wrote:
> >>>>> When ip_vs was adapted to netns the ftp application was not adapted
> >>>>> in a correct way.
> >>>>> However this is a fix to avoid kernel errors. In the long term another solution
> >>>>> might be chosen. I.e the ports that the ftp appl, uses should be per netns.
> >>>>
> >>>> applied, thanks.
> >>>
> >>> Thanks Pablo.
> >>>
> >>> Hans, is this appropriate for -stable (i.e. 2.6.39.x) ?
> >>
> >> Yes it is.
> >
> > Thanks.
> >
> > Dave, can you handle that once this change makes
> > it into your tree?
>
> http://permalink.gmane.org/gmane.linux.kernel.wireless.general/70374
Thanks, I missed that.
^ permalink raw reply
* Re: [PATCH] ipv4: fix fib metrics
From: Alessandro Suardi @ 2011-05-28 22:00 UTC (permalink / raw)
To: David Woodhouse
Cc: Hiroyuki Kawakatsu, Kyle Moffett, Eric Dumazet, David Miller,
linux-kernel, netdev
In-Reply-To: <1306488451.2029.107.camel@i7.infradead.org>
[-- Attachment #1: Type: text/plain, Size: 1309 bytes --]
On Fri, May 27, 2011 at 11:27 AM, David Woodhouse <dwmw2@infradead.org> wrote:
> On Fri, 2011-03-25 at 02:25 +0100, Alessandro Suardi wrote:
>>
>> ...which didn't take that long - one last bugging question and I'm happily
>> off to sleep; does ipid always come in the form of 0x followed by four
>> bytes representing hex values ? In a perhaps inelegant but working way
>> (I'm now writing through the VPN tunnel),
>>
>> sed 's/cache//;s/metric \?[0-9]\+ [0-9]\+//g;s/hoplimit
>> [0-9]\+//g;s/ipid 0x....//g'
>>
>> appears to be Work For Me (TM).
>
> Please could I have a tested patch for vpnc-script?
>
> It now lives in its own repository at
> git://, http://git.infradead.org/users/dwmw2/vpnc-scripts.git because
> it's used by openconnect too, and has had various bug fixes for
> cross-platform support and IPv6 since it was forked from vpnc.
I downloaded the git version and checked - the one I use is the Fedora
version which seems updated to perhaps two revisions behind git...
anyway, attaching (in order to not mangle whitespace) the one-liner
change that I've been using since without issues - to the point that I
actually forgot having patched the script...
--alessandro
"There's always a siren singing you to shipwreck"
(Radiohead, "There There")
[-- Attachment #2: vpnc-script.diff --]
[-- Type: application/octet-stream, Size: 410 bytes --]
--- vpnc-scripts-9239bd8/vpnc-script 2010-02-23 19:11:53.000000000 +0100
+++ vpnc-scripts-new/vpnc-script 2011-05-28 23:49:28.000000000 +0200
@@ -139,7 +139,7 @@
if [ -n "$IPROUTE" ]; then
fix_ip_get_output () {
- sed 's/cache//;s/metric \?[0-9]\+ [0-9]\+//g;s/hoplimit [0-9]\+//g'
+ sed 's/cache//;s/metric \?[0-9]\+ [0-9]\+//g;s/hoplimit [0-9]\+//g;s/ipid 0x....//g'
}
set_vpngateway_route() {
^ permalink raw reply
* Re: [PATCHv2 10/14] virtio_net: limit xmit polling
From: Michael S. Tsirkin @ 2011-05-28 20:02 UTC (permalink / raw)
To: Rusty Russell
Cc: Krishna Kumar, Carsten Otte, lguest-uLR06cmDAlY/bJ5BZ2RsiQ,
Shirley Ma, kvm-u79uwXL29TY76Z2rM5mHXA,
linux-s390-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Heiko Carstens,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
Tom Lendacky, Martin Schwidefsky, linux390-tA70FqPdS9bQT0dZR+AlfA
In-Reply-To: <87vcwyjg2w.fsf-8n+1lVoiYb80n/F98K4Iww@public.gmane.org>
On Thu, May 26, 2011 at 12:58:23PM +0930, Rusty Russell wrote:
> On Wed, 25 May 2011 09:07:59 +0300, "Michael S. Tsirkin" <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > On Wed, May 25, 2011 at 11:05:04AM +0930, Rusty Russell wrote:
> > Hmm I'm not sure I got it, need to think about this.
> > I'd like to go back and document how my design was supposed to work.
> > This really should have been in commit log or even a comment.
> > I thought we need a min, not a max.
> > We start with this:
> >
> > while ((c = (virtqueue_get_capacity(vq) < 2 + MAX_SKB_FRAGS) &&
> > (skb = get_buf)))
> > kfree_skb(skb);
> > return !c;
> >
> > This is clean and simple, right? And it's exactly asking for what we need.
>
> No, I started from the other direction:
>
> for (i = 0; i < 2; i++) {
> skb = get_buf();
> if (!skb)
> break;
> kfree_skb(skb);
> }
>
> ie. free two packets for every one we're about to add. For steady state
> that would work really well.
Sure, with indirect buffers, but if we
don't use indirect (and we discussed switching indirect off
dynamically in the past) this becomes harder to
be sure about. I think I understand why but
does not a simple capacity check make it more obvious?
> Then we hit the case where the ring
> seems full after we do the add: at that point, screw latency, and just
> try to free all the buffers we can.
I see. But the code currently does this:
for(..)
get_buf
add_buf
if (capacity < max_sk_frags+2) {
if (!enable_cb)
for(..)
get_buf
}
In other words the second get_buf is only called
in the unlikely case of race condition.
So we'll need to add *another* call to get_buf.
Is it just me or is this becoming messy?
I was also be worried that we are adding more
"modes" to the code: high and low latency
depending on different speeds between host and guest,
which would be hard to trigger and test.
That's why I tried hard to make the code behave the
same all the time and free up just a bit more than
the minimum necessary.
> > on the normal path min == 2 so we're low latency but we keep ahead on
> > average. min == 0 for the "we're out of capacity, we may have to stop
> > the queue".
> >
> > Does the above make sense at all?
>
> It makes sense, but I think it's a classic case where incremental
> improvements aren't as good as starting from scratch.
>
> Cheers,
> Rusty.
The only difference on good path seems an extra capacity check,
so I don't expect the difference will be testable, do you?
--
MST
^ permalink raw reply
* Re: [PATCH V7 1/4 net-next] sock.h: Add a new sock zero-copy flag
From: Shirley Ma @ 2011-05-28 19:37 UTC (permalink / raw)
To: David Miller
Cc: mst, Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev, kvm,
linux-kernel
In-Reply-To: <1306610157.5180.80.camel@localhost.localdomain>
On Sat, 2011-05-28 at 12:15 -0700, Shirley Ma wrote:
> include/net/sock.h | 1 +
> 1 files changed, 1 insertions(+), 0 deletions(-)
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index f2046e4..2229bd1 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -563,6 +563,7 @@ enum sock_flags {
> SOCK_TIMESTAMPING_SYS_HARDWARE, /* %SOF_TIMESTAMPING_SYS_HARDWARE */
> SOCK_FASYNC, /* fasync() active */
> SOCK_RXQ_OVFL,
> + SOCK_ZEROCOPY, /* buffers from userspace */
> };
>
> static inline void sock_copy_flags(struct sock *nsk, struct sock *osk)
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
Forgot about this:
Signed-off-by: Shirley Ma <xma@us.ibm.com>
---
^ permalink raw reply
* [PATCH V7 4/4 net-next] vhost: vhost TX zero-copy support
From: Shirley Ma @ 2011-05-28 19:34 UTC (permalink / raw)
To: David Miller, mst, Eric Dumazet, Avi Kivity, Arnd Bergmann
Cc: netdev, kvm, linux-kernel
Hello Michael,
In order to use wait for completion in shutting down, seems to me
another work thread is needed to call vhost_zerocopy_add_used, it seems
too much work to address a minor issue here. Do we really need it?
Right now, the approach I am using is to ignore outstanding userspace
buffers during shutting down if any, the device might DMAed some wrong
data to the wire, do we really care?
Thanks
Shirley
------------
This patch maintains the outstanding userspace buffers in the
sequence it is delivered to vhost. The outstanding userspace buffers
will be marked as done once the lower device buffers DMA has finished.
This is monitored through last reference of kfree_skb callback. Two
buffer index are used for this purpose.
The vhost passes the userspace buffers info to lower device skb
through message control. Since there will be some done DMAs when
entering vhost handle_tx. The worse case is all buffers in the vq are
in pending/done status, so we need to notify guest to release DMA done
buffers first before get any new buffers from the vq.
Signed-off-by: Shirley <xma@us.ibm.com>
---
drivers/vhost/net.c | 46 +++++++++++++++++++++++++++++++++++++++++++++-
drivers/vhost/vhost.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++
drivers/vhost/vhost.h | 15 +++++++++++++++
3 files changed, 107 insertions(+), 1 deletions(-)
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 2f7c76a..e2eaba6 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -32,6 +32,11 @@
* Using this limit prevents one virtqueue from starving others. */
#define VHOST_NET_WEIGHT 0x80000
+/* MAX number of TX used buffers for outstanding zerocopy */
+#define VHOST_MAX_PEND 128
+/* change it to 256 when small message size performance issue is addressed */
+#define VHOST_GOODCOPY_LEN 2048
+
enum {
VHOST_NET_VQ_RX = 0,
VHOST_NET_VQ_TX = 1,
@@ -151,6 +156,10 @@ static void handle_tx(struct vhost_net *net)
hdr_size = vq->vhost_hlen;
for (;;) {
+ /* Release DMAs done buffers first */
+ if (atomic_read(&vq->refcnt) > VHOST_MAX_PEND)
+ vhost_zerocopy_signal_used(vq, false);
+
head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
ARRAY_SIZE(vq->iov),
&out, &in,
@@ -166,6 +175,12 @@ static void handle_tx(struct vhost_net *net)
set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
break;
}
+ /* If more outstanding DMAs, queue the work */
+ if (atomic_read(&vq->refcnt) > VHOST_MAX_PEND) {
+ tx_poll_start(net, sock);
+ set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
+ break;
+ }
if (unlikely(vhost_enable_notify(vq))) {
vhost_disable_notify(vq);
continue;
@@ -188,6 +203,26 @@ static void handle_tx(struct vhost_net *net)
iov_length(vq->hdr, s), hdr_size);
break;
}
+ /* use msg_control to pass vhost zerocopy ubuf info to skb */
+ if (sock_flag(sock->sk, SOCK_ZEROCOPY)) {
+ vq->heads[vq->upend_idx].id = head;
+ if (len < VHOST_GOODCOPY_LEN)
+ /* copy don't need to wait for DMA done */
+ vq->heads[vq->upend_idx].len =
+ VHOST_DMA_DONE_LEN;
+ else {
+ struct ubuf_info *ubuf = &vq->ubuf_info[head];
+
+ vq->heads[vq->upend_idx].len = len;
+ ubuf->callback = vhost_zerocopy_callback;
+ ubuf->arg = vq;
+ ubuf->desc = vq->upend_idx;
+ msg.msg_control = ubuf;
+ msg.msg_controllen = sizeof(ubuf);
+ }
+ atomic_inc(&vq->refcnt);
+ vq->upend_idx = (vq->upend_idx + 1) % UIO_MAXIOV;
+ }
/* TODO: Check specific error and bomb out unless ENOBUFS? */
err = sock->ops->sendmsg(NULL, sock, &msg, len);
if (unlikely(err < 0)) {
@@ -198,12 +233,21 @@ static void handle_tx(struct vhost_net *net)
if (err != len)
pr_debug("Truncated TX packet: "
" len %d != %zd\n", err, len);
- vhost_add_used_and_signal(&net->dev, vq, head, 0);
+ if (!sock_flag(sock->sk, SOCK_ZEROCOPY))
+ vhost_add_used_and_signal(&net->dev, vq, head, 0);
total_len += len;
if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
vhost_poll_queue(&vq->poll);
break;
}
+ /* if upend_idx is full, then wait for free more */
+/*
+ if (unlikely(vq->upend_idx == vq->done_idx)) {
+ tx_poll_start(net, sock);
+ set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
+ break;
+ }
+*/
}
mutex_unlock(&vq->mutex);
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 7aa4eea..b76e42a 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -174,6 +174,9 @@ static void vhost_vq_reset(struct vhost_dev *dev,
vq->call_ctx = NULL;
vq->call = NULL;
vq->log_ctx = NULL;
+ vq->upend_idx = 0;
+ vq->done_idx = 0;
+ atomic_set(&vq->refcnt, 0);
}
static int vhost_worker(void *data)
@@ -232,6 +235,8 @@ static long vhost_dev_alloc_iovecs(struct vhost_dev *dev)
GFP_KERNEL);
dev->vqs[i].heads = kmalloc(sizeof *dev->vqs[i].heads *
UIO_MAXIOV, GFP_KERNEL);
+ dev->vqs[i].ubuf_info = kmalloc(sizeof *dev->vqs[i].ubuf_info *
+ UIO_MAXIOV, GFP_KERNEL);
if (!dev->vqs[i].indirect || !dev->vqs[i].log ||
!dev->vqs[i].heads)
@@ -244,6 +249,7 @@ err_nomem:
kfree(dev->vqs[i].indirect);
kfree(dev->vqs[i].log);
kfree(dev->vqs[i].heads);
+ kfree(dev->vqs[i].ubuf_info);
}
return -ENOMEM;
}
@@ -385,6 +391,30 @@ long vhost_dev_reset_owner(struct vhost_dev *dev)
return 0;
}
+/* In case of DMA done not in order in lower device driver for some reason.
+ * upend_idx is used to track end of used idx, done_idx is used to track head
+ * of used idx. Once lower device DMA done contiguously, we will signal KVM
+ * guest used idx.
+ */
+void vhost_zerocopy_signal_used(struct vhost_virtqueue *vq, bool shutdown)
+{
+ int i, j = 0;
+
+ for (i = vq->done_idx; i != vq->upend_idx; i = ++i % UIO_MAXIOV) {
+ if ((vq->heads[i].len == VHOST_DMA_DONE_LEN) || shutdown) {
+ vq->heads[i].len = VHOST_DMA_CLEAR_LEN;
+ vhost_add_used_and_signal(vq->dev, vq,
+ vq->heads[i].id, 0);
+ ++j;
+ } else
+ break;
+ }
+ if (j) {
+ vq->done_idx = i;
+ atomic_sub(j, &vq->refcnt);
+ }
+}
+
/* Caller should have device mutex */
void vhost_dev_cleanup(struct vhost_dev *dev)
{
@@ -395,6 +425,9 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
vhost_poll_stop(&dev->vqs[i].poll);
vhost_poll_flush(&dev->vqs[i].poll);
}
+ /* Wait for all lower device DMAs done */
+ if (atomic_read(&dev->vqs[i].refcnt))
+ vhost_zerocopy_signal_used(&dev->vqs[i], true);
if (dev->vqs[i].error_ctx)
eventfd_ctx_put(dev->vqs[i].error_ctx);
if (dev->vqs[i].error)
@@ -603,6 +636,10 @@ static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp)
mutex_lock(&vq->mutex);
+ /* clean up lower device outstanding DMAs, before setting ring */
+ if (atomic_read(&vq->refcnt))
+ vhost_zerocopy_signal_used(vq, true);
+
switch (ioctl) {
case VHOST_SET_VRING_NUM:
/* Resizing ring with an active backend?
@@ -1416,3 +1453,13 @@ void vhost_disable_notify(struct vhost_virtqueue *vq)
vq_err(vq, "Failed to enable notification at %p: %d\n",
&vq->used->flags, r);
}
+
+void vhost_zerocopy_callback(void *arg)
+{
+ struct ubuf_info *ubuf = (struct ubuf_info *)arg;
+ struct vhost_virtqueue *vq;
+
+ vq = (struct vhost_virtqueue *)ubuf->arg;
+ /* set len = 1 to mark this desc buffers done DMA */
+ vq->heads[ubuf->desc].len = VHOST_DMA_DONE_LEN;
+}
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index b3363ae..0b11734 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -13,6 +13,11 @@
#include <linux/virtio_ring.h>
#include <asm/atomic.h>
+/* This is for zerocopy, used buffer len is set to 1 when lower device DMA
+ * done */
+#define VHOST_DMA_DONE_LEN 1
+#define VHOST_DMA_CLEAR_LEN 0
+
struct vhost_device;
struct vhost_work;
@@ -108,6 +113,14 @@ struct vhost_virtqueue {
/* Log write descriptors */
void __user *log_base;
struct vhost_log *log;
+ /* vhost zerocopy support */
+ atomic_t refcnt; /* num of outstanding zerocopy DMAs */
+ /* last used idx for outstanding DMA zerocopy buffers */
+ int upend_idx;
+ /* first used idx for DMA done zerocopy buffers */
+ int done_idx;
+ /* an array of userspace buffers info */
+ struct ubuf_info *ubuf_info;
};
struct vhost_dev {
@@ -154,6 +167,8 @@ bool vhost_enable_notify(struct vhost_virtqueue *);
int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
unsigned int log_num, u64 len);
+void vhost_zerocopy_callback(void *arg);
+void vhost_zerocopy_signal_used(struct vhost_virtqueue *vq, bool shutdown);
#define vq_err(vq, fmt, ...) do { \
pr_debug(pr_fmt(fmt), ##__VA_ARGS__); \
^ permalink raw reply related
* [PATCH V7 3/4]macvtap: macvtap TX zero-copy support
From: Shirley Ma @ 2011-05-28 19:25 UTC (permalink / raw)
To: David Miller, mst, Eric Dumazet, Avi Kivity, Arnd Bergmann
Cc: netdev, kvm, linux-kernel
Only when buffer size is greater than or equal PAGE_SIZE, macvtap
enables zero-copy.
Signed-off-by: Shirley Ma <xma@us.ibm.com>
---
drivers/net/macvtap.c | 131 ++++++++++++++++++++++++++++++++++++++++++++----
1 files changed, 120 insertions(+), 11 deletions(-)
diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 6696e56..a0fe315 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -60,6 +60,7 @@ static struct proto macvtap_proto = {
*/
static dev_t macvtap_major;
#define MACVTAP_NUM_DEVS 65536
+#define GOODCOPY_LEN 256
static struct class *macvtap_class;
static struct cdev macvtap_cdev;
@@ -340,6 +341,7 @@ static int macvtap_open(struct inode *inode, struct file *file)
{
struct net *net = current->nsproxy->net_ns;
struct net_device *dev = dev_get_by_index(net, iminor(inode));
+ struct macvlan_dev *vlan = netdev_priv(dev);
struct macvtap_queue *q;
int err;
@@ -369,6 +371,16 @@ static int macvtap_open(struct inode *inode, struct file *file)
q->flags = IFF_VNET_HDR | IFF_NO_PI | IFF_TAP;
q->vnet_hdr_sz = sizeof(struct virtio_net_hdr);
+ /*
+ * so far only KVM virtio_net uses macvtap, enable zero copy between
+ * guest kernel and host kernel when lower device supports zerocopy
+ */
+ if (vlan) {
+ if ((vlan->lowerdev->features & NETIF_F_HIGHDMA) &&
+ (vlan->lowerdev->features & NETIF_F_SG))
+ sock_set_flag(&q->sk, SOCK_ZEROCOPY);
+ }
+
err = macvtap_set_queue(dev, file, q);
if (err)
sock_put(&q->sk);
@@ -433,6 +445,80 @@ static inline struct sk_buff *macvtap_alloc_skb(struct sock *sk, size_t prepad,
return skb;
}
+/* set skb frags from iovec, this can move to core network code for reuse */
+static int zerocopy_sg_from_iovec(struct sk_buff *skb, const struct iovec *from,
+ int offset, size_t count)
+{
+ int len = iov_length(from, count) - offset;
+ int copy = skb_headlen(skb);
+ int size, offset1 = 0;
+ int i = 0;
+ skb_frag_t *f;
+
+ /* Skip over from offset */
+ while (count && (offset >= from->iov_len)) {
+ offset -= from->iov_len;
+ ++from;
+ --count;
+ }
+
+ /* copy up to skb headlen */
+ while (count && (copy > 0)) {
+ size = min_t(unsigned int, copy, from->iov_len - offset);
+ if (copy_from_user(skb->data + offset1, from->iov_base + offset,
+ size))
+ return -EFAULT;
+ if (copy > size) {
+ ++from;
+ --count;
+ }
+ copy -= size;
+ offset1 += size;
+ offset = 0;
+ }
+
+ if (len == offset1)
+ return 0;
+
+ while (count--) {
+ struct page *page[MAX_SKB_FRAGS];
+ int num_pages;
+ unsigned long base;
+
+ len = from->iov_len - offset1;
+ if (!len) {
+ offset1 = 0;
+ ++from;
+ continue;
+ }
+ base = (unsigned long)from->iov_base + offset1;
+ size = ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
+ num_pages = get_user_pages_fast(base, size, 0, &page[i]);
+ if ((num_pages != size) ||
+ (num_pages > MAX_SKB_FRAGS - skb_shinfo(skb)->nr_frags))
+ /* put_page is in skb free */
+ return -EFAULT;
+ skb->data_len += len;
+ skb->len += len;
+ skb->truesize += len;
+ atomic_add(len, &skb->sk->sk_wmem_alloc);
+ while (len) {
+ f = &skb_shinfo(skb)->frags[i];
+ f->page = page[i];
+ f->page_offset = base & ~PAGE_MASK;
+ f->size = min_t(int, len, PAGE_SIZE - f->page_offset);
+ skb_shinfo(skb)->nr_frags++;
+ /* increase sk_wmem_alloc */
+ base += f->size;
+ len -= f->size;
+ i++;
+ }
+ offset1 = 0;
+ ++from;
+ }
+ return 0;
+}
+
/*
* macvtap_skb_from_vnet_hdr and macvtap_skb_to_vnet_hdr should
* be shared with the tun/tap driver.
@@ -515,16 +601,18 @@ static int macvtap_skb_to_vnet_hdr(const struct sk_buff *skb,
/* Get packet from user space buffer */
-static ssize_t macvtap_get_user(struct macvtap_queue *q,
- const struct iovec *iv, size_t count,
- int noblock)
+static ssize_t macvtap_get_user(struct macvtap_queue *q, struct msghdr *m,
+ const struct iovec *iv, unsigned long total_len,
+ size_t count, int noblock)
{
struct sk_buff *skb;
struct macvlan_dev *vlan;
- size_t len = count;
+ unsigned long len = total_len;
int err;
struct virtio_net_hdr vnet_hdr = { 0 };
int vnet_hdr_len = 0;
+ int copylen;
+ bool zerocopy = false;
if (q->flags & IFF_VNET_HDR) {
vnet_hdr_len = q->vnet_hdr_sz;
@@ -552,12 +640,30 @@ static ssize_t macvtap_get_user(struct macvtap_queue *q,
if (unlikely(len < ETH_HLEN))
goto err;
- skb = macvtap_alloc_skb(&q->sk, NET_IP_ALIGN, len, vnet_hdr.hdr_len,
- noblock, &err);
+ if (m && m->msg_control)
+ zerocopy = true;
+
+ if (zerocopy) {
+ /* There are 256 bytes to be copied in skb, so there is enough
+ * room for skb expand head in case it is used.
+ * The rest buffer is mapped from userspace.
+ */
+ copylen = vnet_hdr.hdr_len;
+ if (!copylen)
+ copylen = GOODCOPY_LEN;
+ } else
+ copylen = len;
+
+ skb = macvtap_alloc_skb(&q->sk, NET_IP_ALIGN, copylen,
+ vnet_hdr.hdr_len, noblock, &err);
if (!skb)
goto err;
- err = skb_copy_datagram_from_iovec(skb, 0, iv, vnet_hdr_len, len);
+ if (zerocopy)
+ err = zerocopy_sg_from_iovec(skb, iv, vnet_hdr_len, count);
+ else
+ err = skb_copy_datagram_from_iovec(skb, 0, iv, vnet_hdr_len,
+ len);
if (err)
goto err_kfree;
@@ -573,13 +679,16 @@ static ssize_t macvtap_get_user(struct macvtap_queue *q,
rcu_read_lock_bh();
vlan = rcu_dereference_bh(q->vlan);
+ /* copy skb_ubuf_info for callback when skb has no error */
+ if (zerocopy)
+ skb_shinfo(skb)->ubuf_arg = m->msg_control;
if (vlan)
macvlan_start_xmit(skb, vlan->dev);
else
kfree_skb(skb);
rcu_read_unlock_bh();
- return count;
+ return total_len;
err_kfree:
kfree_skb(skb);
@@ -601,8 +710,8 @@ static ssize_t macvtap_aio_write(struct kiocb *iocb, const struct iovec *iv,
ssize_t result = -ENOLINK;
struct macvtap_queue *q = file->private_data;
- result = macvtap_get_user(q, iv, iov_length(iv, count),
- file->f_flags & O_NONBLOCK);
+ result = macvtap_get_user(q, NULL, iv, iov_length(iv, count), count,
+ file->f_flags & O_NONBLOCK);
return result;
}
@@ -815,7 +924,7 @@ static int macvtap_sendmsg(struct kiocb *iocb, struct socket *sock,
struct msghdr *m, size_t total_len)
{
struct macvtap_queue *q = container_of(sock, struct macvtap_queue, sock);
- return macvtap_get_user(q, m->msg_iov, total_len,
+ return macvtap_get_user(q, m, m->msg_iov, total_len, m->msg_iovlen,
m->msg_flags & MSG_DONTWAIT);
}
^ permalink raw reply related
* [PATCH V7 2/4 net-next] skbuff: Add userspace zero-copy buffers in skb
From: Shirley Ma @ 2011-05-28 19:23 UTC (permalink / raw)
To: David Miller, mst, Eric Dumazet, Avi Kivity, Arnd Bergmann
Cc: netdev, kvm, linux-kernel
This patch adds userspace buffers support in skb shared info. A new
struct skb_ubuf_info is needed to maintain the userspace buffers
argument and index, a callback is used to notify userspace to release
the buffers once lower device has done DMA (Last reference to that skb
has gone).
If there is any userspace apps to reference these userspace buffers,
then these userspaces buffers will be copied into kernel. This way we
can prevent userspace apps to hold these userspace buffers too long.
One userspace buffer info pointer is added in skb_shared_info. Is that
safe to use destructor_arg? From the comments, this destructor_arg is
used for destructor, destructor calls doesn't wait for last reference to
that skb is gone.
Signed-off-by: Shirley Ma <xma@us.ibm.com>
---
include/linux/skbuff.h | 24 ++++++++++++++++
net/core/skbuff.c | 73 ++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 97 insertions(+), 0 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index e8b78ce..37a2cb4 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -189,6 +189,17 @@ enum {
SKBTX_DRV_NEEDS_SK_REF = 1 << 3,
};
+/*
+ * The callback notifies userspace to release buffers when skb DMA is done in
+ * lower device, the skb last reference should be 0 when calling this.
+ * The desc is used to track userspace buffer index.
+ */
+struct ubuf_info {
+ void (*callback)(void *);
+ void *arg;
+ unsigned long desc;
+};
+
/* This data is invariant across clones and lives at
* the end of the header data, ie. at skb->end.
*/
@@ -203,6 +214,9 @@ struct skb_shared_info {
struct sk_buff *frag_list;
struct skb_shared_hwtstamps hwtstamps;
+ /* DMA mapping from/to userspace buffers info */
+ void *ubuf_arg;
+
/*
* Warning : all fields before dataref are cleared in __alloc_skb()
*/
@@ -2261,5 +2275,15 @@ static inline void skb_checksum_none_assert(struct sk_buff *skb)
}
bool skb_partial_csum_set(struct sk_buff *skb, u16 start, u16 off);
+
+/*
+ * skb *uarg - is the buffer from userspace
+ * @skb: buffer to check
+ */
+static inline int skb_ubuf(const struct sk_buff *skb)
+{
+ return skb_shinfo(skb)->ubuf_arg != NULL;
+}
+
#endif /* __KERNEL__ */
#endif /* _LINUX_SKBUFF_H */
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 46cbd28..115c5ca 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -329,6 +329,19 @@ static void skb_release_data(struct sk_buff *skb)
put_page(skb_shinfo(skb)->frags[i].page);
}
+ /*
+ * If skb buf is from userspace, we need to notify the caller
+ * the lower device DMA has done;
+ */
+ if (skb_ubuf(skb)) {
+ struct ubuf_info *uarg;
+
+ uarg = (struct ubuf_info *)skb_shinfo(skb)->ubuf_arg;
+
+ if (uarg->callback)
+ uarg->callback(uarg);
+ }
+
if (skb_has_frag_list(skb))
skb_drop_fraglist(skb);
@@ -481,6 +494,9 @@ bool skb_recycle_check(struct sk_buff *skb, int skb_size)
if (irqs_disabled())
return false;
+ if (skb_ubuf(skb))
+ return false;
+
if (skb_is_nonlinear(skb) || skb->fclone != SKB_FCLONE_UNAVAILABLE)
return false;
@@ -573,6 +589,7 @@ static struct sk_buff *__skb_clone(struct sk_buff *n, struct sk_buff *skb)
atomic_set(&n->users, 1);
atomic_inc(&(skb_shinfo(skb)->dataref));
+ skb_shinfo(skb)->ubuf_arg = NULL;
skb->cloned = 1;
return n;
@@ -596,6 +613,51 @@ struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src)
}
EXPORT_SYMBOL_GPL(skb_morph);
+/* skb frags copy userspace buffers to kernel */
+static int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
+{
+ int i;
+ int num_frags = skb_shinfo(skb)->nr_frags;
+ struct page *page, *head = NULL;
+ struct ubuf_info *uarg = skb_shinfo(skb)->ubuf_arg;
+
+ for (i = 0; i < num_frags; i++) {
+ u8 *vaddr;
+ skb_frag_t *f = &skb_shinfo(skb)->frags[i];
+
+ page = alloc_page(GFP_ATOMIC);
+ if (!page) {
+ while (head) {
+ put_page(head);
+ head = (struct page *)head->private;
+ }
+ return -ENOMEM;
+ }
+ vaddr = kmap_skb_frag(&skb_shinfo(skb)->frags[i]);
+ memcpy(page_address(page),
+ vaddr + f->page_offset, f->size);
+ kunmap_skb_frag(vaddr);
+ page->private = (unsigned long)head;
+ head = page;
+ }
+
+ /* skb frags release userspace buffers */
+ for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
+ put_page(skb_shinfo(skb)->frags[i].page);
+
+ uarg->callback(uarg);
+ skb_shinfo(skb)->ubuf_arg = NULL;
+
+ /* skb frags point to kernel buffers */
+ for (i = skb_shinfo(skb)->nr_frags; i > 0; i--) {
+ skb_shinfo(skb)->frags[i - 1].page_offset = 0;
+ skb_shinfo(skb)->frags[i - 1].page = head;
+ head = (struct page *)head->private;
+ }
+ return 0;
+}
+
+
/**
* skb_clone - duplicate an sk_buff
* @skb: buffer to clone
@@ -614,6 +676,11 @@ struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t gfp_mask)
{
struct sk_buff *n;
+ if (skb_ubuf(skb)) {
+ if (skb_copy_ubufs(skb, gfp_mask))
+ return NULL;
+ }
+
n = skb + 1;
if (skb->fclone == SKB_FCLONE_ORIG &&
n->fclone == SKB_FCLONE_UNAVAILABLE) {
@@ -731,6 +798,12 @@ struct sk_buff *pskb_copy(struct sk_buff *skb, gfp_t gfp_mask)
if (skb_shinfo(skb)->nr_frags) {
int i;
+ if (skb_ubuf(skb)) {
+ if (skb_copy_ubufs(skb, gfp_mask)) {
+ kfree(n);
+ goto out;
+ }
+ }
for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
skb_shinfo(n)->frags[i] = skb_shinfo(skb)->frags[i];
get_page(skb_shinfo(n)->frags[i].page);
^ permalink raw reply related
* [PATCH V7 1/4 net-next] sock.h: Add a new sock zero-copy flag
From: Shirley Ma @ 2011-05-28 19:15 UTC (permalink / raw)
To: David Miller, mst, Eric Dumazet, Avi Kivity, Arnd Bergmann
Cc: netdev, kvm, linux-kernel
include/net/sock.h | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)
diff --git a/include/net/sock.h b/include/net/sock.h
index f2046e4..2229bd1 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -563,6 +563,7 @@ enum sock_flags {
SOCK_TIMESTAMPING_SYS_HARDWARE, /* %SOF_TIMESTAMPING_SYS_HARDWARE */
SOCK_FASYNC, /* fasync() active */
SOCK_RXQ_OVFL,
+ SOCK_ZEROCOPY, /* buffers from userspace */
};
static inline void sock_copy_flags(struct sock *nsk, struct sock *osk)
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox