Netdev List
 help / color / mirror / Atom feed
* Re: [IPSEC]: Fix transport-mode async resume on intput without netfilter
From: David Miller @ 2007-12-31  5:11 UTC (permalink / raw)
  To: herbert; +Cc: netdev
In-Reply-To: <20071231003117.GA595@gondor.apana.org.au>

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Mon, 31 Dec 2007 11:31:18 +1100

> [IPSEC]: Fix transport-mode async resume on intput without netfilter
> 
> When netfilter is off the transport-mode async resumption doesn't work
> because we don't push back the IP header.  This patch fixes that by
> moving most of the code outside of ifdef NETFILTER since the only part
> that's not common is the short-circuit in the protocol handler.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Applied.

^ permalink raw reply

* Re: [IPSEC]: Fix double free on skb on async output
From: David Miller @ 2007-12-31  5:10 UTC (permalink / raw)
  To: herbert; +Cc: netdev
In-Reply-To: <20071230123603.GA24629@gondor.apana.org.au>

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Sun, 30 Dec 2007 23:36:04 +1100

> [IPSEC]: Fix double free on skb on async output
> 
> When the output transform returns EINPROGRESS due to async operation we'll
> free the skb the straight away as if it were an error.  This patch fixes
> that so that the skb is freed when the async operation completes.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Applied.

^ permalink raw reply

* Re: [PATCH][ROSE][AX25] af_ax25: possible circular locking
From: David Miller @ 2007-12-31  5:00 UTC (permalink / raw)
  To: jarkao2; +Cc: f6bvp, ralf, adobriyan, netdev
In-Reply-To: <20071230141323.GA3377@ami.dom.local>

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Sun, 30 Dec 2007 15:13:23 +0100

> On Sat, Dec 29, 2007 at 07:14:43PM -0800, David Miller wrote:
> ...
> "again" loop should skip this entry next time: s->ax25_dev should
> be NULL, so not equal to ax25dev. (But of course there could be
> this unknown to me place, which changes this back in the meantime.)

Indeed you are right.

> > You'll thus need to resolve this locking conflict more properly.
> > I know it's hard, but your current fix is worse because it adds
> > a new known bug.
> 
> Sorry, it seems this will've to wait until Ralf finds some time,
> because I really can't give this more time, considering I never
> used nor have any plans to use this code, and this bug could
> suggest there is not so many interested in this, besides Bernard,
> either.

I'll try to look at this.

^ permalink raw reply

* net-2.6.25 rebased...
From: David Miller @ 2007-12-31  4:54 UTC (permalink / raw)
  To: netdev


Since I knew there would be some TCP conflicts etc.
I worked last night and today on a rebase, and I just
pushed that out at the usual location:

	kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6.25.git

Don't worry about respinning patches, I'll take care of things.

^ permalink raw reply

* [IPSEC]: Move all calls to xfrm_audit_state_icvfail to xfrm_input
From: Herbert Xu @ 2007-12-31  4:23 UTC (permalink / raw)
  To: David S. Miller, netdev

Hi Dave:

While refreshing the async IPsec patches I noticed some fresh code
duplication.

[IPSEC]: Move all calls to xfrm_audit_state_icvfail to xfrm_input
    
Let's nip the code duplication in the bud :)
    
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/net/ipv4/ah4.c b/net/ipv4/ah4.c
index ec8de0a..d76803a 100644
--- a/net/ipv4/ah4.c
+++ b/net/ipv4/ah4.c
@@ -179,10 +179,8 @@ static int ah_input(struct xfrm_state *x, struct sk_buff *skb)
 		err = ah_mac_digest(ahp, skb, ah->auth_data);
 		if (err)
 			goto unlock;
-		if (memcmp(ahp->work_icv, auth_data, ahp->icv_trunc_len)) {
-			xfrm_audit_state_icvfail(x, skb, IPPROTO_AH);
+		if (memcmp(ahp->work_icv, auth_data, ahp->icv_trunc_len))
 			err = -EBADMSG;
-		}
 	}
 unlock:
 	spin_unlock(&x->lock);
diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c
index b334c76..28ea5c7 100644
--- a/net/ipv4/esp4.c
+++ b/net/ipv4/esp4.c
@@ -191,7 +191,6 @@ static int esp_input(struct xfrm_state *x, struct sk_buff *skb)
 			BUG();
 
 		if (unlikely(memcmp(esp->auth.work_icv, sum, alen))) {
-			xfrm_audit_state_icvfail(x, skb, IPPROTO_ESP);
 			err = -EBADMSG;
 			goto unlock;
 		}
diff --git a/net/ipv6/ah6.c b/net/ipv6/ah6.c
index 2d32772..fb0d07a 100644
--- a/net/ipv6/ah6.c
+++ b/net/ipv6/ah6.c
@@ -380,10 +380,8 @@ static int ah6_input(struct xfrm_state *x, struct sk_buff *skb)
 		err = ah_mac_digest(ahp, skb, ah->auth_data);
 		if (err)
 			goto unlock;
-		if (memcmp(ahp->work_icv, auth_data, ahp->icv_trunc_len)) {
-			xfrm_audit_state_icvfail(x, skb, IPPROTO_AH);
+		if (memcmp(ahp->work_icv, auth_data, ahp->icv_trunc_len))
 			err = -EBADMSG;
-		}
 	}
 unlock:
 	spin_unlock(&x->lock);
diff --git a/net/ipv6/esp6.c b/net/ipv6/esp6.c
index e10f10b..5bd5292 100644
--- a/net/ipv6/esp6.c
+++ b/net/ipv6/esp6.c
@@ -186,7 +186,6 @@ static int esp6_input(struct xfrm_state *x, struct sk_buff *skb)
 			BUG();
 
 		if (unlikely(memcmp(esp->auth.work_icv, sum, alen))) {
-			xfrm_audit_state_icvfail(x, skb, IPPROTO_ESP);
 			ret = -EBADMSG;
 			goto unlock;
 		}
diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c
index 1b250f3..039e701 100644
--- a/net/xfrm/xfrm_input.c
+++ b/net/xfrm/xfrm_input.c
@@ -186,8 +186,11 @@ int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 spi, int encap_type)
 resume:
 		spin_lock(&x->lock);
 		if (nexthdr <= 0) {
-			if (nexthdr == -EBADMSG)
+			if (nexthdr == -EBADMSG) {
+				xfrm_audit_state_icvfail(x, skb,
+							 x->type->proto);
 				x->stats.integrity_failed++;
+			}
 			XFRM_INC_STATS(LINUX_MIB_XFRMINSTATEPROTOERROR);
 			goto drop_unlock;
 		}

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related

* [IPSEC]: Fix transport-mode async resume on intput without netfilter
From: Herbert Xu @ 2007-12-31  0:31 UTC (permalink / raw)
  To: David S. Miller, netdev

Hi Dave:

Found another bug on the input path when NETFILTER is off.

[IPSEC]: Fix transport-mode async resume on intput without netfilter

When netfilter is off the transport-mode async resumption doesn't work
because we don't push back the IP header.  This patch fixes that by
moving most of the code outside of ifdef NETFILTER since the only part
that's not common is the short-circuit in the protocol handler.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/net/ipv4/xfrm4_input.c b/net/ipv4/xfrm4_input.c
index 33f990d..390dcb1 100644
--- a/net/ipv4/xfrm4_input.c
+++ b/net/ipv4/xfrm4_input.c
@@ -51,7 +51,11 @@ int xfrm4_transport_finish(struct sk_buff *skb, int async)
 
 	iph->protocol = XFRM_MODE_SKB_CB(skb)->protocol;
 
-#ifdef CONFIG_NETFILTER
+#ifndef CONFIG_NETFILTER
+	if (!async)
+		return -iph->protocol;
+#endif
+
 	__skb_push(skb, skb->data - skb_network_header(skb));
 	iph->tot_len = htons(skb->len);
 	ip_send_check(iph);
@@ -59,12 +63,6 @@ int xfrm4_transport_finish(struct sk_buff *skb, int async)
 	NF_HOOK(PF_INET, NF_INET_PRE_ROUTING, skb, skb->dev, NULL,
 		xfrm4_rcv_encap_finish);
 	return 0;
-#else
-	if (async)
-		return xfrm4_rcv_encap_finish(skb);
-
-	return -iph->protocol;
-#endif
 }
 
 /* If it's a keepalive packet, then just eat it.
diff --git a/net/ipv6/xfrm6_input.c b/net/ipv6/xfrm6_input.c
index 063ce6e..a4714d7 100644
--- a/net/ipv6/xfrm6_input.c
+++ b/net/ipv6/xfrm6_input.c
@@ -34,19 +34,17 @@ int xfrm6_transport_finish(struct sk_buff *skb, int async)
 	skb_network_header(skb)[IP6CB(skb)->nhoff] =
 		XFRM_MODE_SKB_CB(skb)->protocol;
 
-#ifdef CONFIG_NETFILTER
+#ifndef CONFIG_NETFILTER
+	if (!async)
+		return 1;
+#endif
+
 	ipv6_hdr(skb)->payload_len = htons(skb->len);
 	__skb_push(skb, skb->data - skb_network_header(skb));
 
 	NF_HOOK(PF_INET6, NF_INET_PRE_ROUTING, skb, skb->dev, NULL,
 		ip6_rcv_finish);
 	return -1;
-#else
-	if (async)
-		return ip6_rcv_finish(skb);
-
-	return 1;
-#endif
 }
 
 int xfrm6_rcv(struct sk_buff *skb)

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related

* Re: [Patch 2.6.22.2 ] : drivers/net/via-rhine.c:   Offload checksum handling to VT6105M
From: Willy Tarreau @ 2007-12-30 22:33 UTC (permalink / raw)
  To: K Naru; +Cc: rl, linux-kernel, netdev
In-Reply-To: <20070818052351.GN6002@1wt.eu>

Hi Kim,

On Fri, Aug 17, 2007 at 11:34:37AM -0700, K Naru wrote:
> From: Kim Naru (squat_rack@yahoo.com)
> 
> Added support to offload TCP/UDP/IP checksum to the
> VIA Technologies VT6105M chip.
> Firstly, let the stack know this chip is capable of
> doing its own checksum(IPV4 only).
> Secondly offload checksum to VT6105M, if necessary. 
> 
> 
> Verbose Mode:
> 
> #1. Define 3 bits(18,19,20) in Transmit Descriptor 1
> of chip, which affect checksum processing.
> The prefix(TDES1) for the 3 variables is the short
> name for Transmit Descriptior 1.
> #2. In rhine_init_one(), if pci_rev >=  VT6105M then
> set  NETIF_F_IP_CSUM(see skbuff.h for details).
> #3. In rhine_start_tx() if NETIF_F_IP_CSUM is set AND
> the stack requires a checksum then
> set either bit 19(UDP),20(TCP) AND bit 18(IP).
> 
> Note : The numbered items above(i.e.#1,#2,#3) denote
> pseudo code.
> 
> This patch was developed and tested on Imedia
> linux-2.6.20 under a PC-Engines Alix System board
> (www.pcengines.ch/alix.htm). It was tested(compilation
> only) on linux-2.6.22.2. The minor code change between
> 2.6.20 and 2.6.22 is the use of ip_hdr() in 2.26.22.
> 
> In 2.6.20 :
>                 struct iphdr *ip = skb->nh.iph;
> In 2.6.22 :
>                 const struct iphdr *ip = ip_hdr(skb);
> 
> Testing:
> 
> 
> ttcp,netperf ftp and top  where used. There appears to
> be a small CPU utilization gain. Throughput results 
> where more inconclusive.
> 
> The data sheet used to get information is 'VT6105M
> Data Sheet, Revision 1.63  June21,2006'.
> 
> Signed-off-by: Kim Naru (squat_rack@yahoo.com)
> 
> ---

Well, I've reformated your patch so that it can be applied, and very
slightly arranged it in order to save 13 bytes of code and a few CPU
cycles.

Also, I moved the if block before the spinlock as there is no reason
for this code to be run with the lock held.

I have run some performance measurements on an ALIX 3C2 motherboard
with a 2.6.22-stable kernel. What I see is a reduction of CPU usage
by about 20% when the network is saturated, but also a reduction of
the network speed by 8%!

Without the patch, I can produce a continuous traffic of about 99 Mbps with
about 11% CPU (system only, 89% idle).

With the patch, the traffic drops to 91 Mbps but CPU usage decreases to 9%.

Now, if I reduce the MTU to exactly 1000, then the traffic increases to about
98 Mbps, but it progressively reduces when the MTU moves away from 1000.

So I have run some deeper tests consisting in leaving NETIF_F_IP_CSUM unset
and still asking the NIC to compute the checksums. The conclusion is very
clear: as soon as *any* checksum bit is set (IP, TCP, UDP), the traffic
immediately drops.

I think that what happens is that the NIC is not pipelined at all and that
no data is transferred while a checksum is being computed. This would also
explain why reducing the MTU increases performance, since it reduces the
time required to compute a checksum, reducing the off time. And the more I
think about it, the more I think this is the problem, because the VT6105M
has a 2kB transmit buffer, so it cannot checksum a 1.5kB frame while sending
another one if it does it inside the buffer.

And I'm pretty sure that the checksum is computed in the buffer and that the
data is not transferred twice on the bus, because playing with PCI latency
timer and other parameters does not change anything.

So basically, we're there with a chip which can offload the CPU by performing
the checksums itself, but it reduces performance for packets larger than 1kB
(or possibly 500 bytes if there's a 1.5kB packet being transferred).

The driver should be adjusted to permit the user to enable and disable this
feature with ethtool. Right now, its status can only be consulted, and I'm
using dd on /dev/mem and /dev/kmem to change the values on the fly.

Given the fact that a 20% reduction on CPU usage which was already 10% only
leaves a net gain of about 2% more CPU available, I'm not convinced that there
is any advantage in enabling this feature by default with this NIC.

Here's the updated patch for reference (maybe you'd want to enhance it).

--- linux-2.6.22-wt3/drivers/net/via-rhine.c	2007-11-22 17:48:34 +0100
+++ linux-2.6.22-wt3.via-cksum/drivers/net/via-rhine.c	2007-12-30 20:53:30 +0100
@@ -95,6 +95,8 @@
 #include <linux/netdevice.h>
 #include <linux/etherdevice.h>
 #include <linux/skbuff.h>
+#include <linux/in.h>
+#include <linux/ip.h>
 #include <linux/init.h>
 #include <linux/delay.h>
 #include <linux/mii.h>
@@ -343,6 +345,9 @@
 
 /* Initial value for tx_desc.desc_length, Buffer size goes to bits 0-10 */
 #define TXDESC		0x00e08000
+#define TDES1_TCPCK	0x00100000  /* Bit 20, Transmit Desc 1  */
+#define TDES1_UDPCK	0x00080000  /* Bit 19, Transmit Desc 1  */
+#define TDES1_IPCK	0x00040000  /* Bit 18, Transmit Desc 1  */
 
 enum rx_status_bits {
 	RxOK=0x8000, RxWholePkt=0x0300, RxErr=0x008F
@@ -788,6 +793,9 @@
 	if (rp->quirks & rqRhineI)
 		dev->features |= NETIF_F_SG|NETIF_F_HW_CSUM;
 
+	if (pci_rev >=  VT6105M)
+		dev->features |= NETIF_F_IP_CSUM;   /* chip does checksum */
+
 	/* dev->name not defined before register_netdev()! */
 	rc = register_netdev(dev);
 	if (rc)
@@ -1260,6 +1268,18 @@
 	rp->tx_ring[entry].desc_length =
 		cpu_to_le32(TXDESC | (skb->len >= ETH_ZLEN ? skb->len : ETH_ZLEN));
 
+	if ((dev->features & NETIF_F_IP_CSUM) &&
+	    (skb->ip_summed == CHECKSUM_PARTIAL)) {
+		/* Offload checksum to chip. */
+		const struct iphdr *ip = ip_hdr(skb);
+		unsigned long flag;
+
+		flag = (ip->protocol == IPPROTO_TCP) ? TDES1_TCPCK|TDES1_IPCK :
+		       (ip->protocol == IPPROTO_UDP) ? TDES1_UDPCK|TDES1_IPCK :
+		       TDES1_IPCK;
+		rp->tx_ring[entry].desc_length |= flag;
+	}
+
 	/* lock eth irq */
 	spin_lock_irq(&rp->lock);
 	wmb();

Best regards,
Willy


^ permalink raw reply

* Re: 2.6.24-rc6-mm1
From: Torsten Kaiser @ 2007-12-30 21:35 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Andrew Morton, linux-kernel, Neil Brown, netdev, Tom Tucker
In-Reply-To: <20071230212443.GA23320@fieldses.org>

On Dec 30, 2007 10:24 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
>
> On Fri, Dec 28, 2007 at 03:07:46PM -0800, Andrew Morton wrote:
> > On Fri, 28 Dec 2007 23:53:49 +0100 "Torsten Kaiser" <just.for.lkml@googlemail.com> wrote:
> >
> > > On Dec 23, 2007 5:27 PM, Torsten Kaiser <just.for.lkml@googlemail.com> wrote:
> > > > On Dec 23, 2007 8:30 AM, Andrew Morton <akpm@linux-foundation.org> wrote:
> > > > >
> > > > > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc6/2.6.24-rc6-mm1/
> > > > I have finally given up on using 2.6.24-rc3-mm2 with slub_debug=FZP to
> > > > get more information out of the random crashes I had seen with that
> > > > version. (Did not crash once with slub_debug, so no new information on
> > > > what the cause was)
> > >
> > > Murphy: Just after sending that mail the system crashed two times with
> > > slub_debug=FZP, but did not show any new informations.
> > > No debug output from slub, only this stacktrace: (Its the same I
> > > already reported in the 2.6.24-rc3-mm2 thread)
> > >
[snip]
> > > [ 7620.708561] Pid: 5698, comm: nfsv4-svc Not tainted 2.6.24-rc3-mm2 #2
[snip]
> >
> > That looks like a sunrpc bug.  git-nfsd has bene mucking around in there a
> > bit.
>
> Can you still reproduce this?  Tom thought there was a chance the
> following could fix it.

Please see also http://lkml.org/lkml/2007/12/29/76

Just wanted to say that slub_debug did not help to get more infos.

I will try to reproduce this with rc3-mm2 and the below patch tomorrow.
Without slub_debug this seemed to trigger rather reliable when trying
to update/upgrade packages on my system.

> From:   Tom Tucker <tom@opengridcomputing.com>
> Date:   Sun, 30 Dec 2007 10:07:17 -0600
>
> Bruce/Aime:
>
> Here is what I believe to be the fix for the crashes/svc_xprt BUG_ON
> that people are seeing. It would be great if those who have seen this
> problem could apply this patch and see if it resolves their problem.
>
> The common code calls svc_xprt_received on behalf of the transport.
> Since the provider was calling it as well, this resulted in clearing the
> busy bit/resetting xpt_pool when the BUSY bit wasn't held.
>
> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> index 4628881..4d39db1 100644
> --- a/net/sunrpc/svcsock.c
> +++ b/net/sunrpc/svcsock.c
> @@ -1272,7 +1272,6 @@ static struct svc_xprt *svc_create_socket(struct svc_serv *serv,
>
>         if ((svsk = svc_setup_socket(serv, sock, &error, flags)) != NULL) {
>                 svc_xprt_set_local(&svsk->sk_xprt, newsin, newlen);
> -               svc_xprt_received(&svsk->sk_xprt);
>                 return (struct svc_xprt *)svsk;
>         }

I will send a mail, when I'm done with testing this...

Thanks for the patch.

Torsten

^ permalink raw reply

* Re: 2.6.24-rc6-mm1
From: J. Bruce Fields @ 2007-12-30 21:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Torsten Kaiser, linux-kernel, Neil Brown, netdev, Tom Tucker
In-Reply-To: <20071228150746.42b3bbc0.akpm@linux-foundation.org>

On Fri, Dec 28, 2007 at 03:07:46PM -0800, Andrew Morton wrote:
> On Fri, 28 Dec 2007 23:53:49 +0100 "Torsten Kaiser" <just.for.lkml@googlemail.com> wrote:
> 
> > On Dec 23, 2007 5:27 PM, Torsten Kaiser <just.for.lkml@googlemail.com> wrote:
> > > On Dec 23, 2007 8:30 AM, Andrew Morton <akpm@linux-foundation.org> wrote:
> > > >
> > > > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc6/2.6.24-rc6-mm1/
> > > I have finally given up on using 2.6.24-rc3-mm2 with slub_debug=FZP to
> > > get more information out of the random crashes I had seen with that
> > > version. (Did not crash once with slub_debug, so no new information on
> > > what the cause was)
> > 
> > Murphy: Just after sending that mail the system crashed two times with
> > slub_debug=FZP, but did not show any new informations.
> > No debug output from slub, only this stacktrace: (Its the same I
> > already reported in the 2.6.24-rc3-mm2 thread)
> > 
> > [ 7620.673012] ------------[ cut here ]------------
> > [ 7620.676291] kernel BUG at lib/list_debug.c:33!
> > [ 7620.679440] invalid opcode: 0000 [1] SMP
> > [ 7620.682319] last sysfs file:
> > /sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map
> > [ 7620.687845] CPU 0
> > [ 7620.689300] Modules linked in: radeon drm nfsd exportfs w83792d
> > ipv6 tuner tea5767 tda8290 tuner_xc2028 tda9887 tuner_simple mt20xx
> > tea5761 tvaudio msp3400 bttv ir_common compat_ioctl32 videobuf_dma_sg
> > videobuf_core btcx_risc tveeprom videodev usbhid v4l2_common
> > v4l1_compat hid i2c_nforce2 sg pata_amd
> > [ 7620.708561] Pid: 5698, comm: nfsv4-svc Not tainted 2.6.24-rc3-mm2 #2
> > [ 7620.713080] RIP: 0010:[<ffffffff803bae54>]  [<ffffffff803bae54>]
> > __list_add+0x54/0x60
> > [ 7620.718667] RSP: 0018:ffff81011bca1dc0  EFLAGS: 00010282
> > [ 7620.722439] RAX: 0000000000000088 RBX: ffff81011c862c48 RCX: 0000000000000002
> > [ 7620.727504] RDX: ffff81011bc82ef0 RSI: 0000000000000001 RDI: ffffffff807590c0
> > [ 7620.732581] RBP: ffff81011bca1dc0 R08: 0000000000000001 R09: 0000000000000000
> > [ 7620.737658] R10: ffff810080058d48 R11: 0000000000000001 R12: ffff81011ed8d1c8
> > [ 7620.742711] R13: ffff81011ed8d200 R14: ffff81011ed8d200 R15: ffff81011cc0e578
> > [ 7620.747806] FS:  00007ffe400116f0(0000) GS:ffffffff807d4000(0000)
> > knlGS:00000000f73558e0
> > [ 7620.753535] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > [ 7620.757607] CR2: 00000000017071dc CR3: 00000001188b5000 CR4: 00000000000006e0
> > [ 7620.762677] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [ 7620.767748] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > [ 7620.772808] Process nfsv4-svc (pid: 5698, threadinfo
> > FFFF81011BCA0000, task FFFF81011BC82EF0)
> > [ 7620.778872] Stack:  ffff81011bca1e00 ffffffff805be26e
> > ffff81011ed8d1d0 ffff81011cc0e578
> > [ 7620.784626]  ffff81011c862c48 ffff81011c8be000 ffff810054a8b060
> > ffff81011cc0e588
> > [ 7620.789913]  ffff81011bca1e10 ffffffff805be367 ffff81011bca1ee0
> > ffffffff805bf0ac
> > [ 7620.795062] Call Trace:
> > [ 7620.796941]  [<ffffffff805be26e>] svc_xprt_enqueue+0x1ae/0x250
> > [ 7620.801087]  [<ffffffff805be367>] svc_xprt_received+0x17/0x20
> > [ 7620.805199]  [<ffffffff805bf0ac>] svc_recv+0x39c/0x840
> > [ 7620.808851]  [<ffffffff805bea3f>] svc_send+0xaf/0xd0
> > [ 7620.812374]  [<ffffffff8022f590>] default_wake_function+0x0/0x10
> > [ 7620.816637]  [<ffffffff803163ea>] nfs_callback_svc+0x7a/0x130
> > [ 7620.820712]  [<ffffffff805cfea2>] trace_hardirqs_on_thunk+0x35/0x3a
> > [ 7620.825174]  [<ffffffff80259f8f>] trace_hardirqs_on+0xbf/0x160
> > [ 7620.829335]  [<ffffffff8020cbc8>] child_rip+0xa/0x12
> > [ 7620.832842]  [<ffffffff8020c2df>] restore_args+0x0/0x30
> > [ 7620.836554]  [<ffffffff80316370>] nfs_callback_svc+0x0/0x130
> > [ 7620.840564]  [<ffffffff8020cbbe>] child_rip+0x0/0x12
> > [ 7620.844102]
> > [ 7620.845168] INFO: lockdep is turned off.
> > [ 7620.847964]
> > [ 7620.847965] Code: 0f 0b eb fe 0f 1f 84 00 00 00 00 00 55 48 8b 16
> > 48 89 e5 e8
> > [ 7620.854334] RIP  [<ffffffff803bae54>] __list_add+0x54/0x60
> > [ 7620.858255]  RSP <ffff81011bca1dc0>
> > [ 7620.860724] Kernel panic - not syncing: Aiee, killing interrupt handler!
> > 
> 
> That looks like a sunrpc bug.  git-nfsd has bene mucking around in there a
> bit.

Can you still reproduce this?  Tom thought there was a chance the
following could fix it.

--b.

From:	Tom Tucker <tom@opengridcomputing.com>
Date:	Sun, 30 Dec 2007 10:07:17 -0600

Bruce/Aime:

Here is what I believe to be the fix for the crashes/svc_xprt BUG_ON
that people are seeing. It would be great if those who have seen this
problem could apply this patch and see if it resolves their problem. 

The common code calls svc_xprt_received on behalf of the transport.
Since the provider was calling it as well, this resulted in clearing the
busy bit/resetting xpt_pool when the BUSY bit wasn't held. 

diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 4628881..4d39db1 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -1272,7 +1272,6 @@ static struct svc_xprt *svc_create_socket(struct svc_serv *serv,
 
 	if ((svsk = svc_setup_socket(serv, sock, &error, flags)) != NULL) {
 		svc_xprt_set_local(&svsk->sk_xprt, newsin, newlen);
-		svc_xprt_received(&svsk->sk_xprt);
 		return (struct svc_xprt *)svsk;
 	}
 

-

^ permalink raw reply related

* Re: [usb regression] Re: [PATCH 2.6.24-rc3] Fix /proc/net breakage
From: Alan Stern @ 2007-12-30 20:34 UTC (permalink / raw)
  To: Andreas Mohr
  Cc: Ingo Molnar, Alexey Dobriyan, Andrew Morton, David Woodhouse,
	Eric W. Biederman, Linus Torvalds, Rafael J. Wysocki,
	Pavel Machek, kernel list, netdev, Pavel Emelyanov,
	Denis V. Lunev, Greg KH
In-Reply-To: <20071230161405.GA27788@elte.hu>

On Sun, 30 Dec 2007, Ingo Molnar wrote:

> * Andreas Mohr <andi@lisas.de> wrote:
> 
> > (yes, that's all there is, despite CONFIG_USB_DEBUG being set)
> > 
> > The LED of a usb stick isn't active either, for obvious reasons.
> > 
> > And keep in mind that this is a (relatively old) OHCI-only machine... 
> > (which had the 2.6.19 lsmod showing ohci-hcd just fine and working 
> > fine with WLAN USB)
> > 
> > Now pondering whether to try -rc6 proper or whether to revert specific 
> > guilty-looking USB changes... And wondering how to properly elevate 
> > this issue (prompt Greg about it, new thread, bug #, ...?)

It looks like Greg misused the debugfs API -- which is ironic, because
he wrote debugfs in the first place!  :-)

Let me know if this patch fixes the problem.  If it does, I'll submit 
it to Greg with all the proper accoutrements.

Alan Stern


Index: 2.6.24-rc6-mm1/drivers/usb/host/ohci-hcd.c
===================================================================
--- 2.6.24-rc6-mm1.orig/drivers/usb/host/ohci-hcd.c
+++ 2.6.24-rc6-mm1/drivers/usb/host/ohci-hcd.c
@@ -1067,14 +1067,8 @@ static int __init ohci_hcd_mod_init(void
 
 #ifdef DEBUG
 	ohci_debug_root = debugfs_create_dir("ohci", NULL);
-	if (!ohci_debug_root || IS_ERR(ohci_debug_root)) {
-		if (!ohci_debug_root)
-			retval = -ENOENT;
-		else
-			retval = PTR_ERR(ohci_debug_root);
-
-		goto error_debug;
-	}
+	if (!ohci_debug_root)
+		return -ENOENT;
 #endif
 
 #ifdef PS3_SYSTEM_BUS_DRIVER
@@ -1142,7 +1136,6 @@ static int __init ohci_hcd_mod_init(void
 #ifdef DEBUG
 	debugfs_remove(ohci_debug_root);
 	ohci_debug_root = NULL;
- error_debug:
 #endif
 
 	return retval;
Index: 2.6.24-rc6-mm1/drivers/usb/host/ohci-dbg.c
===================================================================
--- 2.6.24-rc6-mm1.orig/drivers/usb/host/ohci-dbg.c
+++ 2.6.24-rc6-mm1/drivers/usb/host/ohci-dbg.c
@@ -813,30 +813,29 @@ static inline void create_debug_files (s
 	struct device *dev = bus->dev;
 
 	ohci->debug_dir = debugfs_create_dir(bus->bus_name, ohci_debug_root);
-	if (!ohci->debug_dir || IS_ERR(ohci->debug_dir)) {
-		ohci->debug_dir = NULL;
-		goto done;
-	}
+	if (!ohci->debug_dir)
+		return;
 
 	ohci->debug_async = debugfs_create_file("async", S_IRUGO,
 						ohci->debug_dir, dev,
 						&debug_async_fops);
-	if (!ohci->debug_async || IS_ERR(ohci->debug_async))
+	if (!ohci->debug_async)
 		goto async_error;
 
 	ohci->debug_periodic = debugfs_create_file("periodic", S_IRUGO,
 						   ohci->debug_dir, dev,
 						   &debug_periodic_fops);
-	if (!ohci->debug_periodic || IS_ERR(ohci->debug_periodic))
+	if (!ohci->debug_periodic)
 		goto periodic_error;
 
 	ohci->debug_registers = debugfs_create_file("registers", S_IRUGO,
 						    ohci->debug_dir, dev,
 						    &debug_registers_fops);
-	if (!ohci->debug_registers || IS_ERR(ohci->debug_registers))
+	if (!ohci->debug_registers)
 		goto registers_error;
 
-	goto done;
+	ohci_dbg(ohci, "created debug files\n");
+	return;
 
 registers_error:
 	debugfs_remove(ohci->debug_periodic);
@@ -847,10 +846,6 @@ periodic_error:
 async_error:
 	debugfs_remove(ohci->debug_dir);
 	ohci->debug_dir = NULL;
-done:
-	return;
-
-	ohci_dbg (ohci, "created debug files\n");
 }
 
 static inline void remove_debug_files (struct ohci_hcd *ohci)
Index: 2.6.24-rc6-mm1/drivers/usb/host/ehci-hcd.c
===================================================================
--- 2.6.24-rc6-mm1.orig/drivers/usb/host/ehci-hcd.c
+++ 2.6.24-rc6-mm1/drivers/usb/host/ehci-hcd.c
@@ -1019,14 +1019,8 @@ static int __init ehci_hcd_init(void)
 
 #ifdef DEBUG
 	ehci_debug_root = debugfs_create_dir("ehci", NULL);
-	if (!ehci_debug_root || IS_ERR(ehci_debug_root)) {
-		if (!ehci_debug_root)
-			retval = -ENOENT;
-		else
-			retval = PTR_ERR(ehci_debug_root);
-
-		return retval;
-	}
+	if (!ehci_debug_root)
+		return -ENOENT;
 #endif
 
 #ifdef PLATFORM_DRIVER
Index: 2.6.24-rc6-mm1/drivers/usb/host/ehci-dbg.c
===================================================================
--- 2.6.24-rc6-mm1.orig/drivers/usb/host/ehci-dbg.c
+++ 2.6.24-rc6-mm1/drivers/usb/host/ehci-dbg.c
@@ -914,30 +914,28 @@ static inline void create_debug_files (s
 	struct usb_bus *bus = &ehci_to_hcd(ehci)->self;
 
 	ehci->debug_dir = debugfs_create_dir(bus->bus_name, ehci_debug_root);
-	if (!ehci->debug_dir || IS_ERR(ehci->debug_dir)) {
-		ehci->debug_dir = NULL;
-		goto done;
-	}
+	if (!ehci->debug_dir)
+		return;
 
 	ehci->debug_async = debugfs_create_file("async", S_IRUGO,
 						ehci->debug_dir, bus,
 						&debug_async_fops);
-	if (!ehci->debug_async || IS_ERR(ehci->debug_async))
+	if (!ehci->debug_async)
 		goto async_error;
 
 	ehci->debug_periodic = debugfs_create_file("periodic", S_IRUGO,
 						   ehci->debug_dir, bus,
 						   &debug_periodic_fops);
-	if (!ehci->debug_periodic || IS_ERR(ehci->debug_periodic))
+	if (!ehci->debug_periodic)
 		goto periodic_error;
 
 	ehci->debug_registers = debugfs_create_file("registers", S_IRUGO,
 						    ehci->debug_dir, bus,
 						    &debug_registers_fops);
-	if (!ehci->debug_registers || IS_ERR(ehci->debug_registers))
+	if (!ehci->debug_registers)
 		goto registers_error;
 
-	goto done;
+	return;
 
 registers_error:
 	debugfs_remove(ehci->debug_periodic);
@@ -948,9 +946,6 @@ periodic_error:
 async_error:
 	debugfs_remove(ehci->debug_dir);
 	ehci->debug_dir = NULL;

^ permalink raw reply

* Re: [PATCH] Add Linksys WCF11 to hostap driver
From: Pavel Roskin @ 2007-12-30 16:39 UTC (permalink / raw)
  To: Marcin Juszkiewicz
  Cc: linux-wireless-u79uwXL29TY76Z2rM5mHXA, Jouni Malinen,
	netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <200712301600.46342.openembedded-sbG99h22wavzs492ZWNkEA@public.gmane.org>

Quoting Marcin Juszkiewicz <openembedded-sbG99h22wavzs492ZWNkEA@public.gmane.org>:

> +       PCMCIA_DEVICE_PROD_ID1233(
> +               "The Linksys Group, Inc.", "Wireless Network CF   
> Card", "ISL37300P",
> +               0xa5f472c2, 0x9c05598d, 0xc9049a39),

Acked-by: Pavel Roskin <proski-mXXj517/zsQ@public.gmane.org>

-- 
Regards,
Pavel Roskin

^ permalink raw reply

* [usb regression] Re: [PATCH 2.6.24-rc3] Fix /proc/net breakage
From: Ingo Molnar @ 2007-12-30 16:14 UTC (permalink / raw)
  To: Andreas Mohr
  Cc: Alexey Dobriyan, Andrew Morton, David Woodhouse,
	Eric W. Biederman, Linus Torvalds, Rafael J. Wysocki,
	Pavel Machek, kernel list, netdev, Pavel Emelyanov,
	Denis V. Lunev, Alan Stern
In-Reply-To: <20071228072116.GA16553@rhlx01.hs-esslingen.de>


* Andreas Mohr <andi@lisas.de> wrote:

> (yes, that's all there is, despite CONFIG_USB_DEBUG being set)
> 
> The LED of a usb stick isn't active either, for obvious reasons.
> 
> And keep in mind that this is a (relatively old) OHCI-only machine... 
> (which had the 2.6.19 lsmod showing ohci-hcd just fine and working 
> fine with WLAN USB)
> 
> Now pondering whether to try -rc6 proper or whether to revert specific 
> guilty-looking USB changes... And wondering how to properly elevate 
> this issue (prompt Greg about it, new thread, bug #, ...?)

Cc:-ed Alan Stern for the USB regression.

	Ingo

^ permalink raw reply

* Re: A question about hh_cache
From: Andi Kleen @ 2007-12-30 15:36 UTC (permalink / raw)
  To: Andy Johnson; +Cc: Andi Kleen, netdev
In-Reply-To: <147a89290712300712p5c586f57wff1b29e9fa807dee@mail.gmail.com>

On Sun, Dec 30, 2007 at 05:12:05PM +0200, Andy Johnson wrote:
> Can there be a case in such environment (using IPv4 only) where
> hh_cache of a neighbour instance is a
> list with more than one entry ?

In theory if some device supplies own multiple dst_ops with different 
protocol numbers.  From a quick grep doesn't happen in tree.

-Andi

^ permalink raw reply

* Re: A question about hh_cache
From: Andy Johnson @ 2007-12-30 15:12 UTC (permalink / raw)
  To: Andi Kleen; +Cc: netdev
In-Reply-To: <p73abns1be1.fsf@bingen.suse.de>

Hello,
First, thanks for your reply.

Second, apart from when using both IPV4 and IPV6:
Suppose I work only in IPv4 (and this is indeed the case).

Can there be a case in such environment (using IPv4 only) where
hh_cache of a neighbour instance is a
list with more than one entry ?

Regards,
Andy Johnson



On Dec 30, 2007 4:54 PM, Andi Kleen <andi@firstfloor.org> wrote:
> "Andy Johnson" <johnsonzjo@gmail.com> writes:
>
> > Can anybody give an example when hh_cache of a neighbour instance is a
> > list with more than
> > one entry ?
>
> When you're talking to the same host on a local ethernet with both native IPv4
> and native IPv6.
>
> -Andi
>

^ permalink raw reply

* Re: [PATCH] Add Linksys WCF11 to hostap driver
From: Marcin Juszkiewicz @ 2007-12-30 15:00 UTC (permalink / raw)
  To: linux-wireless-u79uwXL29TY76Z2rM5mHXA
  Cc: Pavel Roskin, Jouni Malinen, netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1198901689.2221.3.camel@dv>

Dnia sobota, 29 grudnia 2007, Pavel Roskin napisał:
> Hello!
>
> On Fri, 2007-12-28 at 14:16 +0100, Marcin Juszkiewicz wrote:
> > +       PCMCIA_DEVICE_PROD_ID1234(
> > +               "The Linksys Group, Inc.", "Wireless Network CF
> > Card", "ISL37300P", +               "RevA",
> > +               0xa5f472c2, 0x9c05598d, 0xc9049a39, 0x57a66194),
>
> I would prefer that revisions are not used for card identification in
> presence of the chip name.  It's not like RevB would need another
> driver as long as "ISL37300P" is still a part of the PCMCIA ID.
> PCMCIA_DEVICE_PROD_ID123 should be enough.

OK - updated patch below:

From: Marcin Juszkiewicz <openembedded-sbG99h22wavzs492ZWNkEA@public.gmane.org>

Linksys WCF11 submitted by Ångström user.

"The Linksys Group, Inc.", "Wireless Network CF Card", "ISL37300P", "RevA",
0xa5f472c2, 0x9c05598d, 0xc9049a39, 0x57a66194 
manfid: 0x0274, 0x3301

Signed-off-by: Marcin Juszkiewicz <openembedded-sbG99h22wavzs492ZWNkEA@public.gmane.org>

---
 drivers/net/wireless/hostap/hostap_cs.c |    3 +++
 1 file changed, 3 insertions(+)

--- linux-2.6.23.orig/drivers/net/wireless/hostap/hostap_cs.c
+++ linux-2.6.23/drivers/net/wireless/hostap/hostap_cs.c
@@ -887,10 +887,13 @@ static struct pcmcia_device_id hostap_cs
                "Ver. 1.00",
                0x5cd01705, 0x4271660f, 0x9d08ee12),
        PCMCIA_DEVICE_PROD_ID123(
                "corega", "WL PCCL-11", "ISL37300P",
                0xa21501a, 0x59868926, 0xc9049a39),
+       PCMCIA_DEVICE_PROD_ID1233(
+               "The Linksys Group, Inc.", "Wireless Network CF Card", "ISL37300P",
+               0xa5f472c2, 0x9c05598d, 0xc9049a39),
        PCMCIA_DEVICE_NULL
 };
 MODULE_DEVICE_TABLE(pcmcia, hostap_cs_ids);


-- 
JID: hrw-jabber.org
OpenEmbedded developer/consultant

Zemsta jest rozkoszą umysłów miernych, słabych i małostkowych. Ale jednak
rozkoszą być nie przestaje

^ permalink raw reply

* Re: A question about hh_cache
From: Andi Kleen @ 2007-12-30 14:54 UTC (permalink / raw)
  To: Andy Johnson; +Cc: netdev
In-Reply-To: <147a89290712300531m4a8845betd23ef71f6d944799@mail.gmail.com>

"Andy Johnson" <johnsonzjo@gmail.com> writes:

> Can anybody give an example when hh_cache of a neighbour instance is a
> list with more than
> one entry ?

When you're talking to the same host on a local ethernet with both native IPv4 
and native IPv6.

-Andi

^ permalink raw reply

* Re: [PATCH][ROSE][AX25] af_ax25: possible circular locking
From: Jarek Poplawski @ 2007-12-30 14:13 UTC (permalink / raw)
  To: David Miller; +Cc: f6bvp, ralf, adobriyan, netdev
In-Reply-To: <20071229.191443.10947724.davem@davemloft.net>

On Sat, Dec 29, 2007 at 07:14:43PM -0800, David Miller wrote:
...
> You can't just drop this linked list lock and expect it to stay
> consistent like that.
> 
> Once you drop it, any thread of control can get in there and delete
> entries from the list.
> 
> Since we know it can happen, using a WARN_ON_ONCE(1) is not
> appropriate.

The problem is 'we' don't know if it can happen... In the first
message with this patch I've tried to get this information, and
now it seems you are the only one with this knowledge, but of
course this is more than enough for me to agree with your decision
to dump this patch.

> [...]  And if it triggers it will do the wrong thing, because
> by branching back to "again" we can call ax25_disconnect() multiple
> times on the same entry which isn't right.

This is an equivalent of list_for_each_entry_safe(), and if such
WARN_ON is ever reported this would confirm the solution is wrong.
But, it seems Bernard's case was too simple to trigger this bug.
Alas it was complex enough to trigger this other bug...

"again" loop should skip this entry next time: s->ax25_dev should
be NULL, so not equal to ax25dev. (But of course there could be
this unknown to me place, which changes this back in the meantime.)

> You'll thus need to resolve this locking conflict more properly.
> I know it's hard, but your current fix is worse because it adds
> a new known bug.

Sorry, it seems this will've to wait until Ralf finds some time,
because I really can't give this more time, considering I never
used nor have any plans to use this code, and this bug could
suggest there is not so many interested in this, besides Bernard,
either.

Regards,
Jarek P.

^ permalink raw reply

* A question about hh_cache
From: Andy Johnson @ 2007-12-30 13:31 UTC (permalink / raw)
  To: netdev

Hi,

struct neighbour has a member named hh_cache; this
hh_cache caches link layer headers to speed up transmission
(instead filling them again and again for each packet).

The hh_cache struct is defined in linux/netdev.h.
Its first member is hh_next, which is a pointer to
hh_cache.
I tried to think of scenarios where this hh_next pointer is used,
and could not think of anything.

Each neighbour has one ip address (the primary key), so it has one
dev member (which is the net_device on which that primary key is configured),
so it has one mac address (which is the dev_addr of net_device).
So there should be only one entry in that hh_cache ; it should be
the entry which its hh_data (the ethernet header) is composed from the
dev_addr, the source MAC address, and the type.

Can anybody give an example when hh_cache of a neighbour instance is a
list with more than
one entry ?

Regards,
Johnson

^ permalink raw reply

* [IPSEC]: Fix double free on skb on async output
From: Herbert Xu @ 2007-12-30 12:36 UTC (permalink / raw)
  To: David S. Miller, netdev

Hi Dave:

Here's a fix for a silly bug on the async output path for IPsec.

[IPSEC]: Fix double free on skb on async output

When the output transform returns EINPROGRESS due to async operation we'll
free the skb the straight away as if it were an error.  This patch fixes
that so that the skb is freed when the async operation completes.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c
index c1ba63e..81ad8eb 100644
--- a/net/xfrm/xfrm_output.c
+++ b/net/xfrm/xfrm_output.c
@@ -78,6 +78,8 @@ static int xfrm_output_one(struct sk_buff *skb, int err)
 		spin_unlock_bh(&x->lock);
 
 		err = x->type->output(x, skb);
+		if (err == -EINPROGRESS)
+			goto out_exit;
 
 resume:
 		if (err) {

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related

* Re: [PATCH/RFC] [v3] TCP: use non-delayed ACK for congestion control RTT
From: Gavin McCullagh @ 2007-12-30 12:35 UTC (permalink / raw)
  To: Ilpo Järvinen; +Cc: David Miller, Netdev
In-Reply-To: <Pine.LNX.4.64.0712301139150.31698@kivilampi-30.cs.helsinki.fi>

On Sun, 30 Dec 2007, Ilpo Järvinen wrote:

> I guess the non-minimum TSO delays are only significant in case there was 
> something unexpectional happening and in such case we definately want to 
> have some measurements taken.

Broadly speaking, delay-based schemes need as much information about the
queueing delay as possible so more points are generally useful,
particularly if the delay is fluctuating rapidly.

When we started looking at delay-based schemes we had trouble with delay
information fluctuating wildly due to TSO.  John Heffner made a change
(that I can't find at the minute) which reduced a TSO timeout and seemed to
reduce this problem greatly.  It's worth bearing in mind though that TSO
may cause spuriously high delay measurements.  

Gavin


^ permalink raw reply

* Re: [PATCH/RFC] [v3] TCP: use non-delayed ACK for congestion control RTT
From: Gavin McCullagh @ 2007-12-30 12:20 UTC (permalink / raw)
  To: David Miller; +Cc: ilpo.jarvinen, netdev
In-Reply-To: <20071229.190917.130399687.davem@davemloft.net>

Hi,

On Sat, 29 Dec 2007, David Miller wrote:

> Never mind about making the relative patch, I didn't want to have
> to wait for you to send me that and have it block my merge of
> fixes with Linus this evening.

Ah, sorry for the inconvenience.  I didn't realise you had merged it yet.

> The following is what I applied on top of your other patch:

That looks fine.

Gavin


^ permalink raw reply

* Re: [PATCH/RFC] [v3] TCP: use non-delayed ACK for congestion control RTT
From: Ilpo Järvinen @ 2007-12-30  9:43 UTC (permalink / raw)
  To: Gavin McCullagh; +Cc: David Miller, Netdev
In-Reply-To: <20071230011500.GA30997@nuim.ie>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1680 bytes --]

On Sun, 30 Dec 2007, Gavin McCullagh wrote:

> Hi,
> 
> On Fri, 21 Dec 2007, Ilpo Järvinen wrote:
> 
> > > I need to re-read properly, but I think the same problem affects the
> > > microsecond values where TCP_CONG_RTT_STAMP is set (used by vegas, veno,
> > > yeah, illinois).  I might follow up with another patch which changes the
> > > behaviour where TCP_CONG_RTT_STAMP when I'm more sure of that.
> >
> > Please do, you might have to remove fully_acked checks to do that right 
> > though so it won't be as straight-forward change as this one and requires 
> > some amount of thinking to result in a right thing.
> 
> The TCP_CONG_RTT_STAMP code does need to be fixed similarly.  A combined
> patch will follow this mail.  As I understand it, the fully_acked checks
> kick in where the ACK is a portion of a TSO chunk and doesn't completely
> ACK that chunk.  Leaving the checks in place means you get one rtt for each
> TSO chunk, based on the ACK for the last byte of the chunk.  This rtt
> therefore is the maximum available and includes the time-lag between the
> first and last chunk being acked.  Removing the tests gives you an RTT
> value for each ACK in a tso chunk, including the minimum and maximum.  It
> seems the minimum rtt is the best indicator of congestion.  On the other
> hand having all available RTTs gives the congestion avoidance greater
> knowledge of how the RTT is evolving (albeit somewhat coloured by TSO
> delays which don't seem too severe in my tests).

I guess the non-minimum TSO delays are only significant in case there was 
something unexpectional happening and in such case we definately want to 
have some measurements taken.


-- 
 i.

^ permalink raw reply

* Re: [PATCH 2/3] [UDP]: memory accounting in IPv4
From: Eric Dumazet @ 2007-12-30  9:28 UTC (permalink / raw)
  To: Hideo AOKI
  Cc: David Miller, Herbert Xu, netdev, Takahiro Yasui,
	Masami Hiramatsu, Satoshi Oshima, billfink, Andi Kleen,
	Evgeniy Polyakov, Stephen Hemminger, yoshfuji, Yumiko Sugita
In-Reply-To: <47775E9A.40105@redhat.com>

Hideo AOKI a écrit :
> This patch adds UDP memory usage accounting in IPv4. Currently,
> receiving buffer accounting is only supported.
> 
> This patch is also introduced memory_allocated variable for UDP protocol.

OK, but sockstat_seq_show() should be updated so that this 
udp_memory_allocated value can be read from /proc/net/sockstat ?

Thank you


^ permalink raw reply

* [PATCH 3/3] [UDP]: memory accounting in IPv6
From: Hideo AOKI @ 2007-12-30  9:02 UTC (permalink / raw)
  To: David Miller, Herbert Xu, netdev
  Cc: Takahiro Yasui, Masami Hiramatsu, Satoshi Oshima, billfink,
	Andi Kleen, Evgeniy Polyakov, Stephen Hemminger, yoshfuji,
	Yumiko Sugita, haoki
In-Reply-To: <47775D8C.5010104@redhat.com>

This patch adds UDP memory usage accounting in IPv6. Currently,
receiving buffer accounting is only supported.

This patch is also introduced memory_allocated variable for UDP protocol.

Cc: Satoshi Oshima <satoshi.oshima.fk@hitachi.com>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
signed-off-by: Takahiro Yasui <tyasui@redhat.com>
signed-off-by: Hideo Aoki <haoki@redhat.com>
---

 udp.c |   32 ++++++++++++++++++++++++++++----
 1 file changed, 28 insertions(+), 4 deletions(-)

diff -pruN net-2.6.25-t12t19m-p6/net/ipv6/udp.c net-2.6.25-t12t19m-p7/net/ipv6/udp.c
--- net-2.6.25-t12t19m-p6/net/ipv6/udp.c	2007-12-27 10:19:02.000000000 -0500
+++ net-2.6.25-t12t19m-p7/net/ipv6/udp.c	2007-12-29 21:57:41.000000000 -0500
@@ -204,13 +204,17 @@ try_again:
 		err = ulen;

 out_free:
+	lock_sock(sk);
 	skb_free_datagram(sk, skb);
+	release_sock(sk);
 out:
 	return err;

 csum_copy_err:
+	lock_sock(sk);
 	if (!skb_kill_datagram(sk, skb, flags))
 		UDP6_INC_STATS_USER(UDP_MIB_INERRORS, is_udplite);
+	release_sock(sk);

 	if (flags & MSG_DONTWAIT)
 		return -EAGAIN;
@@ -366,10 +370,21 @@ static int __udp6_lib_mcast_deliver(stru
 	while ((sk2 = udp_v6_mcast_next(sk_next(sk2), uh->dest, daddr,
 					uh->source, saddr, dif))) {
 		struct sk_buff *buff = skb_clone(skb, GFP_ATOMIC);
-		if (buff)
-			udpv6_queue_rcv_skb(sk2, buff);
+		if (buff) {
+			bh_lock_sock_nested(sk2);
+			if (!sock_owned_by_user(sk2))
+				udpv6_queue_rcv_skb(sk2, buff);
+			else
+				sk_add_backlog(sk2, buff);
+			bh_unlock_sock(sk2);
+		}
 	}
-	udpv6_queue_rcv_skb(sk, skb);
+	bh_lock_sock_nested(sk);
+	if (!sock_owned_by_user(sk))
+		udpv6_queue_rcv_skb(sk, skb);
+	else
+		sk_add_backlog(sk, skb);
+	bh_unlock_sock(sk);
 out:
 	read_unlock(&udp_hash_lock);
 	return 0;
@@ -482,7 +497,12 @@ int __udp6_lib_rcv(struct sk_buff *skb,

 	/* deliver */

-	udpv6_queue_rcv_skb(sk, skb);
+	bh_lock_sock_nested(sk);
+	if (!sock_owned_by_user(sk))
+		udpv6_queue_rcv_skb(sk, skb);
+	else
+		sk_add_backlog(sk, skb);
+	bh_unlock_sock(sk);
 	sock_put(sk);
 	return 0;

@@ -994,6 +1014,10 @@ struct proto udpv6_prot = {
 	.hash		   = udp_lib_hash,
 	.unhash		   = udp_lib_unhash,
 	.get_port	   = udp_v6_get_port,
+	.memory_allocated  = &udp_memory_allocated,
+	.sysctl_mem	   = sysctl_udp_mem,
+	.sysctl_wmem	   = &sysctl_udp_wmem_min,
+	.sysctl_rmem	   = &sysctl_udp_rmem_min,
 	.obj_size	   = sizeof(struct udp6_sock),
 #ifdef CONFIG_COMPAT
 	.compat_setsockopt = compat_udpv6_setsockopt,
-- 
Hitachi Computer Products (America) Inc.

^ permalink raw reply

* [PATCH 1/3] [UDP]: add udp_mem, udp_rmem_min and udp_wmem_min
From: Hideo AOKI @ 2007-12-30  9:01 UTC (permalink / raw)
  To: David Miller, Herbert Xu, netdev
  Cc: Takahiro Yasui, Masami Hiramatsu, Satoshi Oshima, billfink,
	Andi Kleen, Evgeniy Polyakov, Stephen Hemminger, yoshfuji,
	Yumiko Sugita, haoki
In-Reply-To: <47775D8C.5010104@redhat.com>

This patch adds sysctl parameters for customizing UDP memory accounting:
     /proc/sys/net/ipv4/udp_mem
     /proc/sys/net/ipv4/udp_rmem_min
     /proc/sys/net/ipv4/udp_wmem_min

Udp_mem indicates number of pages which can be used for all UDP scokets.
Each UDP packet is dropped, when the number of pages for socket buffer is
beyond udp_mem and the socket already consumes minimum buffer.

Cc: Satoshi Oshima <satoshi.oshima.fk@hitachi.com>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
signed-off-by: Takahiro Yasui <tyasui@redhat.com>
signed-off-by: Hideo Aoki <haoki@redhat.com>
---

 Documentation/networking/ip-sysctl.txt |   27 +++++++++++++++++++++++++++
 include/net/udp.h                      |    7 +++++++
 net/ipv4/af_inet.c                     |    3 +++
 net/ipv4/proc.c                        |    3 ++-
 net/ipv4/sysctl_net_ipv4.c             |   31 +++++++++++++++++++++++++++++++
 net/ipv4/udp.c                         |   31 +++++++++++++++++++++++++++++++
 6 files changed, 101 insertions(+), 1 deletion(-)

diff -pruN net-2.6.25-t12t19m-p4/Documentation/networking/ip-sysctl.txt net-2.6.25-t12t19m-p5/Documentation/networking/ip-sysctl.txt
--- net-2.6.25-t12t19m-p4/Documentation/networking/ip-sysctl.txt	2007-12-27 10:18:41.000000000 -0500
+++ net-2.6.25-t12t19m-p5/Documentation/networking/ip-sysctl.txt	2007-12-29 21:09:21.000000000 -0500
@@ -446,6 +446,33 @@ tcp_dma_copybreak - INTEGER
 	and CONFIG_NET_DMA is enabled.
 	Default: 4096

+UDP variables:
+
+udp_mem - vector of 3 INTEGERs: min, pressure, max
+	Number of pages allowed for queueing by all UDP sockets.
+
+	min: Below this number of pages UDP is not bothered about its
+	memory appetite. When amount of memory allocated by UDP exceeds
+	this number, UDP starts to moderate memory usage.
+
+	pressure: This value was introduced to follow format of tcp_mem.
+
+	max: Number of pages allowed for queueing by all UDP sockets.
+
+	Default is calculated at boot time from amount of available memory.
+
+udp_rmem_min - INTEGER
+	Minimal size of receive buffer used by UDP sockets in moderation.
+	Each UDP socket is able to use the size for receiving data, even if
+	total pages of UDP sockets exceed udp_mem pressure. The unit is byte.
+	Default: 4096
+
+udp_wmem_min - INTEGER
+	Minimal size of send buffer used by UDP sockets in moderation.
+	Each UDP socket is able to use the size for sending data, even if
+	total pages of UDP sockets exceed udp_mem pressure. The unit is byte.
+	Default: 4096
+
 CIPSOv4 Variables:

 cipso_cache_enable - BOOLEAN
diff -pruN net-2.6.25-t12t19m-p4/include/net/udp.h net-2.6.25-t12t19m-p5/include/net/udp.h
--- net-2.6.25-t12t19m-p4/include/net/udp.h	2007-12-27 10:18:58.000000000 -0500
+++ net-2.6.25-t12t19m-p5/include/net/udp.h	2007-12-29 21:10:48.000000000 -0500
@@ -65,6 +65,11 @@ extern rwlock_t udp_hash_lock;

 extern struct proto udp_prot;

+/* sysctl variables for udp */
+extern int sysctl_udp_mem[3];
+extern int sysctl_udp_rmem_min;
+extern int sysctl_udp_wmem_min;
+
 struct sk_buff;

 /*
@@ -198,4 +203,6 @@ extern void udp_proc_unregister(struct u
 extern int  udp4_proc_init(void);
 extern void udp4_proc_exit(void);
 #endif
+
+extern void udp_init(void);
 #endif	/* _UDP_H */
diff -pruN net-2.6.25-t12t19m-p4/net/ipv4/af_inet.c net-2.6.25-t12t19m-p5/net/ipv4/af_inet.c
--- net-2.6.25-t12t19m-p4/net/ipv4/af_inet.c	2007-12-27 10:19:02.000000000 -0500
+++ net-2.6.25-t12t19m-p5/net/ipv4/af_inet.c	2007-12-29 21:09:21.000000000 -0500
@@ -1417,6 +1417,9 @@ static int __init inet_init(void)
 	/* Setup TCP slab cache for open requests. */
 	tcp_init();

+	/* Setup UDP memory threshold */
+	udp_init();
+
 	/* Add UDP-Lite (RFC 3828) */
 	udplite4_register();

diff -pruN net-2.6.25-t12t19m-p4/net/ipv4/proc.c net-2.6.25-t12t19m-p5/net/ipv4/proc.c
--- net-2.6.25-t12t19m-p4/net/ipv4/proc.c	2007-12-27 10:19:02.000000000 -0500
+++ net-2.6.25-t12t19m-p5/net/ipv4/proc.c	2007-12-29 21:09:21.000000000 -0500
@@ -56,7 +56,8 @@ static int sockstat_seq_show(struct seq_
 		   sock_prot_inuse(&tcp_prot), atomic_read(&tcp_orphan_count),
 		   tcp_death_row.tw_count, atomic_read(&tcp_sockets_allocated),
 		   atomic_read(&tcp_memory_allocated));
-	seq_printf(seq, "UDP: inuse %d\n", sock_prot_inuse(&udp_prot));
+	seq_printf(seq, "UDP: inuse %d mem %d\n", sock_prot_inuse(&udp_prot),
+		   atomic_read(&udp_memory_allocated));
 	seq_printf(seq, "UDPLITE: inuse %d\n", sock_prot_inuse(&udplite_prot));
 	seq_printf(seq, "RAW: inuse %d\n", sock_prot_inuse(&raw_prot));
 	seq_printf(seq,  "FRAG: inuse %d memory %d\n",
diff -pruN net-2.6.25-t12t19m-p4/net/ipv4/sysctl_net_ipv4.c net-2.6.25-t12t19m-p5/net/ipv4/sysctl_net_ipv4.c
--- net-2.6.25-t12t19m-p4/net/ipv4/sysctl_net_ipv4.c	2007-12-27 10:19:02.000000000 -0500
+++ net-2.6.25-t12t19m-p5/net/ipv4/sysctl_net_ipv4.c	2007-12-29 21:09:21.000000000 -0500
@@ -19,6 +19,7 @@
 #include <net/ip.h>
 #include <net/route.h>
 #include <net/tcp.h>
+#include <net/udp.h>
 #include <net/cipso_ipv4.h>
 #include <net/inet_frag.h>

@@ -812,6 +813,36 @@ static struct ctl_table ipv4_table[] = {
 		.mode		= 0644,
 		.proc_handler	= &proc_dointvec,
 	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "udp_mem",
+		.data		= &sysctl_udp_mem,
+		.maxlen		= sizeof(sysctl_udp_mem),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec_minmax,
+		.strategy	= &sysctl_intvec,
+		.extra1		= &zero
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "udp_rmem_min",
+		.data		= &sysctl_udp_rmem_min,
+		.maxlen		= sizeof(sysctl_udp_rmem_min),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec_minmax,
+		.strategy	= &sysctl_intvec,
+		.extra1		= &zero
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "udp_wmem_min",
+		.data		= &sysctl_udp_wmem_min,
+		.maxlen		= sizeof(sysctl_udp_wmem_min),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec_minmax,
+		.strategy	= &sysctl_intvec,
+		.extra1		= &zero
+	},
 	{ .ctl_name = 0 }
 };

diff -pruN net-2.6.25-t12t19m-p4/net/ipv4/udp.c net-2.6.25-t12t19m-p5/net/ipv4/udp.c
--- net-2.6.25-t12t19m-p4/net/ipv4/udp.c	2007-12-27 10:19:02.000000000 -0500
+++ net-2.6.25-t12t19m-p5/net/ipv4/udp.c	2007-12-29 21:12:03.000000000 -0500
@@ -82,6 +82,7 @@
 #include <asm/system.h>
 #include <asm/uaccess.h>
 #include <asm/ioctls.h>
+#include <linux/bootmem.h>
 #include <linux/types.h>
 #include <linux/fcntl.h>
 #include <linux/module.h>
@@ -118,6 +119,14 @@ EXPORT_SYMBOL(udp_stats_in6);
 struct hlist_head udp_hash[UDP_HTABLE_SIZE];
 DEFINE_RWLOCK(udp_hash_lock);

+int sysctl_udp_mem[3] __read_mostly;
+int sysctl_udp_rmem_min __read_mostly;
+int sysctl_udp_wmem_min __read_mostly;
+
+EXPORT_SYMBOL(sysctl_udp_mem);
+EXPORT_SYMBOL(sysctl_udp_rmem_min);
+EXPORT_SYMBOL(sysctl_udp_wmem_min);
+
 static inline int __udp_lib_lport_inuse(__u16 num,
 					const struct hlist_head udptable[])
 {
@@ -1460,6 +1469,9 @@ struct proto udp_prot = {
 	.hash		   = udp_lib_hash,
 	.unhash		   = udp_lib_unhash,
 	.get_port	   = udp_v4_get_port,
+	.sysctl_mem	   = sysctl_udp_mem,
+	.sysctl_wmem	   = &sysctl_udp_wmem_min,
+	.sysctl_rmem	   = &sysctl_udp_rmem_min,
 	.obj_size	   = sizeof(struct udp_sock),
 #ifdef CONFIG_COMPAT
 	.compat_setsockopt = compat_udp_setsockopt,
@@ -1655,6 +1667,25 @@ void udp4_proc_exit(void)
 }
 #endif /* CONFIG_PROC_FS */

+void __init udp_init(void)
+{
+	unsigned long limit;
+
+	/* Set the pressure threshold up by the same strategy of TCP. It is a
+	 * fraction of global memory that is up to 1/2 at 256 MB, decreasing
+	 * toward zero with the amount of memory, with a floor of 128 pages.
+	 */
+	limit = min(nr_all_pages, 1UL<<(28-PAGE_SHIFT)) >> (20-PAGE_SHIFT);
+	limit = (limit * (nr_all_pages >> (20-PAGE_SHIFT))) >> (PAGE_SHIFT-11);
+	limit = max(limit, 128UL);
+	sysctl_udp_mem[0] = limit / 4 * 3;
+	sysctl_udp_mem[1] = limit;
+	sysctl_udp_mem[2] = sysctl_udp_mem[0] * 2;
+
+	sysctl_udp_rmem_min = SK_MEM_QUANTUM;
+	sysctl_udp_wmem_min = SK_MEM_QUANTUM;
+}
+
 EXPORT_SYMBOL(udp_disconnect);
 EXPORT_SYMBOL(udp_hash);
 EXPORT_SYMBOL(udp_hash_lock);
-- 
Hitachi Computer Products (America) Inc.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox