Netdev List
 help / color / mirror / Atom feed
* (unknown)
From: Eugene Teo @ 2007-08-24  2:42 UTC (permalink / raw)
  To: netdev

subscribe netdev

^ permalink raw reply

* Re: [PATCH v2] [02/10] pasemi_mac: Stop using the pci config space accessors for register read/writes
From: Stephen Rothwell @ 2007-08-24  4:05 UTC (permalink / raw)
  To: Olof Johansson; +Cc: jgarzik, netdev, linuxppc-dev
In-Reply-To: <20070823181310.GB31882@lixom.net>

[-- Attachment #1: Type: text/plain, Size: 593 bytes --]

On Thu, 23 Aug 2007 13:13:10 -0500 Olof Johansson <olof@lixom.net> wrote:
>
>  out:
> -	pci_dev_put(mac->iob_pdev);
> -out_put_dma_pdev:
> -	pci_dev_put(mac->dma_pdev);
> -out_free_netdev:
> +	if (mac->iob_pdev)
> +		pci_dev_put(mac->iob_pdev);
> +	if (mac->dma_pdev)
> +		pci_dev_put(mac->dma_pdev);

It is not documented as such (as far as I can see), but pci_dev_put is
safe to call with NULL. And there are other places in the kernel that
explicitly use that fact.

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply

* Re: Problem with implementation of TCP_DEFER_ACCEPT?
From: John Heffner @ 2007-08-24  4:40 UTC (permalink / raw)
  To: TJ; +Cc: netdev
In-Reply-To: <1187914105.26866.41.camel@bagoas.tjworld.net>

[-- Attachment #1: Type: text/plain, Size: 2665 bytes --]

TJ wrote:
> client SYN > server LISTENING
> client < SYN ACK server SYN_RECEIVED (time-out 3s)
>                  server: inet_rsk(req)->acked = 1
> 
> client ACK > server (discarded)
> 
> client < SYN ACK (DUP) server (time-out 6s)
> client ACK (DUP) > server (discarded)
> 
> client < SYN ACK (DUP) server (time-out 12s)
> client ACK (DUP) > server (discarded)
> 
> client < SYN ACK (DUP) server (time-out 24s)
> client ACK (DUP) > server (discarded)
> 
> client < SYN ACK (DUP) server (time-out 48s)
> client ACK (DUP) > server (discarded)
> 
> client < SYN ACK (DUP) server (time-out 96s)
> client ACK (DUP) > server (discarded)
> 
> server: half-open socket closed.
> 
> With each client ACK being dropped by the kernel's TCP_DEFER_ACCEPT
> mechanism eventually the handshake fails after the 'SYN ACK' retries and
> time-outs expire.
> 
> There is a case for arguing the kernel should be operating in an
> enhanced handshaking mode when TCP_DEFER_ACCEPT is enabled, not an
> alternative mode, and therefore should accept *both* RFC 793 and
> TCP_DEFER_ACCEPT. I've been unable to find a specification or RFC for
> implementing TCP_DEFER_ACCEPT aka BSD's SO_ACCEPTFILTER to give me firm
> guidance.
> 
> It seems incorrect to penalise a client that is trying to complete the
> handshake according to the RFC 793 specification, especially as the
> client has no way of knowing ahead of time whether or not the server is
> operating deferred accept.

Interesting problem.  TCP_DEFER_ACCEPT does not conform to any standard 
I'm aware of.  (In fact, I'd say it's in violation of RFC 793.)  The 
implementation does exactly what it claims, though -- it "allows a 
listener to be awakened only  when  data  arrives  on  the  socket."

I think a more useful spec might have been "allows a listener to be 
awakened only when data arrives on the socket, unless the specified 
timeout has expired."  Once the timeout expires, it should process the 
embryonic connection as if TCP_DEFER_ACCEPT is not set.  Unfortunately, 
I don't think we can retroactively change this definition, as an 
application might depend on data being available and do a non-blocking 
read() after the accept(), expecting data to be there.  Is this worth 
trying to fix?

Also, a listen socket with a backlog and TCP_DEFER_ACCEPT will have reqs 
sit in the backlog for the full defer timeout, even if they've received 
data, which is not really the right thing to do.

I've attached a patch implementing this suggestion (compile tested only 
-- I think I got the logic right but it's late ;).  Kind of ugly, and 
uses up a bit in struct inet_request_sock.  Maybe can be done better...

   -John

[-- Attachment #2: tcp_defer_accept.patch --]
[-- Type: text/plain, Size: 2737 bytes --]

diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
index 62daf21..f9f64a5 100644
--- a/include/net/inet_sock.h
+++ b/include/net/inet_sock.h
@@ -72,7 +72,8 @@ struct inet_request_sock {
 				sack_ok	   : 1,
 				wscale_ok  : 1,
 				ecn_ok	   : 1,
-				acked	   : 1;
+				acked	   : 1,
+				deferred   : 1;
 	struct ip_options	*opt;
 };
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 185c7ec..cad2490 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -978,6 +978,7 @@ static inline void tcp_openreq_init(struct request_sock *req,
 	ireq->snd_wscale = rx_opt->snd_wscale;
 	ireq->wscale_ok = rx_opt->wscale_ok;
 	ireq->acked = 0;
+	ireq->deferred = 0;
 	ireq->ecn_ok = 0;
 	ireq->rmt_port = tcp_hdr(skb)->source;
 }
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index fbe7714..1207fb8 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -444,9 +444,6 @@ void inet_csk_reqsk_queue_prune(struct sock *parent,
 		}
 	}
 
-	if (queue->rskq_defer_accept)
-		max_retries = queue->rskq_defer_accept;
-
 	budget = 2 * (lopt->nr_table_entries / (timeout / interval));
 	i = lopt->clock_hand;
 
@@ -455,7 +452,9 @@ void inet_csk_reqsk_queue_prune(struct sock *parent,
 		while ((req = *reqp) != NULL) {
 			if (time_after_eq(now, req->expires)) {
 				if ((req->retrans < thresh ||
-				     (inet_rsk(req)->acked && req->retrans < max_retries))
+				     (inet_rsk(req)->acked && req->retrans < max_retries) ||
+				     (inet_rsk(req)->deferred && req->retrans <
+				      queue->rskq_defer_accept + max_retries))
 				    && !req->rsk_ops->rtx_syn_ack(parent, req, NULL)) {
 					unsigned long timeo;
 
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index a12b08f..c4867f3 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -637,8 +637,10 @@ struct sock *tcp_check_req(struct sock *sk,struct sk_buff *skb,
 
 		/* If TCP_DEFER_ACCEPT is set, drop bare ACK. */
 		if (inet_csk(sk)->icsk_accept_queue.rskq_defer_accept &&
-		    TCP_SKB_CB(skb)->end_seq == tcp_rsk(req)->rcv_isn + 1) {
-			inet_rsk(req)->acked = 1;
+		    TCP_SKB_CB(skb)->end_seq == tcp_rsk(req)->rcv_isn + 1 &&
+		    !inet_rsk(req)->acked && req->retrans <
+		    inet_csk(sk)->icsk_accept_queue.rskq_defer_accept) {
+			inet_rsk(req)->deferred = 1;
 			return NULL;
 		}
 
@@ -686,6 +688,9 @@ struct sock *tcp_check_req(struct sock *sk,struct sk_buff *skb,
 	listen_overflow:
 		if (!sysctl_tcp_abort_on_overflow) {
 			inet_rsk(req)->acked = 1;
+			/* If deferred, ACK must contain data.  Shortcut defer. */
+			if (inet_rsk(req)->deferred)
+				req->retrans = inet_csk(sk)->icsk_accept_queue.rskq_defer_accept;
 			return NULL;
 		}
 

^ permalink raw reply related

* Re: 2.6.22.5 forcedeth timeout hang
From: Willy Tarreau @ 2007-08-24  5:06 UTC (permalink / raw)
  To: Mr. Berkley Shands; +Cc: linux-kernel, Net Dev
In-Reply-To: <46CE1CC7.6000507@exegy.com>

On Thu, Aug 23, 2007 at 06:48:23PM -0500, Mr. Berkley Shands wrote:
> 100% reproducible hang on xmit timeout.
> Just do a "make -j4 modules" on an nfs mounted kernel source.

Most likely you also had the problem with 2.6.22.2 (maybe you have not
tested this one, though). There were bug fixes for forcedeth introduced
in this version, one of them being buggy. The patch below fixes it. Can
you please give it a try ? If it does not fix the problem, please try
2.6.22.1 which does not include those changes. I'm interested because
I have those changes pending for 2.6.20.17 too.

diff --git a/drivers/net/forcedeth.c b/drivers/net/forcedeth.c
index 10f4e3b..1938d6d 100644
--- a/drivers/net/forcedeth.c
+++ b/drivers/net/forcedeth.c
@@ -552,7 +552,7 @@ union ring_type {
 #define PHY_OUI_MARVELL	0x5043
 #define PHY_OUI_CICADA	0x03f1
 #define PHY_OUI_VITESSE	0x01c1
-#define PHY_OUI_REALTEK	0x01c1
+#define PHY_OUI_REALTEK	0x0732
 #define PHYID1_OUI_MASK	0x03ff
 #define PHYID1_OUI_SHFT	6
 #define PHYID2_OUI_MASK	0xfc00

Thanks,
Willy


^ permalink raw reply related

* Re: [PATCH (take 2)] request_irq fix DEBUG_SHIRQ handling Re: 2.6.23-rc2-mm1: rtl8139 inconsistent lock state
From: Jarek Poplawski @ 2007-08-24  5:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mariusz Kozlowski, netdev, Jeff Garzik, David Woodhouse,
	Ingo Molnar, Thomas Gleixner, linux-kernel
In-Reply-To: <20070823084430.GB1980@ff.dom.local>

On Thu, Aug 23, 2007 at 10:44:30AM +0200, Jarek Poplawski wrote:
> Andrew Morton pointed out that my changelog was unusable. Sorry!
> Here is a second try with the changelog and kernel version changed.
...
> ------------>(take 2)
> 
> Subject: request_irq() - fix DEBUG_SHIRQ handling
...
> Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>
> 
> ---
> 
> diff -Nurp 2.6.23-rc3-git6-/kernel/irq/manage.c 2.6.23-rc3-git6/kernel/irq/manage.c
> --- 2.6.23-rc3-git6-/kernel/irq/manage.c	2007-08-23 10:11:35.000000000 +0200
> +++ 2.6.23-rc3-git6/kernel/irq/manage.c	2007-08-23 10:16:29.000000000 +0200

So, this time I f-ed the diff part: it's not exactly against 2.6.23-rc-git6.
But, it's Andrew to blame: he should've known that some old & slow chips
can't do science and poetry at the same time. Sorry (for him)!

Anyway, beside an offset, should be OK...

Regards,
Jarek P.

^ permalink raw reply

* [PATCH 0/2] myri10ge updates for 2.6.23
From: Brice Goglin @ 2007-08-24  6:56 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: netdev

Hi Jeff,

Now that Greg pushed my fix to expose pcie_get_readrq() prototype in
linux/pci.h, I am resending my rework of Peter Oruba's patch to use
pcie_get/set_readrq() in myri10ge. Please apply for 2.6.23.

1. use pcie_get/set_readrq
2. update driver version to 1.3.2-1.269



Also, we noticed that packet forwarding is faster on our hardware when
receiving in linear skb instead of pages (about 10Gb/s vs. 7), so we are
thinking of submitting a Kconfig option to switch to the old linear skb
RX code. Would this be acceptable?

Thanks,
Brice


^ permalink raw reply

* [PATCH 1/2] myri10ge: use pcie_get/set_readrq
From: Brice Goglin @ 2007-08-24  6:57 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: netdev
In-Reply-To: <46CE8103.10109@myri.com>

Based on a patch from Peter Oruba, convert myri10ge to use pcie_get_readrq()
and pcie_set_readrq() instead of our own PCI calls and arithmetics.

These driver changes incorporate the proposed PCI-X / PCI-Express read byte
count interface.  Reading and setting those values doesn't take place
"manually", instead wrapping functions are called to allow quirks for some
PCI bridges.

Signed-off-by: Brice Goglin <brice@myri.com>
Signed-off by: Peter Oruba <peter.oruba@amd.com>
Based on work by Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 drivers/net/myri10ge/myri10ge.c |   32 ++++++--------------------------
 1 file changed, 6 insertions(+), 26 deletions(-)

Index: linux-2.6.git/drivers/net/myri10ge/myri10ge.c
===================================================================
--- linux-2.6.git.orig/drivers/net/myri10ge/myri10ge.c	2007-08-24 08:43:01.000000000 +0200
+++ linux-2.6.git/drivers/net/myri10ge/myri10ge.c	2007-08-24 08:45:29.000000000 +0200
@@ -2514,26 +2514,20 @@
 {
 	struct pci_dev *pdev = mgp->pdev;
 	struct device *dev = &pdev->dev;
-	int cap, status;
-	u16 val;
+	int status;
 
 	mgp->tx.boundary = 4096;
 	/*
 	 * Verify the max read request size was set to 4KB
 	 * before trying the test with 4KB.
 	 */
-	cap = pci_find_capability(pdev, PCI_CAP_ID_EXP);
-	if (cap < 64) {
-		dev_err(dev, "Bad PCI_CAP_ID_EXP location %d\n", cap);
-		goto abort;
-	}
-	status = pci_read_config_word(pdev, cap + PCI_EXP_DEVCTL, &val);
-	if (status != 0) {
+	status = pcie_get_readrq(pdev);
+	if (status < 0) {
 		dev_err(dev, "Couldn't read max read req size: %d\n", status);
 		goto abort;
 	}
-	if ((val & (5 << 12)) != (5 << 12)) {
-		dev_warn(dev, "Max Read Request size != 4096 (0x%x)\n", val);
+	if (status != 4096) {
+		dev_warn(dev, "Max Read Request size != 4096 (%d)\n", status);
 		mgp->tx.boundary = 2048;
 	}
 	/*
@@ -2850,9 +2844,7 @@
 	size_t bytes;
 	int i;
 	int status = -ENXIO;
-	int cap;
 	int dac_enabled;
-	u16 val;
 
 	netdev = alloc_etherdev(sizeof(*mgp));
 	if (netdev == NULL) {
@@ -2884,19 +2876,7 @@
 	    = pci_find_capability(pdev, PCI_CAP_ID_VNDR);
 
 	/* Set our max read request to 4KB */
-	cap = pci_find_capability(pdev, PCI_CAP_ID_EXP);
-	if (cap < 64) {
-		dev_err(&pdev->dev, "Bad PCI_CAP_ID_EXP location %d\n", cap);
-		goto abort_with_netdev;
-	}
-	status = pci_read_config_word(pdev, cap + PCI_EXP_DEVCTL, &val);
-	if (status != 0) {
-		dev_err(&pdev->dev, "Error %d reading PCI_EXP_DEVCTL\n",
-			status);
-		goto abort_with_netdev;
-	}
-	val = (val & ~PCI_EXP_DEVCTL_READRQ) | (5 << 12);
-	status = pci_write_config_word(pdev, cap + PCI_EXP_DEVCTL, val);
+	status = pcie_set_readrq(pdev, 4096);
 	if (status != 0) {
 		dev_err(&pdev->dev, "Error %d writing PCI_EXP_DEVCTL\n",
 			status);



^ permalink raw reply

* [PATCH 2/2] myri10ge: update driver version to 1.3.2-1.269
From: Brice Goglin @ 2007-08-24  6:57 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: netdev
In-Reply-To: <46CE8103.10109@myri.com>

Update myri10ge driver version to 1.3.2-1.269.

Signed-off-by: Brice Goglin <brice@myri.com>
---
 drivers/net/myri10ge/myri10ge.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6.git/drivers/net/myri10ge/myri10ge.c
===================================================================
--- linux-2.6.git.orig/drivers/net/myri10ge/myri10ge.c	2007-08-24 08:45:29.000000000 +0200
+++ linux-2.6.git/drivers/net/myri10ge/myri10ge.c	2007-08-24 08:45:38.000000000 +0200
@@ -72,7 +72,7 @@
 #include "myri10ge_mcp.h"
 #include "myri10ge_mcp_gen_header.h"
 
-#define MYRI10GE_VERSION_STR "1.3.1-1.248"
+#define MYRI10GE_VERSION_STR "1.3.2-1.269"
 
 MODULE_DESCRIPTION("Myricom 10G driver (10GbE)");
 MODULE_AUTHOR("Maintainer: help@myri.com");



^ permalink raw reply

* Re: Problem with implementation of TCP_DEFER_ACCEPT?
From: Lennert Buytenhek @ 2007-08-24  7:15 UTC (permalink / raw)
  To: TJ; +Cc: netdev, kuznet
In-Reply-To: <1187914105.26866.41.camel@bagoas.tjworld.net>

On Fri, Aug 24, 2007 at 01:08:25AM +0100, TJ wrote:

> An RFC 793 standard TCP handshake requires three packets:
> 
> client SYN > server LISTENING
> client < SYN ACK server SYN_RECEIVED
> client ACK > server ESTABLISHED
> 
> client PSH ACK + data > server
> 
> TCP_DEFER_ACCEPT is designed to increase performance by reducing the
> number of TCP packets exchanged before the client can pass data:
> 
> client SYN > server LISTENING
> client < SYN ACK server SYN_RECEIVED
> 
> client PSH ACK + data > server ESTABLISHED
> 
> At present with TCP_DEFER_ACCEPT the kernel treats the RFC 793 handshake
> as invalid; dropping the ACK from the client without replying so the
> client doesn't know the server has in fact set it's internal ACKed flag.
> 
> If the client doesn't send a packet containing data before the SYN_ACK
> time-outs finally expire the connection will be dropped.

A brought this up a long, long time ago, and I seem to remember
Alexey Kuznetsov explained me at the time that this was intentional.

I can't find the thread in the mailing list archives anymore, though
-- and my memory might be failing me.

^ permalink raw reply

* Re: [PATCH 12/30] net: No point in casting kmalloc return values in Gianfar Ethernet Driver
From: Kumar Gala @ 2007-08-24  7:25 UTC (permalink / raw)
  To: Jesper Juhl
  Cc: Linux Kernel Mailing List, netdev, David S. Miller, Andy Fleming
In-Reply-To: <c4e095af9df5ab47e7d1c8e4f3d256d74576a0c9.1187912217.git.jesper.juhl@gmail.com>


On Aug 23, 2007, at 6:59 PM, Jesper Juhl wrote:

> kmalloc() returns a void ptr, so there's no need to cast its return
> value in drivers/net/gianfar.c .
>
> Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>

Acked-by: Kumar Gala <galak@kernel.crashing.org>

- k

^ permalink raw reply

* Re: [PATCH] bridge: sysfs locking fix.
From: Daniel Lezcano @ 2007-08-24  8:18 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Lennert Buytenhek, Andrew Morton, David S. Miller, netdev, bridge
In-Reply-To: <20070814145052.2d135d2f@oldman.hamilton.local>

Stephen Hemminger wrote:
> Forget earlier patch, it is wrong...
> 
> The stp change code generates "sleeping function called from invalid context"
> because rtnl_lock() called with BH disabled. This fixes it by not acquiring then
> dropping the bridge lock.
> 
> Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
> 
> --- a/net/bridge/br_sysfs_br.c	2007-08-06 09:26:48.000000000 +0100
> +++ b/net/bridge/br_sysfs_br.c	2007-08-14 14:29:52.000000000 +0100
> @@ -147,20 +147,26 @@ static ssize_t show_stp_state(struct dev
>  	return sprintf(buf, "%d\n", br->stp_enabled);
>  }
> 
> -static void set_stp_state(struct net_bridge *br, unsigned long val)
> -{
> -	rtnl_lock();
> -	spin_unlock_bh(&br->lock);
> -	br_stp_set_enabled(br, val);
> -	spin_lock_bh(&br->lock);
> -	rtnl_unlock();
> -}
> 
>  static ssize_t store_stp_state(struct device *d,
>  			       struct device_attribute *attr, const char *buf,
>  			       size_t len)
>  {
> -	return store_bridge_parm(d, buf, len, set_stp_state);
> +	struct net_bridge *br = to_bridge(d);
> +	char *endp;
> +	unsigned long val;
> +
> +	if (!capable(CAP_NET_ADMIN))
> +		return -EPERM;
> +
> +	val = simple_strtoul(buf, &endp, 0);
> +	if (endp == buf)
> +		return -EINVAL;
> +
> +	rtnl_lock();
> +	br_stp_set_enabled(br, val);
> +	rtnl_unlock();
> +

Shouldn't len value be returned at the end of the function ?

>  }
>  static DEVICE_ATTR(stp_state, S_IRUGO | S_IWUSR, show_stp_state,
>  		   store_stp_state);
> -


^ permalink raw reply

* [Question] the precondition of calling alloc_skb()/kfree_skb()?
From: Li Yu @ 2007-08-24  8:58 UTC (permalink / raw)
  To: netdev, linux-kernel

Hi, all:

I encountered a problem of using sk_buff.

I used 2.4.20 kernel, when burst traffic come, the kernel will complain
a bug report at skbuff.c:316 later:

311 void __kfree_skb(struct sk_buff *skb)
312 {
313 if (skb->list) {
314 printk(KERN_WARNING "Warning: kfree_skb passed an skb still "
315 "on a list (from %p).\n", NET_CALLER(skb));
316 BUG(); /* HERE!!! */
317 }
/* snip some code here */
332 }


I saw the dev_kfree_skb_irq() and dev_kfree_skb_irq(), and how to use
them. even, in fact, we work in pure poll I/O model, so the NIC can not
issue any interrupt.

And, I searched google, there are many similar reports like above, but
almost of them have no reply. I think there may have some unknown thing
while sk_buff API.

Well, I hope know, are there some preconditions of calling alloc_skb()
or *_kfree_skb_*() ? Thank in advanced.

Good luck.

- Yu Li


^ permalink raw reply

* Re: Problem with implementation of TCP_DEFER_ACCEPT?
From: Alexey Kuznetsov @ 2007-08-24  8:40 UTC (permalink / raw)
  To: Lennert Buytenhek; +Cc: TJ, netdev
In-Reply-To: <20070824071530.GB26447@xi.wantstofly.org>

Hello!

> > At present with TCP_DEFER_ACCEPT the kernel treats the RFC 793 handshake
> > as invalid; dropping the ACK from the client without replying so the
> > client doesn't know the server has in fact set it's internal ACKed flag.
> > 
> > If the client doesn't send a packet containing data before the SYN_ACK
> > time-outs finally expire the connection will be dropped.
> 
> A brought this up a long, long time ago, and I seem to remember
> Alexey Kuznetsov explained me at the time that this was intentional.

Obviously, I said something like "it is exactly what TCP_DEFER_ACCEPT does".


There is no protocol violation here, ACK from client is considered as lost,
it is quite normal and happens all the time. Handshake is not complete,
server remains in SYN-RECV state and continues to retransmit SYN-ACK.
If client tried to cheat and is not going to send its request,
connection will time out.

Alexey


^ permalink raw reply

* Re: Problem with implementation of TCP_DEFER_ACCEPT?
From: TJ @ 2007-08-24  9:31 UTC (permalink / raw)
  To: netdev
In-Reply-To: <20070824084048.GA8346@ms2.inr.ac.ru>

On Fri, 2007-08-24 at 12:40 +0400, Alexey Kuznetsov wrote:

> There is no protocol violation here, ACK from client is considered as lost,
> it is quite normal and happens all the time. Handshake is not complete,
> server remains in SYN-RECV state and continues to retransmit SYN-ACK.
> If client tried to cheat and is not going to send its request,
> connection will time out.

Thanks for the responses.

Do we have any authoritative references on this? Who implemented it
originally?

Right now Juniper are claiming the issue that brought this to the
surface (the bug linked to in my original post) is a problem with the
implementation of TCP_DEFER_ACCEPT.

My position so far is that the Juniper DX OS is not following the HTTP
standard because it doesn't send a request with the connection, and as I
read the end of section 1.4 of RFC2616, an HTTP connection should be
accompanied by a request.

Can anyone confirm my interpretation or provide references to firm it
up, or refute it?

There is also a very real practical problem here:

Since version 2.1.5 apache enables TCP_DEFER_ACCEPT *by default* without
mention of it in the configuration file.

As time goes on the number of apache v2.1.5+ deployments is only going
to rise, and I'd hate for anyone else to go through the 5+ weeks of pain
the system admins at the e-commerce operation I was helping went
through, not to mention the last 2 weeks feeling like I was chasing
ghosts - it's an absolute pain to track down and identify!

Therefore, anyone deploying apache web servers in a web-farm behind the
Juniper DX load-balanders and using TCP multiplexing (for which they pay
a hefty licence fee!) is liable to suffer the random drop effects
described in my bug report.

Because several other HTTP load-balancers deploy similar methods of
holding open connections to the servers and pipe-lining requests, this
could affect more than just Juniper.

Any other suggestions/reactions on the Linux kernel side? I'm intending
posting a comment to the apache-dev mailing list once I've gathered the
strands together.

Thanks again.

TJ.
Ubuntu ACPI Kernel Team.


^ permalink raw reply

* [PATCH] [XFRM] : Fix pointer copy size for encap_tmpl and coaddr.
From: Masahide NAKAMURA @ 2007-08-24 10:05 UTC (permalink / raw)
  To: tgraf, davem; +Cc: netdev, Masahide NAKAMURA
In-Reply-To: <20070822145632.087638328@lsx.localdomain>

This is minor fix about sizeof argument using with kmemdup().

Signed-off-by: Masahide NAKAMURA <nakam@linux-ipv6.org>
---
 net/xfrm/xfrm_user.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index 0b8491f..46076f5 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -299,14 +299,14 @@ static struct xfrm_state *xfrm_state_construct(struct xfrm_usersa_info *p,
 
 	if (attrs[XFRMA_ENCAP]) {
 		x->encap = kmemdup(nla_data(attrs[XFRMA_ENCAP]),
-				   sizeof(x->encap), GFP_KERNEL);
+				   sizeof(*x->encap), GFP_KERNEL);
 		if (x->encap == NULL)
 			goto error;
 	}
 
 	if (attrs[XFRMA_COADDR]) {
 		x->coaddr = kmemdup(nla_data(attrs[XFRMA_COADDR]),
-				    sizeof(x->coaddr), GFP_KERNEL);
+				    sizeof(*x->coaddr), GFP_KERNEL);
 		if (x->coaddr == NULL)
 			goto error;
 	}
-- 
1.4.4.2


^ permalink raw reply related

* [PATCH] [IPV6] XFRM: Fix connected socket to use transformation.
From: Masahide NAKAMURA @ 2007-08-24 10:08 UTC (permalink / raw)
  To: davem; +Cc: netdev, Masahide NAKAMURA, Noriaki TAKAMIYA

When XFRM policy and state are ready after TCP connection is started,
the traffic should be transformed immediately, however it does not
on IPv6 TCP.

It depends on a dst cache replacement policy with connected socket.
It seems that the replacement is always done for IPv4, however, on
IPv6 case it is done only when routing cookie is changed.

This patch fix that non-transformation dst can be changed to
transformation one.
This behavior is required by MIPv6 and improves IPv6 IPsec.

Signed-off-by: Noriaki TAKAMIYA <takamiya@po.ntts.co.jp>
Signed-off-by: Masahide NAKAMURA <nakam@linux-ipv6.org>
---
 include/net/ip6_fib.h            |    2 ++
 net/ipv6/inet6_connection_sock.c |   34 ++++++++++++++++++++++++++++++++--
 2 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index c48ea87..85d6d9f 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -105,6 +105,8 @@ struct rt6_info
 	struct rt6key			rt6i_src;
 
 	u8				rt6i_protocol;
+
+	u32				rt6i_flow_cache_genid;
 };
 
 static inline struct inet6_dev *ip6_dst_idev(struct dst_entry *dst)
diff --git a/net/ipv6/inet6_connection_sock.c b/net/ipv6/inet6_connection_sock.c
index 116f94a..f389322 100644
--- a/net/ipv6/inet6_connection_sock.c
+++ b/net/ipv6/inet6_connection_sock.c
@@ -139,6 +139,36 @@ void inet6_csk_addr2sockaddr(struct sock *sk, struct sockaddr * uaddr)
 
 EXPORT_SYMBOL_GPL(inet6_csk_addr2sockaddr);
 
+static inline
+void __inet6_csk_dst_store(struct sock *sk, struct dst_entry *dst,
+			   struct in6_addr *daddr, struct in6_addr *saddr)
+{
+	struct rt6_info *rt = (struct rt6_info *)dst;
+
+	__ip6_dst_store(sk, dst, daddr, saddr);
+	rt->rt6i_flow_cache_genid = atomic_read(&flow_cache_genid);
+}
+
+static inline
+struct dst_entry *__inet6_csk_dst_check(struct sock *sk, u32 cookie)
+{
+	struct dst_entry *dst;
+	struct rt6_info *rt;
+
+	dst = __sk_dst_check(sk, cookie);
+	if (!dst)
+		goto end;
+
+	rt = (struct rt6_info *)dst;
+	if (rt->rt6i_flow_cache_genid != atomic_read(&flow_cache_genid)) {
+		sk->sk_dst_cache = NULL;
+		dst_release(dst);
+		dst = NULL;
+	}
+ end:
+	return dst;
+}
+
 int inet6_csk_xmit(struct sk_buff *skb, int ipfragok)
 {
 	struct sock *sk = skb->sk;
@@ -166,7 +196,7 @@ int inet6_csk_xmit(struct sk_buff *skb, int ipfragok)
 		final_p = &final;
 	}
 
-	dst = __sk_dst_check(sk, np->dst_cookie);
+	dst = __inet6_csk_dst_check(sk, np->dst_cookie);
 
 	if (dst == NULL) {
 		int err = ip6_dst_lookup(sk, &dst, &fl);
@@ -186,7 +216,7 @@ int inet6_csk_xmit(struct sk_buff *skb, int ipfragok)
 			return err;
 		}
 
-		__ip6_dst_store(sk, dst, NULL, NULL);
+		__inet6_csk_dst_store(sk, dst, NULL, NULL);
 	}
 
 	skb->dst = dst_clone(dst);
-- 
1.4.4.2


^ permalink raw reply related

* [PATCH 1/2] [IPV6] IPSEC: Omit redirect for tunnelled packet.
From: Masahide NAKAMURA @ 2007-08-24 10:08 UTC (permalink / raw)
  To: davem; +Cc: netdev, Masahide NAKAMURA

IPv6 IPsec tunnel gateway incorrectly sends redirect to
router or sender when network device the IPsec tunnelled packet
is arrived is the same as the one the decapsulated packet
is sent.

With this patch, it omits to send the redirect when the forwarding
skbuff carries secpath, since such skbuff should be assumed as
a decapsulated packet from IPsec tunnel by own.

It may be a rare case for an IPsec security gateway, however
it is not rare when the gateway is MIPv6 Home Agent since
the another tunnel end-point is Mobile Node and it changes
the attached network.

Signed-off-by: Masahide NAKAMURA <nakam@linux-ipv6.org>
---
 net/ipv6/ip6_output.c |    4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 5dead39..07b82c2 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -441,8 +441,10 @@ int ip6_forward(struct sk_buff *skb)
 
 	/* IPv6 specs say nothing about it, but it is clear that we cannot
 	   send redirects to source routed frames.
+	   We don't send redirects to frames decapsulated from IPsec.
 	 */
-	if (skb->dev == dst->dev && dst->neighbour && opt->srcrt == 0) {
+	if (skb->dev == dst->dev && dst->neighbour && opt->srcrt == 0 &&
+	    !skb->sp) {
 		struct in6_addr *target = NULL;
 		struct rt6_info *rt;
 		struct neighbour *n = dst->neighbour;
-- 
1.4.4.2


^ permalink raw reply related

* [PATCH 2/2] [IPV4] IPSEC: Omit redirect for tunnelled packet.
From: Masahide NAKAMURA @ 2007-08-24 10:09 UTC (permalink / raw)
  To: davem; +Cc: netdev, Masahide NAKAMURA

IPv4 IPsec tunnel gateway incorrectly sends redirect to
sender if it is onlink host when network device the IPsec tunnelled
packet is arrived is the same as the one the decapsulated packet
is sent.

With this patch, it omits to send the redirect when the forwarding
skbuff carries secpath, since such skbuff should be assumed as
a decapsulated packet from IPsec tunnel by own.

Request for comments:
Alternatively we'd have another way to change net/ipv4/route.c
(__mkroute_input) to use RTCF_DOREDIRECT flag unless skbuff
has no secpath. It is better than this patch at performance
point of view because IPv4 redirect judgement is done at
routing slow-path. However, it should be taken care of resource
changes between SAD(XFRM states) and routing table. In other words,
When IPv4 SAD is changed does the related routing entry go to its
slow-path? If not, it is reasonable to apply this patch.

Signed-off-by: Masahide NAKAMURA <nakam@linux-ipv6.org>
---
 net/ipv4/ip_forward.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c
index 8c95cf0..afbf938 100644
--- a/net/ipv4/ip_forward.c
+++ b/net/ipv4/ip_forward.c
@@ -105,7 +105,7 @@ int ip_forward(struct sk_buff *skb)
 	 *	We now generate an ICMP HOST REDIRECT giving the route
 	 *	we calculated.
 	 */
-	if (rt->rt_flags&RTCF_DOREDIRECT && !opt->srr)
+	if (rt->rt_flags&RTCF_DOREDIRECT && !opt->srr && !skb->sp)
 		ip_rt_send_redirect(skb);
 
 	skb->priority = rt_tos2priority(iph->tos);
-- 
1.4.4.2


^ permalink raw reply related

* Re: [ANNOUNCE] iproute2-2.6.23-rc3
From: Jarek Poplawski @ 2007-08-24 10:10 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20070822110804.30c2e803@freepuppy.rosehill.hemminger.net>

On 22-08-2007 20:08, Stephen Hemminger wrote:
> There have been a lot of changes for 2.6.23, so here is a test release
> of iproute2 that should capture all the submitted patches
> 
> 
> http://developer.osdl.org/shemminger/iproute2/download/iproute2-2.6.23-rc3.tar.gz

But... isn't it forged, btw?!

Cheers,
Jarek P.

^ permalink raw reply

* Re: [IPv6] Add v4mapped address inline
From: YOSHIFUJI Hideaki / 吉藤英明 @ 2007-08-24 10:19 UTC (permalink / raw)
  To: brian.haley, davem; +Cc: netdev, lksctp-developers, yoshfuji
In-Reply-To: <46CDCE8B.6010909@hp.com>

In article <46CDCE8B.6010909@hp.com> (at Thu, 23 Aug 2007 14:14:35 -0400), Brian Haley <brian.haley@hp.com> says:

> YOSHIFUJI Hideaki / ???? wrote:
> > Please put this just after ipv6_addr_any(), not after
> > ipv6_addr_diff().
> 
> Ok, updated patch attached.
> 
> -Brian
> 
> 
> Add v4mapped address inline to avoid calls to ipv6_addr_type().
> 
> Signed-off-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

--yoshfuji

^ permalink raw reply

* Re: [PATCH] [XFRM] : Fix pointer copy size for encap_tmpl and coaddr.
From: Thomas Graf @ 2007-08-24 10:35 UTC (permalink / raw)
  To: Masahide NAKAMURA; +Cc: davem, netdev
In-Reply-To: <11879499202750-git-send-email-nakam@linux-ipv6.org>

* Masahide NAKAMURA <nakam@linux-ipv6.org> 2007-08-24 19:05
> This is minor fix about sizeof argument using with kmemdup().

Thanks for catching this!

^ permalink raw reply

* Re: [PATCH] i386: Fix a couple busy loops in mach_wakecpu.h:wait_for_init_deassert()
From: Denys Vlasenko @ 2007-08-24 11:59 UTC (permalink / raw)
  To: Satyam Sharma
  Cc: Heiko Carstens, Herbert Xu, Chris Snook, clameter,
	Linux Kernel Mailing List, linux-arch, Linus Torvalds, netdev,
	Andrew Morton, ak, davem, schwidefsky, wensong, horms, wjiang,
	cfriesen, zlynx, rpjday, jesper.juhl, segher
In-Reply-To: <alpine.LFD.0.999.0708160606580.24380@enigma.security.iitk.ac.in>

On Thursday 16 August 2007 01:39, Satyam Sharma wrote:
>
>  static inline void wait_for_init_deassert(atomic_t *deassert)
>  {
> -	while (!atomic_read(deassert));
> +	while (!atomic_read(deassert))
> +		cpu_relax();
>  	return;
>  }

For less-than-briliant people like me, it's totally non-obvious that
cpu_relax() is needed for correctness here, not just to make P4 happy.

IOW: "atomic_read" name quite unambiguously means "I will read
this variable from main memory". Which is not true and creates
potential for confusion and bugs.
--
vda

^ permalink raw reply

* Re: [PATCH] i386: Fix a couple busy loops in mach_wakecpu.h:wait_for_init_deassert()
From: Andi Kleen @ 2007-08-24 12:07 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Satyam Sharma, Heiko Carstens, Herbert Xu, Chris Snook, clameter,
	Linux Kernel Mailing List, linux-arch, Linus Torvalds, netdev,
	Andrew Morton, davem, schwidefsky, wensong, horms, wjiang,
	cfriesen, zlynx, rpjday, jesper.juhl, segher
In-Reply-To: <200708241259.33659.vda.linux@googlemail.com>

On Friday 24 August 2007 13:59:32 Denys Vlasenko wrote:
> On Thursday 16 August 2007 01:39, Satyam Sharma wrote:
> >
> >  static inline void wait_for_init_deassert(atomic_t *deassert)
> >  {
> > -	while (!atomic_read(deassert));
> > +	while (!atomic_read(deassert))
> > +		cpu_relax();
> >  	return;
> >  }
> 
> For less-than-briliant people like me, it's totally non-obvious that
> cpu_relax() is needed for correctness here, not just to make P4 happy.

I find it also non obvious. It would be really better to have a barrier
or equivalent (volatile or variable clobber) in the atomic_read()
 
> IOW: "atomic_read" name quite unambiguously means "I will read
> this variable from main memory". Which is not true and creates
> potential for confusion and bugs.

Agreed.

-Andi

^ permalink raw reply

* [ofa-general] Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
From: jamal @ 2007-08-24 12:14 UTC (permalink / raw)
  To: Bill Fink
  Cc: jagana, peter.p.waskiewicz.jr, herbert, gaagaan, Robert.Olsson,
	netdev, rdreier, mcarlson, kaber, jeff, general, mchan, tgraf,
	johnpol, shemminger, David Miller, sri
In-Reply-To: <20070823231820.2ae52cc0.billfink@mindspring.com>

On Thu, 2007-23-08 at 23:18 -0400, Bill Fink wrote:

[..]
> Here you can see there is a major difference in the TX CPU utilization
> (99 % with TSO disabled versus only 39 % with TSO enabled), although
> the TSO disabled case was able to squeeze out a little extra performance
> from its extra CPU utilization.  

Good stuff. What kind of machine? SMP?
Seems the receive side of the sender is also consuming a lot more cpu
i suspect because receiver is generating a lot more ACKs with TSO.
Does the choice of the tcp congestion control algorithm affect results?
it would be interesting to see both MTUs with either TCP BIC vs good old
reno on sender (probably without changing what the receiver does). BIC
seems to be the default lately.

> Interestingly, with TSO enabled, the
> receiver actually consumed more CPU than with TSO disabled, 

I would suspect the fact that a lot more packets making it into the
receiver for TSO contributes.

> so I guess
> the receiver CPU saturation in that case (99 %) was what restricted
> its performance somewhat (this was consistent across a few test runs).

Unfortunately the receiver plays a big role in such tests - if it is
bottlenecked then you are not really testing the limits of the
transmitter. 

cheers,
jamal

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Denys Vlasenko @ 2007-08-24 12:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Satyam Sharma, Christoph Lameter, Paul E. McKenney, Herbert Xu,
	Nick Piggin, Paul Mackerras, Segher Boessenkool, heiko.carstens,
	horms, linux-kernel, rpjday, ak, netdev, cfriesen, akpm,
	jesper.juhl, linux-arch, zlynx, schwidefsky, Chris Snook, davem,
	wensong, wjiang
In-Reply-To: <alpine.LFD.0.999.0708172110400.30176@woody.linux-foundation.org>

On Saturday 18 August 2007 05:13, Linus Torvalds wrote:
> On Sat, 18 Aug 2007, Satyam Sharma wrote:
> > No code does (or would do, or should do):
> >
> > 	x.counter++;
> >
> > on an "atomic_t x;" anyway.
>
> That's just an example of a general problem.
>
> No, you don't use "x.counter++". But you *do* use
>
> 	if (atomic_read(&x) <= 1)
>
> and loading into a register is stupid and pointless, when you could just
> do it as a regular memory-operand to the cmp instruction.

It doesn't mean that (volatile int*) cast is bad, it means that current gcc
is bad (or "not good enough"). IOW: instead of avoiding volatile cast,
it's better to fix the compiler.

> And as far as the compiler is concerned, the problem is the 100% same:
> combining operations with the volatile memop.
>
> The fact is, a compiler that thinks that
>
> 	movl mem,reg
> 	cmpl $val,reg
>
> is any better than
>
> 	cmpl $val,mem
>
> is just not a very good compiler.

Linus, in all honesty gcc has many more cases of suboptimal code,
case of "volatile" is just one of many.

Off the top of my head:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=28417

unsigned v;
void f(unsigned A) { v = ((unsigned long long)A) * 365384439 >> (27+32); }

gcc-4.1.1 -S -Os -fomit-frame-pointer t.c

f:
        movl    $365384439, %eax
        mull    4(%esp)
        movl    %edx, %eax <===== ?
        shrl    $27, %eax
        movl    %eax, v
        ret

Why is it moving %edx to %eax?

gcc-4.2.1 -S -Os -fomit-frame-pointer t.c

f:
        movl    $365384439, %eax
        mull    4(%esp)
        movl    %edx, %eax <===== ?
        xorl    %edx, %edx <===== ??!
        shrl    $27, %eax
        movl    %eax, v
        ret

Progress... Now we also zero out %edx afterwards for no apparent reason.
--
vda

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox