fixing sk_stream

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* fixing sk_stream_rfree()
@ 2006-04-15  3:59 David S. Miller
  2006-04-15  8:23 ` David S. Miller
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: David S. Miller @ 2006-04-15  3:59 UTC (permalink / raw)
  To: netdev; +Cc: herbert

Herbert, as you may have noticed we found some missing
locking in sk_stream_rfree().  It could explain the
"!sk_forward_alloc" BUG() we thought was caused by e1000's
TSO implementation and the Intel folks have provided enough
datapoints to prove that it is indeed not an e1000 specific
problem.

sk_stream_rfree() modifies sk->sk_forward_alloc without holding any
locks from the __kfree_skb() callback.

Now, _most_ of the time it is OK because we're being invoked with the
socket lock held as ACKs come back to liberate the transmit queue or
during socket shutdown which also holds the socket locked.  (I did
audit this, but I may have missed something, please double check this
assertion)

However, if skb->users is incremented or similar, that entity could
end up being the context in which the skb is freed up and that will
not necessarily have the socket locked correctly when the
skb->destructor callback invokes sk_stream_rfree().  Consider tcpdump
or similar.

But we're stuck if we try to cure this:

1) We can't take bh_lock_sock().  Well, we could if we disabled
   BH explicitly first but even then if the socket is locked
   by the user we can't tell if that's the current cpu and there
   is no easy way to handle deferring this work.

   We really can't consider using a workqueue or something like
   that to handle this, that's way too heavy handed.

2) We can't turn sk_forward_alloc easily into an atomic_t.  The
   masking operation in __sk_stream_mem_reclaim() does not translate
   readily into an atomic_t op:

		sk->sk_forward_alloc &= SK_STREAM_MEM_QUANTUM - 1;

   That line has always driven me nuts, but I know that it is there
   to handle partial page allocations existing when that function
   is called.

I guess for #2 we could change those two lines into:

		int n = atomic_read(&sk->sk_forward_alloc) /
				SK_STREAM_MEM_QUANTUM;

		atomic_sub(n, sk->sk_prot->memory_allocated);
		sk->sk_forward_alloc -= n * SK_STREAM_MEM_QUANTUM;

in order to make it "atomic_t op" translatable.

But even if we did this with sk->sk_forward_alloc as an atomic_t
it would have the same kinds of races we're trying to cure here,
namely that we test the value and then act upon it while an unlocked
asynchronous context can modify the value in another thread of
execution in between the test and the action.

Any ideas?  Come to think of it, __sk_stream_mem_reclaim() looks
really racey even as-is.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: fixing sk_stream_rfree()
  2006-04-15  3:59 fixing sk_stream_rfree() David S. Miller
@ 2006-04-15  8:23 ` David S. Miller
  2006-04-15  9:03   ` David S. Miller
  2006-04-16 11:18 ` Herbert Xu
  2006-04-18 11:49 ` Ingo Oeser
  2 siblings, 1 reply; 12+ messages in thread
From: David S. Miller @ 2006-04-15  8:23 UTC (permalink / raw)
  To: netdev; +Cc: herbert

From: "David S. Miller" <davem@davemloft.net>
Date: Fri, 14 Apr 2006 20:59:27 -0700 (PDT)

> Any ideas?  Come to think of it, __sk_stream_mem_reclaim() looks
> really racey even as-is.

I came up with one more possible approach, it goes something like
this:

1) Get rid of the skb->destructor callback for TCP receive
   data.

2) Adjust sk_rmem_alloc and sk_forward_alloc explicitly when we add
   packets to sk_receive_queue/out_of_order_queue, and when we unlink
   them from sk_receive_queue and __kfree_skb() them up.

I think this would work and get rid of the unlocked accesses to
sk_forward_alloc.  And it would do so without adding new overheads
(the other two ideas did, by adding locks or turning sk_forward_alloc
into an atomic_t).

The implementation needs a bit of thinking in order to not mess up all
the nice abstractions Arnaldo has created for stream sockets and the
interfaces we share between TCP and DCCP.

But actually DCCP needs this fix too, since it also uses
sk_forward_alloc, so building the solution into the shared interfaces
is desirable.

Too bad this will take a little bit of time to implement, I really
want to be able to cook up a much simpler test patch that the people
who can reproduce the BUG() can try out.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: fixing sk_stream_rfree()
  2006-04-15  8:23 ` David S. Miller
@ 2006-04-15  9:03   ` David S. Miller
  0 siblings, 0 replies; 12+ messages in thread
From: David S. Miller @ 2006-04-15  9:03 UTC (permalink / raw)
  To: netdev; +Cc: herbert

From: "David S. Miller" <davem@davemloft.net>
Date: Sat, 15 Apr 2006 01:23:27 -0700 (PDT)

> I came up with one more possible approach, it goes something like
> this:
> 
> 1) Get rid of the skb->destructor callback for TCP receive
>    data.
> 
> 2) Adjust sk_rmem_alloc and sk_forward_alloc explicitly when we add
>    packets to sk_receive_queue/out_of_order_queue, and when we unlink
>    them from sk_receive_queue and __kfree_skb() them up.

It turns out DCCP doesn't use the sk_forward_alloc memory
scheduling at least not for receive.

So, as food for thought, below is a comile-tested-only
implementation of the above idea.  The basic transformations
are:

1) sk_eat_skb --> sk_stream_eat_skb which does the sk_forward_alloc
   et al. accounting by hand which would have been done by the
   skb->destructor sk_stream_rfree()

2) __skb_queue_purge --> sk_stream_eat_queue

3) sk_stream_set_owner_r --> sk_stream_charge_r which does
   the accounting for a new receive skb but does not hook up
   the destructor

4) unlink and free SKB not received to userspace yet -->
   sk_stream_eat_skb()

... and then I noticed that ip_rcv() unshares the SKB, so
skb->users should always be 1 and therefore it is impossible
for sk_stream_rfree() to be called from a non socket locked
context.

And all of this was a wild goose chase, oh well :-)
So the BUG_ON(!sk_forward_alloc) problem is elsewhere. :-/

diff --git a/arch/x86_64/kernel/functionlist b/arch/x86_64/kernel/functionlist
index 2bcebdc..da3379a 100644
--- a/arch/x86_64/kernel/functionlist
+++ b/arch/x86_64/kernel/functionlist
@@ -751,7 +751,6 @@
 *(.text.strcmp)
 *(.text.steal_locks)
 *(.text.sock_create)
-*(.text.sk_stream_rfree)
 *(.text.sk_stream_mem_schedule)
 *(.text.skip_atoi)
 *(.text.sk_alloc)
diff --git a/include/net/sock.h b/include/net/sock.h
index af2b054..54d5a2f 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -442,14 +442,56 @@ static inline int sk_stream_memory_free(
 	return sk->sk_wmem_queued < sk->sk_sndbuf;
 }
 
-extern void sk_stream_rfree(struct sk_buff *skb);
-
-static inline void sk_stream_set_owner_r(struct sk_buff *skb, struct sock *sk)
+/* Charge SKB as receive queue memory to SK.  The socket must be locked
+ * when we get here.
+ */
+static inline void sk_stream_charge_r(struct sk_buff *skb, struct sock *sk)
 {
-	skb->sk = sk;
-	skb->destructor = sk_stream_rfree;
 	atomic_add(skb->truesize, &sk->sk_rmem_alloc);
 	sk->sk_forward_alloc -= skb->truesize;
+}
+
+/* Release SKB as receive queue memory from SK.  The socket must be
+ * locked when we get here.
+ */
+static inline void sk_stream_release_r(struct sk_buff *skb, struct sock *sk)
+{
+	atomic_sub(skb->truesize, &sk->sk_rmem_alloc);
+	sk->sk_forward_alloc += skb->truesize;
+}
+
+/**
+ * sk_stream_eat_skb - Release a skb if it is no longer needed
+ * @sk: socket to eat this skb from
+ * @skb: socket buffer to eat
+ * @queue: skb queue to unlink from
+ *
+ * This routine must be called with interrupts disabled or with the socket
+ * locked so that the sk_buff queue operation is ok.
+ */
+static inline void sk_stream_eat_skb(struct sk_buff *skb, struct sock *sk, struct sk_buff_head *queue)
+{
+	__skb_unlink(skb, queue);
+	sk_stream_release_r(skb, sk);
+	__kfree_skb(skb);
+}
+
+/*
+ * sk_stream_eat_queue - Release an entire queue of skbs, when no longer needed
+ * @sk: socket to eat these skbs from
+ * @queue: queue of skbs to eat
+ *
+ * This routine must be called with interrupts disabled or with the socket
+ * locked so that the sk_buff queue operation is ok.
+ */
+static inline void sk_stream_eat_queue(struct sock *sk, struct sk_buff_head *queue)
+{
+	struct sk_buff *skb;
+
+	while ((skb = __skb_dequeue(queue)) != NULL) {
+		sk_stream_release_r(skb, sk);
+		__kfree_skb(skb);
+	}
 }
 
 static inline void sk_stream_free_skb(struct sock *sk, struct sk_buff *skb)
diff --git a/net/core/stream.c b/net/core/stream.c
index 35e2525..a034f2d 100644
--- a/net/core/stream.c
+++ b/net/core/stream.c
@@ -172,16 +172,6 @@ do_interrupted:
 
 EXPORT_SYMBOL(sk_stream_wait_memory);
 
-void sk_stream_rfree(struct sk_buff *skb)
-{
-	struct sock *sk = skb->sk;
-
-	atomic_sub(skb->truesize, &sk->sk_rmem_alloc);
-	sk->sk_forward_alloc += skb->truesize;
-}
-
-EXPORT_SYMBOL(sk_stream_rfree);
-
 int sk_stream_error(struct sock *sk, int flags, int err)
 {
 	if (err == -EPIPE)
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 87f68e7..c45a9c2 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1072,11 +1072,11 @@ int tcp_read_sock(struct sock *sk, read_
 				break;
 		}
 		if (skb->h.th->fin) {
-			sk_eat_skb(sk, skb);
+			sk_stream_eat_skb(skb, sk, &sk->sk_receive_queue);
 			++seq;
 			break;
 		}
-		sk_eat_skb(sk, skb);
+		sk_stream_eat_skb(skb, sk, &sk->sk_receive_queue);
 		if (!desc->count)
 			break;
 	}
@@ -1355,15 +1355,17 @@ skip_copy:
 
 		if (skb->h.th->fin)
 			goto found_fin_ok;
-		if (!(flags & MSG_PEEK))
-			sk_eat_skb(sk, skb);
+		if (!(flags & MSG_PEEK)) {
+			sk_stream_eat_skb(skb, sk, &sk->sk_receive_queue);
+		}
 		continue;
 
 	found_fin_ok:
 		/* Process the FIN. */
 		++*seq;
-		if (!(flags & MSG_PEEK))
-			sk_eat_skb(sk, skb);
+		if (!(flags & MSG_PEEK)) {
+			sk_stream_eat_skb(skb, sk, &sk->sk_receive_queue);
+		}
 		break;
 	} while (len > 0);
 
@@ -1489,6 +1491,7 @@ void tcp_close(struct sock *sk, long tim
 		u32 len = TCP_SKB_CB(skb)->end_seq - TCP_SKB_CB(skb)->seq -
 			  skb->h.th->fin;
 		data_was_unread += len;
+		sk_stream_release_r(skb, sk);
 		__kfree_skb(skb);
 	}
 
@@ -1650,9 +1653,9 @@ int tcp_disconnect(struct sock *sk, int 
 		sk->sk_err = ECONNRESET;
 
 	tcp_clear_xmit_timers(sk);
-	__skb_queue_purge(&sk->sk_receive_queue);
+	sk_stream_eat_queue(sk, &sk->sk_receive_queue);
 	sk_stream_writequeue_purge(sk);
-	__skb_queue_purge(&tp->out_of_order_queue);
+	sk_stream_eat_queue(sk, &tp->out_of_order_queue);
 
 	inet->dport = 0;
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 9f0cca4..5a30648 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2875,7 +2875,7 @@ static void tcp_fin(struct sk_buff *skb,
 	/* It _is_ possible, that we have something out-of-order _after_ FIN.
 	 * Probably, we should reset in this case. For now drop them.
 	 */
-	__skb_queue_purge(&tp->out_of_order_queue);
+	sk_stream_eat_queue(sk, &tp->out_of_order_queue);
 	if (tp->rx_opt.sack_ok)
 		tcp_sack_reset(&tp->rx_opt);
 	sk_stream_mem_reclaim(sk);
@@ -3093,8 +3093,7 @@ static void tcp_ofo_queue(struct sock *s
 
 		if (!after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt)) {
 			SOCK_DEBUG(sk, "ofo packet was already received \n");
-			__skb_unlink(skb, &tp->out_of_order_queue);
-			__kfree_skb(skb);
+			sk_stream_eat_skb(skb, sk, &tp->out_of_order_queue);
 			continue;
 		}
 		SOCK_DEBUG(sk, "ofo requeuing : rcv_next %X seq %X - %X\n",
@@ -3166,7 +3165,7 @@ queue_and_out:
 				    !sk_stream_rmem_schedule(sk, skb))
 					goto drop;
 			}
-			sk_stream_set_owner_r(skb, sk);
+			sk_stream_charge_r(skb, sk);
 			__skb_queue_tail(&sk->sk_receive_queue, skb);
 		}
 		tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
@@ -3248,7 +3247,7 @@ drop:
 	SOCK_DEBUG(sk, "out of order segment: rcv_next %X seq %X - %X\n",
 		   tp->rcv_nxt, TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq);
 
-	sk_stream_set_owner_r(skb, sk);
+	sk_stream_charge_r(skb, sk);
 
 	if (!skb_peek(&tp->out_of_order_queue)) {
 		/* Initial out of order segment, build 1 SACK. */
@@ -3290,6 +3289,7 @@ drop:
 		    before(seq, TCP_SKB_CB(skb1)->end_seq)) {
 			if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq)) {
 				/* All the bits are present. Drop. */
+				sk_stream_release_r(skb, sk);
 				__kfree_skb(skb);
 				tcp_dsack_set(tp, seq, end_seq);
 				goto add_sack;
@@ -3311,9 +3311,8 @@ drop:
 			       tcp_dsack_extend(tp, TCP_SKB_CB(skb1)->seq, end_seq);
 			       break;
 		       }
-		       __skb_unlink(skb1, &tp->out_of_order_queue);
 		       tcp_dsack_extend(tp, TCP_SKB_CB(skb1)->seq, TCP_SKB_CB(skb1)->end_seq);
-		       __kfree_skb(skb1);
+		       sk_stream_eat_skb(skb1, sk, &tp->out_of_order_queue);
 		}
 
 add_sack:
@@ -3340,8 +3339,7 @@ tcp_collapse(struct sock *sk, struct sk_
 		/* No new bits? It is possible on ofo queue. */
 		if (!before(start, TCP_SKB_CB(skb)->end_seq)) {
 			struct sk_buff *next = skb->next;
-			__skb_unlink(skb, list);
-			__kfree_skb(skb);
+			sk_stream_eat_skb(skb, sk, list);
 			NET_INC_STATS_BH(LINUX_MIB_TCPRCVCOLLAPSED);
 			skb = next;
 			continue;
@@ -3387,7 +3385,7 @@ tcp_collapse(struct sock *sk, struct sk_
 		memcpy(nskb->cb, skb->cb, sizeof(skb->cb));
 		TCP_SKB_CB(nskb)->seq = TCP_SKB_CB(nskb)->end_seq = start;
 		__skb_insert(nskb, skb->prev, skb, list);
-		sk_stream_set_owner_r(nskb, sk);
+		sk_stream_charge_r(nskb, sk);
 
 		/* Copy data, releasing collapsed skbs. */
 		while (copy > 0) {
@@ -3405,8 +3403,7 @@ tcp_collapse(struct sock *sk, struct sk_
 			}
 			if (!before(start, TCP_SKB_CB(skb)->end_seq)) {
 				struct sk_buff *next = skb->next;
-				__skb_unlink(skb, list);
-				__kfree_skb(skb);
+				sk_stream_eat_skb(skb, sk, list);
 				NET_INC_STATS_BH(LINUX_MIB_TCPRCVCOLLAPSED);
 				skb = next;
 				if (skb == tail || skb->h.th->syn || skb->h.th->fin)
@@ -3494,7 +3491,7 @@ static int tcp_prune_queue(struct sock *
 	/* First, purge the out_of_order queue. */
 	if (!skb_queue_empty(&tp->out_of_order_queue)) {
 		NET_INC_STATS_BH(LINUX_MIB_OFOPRUNED);
-		__skb_queue_purge(&tp->out_of_order_queue);
+		sk_stream_eat_queue(sk, &tp->out_of_order_queue);
 
 		/* Reset SACK state.  A conforming SACK implementation will
 		 * do the same at a timeout based retransmit.  When a connection
@@ -3703,10 +3700,8 @@ static void tcp_check_urg(struct sock * 
 	    tp->copied_seq != tp->rcv_nxt) {
 		struct sk_buff *skb = skb_peek(&sk->sk_receive_queue);
 		tp->copied_seq++;
-		if (skb && !before(tp->copied_seq, TCP_SKB_CB(skb)->end_seq)) {
-			__skb_unlink(skb, &sk->sk_receive_queue);
-			__kfree_skb(skb);
-		}
+		if (skb && !before(tp->copied_seq, TCP_SKB_CB(skb)->end_seq))
+			sk_stream_eat_skb(skb, sk, &sk->sk_receive_queue);
 	}
 
 	tp->urg_data   = TCP_URG_NOTYET;
@@ -3950,7 +3945,7 @@ int tcp_rcv_established(struct sock *sk,
 				/* Bulk data transfer: receiver */
 				__skb_pull(skb,tcp_header_len);
 				__skb_queue_tail(&sk->sk_receive_queue, skb);
-				sk_stream_set_owner_r(skb, sk);
+				sk_stream_charge_r(skb, sk);
 				tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
 			}
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 672950e..e518d97 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1294,7 +1294,7 @@ int tcp_v4_destroy_sock(struct sock *sk)
   	sk_stream_writequeue_purge(sk);
 
 	/* Cleans up our, hopefully empty, out_of_order_queue. */
-  	__skb_queue_purge(&tp->out_of_order_queue);
+	sk_stream_eat_queue(sk, &tp->out_of_order_queue);
 
 	/* Clean prequeue, it must be empty really */
 	__skb_queue_purge(&tp->ucopy.prequeue);

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: fixing sk_stream_rfree()
  2006-04-15  3:59 fixing sk_stream_rfree() David S. Miller
  2006-04-15  8:23 ` David S. Miller
@ 2006-04-16 11:18 ` Herbert Xu
  2006-04-17  5:32   ` David S. Miller
  2006-04-18 11:49 ` Ingo Oeser
  2 siblings, 1 reply; 12+ messages in thread
From: Herbert Xu @ 2006-04-16 11:18 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev

Hi David:

I've been flying on and off for the last two days so I only saw this now.

On Fri, Apr 14, 2006 at 08:59:27PM -0700, David S. Miller wrote:
> 
> Herbert, as you may have noticed we found some missing
> locking in sk_stream_rfree().  It could explain the
> "!sk_forward_alloc" BUG() we thought was caused by e1000's
> TSO implementation and the Intel folks have provided enough
> datapoints to prove that it is indeed not an e1000 specific
> problem.

Great eyes! Yes that bug seems to have been around forever.  I'll
look over the patch tomorrow as right now I'm still on west-coast
time :)

I have one niggling doubt though as to why turning off TSO seems
to cure the problem for people.  Perhaps it'll become clearer
tomorrow morning when I look at it again :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: fixing sk_stream_rfree()
  2006-04-16 11:18 ` Herbert Xu
@ 2006-04-17  5:32   ` David S. Miller
  2006-04-17  6:17     ` Herbert Xu
  0 siblings, 1 reply; 12+ messages in thread
From: David S. Miller @ 2006-04-17  5:32 UTC (permalink / raw)
  To: herbert; +Cc: netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Sun, 16 Apr 2006 21:18:31 +1000

> Great eyes! Yes that bug seems to have been around forever.  I'll
> look over the patch tomorrow as right now I'm still on west-coast
> time :)

Let me save you some time, later in the thread you'll find
out that this whole thing is a dead end.

The reason is that all SKBs are unshared when we receive them
into the TCP stack for sk->sk_receive_queue.  (this is done
by ip_rcv()).

This means every __kfree_skb() call would happen in the context
of the TCP stack with the socket locked, so no concurrency issues
wrt. sk_forward_alloc as I originally thought.

In fact, this is what makes using __kfree_skb() explicitly legal
in the first place.  If skb->users could be non-zero, from tcpdump
etc., then we'd be required to call kfree_skb() in the TCP code
that frees up used packets on the receive queue.

But all is not lost, this is an important datapoint.  We've audited
the receive side accesses to sk_forward_alloc and it all looks good.

And again, as you mention, we go back again to TSO being the deciding
factor in triggering this or not.

So it nearly has to be a send side issue that can only trigger with
TSO enabled, and my next planned chore is to audit the TSO splitting
during ACK processing.  We may be doing something stupid there.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: fixing sk_stream_rfree()
  2006-04-17  5:32   ` David S. Miller
@ 2006-04-17  6:17     ` Herbert Xu
  2006-04-17 20:27       ` Jesse Brandeburg
  2006-04-17 20:29       ` David S. Miller
  0 siblings, 2 replies; 12+ messages in thread
From: Herbert Xu @ 2006-04-17  6:17 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev

On Sun, Apr 16, 2006 at 10:32:03PM -0700, David S. Miller wrote:
> 
> Let me save you some time, later in the thread you'll find
> out that this whole thing is a dead end.

Thanks.  I even read the message but managed to miss your conclusion :)

> So it nearly has to be a send side issue that can only trigger with
> TSO enabled, and my next planned chore is to audit the TSO splitting
> during ACK processing.  We may be doing something stupid there.

BTW, do we have a dmesg log from the machine that did this with tg3?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: fixing sk_stream_rfree()
  2006-04-17  6:17     ` Herbert Xu
@ 2006-04-17 20:27       ` Jesse Brandeburg
  2006-04-17 20:29       ` David S. Miller
  1 sibling, 0 replies; 12+ messages in thread
From: Jesse Brandeburg @ 2006-04-17 20:27 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David S. Miller, netdev

On 4/16/06, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> On Sun, Apr 16, 2006 at 10:32:03PM -0700, David S. Miller wrote:
> >
> > Let me save you some time, later in the thread you'll find
> > out that this whole thing is a dead end.
>
> Thanks.  I even read the message but managed to miss your conclusion :)
>
> > So it nearly has to be a send side issue that can only trigger with
> > TSO enabled, and my next planned chore is to audit the TSO splitting
> > during ACK processing.  We may be doing something stupid there.
>
> BTW, do we have a dmesg log from the machine that did this with tg3?

Unfortunately doesn't seem to be any difference between the failures. 
Here were the lines:
2.6.16 kernel

KERNEL: assertion (!sk->sk_forward_alloc) failed at /net/core/stream.c (283)
KERNEL: assertion (!sk->sk_forward_alloc) failed at /net/ipv4/af_inet.c (150)

I'll be glad to test anything to help nail this down.  Herbert did you
see that we used tbench (from dbench package) to reproduce this
reliably?

Jesse

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: fixing sk_stream_rfree()
  2006-04-17  6:17     ` Herbert Xu
  2006-04-17 20:27       ` Jesse Brandeburg
@ 2006-04-17 20:29       ` David S. Miller
  2006-04-17 21:09         ` Jesse Brandeburg
  1 sibling, 1 reply; 12+ messages in thread
From: David S. Miller @ 2006-04-17 20:29 UTC (permalink / raw)
  To: herbert; +Cc: netdev, jesse.brandeburg

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Mon, 17 Apr 2006 16:17:00 +1000

> On Sun, Apr 16, 2006 at 10:32:03PM -0700, David S. Miller wrote:
> > So it nearly has to be a send side issue that can only trigger with
> > TSO enabled, and my next planned chore is to audit the TSO splitting
> > during ACK processing.  We may be doing something stupid there.
> 
> BTW, do we have a dmesg log from the machine that did this with tg3?

I don't know, Jesse do you have it?

Jesse, my previously lead was a dead end, we're looking into other
possibilities, so if you have the kernel log from reproducing this
BUG(!sk_forward_alloc) problem with tg3 it would be very valuable at
this point.

Thanks.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: fixing sk_stream_rfree()
  2006-04-17 20:29       ` David S. Miller
@ 2006-04-17 21:09         ` Jesse Brandeburg
  2006-04-17 21:40           ` Herbert Xu
  0 siblings, 1 reply; 12+ messages in thread
From: Jesse Brandeburg @ 2006-04-17 21:09 UTC (permalink / raw)
  To: David S. Miller; +Cc: herbert, netdev, Brandeburg, Jesse

[-- Attachment #1: Type: TEXT/PLAIN, Size: 908 bytes --]


On Mon, 17 Apr 2006, David S. Miller wrote:

> 
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Mon, 17 Apr 2006 16:17:00 +1000
> 
> > On Sun, Apr 16, 2006 at 10:32:03PM -0700, David S. Miller wrote:
> > > So it nearly has to be a send side issue that can only trigger with
> > > TSO enabled, and my next planned chore is to audit the TSO splitting
> > > during ACK processing.  We may be doing something stupid there.
> >
> > BTW, do we have a dmesg log from the machine that did this with tg3?
> 
> I don't know, Jesse do you have it?
> 
> Jesse, my previously lead was a dead end, we're looking into other
> possibilities, so if you have the kernel log from reproducing this
> BUG(!sk_forward_alloc) problem with tg3 it would be very valuable at
> this point.

we explicitly enabled TSO before performing the test.
attached bz2 due to size:

Let me know if there is anything else I can do,
Jesse

[-- Attachment #2: Type: APPLICATION/octet-stream, Size: 6806 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: fixing sk_stream_rfree()
  2006-04-17 21:09         ` Jesse Brandeburg
@ 2006-04-17 21:40           ` Herbert Xu
  2006-04-17 21:56             ` Jesse Brandeburg
  0 siblings, 1 reply; 12+ messages in thread
From: Herbert Xu @ 2006-04-17 21:40 UTC (permalink / raw)
  To: Jesse Brandeburg; +Cc: David S. Miller, netdev

On Mon, Apr 17, 2006 at 02:09:43PM -0700, Jesse Brandeburg wrote:
> 
> we explicitly enabled TSO before performing the test.
> attached bz2 due to size:

Thanks a lot.  Are those promiscuous mode messages caused by tcpdump?
If so could you try it without doing a tcpdump or any other af_packet
applications?

> Let me know if there is anything else I can do,

Since you can reproduce this at will now, please modify one of those
warning messages to print out the actual value of sk_forward_alloc.
It can gives a clue.  For instance if it happens to be a multiple of
PAGE_SIZE, then we know it may be related to the reclaim stuff.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: fixing sk_stream_rfree()
  2006-04-17 21:40           ` Herbert Xu
@ 2006-04-17 21:56             ` Jesse Brandeburg
  0 siblings, 0 replies; 12+ messages in thread
From: Jesse Brandeburg @ 2006-04-17 21:56 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Brandeburg, Jesse, David S. Miller, netdev



On Mon, 17 Apr 2006, Herbert Xu wrote:

> 
> On Mon, Apr 17, 2006 at 02:09:43PM -0700, Jesse Brandeburg wrote:
> >
> > we explicitly enabled TSO before performing the test.
> > attached bz2 due to size:
> 
> Thanks a lot.  Are those promiscuous mode messages caused by tcpdump?
> If so could you try it without doing a tcpdump or any other af_packet
> applications?

yes, the promiscuous up/down messages are from tcpdump. We were checking 
whether TSO was enabled by looking at the traffic. The interface was NOT 
in promiscuous mode while running the test.

> > Let me know if there is anything else I can do,
> 
> Since you can reproduce this at will now, please modify one of those
> warning messages to print out the actual value of sk_forward_alloc.
> It can gives a clue.  For instance if it happens to be a multiple of
> PAGE_SIZE, then we know it may be related to the reclaim stuff.

will do.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: fixing sk_stream_rfree()
  2006-04-15  3:59 fixing sk_stream_rfree() David S. Miller
  2006-04-15  8:23 ` David S. Miller
  2006-04-16 11:18 ` Herbert Xu
@ 2006-04-18 11:49 ` Ingo Oeser
  2 siblings, 0 replies; 12+ messages in thread
From: Ingo Oeser @ 2006-04-18 11:49 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev

Hi David,

You wrote:
> 2) We can't turn sk_forward_alloc easily into an atomic_t.  The
>    masking operation in __sk_stream_mem_reclaim() does not translate
>    readily into an atomic_t op:
> 
> 		sk->sk_forward_alloc &= SK_STREAM_MEM_QUANTUM - 1;
> 
>    That line has always driven me nuts, but I know that it is there
>    to handle partial page allocations existing when that function
>    is called.
> 
> I guess for #2 we could change those two lines into:
> 
> 		int n = atomic_read(&sk->sk_forward_alloc) /
> 				SK_STREAM_MEM_QUANTUM;
> 
> 		atomic_sub(n, sk->sk_prot->memory_allocated);
> 		sk->sk_forward_alloc -= n * SK_STREAM_MEM_QUANTUM;
> 
> in order to make it "atomic_t op" translatable.

It is possible. Just view this as an register with size SK_STREAM_MEM_QUANTUM.

Mask on every read and always after the read.
Always add/subtract the intended value.
Wraparound happens within the modulo/mask value.

This works without problems as long as SK_STREAM_MEM_QUANTUM 
is a power of two. If not, just use modulo arithmetics.


The non-atomic wrapround doesn't change the atomic nature of
read, add, sub. Only problem is atomic_dec_and_test() or similiar ops.


Regards

Ingo Oeser

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2006-04-18 11:49 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-04-15  3:59 fixing sk_stream_rfree() David S. Miller
2006-04-15  8:23 ` David S. Miller
2006-04-15  9:03   ` David S. Miller
2006-04-16 11:18 ` Herbert Xu
2006-04-17  5:32   ` David S. Miller
2006-04-17  6:17     ` Herbert Xu
2006-04-17 20:27       ` Jesse Brandeburg
2006-04-17 20:29       ` David S. Miller
2006-04-17 21:09         ` Jesse Brandeburg
2006-04-17 21:40           ` Herbert Xu
2006-04-17 21:56             ` Jesse Brandeburg
2006-04-18 11:49 ` Ingo Oeser

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).