Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: Introduce FCLONE_SCRATCH skbs to reduce stack memory useage and napi jitter
From: Eric Dumazet @ 2011-10-27 22:55 UTC (permalink / raw)
  To: Neil Horman; +Cc: netdev, David S. Miller
In-Reply-To: <1319745221-30880-1-git-send-email-nhorman@tuxdriver.com>

Le jeudi 27 octobre 2011 à 15:53 -0400, Neil Horman a écrit :
> I had this idea awhile ago while I was looking at the receive path for multicast
> frames.   The top of the mcast recieve path (in __udp4_lib_mcast_deliver, has a
> loop in which we traverse a hash list linearly, looking for sockets that are
> listening to a given multicast group.  For each matching socket we clone the skb
> to enqueue it to the corresponding socket.  This creates two problems:
> 
> 1) Application driven jitter in the receive path
>    As you add processes that listen to the same multcast group, you increase the
> number of iterations you have to preform in this loop, which can lead to
> increases in the amount of time you spend processing each frame in softirq
> context, expecially if you are memory constrained, and the skb_clone operation
> has to call all the way back into the buddy allocator for more ram.  This can
> lead to needlessly dropped frames as rx latency increases in the stack.
> 

Hmm... time to perform this loop not depends on memory constraints,
since GFP_ATOMIC allocations are done. It succeed or not, immediately.

Time is consumed on the copy of the skb head, and refcnt
increases/decreases on datarefcnt. Your patch doesnt avoid this.

When application calls recvmsg() we then perform the two atomics on skb
refcnt and data refcnt and free them, with cache line false sharing...

> 2) Increased memory usage
>    As you increase the number of listeners to a multicast group, you directly
> increase the number of times you clone and skb, putting increased memory
> pressure on the system.
> 

One skb_head is about 256 bytes (232 bytes on 64bit arches)

> while neither of these problems is a huge concern, I thought it would be nice if
> we could mitigate the effects of increased application instances on performance
> in this area.  As such I came up with this patch set.  I created a new skb
> fclone type called FCLONE_SCRATCH.  When available, it commandeers the
> internally fragmented space of an skb data buffer and uses that to allocate
> additional skbs during the clone operation. Since the skb->data area is
> allocated with a kmalloc operation (and is therefore nominally a power of 2 in
> size), and nominally network interfaces tend to have an mtu of around 1500
> bytes, we typically can reclaim several hundred bytes of space at the end of an
> skb (more if the incomming packet is not a full MTU in size).  This space, being
> exclusively accessible to the softirq doing the reclaim, can be quickly accesed
> without the need for additional locking, potntially providing lower jitter in
> napi context per frame during a receive operation, as well as some memory
> savings.
> 
> I'm still collecting stats on its performance, but I thought I would post now to
> get some early reviews and feedback on it.
> 

I really doubt you'll find a significative performance increase.

I do believe its a too complex : skb code is already a nightmare if you
ask me.

And your hack/idea wont work quite well if you have 8 receivers for each
frame.

What about finding another way to queue one skb to N receive queue(s),
so that several multicast sockets can share same skb head ?

I always found sk_receive queue being very inefficient, since a queue or
dequeue must dirty a lot of cache lines.

This forces us to use a spinlock to protect queue/enqueue operations,
but also the socket lock (because of the MSG_PEEK stuff and
sk_rmem_alloc / sk_forward_alloc)

sk_receive_queue.lock is the real jitter source.

Ideally, we could have a fast path using a small circular array per
socket, of say 8 or 16 pointers to skbs, or allow application or
sysadmin to size this array.

A circular buffer can be handled without any lock, using atomic
operations (cmpxchg()) on load/unload indexes. The array of pointers is
written only by the softirq handler cpu, read by consumers.

Since this array is small [and finite size], and skb shared, we dont
call skb_set_owner_r() anymore, avoiding expensive atomic ops on
sk->sk_rmem_alloc.

UDP receive path could become lockless, allowing the softirq handler to
run without being slowed down by concurrent recvmsg()

At recvmsg() time, N-1 threads would only perform the skb->refcnt
decrement, and the last one would free the skb and data as well.

^ permalink raw reply

* [RFC] tcp: Export TCP Delayed ACK parameters to user
From: Daniel Baluta @ 2011-10-27 23:07 UTC (permalink / raw)
  To: davem; +Cc: kuznet, jmorris, yoshfuji, kaber, netdev, eric.dumazet,
	Daniel Baluta

RFC2581 ($4.2) specifies when an ACK should be generated as follows:

" .. an ACK SHOULD be generated for at least every second
  full-sized segment, and MUST be generated within 500 ms
  of the arrival of the first unacknowledged packet.
"

We export the number of segments and the timeout limits
specified above, so that a user can tune them according
to its needs.

Specifically:
	* /proc/sys/net/ipv4/tcp_delack_segs, represents
	the threshold for the number of segments.
	* /proc/sys/net/ipv4/tcp_delack_min, specifies
	the minimum timeout value
	* /proc/sys/net/ipv4/tcp_delack_max, specifies
	the maximum timeout value.

Signed-off-by: Daniel Baluta <dbaluta@ixiacom.com>
---
 include/net/tcp.h          |   20 +++++++++++++++++---
 net/ipv4/sysctl_net_ipv4.c |   21 +++++++++++++++++++++
 net/ipv4/tcp.c             |    5 +++--
 net/ipv4/tcp_input.c       |    7 +++++--
 net/ipv4/tcp_output.c      |    4 +++-
 5 files changed, 49 insertions(+), 8 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index e147f42..f3b0c17 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -111,14 +111,21 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo);
 				  * TIME-WAIT timer.
 				  */
 
-#define TCP_DELACK_MAX	((unsigned)(HZ/5))	/* maximal time to delay before sending an ACK */
+/* default maximum time to delay before sending an ACK */
+#define TCP_DELACK_MAX_DEFAULT	((unsigned)(HZ/5))
+
 #if HZ >= 100
-#define TCP_DELACK_MIN	((unsigned)(HZ/25))	/* minimal time to delay before sending an ACK */
+/* default minimum time to delay before sending an ACK */
+#define TCP_DELACK_MIN_DEFAULT	((unsigned)(HZ/25))
 #define TCP_ATO_MIN	((unsigned)(HZ/25))
 #else
-#define TCP_DELACK_MIN	4U
+#define TCP_DELACK_MIN_DEFAULT	4U
 #define TCP_ATO_MIN	4U
 #endif
+
+#define TCP_DELACK_MIN sysctl_tcp_delack_min
+#define TCP_DELACK_MAX sysctl_tcp_delack_max
+
 #define TCP_RTO_MAX	((unsigned)(120*HZ))
 #define TCP_RTO_MIN	((unsigned)(HZ/5))
 #define TCP_TIMEOUT_INIT ((unsigned)(1*HZ))	/* RFC2988bis initial RTO value	*/
@@ -251,6 +258,9 @@ extern int sysctl_tcp_max_ssthresh;
 extern int sysctl_tcp_cookie_size;
 extern int sysctl_tcp_thin_linear_timeouts;
 extern int sysctl_tcp_thin_dupack;
+extern int sysctl_tcp_delack_segs;
+extern int sysctl_tcp_delack_min;
+extern int sysctl_tcp_delack_max;
 
 extern atomic_long_t tcp_memory_allocated;
 extern struct percpu_counter tcp_sockets_allocated;
@@ -1557,6 +1567,10 @@ static inline struct tcp_extend_values *tcp_xv(struct request_values *rvp)
 {
 	return (struct tcp_extend_values *)rvp;
 }
+static inline int tcp_snd_thresh(struct sock *sk)
+{
+	return inet_csk(sk)->icsk_ack.rcv_mss * sysctl_tcp_delack_segs;
+}
 
 extern void tcp_v4_init(void);
 extern void tcp_init(void);
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 69fd720..c22c4c5 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -639,6 +639,27 @@ static struct ctl_table ipv4_table[] = {
 		.proc_handler   = proc_dointvec
 	},
 	{
+		.procname	= "tcp_delack_segs",
+		.data		= &sysctl_tcp_delack_segs,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec
+	},
+	{
+		.procname	= "tcp_delack_min",
+		.data		= &sysctl_tcp_delack_min,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_ms_jiffies
+	},
+	{
+		.procname	= "tcp_delack_max",
+		.data		= &sysctl_tcp_delack_max,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_ms_jiffies
+	},
+	{
 		.procname	= "udp_mem",
 		.data		= &sysctl_udp_mem,
 		.maxlen		= sizeof(sysctl_udp_mem),
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 34f5db1..0aad29b 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1204,8 +1204,9 @@ void tcp_cleanup_rbuf(struct sock *sk, int copied)
 		   /* Delayed ACKs frequently hit locked sockets during bulk
 		    * receive. */
 		if (icsk->icsk_ack.blocked ||
-		    /* Once-per-two-segments ACK was not sent by tcp_input.c */
-		    tp->rcv_nxt - tp->rcv_wup > icsk->icsk_ack.rcv_mss ||
+		    /* More than once-per-tcp_delack_segs-segments ACK
+		     * was not sent by tcp_input.c */
+		    tp->rcv_nxt - tp->rcv_wup > tcp_snd_thresh(sk) ||
 		    /*
 		     * If this read emptied read buffer, we send ACK, if
 		     * connection is not bidirectional, user drained
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 52b5c2d..1e02a80 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -98,6 +98,9 @@ int sysctl_tcp_thin_dupack __read_mostly;
 int sysctl_tcp_moderate_rcvbuf __read_mostly = 1;
 int sysctl_tcp_abc __read_mostly;
 
+int sysctl_tcp_delack_segs __read_mostly = 1;
+
+
 #define FLAG_DATA		0x01 /* Incoming frame contained data.		*/
 #define FLAG_WIN_UPDATE		0x02 /* Incoming ACK was a window update.	*/
 #define FLAG_DATA_ACKED		0x04 /* This ACK acknowledged new data.		*/
@@ -4993,8 +4996,8 @@ static void __tcp_ack_snd_check(struct sock *sk, int ofo_possible)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
-	    /* More than one full frame received... */
-	if (((tp->rcv_nxt - tp->rcv_wup) > inet_csk(sk)->icsk_ack.rcv_mss &&
+	    /* More than tcp_delack_segs full frame(s) received... */
+	if (((tp->rcv_nxt - tp->rcv_wup) > tcp_snd_thresh(sk) &&
 	     /* ... and right edge of window advances far enough.
 	      * (tcp_recvmsg() will send ACK otherwise). Or...
 	      */
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 980b98f..0ec31af 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -63,6 +63,8 @@ int sysctl_tcp_slow_start_after_idle __read_mostly = 1;
 int sysctl_tcp_cookie_size __read_mostly = 0; /* TCP_COOKIE_MAX */
 EXPORT_SYMBOL_GPL(sysctl_tcp_cookie_size);
 
+int sysctl_tcp_delack_min __read_mostly = TCP_DELACK_MIN_DEFAULT;
+int sysctl_tcp_delack_max __read_mostly = TCP_DELACK_MAX_DEFAULT;
 
 /* Account for new data that has been sent to the network. */
 static void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb)
@@ -2685,7 +2687,7 @@ void tcp_send_delayed_ack(struct sock *sk)
 		 * directly.
 		 */
 		if (tp->srtt) {
-			int rtt = max(tp->srtt >> 3, TCP_DELACK_MIN);
+			int rtt = max_t(unsigned, tp->srtt >> 3, TCP_DELACK_MIN);
 
 			if (rtt < max_ato)
 				max_ato = rtt;
-- 
1.7.2.5

^ permalink raw reply related

* Re: [PATCH net-next] bnx2x: Disable LRO on FCoE or iSCSI boot device
From: Michael Chan @ 2011-10-27 23:30 UTC (permalink / raw)
  To: John Fastabend
  Cc: David Miller, netdev@vger.kernel.org, Dmitry Kravkov,
	Eilon Greenstein
In-Reply-To: <4E9F38D6.6030700@intel.com>

On Wed, 2011-10-19 at 13:53 -0700, John Fastabend wrote:
> As a reference point this works fine in both FCoE and iSCSI stacks
> today. The device is reset or link is lost for whatever reason
> when the link comes back up the stack logs back in, enumerates
> the luns and the scsi stack recovers as expected.
> 
> Firmware should do the equivalent login, lun enumeration, etc as
> needed.

Just a quick follow-up on this issue.  Our firmware actually performs
the same logout before the reset and login after the reset.  For iSCSI,
the problem on our device was actually caused by our userspace daemon
logging events to a log file in the root fs.  The file I/O was blocked
and the daemon could not proceed to do the important operations during
the reset, and this caused filesystem I/O errors.  We have now fixed the
problem in the userspace daemon.

For FCoE, there is no logging issue and the root fs failure seems to
happen only in a multipath configuration with all paths going down for a
short time (caused by reset in this case).  We believe this also affects
other devices and not just ours.  We are now working with the multipath
maintainer to understand this issue.

So this confirms that the original patch for bnx2x is not needed.

Thanks.

^ permalink raw reply

* Re: [RFC] tcp: Export TCP Delayed ACK parameters to user
From: Eric Dumazet @ 2011-10-28  0:01 UTC (permalink / raw)
  To: Daniel Baluta; +Cc: davem, kuznet, jmorris, yoshfuji, kaber, netdev
In-Reply-To: <1319756841-2051-1-git-send-email-dbaluta@ixiacom.com>

Le vendredi 28 octobre 2011 à 02:07 +0300, Daniel Baluta a écrit :
> RFC2581 ($4.2) specifies when an ACK should be generated as follows:
> 
> " .. an ACK SHOULD be generated for at least every second
>   full-sized segment, and MUST be generated within 500 ms
>   of the arrival of the first unacknowledged packet.
> "
> 
> We export the number of segments and the timeout limits
> specified above, so that a user can tune them according
> to its needs.
> 

Well, this requires user has a machine exclusive use :)

> Specifically:
> 	* /proc/sys/net/ipv4/tcp_delack_segs, represents
> 	the threshold for the number of segments.
> 	* /proc/sys/net/ipv4/tcp_delack_min, specifies
> 	the minimum timeout value
> 	* /proc/sys/net/ipv4/tcp_delack_max, specifies
> 	the maximum timeout value.
> 


> Signed-off-by: Daniel Baluta <dbaluta@ixiacom.com>
> ---
>  include/net/tcp.h          |   20 +++++++++++++++++---
>  net/ipv4/sysctl_net_ipv4.c |   21 +++++++++++++++++++++
>  net/ipv4/tcp.c             |    5 +++--
>  net/ipv4/tcp_input.c       |    7 +++++--
>  net/ipv4/tcp_output.c      |    4 +++-
>  5 files changed, 49 insertions(+), 8 deletions(-)
> 

Missing Documentation changes

> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index e147f42..f3b0c17 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -111,14 +111,21 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo);
>  				  * TIME-WAIT timer.
>  				  */
>  
> -#define TCP_DELACK_MAX	((unsigned)(HZ/5))	/* maximal time to delay before sending an ACK */
> +/* default maximum time to delay before sending an ACK */
> +#define TCP_DELACK_MAX_DEFAULT	((unsigned)(HZ/5))
> +
>  #if HZ >= 100
> -#define TCP_DELACK_MIN	((unsigned)(HZ/25))	/* minimal time to delay before sending an ACK */
> +/* default minimum time to delay before sending an ACK */
> +#define TCP_DELACK_MIN_DEFAULT	((unsigned)(HZ/25))
>  #define TCP_ATO_MIN	((unsigned)(HZ/25))
>  #else
> -#define TCP_DELACK_MIN	4U
> +#define TCP_DELACK_MIN_DEFAULT	4U
>  #define TCP_ATO_MIN	4U
>  #endif
> +
> +#define TCP_DELACK_MIN sysctl_tcp_delack_min
> +#define TCP_DELACK_MAX sysctl_tcp_delack_max

Hmm, please try to compile dccp as a module :)

You need some EXPORT_SYMBOL() definitions.

Frankly, I suggest removing TCP_DELACK_{MIN|MAX} to avoid unecessary
layer, and use sysctl_tcp_delack_{min|max} instead


> +
>  #define TCP_RTO_MAX	((unsigned)(120*HZ))
>  #define TCP_RTO_MIN	((unsigned)(HZ/5))
>  #define TCP_TIMEOUT_INIT ((unsigned)(1*HZ))	/* RFC2988bis initial RTO value	*/
> @@ -251,6 +258,9 @@ extern int sysctl_tcp_max_ssthresh;
>  extern int sysctl_tcp_cookie_size;
>  extern int sysctl_tcp_thin_linear_timeouts;
>  extern int sysctl_tcp_thin_dupack;
> +extern int sysctl_tcp_delack_segs;
> +extern int sysctl_tcp_delack_min;
> +extern int sysctl_tcp_delack_max;
>  
>  extern atomic_long_t tcp_memory_allocated;
>  extern struct percpu_counter tcp_sockets_allocated;
> @@ -1557,6 +1567,10 @@ static inline struct tcp_extend_values *tcp_xv(struct request_values *rvp)
>  {
>  	return (struct tcp_extend_values *)rvp;
>  }


> +static inline int tcp_snd_thresh(struct sock *sk)

I am not sure name is properly chosen, its about delack or not ?

const struct *sk

> +{
> +	return inet_csk(sk)->icsk_ack.rcv_mss * sysctl_tcp_delack_segs;
> +}
>  

Thanks !

^ permalink raw reply

* Re: [PATCH] xfrm: fix error checking in xfrm_output_gso
From: Yan, Zheng @ 2011-10-28  0:47 UTC (permalink / raw)
  To: netdev@vger.kernel.org; +Cc: Herbert Xu, davem@davemloft.net
In-Reply-To: <4EA8ED1E.6080808@intel.com>

On 10/27/2011 01:33 PM, Yan, Zheng wrote:
> xfrm_output2() returns 1 on success. This bug makes xfrm_output_gso()
> drop all segments except the first one.
> 
> Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
> ---
> diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c
> index 47bacd8..04e963a 100644
> --- a/net/xfrm/xfrm_output.c
> +++ b/net/xfrm/xfrm_output.c
> @@ -159,7 +159,7 @@ static int xfrm_output_gso(struct sk_buff *skb)
>  		segs->next = NULL;
>  		err = xfrm_output2(segs);
>  
> -		if (unlikely(err)) {
> +		if (unlikely(err < 0)) {
>  			while ((segs = nskb)) {
>  				nskb = segs->next;
>  				segs->next = NULL;

Return 1 means collision, please ignore this patch. Thanks

^ permalink raw reply

* [RFC] should VM_BUG_ON(cond) really evaluate cond
From: Eric Dumazet @ 2011-10-28  1:19 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, Andi Kleen, netdev, Andrew Morton

In commit 4e60c86bd9e (gcc-4.6: mm: fix unused but set warnings)
Andi forced VM_BUG_ON(cond) to evaluate cond, even if CONFIG_DEBUG_VM is
not set :

#ifdef CONFIG_DEBUG_VM
#define VM_BUG_ON(cond) BUG_ON(cond)
#else
#define VM_BUG_ON(cond) do { (void)(cond); } while (0)
#endif

As a side effect, get_page()/put_page_testzero() are performing more bus
transactions on contended cache line on some workloads (tcp on loopback
for example, where a page is acting as a shared buffer)

0,05 :  ffffffff815e4775:       je     ffffffff815e4970 <tcp_sendmsg+0xc80>
0,05 :  ffffffff815e477b:       mov    0x1c(%r9),%eax    // useless                  
3,32 :  ffffffff815e477f:       mov    (%r9),%rax        // useless                  
0,51 :  ffffffff815e4782:       lock incl 0x1c(%r9)                        
3,87 :  ffffffff815e4787:       mov    (%r9),%rax                          
0,00 :  ffffffff815e478a:       test   $0x80,%ah                           
0,00 :  ffffffff815e478d:       jne    ffffffff815e49f2 <tcp_sendmsg+0xd02>      

Of course, we have to understand why

(void) (atomic_read(&some_atomic) == 1);

generates asm code...

	mov  some_atomic,%eax

Ah yes, this is because of the volatile...

static inline int atomic_read(const atomic_t *v)
{
	return (*(volatile int *)&(v)->counter);
}

So maybe a fix would be to introduce an atomic_read_stable() variant ?

static inline int atomic_read_stable(const atomic_t *v)
{
	return v->counter;
}

Thanks !

^ permalink raw reply

* Re: [RFC] should VM_BUG_ON(cond) really evaluate cond
From: Andi Kleen @ 2011-10-28  1:25 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Linus Torvalds, linux-kernel, Andi Kleen, netdev, Andrew Morton
In-Reply-To: <1319764761.23112.14.camel@edumazet-laptop>

On Fri, Oct 28, 2011 at 03:19:21AM +0200, Eric Dumazet wrote:
> In commit 4e60c86bd9e (gcc-4.6: mm: fix unused but set warnings)
> Andi forced VM_BUG_ON(cond) to evaluate cond, even if CONFIG_DEBUG_VM is
> not set :
> 
> #ifdef CONFIG_DEBUG_VM
> #define VM_BUG_ON(cond) BUG_ON(cond)
> #else
> #define VM_BUG_ON(cond) do { (void)(cond); } while (0)
> #endif

Eventually the warnings were disabled in the Makefile.
So it would be reasonable to just revert that patch now, at least
for VM_BUG_ON, if it costs performance.

> 
> So maybe a fix would be to introduce an atomic_read_stable() variant ?
> 
> static inline int atomic_read_stable(const atomic_t *v)
> {
> 	return v->counter;
> }

Seems reasonable too. In fact we usually should have memory barriers
for this anyways which obsolete the volatile.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply

* Re: [RFC] should VM_BUG_ON(cond) really evaluate cond
From: Linus Torvalds @ 2011-10-28  1:34 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Eric Dumazet, linux-kernel, netdev, Andrew Morton
In-Reply-To: <20111028012521.GF25795@one.firstfloor.org>

On Thu, Oct 27, 2011 at 6:25 PM, Andi Kleen <andi@firstfloor.org> wrote:
>
> Seems reasonable too. In fact we usually should have memory barriers
> for this anyways which obsolete the volatile.

No we shouldn't. Memory barriers are insanely expensive, and pointless
for atomics - that aren't ordered anyway.

You may mean compiler barriers.

That said, removing the volatile entirely might be a good idea, and
never mind any barriers at all. The ordering for atomics really isn't
well enough specified that we should care. So I wouldn't object to a
patch that just removes the volatile entirely, but it would have to be
accompanied with quite a bit of testing, in case some odd case ends up
depending on it. But nothing *should* be looping on those things
anyway.

                       Linus

^ permalink raw reply

* Re: Introduce FCLONE_SCRATCH skbs to reduce stack memory useage and napi jitter
From: Neil Horman @ 2011-10-28  1:37 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, David S. Miller
In-Reply-To: <1319756146.19125.42.camel@edumazet-laptop>

On Fri, Oct 28, 2011 at 12:55:46AM +0200, Eric Dumazet wrote:
> Le jeudi 27 octobre 2011 à 15:53 -0400, Neil Horman a écrit :
> > I had this idea awhile ago while I was looking at the receive path for multicast
> > frames.   The top of the mcast recieve path (in __udp4_lib_mcast_deliver, has a
> > loop in which we traverse a hash list linearly, looking for sockets that are
> > listening to a given multicast group.  For each matching socket we clone the skb
> > to enqueue it to the corresponding socket.  This creates two problems:
> > 
> > 1) Application driven jitter in the receive path
> >    As you add processes that listen to the same multcast group, you increase the
> > number of iterations you have to preform in this loop, which can lead to
> > increases in the amount of time you spend processing each frame in softirq
> > context, expecially if you are memory constrained, and the skb_clone operation
> > has to call all the way back into the buddy allocator for more ram.  This can
> > lead to needlessly dropped frames as rx latency increases in the stack.
> > 
> 
> Hmm... time to perform this loop not depends on memory constraints,
> since GFP_ATOMIC allocations are done. It succeed or not, immediately.
> 
Thats not entirely correct.  The time it takes to allocate a new object from the
slab can be very much affected by memory contraints.  Systems under memory
pressue may need to shrink slab caches, allocate from remote nodes, etc, all of
which contributes to additional allocation time.  This time is multiplied by the
number of sockets listening to a particular group.  I'm not saying its a huge
change, but its a change that I saw the ability to address.   This change (I
hope) tries, when able, to eliminate the jitter from that path.

> Time is consumed on the copy of the skb head, and refcnt
> increases/decreases on datarefcnt. Your patch doesnt avoid this.
> 
No it doesn't, you're correct.  I was really looking at lower hanging fruit with
this change.

> When application calls recvmsg() we then perform the two atomics on skb
> refcnt and data refcnt and free them, with cache line false sharing...
> 
> > 2) Increased memory usage
> >    As you increase the number of listeners to a multicast group, you directly
> > increase the number of times you clone and skb, putting increased memory
> > pressure on the system.
> > 
> 
> One skb_head is about 256 bytes (232 bytes on 64bit arches)
> 
Yes, thats right.  Depending on the size of the received frame I can allocate
anywhere from 1 to 4 additional skbs using otherwise unused memory at the tail
of the data segment with this patch.

> > while neither of these problems is a huge concern, I thought it would be nice if
> > we could mitigate the effects of increased application instances on performance
> > in this area.  As such I came up with this patch set.  I created a new skb
> > fclone type called FCLONE_SCRATCH.  When available, it commandeers the
> > internally fragmented space of an skb data buffer and uses that to allocate
> > additional skbs during the clone operation. Since the skb->data area is
> > allocated with a kmalloc operation (and is therefore nominally a power of 2 in
> > size), and nominally network interfaces tend to have an mtu of around 1500
> > bytes, we typically can reclaim several hundred bytes of space at the end of an
> > skb (more if the incomming packet is not a full MTU in size).  This space, being
> > exclusively accessible to the softirq doing the reclaim, can be quickly accesed
> > without the need for additional locking, potntially providing lower jitter in
> > napi context per frame during a receive operation, as well as some memory
> > savings.
> > 
> > I'm still collecting stats on its performance, but I thought I would post now to
> > get some early reviews and feedback on it.
> > 
> 
> I really doubt you'll find a significative performance increase.
> 
With the perf script I included in this set, I'm currently reducing the napi rx
path runtime by about 2 microsecond per iteration when multiple processes are
listening to the same group, and we can allocate skbs from the data area of the
received skb.  I'm not sure how accurate (or valuable) that is, but thats what
I'm getting.  While I've not measured, theres also the fact that kfree_skb doesn't
have any work to do with FCLONE_SCRATCH skbs, since they all get returned when
the data area gets kfreed.  Lastly, theres the savings of 256 bytes per
listening app that uses an FCLONE_SCRATCH skb, since that memory is already
reserved, and effectively free.  I can't really put a number on how significant
that is, but I think its valuable.


> I do believe its a too complex : skb code is already a nightmare if you
> ask me.
> 
I'm not sure I agree with that.  Its certainly complex in the sense that it adds
additional state to be checked in the skb_clone and kfree_skb paths, but beyond
that, it doesn't require any semantic changes to any of the other parts of the
skb api(s).  Is there something specific about this change that you find
unacceptibly complex, or something I can do to make it simpler?

> And your hack/idea wont work quite well if you have 8 receivers for each
> frame.
> 
It will work fine, to whatever degree its able, and this it will fall back on
the standard skb_clone behavior.  The performance gains will be reduced, of
course, which I think is what you are referring to, but the memory advantages
still exit.  If you have 8 receivers, but a given skb has only enough free space
at the end of the data segment to allocate 4 additional skbs, we'll allocate
those 4, and do a real kmem_cache_alloc of 4 more in skb_clone.  The last 4 have
to use the slab_allocation path as we otherwise would, so we're no worse off
than before, but we still get the memory savings (which would be about 1k per
received frame for this mcast group, which I think is significant).

> What about finding another way to queue one skb to N receive queue(s),
> so that several multicast sockets can share same skb head ?
> 
Yes, I'd experimented with some of this, and had mixed results.  Specifically I
came up with a way to use the additional space as an array of list heads, with
backpointers to the socket_queue each list_head was owned by and the skb that
owned the entry.  It was nice because it let me enqueue the same skb to hundreds
of sockets at the same time, which was really nice.  It was bad however, because
it completely broke socket accounting (any likely anything else that required on
any part of the socket state).  Any thoughts on ways I might improve on that.
If I could make some sort of reduced sk_buff so that I didn't have to allocate
all 256 bytes of a full sk_buff, that would be great!

> I always found sk_receive queue being very inefficient, since a queue or
> dequeue must dirty a lot of cache lines.
> 
> This forces us to use a spinlock to protect queue/enqueue operations,
> but also the socket lock (because of the MSG_PEEK stuff and
> sk_rmem_alloc / sk_forward_alloc)
> 
> sk_receive_queue.lock is the real jitter source.
> 
> Ideally, we could have a fast path using a small circular array per
> socket, of say 8 or 16 pointers to skbs, or allow application or
> sysadmin to size this array.
> 
> A circular buffer can be handled without any lock, using atomic
> operations (cmpxchg()) on load/unload indexes. The array of pointers is
> written only by the softirq handler cpu, read by consumers.
> 
> Since this array is small [and finite size], and skb shared, we dont
> call skb_set_owner_r() anymore, avoiding expensive atomic ops on
> sk->sk_rmem_alloc.
> 
> UDP receive path could become lockless, allowing the softirq handler to
> run without being slowed down by concurrent recvmsg()
> 
> At recvmsg() time, N-1 threads would only perform the skb->refcnt
> decrement, and the last one would free the skb and data as well.
> 
This seems like a reasonable idea, and I'm happy to experiment with it, but I
still like the idea of using that often available space at the end of an skb,
just because I think the memory savings is attractive.  Shall I roll them into
the same patch series, or do them separately?
Neil

> 
> 
> 

^ permalink raw reply

* Re: [RFC] should VM_BUG_ON(cond) really evaluate cond
From: Ben Hutchings @ 2011-10-28  1:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Eric Dumazet, linux-kernel, netdev, Andrew Morton
In-Reply-To: <CA+55aFy9oJSK99K=brRzSfSnLocO=PuS7yuKkAcw3t5hATJrfw@mail.gmail.com>

On Thu, 2011-10-27 at 18:34 -0700, Linus Torvalds wrote:
> On Thu, Oct 27, 2011 at 6:25 PM, Andi Kleen <andi@firstfloor.org> wrote:
> >
> > Seems reasonable too. In fact we usually should have memory barriers
> > for this anyways which obsolete the volatile.
> 
> No we shouldn't. Memory barriers are insanely expensive, and pointless
> for atomics - that aren't ordered anyway.
> 
> You may mean compiler barriers.
> 
> That said, removing the volatile entirely might be a good idea, and
> never mind any barriers at all. The ordering for atomics really isn't
> well enough specified that we should care. So I wouldn't object to a
> patch that just removes the volatile entirely, but it would have to be
> accompanied with quite a bit of testing, in case some odd case ends up
> depending on it. But nothing *should* be looping on those things
> anyway.

Whether or not it needs to provide any ordering guarantee, atomic_read()
must never read more than once, and I think that requires the volatile
qualification.  It might be clearer to use the ACCESS_ONCE macro,
however.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH net -v2] [BUGFIX] bonding: use flush_delayed_work_sync in bond_close
From: HAYASAKA Mitsuo @ 2011-10-28  1:52 UTC (permalink / raw)
  To: Jay Vosburgh
  Cc: netdev, Américo Wang, Stephen Hemminger, Andy Gospodarek,
	linux-kernel, yrl.pp-manager.tt
In-Reply-To: <2216.1319650260@death>

Hi Joy,

I checked your patch, and any problems have not been observed for
almost one day although I could reproduce the BUG in the latest net-next
without it. 

I think this patch works well.

Tested-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>


(2011/10/27 2:31), Jay Vosburgh wrote:
> HAYASAKA Mitsuo <mitsuo.hayasaka.hu@hitachi.com> wrote:
> [...]
>> The interval of mii_mon was set to 1 to reproduce this bug easily and 
>> the 802.3ad mode was used. Then, I executed the following command.
>>
>> # while true; do ifconfig bond0 down; done &
>> # while true; do ifconfig bond0 up; done &
>>
>> This bug rarely occurs since it is the severe timing problem.
>> I found that it is more easily to reproduce this bug when using guest OS.
>>
>> For example, it took one to three days for me to reproduce it on host OS, 
>> but some hours on guest OS.
> 
> 	Could you test this patch and see if it resolves the problem?
> 
> 	This patch does a few things:
> 
> 	All of the monitor functions that run on work queues are
> modified to never unconditionally acquire RTNL; all will reschedule the
> work and return if rtnl_trylock fails.  This covers bond_mii_monitor,
> bond_activebackup_arp_mon, and bond_alb_monitor.
> 
> 	The "clear out the work queues" calls in bond_close and
> bond_uninit now call cancel_delayed_work_sync, which should either
> delete a pending work item, or wait for an executing item to complete.
> I chose cancel_ over the original patch's flush_ because we just want
> the work queue stopped.  We don't need to have any pending items execute
> if they're not already running.
> 
> 	Also in reference to the previous, I'm not sure if we still need
> to check for delayed_work_pending, but I've left those checks in place.
> 
> 	Remove the "kill_timers" field and all references to it.  If
> cancel_delayed_work_sync is safe to use, we do not need an extra
> sentinel.
> 
> 	Lastly, for testing purposes only, the bond_alb_monitor in this
> patch includes an unconditional call to rtnl_trylock(); this is an
> artifical way to make the race in that function easier to test for,
> because the real race is very difficult to hit.
> 
> 	This patch is against net-next as of yesterday.
> 
> 	Comments?
> 
> 	-J
> 
> 
> diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
> index b33c099..0ae0d7c 100644
> --- a/drivers/net/bonding/bond_3ad.c
> +++ b/drivers/net/bonding/bond_3ad.c
> @@ -2110,9 +2110,6 @@ void bond_3ad_state_machine_handler(struct work_struct *work)
>  
>  	read_lock(&bond->lock);
>  
> -	if (bond->kill_timers)
> -		goto out;
> -
>  	//check if there are any slaves
>  	if (bond->slave_cnt == 0)
>  		goto re_arm;
> @@ -2161,9 +2158,8 @@ void bond_3ad_state_machine_handler(struct work_struct *work)
>  	}
>  
>  re_arm:
> -	if (!bond->kill_timers)
> -		queue_delayed_work(bond->wq, &bond->ad_work, ad_delta_in_ticks);
> -out:
> +	queue_delayed_work(bond->wq, &bond->ad_work, ad_delta_in_ticks);
> +
>  	read_unlock(&bond->lock);
>  }
>  
> diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
> index d4fbd2e..13d1bf9 100644
> --- a/drivers/net/bonding/bond_alb.c
> +++ b/drivers/net/bonding/bond_alb.c
> @@ -1343,10 +1343,6 @@ void bond_alb_monitor(struct work_struct *work)
>  
>  	read_lock(&bond->lock);
>  
> -	if (bond->kill_timers) {
> -		goto out;
> -	}
> -
>  	if (bond->slave_cnt == 0) {
>  		bond_info->tx_rebalance_counter = 0;
>  		bond_info->lp_counter = 0;
> @@ -1394,6 +1390,14 @@ void bond_alb_monitor(struct work_struct *work)
>  		bond_info->tx_rebalance_counter = 0;
>  	}
>  
> +	/* XXX - unconditional attempt at RTNL for testing purposes, as
> +	 * normal case, below, is difficult to induce.
> +	 */
> +	read_unlock(&bond->lock);
> +	if (rtnl_trylock())
> +		rtnl_unlock();
> +	read_lock(&bond->lock);
> +
>  	/* handle rlb stuff */
>  	if (bond_info->rlb_enabled) {
>  		if (bond_info->primary_is_promisc &&
> @@ -1404,7 +1408,10 @@ void bond_alb_monitor(struct work_struct *work)
>  			 * nothing else.
>  			 */
>  			read_unlock(&bond->lock);
> -			rtnl_lock();
> +			if (!rtnl_trylock()) {
> +				read_lock(&bond->lock);
> +				goto re_arm;
> +			}
>  
>  			bond_info->rlb_promisc_timeout_counter = 0;
>  
> @@ -1440,9 +1447,8 @@ void bond_alb_monitor(struct work_struct *work)
>  	}
>  
>  re_arm:
> -	if (!bond->kill_timers)
> -		queue_delayed_work(bond->wq, &bond->alb_work, alb_delta_in_ticks);
> -out:
> +	queue_delayed_work(bond->wq, &bond->alb_work, alb_delta_in_ticks);
> +
>  	read_unlock(&bond->lock);
>  }
>  
> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
> index 71efff3..e6fefff 100644
> --- a/drivers/net/bonding/bond_main.c
> +++ b/drivers/net/bonding/bond_main.c
> @@ -774,9 +774,6 @@ static void bond_resend_igmp_join_requests(struct bonding *bond)
>  
>  	read_lock(&bond->lock);
>  
> -	if (bond->kill_timers)
> -		goto out;
> -
>  	/* rejoin all groups on bond device */
>  	__bond_resend_igmp_join_requests(bond->dev);
>  
> @@ -790,9 +787,9 @@ static void bond_resend_igmp_join_requests(struct bonding *bond)
>  			__bond_resend_igmp_join_requests(vlan_dev);
>  	}
>  
> -	if ((--bond->igmp_retrans > 0) && !bond->kill_timers)
> +	if (--bond->igmp_retrans > 0)
>  		queue_delayed_work(bond->wq, &bond->mcast_work, HZ/5);
> -out:
> +
>  	read_unlock(&bond->lock);
>  }
>  
> @@ -2518,19 +2515,26 @@ void bond_mii_monitor(struct work_struct *work)
>  	struct bonding *bond = container_of(work, struct bonding,
>  					    mii_work.work);
>  	bool should_notify_peers = false;
> +	unsigned long delay;
>  
>  	read_lock(&bond->lock);
> -	if (bond->kill_timers)
> -		goto out;
> +
> +	delay = msecs_to_jiffies(bond->params.miimon);
>  
>  	if (bond->slave_cnt == 0)
>  		goto re_arm;
>  
> -	should_notify_peers = bond_should_notify_peers(bond);
> -
>  	if (bond_miimon_inspect(bond)) {
>  		read_unlock(&bond->lock);
> -		rtnl_lock();
> +
> +		/* Race avoidance with bond_close flush of workqueue */
> +		if (!rtnl_trylock()) {
> +			read_lock(&bond->lock);
> +			delay = 1;
> +			should_notify_peers = false;
> +			goto re_arm;
> +		}
> +
>  		read_lock(&bond->lock);
>  
>  		bond_miimon_commit(bond);
> @@ -2540,15 +2544,21 @@ void bond_mii_monitor(struct work_struct *work)
>  		read_lock(&bond->lock);
>  	}
>  
> +	should_notify_peers = bond_should_notify_peers(bond);
> +
>  re_arm:
> -	if (bond->params.miimon && !bond->kill_timers)
> -		queue_delayed_work(bond->wq, &bond->mii_work,
> -				   msecs_to_jiffies(bond->params.miimon));
> -out:
> +	if (bond->params.miimon)
> +		queue_delayed_work(bond->wq, &bond->mii_work, delay);
> +
>  	read_unlock(&bond->lock);
>  
>  	if (should_notify_peers) {
> -		rtnl_lock();
> +		if (!rtnl_trylock()) {
> +			read_lock(&bond->lock);
> +			bond->send_peer_notif++;
> +			read_unlock(&bond->lock);
> +			return;
> +		}
>  		netdev_bonding_change(bond->dev, NETDEV_NOTIFY_PEERS);
>  		rtnl_unlock();
>  	}
> @@ -2790,9 +2800,6 @@ void bond_loadbalance_arp_mon(struct work_struct *work)
>  
>  	delta_in_ticks = msecs_to_jiffies(bond->params.arp_interval);
>  
> -	if (bond->kill_timers)
> -		goto out;
> -
>  	if (bond->slave_cnt == 0)
>  		goto re_arm;
>  
> @@ -2889,9 +2896,9 @@ void bond_loadbalance_arp_mon(struct work_struct *work)
>  	}
>  
>  re_arm:
> -	if (bond->params.arp_interval && !bond->kill_timers)
> +	if (bond->params.arp_interval)
>  		queue_delayed_work(bond->wq, &bond->arp_work, delta_in_ticks);
> -out:
> +
>  	read_unlock(&bond->lock);
>  }
>  
> @@ -3132,9 +3139,6 @@ void bond_activebackup_arp_mon(struct work_struct *work)
>  
>  	read_lock(&bond->lock);
>  
> -	if (bond->kill_timers)
> -		goto out;
> -
>  	delta_in_ticks = msecs_to_jiffies(bond->params.arp_interval);
>  
>  	if (bond->slave_cnt == 0)
> @@ -3144,7 +3148,15 @@ void bond_activebackup_arp_mon(struct work_struct *work)
>  
>  	if (bond_ab_arp_inspect(bond, delta_in_ticks)) {
>  		read_unlock(&bond->lock);
> -		rtnl_lock();
> +
> +		/* Race avoidance with bond_close flush of workqueue */
> +		if (!rtnl_trylock()) {
> +			read_lock(&bond->lock);
> +			delta_in_ticks = 1;
> +			should_notify_peers = false;
> +			goto re_arm;
> +		}
> +
>  		read_lock(&bond->lock);
>  
>  		bond_ab_arp_commit(bond, delta_in_ticks);
> @@ -3157,13 +3169,18 @@ void bond_activebackup_arp_mon(struct work_struct *work)
>  	bond_ab_arp_probe(bond);
>  
>  re_arm:
> -	if (bond->params.arp_interval && !bond->kill_timers)
> +	if (bond->params.arp_interval)
>  		queue_delayed_work(bond->wq, &bond->arp_work, delta_in_ticks);
> -out:
> +
>  	read_unlock(&bond->lock);
>  
>  	if (should_notify_peers) {
> -		rtnl_lock();
> +		if (!rtnl_trylock()) {
> +			read_lock(&bond->lock);
> +			bond->send_peer_notif++;
> +			read_unlock(&bond->lock);
> +			return;
> +		}
>  		netdev_bonding_change(bond->dev, NETDEV_NOTIFY_PEERS);
>  		rtnl_unlock();
>  	}
> @@ -3425,8 +3442,6 @@ static int bond_open(struct net_device *bond_dev)
>  	struct slave *slave;
>  	int i;
>  
> -	bond->kill_timers = 0;
> -
>  	/* reset slave->backup and slave->inactive */
>  	read_lock(&bond->lock);
>  	if (bond->slave_cnt > 0) {
> @@ -3495,33 +3510,30 @@ static int bond_close(struct net_device *bond_dev)
>  
>  	bond->send_peer_notif = 0;
>  
> -	/* signal timers not to re-arm */
> -	bond->kill_timers = 1;
> -
>  	write_unlock_bh(&bond->lock);
>  
>  	if (bond->params.miimon) {  /* link check interval, in milliseconds. */
> -		cancel_delayed_work(&bond->mii_work);
> +		cancel_delayed_work_sync(&bond->mii_work);
>  	}
>  
>  	if (bond->params.arp_interval) {  /* arp interval, in milliseconds. */
> -		cancel_delayed_work(&bond->arp_work);
> +		cancel_delayed_work_sync(&bond->arp_work);
>  	}
>  
>  	switch (bond->params.mode) {
>  	case BOND_MODE_8023AD:
> -		cancel_delayed_work(&bond->ad_work);
> +		cancel_delayed_work_sync(&bond->ad_work);
>  		break;
>  	case BOND_MODE_TLB:
>  	case BOND_MODE_ALB:
> -		cancel_delayed_work(&bond->alb_work);
> +		cancel_delayed_work_sync(&bond->alb_work);
>  		break;
>  	default:
>  		break;
>  	}
>  
>  	if (delayed_work_pending(&bond->mcast_work))
> -		cancel_delayed_work(&bond->mcast_work);
> +		cancel_delayed_work_sync(&bond->mcast_work);
>  
>  	if (bond_is_lb(bond)) {
>  		/* Must be called only after all
> @@ -4368,26 +4380,22 @@ static void bond_setup(struct net_device *bond_dev)
>  
>  static void bond_work_cancel_all(struct bonding *bond)
>  {
> -	write_lock_bh(&bond->lock);
> -	bond->kill_timers = 1;
> -	write_unlock_bh(&bond->lock);
> -
>  	if (bond->params.miimon && delayed_work_pending(&bond->mii_work))
> -		cancel_delayed_work(&bond->mii_work);
> +		cancel_delayed_work_sync(&bond->mii_work);
>  
>  	if (bond->params.arp_interval && delayed_work_pending(&bond->arp_work))
> -		cancel_delayed_work(&bond->arp_work);
> +		cancel_delayed_work_sync(&bond->arp_work);
>  
>  	if (bond->params.mode == BOND_MODE_ALB &&
>  	    delayed_work_pending(&bond->alb_work))
> -		cancel_delayed_work(&bond->alb_work);
> +		cancel_delayed_work_sync(&bond->alb_work);
>  
>  	if (bond->params.mode == BOND_MODE_8023AD &&
>  	    delayed_work_pending(&bond->ad_work))
> -		cancel_delayed_work(&bond->ad_work);
> +		cancel_delayed_work_sync(&bond->ad_work);
>  
>  	if (delayed_work_pending(&bond->mcast_work))
> -		cancel_delayed_work(&bond->mcast_work);
> +		cancel_delayed_work_sync(&bond->mcast_work);
>  }
>  
>  /*
> diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
> index 82fec5f..1aecc37 100644
> --- a/drivers/net/bonding/bonding.h
> +++ b/drivers/net/bonding/bonding.h
> @@ -222,7 +222,6 @@ struct bonding {
>  			       struct slave *);
>  	rwlock_t lock;
>  	rwlock_t curr_slave_lock;
> -	s8       kill_timers;
>  	u8	 send_peer_notif;
>  	s8	 setup_by_slave;
>  	s8       igmp_retrans;
> 
> 
> ---
> 	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
> 

^ permalink raw reply

* Re: Introduce FCLONE_SCRATCH skbs to reduce stack memory useage and napi jitter
From: Eric Dumazet @ 2011-10-28  2:37 UTC (permalink / raw)
  To: Neil Horman; +Cc: netdev, David S. Miller
In-Reply-To: <20111028013729.GA6524@neilslaptop.think-freely.org>

Le jeudi 27 octobre 2011 à 21:37 -0400, Neil Horman a écrit :
> > 
> Yes, I'd experimented with some of this, and had mixed results.  Specifically I
> came up with a way to use the additional space as an array of list heads, with
> backpointers to the socket_queue each list_head was owned by and the skb that
> owned the entry.  It was nice because it let me enqueue the same skb to hundreds
> of sockets at the same time, which was really nice.  It was bad however, because
> it completely broke socket accounting (any likely anything else that required on
> any part of the socket state).  Any thoughts on ways I might improve on that.
> If I could make some sort of reduced sk_buff so that I didn't have to allocate
> all 256 bytes of a full sk_buff, that would be great!

It seems your objectives are contradictory (memory saving and cpu
saving). Most of the time we have to fight false sharing first.

I dont want to try to use the available space from skb, since its cache
unfriendly, in case you have one CPU 100% in softirq handling,
dispatching messages to XX other application cpus.

You dont want each application cpu slowing down other cpus by
manipulating a list anchor, contained in a shared cache line, highly
contended. We already have one high contention point (skb refcount or
data refcount).

Another point of interest is that modern NIC tend to use paged frag
skbs, with very litle space available in skb head. frag part is
readonly, and sometime with no available space neither.

So your idea only works on some (old gen) nics and some narrow
conditions.

I believe it adds too much complex code for this.

I understand you are disappointed, but maybe you should have shared your
ideas before coding them !

^ permalink raw reply

* Re: [RFC] should VM_BUG_ON(cond) really evaluate cond
From: Eric Dumazet @ 2011-10-28  2:52 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Linus Torvalds, Andi Kleen, linux-kernel, netdev, Andrew Morton
In-Reply-To: <1319766293.6759.17.camel@deadeye>

Le vendredi 28 octobre 2011 à 02:44 +0100, Ben Hutchings a écrit :

> Whether or not it needs to provide any ordering guarantee, atomic_read()
> must never read more than once, and I think that requires the volatile
> qualification.  It might be clearer to use the ACCESS_ONCE macro,
> however.
> 

Where this requirement comes from ?

Maybe then introduce atomic_read_once() for users really needing it :)

ACCESS_ONCE will force the read/move instruction I try to avoid :(

By the way, removing volatile on my x86_64 defconfig saves a bit of
space :

# size vmlinux vmlinux.old
   text	   data	    bss	    dec	    hex	filename
10907585	2890992	1540096	15338673	 ea0cb1	vmlinux
10912342	2891056	1540096	15343494	 ea1f86	vmlinux.old

booted and everything seems fine, but obviously lightly tested...

Note : There is still one useless access to page-flags,
'fixing' atomic_read() is not enough.


    0,04 :        ffffffff815e4675:       je     ffffffff815e4870 <tcp_sendmsg+0xc80>
    0,08 :        ffffffff815e467b:       mov    (%r9),%rax   // useless
    2,08 :        ffffffff815e467e:       lock incl 0x1c(%r9)
    3,58 :        ffffffff815e4683:       mov    (%r9),%rax
    0,04 :        ffffffff815e4686:       test   $0x80,%ah
    0,00 :        ffffffff815e4689:       jne    ffffffff815e48ee <tcp_sendmsg+0xcfe>

^ permalink raw reply

* Re: [patch net-next]alx: Atheros AR8131/AR8151/AR8152/AR8161 Ethernet driver
From: Joe Perches @ 2011-10-28  2:56 UTC (permalink / raw)
  To: Francois Romieu; +Cc: cloud.ren, davem, Luis.Rodriguez, netdev, linux-kernel
In-Reply-To: <20111019222140.GA9937@electric-eye.fr.zoreil.com>

On Thu, 2011-10-20 at 00:21 +0200, Francois Romieu wrote:
> cloud.ren@atheros.com <cloud.ren@atheros.com> :
> > diff --git a/drivers/net/ethernet/atheros/alx/alf_hw.c b/drivers/net/ethernet/atheros/alx/alf_hw.c
[]
> > +	if (en) { /* enable */
> > +		MEM_W32(ctx, L1F_RXQ0, rxq | L1F_RXQ0_EN);
> > +		MEM_W32(ctx, L1F_TXQ0, txq | L1F_TXQ0_EN);
> > +		if (0 != (en_ctrl & LX_MACSPEED_1000)) {
> > +			FIELD_SETL(mac, L1F_MAC_CTRL_SPEED,
> > +			    L1F_MAC_CTRL_SPEED_1000);
> > +		} else {
> > +			FIELD_SETL(mac, L1F_MAC_CTRL_SPEED,
> > +			    L1F_MAC_CTRL_SPEED_10_100);
> > +		}
> > +		if (0 != (en_ctrl & LX_MACDUPLEX_FULL)) {
> > +			mac |= L1F_MAC_CTRL_FULLD;
> > +		} else {
> > +			mac &= ~L1F_MAC_CTRL_FULLD;
> > +		}
> > +		/* rx filter */
> > +		if (0 != (en_ctrl & LX_FLT_PROMISC)) {
> > +			mac |= L1F_MAC_CTRL_PROMISC_EN;
> > +		} else {
> > +			mac &= ~L1F_MAC_CTRL_PROMISC_EN;
> > +		}
> > +		if (0 != (en_ctrl & LX_FLT_MULTI_ALL)) {
> > +			mac |= L1F_MAC_CTRL_MULTIALL_EN;
> > +		} else {
> > +			mac &= ~L1F_MAC_CTRL_MULTIALL_EN;
> > +		}
> > +		if (0 != (en_ctrl & LX_FLT_BROADCAST)) {
> > +			mac |= L1F_MAC_CTRL_BRD_EN;
> > +		} else {
> > +			mac &= ~L1F_MAC_CTRL_BRD_EN;
> > +		}
> Code duplication. Who cares ?

Maybe add a macro like:

#define mac_ctrl(mac, ctrl, flag, bit)		\
do {						\
	if ((ctrl) & (flag))			\
		mac |= (bit);			\
	else					\
		mac &= ~(bit);			\
} while (0)

so these become

	mac_ctrl(mac, en_ctrl, LX_MACDUPLEX_FULL, L1F_MAC_CTRL_FULLD);
	mac_ctrl(mac, en_ctrl, LX_FLT_PROMISC, L1F_MAC_CTRL_PROMISC_EN);
	mac_ctrl(mac, en_ctrl, LX_FLT_MULTI_ALL, L1F_MAC_CTRL_MULTIALL_EN);
	mac_ctrl(mac, en_ctrl, LX_FLT_BROADCAST, L1F_MAC_CTRL_BRD_EN);
	mac_ctrl(mac, en_ctrl, LX_FLT_DIRECT, L1F_MAC_CTRL_RX_EN);
	mac_ctrl(mac, en_ctrl, LX_FC_TXEN, L1F_MAC_CTRL_TXFC_EN);
	mac_ctrl(mac, en_ctrl, LX_FC_RXEN, L1F_MAC_CTRL_RXFC_EN);
	etc.

Or maybe add mac, en_ctrl and L1F_MAC_CTRL_ to the macro if you
really want to shorten it up.

	mac_ctrl(MACDUPLEX_FULL, FULLD);
	mac_ctrl(FLT_PROMISC, PROMISC_EN);
	mac_ctrl(FLT_MULTI_ALL, MULTIALL_EN);
	etc.

> It may make some sense to move these ~60 loc in a xyz_enable_something
> method...

2 macros?

^ permalink raw reply

* Re: [PATCH] MAINTAINERS: Remove quotes of maintaners's names
From: David Miller @ 2011-10-28  3:01 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: marcos.mage, akpm, netdev, linux-kernel
In-Reply-To: <610B86C6-622E-4EFE-89DF-BF09BDB4D765@intel.com>

From: "Kirsher, Jeffrey T" <jeffrey.t.kirsher@intel.com>
Date: Thu, 27 Oct 2011 14:35:54 -0700

> The reason these names are in quotes is because they need to be
> wrapped in quotes when used in a git patch (I.e. CC: "David
> S. Miller" <davem@davemloft.net>) because of the . in the friendly
> name.

That's correct, if a character such as "." appears in an email
header field it must be surrounded by double quotes, as per
SMTP rules.

VGER rejects any email where this rule is not followed properly.

^ permalink raw reply

* Re: [PATCH net -v2] [BUGFIX] bonding: use flush_delayed_work_sync in bond_close
From: David Miller @ 2011-10-28  3:15 UTC (permalink / raw)
  To: mitsuo.hayasaka.hu
  Cc: fubar, netdev, xiyou.wangcong, shemminger, andy, linux-kernel,
	yrl.pp-manager.tt
In-Reply-To: <4EAA0ADF.5020504@hitachi.com>

From: HAYASAKA Mitsuo <mitsuo.hayasaka.hu@hitachi.com>
Date: Fri, 28 Oct 2011 10:52:31 +0900

> I checked your patch, and any problems have not been observed for
> almost one day although I could reproduce the BUG in the latest net-next
> without it. 
> 
> I think this patch works well.
> 
> Tested-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>

Jay, please formally submit this patch with proper signoffs etc.

Thanks!

^ permalink raw reply

* Re: [PATCH net-next-2.6 0/2] be2net: fixes
From: David Miller @ 2011-10-28  3:17 UTC (permalink / raw)
  To: somnath.kotur; +Cc: netdev
In-Reply-To: <900362ce-e840-44cb-bfcf-72eba609edf3@exht2.ad.emulex.com>

From: Somnath Kotur <somnath.kotur@Emulex.Com>
Date: Thu, 27 Oct 2011 22:41:47 +0530

> Somnath Kotur (2):
>   be2net: Refactored be_cmds.c file.
>   be2net: Changing MAC Address of a VF (in SR-IOV case) was broken.

Applied.

^ permalink raw reply

* Re: [PATCH 1/3] stmmac: fix a bug while checking the HW cap reg (v2)
From: David Miller @ 2011-10-28  3:17 UTC (permalink / raw)
  To: peppe.cavallaro; +Cc: netdev
In-Reply-To: <1319694189-25223-1-git-send-email-peppe.cavallaro@st.com>


All 3 patches applied, thanks.

^ permalink raw reply

* Re: [RFC] should VM_BUG_ON(cond) really evaluate cond
From: Ben Hutchings @ 2011-10-28  3:29 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Linus Torvalds, Andi Kleen, linux-kernel, netdev, Andrew Morton
In-Reply-To: <1319770376.23112.58.camel@edumazet-laptop>

On Fri, 2011-10-28 at 04:52 +0200, Eric Dumazet wrote:
> Le vendredi 28 octobre 2011 à 02:44 +0100, Ben Hutchings a écrit :
> 
> > Whether or not it needs to provide any ordering guarantee, atomic_read()
> > must never read more than once, and I think that requires the volatile
> > qualification.  It might be clearer to use the ACCESS_ONCE macro,
> > however.
> > 
> 
> Where this requirement comes from ?

That is the conventional behaviour of 'atomic' operations, and callers
may depend on it.

> Maybe then introduce atomic_read_once() for users really needing it :)
>
> ACCESS_ONCE will force the read/move instruction I try to avoid :(
[...]

I'm sure you can find some other way to avoid it.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* [PATCH] ipv6: fix error propagation in ip6_ufo_append_data()
From: Yan, Zheng @ 2011-10-28  4:24 UTC (permalink / raw)
  To: netdev@vger.kernel.org, davem@davemloft.net

We should return errcode from sock_alloc_send_skb()

Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
---
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index ff30047..84d0bd5 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1123,7 +1123,7 @@ static inline int ip6_ufo_append_data(struct sock *sk,
 			hh_len + fragheaderlen + transhdrlen + 20,
 			(flags & MSG_DONTWAIT), &err);
 		if (skb == NULL)
-			return -ENOMEM;
+			return err;
 
 		/* reserve space for Hardware header */
 		skb_reserve(skb, hh_len);

^ permalink raw reply related

* Re: [PATCH] ipv6: fix error propagation in ip6_ufo_append_data()
From: David Miller @ 2011-10-28  4:26 UTC (permalink / raw)
  To: zheng.z.yan; +Cc: netdev
In-Reply-To: <4EAA2E74.2000703@intel.com>

From: "Yan, Zheng" <zheng.z.yan@intel.com>
Date: Fri, 28 Oct 2011 12:24:20 +0800

> We should return errcode from sock_alloc_send_skb()
> 
> Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>

Applied, thanks!

^ permalink raw reply

* >Re: [RFC] should VM_BUG_ON(cond) really evaluate cond
From: Eric Dumazet @ 2011-10-28  4:43 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Linus Torvalds, Andi Kleen, linux-kernel, netdev, Andrew Morton
In-Reply-To: <1319772566.6759.27.camel@deadeye>

Le vendredi 28 octobre 2011 à 04:29 +0100, Ben Hutchings a écrit :
> On Fri, 2011-10-28 at 04:52 +0200, Eric Dumazet wrote:
> > Le vendredi 28 octobre 2011 à 02:44 +0100, Ben Hutchings a écrit :
> > 
> > > Whether or not it needs to provide any ordering guarantee, atomic_read()
> > > must never read more than once, and I think that requires the volatile
> > > qualification.  It might be clearer to use the ACCESS_ONCE macro,
> > > however.
> > > 
> > 
> > Where this requirement comes from ?
> 
> That is the conventional behaviour of 'atomic' operations, and callers
> may depend on it.
> 

Thats your opinion, not a fact documented in
Documentation/atomic_ops.txt

<QUOTE>
---------------------------------------------------------
Next, we have:

        #define atomic_read(v)  ((v)->counter)

which simply reads the counter value currently visible to the calling thread.
The read is atomic in that the return value is guaranteed to be one of the
values initialized or modified with the interface operations if a proper
implicit or explicit memory barrier is used after possible runtime
initialization by any other thread and the value is modified only with the
interface operations.  atomic_read does not guarantee that the runtime
initialization by any other thread is visible yet, so the user of the
interface must take care of that with a proper implicit or explicit memory
barrier.

*** WARNING: atomic_read() and atomic_set() DO NOT IMPLY BARRIERS! ***

Some architectures may choose to use the volatile keyword, barriers, or inline
assembly to guarantee some degree of immediacy for atomic_read() and
atomic_set().  This is not uniformly guaranteed, and may change in the future,
so all users of atomic_t should treat atomic_read() and atomic_set() as simple
C statements that may be reordered or optimized away entirely by the compiler
or processor, and explicitly invoke the appropriate compiler and/or memory
barrier for each use case.  Failure to do so will result in code that may
suddenly break when used with different architectures or compiler
optimizations, or even changes in unrelated code which changes how the
compiler optimizes the section accessing atomic_t variables.

*** YOU HAVE BEEN WARNED! ***
---------------------------------------------------------------
</QUOTE>

The only requirement of atomic_read() is that it must return value
before or after an atomic_write(), not a garbled value.

So the ACCESS_ONCE semantic is on the write side (the atomic itself
cannot be used as a temporary variable. It must contain only two value
of atomic_{write|add|sub}() [ prior to the operation, after the
operation ]

In fact, if a compiler is stupid enough to issue two reads on following
code :

static inline atomic_read(const atomic_t *p)
{
	return p->value;
}

Then its fine, because in the end the returned value will be the old or
new one.

^ permalink raw reply

* Re: [RFC] tcp: Export TCP Delayed ACK parameters to user
From: Daniel Baluta @ 2011-10-28  8:01 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: davem, kuznet, jmorris, yoshfuji, kaber, netdev
In-Reply-To: <1319760097.19125.55.camel@edumazet-laptop>

On Fri, Oct 28, 2011 at 3:01 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le vendredi 28 octobre 2011 à 02:07 +0300, Daniel Baluta a écrit :
>> RFC2581 ($4.2) specifies when an ACK should be generated as follows:
>>
>> " .. an ACK SHOULD be generated for at least every second
>>   full-sized segment, and MUST be generated within 500 ms
>>   of the arrival of the first unacknowledged packet.
>> "
>>
>> We export the number of segments and the timeout limits
>> specified above, so that a user can tune them according
>> to its needs.
>>
>
> Well, this requires user has a machine exclusive use :)

So, this means that setting parameters system wide
isn't an option?

On Windows there is a global setting TcpAckFrequency [1],
which is similar with our tcp_delack_{min,max}.

On Solaris there is a global option tcp_deferred_acks_max [2],
which is similar with our tcp_delack_segs.

Thanks for your comments, I will post an updated patch asap.

Daniel.

[1] http://support.microsoft.com/kb/328890
[2] http://www.sean.de/Solaris/soltune.html

^ permalink raw reply

* Re: [RFC] tcp: Export TCP Delayed ACK parameters to user
From: Eric Dumazet @ 2011-10-28  8:44 UTC (permalink / raw)
  To: Daniel Baluta; +Cc: davem, kuznet, jmorris, yoshfuji, kaber, netdev
In-Reply-To: <CAEnQRZAj4H9neNsGB8hkrQO5trvP8QimhKhpnwchjTVrAo2LGQ@mail.gmail.com>

Le vendredi 28 octobre 2011 à 11:01 +0300, Daniel Baluta a écrit :
> On Fri, Oct 28, 2011 at 3:01 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > Le vendredi 28 octobre 2011 à 02:07 +0300, Daniel Baluta a écrit :
> >> RFC2581 ($4.2) specifies when an ACK should be generated as follows:
> >>
> >> " .. an ACK SHOULD be generated for at least every second
> >>   full-sized segment, and MUST be generated within 500 ms
> >>   of the arrival of the first unacknowledged packet.
> >> "
> >>
> >> We export the number of segments and the timeout limits
> >> specified above, so that a user can tune them according
> >> to its needs.
> >>
> >
> > Well, this requires user has a machine exclusive use :)
> 
> So, this means that setting parameters system wide
> isn't an option?
> 

It is a first step, but we can notice a global setting might please one
application but negatively impact other applications.

I guess some users will want a per socket option, but this can come
later. An other idea to save space on socket structures would be to
select two set of values depending on TOS/TCLASS.

I can imagine ssh (lowdelay) and scp (throughput) wanting different
behavior here.

> On Windows there is a global setting TcpAckFrequency [1],
> which is similar with our tcp_delack_{min,max}.
> 
> On Solaris there is a global option tcp_deferred_acks_max [2],
> which is similar with our tcp_delack_segs.
> 

and also has tcp_deferred_ack_interval

> Thanks for your comments, I will post an updated patch asap.
> 
> Daniel.
> 
> [1] http://support.microsoft.com/kb/328890
> [2] http://www.sean.de/Solaris/soltune.html

Dont forget to CC Andy Lutomirski <luto@amacapital.net>, he might be
interested being part of the process.

Thanks

^ permalink raw reply

* Re: [PATCH v3 0/3] SUNRPC: rcbind clients virtualization
From: Stanislav Kinsbursky @ 2011-10-28  9:24 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Trond.Myklebust@netapp.com, linux-nfs@vger.kernel.org,
	Pavel Emelianov, neilb@suse.de, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, davem@davemloft.net,
	devel@openvz.org
In-Reply-To: <20111027202514.GA31669@fieldses.org>

28.10.2011 00:25, J. Bruce Fields пишет:
> On Thu, Oct 27, 2011 at 10:10:43PM +0300, Stanislav Kinsbursky wrote:
>> v3:
>> 1) First two patches from previous version were squashed.
>>
>> This patch-set was created in context of clone of git branch:
>> git://git.linux-nfs.org/projects/trondmy/nfs-2.6.git
>> and rebased on tag "v3.1".
>>
>> This patch-set virtualizes rpcbind clients per network namespace context. IOW,
>> each network namespace will have its own pair of rpcbind clients (if they would
>> be created by request).
>>
>> Note:
>> 1) this patch-set depends on "SUNRPC: make rpcbind clients allocated and
>> destroyed dynamically" patch-set which has been send earlier.
>> 2) init_net pointer is still used instead of current->nsproxy->net_ns,
>> because I'm not sure yet about how to virtualize services. I.e. NFS callback
>> services will be per netns. NFSd service will be per netns too from my pow. But
>> Lockd can be per netns or one for all.
>
> I'm not sure what you mean by that; could you explain?
>

This patch-set was created before you've sent your NFSd plan and we disacussed 
Lockd per netns.
So, this sentence: "NFSd service will be per netns too from my pow" is obsolete. 
And Lockd will be one for all.
Or you are asking about something else?

> --b.
>
>> And also we have NFSd file system, which
>> is not virtualized yet.
>>
>> The following series consists of:
>>
>> ---
>>
>> Stanislav Kinsbursky (3):
>>        SUNRPC: move rpcbind internals to sunrpc part of network namespace context
>>        SUNRPC: optimize net_ns dereferencing in rpcbind creation calls
>>        SUNRPC: optimize net_ns dereferencing in rpcbind registering calls
>>
>>
>>   net/sunrpc/netns.h     |    5 ++
>>   net/sunrpc/rpcb_clnt.c |  103 ++++++++++++++++++++++++++----------------------
>>   2 files changed, 61 insertions(+), 47 deletions(-)
>>
>> --
>> Signature


-- 
Best regards,
Stanislav Kinsbursky

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox