Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH 00/13] Swap-over-NBD without deadlocking
From: Mel Gorman @ 2011-04-26  7:36 UTC (permalink / raw)
  To: Linux-MM, Linux-Netdev
  Cc: LKML, David Miller, Neil Brown, Peter Zijlstra, Mel Gorman

Changelog since V1
  o Rebase on top of mmotm
  o Use atomic_t for memalloc_socks		(David Miller)
  o Remove use of sk_memalloc_socks in vmscan	(Neil Brown)
  o Check throttle within prepare_to_wait	(Neil Brown)
  o Add statistics on throttling instead of printk

Swapping over NBD is something that is technically possible but not
often advised. While there are number of guides on the internet
on how to configure it and nbd-client supports a -swap switch to
"prevent deadlocks", the fact of the matter is a machine using NBD
for swap can be locked up within minutes if swap is used intensively.

The problem is that network block devices do not use mempools like
normal block devices do. As the host cannot control where they receive
packets from, they cannot reliably work out in advance how much memory
they might need.

Some years ago, Peter Ziljstra developed a series of patches that
supported swap over an NFS that some distributions are carrying in
their kernels. This patch series borrows very heavily from Peter's work
to support swapping over NBD (the relatively straight-forward case)
and uses throttling instead of dynamically resized memory reserves
so the series is not too unwieldy for review.

Patch 1 serialises access to min_free_kbytes. It's not strictly needed
	by this series but as the series cares about watermarks in
	general, it's a harmless fix. It could be merged independently.

Patch 2 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to
	preserve access to pages allocated under low memory situations
	to callers that are freeying memory.

Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC
	reserves without setting PFMEMALLOC.

Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves
	for later use by network packet processing.

Patch 5 ignores memory policies when ALLOC_NO_WATERMARKS is set.

Patches 6-9 allows network processing to use PFMEMALLOC reserves when
	the socket has been marked as being used by the VM to clean
	pages. If packets are received and stored in pages that were
	allocated under low-memory situations and are unrelated to
	the VM, the packets are dropped.

Patch 10 is a micro-optimisation to avoid a function call in the
	common case.

Patch 11 tags NBD sockets as being SOCK_MEMALLOC so they can use
	PFMEMALLOC if necessary.

Patch 12 notes that it is still possible for the PFMEMALLOC reserve
	to be depleted. To prevent this, direct reclaimers get
	throttled on a waitqueue if 50% of the PFMEMALLOC reserves are
	depleted.  It is expected that kswapd and the direct reclaimers
	already running will clean enough pages for the low watermark
	to be reached and the throttled processes are woken up.

Patch 13 adds a statistic to track how often processes get throttled

Some basic performance testing was run using kernel builds, netperf
on loopback for UDP and TCP, hackbench (pipes and sockets), iozone
and sysbench. Each of them were expected to use the sl*b allocators
reasonably heavily but there did not appear to be significant
performance variances. Here is the results from netperf using
slab as an example

NETPERF UDP
                   netperf-udp       udp-swapnbd
                  vanilla-slab        v1r17-slab
      64   178.06 ( 0.00%)*   189.46 ( 6.02%) 
             1.02%             1.00%        
     128   355.06 ( 0.00%)    370.75 ( 4.23%) 
     256   662.47 ( 0.00%)    721.62 ( 8.20%) 
    1024  2229.39 ( 0.00%)   2567.04 (13.15%) 
    2048  3974.20 ( 0.00%)   4114.70 ( 3.41%) 
    3312  5619.89 ( 0.00%)   5800.09 ( 3.11%) 
    4096  6460.45 ( 0.00%)   6702.45 ( 3.61%) 
    8192  9580.24 ( 0.00%)   9927.97 ( 3.50%) 
   16384 13259.14 ( 0.00%)  13493.88 ( 1.74%) 
MMTests Statistics: duration
User/Sys Time Running Test (seconds)       2960.17   2540.14
Total Elapsed Time (seconds)               3554.10   3050.10

NETPERF TCP
                   netperf-tcp       tcp-swapnbd
                  vanilla-slab        v1r17-slab
      64  1230.29 ( 0.00%)   1273.17 ( 3.37%) 
     128  2309.97 ( 0.00%)   2375.22 ( 2.75%) 
     256  3659.32 ( 0.00%)   3704.87 ( 1.23%) 
    1024  7267.80 ( 0.00%)   7251.02 (-0.23%) 
    2048  8358.26 ( 0.00%)   8204.74 (-1.87%) 
    3312  8631.07 ( 0.00%)   8637.62 ( 0.08%) 
    4096  8770.95 ( 0.00%)   8704.08 (-0.77%) 
    8192  9749.33 ( 0.00%)   9769.06 ( 0.20%) 
   16384 11151.71 ( 0.00%)  11135.32 (-0.15%) 
MMTests Statistics: duration
User/Sys Time Running Test (seconds)       1245.04   1619.89
Total Elapsed Time (seconds)               1250.66   1622.18

Here is the equivalent test for SLUB

NETPERF UDP
                   netperf-udp       udp-swapnbd
                  vanilla-slub        v1r17-slub
      64   180.83 ( 0.00%)    183.68 ( 1.55%) 
     128   357.29 ( 0.00%)    367.11 ( 2.67%) 
     256   679.64 ( 0.00%)*   724.03 ( 6.13%) 
             1.15%             1.00%        
    1024  2343.40 ( 0.00%)*  2610.63 (10.24%) 
             1.68%             1.00%        
    2048  3971.53 ( 0.00%)   4102.21 ( 3.19%)*
             1.00%             1.40%        
    3312  5677.04 ( 0.00%)   5748.69 ( 1.25%) 
    4096  6436.75 ( 0.00%)   6549.41 ( 1.72%) 
    8192  9698.56 ( 0.00%)   9808.84 ( 1.12%) 
   16384 13337.06 ( 0.00%)  13404.38 ( 0.50%) 
MMTests Statistics: duration
User/Sys Time Running Test (seconds)       2880.15   2180.13
Total Elapsed Time (seconds)               3458.10   2618.09

NETPERF TCP
                   netperf-tcp       tcp-swapnbd
                  vanilla-slub        v1r17-slub
      64  1256.79 ( 0.00%)   1287.32 ( 2.37%) 
     128  2308.71 ( 0.00%)   2371.09 ( 2.63%) 
     256  3672.03 ( 0.00%)   3771.05 ( 2.63%) 
    1024  7245.08 ( 0.00%)   7261.60 ( 0.23%) 
    2048  8315.17 ( 0.00%)   8244.14 (-0.86%) 
    3312  8611.43 ( 0.00%)   8616.90 ( 0.06%) 
    4096  8711.64 ( 0.00%)   8695.97 (-0.18%) 
    8192  9795.71 ( 0.00%)   9774.11 (-0.22%) 
   16384 11145.48 ( 0.00%)  11225.70 ( 0.71%) 
MMTests Statistics: duration
User/Sys Time Running Test (seconds)       1345.05   1425.06
Total Elapsed Time (seconds)               1350.61   1430.66

Time to completion varied a lot but this can happen with netperf as
it tries to find results within a sufficiently high confidence. I
wouldn't read too much into the performance gains of netperf-udp
as it can sometimes be affected by code just shuffling around for
whatever reason.

For testing swap-over-NBD, a machine was booted with 2G of RAM with a
swapfile backed by NBD. 16*NUM_CPU processes were started that create
anonymous memory mappings and read them linearly in a loop. The total
size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under
memory pressure. Without the patches, the machine locks up within
minutes and runs to completion with them applied.

Comments?

 drivers/block/nbd.c           |    7 +-
 include/linux/gfp.h           |    7 +-
 include/linux/mm_types.h      |    8 ++
 include/linux/mmzone.h        |    1 +
 include/linux/sched.h         |    7 ++
 include/linux/skbuff.h        |   19 +++-
 include/linux/slub_def.h      |    1 +
 include/linux/vm_event_item.h |    1 +
 include/net/sock.h            |   19 ++++
 kernel/softirq.c              |    3 +
 mm/page_alloc.c               |   57 ++++++++--
 mm/slab.c                     |  240 +++++++++++++++++++++++++++++++++++------
 mm/slub.c                     |   35 +++++-
 mm/vmscan.c                   |   58 ++++++++++
 mm/vmstat.c                   |    1 +
 net/core/dev.c                |   52 ++++++++-
 net/core/filter.c             |    8 ++
 net/core/skbuff.c             |   95 ++++++++++++++---
 net/core/sock.c               |   42 +++++++
 net/ipv4/tcp.c                |    3 +-
 net/ipv4/tcp_output.c         |   13 ++-
 net/ipv6/tcp_ipv6.c           |   12 ++-
 22 files changed, 603 insertions(+), 86 deletions(-)

-- 
1.7.3.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH net-next 0/6] tg3: TSO loopback and EEH support
From: David Miller @ 2011-04-26  7:25 UTC (permalink / raw)
  To: mcarlson; +Cc: netdev
In-Reply-To: <1303771370-32579-1-git-send-email-mcarlson@broadcom.com>

From: "Matt Carlson" <mcarlson@broadcom.com>
Date: Mon, 25 Apr 2011 15:42:44 -0700

> This patchset implements TSO loopback support into the selftest.  It also
> adds EEH support.

Series applied to net-next-2.6, thanks.

^ permalink raw reply

* Re: [PATCH net-next-2.6 v6 0/5] sctp: Patch series
From: David Miller @ 2011-04-26  7:24 UTC (permalink / raw)
  To: micchie; +Cc: netdev, lksctp-developers
In-Reply-To: <20110426.001247.39188691.davem@davemloft.net>

From: David Miller <davem@davemloft.net>
Date: Tue, 26 Apr 2011 00:12:47 -0700 (PDT)

> From: Michio Honda <micchie@sfc.wide.ad.jp>
> Date: Tue, 26 Apr 2011 13:28:40 +0900
> 
>> Series of 5 patches to support auto_asconf and the other related functionalities that auto_asconf relies on. 
>> 
>> Cheers,
>> - Michio
>> 
>> [1/5] Add Auto-ASCONF support
>> [2/5] Add sysctl support for Auto-ASCONF
>> [3/5] Add socket option operation for Auto-ASCONF
>> [4/5] Add ADD/DEL ASCONF handling at the receiver
>> [5/5] Add ASCONF operation on the single-homed host--
> 
> Series applied.

Actually, I'm reverting this too.

Do you SCTP guys look at your test builds AT ALL!?!?!

net/sctp/protocol.c:711: warning: function declaration isn’t a prototype

Please fix this up and resubmit everything.


^ permalink raw reply

* Re: [PATCH net-next-2.6 4/7] sctp: remove useless arguments from get_saddr() call
From: David Miller @ 2011-04-26  7:20 UTC (permalink / raw)
  To: yjwei; +Cc: netdev, linux-sctp
In-Reply-To: <20110426.001227.112589346.davem@davemloft.net>

From: David Miller <davem@davemloft.net>
Date: Tue, 26 Apr 2011 00:12:27 -0700 (PDT)

> I really get grumpy when I have to fix up stuff like this:
> 
> net/sctp/ipv6.c: In function ‘sctp_v6_get_saddr’:
> net/sctp/ipv6.c:382: warning: unused variable ‘daddr’

Now I'm even more grumpy, if I only apply the first 3 patches:

net/sctp/ipv6.c: In function ‘sctp_v6_dst_lookup’:
net/sctp/ipv6.c:259: warning: assignment makes integer from pointer without a cast

it goes away later but you're not testing the intermediate steps
of your SCTP patch backports and as a result you are going to
break bisecting.

I'm reverting this patch series, fix this up and actually test
the intermediate builds and functionality before resubmission.

^ permalink raw reply

* Re: [PATCH net-next-2.6 v6 0/5] sctp: Patch series
From: David Miller @ 2011-04-26  7:12 UTC (permalink / raw)
  To: micchie; +Cc: netdev, lksctp-developers
In-Reply-To: <28B01B73-1E94-4E0A-BA9B-82122A6726E4@sfc.wide.ad.jp>

From: Michio Honda <micchie@sfc.wide.ad.jp>
Date: Tue, 26 Apr 2011 13:28:40 +0900

> Series of 5 patches to support auto_asconf and the other related functionalities that auto_asconf relies on. 
> 
> Cheers,
> - Michio
> 
> [1/5] Add Auto-ASCONF support
> [2/5] Add sysctl support for Auto-ASCONF
> [3/5] Add socket option operation for Auto-ASCONF
> [4/5] Add ADD/DEL ASCONF handling at the receiver
> [5/5] Add ASCONF operation on the single-homed host--

Series applied.

^ permalink raw reply

* Re: [PATCH net-next-2.6 0/7] SCTP updates for net-next-2.6
From: David Miller @ 2011-04-26  7:12 UTC (permalink / raw)
  To: yjwei; +Cc: netdev, linux-sctp
In-Reply-To: <4DB63F85.2090609@cn.fujitsu.com>

From: Wei Yongjun <yjwei@cn.fujitsu.com>
Date: Tue, 26 Apr 2011 11:44:05 +0800

> Hi David
> 
> Here is a set of SCTP patches for net-next-2.6, the last part
> from vlad's lksctp-dev tree, update SCTP IPv6 routing and IPSec
> issues. Please apply.
> 
> Vlad Yasevich (4):
>       sctp: cache the ipv6 source after route lookup
>       sctp: make sctp over IPv6 work with IPsec
>       sctp: remove useless arguments from get_saddr() call
>       sctp: clean up route lookup calls
> 
> Wei Yongjun (2):
>       sctp: clean up IPv6 route and XFRM lookups
>       sctp: fix IPv6 source address output routing with IPsec
> 
> Weixing Shi (1):
>       sctp: fix sctp to work with ipv6 source address routing

All applied, with the warning fixed.

^ permalink raw reply

* Re: [PATCH net-next-2.6 4/7] sctp: remove useless arguments from get_saddr() call
From: David Miller @ 2011-04-26  7:12 UTC (permalink / raw)
  To: yjwei; +Cc: netdev, linux-sctp
In-Reply-To: <4DB6405B.2060200@cn.fujitsu.com>

From: Wei Yongjun <yjwei@cn.fujitsu.com>
Date: Tue, 26 Apr 2011 11:47:39 +0800

> @@ -392,11 +392,11 @@ static inline int sctp_v6_addr_match_len(union sctp_addr *s1,
>   */
>  static void sctp_v6_get_saddr(struct sctp_sock *sk,
>  			      struct sctp_transport *t,
> -			      union sctp_addr *daddr,
>  			      struct flowi *fl)
>  {
>  	struct flowi6 *fl6 = &fl->u.ip6;
>  	union sctp_addr *saddr = &t->saddr;
> +	union sctp_addr *daddr = &t->ipaddr;
>  
>  	SCTP_DEBUG_PRINTK("%s: asoc:%p dst:%p daddr:%pI6 ",
>  			  __func__, t->asoc, t->dst, &daddr->v6.sin6_addr);

I really get grumpy when I have to fix up stuff like this:

net/sctp/ipv6.c: In function ‘sctp_v6_get_saddr’:
net/sctp/ipv6.c:382: warning: unused variable ‘daddr’

You guys know I'm going to immediately run make on any patch you send
me and look for new warnings.

Why waste my time and not look for them yourselves before posting the
patch?

This wasn't even one of those cases where the warning goes away at
the end of the patch series, and only exists somewhere in the middle.


^ permalink raw reply

* Re: [RFC PATCH] netlink: Increase netlink dump skb message size
From: David Miller @ 2011-04-26  6:56 UTC (permalink / raw)
  To: eric.dumazet; +Cc: gregory.v.rose, netdev, bhutchings
In-Reply-To: <1303799597.2747.214.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 26 Apr 2011 08:33:17 +0200

> Le lundi 25 avril 2011 à 15:01 -0700, Greg Rose a écrit :
>> The message size allocated for rtnl info dumps was limited to a single page.
>> This is not enough for additional interface info available with devices
>> that support SR-IOV.  Check that the amount of data allocated is sufficient
>> for the amount of data requested.
>> 
>> Signed-off-by: Greg Rose <gregory.v.rose@intel.com>
>> ---
>> 
>>  include/linux/rtnetlink.h |    1 +
>>  net/core/rtnetlink.c      |    6 ++++++
>>  net/netlink/af_netlink.c  |   37 +++++++++++++++++++++++++++++++------
>>  3 files changed, 38 insertions(+), 6 deletions(-)
>> 
> 
> Hmm, thats a hack, because netlink_dump() is generic and you add
> something very specific.

You also can't do this without breaking applications.

We've trained every single netlink library out there about this message size
formula, so they know that if you allocate at least 8192 bytes for a recvmsg()
call they can receive fully any single netlink message.

And they must be able to make assumptions like this, otherwise they
cannot know how to reliably be able to receive a netlink message in
it's entirety in a generic fashion.

So one must not attack this problem from this angle.

It is absolutely necessary to find some way to report the VF
information, out of band, instead of trying to make the message
larger.

Needing more than 8K to get a dump of a single device over netlink is
absolutely rediculious, this VF stuff was poorly designed.  If has to
be fixed and the current stuff marked deprecated.

^ permalink raw reply

* Re: [RFC PATCH] netlink: Increase netlink dump skb message size
From: Eric Dumazet @ 2011-04-26  6:33 UTC (permalink / raw)
  To: Greg Rose; +Cc: netdev, bhutchings, davem
In-Reply-To: <20110425220157.2012.96707.stgit@gitlad.jf.intel.com>

Le lundi 25 avril 2011 à 15:01 -0700, Greg Rose a écrit :
> The message size allocated for rtnl info dumps was limited to a single page.
> This is not enough for additional interface info available with devices
> that support SR-IOV.  Check that the amount of data allocated is sufficient
> for the amount of data requested.
> 
> Signed-off-by: Greg Rose <gregory.v.rose@intel.com>
> ---
> 
>  include/linux/rtnetlink.h |    1 +
>  net/core/rtnetlink.c      |    6 ++++++
>  net/netlink/af_netlink.c  |   37 +++++++++++++++++++++++++++++++------
>  3 files changed, 38 insertions(+), 6 deletions(-)
> 

Hmm, thats a hack, because netlink_dump() is generic and you add
something very specific.

I prefer something that allows one dump() to reallocate a bigger skb

Maybe changing->dump() prototype to struct sk_buff **pskb instead of
struct sk_buff *skb.

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index c8f35b5..7fa6735 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1681,7 +1681,7 @@ static int netlink_dump(struct sock *sk)
 		goto errout_skb;
 	}
 
-	len = cb->dump(skb, cb);
+	len = cb->dump(&skb, cb);
 
 	if (len > 0) {
 		mutex_unlock(nlk->cb_mutex);



^ permalink raw reply related

* Re: [PATCH 0/3] net: Byte queue limit patch series
From: Eric Dumazet @ 2011-04-26  6:17 UTC (permalink / raw)
  To: Bill Fink; +Cc: Tom Herbert, davem, netdev
In-Reply-To: <20110426015645.c2d19cfe.billfink@mindspring.com>

Le mardi 26 avril 2011 à 01:56 -0400, Bill Fink a écrit :

> I don't quite follow your conclusion from your data.
> While there was a sweet spot for the 1400 rr size, other
> smaller rr took a hit.  Now all the tps changes were
> within 1 %, so perhaps that isn't considered significant
> (I'm not qualified to make that call).  But if that's
> the case, then the effective latency change seen by the
> user isn't significant either, although the amount of
> queuing in the NIC is admittedly significantly reduced
> for a rr size of 1400 or larger.

Tom point was to show that we can reduce latency (because size of
netdevice queue is smaller) without changing tps ;)




^ permalink raw reply

* Re: [PATCH 0/3] net: Byte queue limit patch series
From: Eric Dumazet @ 2011-04-26  6:14 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev
In-Reply-To: <alpine.DEB.2.00.1104252128001.5889@pokey.mtv.corp.google.com>

Le lundi 25 avril 2011 à 21:38 -0700, Tom Herbert a écrit :
> This patch series implements byte queue limits (bql) for NIC TX queues.
> 
> Byte queue limits are a mechanism to limit the size of the transmit
> hardware queue on a NIC by number of bytes. The goal of these byte
> limits is too reduce latency caused by excessive queuing in hardware
> without sacrificing throughput.
> 
> Hardware queuing limits are typically specified in terms of a number
> hardware descriptors, each of which has a variable size. The variability
> of the size of individual queued items can have a very wide range. For
> instance with the e1000 NIC the size could range from 64 bytes to 4K
> (with TSO enabled). This variability makes it next to impossible to
> choose a single queue limit that prevents starvation and provides lowest
> possible latency.
> 
> The objective of byte queue limits is to set the limit to be the
> minimum needed to prevent starvation between successive transmissions to
> the hardware. The latency between two transmissions can be variable in a
> system. It is dependent on interrupt frequency, NAPI polling latencies,
> scheduling of the queuing discipline, lock contention, etc. Therefore we
> propose that byte queue limits should be dynamic and change in
> iaccordance with networking stack latencies a system encounters.
> 
> Patches to implement this:
> Patch 1: Dynamic queue limits (dql) library.  This provides the general
> queuing algorithm.
> Patch 2: netdev changes that use dlq to support byte queue limits.
> Patch 3: Support in forcedeth drvier for byte queue limits.
> 
> The effects of BQL are demonstrated in the benchmark results below.
> These were made running 200 stream of netperf RR tests:
> 
> 140000 rr size
> BQL: 80-215K bytes in queue, 856 tps, 3.26%
> No BQL: 2700-2930K bytes in queue, 854 tps, 3.71% cpu
> 
> 14000 rr size
> BQ: 25-55K bytes in queue, 8500 tps
> No BQL: 1500-1622K bytes in queue,  8523 tps, 4.53% cpu
> 
> 1400 rr size
> BQL: 20-38K in queue bytes in queue, 86582 tps,  7.38% cpu
> No BQL: 29-117K 85738 tps, 7.67% cpu
> 
> 140 rr size
> BQL: 1-10K bytes in queue, 320540 tps, 34.6% cpu
> No BQL: 1-13K bytes in queue, 323158, 37.16% cpu
> 
> 1 rr size
> BQL: 0-3K in queue, 338811 tps, 41.41% cpu
> No BQL: 0-3K in queue, 339947 42.36% cpu
> 
> The amount of queuing in the NIC is reduced up to 90%, and I haven't
> yet seen a consistent negative impact in terms of throughout or
> CPU utilization.

Hi Tom

Thats a focus on thoughput, adding some extra latency (because of new
fields to access/dirty in tx path and tx completion path), especially on
setups where many cpus are sending data on one device. I suspect this is
the price to pay to fight bufferbloat.

We can try to make this non so expensive.

Maybe try to separate the DQL structure into two parts, one use on TX
path (inside the already dirtied cache line in netdev_queue structure
(_xmit_lock, xmit_lock_owner, trans_start)), and the other one in TX
completion path ?


This new limit schem also favors streams using super packets. Your
workload use 200 identical clients, it would be nice to mix DNS trafic
(small UDP frames) in them, and check how they behave when queue is
full, while it was almost never full before...




^ permalink raw reply

* Re: [PATCH net-2.6 4/4] xfrm: Fix integer underrun on zero sized replay windows
From: Herbert Xu @ 2011-04-26  6:01 UTC (permalink / raw)
  To: Steffen Klassert; +Cc: David Miller, netdev
In-Reply-To: <20110426054232.GI5495@secunet.com>

On Tue, Apr 26, 2011 at 07:42:32AM +0200, Steffen Klassert wrote:
> The check if the replay window is contained within one subspace or
> spans over two subspaces causes an unwanted integer underrun on
> zero sized replay windows when we subtract minus one. We fix this by
> changeing this check to avoid the subtraction.
> 
> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>

Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH 0/3] net: Byte queue limit patch series
From: Bill Fink @ 2011-04-26  5:56 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev
In-Reply-To: <alpine.DEB.2.00.1104252128001.5889@pokey.mtv.corp.google.com>

On Mon, 25 Apr 2011, Tom Herbert wrote:

> This patch series implements byte queue limits (bql) for NIC TX queues.
> 
> Byte queue limits are a mechanism to limit the size of the transmit
> hardware queue on a NIC by number of bytes. The goal of these byte
> limits is too reduce latency caused by excessive queuing in hardware
> without sacrificing throughput.
> 
> Hardware queuing limits are typically specified in terms of a number
> hardware descriptors, each of which has a variable size. The variability
> of the size of individual queued items can have a very wide range. For
> instance with the e1000 NIC the size could range from 64 bytes to 4K
> (with TSO enabled). This variability makes it next to impossible to
> choose a single queue limit that prevents starvation and provides lowest
> possible latency.
> 
> The objective of byte queue limits is to set the limit to be the
> minimum needed to prevent starvation between successive transmissions to
> the hardware. The latency between two transmissions can be variable in a
> system. It is dependent on interrupt frequency, NAPI polling latencies,
> scheduling of the queuing discipline, lock contention, etc. Therefore we
> propose that byte queue limits should be dynamic and change in
> iaccordance with networking stack latencies a system encounters.
> 
> Patches to implement this:
> Patch 1: Dynamic queue limits (dql) library.  This provides the general
> queuing algorithm.
> Patch 2: netdev changes that use dlq to support byte queue limits.
> Patch 3: Support in forcedeth drvier for byte queue limits.
> 
> The effects of BQL are demonstrated in the benchmark results below.
> These were made running 200 stream of netperf RR tests:
> 
> 140000 rr size
> BQL: 80-215K bytes in queue, 856 tps, 3.26%
> No BQL: 2700-2930K bytes in queue, 854 tps, 3.71% cpu

	tps	+0.23 %

> 14000 rr size
> BQ: 25-55K bytes in queue, 8500 tps
> No BQL: 1500-1622K bytes in queue,  8523 tps, 4.53% cpu

	tps	-0.27 %

> 1400 rr size
> BQL: 20-38K in queue bytes in queue, 86582 tps,  7.38% cpu
> No BQL: 29-117K 85738 tps, 7.67% cpu

	tps	+0.98 %

> 140 rr size
> BQL: 1-10K bytes in queue, 320540 tps, 34.6% cpu
> No BQL: 1-13K bytes in queue, 323158, 37.16% cpu

	tps	-0.81 %

> 1 rr size
> BQL: 0-3K in queue, 338811 tps, 41.41% cpu
> No BQL: 0-3K in queue, 339947 42.36% cpu

	tps	-0.33 %

> The amount of queuing in the NIC is reduced up to 90%, and I haven't
> yet seen a consistent negative impact in terms of throughout or
> CPU utilization.

I don't quite follow your conclusion from your data.
While there was a sweet spot for the 1400 rr size, other
smaller rr took a hit.  Now all the tps changes were
within 1 %, so perhaps that isn't considered significant
(I'm not qualified to make that call).  But if that's
the case, then the effective latency change seen by the
user isn't significant either, although the amount of
queuing in the NIC is admittedly significantly reduced
for a rr size of 1400 or larger.

					-Bill

^ permalink raw reply

* Re: [PATCH net-2.6 3/4] xfrm: Check for the new replay implementation if an esn state is inserted
From: Herbert Xu @ 2011-04-26  5:43 UTC (permalink / raw)
  To: Steffen Klassert; +Cc: David Miller, netdev
In-Reply-To: <20110426054121.GH5495@secunet.com>

On Tue, Apr 26, 2011 at 07:41:21AM +0200, Steffen Klassert wrote:
> IPsec extended sequence numbers can be used only with the new
> anti-replay window implementation. So check if the new implementation
> is used if an esn state is inserted and return an error if it is not.
> 
> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>

Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH net-2.6 2/4] esp6: Fix scatterlist initialization
From: Herbert Xu @ 2011-04-26  5:41 UTC (permalink / raw)
  To: Steffen Klassert; +Cc: David Miller, netdev
In-Reply-To: <20110426054023.GG5495@secunet.com>

On Tue, Apr 26, 2011 at 07:40:23AM +0200, Steffen Klassert wrote:
> When we use IPsec extended sequence numbers, we may overwrite
> the last scatterlist of the associated data by the scatterlist
> for the skb. This patch fixes this by placing the scatterlist
> for the skb right behind the last scatterlist of the associated
> data. esp4 does it already like that.
> 
> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>

Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* [PATCH net-2.6 4/4] xfrm: Fix integer underrun on zero sized replay windows
From: Steffen Klassert @ 2011-04-26  5:42 UTC (permalink / raw)
  To: David Miller, Herbert Xu; +Cc: netdev
In-Reply-To: <20110426053923.GF5495@secunet.com>

The check if the replay window is contained within one subspace or
spans over two subspaces causes an unwanted integer underrun on
zero sized replay windows when we subtract minus one. We fix this by
changeing this check to avoid the subtraction.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
---
 net/xfrm/xfrm_replay.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/xfrm/xfrm_replay.c b/net/xfrm/xfrm_replay.c
index e8a7814..19f94bb 100644
--- a/net/xfrm/xfrm_replay.c
+++ b/net/xfrm/xfrm_replay.c
@@ -32,7 +32,7 @@ u32 xfrm_replay_seqhi(struct xfrm_state *x, __be32 net_seq)
 	seq_hi = replay_esn->seq_hi;
 	bottom = replay_esn->seq - replay_esn->replay_window + 1;
 
-	if (likely(replay_esn->seq >= replay_esn->replay_window - 1)) {
+	if (likely(replay_esn->seq > replay_esn->replay_window)) {
 		/* A. same subspace */
 		if (unlikely(seq < bottom))
 			seq_hi++;
-- 
1.7.0.4


^ permalink raw reply related

* Re: [PATCH net-2.6 1/4] xfrm: Fix replay window size calculation on initialization
From: Herbert Xu @ 2011-04-26  5:41 UTC (permalink / raw)
  To: Steffen Klassert; +Cc: David Miller, netdev
In-Reply-To: <20110426053923.GF5495@secunet.com>

On Tue, Apr 26, 2011 at 07:39:24AM +0200, Steffen Klassert wrote:
> On replay initialization, we compute the size of the replay
> buffer to see if the replay window fits into the buffer.
> This computation lacks a mutliplication by 8 because we need
> the size in bit, not in byte. So we might return an error
> even though the replay window would fit into the buffer.
> This patch fixes this issue.
> 
> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>

Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* [PATCH net-2.6 3/4] xfrm: Check for the new replay implementation if an esn state is inserted
From: Steffen Klassert @ 2011-04-26  5:41 UTC (permalink / raw)
  To: David Miller, Herbert Xu; +Cc: netdev
In-Reply-To: <20110426053923.GF5495@secunet.com>

IPsec extended sequence numbers can be used only with the new
anti-replay window implementation. So check if the new implementation
is used if an esn state is inserted and return an error if it is not.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
---
 net/xfrm/xfrm_user.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index 5d1d60d..c658cb3 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -124,6 +124,9 @@ static inline int verify_replay(struct xfrm_usersa_info *p,
 {
 	struct nlattr *rt = attrs[XFRMA_REPLAY_ESN_VAL];
 
+	if ((p->flags & XFRM_STATE_ESN) && !rt)
+		return -EINVAL;
+
 	if (!rt)
 		return 0;
 
-- 
1.7.0.4


^ permalink raw reply related

* [PATCH net-2.6 2/4] esp6: Fix scatterlist initialization
From: Steffen Klassert @ 2011-04-26  5:40 UTC (permalink / raw)
  To: David Miller, Herbert Xu; +Cc: netdev
In-Reply-To: <20110426053923.GF5495@secunet.com>

When we use IPsec extended sequence numbers, we may overwrite
the last scatterlist of the associated data by the scatterlist
for the skb. This patch fixes this by placing the scatterlist
for the skb right behind the last scatterlist of the associated
data. esp4 does it already like that.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
---
 net/ipv6/esp6.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/ipv6/esp6.c b/net/ipv6/esp6.c
index 5aa8ec8..59dccfb 100644
--- a/net/ipv6/esp6.c
+++ b/net/ipv6/esp6.c
@@ -371,7 +371,7 @@ static int esp6_input(struct xfrm_state *x, struct sk_buff *skb)
 	iv = esp_tmp_iv(aead, tmp, seqhilen);
 	req = esp_tmp_req(aead, iv);
 	asg = esp_req_sg(aead, req);
-	sg = asg + 1;
+	sg = asg + sglists;
 
 	skb->ip_summed = CHECKSUM_NONE;
 
-- 
1.7.0.4


^ permalink raw reply related

* [PATCH net-2.6 1/4] xfrm: Fix replay window size calculation on initialization
From: Steffen Klassert @ 2011-04-26  5:39 UTC (permalink / raw)
  To: David Miller, Herbert Xu; +Cc: netdev

On replay initialization, we compute the size of the replay
buffer to see if the replay window fits into the buffer.
This computation lacks a mutliplication by 8 because we need
the size in bit, not in byte. So we might return an error
even though the replay window would fit into the buffer.
This patch fixes this issue.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
---
 net/xfrm/xfrm_replay.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/xfrm/xfrm_replay.c b/net/xfrm/xfrm_replay.c
index f218385..e8a7814 100644
--- a/net/xfrm/xfrm_replay.c
+++ b/net/xfrm/xfrm_replay.c
@@ -532,7 +532,7 @@ int xfrm_init_replay(struct xfrm_state *x)
 
 	if (replay_esn) {
 		if (replay_esn->replay_window >
-		    replay_esn->bmp_len * sizeof(__u32))
+		    replay_esn->bmp_len * sizeof(__u32) * 8)
 			return -EINVAL;
 
 	if ((x->props.flags & XFRM_STATE_ESN) && x->replay_esn)
-- 
1.7.0.4


^ permalink raw reply related

* Re: [PATCH 2/3] bql: Byte queue limits
From: Eric Dumazet @ 2011-04-26  5:38 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev
In-Reply-To: <alpine.DEB.2.00.1104252128290.5895@pokey.mtv.corp.google.com>

Le lundi 25 avril 2011 à 21:38 -0700, Tom Herbert a écrit :
> Networking stack support for byte queue limits, uses dynamic queue
> limits library.  Byte queue limits are maintained per transmit queue,
> and a bql structure has been added to netdev_queue structure for this
> purpose.
> 
> Configuration of bql is in the tx-<n> sysfs directory for the queue
> under the byte_queue_limits directory.  Configuration includes:
> limit_min, bql minimum limit
> limit_max, bql maximum limit
> hold_time, bql slack hold time
> 
> Also under the directory are:
> limit, current byte limit
> inflight, current number of bytes on the queue
> 

Wow... magical values and very limited advices how to tune them.

Tom, this reminds me you were supposed to provide Documentation/files to
describe RPS, RFS, XPS ...

We receive many questions about these features...

> Signed-off-by: Tom Herbert <therbert@google.com>
> ---
>  include/linux/netdevice.h |   46 +++++++++++++++-
>  net/core/net-sysfs.c      |  137 +++++++++++++++++++++++++++++++++++++++++++--
>  2 files changed, 177 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index cb8178a..0a76b88 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -44,6 +44,7 @@
>  #include <linux/rculist.h>
>  #include <linux/dmaengine.h>
>  #include <linux/workqueue.h>
> +#include <linux/dynamic_queue_limits.h>
>  
>  #include <linux/ethtool.h>
>  #include <net/net_namespace.h>
> @@ -556,8 +557,10 @@ struct netdev_queue {
>  	struct Qdisc		*qdisc;
>  	unsigned long		state;
>  	struct Qdisc		*qdisc_sleeping;
> -#ifdef CONFIG_RPS
> +#ifdef CONFIG_XPS
>  	struct kobject		kobj;
> +	bool			do_bql;
> +	struct dql		dql;
>  #endif

I have no idea why you use CONFIG_XPS for BQL (how BQL is it related to
SMP ???), and why kobj is now guarded by CONFIG_XPS instead of
CONFIG_RPS.




^ permalink raw reply

* Re: [PATCH net-next-2.6 v6 5/5] sctp: Add ASCONF operation on the single-homed host
From: Wei Yongjun @ 2011-04-26  5:33 UTC (permalink / raw)
  To: Michio Honda; +Cc: netdev, lksctp-developers
In-Reply-To: <856CB69B-767A-4F6C-9DBF-26EEAFCC3B56@sfc.wide.ad.jp>


> SCTP can change the IP address on the single-homed host.  
> In this case, the SCTP association transmits an ASCONF packet including addition of the new IP address and deletion of the old address.  This patch implements this functionality.  
> In this case, the ASCONF chunk is added to the beginning of the queue, because the other chunks cannot be transmitted in this state.  
>
> Signed-off-by: Michio Honda <micchie@sfc.wide.ad.jp>
> ---
>
Acked-by: Wei Yongjun <yjwei@cn.fujitsu.com>


^ permalink raw reply

* Re: [PATCH net-next-2.6 v6 4/5] sctp: Add ADD/DEL ASCONF handling at the receiver
From: Wei Yongjun @ 2011-04-26  5:31 UTC (permalink / raw)
  To: Michio Honda; +Cc: netdev, lksctp-developers
In-Reply-To: <BECB6CDC-BC4F-4BC1-B67D-B9F3F02E8D87@sfc.wide.ad.jp>


> This patch fixes the problem that the original code cannot delete the remote address where the corresponding transport is currently directed, even when the ASCONF is sent from the other address (this situation happens when the single-homed sender transmits  ASCONF with ADD and DEL.)  
>
> Signed-off-by: Michio Honda <micchie@sfc.wide.ad.jp>
> ---
>
Acked-by: Wei Yongjun <yjwei@cn.fujitsu.com>


^ permalink raw reply

* Re: [PATCH net-next-2.6 v6 3/5] sctp: Add socket option operation for Auto-ASCONF
From: Wei Yongjun @ 2011-04-26  5:31 UTC (permalink / raw)
  To: Michio Honda; +Cc: netdev, lksctp-developers
In-Reply-To: <0B9100AB-44C5-49E7-AA03-8B99180BE7E3@sfc.wide.ad.jp>



> This patch allows the application to operate Auto-ASCONF on/off behavior via setsockopt() and getsockopt().  
>
> Signed-off-by: Michio Honda <micchie@sfc.wide.ad.jp>
> ---
>

Acked-by: Wei Yongjun <yjwei@cn.fujitsu.com>


^ permalink raw reply

* Re: [PATCH net-next-2.6 v6 2/5] sctp: Add sysctl support for Auto-ASCONF
From: Wei Yongjun @ 2011-04-26  5:30 UTC (permalink / raw)
  To: Michio Honda; +Cc: netdev, lksctp-developers
In-Reply-To: <4B304B0D-35AC-4372-84F3-EFBC5A4C7BF2@sfc.wide.ad.jp>


> This patch allows the system administrator to change default Auto-ASCONF on/off behavior via an sysctl value.  
>
> Signed-off-by: Michio Honda <micchie@sfc.wide.ad.jp>
>

Acked-by: Wei Yongjun <yjwei@cn.fujitsu.com>


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox