netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [net-next PATCH V2 0/9] net: fragmentation performance scalability on NUMA/SMP systems
@ 2012-11-29 16:10 Jesper Dangaard Brouer
  2012-11-29 16:11 ` [net-next PATCH V2 1/9] net: frag evictor, avoid killing warm frag queues Jesper Dangaard Brouer
                   ` (8 more replies)
  0 siblings, 9 replies; 57+ messages in thread
From: Jesper Dangaard Brouer @ 2012-11-29 16:10 UTC (permalink / raw)
  To: Eric Dumazet, David S. Miller, Florian Westphal
  Cc: Jesper Dangaard Brouer, netdev, Pablo Neira Ayuso, Thomas Graf,
	Cong Wang, Patrick McHardy, Paul E. McKenney, Herbert Xu

This patchset implements significant performance improvements for
fragmentation handling in the kernel, with a focus on NUMA and SMP
based systems.

This is V2 of the patchset, previously send as an RFC patchset.
(Notice an extra patch is inserted after patch-05, thus shifting the
patch numbers.)

To reviewers:

 Please give me your signoff's or ack's, or comment on the code,
 explaining what I should change.

The fragmentation code today:

 The fragmentation code "protects" kernel resources, by implementing
 some memory resource limitation code.  This is centered around a
 global readers-writer lock, and (per network namespace) an atomic mem
 counter and a LRU (Least-Recently-Used) list.  (Although separate
 global variables and namespace resources, are kept for IPv4, IPv6
 and Netfilter reassembly.)

 The code tries to keep the memory usage between a high and low
 threshold (see: /proc/sys/net/ipv4/ipfrag_{high,low}_thresh).  The
 "evictor" code cleans up fragments, when the high threshold is
 exceeded, and stops only, when the low threshold is reached.

The scalability problem:

 Having a global/central variable for a resource limit is obviously a
 scalability issue on SMP systems, and even amplified on a NUMA based
 system.

 When profiling the code, the scalability problems appeared to be the
 readers-writer lock.  But, surprise, the primary scalability issue
 was caused by the global atomic mem limit counter, which, especially
 on NUMA systems, would prolong the time spend inside the
 readers-writer lock sections.  It is not trivial to remove the
 readers-writer lock, but it is possible to reduce the number of
 writer lock sections.

Testlab:

 My original big-testlab were based on four Intel based 10Gbit/s NICs
 on two identical Sandy-Bridge-E NUMA system.  The testlab
 used/available, while rebasing to net-next, were not as powerful.
 Its based on a single Sandy-Bridge-E NUMA system with the same Intel
 10G NICs, but the generator machine was an old Core-i7 920 with some
 older NICs. This means that I have not been able to generate full 4x
 10G wirespeed.  I have chosen (mostly) to include 2x 10G test results
 due to the generator machine (although the 4x 10G results from the
 big system looks more impressive).

 The tests are performed with netperf -t UDP_STREAM (which default
 send UDP packets with size 65507 bytes, which gets fragmented).  The
 netserver's get numactl pinned and the CPU sockets get smp_affinity
 aligned to the physical NIC connected to its own NUMA node.

Performance results:

 For the impressive 4x 10Gbit/s big-testlab results, performance goes
  from (a collective) 496 Mbit/s to 38463 Mbit/s (per stream 9615 Mbit/s)
  (at packet size 65507 bytes)

 For the results to be fair/meaningful, I'll report the used packet
 size, as (after the fixes) bigger UDP packets scale better, because
 smaller packets will require/create more frag queues to handle.

 I'll report packet size 65507 and three fragments 1472*3=4416 bytes.

 Disabled Ethernet Flow Control (via ethtool -A).  To show the real
 effect of the patches, the system needs to be in an "overload"
 situation.  When Ethernet Flow Control is enabled, the system will
 make the generator back-off, and the code path will be less stressed.
 Thus, I have disabled Ethernet Flow Control.

No patches:
 -------
 Results without any patches, and no flow control:

  2x10G size(65507) result:(7+50)     =57   Mbit/s (gen:9613+9473 Mbit/s)
  2x10G size(4416)  result:(3619+3772)=7391 Mbit/s (gen:8339+9105 Mbit/s)

 The very pure result with large frames is a result of the "evictor"
 code, which gets fixed in patch-01.

Patch-01: net: frag evictor, avoid killing warm frag queues
 -------
 The fragmentation evictor system have a very unfortunate eviction
 system for killing fragment, when the system is put under pressure.
 The evictor code basically kills "warm" fragments too quickly.
 Resulting in a massive, DoS like, performance drop, as seen above
 (no-patch) results with large packets.

 The solution is to avoid killing "warm" fragments, and rather block
 new incoming in case mem limit is exceeded. This is solved by
 introducing a creation time-stamp, which set to "jiffies" in
 inet_frag_alloc().

 UPDATE V2:
  - Drop the INET_FRAG_FIRST_IN idea for detecting dropped "head" packets

  2x10G size(65507) result:(3011+2568)=5579 Mbit/s (gen:9613+9553 Mbit/s)
  2x10G size(4416)  result:(3716+3518)=7234 Mbit/s (gen:9037+8614 Mbit/s)

Patch-02: cache line adjust inet_frag_queue.net (netns)
 -------
 Avoid possible cache-line bounces in struct inet_frag_queue.  By
 moving the net pointer (struct netns_frags) because its placed on the
 same write-often cache-line as e.g. refcnt and lock.

  2x10G size(65507) result:(2960+2613)=5573 Mbit/s (gen:9614+9465 Mbit/s)
  2x10G size(4416)  result:(3858+3650)=7508 Mbit/s (gen:8076+7633 Mbit/s)

 The performance benefit looks small. We can discuss if this patch is
 needed or not.

Patch-03: move LRU list maintenance outside of rwlock
 -------
 Updating the fragmentation queues LRU (Least-Recently-Used) list,
 required taking the hash writer lock.  However, the LRU list isn't
 tied to the hash at all, so we can use a separate lock for it.

 This patch looks like a performance loss for big packets, but the LRU
 locking changes are needed, by later patches.

 UPDATE V2:
  - Don't perform inet_frag_lru_move() outside the q.lock (inet_frag_queue)
    Because there were a theoretical chance of a race between
    inet_frag_lru_move() and fq_unlink() which is called under the
    q.lock. I have not been able to provoke this though (it should
    result in a list poison error)

  2x10G size(65507) result:(2533+2138)=4671 Mbit/s (gen:9612+9461 Mbit/s)
  2x10G size(4416)  result:(3952+3713)=7665 Mbit/s (gen:9168+8415 Mbit/s)

Patch-04: frag helper functions for mem limit tracking
 -------
 This patch is only meant as a preparation patch, towards the next
 patch.  The performance improvement comes from reduce the number
 atomic operation, during freeing of a frag queue, by summing the mem
 accounting before and doing a single atomic dec.

  2x10G size(65507) result:(2475+3101)=5576 Mbit/s (gen:9614+9439 Mbit/s)
  2x10G size(4416)  result:(3928+4129)=8057 Mbit/s (gen:7259+8131 Mbit/s)

Patch-05: per CPU resource, mem limit and LRU list accounting
 -------
 The major performance bottleneck on NUMA systems, is the mem limit
 counter, which is based on an atomic counter.  This patch removes the
 cache-bouncing of the atomic counter, by moving this accounting to be
 bound to each CPU.  The LRU list also need to be done per CPU,
 in-order to keep the accounting straight.

 UPDATE V2:
   - Rename struct cpu_resource -> frag_cpu_limit
   - Move init functions from inet_frag.h to inet_fragment.c
   - Cleanup per CPU in inet_frags_exit_net()

  2x10G size(65507) result:(9603+9458)=19061 Mbit/s (gen:9614+9458 Mbit/s)
  2x10G size(4416)  result:(4871+4848)= 9719 Mbit/s (gen:9107+8378 Mbit/s)

 To compare the benefit of the next patches, its necessary to increase
 the stress on the code, but doing 4x 10Gbit/s tests.

  4x10G size(65507) result:(8631+9337+7534+6928)=32430 Mbit/s
                       (gen:8646+9613+7547+6937 =32743 Mbit/s)
  4x10G size(4416)  result:(2870+2990+2993+3016)=11869 Mbit/s
                       (gen:4819+7767+6893+5043 =24522 Mbit/s)

Patch-06: implement dynamic percpu alloc of frag_cpu_limit
 -------
 Use the percpu API to implement dynamic per CPU allocation of the
 frag_cpu_limit in struct netns_frags.  This replaces the static array
 percpu[NR_CPUS].

 UPDATE V2:
  - This is a new patch.
  - Keeping it separate to get explicit review of this
  - (as this the first time I use the percpu API)

 2x10G size(65507) result:(9603+9367)=18970 Mbit/s (gen: 9614+9379=18993 Mbit/s)
 2x10G size(4416)  result:(4887+4773)= 9660 Mbit/s (gen: 7966+7412=15378 Mbit/s)

 4x10G size(65507) result:(7821+7723+6784+7859)=30187 Mbit/s
                     (gen: 8017+9545+6798+7863 =32223 Mbit/s)
 4x10G size(4416)  result:(2706+2684+2647+2669)=10706 Mbit/s
                     (gen: 4943+7483+7291+4271 =23988 Mbit/s)

 At first sight it looks like performance went down a bit, but as you
 can see in the next patches, my V2 results are (almost) the same.

Patch-07: nqueues_under_LRU_lock
 -------
 This patch just moves the nqueues counter under the LRU lock (and
 per CPU), instead of the write lock, to prepare for next patch.  No
 need for performance testing this part.

Patch-08: hash_bucket_locking
 -------
 This patch implements per hash bucket locking for the frag queue
 hash.  This removes two write locks, and the only remaining write
 lock is for protecting hash rebuild.  This essentially reduces the
 readers-writer lock to a rebuild lock.

 UPDATE V2:
  - Fixed two bugs
  - 1) Missed/too-late read_lock() for protecting hashfn in fq_unlink()
  - 2) Used old hash bucket instead of new dest bucket in inet_frag_secret_rebuild()

  2x10G size(65507) result:(9602+9466)=19068 Mbit/s (gen:9613+9472 Mbit/s)
                 V2 result:(9521+9505)=19026 Mbit/s

  2x10G size(4416)  result:(5024+4925)= 9949 Mbit/s (gen:8581+8957 Mbit/s)
                 V2 result:(5140+5206)=10346 Mbit/s

 To see the real benefit of this patch, we need to crank up the load
 and stress on the code, with 4x 10Gbit/s at small packets,
 improvement at size(4416): before 11869 Mbit/s now 17155 Mbit/s. Also
 note the regression at size(65507) 32430 -> 31021.

  4x10G size(65507) result:(7618+8708+7381+7314)=31021 Mbit/s
                 V2 result:(7488+8350+6834+8562)=31234 Mbit/s
                       (gen:7628+9501+8728+7321 =33178 Mbit/s)

  4x10G size(4416)  result:(4156+4714+4300+3985)=17155 Mbit/s
                 V2 result:(4341+4607+3963+4450)=17361 Mbit/s
                       (gen:6614+5330+7745+5366 =25055 Mbit/s)

 At 4x10G size(4416) I have seen 206 frag queues in use, and hash size is 64.

Patch-09: cache_align_hash_bucket
 -------
 Increase frag queue hash size and assure cache-line alignment to
 avoid false sharing.  Hash size is set to 256, because I have
 observed 206 frag queues in use at 4x10G with packet size 4416 bytes.

  2x10G size(65507) result:(9601+9414)=19015 Mbit/s (gen:9614+9434 Mbit/s)
                 V2 result:(9599+9427)=19026 Mbit/s

  2x10G size(4416)  result:(5421+5268)=10689 Mbit/s (gen:8028+7457 Mbit/s)
                 V2 result:(5377+5336)=10713 Mbit/s

 This does introduce an improvement (although not as big as I
 expected), but most importantly the regression seen in patch-08 4x10G
 at size(65507) is gone (patch-05:32430 Mbits/s -> 32676 Mbit).

  4x10G size(65507) result:(7604+8307+9593+7172)=32676 Mbit/s
                 V2 result:(7612+8063+9580+7265)=32520 Mbit/s
                       (gen:7615+8713+9606+7184 =33118 Mbit/s)

  4x10G size(4416)  result:(4890+4364+4139+4530)=17923 Mbit/s
                 V2 result:(3860+4533+4936+4519)=17848 Mbit/s
                       (gen:5170+6873+5215+7632 =24890 Mbit/s)

 After this patch it looks like the read lock is now the new
 contention point.

DROPPED Patch-10: Hack disable rebuild and remove rw_lock
 -------
 I've done a quick hack patch, that remove the readers-writer lock, by
 disabling/breaking hash rebuilding.  Just to see how big the
 performance gain would be.

 UPDATE V2:
  - I have dropped this patch.  It was just to show the potential.
  - Lets first integrate the other patches, and leave this for the future

  2x10G size(4416) result: 6481+6764 = 13245 Mbit/s (gen: 7652+8077 Mbit/s)

  4x10G size(4416) result:(5610+6283+5735+5238)=22866 Mbit/s
                     (gen: 6530+7860+5967+5238 =25595 Mbit/s)

 And the results show, that its a big win. With 4x10G size(4416)
 before: 17923 Mbit/s -> now: 22866 Mbit/s increase 4943 Mbit/s.
 With 2x10G size(4416) before 10689 Mbit/s -> 13245 Mbit/s
 increase 2556 Mbit/s.

 In the future, I'll work on a real solution for removing the rw_lock
 while still supporting hash rebuilding.  Suggestions and ideas are
 welcome.


This patchset is based upon:
  Davem's net-next tree:
    git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git
  On top of:
    commit ff33c0e1885cda44dd14c79f70df4706f83582a0
    (net: Remove bogus dependencies on INET)

---

Jesper Dangaard Brouer (9):
      net: increase frag queue hash size and cache-line
      net: frag queue locking per hash bucket
      net: frag, move nqueues counter under LRU lock protection
      net: frag, implement dynamic percpu alloc of frag_cpu_limit
      net: frag, per CPU resource, mem limit and LRU list accounting
      net: frag helper functions for mem limit tracking
      net: frag, move LRU list maintenance outside of rwlock
      net: frag cache line adjust inet_frag_queue.net
      net: frag evictor, avoid killing warm frag queues


 include/net/inet_frag.h                 |  114 ++++++++++++++++++++--
 include/net/ipv6.h                      |    4 -
 net/ipv4/inet_fragment.c                |  162 ++++++++++++++++++++++++-------
 net/ipv4/ip_fragment.c                  |   39 +++----
 net/ipv6/netfilter/nf_conntrack_reasm.c |   13 +-
 net/ipv6/reassembly.c                   |   12 +-
 6 files changed, 260 insertions(+), 84 deletions(-)


--
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [net-next PATCH V2 1/9] net: frag evictor, avoid killing warm frag queues
  2012-11-29 16:10 [net-next PATCH V2 0/9] net: fragmentation performance scalability on NUMA/SMP systems Jesper Dangaard Brouer
@ 2012-11-29 16:11 ` Jesper Dangaard Brouer
  2012-11-29 17:44   ` David Miller
  2012-11-29 16:11 ` [net-next PATCH V2 2/9] net: frag cache line adjust inet_frag_queue.net Jesper Dangaard Brouer
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 57+ messages in thread
From: Jesper Dangaard Brouer @ 2012-11-29 16:11 UTC (permalink / raw)
  To: Eric Dumazet, David S. Miller, Florian Westphal
  Cc: Jesper Dangaard Brouer, netdev, Pablo Neira Ayuso, Thomas Graf,
	Cong Wang, Patrick McHardy, Paul E. McKenney, Herbert Xu

The fragmentation evictor system have a very unfortunate eviction
system for killing fragment, when the system is put under pressure.
If packets are coming in too fast, the evictor code kills "warm"
fragments too quickly.  Resulting in a massive performance drop,
because we drop frag lists where we have already queue up a lot of
fragments/work, which gets killed before they have a chance to
complete.

This is perhaps amplified by the CPUs fighting for the same lru_list
element q in inet_frag_evictor(), and the atomic_dec_and_test(&q->refcnt)
which will cause another trip round the loop.

The solution idea (orig conceived by Florian Westphal) is to avoid
killing "warm" fragments, and rather block new incoming in case mem
limit is exceeded. This is solved by introducing a creation time-stamp
(creation_ts) in inet_frag_queue, which set to "jiffies" in
inet_frag_alloc().

In inet_frag_evictor() we don't kill "warm" queue elements. A "warm"
queue element is an element reaching inet_frag_evictor() within the
same "jiffie".

To maintain the queue limit (/proc/sys/net/ipv4/ipfrag_high_thresh) we
don't allow, new frag queue's to be allocated in inet_frag_alloc().
This will result in kernel log messages, which have been adjusted to
"ip_frag_create: mem limit reached".  We should consider dropping this
text, to avoid confusing end-users.

Notice, this is not the complete solution to fixing fragmentation
performance, but it allow us to see what is going on.

Original-idea-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Jesper Dangaard Brouer <jbrouer@redhat.com>

---
V2:
 - Drop the INET_FRAG_FIRST_IN idea for detecting dropped "head" packets

 include/net/inet_frag.h  |    1 +
 net/ipv4/inet_fragment.c |   19 +++++++++++++++++++
 net/ipv4/ip_fragment.c   |    6 +++---
 3 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 32786a0..7b897b2 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -24,6 +24,7 @@ struct inet_frag_queue {
 	ktime_t			stamp;
 	int			len;        /* total length of orig datagram */
 	int			meat;
+	u32			creation_ts;/* jiffies when queue was created*/
 	__u8			last_in;    /* first/last segment arrived? */
 
 #define INET_FRAG_COMPLETE	4
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 4750d2b..9bb6237 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -178,6 +178,18 @@ int inet_frag_evictor(struct netns_frags *nf, struct inet_frags *f, bool force)
 
 		q = list_first_entry(&nf->lru_list,
 				struct inet_frag_queue, lru_list);
+
+		/* queue entry is warm, i.e. new frags are arriving
+		 * too fast.  instead of evicting warm entry, give it
+		 * a chance to complete.  Instead, inet_frag_alloc()
+		 * will fail until more time has elapsed or queue
+		 * completes.
+		 */
+		if (!force && q->creation_ts == (u32) jiffies) {
+			read_unlock(&f->lock);
+			break;
+		}
+
 		atomic_inc(&q->refcnt);
 		read_unlock(&f->lock);
 
@@ -244,10 +256,17 @@ static struct inet_frag_queue *inet_frag_alloc(struct netns_frags *nf,
 {
 	struct inet_frag_queue *q;
 
+	/* Guard creations of new frag queues if mem limit reached, as
+	 * we allow warm/recent elements to survive in inet_frag_evictor()
+	 */
+	if (atomic_read(&nf->mem) > nf->high_thresh)
+		return NULL;
+
 	q = kzalloc(f->qsize, GFP_ATOMIC);
 	if (q == NULL)
 		return NULL;
 
+	q->creation_ts = (u32) jiffies;
 	q->net = nf;
 	f->constructor(q, arg);
 	atomic_add(f->qsize, &nf->mem);
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 1cf6a76..ef00d0a 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -300,12 +300,12 @@ static inline struct ipq *ip_find(struct net *net, struct iphdr *iph, u32 user)
 
 	q = inet_frag_find(&net->ipv4.frags, &ip4_frags, &arg, hash);
 	if (q == NULL)
-		goto out_nomem;
+		goto out_memlimit;
 
 	return container_of(q, struct ipq, q);
 
-out_nomem:
-	LIMIT_NETDEBUG(KERN_ERR pr_fmt("ip_frag_create: no memory left !\n"));
+out_memlimit:
+	LIMIT_NETDEBUG(KERN_ERR pr_fmt("ip_frag_create: mem limit reached!\n"));
 	return NULL;
 }
 

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [net-next PATCH V2 2/9] net: frag cache line adjust inet_frag_queue.net
  2012-11-29 16:10 [net-next PATCH V2 0/9] net: fragmentation performance scalability on NUMA/SMP systems Jesper Dangaard Brouer
  2012-11-29 16:11 ` [net-next PATCH V2 1/9] net: frag evictor, avoid killing warm frag queues Jesper Dangaard Brouer
@ 2012-11-29 16:11 ` Jesper Dangaard Brouer
  2012-11-29 16:12 ` [net-next PATCH V2 3/9] net: frag, move LRU list maintenance outside of rwlock Jesper Dangaard Brouer
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 57+ messages in thread
From: Jesper Dangaard Brouer @ 2012-11-29 16:11 UTC (permalink / raw)
  To: Eric Dumazet, David S. Miller, Florian Westphal
  Cc: Jesper Dangaard Brouer, netdev, Pablo Neira Ayuso, Thomas Graf,
	Cong Wang, Patrick McHardy, Paul E. McKenney, Herbert Xu

In inet_frag_find() unfortunate cache-line bounces can occur with
struct inet_frag_queue.  As the net pointer (struct netns_frags)
is placed on the same write-often cache-line as e.g. refcnt and
lock. As the hash match check always check (q->net == nf).

This (of-cause) only happens on hash bucket collisions, but as current
hash size is only 64 this makes collisions more likely.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---

 include/net/inet_frag.h |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 7b897b2..1f75316 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -14,7 +14,6 @@ struct netns_frags {
 
 struct inet_frag_queue {
 	struct hlist_node	list;
-	struct netns_frags	*net;
 	struct list_head	lru_list;   /* lru list member */
 	spinlock_t		lock;
 	atomic_t		refcnt;
@@ -24,6 +23,7 @@ struct inet_frag_queue {
 	ktime_t			stamp;
 	int			len;        /* total length of orig datagram */
 	int			meat;
+	struct netns_frags	*net;
 	u32			creation_ts;/* jiffies when queue was created*/
 	__u8			last_in;    /* first/last segment arrived? */
 

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [net-next PATCH V2 3/9] net: frag, move LRU list maintenance outside of rwlock
  2012-11-29 16:10 [net-next PATCH V2 0/9] net: fragmentation performance scalability on NUMA/SMP systems Jesper Dangaard Brouer
  2012-11-29 16:11 ` [net-next PATCH V2 1/9] net: frag evictor, avoid killing warm frag queues Jesper Dangaard Brouer
  2012-11-29 16:11 ` [net-next PATCH V2 2/9] net: frag cache line adjust inet_frag_queue.net Jesper Dangaard Brouer
@ 2012-11-29 16:12 ` Jesper Dangaard Brouer
  2012-11-29 17:43   ` Eric Dumazet
  2012-11-29 16:12 ` [net-next PATCH V2 4/9] net: frag helper functions for mem limit tracking Jesper Dangaard Brouer
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 57+ messages in thread
From: Jesper Dangaard Brouer @ 2012-11-29 16:12 UTC (permalink / raw)
  To: Eric Dumazet, David S. Miller, Florian Westphal
  Cc: Jesper Dangaard Brouer, netdev, Pablo Neira Ayuso, Thomas Graf,
	Cong Wang, Patrick McHardy, Paul E. McKenney, Herbert Xu

Updating the fragmentation queues LRU (Least-Recently-Used) list,
required taking the hash writer lock.  However, the LRU list isn't
tied to the hash at all, so we can use a separate lock for it.

This change, in it self, does not improve performance significantly.
But its part of making the fragmentation code scale.

Original-idea-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

---
V2:
 - Don't perform inet_frag_lru_move() outside the q.lock (inet_frag_queue)
   Because there were a theoretical chance of a race between
   inet_frag_lru_move() and fq_unlink() which is called under the
   q.lock. I have not been able to provoke this though (it should
   result in a list poison error)

 include/net/inet_frag.h                 |   22 ++++++++++++++++++++++
 net/ipv4/inet_fragment.c                |   14 ++++++++------
 net/ipv4/ip_fragment.c                  |    4 +---
 net/ipv6/netfilter/nf_conntrack_reasm.c |    5 ++---
 net/ipv6/reassembly.c                   |    4 +---
 5 files changed, 34 insertions(+), 15 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 1f75316..312a3fa 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -5,6 +5,7 @@ struct netns_frags {
 	int			nqueues;
 	atomic_t		mem;
 	struct list_head	lru_list;
+	spinlock_t		lru_lock;
 
 	/* sysctls */
 	int			timeout;
@@ -73,4 +74,25 @@ static inline void inet_frag_put(struct inet_frag_queue *q, struct inet_frags *f
 		inet_frag_destroy(q, f, NULL);
 }
 
+static inline void inet_frag_lru_move(struct inet_frag_queue *q)
+{
+	spin_lock(&q->net->lru_lock);
+	list_move_tail(&q->lru_list, &q->net->lru_list);
+	spin_unlock(&q->net->lru_lock);
+}
+
+static inline void inet_frag_lru_del(struct inet_frag_queue *q)
+{
+	spin_lock(&q->net->lru_lock);
+	list_del(&q->lru_list);
+	spin_unlock(&q->net->lru_lock);
+}
+
+static inline void inet_frag_lru_add(struct netns_frags *nf,
+				     struct inet_frag_queue *q)
+{
+	spin_lock(&nf->lru_lock);
+	list_add_tail(&q->lru_list, &nf->lru_list);
+	spin_unlock(&nf->lru_lock);
+}
 #endif
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 9bb6237..4e56587 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -75,6 +75,7 @@ void inet_frags_init_net(struct netns_frags *nf)
 	nf->nqueues = 0;
 	atomic_set(&nf->mem, 0);
 	INIT_LIST_HEAD(&nf->lru_list);
+	spin_lock_init(&nf->lru_lock);
 }
 EXPORT_SYMBOL(inet_frags_init_net);
 
@@ -98,9 +99,9 @@ static inline void fq_unlink(struct inet_frag_queue *fq, struct inet_frags *f)
 {
 	write_lock(&f->lock);
 	hlist_del(&fq->list);
-	list_del(&fq->lru_list);
 	fq->net->nqueues--;
 	write_unlock(&f->lock);
+	inet_frag_lru_del(fq);
 }
 
 void inet_frag_kill(struct inet_frag_queue *fq, struct inet_frags *f)
@@ -170,9 +171,10 @@ int inet_frag_evictor(struct netns_frags *nf, struct inet_frags *f, bool force)
 
 	work = atomic_read(&nf->mem) - nf->low_thresh;
 	while (work > 0) {
-		read_lock(&f->lock);
+		spin_lock(&nf->lru_lock);
+
 		if (list_empty(&nf->lru_list)) {
-			read_unlock(&f->lock);
+			spin_unlock(&nf->lru_lock);
 			break;
 		}
 
@@ -186,12 +188,12 @@ int inet_frag_evictor(struct netns_frags *nf, struct inet_frags *f, bool force)
 		 * completes.
 		 */
 		if (!force && q->creation_ts == (u32) jiffies) {
-			read_unlock(&f->lock);
+			spin_unlock(&nf->lru_lock);
 			break;
 		}
 
 		atomic_inc(&q->refcnt);
-		read_unlock(&f->lock);
+		spin_unlock(&nf->lru_lock);
 
 		spin_lock(&q->lock);
 		if (!(q->last_in & INET_FRAG_COMPLETE))
@@ -245,9 +247,9 @@ static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf,
 
 	atomic_inc(&qp->refcnt);
 	hlist_add_head(&qp->list, &f->hash[hash]);
-	list_add_tail(&qp->lru_list, &nf->lru_list);
 	nf->nqueues++;
 	write_unlock(&f->lock);
+	inet_frag_lru_add(nf, qp);
 	return qp;
 }
 
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index ef00d0a..b2425bf 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -531,9 +531,7 @@ found:
 	    qp->q.meat == qp->q.len)
 		return ip_frag_reasm(qp, prev, dev);
 
-	write_lock(&ip4_frags.lock);
-	list_move_tail(&qp->q.lru_list, &qp->q.net->lru_list);
-	write_unlock(&ip4_frags.lock);
+	inet_frag_lru_move(&qp->q);
 	return -EINPROGRESS;
 
 err:
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
index 22c8ea9..b0a1c96 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -325,9 +325,8 @@ found:
 		fq->nhoffset = nhoff;
 		fq->q.last_in |= INET_FRAG_FIRST_IN;
 	}
-	write_lock(&nf_frags.lock);
-	list_move_tail(&fq->q.lru_list, &fq->q.net->lru_list);
-	write_unlock(&nf_frags.lock);
+
+	inet_frag_lru_move(&fq->q);
 	return 0;
 
 discard_fq:
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index e5253ec..b373309 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -341,9 +341,7 @@ found:
 	    fq->q.meat == fq->q.len)
 		return ip6_frag_reasm(fq, prev, dev);
 
-	write_lock(&ip6_frags.lock);
-	list_move_tail(&fq->q.lru_list, &fq->q.net->lru_list);
-	write_unlock(&ip6_frags.lock);
+	inet_frag_lru_move(&fq->q);
 	return -1;
 
 discard_fq:

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [net-next PATCH V2 4/9] net: frag helper functions for mem limit tracking
  2012-11-29 16:10 [net-next PATCH V2 0/9] net: fragmentation performance scalability on NUMA/SMP systems Jesper Dangaard Brouer
                   ` (2 preceding siblings ...)
  2012-11-29 16:12 ` [net-next PATCH V2 3/9] net: frag, move LRU list maintenance outside of rwlock Jesper Dangaard Brouer
@ 2012-11-29 16:12 ` Jesper Dangaard Brouer
  2012-11-29 16:13 ` [net-next PATCH V2 5/9] net: frag, per CPU resource, mem limit and LRU list accounting Jesper Dangaard Brouer
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 57+ messages in thread
From: Jesper Dangaard Brouer @ 2012-11-29 16:12 UTC (permalink / raw)
  To: Eric Dumazet, David S. Miller, Florian Westphal
  Cc: Jesper Dangaard Brouer, netdev, Pablo Neira Ayuso, Thomas Graf,
	Cong Wang, Patrick McHardy, Paul E. McKenney, Herbert Xu

This change is primarily a preparation to ease the extension of memory
limit tracking.

The change does reduce the number atomic operation, during freeing of
a frag queue.  This does introduce a fairly good performance improvement, as
these atomic operations are at the core of the performance problems
seen on NUMA systems.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---

 include/net/inet_frag.h                 |   28 ++++++++++++++++++++++++++++
 include/net/ipv6.h                      |    2 +-
 net/ipv4/inet_fragment.c                |   27 +++++++++++++--------------
 net/ipv4/ip_fragment.c                  |   24 +++++++++++-------------
 net/ipv6/netfilter/nf_conntrack_reasm.c |    6 +++---
 net/ipv6/reassembly.c                   |    6 +++---
 6 files changed, 59 insertions(+), 34 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 312a3fa..9bbef17 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -95,4 +95,32 @@ static inline void inet_frag_lru_add(struct netns_frags *nf,
 	list_add_tail(&q->lru_list, &nf->lru_list);
 	spin_unlock(&nf->lru_lock);
 }
+
+/* Memory Tracking Functions. */
+
+static inline int frag_mem_limit(struct netns_frags *nf)
+{
+	return atomic_read(&nf->mem);
+}
+
+static inline void sub_frag_mem_limit(struct inet_frag_queue *q, int i)
+{
+	atomic_sub(i, &q->net->mem);
+}
+
+static inline void add_frag_mem_limit(struct inet_frag_queue *q, int i)
+{
+	atomic_add(i, &q->net->mem);
+}
+
+static inline void init_frag_mem_limit(struct netns_frags *nf)
+{
+	atomic_set(&nf->mem, 0);
+}
+
+static inline int sum_frag_mem_limit(struct netns_frags *nf)
+{
+	return atomic_read(&nf->mem);
+}
+
 #endif
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 979bf6c..a5c1cf1 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -279,7 +279,7 @@ static inline int ip6_frag_nqueues(struct net *net)
 
 static inline int ip6_frag_mem(struct net *net)
 {
-	return atomic_read(&net->ipv6.frags.mem);
+	return sum_frag_mem_limit(&net->ipv6.frags);
 }
 #endif
 
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 4e56587..0ecacbd 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -73,7 +73,7 @@ EXPORT_SYMBOL(inet_frags_init);
 void inet_frags_init_net(struct netns_frags *nf)
 {
 	nf->nqueues = 0;
-	atomic_set(&nf->mem, 0);
+	init_frag_mem_limit(nf);
 	INIT_LIST_HEAD(&nf->lru_list);
 	spin_lock_init(&nf->lru_lock);
 }
@@ -118,12 +118,8 @@ void inet_frag_kill(struct inet_frag_queue *fq, struct inet_frags *f)
 EXPORT_SYMBOL(inet_frag_kill);
 
 static inline void frag_kfree_skb(struct netns_frags *nf, struct inet_frags *f,
-		struct sk_buff *skb, int *work)
+		struct sk_buff *skb)
 {
-	if (work)
-		*work -= skb->truesize;
-
-	atomic_sub(skb->truesize, &nf->mem);
 	if (f->skb_free)
 		f->skb_free(skb);
 	kfree_skb(skb);
@@ -134,6 +130,7 @@ void inet_frag_destroy(struct inet_frag_queue *q, struct inet_frags *f,
 {
 	struct sk_buff *fp;
 	struct netns_frags *nf;
+	unsigned int sum, sum_truesize = 0;
 
 	WARN_ON(!(q->last_in & INET_FRAG_COMPLETE));
 	WARN_ON(del_timer(&q->timer) != 0);
@@ -144,13 +141,14 @@ void inet_frag_destroy(struct inet_frag_queue *q, struct inet_frags *f,
 	while (fp) {
 		struct sk_buff *xp = fp->next;
 
-		frag_kfree_skb(nf, f, fp, work);
+		sum_truesize += fp->truesize;
+		frag_kfree_skb(nf, f, fp);
 		fp = xp;
 	}
-
+	sum = sum_truesize + f->qsize;
 	if (work)
-		*work -= f->qsize;
-	atomic_sub(f->qsize, &nf->mem);
+		*work -= sum;
+	sub_frag_mem_limit(q, sum);
 
 	if (f->destructor)
 		f->destructor(q);
@@ -165,11 +163,11 @@ int inet_frag_evictor(struct netns_frags *nf, struct inet_frags *f, bool force)
 	int work, evicted = 0;
 
 	if (!force) {
-		if (atomic_read(&nf->mem) <= nf->high_thresh)
+		if (frag_mem_limit(nf) <= nf->high_thresh)
 			return 0;
 	}
 
-	work = atomic_read(&nf->mem) - nf->low_thresh;
+	work = frag_mem_limit(nf) - nf->low_thresh;
 	while (work > 0) {
 		spin_lock(&nf->lru_lock);
 
@@ -261,7 +259,7 @@ static struct inet_frag_queue *inet_frag_alloc(struct netns_frags *nf,
 	/* Guard creations of new frag queues if mem limit reached, as
 	 * we allow warm/recent elements to survive in inet_frag_evictor()
 	 */
-	if (atomic_read(&nf->mem) > nf->high_thresh)
+	if (frag_mem_limit(nf) > nf->high_thresh)
 		return NULL;
 
 	q = kzalloc(f->qsize, GFP_ATOMIC);
@@ -271,7 +269,8 @@ static struct inet_frag_queue *inet_frag_alloc(struct netns_frags *nf,
 	q->creation_ts = (u32) jiffies;
 	q->net = nf;
 	f->constructor(q, arg);
-	atomic_add(f->qsize, &nf->mem);
+	add_frag_mem_limit(q, f->qsize);
+
 	setup_timer(&q->timer, f->frag_expire, (unsigned long)q);
 	spin_lock_init(&q->lock);
 	atomic_set(&q->refcnt, 1);
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index b2425bf..abb5551 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -122,7 +122,7 @@ int ip_frag_nqueues(struct net *net)
 
 int ip_frag_mem(struct net *net)
 {
-	return atomic_read(&net->ipv4.frags.mem);
+	return sum_frag_mem_limit(&net->ipv4.frags);
 }
 
 static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
@@ -161,13 +161,6 @@ static bool ip4_frag_match(struct inet_frag_queue *q, void *a)
 		qp->user == arg->user;
 }
 
-/* Memory Tracking Functions. */
-static void frag_kfree_skb(struct netns_frags *nf, struct sk_buff *skb)
-{
-	atomic_sub(skb->truesize, &nf->mem);
-	kfree_skb(skb);
-}
-
 static void ip4_frag_init(struct inet_frag_queue *q, void *a)
 {
 	struct ipq *qp = container_of(q, struct ipq, q);
@@ -340,6 +333,7 @@ static inline int ip_frag_too_far(struct ipq *qp)
 static int ip_frag_reinit(struct ipq *qp)
 {
 	struct sk_buff *fp;
+	unsigned int sum_truesize = 0;
 
 	if (!mod_timer(&qp->q.timer, jiffies + qp->q.net->timeout)) {
 		atomic_inc(&qp->q.refcnt);
@@ -349,9 +343,12 @@ static int ip_frag_reinit(struct ipq *qp)
 	fp = qp->q.fragments;
 	do {
 		struct sk_buff *xp = fp->next;
-		frag_kfree_skb(qp->q.net, fp);
+
+		sum_truesize += fp->truesize;
+		kfree_skb(fp);
 		fp = xp;
 	} while (fp);
+	sub_frag_mem_limit(&qp->q, sum_truesize);
 
 	qp->q.last_in = 0;
 	qp->q.len = 0;
@@ -496,7 +493,8 @@ found:
 				qp->q.fragments = next;
 
 			qp->q.meat -= free_it->len;
-			frag_kfree_skb(qp->q.net, free_it);
+			sub_frag_mem_limit(&qp->q, free_it->truesize);
+			kfree_skb(free_it);
 		}
 	}
 
@@ -519,7 +517,7 @@ found:
 	qp->q.stamp = skb->tstamp;
 	qp->q.meat += skb->len;
 	qp->ecn |= ecn;
-	atomic_add(skb->truesize, &qp->q.net->mem);
+	add_frag_mem_limit(&qp->q, skb->truesize);
 	if (offset == 0)
 		qp->q.last_in |= INET_FRAG_FIRST_IN;
 
@@ -615,7 +613,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
 		head->len -= clone->len;
 		clone->csum = 0;
 		clone->ip_summed = head->ip_summed;
-		atomic_add(clone->truesize, &qp->q.net->mem);
+		add_frag_mem_limit(&qp->q, clone->truesize);
 	}
 
 	skb_push(head, head->data - skb_network_header(head));
@@ -643,7 +641,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
 		}
 		fp = next;
 	}
-	atomic_sub(sum_truesize, &qp->q.net->mem);
+	sub_frag_mem_limit(&qp->q, sum_truesize);
 
 	head->next = NULL;
 	head->dev = dev;
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
index b0a1c96..c088831 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -316,7 +316,7 @@ found:
 	fq->q.meat += skb->len;
 	if (payload_len > fq->q.max_size)
 		fq->q.max_size = payload_len;
-	atomic_add(skb->truesize, &fq->q.net->mem);
+	add_frag_mem_limit(&fq->q, skb->truesize);
 
 	/* The first fragment.
 	 * nhoffset is obtained from the first fragment, of course.
@@ -394,7 +394,7 @@ nf_ct_frag6_reasm(struct frag_queue *fq, struct net_device *dev)
 		clone->ip_summed = head->ip_summed;
 
 		NFCT_FRAG6_CB(clone)->orig = NULL;
-		atomic_add(clone->truesize, &fq->q.net->mem);
+		add_frag_mem_limit(&fq->q, clone->truesize);
 	}
 
 	/* We have to remove fragment header from datagram and to relocate
@@ -418,7 +418,7 @@ nf_ct_frag6_reasm(struct frag_queue *fq, struct net_device *dev)
 			head->csum = csum_add(head->csum, fp->csum);
 		head->truesize += fp->truesize;
 	}
-	atomic_sub(head->truesize, &fq->q.net->mem);
+	sub_frag_mem_limit(&fq->q, head->truesize);
 
 	head->local_df = 1;
 	head->next = NULL;
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index b373309..bab2c27 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -327,7 +327,7 @@ found:
 	}
 	fq->q.stamp = skb->tstamp;
 	fq->q.meat += skb->len;
-	atomic_add(skb->truesize, &fq->q.net->mem);
+	add_frag_mem_limit(&fq->q, skb->truesize);
 
 	/* The first fragment.
 	 * nhoffset is obtained from the first fragment, of course.
@@ -427,7 +427,7 @@ static int ip6_frag_reasm(struct frag_queue *fq, struct sk_buff *prev,
 		head->len -= clone->len;
 		clone->csum = 0;
 		clone->ip_summed = head->ip_summed;
-		atomic_add(clone->truesize, &fq->q.net->mem);
+		add_frag_mem_limit(&fq->q, clone->truesize);
 	}
 
 	/* We have to remove fragment header from datagram and to relocate
@@ -465,7 +465,7 @@ static int ip6_frag_reasm(struct frag_queue *fq, struct sk_buff *prev,
 		}
 		fp = next;
 	}
-	atomic_sub(sum_truesize, &fq->q.net->mem);
+	sub_frag_mem_limit(&fq->q, sum_truesize);
 
 	head->next = NULL;
 	head->dev = dev;

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [net-next PATCH V2 5/9] net: frag, per CPU resource, mem limit and LRU list accounting
  2012-11-29 16:10 [net-next PATCH V2 0/9] net: fragmentation performance scalability on NUMA/SMP systems Jesper Dangaard Brouer
                   ` (3 preceding siblings ...)
  2012-11-29 16:12 ` [net-next PATCH V2 4/9] net: frag helper functions for mem limit tracking Jesper Dangaard Brouer
@ 2012-11-29 16:13 ` Jesper Dangaard Brouer
  2012-11-29 17:06   ` Eric Dumazet
  2012-11-29 16:14 ` [net-next PATCH V2 6/9] net: frag, implement dynamic percpu alloc of frag_cpu_limit Jesper Dangaard Brouer
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 57+ messages in thread
From: Jesper Dangaard Brouer @ 2012-11-29 16:13 UTC (permalink / raw)
  To: Eric Dumazet, David S. Miller, Florian Westphal
  Cc: Jesper Dangaard Brouer, netdev, Pablo Neira Ayuso, Thomas Graf,
	Cong Wang, Patrick McHardy, Paul E. McKenney, Herbert Xu

The major performance bottleneck on NUMA systems, is the mem limit
counter which is based an atomic counter.  This patch removes the
cache-bouncing of the atomic counter, by moving this accounting to be
bound to each CPU.  The LRU list also need to be done per CPU,
in-order to keep the accounting straight.

If fragments belonging together is "sprayed" across CPUs, performance
will still suffer, but due to NIC rxhashing this is not very common.
Correct accounting in this situation is maintained by recording and
"assigning" a CPU to a frag queue when its allocated (caused by the
first packet associated packet).

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

---
V2:
 - Rename struct cpu_resource -> frag_cpu_limit
 - Move init functions from inet_frag.h to inet_fragment.c
 - Cleanup per CPU in inet_frags_exit_net()

 include/net/inet_frag.h                 |   64 +++++++++++++++++++------------
 net/ipv4/inet_fragment.c                |   50 ++++++++++++++++++------
 net/ipv4/ip_fragment.c                  |    3 +
 net/ipv6/netfilter/nf_conntrack_reasm.c |    2 -
 net/ipv6/reassembly.c                   |    2 -
 5 files changed, 80 insertions(+), 41 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 9bbef17..8421904 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -1,11 +1,22 @@
 #ifndef __NET_FRAG_H__
 #define __NET_FRAG_H__
 
+#include <linux/spinlock.h>
+#include <linux/atomic.h>
+
+/* Need to maintain these resource limits per CPU, else we will kill
+ * performance due to cache-line bouncing
+ */
+struct frag_cpu_limit {
+	atomic_t                mem;
+	struct list_head        lru_list;
+	spinlock_t              lru_lock;
+} ____cacheline_aligned_in_smp;
+
 struct netns_frags {
 	int			nqueues;
-	atomic_t		mem;
-	struct list_head	lru_list;
-	spinlock_t		lru_lock;
+
+	struct frag_cpu_limit	percpu[NR_CPUS];
 
 	/* sysctls */
 	int			timeout;
@@ -26,6 +37,7 @@ struct inet_frag_queue {
 	int			meat;
 	struct netns_frags	*net;
 	u32			creation_ts;/* jiffies when queue was created*/
+	u32			cpu_alloc;  /* used for mem limit accounting */
 	__u8			last_in;    /* first/last segment arrived? */
 
 #define INET_FRAG_COMPLETE	4
@@ -63,7 +75,8 @@ void inet_frags_exit_net(struct netns_frags *nf, struct inet_frags *f);
 void inet_frag_kill(struct inet_frag_queue *q, struct inet_frags *f);
 void inet_frag_destroy(struct inet_frag_queue *q,
 				struct inet_frags *f, int *work);
-int inet_frag_evictor(struct netns_frags *nf, struct inet_frags *f, bool force);
+int inet_frag_evictor(struct netns_frags *nf, struct inet_frags *f,
+		      bool force, int on_cpu);
 struct inet_frag_queue *inet_frag_find(struct netns_frags *nf,
 		struct inet_frags *f, void *key, unsigned int hash)
 	__releases(&f->lock);
@@ -74,53 +87,54 @@ static inline void inet_frag_put(struct inet_frag_queue *q, struct inet_frags *f
 		inet_frag_destroy(q, f, NULL);
 }
 
+/* LRU (Least Recently Used) resource functions */
+
 static inline void inet_frag_lru_move(struct inet_frag_queue *q)
 {
-	spin_lock(&q->net->lru_lock);
-	list_move_tail(&q->lru_list, &q->net->lru_list);
-	spin_unlock(&q->net->lru_lock);
+	int cpu = q->cpu_alloc;
+	spin_lock(&q->net->percpu[cpu].lru_lock);
+	list_move_tail(&q->lru_list, &q->net->percpu[cpu].lru_list);
+	spin_unlock(&q->net->percpu[cpu].lru_lock);
 }
 
 static inline void inet_frag_lru_del(struct inet_frag_queue *q)
 {
-	spin_lock(&q->net->lru_lock);
+	int cpu = q->cpu_alloc;
+	spin_lock(&q->net->percpu[cpu].lru_lock);
 	list_del(&q->lru_list);
-	spin_unlock(&q->net->lru_lock);
+	spin_unlock(&q->net->percpu[cpu].lru_lock);
 }
 
 static inline void inet_frag_lru_add(struct netns_frags *nf,
 				     struct inet_frag_queue *q)
 {
-	spin_lock(&nf->lru_lock);
-	list_add_tail(&q->lru_list, &nf->lru_list);
-	spin_unlock(&nf->lru_lock);
+	int cpu = q->cpu_alloc;
+	spin_lock(&nf->percpu[cpu].lru_lock);
+	list_add_tail(&q->lru_list, &nf->percpu[cpu].lru_list);
+	spin_unlock(&nf->percpu[cpu].lru_lock);
 }
 
 /* Memory Tracking Functions. */
 
-static inline int frag_mem_limit(struct netns_frags *nf)
-{
-	return atomic_read(&nf->mem);
-}
-
 static inline void sub_frag_mem_limit(struct inet_frag_queue *q, int i)
 {
-	atomic_sub(i, &q->net->mem);
+	int cpu = q->cpu_alloc;
+	atomic_sub(i, &q->net->percpu[cpu].mem);
 }
 
 static inline void add_frag_mem_limit(struct inet_frag_queue *q, int i)
 {
-	atomic_add(i, &q->net->mem);
-}
-
-static inline void init_frag_mem_limit(struct netns_frags *nf)
-{
-	atomic_set(&nf->mem, 0);
+	int cpu = q->cpu_alloc;
+	atomic_add(i, &q->net->percpu[cpu].mem);
 }
 
 static inline int sum_frag_mem_limit(struct netns_frags *nf)
 {
-	return atomic_read(&nf->mem);
+	unsigned int sum = 0;
+	int cpu;
+	for_each_possible_cpu(cpu)
+		sum += atomic_read(&nf->percpu[cpu].mem);
+	return sum;
 }
 
 #endif
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 0ecacbd..068aabe 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -23,6 +23,17 @@
 
 #include <net/inet_frag.h>
 
+static inline int frag_mem_limit_on_cpu(struct netns_frags *nf, int on_cpu)
+{
+	return atomic_read(&nf->percpu[on_cpu].mem);
+}
+
+static inline int frag_mem_limit(struct netns_frags *nf)
+{
+	int cpu = smp_processor_id();
+	return frag_mem_limit_on_cpu(nf, cpu);
+}
+
 static void inet_frag_secret_rebuild(unsigned long dummy)
 {
 	struct inet_frags *f = (struct inet_frags *)dummy;
@@ -70,12 +81,20 @@ void inet_frags_init(struct inet_frags *f)
 }
 EXPORT_SYMBOL(inet_frags_init);
 
+static void inet_frags_init_percpu_limit(struct netns_frags *nf)
+{
+	int cpu;
+	for_each_possible_cpu(cpu) {
+		INIT_LIST_HEAD(&nf->percpu[cpu].lru_list);
+		spin_lock_init(&nf->percpu[cpu].lru_lock);
+		atomic_set(&nf->percpu[cpu].mem, 0);
+	}
+}
+
 void inet_frags_init_net(struct netns_frags *nf)
 {
 	nf->nqueues = 0;
-	init_frag_mem_limit(nf);
-	INIT_LIST_HEAD(&nf->lru_list);
-	spin_lock_init(&nf->lru_lock);
+	inet_frags_init_percpu_limit(nf);
 }
 EXPORT_SYMBOL(inet_frags_init_net);
 
@@ -87,10 +106,12 @@ EXPORT_SYMBOL(inet_frags_fini);
 
 void inet_frags_exit_net(struct netns_frags *nf, struct inet_frags *f)
 {
+	int cpu;
 	nf->low_thresh = 0;
 
 	local_bh_disable();
-	inet_frag_evictor(nf, f, true);
+	for_each_possible_cpu(cpu)
+		inet_frag_evictor(nf, f, true, cpu);
 	local_bh_enable();
 }
 EXPORT_SYMBOL(inet_frags_exit_net);
@@ -157,26 +178,28 @@ void inet_frag_destroy(struct inet_frag_queue *q, struct inet_frags *f,
 }
 EXPORT_SYMBOL(inet_frag_destroy);
 
-int inet_frag_evictor(struct netns_frags *nf, struct inet_frags *f, bool force)
+int inet_frag_evictor(struct netns_frags *nf, struct inet_frags *f,
+		      bool force, int on_cpu)
 {
 	struct inet_frag_queue *q;
 	int work, evicted = 0;
+	int cpu = (likely(on_cpu < 0)) ? smp_processor_id() : on_cpu;
 
 	if (!force) {
-		if (frag_mem_limit(nf) <= nf->high_thresh)
+		if (frag_mem_limit_on_cpu(nf, cpu) <= nf->high_thresh)
 			return 0;
 	}
 
-	work = frag_mem_limit(nf) - nf->low_thresh;
+	work = frag_mem_limit_on_cpu(nf, cpu) - nf->low_thresh;
 	while (work > 0) {
-		spin_lock(&nf->lru_lock);
+		spin_lock(&nf->percpu[cpu].lru_lock);
 
-		if (list_empty(&nf->lru_list)) {
-			spin_unlock(&nf->lru_lock);
+		if (list_empty(&nf->percpu[cpu].lru_list)) {
+			spin_unlock(&nf->percpu[cpu].lru_lock);
 			break;
 		}
 
-		q = list_first_entry(&nf->lru_list,
+		q = list_first_entry(&nf->percpu[cpu].lru_list,
 				struct inet_frag_queue, lru_list);
 
 		/* queue entry is warm, i.e. new frags are arriving
@@ -186,12 +209,12 @@ int inet_frag_evictor(struct netns_frags *nf, struct inet_frags *f, bool force)
 		 * completes.
 		 */
 		if (!force && q->creation_ts == (u32) jiffies) {
-			spin_unlock(&nf->lru_lock);
+			spin_unlock(&nf->percpu[cpu].lru_lock);
 			break;
 		}
 
 		atomic_inc(&q->refcnt);
-		spin_unlock(&nf->lru_lock);
+		spin_unlock(&nf->percpu[cpu].lru_lock);
 
 		spin_lock(&q->lock);
 		if (!(q->last_in & INET_FRAG_COMPLETE))
@@ -267,6 +290,7 @@ static struct inet_frag_queue *inet_frag_alloc(struct netns_frags *nf,
 		return NULL;
 
 	q->creation_ts = (u32) jiffies;
+	q->cpu_alloc = (u32) smp_processor_id();
 	q->net = nf;
 	f->constructor(q, arg);
 	add_frag_mem_limit(q, f->qsize);
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index abb5551..99944a8 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -18,6 +18,7 @@
  *		John McDonald	:	0 length frag bug.
  *		Alexey Kuznetsov:	SMP races, threading, cleanup.
  *		Patrick McHardy :	LRU queue of frag heads for evictor.
+ *		Jesper Brouer   :	SMP/NUMA scalability
  */
 
 #define pr_fmt(fmt) "IPv4: " fmt
@@ -212,7 +213,7 @@ static void ip_evictor(struct net *net)
 {
 	int evicted;
 
-	evicted = inet_frag_evictor(&net->ipv4.frags, &ip4_frags, false);
+	evicted = inet_frag_evictor(&net->ipv4.frags, &ip4_frags, false, -1);
 	if (evicted)
 		IP_ADD_STATS_BH(net, IPSTATS_MIB_REASMFAILS, evicted);
 }
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
index c088831..8cb1710 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -566,7 +566,7 @@ struct sk_buff *nf_ct_frag6_gather(struct sk_buff *skb, u32 user)
 	fhdr = (struct frag_hdr *)skb_transport_header(clone);
 
 	local_bh_disable();
-	inet_frag_evictor(&net->nf_frag.frags, &nf_frags, false);
+	inet_frag_evictor(&net->nf_frag.frags, &nf_frags, false, -1);
 	local_bh_enable();
 
 	fq = fq_find(net, fhdr->identification, user, &hdr->saddr, &hdr->daddr);
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index bab2c27..d1e70dd 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -529,7 +529,7 @@ static int ipv6_frag_rcv(struct sk_buff *skb)
 		return 1;
 	}
 
-	evicted = inet_frag_evictor(&net->ipv6.frags, &ip6_frags, false);
+	evicted = inet_frag_evictor(&net->ipv6.frags, &ip6_frags, false, -1);
 	if (evicted)
 		IP6_ADD_STATS_BH(net, ip6_dst_idev(skb_dst(skb)),
 				 IPSTATS_MIB_REASMFAILS, evicted);

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [net-next PATCH V2 6/9] net: frag, implement dynamic percpu alloc of frag_cpu_limit
  2012-11-29 16:10 [net-next PATCH V2 0/9] net: fragmentation performance scalability on NUMA/SMP systems Jesper Dangaard Brouer
                   ` (4 preceding siblings ...)
  2012-11-29 16:13 ` [net-next PATCH V2 5/9] net: frag, per CPU resource, mem limit and LRU list accounting Jesper Dangaard Brouer
@ 2012-11-29 16:14 ` Jesper Dangaard Brouer
  2012-11-29 16:15 ` [net-next PATCH V2 7/9] net: frag, move nqueues counter under LRU lock protection Jesper Dangaard Brouer
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 57+ messages in thread
From: Jesper Dangaard Brouer @ 2012-11-29 16:14 UTC (permalink / raw)
  To: Eric Dumazet, David S. Miller, Florian Westphal
  Cc: Jesper Dangaard Brouer, netdev, Pablo Neira Ayuso, Thomas Graf,
	Cong Wang, Patrick McHardy, Paul E. McKenney, Herbert Xu

Use the percpu API to implement dynamic per CPU allocation
of the frag_cpu_limit in struct netns_frags.  This replaces
the static array percpu[NR_CPUS].

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
Its the first time I use the percpu API, please let me know
if I'm using it correctly.

 include/net/inet_frag.h  |   39 ++++++++++++++++++++++++++-------------
 net/ipv4/inet_fragment.c |   34 +++++++++++++++++++++++-----------
 2 files changed, 49 insertions(+), 24 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 8421904..3eadf42 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -3,6 +3,7 @@
 
 #include <linux/spinlock.h>
 #include <linux/atomic.h>
+#include <linux/percpu.h>
 
 /* Need to maintain these resource limits per CPU, else we will kill
  * performance due to cache-line bouncing
@@ -16,7 +17,7 @@ struct frag_cpu_limit {
 struct netns_frags {
 	int			nqueues;
 
-	struct frag_cpu_limit	percpu[NR_CPUS];
+	struct frag_cpu_limit __percpu *percpu;
 
 	/* sysctls */
 	int			timeout;
@@ -92,26 +93,32 @@ static inline void inet_frag_put(struct inet_frag_queue *q, struct inet_frags *f
 static inline void inet_frag_lru_move(struct inet_frag_queue *q)
 {
 	int cpu = q->cpu_alloc;
-	spin_lock(&q->net->percpu[cpu].lru_lock);
-	list_move_tail(&q->lru_list, &q->net->percpu[cpu].lru_list);
-	spin_unlock(&q->net->percpu[cpu].lru_lock);
+	struct frag_cpu_limit *percpu = per_cpu_ptr(q->net->percpu, cpu);
+
+	spin_lock(&percpu->lru_lock);
+	list_move_tail(&q->lru_list, &percpu->lru_list);
+	spin_unlock(&percpu->lru_lock);
 }
 
 static inline void inet_frag_lru_del(struct inet_frag_queue *q)
 {
 	int cpu = q->cpu_alloc;
-	spin_lock(&q->net->percpu[cpu].lru_lock);
+	struct frag_cpu_limit *percpu = per_cpu_ptr(q->net->percpu, cpu);
+
+	spin_lock(&percpu->lru_lock);
 	list_del(&q->lru_list);
-	spin_unlock(&q->net->percpu[cpu].lru_lock);
+	spin_unlock(&percpu->lru_lock);
 }
 
 static inline void inet_frag_lru_add(struct netns_frags *nf,
 				     struct inet_frag_queue *q)
 {
 	int cpu = q->cpu_alloc;
-	spin_lock(&nf->percpu[cpu].lru_lock);
-	list_add_tail(&q->lru_list, &nf->percpu[cpu].lru_list);
-	spin_unlock(&nf->percpu[cpu].lru_lock);
+	struct frag_cpu_limit *percpu = per_cpu_ptr(nf->percpu, cpu);
+
+	spin_lock(&percpu->lru_lock);
+	list_add_tail(&q->lru_list, &percpu->lru_list);
+	spin_unlock(&percpu->lru_lock);
 }
 
 /* Memory Tracking Functions. */
@@ -119,21 +126,27 @@ static inline void inet_frag_lru_add(struct netns_frags *nf,
 static inline void sub_frag_mem_limit(struct inet_frag_queue *q, int i)
 {
 	int cpu = q->cpu_alloc;
-	atomic_sub(i, &q->net->percpu[cpu].mem);
+	struct frag_cpu_limit *percpu = per_cpu_ptr(q->net->percpu, cpu);
+	atomic_sub(i, &percpu->mem);
 }
 
 static inline void add_frag_mem_limit(struct inet_frag_queue *q, int i)
 {
 	int cpu = q->cpu_alloc;
-	atomic_add(i, &q->net->percpu[cpu].mem);
+	struct frag_cpu_limit *percpu = per_cpu_ptr(q->net->percpu, cpu);
+	atomic_add(i, &percpu->mem);
 }
 
 static inline int sum_frag_mem_limit(struct netns_frags *nf)
 {
 	unsigned int sum = 0;
 	int cpu;
-	for_each_possible_cpu(cpu)
-		sum += atomic_read(&nf->percpu[cpu].mem);
+
+	for_each_possible_cpu(cpu) {
+		struct frag_cpu_limit *percpu = per_cpu_ptr(nf->percpu, cpu);
+
+		sum += atomic_read(&percpu->mem);
+	}
 	return sum;
 }
 
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 068aabe..0099f0c 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -25,7 +25,8 @@
 
 static inline int frag_mem_limit_on_cpu(struct netns_frags *nf, int on_cpu)
 {
-	return atomic_read(&nf->percpu[on_cpu].mem);
+	struct frag_cpu_limit *percpu = per_cpu_ptr(nf->percpu, on_cpu);
+	return atomic_read(&percpu->mem);
 }
 
 static inline int frag_mem_limit(struct netns_frags *nf)
@@ -81,14 +82,22 @@ void inet_frags_init(struct inet_frags *f)
 }
 EXPORT_SYMBOL(inet_frags_init);
 
-static void inet_frags_init_percpu_limit(struct netns_frags *nf)
+static int inet_frags_init_percpu_limit(struct netns_frags *nf)
 {
 	int cpu;
+
+	nf->percpu = alloc_percpu(struct frag_cpu_limit);
+	if (!nf->percpu)
+		return -ENOMEM;
+
 	for_each_possible_cpu(cpu) {
-		INIT_LIST_HEAD(&nf->percpu[cpu].lru_list);
-		spin_lock_init(&nf->percpu[cpu].lru_lock);
-		atomic_set(&nf->percpu[cpu].mem, 0);
+		struct frag_cpu_limit *percpu = per_cpu_ptr(nf->percpu, cpu);
+
+		INIT_LIST_HEAD(&percpu->lru_list);
+		spin_lock_init(&percpu->lru_lock);
+		atomic_set(&percpu->mem, 0);
 	}
+	return 1;
 }
 
 void inet_frags_init_net(struct netns_frags *nf)
@@ -113,6 +122,8 @@ void inet_frags_exit_net(struct netns_frags *nf, struct inet_frags *f)
 	for_each_possible_cpu(cpu)
 		inet_frag_evictor(nf, f, true, cpu);
 	local_bh_enable();
+
+	free_percpu(nf->percpu);
 }
 EXPORT_SYMBOL(inet_frags_exit_net);
 
@@ -184,6 +195,7 @@ int inet_frag_evictor(struct netns_frags *nf, struct inet_frags *f,
 	struct inet_frag_queue *q;
 	int work, evicted = 0;
 	int cpu = (likely(on_cpu < 0)) ? smp_processor_id() : on_cpu;
+	struct frag_cpu_limit *percpu = per_cpu_ptr(nf->percpu, cpu);
 
 	if (!force) {
 		if (frag_mem_limit_on_cpu(nf, cpu) <= nf->high_thresh)
@@ -192,14 +204,14 @@ int inet_frag_evictor(struct netns_frags *nf, struct inet_frags *f,
 
 	work = frag_mem_limit_on_cpu(nf, cpu) - nf->low_thresh;
 	while (work > 0) {
-		spin_lock(&nf->percpu[cpu].lru_lock);
+		spin_lock(&percpu->lru_lock);
 
-		if (list_empty(&nf->percpu[cpu].lru_list)) {
-			spin_unlock(&nf->percpu[cpu].lru_lock);
+		if (list_empty(&percpu->lru_list)) {
+			spin_unlock(&percpu->lru_lock);
 			break;
 		}
 
-		q = list_first_entry(&nf->percpu[cpu].lru_list,
+		q = list_first_entry(&percpu->lru_list,
 				struct inet_frag_queue, lru_list);
 
 		/* queue entry is warm, i.e. new frags are arriving
@@ -209,12 +221,12 @@ int inet_frag_evictor(struct netns_frags *nf, struct inet_frags *f,
 		 * completes.
 		 */
 		if (!force && q->creation_ts == (u32) jiffies) {
-			spin_unlock(&nf->percpu[cpu].lru_lock);
+			spin_unlock(&percpu->lru_lock);
 			break;
 		}
 
 		atomic_inc(&q->refcnt);
-		spin_unlock(&nf->percpu[cpu].lru_lock);
+		spin_unlock(&percpu->lru_lock);
 
 		spin_lock(&q->lock);
 		if (!(q->last_in & INET_FRAG_COMPLETE))

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [net-next PATCH V2 7/9] net: frag, move nqueues counter under LRU lock protection
  2012-11-29 16:10 [net-next PATCH V2 0/9] net: fragmentation performance scalability on NUMA/SMP systems Jesper Dangaard Brouer
                   ` (5 preceding siblings ...)
  2012-11-29 16:14 ` [net-next PATCH V2 6/9] net: frag, implement dynamic percpu alloc of frag_cpu_limit Jesper Dangaard Brouer
@ 2012-11-29 16:15 ` Jesper Dangaard Brouer
  2012-11-29 16:15 ` [net-next PATCH V2 8/9] net: frag queue locking per hash bucket Jesper Dangaard Brouer
  2012-11-29 16:16 ` [net-next PATCH V2 9/9] net: increase frag queue hash size and cache-line Jesper Dangaard Brouer
  8 siblings, 0 replies; 57+ messages in thread
From: Jesper Dangaard Brouer @ 2012-11-29 16:15 UTC (permalink / raw)
  To: Eric Dumazet, David S. Miller, Florian Westphal
  Cc: Jesper Dangaard Brouer, netdev, Pablo Neira Ayuso, Thomas Graf,
	Cong Wang, Patrick McHardy, Paul E. McKenney, Herbert Xu

Preparation patch for per hash bucket locking.

This patch just moves the nqueues counter under the LRU lock (and
per CPU), instead of the write lock, to prepare for next patch.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---

 include/net/inet_frag.h  |   19 +++++++++++++++++--
 include/net/ipv6.h       |    2 +-
 net/ipv4/inet_fragment.c |    4 +---
 net/ipv4/ip_fragment.c   |    2 +-
 4 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 3eadf42..f58590f 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -12,11 +12,10 @@ struct frag_cpu_limit {
 	atomic_t                mem;
 	struct list_head        lru_list;
 	spinlock_t              lru_lock;
+	int			nqueues;
 } ____cacheline_aligned_in_smp;
 
 struct netns_frags {
-	int			nqueues;
-
 	struct frag_cpu_limit __percpu *percpu;
 
 	/* sysctls */
@@ -107,6 +106,7 @@ static inline void inet_frag_lru_del(struct inet_frag_queue *q)
 
 	spin_lock(&percpu->lru_lock);
 	list_del(&q->lru_list);
+	percpu->nqueues--;
 	spin_unlock(&percpu->lru_lock);
 }
 
@@ -118,6 +118,7 @@ static inline void inet_frag_lru_add(struct netns_frags *nf,
 
 	spin_lock(&percpu->lru_lock);
 	list_add_tail(&q->lru_list, &percpu->lru_list);
+	percpu->nqueues++;
 	spin_unlock(&percpu->lru_lock);
 }
 
@@ -150,4 +151,18 @@ static inline int sum_frag_mem_limit(struct netns_frags *nf)
 	return sum;
 }
 
+static inline int sum_frag_nqueues(struct netns_frags *nf)
+{
+	unsigned int sum = 0;
+	int cpu;
+	for_each_possible_cpu(cpu) {
+		struct frag_cpu_limit *percpu = per_cpu_ptr(nf->percpu, cpu);
+
+		spin_lock(&percpu->lru_lock);
+		sum += percpu->nqueues;
+		spin_unlock(&percpu->lru_lock);
+	}
+	return sum;
+}
+
 #endif
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index a5c1cf1..27edfcf 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -274,7 +274,7 @@ extern bool ipv6_opt_accepted(const struct sock *sk, const struct sk_buff *skb);
 #if IS_ENABLED(CONFIG_IPV6)
 static inline int ip6_frag_nqueues(struct net *net)
 {
-	return net->ipv6.frags.nqueues;
+	return sum_frag_nqueues(&net->ipv6.frags);
 }
 
 static inline int ip6_frag_mem(struct net *net)
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 0099f0c..9b97f2e 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -96,13 +96,13 @@ static int inet_frags_init_percpu_limit(struct netns_frags *nf)
 		INIT_LIST_HEAD(&percpu->lru_list);
 		spin_lock_init(&percpu->lru_lock);
 		atomic_set(&percpu->mem, 0);
+		percpu->nqueues = 0;
 	}
 	return 1;
 }
 
 void inet_frags_init_net(struct netns_frags *nf)
 {
-	nf->nqueues = 0;
 	inet_frags_init_percpu_limit(nf);
 }
 EXPORT_SYMBOL(inet_frags_init_net);
@@ -131,7 +131,6 @@ static inline void fq_unlink(struct inet_frag_queue *fq, struct inet_frags *f)
 {
 	write_lock(&f->lock);
 	hlist_del(&fq->list);
-	fq->net->nqueues--;
 	write_unlock(&f->lock);
 	inet_frag_lru_del(fq);
 }
@@ -280,7 +279,6 @@ static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf,
 
 	atomic_inc(&qp->refcnt);
 	hlist_add_head(&qp->list, &f->hash[hash]);
-	nf->nqueues++;
 	write_unlock(&f->lock);
 	inet_frag_lru_add(nf, qp);
 	return qp;
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 99944a8..26fd2b7 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -118,7 +118,7 @@ static struct inet_frags ip4_frags;
 
 int ip_frag_nqueues(struct net *net)
 {
-	return net->ipv4.frags.nqueues;
+	return sum_frag_nqueues(&net->ipv4.frags);
 }
 
 int ip_frag_mem(struct net *net)

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [net-next PATCH V2 8/9] net: frag queue locking per hash bucket
  2012-11-29 16:10 [net-next PATCH V2 0/9] net: fragmentation performance scalability on NUMA/SMP systems Jesper Dangaard Brouer
                   ` (6 preceding siblings ...)
  2012-11-29 16:15 ` [net-next PATCH V2 7/9] net: frag, move nqueues counter under LRU lock protection Jesper Dangaard Brouer
@ 2012-11-29 16:15 ` Jesper Dangaard Brouer
  2012-11-29 17:08   ` Eric Dumazet
  2012-11-29 16:16 ` [net-next PATCH V2 9/9] net: increase frag queue hash size and cache-line Jesper Dangaard Brouer
  8 siblings, 1 reply; 57+ messages in thread
From: Jesper Dangaard Brouer @ 2012-11-29 16:15 UTC (permalink / raw)
  To: Eric Dumazet, David S. Miller, Florian Westphal
  Cc: Jesper Dangaard Brouer, netdev, Pablo Neira Ayuso, Thomas Graf,
	Cong Wang, Patrick McHardy, Paul E. McKenney, Herbert Xu

This patch implements per hash bucket locking for the frag queue
hash.  This removes two write locks, and the only remaining write
lock is for protecting hash rebuild.  This essentially reduce the
readers-writer lock to a rebuild lock.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

---
V2:
 - Fixed two bugs
 - 1) Missed/too-late read_lock() for protecting hashfn in fq_unlink()
 - 2) Uses old hash bucket instead of new dest bucket in inet_frag_secret_rebuild()

 include/net/inet_frag.h  |   10 +++++++-
 net/ipv4/inet_fragment.c |   56 +++++++++++++++++++++++++++++++++++-----------
 2 files changed, 51 insertions(+), 15 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index f58590f..431f68e 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -49,9 +49,15 @@ struct inet_frag_queue {
 
 #define INETFRAGS_HASHSZ		64
 
+
+struct inet_frag_bucket {
+	struct hlist_head	chain;
+	spinlock_t		chain_lock;
+};
+
 struct inet_frags {
-	struct hlist_head	hash[INETFRAGS_HASHSZ];
-	rwlock_t		lock;
+	struct inet_frag_bucket	hash[INETFRAGS_HASHSZ];
+	rwlock_t		lock; /* Rebuild lock */
 	u32			rnd;
 	int			qsize;
 	int			secret_interval;
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 9b97f2e..59999e6 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -41,20 +41,27 @@ static void inet_frag_secret_rebuild(unsigned long dummy)
 	unsigned long now = jiffies;
 	int i;
 
+	/* Per bucket lock NOT needed here, due to write lock protection */
 	write_lock(&f->lock);
+
 	get_random_bytes(&f->rnd, sizeof(u32));
 	for (i = 0; i < INETFRAGS_HASHSZ; i++) {
+		struct inet_frag_bucket *hb;
 		struct inet_frag_queue *q;
 		struct hlist_node *p, *n;
 
-		hlist_for_each_entry_safe(q, p, n, &f->hash[i], list) {
+		hb = &f->hash[i];
+		hlist_for_each_entry_safe(q, p, n, &hb->chain, list) {
 			unsigned int hval = f->hashfn(q);
 
 			if (hval != i) {
+				struct inet_frag_bucket *hb_dest;
+
 				hlist_del(&q->list);
 
 				/* Relink to new hash chain. */
-				hlist_add_head(&q->list, &f->hash[hval]);
+				hb_dest = &f->hash[hval];
+				hlist_add_head(&q->list, &hb->chain);
 			}
 		}
 	}
@@ -67,9 +74,12 @@ void inet_frags_init(struct inet_frags *f)
 {
 	int i;
 
-	for (i = 0; i < INETFRAGS_HASHSZ; i++)
-		INIT_HLIST_HEAD(&f->hash[i]);
+	for (i = 0; i < INETFRAGS_HASHSZ; i++) {
+		struct inet_frag_bucket *hb = &f->hash[i];
 
+		spin_lock_init(&hb->chain_lock);
+		INIT_HLIST_HEAD(&hb->chain);
+	}
 	rwlock_init(&f->lock);
 
 	f->rnd = (u32) ((num_physpages ^ (num_physpages>>7)) ^
@@ -129,9 +139,17 @@ EXPORT_SYMBOL(inet_frags_exit_net);
 
 static inline void fq_unlink(struct inet_frag_queue *fq, struct inet_frags *f)
 {
-	write_lock(&f->lock);
+	struct inet_frag_bucket *hb;
+	unsigned int hash;
+
+	read_lock(&f->lock);
+	hash = f->hashfn(fq);
+	hb = &f->hash[hash];
+
+	spin_lock_bh(&hb->chain_lock);
 	hlist_del(&fq->list);
-	write_unlock(&f->lock);
+	spin_unlock_bh(&hb->chain_lock);
+	read_unlock(&f->lock);
 	inet_frag_lru_del(fq);
 }
 
@@ -245,28 +263,33 @@ static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf,
 		struct inet_frag_queue *qp_in, struct inet_frags *f,
 		void *arg)
 {
+	struct inet_frag_bucket *hb;
 	struct inet_frag_queue *qp;
 #ifdef CONFIG_SMP
 	struct hlist_node *n;
 #endif
 	unsigned int hash;
 
-	write_lock(&f->lock);
+	read_lock(&f->lock); /* Protects against hash rebuild */
 	/*
 	 * While we stayed w/o the lock other CPU could update
 	 * the rnd seed, so we need to re-calculate the hash
 	 * chain. Fortunatelly the qp_in can be used to get one.
 	 */
 	hash = f->hashfn(qp_in);
+	hb = &f->hash[hash];
+	spin_lock_bh(&hb->chain_lock);
+
 #ifdef CONFIG_SMP
 	/* With SMP race we have to recheck hash table, because
 	 * such entry could be created on other cpu, while we
-	 * promoted read lock to write lock.
+	 * promoted read lock to write lock. ***Comment FIXME***
 	 */
-	hlist_for_each_entry(qp, n, &f->hash[hash], list) {
+	hlist_for_each_entry(qp, n, &hb->chain, list) {
 		if (qp->net == nf && f->match(qp, arg)) {
 			atomic_inc(&qp->refcnt);
-			write_unlock(&f->lock);
+			spin_unlock_bh(&hb->chain_lock);
+			read_unlock(&f->lock);
 			qp_in->last_in |= INET_FRAG_COMPLETE;
 			inet_frag_put(qp_in, f);
 			return qp;
@@ -278,8 +301,9 @@ static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf,
 		atomic_inc(&qp->refcnt);
 
 	atomic_inc(&qp->refcnt);
-	hlist_add_head(&qp->list, &f->hash[hash]);
-	write_unlock(&f->lock);
+	hlist_add_head(&qp->list, &hb->chain);
+	spin_unlock_bh(&hb->chain_lock);
+	read_unlock(&f->lock);
 	inet_frag_lru_add(nf, qp);
 	return qp;
 }
@@ -328,16 +352,22 @@ struct inet_frag_queue *inet_frag_find(struct netns_frags *nf,
 		struct inet_frags *f, void *key, unsigned int hash)
 	__releases(&f->lock)
 {
+	struct inet_frag_bucket *hb;
 	struct inet_frag_queue *q;
 	struct hlist_node *n;
 
-	hlist_for_each_entry(q, n, &f->hash[hash], list) {
+	hb = &f->hash[hash];
+
+	spin_lock_bh(&hb->chain_lock);
+	hlist_for_each_entry(q, n, &hb->chain, list) {
 		if (q->net == nf && f->match(q, key)) {
 			atomic_inc(&q->refcnt);
+			spin_unlock_bh(&hb->chain_lock);
 			read_unlock(&f->lock);
 			return q;
 		}
 	}
+	spin_unlock_bh(&hb->chain_lock);
 	read_unlock(&f->lock);
 
 	return inet_frag_create(nf, f, key);

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [net-next PATCH V2 9/9] net: increase frag queue hash size and cache-line
  2012-11-29 16:10 [net-next PATCH V2 0/9] net: fragmentation performance scalability on NUMA/SMP systems Jesper Dangaard Brouer
                   ` (7 preceding siblings ...)
  2012-11-29 16:15 ` [net-next PATCH V2 8/9] net: frag queue locking per hash bucket Jesper Dangaard Brouer
@ 2012-11-29 16:16 ` Jesper Dangaard Brouer
  2012-11-29 16:39   ` [net-next PATCH V2 9/9] net: increase frag queue hash size andcache-line David Laight
  2012-11-29 16:55   ` [net-next PATCH V2 9/9] net: increase frag queue hash size and cache-line Eric Dumazet
  8 siblings, 2 replies; 57+ messages in thread
From: Jesper Dangaard Brouer @ 2012-11-29 16:16 UTC (permalink / raw)
  To: Eric Dumazet, David S. Miller, Florian Westphal
  Cc: Jesper Dangaard Brouer, netdev, Pablo Neira Ayuso, Thomas Graf,
	Cong Wang, Patrick McHardy, Paul E. McKenney, Herbert Xu

Increase frag queue hash size and assure cache-line alignment to
avoid false sharing.  Hash size is set to 256, because I have
observed 206 frag queues in use at 4x10G with packet size 4416 bytes
(three fragments).

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---

 include/net/inet_frag.h  |    5 ++---
 net/ipv4/inet_fragment.c |    2 +-
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 431f68e..c8ad7e4 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -47,13 +47,12 @@ struct inet_frag_queue {
 	u16			max_size;
 };
 
-#define INETFRAGS_HASHSZ		64
-
+#define INETFRAGS_HASHSZ	256
 
 struct inet_frag_bucket {
 	struct hlist_head	chain;
 	spinlock_t		chain_lock;
-};
+} ____cacheline_aligned_in_smp;
 
 struct inet_frags {
 	struct inet_frag_bucket	hash[INETFRAGS_HASHSZ];
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 59999e6..44c9c75 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -61,7 +61,7 @@ static void inet_frag_secret_rebuild(unsigned long dummy)
 
 				/* Relink to new hash chain. */
 				hb_dest = &f->hash[hval];
-				hlist_add_head(&q->list, &hb->chain);
+				hlist_add_head(&q->list, &hb_dest->chain);
 			}
 		}
 	}

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* RE: [net-next PATCH V2 9/9] net: increase frag queue hash size andcache-line
  2012-11-29 16:16 ` [net-next PATCH V2 9/9] net: increase frag queue hash size and cache-line Jesper Dangaard Brouer
@ 2012-11-29 16:39   ` David Laight
  2012-11-29 16:55   ` [net-next PATCH V2 9/9] net: increase frag queue hash size and cache-line Eric Dumazet
  1 sibling, 0 replies; 57+ messages in thread
From: David Laight @ 2012-11-29 16:39 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Eric Dumazet, David S. Miller,
	Florian Westphal
  Cc: netdev, Pablo Neira Ayuso, Thomas Graf, Cong Wang,
	Patrick McHardy, Paul E. McKenney, Herbert Xu

> Increase frag queue hash size and assure cache-line alignment to
> avoid false sharing.  Hash size is set to 256, because I have
> observed 206 frag queues in use at 4x10G with packet size 4416 bytes
> (three fragments).

Since it is a hash list, won't there always be workloads where
there are hash collisions?
I'm not sure I'm in favour of massively padding out structure
to the size of cache lines.

Both these changes add a moderate amount to the kernel data size
(those people who are worried about not being able to discard
code because hot_plug is always configured really ought to
be worried about the footprint of some of these hash tables).

We run Linux on embedded (small) ppc where there might only
be a handful of TCP connections, these sort of tables use up
precious memory.

While padding to cache line might reduce the number of cache
snoops when attacking the code, in a real life situation I
suspect the effect of making another cache line busy is as
likely to flush out some other important data.
This is similar to the reasons that excessive function inlining
and loop unrolling will speed up benchmarks but slow down real
code.

	David


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [net-next PATCH V2 9/9] net: increase frag queue hash size and cache-line
  2012-11-29 16:16 ` [net-next PATCH V2 9/9] net: increase frag queue hash size and cache-line Jesper Dangaard Brouer
  2012-11-29 16:39   ` [net-next PATCH V2 9/9] net: increase frag queue hash size andcache-line David Laight
@ 2012-11-29 16:55   ` Eric Dumazet
  2012-11-29 20:53     ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 57+ messages in thread
From: Eric Dumazet @ 2012-11-29 16:55 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: David S. Miller, Florian Westphal, netdev, Pablo Neira Ayuso,
	Thomas Graf, Cong Wang, Patrick McHardy, Paul E. McKenney,
	Herbert Xu

On Thu, 2012-11-29 at 17:16 +0100, Jesper Dangaard Brouer wrote:
> Increase frag queue hash size and assure cache-line alignment to
> avoid false sharing.  Hash size is set to 256, because I have
> observed 206 frag queues in use at 4x10G with packet size 4416 bytes
> (three fragments).
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
> 
>  include/net/inet_frag.h  |    5 ++---
>  net/ipv4/inet_fragment.c |    2 +-
>  2 files changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
> index 431f68e..c8ad7e4 100644
> --- a/include/net/inet_frag.h
> +++ b/include/net/inet_frag.h
> @@ -47,13 +47,12 @@ struct inet_frag_queue {
>  	u16			max_size;
>  };
>  
> -#define INETFRAGS_HASHSZ		64
> -
> +#define INETFRAGS_HASHSZ	256
>  
>  struct inet_frag_bucket {
>  	struct hlist_head	chain;
>  	spinlock_t		chain_lock;
> -};
> +} ____cacheline_aligned_in_smp;
>  

This is a waste of memory.

Most linux powered devices dont care at all about fragments.

Just increase hashsz if you really want, and rely on hash dispersion
to avoid false sharing.

You gave no performance results for this patch anyway.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [net-next PATCH V2 5/9] net: frag, per CPU resource, mem limit and LRU list accounting
  2012-11-29 16:13 ` [net-next PATCH V2 5/9] net: frag, per CPU resource, mem limit and LRU list accounting Jesper Dangaard Brouer
@ 2012-11-29 17:06   ` Eric Dumazet
  2012-11-29 17:31     ` David Miller
  2012-12-03 14:02     ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 57+ messages in thread
From: Eric Dumazet @ 2012-11-29 17:06 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: David S. Miller, Florian Westphal, netdev, Pablo Neira Ayuso,
	Thomas Graf, Cong Wang, Patrick McHardy, Paul E. McKenney,
	Herbert Xu

On Thu, 2012-11-29 at 17:13 +0100, Jesper Dangaard Brouer wrote:
> The major performance bottleneck on NUMA systems, is the mem limit
> counter which is based an atomic counter.  This patch removes the
> cache-bouncing of the atomic counter, by moving this accounting to be
> bound to each CPU.  The LRU list also need to be done per CPU,
> in-order to keep the accounting straight.
> 
> If fragments belonging together is "sprayed" across CPUs, performance
> will still suffer, but due to NIC rxhashing this is not very common.
> Correct accounting in this situation is maintained by recording and
> "assigning" a CPU to a frag queue when its allocated (caused by the
> first packet associated packet).
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> 
> ---
> V2:
>  - Rename struct cpu_resource -> frag_cpu_limit
>  - Move init functions from inet_frag.h to inet_fragment.c
>  - Cleanup per CPU in inet_frags_exit_net()
> 
>  include/net/inet_frag.h                 |   64 +++++++++++++++++++------------
>  net/ipv4/inet_fragment.c                |   50 ++++++++++++++++++------
>  net/ipv4/ip_fragment.c                  |    3 +
>  net/ipv6/netfilter/nf_conntrack_reasm.c |    2 -
>  net/ipv6/reassembly.c                   |    2 -
>  5 files changed, 80 insertions(+), 41 deletions(-)
> 
> diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
> index 9bbef17..8421904 100644
> --- a/include/net/inet_frag.h
> +++ b/include/net/inet_frag.h
> @@ -1,11 +1,22 @@
>  #ifndef __NET_FRAG_H__
>  #define __NET_FRAG_H__
>  
> +#include <linux/spinlock.h>
> +#include <linux/atomic.h>
> +
> +/* Need to maintain these resource limits per CPU, else we will kill
> + * performance due to cache-line bouncing
> + */
> +struct frag_cpu_limit {
> +	atomic_t                mem;
> +	struct list_head        lru_list;
> +	spinlock_t              lru_lock;
> +} ____cacheline_aligned_in_smp;
> +

This looks like a big patch introducing a specific infrastructure, while
we already have lib/percpu_counter.c

Not counting the addition of a NR_CPUS array, which is really
unfortunate these days.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [net-next PATCH V2 8/9] net: frag queue locking per hash bucket
  2012-11-29 16:15 ` [net-next PATCH V2 8/9] net: frag queue locking per hash bucket Jesper Dangaard Brouer
@ 2012-11-29 17:08   ` Eric Dumazet
  2012-11-30 12:55     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 57+ messages in thread
From: Eric Dumazet @ 2012-11-29 17:08 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: David S. Miller, Florian Westphal, netdev, Pablo Neira Ayuso,
	Thomas Graf, Cong Wang, Patrick McHardy, Paul E. McKenney,
	Herbert Xu

On Thu, 2012-11-29 at 17:15 +0100, Jesper Dangaard Brouer wrote:
> This patch implements per hash bucket locking for the frag queue
> hash.  This removes two write locks, and the only remaining write
> lock is for protecting hash rebuild.  This essentially reduce the
> readers-writer lock to a rebuild lock.

So we still have this huge contention on the reader-writer lock cache
line...

I would just remove it. (And remove hash rebuild, or make it RCU
compatible )

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [net-next PATCH V2 5/9] net: frag, per CPU resource, mem limit and LRU list accounting
  2012-11-29 17:06   ` Eric Dumazet
@ 2012-11-29 17:31     ` David Miller
  2012-12-03 14:02     ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 57+ messages in thread
From: David Miller @ 2012-11-29 17:31 UTC (permalink / raw)
  To: eric.dumazet
  Cc: brouer, fw, netdev, pablo, tgraf, amwang, kaber, paulmck, herbert

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 29 Nov 2012 09:06:16 -0800

> Not counting the addition of a NR_CPUS array, which is really
> unfortunate these days.

That is addressed in patch #6.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [net-next PATCH V2 3/9] net: frag, move LRU list maintenance outside of rwlock
  2012-11-29 16:12 ` [net-next PATCH V2 3/9] net: frag, move LRU list maintenance outside of rwlock Jesper Dangaard Brouer
@ 2012-11-29 17:43   ` Eric Dumazet
  2012-11-29 17:48     ` David Miller
  0 siblings, 1 reply; 57+ messages in thread
From: Eric Dumazet @ 2012-11-29 17:43 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: David S. Miller, Florian Westphal, netdev, Pablo Neira Ayuso,
	Thomas Graf, Cong Wang, Patrick McHardy, Paul E. McKenney,
	Herbert Xu

On Thu, 2012-11-29 at 17:12 +0100, Jesper Dangaard Brouer wrote:
> Updating the fragmentation queues LRU (Least-Recently-Used) list,
> required taking the hash writer lock.  However, the LRU list isn't
> tied to the hash at all, so we can use a separate lock for it.
> 
> This change, in it self, does not improve performance significantly.
> But its part of making the fragmentation code scale.


I would just remove the LRU completely.

All this stuff is 20 years old, we dont need it anymore now we can use
more than 64KB of memory. The design was OK on !SMP

Here is what I would suggest :

Use a schem with a hash table of 256 (or 1024) slots.

Each slot/bucket has : 
  - Its own spinlock.
  - List of items
  - A limit of 5 (or so) elems in the list.

No more LRU, no more rehash (thanks to jhash and the random seed at boot
or first frag created), no more reader-writer lock.

Use a percpu_counter to implement ipfrag_low_thresh/ipfrag_high_thresh

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [net-next PATCH V2 1/9] net: frag evictor, avoid killing warm frag queues
  2012-11-29 16:11 ` [net-next PATCH V2 1/9] net: frag evictor, avoid killing warm frag queues Jesper Dangaard Brouer
@ 2012-11-29 17:44   ` David Miller
  2012-11-29 22:17     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 57+ messages in thread
From: David Miller @ 2012-11-29 17:44 UTC (permalink / raw)
  To: brouer
  Cc: eric.dumazet, fw, netdev, pablo, tgraf, amwang, kaber, paulmck,
	herbert

From: Jesper Dangaard Brouer <brouer@redhat.com>
Date: Thu, 29 Nov 2012 17:11:09 +0100

> The fragmentation evictor system have a very unfortunate eviction
> system for killing fragment, when the system is put under pressure.
> If packets are coming in too fast, the evictor code kills "warm"
> fragments too quickly.  Resulting in a massive performance drop,
> because we drop frag lists where we have already queue up a lot of
> fragments/work, which gets killed before they have a chance to
> complete.

I think this is a trade-off where the decision is somewhat
arbitrary.

If you kill warm entries, the sending of all of the fragments is
wasted.  If you retain warm entries and drop incoming new fragments,
well then the sending of all of those newer fragments is wasted too.

The only way I could see this making sense is if some "probability
of fulfillment" was taken into account.  For example, if you have
more than half of the fragments already, then yes it may be
advisable to retain the warm entry.

Otherwise, as I said, the decision seems arbitrary.

Let's take a step back and think about why this is happening at all.

I wonder how reasonable the high and low thresholds really are.  Even
once you move them to per-cpu, I think the limits are far too small.

I'm under the impression that it's common for skb->truesize for 1500
MTU frames to be something rounded up to the next power of 2, so
2048 bytes, or something like that.  Then add in the sk_buff control
overhead, as well as the inet_frag head.

So a 64K fragmented frame probably consumes close to 100K.

So once we have three 64K frames in flight, we're already over the
high threshold and will start dropping things.

That's beyond stupid.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [net-next PATCH V2 3/9] net: frag, move LRU list maintenance outside of rwlock
  2012-11-29 17:43   ` Eric Dumazet
@ 2012-11-29 17:48     ` David Miller
  2012-11-29 17:54       ` Eric Dumazet
  0 siblings, 1 reply; 57+ messages in thread
From: David Miller @ 2012-11-29 17:48 UTC (permalink / raw)
  To: eric.dumazet
  Cc: brouer, fw, netdev, pablo, tgraf, amwang, kaber, paulmck, herbert

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 29 Nov 2012 09:43:24 -0800

> Use a schem with a hash table of 256 (or 1024) slots.
> 
> Each slot/bucket has : 
>   - Its own spinlock.
>   - List of items
>   - A limit of 5 (or so) elems in the list.
> 
> No more LRU, no more rehash (thanks to jhash and the random seed at boot
> or first frag created), no more reader-writer lock.
> 
> Use a percpu_counter to implement ipfrag_low_thresh/ipfrag_high_thresh

If we limit the chain sizes to 5 elements, there is no need for
any thresholds at all.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [net-next PATCH V2 3/9] net: frag, move LRU list maintenance outside of rwlock
  2012-11-29 17:48     ` David Miller
@ 2012-11-29 17:54       ` Eric Dumazet
  2012-11-29 18:05         ` David Miller
  2012-11-29 22:33         ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 57+ messages in thread
From: Eric Dumazet @ 2012-11-29 17:54 UTC (permalink / raw)
  To: David Miller
  Cc: brouer, fw, netdev, pablo, tgraf, amwang, kaber, paulmck, herbert

On Thu, 2012-11-29 at 12:48 -0500, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Thu, 29 Nov 2012 09:43:24 -0800
> 
> > Use a schem with a hash table of 256 (or 1024) slots.
> > 
> > Each slot/bucket has : 
> >   - Its own spinlock.
> >   - List of items
> >   - A limit of 5 (or so) elems in the list.
> > 
> > No more LRU, no more rehash (thanks to jhash and the random seed at boot
> > or first frag created), no more reader-writer lock.
> > 
> > Use a percpu_counter to implement ipfrag_low_thresh/ipfrag_high_thresh
> 
> If we limit the chain sizes to 5 elements, there is no need for
> any thresholds at all.

One element can hold about 100KB.

I guess some systems could have some worries if we consume 1024 * 5 *
100 KB

So lets call the threshold a limit ;)

I agree the ipfrag_low_thresh should disappear :

One we hit the ipfrag_high_thresh (under softirq), we really dont want
to scan table to perform gc, as it might take more than 10 ms

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [net-next PATCH V2 3/9] net: frag, move LRU list maintenance outside of rwlock
  2012-11-29 17:54       ` Eric Dumazet
@ 2012-11-29 18:05         ` David Miller
  2012-11-29 18:24           ` Eric Dumazet
  2012-11-29 22:33         ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 57+ messages in thread
From: David Miller @ 2012-11-29 18:05 UTC (permalink / raw)
  To: eric.dumazet
  Cc: brouer, fw, netdev, pablo, tgraf, amwang, kaber, paulmck, herbert

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 29 Nov 2012 09:54:19 -0800

> One element can hold about 100KB.
> 
> I guess some systems could have some worries if we consume 1024 * 5 *
> 100 KB

That's true.

Replace 1024 in your formula with X and the limit is therefore
controlled by X.

So it seems the high_thresh can be replaced with an appropriate
determination of X to size the hash.

If X is 256, that limits us to ~130MB per cpu.

We could also tweak the chain limit, call it Y, as well.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [net-next PATCH V2 3/9] net: frag, move LRU list maintenance outside of rwlock
  2012-11-29 18:05         ` David Miller
@ 2012-11-29 18:24           ` Eric Dumazet
  2012-11-29 18:31             ` David Miller
  0 siblings, 1 reply; 57+ messages in thread
From: Eric Dumazet @ 2012-11-29 18:24 UTC (permalink / raw)
  To: David Miller
  Cc: brouer, fw, netdev, pablo, tgraf, amwang, kaber, paulmck, herbert

On Thu, 2012-11-29 at 13:05 -0500, David Miller wrote:

> Replace 1024 in your formula with X and the limit is therefore
> controlled by X.
> 
> So it seems the high_thresh can be replaced with an appropriate
> determination of X to size the hash.
> 
> If X is 256, that limits us to ~130MB per cpu.
> 

per host, as the table would be shared by all cpus.

Lets say the default mem limit would be 4MB, I believe the
percpu_counter is cheap enough that we still allow a 1024 buckets table,
and allow receiving full size IP packets (If not under frag attack)

(If we divided 4MB by 64, we would have a 64KB limit per bucket, which
is too small because of truesize overhead)

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [net-next PATCH V2 3/9] net: frag, move LRU list maintenance outside of rwlock
  2012-11-29 18:24           ` Eric Dumazet
@ 2012-11-29 18:31             ` David Miller
  2012-11-29 18:33               ` Eric Dumazet
  0 siblings, 1 reply; 57+ messages in thread
From: David Miller @ 2012-11-29 18:31 UTC (permalink / raw)
  To: eric.dumazet
  Cc: brouer, fw, netdev, pablo, tgraf, amwang, kaber, paulmck, herbert

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 29 Nov 2012 10:24:52 -0800

> On Thu, 2012-11-29 at 13:05 -0500, David Miller wrote:
> 
>> Replace 1024 in your formula with X and the limit is therefore
>> controlled by X.
>> 
>> So it seems the high_thresh can be replaced with an appropriate
>> determination of X to size the hash.
>> 
>> If X is 256, that limits us to ~130MB per cpu.
>> 
> 
> per host, as the table would be shared by all cpus.

I think a per-cpu hash might make more sense.

This would scale our limits to the size of the system.

I'm willing to be convinced otherwise, but it seems the most
sensible thing to do.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [net-next PATCH V2 3/9] net: frag, move LRU list maintenance outside of rwlock
  2012-11-29 18:31             ` David Miller
@ 2012-11-29 18:33               ` Eric Dumazet
  2012-11-29 18:36                 ` David Miller
  0 siblings, 1 reply; 57+ messages in thread
From: Eric Dumazet @ 2012-11-29 18:33 UTC (permalink / raw)
  To: David Miller
  Cc: brouer, fw, netdev, pablo, tgraf, amwang, kaber, paulmck, herbert

On Thu, 2012-11-29 at 13:31 -0500, David Miller wrote:

> I think a per-cpu hash might make more sense.
> 
> This would scale our limits to the size of the system.
> 
> I'm willing to be convinced otherwise, but it seems the most
> sensible thing to do.

It would break in many cases, when frags are spreaded on different cpus.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [net-next PATCH V2 3/9] net: frag, move LRU list maintenance outside of rwlock
  2012-11-29 18:33               ` Eric Dumazet
@ 2012-11-29 18:36                 ` David Miller
  0 siblings, 0 replies; 57+ messages in thread
From: David Miller @ 2012-11-29 18:36 UTC (permalink / raw)
  To: eric.dumazet
  Cc: brouer, fw, netdev, pablo, tgraf, amwang, kaber, paulmck, herbert

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 29 Nov 2012 10:33:13 -0800

> On Thu, 2012-11-29 at 13:31 -0500, David Miller wrote:
> 
>> I think a per-cpu hash might make more sense.
>> 
>> This would scale our limits to the size of the system.
>> 
>> I'm willing to be convinced otherwise, but it seems the most
>> sensible thing to do.
> 
> It would break in many cases, when frags are spreaded on different cpus.

Indeed, ignore my stupid idea.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [net-next PATCH V2 9/9] net: increase frag queue hash size and cache-line
  2012-11-29 16:55   ` [net-next PATCH V2 9/9] net: increase frag queue hash size and cache-line Eric Dumazet
@ 2012-11-29 20:53     ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 57+ messages in thread
From: Jesper Dangaard Brouer @ 2012-11-29 20:53 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, Florian Westphal, netdev, Pablo Neira Ayuso,
	Thomas Graf, Cong Wang, Patrick McHardy, Paul E. McKenney,
	Herbert Xu, David Laight

On Thu, 2012-11-29 at 08:55 -0800, Eric Dumazet wrote:
> On Thu, 2012-11-29 at 17:16 +0100, Jesper Dangaard Brouer wrote:
> > Increase frag queue hash size and assure cache-line alignment to
> > avoid false sharing.  Hash size is set to 256, because I have
> > observed 206 frag queues in use at 4x10G with packet size 4416 bytes
> > (three fragments).
> > 
[...]
> >  struct inet_frag_bucket {
> >  	struct hlist_head	chain;
> >  	spinlock_t		chain_lock;
> > -};
> > +} ____cacheline_aligned_in_smp;
> >  
> 
> This is a waste of memory.

Do keep in mind this is only 16 Kbytes (256 * 64 bytes = 16384 bytes).


> Most linux powered devices dont care at all about fragments.
> 
> Just increase hashsz if you really want, and rely on hash dispersion
> to avoid false sharing.

I must agree, that it is perhaps better usage of the memory to just
increase the hashsz (and drop ____cacheline_aligned_in_smp), especially
with the measured performance gain.

> You gave no performance results for this patch anyway.

Yes, I did! -- See cover-mail patch 08 vs 09.
But the gain is really too small, to argue for this cache alignment.


Patch-08:
  2x10G size(4416)  result:(5024+4925)= 9949 Mbit/s
                 V2 result:(5140+5206)=10346 Mbit/s

  4x10G size(4416)  result:(4156+4714+4300+3985)=17155 Mbit/s
                 V2 result:(4341+4607+3963+4450)=17361 Mbit/s
                       (gen:6614+5330+7745+5366 =25055 Mbit/s)

Patch-09:
  2x10G size(4416)  result:(5421+5268)=10689 Mbit/s
                 V2 result:(5377+5336)=10713 Mbit/s

  4x10G size(4416) result:(4890+4364+4139+4530)=17923 Mbit/s
                V2 result:(3860+4533+4936+4519)=17848 Mbit/s
                      (gen:5170+6873+5215+7632 =24890 Mbit/s)
  
Improvements Patch 08 -> 09:

 2x10G size(4416):
   RunV1 (10689-9949) =740 Mbit/s
   RunV2 (10713-10346)=367 Mbit/s

 4x10G size(4416):
   RunV1 (17923-17155)=768 Mbit/s
   RunV2 (17848-17361)=487 Mbit/s

Its consistently better performance, but given magnitude the other
improvements, I don't want to argue over "wasting" 16Kbytes kernel
memory.

I have some debug patches for dumping the content of the hash, which
shows that at 4x10G size(4416) three frags, 206 frag queues, cross CPU
collisions occur anyhow.

Lets focus on the other patches instead.

--Jesper

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [net-next PATCH V2 1/9] net: frag evictor, avoid killing warm frag queues
  2012-11-29 17:44   ` David Miller
@ 2012-11-29 22:17     ` Jesper Dangaard Brouer
  2012-11-29 23:01       ` Eric Dumazet
                         ` (2 more replies)
  0 siblings, 3 replies; 57+ messages in thread
From: Jesper Dangaard Brouer @ 2012-11-29 22:17 UTC (permalink / raw)
  To: David Miller
  Cc: eric.dumazet, fw, netdev, pablo, tgraf, amwang, kaber, paulmck,
	herbert

On Thu, 2012-11-29 at 12:44 -0500, David Miller wrote:
> From: Jesper Dangaard Brouer <brouer@redhat.com>
> Date: Thu, 29 Nov 2012 17:11:09 +0100
> 
> > The fragmentation evictor system have a very unfortunate eviction
> > system for killing fragment, when the system is put under pressure.
> > If packets are coming in too fast, the evictor code kills "warm"
> > fragments too quickly.  Resulting in a massive performance drop,
> > because we drop frag lists where we have already queue up a lot of
> > fragments/work, which gets killed before they have a chance to
> > complete.
> 
> I think this is a trade-off where the decision is somewhat
> arbitrary.
> 
> If you kill warm entries, the sending of all of the fragments is
> wasted.  If you retain warm entries and drop incoming new fragments,
> well then the sending of all of those newer fragments is wasted too.
> 
> The only way I could see this making sense is if some "probability
> of fulfillment" was taken into account.  For example, if you have
> more than half of the fragments already, then yes it may be
> advisable to retain the warm entry.

I disagree, once we have spend energy on allocation a frag queue and
adding just a single packet, while the entry is warn we must not drop
it.

I believe, we need the concept of "warn" entries.  Without it you scheme
is dangerous, as we would retain entries with missed/dropped fragments
too long.

> Otherwise, as I said, the decision seems arbitrary.

This patch/system actually includes a "promise/probability of
fulfillment". Let me explain.

We allow "warn" entries to complete, by allowing (new) fragments/packets
for these entries (present in the frag queue).  While we don't allow the
system to create new entries.  This creates the selection we interested
in (as we must drop some packets given the arrival rate bigger than the
processing rate).

(This might be hard to follow)
While blocking new frag queue entries, we are dropping packets.  When
fragments complete, the "window" opens up again.  Creating a frag queue
entry, in this situation, might be for a fragment which already have
lost some packets, thus wasting resources.  BUT its not such a big
problem, because our evictor system will quickly kill this fragment
queue, due to the LRU list (and the fq getting "cold").

I have through of keeping track of fragment id's of dropped fragments,
but is a scalability problem of its own to maintain this list.  And I'm
not sure how effective its going to be (as we already get some of the
effect "automatically", as descrived above).


> Let's take a step back and think about why this is happening at all.
> 
> I wonder how reasonable the high and low thresholds really are.  Even
> once you move them to per-cpu, I think the limits are far too small.

It is VERY important that you understand/realize that throwing more
memory at the problem is NOT the solution.  It a classical queue theory
situation where the arrival rate is bigger than the processing
capabilities. Thus, we must drop packets, or else the NIC will do it for
us... for fragments we need do this "dropping" more intelligent.

For example lets give a threshold of 2000 MBytes:

[root@dragon ~]# sysctl -w net/ipv4/ipfrag_high_thresh=$(((1024**2*2000)))
net.ipv4.ipfrag_high_thresh = 2097152000

[root@dragon ~]# sysctl -w net/ipv4/ipfrag_low_thresh=$(((1024**2*2000)-655350))
net.ipv4.ipfrag_low_thresh = 2096496650

4x10 Netperf adjusted output:
 Socket  Message  Elapsed      Messages
 Size    Size     Time         Okay Errors   Throughput
 bytes   bytes    secs            #      #   10^6bits/sec

 229376   65507   20.00      298685      0    7826.35
 212992           20.00          27              0.71

 229376   65507   20.00      366668      0    9607.71
 212992           20.00          13              0.34

 229376   65507   20.00      254790      0    6676.20
 212992           20.00          14              0.37

 229376   65507   20.00      309293      0    8104.33
 212992           20.00          15              0.39

Can we agree that the current evictor strategy is broken?


> I'm under the impression that it's common for skb->truesize for 1500
> MTU frames to be something rounded up to the next power of 2, so
> 2048 bytes, or something like that.  Then add in the sk_buff control
> overhead, as well as the inet_frag head.

(Plus, the size of the frag queue struct, which is also being accounted
for).

> So a 64K fragmented frame probably consumes close to 100K.
> 
> So once we have three 64K frames in flight, we're already over the
> high threshold and will start dropping things.
> 
> That's beyond stupid.

Yes, I would say the interval between ipfrag_high_thresh and
ipfrag_low_thresh seems quite a stupid default.


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [net-next PATCH V2 3/9] net: frag, move LRU list maintenance outside of rwlock
  2012-11-29 17:54       ` Eric Dumazet
  2012-11-29 18:05         ` David Miller
@ 2012-11-29 22:33         ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 57+ messages in thread
From: Jesper Dangaard Brouer @ 2012-11-29 22:33 UTC (permalink / raw)
  To: Eric Dumazet, David Laight
  Cc: David Miller, fw, netdev, pablo, tgraf, amwang, kaber, paulmck,
	herbert

On Thu, 2012-11-29 at 09:54 -0800, Eric Dumazet wrote:
> On Thu, 2012-11-29 at 12:48 -0500, David Miller wrote:
> > From: Eric Dumazet <eric.dumazet@gmail.com>
> > Date: Thu, 29 Nov 2012 09:43:24 -0800
> > 
> > > Use a schem with a hash table of 256 (or 1024) slots.
> > > 
> > > Each slot/bucket has : 
> > >   - Its own spinlock.
> > >   - List of items
> > >   - A limit of 5 (or so) elems in the list.
> > > 
> > > No more LRU, no more rehash (thanks to jhash and the random seed at boot
> > > or first frag created), no more reader-writer lock.
> > > 
> > > Use a percpu_counter to implement ipfrag_low_thresh/ipfrag_high_thresh
> > 
> > If we limit the chain sizes to 5 elements, there is no need for
> > any thresholds at all.
> 
> One element can hold about 100KB.
> 
> I guess some systems could have some worries if we consume 1024 * 5 *
> 100 KB

1024 * 5 * 100k = 512 MB  -- That's just crasy!
I guess the embedded guys is going to choke reading this! 

Look at what I have achieved with 256KBytes per CPU...

--Jesper

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [net-next PATCH V2 1/9] net: frag evictor, avoid killing warm frag queues
  2012-11-29 22:17     ` Jesper Dangaard Brouer
@ 2012-11-29 23:01       ` Eric Dumazet
  2012-11-30 10:04         ` Jesper Dangaard Brouer
  2012-11-29 23:32       ` [net-next PATCH V2 1/9] " Eric Dumazet
  2012-11-30 12:01       ` Jesper Dangaard Brouer
  2 siblings, 1 reply; 57+ messages in thread
From: Eric Dumazet @ 2012-11-29 23:01 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: David Miller, fw, netdev, pablo, tgraf, amwang, kaber, paulmck,
	herbert

On Thu, 2012-11-29 at 23:17 +0100, Jesper Dangaard Brouer wrote:

> For example lets give a threshold of 2000 MBytes:
> 
> [root@dragon ~]# sysctl -w net/ipv4/ipfrag_high_thresh=$(((1024**2*2000)))
> net.ipv4.ipfrag_high_thresh = 2097152000
> 
> [root@dragon ~]# sysctl -w net/ipv4/ipfrag_low_thresh=$(((1024**2*2000)-655350))
> net.ipv4.ipfrag_low_thresh = 2096496650
> 
> 4x10 Netperf adjusted output:
>  Socket  Message  Elapsed      Messages
>  Size    Size     Time         Okay Errors   Throughput
>  bytes   bytes    secs            #      #   10^6bits/sec
> 
>  229376   65507   20.00      298685      0    7826.35
>  212992           20.00          27              0.71
> 
>  229376   65507   20.00      366668      0    9607.71
>  212992           20.00          13              0.34
> 
>  229376   65507   20.00      254790      0    6676.20
>  212992           20.00          14              0.37
> 
>  229376   65507   20.00      309293      0    8104.33
>  212992           20.00          15              0.39
> 
> Can we agree that the current evictor strategy is broken?

Not really, you drop packets because of another limit.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [net-next PATCH V2 1/9] net: frag evictor, avoid killing warm frag queues
  2012-11-29 22:17     ` Jesper Dangaard Brouer
  2012-11-29 23:01       ` Eric Dumazet
@ 2012-11-29 23:32       ` Eric Dumazet
  2012-11-30 12:01       ` Jesper Dangaard Brouer
  2 siblings, 0 replies; 57+ messages in thread
From: Eric Dumazet @ 2012-11-29 23:32 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: David Miller, fw, netdev, pablo, tgraf, amwang, kaber, paulmck,
	herbert

On Thu, 2012-11-29 at 23:17 +0100, Jesper Dangaard Brouer wrote:
> On Thu, 2012-11-29 at 12:44 -0500, David Miller wrote:

> It is VERY important that you understand/realize that throwing more
> memory at the problem is NOT the solution.  It a classical queue theory
> situation where the arrival rate is bigger than the processing
> capabilities. Thus, we must drop packets, or else the NIC will do it for
> us... for fragments we need do this "dropping" more intelligent.

Thats the typical head/tail drop choice. There is no bad/good choice.

Implementing head drop or tail drop is not a matter of only dealing with
regular traffic. We also can face DOS attacks, with packets of different
sizes and have to choose a compromise.

For example, it seems we have no protection against an attack using
small frags (but big truesize). So a particular 'packet' could consume
all the truesize we allowed for the whole frags queue.

skb_try_coalesce() cant always deal with this kind of thing.

Basically, thats why using frags in the first place is a bad choice.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [net-next PATCH V2 1/9] net: frag evictor, avoid killing warm frag queues
  2012-11-29 23:01       ` Eric Dumazet
@ 2012-11-30 10:04         ` Jesper Dangaard Brouer
  2012-11-30 14:52           ` Eric Dumazet
  0 siblings, 1 reply; 57+ messages in thread
From: Jesper Dangaard Brouer @ 2012-11-30 10:04 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, fw, netdev, pablo, tgraf, amwang, kaber, paulmck,
	herbert

On Thu, 2012-11-29 at 15:01 -0800, Eric Dumazet wrote:
> On Thu, 2012-11-29 at 23:17 +0100, Jesper Dangaard Brouer wrote:
> 
> > For example lets give a threshold of 2000 MBytes:
> > 
> > [root@dragon ~]# sysctl -w net/ipv4/ipfrag_high_thresh=$(((1024**2*2000)))
> > net.ipv4.ipfrag_high_thresh = 2097152000
> > 
> > [root@dragon ~]# sysctl -w net/ipv4/ipfrag_low_thresh=$(((1024**2*2000)-655350))
> > net.ipv4.ipfrag_low_thresh = 2096496650
> > 
> > 4x10 Netperf adjusted output:
> >  Socket  Message  Elapsed      Messages
> >  Size    Size     Time         Okay Errors   Throughput
> >  bytes   bytes    secs            #      #   10^6bits/sec
> > 
> >  229376   65507   20.00      298685      0    7826.35
> >  212992           20.00          27              0.71
> > 
> >  229376   65507   20.00      366668      0    9607.71
> >  212992           20.00          13              0.34
> > 
> >  229376   65507   20.00      254790      0    6676.20
> >  212992           20.00          14              0.37
> > 
> >  229376   65507   20.00      309293      0    8104.33
> >  212992           20.00          15              0.39
> > 
> > Can we agree that the current evictor strategy is broken?
> 
> Not really, you drop packets because of another limit.

Then tell me which limit?
And notice the result is the same for 200 MBytes threshold.

As I wrote *just* above the section you quoted:

On Thu, 2012-11-29 at 23:17 +0100, Jesper Dangaard Brouer wrote:
[...] Thus, we must drop packets, or else the NIC will do it for
> us... for fragments we need do this "dropping" more intelligent. 

So, I think it is the NIC dropping packets, in this case... what do you
claim?



I still claim the the current evictor strategy is broken!

We need to drop fragments more intelligently in software. As DaveM
correctly states, the code/algorithm needs some "probability
of fulfillment" taken into account.   Which is actually what my evictor
code implements (I don't claim its perfect, as it currently does have
fairness/fair-queue issues, I have a plan for fixing it, but lets not
clutter up this answer).


So, let me instead show, with tests, that the evictor strategy is
broken, while keeping the original default thresh settings:

# grep . /proc/sys/net/ipv4/ipfrag_*_thresh
/proc/sys/net/ipv4/ipfrag_high_thresh:262144
/proc/sys/net/ipv4/ipfrag_low_thresh:196608

Test purpose, I will on a single 10G link demonstrate, that starting
several "N" netperf UDP fragmentation flows, will hurt performance, and
then claim this is caused by the bad evictor strategy.

Test setup:
 - Disable Ethernet flow control
 - netperf packet size 65507
 - Run netserver on one NUMA node
 - Start netperf clients against a NIC on the other NUMA node
 - (The NUMA imbalance helps the effect occur at lower N) 

Result: N=1  8040 Mbit/s
Result: N=2  9584 Mbit/s (4739+4845)
Result: N=3  4055 Mbit/s (1436+1371+1248)
Result: N=4  2247 Mbit/s (1538+29+54+626)
Result: N=5   879 Mbit/s (78+152+226+125+298)
Result: N=6   293 Mbit/s (85+55+32+57+46+18)
Result: N=7   354 Mbit/s (70+47+33+80+20+72+32)

Can we, now, agree that the current evictor strategy is broken?!?


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [net-next PATCH V2 1/9] net: frag evictor, avoid killing warm frag queues
  2012-11-29 22:17     ` Jesper Dangaard Brouer
  2012-11-29 23:01       ` Eric Dumazet
  2012-11-29 23:32       ` [net-next PATCH V2 1/9] " Eric Dumazet
@ 2012-11-30 12:01       ` Jesper Dangaard Brouer
  2012-11-30 14:57         ` Eric Dumazet
  2 siblings, 1 reply; 57+ messages in thread
From: Jesper Dangaard Brouer @ 2012-11-30 12:01 UTC (permalink / raw)
  To: David Miller
  Cc: eric.dumazet, fw, netdev, pablo, tgraf, amwang, kaber, paulmck,
	herbert, David Laight

On Thu, 2012-11-29 at 23:17 +0100, Jesper Dangaard Brouer wrote:
> On Thu, 2012-11-29 at 12:44 -0500, David Miller wrote:
> > 
> > The only way I could see this making sense is if some "probability
> > of fulfillment" was taken into account.  
[...]

> This patch/system actually includes a "promise/probability of
> fulfillment". Let me explain.
> 
> We allow "warn" entries to complete, by allowing (new) fragments/packets
> for these entries (present in the frag queue).  While we don't allow the
> system to create new entries.  This creates the selection we interested
> in (as we must drop some packets given the arrival rate bigger than the
> processing rate).

To help reviewers understand; the implications of allowing existing frag
queue to complete/finish. 

Let me explain the memory implications:

Remember we only allow (default) 256K mem to be used, (now) per CPU for
fragments (raw memory usage skb->truesize).  

 Hint: I violate this!!! -- the embedded lynch mob is gathering support

As the existing entries in the frag queues, are still being allowed
packets through (even when the memory limit is exceeded).   In
worst-case, as DaveM explained, this can be as much as 100KBytes per
entry (for 64K fragments).

The highest number of frag queue hash entries, I have seen is 308, at
4x10G with two fragments size 2944. (This is of-cause unrealistic to get
this high with 64K frames, due to bw link limit, I would approximate is
max at 77 entries at 4x10G).

Now I'm teasing the embedded lynch mob.
Worst case memory usage:

 308 * 100KBytes = 30.8 MBytes (not per CPU, total)

Now the embedded lynch mob is banging at my door, yelling that I'm
opening a remote OOM DoS attack on their small memory boxes.
I'll calm them down, by explaining why we cannot reach this number.

The "warm" fragment code is making sure, this does not get out of hand.
An entry is considered "warn" for only one jiffie (1 HZ), which on
1000HZ systems is 1 ms (and 100HZ = 10 ms). (after-which the fragment
queue is freed)

How much data can we send in 1 ms at 10000 Mbit/s:
  10000 Mbit/s div 8bit-per-bytes * 0.001 sec = 1.25 MBytes

And having 4x10G can result in 5 MBytes (and the raw mem usage
skb->truesize is going to get it a bit higher).

Now, the embedded lynch mob is trying find a 4x 10Gbit/s embedded system
with less than 10MBytes of memory... they give up and go home.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [net-next PATCH V2 8/9] net: frag queue locking per hash bucket
  2012-11-29 17:08   ` Eric Dumazet
@ 2012-11-30 12:55     ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 57+ messages in thread
From: Jesper Dangaard Brouer @ 2012-11-30 12:55 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, Florian Westphal, netdev, Pablo Neira Ayuso,
	Thomas Graf, Cong Wang, Patrick McHardy, Paul E. McKenney,
	Herbert Xu

On Thu, 2012-11-29 at 09:08 -0800, Eric Dumazet wrote:

> So we still have this huge contention on the reader-writer lock cache
> line...
> I would just remove it. (And remove hash rebuild, or make it RCU
> compatible )

I vote for removing the rebuild feature.  It really doesn't make sense
to protect a 256Kb memory (using) hash, with very short lived elements,
against hash collision attacks, every 10 minutes. (Especially as we saw
a 7 GBit/s performance improvement)

Making the hash RCU "compatible" is not going to work, because the
elements are too short lived, thus we cannot free them fast enough
(call_rcu stuff). (I actually did the RCU implementation, to realize
this...)

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [net-next PATCH V2 1/9] net: frag evictor, avoid killing warm frag queues
  2012-11-30 10:04         ` Jesper Dangaard Brouer
@ 2012-11-30 14:52           ` Eric Dumazet
  2012-11-30 15:45             ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 57+ messages in thread
From: Eric Dumazet @ 2012-11-30 14:52 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: David Miller, fw, netdev, pablo, tgraf, amwang, kaber, paulmck,
	herbert

On Fri, 2012-11-30 at 11:04 +0100, Jesper Dangaard Brouer wrote:
> So, let me instead show, with tests, that the evictor strategy is
> broken, while keeping the original default thresh settings:
> 
> # grep . /proc/sys/net/ipv4/ipfrag_*_thresh
> /proc/sys/net/ipv4/ipfrag_high_thresh:262144
> /proc/sys/net/ipv4/ipfrag_low_thresh:196608
> 
> Test purpose, I will on a single 10G link demonstrate, that starting
> several "N" netperf UDP fragmentation flows, will hurt performance, and
> then claim this is caused by the bad evictor strategy.
> 
> Test setup:
>  - Disable Ethernet flow control
>  - netperf packet size 65507
>  - Run netserver on one NUMA node
>  - Start netperf clients against a NIC on the other NUMA node
>  - (The NUMA imbalance helps the effect occur at lower N) 
> 
> Result: N=1  8040 Mbit/s
> Result: N=2  9584 Mbit/s (4739+4845)
> Result: N=3  4055 Mbit/s (1436+1371+1248)
> Result: N=4  2247 Mbit/s (1538+29+54+626)
> Result: N=5   879 Mbit/s (78+152+226+125+298)
> Result: N=6   293 Mbit/s (85+55+32+57+46+18)
> Result: N=7   354 Mbit/s (70+47+33+80+20+72+32)
> 
> Can we, now, agree that the current evictor strategy is broken?!?

Your setup is broken for sure. I dont know how you expect that many
datagrams being correctly reassembled with ipfrag_high_thresh=262144 

No matter strategy is implemented, an attacker knows it and can send
frags so that regular workload is denied. Kernel cant decide which
packets are more likely to be completed.

BTW, install fq_codel at the sender side, so that frags are nicely
interleaved. Because on real networks, frags of an UDP datagram rarely
come to destination in a single train with no alien packets inside the
train.

You focus on a particular lab setup and particular workload, you
should consider that if an application _really_ depends on frags, then
the receiver is likely to be a target for various frag attacks.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [net-next PATCH V2 1/9] net: frag evictor, avoid killing warm frag queues
  2012-11-30 12:01       ` Jesper Dangaard Brouer
@ 2012-11-30 14:57         ` Eric Dumazet
  0 siblings, 0 replies; 57+ messages in thread
From: Eric Dumazet @ 2012-11-30 14:57 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: David Miller, fw, netdev, pablo, tgraf, amwang, kaber, paulmck,
	herbert, David Laight

On Fri, 2012-11-30 at 13:01 +0100, Jesper Dangaard Brouer wrote:

> As the existing entries in the frag queues, are still being allowed
> packets through (even when the memory limit is exceeded).   In
> worst-case, as DaveM explained, this can be as much as 100KBytes per
> entry (for 64K fragments).

Some NIC uses a plain page to hold an ethernet frame.

Consider an UDP packet using 512 bytes frags, the resulting packet after
reassembly can use 128 pages. Thats 512 KB of memory on x86.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [net-next PATCH V2 1/9] net: frag evictor, avoid killing warm frag queues
  2012-11-30 14:52           ` Eric Dumazet
@ 2012-11-30 15:45             ` Jesper Dangaard Brouer
  2012-11-30 16:37               ` Eric Dumazet
  0 siblings, 1 reply; 57+ messages in thread
From: Jesper Dangaard Brouer @ 2012-11-30 15:45 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, fw, netdev, pablo, tgraf, amwang, kaber, paulmck,
	herbert

On Fri, 2012-11-30 at 06:52 -0800, Eric Dumazet wrote:
> On Fri, 2012-11-30 at 11:04 +0100, Jesper Dangaard Brouer wrote:
> > So, let me instead show, with tests, that the evictor strategy is
> > broken, while keeping the original default thresh settings:
> > 
> > # grep . /proc/sys/net/ipv4/ipfrag_*_thresh
> > /proc/sys/net/ipv4/ipfrag_high_thresh:262144
> > /proc/sys/net/ipv4/ipfrag_low_thresh:196608
> > 
> > Test purpose, I will on a single 10G link demonstrate, that starting
> > several "N" netperf UDP fragmentation flows, will hurt performance, and
> > then claim this is caused by the bad evictor strategy.
> > 
> > Test setup:
> >  - Disable Ethernet flow control
> >  - netperf packet size 65507
> >  - Run netserver on one NUMA node
> >  - Start netperf clients against a NIC on the other NUMA node
> >  - (The NUMA imbalance helps the effect occur at lower N) 
> > 
> > Result: N=1  8040 Mbit/s
> > Result: N=2  9584 Mbit/s (4739+4845)
> > Result: N=3  4055 Mbit/s (1436+1371+1248)
> > Result: N=4  2247 Mbit/s (1538+29+54+626)
> > Result: N=5   879 Mbit/s (78+152+226+125+298)
> > Result: N=6   293 Mbit/s (85+55+32+57+46+18)
> > Result: N=7   354 Mbit/s (70+47+33+80+20+72+32)
> > 
> > Can we, now, agree that the current evictor strategy is broken?!?
> 
> Your setup is broken for sure. 

No, its not.

> I dont know how you expect that many
> datagrams being correctly reassembled with ipfrag_high_thresh=262144 

That's my point... I'm showing that its not possible, with out current
implementation!


> No matter strategy is implemented, an attacker knows it and can send
> frags so that regular workload is denied. Kernel cant decide which
> packets are more likely to be completed.

Our current evictor implementation will allow the attacker to kill ALL
frag traffic to the machine (and cause high CPU load, with CPUs
spinning).
My implementation will guarantee that some fragments will be, allowed to
finish and complete.  A far better choice, than our current situation.


> BTW, install fq_codel at the sender side, so that frags are nicely
> interleaved. Because on real networks, frags of an UDP datagram rarely
> come to destination in a single train with no alien packets inside the
> train.

You are arguing in my favor.  I have taken great care that my test will
interleave UDP datagrams (by CPU pinning netperf clients on the sender
side).  If I just naively start all netperf client on one CPU on the
sender, then I will have packet trains... and the result will be:
 Packet trains result N=5  8884 Mbit/s (1775+1775+1790+1789+1755)
That test would be "broken for sure".

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [net-next PATCH V2 1/9] net: frag evictor, avoid killing warm frag queues
  2012-11-30 15:45             ` Jesper Dangaard Brouer
@ 2012-11-30 16:37               ` Eric Dumazet
  2012-11-30 21:37                 ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 57+ messages in thread
From: Eric Dumazet @ 2012-11-30 16:37 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: David Miller, fw, netdev, pablo, tgraf, amwang, kaber, paulmck,
	herbert

On Fri, 2012-11-30 at 16:45 +0100, Jesper Dangaard Brouer wrote:
> On Fri, 2012-11-30 at 06:52 -0800, Eric Dumazet wrote:

> 
> > I dont know how you expect that many
> > datagrams being correctly reassembled with ipfrag_high_thresh=262144 
> 
> That's my point... I'm showing that its not possible, with out current
> implementation!

What I was saying is that the limits are too small, and we should
increase them for this particular need.

This has little to do with the underlying algo.

Assuming we have a hash table size of 1024 buckets, you could
easily add the following :

diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 448e685..bc1bdf9 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -849,13 +849,13 @@ static inline void ip4_frags_ctl_register(void)
 static int __net_init ipv4_frags_init_net(struct net *net)
 {
 	/*
-	 * Fragment cache limits. We will commit 256K at one time. Should we
-	 * cross that limit we will prune down to 192K. This should cope with
+	 * Fragment cache limits. We will commit 4M at one time. Should we
+	 * cross that limit we will prune down to 3M. This should cope with
 	 * even the most extreme cases without allowing an attacker to
 	 * measurably harm machine performance.
 	 */
-	net->ipv4.frags.high_thresh = 256 * 1024;
-	net->ipv4.frags.low_thresh = 192 * 1024;
+	net->ipv4.frags.high_thresh = 4 << 20;
+	net->ipv4.frags.low_thresh = 3 << 20;
 	/*
 	 * Important NOTE! Fragment queue must be destroyed before MSL expires.
 	 * RFC791 is wrong proposing to prolongate timer each fragment arrival

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [net-next PATCH V2 1/9] net: frag evictor, avoid killing warm frag queues
  2012-11-30 16:37               ` Eric Dumazet
@ 2012-11-30 21:37                 ` Jesper Dangaard Brouer
  2012-11-30 22:25                   ` Eric Dumazet
  0 siblings, 1 reply; 57+ messages in thread
From: Jesper Dangaard Brouer @ 2012-11-30 21:37 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, fw, netdev, pablo, tgraf, amwang, kaber, paulmck,
	herbert

On Fri, 2012-11-30 at 08:37 -0800, Eric Dumazet wrote:
> On Fri, 2012-11-30 at 16:45 +0100, Jesper Dangaard Brouer wrote:
> > On Fri, 2012-11-30 at 06:52 -0800, Eric Dumazet wrote:
> 
> > 
> > > I dont know how you expect that many
> > > datagrams being correctly reassembled with ipfrag_high_thresh=262144 
> > 
> > That's my point... I'm showing that its not possible, with out current
> > implementation!
> 
> What I was saying is that the limits are too small, and we should
> increase them for this particular need.
> 
> This has little to do with the underlying algo.

Actual data is an engineers best friend.

[root@dragon ~]# sysctl -w net/ipv4/ipfrag_high_thresh=$((4<<20))
net.ipv4.ipfrag_high_thresh = 4194304
[root@dragon ~]# sysctl -w net/ipv4/ipfrag_low_thresh=$((3<<20))
net.ipv4.ipfrag_low_thresh = 3145728


[jbrouer@firesoul ~]$ netperf -H 192.168.51.2 -T0,0 -t UDP_STREAM -l 20 &\
 netperf -p 1337 -H 192.168.31.2 -T7,7 -t UDP_STREAM -l 20
[1] 18573
UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.51.2 (192.168.51.2) port 0 AF_INET : cpu bind
UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.31.2 (192.168.31.2) port 0 AF_INET : cpu bind
Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

229376   65507   20.00      363315      0    9519.86
212992           20.00        7297            191.20

Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

229376   65507   20.00      366927      0    9614.48
212992           20.00       10437            273.48


This test is 2x10G with straight NUMA nodes (meaning optimal NUMA
allocation where the incoming netperf packets are received by kernel and
delivered to netserver on the same NUMA node).


Come on Eric, you are smart than this.  When will you realize, that
dropping partly completed fragment queue are bad for performance? (And
thus a bad algorithmic choice in the evictor)


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [net-next PATCH V2 1/9] net: frag evictor, avoid killing warm frag queues
  2012-11-30 21:37                 ` Jesper Dangaard Brouer
@ 2012-11-30 22:25                   ` Eric Dumazet
  2012-11-30 23:23                     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 57+ messages in thread
From: Eric Dumazet @ 2012-11-30 22:25 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: David Miller, fw, netdev, pablo, tgraf, amwang, kaber, paulmck,
	herbert

On Fri, 2012-11-30 at 22:37 +0100, Jesper Dangaard Brouer wrote:

> 
> Come on Eric, you are smart than this.  When will you realize, that
> dropping partly completed fragment queue are bad for performance? (And
> thus a bad algorithmic choice in the evictor)

Sorry I must be dumb, so I'll stop commenting.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [net-next PATCH V2 1/9] net: frag evictor, avoid killing warm frag queues
  2012-11-30 22:25                   ` Eric Dumazet
@ 2012-11-30 23:23                     ` Jesper Dangaard Brouer
  2012-11-30 23:47                       ` Stephen Hemminger
  2012-11-30 23:58                       ` Eric Dumazet
  0 siblings, 2 replies; 57+ messages in thread
From: Jesper Dangaard Brouer @ 2012-11-30 23:23 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, fw, netdev, pablo, tgraf, amwang, kaber, paulmck,
	herbert

On Fri, 2012-11-30 at 14:25 -0800, Eric Dumazet wrote:
> On Fri, 2012-11-30 at 22:37 +0100, Jesper Dangaard Brouer wrote:
>
> > Come on Eric, you are smart than this.  When will you realize, that
> > dropping partly completed fragment queue are bad for performance? (And
> > thus a bad algorithmic choice in the evictor)
> 
> Sorry I must be dumb, so I'll stop commenting.

Come on Eric, you are one of the smartest and most enlightened persons I
know.

I'm just a little puzzled (and perhaps annoyed) that you don't agree
that the evictor code is a problem, given the tests I have provided and
the discussion we have had.

On this mailing list we challenge and give each other a hard time on the
technical side, as it should be.  This is nothing personal -- I don't
take it personal, I just believe this patch is important and makes a
difference.


I want us to discuss the evictor code as such.  Not trying to come up
with, workarounds avoiding the evictor code.

The dropping choice in the evictor code is not sound.

We are dealing with assembling fragments.  If a single fragment is lost,
the complete fragment is lost.  The evictor code, will kill off one or
several fragments, knowing that this will invalidate the remaining
fragments.  Under high load, the LRU list has no effect, and cannot
guide the drop choice.  The result is dropping on an "even"/fair basis,
which will basically kill all fragments, letting none complete.  Just as
my tests indicate, it severely affects performance with nearly no
throughput as a result.


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [net-next PATCH V2 1/9] net: frag evictor, avoid killing warm frag queues
  2012-11-30 23:23                     ` Jesper Dangaard Brouer
@ 2012-11-30 23:47                       ` Stephen Hemminger
  2012-12-01  0:03                         ` Eric Dumazet
  2012-11-30 23:58                       ` Eric Dumazet
  1 sibling, 1 reply; 57+ messages in thread
From: Stephen Hemminger @ 2012-11-30 23:47 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Eric Dumazet, David Miller, fw, netdev, pablo, tgraf, amwang,
	kaber, paulmck, herbert

My $.02 is that this would be a good place to introduce lock-free auto
resizing hash lists that are in the userspace RCU library.

It would be a non-trivial effort to put this in the kernel, but
it would let the table grow transparently.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [net-next PATCH V2 1/9] net: frag evictor, avoid killing warm frag queues
  2012-11-30 23:23                     ` Jesper Dangaard Brouer
  2012-11-30 23:47                       ` Stephen Hemminger
@ 2012-11-30 23:58                       ` Eric Dumazet
  2012-12-04 13:30                         ` [net-next PATCH V3-evictor] " Jesper Dangaard Brouer
  1 sibling, 1 reply; 57+ messages in thread
From: Eric Dumazet @ 2012-11-30 23:58 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: David Miller, fw, netdev, pablo, tgraf, amwang, kaber, paulmck,
	herbert

On Sat, 2012-12-01 at 00:23 +0100, Jesper Dangaard Brouer wrote:


> I'm just a little puzzled (and perhaps annoyed) that you don't agree
> that the evictor code is a problem, given the tests I have provided and
> the discussion we have had.
> 
> On this mailing list we challenge and give each other a hard time on the
> technical side, as it should be.  This is nothing personal -- I don't
> take it personal, I just believe this patch is important and makes a
> difference.
> 
> 
> I want us to discuss the evictor code as such.  Not trying to come up
> with, workarounds avoiding the evictor code.
> 
> The dropping choice in the evictor code is not sound.
> 
> We are dealing with assembling fragments.  If a single fragment is lost,
> the complete fragment is lost.  The evictor code, will kill off one or
> several fragments, knowing that this will invalidate the remaining
> fragments.  Under high load, the LRU list has no effect, and cannot
> guide the drop choice.  The result is dropping on an "even"/fair basis,
> which will basically kill all fragments, letting none complete.  Just as
> my tests indicate, it severely affects performance with nearly no
> throughput as a result.

Give me an alternative, I'll tell you how an attacker can hurt you,
knowing the strategy you use.

Keeping around old frags is not good. After a burst of frags, you'll be
unable to recover until they are purged.

Purging old frags is the most natural way to evict incomplete messages.

(If your mem limits are high enough to absorb the expected workload plus
a fair amount of extra space, but your results are biased with wrong
thresholds)

Or else, an attacker only has to send incomplete messages, and your host
will fill its table and refuse your messages.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [net-next PATCH V2 1/9] net: frag evictor, avoid killing warm frag queues
  2012-11-30 23:47                       ` Stephen Hemminger
@ 2012-12-01  0:03                         ` Eric Dumazet
  2012-12-01  0:13                           ` Stephen Hemminger
  0 siblings, 1 reply; 57+ messages in thread
From: Eric Dumazet @ 2012-12-01  0:03 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Jesper Dangaard Brouer, David Miller, fw, netdev, pablo, tgraf,
	amwang, kaber, paulmck, herbert

On Fri, 2012-11-30 at 15:47 -0800, Stephen Hemminger wrote:
> My $.02 is that this would be a good place to introduce lock-free auto
> resizing hash lists that are in the userspace RCU library.
> 
> It would be a non-trivial effort to put this in the kernel, but
> it would let the table grow transparently.

Yes, but no ;)

The current hash is 64 slots, its not like anybody wants it to be one
million slots.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [net-next PATCH V2 1/9] net: frag evictor, avoid killing warm frag queues
  2012-12-01  0:03                         ` Eric Dumazet
@ 2012-12-01  0:13                           ` Stephen Hemminger
  0 siblings, 0 replies; 57+ messages in thread
From: Stephen Hemminger @ 2012-12-01  0:13 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jesper Dangaard Brouer, David Miller, fw, netdev, pablo, tgraf,
	amwang, kaber, paulmck, herbert

On Fri, 30 Nov 2012 16:03:29 -0800
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Fri, 2012-11-30 at 15:47 -0800, Stephen Hemminger wrote:
> > My $.02 is that this would be a good place to introduce lock-free auto
> > resizing hash lists that are in the userspace RCU library.
> > 
> > It would be a non-trivial effort to put this in the kernel, but
> > it would let the table grow transparently.
> 
> Yes, but no ;)
> 
> The current hash is 64 slots, its not like anybody wants it to be one
> million slots.
> 
> 
> 

userspace rcu lfhash has min/max values.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [net-next PATCH V2 5/9] net: frag, per CPU resource, mem limit and LRU list accounting
  2012-11-29 17:06   ` Eric Dumazet
  2012-11-29 17:31     ` David Miller
@ 2012-12-03 14:02     ` Jesper Dangaard Brouer
  2012-12-03 17:25       ` David Miller
  1 sibling, 1 reply; 57+ messages in thread
From: Jesper Dangaard Brouer @ 2012-12-03 14:02 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, Florian Westphal, netdev, Pablo Neira Ayuso,
	Thomas Graf, Cong Wang, Patrick McHardy, Paul E. McKenney,
	Herbert Xu

On Thu, 2012-11-29 at 09:06 -0800, Eric Dumazet wrote:
> On Thu, 2012-11-29 at 17:13 +0100, Jesper Dangaard Brouer wrote:
> > The major performance bottleneck on NUMA systems, is the mem limit
> > counter which is based an atomic counter.  This patch removes the
> > cache-bouncing of the atomic counter, by moving this accounting to be
> > bound to each CPU.  The LRU list also need to be done per CPU,
> > in-order to keep the accounting straight.
> > 
> > If fragments belonging together is "sprayed" across CPUs, performance
> > will still suffer, but due to NIC rxhashing this is not very common.
> > Correct accounting in this situation is maintained by recording and
> > "assigning" a CPU to a frag queue when its allocated (caused by the
> > first packet associated packet).
> > 
[...]
> > +/* Need to maintain these resource limits per CPU, else we will kill
> > + * performance due to cache-line bouncing
> > + */
> > +struct frag_cpu_limit {
> > +	atomic_t                mem;
> > +	struct list_head        lru_list;
> > +	spinlock_t              lru_lock;
> > +} ____cacheline_aligned_in_smp;
> > +
> 
> This looks like a big patch introducing a specific infrastructure, while
> we already have lib/percpu_counter.c

For the record, I cannot use the lib/percpu_counter, because this
accounting is not kept strictly per CPU, if the fragments are "sprayed"
across CPUs (as described in the commit message above).

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [net-next PATCH V2 5/9] net: frag, per CPU resource, mem limit and LRU list accounting
  2012-12-03 14:02     ` Jesper Dangaard Brouer
@ 2012-12-03 17:25       ` David Miller
  0 siblings, 0 replies; 57+ messages in thread
From: David Miller @ 2012-12-03 17:25 UTC (permalink / raw)
  To: brouer
  Cc: eric.dumazet, fw, netdev, pablo, tgraf, amwang, kaber, paulmck,
	herbert

From: Jesper Dangaard Brouer <brouer@redhat.com>
Date: Mon, 03 Dec 2012 15:02:41 +0100

> On Thu, 2012-11-29 at 09:06 -0800, Eric Dumazet wrote:
>> On Thu, 2012-11-29 at 17:13 +0100, Jesper Dangaard Brouer wrote:
>> > The major performance bottleneck on NUMA systems, is the mem limit
>> > counter which is based an atomic counter.  This patch removes the
>> > cache-bouncing of the atomic counter, by moving this accounting to be
>> > bound to each CPU.  The LRU list also need to be done per CPU,
>> > in-order to keep the accounting straight.
>> > 
>> > If fragments belonging together is "sprayed" across CPUs, performance
>> > will still suffer, but due to NIC rxhashing this is not very common.
>> > Correct accounting in this situation is maintained by recording and
>> > "assigning" a CPU to a frag queue when its allocated (caused by the
>> > first packet associated packet).
>> > 
> [...]
>> > +/* Need to maintain these resource limits per CPU, else we will kill
>> > + * performance due to cache-line bouncing
>> > + */
>> > +struct frag_cpu_limit {
>> > +	atomic_t                mem;
>> > +	struct list_head        lru_list;
>> > +	spinlock_t              lru_lock;
>> > +} ____cacheline_aligned_in_smp;
>> > +
>> 
>> This looks like a big patch introducing a specific infrastructure, while
>> we already have lib/percpu_counter.c
> 
> For the record, I cannot use the lib/percpu_counter, because this
> accounting is not kept strictly per CPU, if the fragments are "sprayed"
> across CPUs (as described in the commit message above).

The percpu infrastructure allows precise counts and comparisons even
in that case.  It uses the cheap test when possible, and defers to a
more expensive test when necessary.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [net-next PATCH V3-evictor] net: frag evictor, avoid killing warm frag queues
  2012-11-30 23:58                       ` Eric Dumazet
@ 2012-12-04 13:30                         ` Jesper Dangaard Brouer
  2012-12-04 14:32                           ` [net-next PATCH V3-evictor] net: frag evictor,avoid " David Laight
                                             ` (2 more replies)
  0 siblings, 3 replies; 57+ messages in thread
From: Jesper Dangaard Brouer @ 2012-12-04 13:30 UTC (permalink / raw)
  To: Eric Dumazet, David S. Miller, Florian Westphal
  Cc: Jesper Dangaard Brouer, netdev, Thomas Graf, Paul E. McKenney,
	Cong Wang, Herbert Xu

The fragmentation evictor system have a very unfortunate eviction
system for killing fragment, when the system is put under pressure.

If packets are coming in too fast, the evictor code kills "warm"
fragments too quickly.  Resulting in close to zero throughput, as
fragments are killed before they have a chance to complete

This is related to the bad interaction with the LRU (Least Recently
Used) list.  Under load the LRU list sort-of changes meaning/behavior.
When the LRU head is very new/warm, then the head is most likely the
one with most fragments and the tail (latest used or added element)
with least.

Solved by, introducing a creation "jiffie" timestamp (creation_ts).
If the element is tried evicted in same jiffie, then perform tail drop
on the LRU list instead.

Signed-off-by: Jesper Dangaard Brouer <jbrouer@redhat.com>

---
V2:
 - Drop the INET_FRAG_FIRST_IN idea for detecting dropped "head" packets

V3:
 - Move the tail drop, from inet_frag_alloc() to inet_frag_evictor()
   This will be close to the same semantics, but at a higher cost.


 include/net/inet_frag.h  |    1 +
 net/ipv4/inet_fragment.c |   12 ++++++++++++
 2 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 32786a0..7b897b2 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -24,6 +24,7 @@ struct inet_frag_queue {
 	ktime_t			stamp;
 	int			len;        /* total length of orig datagram */
 	int			meat;
+	u32			creation_ts;/* jiffies when queue was created*/
 	__u8			last_in;    /* first/last segment arrived? */
 
 #define INET_FRAG_COMPLETE	4
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 4750d2b..d8bf59b 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -178,6 +178,16 @@ int inet_frag_evictor(struct netns_frags *nf, struct inet_frags *f, bool force)
 
 		q = list_first_entry(&nf->lru_list,
 				struct inet_frag_queue, lru_list);
+
+		/* When head of LRU is very new/warm, then the head is
+		 * most likely the one with most fragments and the
+		 * tail with least, thus drop tail
+		 */
+		if (!force && q->creation_ts == (u32) jiffies) {
+			q = list_entry(&nf->lru_list.prev,
+				struct inet_frag_queue, lru_list);
+		}
+
 		atomic_inc(&q->refcnt);
 		read_unlock(&f->lock);
 
@@ -243,11 +253,13 @@ static struct inet_frag_queue *inet_frag_alloc(struct netns_frags *nf,
 		struct inet_frags *f, void *arg)
 {
 	struct inet_frag_queue *q;
+	// Note: We could also perform the tail drop here
 
 	q = kzalloc(f->qsize, GFP_ATOMIC);
 	if (q == NULL)
 		return NULL;
 
+	q->creation_ts = (u32) jiffies;
 	q->net = nf;
 	f->constructor(q, arg);
 	atomic_add(f->qsize, &nf->mem);

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* RE: [net-next PATCH V3-evictor] net: frag evictor,avoid killing warm frag queues
  2012-12-04 13:30                         ` [net-next PATCH V3-evictor] " Jesper Dangaard Brouer
@ 2012-12-04 14:32                           ` David Laight
  2012-12-04 14:47                           ` [net-next PATCH V3-evictor] net: frag evictor, avoid " Eric Dumazet
  2012-12-05  9:24                           ` Jesper Dangaard Brouer
  2 siblings, 0 replies; 57+ messages in thread
From: David Laight @ 2012-12-04 14:32 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Eric Dumazet, David S. Miller,
	Florian Westphal
  Cc: netdev, Thomas Graf, Paul E. McKenney, Cong Wang, Herbert Xu

> The fragmentation evictor system have a very unfortunate eviction
> system for killing fragment, when the system is put under pressure.
> 
> If packets are coming in too fast, the evictor code kills "warm"
> fragments too quickly.  Resulting in close to zero throughput, as
> fragments are killed before they have a chance to complete
> 
> This is related to the bad interaction with the LRU (Least Recently
> Used) list.  Under load the LRU list sort-of changes meaning/behavior.
> When the LRU head is very new/warm, then the head is most likely the
> one with most fragments and the tail (latest used or added element)
> with least.
> 
> Solved by, introducing a creation "jiffie" timestamp (creation_ts).
> If the element is tried evicted in same jiffie, then perform tail drop
> on the LRU list instead.

I'm not at all sure a 'same tick' test is actually sensible.

Some quick 'ball park' maths:
10Mbps is about 1kB per jiffie (1kHz tick).
So at 10Mbps you'll see a frame every 1.5 jiffies.
So if the source of any fragment stream is some remote ADSL
connection (rather than a local Ge LAN connection) you'll
never manage to assemble the entire datagram.
Anyone connecting from dialup will fail miserably.

This means you need to keep some initial fragments for much
longer than 1 jiffie before discarding them.

Anything with more than one in-sequence fragment is a good
candidate for keeping (especially if the inter-fragment time
is being monitored).

Anything with out of sequence fragments is a very good choice
for discard.

But yes, discarding some 'new' fragments is very likely necessary
in order to manage to actually receive anything.

Sometimes I've wondered whether discarding some transmits would
help when the rx side is congested (and v.v.). I've never tried
modelling that one with any traffic flows! But forcing the remote
to timeout might reduce the overall traffic, trouble is the
discards would have to be carefully chosen!
(I was thinking about this mainly with regard to routers that
are having issues feeding outbound data into a narrow pipe.)

	David

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [net-next PATCH V3-evictor] net: frag evictor, avoid killing warm frag queues
  2012-12-04 13:30                         ` [net-next PATCH V3-evictor] " Jesper Dangaard Brouer
  2012-12-04 14:32                           ` [net-next PATCH V3-evictor] net: frag evictor,avoid " David Laight
@ 2012-12-04 14:47                           ` Eric Dumazet
  2012-12-04 17:51                             ` Jesper Dangaard Brouer
  2012-12-05  9:24                           ` Jesper Dangaard Brouer
  2 siblings, 1 reply; 57+ messages in thread
From: Eric Dumazet @ 2012-12-04 14:47 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: David S. Miller, Florian Westphal, netdev, Thomas Graf,
	Paul E. McKenney, Cong Wang, Herbert Xu

On Tue, 2012-12-04 at 14:30 +0100, Jesper Dangaard Brouer wrote:
> The fragmentation evictor system have a very unfortunate eviction
> system for killing fragment, when the system is put under pressure.
> 
> If packets are coming in too fast, the evictor code kills "warm"
> fragments too quickly.  Resulting in close to zero throughput, as
> fragments are killed before they have a chance to complete
> 
> This is related to the bad interaction with the LRU (Least Recently
> Used) list.  Under load the LRU list sort-of changes meaning/behavior.
> When the LRU head is very new/warm, then the head is most likely the
> one with most fragments and the tail (latest used or added element)
> with least.
> 
> Solved by, introducing a creation "jiffie" timestamp (creation_ts).
> If the element is tried evicted in same jiffie, then perform tail drop
> on the LRU list instead.
> 
> Signed-off-by: Jesper Dangaard Brouer <jbrouer@redhat.com>

This would only 'work' if a reassembled packet can be done/completed
under one jiffie.

For 64KB packets, this means 100Mb link wont be able to deliver a
reassembled packet under IP frags load if HZ=1000

LRU goal is to be able to select the oldest inet_frag_queue, because in
typical networks, packet losses are really happening and this is why
some packets wont complete their reassembly. They naturally will be
found on LRU head, and they probably are very fat (for example a single
packet was lost for the inet_frag_queue)

Choosing the most recent inet_frag_queue is exactly the opposite
strategy. We pay the huge cost of maintaining a central LRU, and we
exactly misuse it.

As long as an inet_frag_queue receives new fragments and is moved to the
LRU tail, its a candidate for being kept, not a candidate for being
evicted.

Only when an inet_frag_queue is the oldest one, it becomes a candidate
for eviction.

I think you are trying to solve a configuration/tuning problem by
changing a valid strategy.

Whats wrong with admitting high_thresh/low_thresh default values should
be updated, now some people apparently want to use IP fragments in
production ?

Lets say we allow to use 1 % of memory for frags, instead of the current
256 KB limit, which was chosen decades ago.

Only in very severe DOS attacks, LRU head 'creation_ts' would possibly
be <= 1ms. And under severe DOS attacks, I am afraid there is nothing we
can do.

(We could eventually avoid LRU hassle and chose instead a random drop
strategy)

high_thresh/low_thresh should be changed from 'int' to 'long' as well,
so that a 64bit host could use more than 2GB for frag storage.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [net-next PATCH V3-evictor] net: frag evictor, avoid killing warm frag queues
  2012-12-04 14:47                           ` [net-next PATCH V3-evictor] net: frag evictor, avoid " Eric Dumazet
@ 2012-12-04 17:51                             ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 57+ messages in thread
From: Jesper Dangaard Brouer @ 2012-12-04 17:51 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, Florian Westphal, netdev, Thomas Graf,
	Paul E. McKenney, Cong Wang, Herbert Xu

On Tue, 2012-12-04 at 06:47 -0800, Eric Dumazet wrote:
> On Tue, 2012-12-04 at 14:30 +0100, Jesper Dangaard Brouer wrote:
> > The fragmentation evictor system have a very unfortunate eviction
> > system for killing fragment, when the system is put under pressure.
> > 
> > If packets are coming in too fast, the evictor code kills "warm"
> > fragments too quickly.  Resulting in close to zero throughput, as
> > fragments are killed before they have a chance to complete
> > 
> > This is related to the bad interaction with the LRU (Least Recently
> > Used) list.  Under load the LRU list sort-of changes meaning/behavior.
> > When the LRU head is very new/warm, then the head is most likely the
> > one with most fragments and the tail (latest used or added element)
> > with least.
> > 
> > Solved by, introducing a creation "jiffie" timestamp (creation_ts).
> > If the element is tried evicted in same jiffie, then perform tail drop
> > on the LRU list instead.
> > 
> > Signed-off-by: Jesper Dangaard Brouer <jbrouer@redhat.com>

First of all, this patch is not the perfect thing, its a starting point
of a discussion to find a better solution.


> This would only 'work' if a reassembled packet can be done/completed
> under one jiffie.

True, and I'm not happy with this resolution.  It's only purpose is to
help me detect when the LRU list is reversing it functionality. 

This is the *only* message I'm trying to convey:

    **The LRU list is misbehaving** (in this situation)


Perhaps the best option is to implement something else than a LRU... I
just haven't found the correct replacement/idea yet.


> For 64KB packets, this means 100Mb link wont be able to deliver a
> reassembled packet under IP frags load if HZ=1000

True, the 1 jiffie check should be increased, but that's not the point.
(Also I make no promise of fairness, I hope we can address this fairness
issues in a later patch, perhaps in combination with replacing the LRU).


(Notice: I have run tests with higher high_thresh/low_thresh values, the
results are the same)


> LRU goal is to be able to select the oldest inet_frag_queue, because in
> typical networks, packet losses are really happening and this is why
> some packets wont complete their reassembly. They naturally will be
> found on LRU head, and they probably are very fat (for example a single
> packet was lost for the inet_frag_queue)

Look at what is happening in inet_frag_evictor(), when we are under
load.  We will quickly delete all the oldest inet_frag_queue, you are
talking about.  After which the LRU list will be filled with what? Only
new fragments.  

Think about that is the order of this list, now?  Remember it only
contains incomplete inet_frag_queue's.

My theory, prove me wrong, is when the LRU head is very new/warm, then
the head is most likely the one with most fragments and the tail (latest
used or added element) with the least fragments.


> Choosing the most recent inet_frag_queue is exactly the opposite
> strategy. We pay the huge cost of maintaining a central LRU, and we
> exactly misuse it.

Then the LRU list is perhaps is the wrong choice?

> As long as an inet_frag_queue receives new fragments and is moved to the
> LRU tail, its a candidate for being kept, not a candidate for being
> evicted.

Remember I have shown/proven that all inet_frag_queue's in the list
have been touched within 1 jiffie.  Which one do you choose for removal?

(Also remember if an inet_frag_queue looses one frame, on the network
layer, it will not complete, and after 1 jiffie it will be killed by the
evictor.  So, this function still "works")


> Only when an inet_frag_queue is the oldest one, it becomes a candidate
> for eviction.
> 
> I think you are trying to solve a configuration/tuning problem by
> changing a valid strategy.
> 
> Whats wrong with admitting high_thresh/low_thresh default values should
> be updated, now some people apparently want to use IP fragments in
> production ?

I'm not against increasing the high_thresh/low_thresh default values.
I have tested with your 4MB/3MB settings (and 40/39, and 400/399).  The
results are (almost) the same, its not the problem!  I have shown you
several test results already (added some extra tests below)
And yes, the high_thresh/low_thresh default values should be increased,
I just don't want to discuss how much.

I want to discuss the correctness of the evictor and LRU.  You are
trying to avoid calling the evictor code; you cannot, assuming a queing
system, where packets are arriving at a higher rate than you can
process.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer


p.s. I'm working on getting a better interleaving of the fragments. I'm
depending on running netperf generators on different CPUs.  I tried
adding a SFQ qdisc, but no egress queueing occurred, so it didn't have
any effect.

RAW tests, with different high_thresh/low_thresh:
-------------------------------------------------
I'm extracting the "FRAG: inuse X memory YYYYYY" with the command:
 [root@dragon ~]# for pid in `ps aux | grep [n]etserver | awk '{print $2}' | tr '\n' ' '`; do echo -e "\nNetserver PID:$pid"; egrep -e 'UDP|FRAG' /proc/$pid/net/sockstat ; done

Default net-next kernel with out patches:

[root@dragon ~]# uname -a
Linux dragon 3.7.0-rc6-net-next+ #47 SMP Thu Nov 22 00:06:12 CET 2012 x86_64 x86_64 x86_64 GNU/Linux

----------------------
[root@dragon ~]# grep . /proc/sys/net/ipv4/ipfrag_*_thresh
/proc/sys/net/ipv4/ipfrag_high_thresh:262144
/proc/sys/net/ipv4/ipfrag_low_thresh:196608

FRAG: inuse 4 memory 245152

[jbrouer@firesoul ~]$ netperf -H 192.168.51.2 -T0,0 -t UDP_STREAM -l 20 & netperf -p 1337 -H 192.168.31.2 -T7,7 -t UDP_STREAM -l 20
[1] 10580
UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.31.2 (192.168.31.2) port 0 AF_INET : cpu bind
UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.51.2 (192.168.51.2) port 0 AF_INET : cpu bind
Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992   65507   20.00      353279      0    9256.89
212992           20.00       10768            282.15

Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992   65507   20.00      350801      0    9191.95
212992           20.00        7283            190.83

------------------------


[root@dragon ~]# sysctl -w net/ipv4/ipfrag_high_thresh=$(((1024**2*4)))
net.ipv4.ipfrag_high_thresh = 4194304
[root@dragon ~]# sysctl -w net/ipv4/ipfrag_low_thresh=$(((1024**2*3)))
net.ipv4.ipfrag_low_thresh = 3145728

FRAG: inuse 41 memory 3867784

[jbrouer@firesoul ~]$ netperf -H 192.168.51.2 -T0,0 -t UDP_STREAM -l 20 & netperf -p 1337 -H 192.168.31.2 -T7,7 -t UDP_STREAM -l 20
[1] 10882
UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.31.2 (192.168.31.2) port 0 AF_INET : cpu bind
UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.51.2 (192.168.51.2) port 0 AF_INET : cpu bind
Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992   65507   20.00      353379      0    9259.50
212992           20.00       48986           1283.57

Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992   65507   20.00      350488      0    9183.75
212992           20.00       33336            873.50

-------------------------

[root@dragon ~]# sysctl -w net/ipv4/ipfrag_high_thresh=$(((1024**2*40)))
net.ipv4.ipfrag_high_thresh = 41943040
[root@dragon ~]# sysctl -w net/ipv4/ipfrag_low_thresh=$(((1024**2*39)))
net.ipv4.ipfrag_low_thresh = 40894464

FRAG: inuse 442 memory 41693008

[jbrouer@firesoul ~]$ netperf -H 192.168.51.2 -T0,0 -t UDP_STREAM -l 20 & netperf -p 1337 -H 192.168.31.2 -T7,7 -t UDP_STREAM -l 20
[1] 10899
UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.31.2 (192.168.31.2) port 0 AF_INET : cpu bind
UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.51.2 (192.168.51.2) port 0 AF_INET : cpu bind
Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992   65507   20.00      353097      0    9252.10
212992           20.00       38281           1003.07

Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992   65507   20.00      351708      0    9215.72
212992           20.00       23602            618.44


---------------------------

[root@dragon ~]# sysctl -w net/ipv4/ipfrag_high_thresh=$(((1024**2*400)))
net.ipv4.ipfrag_high_thresh = 419430400
[root@dragon ~]# sysctl -w net/ipv4/ipfrag_low_thresh=$(((1024**2*399)))
net.ipv4.ipfrag_low_thresh = 418381824

FRAG: inuse 4665 memory 418760600

[jbrouer@firesoul ~]$ netperf -H 192.168.51.2 -T0,0 -t UDP_STREAM -l 20 & netperf -p 1337 -H 192.168.31.2 -T7,7 -t UDP_STREAM -l 20
[2] 10918
UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.31.2 (192.168.31.2) port 0 AF_INET : cpu bind
UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.51.2 (192.168.51.2) port 0 AF_INET : cpu bind
Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992   65507   20.00      352255      0    9230.05
212992           20.00       28048            734.94

Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992   65507   20.00      349842      0    9166.83
212992           20.00       20979            549.71

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [net-next PATCH V3-evictor] net: frag evictor, avoid killing warm frag queues
  2012-12-04 13:30                         ` [net-next PATCH V3-evictor] " Jesper Dangaard Brouer
  2012-12-04 14:32                           ` [net-next PATCH V3-evictor] net: frag evictor,avoid " David Laight
  2012-12-04 14:47                           ` [net-next PATCH V3-evictor] net: frag evictor, avoid " Eric Dumazet
@ 2012-12-05  9:24                           ` Jesper Dangaard Brouer
  2012-12-06 12:26                             ` Jesper Dangaard Brouer
  2 siblings, 1 reply; 57+ messages in thread
From: Jesper Dangaard Brouer @ 2012-12-05  9:24 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, Florian Westphal, netdev, Thomas Graf,
	Paul E. McKenney, Cong Wang, Herbert Xu


First of all, this patch contains a small bug (see below), which
resulted in me not testing the correct patch...

Second, this patch does NOT behave as I expected and claimed.  Thus, my
conclusions, in my previous respond might be wrong!

The previous evictor patch of letting new fragments enter, worked
amazingly well.  But I suspect, this might also be related to a
bug/problem in the evictor loop (which were being hidden by that patch).

My new *theory* is that the evictor loop, will be looping too much, if
it finds a fragment which is INET_FRAG_COMPLETE ... in that case, we
don't advance the LRU list, and thus will pickup the exact same
inet_frag_queue again in the loop... to get out of the loop we need
another CPU or packet to change the LRU list for us... I'll test that
theory... (its could also be CPUs fighting over the same LRU head
element that cause this) ... more to come...


On Tue, 2012-12-04 at 14:30 +0100, Jesper Dangaard Brouer wrote:
> diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
> index 4750d2b..d8bf59b 100644
> --- a/net/ipv4/inet_fragment.c
> +++ b/net/ipv4/inet_fragment.c
> @@ -178,6 +178,16 @@ int inet_frag_evictor(struct netns_frags *nf, struct inet_frags *f, bool force)
>  
>  		q = list_first_entry(&nf->lru_list,
>  				struct inet_frag_queue, lru_list);
> +
> +		/* When head of LRU is very new/warm, then the head is
> +		 * most likely the one with most fragments and the
> +		 * tail with least, thus drop tail
> +		 */
> +		if (!force && q->creation_ts == (u32) jiffies) {
> +			q = list_entry(&nf->lru_list.prev,

Remove the "&" in &nf->lru_list.prev

> +				struct inet_frag_queue, lru_list);
> +		}
> +
>  		atomic_inc(&q->refcnt);
>  		read_unlock(&f->lock);

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [net-next PATCH V3-evictor] net: frag evictor, avoid killing warm frag queues
  2012-12-05  9:24                           ` Jesper Dangaard Brouer
@ 2012-12-06 12:26                             ` Jesper Dangaard Brouer
  2012-12-06 12:32                               ` Florian Westphal
  0 siblings, 1 reply; 57+ messages in thread
From: Jesper Dangaard Brouer @ 2012-12-06 12:26 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, Florian Westphal, netdev, Thomas Graf,
	Paul E. McKenney, Cong Wang, Herbert Xu

On Wed, 2012-12-05 at 10:24 +0100, Jesper Dangaard Brouer wrote:
> 
> The previous evictor patch of letting new fragments enter, worked
> amazingly well.  But I suspect, this might also be related to a
> bug/problem in the evictor loop (which were being hidden by that
> patch).

The evictor loop does not contain a bug, just a SMP scalability issue
(which is fixed by later patches).  The first evictor patch, which
does not let new fragments enter, only worked amazingly well because
its hiding this (and other) scalability issues, and implicit allowing
frags already "in" to exceed the mem usage for 1 jiffie.  Thus,
invalidating the patch, as the improvement were only a side effect.


> My new *theory* is that the evictor loop, will be looping too much, if
> it finds a fragment which is INET_FRAG_COMPLETE ... in that case, we
> don't advance the LRU list, and thus will pickup the exact same
> inet_frag_queue again in the loop... to get out of the loop we need
> another CPU or packet to change the LRU list for us... I'll test that
> theory... (its could also be CPUs fighting over the same LRU head
> element that cause this) ... more to come...

The above theory does happen, but does not cause excessive looping.
The CPUs are just fighting about who gets to free the inet_frag_queue
and who gets to unlink it from its data structures (I guess, resulting
cache bouncing between CPUs).

CPUs are fighting for the same LRU head (inet_frag_queue) element,
which is bad for scalability.  We could fix this by unlinking the
element once a CPU graps it, but it would require us to change a
read_lock to a write_lock, thus we might not gain much performance.

I already (implicit) fix this is a later patch, where I'm moving the
LRU lists to be per CPU.  So, I don't know if it's worth fixing.


(And yes, I'm using thresh 4Mb/3Mb as my default setting now, but I'm
also experimenting with other thresh sizes)

p.s. Thank you Eric for being so persistent, so I realized this patch
were not good.  We can hopefully now, move on to the other patches,
which fixes the real scalability issues.

--Jesper

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [net-next PATCH V3-evictor] net: frag evictor, avoid killing warm frag queues
  2012-12-06 12:26                             ` Jesper Dangaard Brouer
@ 2012-12-06 12:32                               ` Florian Westphal
  2012-12-06 13:29                                 ` David Laight
  2012-12-06 13:55                                 ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 57+ messages in thread
From: Florian Westphal @ 2012-12-06 12:32 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Eric Dumazet, David S. Miller, Florian Westphal, netdev,
	Thomas Graf, Paul E. McKenney, Cong Wang, Herbert Xu

Jesper Dangaard Brouer <jbrouer@redhat.com> wrote:
> CPUs are fighting for the same LRU head (inet_frag_queue) element,
> which is bad for scalability.  We could fix this by unlinking the
> element once a CPU graps it, but it would require us to change a
> read_lock to a write_lock, thus we might not gain much performance.
> 
> I already (implicit) fix this is a later patch, where I'm moving the
> LRU lists to be per CPU.  So, I don't know if it's worth fixing.

Do you think its worth trying to remove the lru list altogether and
just evict from the hash in a round-robin fashion instead?

^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [net-next PATCH V3-evictor] net: frag evictor, avoid killing warm frag queues
  2012-12-06 12:32                               ` Florian Westphal
@ 2012-12-06 13:29                                 ` David Laight
  2012-12-06 21:38                                   ` David Miller
  2012-12-06 13:55                                 ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 57+ messages in thread
From: David Laight @ 2012-12-06 13:29 UTC (permalink / raw)
  To: Florian Westphal, Jesper Dangaard Brouer
  Cc: Eric Dumazet, David S. Miller, netdev, Thomas Graf,
	Paul E. McKenney, Cong Wang, Herbert Xu

> Jesper Dangaard Brouer <jbrouer@redhat.com> wrote:
> > CPUs are fighting for the same LRU head (inet_frag_queue) element,
> > which is bad for scalability.  We could fix this by unlinking the
> > element once a CPU graps it, but it would require us to change a
> > read_lock to a write_lock, thus we might not gain much performance.
> >
> > I already (implicit) fix this is a later patch, where I'm moving the
> > LRU lists to be per CPU.  So, I don't know if it's worth fixing.
> 
> Do you think its worth trying to remove the lru list altogether and
> just evict from the hash in a round-robin fashion instead?

Round-robin will be the same as LRU under overload - so have the
same issues.
Random might be better - especially if IP datagrams for which
more than one in-sequence packet have been received are moved
to a second structure.
But you still need something to control the total memory use.

NFS/UDP is about the only thing that generates very large
IP datagrams - and no one in their right mind runs that
over non-local links.

For SMP you might hash to a small array of pointers (to fragments)
each having its own lock. Only evict items with the same hash.
Put the id in the array and you probably won't need to look at
the actual fragment (saving a cache miss) unless it is the one
you want.

	David

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [net-next PATCH V3-evictor] net: frag evictor, avoid killing warm frag queues
  2012-12-06 12:32                               ` Florian Westphal
  2012-12-06 13:29                                 ` David Laight
@ 2012-12-06 13:55                                 ` Jesper Dangaard Brouer
  2012-12-06 14:47                                   ` Eric Dumazet
  1 sibling, 1 reply; 57+ messages in thread
From: Jesper Dangaard Brouer @ 2012-12-06 13:55 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Eric Dumazet, David S. Miller, netdev, Thomas Graf,
	Paul E. McKenney, Cong Wang, Herbert Xu

On Thu, 2012-12-06 at 13:32 +0100, Florian Westphal wrote:
> Jesper Dangaard Brouer <jbrouer@redhat.com> wrote:
> > CPUs are fighting for the same LRU head (inet_frag_queue) element,
> > which is bad for scalability.  We could fix this by unlinking the
> > element once a CPU graps it, but it would require us to change a
> > read_lock to a write_lock, thus we might not gain much performance.
> > 
> > I already (implicit) fix this is a later patch, where I'm moving the
> > LRU lists to be per CPU.  So, I don't know if it's worth fixing.
> 
> Do you think its worth trying to remove the lru list altogether and
> just evict from the hash in a round-robin fashion instead?

Perhaps.  But do note my bashing of the LRU list were wrong.  I planned
to explain that in a separate mail, but basically I were causing a DoS
attack with incomplete fragments on my self, because I had disabled
Ethernet flow-control.  Which led me to some false assumptions on the
LRU list behavior (sorry).

The LRU might be the correct solution after all.  If I enable Ethernet
flow-control again, then I have a hard time "activating" the evictor
code (with thresh 4M/3M) .  I'll need a separate DoS program, which can
send incomplete fragments (in back-to-back bursts) to provoke the
evictor and LRU.

My cheap DoS reproducer-hack is to disable Ethernet flow-control on only
one interface (out of 3), to cause packet drops and the incomplete
fragments. The current preliminary results is that the two other
interfaces still gets packets through, we don't get the zero throughput
situation.
 Two interfaces and no DoS: 15342 Mbit/s
 Three interfaces and DoS:   7355 Mbit/s

The reduction might look big, but you have to take into account, that
"activating" the evictor code, is also causing scalability issues of its
own (which could account for the performance drop it self).

--Jesper

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [net-next PATCH V3-evictor] net: frag evictor, avoid killing warm frag queues
  2012-12-06 13:55                                 ` Jesper Dangaard Brouer
@ 2012-12-06 14:47                                   ` Eric Dumazet
  2012-12-06 15:23                                     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 57+ messages in thread
From: Eric Dumazet @ 2012-12-06 14:47 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Florian Westphal, David S. Miller, netdev, Thomas Graf,
	Paul E. McKenney, Cong Wang, Herbert Xu

On Thu, 2012-12-06 at 14:55 +0100, Jesper Dangaard Brouer wrote:

> Perhaps.  But do note my bashing of the LRU list were wrong.  I planned
> to explain that in a separate mail, but basically I were causing a DoS
> attack with incomplete fragments on my self, because I had disabled
> Ethernet flow-control.  Which led me to some false assumptions on the
> LRU list behavior (sorry).
> 
> The LRU might be the correct solution after all.  If I enable Ethernet
> flow-control again, then I have a hard time "activating" the evictor
> code (with thresh 4M/3M) .  I'll need a separate DoS program, which can
> send incomplete fragments (in back-to-back bursts) to provoke the
> evictor and LRU.
> 
> My cheap DoS reproducer-hack is to disable Ethernet flow-control on only
> one interface (out of 3), to cause packet drops and the incomplete
> fragments. The current preliminary results is that the two other
> interfaces still gets packets through, we don't get the zero throughput
> situation.
>  Two interfaces and no DoS: 15342 Mbit/s
>  Three interfaces and DoS:   7355 Mbit/s
> 
> The reduction might look big, but you have to take into account, that
> "activating" the evictor code, is also causing scalability issues of its
> own (which could account for the performance drop it self).

I would try removing the LRU, but keeping the age information (jiffie of
last valid frag received on one inet_frag_queue)

The eviction would be a function of the current memory used for the
frags (percpu_counter for good SMP scalability), divided by the max
allowed size, and ipfrag_time.

Under load, we would evict inet_frag_queue before the ipfrag_time timer,
without necessarily having to scan whole frags, only the ones we find in
the bucket we need to parse anyway (and lock)

The whole idea of a full garbage collect under softirq is not scalable,
as it locks a CPU in a non preemptible section for too long.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [net-next PATCH V3-evictor] net: frag evictor, avoid killing warm frag queues
  2012-12-06 14:47                                   ` Eric Dumazet
@ 2012-12-06 15:23                                     ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 57+ messages in thread
From: Jesper Dangaard Brouer @ 2012-12-06 15:23 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Florian Westphal, David S. Miller, netdev, Thomas Graf,
	Paul E. McKenney, Cong Wang, Herbert Xu

On Thu, 2012-12-06 at 06:47 -0800, Eric Dumazet wrote:
> On Thu, 2012-12-06 at 14:55 +0100, Jesper Dangaard Brouer wrote:
> 
> > The LRU might be the correct solution after all.  If I enable Ethernet
> > flow-control again, then I have a hard time "activating" the evictor
> > code (with thresh 4M/3M) .  I'll need a separate DoS program, which can
> > send incomplete fragments (in back-to-back bursts) to provoke the
> > evictor and LRU.
> > 
> > My cheap DoS reproducer-hack is to disable Ethernet flow-control on only
> > one interface (out of 3), to cause packet drops and the incomplete
> > fragments. The current preliminary results is that the two other
> > interfaces still gets packets through, we don't get the zero throughput
> > situation.
> >  Two interfaces and no DoS: 15342 Mbit/s
> >  Three interfaces and DoS:   7355 Mbit/s
> > 
> > The reduction might look big, but you have to take into account, that
> > "activating" the evictor code, is also causing scalability issues of its
> > own (which could account for the performance drop it self).
> 
> I would try removing the LRU, but keeping the age information (jiffie of
> last valid frag received on one inet_frag_queue)

I don't think its worth optimizing further, atm.

Because, the test above is without any of my SMP scalability fixes.
With my SMP fixes the result is, full scalability:

 Three interfaces:  (9601+6723+9432) = 25756 Mbit/s

And the 6723 Mbit/s number, is because the old 10G NIC cannot generate
anymore...

And I basically cannot use the cheap DoS reproducer-hack, as the
machine/code-path is now too fast...

Running with 4 interfaces, and starting 6 netperf's (to cause more
interleaving and higher mem usage):

 4716+8042+8765+6204+2475+4568 = 34770 Mbit/s

I could just manage to get to do IpReasmFails = 14.

[jbrouer@dragon ~]$ nstat > /dev/null && sleep 1 && nstat
#kernel
IpInReceives                    2980048            0.0
IpInDelivers                    66217              0.0
IpReasmReqds                    2980040            0.0
IpReasmOKs                      66218              0.0
IpReasmFails                    14                 0.0
UdpInDatagrams                  66218              0.0
IpExtInOctets                   4397976885         0.0

So, after the SMP fixes, its very hard to "activate" the evictor.  We
would need to find a slower e.g. embedded box and tune the evictor on
that, as a multi-CPU machine basically will scale "too-well" now ;-)

--Jesper

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [net-next PATCH V3-evictor] net: frag evictor, avoid killing warm frag queues
  2012-12-06 13:29                                 ` David Laight
@ 2012-12-06 21:38                                   ` David Miller
  0 siblings, 0 replies; 57+ messages in thread
From: David Miller @ 2012-12-06 21:38 UTC (permalink / raw)
  To: David.Laight
  Cc: fw, jbrouer, eric.dumazet, netdev, tgraf, paulmck, amwang,
	herbert

From: "David Laight" <David.Laight@ACULAB.COM>
Date: Thu, 6 Dec 2012 13:29:13 -0000

> NFS/UDP is about the only thing that generates very large
> IP datagrams - and no one in their right mind runs that
> over non-local links.

There are people with real applications that use UDP with
large IP datagrams.  As unfortunate as it is, this is the
reality we have to deal with.

^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2012-12-06 21:38 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-11-29 16:10 [net-next PATCH V2 0/9] net: fragmentation performance scalability on NUMA/SMP systems Jesper Dangaard Brouer
2012-11-29 16:11 ` [net-next PATCH V2 1/9] net: frag evictor, avoid killing warm frag queues Jesper Dangaard Brouer
2012-11-29 17:44   ` David Miller
2012-11-29 22:17     ` Jesper Dangaard Brouer
2012-11-29 23:01       ` Eric Dumazet
2012-11-30 10:04         ` Jesper Dangaard Brouer
2012-11-30 14:52           ` Eric Dumazet
2012-11-30 15:45             ` Jesper Dangaard Brouer
2012-11-30 16:37               ` Eric Dumazet
2012-11-30 21:37                 ` Jesper Dangaard Brouer
2012-11-30 22:25                   ` Eric Dumazet
2012-11-30 23:23                     ` Jesper Dangaard Brouer
2012-11-30 23:47                       ` Stephen Hemminger
2012-12-01  0:03                         ` Eric Dumazet
2012-12-01  0:13                           ` Stephen Hemminger
2012-11-30 23:58                       ` Eric Dumazet
2012-12-04 13:30                         ` [net-next PATCH V3-evictor] " Jesper Dangaard Brouer
2012-12-04 14:32                           ` [net-next PATCH V3-evictor] net: frag evictor,avoid " David Laight
2012-12-04 14:47                           ` [net-next PATCH V3-evictor] net: frag evictor, avoid " Eric Dumazet
2012-12-04 17:51                             ` Jesper Dangaard Brouer
2012-12-05  9:24                           ` Jesper Dangaard Brouer
2012-12-06 12:26                             ` Jesper Dangaard Brouer
2012-12-06 12:32                               ` Florian Westphal
2012-12-06 13:29                                 ` David Laight
2012-12-06 21:38                                   ` David Miller
2012-12-06 13:55                                 ` Jesper Dangaard Brouer
2012-12-06 14:47                                   ` Eric Dumazet
2012-12-06 15:23                                     ` Jesper Dangaard Brouer
2012-11-29 23:32       ` [net-next PATCH V2 1/9] " Eric Dumazet
2012-11-30 12:01       ` Jesper Dangaard Brouer
2012-11-30 14:57         ` Eric Dumazet
2012-11-29 16:11 ` [net-next PATCH V2 2/9] net: frag cache line adjust inet_frag_queue.net Jesper Dangaard Brouer
2012-11-29 16:12 ` [net-next PATCH V2 3/9] net: frag, move LRU list maintenance outside of rwlock Jesper Dangaard Brouer
2012-11-29 17:43   ` Eric Dumazet
2012-11-29 17:48     ` David Miller
2012-11-29 17:54       ` Eric Dumazet
2012-11-29 18:05         ` David Miller
2012-11-29 18:24           ` Eric Dumazet
2012-11-29 18:31             ` David Miller
2012-11-29 18:33               ` Eric Dumazet
2012-11-29 18:36                 ` David Miller
2012-11-29 22:33         ` Jesper Dangaard Brouer
2012-11-29 16:12 ` [net-next PATCH V2 4/9] net: frag helper functions for mem limit tracking Jesper Dangaard Brouer
2012-11-29 16:13 ` [net-next PATCH V2 5/9] net: frag, per CPU resource, mem limit and LRU list accounting Jesper Dangaard Brouer
2012-11-29 17:06   ` Eric Dumazet
2012-11-29 17:31     ` David Miller
2012-12-03 14:02     ` Jesper Dangaard Brouer
2012-12-03 17:25       ` David Miller
2012-11-29 16:14 ` [net-next PATCH V2 6/9] net: frag, implement dynamic percpu alloc of frag_cpu_limit Jesper Dangaard Brouer
2012-11-29 16:15 ` [net-next PATCH V2 7/9] net: frag, move nqueues counter under LRU lock protection Jesper Dangaard Brouer
2012-11-29 16:15 ` [net-next PATCH V2 8/9] net: frag queue locking per hash bucket Jesper Dangaard Brouer
2012-11-29 17:08   ` Eric Dumazet
2012-11-30 12:55     ` Jesper Dangaard Brouer
2012-11-29 16:16 ` [net-next PATCH V2 9/9] net: increase frag queue hash size and cache-line Jesper Dangaard Brouer
2012-11-29 16:39   ` [net-next PATCH V2 9/9] net: increase frag queue hash size andcache-line David Laight
2012-11-29 16:55   ` [net-next PATCH V2 9/9] net: increase frag queue hash size and cache-line Eric Dumazet
2012-11-29 20:53     ` Jesper Dangaard Brouer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).