[RFC net-next PATCH V1 0/9] net: fragmentation performance scalability on NUMA/SMP systems

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC net-next PATCH V1 0/9] net: fragmentation performance scalability on NUMA/SMP systems
@ 2012-11-23 13:08 Jesper Dangaard Brouer
  2012-11-23 13:08 ` [RFC net-next PATCH V1 2/9] net: frag cache line adjust inet_frag_queue.net Jesper Dangaard Brouer
                   ` (7 more replies)
  0 siblings, 8 replies; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2012-11-23 13:08 UTC (permalink / raw)
  To: Eric Dumazet, David S. Miller, Florian Westphal
  Cc: Jesper Dangaard Brouer, netdev, Pablo Neira Ayuso, Thomas Graf,
	Cong Wang, Patrick McHardy, Paul E. McKenney, Herbert Xu

This patchset implements significant performance improvements for
fragmentation handling in the kernel, with a focus on NUMA and SMP
based systems.

Review:

 Please review these patches.  I have on purpose added comments in the
 code with the "//" comments style.  These comments are to be removed
 before applying.  They serve as a questions to, you, the reviewer.

The fragmentation code today:

 The fragmentation code "protects" kernel resources, by implementing
 some memory resource limitation code.  This is centered around a
 global readers-writer lock, and (per network namespace) an atomic mem
 counter and a LRU (Least-Recently-Used) list.  (Although separate
 global variables and namespace resources, are kept for IPv4, IPv6
 and Netfilter reassembly.)

 The code tries to keep the memory usage between a high and low
 threshold (see: /proc/sys/net/ipv4/ipfrag_{high,low}_thresh).  The
 "evictor" code cleans up fragments, when the high threshold is
 exceeded, and stops only, when the low threshold is reached.

The scalability problem:

 Having a global/central variable for a resource limit is obviously a
 scalability issue on SMP systems, and even amplified on a NUMA based
 system.

 When profiling the code, the scalability problems appeared to be the
 readers-writer lock.  But, surprise, the primary scalability issue
 was caused by the global atomic mem limit counter, which, especially
 on NUMA systems, would prolong the time spend inside the
 readers-writer lock sections.  It is not trivial to remove the
 readers-writer lock, but it is possible to reduce the number of
 writer lock sections.

Testlab:

 My original big-testlab were based on four Intel based 10Gbit/s NICs
 on two identical Sandy-Bridge-E NUMA system.  The testlab
 used/available, while rebasing to net-next, were not as powerful.
 Its based on a single Sandy-Bridge-E NUMA system with the same Intel
 10G NICs, but the generator machine was an old Core-i7 920 with some
 older NICs. This means that I have not been able to generate full 4x
 10G wirespeed.  I have chosen (mostly) to include 2x 10G test results
 due to the generator machine (although the 4x 10G results from the
 big system looks more impressive).

 The tests are performed with netperf -t UDP_STREAM (which default
 send UDP packets with size 65507 bytes, which gets fragmented).  The
 netserver's get numactl pinned and the CPU sockets get smp_affinity
 aligned to the physical NIC connected to its own NUMA node.

Performance results:

 For the impressive 4x 10Gbit/s big-testlab results, performance goes
  from (a collective) 496 Mbit/s to 38463 Mbit/s (per stream 9615 Mbit/s)
  (at packet size 65507 bytes)

 For the results to be fair/meaningful, I'll report the used packet
 size, as (after the fixes) bigger UDP packets scale better, because
 smaller packets will require/create more frag queues to handle.

 I'll report packet size 65507 and three fragments 1472*3=4416 bytes.

 Disabled Ethernet Flow Control (via ethtool -A).  To show the real
 effect of the patches, the system needs to be in an "overload"
 situation.  When Ethernet Flow Control is enabled, the system will
 make the generator back-off, and the code path will be less stressed.
 Thus, I have disabled Ethernet Flow Control.

No patches:
 -------
 Results without any patches, and no flow control:

  2x10G size(65507) result:(7+50)     =57   Mbit/s (gen:9613+9473 Mbit/s)
  2x10G size(4416)  result:(3619+3772)=7391 Mbit/s (gen:8339+9105 Mbit/s)

 The very pure result with large frames is a result of the "evictor"
 code, which gets fixed in patch-01.

Patch-01: net: frag evictor, avoid killing warm frag queues
 -------
 The fragmentation evictor system have a very unfortunate eviction
 system for killing fragment, when the system is put under pressure.
 The evictor code basically kills "warm" fragments too quickly.
 Resulting in a massive, DoS like, performance drop, as seen above
 (no-patch) results with large packets.

 The solution is to avoid killing "warm" fragments, and rather block
 new incoming in case mem limit is exceeded. This is solved by
 introducing a creation time-stamp, which set to "jiffies" in
 inet_frag_alloc().

  2x10G size(65507) result:(3011+2568)=5579 Mbit/s (gen:9613+9553 Mbit/s)
  2x10G size(4416)  result:(3716+3518)=7234 Mbit/s (gen:9037+8614 Mbit/s)

Patch-02: cache line adjust inet_frag_queue.net (netns)
 -------
 Avoid possible cache-line bounces in struct inet_frag_queue.  By
 moving the net pointer (struct netns_frags) because its placed on the
 same write-often cache-line as e.g. refcnt and lock.

  2x10G size(65507) result:(2960+2613)=5573 Mbit/s (gen:9614+9465 Mbit/s)
  2x10G size(4416)  result:(3858+3650)=7508 Mbit/s (gen:8076+7633 Mbit/s)

 The performance benefit looks small. We can discuss if this patch is
 needed or not.

Patch-03: move LRU list maintenance outside of rwlock
 -------
 Updating the fragmentation queues LRU (Least-Recently-Used) list,
 required taking the hash writer lock.  However, the LRU list isn't
 tied to the hash at all, so we can use a separate lock for it.

 This patch looks like a performance loss for big packets, but the LRU
 locking changes are needed, by later patches.

  2x10G size(65507) result:(2533+2138)=4671 Mbit/s (gen:9612+9461 Mbit/s)
  2x10G size(4416)  result:(3952+3713)=7665 Mbit/s (gen:9168+8415 Mbit/s)

Patch-04: frag helper functions for mem limit tracking
 -------
 This patch is only meant as a preparation patch, towards the next
 patch.  The performance improvement comes from reduce the number
 atomic operation, during freeing of a frag queue, by summing the mem
 accounting before and doing a single atomic dec.

  2x10G size(65507) result:(2475+3101)=5576 Mbit/s (gen:9614+9439 Mbit/s)
  2x10G size(4416)  result:(3928+4129)=8057 Mbit/s (gen:7259+8131 Mbit/s)

Patch-05: per CPU mem limit and LRU list accounting
 -------
 The major performance bottleneck on NUMA systems, is the mem limit
 counter, which is based on an atomic counter.  This patch removes the
 cache-bouncing of the atomic counter, by moving this accounting to be
 bound to each CPU.  The LRU list also need to be done per CPU,
 in-order to keep the accounting straight.

  2x10G size(65507) result:(9603+9458)=19061 Mbit/s (gen:9614+9458 Mbit/s)
  2x10G size(4416)  result:(4871+4848)=9719 Mbit/s (gen:9107+8378 Mbit/s)

 To compare the benefit of the next patches, its necessary to increase
 the stress on the code, but doing 4x 10Gbit/s tests.

  4x10G size(65507) result:(8631+9337+7534+6928)=32430 Mbit/s
                       (gen:8646+9613+7547+6937 =32743 Mbit/s)
  4x10G size(4416)  result:(2870+2990+2993+3016)=11869 Mbit/s
                       (gen:4819+7767+6893+5043 =24522 Mbit/s)

Patch-06: nqueues_under_LRU_lock
 -------
 This patch just moves the nqueues counter under the LRU lock (and
 per CPU), instead of the write lock, to prepare for next patch.  No
 need for performance testing this part.

Patch-07: hash_bucket_locking
 -------
 This patch implements per hash bucket locking for the frag queue
 hash.  This removes two write locks, and the only remaining write
 lock is for protecting hash rebuild.  This essentially reduces the
 readers-writer lock to a rebuild lock.

 UPDATE: This patch can result in a OOPS during hash rebuilding.
 Needs more work before its safe to apply.

  2x10G size(65507) result:(9602+9466)=19068 Mbit/s (gen:9613+9472 Mbit/s)
  2x10G size(4416)  result:(5024+4925)= 9949 Mbit/s (gen:8581+8957 Mbit/s)

 To see the real benefit of this patch, we need to crank up the load
 and stress on the code, with 4x 10Gbit/s at small packets,
 improvement at size(4416): before 11869 Mbit/s now 17155 Mbit/s. Also
 note the regression at size(65507) 32430 -> 31021.

  4x10G size(65507) result:(7618+8708+7381+7314)=31021 Mbit/s
                       (gen:7628+9501+8728+7321 =33178 Mbit/s)
  4x10G size(4416)  result:(4156+4714+4300+3985)=17155 Mbit/s
                       (gen:6614+5330+7745+5366 =25055 Mbit/s)

 At 4x10G size(4416) I have seen 206 frag queues in use, and hash size is 64.

Patch-08: cache_align_hash_bucket
 -------
 Increase frag queue hash size and assure cache-line alignment to
 avoid false sharing.  Hash size is set to 256, because I have
 observed 206 frag queues in use at 4x10G with packet size 4416 bytes.

  2x10G size(65507) result:(9601+9414)=19015 Mbit/s (gen:9614+9434 Mbit/s)
  2x10G size(4416)  result:(5421+5268)=10689 Mbit/s (gen:8028+7457 Mbit/s)

 This does introduce an improvement (although not as big as I
 expected), but most importantly the regression seen in patch-07 4x10G
 at size(65507) is gone (patch-05:32430 Mbits/s -> 32676 Mbit).

  4x10G size(65507) result:(7604+8307+9593+7172)=32676 Mbit/s
                       (gen:7615+8713+9606+7184 =33118 Mbit/s)
  4x10G size(4416)  result:(4890+4364+4139+4530)=17923 Mbit/s
                       (gen:5170+6873+5215+7632 =24890 Mbit/s)

 After this patch it looks like the read lock is now the new
 contention point.

Patch-09: Hack disable rebuild and remove rw_lock
 -------
 I've done a quick hack patch, that remove the readers-writer lock, by
 disabling/breaking hash rebuilding.  Just to see how big the
 performance gain would be.

  2x10G size(4416) result: 6481+6764 = 13245 Mbit/s (gen: 7652+8077 Mbit/s)

  4x10G size(4416) result:(5610+6283+5735+5238)=22866 Mbit/s
                     (gen: 6530+7860+5967+5238 =25595 Mbit/s)

 And the results show, that its a big win. With 4x10G size(4416)
 before: 17923 Mbit/s -> now: 22866 Mbit/s increase 4943 Mbit/s.
 With 2x10G size(4416) before 10689 Mbit/s -> 13245 Mbit/s
 increase 2556 Mbit/s.

 I'll work on a real solution for removing the rw_lock while still
 supporting hash rebuilding.  Suggestions and ideas are welcome.

This patchset is based upon:
  Davem's net-next tree:
    git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git
  On top of:
    commit ff33c0e1885cda44dd14c79f70df4706f83582a0
    (net: Remove bogus dependencies on INET)

---

Jesper Dangaard Brouer (9):
      net: frag remove readers-writer lock (hack)
      net: increase frag queue hash size and cache-line
      net: frag queue locking per hash bucket
      net: frag, move nqueues counter under LRU lock protection
      net: frag per CPU mem limit and LRU list accounting
      net: frag helper functions for mem limit tracking
      net: frag, move LRU list maintenance outside of rwlock
      net: frag cache line adjust inet_frag_queue.net
      net: frag evictor, avoid killing warm frag queues

 include/net/inet_frag.h                 |  120 +++++++++++++++++++++++--
 include/net/ipv6.h                      |    4 -
 net/ipv4/inet_fragment.c                |  150 ++++++++++++++++++++++---------
 net/ipv4/ip_fragment.c                  |   43 +++++----
 net/ipv6/netfilter/nf_conntrack_reasm.c |   13 +--
 net/ipv6/reassembly.c                   |   16 ++-
 6 files changed, 259 insertions(+), 87 deletions(-)

--
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC net-next PATCH V1 2/9] net: frag cache line adjust inet_frag_queue.net
  2012-11-23 13:08 [RFC net-next PATCH V1 0/9] net: fragmentation performance scalability on NUMA/SMP systems Jesper Dangaard Brouer
@ 2012-11-23 13:08 ` Jesper Dangaard Brouer
  2012-11-23 13:08 ` [RFC net-next PATCH V1 4/9] net: frag helper functions for mem limit tracking Jesper Dangaard Brouer
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2012-11-23 13:08 UTC (permalink / raw)
  To: Eric Dumazet, David S. Miller, Florian Westphal
  Cc: Jesper Dangaard Brouer, netdev, Pablo Neira Ayuso, Thomas Graf,
	Cong Wang, Patrick McHardy, Paul E. McKenney, Herbert Xu

In inet_frag_find() unfortunate cache-line bounces can occur with
struct inet_frag_queue.  As the net pointer (struct netns_frags)
is placed on the same write-often cache-line as e.g. refcnt and
lock. As the hash match check always check (q->net == nf).

This (of-cause) only happens on hash bucket collisions, but as current
hash size is only 64 this makes collisions more likely.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---

 include/net/inet_frag.h |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 7b897b2..1f75316 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -14,7 +14,6 @@ struct netns_frags {
 
 struct inet_frag_queue {
 	struct hlist_node	list;
-	struct netns_frags	*net;
 	struct list_head	lru_list;   /* lru list member */
 	spinlock_t		lock;
 	atomic_t		refcnt;
@@ -24,6 +23,7 @@ struct inet_frag_queue {
 	ktime_t			stamp;
 	int			len;        /* total length of orig datagram */
 	int			meat;
+	struct netns_frags	*net;
 	u32			creation_ts;/* jiffies when queue was created*/
 	__u8			last_in;    /* first/last segment arrived? */
 

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC net-next PATCH V1 4/9] net: frag helper functions for mem limit tracking
  2012-11-23 13:08 [RFC net-next PATCH V1 0/9] net: fragmentation performance scalability on NUMA/SMP systems Jesper Dangaard Brouer
  2012-11-23 13:08 ` [RFC net-next PATCH V1 2/9] net: frag cache line adjust inet_frag_queue.net Jesper Dangaard Brouer
@ 2012-11-23 13:08 ` Jesper Dangaard Brouer
  2012-11-23 13:08 ` [RFC net-next PATCH V1 7/9] net: frag queue locking per hash bucket Jesper Dangaard Brouer
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2012-11-23 13:08 UTC (permalink / raw)
  To: Eric Dumazet, David S. Miller, Florian Westphal
  Cc: Jesper Dangaard Brouer, netdev, Pablo Neira Ayuso, Thomas Graf,
	Cong Wang, Patrick McHardy, Paul E. McKenney, Herbert Xu

This change is primarily a preparation to ease the extension of memory
limit tracking.

The change does reduce the number atomic operation, during freeing of
a frag queue.  This does introduce a fairly good performance improvement, as
these atomic operations are at the core of the performance problems
seen on NUMA systems.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---

 include/net/inet_frag.h                 |   28 ++++++++++++++++++++++++++++
 include/net/ipv6.h                      |    2 +-
 net/ipv4/inet_fragment.c                |   27 +++++++++++++--------------
 net/ipv4/ip_fragment.c                  |   24 +++++++++++-------------
 net/ipv6/netfilter/nf_conntrack_reasm.c |    6 +++---
 net/ipv6/reassembly.c                   |    6 +++---
 6 files changed, 59 insertions(+), 34 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 312a3fa..9bbef17 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -95,4 +95,32 @@ static inline void inet_frag_lru_add(struct netns_frags *nf,
 	list_add_tail(&q->lru_list, &nf->lru_list);
 	spin_unlock(&nf->lru_lock);
 }
+
+/* Memory Tracking Functions. */
+
+static inline int frag_mem_limit(struct netns_frags *nf)
+{
+	return atomic_read(&nf->mem);
+}
+
+static inline void sub_frag_mem_limit(struct inet_frag_queue *q, int i)
+{
+	atomic_sub(i, &q->net->mem);
+}
+
+static inline void add_frag_mem_limit(struct inet_frag_queue *q, int i)
+{
+	atomic_add(i, &q->net->mem);
+}
+
+static inline void init_frag_mem_limit(struct netns_frags *nf)
+{
+	atomic_set(&nf->mem, 0);
+}
+
+static inline int sum_frag_mem_limit(struct netns_frags *nf)
+{
+	return atomic_read(&nf->mem);
+}
+
 #endif
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 979bf6c..a5c1cf1 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -279,7 +279,7 @@ static inline int ip6_frag_nqueues(struct net *net)
 
 static inline int ip6_frag_mem(struct net *net)
 {
-	return atomic_read(&net->ipv6.frags.mem);
+	return sum_frag_mem_limit(&net->ipv6.frags);
 }
 #endif
 
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 47141ab..4f1ab8a 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -73,7 +73,7 @@ EXPORT_SYMBOL(inet_frags_init);
 void inet_frags_init_net(struct netns_frags *nf)
 {
 	nf->nqueues = 0;
-	atomic_set(&nf->mem, 0);
+	init_frag_mem_limit(nf);
 	INIT_LIST_HEAD(&nf->lru_list);
 	spin_lock_init(&nf->lru_lock);
 }
@@ -118,12 +118,8 @@ void inet_frag_kill(struct inet_frag_queue *fq, struct inet_frags *f)
 EXPORT_SYMBOL(inet_frag_kill);
 
 static inline void frag_kfree_skb(struct netns_frags *nf, struct inet_frags *f,
-		struct sk_buff *skb, int *work)
+		struct sk_buff *skb)
 {
-	if (work)
-		*work -= skb->truesize;
-
-	atomic_sub(skb->truesize, &nf->mem);
 	if (f->skb_free)
 		f->skb_free(skb);
 	kfree_skb(skb);
@@ -134,6 +130,7 @@ void inet_frag_destroy(struct inet_frag_queue *q, struct inet_frags *f,
 {
 	struct sk_buff *fp;
 	struct netns_frags *nf;
+	unsigned int sum, sum_truesize = 0;
 
 	WARN_ON(!(q->last_in & INET_FRAG_COMPLETE));
 	WARN_ON(del_timer(&q->timer) != 0);
@@ -144,13 +141,14 @@ void inet_frag_destroy(struct inet_frag_queue *q, struct inet_frags *f,
 	while (fp) {
 		struct sk_buff *xp = fp->next;
 
-		frag_kfree_skb(nf, f, fp, work);
+		sum_truesize += fp->truesize;
+		frag_kfree_skb(nf, f, fp);
 		fp = xp;
 	}
-
+	sum = sum_truesize + f->qsize;
 	if (work)
-		*work -= f->qsize;
-	atomic_sub(f->qsize, &nf->mem);
+		*work -= sum;
+	sub_frag_mem_limit(q, sum);
 
 	if (f->destructor)
 		f->destructor(q);
@@ -165,11 +163,11 @@ int inet_frag_evictor(struct netns_frags *nf, struct inet_frags *f, bool force)
 	int work, evicted = 0;
 
 	if (!force) {
-		if (atomic_read(&nf->mem) <= nf->high_thresh)
+		if (frag_mem_limit(nf) <= nf->high_thresh)
 			return 0;
 	}
 
-	work = atomic_read(&nf->mem) - nf->low_thresh;
+	work = frag_mem_limit(nf) - nf->low_thresh;
 	while (work > 0) {
 		spin_lock(&nf->lru_lock);
 
@@ -269,7 +267,7 @@ static struct inet_frag_queue *inet_frag_alloc(struct netns_frags *nf,
 	/* Guard creations of new frag queues if mem limit reached, as
 	 * we allow warm/recent elements to survive in inet_frag_evictor()
 	 */
-	if (atomic_read(&nf->mem) > nf->high_thresh)
+	if (frag_mem_limit(nf) > nf->high_thresh)
 		return NULL;
 
 	q = kzalloc(f->qsize, GFP_ATOMIC);
@@ -279,7 +277,8 @@ static struct inet_frag_queue *inet_frag_alloc(struct netns_frags *nf,
 	q->creation_ts = (u32) jiffies;
 	q->net = nf;
 	f->constructor(q, arg);
-	atomic_add(f->qsize, &nf->mem);
+	add_frag_mem_limit(q, f->qsize);
+
 	setup_timer(&q->timer, f->frag_expire, (unsigned long)q);
 	spin_lock_init(&q->lock);
 	atomic_set(&q->refcnt, 1);
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 04c9e53..cc36a0b 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -122,7 +122,7 @@ int ip_frag_nqueues(struct net *net)
 
 int ip_frag_mem(struct net *net)
 {
-	return atomic_read(&net->ipv4.frags.mem);
+	return sum_frag_mem_limit(&net->ipv4.frags);
 }
 
 static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
@@ -161,13 +161,6 @@ static bool ip4_frag_match(struct inet_frag_queue *q, void *a)
 		qp->user == arg->user;
 }
 
-/* Memory Tracking Functions. */
-static void frag_kfree_skb(struct netns_frags *nf, struct sk_buff *skb)
-{
-	atomic_sub(skb->truesize, &nf->mem);
-	kfree_skb(skb);
-}
-
 static void ip4_frag_init(struct inet_frag_queue *q, void *a)
 {
 	struct ipq *qp = container_of(q, struct ipq, q);
@@ -340,6 +333,7 @@ static inline int ip_frag_too_far(struct ipq *qp)
 static int ip_frag_reinit(struct ipq *qp)
 {
 	struct sk_buff *fp;
+	unsigned int sum_truesize = 0;
 
 	if (!mod_timer(&qp->q.timer, jiffies + qp->q.net->timeout)) {
 		atomic_inc(&qp->q.refcnt);
@@ -349,9 +343,12 @@ static int ip_frag_reinit(struct ipq *qp)
 	fp = qp->q.fragments;
 	do {
 		struct sk_buff *xp = fp->next;
-		frag_kfree_skb(qp->q.net, fp);
+
+		sum_truesize += fp->truesize;
+		kfree_skb(fp);
 		fp = xp;
 	} while (fp);
+	sub_frag_mem_limit(&qp->q, sum_truesize);
 
 	qp->q.last_in = 0;
 	qp->q.len = 0;
@@ -496,7 +493,8 @@ found:
 				qp->q.fragments = next;
 
 			qp->q.meat -= free_it->len;
-			frag_kfree_skb(qp->q.net, free_it);
+			sub_frag_mem_limit(&qp->q, free_it->truesize);
+			kfree_skb(free_it);
 		}
 	}
 
@@ -519,7 +517,7 @@ found:
 	qp->q.stamp = skb->tstamp;
 	qp->q.meat += skb->len;
 	qp->ecn |= ecn;
-	atomic_add(skb->truesize, &qp->q.net->mem);
+	add_frag_mem_limit(&qp->q, skb->truesize);
 	if (offset == 0)
 		qp->q.last_in |= INET_FRAG_FIRST_IN;
 
@@ -614,7 +612,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
 		head->len -= clone->len;
 		clone->csum = 0;
 		clone->ip_summed = head->ip_summed;
-		atomic_add(clone->truesize, &qp->q.net->mem);
+		add_frag_mem_limit(&qp->q, clone->truesize);
 	}
 
 	skb_push(head, head->data - skb_network_header(head));
@@ -642,7 +640,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
 		}
 		fp = next;
 	}
-	atomic_sub(sum_truesize, &qp->q.net->mem);
+	sub_frag_mem_limit(&qp->q, sum_truesize);
 
 	head->next = NULL;
 	head->dev = dev;
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
index b0a1c96..c088831 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -316,7 +316,7 @@ found:
 	fq->q.meat += skb->len;
 	if (payload_len > fq->q.max_size)
 		fq->q.max_size = payload_len;
-	atomic_add(skb->truesize, &fq->q.net->mem);
+	add_frag_mem_limit(&fq->q, skb->truesize);
 
 	/* The first fragment.
 	 * nhoffset is obtained from the first fragment, of course.
@@ -394,7 +394,7 @@ nf_ct_frag6_reasm(struct frag_queue *fq, struct net_device *dev)
 		clone->ip_summed = head->ip_summed;
 
 		NFCT_FRAG6_CB(clone)->orig = NULL;
-		atomic_add(clone->truesize, &fq->q.net->mem);
+		add_frag_mem_limit(&fq->q, clone->truesize);
 	}
 
 	/* We have to remove fragment header from datagram and to relocate
@@ -418,7 +418,7 @@ nf_ct_frag6_reasm(struct frag_queue *fq, struct net_device *dev)
 			head->csum = csum_add(head->csum, fp->csum);
 		head->truesize += fp->truesize;
 	}
-	atomic_sub(head->truesize, &fq->q.net->mem);
+	sub_frag_mem_limit(&fq->q, head->truesize);
 
 	head->local_df = 1;
 	head->next = NULL;
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index b38f290..9cfe047 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -327,7 +327,7 @@ found:
 	}
 	fq->q.stamp = skb->tstamp;
 	fq->q.meat += skb->len;
-	atomic_add(skb->truesize, &fq->q.net->mem);
+	add_frag_mem_limit(&fq->q, skb->truesize);
 
 	/* The first fragment.
 	 * nhoffset is obtained from the first fragment, of course.
@@ -426,7 +426,7 @@ static int ip6_frag_reasm(struct frag_queue *fq, struct sk_buff *prev,
 		head->len -= clone->len;
 		clone->csum = 0;
 		clone->ip_summed = head->ip_summed;
-		atomic_add(clone->truesize, &fq->q.net->mem);
+		add_frag_mem_limit(&fq->q, clone->truesize);
 	}
 
 	/* We have to remove fragment header from datagram and to relocate
@@ -464,7 +464,7 @@ static int ip6_frag_reasm(struct frag_queue *fq, struct sk_buff *prev,
 		}
 		fp = next;
 	}
-	atomic_sub(sum_truesize, &fq->q.net->mem);
+	sub_frag_mem_limit(&fq->q, sum_truesize);
 
 	head->next = NULL;
 	head->dev = dev;

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC net-next PATCH V1 7/9] net: frag queue locking per hash bucket
  2012-11-23 13:08 [RFC net-next PATCH V1 0/9] net: fragmentation performance scalability on NUMA/SMP systems Jesper Dangaard Brouer
  2012-11-23 13:08 ` [RFC net-next PATCH V1 2/9] net: frag cache line adjust inet_frag_queue.net Jesper Dangaard Brouer
  2012-11-23 13:08 ` [RFC net-next PATCH V1 4/9] net: frag helper functions for mem limit tracking Jesper Dangaard Brouer
@ 2012-11-23 13:08 ` Jesper Dangaard Brouer
  2012-11-27  9:07   ` Jesper Dangaard Brouer
  2012-11-27 15:00   ` Jesper Dangaard Brouer
  2012-11-23 13:08 ` [RFC net-next PATCH V1 8/9] net: increase frag queue hash size and cache-line Jesper Dangaard Brouer
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2012-11-23 13:08 UTC (permalink / raw)
  To: Eric Dumazet, David S. Miller, Florian Westphal
  Cc: Jesper Dangaard Brouer, netdev, Pablo Neira Ayuso, Thomas Graf,
	Cong Wang, Patrick McHardy, Paul E. McKenney, Herbert Xu

DO NOT apply - patch not finished, can cause on OOPS/PANIC during hash rebuild

This patch implements per hash bucket locking for the frag queue
hash.  This removes two write locks, and the only remaining write
lock is for protecting hash rebuild.  This essentially reduce the
readers-writer lock to a rebuild lock.

NOT-Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---

 include/net/inet_frag.h  |   10 +++++++-
 net/ipv4/inet_fragment.c |   56 +++++++++++++++++++++++++++++++++++-----------
 2 files changed, 51 insertions(+), 15 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 9938ea4..1efec6b 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -51,9 +51,15 @@ struct inet_frag_queue {
 
 #define INETFRAGS_HASHSZ		64
 
+
+struct inet_frag_bucket {
+	struct hlist_head	chain;
+	spinlock_t		chain_lock;
+};
+
 struct inet_frags {
-	struct hlist_head	hash[INETFRAGS_HASHSZ];
-	rwlock_t		lock;
+	struct inet_frag_bucket	hash[INETFRAGS_HASHSZ];
+	rwlock_t		lock; /* Rebuild lock */
 	u32			rnd;
 	int			qsize;
 	int			secret_interval;
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 1620a21..447423f 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -35,20 +35,27 @@ static void inet_frag_secret_rebuild(unsigned long dummy)
 	unsigned long now = jiffies;
 	int i;
 
+	/* Per bucket lock NOT needed here, due to write lock protection */
 	write_lock(&f->lock);
+
 	get_random_bytes(&f->rnd, sizeof(u32));
 	for (i = 0; i < INETFRAGS_HASHSZ; i++) {
+		struct inet_frag_bucket *hb;
 		struct inet_frag_queue *q;
 		struct hlist_node *p, *n;
 
-		hlist_for_each_entry_safe(q, p, n, &f->hash[i], list) {
+		hb = &f->hash[i];
+		hlist_for_each_entry_safe(q, p, n, &hb->chain, list) {
 			unsigned int hval = f->hashfn(q);
 
 			if (hval != i) {
+				struct inet_frag_bucket *hb_dest;
+
 				hlist_del(&q->list);
 
 				/* Relink to new hash chain. */
-				hlist_add_head(&q->list, &f->hash[hval]);
+				hb_dest = &f->hash[hval];
+				hlist_add_head(&q->list, &hb->chain);
 			}
 		}
 	}
@@ -61,9 +68,12 @@ void inet_frags_init(struct inet_frags *f)
 {
 	int i;
 
-	for (i = 0; i < INETFRAGS_HASHSZ; i++)
-		INIT_HLIST_HEAD(&f->hash[i]);
+	for (i = 0; i < INETFRAGS_HASHSZ; i++) {
+		struct inet_frag_bucket *hb = &f->hash[i];
 
+		spin_lock_init(&hb->chain_lock);
+		INIT_HLIST_HEAD(&hb->chain);
+	}
 	rwlock_init(&f->lock);
 
 	f->rnd = (u32) ((num_physpages ^ (num_physpages>>7)) ^
@@ -102,9 +112,17 @@ EXPORT_SYMBOL(inet_frags_exit_net);
 
 static inline void fq_unlink(struct inet_frag_queue *fq, struct inet_frags *f)
 {
-	write_lock(&f->lock);
+	struct inet_frag_bucket *hb;
+	unsigned int hash;
+
+	read_lock(&f->lock);
+	hash = f->hashfn(fq);
+	hb = &f->hash[hash];
+
+	spin_lock_bh(&hb->chain_lock);
 	hlist_del(&fq->list);
-	write_unlock(&f->lock);
+	spin_unlock_bh(&hb->chain_lock);
+	read_unlock(&f->lock);
 	inet_frag_lru_del(fq);
 }
 
@@ -224,28 +242,33 @@ static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf,
 		struct inet_frag_queue *qp_in, struct inet_frags *f,
 		void *arg)
 {
+	struct inet_frag_bucket *hb;
 	struct inet_frag_queue *qp;
 #ifdef CONFIG_SMP
 	struct hlist_node *n;
 #endif
 	unsigned int hash;
 
-	write_lock(&f->lock);
+	read_lock(&f->lock); /* Protects against hash rebuild */
 	/*
 	 * While we stayed w/o the lock other CPU could update
 	 * the rnd seed, so we need to re-calculate the hash
 	 * chain. Fortunatelly the qp_in can be used to get one.
 	 */
 	hash = f->hashfn(qp_in);
+	hb = &f->hash[hash];
+	spin_lock_bh(&hb->chain_lock);
+
 #ifdef CONFIG_SMP
 	/* With SMP race we have to recheck hash table, because
 	 * such entry could be created on other cpu, while we
-	 * promoted read lock to write lock.
+	 * promoted read lock to write lock. ***Comment FIXME***
 	 */
-	hlist_for_each_entry(qp, n, &f->hash[hash], list) {
+	hlist_for_each_entry(qp, n, &hb->chain, list) {
 		if (qp->net == nf && f->match(qp, arg)) {
 			atomic_inc(&qp->refcnt);
-			write_unlock(&f->lock);
+			spin_unlock_bh(&hb->chain_lock);
+			read_unlock(&f->lock);
 			qp_in->last_in |= INET_FRAG_COMPLETE;
 			inet_frag_put(qp_in, f);
 			return qp;
@@ -257,8 +280,9 @@ static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf,
 		atomic_inc(&qp->refcnt);
 
 	atomic_inc(&qp->refcnt);
-	hlist_add_head(&qp->list, &f->hash[hash]);
-	write_unlock(&f->lock);
+	hlist_add_head(&qp->list, &hb->chain);
+	spin_unlock_bh(&hb->chain_lock);
+	read_unlock(&f->lock);
 	inet_frag_lru_add(nf, qp);
 	return qp;
 }
@@ -307,16 +331,22 @@ struct inet_frag_queue *inet_frag_find(struct netns_frags *nf,
 		struct inet_frags *f, void *key, unsigned int hash)
 	__releases(&f->lock)
 {
+	struct inet_frag_bucket *hb;
 	struct inet_frag_queue *q;
 	struct hlist_node *n;
 
-	hlist_for_each_entry(q, n, &f->hash[hash], list) {
+	hb = &f->hash[hash];
+
+	spin_lock_bh(&hb->chain_lock);
+	hlist_for_each_entry(q, n, &hb->chain, list) {
 		if (q->net == nf && f->match(q, key)) {
 			atomic_inc(&q->refcnt);
+			spin_unlock_bh(&hb->chain_lock);
 			read_unlock(&f->lock);
 			return q;
 		}
 	}
+	spin_unlock_bh(&hb->chain_lock);
 	read_unlock(&f->lock);
 
 	return inet_frag_create(nf, f, key);

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC net-next PATCH V1 7/9] net: frag queue locking per hash bucket
  2012-11-23 13:08 ` [RFC net-next PATCH V1 7/9] net: frag queue locking per hash bucket Jesper Dangaard Brouer
@ 2012-11-27  9:07   ` Jesper Dangaard Brouer
  2012-11-27 15:00   ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2012-11-27  9:07 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, Florian Westphal, netdev, Pablo Neira Ayuso,
	Thomas Graf, Cong Wang, Patrick McHardy, Paul E. McKenney,
	Herbert Xu

On Fri, 2012-11-23 at 14:08 +0100, Jesper Dangaard Brouer wrote:
> DO NOT apply - patch not finished, can cause on OOPS/PANIC during hash rebuild

Think I have fixed this issue.  And its even fixed in this patch, as the
oops occurred, when testing, with a slightly older version of this
patch.


> This patch implements per hash bucket locking for the frag queue
> hash.  This removes two write locks, and the only remaining write
> lock is for protecting hash rebuild.  This essentially reduce the
> readers-writer lock to a rebuild lock.

[... cut ...]
> diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
> index 1620a21..447423f 100644
> --- a/net/ipv4/inet_fragment.c
> +++ b/net/ipv4/inet_fragment.c
[...]
> @@ -102,9 +112,17 @@ EXPORT_SYMBOL(inet_frags_exit_net);
>  
>  static inline void fq_unlink(struct inet_frag_queue *fq, struct inet_frags *f)
>  {
> -	write_lock(&f->lock);
> +	struct inet_frag_bucket *hb;
> +	unsigned int hash;
> +
> +	read_lock(&f->lock);
> +	hash = f->hashfn(fq);
> +	hb = &f->hash[hash];

Before the read_lock, were _here_ below then hash calc.  Which is wrong,
and caused the oops, as the hash result can change, when rnd is changed,
during hash rebuild.

> +
> +	spin_lock_bh(&hb->chain_lock);
>  	hlist_del(&fq->list);
> -	write_unlock(&f->lock);
> +	spin_unlock_bh(&hb->chain_lock);
> +	read_unlock(&f->lock);
>  	inet_frag_lru_del(fq);
>  }

[... cut ...]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC net-next PATCH V1 7/9] net: frag queue locking per hash bucket
  2012-11-23 13:08 ` [RFC net-next PATCH V1 7/9] net: frag queue locking per hash bucket Jesper Dangaard Brouer
  2012-11-27  9:07   ` Jesper Dangaard Brouer
@ 2012-11-27 15:00   ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2012-11-27 15:00 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, Florian Westphal, netdev, Pablo Neira Ayuso,
	Thomas Graf, Cong Wang, Patrick McHardy, Paul E. McKenney,
	Herbert Xu

On Fri, 2012-11-23 at 14:08 +0100, Jesper Dangaard Brouer wrote:
> DO NOT apply - patch not finished, can cause on OOPS/PANIC during hash rebuild
> 
> This patch implements per hash bucket locking for the frag queue
> hash.  This removes two write locks, and the only remaining write
> lock is for protecting hash rebuild.  This essentially reduce the
> readers-writer lock to a rebuild lock.
> 
> NOT-Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

Last bug mentioned, were not the only one... fixing hopefully the last bug in this patch.


> diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
> index 1620a21..447423f 100644
> --- a/net/ipv4/inet_fragment.c
> +++ b/net/ipv4/inet_fragment.c
> @@ -35,20 +35,27 @@ static void inet_frag_secret_rebuild(unsigned long dummy)
>  	unsigned long now = jiffies;
>  	int i;
>  
> +	/* Per bucket lock NOT needed here, due to write lock protection */
>  	write_lock(&f->lock);
> +
>  	get_random_bytes(&f->rnd, sizeof(u32));
>  	for (i = 0; i < INETFRAGS_HASHSZ; i++) {
> +		struct inet_frag_bucket *hb;
>  		struct inet_frag_queue *q;
>  		struct hlist_node *p, *n;
>  
> -		hlist_for_each_entry_safe(q, p, n, &f->hash[i], list) {
> +		hb = &f->hash[i];
> +		hlist_for_each_entry_safe(q, p, n, &hb->chain, list) {
>  			unsigned int hval = f->hashfn(q);
>  
>  			if (hval != i) {
> +				struct inet_frag_bucket *hb_dest;
> +
>  				hlist_del(&q->list);
>  
>  				/* Relink to new hash chain. */
> -				hlist_add_head(&q->list, &f->hash[hval]);
> +				hb_dest = &f->hash[hval];
> +				hlist_add_head(&q->list, &hb->chain);

The above line were wrong, it should have been:
   hlist_add_head(&q->list, &hb_dest->chain);

>  			}
>  		}
>  	}

The patch seem quite stable now.  My test is to adjust to rebuild
interval to 2 sec and then run 4x 10G with two fragments (packet size
1472*2) to create as many fragments as possible (approx 300
inet_frag_queue elements).

30 min test run:
 3726+3896+3960+3608 = 15190 Mbit/s

(For reproducers, note, that changing ipfrag_secret_interval (e.g.
sysctl -w net/ipv4/ipfrag_secret_interval=2), first take effect after
the first interval/timer expires, which default is 10 min)


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC net-next PATCH V1 8/9] net: increase frag queue hash size and cache-line
  2012-11-23 13:08 [RFC net-next PATCH V1 0/9] net: fragmentation performance scalability on NUMA/SMP systems Jesper Dangaard Brouer
                   ` (2 preceding siblings ...)
  2012-11-23 13:08 ` [RFC net-next PATCH V1 7/9] net: frag queue locking per hash bucket Jesper Dangaard Brouer
@ 2012-11-23 13:08 ` Jesper Dangaard Brouer
  2012-11-23 13:08 ` [RFC net-next PATCH V1 9/9] net: frag remove readers-writer lock (hack) Jesper Dangaard Brouer
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2012-11-23 13:08 UTC (permalink / raw)
  To: Eric Dumazet, David S. Miller, Florian Westphal
  Cc: Jesper Dangaard Brouer, netdev, Pablo Neira Ayuso, Thomas Graf,
	Cong Wang, Patrick McHardy, Paul E. McKenney, Herbert Xu

Increase frag queue hash size and assure cache-line alignment to
avoid false sharing.  Hash size is set to 256, because I have
observed 206 frag queues in use at 4x10G with packet size 4416 bytes.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---

 include/net/inet_frag.h |    5 ++---
 1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 1efec6b..5054228 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -49,13 +49,12 @@ struct inet_frag_queue {
 	u16			max_size;
 };
 
-#define INETFRAGS_HASHSZ		64
-
+#define INETFRAGS_HASHSZ	256
 
 struct inet_frag_bucket {
 	struct hlist_head	chain;
 	spinlock_t		chain_lock;
-};
+} ____cacheline_aligned_in_smp;
 
 struct inet_frags {
 	struct inet_frag_bucket	hash[INETFRAGS_HASHSZ];

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC net-next PATCH V1 9/9] net: frag remove readers-writer lock (hack)
  2012-11-23 13:08 [RFC net-next PATCH V1 0/9] net: fragmentation performance scalability on NUMA/SMP systems Jesper Dangaard Brouer
                   ` (3 preceding siblings ...)
  2012-11-23 13:08 ` [RFC net-next PATCH V1 8/9] net: increase frag queue hash size and cache-line Jesper Dangaard Brouer
@ 2012-11-23 13:08 ` Jesper Dangaard Brouer
  2012-11-26  6:03   ` Stephen Hemminger
  2012-11-26  9:18   ` Florian Westphal
       [not found] ` <20121123130806.18764.41854.stgit@dragon>
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2012-11-23 13:08 UTC (permalink / raw)
  To: Eric Dumazet, David S. Miller, Florian Westphal
  Cc: Jesper Dangaard Brouer, netdev, Pablo Neira Ayuso, Thomas Graf,
	Cong Wang, Patrick McHardy, Paul E. McKenney, Herbert Xu

Do NOT APPLY this patch.

After all the other patches, the rw_lock is now the contention point.

This is a quick hack, that remove the readers-writer lock, by
disabling/breaking hash rebuilding.  Just to see how big the
performance gain would be.

  2x10G size(4416) result: 6481+6764 = 13245 Mbit/s (gen: 7652+8077 Mbit/s)

  4x10G size(4416) result:(5610+6283+5735+5238)=22866 Mbit/s
                     (gen: 6530+7860+5967+5238 =25595 Mbit/s)

And the results show, that its a big win. With 4x10G size(4416)
before: 17923 Mbit/s -> now: 22866 Mbit/s increase 4943 Mbit/s.
With 2x10G size(4416) before 10689 Mbit/s -> 13245 Mbit/s
increase 2556 Mbit/s.

I'll work on a real solution for removing the rw_lock while still
supporting hash rebuilding.  Suggestions and ideas are welcome.

NOT-signed-off
---

 include/net/inet_frag.h                 |    2 +-
 net/ipv4/inet_fragment.c                |   23 +++++++++++++----------
 net/ipv4/ip_fragment.c                  |    2 +-
 net/ipv6/netfilter/nf_conntrack_reasm.c |    2 +-
 net/ipv6/reassembly.c                   |    2 +-
 5 files changed, 17 insertions(+), 14 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 5054228..2fb8578 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -58,7 +58,7 @@ struct inet_frag_bucket {
 
 struct inet_frags {
 	struct inet_frag_bucket	hash[INETFRAGS_HASHSZ];
-	rwlock_t		lock; /* Rebuild lock */
+//	rwlock_t		lock; /* Rebuild lock */
 	u32			rnd;
 	int			qsize;
 	int			secret_interval;
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 447423f..63227d6 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -35,8 +35,11 @@ static void inet_frag_secret_rebuild(unsigned long dummy)
 	unsigned long now = jiffies;
 	int i;
 
+	//HACK don't rebuild
+	return;
+
 	/* Per bucket lock NOT needed here, due to write lock protection */
-	write_lock(&f->lock);
+//	write_lock(&f->lock);
 
 	get_random_bytes(&f->rnd, sizeof(u32));
 	for (i = 0; i < INETFRAGS_HASHSZ; i++) {
@@ -59,7 +62,7 @@ static void inet_frag_secret_rebuild(unsigned long dummy)
 			}
 		}
 	}
-	write_unlock(&f->lock);
+//	write_unlock(&f->lock);
 
 	mod_timer(&f->secret_timer, now + f->secret_interval);
 }
@@ -74,7 +77,7 @@ void inet_frags_init(struct inet_frags *f)
 		spin_lock_init(&hb->chain_lock);
 		INIT_HLIST_HEAD(&hb->chain);
 	}
-	rwlock_init(&f->lock);
+//	rwlock_init(&f->lock);
 
 	f->rnd = (u32) ((num_physpages ^ (num_physpages>>7)) ^
 				   (jiffies ^ (jiffies >> 6)));
@@ -115,14 +118,14 @@ static inline void fq_unlink(struct inet_frag_queue *fq, struct inet_frags *f)
 	struct inet_frag_bucket *hb;
 	unsigned int hash;
 
-	read_lock(&f->lock);
+	//read_lock(&f->lock);
 	hash = f->hashfn(fq);
 	hb = &f->hash[hash];
 
 	spin_lock_bh(&hb->chain_lock);
 	hlist_del(&fq->list);
 	spin_unlock_bh(&hb->chain_lock);
-	read_unlock(&f->lock);
+	//read_unlock(&f->lock);
 	inet_frag_lru_del(fq);
 }
 
@@ -249,7 +252,7 @@ static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf,
 #endif
 	unsigned int hash;
 
-	read_lock(&f->lock); /* Protects against hash rebuild */
+	//read_lock(&f->lock); /* Protects against hash rebuild */
 	/*
 	 * While we stayed w/o the lock other CPU could update
 	 * the rnd seed, so we need to re-calculate the hash
@@ -268,7 +271,7 @@ static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf,
 		if (qp->net == nf && f->match(qp, arg)) {
 			atomic_inc(&qp->refcnt);
 			spin_unlock_bh(&hb->chain_lock);
-			read_unlock(&f->lock);
+			//read_unlock(&f->lock);
 			qp_in->last_in |= INET_FRAG_COMPLETE;
 			inet_frag_put(qp_in, f);
 			return qp;
@@ -282,7 +285,7 @@ static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf,
 	atomic_inc(&qp->refcnt);
 	hlist_add_head(&qp->list, &hb->chain);
 	spin_unlock_bh(&hb->chain_lock);
-	read_unlock(&f->lock);
+	//read_unlock(&f->lock);
 	inet_frag_lru_add(nf, qp);
 	return qp;
 }
@@ -342,12 +345,12 @@ struct inet_frag_queue *inet_frag_find(struct netns_frags *nf,
 		if (q->net == nf && f->match(q, key)) {
 			atomic_inc(&q->refcnt);
 			spin_unlock_bh(&hb->chain_lock);
-			read_unlock(&f->lock);
+			//read_unlock(&f->lock);
 			return q;
 		}
 	}
 	spin_unlock_bh(&hb->chain_lock);
-	read_unlock(&f->lock);
+	//read_unlock(&f->lock);
 
 	return inet_frag_create(nf, f, key);
 }
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 7b1cf51..b2cb05f 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -289,7 +289,7 @@ static inline struct ipq *ip_find(struct net *net, struct iphdr *iph, u32 user)
 	arg.iph = iph;
 	arg.user = user;
 
-	read_lock(&ip4_frags.lock);
+	//read_lock(&ip4_frags.lock);
 	hash = ipqhashfn(iph->id, iph->saddr, iph->daddr, iph->protocol);
 
 	q = inet_frag_find(&net->ipv4.frags, &ip4_frags, &arg, hash);
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
index c088831..5b57e03 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -175,7 +175,7 @@ static inline struct frag_queue *fq_find(struct net *net, __be32 id,
 	arg.src = src;
 	arg.dst = dst;
 
-	read_lock_bh(&nf_frags.lock);
+	//read_lock_bh(&nf_frags.lock);
 	hash = inet6_hash_frag(id, src, dst, nf_frags.rnd);
 
 	q = inet_frag_find(&net->nf_frag.frags, &nf_frags, &arg, hash);
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 9cfe047..2c74394 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -193,7 +193,7 @@ fq_find(struct net *net, __be32 id, const struct in6_addr *src, const struct in6
 	arg.src = src;
 	arg.dst = dst;
 
-	read_lock(&ip6_frags.lock);
+	//read_lock(&ip6_frags.lock);
 	hash = inet6_hash_frag(id, src, dst, ip6_frags.rnd);
 
 	q = inet_frag_find(&net->ipv6.frags, &ip6_frags, &arg, hash);

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC net-next PATCH V1 9/9] net: frag remove readers-writer lock (hack)
  2012-11-23 13:08 ` [RFC net-next PATCH V1 9/9] net: frag remove readers-writer lock (hack) Jesper Dangaard Brouer
@ 2012-11-26  6:03   ` Stephen Hemminger
  2012-11-26  9:18   ` Florian Westphal
  1 sibling, 0 replies; 19+ messages in thread
From: Stephen Hemminger @ 2012-11-26  6:03 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Eric Dumazet, David S. Miller, Florian Westphal, netdev,
	Pablo Neira Ayuso, Thomas Graf, Cong Wang, Patrick McHardy,
	Paul E. McKenney, Herbert Xu

On Fri, 23 Nov 2012 14:08:47 +0100
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> Do NOT APPLY this patch.
> 
> After all the other patches, the rw_lock is now the contention point.

Reader-writer locks are significantly slower than a simple spinlock.
Unless the reader holds the lock for "significant" number of cycles,
a spin lock will be faster. 

Did you try just changing it to a spinlock?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC net-next PATCH V1 9/9] net: frag remove readers-writer lock (hack)
  2012-11-23 13:08 ` [RFC net-next PATCH V1 9/9] net: frag remove readers-writer lock (hack) Jesper Dangaard Brouer
  2012-11-26  6:03   ` Stephen Hemminger
@ 2012-11-26  9:18   ` Florian Westphal
  1 sibling, 0 replies; 19+ messages in thread
From: Florian Westphal @ 2012-11-26  9:18 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Eric Dumazet, David S. Miller, Florian Westphal, netdev,
	Pablo Neira Ayuso, Thomas Graf, Cong Wang, Patrick McHardy,
	Paul E. McKenney, Herbert Xu

Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> After all the other patches, the rw_lock is now the contention point.
> 
> This is a quick hack, that remove the readers-writer lock, by
> disabling/breaking hash rebuilding.  Just to see how big the
> performance gain would be.
> 
>   2x10G size(4416) result: 6481+6764 = 13245 Mbit/s (gen: 7652+8077 Mbit/s)
> 
>   4x10G size(4416) result:(5610+6283+5735+5238)=22866 Mbit/s
>                      (gen: 6530+7860+5967+5238 =25595 Mbit/s)
> 
> And the results show, that its a big win. With 4x10G size(4416)
> before: 17923 Mbit/s -> now: 22866 Mbit/s increase 4943 Mbit/s.
> With 2x10G size(4416) before 10689 Mbit/s -> 13245 Mbit/s
> increase 2556 Mbit/s.
> 
> I'll work on a real solution for removing the rw_lock while still
> supporting hash rebuilding.  Suggestions and ideas are welcome.

<devils advocate>
Why not kill it altogether, and just set new secret_interval
without moving frag queues to new location?

The only consequence is that all fragments queued at the time of
changing secret_interval will be lost, and free'd by evictor/timer.

Default secret rebuild interval is 10 minutes, should we care about
small packet loss every 10 minutes?
</devils advocate>

^ permalink raw reply	[flat|nested] 19+ messages in thread

[parent not found: <20121123130806.18764.41854.stgit@dragon>]

* Re: [RFC net-next PATCH V1 1/9] net: frag evictor, avoid killing warm frag queues
       [not found] ` <20121123130806.18764.41854.stgit@dragon>
@ 2012-11-23 19:58   ` Florian Westphal
  2012-11-24 11:36     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 19+ messages in thread
From: Florian Westphal @ 2012-11-23 19:58 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Eric Dumazet, David S. Miller, Florian Westphal, netdev,
	Pablo Neira Ayuso, Thomas Graf, Cong Wang, Patrick McHardy,
	Paul E. McKenney, Herbert Xu

Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> +// TODO: Idea what about also looking at flag INET_FRAG_FIRST_IN
> +// just as safe-guard against frags with a dropped "head" packet
> +		if (!force && q->creation_ts == (u32) jiffies) {

I think we should not rely on head fragment arriving first.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC net-next PATCH V1 1/9] net: frag evictor, avoid killing warm frag queues
  2012-11-23 19:58   ` [RFC net-next PATCH V1 1/9] net: frag evictor, avoid killing warm frag queues Florian Westphal
@ 2012-11-24 11:36     ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2012-11-24 11:36 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Eric Dumazet, David S. Miller, netdev, Pablo Neira Ayuso,
	Thomas Graf, Cong Wang, Patrick McHardy, Paul E. McKenney,
	Herbert Xu

On Fri, 2012-11-23 at 20:58 +0100, Florian Westphal wrote:
> Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> > +// TODO: Idea what about also looking at flag INET_FRAG_FIRST_IN
> > +// just as safe-guard against frags with a dropped "head" packet
> > +		if (!force && q->creation_ts == (u32) jiffies) {
> 
> I think we should not rely on head fragment arriving first.

Good point.  Lets keep the current approach.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC net-next PATCH V1 0/9] net: fragmentation performance scalability on NUMA/SMP systems
  2012-11-23 13:08 [RFC net-next PATCH V1 0/9] net: fragmentation performance scalability on NUMA/SMP systems Jesper Dangaard Brouer
                   ` (5 preceding siblings ...)
       [not found] ` <20121123130806.18764.41854.stgit@dragon>
@ 2012-11-25  2:31 ` Eric Dumazet
  2012-11-25  8:53   ` Jesper Dangaard Brouer
       [not found] ` <20121123130826.18764.66507.stgit@dragon>
  7 siblings, 1 reply; 19+ messages in thread
From: Eric Dumazet @ 2012-11-25  2:31 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: David S. Miller, Florian Westphal, netdev, Pablo Neira Ayuso,
	Thomas Graf, Cong Wang, Patrick McHardy, Paul E. McKenney,
	Herbert Xu

On Fri, 2012-11-23 at 14:08 +0100, Jesper Dangaard Brouer wrote:
> This patchset implements significant performance improvements for
> fragmentation handling in the kernel, with a focus on NUMA and SMP
> based systems.
> 
> Review:
> 
>  Please review these patches.  I have on purpose added comments in the
>  code with the "//" comments style.  These comments are to be removed
>  before applying.  They serve as a questions to, you, the reviewer.
> 
> The fragmentation code today:
> 
>  The fragmentation code "protects" kernel resources, by implementing
>  some memory resource limitation code.  This is centered around a
>  global readers-writer lock, and (per network namespace) an atomic mem
>  counter and a LRU (Least-Recently-Used) list.  (Although separate
>  global variables and namespace resources, are kept for IPv4, IPv6
>  and Netfilter reassembly.)
> 
>  The code tries to keep the memory usage between a high and low
>  threshold (see: /proc/sys/net/ipv4/ipfrag_{high,low}_thresh).  The
>  "evictor" code cleans up fragments, when the high threshold is
>  exceeded, and stops only, when the low threshold is reached.
> 
> The scalability problem:
> 
>  Having a global/central variable for a resource limit is obviously a
>  scalability issue on SMP systems, and even amplified on a NUMA based
>  system.
> 


But ... , what practical workload even use fragments ?

Sure, netperf -t UDP_STREAM uses frags, but its a benchmark.

The only heavy user was NFS in the days it was using UDP, a very long
time ago.

A single lost fragment means the whole packet is lost.

Another problem with fragments is the lack of 4-tuple hashing, as only
the first frag contains the dst/src ports.

Also there is the sysctl_ipfrag_max_dist issue...

Hint : many NIC provide TSO (TCP offload), but none provide UFO,
probably because there is no demand for it.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC net-next PATCH V1 0/9] net: fragmentation performance scalability on NUMA/SMP systems
  2012-11-25  2:31 ` [RFC net-next PATCH V1 0/9] net: fragmentation performance scalability on NUMA/SMP systems Eric Dumazet
@ 2012-11-25  8:53   ` Jesper Dangaard Brouer
  2012-11-25 16:11     ` Eric Dumazet
  0 siblings, 1 reply; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2012-11-25  8:53 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, Florian Westphal, netdev, Pablo Neira Ayuso,
	Thomas Graf, Cong Wang, Patrick McHardy, Paul E. McKenney,
	Herbert Xu

On Sat, 2012-11-24 at 18:31 -0800, Eric Dumazet wrote:
> On Fri, 2012-11-23 at 14:08 +0100, Jesper Dangaard Brouer wrote:
> > This patchset implements significant performance improvements for
> > fragmentation handling in the kernel, with a focus on NUMA and SMP
> > based systems.
> > 
> > Review:
> > 
> >  Please review these patches.  I have on purpose added comments in the
> >  code with the "//" comments style.  These comments are to be removed
> >  before applying.  They serve as a questions to, you, the reviewer.
> > 
> > The fragmentation code today:
> > 
> >  The fragmentation code "protects" kernel resources, by implementing
> >  some memory resource limitation code.  This is centered around a
> >  global readers-writer lock, and (per network namespace) an atomic mem
> >  counter and a LRU (Least-Recently-Used) list.  (Although separate
> >  global variables and namespace resources, are kept for IPv4, IPv6
> >  and Netfilter reassembly.)
> > 
> >  The code tries to keep the memory usage between a high and low
> >  threshold (see: /proc/sys/net/ipv4/ipfrag_{high,low}_thresh).  The
> >  "evictor" code cleans up fragments, when the high threshold is
> >  exceeded, and stops only, when the low threshold is reached.
> > 
> > The scalability problem:
> > 
> >  Having a global/central variable for a resource limit is obviously a
> >  scalability issue on SMP systems, and even amplified on a NUMA based
> >  system.
> > 
> 
> 
> But ... , what practical workload even use fragments ?

(1) DNS (default for Bind) will use up-to 3 UDP fragments before
switching to TCP.  This is getting more and more relevant after the
introduction of DNSSEC.  That's why I'm explicit testing the 3xUDP
fragments so heavily.

(2) For IPVS (load-balancing) we have recently allowed fragmentation in
tunnel mode, towards the realservers (to hide the MTU reduction for the
clients).  Thus, we need better frag performance in this case.

(3) I also have a customer that have a usage scenario/application (at
4x10G) that needs this... but I'm trying to convince them to fix/change
their application...

Scenario (1) is the real reason I want to fix this scalability issue in
the code.

> Sure, netperf -t UDP_STREAM uses frags, but its a benchmark.

Yes, for the default large 64k packets size, its just a "fake"
benchmark.  And notice with my fixes, we are even faster than the
none-frag/single-UDP packet case... but its because we are getting a
GSO/GRO effect.

That's why I'm adjusting the UDP "frag" packet size to get a more
realistic use case... to simulate the DNS use-case (1).

> The only heavy user was NFS in the days it was using UDP, a very long
> time ago.
> 
> A single lost fragment means the whole packet is lost.

That is correct, that's why we need the fix in patch-01. 

(It actually reminds me of the problem with ADSL/ATM, where (small) ATM
frame are used for carrying IP packets, and when some (more central) ATM
link gets overloaded and starts to drops ATM frames, not taking the AAL5
packets into account).

> Another problem with fragments is the lack of 4-tuple hashing, as only
> the first frag contains the dst/src ports.
> 
> Also there is the sysctl_ipfrag_max_dist issue...
> 
> Hint : many NIC provide TSO (TCP offload), but none provide UFO,
> probably because there is no demand for it.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC net-next PATCH V1 0/9] net: fragmentation performance scalability on NUMA/SMP systems
  2012-11-25  8:53   ` Jesper Dangaard Brouer
@ 2012-11-25 16:11     ` Eric Dumazet
  2012-11-26 14:42       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 19+ messages in thread
From: Eric Dumazet @ 2012-11-25 16:11 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: David S. Miller, Florian Westphal, netdev, Pablo Neira Ayuso,
	Thomas Graf, Cong Wang, Patrick McHardy, Paul E. McKenney,
	Herbert Xu

On Sun, 2012-11-25 at 09:53 +0100, Jesper Dangaard Brouer wrote:

> Yes, for the default large 64k packets size, its just a "fake"
> benchmark.  And notice with my fixes, we are even faster than the
> none-frag/single-UDP packet case... but its because we are getting a
> GSO/GRO effect.

Could you elaborate on this GSO/GRO effect ?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC net-next PATCH V1 0/9] net: fragmentation performance scalability on NUMA/SMP systems
  2012-11-25 16:11     ` Eric Dumazet
@ 2012-11-26 14:42       ` Jesper Dangaard Brouer
  2012-11-26 15:15         ` Eric Dumazet
  0 siblings, 1 reply; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2012-11-26 14:42 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, Florian Westphal, netdev, Pablo Neira Ayuso,
	Thomas Graf, Cong Wang, Patrick McHardy, Paul E. McKenney,
	Herbert Xu

On Sun, 2012-11-25 at 08:11 -0800, Eric Dumazet wrote:
> On Sun, 2012-11-25 at 09:53 +0100, Jesper Dangaard Brouer wrote:
> 
> > Yes, for the default large 64k packets size, its just a "fake"
> > benchmark.  And notice with my fixes, we are even faster than the
> > none-frag/single-UDP packet case... but its because we are getting a
> > GSO/GRO effect.
> 
> Could you elaborate on this GSO/GRO effect ?

On the big system, I saw none-frag UDP (1472 bytes) throughput of:
  7356.57 + 7351.78 + 7330.60 + 7269.26 = 29308.21 Mbit/s

While with UDP fragments size 65507 bytes I saw:
  9228.75 + 9207.81 + 9615.83 + 9615.87 = 37668.26 Mbit/s

Fragmented UDP is faster by:
 37668.26 - 29308.21 = 8360.05 Mbit/s

The 65507 bytes UDP size is just a benchmark test, and have no real-life
relevance.  As performance starts to drop (below none-frag/normal case)
when the frag size is decreased, to more realistic sizes...


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC net-next PATCH V1 0/9] net: fragmentation performance scalability on NUMA/SMP systems
  2012-11-26 14:42       ` Jesper Dangaard Brouer
@ 2012-11-26 15:15         ` Eric Dumazet
  2012-11-26 15:29           ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 19+ messages in thread
From: Eric Dumazet @ 2012-11-26 15:15 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: David S. Miller, Florian Westphal, netdev, Pablo Neira Ayuso,
	Thomas Graf, Cong Wang, Patrick McHardy, Paul E. McKenney,
	Herbert Xu

On Mon, 2012-11-26 at 15:42 +0100, Jesper Dangaard Brouer wrote:
> On Sun, 2012-11-25 at 08:11 -0800, Eric Dumazet wrote:
> > On Sun, 2012-11-25 at 09:53 +0100, Jesper Dangaard Brouer wrote:
> > 
> > > Yes, for the default large 64k packets size, its just a "fake"
> > > benchmark.  And notice with my fixes, we are even faster than the
> > > none-frag/single-UDP packet case... but its because we are getting a
> > > GSO/GRO effect.
> > 
> > Could you elaborate on this GSO/GRO effect ?
> 
> On the big system, I saw none-frag UDP (1472 bytes) throughput of:
>   7356.57 + 7351.78 + 7330.60 + 7269.26 = 29308.21 Mbit/s
> 
> While with UDP fragments size 65507 bytes I saw:
>   9228.75 + 9207.81 + 9615.83 + 9615.87 = 37668.26 Mbit/s
> 
> Fragmented UDP is faster by:
>  37668.26 - 29308.21 = 8360.05 Mbit/s
> 
> The 65507 bytes UDP size is just a benchmark test, and have no real-life
> relevance.  As performance starts to drop (below none-frag/normal case)
> when the frag size is decreased, to more realistic sizes...

Yes, but I doubt GRO / GSO are the reason you get better performance.

GRO doesnt aggregate UDP frames.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC net-next PATCH V1 0/9] net: fragmentation performance scalability on NUMA/SMP systems
  2012-11-26 15:15         ` Eric Dumazet
@ 2012-11-26 15:29           ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2012-11-26 15:29 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, Florian Westphal, netdev, Pablo Neira Ayuso,
	Thomas Graf, Cong Wang, Patrick McHardy, Paul E. McKenney,
	Herbert Xu

On Mon, 2012-11-26 at 07:15 -0800, Eric Dumazet wrote:
> On Mon, 2012-11-26 at 15:42 +0100, Jesper Dangaard Brouer wrote:
> > On Sun, 2012-11-25 at 08:11 -0800, Eric Dumazet wrote:
> > > On Sun, 2012-11-25 at 09:53 +0100, Jesper Dangaard Brouer wrote:
> > > 
> > > > Yes, for the default large 64k packets size, its just a "fake"
> > > > benchmark.  And notice with my fixes, we are even faster than the
> > > > none-frag/single-UDP packet case... but its because we are getting a
> > > > GSO/GRO effect.
> > > 
> > > Could you elaborate on this GSO/GRO effect ?
> > 
> > On the big system, I saw none-frag UDP (1472 bytes) throughput of:
> >   7356.57 + 7351.78 + 7330.60 + 7269.26 = 29308.21 Mbit/s
> > 
> > While with UDP fragments size 65507 bytes I saw:
> >   9228.75 + 9207.81 + 9615.83 + 9615.87 = 37668.26 Mbit/s
> > 
> > Fragmented UDP is faster by:
> >  37668.26 - 29308.21 = 8360.05 Mbit/s
> > 
> > The 65507 bytes UDP size is just a benchmark test, and have no real-life
> > relevance.  As performance starts to drop (below none-frag/normal case)
> > when the frag size is decreased, to more realistic sizes...
> 
> Yes, but I doubt GRO / GSO are the reason you get better performance.
> GRO doesnt aggregate UDP frames.

Oh, now I think I understand your question.

I don't think GRO is helping me.  Its the same "effect" as GRO.  As (I
think) that the reasm frag SKB will be a "bigger" SKB, which is passed
to the rest of the stack.  Thus, less (but) bigger SKBs get the overhead
of the rest of the stack.  It was actually Herbert that mentioned it to
me...

--Jesper

^ permalink raw reply	[flat|nested] 19+ messages in thread

[parent not found: <20121123130826.18764.66507.stgit@dragon>]

* Re: [RFC net-next PATCH V1 5/9] net: frag per CPU mem limit and LRU list accounting
       [not found] ` <20121123130826.18764.66507.stgit@dragon>
@ 2012-11-26  2:54   ` Cong Wang
  0 siblings, 0 replies; 19+ messages in thread
From: Cong Wang @ 2012-11-26  2:54 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Eric Dumazet, David S. Miller, Florian Westphal, netdev,
	Pablo Neira Ayuso, Thomas Graf, Patrick McHardy, Paul E. McKenney,
	Herbert Xu

On Fri, 2012-11-23 at 14:08 +0100, Jesper Dangaard Brouer wrote:
> +       // QUESTION: Please advice in a better alloc solution than
> NR_CPUS
> +       struct cpu_resource     percpu[NR_CPUS]; 

alloc_percpu(struct cpu_resource) ?

BTW, struct cpu_resource is not a good name here, it is too generic.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2012-11-27 15:02 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-11-23 13:08 [RFC net-next PATCH V1 0/9] net: fragmentation performance scalability on NUMA/SMP systems Jesper Dangaard Brouer
2012-11-23 13:08 ` [RFC net-next PATCH V1 2/9] net: frag cache line adjust inet_frag_queue.net Jesper Dangaard Brouer
2012-11-23 13:08 ` [RFC net-next PATCH V1 4/9] net: frag helper functions for mem limit tracking Jesper Dangaard Brouer
2012-11-23 13:08 ` [RFC net-next PATCH V1 7/9] net: frag queue locking per hash bucket Jesper Dangaard Brouer
2012-11-27  9:07   ` Jesper Dangaard Brouer
2012-11-27 15:00   ` Jesper Dangaard Brouer
2012-11-23 13:08 ` [RFC net-next PATCH V1 8/9] net: increase frag queue hash size and cache-line Jesper Dangaard Brouer
2012-11-23 13:08 ` [RFC net-next PATCH V1 9/9] net: frag remove readers-writer lock (hack) Jesper Dangaard Brouer
2012-11-26  6:03   ` Stephen Hemminger
2012-11-26  9:18   ` Florian Westphal
     [not found] ` <20121123130806.18764.41854.stgit@dragon>
2012-11-23 19:58   ` [RFC net-next PATCH V1 1/9] net: frag evictor, avoid killing warm frag queues Florian Westphal
2012-11-24 11:36     ` Jesper Dangaard Brouer
2012-11-25  2:31 ` [RFC net-next PATCH V1 0/9] net: fragmentation performance scalability on NUMA/SMP systems Eric Dumazet
2012-11-25  8:53   ` Jesper Dangaard Brouer
2012-11-25 16:11     ` Eric Dumazet
2012-11-26 14:42       ` Jesper Dangaard Brouer
2012-11-26 15:15         ` Eric Dumazet
2012-11-26 15:29           ` Jesper Dangaard Brouer
     [not found] ` <20121123130826.18764.66507.stgit@dragon>
2012-11-26  2:54   ` [RFC net-next PATCH V1 5/9] net: frag per CPU mem limit and LRU list accounting Cong Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).