* [RFC net-next PATCH V1 2/9] net: frag cache line adjust inet_frag_queue.net
2012-11-23 13:08 [RFC net-next PATCH V1 0/9] net: fragmentation performance scalability on NUMA/SMP systems Jesper Dangaard Brouer
@ 2012-11-23 13:08 ` Jesper Dangaard Brouer
2012-11-23 13:08 ` [RFC net-next PATCH V1 4/9] net: frag helper functions for mem limit tracking Jesper Dangaard Brouer
` (6 subsequent siblings)
7 siblings, 0 replies; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2012-11-23 13:08 UTC (permalink / raw)
To: Eric Dumazet, David S. Miller, Florian Westphal
Cc: Jesper Dangaard Brouer, netdev, Pablo Neira Ayuso, Thomas Graf,
Cong Wang, Patrick McHardy, Paul E. McKenney, Herbert Xu
In inet_frag_find() unfortunate cache-line bounces can occur with
struct inet_frag_queue. As the net pointer (struct netns_frags)
is placed on the same write-often cache-line as e.g. refcnt and
lock. As the hash match check always check (q->net == nf).
This (of-cause) only happens on hash bucket collisions, but as current
hash size is only 64 this makes collisions more likely.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
include/net/inet_frag.h | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 7b897b2..1f75316 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -14,7 +14,6 @@ struct netns_frags {
struct inet_frag_queue {
struct hlist_node list;
- struct netns_frags *net;
struct list_head lru_list; /* lru list member */
spinlock_t lock;
atomic_t refcnt;
@@ -24,6 +23,7 @@ struct inet_frag_queue {
ktime_t stamp;
int len; /* total length of orig datagram */
int meat;
+ struct netns_frags *net;
u32 creation_ts;/* jiffies when queue was created*/
__u8 last_in; /* first/last segment arrived? */
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [RFC net-next PATCH V1 4/9] net: frag helper functions for mem limit tracking
2012-11-23 13:08 [RFC net-next PATCH V1 0/9] net: fragmentation performance scalability on NUMA/SMP systems Jesper Dangaard Brouer
2012-11-23 13:08 ` [RFC net-next PATCH V1 2/9] net: frag cache line adjust inet_frag_queue.net Jesper Dangaard Brouer
@ 2012-11-23 13:08 ` Jesper Dangaard Brouer
2012-11-23 13:08 ` [RFC net-next PATCH V1 7/9] net: frag queue locking per hash bucket Jesper Dangaard Brouer
` (5 subsequent siblings)
7 siblings, 0 replies; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2012-11-23 13:08 UTC (permalink / raw)
To: Eric Dumazet, David S. Miller, Florian Westphal
Cc: Jesper Dangaard Brouer, netdev, Pablo Neira Ayuso, Thomas Graf,
Cong Wang, Patrick McHardy, Paul E. McKenney, Herbert Xu
This change is primarily a preparation to ease the extension of memory
limit tracking.
The change does reduce the number atomic operation, during freeing of
a frag queue. This does introduce a fairly good performance improvement, as
these atomic operations are at the core of the performance problems
seen on NUMA systems.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
include/net/inet_frag.h | 28 ++++++++++++++++++++++++++++
include/net/ipv6.h | 2 +-
net/ipv4/inet_fragment.c | 27 +++++++++++++--------------
net/ipv4/ip_fragment.c | 24 +++++++++++-------------
net/ipv6/netfilter/nf_conntrack_reasm.c | 6 +++---
net/ipv6/reassembly.c | 6 +++---
6 files changed, 59 insertions(+), 34 deletions(-)
diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 312a3fa..9bbef17 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -95,4 +95,32 @@ static inline void inet_frag_lru_add(struct netns_frags *nf,
list_add_tail(&q->lru_list, &nf->lru_list);
spin_unlock(&nf->lru_lock);
}
+
+/* Memory Tracking Functions. */
+
+static inline int frag_mem_limit(struct netns_frags *nf)
+{
+ return atomic_read(&nf->mem);
+}
+
+static inline void sub_frag_mem_limit(struct inet_frag_queue *q, int i)
+{
+ atomic_sub(i, &q->net->mem);
+}
+
+static inline void add_frag_mem_limit(struct inet_frag_queue *q, int i)
+{
+ atomic_add(i, &q->net->mem);
+}
+
+static inline void init_frag_mem_limit(struct netns_frags *nf)
+{
+ atomic_set(&nf->mem, 0);
+}
+
+static inline int sum_frag_mem_limit(struct netns_frags *nf)
+{
+ return atomic_read(&nf->mem);
+}
+
#endif
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 979bf6c..a5c1cf1 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -279,7 +279,7 @@ static inline int ip6_frag_nqueues(struct net *net)
static inline int ip6_frag_mem(struct net *net)
{
- return atomic_read(&net->ipv6.frags.mem);
+ return sum_frag_mem_limit(&net->ipv6.frags);
}
#endif
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 47141ab..4f1ab8a 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -73,7 +73,7 @@ EXPORT_SYMBOL(inet_frags_init);
void inet_frags_init_net(struct netns_frags *nf)
{
nf->nqueues = 0;
- atomic_set(&nf->mem, 0);
+ init_frag_mem_limit(nf);
INIT_LIST_HEAD(&nf->lru_list);
spin_lock_init(&nf->lru_lock);
}
@@ -118,12 +118,8 @@ void inet_frag_kill(struct inet_frag_queue *fq, struct inet_frags *f)
EXPORT_SYMBOL(inet_frag_kill);
static inline void frag_kfree_skb(struct netns_frags *nf, struct inet_frags *f,
- struct sk_buff *skb, int *work)
+ struct sk_buff *skb)
{
- if (work)
- *work -= skb->truesize;
-
- atomic_sub(skb->truesize, &nf->mem);
if (f->skb_free)
f->skb_free(skb);
kfree_skb(skb);
@@ -134,6 +130,7 @@ void inet_frag_destroy(struct inet_frag_queue *q, struct inet_frags *f,
{
struct sk_buff *fp;
struct netns_frags *nf;
+ unsigned int sum, sum_truesize = 0;
WARN_ON(!(q->last_in & INET_FRAG_COMPLETE));
WARN_ON(del_timer(&q->timer) != 0);
@@ -144,13 +141,14 @@ void inet_frag_destroy(struct inet_frag_queue *q, struct inet_frags *f,
while (fp) {
struct sk_buff *xp = fp->next;
- frag_kfree_skb(nf, f, fp, work);
+ sum_truesize += fp->truesize;
+ frag_kfree_skb(nf, f, fp);
fp = xp;
}
-
+ sum = sum_truesize + f->qsize;
if (work)
- *work -= f->qsize;
- atomic_sub(f->qsize, &nf->mem);
+ *work -= sum;
+ sub_frag_mem_limit(q, sum);
if (f->destructor)
f->destructor(q);
@@ -165,11 +163,11 @@ int inet_frag_evictor(struct netns_frags *nf, struct inet_frags *f, bool force)
int work, evicted = 0;
if (!force) {
- if (atomic_read(&nf->mem) <= nf->high_thresh)
+ if (frag_mem_limit(nf) <= nf->high_thresh)
return 0;
}
- work = atomic_read(&nf->mem) - nf->low_thresh;
+ work = frag_mem_limit(nf) - nf->low_thresh;
while (work > 0) {
spin_lock(&nf->lru_lock);
@@ -269,7 +267,7 @@ static struct inet_frag_queue *inet_frag_alloc(struct netns_frags *nf,
/* Guard creations of new frag queues if mem limit reached, as
* we allow warm/recent elements to survive in inet_frag_evictor()
*/
- if (atomic_read(&nf->mem) > nf->high_thresh)
+ if (frag_mem_limit(nf) > nf->high_thresh)
return NULL;
q = kzalloc(f->qsize, GFP_ATOMIC);
@@ -279,7 +277,8 @@ static struct inet_frag_queue *inet_frag_alloc(struct netns_frags *nf,
q->creation_ts = (u32) jiffies;
q->net = nf;
f->constructor(q, arg);
- atomic_add(f->qsize, &nf->mem);
+ add_frag_mem_limit(q, f->qsize);
+
setup_timer(&q->timer, f->frag_expire, (unsigned long)q);
spin_lock_init(&q->lock);
atomic_set(&q->refcnt, 1);
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 04c9e53..cc36a0b 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -122,7 +122,7 @@ int ip_frag_nqueues(struct net *net)
int ip_frag_mem(struct net *net)
{
- return atomic_read(&net->ipv4.frags.mem);
+ return sum_frag_mem_limit(&net->ipv4.frags);
}
static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
@@ -161,13 +161,6 @@ static bool ip4_frag_match(struct inet_frag_queue *q, void *a)
qp->user == arg->user;
}
-/* Memory Tracking Functions. */
-static void frag_kfree_skb(struct netns_frags *nf, struct sk_buff *skb)
-{
- atomic_sub(skb->truesize, &nf->mem);
- kfree_skb(skb);
-}
-
static void ip4_frag_init(struct inet_frag_queue *q, void *a)
{
struct ipq *qp = container_of(q, struct ipq, q);
@@ -340,6 +333,7 @@ static inline int ip_frag_too_far(struct ipq *qp)
static int ip_frag_reinit(struct ipq *qp)
{
struct sk_buff *fp;
+ unsigned int sum_truesize = 0;
if (!mod_timer(&qp->q.timer, jiffies + qp->q.net->timeout)) {
atomic_inc(&qp->q.refcnt);
@@ -349,9 +343,12 @@ static int ip_frag_reinit(struct ipq *qp)
fp = qp->q.fragments;
do {
struct sk_buff *xp = fp->next;
- frag_kfree_skb(qp->q.net, fp);
+
+ sum_truesize += fp->truesize;
+ kfree_skb(fp);
fp = xp;
} while (fp);
+ sub_frag_mem_limit(&qp->q, sum_truesize);
qp->q.last_in = 0;
qp->q.len = 0;
@@ -496,7 +493,8 @@ found:
qp->q.fragments = next;
qp->q.meat -= free_it->len;
- frag_kfree_skb(qp->q.net, free_it);
+ sub_frag_mem_limit(&qp->q, free_it->truesize);
+ kfree_skb(free_it);
}
}
@@ -519,7 +517,7 @@ found:
qp->q.stamp = skb->tstamp;
qp->q.meat += skb->len;
qp->ecn |= ecn;
- atomic_add(skb->truesize, &qp->q.net->mem);
+ add_frag_mem_limit(&qp->q, skb->truesize);
if (offset == 0)
qp->q.last_in |= INET_FRAG_FIRST_IN;
@@ -614,7 +612,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
head->len -= clone->len;
clone->csum = 0;
clone->ip_summed = head->ip_summed;
- atomic_add(clone->truesize, &qp->q.net->mem);
+ add_frag_mem_limit(&qp->q, clone->truesize);
}
skb_push(head, head->data - skb_network_header(head));
@@ -642,7 +640,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
}
fp = next;
}
- atomic_sub(sum_truesize, &qp->q.net->mem);
+ sub_frag_mem_limit(&qp->q, sum_truesize);
head->next = NULL;
head->dev = dev;
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
index b0a1c96..c088831 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -316,7 +316,7 @@ found:
fq->q.meat += skb->len;
if (payload_len > fq->q.max_size)
fq->q.max_size = payload_len;
- atomic_add(skb->truesize, &fq->q.net->mem);
+ add_frag_mem_limit(&fq->q, skb->truesize);
/* The first fragment.
* nhoffset is obtained from the first fragment, of course.
@@ -394,7 +394,7 @@ nf_ct_frag6_reasm(struct frag_queue *fq, struct net_device *dev)
clone->ip_summed = head->ip_summed;
NFCT_FRAG6_CB(clone)->orig = NULL;
- atomic_add(clone->truesize, &fq->q.net->mem);
+ add_frag_mem_limit(&fq->q, clone->truesize);
}
/* We have to remove fragment header from datagram and to relocate
@@ -418,7 +418,7 @@ nf_ct_frag6_reasm(struct frag_queue *fq, struct net_device *dev)
head->csum = csum_add(head->csum, fp->csum);
head->truesize += fp->truesize;
}
- atomic_sub(head->truesize, &fq->q.net->mem);
+ sub_frag_mem_limit(&fq->q, head->truesize);
head->local_df = 1;
head->next = NULL;
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index b38f290..9cfe047 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -327,7 +327,7 @@ found:
}
fq->q.stamp = skb->tstamp;
fq->q.meat += skb->len;
- atomic_add(skb->truesize, &fq->q.net->mem);
+ add_frag_mem_limit(&fq->q, skb->truesize);
/* The first fragment.
* nhoffset is obtained from the first fragment, of course.
@@ -426,7 +426,7 @@ static int ip6_frag_reasm(struct frag_queue *fq, struct sk_buff *prev,
head->len -= clone->len;
clone->csum = 0;
clone->ip_summed = head->ip_summed;
- atomic_add(clone->truesize, &fq->q.net->mem);
+ add_frag_mem_limit(&fq->q, clone->truesize);
}
/* We have to remove fragment header from datagram and to relocate
@@ -464,7 +464,7 @@ static int ip6_frag_reasm(struct frag_queue *fq, struct sk_buff *prev,
}
fp = next;
}
- atomic_sub(sum_truesize, &fq->q.net->mem);
+ sub_frag_mem_limit(&fq->q, sum_truesize);
head->next = NULL;
head->dev = dev;
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [RFC net-next PATCH V1 7/9] net: frag queue locking per hash bucket
2012-11-23 13:08 [RFC net-next PATCH V1 0/9] net: fragmentation performance scalability on NUMA/SMP systems Jesper Dangaard Brouer
2012-11-23 13:08 ` [RFC net-next PATCH V1 2/9] net: frag cache line adjust inet_frag_queue.net Jesper Dangaard Brouer
2012-11-23 13:08 ` [RFC net-next PATCH V1 4/9] net: frag helper functions for mem limit tracking Jesper Dangaard Brouer
@ 2012-11-23 13:08 ` Jesper Dangaard Brouer
2012-11-27 9:07 ` Jesper Dangaard Brouer
2012-11-27 15:00 ` Jesper Dangaard Brouer
2012-11-23 13:08 ` [RFC net-next PATCH V1 8/9] net: increase frag queue hash size and cache-line Jesper Dangaard Brouer
` (4 subsequent siblings)
7 siblings, 2 replies; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2012-11-23 13:08 UTC (permalink / raw)
To: Eric Dumazet, David S. Miller, Florian Westphal
Cc: Jesper Dangaard Brouer, netdev, Pablo Neira Ayuso, Thomas Graf,
Cong Wang, Patrick McHardy, Paul E. McKenney, Herbert Xu
DO NOT apply - patch not finished, can cause on OOPS/PANIC during hash rebuild
This patch implements per hash bucket locking for the frag queue
hash. This removes two write locks, and the only remaining write
lock is for protecting hash rebuild. This essentially reduce the
readers-writer lock to a rebuild lock.
NOT-Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
include/net/inet_frag.h | 10 +++++++-
net/ipv4/inet_fragment.c | 56 +++++++++++++++++++++++++++++++++++-----------
2 files changed, 51 insertions(+), 15 deletions(-)
diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 9938ea4..1efec6b 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -51,9 +51,15 @@ struct inet_frag_queue {
#define INETFRAGS_HASHSZ 64
+
+struct inet_frag_bucket {
+ struct hlist_head chain;
+ spinlock_t chain_lock;
+};
+
struct inet_frags {
- struct hlist_head hash[INETFRAGS_HASHSZ];
- rwlock_t lock;
+ struct inet_frag_bucket hash[INETFRAGS_HASHSZ];
+ rwlock_t lock; /* Rebuild lock */
u32 rnd;
int qsize;
int secret_interval;
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 1620a21..447423f 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -35,20 +35,27 @@ static void inet_frag_secret_rebuild(unsigned long dummy)
unsigned long now = jiffies;
int i;
+ /* Per bucket lock NOT needed here, due to write lock protection */
write_lock(&f->lock);
+
get_random_bytes(&f->rnd, sizeof(u32));
for (i = 0; i < INETFRAGS_HASHSZ; i++) {
+ struct inet_frag_bucket *hb;
struct inet_frag_queue *q;
struct hlist_node *p, *n;
- hlist_for_each_entry_safe(q, p, n, &f->hash[i], list) {
+ hb = &f->hash[i];
+ hlist_for_each_entry_safe(q, p, n, &hb->chain, list) {
unsigned int hval = f->hashfn(q);
if (hval != i) {
+ struct inet_frag_bucket *hb_dest;
+
hlist_del(&q->list);
/* Relink to new hash chain. */
- hlist_add_head(&q->list, &f->hash[hval]);
+ hb_dest = &f->hash[hval];
+ hlist_add_head(&q->list, &hb->chain);
}
}
}
@@ -61,9 +68,12 @@ void inet_frags_init(struct inet_frags *f)
{
int i;
- for (i = 0; i < INETFRAGS_HASHSZ; i++)
- INIT_HLIST_HEAD(&f->hash[i]);
+ for (i = 0; i < INETFRAGS_HASHSZ; i++) {
+ struct inet_frag_bucket *hb = &f->hash[i];
+ spin_lock_init(&hb->chain_lock);
+ INIT_HLIST_HEAD(&hb->chain);
+ }
rwlock_init(&f->lock);
f->rnd = (u32) ((num_physpages ^ (num_physpages>>7)) ^
@@ -102,9 +112,17 @@ EXPORT_SYMBOL(inet_frags_exit_net);
static inline void fq_unlink(struct inet_frag_queue *fq, struct inet_frags *f)
{
- write_lock(&f->lock);
+ struct inet_frag_bucket *hb;
+ unsigned int hash;
+
+ read_lock(&f->lock);
+ hash = f->hashfn(fq);
+ hb = &f->hash[hash];
+
+ spin_lock_bh(&hb->chain_lock);
hlist_del(&fq->list);
- write_unlock(&f->lock);
+ spin_unlock_bh(&hb->chain_lock);
+ read_unlock(&f->lock);
inet_frag_lru_del(fq);
}
@@ -224,28 +242,33 @@ static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf,
struct inet_frag_queue *qp_in, struct inet_frags *f,
void *arg)
{
+ struct inet_frag_bucket *hb;
struct inet_frag_queue *qp;
#ifdef CONFIG_SMP
struct hlist_node *n;
#endif
unsigned int hash;
- write_lock(&f->lock);
+ read_lock(&f->lock); /* Protects against hash rebuild */
/*
* While we stayed w/o the lock other CPU could update
* the rnd seed, so we need to re-calculate the hash
* chain. Fortunatelly the qp_in can be used to get one.
*/
hash = f->hashfn(qp_in);
+ hb = &f->hash[hash];
+ spin_lock_bh(&hb->chain_lock);
+
#ifdef CONFIG_SMP
/* With SMP race we have to recheck hash table, because
* such entry could be created on other cpu, while we
- * promoted read lock to write lock.
+ * promoted read lock to write lock. ***Comment FIXME***
*/
- hlist_for_each_entry(qp, n, &f->hash[hash], list) {
+ hlist_for_each_entry(qp, n, &hb->chain, list) {
if (qp->net == nf && f->match(qp, arg)) {
atomic_inc(&qp->refcnt);
- write_unlock(&f->lock);
+ spin_unlock_bh(&hb->chain_lock);
+ read_unlock(&f->lock);
qp_in->last_in |= INET_FRAG_COMPLETE;
inet_frag_put(qp_in, f);
return qp;
@@ -257,8 +280,9 @@ static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf,
atomic_inc(&qp->refcnt);
atomic_inc(&qp->refcnt);
- hlist_add_head(&qp->list, &f->hash[hash]);
- write_unlock(&f->lock);
+ hlist_add_head(&qp->list, &hb->chain);
+ spin_unlock_bh(&hb->chain_lock);
+ read_unlock(&f->lock);
inet_frag_lru_add(nf, qp);
return qp;
}
@@ -307,16 +331,22 @@ struct inet_frag_queue *inet_frag_find(struct netns_frags *nf,
struct inet_frags *f, void *key, unsigned int hash)
__releases(&f->lock)
{
+ struct inet_frag_bucket *hb;
struct inet_frag_queue *q;
struct hlist_node *n;
- hlist_for_each_entry(q, n, &f->hash[hash], list) {
+ hb = &f->hash[hash];
+
+ spin_lock_bh(&hb->chain_lock);
+ hlist_for_each_entry(q, n, &hb->chain, list) {
if (q->net == nf && f->match(q, key)) {
atomic_inc(&q->refcnt);
+ spin_unlock_bh(&hb->chain_lock);
read_unlock(&f->lock);
return q;
}
}
+ spin_unlock_bh(&hb->chain_lock);
read_unlock(&f->lock);
return inet_frag_create(nf, f, key);
^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [RFC net-next PATCH V1 7/9] net: frag queue locking per hash bucket
2012-11-23 13:08 ` [RFC net-next PATCH V1 7/9] net: frag queue locking per hash bucket Jesper Dangaard Brouer
@ 2012-11-27 9:07 ` Jesper Dangaard Brouer
2012-11-27 15:00 ` Jesper Dangaard Brouer
1 sibling, 0 replies; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2012-11-27 9:07 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S. Miller, Florian Westphal, netdev, Pablo Neira Ayuso,
Thomas Graf, Cong Wang, Patrick McHardy, Paul E. McKenney,
Herbert Xu
On Fri, 2012-11-23 at 14:08 +0100, Jesper Dangaard Brouer wrote:
> DO NOT apply - patch not finished, can cause on OOPS/PANIC during hash rebuild
Think I have fixed this issue. And its even fixed in this patch, as the
oops occurred, when testing, with a slightly older version of this
patch.
> This patch implements per hash bucket locking for the frag queue
> hash. This removes two write locks, and the only remaining write
> lock is for protecting hash rebuild. This essentially reduce the
> readers-writer lock to a rebuild lock.
[... cut ...]
> diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
> index 1620a21..447423f 100644
> --- a/net/ipv4/inet_fragment.c
> +++ b/net/ipv4/inet_fragment.c
[...]
> @@ -102,9 +112,17 @@ EXPORT_SYMBOL(inet_frags_exit_net);
>
> static inline void fq_unlink(struct inet_frag_queue *fq, struct inet_frags *f)
> {
> - write_lock(&f->lock);
> + struct inet_frag_bucket *hb;
> + unsigned int hash;
> +
> + read_lock(&f->lock);
> + hash = f->hashfn(fq);
> + hb = &f->hash[hash];
Before the read_lock, were _here_ below then hash calc. Which is wrong,
and caused the oops, as the hash result can change, when rnd is changed,
during hash rebuild.
> +
> + spin_lock_bh(&hb->chain_lock);
> hlist_del(&fq->list);
> - write_unlock(&f->lock);
> + spin_unlock_bh(&hb->chain_lock);
> + read_unlock(&f->lock);
> inet_frag_lru_del(fq);
> }
[... cut ...]
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC net-next PATCH V1 7/9] net: frag queue locking per hash bucket
2012-11-23 13:08 ` [RFC net-next PATCH V1 7/9] net: frag queue locking per hash bucket Jesper Dangaard Brouer
2012-11-27 9:07 ` Jesper Dangaard Brouer
@ 2012-11-27 15:00 ` Jesper Dangaard Brouer
1 sibling, 0 replies; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2012-11-27 15:00 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S. Miller, Florian Westphal, netdev, Pablo Neira Ayuso,
Thomas Graf, Cong Wang, Patrick McHardy, Paul E. McKenney,
Herbert Xu
On Fri, 2012-11-23 at 14:08 +0100, Jesper Dangaard Brouer wrote:
> DO NOT apply - patch not finished, can cause on OOPS/PANIC during hash rebuild
>
> This patch implements per hash bucket locking for the frag queue
> hash. This removes two write locks, and the only remaining write
> lock is for protecting hash rebuild. This essentially reduce the
> readers-writer lock to a rebuild lock.
>
> NOT-Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Last bug mentioned, were not the only one... fixing hopefully the last bug in this patch.
> diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
> index 1620a21..447423f 100644
> --- a/net/ipv4/inet_fragment.c
> +++ b/net/ipv4/inet_fragment.c
> @@ -35,20 +35,27 @@ static void inet_frag_secret_rebuild(unsigned long dummy)
> unsigned long now = jiffies;
> int i;
>
> + /* Per bucket lock NOT needed here, due to write lock protection */
> write_lock(&f->lock);
> +
> get_random_bytes(&f->rnd, sizeof(u32));
> for (i = 0; i < INETFRAGS_HASHSZ; i++) {
> + struct inet_frag_bucket *hb;
> struct inet_frag_queue *q;
> struct hlist_node *p, *n;
>
> - hlist_for_each_entry_safe(q, p, n, &f->hash[i], list) {
> + hb = &f->hash[i];
> + hlist_for_each_entry_safe(q, p, n, &hb->chain, list) {
> unsigned int hval = f->hashfn(q);
>
> if (hval != i) {
> + struct inet_frag_bucket *hb_dest;
> +
> hlist_del(&q->list);
>
> /* Relink to new hash chain. */
> - hlist_add_head(&q->list, &f->hash[hval]);
> + hb_dest = &f->hash[hval];
> + hlist_add_head(&q->list, &hb->chain);
The above line were wrong, it should have been:
hlist_add_head(&q->list, &hb_dest->chain);
> }
> }
> }
The patch seem quite stable now. My test is to adjust to rebuild
interval to 2 sec and then run 4x 10G with two fragments (packet size
1472*2) to create as many fragments as possible (approx 300
inet_frag_queue elements).
30 min test run:
3726+3896+3960+3608 = 15190 Mbit/s
(For reproducers, note, that changing ipfrag_secret_interval (e.g.
sysctl -w net/ipv4/ipfrag_secret_interval=2), first take effect after
the first interval/timer expires, which default is 10 min)
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Sr. Network Kernel Developer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer
^ permalink raw reply [flat|nested] 19+ messages in thread
* [RFC net-next PATCH V1 8/9] net: increase frag queue hash size and cache-line
2012-11-23 13:08 [RFC net-next PATCH V1 0/9] net: fragmentation performance scalability on NUMA/SMP systems Jesper Dangaard Brouer
` (2 preceding siblings ...)
2012-11-23 13:08 ` [RFC net-next PATCH V1 7/9] net: frag queue locking per hash bucket Jesper Dangaard Brouer
@ 2012-11-23 13:08 ` Jesper Dangaard Brouer
2012-11-23 13:08 ` [RFC net-next PATCH V1 9/9] net: frag remove readers-writer lock (hack) Jesper Dangaard Brouer
` (3 subsequent siblings)
7 siblings, 0 replies; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2012-11-23 13:08 UTC (permalink / raw)
To: Eric Dumazet, David S. Miller, Florian Westphal
Cc: Jesper Dangaard Brouer, netdev, Pablo Neira Ayuso, Thomas Graf,
Cong Wang, Patrick McHardy, Paul E. McKenney, Herbert Xu
Increase frag queue hash size and assure cache-line alignment to
avoid false sharing. Hash size is set to 256, because I have
observed 206 frag queues in use at 4x10G with packet size 4416 bytes.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
include/net/inet_frag.h | 5 ++---
1 files changed, 2 insertions(+), 3 deletions(-)
diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 1efec6b..5054228 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -49,13 +49,12 @@ struct inet_frag_queue {
u16 max_size;
};
-#define INETFRAGS_HASHSZ 64
-
+#define INETFRAGS_HASHSZ 256
struct inet_frag_bucket {
struct hlist_head chain;
spinlock_t chain_lock;
-};
+} ____cacheline_aligned_in_smp;
struct inet_frags {
struct inet_frag_bucket hash[INETFRAGS_HASHSZ];
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [RFC net-next PATCH V1 9/9] net: frag remove readers-writer lock (hack)
2012-11-23 13:08 [RFC net-next PATCH V1 0/9] net: fragmentation performance scalability on NUMA/SMP systems Jesper Dangaard Brouer
` (3 preceding siblings ...)
2012-11-23 13:08 ` [RFC net-next PATCH V1 8/9] net: increase frag queue hash size and cache-line Jesper Dangaard Brouer
@ 2012-11-23 13:08 ` Jesper Dangaard Brouer
2012-11-26 6:03 ` Stephen Hemminger
2012-11-26 9:18 ` Florian Westphal
[not found] ` <20121123130806.18764.41854.stgit@dragon>
` (2 subsequent siblings)
7 siblings, 2 replies; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2012-11-23 13:08 UTC (permalink / raw)
To: Eric Dumazet, David S. Miller, Florian Westphal
Cc: Jesper Dangaard Brouer, netdev, Pablo Neira Ayuso, Thomas Graf,
Cong Wang, Patrick McHardy, Paul E. McKenney, Herbert Xu
Do NOT APPLY this patch.
After all the other patches, the rw_lock is now the contention point.
This is a quick hack, that remove the readers-writer lock, by
disabling/breaking hash rebuilding. Just to see how big the
performance gain would be.
2x10G size(4416) result: 6481+6764 = 13245 Mbit/s (gen: 7652+8077 Mbit/s)
4x10G size(4416) result:(5610+6283+5735+5238)=22866 Mbit/s
(gen: 6530+7860+5967+5238 =25595 Mbit/s)
And the results show, that its a big win. With 4x10G size(4416)
before: 17923 Mbit/s -> now: 22866 Mbit/s increase 4943 Mbit/s.
With 2x10G size(4416) before 10689 Mbit/s -> 13245 Mbit/s
increase 2556 Mbit/s.
I'll work on a real solution for removing the rw_lock while still
supporting hash rebuilding. Suggestions and ideas are welcome.
NOT-signed-off
---
include/net/inet_frag.h | 2 +-
net/ipv4/inet_fragment.c | 23 +++++++++++++----------
net/ipv4/ip_fragment.c | 2 +-
net/ipv6/netfilter/nf_conntrack_reasm.c | 2 +-
net/ipv6/reassembly.c | 2 +-
5 files changed, 17 insertions(+), 14 deletions(-)
diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 5054228..2fb8578 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -58,7 +58,7 @@ struct inet_frag_bucket {
struct inet_frags {
struct inet_frag_bucket hash[INETFRAGS_HASHSZ];
- rwlock_t lock; /* Rebuild lock */
+// rwlock_t lock; /* Rebuild lock */
u32 rnd;
int qsize;
int secret_interval;
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 447423f..63227d6 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -35,8 +35,11 @@ static void inet_frag_secret_rebuild(unsigned long dummy)
unsigned long now = jiffies;
int i;
+ //HACK don't rebuild
+ return;
+
/* Per bucket lock NOT needed here, due to write lock protection */
- write_lock(&f->lock);
+// write_lock(&f->lock);
get_random_bytes(&f->rnd, sizeof(u32));
for (i = 0; i < INETFRAGS_HASHSZ; i++) {
@@ -59,7 +62,7 @@ static void inet_frag_secret_rebuild(unsigned long dummy)
}
}
}
- write_unlock(&f->lock);
+// write_unlock(&f->lock);
mod_timer(&f->secret_timer, now + f->secret_interval);
}
@@ -74,7 +77,7 @@ void inet_frags_init(struct inet_frags *f)
spin_lock_init(&hb->chain_lock);
INIT_HLIST_HEAD(&hb->chain);
}
- rwlock_init(&f->lock);
+// rwlock_init(&f->lock);
f->rnd = (u32) ((num_physpages ^ (num_physpages>>7)) ^
(jiffies ^ (jiffies >> 6)));
@@ -115,14 +118,14 @@ static inline void fq_unlink(struct inet_frag_queue *fq, struct inet_frags *f)
struct inet_frag_bucket *hb;
unsigned int hash;
- read_lock(&f->lock);
+ //read_lock(&f->lock);
hash = f->hashfn(fq);
hb = &f->hash[hash];
spin_lock_bh(&hb->chain_lock);
hlist_del(&fq->list);
spin_unlock_bh(&hb->chain_lock);
- read_unlock(&f->lock);
+ //read_unlock(&f->lock);
inet_frag_lru_del(fq);
}
@@ -249,7 +252,7 @@ static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf,
#endif
unsigned int hash;
- read_lock(&f->lock); /* Protects against hash rebuild */
+ //read_lock(&f->lock); /* Protects against hash rebuild */
/*
* While we stayed w/o the lock other CPU could update
* the rnd seed, so we need to re-calculate the hash
@@ -268,7 +271,7 @@ static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf,
if (qp->net == nf && f->match(qp, arg)) {
atomic_inc(&qp->refcnt);
spin_unlock_bh(&hb->chain_lock);
- read_unlock(&f->lock);
+ //read_unlock(&f->lock);
qp_in->last_in |= INET_FRAG_COMPLETE;
inet_frag_put(qp_in, f);
return qp;
@@ -282,7 +285,7 @@ static struct inet_frag_queue *inet_frag_intern(struct netns_frags *nf,
atomic_inc(&qp->refcnt);
hlist_add_head(&qp->list, &hb->chain);
spin_unlock_bh(&hb->chain_lock);
- read_unlock(&f->lock);
+ //read_unlock(&f->lock);
inet_frag_lru_add(nf, qp);
return qp;
}
@@ -342,12 +345,12 @@ struct inet_frag_queue *inet_frag_find(struct netns_frags *nf,
if (q->net == nf && f->match(q, key)) {
atomic_inc(&q->refcnt);
spin_unlock_bh(&hb->chain_lock);
- read_unlock(&f->lock);
+ //read_unlock(&f->lock);
return q;
}
}
spin_unlock_bh(&hb->chain_lock);
- read_unlock(&f->lock);
+ //read_unlock(&f->lock);
return inet_frag_create(nf, f, key);
}
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 7b1cf51..b2cb05f 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -289,7 +289,7 @@ static inline struct ipq *ip_find(struct net *net, struct iphdr *iph, u32 user)
arg.iph = iph;
arg.user = user;
- read_lock(&ip4_frags.lock);
+ //read_lock(&ip4_frags.lock);
hash = ipqhashfn(iph->id, iph->saddr, iph->daddr, iph->protocol);
q = inet_frag_find(&net->ipv4.frags, &ip4_frags, &arg, hash);
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
index c088831..5b57e03 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -175,7 +175,7 @@ static inline struct frag_queue *fq_find(struct net *net, __be32 id,
arg.src = src;
arg.dst = dst;
- read_lock_bh(&nf_frags.lock);
+ //read_lock_bh(&nf_frags.lock);
hash = inet6_hash_frag(id, src, dst, nf_frags.rnd);
q = inet_frag_find(&net->nf_frag.frags, &nf_frags, &arg, hash);
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 9cfe047..2c74394 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -193,7 +193,7 @@ fq_find(struct net *net, __be32 id, const struct in6_addr *src, const struct in6
arg.src = src;
arg.dst = dst;
- read_lock(&ip6_frags.lock);
+ //read_lock(&ip6_frags.lock);
hash = inet6_hash_frag(id, src, dst, ip6_frags.rnd);
q = inet_frag_find(&net->ipv6.frags, &ip6_frags, &arg, hash);
^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [RFC net-next PATCH V1 9/9] net: frag remove readers-writer lock (hack)
2012-11-23 13:08 ` [RFC net-next PATCH V1 9/9] net: frag remove readers-writer lock (hack) Jesper Dangaard Brouer
@ 2012-11-26 6:03 ` Stephen Hemminger
2012-11-26 9:18 ` Florian Westphal
1 sibling, 0 replies; 19+ messages in thread
From: Stephen Hemminger @ 2012-11-26 6:03 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: Eric Dumazet, David S. Miller, Florian Westphal, netdev,
Pablo Neira Ayuso, Thomas Graf, Cong Wang, Patrick McHardy,
Paul E. McKenney, Herbert Xu
On Fri, 23 Nov 2012 14:08:47 +0100
Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> Do NOT APPLY this patch.
>
> After all the other patches, the rw_lock is now the contention point.
Reader-writer locks are significantly slower than a simple spinlock.
Unless the reader holds the lock for "significant" number of cycles,
a spin lock will be faster.
Did you try just changing it to a spinlock?
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC net-next PATCH V1 9/9] net: frag remove readers-writer lock (hack)
2012-11-23 13:08 ` [RFC net-next PATCH V1 9/9] net: frag remove readers-writer lock (hack) Jesper Dangaard Brouer
2012-11-26 6:03 ` Stephen Hemminger
@ 2012-11-26 9:18 ` Florian Westphal
1 sibling, 0 replies; 19+ messages in thread
From: Florian Westphal @ 2012-11-26 9:18 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: Eric Dumazet, David S. Miller, Florian Westphal, netdev,
Pablo Neira Ayuso, Thomas Graf, Cong Wang, Patrick McHardy,
Paul E. McKenney, Herbert Xu
Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> After all the other patches, the rw_lock is now the contention point.
>
> This is a quick hack, that remove the readers-writer lock, by
> disabling/breaking hash rebuilding. Just to see how big the
> performance gain would be.
>
> 2x10G size(4416) result: 6481+6764 = 13245 Mbit/s (gen: 7652+8077 Mbit/s)
>
> 4x10G size(4416) result:(5610+6283+5735+5238)=22866 Mbit/s
> (gen: 6530+7860+5967+5238 =25595 Mbit/s)
>
> And the results show, that its a big win. With 4x10G size(4416)
> before: 17923 Mbit/s -> now: 22866 Mbit/s increase 4943 Mbit/s.
> With 2x10G size(4416) before 10689 Mbit/s -> 13245 Mbit/s
> increase 2556 Mbit/s.
>
> I'll work on a real solution for removing the rw_lock while still
> supporting hash rebuilding. Suggestions and ideas are welcome.
<devils advocate>
Why not kill it altogether, and just set new secret_interval
without moving frag queues to new location?
The only consequence is that all fragments queued at the time of
changing secret_interval will be lost, and free'd by evictor/timer.
Default secret rebuild interval is 10 minutes, should we care about
small packet loss every 10 minutes?
</devils advocate>
^ permalink raw reply [flat|nested] 19+ messages in thread
[parent not found: <20121123130806.18764.41854.stgit@dragon>]
* Re: [RFC net-next PATCH V1 1/9] net: frag evictor, avoid killing warm frag queues
[not found] ` <20121123130806.18764.41854.stgit@dragon>
@ 2012-11-23 19:58 ` Florian Westphal
2012-11-24 11:36 ` Jesper Dangaard Brouer
0 siblings, 1 reply; 19+ messages in thread
From: Florian Westphal @ 2012-11-23 19:58 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: Eric Dumazet, David S. Miller, Florian Westphal, netdev,
Pablo Neira Ayuso, Thomas Graf, Cong Wang, Patrick McHardy,
Paul E. McKenney, Herbert Xu
Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> +// TODO: Idea what about also looking at flag INET_FRAG_FIRST_IN
> +// just as safe-guard against frags with a dropped "head" packet
> + if (!force && q->creation_ts == (u32) jiffies) {
I think we should not rely on head fragment arriving first.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC net-next PATCH V1 1/9] net: frag evictor, avoid killing warm frag queues
2012-11-23 19:58 ` [RFC net-next PATCH V1 1/9] net: frag evictor, avoid killing warm frag queues Florian Westphal
@ 2012-11-24 11:36 ` Jesper Dangaard Brouer
0 siblings, 0 replies; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2012-11-24 11:36 UTC (permalink / raw)
To: Florian Westphal
Cc: Eric Dumazet, David S. Miller, netdev, Pablo Neira Ayuso,
Thomas Graf, Cong Wang, Patrick McHardy, Paul E. McKenney,
Herbert Xu
On Fri, 2012-11-23 at 20:58 +0100, Florian Westphal wrote:
> Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> > +// TODO: Idea what about also looking at flag INET_FRAG_FIRST_IN
> > +// just as safe-guard against frags with a dropped "head" packet
> > + if (!force && q->creation_ts == (u32) jiffies) {
>
> I think we should not rely on head fragment arriving first.
Good point. Lets keep the current approach.
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Sr. Network Kernel Developer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC net-next PATCH V1 0/9] net: fragmentation performance scalability on NUMA/SMP systems
2012-11-23 13:08 [RFC net-next PATCH V1 0/9] net: fragmentation performance scalability on NUMA/SMP systems Jesper Dangaard Brouer
` (5 preceding siblings ...)
[not found] ` <20121123130806.18764.41854.stgit@dragon>
@ 2012-11-25 2:31 ` Eric Dumazet
2012-11-25 8:53 ` Jesper Dangaard Brouer
[not found] ` <20121123130826.18764.66507.stgit@dragon>
7 siblings, 1 reply; 19+ messages in thread
From: Eric Dumazet @ 2012-11-25 2:31 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: David S. Miller, Florian Westphal, netdev, Pablo Neira Ayuso,
Thomas Graf, Cong Wang, Patrick McHardy, Paul E. McKenney,
Herbert Xu
On Fri, 2012-11-23 at 14:08 +0100, Jesper Dangaard Brouer wrote:
> This patchset implements significant performance improvements for
> fragmentation handling in the kernel, with a focus on NUMA and SMP
> based systems.
>
> Review:
>
> Please review these patches. I have on purpose added comments in the
> code with the "//" comments style. These comments are to be removed
> before applying. They serve as a questions to, you, the reviewer.
>
> The fragmentation code today:
>
> The fragmentation code "protects" kernel resources, by implementing
> some memory resource limitation code. This is centered around a
> global readers-writer lock, and (per network namespace) an atomic mem
> counter and a LRU (Least-Recently-Used) list. (Although separate
> global variables and namespace resources, are kept for IPv4, IPv6
> and Netfilter reassembly.)
>
> The code tries to keep the memory usage between a high and low
> threshold (see: /proc/sys/net/ipv4/ipfrag_{high,low}_thresh). The
> "evictor" code cleans up fragments, when the high threshold is
> exceeded, and stops only, when the low threshold is reached.
>
> The scalability problem:
>
> Having a global/central variable for a resource limit is obviously a
> scalability issue on SMP systems, and even amplified on a NUMA based
> system.
>
But ... , what practical workload even use fragments ?
Sure, netperf -t UDP_STREAM uses frags, but its a benchmark.
The only heavy user was NFS in the days it was using UDP, a very long
time ago.
A single lost fragment means the whole packet is lost.
Another problem with fragments is the lack of 4-tuple hashing, as only
the first frag contains the dst/src ports.
Also there is the sysctl_ipfrag_max_dist issue...
Hint : many NIC provide TSO (TCP offload), but none provide UFO,
probably because there is no demand for it.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC net-next PATCH V1 0/9] net: fragmentation performance scalability on NUMA/SMP systems
2012-11-25 2:31 ` [RFC net-next PATCH V1 0/9] net: fragmentation performance scalability on NUMA/SMP systems Eric Dumazet
@ 2012-11-25 8:53 ` Jesper Dangaard Brouer
2012-11-25 16:11 ` Eric Dumazet
0 siblings, 1 reply; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2012-11-25 8:53 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S. Miller, Florian Westphal, netdev, Pablo Neira Ayuso,
Thomas Graf, Cong Wang, Patrick McHardy, Paul E. McKenney,
Herbert Xu
On Sat, 2012-11-24 at 18:31 -0800, Eric Dumazet wrote:
> On Fri, 2012-11-23 at 14:08 +0100, Jesper Dangaard Brouer wrote:
> > This patchset implements significant performance improvements for
> > fragmentation handling in the kernel, with a focus on NUMA and SMP
> > based systems.
> >
> > Review:
> >
> > Please review these patches. I have on purpose added comments in the
> > code with the "//" comments style. These comments are to be removed
> > before applying. They serve as a questions to, you, the reviewer.
> >
> > The fragmentation code today:
> >
> > The fragmentation code "protects" kernel resources, by implementing
> > some memory resource limitation code. This is centered around a
> > global readers-writer lock, and (per network namespace) an atomic mem
> > counter and a LRU (Least-Recently-Used) list. (Although separate
> > global variables and namespace resources, are kept for IPv4, IPv6
> > and Netfilter reassembly.)
> >
> > The code tries to keep the memory usage between a high and low
> > threshold (see: /proc/sys/net/ipv4/ipfrag_{high,low}_thresh). The
> > "evictor" code cleans up fragments, when the high threshold is
> > exceeded, and stops only, when the low threshold is reached.
> >
> > The scalability problem:
> >
> > Having a global/central variable for a resource limit is obviously a
> > scalability issue on SMP systems, and even amplified on a NUMA based
> > system.
> >
>
>
> But ... , what practical workload even use fragments ?
(1) DNS (default for Bind) will use up-to 3 UDP fragments before
switching to TCP. This is getting more and more relevant after the
introduction of DNSSEC. That's why I'm explicit testing the 3xUDP
fragments so heavily.
(2) For IPVS (load-balancing) we have recently allowed fragmentation in
tunnel mode, towards the realservers (to hide the MTU reduction for the
clients). Thus, we need better frag performance in this case.
(3) I also have a customer that have a usage scenario/application (at
4x10G) that needs this... but I'm trying to convince them to fix/change
their application...
Scenario (1) is the real reason I want to fix this scalability issue in
the code.
> Sure, netperf -t UDP_STREAM uses frags, but its a benchmark.
Yes, for the default large 64k packets size, its just a "fake"
benchmark. And notice with my fixes, we are even faster than the
none-frag/single-UDP packet case... but its because we are getting a
GSO/GRO effect.
That's why I'm adjusting the UDP "frag" packet size to get a more
realistic use case... to simulate the DNS use-case (1).
> The only heavy user was NFS in the days it was using UDP, a very long
> time ago.
>
> A single lost fragment means the whole packet is lost.
That is correct, that's why we need the fix in patch-01.
(It actually reminds me of the problem with ADSL/ATM, where (small) ATM
frame are used for carrying IP packets, and when some (more central) ATM
link gets overloaded and starts to drops ATM frames, not taking the AAL5
packets into account).
> Another problem with fragments is the lack of 4-tuple hashing, as only
> the first frag contains the dst/src ports.
>
> Also there is the sysctl_ipfrag_max_dist issue...
>
> Hint : many NIC provide TSO (TCP offload), but none provide UFO,
> probably because there is no demand for it.
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Sr. Network Kernel Developer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC net-next PATCH V1 0/9] net: fragmentation performance scalability on NUMA/SMP systems
2012-11-25 8:53 ` Jesper Dangaard Brouer
@ 2012-11-25 16:11 ` Eric Dumazet
2012-11-26 14:42 ` Jesper Dangaard Brouer
0 siblings, 1 reply; 19+ messages in thread
From: Eric Dumazet @ 2012-11-25 16:11 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: David S. Miller, Florian Westphal, netdev, Pablo Neira Ayuso,
Thomas Graf, Cong Wang, Patrick McHardy, Paul E. McKenney,
Herbert Xu
On Sun, 2012-11-25 at 09:53 +0100, Jesper Dangaard Brouer wrote:
> Yes, for the default large 64k packets size, its just a "fake"
> benchmark. And notice with my fixes, we are even faster than the
> none-frag/single-UDP packet case... but its because we are getting a
> GSO/GRO effect.
Could you elaborate on this GSO/GRO effect ?
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC net-next PATCH V1 0/9] net: fragmentation performance scalability on NUMA/SMP systems
2012-11-25 16:11 ` Eric Dumazet
@ 2012-11-26 14:42 ` Jesper Dangaard Brouer
2012-11-26 15:15 ` Eric Dumazet
0 siblings, 1 reply; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2012-11-26 14:42 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S. Miller, Florian Westphal, netdev, Pablo Neira Ayuso,
Thomas Graf, Cong Wang, Patrick McHardy, Paul E. McKenney,
Herbert Xu
On Sun, 2012-11-25 at 08:11 -0800, Eric Dumazet wrote:
> On Sun, 2012-11-25 at 09:53 +0100, Jesper Dangaard Brouer wrote:
>
> > Yes, for the default large 64k packets size, its just a "fake"
> > benchmark. And notice with my fixes, we are even faster than the
> > none-frag/single-UDP packet case... but its because we are getting a
> > GSO/GRO effect.
>
> Could you elaborate on this GSO/GRO effect ?
On the big system, I saw none-frag UDP (1472 bytes) throughput of:
7356.57 + 7351.78 + 7330.60 + 7269.26 = 29308.21 Mbit/s
While with UDP fragments size 65507 bytes I saw:
9228.75 + 9207.81 + 9615.83 + 9615.87 = 37668.26 Mbit/s
Fragmented UDP is faster by:
37668.26 - 29308.21 = 8360.05 Mbit/s
The 65507 bytes UDP size is just a benchmark test, and have no real-life
relevance. As performance starts to drop (below none-frag/normal case)
when the frag size is decreased, to more realistic sizes...
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Sr. Network Kernel Developer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC net-next PATCH V1 0/9] net: fragmentation performance scalability on NUMA/SMP systems
2012-11-26 14:42 ` Jesper Dangaard Brouer
@ 2012-11-26 15:15 ` Eric Dumazet
2012-11-26 15:29 ` Jesper Dangaard Brouer
0 siblings, 1 reply; 19+ messages in thread
From: Eric Dumazet @ 2012-11-26 15:15 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: David S. Miller, Florian Westphal, netdev, Pablo Neira Ayuso,
Thomas Graf, Cong Wang, Patrick McHardy, Paul E. McKenney,
Herbert Xu
On Mon, 2012-11-26 at 15:42 +0100, Jesper Dangaard Brouer wrote:
> On Sun, 2012-11-25 at 08:11 -0800, Eric Dumazet wrote:
> > On Sun, 2012-11-25 at 09:53 +0100, Jesper Dangaard Brouer wrote:
> >
> > > Yes, for the default large 64k packets size, its just a "fake"
> > > benchmark. And notice with my fixes, we are even faster than the
> > > none-frag/single-UDP packet case... but its because we are getting a
> > > GSO/GRO effect.
> >
> > Could you elaborate on this GSO/GRO effect ?
>
> On the big system, I saw none-frag UDP (1472 bytes) throughput of:
> 7356.57 + 7351.78 + 7330.60 + 7269.26 = 29308.21 Mbit/s
>
> While with UDP fragments size 65507 bytes I saw:
> 9228.75 + 9207.81 + 9615.83 + 9615.87 = 37668.26 Mbit/s
>
> Fragmented UDP is faster by:
> 37668.26 - 29308.21 = 8360.05 Mbit/s
>
> The 65507 bytes UDP size is just a benchmark test, and have no real-life
> relevance. As performance starts to drop (below none-frag/normal case)
> when the frag size is decreased, to more realistic sizes...
Yes, but I doubt GRO / GSO are the reason you get better performance.
GRO doesnt aggregate UDP frames.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC net-next PATCH V1 0/9] net: fragmentation performance scalability on NUMA/SMP systems
2012-11-26 15:15 ` Eric Dumazet
@ 2012-11-26 15:29 ` Jesper Dangaard Brouer
0 siblings, 0 replies; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2012-11-26 15:29 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S. Miller, Florian Westphal, netdev, Pablo Neira Ayuso,
Thomas Graf, Cong Wang, Patrick McHardy, Paul E. McKenney,
Herbert Xu
On Mon, 2012-11-26 at 07:15 -0800, Eric Dumazet wrote:
> On Mon, 2012-11-26 at 15:42 +0100, Jesper Dangaard Brouer wrote:
> > On Sun, 2012-11-25 at 08:11 -0800, Eric Dumazet wrote:
> > > On Sun, 2012-11-25 at 09:53 +0100, Jesper Dangaard Brouer wrote:
> > >
> > > > Yes, for the default large 64k packets size, its just a "fake"
> > > > benchmark. And notice with my fixes, we are even faster than the
> > > > none-frag/single-UDP packet case... but its because we are getting a
> > > > GSO/GRO effect.
> > >
> > > Could you elaborate on this GSO/GRO effect ?
> >
> > On the big system, I saw none-frag UDP (1472 bytes) throughput of:
> > 7356.57 + 7351.78 + 7330.60 + 7269.26 = 29308.21 Mbit/s
> >
> > While with UDP fragments size 65507 bytes I saw:
> > 9228.75 + 9207.81 + 9615.83 + 9615.87 = 37668.26 Mbit/s
> >
> > Fragmented UDP is faster by:
> > 37668.26 - 29308.21 = 8360.05 Mbit/s
> >
> > The 65507 bytes UDP size is just a benchmark test, and have no real-life
> > relevance. As performance starts to drop (below none-frag/normal case)
> > when the frag size is decreased, to more realistic sizes...
>
> Yes, but I doubt GRO / GSO are the reason you get better performance.
> GRO doesnt aggregate UDP frames.
Oh, now I think I understand your question.
I don't think GRO is helping me. Its the same "effect" as GRO. As (I
think) that the reasm frag SKB will be a "bigger" SKB, which is passed
to the rest of the stack. Thus, less (but) bigger SKBs get the overhead
of the rest of the stack. It was actually Herbert that mentioned it to
me...
--Jesper
^ permalink raw reply [flat|nested] 19+ messages in thread
[parent not found: <20121123130826.18764.66507.stgit@dragon>]
* Re: [RFC net-next PATCH V1 5/9] net: frag per CPU mem limit and LRU list accounting
[not found] ` <20121123130826.18764.66507.stgit@dragon>
@ 2012-11-26 2:54 ` Cong Wang
0 siblings, 0 replies; 19+ messages in thread
From: Cong Wang @ 2012-11-26 2:54 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: Eric Dumazet, David S. Miller, Florian Westphal, netdev,
Pablo Neira Ayuso, Thomas Graf, Patrick McHardy, Paul E. McKenney,
Herbert Xu
On Fri, 2012-11-23 at 14:08 +0100, Jesper Dangaard Brouer wrote:
> + // QUESTION: Please advice in a better alloc solution than
> NR_CPUS
> + struct cpu_resource percpu[NR_CPUS];
alloc_percpu(struct cpu_resource) ?
BTW, struct cpu_resource is not a good name here, it is too generic.
^ permalink raw reply [flat|nested] 19+ messages in thread