* inetpeer with create==0 @ 2011-03-03 4:45 David Miller 2011-03-03 5:30 ` Changli Gao 2011-03-03 6:51 ` Eric Dumazet 0 siblings, 2 replies; 17+ messages in thread From: David Miller @ 2011-03-03 4:45 UTC (permalink / raw) To: eric.dumazet; +Cc: netdev Eric, I was profiling the non-routing-cache case and something that stuck out is the case of calling inet_getpeer() with create==0. If an entry is not found, we have to redo the lookup under a spinlock to make certain that a concurrent writer rebalancing the tree does not "hide" an existing entry from us. This makes the case of a create==0 lookup for a not-present entry really expensive. It is on the order of 600 cpu cycles on my Niagara2. I added a hack to not do the relookup under the lock when create==0 and it now costs less than 300 cycles. This is now a pretty common operation with the way we handle COW'd metrics, so I think it's definitely worth optimizing. I looked at the generic radix tree implementation, and it supports full RCU lookups in parallel with insert/delete. It handles the race case without the relookup under lock because it creates fixed paths to "slots" where nodes live using shifts and masks. So if a path to a slot ever existed, it will always exist. Take a look at lib/radix-tree.c and include/linux/radix-tree.h if you are curious. I think we should do something similar for inetpeer. Currently we cannot just use the existing generic radix-tree code because it only supports indexes as large as "unsigned long" and we need to handle 128-bit ipv6 addresses. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: inetpeer with create==0 2011-03-03 4:45 inetpeer with create==0 David Miller @ 2011-03-03 5:30 ` Changli Gao 2011-03-03 5:36 ` David Miller 2011-03-03 6:51 ` Eric Dumazet 1 sibling, 1 reply; 17+ messages in thread From: Changli Gao @ 2011-03-03 5:30 UTC (permalink / raw) To: David Miller; +Cc: eric.dumazet, netdev On Thu, Mar 3, 2011 at 12:45 PM, David Miller <davem@davemloft.net> wrote: > > > I looked at the generic radix tree implementation, and it supports > full RCU lookups in parallel with insert/delete. It handles the race > case without the relookup under lock because it creates fixed paths > to "slots" where nodes live using shifts and masks. So if a path > to a slot ever existed, it will always exist. > > Take a look at lib/radix-tree.c and include/linux/radix-tree.h if > you are curious. > > I think we should do something similar for inetpeer. Currently we > cannot just use the existing generic radix-tree code because it only > supports indexes as large as "unsigned long" and we need to handle > 128-bit ipv6 addresses. I am just wondering why we need a trie(radix tree) here, we don't have to do LPM since the lengths of the keys are always fixed(ether 4 or 16). Maybe a hash table is enough and simpler, and it is RCU compatible. -- Regards, Changli Gao(xiaosuo@gmail.com) ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: inetpeer with create==0 2011-03-03 5:30 ` Changli Gao @ 2011-03-03 5:36 ` David Miller 2011-03-03 6:27 ` Changli Gao 0 siblings, 1 reply; 17+ messages in thread From: David Miller @ 2011-03-03 5:36 UTC (permalink / raw) To: xiaosuo; +Cc: eric.dumazet, netdev From: Changli Gao <xiaosuo@gmail.com> Date: Thu, 3 Mar 2011 13:30:03 +0800 > I am just wondering why we need a trie(radix tree) here, we don't have > to do LPM since the lengths of the keys are always fixed(ether 4 or > 16). Maybe a hash table is enough and simpler, and it is RCU > compatible. Because trie eliminates all of the issues of having to size a hash table, dynamically resize it, etc. Trie gives well bounded performance dependent solely upon size of the table, rather than access patterns, distribution of keys, and how perfect hash function is. We used to use a hash table for the page cache too, but now we use a trie structure there as well, and that uses full long word sized keys and the generic raidx-tree code. Using hash tables is really foolish for potentially large data sets. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: inetpeer with create==0 2011-03-03 5:36 ` David Miller @ 2011-03-03 6:27 ` Changli Gao 2011-03-03 6:42 ` David Miller 0 siblings, 1 reply; 17+ messages in thread From: Changli Gao @ 2011-03-03 6:27 UTC (permalink / raw) To: David Miller; +Cc: eric.dumazet, netdev On Thu, Mar 3, 2011 at 1:36 PM, David Miller <davem@davemloft.net> wrote: > > Because trie eliminates all of the issues of having to size a hash > table, dynamically resize it, etc. > > Trie gives well bounded performance dependent solely upon size of > the table, rather than access patterns, distribution of keys, and > how perfect hash function is. Thanks for your explaination. Routing cache has all of these issues. :) > > We used to use a hash table for the page cache too, but now we use > a trie structure there as well, and that uses full long word sized > keys and the generic raidx-tree code. > > Using hash tables is really foolish for potentially large data sets. > However, I don't agree with you at every aspect. Radix tree may cost lots of memory than a rbtree, avl tree or hash table. Here is a case: turning to rbtree from radix tree. http://git.kernel.org/?p=linux/kernel/git/davem/net-next-2.6.git;a=commit;h=8549164143a5431f9d9ea846acaa35a862410d9c Since the keys of inetpeer rely on the access pattern, a attacker may do a OOM DoS attack using sparse keys. We can't simply limit the number of inetpeers to limit the memory cost of this subsystem. We have to count the memory used for the inner branches of a radix tree. Nevertheless, a attacker may also make this OS inefficiency by reduce the max number of inetpeers. Hash table + jhash have been proven a safe and efficient data structure for large data sets(conntrack and ipvs), although the size of the hash table may have to be adjusted by an administrator. -- Regards, Changli Gao(xiaosuo@gmail.com) ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: inetpeer with create==0 2011-03-03 6:27 ` Changli Gao @ 2011-03-03 6:42 ` David Miller 2011-03-03 7:39 ` Eric Dumazet 2011-03-03 8:07 ` inetpeer with create==0 Changli Gao 0 siblings, 2 replies; 17+ messages in thread From: David Miller @ 2011-03-03 6:42 UTC (permalink / raw) To: xiaosuo; +Cc: eric.dumazet, netdev From: Changli Gao <xiaosuo@gmail.com> Date: Thu, 3 Mar 2011 14:27:40 +0800 > On Thu, Mar 3, 2011 at 1:36 PM, David Miller <davem@davemloft.net> wrote: >> >> Because trie eliminates all of the issues of having to size a hash >> table, dynamically resize it, etc. >> >> Trie gives well bounded performance dependent solely upon size of >> the table, rather than access patterns, distribution of keys, and >> how perfect hash function is. > > Thanks for your explaination. Routing cache has all of these issues. :) Thats why I am working hard to remove it :-) The %99 percentile performance of our routing cache is absolutely terrible. The routing cache does nothing except get in the way. It is the reason why Linux is still completely unsuitable for use as a router on the core internet. > Radix tree may cost lots of memory than a rbtree, avl tree or hash > table. Here is a case: turning to rbtree from radix tree. > http://git.kernel.org/?p=linux/kernel/git/davem/net-next-2.6.git;a=commit;h=8549164143a5431f9d9ea846acaa35a862410d9c I would not be against the use of rbtree or similar if it could be done with with fully RCU non-locked lookups even in the not-present case. > Hash table + jhash have been proven a safe and efficient data > structure for large data sets(conntrack and ipvs), although the size > of the hash table may have to be adjusted by an administrator. If anything is proven, it is that hashing based collections of cached information are nothing but trouble when the contents are controlled in some way by external entities. Actually, back to the original topic, I wonder how bad it is to simply elide the recheck in the create==0 case anyways. Except for the ipv4 fragmentation wraparound protection values, perfect inetpeer finding is not necessary for correctness. And IPv4 fragmentation always calls inetpeer with create!=0. Eric? ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: inetpeer with create==0 2011-03-03 6:42 ` David Miller @ 2011-03-03 7:39 ` Eric Dumazet 2011-03-03 8:32 ` David Miller 2011-03-03 8:07 ` inetpeer with create==0 Changli Gao 1 sibling, 1 reply; 17+ messages in thread From: Eric Dumazet @ 2011-03-03 7:39 UTC (permalink / raw) To: David Miller; +Cc: xiaosuo, netdev Le mercredi 02 mars 2011 à 22:42 -0800, David Miller a écrit : > Actually, back to the original topic, I wonder how bad it is to simply > elide the recheck in the create==0 case anyways. Except for the ipv4 > fragmentation wraparound protection values, perfect inetpeer finding > is not necessary for correctness. And IPv4 fragmentation always calls > inetpeer with create!=0. > We could use a seqlock, to detect that a writer might have changed things while we did our RCU lookup ? ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: inetpeer with create==0 2011-03-03 7:39 ` Eric Dumazet @ 2011-03-03 8:32 ` David Miller 2011-03-04 15:09 ` [PATCH net-next-2.6] inetpeer: seqlock optimization Eric Dumazet 0 siblings, 1 reply; 17+ messages in thread From: David Miller @ 2011-03-03 8:32 UTC (permalink / raw) To: eric.dumazet; +Cc: xiaosuo, netdev From: Eric Dumazet <eric.dumazet@gmail.com> Date: Thu, 03 Mar 2011 08:39:37 +0100 > Le mercredi 02 mars 2011 à 22:42 -0800, David Miller a écrit : >> Actually, back to the original topic, I wonder how bad it is to simply >> elide the recheck in the create==0 case anyways. Except for the ipv4 >> fragmentation wraparound protection values, perfect inetpeer finding >> is not necessary for correctness. And IPv4 fragmentation always calls >> inetpeer with create!=0. > > We could use a seqlock, to detect that a writer might have changed > things while we did our RCU lookup ? That would certainly work. ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH net-next-2.6] inetpeer: seqlock optimization 2011-03-03 8:32 ` David Miller @ 2011-03-04 15:09 ` Eric Dumazet 2011-03-04 19:17 ` David Miller 0 siblings, 1 reply; 17+ messages in thread From: Eric Dumazet @ 2011-03-04 15:09 UTC (permalink / raw) To: David Miller; +Cc: xiaosuo, netdev Le jeudi 03 mars 2011 à 00:32 -0800, David Miller a écrit : > From: Eric Dumazet <eric.dumazet@gmail.com> > Date: Thu, 03 Mar 2011 08:39:37 +0100 > > > Le mercredi 02 mars 2011 à 22:42 -0800, David Miller a écrit : > >> Actually, back to the original topic, I wonder how bad it is to simply > >> elide the recheck in the create==0 case anyways. Except for the ipv4 > >> fragmentation wraparound protection values, perfect inetpeer finding > >> is not necessary for correctness. And IPv4 fragmentation always calls > >> inetpeer with create!=0. > > > > We could use a seqlock, to detect that a writer might have changed > > things while we did our RCU lookup ? > > That would certainly work. Here is a patch to implement this idea. Thanks ! [PATCH net-next-2.6] inetpeer: seqlock optimization David noticed : ------------------ Eric, I was profiling the non-routing-cache case and something that stuck out is the case of calling inet_getpeer() with create==0. If an entry is not found, we have to redo the lookup under a spinlock to make certain that a concurrent writer rebalancing the tree does not "hide" an existing entry from us. This makes the case of a create==0 lookup for a not-present entry really expensive. It is on the order of 600 cpu cycles on my Niagara2. I added a hack to not do the relookup under the lock when create==0 and it now costs less than 300 cycles. This is now a pretty common operation with the way we handle COW'd metrics, so I think it's definitely worth optimizing. ----------------- One solution is to use a seqlock instead of a spinlock to protect struct inet_peer_base. After a failed avl tree lookup, we can easily detect if a writer did some changes during our lookup. Taking the lock and redo the lookup is only necessary in this case. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> --- net/ipv4/inetpeer.c | 24 ++++++++++++++++-------- 1 files changed, 16 insertions(+), 8 deletions(-) diff --git a/net/ipv4/inetpeer.c b/net/ipv4/inetpeer.c index 48f8d45..7fd9fab 100644 --- a/net/ipv4/inetpeer.c +++ b/net/ipv4/inetpeer.c @@ -81,19 +81,19 @@ static const struct inet_peer peer_fake_node = { struct inet_peer_base { struct inet_peer __rcu *root; - spinlock_t lock; + seqlock_t lock; int total; }; static struct inet_peer_base v4_peers = { .root = peer_avl_empty_rcu, - .lock = __SPIN_LOCK_UNLOCKED(v4_peers.lock), + .lock = __SEQLOCK_UNLOCKED(v4_peers.lock), .total = 0, }; static struct inet_peer_base v6_peers = { .root = peer_avl_empty_rcu, - .lock = __SPIN_LOCK_UNLOCKED(v6_peers.lock), + .lock = __SEQLOCK_UNLOCKED(v6_peers.lock), .total = 0, }; @@ -372,7 +372,7 @@ static void unlink_from_pool(struct inet_peer *p, struct inet_peer_base *base) do_free = 0; - spin_lock_bh(&base->lock); + write_seqlock_bh(&base->lock); /* Check the reference counter. It was artificially incremented by 1 * in cleanup() function to prevent sudden disappearing. If we can * atomically (because of lockless readers) take this last reference, @@ -409,7 +409,7 @@ static void unlink_from_pool(struct inet_peer *p, struct inet_peer_base *base) base->total--; do_free = 1; } - spin_unlock_bh(&base->lock); + write_sequnlock_bh(&base->lock); if (do_free) call_rcu_bh(&p->rcu, inetpeer_free_rcu); @@ -477,12 +477,16 @@ struct inet_peer *inet_getpeer(struct inetpeer_addr *daddr, int create) struct inet_peer __rcu **stack[PEER_MAXDEPTH], ***stackptr; struct inet_peer_base *base = family_to_base(daddr->family); struct inet_peer *p; + unsigned int sequence; + int invalidated; /* Look up for the address quickly, lockless. * Because of a concurrent writer, we might not find an existing entry. */ rcu_read_lock_bh(); + sequence = read_seqbegin(&base->lock); p = lookup_rcu_bh(daddr, base); + invalidated = read_seqretry(&base->lock, sequence); rcu_read_unlock_bh(); if (p) { @@ -493,14 +497,18 @@ struct inet_peer *inet_getpeer(struct inetpeer_addr *daddr, int create) return p; } + /* If no writer did a change during our lookup, we can return early. */ + if (!create && !invalidated) + return NULL; + /* retry an exact lookup, taking the lock before. * At least, nodes should be hot in our cache. */ - spin_lock_bh(&base->lock); + write_seqlock_bh(&base->lock); p = lookup(daddr, stack, base); if (p != peer_avl_empty) { atomic_inc(&p->refcnt); - spin_unlock_bh(&base->lock); + write_sequnlock_bh(&base->lock); /* Remove the entry from unused list if it was there. */ unlink_from_unused(p); return p; @@ -524,7 +532,7 @@ struct inet_peer *inet_getpeer(struct inetpeer_addr *daddr, int create) link_to_pool(p, base); base->total++; } - spin_unlock_bh(&base->lock); + write_sequnlock_bh(&base->lock); if (base->total >= inet_peer_threshold) /* Remove one less-recently-used entry. */ ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH net-next-2.6] inetpeer: seqlock optimization 2011-03-04 15:09 ` [PATCH net-next-2.6] inetpeer: seqlock optimization Eric Dumazet @ 2011-03-04 19:17 ` David Miller 2011-03-04 20:45 ` David Miller 0 siblings, 1 reply; 17+ messages in thread From: David Miller @ 2011-03-04 19:17 UTC (permalink / raw) To: eric.dumazet; +Cc: xiaosuo, netdev From: Eric Dumazet <eric.dumazet@gmail.com> Date: Fri, 04 Mar 2011 16:09:08 +0100 > Here is a patch to implement this idea. Applied, thanks Eric! ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH net-next-2.6] inetpeer: seqlock optimization 2011-03-04 19:17 ` David Miller @ 2011-03-04 20:45 ` David Miller 2011-03-04 22:13 ` Eric Dumazet 0 siblings, 1 reply; 17+ messages in thread From: David Miller @ 2011-03-04 20:45 UTC (permalink / raw) To: eric.dumazet; +Cc: xiaosuo, netdev From: David Miller <davem@davemloft.net> Date: Fri, 04 Mar 2011 11:17:05 -0800 (PST) > From: Eric Dumazet <eric.dumazet@gmail.com> > Date: Fri, 04 Mar 2011 16:09:08 +0100 > >> Here is a patch to implement this idea. > > Applied, thanks Eric! Unfortunately, I have to revert, the lockdep annotations needs to be updated: net/ipv4/inetpeer.c: In function ‘peer_avl_rebalance’: net/ipv4/inetpeer.c:274:10: error: ‘seqlock_t’ has no member named ‘dep_map’ net/ipv4/inetpeer.c:276:7: error: ‘seqlock_t’ has no member named ‘dep_map’ net/ipv4/inetpeer.c:278:7: error: ‘seqlock_t’ has no member named ‘dep_map’ net/ipv4/inetpeer.c:285:9: error: ‘seqlock_t’ has no member named ‘dep_map’ net/ipv4/inetpeer.c:287:9: error: ‘seqlock_t’ has no member named ‘dep_map’ net/ipv4/inetpeer.c:299:11: error: ‘seqlock_t’ has no member named ‘dep_map’ net/ipv4/inetpeer.c:301:11: error: ‘seqlock_t’ has no member named ‘dep_map’ net/ipv4/inetpeer.c:317:9: error: ‘seqlock_t’ has no member named ‘dep_map’ net/ipv4/inetpeer.c:319:9: error: ‘seqlock_t’ has no member named ‘dep_map’ net/ipv4/inetpeer.c:331:11: error: ‘seqlock_t’ has no member named ‘dep_map’ net/ipv4/inetpeer.c:333:11: error: ‘seqlock_t’ has no member named ‘dep_map’ net/ipv4/inetpeer.c: In function ‘unlink_from_pool’: net/ipv4/inetpeer.c:385:7: error: ‘seqlock_t’ has no member named ‘dep_map’ net/ipv4/inetpeer.c:385:7: error: ‘seqlock_t’ has no member named ‘dep_map’ net/ipv4/inetpeer.c:394:8: error: ‘seqlock_t’ has no member named ‘dep_map’ net/ipv4/inetpeer.c:394:8: error: ‘seqlock_t’ has no member named ‘dep_map’ net/ipv4/inetpeer.c:395:4: error: ‘seqlock_t’ has no member named ‘dep_map’ net/ipv4/inetpeer.c: In function ‘inet_getpeer’: net/ipv4/inetpeer.c:508:6: error: ‘seqlock_t’ has no member named ‘dep_map’ net/ipv4/inetpeer.c:508:6: error: ‘seqlock_t’ has no member named ‘dep_map’ ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH net-next-2.6] inetpeer: seqlock optimization 2011-03-04 20:45 ` David Miller @ 2011-03-04 22:13 ` Eric Dumazet 2011-03-04 22:44 ` David Miller 0 siblings, 1 reply; 17+ messages in thread From: Eric Dumazet @ 2011-03-04 22:13 UTC (permalink / raw) To: David Miller; +Cc: xiaosuo, netdev Le vendredi 04 mars 2011 à 12:45 -0800, David Miller a écrit : > From: David Miller <davem@davemloft.net> > Date: Fri, 04 Mar 2011 11:17:05 -0800 (PST) > > > From: Eric Dumazet <eric.dumazet@gmail.com> > > Date: Fri, 04 Mar 2011 16:09:08 +0100 > > > >> Here is a patch to implement this idea. > > > > Applied, thanks Eric! > > Unfortunately, I have to revert, the lockdep annotations needs to > be updated: > > net/ipv4/inetpeer.c: In function ‘peer_avl_rebalance’: > net/ipv4/inetpeer.c:274:10: error: ‘seqlock_t’ has no member named ‘dep_map’ Oops thats right, here is an updated version. Thanks [PATCH net-next-2.6] inetpeer: seqlock optimization David noticed : ------------------ Eric, I was profiling the non-routing-cache case and something that stuck out is the case of calling inet_getpeer() with create==0. If an entry is not found, we have to redo the lookup under a spinlock to make certain that a concurrent writer rebalancing the tree does not "hide" an existing entry from us. This makes the case of a create==0 lookup for a not-present entry really expensive. It is on the order of 600 cpu cycles on my Niagara2. I added a hack to not do the relookup under the lock when create==0 and it now costs less than 300 cycles. This is now a pretty common operation with the way we handle COW'd metrics, so I think it's definitely worth optimizing. ----------------- One solution is to use a seqlock instead of a spinlock to protect struct inet_peer_base. After a failed avl tree lookup, we can easily detect if a writer did some changes during our lookup. Taking the lock and redo the lookup is only necessary in this case. Note: Add one private rcu_deref_locked() macro to place in one spot the access to spinlock included in seqlock. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> --- V2: updated lockdep annotations net/ipv4/inetpeer.c | 75 +++++++++++++++++++----------------------- 1 files changed, 35 insertions(+), 40 deletions(-) diff --git a/net/ipv4/inetpeer.c b/net/ipv4/inetpeer.c index 48f8d45..f604ffd 100644 --- a/net/ipv4/inetpeer.c +++ b/net/ipv4/inetpeer.c @@ -81,19 +81,19 @@ static const struct inet_peer peer_fake_node = { struct inet_peer_base { struct inet_peer __rcu *root; - spinlock_t lock; + seqlock_t lock; int total; }; static struct inet_peer_base v4_peers = { .root = peer_avl_empty_rcu, - .lock = __SPIN_LOCK_UNLOCKED(v4_peers.lock), + .lock = __SEQLOCK_UNLOCKED(v4_peers.lock), .total = 0, }; static struct inet_peer_base v6_peers = { .root = peer_avl_empty_rcu, - .lock = __SPIN_LOCK_UNLOCKED(v6_peers.lock), + .lock = __SEQLOCK_UNLOCKED(v6_peers.lock), .total = 0, }; @@ -177,6 +177,9 @@ static int addr_compare(const struct inetpeer_addr *a, return 0; } +#define rcu_deref_locked(X, BASE) \ + rcu_dereference_protected(X, lockdep_is_held(&(BASE)->lock.lock)) + /* * Called with local BH disabled and the pool lock held. */ @@ -187,8 +190,7 @@ static int addr_compare(const struct inetpeer_addr *a, \ stackptr = _stack; \ *stackptr++ = &_base->root; \ - for (u = rcu_dereference_protected(_base->root, \ - lockdep_is_held(&_base->lock)); \ + for (u = rcu_deref_locked(_base->root, _base); \ u != peer_avl_empty; ) { \ int cmp = addr_compare(_daddr, &u->daddr); \ if (cmp == 0) \ @@ -198,8 +200,7 @@ static int addr_compare(const struct inetpeer_addr *a, else \ v = &u->avl_right; \ *stackptr++ = v; \ - u = rcu_dereference_protected(*v, \ - lockdep_is_held(&_base->lock)); \ + u = rcu_deref_locked(*v, _base); \ } \ u; \ }) @@ -246,13 +247,11 @@ static struct inet_peer *lookup_rcu_bh(const struct inetpeer_addr *daddr, struct inet_peer __rcu **v; \ *stackptr++ = &start->avl_left; \ v = &start->avl_left; \ - for (u = rcu_dereference_protected(*v, \ - lockdep_is_held(&base->lock)); \ + for (u = rcu_deref_locked(*v, base); \ u->avl_right != peer_avl_empty_rcu; ) { \ v = &u->avl_right; \ *stackptr++ = v; \ - u = rcu_dereference_protected(*v, \ - lockdep_is_held(&base->lock)); \ + u = rcu_deref_locked(*v, base); \ } \ u; \ }) @@ -271,21 +270,16 @@ static void peer_avl_rebalance(struct inet_peer __rcu **stack[], while (stackend > stack) { nodep = *--stackend; - node = rcu_dereference_protected(*nodep, - lockdep_is_held(&base->lock)); - l = rcu_dereference_protected(node->avl_left, - lockdep_is_held(&base->lock)); - r = rcu_dereference_protected(node->avl_right, - lockdep_is_held(&base->lock)); + node = rcu_deref_locked(*nodep, base); + l = rcu_deref_locked(node->avl_left, base); + r = rcu_deref_locked(node->avl_right, base); lh = node_height(l); rh = node_height(r); if (lh > rh + 1) { /* l: RH+2 */ struct inet_peer *ll, *lr, *lrl, *lrr; int lrh; - ll = rcu_dereference_protected(l->avl_left, - lockdep_is_held(&base->lock)); - lr = rcu_dereference_protected(l->avl_right, - lockdep_is_held(&base->lock)); + ll = rcu_deref_locked(l->avl_left, base); + lr = rcu_deref_locked(l->avl_right, base); lrh = node_height(lr); if (lrh <= node_height(ll)) { /* ll: RH+1 */ RCU_INIT_POINTER(node->avl_left, lr); /* lr: RH or RH+1 */ @@ -296,10 +290,8 @@ static void peer_avl_rebalance(struct inet_peer __rcu **stack[], l->avl_height = node->avl_height + 1; RCU_INIT_POINTER(*nodep, l); } else { /* ll: RH, lr: RH+1 */ - lrl = rcu_dereference_protected(lr->avl_left, - lockdep_is_held(&base->lock)); /* lrl: RH or RH-1 */ - lrr = rcu_dereference_protected(lr->avl_right, - lockdep_is_held(&base->lock)); /* lrr: RH or RH-1 */ + lrl = rcu_deref_locked(lr->avl_left, base);/* lrl: RH or RH-1 */ + lrr = rcu_deref_locked(lr->avl_right, base);/* lrr: RH or RH-1 */ RCU_INIT_POINTER(node->avl_left, lrr); /* lrr: RH or RH-1 */ RCU_INIT_POINTER(node->avl_right, r); /* r: RH */ node->avl_height = rh + 1; /* node: RH+1 */ @@ -314,10 +306,8 @@ static void peer_avl_rebalance(struct inet_peer __rcu **stack[], } else if (rh > lh + 1) { /* r: LH+2 */ struct inet_peer *rr, *rl, *rlr, *rll; int rlh; - rr = rcu_dereference_protected(r->avl_right, - lockdep_is_held(&base->lock)); - rl = rcu_dereference_protected(r->avl_left, - lockdep_is_held(&base->lock)); + rr = rcu_deref_locked(r->avl_right, base); + rl = rcu_deref_locked(r->avl_left, base); rlh = node_height(rl); if (rlh <= node_height(rr)) { /* rr: LH+1 */ RCU_INIT_POINTER(node->avl_right, rl); /* rl: LH or LH+1 */ @@ -328,10 +318,8 @@ static void peer_avl_rebalance(struct inet_peer __rcu **stack[], r->avl_height = node->avl_height + 1; RCU_INIT_POINTER(*nodep, r); } else { /* rr: RH, rl: RH+1 */ - rlr = rcu_dereference_protected(rl->avl_right, - lockdep_is_held(&base->lock)); /* rlr: LH or LH-1 */ - rll = rcu_dereference_protected(rl->avl_left, - lockdep_is_held(&base->lock)); /* rll: LH or LH-1 */ + rlr = rcu_deref_locked(rl->avl_right, base);/* rlr: LH or LH-1 */ + rll = rcu_deref_locked(rl->avl_left, base);/* rll: LH or LH-1 */ RCU_INIT_POINTER(node->avl_right, rll); /* rll: LH or LH-1 */ RCU_INIT_POINTER(node->avl_left, l); /* l: LH */ node->avl_height = lh + 1; /* node: LH+1 */ @@ -372,7 +360,7 @@ static void unlink_from_pool(struct inet_peer *p, struct inet_peer_base *base) do_free = 0; - spin_lock_bh(&base->lock); + write_seqlock_bh(&base->lock); /* Check the reference counter. It was artificially incremented by 1 * in cleanup() function to prevent sudden disappearing. If we can * atomically (because of lockless readers) take this last reference, @@ -392,8 +380,7 @@ static void unlink_from_pool(struct inet_peer *p, struct inet_peer_base *base) /* look for a node to insert instead of p */ struct inet_peer *t; t = lookup_rightempty(p, base); - BUG_ON(rcu_dereference_protected(*stackptr[-1], - lockdep_is_held(&base->lock)) != t); + BUG_ON(rcu_deref_locked(*stackptr[-1], base) != t); **--stackptr = t->avl_left; /* t is removed, t->daddr > x->daddr for any * x in p->avl_left subtree. @@ -409,7 +396,7 @@ static void unlink_from_pool(struct inet_peer *p, struct inet_peer_base *base) base->total--; do_free = 1; } - spin_unlock_bh(&base->lock); + write_sequnlock_bh(&base->lock); if (do_free) call_rcu_bh(&p->rcu, inetpeer_free_rcu); @@ -477,12 +464,16 @@ struct inet_peer *inet_getpeer(struct inetpeer_addr *daddr, int create) struct inet_peer __rcu **stack[PEER_MAXDEPTH], ***stackptr; struct inet_peer_base *base = family_to_base(daddr->family); struct inet_peer *p; + unsigned int sequence; + int invalidated; /* Look up for the address quickly, lockless. * Because of a concurrent writer, we might not find an existing entry. */ rcu_read_lock_bh(); + sequence = read_seqbegin(&base->lock); p = lookup_rcu_bh(daddr, base); + invalidated = read_seqretry(&base->lock, sequence); rcu_read_unlock_bh(); if (p) { @@ -493,14 +484,18 @@ struct inet_peer *inet_getpeer(struct inetpeer_addr *daddr, int create) return p; } + /* If no writer did a change during our lookup, we can return early. */ + if (!create && !invalidated) + return NULL; + /* retry an exact lookup, taking the lock before. * At least, nodes should be hot in our cache. */ - spin_lock_bh(&base->lock); + write_seqlock_bh(&base->lock); p = lookup(daddr, stack, base); if (p != peer_avl_empty) { atomic_inc(&p->refcnt); - spin_unlock_bh(&base->lock); + write_sequnlock_bh(&base->lock); /* Remove the entry from unused list if it was there. */ unlink_from_unused(p); return p; @@ -524,7 +519,7 @@ struct inet_peer *inet_getpeer(struct inetpeer_addr *daddr, int create) link_to_pool(p, base); base->total++; } - spin_unlock_bh(&base->lock); + write_sequnlock_bh(&base->lock); if (base->total >= inet_peer_threshold) /* Remove one less-recently-used entry. */ ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH net-next-2.6] inetpeer: seqlock optimization 2011-03-04 22:13 ` Eric Dumazet @ 2011-03-04 22:44 ` David Miller 2011-03-04 23:02 ` Eric Dumazet 0 siblings, 1 reply; 17+ messages in thread From: David Miller @ 2011-03-04 22:44 UTC (permalink / raw) To: eric.dumazet; +Cc: xiaosuo, netdev From: Eric Dumazet <eric.dumazet@gmail.com> Date: Fri, 04 Mar 2011 23:13:59 +0100 > Le vendredi 04 mars 2011 à 12:45 -0800, David Miller a écrit : >> From: David Miller <davem@davemloft.net> >> Date: Fri, 04 Mar 2011 11:17:05 -0800 (PST) >> >> > From: Eric Dumazet <eric.dumazet@gmail.com> >> > Date: Fri, 04 Mar 2011 16:09:08 +0100 >> > >> >> Here is a patch to implement this idea. >> > >> > Applied, thanks Eric! >> >> Unfortunately, I have to revert, the lockdep annotations needs to >> be updated: >> >> net/ipv4/inetpeer.c: In function ‘peer_avl_rebalance’: >> net/ipv4/inetpeer.c:274:10: error: ‘seqlock_t’ has no member named ‘dep_map’ > > Oops thats right, here is an updated version. Applied, thanks Eric! With this and the following patch applied to my no-routing-cache tree, output route lookup on my Niagara2 is down to 2966 cycles! For reference with just the plain routing cache removal, it was as much as 3832 cycles. udpflood is a lot faster too, with plain routing cache removal it ran as: bash$ time ./bin/udpflood -l 10000000 10.2.2.11 real 3m9.921s user 0m9.520s sys 3m0.440s But now it's: bash$ time ./bin/udpflood -l 10000000 10.2.2.11 real 2m45.903s user 0m8.640s sys 2m37.280s :-) -------------------- ipv4: Optimize flow initialization in output route lookup. We burn a lot of useless cycles, cpu store buffer traffic, and memory operations memset()'ing the on-stack flow used to perform output route lookups in __ip_route_output_key(). Only the first half of the flow object members even matter for output route lookups in this context, specifically: FIB rules matching cares about: dst, src, tos, iif, oif, mark FIB trie lookup cares about: dst FIB semantic match cares about: tos, scope, oif Therefore only initialize these specific members and elide the memset entirely. On Niagara2 this kills about ~300 cycles from the output route lookup path. Likely, we can take things further, since all callers of output route lookups essentially throw away the on-stack flow they use. So they don't care if we use it as a scratch-pad to compute the final flow key. Signed-off-by: David S. Miller <davem@davemloft.net> --- net/ipv4/route.c | 18 ++++++++++-------- 1 files changed, 10 insertions(+), 8 deletions(-) diff --git a/net/ipv4/route.c b/net/ipv4/route.c index 04b8954..e3a5a89 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -1670,14 +1670,7 @@ static struct rtable *__mkroute_output(const struct fib_result *res, struct rtable *__ip_route_output_key(struct net *net, const struct flowi *oldflp) { u32 tos = RT_FL_TOS(oldflp); - struct flowi fl = { .fl4_dst = oldflp->fl4_dst, - .fl4_src = oldflp->fl4_src, - .fl4_tos = tos & IPTOS_RT_MASK, - .fl4_scope = ((tos & RTO_ONLINK) ? - RT_SCOPE_LINK : RT_SCOPE_UNIVERSE), - .mark = oldflp->mark, - .iif = net->loopback_dev->ifindex, - .oif = oldflp->oif }; + struct flowi fl; struct fib_result res; unsigned int flags = 0; struct net_device *dev_out = NULL; @@ -1688,6 +1681,15 @@ struct rtable *__ip_route_output_key(struct net *net, const struct flowi *oldflp res.r = NULL; #endif + fl.oif = oldflp->oif; + fl.iif = net->loopback_dev->ifindex; + fl.mark = oldflp->mark; + fl.fl4_dst = oldflp->fl4_dst; + fl.fl4_src = oldflp->fl4_src; + fl.fl4_tos = tos & IPTOS_RT_MASK; + fl.fl4_scope = ((tos & RTO_ONLINK) ? + RT_SCOPE_LINK : RT_SCOPE_UNIVERSE); + rcu_read_lock(); if (oldflp->fl4_src) { rth = ERR_PTR(-EINVAL); -- 1.7.4.1 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH net-next-2.6] inetpeer: seqlock optimization 2011-03-04 22:44 ` David Miller @ 2011-03-04 23:02 ` Eric Dumazet 0 siblings, 0 replies; 17+ messages in thread From: Eric Dumazet @ 2011-03-04 23:02 UTC (permalink / raw) To: David Miller; +Cc: xiaosuo, netdev Le vendredi 04 mars 2011 à 14:44 -0800, David Miller a écrit : > Applied, thanks Eric! > > With this and the following patch applied to my no-routing-cache tree, > output route lookup on my Niagara2 is down to 2966 cycles! For reference > with just the plain routing cache removal, it was as much as 3832 cycles. > > udpflood is a lot faster too, with plain routing cache removal it ran as: > > bash$ time ./bin/udpflood -l 10000000 10.2.2.11 > real 3m9.921s > user 0m9.520s > sys 3m0.440s > > But now it's: > > bash$ time ./bin/udpflood -l 10000000 10.2.2.11 > real 2m45.903s > user 0m8.640s > sys 2m37.280s > > :-) > Nice indeed :) > -------------------- > ipv4: Optimize flow initialization in output route lookup. > > We burn a lot of useless cycles, cpu store buffer traffic, and > memory operations memset()'ing the on-stack flow used to perform > output route lookups in __ip_route_output_key(). > > Only the first half of the flow object members even matter for > output route lookups in this context, specifically: > > FIB rules matching cares about: > > dst, src, tos, iif, oif, mark > > FIB trie lookup cares about: > > dst > > FIB semantic match cares about: > > tos, scope, oif > > Therefore only initialize these specific members and elide the > memset entirely. > > On Niagara2 this kills about ~300 cycles from the output route > lookup path. > > Likely, we can take things further, since all callers of output > route lookups essentially throw away the on-stack flow they use. > So they don't care if we use it as a scratch-pad to compute the > final flow key. > > Signed-off-by: David S. Miller <davem@davemloft.net> > --- > net/ipv4/route.c | 18 ++++++++++-------- > 1 files changed, 10 insertions(+), 8 deletions(-) > > diff --git a/net/ipv4/route.c b/net/ipv4/route.c > index 04b8954..e3a5a89 100644 > --- a/net/ipv4/route.c > +++ b/net/ipv4/route.c > @@ -1670,14 +1670,7 @@ static struct rtable *__mkroute_output(const struct fib_result *res, > struct rtable *__ip_route_output_key(struct net *net, const struct flowi *oldflp) > { > u32 tos = RT_FL_TOS(oldflp); > - struct flowi fl = { .fl4_dst = oldflp->fl4_dst, > - .fl4_src = oldflp->fl4_src, > - .fl4_tos = tos & IPTOS_RT_MASK, > - .fl4_scope = ((tos & RTO_ONLINK) ? > - RT_SCOPE_LINK : RT_SCOPE_UNIVERSE), > - .mark = oldflp->mark, > - .iif = net->loopback_dev->ifindex, > - .oif = oldflp->oif }; > + struct flowi fl; > struct fib_result res; > unsigned int flags = 0; > struct net_device *dev_out = NULL; > @@ -1688,6 +1681,15 @@ struct rtable *__ip_route_output_key(struct net *net, const struct flowi *oldflp > res.r = NULL; > #endif > > + fl.oif = oldflp->oif; > + fl.iif = net->loopback_dev->ifindex; > + fl.mark = oldflp->mark; > + fl.fl4_dst = oldflp->fl4_dst; > + fl.fl4_src = oldflp->fl4_src; > + fl.fl4_tos = tos & IPTOS_RT_MASK; > + fl.fl4_scope = ((tos & RTO_ONLINK) ? > + RT_SCOPE_LINK : RT_SCOPE_UNIVERSE); > + > rcu_read_lock(); > if (oldflp->fl4_src) { > rth = ERR_PTR(-EINVAL); Acked-by: Eric Dumazet <eric.dumazet@gmail.com> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: inetpeer with create==0 2011-03-03 6:42 ` David Miller 2011-03-03 7:39 ` Eric Dumazet @ 2011-03-03 8:07 ` Changli Gao 2011-03-03 8:34 ` David Miller 1 sibling, 1 reply; 17+ messages in thread From: Changli Gao @ 2011-03-03 8:07 UTC (permalink / raw) To: David Miller; +Cc: eric.dumazet, netdev On Thu, Mar 3, 2011 at 2:42 PM, David Miller <davem@davemloft.net> wrote: > > Thats why I am working hard to remove it :-) > > The %99 percentile performance of our routing cache is absolutely > terrible. The routing cache does nothing except get in the way. > > It is the reason why Linux is still completely unsuitable for use as a > router on the core internet. > After routing cache is removed totally, Linux is still unsuitable for use as a router on the core internet. Because it isn't a realtime OS, and the forwarding delay isn't bounded. With routing cache, the memory cost isn't bounded too. :) -- Regards, Changli Gao(xiaosuo@gmail.com) ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: inetpeer with create==0 2011-03-03 8:07 ` inetpeer with create==0 Changli Gao @ 2011-03-03 8:34 ` David Miller 0 siblings, 0 replies; 17+ messages in thread From: David Miller @ 2011-03-03 8:34 UTC (permalink / raw) To: xiaosuo; +Cc: eric.dumazet, netdev From: Changli Gao <xiaosuo@gmail.com> Date: Thu, 3 Mar 2011 16:07:32 +0800 > After routing cache is removed totally, Linux is still unsuitable for > use as a router on the core internet. Because it isn't a realtime OS, > and the forwarding delay isn't bounded. With routing cache, the memory > cost isn't bounded too. :) With many cores, pipe can be filled no problem, because with multi-core machine routing lookup proceeds just like Cisco router makes in hardware. I wrote about this in my blog a few years ago. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: inetpeer with create==0 2011-03-03 4:45 inetpeer with create==0 David Miller 2011-03-03 5:30 ` Changli Gao @ 2011-03-03 6:51 ` Eric Dumazet 2011-03-03 8:30 ` David Miller 1 sibling, 1 reply; 17+ messages in thread From: Eric Dumazet @ 2011-03-03 6:51 UTC (permalink / raw) To: David Miller; +Cc: netdev Le mercredi 02 mars 2011 à 20:45 -0800, David Miller a écrit : > Eric, I was profiling the non-routing-cache case and something that stuck > out is the case of calling inet_getpeer() with create==0. > > If an entry is not found, we have to redo the lookup under a spinlock > to make certain that a concurrent writer rebalancing the tree does > not "hide" an existing entry from us. > > This makes the case of a create==0 lookup for a not-present entry > really expensive. It is on the order of 600 cpu cycles on my > Niagara2. > Well, your test assumes all data is already on cpu caches ? I'll take a look, but my reasoning was that the real cost in DDOS situation is to bring data into caches. With a 20 depth, cache misses costs are the problem. The second lookup was basically free. > I added a hack to not do the relookup under the lock when create==0 > and it now costs less than 300 cycles. > Hmm... I'm curious you send this hack to me ;) > This is now a pretty common operation with the way we handle COW'd > metrics, so I think it's definitely worth optimizing. > > I looked at the generic radix tree implementation, and it supports > full RCU lookups in parallel with insert/delete. It handles the race > case without the relookup under lock because it creates fixed paths > to "slots" where nodes live using shifts and masks. So if a path > to a slot ever existed, it will always exist. > > Take a look at lib/radix-tree.c and include/linux/radix-tree.h if > you are curious. > > I think we should do something similar for inetpeer. Currently we > cannot just use the existing generic radix-tree code because it only > supports indexes as large as "unsigned long" and we need to handle > 128-bit ipv6 addresses. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: inetpeer with create==0 2011-03-03 6:51 ` Eric Dumazet @ 2011-03-03 8:30 ` David Miller 0 siblings, 0 replies; 17+ messages in thread From: David Miller @ 2011-03-03 8:30 UTC (permalink / raw) To: eric.dumazet; +Cc: netdev From: Eric Dumazet <eric.dumazet@gmail.com> Date: Thu, 03 Mar 2011 07:51:47 +0100 > Hmm... I'm curious you send this hack to me ;) Ok, it's so simple :-) diff --git a/net/ipv4/inetpeer.c b/net/ipv4/inetpeer.c index 48f8d45..a194e91 100644 --- a/net/ipv4/inetpeer.c +++ b/net/ipv4/inetpeer.c @@ -492,6 +492,8 @@ struct inet_peer *inet_getpeer(struct inetpeer_addr *daddr, int create) unlink_from_unused(p); return p; } + if (!create) + return NULL; /* retry an exact lookup, taking the lock before. * At least, nodes should be hot in our cache. @@ -505,7 +507,7 @@ struct inet_peer *inet_getpeer(struct inetpeer_addr *daddr, int create) unlink_from_unused(p); return p; } - p = create ? kmem_cache_alloc(peer_cachep, GFP_ATOMIC) : NULL; + p = kmem_cache_alloc(peer_cachep, GFP_ATOMIC); if (p) { p->daddr = *daddr; atomic_set(&p->refcnt, 1); ^ permalink raw reply related [flat|nested] 17+ messages in thread
end of thread, other threads:[~2011-03-04 23:02 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-03-03 4:45 inetpeer with create==0 David Miller 2011-03-03 5:30 ` Changli Gao 2011-03-03 5:36 ` David Miller 2011-03-03 6:27 ` Changli Gao 2011-03-03 6:42 ` David Miller 2011-03-03 7:39 ` Eric Dumazet 2011-03-03 8:32 ` David Miller 2011-03-04 15:09 ` [PATCH net-next-2.6] inetpeer: seqlock optimization Eric Dumazet 2011-03-04 19:17 ` David Miller 2011-03-04 20:45 ` David Miller 2011-03-04 22:13 ` Eric Dumazet 2011-03-04 22:44 ` David Miller 2011-03-04 23:02 ` Eric Dumazet 2011-03-03 8:07 ` inetpeer with create==0 Changli Gao 2011-03-03 8:34 ` David Miller 2011-03-03 6:51 ` Eric Dumazet 2011-03-03 8:30 ` David Miller
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).