* inetpeer with create==0
@ 2011-03-03 4:45 David Miller
2011-03-03 5:30 ` Changli Gao
2011-03-03 6:51 ` Eric Dumazet
0 siblings, 2 replies; 17+ messages in thread
From: David Miller @ 2011-03-03 4:45 UTC (permalink / raw)
To: eric.dumazet; +Cc: netdev
Eric, I was profiling the non-routing-cache case and something that stuck
out is the case of calling inet_getpeer() with create==0.
If an entry is not found, we have to redo the lookup under a spinlock
to make certain that a concurrent writer rebalancing the tree does
not "hide" an existing entry from us.
This makes the case of a create==0 lookup for a not-present entry
really expensive. It is on the order of 600 cpu cycles on my
Niagara2.
I added a hack to not do the relookup under the lock when create==0
and it now costs less than 300 cycles.
This is now a pretty common operation with the way we handle COW'd
metrics, so I think it's definitely worth optimizing.
I looked at the generic radix tree implementation, and it supports
full RCU lookups in parallel with insert/delete. It handles the race
case without the relookup under lock because it creates fixed paths
to "slots" where nodes live using shifts and masks. So if a path
to a slot ever existed, it will always exist.
Take a look at lib/radix-tree.c and include/linux/radix-tree.h if
you are curious.
I think we should do something similar for inetpeer. Currently we
cannot just use the existing generic radix-tree code because it only
supports indexes as large as "unsigned long" and we need to handle
128-bit ipv6 addresses.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: inetpeer with create==0
2011-03-03 4:45 inetpeer with create==0 David Miller
@ 2011-03-03 5:30 ` Changli Gao
2011-03-03 5:36 ` David Miller
2011-03-03 6:51 ` Eric Dumazet
1 sibling, 1 reply; 17+ messages in thread
From: Changli Gao @ 2011-03-03 5:30 UTC (permalink / raw)
To: David Miller; +Cc: eric.dumazet, netdev
On Thu, Mar 3, 2011 at 12:45 PM, David Miller <davem@davemloft.net> wrote:
>
>
> I looked at the generic radix tree implementation, and it supports
> full RCU lookups in parallel with insert/delete. It handles the race
> case without the relookup under lock because it creates fixed paths
> to "slots" where nodes live using shifts and masks. So if a path
> to a slot ever existed, it will always exist.
>
> Take a look at lib/radix-tree.c and include/linux/radix-tree.h if
> you are curious.
>
> I think we should do something similar for inetpeer. Currently we
> cannot just use the existing generic radix-tree code because it only
> supports indexes as large as "unsigned long" and we need to handle
> 128-bit ipv6 addresses.
I am just wondering why we need a trie(radix tree) here, we don't have
to do LPM since the lengths of the keys are always fixed(ether 4 or
16). Maybe a hash table is enough and simpler, and it is RCU
compatible.
--
Regards,
Changli Gao(xiaosuo@gmail.com)
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: inetpeer with create==0
2011-03-03 5:30 ` Changli Gao
@ 2011-03-03 5:36 ` David Miller
2011-03-03 6:27 ` Changli Gao
0 siblings, 1 reply; 17+ messages in thread
From: David Miller @ 2011-03-03 5:36 UTC (permalink / raw)
To: xiaosuo; +Cc: eric.dumazet, netdev
From: Changli Gao <xiaosuo@gmail.com>
Date: Thu, 3 Mar 2011 13:30:03 +0800
> I am just wondering why we need a trie(radix tree) here, we don't have
> to do LPM since the lengths of the keys are always fixed(ether 4 or
> 16). Maybe a hash table is enough and simpler, and it is RCU
> compatible.
Because trie eliminates all of the issues of having to size a hash
table, dynamically resize it, etc.
Trie gives well bounded performance dependent solely upon size of
the table, rather than access patterns, distribution of keys, and
how perfect hash function is.
We used to use a hash table for the page cache too, but now we use
a trie structure there as well, and that uses full long word sized
keys and the generic raidx-tree code.
Using hash tables is really foolish for potentially large data sets.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: inetpeer with create==0
2011-03-03 5:36 ` David Miller
@ 2011-03-03 6:27 ` Changli Gao
2011-03-03 6:42 ` David Miller
0 siblings, 1 reply; 17+ messages in thread
From: Changli Gao @ 2011-03-03 6:27 UTC (permalink / raw)
To: David Miller; +Cc: eric.dumazet, netdev
On Thu, Mar 3, 2011 at 1:36 PM, David Miller <davem@davemloft.net> wrote:
>
> Because trie eliminates all of the issues of having to size a hash
> table, dynamically resize it, etc.
>
> Trie gives well bounded performance dependent solely upon size of
> the table, rather than access patterns, distribution of keys, and
> how perfect hash function is.
Thanks for your explaination. Routing cache has all of these issues. :)
>
> We used to use a hash table for the page cache too, but now we use
> a trie structure there as well, and that uses full long word sized
> keys and the generic raidx-tree code.
>
> Using hash tables is really foolish for potentially large data sets.
>
However, I don't agree with you at every aspect.
Radix tree may cost lots of memory than a rbtree, avl tree or hash
table. Here is a case: turning to rbtree from radix tree.
http://git.kernel.org/?p=linux/kernel/git/davem/net-next-2.6.git;a=commit;h=8549164143a5431f9d9ea846acaa35a862410d9c
Since the keys of inetpeer rely on the access pattern, a attacker may
do a OOM DoS attack using sparse keys. We can't simply limit the
number of inetpeers to limit the memory cost of this subsystem. We
have to count the memory used for the inner branches of a radix tree.
Nevertheless, a attacker may also make this OS inefficiency by reduce
the max number of inetpeers.
Hash table + jhash have been proven a safe and efficient data
structure for large data sets(conntrack and ipvs), although the size
of the hash table may have to be adjusted by an administrator.
--
Regards,
Changli Gao(xiaosuo@gmail.com)
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: inetpeer with create==0
2011-03-03 6:27 ` Changli Gao
@ 2011-03-03 6:42 ` David Miller
2011-03-03 7:39 ` Eric Dumazet
2011-03-03 8:07 ` inetpeer with create==0 Changli Gao
0 siblings, 2 replies; 17+ messages in thread
From: David Miller @ 2011-03-03 6:42 UTC (permalink / raw)
To: xiaosuo; +Cc: eric.dumazet, netdev
From: Changli Gao <xiaosuo@gmail.com>
Date: Thu, 3 Mar 2011 14:27:40 +0800
> On Thu, Mar 3, 2011 at 1:36 PM, David Miller <davem@davemloft.net> wrote:
>>
>> Because trie eliminates all of the issues of having to size a hash
>> table, dynamically resize it, etc.
>>
>> Trie gives well bounded performance dependent solely upon size of
>> the table, rather than access patterns, distribution of keys, and
>> how perfect hash function is.
>
> Thanks for your explaination. Routing cache has all of these issues. :)
Thats why I am working hard to remove it :-)
The %99 percentile performance of our routing cache is absolutely
terrible. The routing cache does nothing except get in the way.
It is the reason why Linux is still completely unsuitable for use as a
router on the core internet.
> Radix tree may cost lots of memory than a rbtree, avl tree or hash
> table. Here is a case: turning to rbtree from radix tree.
> http://git.kernel.org/?p=linux/kernel/git/davem/net-next-2.6.git;a=commit;h=8549164143a5431f9d9ea846acaa35a862410d9c
I would not be against the use of rbtree or similar if it could be
done with with fully RCU non-locked lookups even in the not-present
case.
> Hash table + jhash have been proven a safe and efficient data
> structure for large data sets(conntrack and ipvs), although the size
> of the hash table may have to be adjusted by an administrator.
If anything is proven, it is that hashing based collections of cached
information are nothing but trouble when the contents are controlled in
some way by external entities.
Actually, back to the original topic, I wonder how bad it is to simply
elide the recheck in the create==0 case anyways. Except for the ipv4
fragmentation wraparound protection values, perfect inetpeer finding
is not necessary for correctness. And IPv4 fragmentation always calls
inetpeer with create!=0.
Eric?
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: inetpeer with create==0
2011-03-03 4:45 inetpeer with create==0 David Miller
2011-03-03 5:30 ` Changli Gao
@ 2011-03-03 6:51 ` Eric Dumazet
2011-03-03 8:30 ` David Miller
1 sibling, 1 reply; 17+ messages in thread
From: Eric Dumazet @ 2011-03-03 6:51 UTC (permalink / raw)
To: David Miller; +Cc: netdev
Le mercredi 02 mars 2011 à 20:45 -0800, David Miller a écrit :
> Eric, I was profiling the non-routing-cache case and something that stuck
> out is the case of calling inet_getpeer() with create==0.
>
> If an entry is not found, we have to redo the lookup under a spinlock
> to make certain that a concurrent writer rebalancing the tree does
> not "hide" an existing entry from us.
>
> This makes the case of a create==0 lookup for a not-present entry
> really expensive. It is on the order of 600 cpu cycles on my
> Niagara2.
>
Well, your test assumes all data is already on cpu caches ?
I'll take a look, but my reasoning was that the real cost in DDOS
situation is to bring data into caches. With a 20 depth, cache misses
costs are the problem.
The second lookup was basically free.
> I added a hack to not do the relookup under the lock when create==0
> and it now costs less than 300 cycles.
>
Hmm... I'm curious you send this hack to me ;)
> This is now a pretty common operation with the way we handle COW'd
> metrics, so I think it's definitely worth optimizing.
>
> I looked at the generic radix tree implementation, and it supports
> full RCU lookups in parallel with insert/delete. It handles the race
> case without the relookup under lock because it creates fixed paths
> to "slots" where nodes live using shifts and masks. So if a path
> to a slot ever existed, it will always exist.
>
> Take a look at lib/radix-tree.c and include/linux/radix-tree.h if
> you are curious.
>
> I think we should do something similar for inetpeer. Currently we
> cannot just use the existing generic radix-tree code because it only
> supports indexes as large as "unsigned long" and we need to handle
> 128-bit ipv6 addresses.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: inetpeer with create==0
2011-03-03 6:42 ` David Miller
@ 2011-03-03 7:39 ` Eric Dumazet
2011-03-03 8:32 ` David Miller
2011-03-03 8:07 ` inetpeer with create==0 Changli Gao
1 sibling, 1 reply; 17+ messages in thread
From: Eric Dumazet @ 2011-03-03 7:39 UTC (permalink / raw)
To: David Miller; +Cc: xiaosuo, netdev
Le mercredi 02 mars 2011 à 22:42 -0800, David Miller a écrit :
> Actually, back to the original topic, I wonder how bad it is to simply
> elide the recheck in the create==0 case anyways. Except for the ipv4
> fragmentation wraparound protection values, perfect inetpeer finding
> is not necessary for correctness. And IPv4 fragmentation always calls
> inetpeer with create!=0.
>
We could use a seqlock, to detect that a writer might have changed
things while we did our RCU lookup ?
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: inetpeer with create==0
2011-03-03 6:42 ` David Miller
2011-03-03 7:39 ` Eric Dumazet
@ 2011-03-03 8:07 ` Changli Gao
2011-03-03 8:34 ` David Miller
1 sibling, 1 reply; 17+ messages in thread
From: Changli Gao @ 2011-03-03 8:07 UTC (permalink / raw)
To: David Miller; +Cc: eric.dumazet, netdev
On Thu, Mar 3, 2011 at 2:42 PM, David Miller <davem@davemloft.net> wrote:
>
> Thats why I am working hard to remove it :-)
>
> The %99 percentile performance of our routing cache is absolutely
> terrible. The routing cache does nothing except get in the way.
>
> It is the reason why Linux is still completely unsuitable for use as a
> router on the core internet.
>
After routing cache is removed totally, Linux is still unsuitable for
use as a router on the core internet. Because it isn't a realtime OS,
and the forwarding delay isn't bounded. With routing cache, the memory
cost isn't bounded too. :)
--
Regards,
Changli Gao(xiaosuo@gmail.com)
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: inetpeer with create==0
2011-03-03 6:51 ` Eric Dumazet
@ 2011-03-03 8:30 ` David Miller
0 siblings, 0 replies; 17+ messages in thread
From: David Miller @ 2011-03-03 8:30 UTC (permalink / raw)
To: eric.dumazet; +Cc: netdev
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 03 Mar 2011 07:51:47 +0100
> Hmm... I'm curious you send this hack to me ;)
Ok, it's so simple :-)
diff --git a/net/ipv4/inetpeer.c b/net/ipv4/inetpeer.c
index 48f8d45..a194e91 100644
--- a/net/ipv4/inetpeer.c
+++ b/net/ipv4/inetpeer.c
@@ -492,6 +492,8 @@ struct inet_peer *inet_getpeer(struct inetpeer_addr *daddr, int create)
unlink_from_unused(p);
return p;
}
+ if (!create)
+ return NULL;
/* retry an exact lookup, taking the lock before.
* At least, nodes should be hot in our cache.
@@ -505,7 +507,7 @@ struct inet_peer *inet_getpeer(struct inetpeer_addr *daddr, int create)
unlink_from_unused(p);
return p;
}
- p = create ? kmem_cache_alloc(peer_cachep, GFP_ATOMIC) : NULL;
+ p = kmem_cache_alloc(peer_cachep, GFP_ATOMIC);
if (p) {
p->daddr = *daddr;
atomic_set(&p->refcnt, 1);
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: inetpeer with create==0
2011-03-03 7:39 ` Eric Dumazet
@ 2011-03-03 8:32 ` David Miller
2011-03-04 15:09 ` [PATCH net-next-2.6] inetpeer: seqlock optimization Eric Dumazet
0 siblings, 1 reply; 17+ messages in thread
From: David Miller @ 2011-03-03 8:32 UTC (permalink / raw)
To: eric.dumazet; +Cc: xiaosuo, netdev
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 03 Mar 2011 08:39:37 +0100
> Le mercredi 02 mars 2011 à 22:42 -0800, David Miller a écrit :
>> Actually, back to the original topic, I wonder how bad it is to simply
>> elide the recheck in the create==0 case anyways. Except for the ipv4
>> fragmentation wraparound protection values, perfect inetpeer finding
>> is not necessary for correctness. And IPv4 fragmentation always calls
>> inetpeer with create!=0.
>
> We could use a seqlock, to detect that a writer might have changed
> things while we did our RCU lookup ?
That would certainly work.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: inetpeer with create==0
2011-03-03 8:07 ` inetpeer with create==0 Changli Gao
@ 2011-03-03 8:34 ` David Miller
0 siblings, 0 replies; 17+ messages in thread
From: David Miller @ 2011-03-03 8:34 UTC (permalink / raw)
To: xiaosuo; +Cc: eric.dumazet, netdev
From: Changli Gao <xiaosuo@gmail.com>
Date: Thu, 3 Mar 2011 16:07:32 +0800
> After routing cache is removed totally, Linux is still unsuitable for
> use as a router on the core internet. Because it isn't a realtime OS,
> and the forwarding delay isn't bounded. With routing cache, the memory
> cost isn't bounded too. :)
With many cores, pipe can be filled no problem, because with multi-core
machine routing lookup proceeds just like Cisco router makes in hardware.
I wrote about this in my blog a few years ago.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH net-next-2.6] inetpeer: seqlock optimization
2011-03-03 8:32 ` David Miller
@ 2011-03-04 15:09 ` Eric Dumazet
2011-03-04 19:17 ` David Miller
0 siblings, 1 reply; 17+ messages in thread
From: Eric Dumazet @ 2011-03-04 15:09 UTC (permalink / raw)
To: David Miller; +Cc: xiaosuo, netdev
Le jeudi 03 mars 2011 à 00:32 -0800, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Thu, 03 Mar 2011 08:39:37 +0100
>
> > Le mercredi 02 mars 2011 à 22:42 -0800, David Miller a écrit :
> >> Actually, back to the original topic, I wonder how bad it is to simply
> >> elide the recheck in the create==0 case anyways. Except for the ipv4
> >> fragmentation wraparound protection values, perfect inetpeer finding
> >> is not necessary for correctness. And IPv4 fragmentation always calls
> >> inetpeer with create!=0.
> >
> > We could use a seqlock, to detect that a writer might have changed
> > things while we did our RCU lookup ?
>
> That would certainly work.
Here is a patch to implement this idea.
Thanks !
[PATCH net-next-2.6] inetpeer: seqlock optimization
David noticed :
------------------
Eric, I was profiling the non-routing-cache case and something that
stuck out is the case of calling inet_getpeer() with create==0.
If an entry is not found, we have to redo the lookup under a spinlock
to make certain that a concurrent writer rebalancing the tree does
not "hide" an existing entry from us.
This makes the case of a create==0 lookup for a not-present entry
really expensive. It is on the order of 600 cpu cycles on my
Niagara2.
I added a hack to not do the relookup under the lock when create==0
and it now costs less than 300 cycles.
This is now a pretty common operation with the way we handle COW'd
metrics, so I think it's definitely worth optimizing.
-----------------
One solution is to use a seqlock instead of a spinlock to protect struct
inet_peer_base.
After a failed avl tree lookup, we can easily detect if a writer did
some changes during our lookup. Taking the lock and redo the lookup is
only necessary in this case.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
net/ipv4/inetpeer.c | 24 ++++++++++++++++--------
1 files changed, 16 insertions(+), 8 deletions(-)
diff --git a/net/ipv4/inetpeer.c b/net/ipv4/inetpeer.c
index 48f8d45..7fd9fab 100644
--- a/net/ipv4/inetpeer.c
+++ b/net/ipv4/inetpeer.c
@@ -81,19 +81,19 @@ static const struct inet_peer peer_fake_node = {
struct inet_peer_base {
struct inet_peer __rcu *root;
- spinlock_t lock;
+ seqlock_t lock;
int total;
};
static struct inet_peer_base v4_peers = {
.root = peer_avl_empty_rcu,
- .lock = __SPIN_LOCK_UNLOCKED(v4_peers.lock),
+ .lock = __SEQLOCK_UNLOCKED(v4_peers.lock),
.total = 0,
};
static struct inet_peer_base v6_peers = {
.root = peer_avl_empty_rcu,
- .lock = __SPIN_LOCK_UNLOCKED(v6_peers.lock),
+ .lock = __SEQLOCK_UNLOCKED(v6_peers.lock),
.total = 0,
};
@@ -372,7 +372,7 @@ static void unlink_from_pool(struct inet_peer *p, struct inet_peer_base *base)
do_free = 0;
- spin_lock_bh(&base->lock);
+ write_seqlock_bh(&base->lock);
/* Check the reference counter. It was artificially incremented by 1
* in cleanup() function to prevent sudden disappearing. If we can
* atomically (because of lockless readers) take this last reference,
@@ -409,7 +409,7 @@ static void unlink_from_pool(struct inet_peer *p, struct inet_peer_base *base)
base->total--;
do_free = 1;
}
- spin_unlock_bh(&base->lock);
+ write_sequnlock_bh(&base->lock);
if (do_free)
call_rcu_bh(&p->rcu, inetpeer_free_rcu);
@@ -477,12 +477,16 @@ struct inet_peer *inet_getpeer(struct inetpeer_addr *daddr, int create)
struct inet_peer __rcu **stack[PEER_MAXDEPTH], ***stackptr;
struct inet_peer_base *base = family_to_base(daddr->family);
struct inet_peer *p;
+ unsigned int sequence;
+ int invalidated;
/* Look up for the address quickly, lockless.
* Because of a concurrent writer, we might not find an existing entry.
*/
rcu_read_lock_bh();
+ sequence = read_seqbegin(&base->lock);
p = lookup_rcu_bh(daddr, base);
+ invalidated = read_seqretry(&base->lock, sequence);
rcu_read_unlock_bh();
if (p) {
@@ -493,14 +497,18 @@ struct inet_peer *inet_getpeer(struct inetpeer_addr *daddr, int create)
return p;
}
+ /* If no writer did a change during our lookup, we can return early. */
+ if (!create && !invalidated)
+ return NULL;
+
/* retry an exact lookup, taking the lock before.
* At least, nodes should be hot in our cache.
*/
- spin_lock_bh(&base->lock);
+ write_seqlock_bh(&base->lock);
p = lookup(daddr, stack, base);
if (p != peer_avl_empty) {
atomic_inc(&p->refcnt);
- spin_unlock_bh(&base->lock);
+ write_sequnlock_bh(&base->lock);
/* Remove the entry from unused list if it was there. */
unlink_from_unused(p);
return p;
@@ -524,7 +532,7 @@ struct inet_peer *inet_getpeer(struct inetpeer_addr *daddr, int create)
link_to_pool(p, base);
base->total++;
}
- spin_unlock_bh(&base->lock);
+ write_sequnlock_bh(&base->lock);
if (base->total >= inet_peer_threshold)
/* Remove one less-recently-used entry. */
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH net-next-2.6] inetpeer: seqlock optimization
2011-03-04 15:09 ` [PATCH net-next-2.6] inetpeer: seqlock optimization Eric Dumazet
@ 2011-03-04 19:17 ` David Miller
2011-03-04 20:45 ` David Miller
0 siblings, 1 reply; 17+ messages in thread
From: David Miller @ 2011-03-04 19:17 UTC (permalink / raw)
To: eric.dumazet; +Cc: xiaosuo, netdev
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 04 Mar 2011 16:09:08 +0100
> Here is a patch to implement this idea.
Applied, thanks Eric!
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH net-next-2.6] inetpeer: seqlock optimization
2011-03-04 19:17 ` David Miller
@ 2011-03-04 20:45 ` David Miller
2011-03-04 22:13 ` Eric Dumazet
0 siblings, 1 reply; 17+ messages in thread
From: David Miller @ 2011-03-04 20:45 UTC (permalink / raw)
To: eric.dumazet; +Cc: xiaosuo, netdev
From: David Miller <davem@davemloft.net>
Date: Fri, 04 Mar 2011 11:17:05 -0800 (PST)
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Fri, 04 Mar 2011 16:09:08 +0100
>
>> Here is a patch to implement this idea.
>
> Applied, thanks Eric!
Unfortunately, I have to revert, the lockdep annotations needs to
be updated:
net/ipv4/inetpeer.c: In function ‘peer_avl_rebalance’:
net/ipv4/inetpeer.c:274:10: error: ‘seqlock_t’ has no member named ‘dep_map’
net/ipv4/inetpeer.c:276:7: error: ‘seqlock_t’ has no member named ‘dep_map’
net/ipv4/inetpeer.c:278:7: error: ‘seqlock_t’ has no member named ‘dep_map’
net/ipv4/inetpeer.c:285:9: error: ‘seqlock_t’ has no member named ‘dep_map’
net/ipv4/inetpeer.c:287:9: error: ‘seqlock_t’ has no member named ‘dep_map’
net/ipv4/inetpeer.c:299:11: error: ‘seqlock_t’ has no member named ‘dep_map’
net/ipv4/inetpeer.c:301:11: error: ‘seqlock_t’ has no member named ‘dep_map’
net/ipv4/inetpeer.c:317:9: error: ‘seqlock_t’ has no member named ‘dep_map’
net/ipv4/inetpeer.c:319:9: error: ‘seqlock_t’ has no member named ‘dep_map’
net/ipv4/inetpeer.c:331:11: error: ‘seqlock_t’ has no member named ‘dep_map’
net/ipv4/inetpeer.c:333:11: error: ‘seqlock_t’ has no member named ‘dep_map’
net/ipv4/inetpeer.c: In function ‘unlink_from_pool’:
net/ipv4/inetpeer.c:385:7: error: ‘seqlock_t’ has no member named ‘dep_map’
net/ipv4/inetpeer.c:385:7: error: ‘seqlock_t’ has no member named ‘dep_map’
net/ipv4/inetpeer.c:394:8: error: ‘seqlock_t’ has no member named ‘dep_map’
net/ipv4/inetpeer.c:394:8: error: ‘seqlock_t’ has no member named ‘dep_map’
net/ipv4/inetpeer.c:395:4: error: ‘seqlock_t’ has no member named ‘dep_map’
net/ipv4/inetpeer.c: In function ‘inet_getpeer’:
net/ipv4/inetpeer.c:508:6: error: ‘seqlock_t’ has no member named ‘dep_map’
net/ipv4/inetpeer.c:508:6: error: ‘seqlock_t’ has no member named ‘dep_map’
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH net-next-2.6] inetpeer: seqlock optimization
2011-03-04 20:45 ` David Miller
@ 2011-03-04 22:13 ` Eric Dumazet
2011-03-04 22:44 ` David Miller
0 siblings, 1 reply; 17+ messages in thread
From: Eric Dumazet @ 2011-03-04 22:13 UTC (permalink / raw)
To: David Miller; +Cc: xiaosuo, netdev
Le vendredi 04 mars 2011 à 12:45 -0800, David Miller a écrit :
> From: David Miller <davem@davemloft.net>
> Date: Fri, 04 Mar 2011 11:17:05 -0800 (PST)
>
> > From: Eric Dumazet <eric.dumazet@gmail.com>
> > Date: Fri, 04 Mar 2011 16:09:08 +0100
> >
> >> Here is a patch to implement this idea.
> >
> > Applied, thanks Eric!
>
> Unfortunately, I have to revert, the lockdep annotations needs to
> be updated:
>
> net/ipv4/inetpeer.c: In function ‘peer_avl_rebalance’:
> net/ipv4/inetpeer.c:274:10: error: ‘seqlock_t’ has no member named ‘dep_map’
Oops thats right, here is an updated version.
Thanks
[PATCH net-next-2.6] inetpeer: seqlock optimization
David noticed :
------------------
Eric, I was profiling the non-routing-cache case and something that
stuck out is the case of calling inet_getpeer() with create==0.
If an entry is not found, we have to redo the lookup under a spinlock
to make certain that a concurrent writer rebalancing the tree does
not "hide" an existing entry from us.
This makes the case of a create==0 lookup for a not-present entry
really expensive. It is on the order of 600 cpu cycles on my
Niagara2.
I added a hack to not do the relookup under the lock when create==0
and it now costs less than 300 cycles.
This is now a pretty common operation with the way we handle COW'd
metrics, so I think it's definitely worth optimizing.
-----------------
One solution is to use a seqlock instead of a spinlock to protect struct
inet_peer_base.
After a failed avl tree lookup, we can easily detect if a writer did
some changes during our lookup. Taking the lock and redo the lookup is
only necessary in this case.
Note: Add one private rcu_deref_locked() macro to place in one spot the
access to spinlock included in seqlock.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
V2: updated lockdep annotations
net/ipv4/inetpeer.c | 75 +++++++++++++++++++-----------------------
1 files changed, 35 insertions(+), 40 deletions(-)
diff --git a/net/ipv4/inetpeer.c b/net/ipv4/inetpeer.c
index 48f8d45..f604ffd 100644
--- a/net/ipv4/inetpeer.c
+++ b/net/ipv4/inetpeer.c
@@ -81,19 +81,19 @@ static const struct inet_peer peer_fake_node = {
struct inet_peer_base {
struct inet_peer __rcu *root;
- spinlock_t lock;
+ seqlock_t lock;
int total;
};
static struct inet_peer_base v4_peers = {
.root = peer_avl_empty_rcu,
- .lock = __SPIN_LOCK_UNLOCKED(v4_peers.lock),
+ .lock = __SEQLOCK_UNLOCKED(v4_peers.lock),
.total = 0,
};
static struct inet_peer_base v6_peers = {
.root = peer_avl_empty_rcu,
- .lock = __SPIN_LOCK_UNLOCKED(v6_peers.lock),
+ .lock = __SEQLOCK_UNLOCKED(v6_peers.lock),
.total = 0,
};
@@ -177,6 +177,9 @@ static int addr_compare(const struct inetpeer_addr *a,
return 0;
}
+#define rcu_deref_locked(X, BASE) \
+ rcu_dereference_protected(X, lockdep_is_held(&(BASE)->lock.lock))
+
/*
* Called with local BH disabled and the pool lock held.
*/
@@ -187,8 +190,7 @@ static int addr_compare(const struct inetpeer_addr *a,
\
stackptr = _stack; \
*stackptr++ = &_base->root; \
- for (u = rcu_dereference_protected(_base->root, \
- lockdep_is_held(&_base->lock)); \
+ for (u = rcu_deref_locked(_base->root, _base); \
u != peer_avl_empty; ) { \
int cmp = addr_compare(_daddr, &u->daddr); \
if (cmp == 0) \
@@ -198,8 +200,7 @@ static int addr_compare(const struct inetpeer_addr *a,
else \
v = &u->avl_right; \
*stackptr++ = v; \
- u = rcu_dereference_protected(*v, \
- lockdep_is_held(&_base->lock)); \
+ u = rcu_deref_locked(*v, _base); \
} \
u; \
})
@@ -246,13 +247,11 @@ static struct inet_peer *lookup_rcu_bh(const struct inetpeer_addr *daddr,
struct inet_peer __rcu **v; \
*stackptr++ = &start->avl_left; \
v = &start->avl_left; \
- for (u = rcu_dereference_protected(*v, \
- lockdep_is_held(&base->lock)); \
+ for (u = rcu_deref_locked(*v, base); \
u->avl_right != peer_avl_empty_rcu; ) { \
v = &u->avl_right; \
*stackptr++ = v; \
- u = rcu_dereference_protected(*v, \
- lockdep_is_held(&base->lock)); \
+ u = rcu_deref_locked(*v, base); \
} \
u; \
})
@@ -271,21 +270,16 @@ static void peer_avl_rebalance(struct inet_peer __rcu **stack[],
while (stackend > stack) {
nodep = *--stackend;
- node = rcu_dereference_protected(*nodep,
- lockdep_is_held(&base->lock));
- l = rcu_dereference_protected(node->avl_left,
- lockdep_is_held(&base->lock));
- r = rcu_dereference_protected(node->avl_right,
- lockdep_is_held(&base->lock));
+ node = rcu_deref_locked(*nodep, base);
+ l = rcu_deref_locked(node->avl_left, base);
+ r = rcu_deref_locked(node->avl_right, base);
lh = node_height(l);
rh = node_height(r);
if (lh > rh + 1) { /* l: RH+2 */
struct inet_peer *ll, *lr, *lrl, *lrr;
int lrh;
- ll = rcu_dereference_protected(l->avl_left,
- lockdep_is_held(&base->lock));
- lr = rcu_dereference_protected(l->avl_right,
- lockdep_is_held(&base->lock));
+ ll = rcu_deref_locked(l->avl_left, base);
+ lr = rcu_deref_locked(l->avl_right, base);
lrh = node_height(lr);
if (lrh <= node_height(ll)) { /* ll: RH+1 */
RCU_INIT_POINTER(node->avl_left, lr); /* lr: RH or RH+1 */
@@ -296,10 +290,8 @@ static void peer_avl_rebalance(struct inet_peer __rcu **stack[],
l->avl_height = node->avl_height + 1;
RCU_INIT_POINTER(*nodep, l);
} else { /* ll: RH, lr: RH+1 */
- lrl = rcu_dereference_protected(lr->avl_left,
- lockdep_is_held(&base->lock)); /* lrl: RH or RH-1 */
- lrr = rcu_dereference_protected(lr->avl_right,
- lockdep_is_held(&base->lock)); /* lrr: RH or RH-1 */
+ lrl = rcu_deref_locked(lr->avl_left, base);/* lrl: RH or RH-1 */
+ lrr = rcu_deref_locked(lr->avl_right, base);/* lrr: RH or RH-1 */
RCU_INIT_POINTER(node->avl_left, lrr); /* lrr: RH or RH-1 */
RCU_INIT_POINTER(node->avl_right, r); /* r: RH */
node->avl_height = rh + 1; /* node: RH+1 */
@@ -314,10 +306,8 @@ static void peer_avl_rebalance(struct inet_peer __rcu **stack[],
} else if (rh > lh + 1) { /* r: LH+2 */
struct inet_peer *rr, *rl, *rlr, *rll;
int rlh;
- rr = rcu_dereference_protected(r->avl_right,
- lockdep_is_held(&base->lock));
- rl = rcu_dereference_protected(r->avl_left,
- lockdep_is_held(&base->lock));
+ rr = rcu_deref_locked(r->avl_right, base);
+ rl = rcu_deref_locked(r->avl_left, base);
rlh = node_height(rl);
if (rlh <= node_height(rr)) { /* rr: LH+1 */
RCU_INIT_POINTER(node->avl_right, rl); /* rl: LH or LH+1 */
@@ -328,10 +318,8 @@ static void peer_avl_rebalance(struct inet_peer __rcu **stack[],
r->avl_height = node->avl_height + 1;
RCU_INIT_POINTER(*nodep, r);
} else { /* rr: RH, rl: RH+1 */
- rlr = rcu_dereference_protected(rl->avl_right,
- lockdep_is_held(&base->lock)); /* rlr: LH or LH-1 */
- rll = rcu_dereference_protected(rl->avl_left,
- lockdep_is_held(&base->lock)); /* rll: LH or LH-1 */
+ rlr = rcu_deref_locked(rl->avl_right, base);/* rlr: LH or LH-1 */
+ rll = rcu_deref_locked(rl->avl_left, base);/* rll: LH or LH-1 */
RCU_INIT_POINTER(node->avl_right, rll); /* rll: LH or LH-1 */
RCU_INIT_POINTER(node->avl_left, l); /* l: LH */
node->avl_height = lh + 1; /* node: LH+1 */
@@ -372,7 +360,7 @@ static void unlink_from_pool(struct inet_peer *p, struct inet_peer_base *base)
do_free = 0;
- spin_lock_bh(&base->lock);
+ write_seqlock_bh(&base->lock);
/* Check the reference counter. It was artificially incremented by 1
* in cleanup() function to prevent sudden disappearing. If we can
* atomically (because of lockless readers) take this last reference,
@@ -392,8 +380,7 @@ static void unlink_from_pool(struct inet_peer *p, struct inet_peer_base *base)
/* look for a node to insert instead of p */
struct inet_peer *t;
t = lookup_rightempty(p, base);
- BUG_ON(rcu_dereference_protected(*stackptr[-1],
- lockdep_is_held(&base->lock)) != t);
+ BUG_ON(rcu_deref_locked(*stackptr[-1], base) != t);
**--stackptr = t->avl_left;
/* t is removed, t->daddr > x->daddr for any
* x in p->avl_left subtree.
@@ -409,7 +396,7 @@ static void unlink_from_pool(struct inet_peer *p, struct inet_peer_base *base)
base->total--;
do_free = 1;
}
- spin_unlock_bh(&base->lock);
+ write_sequnlock_bh(&base->lock);
if (do_free)
call_rcu_bh(&p->rcu, inetpeer_free_rcu);
@@ -477,12 +464,16 @@ struct inet_peer *inet_getpeer(struct inetpeer_addr *daddr, int create)
struct inet_peer __rcu **stack[PEER_MAXDEPTH], ***stackptr;
struct inet_peer_base *base = family_to_base(daddr->family);
struct inet_peer *p;
+ unsigned int sequence;
+ int invalidated;
/* Look up for the address quickly, lockless.
* Because of a concurrent writer, we might not find an existing entry.
*/
rcu_read_lock_bh();
+ sequence = read_seqbegin(&base->lock);
p = lookup_rcu_bh(daddr, base);
+ invalidated = read_seqretry(&base->lock, sequence);
rcu_read_unlock_bh();
if (p) {
@@ -493,14 +484,18 @@ struct inet_peer *inet_getpeer(struct inetpeer_addr *daddr, int create)
return p;
}
+ /* If no writer did a change during our lookup, we can return early. */
+ if (!create && !invalidated)
+ return NULL;
+
/* retry an exact lookup, taking the lock before.
* At least, nodes should be hot in our cache.
*/
- spin_lock_bh(&base->lock);
+ write_seqlock_bh(&base->lock);
p = lookup(daddr, stack, base);
if (p != peer_avl_empty) {
atomic_inc(&p->refcnt);
- spin_unlock_bh(&base->lock);
+ write_sequnlock_bh(&base->lock);
/* Remove the entry from unused list if it was there. */
unlink_from_unused(p);
return p;
@@ -524,7 +519,7 @@ struct inet_peer *inet_getpeer(struct inetpeer_addr *daddr, int create)
link_to_pool(p, base);
base->total++;
}
- spin_unlock_bh(&base->lock);
+ write_sequnlock_bh(&base->lock);
if (base->total >= inet_peer_threshold)
/* Remove one less-recently-used entry. */
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH net-next-2.6] inetpeer: seqlock optimization
2011-03-04 22:13 ` Eric Dumazet
@ 2011-03-04 22:44 ` David Miller
2011-03-04 23:02 ` Eric Dumazet
0 siblings, 1 reply; 17+ messages in thread
From: David Miller @ 2011-03-04 22:44 UTC (permalink / raw)
To: eric.dumazet; +Cc: xiaosuo, netdev
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 04 Mar 2011 23:13:59 +0100
> Le vendredi 04 mars 2011 à 12:45 -0800, David Miller a écrit :
>> From: David Miller <davem@davemloft.net>
>> Date: Fri, 04 Mar 2011 11:17:05 -0800 (PST)
>>
>> > From: Eric Dumazet <eric.dumazet@gmail.com>
>> > Date: Fri, 04 Mar 2011 16:09:08 +0100
>> >
>> >> Here is a patch to implement this idea.
>> >
>> > Applied, thanks Eric!
>>
>> Unfortunately, I have to revert, the lockdep annotations needs to
>> be updated:
>>
>> net/ipv4/inetpeer.c: In function ‘peer_avl_rebalance’:
>> net/ipv4/inetpeer.c:274:10: error: ‘seqlock_t’ has no member named ‘dep_map’
>
> Oops thats right, here is an updated version.
Applied, thanks Eric!
With this and the following patch applied to my no-routing-cache tree,
output route lookup on my Niagara2 is down to 2966 cycles! For reference
with just the plain routing cache removal, it was as much as 3832 cycles.
udpflood is a lot faster too, with plain routing cache removal it ran as:
bash$ time ./bin/udpflood -l 10000000 10.2.2.11
real 3m9.921s
user 0m9.520s
sys 3m0.440s
But now it's:
bash$ time ./bin/udpflood -l 10000000 10.2.2.11
real 2m45.903s
user 0m8.640s
sys 2m37.280s
:-)
--------------------
ipv4: Optimize flow initialization in output route lookup.
We burn a lot of useless cycles, cpu store buffer traffic, and
memory operations memset()'ing the on-stack flow used to perform
output route lookups in __ip_route_output_key().
Only the first half of the flow object members even matter for
output route lookups in this context, specifically:
FIB rules matching cares about:
dst, src, tos, iif, oif, mark
FIB trie lookup cares about:
dst
FIB semantic match cares about:
tos, scope, oif
Therefore only initialize these specific members and elide the
memset entirely.
On Niagara2 this kills about ~300 cycles from the output route
lookup path.
Likely, we can take things further, since all callers of output
route lookups essentially throw away the on-stack flow they use.
So they don't care if we use it as a scratch-pad to compute the
final flow key.
Signed-off-by: David S. Miller <davem@davemloft.net>
---
net/ipv4/route.c | 18 ++++++++++--------
1 files changed, 10 insertions(+), 8 deletions(-)
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 04b8954..e3a5a89 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1670,14 +1670,7 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
struct rtable *__ip_route_output_key(struct net *net, const struct flowi *oldflp)
{
u32 tos = RT_FL_TOS(oldflp);
- struct flowi fl = { .fl4_dst = oldflp->fl4_dst,
- .fl4_src = oldflp->fl4_src,
- .fl4_tos = tos & IPTOS_RT_MASK,
- .fl4_scope = ((tos & RTO_ONLINK) ?
- RT_SCOPE_LINK : RT_SCOPE_UNIVERSE),
- .mark = oldflp->mark,
- .iif = net->loopback_dev->ifindex,
- .oif = oldflp->oif };
+ struct flowi fl;
struct fib_result res;
unsigned int flags = 0;
struct net_device *dev_out = NULL;
@@ -1688,6 +1681,15 @@ struct rtable *__ip_route_output_key(struct net *net, const struct flowi *oldflp
res.r = NULL;
#endif
+ fl.oif = oldflp->oif;
+ fl.iif = net->loopback_dev->ifindex;
+ fl.mark = oldflp->mark;
+ fl.fl4_dst = oldflp->fl4_dst;
+ fl.fl4_src = oldflp->fl4_src;
+ fl.fl4_tos = tos & IPTOS_RT_MASK;
+ fl.fl4_scope = ((tos & RTO_ONLINK) ?
+ RT_SCOPE_LINK : RT_SCOPE_UNIVERSE);
+
rcu_read_lock();
if (oldflp->fl4_src) {
rth = ERR_PTR(-EINVAL);
--
1.7.4.1
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH net-next-2.6] inetpeer: seqlock optimization
2011-03-04 22:44 ` David Miller
@ 2011-03-04 23:02 ` Eric Dumazet
0 siblings, 0 replies; 17+ messages in thread
From: Eric Dumazet @ 2011-03-04 23:02 UTC (permalink / raw)
To: David Miller; +Cc: xiaosuo, netdev
Le vendredi 04 mars 2011 à 14:44 -0800, David Miller a écrit :
> Applied, thanks Eric!
>
> With this and the following patch applied to my no-routing-cache tree,
> output route lookup on my Niagara2 is down to 2966 cycles! For reference
> with just the plain routing cache removal, it was as much as 3832 cycles.
>
> udpflood is a lot faster too, with plain routing cache removal it ran as:
>
> bash$ time ./bin/udpflood -l 10000000 10.2.2.11
> real 3m9.921s
> user 0m9.520s
> sys 3m0.440s
>
> But now it's:
>
> bash$ time ./bin/udpflood -l 10000000 10.2.2.11
> real 2m45.903s
> user 0m8.640s
> sys 2m37.280s
>
> :-)
>
Nice indeed :)
> --------------------
> ipv4: Optimize flow initialization in output route lookup.
>
> We burn a lot of useless cycles, cpu store buffer traffic, and
> memory operations memset()'ing the on-stack flow used to perform
> output route lookups in __ip_route_output_key().
>
> Only the first half of the flow object members even matter for
> output route lookups in this context, specifically:
>
> FIB rules matching cares about:
>
> dst, src, tos, iif, oif, mark
>
> FIB trie lookup cares about:
>
> dst
>
> FIB semantic match cares about:
>
> tos, scope, oif
>
> Therefore only initialize these specific members and elide the
> memset entirely.
>
> On Niagara2 this kills about ~300 cycles from the output route
> lookup path.
>
> Likely, we can take things further, since all callers of output
> route lookups essentially throw away the on-stack flow they use.
> So they don't care if we use it as a scratch-pad to compute the
> final flow key.
>
> Signed-off-by: David S. Miller <davem@davemloft.net>
> ---
> net/ipv4/route.c | 18 ++++++++++--------
> 1 files changed, 10 insertions(+), 8 deletions(-)
>
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index 04b8954..e3a5a89 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -1670,14 +1670,7 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
> struct rtable *__ip_route_output_key(struct net *net, const struct flowi *oldflp)
> {
> u32 tos = RT_FL_TOS(oldflp);
> - struct flowi fl = { .fl4_dst = oldflp->fl4_dst,
> - .fl4_src = oldflp->fl4_src,
> - .fl4_tos = tos & IPTOS_RT_MASK,
> - .fl4_scope = ((tos & RTO_ONLINK) ?
> - RT_SCOPE_LINK : RT_SCOPE_UNIVERSE),
> - .mark = oldflp->mark,
> - .iif = net->loopback_dev->ifindex,
> - .oif = oldflp->oif };
> + struct flowi fl;
> struct fib_result res;
> unsigned int flags = 0;
> struct net_device *dev_out = NULL;
> @@ -1688,6 +1681,15 @@ struct rtable *__ip_route_output_key(struct net *net, const struct flowi *oldflp
> res.r = NULL;
> #endif
>
> + fl.oif = oldflp->oif;
> + fl.iif = net->loopback_dev->ifindex;
> + fl.mark = oldflp->mark;
> + fl.fl4_dst = oldflp->fl4_dst;
> + fl.fl4_src = oldflp->fl4_src;
> + fl.fl4_tos = tos & IPTOS_RT_MASK;
> + fl.fl4_scope = ((tos & RTO_ONLINK) ?
> + RT_SCOPE_LINK : RT_SCOPE_UNIVERSE);
> +
> rcu_read_lock();
> if (oldflp->fl4_src) {
> rth = ERR_PTR(-EINVAL);
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2011-03-04 23:02 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-03-03 4:45 inetpeer with create==0 David Miller
2011-03-03 5:30 ` Changli Gao
2011-03-03 5:36 ` David Miller
2011-03-03 6:27 ` Changli Gao
2011-03-03 6:42 ` David Miller
2011-03-03 7:39 ` Eric Dumazet
2011-03-03 8:32 ` David Miller
2011-03-04 15:09 ` [PATCH net-next-2.6] inetpeer: seqlock optimization Eric Dumazet
2011-03-04 19:17 ` David Miller
2011-03-04 20:45 ` David Miller
2011-03-04 22:13 ` Eric Dumazet
2011-03-04 22:44 ` David Miller
2011-03-04 23:02 ` Eric Dumazet
2011-03-03 8:07 ` inetpeer with create==0 Changli Gao
2011-03-03 8:34 ` David Miller
2011-03-03 6:51 ` Eric Dumazet
2011-03-03 8:30 ` David Miller
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).