* Re: [patch net-next-2.6 V3] net: convert bonding to use rx_handler
From: Eric Dumazet @ 2011-02-19 9:22 UTC (permalink / raw)
To: Jiri Pirko
Cc: Jay Vosburgh, David Miller, kaber, netdev, shemminger,
nicolas.2p.debian, andy
In-Reply-To: <20110219085816.GC2782@psychotron.redhat.com>
Le samedi 19 février 2011 à 09:58 +0100, Jiri Pirko a écrit :
> Sat, Feb 19, 2011 at 09:37:55AM CET, eric.dumazet@gmail.com wrote:
> >Le samedi 19 février 2011 à 09:05 +0100, Jiri Pirko a écrit :
> >> This patch converts bonding to use rx_handler. Results in cleaner
> >> __netif_receive_skb() with much less exceptions needed. Also
> >> bond-specific work is moved into bond code.
> >>
> >> Signed-off-by: Jiri Pirko <jpirko@redhat.com>
> >>
> >> v1->v2:
> >> using skb_iif instead of new input_dev to remember original
> >> device
> >> v2->v3:
> >> set orig_dev = skb->dev if skb_iif is set
> >>
> >
> >Seems much better ;)
> >
> >Do you have some performance numbers ?
>
> I don't. I can surely obtain some. What's the best way to measure this?
>
Hmm, since its receive path :
Two machines, one sending (pktgen) a flood, one receiving it and
check/count how many frames hit destination, before/after patch.
^ permalink raw reply
* [PATCH v2] ipvs: unify the formula to estimate the overhead of processing connections
From: Changli Gao @ 2011-02-19 9:32 UTC (permalink / raw)
To: Simon Horman
Cc: David S. Miller, Patrick McHardy, Wensong Zhang, Julian Anastasov,
netdev, lvs-devel, netfilter-devel, Changli Gao
lc and wlc use the same formula, but lblc and lblcr use another one. There
is no reason for using two different formulas for the lc variants.
The formula used by lc is used by all the lc variants in this patch.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
---
v2: use ip_vs_dest_conn_overhead() instead.
include/net/ip_vs.h | 14 ++++++++++++++
net/netfilter/ipvs/ip_vs_lblc.c | 13 +++----------
net/netfilter/ipvs/ip_vs_lblcr.c | 25 +++++++------------------
net/netfilter/ipvs/ip_vs_lc.c | 18 +-----------------
net/netfilter/ipvs/ip_vs_wlc.c | 20 ++------------------
5 files changed, 27 insertions(+), 63 deletions(-)
diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index 5d75fea..e80ffb7 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -1241,6 +1241,20 @@ static inline void ip_vs_conn_drop_conntrack(struct ip_vs_conn *cp)
/* CONFIG_IP_VS_NFCT */
#endif
+static inline unsigned int
+ip_vs_dest_conn_overhead(struct ip_vs_dest *dest)
+{
+ /*
+ * We think the overhead of processing active connections is 256
+ * times higher than that of inactive connections in average. (This
+ * 256 times might not be accurate, we will change it later) We
+ * use the following formula to estimate the overhead now:
+ * dest->activeconns*256 + dest->inactconns
+ */
+ return (atomic_read(&dest->activeconns) << 8) +
+ atomic_read(&dest->inactconns);
+}
+
#endif /* __KERNEL__ */
#endif /* _NET_IP_VS_H */
diff --git a/net/netfilter/ipvs/ip_vs_lblc.c b/net/netfilter/ipvs/ip_vs_lblc.c
index 00b5ffa..58ae403 100644
--- a/net/netfilter/ipvs/ip_vs_lblc.c
+++ b/net/netfilter/ipvs/ip_vs_lblc.c
@@ -389,12 +389,7 @@ __ip_vs_lblc_schedule(struct ip_vs_service *svc)
int loh, doh;
/*
- * We think the overhead of processing active connections is fifty
- * times higher than that of inactive connections in average. (This
- * fifty times might not be accurate, we will change it later.) We
- * use the following formula to estimate the overhead:
- * dest->activeconns*50 + dest->inactconns
- * and the load:
+ * We use the following formula to estimate the load:
* (dest overhead) / dest->weight
*
* Remember -- no floats in kernel mode!!!
@@ -410,8 +405,7 @@ __ip_vs_lblc_schedule(struct ip_vs_service *svc)
continue;
if (atomic_read(&dest->weight) > 0) {
least = dest;
- loh = atomic_read(&least->activeconns) * 50
- + atomic_read(&least->inactconns);
+ loh = ip_vs_dest_conn_overhead(least);
goto nextstage;
}
}
@@ -425,8 +419,7 @@ __ip_vs_lblc_schedule(struct ip_vs_service *svc)
if (dest->flags & IP_VS_DEST_F_OVERLOAD)
continue;
- doh = atomic_read(&dest->activeconns) * 50
- + atomic_read(&dest->inactconns);
+ doh = ip_vs_dest_conn_overhead(dest);
if (loh * atomic_read(&dest->weight) >
doh * atomic_read(&least->weight)) {
least = dest;
diff --git a/net/netfilter/ipvs/ip_vs_lblcr.c b/net/netfilter/ipvs/ip_vs_lblcr.c
index bfa25f1..2ddefe8 100644
--- a/net/netfilter/ipvs/ip_vs_lblcr.c
+++ b/net/netfilter/ipvs/ip_vs_lblcr.c
@@ -178,8 +178,7 @@ static inline struct ip_vs_dest *ip_vs_dest_set_min(struct ip_vs_dest_set *set)
if ((atomic_read(&least->weight) > 0)
&& (least->flags & IP_VS_DEST_F_AVAILABLE)) {
- loh = atomic_read(&least->activeconns) * 50
- + atomic_read(&least->inactconns);
+ loh = ip_vs_dest_conn_overhead(least);
goto nextstage;
}
}
@@ -192,8 +191,7 @@ static inline struct ip_vs_dest *ip_vs_dest_set_min(struct ip_vs_dest_set *set)
if (dest->flags & IP_VS_DEST_F_OVERLOAD)
continue;
- doh = atomic_read(&dest->activeconns) * 50
- + atomic_read(&dest->inactconns);
+ doh = ip_vs_dest_conn_overhead(dest);
if ((loh * atomic_read(&dest->weight) >
doh * atomic_read(&least->weight))
&& (dest->flags & IP_VS_DEST_F_AVAILABLE)) {
@@ -228,8 +226,7 @@ static inline struct ip_vs_dest *ip_vs_dest_set_max(struct ip_vs_dest_set *set)
list_for_each_entry(e, &set->list, list) {
most = e->dest;
if (atomic_read(&most->weight) > 0) {
- moh = atomic_read(&most->activeconns) * 50
- + atomic_read(&most->inactconns);
+ moh = ip_vs_dest_conn_overhead(most);
goto nextstage;
}
}
@@ -239,8 +236,7 @@ static inline struct ip_vs_dest *ip_vs_dest_set_max(struct ip_vs_dest_set *set)
nextstage:
list_for_each_entry(e, &set->list, list) {
dest = e->dest;
- doh = atomic_read(&dest->activeconns) * 50
- + atomic_read(&dest->inactconns);
+ doh = ip_vs_dest_conn_overhead(dest);
/* moh/mw < doh/dw ==> moh*dw < doh*mw, where mw,dw>0 */
if ((moh * atomic_read(&dest->weight) <
doh * atomic_read(&most->weight))
@@ -563,12 +559,7 @@ __ip_vs_lblcr_schedule(struct ip_vs_service *svc)
int loh, doh;
/*
- * We think the overhead of processing active connections is fifty
- * times higher than that of inactive connections in average. (This
- * fifty times might not be accurate, we will change it later.) We
- * use the following formula to estimate the overhead:
- * dest->activeconns*50 + dest->inactconns
- * and the load:
+ * We use the following formula to estimate the load:
* (dest overhead) / dest->weight
*
* Remember -- no floats in kernel mode!!!
@@ -585,8 +576,7 @@ __ip_vs_lblcr_schedule(struct ip_vs_service *svc)
if (atomic_read(&dest->weight) > 0) {
least = dest;
- loh = atomic_read(&least->activeconns) * 50
- + atomic_read(&least->inactconns);
+ loh = ip_vs_dest_conn_overhead(least);
goto nextstage;
}
}
@@ -600,8 +590,7 @@ __ip_vs_lblcr_schedule(struct ip_vs_service *svc)
if (dest->flags & IP_VS_DEST_F_OVERLOAD)
continue;
- doh = atomic_read(&dest->activeconns) * 50
- + atomic_read(&dest->inactconns);
+ doh = ip_vs_dest_conn_overhead(dest);
if (loh * atomic_read(&dest->weight) >
doh * atomic_read(&least->weight)) {
least = dest;
diff --git a/net/netfilter/ipvs/ip_vs_lc.c b/net/netfilter/ipvs/ip_vs_lc.c
index 4f69db1..160cb80 100644
--- a/net/netfilter/ipvs/ip_vs_lc.c
+++ b/net/netfilter/ipvs/ip_vs_lc.c
@@ -22,22 +22,6 @@
#include <net/ip_vs.h>
-
-static inline unsigned int
-ip_vs_lc_dest_overhead(struct ip_vs_dest *dest)
-{
- /*
- * We think the overhead of processing active connections is 256
- * times higher than that of inactive connections in average. (This
- * 256 times might not be accurate, we will change it later) We
- * use the following formula to estimate the overhead now:
- * dest->activeconns*256 + dest->inactconns
- */
- return (atomic_read(&dest->activeconns) << 8) +
- atomic_read(&dest->inactconns);
-}
-
-
/*
* Least Connection scheduling
*/
@@ -62,7 +46,7 @@ ip_vs_lc_schedule(struct ip_vs_service *svc, const struct sk_buff *skb)
if ((dest->flags & IP_VS_DEST_F_OVERLOAD) ||
atomic_read(&dest->weight) == 0)
continue;
- doh = ip_vs_lc_dest_overhead(dest);
+ doh = ip_vs_dest_conn_overhead(dest);
if (!least || doh < loh) {
least = dest;
loh = doh;
diff --git a/net/netfilter/ipvs/ip_vs_wlc.c b/net/netfilter/ipvs/ip_vs_wlc.c
index bbddfdb..db751f5 100644
--- a/net/netfilter/ipvs/ip_vs_wlc.c
+++ b/net/netfilter/ipvs/ip_vs_wlc.c
@@ -27,22 +27,6 @@
#include <net/ip_vs.h>
-
-static inline unsigned int
-ip_vs_wlc_dest_overhead(struct ip_vs_dest *dest)
-{
- /*
- * We think the overhead of processing active connections is 256
- * times higher than that of inactive connections in average. (This
- * 256 times might not be accurate, we will change it later) We
- * use the following formula to estimate the overhead now:
- * dest->activeconns*256 + dest->inactconns
- */
- return (atomic_read(&dest->activeconns) << 8) +
- atomic_read(&dest->inactconns);
-}
-
^ permalink raw reply related
* [PATCH] ipvs: use hlist instead of list
From: Changli Gao @ 2011-02-19 10:05 UTC (permalink / raw)
To: Simon Horman
Cc: David S. Miller, Patrick McHardy, Wensong Zhang, Julian Anastasov,
netdev, lvs-devel, netfilter-devel, Changli Gao
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
---
include/net/ip_vs.h | 2 -
net/netfilter/ipvs/ip_vs_conn.c | 52 ++++++++++++++++++++++------------------
2 files changed, 30 insertions(+), 24 deletions(-)
diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index e80ffb7..2078a47 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -494,7 +494,7 @@ struct ip_vs_conn_param {
* IP_VS structure allocated for each dynamically scheduled connection
*/
struct ip_vs_conn {
- struct list_head c_list; /* hashed list heads */
+ struct hlist_node c_list; /* hashed list heads */
#ifdef CONFIG_NET_NS
struct net *net; /* Name space */
#endif
diff --git a/net/netfilter/ipvs/ip_vs_conn.c b/net/netfilter/ipvs/ip_vs_conn.c
index 83233fe..9c2a517 100644
--- a/net/netfilter/ipvs/ip_vs_conn.c
+++ b/net/netfilter/ipvs/ip_vs_conn.c
@@ -59,7 +59,7 @@ static int ip_vs_conn_tab_mask __read_mostly;
/*
* Connection hash table: for input and output packets lookups of IPVS
*/
-static struct list_head *ip_vs_conn_tab __read_mostly;
+static struct hlist_head *ip_vs_conn_tab __read_mostly;
/* SLAB cache for IPVS connections */
static struct kmem_cache *ip_vs_conn_cachep __read_mostly;
@@ -201,7 +201,7 @@ static inline int ip_vs_conn_hash(struct ip_vs_conn *cp)
spin_lock(&cp->lock);
if (!(cp->flags & IP_VS_CONN_F_HASHED)) {
- list_add(&cp->c_list, &ip_vs_conn_tab[hash]);
+ hlist_add_head(&cp->c_list, &ip_vs_conn_tab[hash]);
cp->flags |= IP_VS_CONN_F_HASHED;
atomic_inc(&cp->refcnt);
ret = 1;
@@ -234,7 +234,7 @@ static inline int ip_vs_conn_unhash(struct ip_vs_conn *cp)
spin_lock(&cp->lock);
if (cp->flags & IP_VS_CONN_F_HASHED) {
- list_del(&cp->c_list);
+ hlist_del(&cp->c_list);
cp->flags &= ~IP_VS_CONN_F_HASHED;
atomic_dec(&cp->refcnt);
ret = 1;
@@ -259,12 +259,13 @@ __ip_vs_conn_in_get(const struct ip_vs_conn_param *p)
{
unsigned hash;
struct ip_vs_conn *cp;
+ struct hlist_node *n;
hash = ip_vs_conn_hashkey_param(p, false);
ct_read_lock(hash);
- list_for_each_entry(cp, &ip_vs_conn_tab[hash], c_list) {
+ hlist_for_each_entry(cp, n, &ip_vs_conn_tab[hash], c_list) {
if (cp->af == p->af &&
p->cport == cp->cport && p->vport == cp->vport &&
ip_vs_addr_equal(p->af, p->caddr, &cp->caddr) &&
@@ -345,12 +346,13 @@ struct ip_vs_conn *ip_vs_ct_in_get(const struct ip_vs_conn_param *p)
{
unsigned hash;
struct ip_vs_conn *cp;
+ struct hlist_node *n;
hash = ip_vs_conn_hashkey_param(p, false);
ct_read_lock(hash);
- list_for_each_entry(cp, &ip_vs_conn_tab[hash], c_list) {
+ hlist_for_each_entry(cp, n, &ip_vs_conn_tab[hash], c_list) {
if (!ip_vs_conn_net_eq(cp, p->net))
continue;
if (p->pe_data && p->pe->ct_match) {
@@ -394,6 +396,7 @@ struct ip_vs_conn *ip_vs_conn_out_get(const struct ip_vs_conn_param *p)
{
unsigned hash;
struct ip_vs_conn *cp, *ret=NULL;
+ struct hlist_node *n;
/*
* Check for "full" addressed entries
@@ -402,7 +405,7 @@ struct ip_vs_conn *ip_vs_conn_out_get(const struct ip_vs_conn_param *p)
ct_read_lock(hash);
- list_for_each_entry(cp, &ip_vs_conn_tab[hash], c_list) {
+ hlist_for_each_entry(cp, n, &ip_vs_conn_tab[hash], c_list) {
if (cp->af == p->af &&
p->vport == cp->cport && p->cport == cp->dport &&
ip_vs_addr_equal(p->af, p->vaddr, &cp->caddr) &&
@@ -818,7 +821,7 @@ ip_vs_conn_new(const struct ip_vs_conn_param *p,
return NULL;
}
- INIT_LIST_HEAD(&cp->c_list);
+ INIT_HLIST_NODE(&cp->c_list);
setup_timer(&cp->timer, ip_vs_conn_expire, (unsigned long)cp);
ip_vs_conn_net_set(cp, p->net);
cp->af = p->af;
@@ -894,8 +897,8 @@ ip_vs_conn_new(const struct ip_vs_conn_param *p,
*/
#ifdef CONFIG_PROC_FS
struct ip_vs_iter_state {
- struct seq_net_private p;
- struct list_head *l;
+ struct seq_net_private p;
+ struct hlist_head *l;
};
static void *ip_vs_conn_array(struct seq_file *seq, loff_t pos)
@@ -903,13 +906,14 @@ static void *ip_vs_conn_array(struct seq_file *seq, loff_t pos)
int idx;
struct ip_vs_conn *cp;
struct ip_vs_iter_state *iter = seq->private;
+ struct hlist_node *n;
for (idx = 0; idx < ip_vs_conn_tab_size; idx++) {
ct_read_lock_bh(idx);
- list_for_each_entry(cp, &ip_vs_conn_tab[idx], c_list) {
+ hlist_for_each_entry(cp, n, &ip_vs_conn_tab[idx], c_list) {
if (pos-- == 0) {
iter->l = &ip_vs_conn_tab[idx];
- return cp;
+ return cp;
}
}
ct_read_unlock_bh(idx);
@@ -930,7 +934,8 @@ static void *ip_vs_conn_seq_next(struct seq_file *seq, void *v, loff_t *pos)
{
struct ip_vs_conn *cp = v;
struct ip_vs_iter_state *iter = seq->private;
- struct list_head *e, *l = iter->l;
+ struct hlist_node *e;
+ struct hlist_head *l = iter->l;
int idx;
++*pos;
@@ -938,15 +943,15 @@ static void *ip_vs_conn_seq_next(struct seq_file *seq, void *v, loff_t *pos)
return ip_vs_conn_array(seq, 0);
/* more on same hash chain? */
- if ((e = cp->c_list.next) != l)
- return list_entry(e, struct ip_vs_conn, c_list);
+ if ((e = cp->c_list.next))
+ return hlist_entry(e, struct ip_vs_conn, c_list);
idx = l - ip_vs_conn_tab;
ct_read_unlock_bh(idx);
while (++idx < ip_vs_conn_tab_size) {
ct_read_lock_bh(idx);
- list_for_each_entry(cp, &ip_vs_conn_tab[idx], c_list) {
+ hlist_for_each_entry(cp, e, &ip_vs_conn_tab[idx], c_list) {
iter->l = &ip_vs_conn_tab[idx];
return cp;
}
@@ -959,7 +964,7 @@ static void *ip_vs_conn_seq_next(struct seq_file *seq, void *v, loff_t *pos)
static void ip_vs_conn_seq_stop(struct seq_file *seq, void *v)
{
struct ip_vs_iter_state *iter = seq->private;
- struct list_head *l = iter->l;
+ struct hlist_head *l = iter->l;
if (l)
ct_read_unlock_bh(l - ip_vs_conn_tab);
@@ -1148,13 +1153,14 @@ void ip_vs_random_dropentry(struct net *net)
*/
for (idx = 0; idx < (ip_vs_conn_tab_size>>5); idx++) {
unsigned hash = net_random() & ip_vs_conn_tab_mask;
+ struct hlist_node *n;
/*
* Lock is actually needed in this loop.
*/
ct_write_lock_bh(hash);
- list_for_each_entry(cp, &ip_vs_conn_tab[hash], c_list) {
+ hlist_for_each_entry(cp, n, &ip_vs_conn_tab[hash], c_list) {
if (cp->flags & IP_VS_CONN_F_TEMPLATE)
/* connection template */
continue;
@@ -1202,12 +1208,14 @@ static void ip_vs_conn_flush(struct net *net)
flush_again:
for (idx = 0; idx < ip_vs_conn_tab_size; idx++) {
+ struct hlist_node *n;
+
/*
* Lock is actually needed in this loop.
*/
ct_write_lock_bh(idx);
- list_for_each_entry(cp, &ip_vs_conn_tab[idx], c_list) {
+ hlist_for_each_entry(cp, n, &ip_vs_conn_tab[idx], c_list) {
if (!ip_vs_conn_net_eq(cp, net))
continue;
IP_VS_DBG(4, "del connection\n");
@@ -1265,8 +1273,7 @@ int __init ip_vs_conn_init(void)
/*
* Allocate the connection hash table and initialize its list heads
*/
- ip_vs_conn_tab = vmalloc(ip_vs_conn_tab_size *
- sizeof(struct list_head));
+ ip_vs_conn_tab = vmalloc(ip_vs_conn_tab_size * sizeof(*ip_vs_conn_tab));
if (!ip_vs_conn_tab)
return -ENOMEM;
@@ -1286,9 +1293,8 @@ int __init ip_vs_conn_init(void)
IP_VS_DBG(0, "Each connection entry needs %Zd bytes at least\n",
sizeof(struct ip_vs_conn));
- for (idx = 0; idx < ip_vs_conn_tab_size; idx++) {
- INIT_LIST_HEAD(&ip_vs_conn_tab[idx]);
- }
+ for (idx = 0; idx < ip_vs_conn_tab_size; idx++)
+ INIT_HLIST_HEAD(&ip_vs_conn_tab[idx]);
for (idx = 0; idx < CT_LOCKARRAY_SIZE; idx++) {
rwlock_init(&__ip_vs_conntbl_lock_array[idx].l);
^ permalink raw reply related
* Re: [patch net-next-2.6 V3] net: convert bonding to use rx_handler
From: Nicolas de Pesloüan @ 2011-02-19 10:56 UTC (permalink / raw)
To: Jiri Pirko
Cc: Jay Vosburgh, David Miller, kaber, eric.dumazet, netdev,
shemminger, andy
In-Reply-To: <20110219080523.GB2782@psychotron.redhat.com>
Le 19/02/2011 09:05, Jiri Pirko a écrit :
> This patch converts bonding to use rx_handler. Results in cleaner
> __netif_receive_skb() with much less exceptions needed. Also
> bond-specific work is moved into bond code.
>
> Signed-off-by: Jiri Pirko<jpirko@redhat.com>
>
> v1->v2:
> using skb_iif instead of new input_dev to remember original
> device
> v2->v3:
> set orig_dev = skb->dev if skb_iif is set
>
Why do we need to let the rx_handlers call netif_rx() or __netif_receive_skb()?
Bonding used to be handled with very few overhead, simply replacing skb->dev with skb->dev->master.
Time has passed and we eventually added many special processing for bonding into
__netif_receive_skb(), but the overhead remained very light.
Calling netif_rx() (or __netif_receive_skb()) to allow nesting would probably lead to some overhead.
Can't we, instead, loop inside __netif_receive_skb(), and deliver whatever need to be delivered, to
whoever need, inside the loop ?
rx_handler = rcu_dereference(skb->dev->rx_handler);
while (rx_handler) {
/* ... */
orig_dev = skb->dev;
skb = rx_handler(skb);
/* ... */
rx_handler = (skb->dev != orig_dev) ? rcu_dereference(skb->dev->rx_handler) : NULL;
}
This would reduce the overhead, while still allowing nesting: vlan on top on bonding, bridge on top
on bonding, ...
That way, we can probably keep the list of crossed devices inside a local array, and call
deliver_skb() with the current "orig_dev" when appropriate. No need to overload sk_buff nor to use a
global variable.
Of course, this might be a very simplistic view.
Any comments?
Nicolas.
^ permalink raw reply
* Re: [patch net-next-2.6 V3] net: convert bonding to use rx_handler
From: Jiri Pirko @ 2011-02-19 11:08 UTC (permalink / raw)
To: Nicolas de Pesloüan
Cc: Jay Vosburgh, David Miller, kaber, eric.dumazet, netdev,
shemminger, andy
In-Reply-To: <4D5FA1D7.4050801@gmail.com>
Sat, Feb 19, 2011 at 11:56:23AM CET, nicolas.2p.debian@gmail.com wrote:
>Le 19/02/2011 09:05, Jiri Pirko a écrit :
>>This patch converts bonding to use rx_handler. Results in cleaner
>>__netif_receive_skb() with much less exceptions needed. Also
>>bond-specific work is moved into bond code.
>>
>>Signed-off-by: Jiri Pirko<jpirko@redhat.com>
>>
>>v1->v2:
>> using skb_iif instead of new input_dev to remember original
>> device
>>v2->v3:
>> set orig_dev = skb->dev if skb_iif is set
>>
>
>Why do we need to let the rx_handlers call netif_rx() or __netif_receive_skb()?
>
>Bonding used to be handled with very few overhead, simply replacing
>skb->dev with skb->dev->master. Time has passed and we eventually
>added many special processing for bonding into __netif_receive_skb(),
>but the overhead remained very light.
>
>Calling netif_rx() (or __netif_receive_skb()) to allow nesting would probably lead to some overhead.
>
>Can't we, instead, loop inside __netif_receive_skb(), and deliver
>whatever need to be delivered, to whoever need, inside the loop ?
>
>rx_handler = rcu_dereference(skb->dev->rx_handler);
>while (rx_handler) {
> /* ... */
> orig_dev = skb->dev;
> skb = rx_handler(skb);
> /* ... */
> rx_handler = (skb->dev != orig_dev) ? rcu_dereference(skb->dev->rx_handler) : NULL;
>}
>
>This would reduce the overhead, while still allowing nesting: vlan on
>top on bonding, bridge on top on bonding, ...
I see your point. Makes sense to me. But the loop would have to include
at least processing of ptype_all too. I'm going to cook a follow-up
patch.
>
>That way, we can probably keep the list of crossed devices inside a
>local array, and call deliver_skb() with the current "orig_dev" when
>appropriate. No need to overload sk_buff nor to use a global
>variable.
>
>Of course, this might be a very simplistic view.
>
>Any comments?
>
> Nicolas.
^ permalink raw reply
* Re: [patch net-next-2.6 V3] net: convert bonding to use rx_handler
From: Jiri Pirko @ 2011-02-19 11:28 UTC (permalink / raw)
To: Nicolas de Pesloüan
Cc: Jay Vosburgh, David Miller, kaber, eric.dumazet, netdev,
shemminger, andy
In-Reply-To: <20110219110830.GD2782@psychotron.redhat.com>
Sat, Feb 19, 2011 at 12:08:31PM CET, jpirko@redhat.com wrote:
>Sat, Feb 19, 2011 at 11:56:23AM CET, nicolas.2p.debian@gmail.com wrote:
>>Le 19/02/2011 09:05, Jiri Pirko a écrit :
>>>This patch converts bonding to use rx_handler. Results in cleaner
>>>__netif_receive_skb() with much less exceptions needed. Also
>>>bond-specific work is moved into bond code.
>>>
>>>Signed-off-by: Jiri Pirko<jpirko@redhat.com>
>>>
>>>v1->v2:
>>> using skb_iif instead of new input_dev to remember original
>>> device
>>>v2->v3:
>>> set orig_dev = skb->dev if skb_iif is set
>>>
>>
>>Why do we need to let the rx_handlers call netif_rx() or __netif_receive_skb()?
>>
>>Bonding used to be handled with very few overhead, simply replacing
>>skb->dev with skb->dev->master. Time has passed and we eventually
>>added many special processing for bonding into __netif_receive_skb(),
>>but the overhead remained very light.
>>
>>Calling netif_rx() (or __netif_receive_skb()) to allow nesting would probably lead to some overhead.
>>
>>Can't we, instead, loop inside __netif_receive_skb(), and deliver
>>whatever need to be delivered, to whoever need, inside the loop ?
>>
>>rx_handler = rcu_dereference(skb->dev->rx_handler);
>>while (rx_handler) {
>> /* ... */
>> orig_dev = skb->dev;
>> skb = rx_handler(skb);
>> /* ... */
>> rx_handler = (skb->dev != orig_dev) ? rcu_dereference(skb->dev->rx_handler) : NULL;
>>}
>>
>>This would reduce the overhead, while still allowing nesting: vlan on
>>top on bonding, bridge on top on bonding, ...
>
>I see your point. Makes sense to me. But the loop would have to include
>at least processing of ptype_all too. I'm going to cook a follow-up
>patch.
>
DRAFT (doesn't modify rx_handlers):
diff --git a/net/core/dev.c b/net/core/dev.c
index 4ebf7fe..e5dba47 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3115,6 +3115,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
{
struct packet_type *ptype, *pt_prev;
rx_handler_func_t *rx_handler;
+ struct net_device *dev;
struct net_device *orig_dev;
struct net_device *null_or_dev;
int ret = NET_RX_DROP;
@@ -3129,7 +3130,9 @@ static int __netif_receive_skb(struct sk_buff *skb)
if (netpoll_receive_skb(skb))
return NET_RX_DROP;
- __this_cpu_inc(softnet_data.processed);
+ skb->skb_iif = skb->dev->ifindex;
+ orig_dev = skb->dev;
+
skb_reset_network_header(skb);
skb_reset_transport_header(skb);
skb->mac_len = skb->network_header - skb->mac_header;
@@ -3138,12 +3141,9 @@ static int __netif_receive_skb(struct sk_buff *skb)
rcu_read_lock();
- if (!skb->skb_iif) {
- skb->skb_iif = skb->dev->ifindex;
- orig_dev = skb->dev;
- } else {
- orig_dev = dev_get_by_index_rcu(dev_net(skb->dev), skb->skb_iif);
- }
+another_round:
+ __this_cpu_inc(softnet_data.processed);
+ dev = skb->dev;
#ifdef CONFIG_NET_CLS_ACT
if (skb->tc_verd & TC_NCLS) {
@@ -3153,7 +3153,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
#endif
list_for_each_entry_rcu(ptype, &ptype_all, list) {
- if (!ptype->dev || ptype->dev == skb->dev) {
+ if (!ptype->dev || ptype->dev == dev) {
if (pt_prev)
ret = deliver_skb(skb, pt_prev, orig_dev);
pt_prev = ptype;
@@ -3167,7 +3167,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
ncls:
#endif
- rx_handler = rcu_dereference(skb->dev->rx_handler);
+ rx_handler = rcu_dereference(dev->rx_handler);
if (rx_handler) {
if (pt_prev) {
ret = deliver_skb(skb, pt_prev, orig_dev);
@@ -3176,6 +3176,8 @@ ncls:
skb = rx_handler(skb);
if (!skb)
goto out;
+ if (dev != skb->dev)
+ goto another_round;
}
if (vlan_tx_tag_present(skb)) {
^ permalink raw reply related
* Re: [PATCH v6 0/9] net: Unified offload configuration
From: Michał Mirosław @ 2011-02-19 12:28 UTC (permalink / raw)
To: David Miller; +Cc: bhutchings, netdev
In-Reply-To: <20110218.120617.71104636.davem@davemloft.net>
On Fri, Feb 18, 2011 at 12:06:17PM -0800, David Miller wrote:
> From: Ben Hutchings <bhutchings@solarflare.com>
> Date: Fri, 18 Feb 2011 14:29:31 +0000
>
> > On Fri, 2011-02-18 at 15:22 +0100, Michał Mirosław wrote:
> >> On Thu, Feb 17, 2011 at 02:56:11PM -0800, David Miller wrote:
> > [...]
> >> > Please get rid of that annoying message spit out by netif_features_change(),
> >> > it's just spam. If we want notifications for stuff like this, use a
> >> > non-unicast netlink message so those who want to hear it can do so.
> >> You mean netdev_update_features() "Features changed" message? Is it ok
> >> to just demote it to DEBUG level or you want to remove it altogether?
> >> What about netdev_fix_features() messages?
> > I think you need to emit these messages at 'error' severity when fixing
> > up features for a newly-added device, but at 'debug' later on.
> I get one several minutes after every boot for a completely
> unregistered device for some reason:
>
> [119704.730965] (unregistered net_device): Features changed: 0x00011065 -> 0x00015065
Hmm. That's because netdev_update_features() get's called before changing
netdev->reg_state. I wonder if moving the feature update just after
"dev->reg_state = NETREG_REGISTERED;" will be correct fix.
Best Regards,
Michał Mirosław
^ permalink raw reply
* [PATCH 1/1] Fix "(unregistered net_device): Features changed" message
From: Michał Mirosław @ 2011-02-19 12:46 UTC (permalink / raw)
To: netdev; +Cc: Ben Hutchings, David Miller
In-Reply-To: <20110219122804.GA28326@rere.qmqm.pl>
Fix netdev_update_features() messages on register time by moving
the call further in register_netdevice(). When
netdev->reg_state != NETREG_REGISTERED, netdev_name() returns
"(unregistered netdevice)" even if the dev's name is already filled.
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
---
net/core/dev.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/net/core/dev.c b/net/core/dev.c
index 4f69439..5d8e13e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5478,8 +5478,6 @@ int register_netdevice(struct net_device *dev)
if (!(dev->wanted_features & NETIF_F_SG))
dev->wanted_features &= ~NETIF_F_GSO;
- netdev_update_features(dev);
-
/* Enable GRO and NETIF_F_HIGHDMA for vlans by default,
* vlan_dev_init() will do the dev->features check, so these features
* are enabled only if supported by underlying device.
@@ -5496,6 +5494,8 @@ int register_netdevice(struct net_device *dev)
goto err_uninit;
dev->reg_state = NETREG_REGISTERED;
+ netdev_update_features(dev);
+
/*
* Default initial state at registry is that the
* device is present.
--
1.7.2.3
^ permalink raw reply related
* Re: [patch net-next-2.6 V3] net: convert bonding to use rx_handler
From: Nicolas de Pesloüan @ 2011-02-19 13:18 UTC (permalink / raw)
To: Jiri Pirko
Cc: Jay Vosburgh, David Miller, kaber, eric.dumazet, netdev,
shemminger, andy
In-Reply-To: <20110219112842.GE2782@psychotron.redhat.com>
Le 19/02/2011 12:28, Jiri Pirko a écrit :
> Sat, Feb 19, 2011 at 12:08:31PM CET, jpirko@redhat.com wrote:
>> Sat, Feb 19, 2011 at 11:56:23AM CET, nicolas.2p.debian@gmail.com wrote:
>>> Le 19/02/2011 09:05, Jiri Pirko a écrit :
>>>> This patch converts bonding to use rx_handler. Results in cleaner
>>>> __netif_receive_skb() with much less exceptions needed. Also
>>>> bond-specific work is moved into bond code.
>>>>
>>>> Signed-off-by: Jiri Pirko<jpirko@redhat.com>
>>>>
>>>> v1->v2:
>>>> using skb_iif instead of new input_dev to remember original
>>>> device
>>>> v2->v3:
>>>> set orig_dev = skb->dev if skb_iif is set
>>>>
>>>
>>> Why do we need to let the rx_handlers call netif_rx() or __netif_receive_skb()?
>>>
>>> Bonding used to be handled with very few overhead, simply replacing
>>> skb->dev with skb->dev->master. Time has passed and we eventually
>>> added many special processing for bonding into __netif_receive_skb(),
>>> but the overhead remained very light.
>>>
>>> Calling netif_rx() (or __netif_receive_skb()) to allow nesting would probably lead to some overhead.
>>>
>>> Can't we, instead, loop inside __netif_receive_skb(), and deliver
>>> whatever need to be delivered, to whoever need, inside the loop ?
>>>
>>> rx_handler = rcu_dereference(skb->dev->rx_handler);
>>> while (rx_handler) {
>>> /* ... */
>>> orig_dev = skb->dev;
>>> skb = rx_handler(skb);
>>> /* ... */
>>> rx_handler = (skb->dev != orig_dev) ? rcu_dereference(skb->dev->rx_handler) : NULL;
>>> }
>>>
>>> This would reduce the overhead, while still allowing nesting: vlan on
>>> top on bonding, bridge on top on bonding, ...
>>
>> I see your point. Makes sense to me. But the loop would have to include
>> at least processing of ptype_all too. I'm going to cook a follow-up
>> patch.
>>
>
> DRAFT (doesn't modify rx_handlers):
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 4ebf7fe..e5dba47 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3115,6 +3115,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
> {
> struct packet_type *ptype, *pt_prev;
> rx_handler_func_t *rx_handler;
> + struct net_device *dev;
> struct net_device *orig_dev;
> struct net_device *null_or_dev;
> int ret = NET_RX_DROP;
> @@ -3129,7 +3130,9 @@ static int __netif_receive_skb(struct sk_buff *skb)
> if (netpoll_receive_skb(skb))
> return NET_RX_DROP;
>
> - __this_cpu_inc(softnet_data.processed);
> + skb->skb_iif = skb->dev->ifindex;
> + orig_dev = skb->dev;
orig_dev should be set inside the loop, to reflect "previously crossed device", while following the
path:
eth0 -> bond0 -> br0.
First step inside loop:
orig_dev = eth0
skb->dev = bond0 (at the end of the loop).
Second step inside loop:
orig_dev = bond0
skb->dev = br0 (et the end of the loop).
This would allow for exact match delivery to bond0 if someone bind there.
> +
> skb_reset_network_header(skb);
> skb_reset_transport_header(skb);
> skb->mac_len = skb->network_header - skb->mac_header;
> @@ -3138,12 +3141,9 @@ static int __netif_receive_skb(struct sk_buff *skb)
>
> rcu_read_lock();
>
> - if (!skb->skb_iif) {
> - skb->skb_iif = skb->dev->ifindex;
> - orig_dev = skb->dev;
> - } else {
> - orig_dev = dev_get_by_index_rcu(dev_net(skb->dev), skb->skb_iif);
> - }
I like the fact that it removes the above part.
> +another_round:
> + __this_cpu_inc(softnet_data.processed);
> + dev = skb->dev;
>
> #ifdef CONFIG_NET_CLS_ACT
> if (skb->tc_verd& TC_NCLS) {
> @@ -3153,7 +3153,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
> #endif
>
> list_for_each_entry_rcu(ptype,&ptype_all, list) {
> - if (!ptype->dev || ptype->dev == skb->dev) {
> + if (!ptype->dev || ptype->dev == dev) {
> if (pt_prev)
> ret = deliver_skb(skb, pt_prev, orig_dev);
> pt_prev = ptype;
Inside the loop, we should only do exact match delivery, for &ptype_all and for
&ptype_base[ntohs(type) & PTYPE_HASH_MASK]:
list_for_each_entry_rcu(ptype, &ptype_all, list) {
- if (!ptype->dev || ptype->dev == dev) {
+ if (ptype->dev == dev) {
if (pt_prev)
ret = deliver_skb(skb, pt_prev, orig_dev);
pt_prev = ptype;
}
}
list_for_each_entry_rcu(ptype,
&ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
if (ptype->type == type &&
- (ptype->dev == null_or_dev || ptype->dev == skb->dev)) {
+ (ptype->dev == skb->dev)) {
if (pt_prev)
ret = deliver_skb(skb, pt_prev, orig_dev);
pt_prev = ptype;
}
}
After leaving the loop, we can do wilcard delivery, if skb is not NULL.
list_for_each_entry_rcu(ptype, &ptype_all, list) {
- if (!ptype->dev || ptype->dev == dev) {
+ if (!ptype->dev) {
if (pt_prev)
ret = deliver_skb(skb, pt_prev, orig_dev);
pt_prev = ptype;
}
}
list_for_each_entry_rcu(ptype,
&ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
- if (ptype->type == type &&
- (ptype->dev == null_or_dev || ptype->dev == skb->dev)) {
+ if (ptype->type == type && !ptype->dev) {
if (pt_prev)
ret = deliver_skb(skb, pt_prev, orig_dev);
pt_prev = ptype;
}
}
This would reduce the number of tests inside the list_for_each_entry_rcu() loops. And because we
match only ptype->dev == dev inside the loop and !ptype->dev outside the loop, this should avoid
duplicate delivery.
Also, for performance reason, exact match protocol handler lists might be moved from ptype_base or
ptype_all to a per net_device list. That way, the list_for_each_entry_rcu() inside the loop could be
empty if no protocol handler bind on the current dev.
inside loop:
list_for_each_entry_rcu(ptype, dev->ptype_all, list) {
if (pt_prev)
ret = deliver_skb(skb, pt_prev, orig_dev);
pt_prev = ptype;
}
list_for_each_entry_rcu(ptype,
dev->ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
if (ptype->type == type) {
if (pt_prev)
ret = deliver_skb(skb, pt_prev, orig_dev);
pt_prev = ptype;
}
}
Outside loop :
list_for_each_entry_rcu(ptype, &ptype_all, list) {
if (pt_prev)
ret = deliver_skb(skb, pt_prev, orig_dev);
pt_prev = ptype;
}
list_for_each_entry_rcu(ptype,
&ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
if (ptype->type == type) {
if (pt_prev)
ret = deliver_skb(skb, pt_prev, orig_dev);
pt_prev = ptype;
}
}
This would require several changes into ptype_all and ptype_base handling, but should be faster.
> @@ -3167,7 +3167,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
> ncls:
> #endif
>
> - rx_handler = rcu_dereference(skb->dev->rx_handler);
> + rx_handler = rcu_dereference(dev->rx_handler);
> if (rx_handler) {
> if (pt_prev) {
> ret = deliver_skb(skb, pt_prev, orig_dev);
> @@ -3176,6 +3176,8 @@ ncls:
> skb = rx_handler(skb);
> if (!skb)
> goto out;
> + if (dev != skb->dev)
I would use "if (skb->dev != dev)" for clarity, because skb->dev is expected to have changed, not dev.
> + goto another_round;
> }
>
> if (vlan_tx_tag_present(skb)) {
>
Nicolas.
^ permalink raw reply
* Re: [patch net-next-2.6 V3] net: convert bonding to use rx_handler
From: Jiri Pirko @ 2011-02-19 13:46 UTC (permalink / raw)
To: Nicolas de Pesloüan
Cc: Jay Vosburgh, David Miller, kaber, eric.dumazet, netdev,
shemminger, andy
In-Reply-To: <4D5FC308.9020507@gmail.com>
Sat, Feb 19, 2011 at 02:18:00PM CET, nicolas.2p.debian@gmail.com wrote:
>Le 19/02/2011 12:28, Jiri Pirko a écrit :
>>Sat, Feb 19, 2011 at 12:08:31PM CET, jpirko@redhat.com wrote:
>>>Sat, Feb 19, 2011 at 11:56:23AM CET, nicolas.2p.debian@gmail.com wrote:
>>>>Le 19/02/2011 09:05, Jiri Pirko a écrit :
>>>>>This patch converts bonding to use rx_handler. Results in cleaner
>>>>>__netif_receive_skb() with much less exceptions needed. Also
>>>>>bond-specific work is moved into bond code.
>>>>>
>>>>>Signed-off-by: Jiri Pirko<jpirko@redhat.com>
>>>>>
>>>>>v1->v2:
>>>>> using skb_iif instead of new input_dev to remember original
>>>>> device
>>>>>v2->v3:
>>>>> set orig_dev = skb->dev if skb_iif is set
>>>>>
>>>>
>>>>Why do we need to let the rx_handlers call netif_rx() or __netif_receive_skb()?
>>>>
>>>>Bonding used to be handled with very few overhead, simply replacing
>>>>skb->dev with skb->dev->master. Time has passed and we eventually
>>>>added many special processing for bonding into __netif_receive_skb(),
>>>>but the overhead remained very light.
>>>>
>>>>Calling netif_rx() (or __netif_receive_skb()) to allow nesting would probably lead to some overhead.
>>>>
>>>>Can't we, instead, loop inside __netif_receive_skb(), and deliver
>>>>whatever need to be delivered, to whoever need, inside the loop ?
>>>>
>>>>rx_handler = rcu_dereference(skb->dev->rx_handler);
>>>>while (rx_handler) {
>>>> /* ... */
>>>> orig_dev = skb->dev;
>>>> skb = rx_handler(skb);
>>>> /* ... */
>>>> rx_handler = (skb->dev != orig_dev) ? rcu_dereference(skb->dev->rx_handler) : NULL;
>>>>}
>>>>
>>>>This would reduce the overhead, while still allowing nesting: vlan on
>>>>top on bonding, bridge on top on bonding, ...
>>>
>>>I see your point. Makes sense to me. But the loop would have to include
>>>at least processing of ptype_all too. I'm going to cook a follow-up
>>>patch.
>>>
>>
>>DRAFT (doesn't modify rx_handlers):
>>
>>diff --git a/net/core/dev.c b/net/core/dev.c
>>index 4ebf7fe..e5dba47 100644
>>--- a/net/core/dev.c
>>+++ b/net/core/dev.c
>>@@ -3115,6 +3115,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
>> {
>> struct packet_type *ptype, *pt_prev;
>> rx_handler_func_t *rx_handler;
>>+ struct net_device *dev;
>> struct net_device *orig_dev;
>> struct net_device *null_or_dev;
>> int ret = NET_RX_DROP;
>>@@ -3129,7 +3130,9 @@ static int __netif_receive_skb(struct sk_buff *skb)
>> if (netpoll_receive_skb(skb))
>> return NET_RX_DROP;
>>
>>- __this_cpu_inc(softnet_data.processed);
>>+ skb->skb_iif = skb->dev->ifindex;
>>+ orig_dev = skb->dev;
>
>orig_dev should be set inside the loop, to reflect "previously
>crossed device", while following the path:
>
>eth0 -> bond0 -> br0.
>
>First step inside loop:
>
>orig_dev = eth0
>skb->dev = bond0 (at the end of the loop).
>
>Second step inside loop:
>
>orig_dev = bond0
>skb->dev = br0 (et the end of the loop).
>
>This would allow for exact match delivery to bond0 if someone bind there.
>
>>+
>> skb_reset_network_header(skb);
>> skb_reset_transport_header(skb);
>> skb->mac_len = skb->network_header - skb->mac_header;
>>@@ -3138,12 +3141,9 @@ static int __netif_receive_skb(struct sk_buff *skb)
>>
>> rcu_read_lock();
>>
>>- if (!skb->skb_iif) {
>>- skb->skb_iif = skb->dev->ifindex;
>>- orig_dev = skb->dev;
>>- } else {
>>- orig_dev = dev_get_by_index_rcu(dev_net(skb->dev), skb->skb_iif);
>>- }
>
>I like the fact that it removes the above part.
>
>>+another_round:
>>+ __this_cpu_inc(softnet_data.processed);
>>+ dev = skb->dev;
>>
>> #ifdef CONFIG_NET_CLS_ACT
>> if (skb->tc_verd& TC_NCLS) {
>>@@ -3153,7 +3153,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
>> #endif
>>
>> list_for_each_entry_rcu(ptype,&ptype_all, list) {
>>- if (!ptype->dev || ptype->dev == skb->dev) {
>>+ if (!ptype->dev || ptype->dev == dev) {
>> if (pt_prev)
>> ret = deliver_skb(skb, pt_prev, orig_dev);
>> pt_prev = ptype;
>
>Inside the loop, we should only do exact match delivery, for
>&ptype_all and for &ptype_base[ntohs(type) & PTYPE_HASH_MASK]:
>
> list_for_each_entry_rcu(ptype, &ptype_all, list) {
>- if (!ptype->dev || ptype->dev == dev) {
>+ if (ptype->dev == dev) {
> if (pt_prev)
> ret = deliver_skb(skb, pt_prev, orig_dev);
> pt_prev = ptype;
> }
> }
>
>
> list_for_each_entry_rcu(ptype,
> &ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
> if (ptype->type == type &&
>- (ptype->dev == null_or_dev || ptype->dev == skb->dev)) {
>+ (ptype->dev == skb->dev)) {
> if (pt_prev)
> ret = deliver_skb(skb, pt_prev, orig_dev);
> pt_prev = ptype;
> }
> }
>
>After leaving the loop, we can do wilcard delivery, if skb is not NULL.
>
> list_for_each_entry_rcu(ptype, &ptype_all, list) {
>- if (!ptype->dev || ptype->dev == dev) {
>+ if (!ptype->dev) {
> if (pt_prev)
> ret = deliver_skb(skb, pt_prev, orig_dev);
> pt_prev = ptype;
> }
> }
>
>
> list_for_each_entry_rcu(ptype,
> &ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
>- if (ptype->type == type &&
>- (ptype->dev == null_or_dev || ptype->dev == skb->dev)) {
>+ if (ptype->type == type && !ptype->dev) {
> if (pt_prev)
> ret = deliver_skb(skb, pt_prev, orig_dev);
> pt_prev = ptype;
> }
> }
>
>This would reduce the number of tests inside the
>list_for_each_entry_rcu() loops. And because we match only ptype->dev
>== dev inside the loop and !ptype->dev outside the loop, this should
>avoid duplicate delivery.
Would you care to put this into patch so I can see the whole picture?
Thanks.
^ permalink raw reply
* Advice on network driver design
From: Felix Radensky @ 2011-02-19 13:37 UTC (permalink / raw)
To: netdev@vger.kernel.org
Hi,
I'm in the process of designing a network driver for a custom
hardware and would like to get some advice from linux network
gurus.
The host platform is Freescale P2020. The custom hardware is
FPGA with several TX FIFOs, single RX FIFO and a set of registers.
FPGA is connected to CPU via PCI-E. Host CPU DMA controller is used
to get packets to/from FIFOs. Each FIFO has its set of events,
generating interrupts, which can be enabled and disabled. Status
register reflects the current state of events, the bit in status
register is cleared by FPGA when event is handled. Reads or writes to
status register have no impact on its contents.
The device driver should support 80Mbit/sec of traffic in each direction.
So far I have TX side working. I'm using Linux dmaengine APIs to
transfer packets to FIFOs. The DMA completion interrupt is handled
by per-fifo work queue.
My question is about RX. Would such design benefit from NAPI ?
If my understanding of NAPI is correct, it runs in softirq context,
so I cannot do any DMA work in dev->poll(). If I were to use NAPI,
I should probably disable RX interrupts, do all DMA work in some
work queue, keep RX packets in a list and only then call dev->poll().
Is that correct ?
Any other advice and how to write an efficient driver for this
hardware is most welcome. I can influence FPGA design to some degree,
so if you think FPGA should be changed to improve things, please let
me know.
Thanks a lot in advance.
Felix.
^ permalink raw reply
* Re: [patch net-next-2.6 V3] net: convert bonding to use rx_handler
From: Nicolas de Pesloüan @ 2011-02-19 14:32 UTC (permalink / raw)
To: Jiri Pirko
Cc: Jay Vosburgh, David Miller, kaber, eric.dumazet, netdev,
shemminger, andy
In-Reply-To: <20110219134645.GF2782@psychotron.redhat.com>
Le 19/02/2011 14:46, Jiri Pirko a écrit :
> Sat, Feb 19, 2011 at 02:18:00PM CET, nicolas.2p.debian@gmail.com wrote:
[snip]
>> Inside the loop, we should only do exact match delivery, for
>> &ptype_all and for&ptype_base[ntohs(type)& PTYPE_HASH_MASK]:
>>
>> list_for_each_entry_rcu(ptype,&ptype_all, list) {
>> - if (!ptype->dev || ptype->dev == dev) {
>> + if (ptype->dev == dev) {
>> if (pt_prev)
>> ret = deliver_skb(skb, pt_prev, orig_dev);
>> pt_prev = ptype;
>> }
>> }
>>
>>
>> list_for_each_entry_rcu(ptype,
>> &ptype_base[ntohs(type)& PTYPE_HASH_MASK], list) {
>> if (ptype->type == type&&
>> - (ptype->dev == null_or_dev || ptype->dev == skb->dev)) {
>> + (ptype->dev == skb->dev)) {
>> if (pt_prev)
>> ret = deliver_skb(skb, pt_prev, orig_dev);
>> pt_prev = ptype;
>> }
>> }
>>
>> After leaving the loop, we can do wilcard delivery, if skb is not NULL.
>>
>> list_for_each_entry_rcu(ptype,&ptype_all, list) {
>> - if (!ptype->dev || ptype->dev == dev) {
>> + if (!ptype->dev) {
>> if (pt_prev)
>> ret = deliver_skb(skb, pt_prev, orig_dev);
>> pt_prev = ptype;
>> }
>> }
>>
>>
>> list_for_each_entry_rcu(ptype,
>> &ptype_base[ntohs(type)& PTYPE_HASH_MASK], list) {
>> - if (ptype->type == type&&
>> - (ptype->dev == null_or_dev || ptype->dev == skb->dev)) {
>> + if (ptype->type == type&& !ptype->dev) {
>> if (pt_prev)
>> ret = deliver_skb(skb, pt_prev, orig_dev);
>> pt_prev = ptype;
>> }
>> }
>>
>> This would reduce the number of tests inside the
>> list_for_each_entry_rcu() loops. And because we match only ptype->dev
>> == dev inside the loop and !ptype->dev outside the loop, this should
>> avoid duplicate delivery.
>
> Would you care to put this into patch so I can see the whole picture?
> Thanks.
I will try.
Nicolas.
^ permalink raw reply
* [PATCH 1/2] netfilter: tproxy: do not assign timewait sockets to skb->sk
From: kaber @ 2011-02-19 15:41 UTC (permalink / raw)
To: davem; +Cc: netfilter-devel, netdev
In-Reply-To: <1298130065-14205-1-git-send-email-kaber@trash.net>
From: Florian Westphal <fwestphal@astaro.com>
Assigning a socket in timewait state to skb->sk can trigger
kernel oops, e.g. in nfnetlink_log, which does:
if (skb->sk) {
read_lock_bh(&skb->sk->sk_callback_lock);
if (skb->sk->sk_socket && skb->sk->sk_socket->file) ...
in the timewait case, accessing sk->sk_callback_lock and sk->sk_socket
is invalid.
Either all of these spots will need to add a test for sk->sk_state != TCP_TIME_WAIT,
or xt_TPROXY must not assign a timewait socket to skb->sk.
This does the latter.
If a TW socket is found, assign the tproxy nfmark, but skip the skb->sk assignment,
thus mimicking behaviour of a '-m socket .. -j MARK/ACCEPT' re-routing rule.
The 'SYN to TW socket' case is left unchanged -- we try to redirect to the
listener socket.
Cc: Balazs Scheidler <bazsi@balabit.hu>
Cc: KOVACS Krisztian <hidden@balabit.hu>
Signed-off-by: Florian Westphal <fwestphal@astaro.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
---
include/net/netfilter/nf_tproxy_core.h | 12 +-----------
net/netfilter/nf_tproxy_core.c | 27 ++++++++++++---------------
net/netfilter/xt_TPROXY.c | 22 ++++++++++++++++++++--
net/netfilter/xt_socket.c | 13 +++++++++++--
4 files changed, 44 insertions(+), 30 deletions(-)
diff --git a/include/net/netfilter/nf_tproxy_core.h b/include/net/netfilter/nf_tproxy_core.h
index cd85b3b..e505358 100644
--- a/include/net/netfilter/nf_tproxy_core.h
+++ b/include/net/netfilter/nf_tproxy_core.h
@@ -201,18 +201,8 @@ nf_tproxy_get_sock_v6(struct net *net, const u8 protocol,
}
#endif
-static inline void
-nf_tproxy_put_sock(struct sock *sk)
-{
- /* TIME_WAIT inet sockets have to be handled differently */
- if ((sk->sk_protocol == IPPROTO_TCP) && (sk->sk_state == TCP_TIME_WAIT))
- inet_twsk_put(inet_twsk(sk));
- else
- sock_put(sk);
-}
-
/* assign a socket to the skb -- consumes sk */
-int
+void
nf_tproxy_assign_sock(struct sk_buff *skb, struct sock *sk);
#endif
diff --git a/net/netfilter/nf_tproxy_core.c b/net/netfilter/nf_tproxy_core.c
index 4d87bef..474d621 100644
--- a/net/netfilter/nf_tproxy_core.c
+++ b/net/netfilter/nf_tproxy_core.c
@@ -28,26 +28,23 @@ nf_tproxy_destructor(struct sk_buff *skb)
skb->destructor = NULL;
if (sk)
- nf_tproxy_put_sock(sk);
+ sock_put(sk);
}
/* consumes sk */
-int
+void
nf_tproxy_assign_sock(struct sk_buff *skb, struct sock *sk)
{
- bool transparent = (sk->sk_state == TCP_TIME_WAIT) ?
- inet_twsk(sk)->tw_transparent :
- inet_sk(sk)->transparent;
-
- if (transparent) {
- skb_orphan(skb);
- skb->sk = sk;
- skb->destructor = nf_tproxy_destructor;
- return 1;
- } else
- nf_tproxy_put_sock(sk);
-
- return 0;
+ /* assigning tw sockets complicates things; most
+ * skb->sk->X checks would have to test sk->sk_state first */
+ if (sk->sk_state == TCP_TIME_WAIT) {
+ inet_twsk_put(inet_twsk(sk));
+ return;
+ }
+
+ skb_orphan(skb);
+ skb->sk = sk;
+ skb->destructor = nf_tproxy_destructor;
}
EXPORT_SYMBOL_GPL(nf_tproxy_assign_sock);
diff --git a/net/netfilter/xt_TPROXY.c b/net/netfilter/xt_TPROXY.c
index 640678f..dcfd57e 100644
--- a/net/netfilter/xt_TPROXY.c
+++ b/net/netfilter/xt_TPROXY.c
@@ -33,6 +33,20 @@
#include <net/netfilter/nf_tproxy_core.h>
#include <linux/netfilter/xt_TPROXY.h>
+static bool tproxy_sk_is_transparent(struct sock *sk)
+{
+ if (sk->sk_state != TCP_TIME_WAIT) {
+ if (inet_sk(sk)->transparent)
+ return true;
+ sock_put(sk);
+ } else {
+ if (inet_twsk(sk)->tw_transparent)
+ return true;
+ inet_twsk_put(inet_twsk(sk));
+ }
+ return false;
+}
+
static inline __be32
tproxy_laddr4(struct sk_buff *skb, __be32 user_laddr, __be32 daddr)
{
@@ -141,7 +155,7 @@ tproxy_tg4(struct sk_buff *skb, __be32 laddr, __be16 lport,
skb->dev, NFT_LOOKUP_LISTENER);
/* NOTE: assign_sock consumes our sk reference */
- if (sk && nf_tproxy_assign_sock(skb, sk)) {
+ if (sk && tproxy_sk_is_transparent(sk)) {
/* This should be in a separate target, but we don't do multiple
targets on the same rule yet */
skb->mark = (skb->mark & ~mark_mask) ^ mark_value;
@@ -149,6 +163,8 @@ tproxy_tg4(struct sk_buff *skb, __be32 laddr, __be16 lport,
pr_debug("redirecting: proto %hhu %pI4:%hu -> %pI4:%hu, mark: %x\n",
iph->protocol, &iph->daddr, ntohs(hp->dest),
&laddr, ntohs(lport), skb->mark);
+
+ nf_tproxy_assign_sock(skb, sk);
return NF_ACCEPT;
}
@@ -306,7 +322,7 @@ tproxy_tg6_v1(struct sk_buff *skb, const struct xt_action_param *par)
par->in, NFT_LOOKUP_LISTENER);
/* NOTE: assign_sock consumes our sk reference */
- if (sk && nf_tproxy_assign_sock(skb, sk)) {
+ if (sk && tproxy_sk_is_transparent(sk)) {
/* This should be in a separate target, but we don't do multiple
targets on the same rule yet */
skb->mark = (skb->mark & ~tgi->mark_mask) ^ tgi->mark_value;
@@ -314,6 +330,8 @@ tproxy_tg6_v1(struct sk_buff *skb, const struct xt_action_param *par)
pr_debug("redirecting: proto %hhu %pI6:%hu -> %pI6:%hu, mark: %x\n",
tproto, &iph->saddr, ntohs(hp->source),
laddr, ntohs(lport), skb->mark);
+
+ nf_tproxy_assign_sock(skb, sk);
return NF_ACCEPT;
}
diff --git a/net/netfilter/xt_socket.c b/net/netfilter/xt_socket.c
index 00d6ae8..9cc4635 100644
--- a/net/netfilter/xt_socket.c
+++ b/net/netfilter/xt_socket.c
@@ -35,6 +35,15 @@
#include <net/netfilter/nf_conntrack.h>
#endif
+static void
+xt_socket_put_sk(struct sock *sk)
+{
+ if (sk->sk_state == TCP_TIME_WAIT)
+ inet_twsk_put(inet_twsk(sk));
+ else
+ sock_put(sk);
+}
+
static int
extract_icmp4_fields(const struct sk_buff *skb,
u8 *protocol,
@@ -164,7 +173,7 @@ socket_match(const struct sk_buff *skb, struct xt_action_param *par,
(sk->sk_state == TCP_TIME_WAIT &&
inet_twsk(sk)->tw_transparent));
- nf_tproxy_put_sock(sk);
+ xt_socket_put_sk(sk);
if (wildcard || !transparent)
sk = NULL;
@@ -298,7 +307,7 @@ socket_mt6_v1(const struct sk_buff *skb, struct xt_action_param *par)
(sk->sk_state == TCP_TIME_WAIT &&
inet_twsk(sk)->tw_transparent));
- nf_tproxy_put_sock(sk);
+ xt_socket_put_sk(sk);
if (wildcard || !transparent)
sk = NULL;
--
1.7.2.3
^ permalink raw reply related
* [PATCH 2/2] netfilter: ip6t_LOG: fix a flaw in printing the MAC
From: kaber @ 2011-02-19 15:41 UTC (permalink / raw)
To: davem; +Cc: netfilter-devel, netdev
In-Reply-To: <1298130065-14205-1-git-send-email-kaber@trash.net>
From: Joerg Marx <joerg.marx@secunet.com>
The flaw was in skipping the second byte in MAC header due to increasing
the pointer AND indexed access starting at '1'.
Signed-off-by: Joerg Marx <joerg.marx@secunet.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
---
net/ipv6/netfilter/ip6t_LOG.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/net/ipv6/netfilter/ip6t_LOG.c b/net/ipv6/netfilter/ip6t_LOG.c
index 09c8889..de33803 100644
--- a/net/ipv6/netfilter/ip6t_LOG.c
+++ b/net/ipv6/netfilter/ip6t_LOG.c
@@ -410,7 +410,7 @@ fallback:
if (p != NULL) {
sb_add(m, "%02x", *p++);
for (i = 1; i < len; i++)
- sb_add(m, ":%02x", p[i]);
+ sb_add(m, ":%02x", *p++);
}
sb_add(m, " ");
--
1.7.2.3
^ permalink raw reply related
* [PATCH 0/2] netfilter: netfilter fixes for 2.6.38
From: kaber @ 2011-02-19 15:41 UTC (permalink / raw)
To: davem; +Cc: netfilter-devel, netdev
Hi Dave,
the following patches for two netfilter bugs:
- an oops in nfnetlink_log in combination with TPROXY when a socket
in TIME-WAIT state is assigned to skb->sk, patch from Florian Westphal
- incorrect printing of the MAC header in the ip6t_LOG target,
from Joerg Marx
Please apply or pull from:
git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-2.6.git master
Thanks!
^ permalink raw reply
* [PATCH 1/3] tracepath: re-use printf return in print_host
From: Mike Frysinger @ 2011-02-19 17:51 UTC (permalink / raw)
To: YOSHIFUJI Hideaki; +Cc: netdev
The printf funcs take an int for field widths, not a size_t. Also, since
the printf funcs already return the length of chars displayed, use that
value instead of re-calculating the length with strlen.
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
---
tracepath.c | 11 ++++-------
tracepath6.c | 11 ++++-------
2 files changed, 8 insertions(+), 14 deletions(-)
diff --git a/tracepath.c b/tracepath.c
index 81c22e9..ca84a69 100644
--- a/tracepath.c
+++ b/tracepath.c
@@ -68,13 +68,10 @@ void data_wait(int fd)
void print_host(const char *a, const char *b, int both)
{
- size_t plen = 0;
- printf("%s", a);
- plen = strlen(a);
- if (both) {
- printf(" (%s)", b);
- plen += strlen(b) + 3;
- }
+ int plen;
+ plen = printf("%s", a);
+ if (both)
+ plen += printf(" (%s)", b);
if (plen >= HOST_COLUMN_SIZE)
plen = HOST_COLUMN_SIZE - 1;
printf("%*s", HOST_COLUMN_SIZE - plen, "");
diff --git a/tracepath6.c b/tracepath6.c
index 5cc7424..5c2db8f 100644
--- a/tracepath6.c
+++ b/tracepath6.c
@@ -80,13 +80,10 @@ void data_wait(int fd)
void print_host(const char *a, const char *b, int both)
{
- size_t plen = 0;
- printf("%s", a);
- plen = strlen(a);
- if (both) {
- printf(" (%s)", b);
- plen += strlen(b) + 3;
- }
+ int plen;
+ plen = printf("%s", a);
+ if (both)
+ plen += printf(" (%s)", b);
if (plen >= HOST_COLUMN_SIZE)
plen = HOST_COLUMN_SIZE - 1;
printf("%*s", HOST_COLUMN_SIZE - plen, "");
--
1.7.4.1
^ permalink raw reply related
* [PATCH 2/3] fix up strict-aliasing warnings
From: Mike Frysinger @ 2011-02-19 17:51 UTC (permalink / raw)
To: YOSHIFUJI Hideaki; +Cc: netdev
In-Reply-To: <1298137889-23969-1-git-send-email-vapier@gentoo.org>
Current build of some tools results in gcc warning about strict-aliasing
violations. So change those freaky casts to memcpy's. When the pointer
types work out, gcc will optimize this away anyways.
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
---
ping6.c | 13 +++++++++----
tracepath.c | 2 +-
tracepath6.c | 2 +-
3 files changed, 11 insertions(+), 6 deletions(-)
diff --git a/ping6.c b/ping6.c
index c5ff881..86f9216 100644
--- a/ping6.c
+++ b/ping6.c
@@ -1104,18 +1104,21 @@ int build_niquery(__u8 *_nih)
{
struct ni_hdr *nih;
int cc;
+ __u16 this_nonce;
nih = (struct ni_hdr *)_nih;
nih->ni_cksum = 0;
- CLR(ntohs((*(__u16*)(nih->ni_nonce))) % mx_dup_ck);
+ memcpy(&this_nonce, &nih->ni_nonce, sizeof(this_nonce));
+ CLR(ntohs(this_nonce) % mx_dup_ck);
nih->ni_type = ICMPV6_NI_QUERY;
cc = sizeof(*nih);
datalen = 0;
memcpy(nih->ni_nonce, ni_nonce, sizeof(nih->ni_nonce));
- *(__u16*)(nih->ni_nonce) = htons(ntransmitted + 1);
+ this_nonce = htons(ntransmitted + 1);
+ memcpy(&nih->ni_nonce, &this_nonce, sizeof(this_nonce));
nih->ni_code = ni_subject_type;
nih->ni_qtype = htons(ni_query);
@@ -1331,7 +1334,7 @@ parse_reply(struct msghdr *msg, int cc, void *addr, struct timeval *tv)
#endif
if (c->cmsg_len < CMSG_LEN(sizeof(int)))
continue;
- hops = *(int*)CMSG_DATA(c);
+ memcpy(&hops, CMSG_DATA(c), sizeof(int));
}
}
@@ -1355,7 +1358,9 @@ parse_reply(struct msghdr *msg, int cc, void *addr, struct timeval *tv)
return 0;
} else if (icmph->icmp6_type == ICMPV6_NI_REPLY) {
struct ni_hdr *nih = (struct ni_hdr *)icmph;
- __u16 seq = ntohs(*(__u16 *)nih->ni_nonce);
+ __u16 seq;
+ memcpy(&seq, &nih->ni_nonce, sizeof(seq));
+ seq = ntohs(seq);
if (memcmp(&nih->ni_nonce[2], &ni_nonce[2], sizeof(ni_nonce) - sizeof(__u16)))
return 1;
if (gather_statistics((__u8*)icmph, sizeof(*icmph), cc,
diff --git a/tracepath.c b/tracepath.c
index ca84a69..0a14b1b 100644
--- a/tracepath.c
+++ b/tracepath.c
@@ -142,7 +142,7 @@ restart:
if (cmsg->cmsg_type == IP_RECVERR) {
e = (struct sock_extended_err *) CMSG_DATA(cmsg);
} else if (cmsg->cmsg_type == IP_TTL) {
- rethops = *(int*)CMSG_DATA(cmsg);
+ memcpy(&rethops, CMSG_DATA(cmsg), sizeof(int));
} else {
printf("cmsg:%d\n ", cmsg->cmsg_type);
}
diff --git a/tracepath6.c b/tracepath6.c
index 5c2db8f..77a3563 100644
--- a/tracepath6.c
+++ b/tracepath6.c
@@ -170,7 +170,7 @@ restart:
#ifdef IPV6_2292HOPLIMIT
case IPV6_2292HOPLIMIT:
#endif
- rethops = *(int*)CMSG_DATA(cmsg);
+ memcpy(&rethops, CMSG_DATA(cmsg), sizeof(int));
break;
default:
printf("cmsg6:%d\n ", cmsg->cmsg_type);
--
1.7.4.1
^ permalink raw reply related
* [PATCH 3/3] tracepath: fix typo in -l error message
From: Mike Frysinger @ 2011-02-19 17:51 UTC (permalink / raw)
To: YOSHIFUJI Hideaki; +Cc: netdev, Jeroen Roovers
In-Reply-To: <1298137889-23969-1-git-send-email-vapier@gentoo.org>
From: Jeroen Roovers <jer@gentoo.org>
Signed-off-by: Jeroen Roovers <jer@gentoo.org>
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
---
tracepath.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/tracepath.c b/tracepath.c
index 0a14b1b..90e1b28 100644
--- a/tracepath.c
+++ b/tracepath.c
@@ -306,7 +306,7 @@ main(int argc, char **argv)
break;
case 'l':
if ((mtu = atoi(optarg)) <= overhead) {
- fprintf(stderr, "Error: length must be >= %d\n", overhead);
+ fprintf(stderr, "Error: length must be > %d\n", overhead);
exit(1);
}
break;
--
1.7.4.1
^ permalink raw reply related
* [PATCH] ping: fix building on older systems
From: Mike Frysinger @ 2011-02-19 17:56 UTC (permalink / raw)
To: YOSHIFUJI Hideaki; +Cc: netdev
The SO_MARK define is somewhat recent (linux-2.6.25), so check for its
existence before we try to use it.
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
---
ping_common.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)
diff --git a/ping_common.c b/ping_common.c
index 82320b1..b4b26b2 100644
--- a/ping_common.c
+++ b/ping_common.c
@@ -475,6 +475,7 @@ void setup(int icmp_sock)
fprintf(stderr, "Warning: no SO_TIMESTAMP support, falling back to SIOCGSTAMP\n");
}
#endif
+#ifdef SO_MARK
if (options & F_MARK) {
if (setsockopt(icmp_sock, SOL_SOCKET, SO_MARK,
&mark, sizeof(mark)) == -1) {
@@ -484,6 +485,7 @@ void setup(int icmp_sock)
fprintf(stderr, "Warning: Failed to set mark %d\n", mark);
}
}
+#endif
/* Set some SNDTIMEO to prevent blocking forever
* on sends, when device is too slow or stalls. Just put limit
--
1.7.4.1
^ permalink raw reply related
* Business Proposal for you!!!
From: Mr Peter Wong @ 2011-02-19 16:15 UTC (permalink / raw)
Dear Friend,
I am Mr.Peter Wong Non Executive Director of Hang Seng Bank Ltd, Hong
Kong.I have a deceased client funds in my bank of $44.5MUSD and i need you
to front as beneficiary,your benefit is 50% of the total funds.If you are
interested contact me for more information via peter.wong133@hotmail.com
Sincerely
Mr.Peter
^ permalink raw reply
* Re: [patch net-next-2.6 V3] net: convert bonding to use rx_handler
From: Nicolas de Pesloüan @ 2011-02-19 20:27 UTC (permalink / raw)
To: Jiri Pirko
Cc: Jay Vosburgh, David Miller, kaber, eric.dumazet, netdev,
shemminger, andy
In-Reply-To: <20110219134645.GF2782@psychotron.redhat.com>
Le 19/02/2011 14:46, Jiri Pirko a écrit :
> Sat, Feb 19, 2011 at 02:18:00PM CET, nicolas.2p.debian@gmail.com wrote:
>> Le 19/02/2011 12:28, Jiri Pirko a écrit :
>>> Sat, Feb 19, 2011 at 12:08:31PM CET, jpirko@redhat.com wrote:
>>>> Sat, Feb 19, 2011 at 11:56:23AM CET, nicolas.2p.debian@gmail.com wrote:
>>>>> Le 19/02/2011 09:05, Jiri Pirko a écrit :
>>>>>> This patch converts bonding to use rx_handler. Results in cleaner
>>>>>> __netif_receive_skb() with much less exceptions needed. Also
>>>>>> bond-specific work is moved into bond code.
>>>>>>
>>>>>> Signed-off-by: Jiri Pirko<jpirko@redhat.com>
>>>>>>
>>>>>> v1->v2:
>>>>>> using skb_iif instead of new input_dev to remember original
>>>>>> device
>>>>>> v2->v3:
>>>>>> set orig_dev = skb->dev if skb_iif is set
>>>>>>
>>>>>
>>>>> Why do we need to let the rx_handlers call netif_rx() or __netif_receive_skb()?
>>>>>
>>>>> Bonding used to be handled with very few overhead, simply replacing
>>>>> skb->dev with skb->dev->master. Time has passed and we eventually
>>>>> added many special processing for bonding into __netif_receive_skb(),
>>>>> but the overhead remained very light.
>>>>>
>>>>> Calling netif_rx() (or __netif_receive_skb()) to allow nesting would probably lead to some overhead.
>>>>>
>>>>> Can't we, instead, loop inside __netif_receive_skb(), and deliver
>>>>> whatever need to be delivered, to whoever need, inside the loop ?
>>>>>
>>>>> rx_handler = rcu_dereference(skb->dev->rx_handler);
>>>>> while (rx_handler) {
>>>>> /* ... */
>>>>> orig_dev = skb->dev;
>>>>> skb = rx_handler(skb);
>>>>> /* ... */
>>>>> rx_handler = (skb->dev != orig_dev) ? rcu_dereference(skb->dev->rx_handler) : NULL;
>>>>> }
>>>>>
>>>>> This would reduce the overhead, while still allowing nesting: vlan on
>>>>> top on bonding, bridge on top on bonding, ...
>>>>
>>>> I see your point. Makes sense to me. But the loop would have to include
>>>> at least processing of ptype_all too. I'm going to cook a follow-up
>>>> patch.
>>>>
>>>
>>> DRAFT (doesn't modify rx_handlers):
>>>
>>> diff --git a/net/core/dev.c b/net/core/dev.c
>>> index 4ebf7fe..e5dba47 100644
>>> --- a/net/core/dev.c
>>> +++ b/net/core/dev.c
>>> @@ -3115,6 +3115,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
>>> {
>>> struct packet_type *ptype, *pt_prev;
>>> rx_handler_func_t *rx_handler;
>>> + struct net_device *dev;
>>> struct net_device *orig_dev;
>>> struct net_device *null_or_dev;
>>> int ret = NET_RX_DROP;
>>> @@ -3129,7 +3130,9 @@ static int __netif_receive_skb(struct sk_buff *skb)
>>> if (netpoll_receive_skb(skb))
>>> return NET_RX_DROP;
>>>
>>> - __this_cpu_inc(softnet_data.processed);
>>> + skb->skb_iif = skb->dev->ifindex;
>>> + orig_dev = skb->dev;
>>
>> orig_dev should be set inside the loop, to reflect "previously
>> crossed device", while following the path:
>>
>> eth0 -> bond0 -> br0.
>>
>> First step inside loop:
>>
>> orig_dev = eth0
>> skb->dev = bond0 (at the end of the loop).
>>
>> Second step inside loop:
>>
>> orig_dev = bond0
>> skb->dev = br0 (et the end of the loop).
>>
>> This would allow for exact match delivery to bond0 if someone bind there.
>>
>>> +
>>> skb_reset_network_header(skb);
>>> skb_reset_transport_header(skb);
>>> skb->mac_len = skb->network_header - skb->mac_header;
>>> @@ -3138,12 +3141,9 @@ static int __netif_receive_skb(struct sk_buff *skb)
>>>
>>> rcu_read_lock();
>>>
>>> - if (!skb->skb_iif) {
>>> - skb->skb_iif = skb->dev->ifindex;
>>> - orig_dev = skb->dev;
>>> - } else {
>>> - orig_dev = dev_get_by_index_rcu(dev_net(skb->dev), skb->skb_iif);
>>> - }
>>
>> I like the fact that it removes the above part.
>>
>>> +another_round:
>>> + __this_cpu_inc(softnet_data.processed);
>>> + dev = skb->dev;
>>>
>>> #ifdef CONFIG_NET_CLS_ACT
>>> if (skb->tc_verd& TC_NCLS) {
>>> @@ -3153,7 +3153,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
>>> #endif
>>>
>>> list_for_each_entry_rcu(ptype,&ptype_all, list) {
>>> - if (!ptype->dev || ptype->dev == skb->dev) {
>>> + if (!ptype->dev || ptype->dev == dev) {
>>> if (pt_prev)
>>> ret = deliver_skb(skb, pt_prev, orig_dev);
>>> pt_prev = ptype;
>>
>> Inside the loop, we should only do exact match delivery, for
>> &ptype_all and for&ptype_base[ntohs(type)& PTYPE_HASH_MASK]:
>>
>> list_for_each_entry_rcu(ptype,&ptype_all, list) {
>> - if (!ptype->dev || ptype->dev == dev) {
>> + if (ptype->dev == dev) {
>> if (pt_prev)
>> ret = deliver_skb(skb, pt_prev, orig_dev);
>> pt_prev = ptype;
>> }
>> }
>>
>>
>> list_for_each_entry_rcu(ptype,
>> &ptype_base[ntohs(type)& PTYPE_HASH_MASK], list) {
>> if (ptype->type == type&&
>> - (ptype->dev == null_or_dev || ptype->dev == skb->dev)) {
>> + (ptype->dev == skb->dev)) {
>> if (pt_prev)
>> ret = deliver_skb(skb, pt_prev, orig_dev);
>> pt_prev = ptype;
>> }
>> }
>>
>> After leaving the loop, we can do wilcard delivery, if skb is not NULL.
>>
>> list_for_each_entry_rcu(ptype,&ptype_all, list) {
>> - if (!ptype->dev || ptype->dev == dev) {
>> + if (!ptype->dev) {
>> if (pt_prev)
>> ret = deliver_skb(skb, pt_prev, orig_dev);
>> pt_prev = ptype;
>> }
>> }
>>
>>
>> list_for_each_entry_rcu(ptype,
>> &ptype_base[ntohs(type)& PTYPE_HASH_MASK], list) {
>> - if (ptype->type == type&&
>> - (ptype->dev == null_or_dev || ptype->dev == skb->dev)) {
>> + if (ptype->type == type&& !ptype->dev) {
>> if (pt_prev)
>> ret = deliver_skb(skb, pt_prev, orig_dev);
>> pt_prev = ptype;
>> }
>> }
>>
>> This would reduce the number of tests inside the
>> list_for_each_entry_rcu() loops. And because we match only ptype->dev
>> == dev inside the loop and !ptype->dev outside the loop, this should
>> avoid duplicate delivery.
>
> Would you care to put this into patch so I can see the whole picture?
> Thanks.
Here is what I have in mind. It is based on your previous DRAFT patch, and don't modify rx_handlers yet.
Only compile tested !!
I don't know if every pieces are at the right place. I wonder what to do with CONFIG_NET_CLS_ACT
part, that currently is between ptype_all and ptype_base processing.
Anyway, the general idea is there.
Nicolas.
net/core/dev.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++++--------
1 files changed, 60 insertions(+), 10 deletions(-)
diff --git a/net/core/dev.c b/net/core/dev.c
index e5dba47..7e007a9 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3117,7 +3117,6 @@ static int __netif_receive_skb(struct sk_buff *skb)
rx_handler_func_t *rx_handler;
struct net_device *dev;
struct net_device *orig_dev;
- struct net_device *null_or_dev;
int ret = NET_RX_DROP;
__be16 type;
@@ -3130,9 +3129,6 @@ static int __netif_receive_skb(struct sk_buff *skb)
if (netpoll_receive_skb(skb))
return NET_RX_DROP;
- skb->skb_iif = skb->dev->ifindex;
- orig_dev = skb->dev;
-
skb_reset_network_header(skb);
skb_reset_transport_header(skb);
skb->mac_len = skb->network_header - skb->mac_header;
@@ -3143,6 +3139,8 @@ static int __netif_receive_skb(struct sk_buff *skb)
another_round:
__this_cpu_inc(softnet_data.processed);
+ skb->skb_iif = skb->dev->ifindex;
+ orig_dev = skb->dev;
dev = skb->dev;
#ifdef CONFIG_NET_CLS_ACT
@@ -3152,8 +3150,13 @@ another_round:
}
#endif
+ /*
+ * Deliver to ptype_all protocol handlers that match current dev.
+ * This happens before rx_handler is given a chance to change skb->dev.
+ */
+
list_for_each_entry_rcu(ptype, &ptype_all, list) {
- if (!ptype->dev || ptype->dev == dev) {
+ if (ptype->dev == dev) {
if (pt_prev)
ret = deliver_skb(skb, pt_prev, orig_dev);
pt_prev = ptype;
@@ -3167,6 +3170,31 @@ another_round:
ncls:
#endif
+ /*
+ * Deliver to ptype_base protocol handlers that match current dev.
+ * This happens before rx_handler is given a chance to change skb->dev.
+ */
+
+ type = skb->protocol;
+ list_for_each_entry_rcu(ptype,
+ &ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
+ if (ptype->type == type && ptype->dev == skb->dev) {
+ if (pt_prev)
+ ret = deliver_skb(skb, pt_prev, orig_dev);
+ pt_prev = ptype;
+ }
+ }
+
+ /*
+ * Call rx_handler for current device.
+ * If rx_handler return NULL, skip wilcard protocol handler delivery.
+ * Else, if skb->dev changed, restart the whole delivery process, to
+ * allow for device nesting.
+ *
+ * Warning:
+ * rx_handlers must kfree_skb(skb) if they return NULL.
+ */
+
rx_handler = rcu_dereference(dev->rx_handler);
if (rx_handler) {
if (pt_prev) {
@@ -3176,10 +3204,15 @@ ncls:
skb = rx_handler(skb);
if (!skb)
goto out;
- if (dev != skb->dev)
+ if (skb->dev != dev)
goto another_round;
}
+ /*
+ * FIXME: The part below should use rx_handler instead of being hard
+ * coded here.
+ */
+
if (vlan_tx_tag_present(skb)) {
if (pt_prev) {
ret = deliver_skb(skb, pt_prev, orig_dev);
@@ -3192,16 +3225,33 @@ ncls:
goto out;
}
+ /*
+ * FIXME: Can't this be moved into the rx_handler for bonding,
+ * or into a futur rx_handler for vlan?
+ */
+
vlan_on_bond_hook(skb);
- /* deliver only exact match when indicated */
- null_or_dev = skb->deliver_no_wcard ? skb->dev : NULL;
+ /*
+ * Deliver to wildcard ptype_all protocol handlers.
+ */
+
+ list_for_each_entry_rcu(ptype, &ptype_all, list) {
+ if (!ptype->dev) {
+ if (pt_prev)
+ ret = deliver_skb(skb, pt_prev, orig_dev);
+ pt_prev = ptype;
+ }
+ }
+
+ /*
+ * Deliver to wildcard ptype_all protocol handlers.
+ */
type = skb->protocol;
list_for_each_entry_rcu(ptype,
&ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
- if (ptype->type == type &&
- (ptype->dev == null_or_dev || ptype->dev == skb->dev)) {
+ if (ptype->type == type && !ptype->dev) {
if (pt_prev)
ret = deliver_skb(skb, pt_prev, orig_dev);
pt_prev = ptype;
--
1.7.2.3
^ permalink raw reply related
* [PATCH] connector: Convert char *name to const char *name
From: Joe Perches @ 2011-02-19 23:45 UTC (permalink / raw)
To: Javier Martinez Canillas, Evgeniy Polyakov
Cc: Greg Kroah-Hartman, devel, K. Y. Srinivasan, netdev
In-Reply-To: <1298151834-13141-1-git-send-email-martinez.javier@gmail.com>
Allow more const declarations.
Signed-off-by: Joe Perches <joe@perches.com>
---
Better to change the declarations and uses as this argument
is not modified.
On Sat, 2011-02-19 at 22:43 +0100, Javier Martinez Canillas wrote:
> cn_add_callback() defines is second argument as a char * but a
> const char * was supplied instead. So the compiler generates a
> compile warning.
drivers/connector/cn_queue.c | 7 ++++---
drivers/connector/connector.c | 2 +-
include/linux/connector.h | 9 ++++++---
3 files changed, 11 insertions(+), 7 deletions(-)
diff --git a/drivers/connector/cn_queue.c b/drivers/connector/cn_queue.c
index 81270d2..55653ab 100644
--- a/drivers/connector/cn_queue.c
+++ b/drivers/connector/cn_queue.c
@@ -48,7 +48,7 @@ void cn_queue_wrapper(struct work_struct *work)
}
static struct cn_callback_entry *
-cn_queue_alloc_callback_entry(char *name, struct cb_id *id,
+cn_queue_alloc_callback_entry(const char *name, struct cb_id *id,
void (*callback)(struct cn_msg *, struct netlink_skb_parms *))
{
struct cn_callback_entry *cbq;
@@ -78,7 +78,8 @@ int cn_cb_equal(struct cb_id *i1, struct cb_id *i2)
return ((i1->idx == i2->idx) && (i1->val == i2->val));
}
-int cn_queue_add_callback(struct cn_queue_dev *dev, char *name, struct cb_id *id,
+int cn_queue_add_callback(struct cn_queue_dev *dev, const char *name,
+ struct cb_id *id,
void (*callback)(struct cn_msg *, struct netlink_skb_parms *))
{
struct cn_callback_entry *cbq, *__cbq;
@@ -135,7 +136,7 @@ void cn_queue_del_callback(struct cn_queue_dev *dev, struct cb_id *id)
}
}
-struct cn_queue_dev *cn_queue_alloc_dev(char *name, struct sock *nls)
+struct cn_queue_dev *cn_queue_alloc_dev(const char *name, struct sock *nls)
{
struct cn_queue_dev *dev;
diff --git a/drivers/connector/connector.c b/drivers/connector/connector.c
index 05117f1..f7554de 100644
--- a/drivers/connector/connector.c
+++ b/drivers/connector/connector.c
@@ -205,7 +205,7 @@ static void cn_rx_skb(struct sk_buff *__skb)
*
* May sleep.
*/
-int cn_add_callback(struct cb_id *id, char *name,
+int cn_add_callback(struct cb_id *id, const char *name,
void (*callback)(struct cn_msg *, struct netlink_skb_parms *))
{
int err;
diff --git a/include/linux/connector.h b/include/linux/connector.h
index 2e9759c..a9edf24 100644
--- a/include/linux/connector.h
+++ b/include/linux/connector.h
@@ -129,14 +129,17 @@ struct cn_dev {
struct cn_queue_dev *cbdev;
};
-int cn_add_callback(struct cb_id *, char *, void (*callback) (struct cn_msg *, struct netlink_skb_parms *));
+int cn_add_callback(struct cb_id *id, const char *name,
+ void (*callback)(struct cn_msg *, struct netlink_skb_parms *));
void cn_del_callback(struct cb_id *);
int cn_netlink_send(struct cn_msg *, u32, gfp_t);
-int cn_queue_add_callback(struct cn_queue_dev *dev, char *name, struct cb_id *id, void (*callback)(struct cn_msg *, struct netlink_skb_parms *));
+int cn_queue_add_callback(struct cn_queue_dev *dev, const char *name,
+ struct cb_id *id,
+ void (*callback)(struct cn_msg *, struct netlink_skb_parms *));
void cn_queue_del_callback(struct cn_queue_dev *dev, struct cb_id *id);
-struct cn_queue_dev *cn_queue_alloc_dev(char *name, struct sock *);
+struct cn_queue_dev *cn_queue_alloc_dev(const char *name, struct sock *);
void cn_queue_free_dev(struct cn_queue_dev *dev);
int cn_cb_equal(struct cb_id *, struct cb_id *);
^ permalink raw reply related
* Re: [PATCH] tcp: fix inet_twsk_deschedule()
From: David Miller @ 2011-02-20 2:59 UTC (permalink / raw)
To: eric.dumazet
Cc: ebiederm, acme, torvalds, mhocko, mingo, linux-mm, linux-kernel,
netdev, xemul, daniel.lezcano
In-Reply-To: <1298104556.8559.21.camel@edumazet-laptop>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sat, 19 Feb 2011 09:35:56 +0100
> [PATCH] tcp: fix inet_twsk_deschedule()
>
> Eric W. Biederman reported a lockdep splat in inet_twsk_deschedule()
>
> This is caused by inet_twsk_purge(), run from process context,
> and commit 575f4cd5a5b6394577 (net: Use rcu lookups in inet_twsk_purge.)
> removed the BH disabling that was necessary.
>
> Add the BH disabling but fine grained, right before calling
> inet_twsk_deschedule(), instead of whole function.
>
> With help from Linus Torvalds and Eric W. Biederman
>
> Reported-by: Eric W. Biederman <ebiederm@xmission.com>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> CC: Daniel Lezcano <daniel.lezcano@free.fr>
> CC: Pavel Emelyanov <xemul@openvz.org>
> CC: Arnaldo Carvalho de Melo <acme@redhat.com>
> CC: stable <stable@kernel.org> (# 2.6.33+)
Applied, thanks Eric.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply
* Re: [net-2.6 PATCH v2] net: dcb: match dcb_app protocol field with 802.1Qaz spec
From: David Miller @ 2011-02-20 3:00 UTC (permalink / raw)
To: john.r.fastabend; +Cc: shmulikr, netdev
In-Reply-To: <20110218233017.7419.32447.stgit@jf-dev1-dcblab>
From: John Fastabend <john.r.fastabend@intel.com>
Date: Fri, 18 Feb 2011 15:30:17 -0800
> The dcb_app protocol field is a __u32 however the 802.1Qaz
> specification defines it as a 16 bit field. This patch brings
> the structure inline with the spec making it a __u16.
>
> CC: Shmulik Ravid <shmulikr@broadcom.com>
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Applied.
^ permalink raw reply
* Re: [PATCH 0/2] netfilter: netfilter fixes for 2.6.38
From: David Miller @ 2011-02-20 3:01 UTC (permalink / raw)
To: kaber; +Cc: netfilter-devel, netdev
In-Reply-To: <1298130065-14205-1-git-send-email-kaber@trash.net>
From: kaber@trash.net
Date: Sat, 19 Feb 2011 16:41:03 +0100
> the following patches for two netfilter bugs:
>
> - an oops in nfnetlink_log in combination with TPROXY when a socket
> in TIME-WAIT state is assigned to skb->sk, patch from Florian Westphal
>
> - incorrect printing of the MAC header in the ip6t_LOG target,
> from Joerg Marx
>
> Please apply or pull from:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-2.6.git master
Pulled, thanks Patrick.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox