* Re: [PATCH] ipv6: udp: make short packet logging consistent with ipv4
From: Eric Dumazet @ 2010-05-06 14:49 UTC (permalink / raw)
To: Bjørn Mork; +Cc: netdev
In-Reply-To: <1273153475-32363-2-git-send-email-bjorn@mork.no>
Le jeudi 06 mai 2010 à 15:44 +0200, Bjørn Mork a écrit :
> Adding addresses and ports to the short packet log message,
> like ipv4/udp.c does it, makes these messages a lot more useful:
>
> [ 822.182450] UDPv6: short packet: From [2001:db8:ffb4:3::1]:47839 23715/178 to [2001:db8:ffb4:3:5054:ff:feff:200]:1234
>
> This requires us to drop logging in case pskb_may_pull() fails,
> which also is consistent with ipv4/udp.c
>
> Signed-off-by: Bjørn Mork <bjorn@mork.no>
> ---
> net/ipv6/udp.c | 11 ++++++++---
> 1 files changed, 8 insertions(+), 3 deletions(-)
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
^ permalink raw reply
* Re: 2.6.33.2: Turn tx power off/on for Atheros card
From: Yegor Yefremov @ 2010-05-06 14:52 UTC (permalink / raw)
To: linux-wireless; +Cc: netdev
In-Reply-To: <r2xf69abfc31005050326oe123cb60s131ab4969341ef57@mail.gmail.com>
On Wed, May 5, 2010 at 12:26 PM, Yegor Yefremov
<yegorslists@googlemail.com> wrote:
> I'm using kernel 2.6.33.2 with AR2413 WLAN card. Issuing
>
> iwconfig wlan0 txpower off
>
> turns txpower off. I can see this status by iwconfig wlan0 and the
> communication with AP terminates. But when I turn the txpower on
>
> iwconfig wlan0 txpower on
>
> nothing happens. Though iwconfig shows the previous tx power value.
> Only ifconfig wlan0 down and then up recovers the transmission.
>
> Is it a known bug or I'm doing something wrong?
I made some debugging and found out that after iwconfig wlan0 txpower
off dev_close() will be invoked, so that local->open_count will be 0.
The next time txpower on will be called, it will be checked if
local->open_count > 0 and this conditions fails, so no hardware
configuration will be made.
I've made a quick and dirty hack, that opens the wireless device by
enabling the txpower, if it was closed before. Is there any proper
solution? Is it really necessary to close device to tunr txpower off?
Best regards,
Yegor
Index: b/net/wireless/wext-compat.c
===================================================================
--- a/net/wireless/wext-compat.c 2010-04-30 05:02:05.000000000 +0200
+++ b/net/wireless/wext-compat.c 2010-05-06 16:31:20.000000000 +0200
@@ -15,6 +15,7 @@
#include <linux/slab.h>
#include <net/iw_handler.h>
#include <net/cfg80211.h>
+#include "../mac80211/ieee80211_i.h"
#include "wext-compat.h"
#include "core.h"
@@ -824,6 +825,7 @@
{
struct wireless_dev *wdev = dev->ieee80211_ptr;
struct cfg80211_registered_device *rdev = wiphy_to_dev(wdev->wiphy);
+ struct ieee80211_local *local = wiphy_priv(wdev->wiphy);
enum tx_power_setting type;
int dbm = 0;
@@ -861,6 +863,8 @@
type = TX_POWER_LIMITED;
}
}
+ if(!local->open_count)
+ dev_open(wdev->netdev);
} else {
rfkill_set_sw_state(rdev->rfkill, true);
schedule_work(&rdev->rfkill_sync);
^ permalink raw reply
* Re: [PATCH net-next-2.6] net: Consistent skb timestamping
From: Tom Herbert @ 2010-05-06 15:12 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1273147309.2357.59.camel@edumazet-laptop>
On Thu, May 6, 2010 at 5:01 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> With RPS inclusion, skb timestamping is not consistent in RX path.
>
> If netif_receive_skb() is used, its deferred after RPS dispatch.
>
> If netif_rx() is used, its done before RPS dispatch.
>
> This can give strange tcpdump timestamps results.
>
> I think timestamping should be done as soon as possible in the receive
> path, to get meaningful values (ie timestamps taken at the time packet
> was delivered by NIC driver to our stack), even if NAPI already can
> defer timestamping a bit (RPS can help to reduce the gap)
>
The counter argument to this is that it moves another thing into the
serialized path for networking which slows everyone down. I'm not
concerned about when tcpdump is running since performance will suck
anyway, but what is bad is if any single socket in the system turns on
SO_TIMESTAMP, overhead is incurred on *every* packet. This happens
regardless of whether the application ever actually gets a timestamp,
or even whether timestamps are supported by the protocol (try setting
SO_TIMESTAMP on a TCP socket ;-) ). I'm contemplating changing
SO_TIMESTAMP to not enable global timestamps, but only take the
timestamp for a packet once the socket is identified and the timestamp
flag is set (this is the technique done in FreeBSD and Solaris, so I
believe the external semantics would still be valid).
> Remove timestamping from __netif_receive_skb, and add it to
> netif_receive_skb(), before RPS.
>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---
> net/core/dev.c | 46 ++++++++++++++++++++++++++--------------------
> 1 file changed, 26 insertions(+), 20 deletions(-)
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 36d53be..3278003 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1454,7 +1454,7 @@ void net_disable_timestamp(void)
> }
> EXPORT_SYMBOL(net_disable_timestamp);
>
> -static inline void net_timestamp(struct sk_buff *skb)
> +static inline void net_timestamp_set(struct sk_buff *skb)
> {
> if (atomic_read(&netstamp_needed))
> __net_timestamp(skb);
> @@ -1462,6 +1462,12 @@ static inline void net_timestamp(struct sk_buff *skb)
> skb->tstamp.tv64 = 0;
> }
>
> +static inline void net_timestamp_check(struct sk_buff *skb)
> +{
> + if (!skb->tstamp.tv64 && atomic_read(&netstamp_needed))
> + __net_timestamp(skb);
> +}
> +
> /**
> * dev_forward_skb - loopback an skb to another netif
> *
> @@ -1509,9 +1515,9 @@ static void dev_queue_xmit_nit(struct sk_buff *skb, struct net_device *dev)
>
> #ifdef CONFIG_NET_CLS_ACT
> if (!(skb->tstamp.tv64 && (G_TC_FROM(skb->tc_verd) & AT_INGRESS)))
> - net_timestamp(skb);
> + net_timestamp_set(skb);
> #else
> - net_timestamp(skb);
> + net_timestamp_set(skb);
> #endif
>
> rcu_read_lock();
> @@ -2458,8 +2464,7 @@ int netif_rx(struct sk_buff *skb)
> if (netpoll_rx(skb))
> return NET_RX_DROP;
>
> - if (!skb->tstamp.tv64)
> - net_timestamp(skb);
> + net_timestamp_check(skb);
>
> #ifdef CONFIG_RPS
> {
> @@ -2780,9 +2785,6 @@ static int __netif_receive_skb(struct sk_buff *skb)
> int ret = NET_RX_DROP;
> __be16 type;
>
> - if (!skb->tstamp.tv64)
> - net_timestamp(skb);
> -
> if (vlan_tx_tag_present(skb) && vlan_hwaccel_do_receive(skb))
> return NET_RX_SUCCESS;
>
> @@ -2899,23 +2901,27 @@ out:
> */
> int netif_receive_skb(struct sk_buff *skb)
> {
> + net_timestamp_check(skb);
> +
> #ifdef CONFIG_RPS
> - struct rps_dev_flow voidflow, *rflow = &voidflow;
> - int cpu, ret;
> + {
> + struct rps_dev_flow voidflow, *rflow = &voidflow;
> + int cpu, ret;
>
> - rcu_read_lock();
> + rcu_read_lock();
>
> - cpu = get_rps_cpu(skb->dev, skb, &rflow);
> + cpu = get_rps_cpu(skb->dev, skb, &rflow);
>
> - if (cpu >= 0) {
> - ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
> - rcu_read_unlock();
> - } else {
> - rcu_read_unlock();
> - ret = __netif_receive_skb(skb);
> - }
> + if (cpu >= 0) {
> + ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
> + rcu_read_unlock();
> + } else {
> + rcu_read_unlock();
> + ret = __netif_receive_skb(skb);
> + }
>
> - return ret;
> + return ret;
> + }
> #else
> return __netif_receive_skb(skb);
> #endif
>
>
>
^ permalink raw reply
* Re: [PATCH net-next-2.6] net: Consistent skb timestamping
From: Eric Dumazet @ 2010-05-06 15:37 UTC (permalink / raw)
To: Tom Herbert; +Cc: David Miller, netdev
In-Reply-To: <AANLkTikLgHvtpCtBTKmJZBwixmZDHjRjGb1c59oAemli@mail.gmail.com>
Le jeudi 06 mai 2010 à 08:12 -0700, Tom Herbert a écrit :
> On Thu, May 6, 2010 at 5:01 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > With RPS inclusion, skb timestamping is not consistent in RX path.
> >
> > If netif_receive_skb() is used, its deferred after RPS dispatch.
> >
> > If netif_rx() is used, its done before RPS dispatch.
> >
> > This can give strange tcpdump timestamps results.
> >
> > I think timestamping should be done as soon as possible in the receive
> > path, to get meaningful values (ie timestamps taken at the time packet
> > was delivered by NIC driver to our stack), even if NAPI already can
> > defer timestamping a bit (RPS can help to reduce the gap)
> >
> The counter argument to this is that it moves another thing into the
> serialized path for networking which slows everyone down. I'm not
> concerned about when tcpdump is running since performance will suck
> anyway, but what is bad is if any single socket in the system turns on
> SO_TIMESTAMP, overhead is incurred on *every* packet. This happens
> regardless of whether the application ever actually gets a timestamp,
> or even whether timestamps are supported by the protocol (try setting
> SO_TIMESTAMP on a TCP socket ;-) ). I'm contemplating changing
> SO_TIMESTAMP to not enable global timestamps, but only take the
> timestamp for a packet once the socket is identified and the timestamp
> flag is set (this is the technique done in FreeBSD and Solaris, so I
> believe the external semantics would still be valid).
I agree with you, thanks for this excellent argument.
Right now, timestamping is not meant for userland pleasure, but for
sniffers and network diagnostics. (I mean with current API, not with a
new one we could add later)
Once we settle a per socket timestamping, not global, we can reconsider
the thing (or not reconsider it, since socket timestamping will be done
after RPS dispatch)
Its true our global variable to enable/disable timestamp sucks, but its
a separate issue ;)
We probably could have a sysctl to let admin chose the moment timestamp
takes place (before or after RPS)
If TSC is available, here is the "perf top" of the cpu handling
1.200.000 packets per second, while timestamping is requested :
You can hardly see something about time services :
--------------------------------------------------------------------------------------------------------------------------
PerfTop: 983 irqs/sec kernel:99.5% [1000Hz cycles], (all, cpu: 10)
--------------------------------------------------------------------------------------------------------------------------
samples pcnt function DSO
_______ _____ ___________________________________ _______
1568.00 14.9% bnx2x_rx_int vmlinux
1133.00 10.7% eth_type_trans vmlinux
798.00 7.6% kmem_cache_alloc_node vmlinux
720.00 6.8% _raw_spin_lock vmlinux
709.00 6.7% __kmalloc_node_track_caller vmlinux
547.00 5.2% __memset vmlinux
540.00 5.1% __slab_alloc vmlinux
453.00 4.3% get_rps_cpu vmlinux
402.00 3.8% _raw_spin_lock_irqsave vmlinux
295.00 2.8% enqueue_to_backlog vmlinux
271.00 2.6% default_send_IPI_mask_sequence_phys vmlinux
259.00 2.5% get_partial_node vmlinux
235.00 2.2% __alloc_skb vmlinux
227.00 2.2% vlan_gro_common vmlinux
206.00 2.0% swiotlb_dma_mapping_error vmlinux
201.00 1.9% skb_put vmlinux
118.00 1.1% getnstimeofday vmlinux
97.00 0.9% csd_lock vmlinux
96.00 0.9% swiotlb_map_page vmlinux
85.00 0.8% read_tsc vmlinux
76.00 0.7% dev_gro_receive vmlinux
75.00 0.7% __napi_complete vmlinux
74.00 0.7% bnx2x_poll vmlinux
73.00 0.7% unmap_single vmlinux
72.00 0.7% netif_receive_skb vmlinux
66.00 0.6% irq_entries_start vmlinux
65.00 0.6% net_rps_action_and_irq_enable vmlinux
62.00 0.6% __phys_addr vmlinux
If HPET or acpi_pm is used, then you can cry :)
(820.000 pps, or 570.000 pps max)
--------------------------------------------------------------------------------------------------------------------------
PerfTop: 1001 irqs/sec kernel:100.0% [1000Hz cycles], (all, cpu: 10)
--------------------------------------------------------------------------------------------------------------------------
samples pcnt function DSO
_______ _____ ___________________________________ _______
6488.00 48.4% read_hpet vmlinux
1214.00 9.1% bnx2x_rx_int vmlinux
820.00 6.1% eth_type_trans vmlinux
679.00 5.1% _raw_spin_lock vmlinux
678.00 5.1% kmem_cache_alloc_node vmlinux
607.00 4.5% __slab_alloc vmlinux
478.00 3.6% __kmalloc_node_track_caller vmlinux
404.00 3.0% __memset vmlinux
246.00 1.8% get_partial_node vmlinux
213.00 1.6% get_rps_cpu vmlinux
195.00 1.5% enqueue_to_backlog vmlinux
171.00 1.3% __alloc_skb vmlinux
163.00 1.2% vlan_gro_common vmlinux
135.00 1.0% swiotlb_dma_mapping_error vmlinux
118.00 0.9% skb_put vmlinux
88.00 0.7% getnstimeofday vmlinux
60.00 0.4% swiotlb_map_page vmlinux
59.00 0.4% dev_gro_receive vmlinux
--------------------------------------------------------------------------------------------------------------------------
PerfTop: 1001 irqs/sec kernel:100.0% [1000Hz cycles], (all, cpu: 10)
--------------------------------------------------------------------------------------------------------------------------
samples pcnt function DSO
_______ _____ ___________________________________ _______
2573.00 68.3% acpi_pm_read vmlinux
237.00 6.3% bnx2x_rx_int vmlinux
153.00 4.1% eth_type_trans vmlinux
101.00 2.7% kmem_cache_alloc_node vmlinux
99.00 2.6% __kmalloc_node_track_caller vmlinux
79.00 2.1% get_rps_cpu vmlinux
75.00 2.0% __memset vmlinux
72.00 1.9% _raw_spin_lock vmlinux
68.00 1.8% __slab_alloc vmlinux
40.00 1.1% enqueue_to_backlog vmlinux
39.00 1.0% __alloc_skb vmlinux
27.00 0.7% get_partial_node vmlinux
23.00 0.6% swiotlb_dma_mapping_error vmlinux
22.00 0.6% vlan_gro_common vmlinux
^ permalink raw reply
* [PATCH 1/8] 3c507: Remove unnecessary memset of netdev private data
From: Tobias Klauser @ 2010-05-06 15:39 UTC (permalink / raw)
To: davem, netdev; +Cc: kernel-janitors, Tobias Klauser
The memory for the private data is allocated using kzalloc in
alloc_etherdev (or alloc_netdev_mq respectively) so there is no need to
set it to 0 again.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
---
drivers/net/3c507.c | 1 -
1 files changed, 0 insertions(+), 1 deletions(-)
diff --git a/drivers/net/3c507.c b/drivers/net/3c507.c
index b32b7a1..9e95afa 100644
--- a/drivers/net/3c507.c
+++ b/drivers/net/3c507.c
@@ -449,7 +449,6 @@ static int __init el16_probe1(struct net_device *dev, int ioaddr)
pr_debug("%s", version);
lp = netdev_priv(dev);
- memset(lp, 0, sizeof(*lp));
spin_lock_init(&lp->lock);
lp->base = ioremap(dev->mem_start, RX_BUF_END);
if (!lp->base) {
--
1.6.3.3
^ permalink raw reply related
* [PATCH 2/8] 3c523: Remove unnecessary memset of netdev private data
From: Tobias Klauser @ 2010-05-06 15:39 UTC (permalink / raw)
To: davem, netdev; +Cc: kernel-janitors, Tobias Klauser
The memory for the private data is allocated using kzalloc in
alloc_etherdev (or alloc_netdev_mq respectively) so there is no need to
set it to 0 again.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
---
drivers/net/3c523.c | 1 -
1 files changed, 0 insertions(+), 1 deletions(-)
diff --git a/drivers/net/3c523.c b/drivers/net/3c523.c
index 8c70686..55d219e 100644
--- a/drivers/net/3c523.c
+++ b/drivers/net/3c523.c
@@ -503,7 +503,6 @@ static int __init do_elmc_probe(struct net_device *dev)
break;
}
- memset(pr, 0, sizeof(struct priv));
pr->slot = slot;
pr_info("%s: 3Com 3c523 Rev 0x%x at %#lx\n", dev->name, (int) revision,
--
1.6.3.3
^ permalink raw reply related
* [PATCH 3/8] KS8695: Remove unnecessary memset of netdev private data
From: Tobias Klauser @ 2010-05-06 15:40 UTC (permalink / raw)
To: davem, netdev; +Cc: kernel-janitors, Tobias Klauser
The memory for the private data is allocated using kzalloc in
alloc_etherdev (or alloc_netdev_mq respectively) so there is no need to
set it to 0 again.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
---
drivers/net/arm/ks8695net.c | 1 -
1 files changed, 0 insertions(+), 1 deletions(-)
diff --git a/drivers/net/arm/ks8695net.c b/drivers/net/arm/ks8695net.c
index 7413a87..6404704 100644
--- a/drivers/net/arm/ks8695net.c
+++ b/drivers/net/arm/ks8695net.c
@@ -1472,7 +1472,6 @@ ks8695_probe(struct platform_device *pdev)
/* Configure our private structure a little */
ksp = netdev_priv(ndev);
- memset(ksp, 0, sizeof(struct ks8695_priv));
ksp->dev = &pdev->dev;
ksp->ndev = ndev;
--
1.6.3.3
^ permalink raw reply related
* [PATCH 4/8] bcm63xx_enet: Remove unnecessary memset of netdev private data
From: Tobias Klauser @ 2010-05-06 15:40 UTC (permalink / raw)
To: davem, netdev; +Cc: kernel-janitors, Tobias Klauser
The memory for the private data is allocated using kzalloc in
alloc_etherdev (or alloc_netdev_mq respectively) so there is no need to
set it to 0 again.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
---
drivers/net/bcm63xx_enet.c | 1 -
1 files changed, 0 insertions(+), 1 deletions(-)
diff --git a/drivers/net/bcm63xx_enet.c b/drivers/net/bcm63xx_enet.c
index 9a8bdea..f48ba80 100644
--- a/drivers/net/bcm63xx_enet.c
+++ b/drivers/net/bcm63xx_enet.c
@@ -1647,7 +1647,6 @@ static int __devinit bcm_enet_probe(struct platform_device *pdev)
if (!dev)
return -ENOMEM;
priv = netdev_priv(dev);
- memset(priv, 0, sizeof(*priv));
ret = compute_hw_mtu(priv, dev->mtu);
if (ret)
--
1.6.3.3
^ permalink raw reply related
* [PATCH 5/8] ethoc: Remove unnecessary memset of napi member in netdev private data
From: Tobias Klauser @ 2010-05-06 15:41 UTC (permalink / raw)
To: davem, netdev; +Cc: kernel-janitors, Tobias Klauser
The memory for the private data is allocated using kzalloc in
alloc_etherdev (or alloc_netdev_mq respectively) so there is no need to
set the napi member it to 0 explicitely.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
---
drivers/net/ethoc.c | 1 -
1 files changed, 0 insertions(+), 1 deletions(-)
diff --git a/drivers/net/ethoc.c b/drivers/net/ethoc.c
index 6bd03c8..ad1bc73 100644
--- a/drivers/net/ethoc.c
+++ b/drivers/net/ethoc.c
@@ -1040,7 +1040,6 @@ static int ethoc_probe(struct platform_device *pdev)
netdev->features |= 0;
/* setup NAPI */
- memset(&priv->napi, 0, sizeof(priv->napi));
netif_napi_add(netdev, &priv->napi, ethoc_poll, 64);
spin_lock_init(&priv->rx_lock);
--
1.6.3.3
^ permalink raw reply related
* [PATCH 6/8] smc9194: Remove unnecessary memset of netdev private data
From: Tobias Klauser @ 2010-05-06 15:41 UTC (permalink / raw)
To: davem, netdev; +Cc: kernel-janitors, Tobias Klauser
The memory for the private data is allocated using kzalloc in
alloc_etherdev (or alloc_netdev_mq respectively) so there is no need to
set it to 0 again.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
---
drivers/net/smc9194.c | 3 ---
1 files changed, 0 insertions(+), 3 deletions(-)
diff --git a/drivers/net/smc9194.c b/drivers/net/smc9194.c
index e94521c..d76c815 100644
--- a/drivers/net/smc9194.c
+++ b/drivers/net/smc9194.c
@@ -1042,9 +1042,6 @@ static int __init smc_probe(struct net_device *dev, int ioaddr)
*/
printk("ADDR: %pM\n", dev->dev_addr);
- /* set the private data to zero by default */
- memset(netdev_priv(dev), 0, sizeof(struct smc_local));
-
/* Grab the IRQ */
retval = request_irq(dev->irq, smc_interrupt, 0, DRV_NAME, dev);
if (retval) {
--
1.6.3.3
^ permalink raw reply related
* [PATCH 7/8] sunhme: Remove unnecessary memset of netdev private data
From: Tobias Klauser @ 2010-05-06 15:41 UTC (permalink / raw)
To: davem, netdev; +Cc: kernel-janitors, Tobias Klauser
The memory for the private data is allocated using kzalloc in
alloc_etherdev (or alloc_netdev_mq respectively) so there is no need to
set it to 0 again.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
---
drivers/net/sunhme.c | 1 -
1 files changed, 0 insertions(+), 1 deletions(-)
diff --git a/drivers/net/sunhme.c b/drivers/net/sunhme.c
index 20deb14..982ff12 100644
--- a/drivers/net/sunhme.c
+++ b/drivers/net/sunhme.c
@@ -3004,7 +3004,6 @@ static int __devinit happy_meal_pci_probe(struct pci_dev *pdev,
dev->base_addr = (long) pdev;
hp = netdev_priv(dev);
- memset(hp, 0, sizeof(*hp));
hp->happy_dev = pdev;
hp->dma_dev = &pdev->dev;
--
1.6.3.3
^ permalink raw reply related
* [PATCH 8/8] tehuti: Remove unnecessary memset of netdev private data
From: Tobias Klauser @ 2010-05-06 15:43 UTC (permalink / raw)
To: baum, davem, netdev; +Cc: kernel-janitors, andy, Tobias Klauser
The memory for the private data is allocated using kzalloc in
alloc_etherdev (or alloc_netdev_mq respectively) so there is no need to
set it to 0 again.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
---
drivers/net/tehuti.c | 1 -
1 files changed, 0 insertions(+), 1 deletions(-)
diff --git a/drivers/net/tehuti.c b/drivers/net/tehuti.c
index e29f495..20ab161 100644
--- a/drivers/net/tehuti.c
+++ b/drivers/net/tehuti.c
@@ -2033,7 +2033,6 @@ bdx_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
/************** priv ****************/
priv = nic->priv[port] = netdev_priv(ndev);
- memset(priv, 0, sizeof(struct bdx_priv));
priv->pBdxRegs = nic->regs + port * 0x8000;
priv->port = port;
priv->pdev = pdev;
--
1.6.3.3
^ permalink raw reply related
* Re: [PATCH/RFC] cxgb4: Add MAINTAINERS info
From: Roland Dreier @ 2010-05-06 16:07 UTC (permalink / raw)
To: Or Gerlitz
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW, dm-ut6Up61K2wZBDgjK7y7TUQ
In-Reply-To: <4BE25A3D.20800-smomgflXvOZWk0Htik3J/w@public.gmane.org>
> not sure who's the butterfly that caused this, but this was somehow
> committed as "CXGB4 ETHERNET DRIVER (CXGB3)" and same goes for the
> IW_ piece
Thanks, I think I committed, saw the problem, fixed it up, sent the RFC,
and then pushed my tree. I fixed it up now. Pretty impressive eagle
eyes to notice that...
--
Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH net-next-2.6] net: Consistent skb timestamping
From: Eric Dumazet @ 2010-05-06 16:14 UTC (permalink / raw)
To: Tom Herbert; +Cc: David Miller, netdev
In-Reply-To: <1273160276.2853.27.camel@edumazet-laptop>
Le jeudi 06 mai 2010 à 17:37 +0200, Eric Dumazet a écrit :
> Right now, timestamping is not meant for userland pleasure, but for
> sniffers and network diagnostics. (I mean with current API, not with a
> new one we could add later)
>
> Once we settle a per socket timestamping, not global, we can reconsider
> the thing (or not reconsider it, since socket timestamping will be done
> after RPS dispatch)
>
> Its true our global variable to enable/disable timestamp sucks, but its
> a separate issue ;)
>
> We probably could have a sysctl to let admin chose the moment timestamp
> takes place (before or after RPS)
Here is v2 of patch,
introducing /proc/sys/net/core/netdev_tstamp_prequeue
Thanks
[PATCH v2 net-next-2.6] net: Consistent skb timestamping
With RPS inclusion, skb timestamping is not consistent in RX path.
If netif_receive_skb() is used, its deferred after RPS dispatch.
If netif_rx() is used, its done before RPS dispatch.
This can give strange tcpdump timestamps results.
I think timestamping should be done as soon as possible in the receive
path, to get meaningful values (ie timestamps taken at the time packet
was delivered by NIC driver to our stack), even if NAPI already can
defer timestamping a bit (RPS can help to reduce the gap)
Tom Herbert prefer to sample timestamps after RPS dispatch. In case
sampling is expensive (HPET/acpi_pm on x86), this makes sense.
Let admins switch from one mode to another, using a new
sysctl, /proc/sys/net/core/netdev_tstamp_prequeue
Its default value (1), means timestamps are taken as soon as possible,
before backlog queueing, giving accurate timestamps.
Setting a 0 value permits to sample timestamps when processing backlog,
after RPS dispatch, to lower the load of the pre-RPS cpu.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
Documentation/sysctl/net.txt | 10 ++++++
include/linux/netdevice.h | 1
net/core/dev.c | 50 ++++++++++++++++++++-------------
net/core/sysctl_net_core.c | 7 ++++
4 files changed, 49 insertions(+), 19 deletions(-)
diff --git a/Documentation/sysctl/net.txt b/Documentation/sysctl/net.txt
index df38ef0..cbd05ff 100644
--- a/Documentation/sysctl/net.txt
+++ b/Documentation/sysctl/net.txt
@@ -84,6 +84,16 @@ netdev_max_backlog
Maximum number of packets, queued on the INPUT side, when the interface
receives packets faster than kernel can process them.
+netdev_tstamp_prequeue
+----------------------
+
+If set to 0, RX packet timestamps can be sampled after RPS processing, when
+the target CPU processes packets. It might give some delay on timestamps, but
+permit to distribute the load on several cpus.
+
+If set to 1 (default), timestamps are sampled as soon as possible, before
+queueing.
+
optmem_max
----------
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 69022d4..c1b2341 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2100,6 +2100,7 @@ extern const struct net_device_stats *dev_get_stats(struct net_device *dev);
extern void dev_txq_stats_fold(const struct net_device *dev, struct net_device_stats *stats);
extern int netdev_max_backlog;
+extern int netdev_tstamp_prequeue;
extern int weight_p;
extern int netdev_set_master(struct net_device *dev, struct net_device *master);
extern int skb_checksum_help(struct sk_buff *skb);
diff --git a/net/core/dev.c b/net/core/dev.c
index 36d53be..1ca4de8 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1454,7 +1454,7 @@ void net_disable_timestamp(void)
}
EXPORT_SYMBOL(net_disable_timestamp);
-static inline void net_timestamp(struct sk_buff *skb)
+static inline void net_timestamp_set(struct sk_buff *skb)
{
if (atomic_read(&netstamp_needed))
__net_timestamp(skb);
@@ -1462,6 +1462,12 @@ static inline void net_timestamp(struct sk_buff *skb)
skb->tstamp.tv64 = 0;
}
+static inline void net_timestamp_check(struct sk_buff *skb)
+{
+ if (!skb->tstamp.tv64 && atomic_read(&netstamp_needed))
+ __net_timestamp(skb);
+}
+
/**
* dev_forward_skb - loopback an skb to another netif
*
@@ -1509,9 +1515,9 @@ static void dev_queue_xmit_nit(struct sk_buff *skb, struct net_device *dev)
#ifdef CONFIG_NET_CLS_ACT
if (!(skb->tstamp.tv64 && (G_TC_FROM(skb->tc_verd) & AT_INGRESS)))
- net_timestamp(skb);
+ net_timestamp_set(skb);
#else
- net_timestamp(skb);
+ net_timestamp_set(skb);
#endif
rcu_read_lock();
@@ -2202,6 +2208,7 @@ EXPORT_SYMBOL(dev_queue_xmit);
=======================================================================*/
int netdev_max_backlog __read_mostly = 1000;
+int netdev_tstamp_prequeue __read_mostly = 1;
int netdev_budget __read_mostly = 300;
int weight_p __read_mostly = 64; /* old backlog weight */
@@ -2458,8 +2465,8 @@ int netif_rx(struct sk_buff *skb)
if (netpoll_rx(skb))
return NET_RX_DROP;
- if (!skb->tstamp.tv64)
- net_timestamp(skb);
+ if (netdev_tstamp_prequeue)
+ net_timestamp_check(skb);
#ifdef CONFIG_RPS
{
@@ -2780,8 +2787,8 @@ static int __netif_receive_skb(struct sk_buff *skb)
int ret = NET_RX_DROP;
__be16 type;
- if (!skb->tstamp.tv64)
- net_timestamp(skb);
+ if (!netdev_tstamp_prequeue)
+ net_timestamp_check(skb);
if (vlan_tx_tag_present(skb) && vlan_hwaccel_do_receive(skb))
return NET_RX_SUCCESS;
@@ -2899,23 +2906,28 @@ out:
*/
int netif_receive_skb(struct sk_buff *skb)
{
+ if (netdev_tstamp_prequeue)
+ net_timestamp_check(skb);
+
#ifdef CONFIG_RPS
- struct rps_dev_flow voidflow, *rflow = &voidflow;
- int cpu, ret;
+ {
+ struct rps_dev_flow voidflow, *rflow = &voidflow;
+ int cpu, ret;
- rcu_read_lock();
+ rcu_read_lock();
+
+ cpu = get_rps_cpu(skb->dev, skb, &rflow);
- cpu = get_rps_cpu(skb->dev, skb, &rflow);
+ if (cpu >= 0) {
+ ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
+ rcu_read_unlock();
+ } else {
+ rcu_read_unlock();
+ ret = __netif_receive_skb(skb);
+ }
- if (cpu >= 0) {
- ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
- rcu_read_unlock();
- } else {
- rcu_read_unlock();
- ret = __netif_receive_skb(skb);
+ return ret;
}
-
- return ret;
#else
return __netif_receive_skb(skb);
#endif
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index dcc7d25..01eee5d 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -122,6 +122,13 @@ static struct ctl_table net_core_table[] = {
.proc_handler = proc_dointvec
},
{
+ .procname = "netdev_tstamp_prequeue",
+ .data = &netdev_tstamp_prequeue,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec
+ },
+ {
.procname = "message_cost",
.data = &net_ratelimit_state.interval,
.maxlen = sizeof(int),
^ permalink raw reply related
* Re: [net-next-2.6 V5 PATCH 0/3] Add port-profile netlink support
From: Scott Feldman @ 2010-05-06 16:19 UTC (permalink / raw)
To: Arnd Bergmann; +Cc: davem, netdev, chrisw
In-Reply-To: <201005061551.35254.arnd@arndb.de>
On 5/6/10 6:51 AM, "Arnd Bergmann" <arnd@arndb.de> wrote:
> On Thursday 06 May 2010, Scott Feldman wrote:
>> The intent of this patch set is to cover both definitions of port-profile
>> as defined by Cisco's enic use and as defined by VSI discover protocol (VDP),
>> used in VEPA implemenations. While both definitions are based on pre-
>> standards, the concept of a port-profile to be applied to an external switch
>> port on behalf of a virtual machine interface is common, as well as many
>> of the fields defining the protocols.
>
> The description no either no longer matches the patches, or you did not make
> the
> changes that were needed based on our last discussion.
>
> What happened to the base-device argument that you were planning to pass?
Using the IFLA_VF_* model works better for us where the recipient of the
netlink msg is the PF but the msg is to be applied to the VF. The third
patch illustrates how this fits nicely with SR-IOV devices. The PF is the
base device.
> The fields that I mentioned are needed for VDP
> (associate/pre-associate/disassociate-flag,
> VLAN ID, etc) are not there. I assume that means we should use a different
> data structure for VDP, but then your description above should be updated
> to state that this is no longer common for the two.
>
> I'll follow up with a draft for VDP based on your definitions.
I tried to accommodate space for VDP, but was hoping you could add the
definitions on top of what I had since your more familiar with VDP and can
do the testing.
Also, I wasn't sure if you could use the existing IFLA_VF_VLAN msg to apply
the VLAN ID or if you wanted VLAN ID also added to IFLA_VF_PORT_PROFILE.
-scott
^ permalink raw reply
* Re: ixgbe and mac-vlans problem
From: Ben Greear @ 2010-05-06 16:23 UTC (permalink / raw)
To: Tantilov, Emil S; +Cc: Arnd Bergmann, NetDev, Patrick McHardy
In-Reply-To: <EA929A9653AAE14F841771FB1DE5A1365FE5560B00@rrsmsx501.amr.corp.intel.com>
On 04/30/2010 03:26 PM, Tantilov, Emil S wrote:
> Ben Greear wrote:
>> On 04/30/2010 02:13 PM, Tantilov, Emil S wrote:
>>> I ran a quick test in my setup with 82599 and was able to pass
>>> traffic on all 50 mac-vlans without issues. This is on net-next.
>>
>> For an 82599 system, I can get 127 mac-vlans working out of 500
>> created.
>>
>> That NIC also does not go PROMISC with lots (500) of mac-vlans.
>>
>> Once I put it in promisc mode manually, it works fine.
>>
>> So, I think whatever logic is supposed to put the NIC into promisc
>> mode when it overflows it's lookup tables isn't working for ixgbe
>> in 2.6.31.12.
>
> Yeah, you're right. I was able to repro it.
>
> We'll look into it.
I'd be happy to test out a patch if you have one available.
If you don't expect to have one soon, please let me know and
I'll add work-arounds to my code to throw ixgbe NICs into PROMISC
mode manually.
Thanks,
Ben
--
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc http://www.candelatech.com
^ permalink raw reply
* Re: [net-next-2.6 V5 PATCH 2/3] Add ndo_{set|get}_vf_port_profile op support for enic dynamic vnics
From: Scott Feldman @ 2010-05-06 16:25 UTC (permalink / raw)
To: Arnd Bergmann; +Cc: davem, netdev, chrisw
In-Reply-To: <201005061547.22400.arnd@arndb.de>
On 5/6/10 6:47 AM, "Arnd Bergmann" <arnd@arndb.de> wrote:
> On Thursday 06 May 2010, Scott Feldman wrote:
>> @@ -810,14 +819,24 @@ static void enic_reset_mcaddrs(struct enic *enic)
>>
>> static int enic_set_mac_addr(struct net_device *netdev, char *addr)
>> {
>> - if (!is_valid_ether_addr(addr))
>> - return -EADDRNOTAVAIL;
>> + struct enic *enic = netdev_priv(netdev);
>>
>> - memcpy(netdev->dev_addr, addr, netdev->addr_len);
>> + if (enic_is_dynamic(enic)) {
>> + random_ether_addr(netdev->dev_addr);
>> + } else {
>> + if (!is_valid_ether_addr(addr))
>> + return -EADDRNOTAVAIL;
>> + memcpy(netdev->dev_addr, addr, netdev->addr_len);
>> + }
>>
>> return 0;
>> }
>>
>> +static int enic_set_mac_address(struct net_device *netdev, void *p)
>> +{
>> + return -EOPNOTSUPP;
>> +}
>> +
>> /* netif_tx_lock held, BHs disabled */
>> static void enic_set_multicast_list(struct net_device *netdev)
>> {
>
> Thsi looks funny. So you just ignore the address that gets passed to
> enic_set_mac_addr for dynamic interfaces and instead set a random
> address?
Dynamic enics have all-zero mac address on init, so we assign a random mac
addr to the interface. This would seem less funny:
if (enic_is_dynamic(enic) && is_zero_ether_addr(addr))
random_ether_addr(netdev->dev_addr);
else
...
I'll make that change and resubmit with your VDP additions if you like.
-scott
^ permalink raw reply
* Re: [net-next-2.6 V5 PATCH 0/3] Add port-profile netlink support
From: Arnd Bergmann @ 2010-05-06 16:42 UTC (permalink / raw)
To: Scott Feldman; +Cc: davem, netdev, chrisw
In-Reply-To: <C8083A0F.2EC10%scofeldm@cisco.com>
On Thursday 06 May 2010, Scott Feldman wrote:
> On 5/6/10 6:51 AM, "Arnd Bergmann" <arnd@arndb.de> wrote:
>
> > On Thursday 06 May 2010, Scott Feldman wrote:
> >> The intent of this patch set is to cover both definitions of port-profile
> >> as defined by Cisco's enic use and as defined by VSI discover protocol (VDP),
> >> used in VEPA implemenations. While both definitions are based on pre-
> >> standards, the concept of a port-profile to be applied to an external switch
> >> port on behalf of a virtual machine interface is common, as well as many
> >> of the fields defining the protocols.
> >
> > The description no either no longer matches the patches, or you did not make
> > the
> > changes that were needed based on our last discussion.
> >
> > What happened to the base-device argument that you were planning to pass?
>
> Using the IFLA_VF_* model works better for us where the recipient of the
> netlink msg is the PF but the msg is to be applied to the VF. The third
> patch illustrates how this fits nicely with SR-IOV devices. The PF is the
> base device.
Ah, got it. I did not notice that you had put a vf field in there.
It now makes a lot more sense to me, and is more in line with what
we need for VDP.
It does however make me wonder how this could be implemented for
a software-only implementation of your protocol that does not refer
to vf numbers. One way would be to define the 'vf' field as implementation
specific and just use the ifindex in this case, which would also work
in case of network namespaces. Alternatively, it could use whatever
tag you use in your wire protocol (e.g. an S-VID)
Both are a bit of a stretch, but I see no technical problems with them.
> > The fields that I mentioned are needed for VDP
> > (associate/pre-associate/disassociate-flag,
> > VLAN ID, etc) are not there. I assume that means we should use a different
> > data structure for VDP, but then your description above should be updated
> > to state that this is no longer common for the two.
> >
> > I'll follow up with a draft for VDP based on your definitions.
>
> I tried to accommodate space for VDP, but was hoping you could add the
> definitions on top of what I had since your more familiar with VDP and can
> do the testing.
>
> Also, I wasn't sure if you could use the existing IFLA_VF_VLAN msg to apply
> the VLAN ID or if you wanted VLAN ID also added to IFLA_VF_PORT_PROFILE.
The IFLA_VF_VLAN would not work well here because of the issue we discussed
before that I think we need to keep device setup separate from the protocol
exchange. IFLA_VF_VLAN configures the VLAN, while we need to tell the switch
about the configuration.
One (new) point that came up today is that your protocol is actually much
more closely related to the 'CDCP' protocol in 802.1Qbg than to 'VDP'.
I'll also try to make sure that we cover this case as well. It should
also be possible to do VDP over a dynamic enic VF and have multiple guests
using macvtap on that function, and there will probably be adapters that
need to use IFLA_VF_PORT_PROFILE (or another set) as the interface between
libvirt and the adapter firmware for doing CDCP.
To give some background, CDCP is an LLDP extension that is used to create
virtual channels between a physical NIC and the phys bridge on the other side,
using S-VLAN tagging. You can either assign one of these channels to a
guest directly (similar to what enic does), or use VDP on the channel
to connect multiple guests using a bridge device or macvtap in the same
way that we also do VDP on the physical device in the absence of CDCP.
Arnd
^ permalink raw reply
* Re: [net-next-2.6 V5 PATCH 2/3] Add ndo_{set|get}_vf_port_profile op support for enic dynamic vnics
From: Arnd Bergmann @ 2010-05-06 16:45 UTC (permalink / raw)
To: Scott Feldman; +Cc: davem, netdev, chrisw
In-Reply-To: <C8083B9D.2EC18%scofeldm@cisco.com>
On Thursday 06 May 2010, Scott Feldman wrote:
> Dynamic enics have all-zero mac address on init, so we assign a random mac
> addr to the interface. This would seem less funny:
>
> if (enic_is_dynamic(enic) && is_zero_ether_addr(addr))
> random_ether_addr(netdev->dev_addr);
> else
> ...
>
> I'll make that change and resubmit with your VDP additions if you like.
The change is ok, but what I think would be more helpful is a code comment
with your above sentence.
Arnd
^ permalink raw reply
* [PATCH] ipv4: remove ip_rt_secret timer
From: Neil Horman @ 2010-05-06 17:16 UTC (permalink / raw)
To: netdev; +Cc: davem, kuznet, jmorris, yoshfuji, kaber, nhorman
A while back there was a discussion regarding the rt_secret_interval timer.
Given that we've had the ability to do emergency route cache rebuilds for awhile
now, based on a statistical analysis of the various hash chain lengths in the
cache, the use of the flush timer is somewhat redundant. This patch removes the
rt_secret_interval sysctl, allowing us to rely solely on the statistical
analysis mechanism to determine the need for route cache flushes.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
include/net/netns/ipv4.h | 1
net/ipv4/route.c | 108 -----------------------------------------------
2 files changed, 2 insertions(+), 107 deletions(-)
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index ae07fee..d68c3f1 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -55,7 +55,6 @@ struct netns_ipv4 {
int sysctl_rt_cache_rebuild_count;
int current_rt_cache_rebuild_count;
- struct timer_list rt_secret_timer;
atomic_t rt_genid;
#ifdef CONFIG_IP_MROUTE
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index a947428..ffd3da1 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -129,7 +129,6 @@ static int ip_rt_gc_elasticity __read_mostly = 8;
static int ip_rt_mtu_expires __read_mostly = 10 * 60 * HZ;
static int ip_rt_min_pmtu __read_mostly = 512 + 20 + 20;
static int ip_rt_min_advmss __read_mostly = 256;
-static int ip_rt_secret_interval __read_mostly = 10 * 60 * HZ;
static int rt_chain_length_max __read_mostly = 20;
static struct delayed_work expires_work;
@@ -918,32 +917,11 @@ void rt_cache_flush_batch(void)
rt_do_flush(!in_softirq());
}
-/*
- * We change rt_genid and let gc do the cleanup
- */
-static void rt_secret_rebuild(unsigned long __net)
-{
- struct net *net = (struct net *)__net;
- rt_cache_invalidate(net);
- mod_timer(&net->ipv4.rt_secret_timer, jiffies + ip_rt_secret_interval);
-}
-
-static void rt_secret_rebuild_oneshot(struct net *net)
-{
- del_timer_sync(&net->ipv4.rt_secret_timer);
- rt_cache_invalidate(net);
- if (ip_rt_secret_interval)
- mod_timer(&net->ipv4.rt_secret_timer, jiffies + ip_rt_secret_interval);
-}
-
static void rt_emergency_hash_rebuild(struct net *net)
{
- if (net_ratelimit()) {
+ if (net_ratelimit())
printk(KERN_WARNING "Route hash chain too long!\n");
- printk(KERN_WARNING "Adjust your secret_interval!\n");
- }
-
- rt_secret_rebuild_oneshot(net);
+ rt_cache_invalidate(net);
}
/*
@@ -3101,48 +3079,6 @@ static int ipv4_sysctl_rtcache_flush(ctl_table *__ctl, int write,
return -EINVAL;
}
-static void rt_secret_reschedule(int old)
-{
- struct net *net;
- int new = ip_rt_secret_interval;
- int diff = new - old;
-
- if (!diff)
- return;
-
- rtnl_lock();
- for_each_net(net) {
- int deleted = del_timer_sync(&net->ipv4.rt_secret_timer);
- long time;
-
- if (!new)
- continue;
-
- if (deleted) {
- time = net->ipv4.rt_secret_timer.expires - jiffies;
-
- if (time <= 0 || (time += diff) <= 0)
- time = 0;
- } else
- time = new;
-
- mod_timer(&net->ipv4.rt_secret_timer, jiffies + time);
- }
- rtnl_unlock();
-}
-
-static int ipv4_sysctl_rt_secret_interval(ctl_table *ctl, int write,
- void __user *buffer, size_t *lenp,
- loff_t *ppos)
-{
- int old = ip_rt_secret_interval;
- int ret = proc_dointvec_jiffies(ctl, write, buffer, lenp, ppos);
-
- rt_secret_reschedule(old);
-
- return ret;
-}
-
static ctl_table ipv4_route_table[] = {
{
.procname = "gc_thresh",
@@ -3251,13 +3187,6 @@ static ctl_table ipv4_route_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec,
},
- {
- .procname = "secret_interval",
- .data = &ip_rt_secret_interval,
- .maxlen = sizeof(int),
- .mode = 0644,
- .proc_handler = ipv4_sysctl_rt_secret_interval,
- },
{ }
};
@@ -3337,36 +3266,6 @@ static __net_initdata struct pernet_operations sysctl_route_ops = {
#endif
-static __net_init int rt_secret_timer_init(struct net *net)
-{
- atomic_set(&net->ipv4.rt_genid,
- (int) ((num_physpages ^ (num_physpages>>8)) ^
- (jiffies ^ (jiffies >> 7))));
-
- net->ipv4.rt_secret_timer.function = rt_secret_rebuild;
- net->ipv4.rt_secret_timer.data = (unsigned long)net;
- init_timer_deferrable(&net->ipv4.rt_secret_timer);
-
- if (ip_rt_secret_interval) {
- net->ipv4.rt_secret_timer.expires =
- jiffies + net_random() % ip_rt_secret_interval +
- ip_rt_secret_interval;
- add_timer(&net->ipv4.rt_secret_timer);
- }
- return 0;
-}
-
-static __net_exit void rt_secret_timer_exit(struct net *net)
-{
- del_timer_sync(&net->ipv4.rt_secret_timer);
-}
-
-static __net_initdata struct pernet_operations rt_secret_timer_ops = {
- .init = rt_secret_timer_init,
- .exit = rt_secret_timer_exit,
-};
-
-
#ifdef CONFIG_NET_CLS_ROUTE
struct ip_rt_acct __percpu *ip_rt_acct __read_mostly;
#endif /* CONFIG_NET_CLS_ROUTE */
@@ -3424,9 +3323,6 @@ int __init ip_rt_init(void)
schedule_delayed_work(&expires_work,
net_random() % ip_rt_gc_interval + ip_rt_gc_interval);
- if (register_pernet_subsys(&rt_secret_timer_ops))
- printk(KERN_ERR "Unable to setup rt_secret_timer\n");
-
if (ip_rt_proc_init())
printk(KERN_ERR "Unable to create route proc files\n");
#ifdef CONFIG_XFRM
^ permalink raw reply related
* [PATCH] net: deliver skbs on inactive slaves to exact matches
From: John Fastabend @ 2010-05-06 17:24 UTC (permalink / raw)
To: bonding-devel, netdev
Cc: john.r.fastabend, christopher.leech, andy, kaber, fubar
Currently, the accelerated receive path for VLAN's will
drop packets if the real device is an inactive slave and
is not one of the special pkts tested for in
skb_bond_should_drop(). This behavior is different then
the non-accelerated path and for pkts over a bonded vlan.
For example,
vlanx -> bond0 -> ethx
will be dropped in the vlan path and not delivered to any
packet handlers. However,
bond0 -> vlanx -> ethx
will be delivered to handlers that match the exact dev,
because the VLAN path checks the real_dev which is not a
slave and netif_recv_skb() doesn't drop frames but only
delivers them to exact matches.
This patch adds a pkt_type PACKET_DROP which is now used
to identify skbs that would previously been dropped and
allows the skb to continue to skb_netif_recv(). Here we
add logic to check for PACKET_DROP and if so only deliver
to handlers that match exactly. IMHO this is more
consistent and gives pkt handlers a way to identify skbs
that come from inactive slaves.
This allows a third case to function which is important for
doing multipath with FCoE traffic while LAN traffic bonded,
bond0 -> ethx
|
vlanx -> --
Here the vlan is not in bond0 but the FCoE handler can still
receive the skb. Previously these skbs were dropped.
I have tested the following 4 configurations in failover modes
and load balancing modes and have not seen any duplicate packets
or unexpected bahavior.
# bond0 -> ethx
# vlanx -> bond0 -> ethx
# bond0 -> vlanx -> ethx
# bond0 -> ethx
|
vlanx -> --
Also this removes the PACKET_FASTROUTE define which was not being
used.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
include/linux/if_packet.h | 2 +-
net/8021q/vlan_core.c | 4 ++--
net/core/dev.c | 25 ++++++++++++++++++-------
3 files changed, 21 insertions(+), 10 deletions(-)
diff --git a/include/linux/if_packet.h b/include/linux/if_packet.h
index 6ac23ef..9d079fa 100644
--- a/include/linux/if_packet.h
+++ b/include/linux/if_packet.h
@@ -28,7 +28,7 @@ struct sockaddr_ll {
#define PACKET_OUTGOING 4 /* Outgoing of any type */
/* These ones are invisible by user level */
#define PACKET_LOOPBACK 5 /* MC/BRD frame looped back */
-#define PACKET_FASTROUTE 6 /* Fastrouted frame */
+#define PACKET_DROP 6 /* Drop packet */
/* Packet socket options */
diff --git a/net/8021q/vlan_core.c b/net/8021q/vlan_core.c
index c584a0a..4510e08 100644
--- a/net/8021q/vlan_core.c
+++ b/net/8021q/vlan_core.c
@@ -12,7 +12,7 @@ int __vlan_hwaccel_rx(struct sk_buff *skb, struct vlan_group *grp,
return NET_RX_DROP;
if (skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master)))
- goto drop;
+ skb->pkt_type = PACKET_DROP;
skb->skb_iif = skb->dev->ifindex;
__vlan_hwaccel_put_tag(skb, vlan_tci);
@@ -84,7 +84,7 @@ vlan_gro_common(struct napi_struct *napi, struct vlan_group *grp,
struct sk_buff *p;
if (skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master)))
- goto drop;
+ skb->pkt_type = PACKET_DROP;
skb->skb_iif = skb->dev->ifindex;
__vlan_hwaccel_put_tag(skb, vlan_tci);
diff --git a/net/core/dev.c b/net/core/dev.c
index 36d53be..cefac4f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2776,7 +2776,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
struct net_device *orig_dev;
struct net_device *master;
struct net_device *null_or_orig;
- struct net_device *null_or_bond;
+ struct net_device *dev_or_bond;
int ret = NET_RX_DROP;
__be16 type;
@@ -2793,13 +2793,24 @@ static int __netif_receive_skb(struct sk_buff *skb)
if (!skb->skb_iif)
skb->skb_iif = skb->dev->ifindex;
+ /*
+ * bonding note: skbs received on inactive slaves should only
+ * be delivered to pkt handlers that are exact matches. Also
+ * the pkt_type field will be marked PACKET_DROP. If packet
+ * handlers are sensitive to duplicate packets these skbs will
+ * need to be dropped at the handler. The vlan accel path may
+ * have already set PACKET_DROP.
+ */
null_or_orig = NULL;
orig_dev = skb->dev;
master = ACCESS_ONCE(orig_dev->master);
- if (master) {
- if (skb_bond_should_drop(skb, master))
+ if (skb->pkt_type == PACKET_DROP)
+ null_or_orig = orig_dev;
+ else if (master) {
+ if (skb_bond_should_drop(skb, master)) {
+ skb->pkt_type = PACKET_DROP;
null_or_orig = orig_dev; /* deliver only exact match */
- else
+ } else
skb->dev = master;
}
@@ -2849,10 +2860,10 @@ ncls:
* device that may have registered for a specific ptype. The
* handler may have to adjust skb->dev and orig_dev.
*/
- null_or_bond = NULL;
+ dev_or_bond = skb->dev;
if ((skb->dev->priv_flags & IFF_802_1Q_VLAN) &&
(vlan_dev_real_dev(skb->dev)->priv_flags & IFF_BONDING)) {
- null_or_bond = vlan_dev_real_dev(skb->dev);
+ dev_or_bond = vlan_dev_real_dev(skb->dev);
}
type = skb->protocol;
@@ -2860,7 +2871,7 @@ ncls:
&ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
if (ptype->type == type && (ptype->dev == null_or_orig ||
ptype->dev == skb->dev || ptype->dev == orig_dev ||
- ptype->dev == null_or_bond)) {
+ ptype->dev == dev_or_bond)) {
if (pt_prev)
ret = deliver_skb(skb, pt_prev, orig_dev);
pt_prev = ptype;
^ permalink raw reply related
* Re: [PATCH] ipv4: remove ip_rt_secret timer
From: Eric Dumazet @ 2010-05-06 17:32 UTC (permalink / raw)
To: Neil Horman; +Cc: netdev, davem, kuznet, jmorris, yoshfuji, kaber
In-Reply-To: <20100506171639.GA5063@hmsreliant.think-freely.org>
Le jeudi 06 mai 2010 à 13:16 -0400, Neil Horman a écrit :
> A while back there was a discussion regarding the rt_secret_interval timer.
> Given that we've had the ability to do emergency route cache rebuilds for awhile
> now, based on a statistical analysis of the various hash chain lengths in the
> cache, the use of the flush timer is somewhat redundant. This patch removes the
> rt_secret_interval sysctl, allowing us to rely solely on the statistical
> analysis mechanism to determine the need for route cache flushes.
>
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
>
>
Nice cleanup try Neil, but this gives to attackers more time to hit the
cache (infinite time should be enough as a matter of fact ;) )
Hints :
- What is the initial value of rt_genid ?
- How/When is it changed (full 32 bits are changed or small
perturbations ? check rt_cache_invalidate())
Thanks
> include/net/netns/ipv4.h | 1
> net/ipv4/route.c | 108 -----------------------------------------------
> 2 files changed, 2 insertions(+), 107 deletions(-)
>
> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
> index ae07fee..d68c3f1 100644
> --- a/include/net/netns/ipv4.h
> +++ b/include/net/netns/ipv4.h
> @@ -55,7 +55,6 @@ struct netns_ipv4 {
> int sysctl_rt_cache_rebuild_count;
> int current_rt_cache_rebuild_count;
>
> - struct timer_list rt_secret_timer;
> atomic_t rt_genid;
>
> #ifdef CONFIG_IP_MROUTE
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index a947428..ffd3da1 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -129,7 +129,6 @@ static int ip_rt_gc_elasticity __read_mostly = 8;
> static int ip_rt_mtu_expires __read_mostly = 10 * 60 * HZ;
> static int ip_rt_min_pmtu __read_mostly = 512 + 20 + 20;
> static int ip_rt_min_advmss __read_mostly = 256;
> -static int ip_rt_secret_interval __read_mostly = 10 * 60 * HZ;
> static int rt_chain_length_max __read_mostly = 20;
>
> static struct delayed_work expires_work;
> @@ -918,32 +917,11 @@ void rt_cache_flush_batch(void)
> rt_do_flush(!in_softirq());
> }
>
> -/*
> - * We change rt_genid and let gc do the cleanup
> - */
> -static void rt_secret_rebuild(unsigned long __net)
> -{
> - struct net *net = (struct net *)__net;
> - rt_cache_invalidate(net);
> - mod_timer(&net->ipv4.rt_secret_timer, jiffies + ip_rt_secret_interval);
> -}
> -
> -static void rt_secret_rebuild_oneshot(struct net *net)
> -{
> - del_timer_sync(&net->ipv4.rt_secret_timer);
> - rt_cache_invalidate(net);
> - if (ip_rt_secret_interval)
> - mod_timer(&net->ipv4.rt_secret_timer, jiffies + ip_rt_secret_interval);
> -}
> -
> static void rt_emergency_hash_rebuild(struct net *net)
> {
> - if (net_ratelimit()) {
> + if (net_ratelimit())
> printk(KERN_WARNING "Route hash chain too long!\n");
> - printk(KERN_WARNING "Adjust your secret_interval!\n");
> - }
> -
> - rt_secret_rebuild_oneshot(net);
> + rt_cache_invalidate(net);
> }
>
> /*
> @@ -3101,48 +3079,6 @@ static int ipv4_sysctl_rtcache_flush(ctl_table *__ctl, int write,
> return -EINVAL;
> }
>
> -static void rt_secret_reschedule(int old)
> -{
> - struct net *net;
> - int new = ip_rt_secret_interval;
> - int diff = new - old;
> -
> - if (!diff)
> - return;
> -
> - rtnl_lock();
> - for_each_net(net) {
> - int deleted = del_timer_sync(&net->ipv4.rt_secret_timer);
> - long time;
> -
> - if (!new)
> - continue;
> -
> - if (deleted) {
> - time = net->ipv4.rt_secret_timer.expires - jiffies;
> -
> - if (time <= 0 || (time += diff) <= 0)
> - time = 0;
> - } else
> - time = new;
> -
> - mod_timer(&net->ipv4.rt_secret_timer, jiffies + time);
> - }
> - rtnl_unlock();
> -}
> -
> -static int ipv4_sysctl_rt_secret_interval(ctl_table *ctl, int write,
> - void __user *buffer, size_t *lenp,
> - loff_t *ppos)
> -{
> - int old = ip_rt_secret_interval;
> - int ret = proc_dointvec_jiffies(ctl, write, buffer, lenp, ppos);
> -
> - rt_secret_reschedule(old);
> -
> - return ret;
> -}
> -
> static ctl_table ipv4_route_table[] = {
> {
> .procname = "gc_thresh",
> @@ -3251,13 +3187,6 @@ static ctl_table ipv4_route_table[] = {
> .mode = 0644,
> .proc_handler = proc_dointvec,
> },
> - {
> - .procname = "secret_interval",
> - .data = &ip_rt_secret_interval,
> - .maxlen = sizeof(int),
> - .mode = 0644,
> - .proc_handler = ipv4_sysctl_rt_secret_interval,
> - },
> { }
> };
>
> @@ -3337,36 +3266,6 @@ static __net_initdata struct pernet_operations sysctl_route_ops = {
> #endif
>
>
> -static __net_init int rt_secret_timer_init(struct net *net)
> -{
> - atomic_set(&net->ipv4.rt_genid,
> - (int) ((num_physpages ^ (num_physpages>>8)) ^
> - (jiffies ^ (jiffies >> 7))));
> -
> - net->ipv4.rt_secret_timer.function = rt_secret_rebuild;
> - net->ipv4.rt_secret_timer.data = (unsigned long)net;
> - init_timer_deferrable(&net->ipv4.rt_secret_timer);
> -
> - if (ip_rt_secret_interval) {
> - net->ipv4.rt_secret_timer.expires =
> - jiffies + net_random() % ip_rt_secret_interval +
> - ip_rt_secret_interval;
> - add_timer(&net->ipv4.rt_secret_timer);
> - }
> - return 0;
> -}
> -
> -static __net_exit void rt_secret_timer_exit(struct net *net)
> -{
> - del_timer_sync(&net->ipv4.rt_secret_timer);
> -}
> -
> -static __net_initdata struct pernet_operations rt_secret_timer_ops = {
> - .init = rt_secret_timer_init,
> - .exit = rt_secret_timer_exit,
> -};
> -
> -
> #ifdef CONFIG_NET_CLS_ROUTE
> struct ip_rt_acct __percpu *ip_rt_acct __read_mostly;
> #endif /* CONFIG_NET_CLS_ROUTE */
> @@ -3424,9 +3323,6 @@ int __init ip_rt_init(void)
> schedule_delayed_work(&expires_work,
> net_random() % ip_rt_gc_interval + ip_rt_gc_interval);
>
> - if (register_pernet_subsys(&rt_secret_timer_ops))
> - printk(KERN_ERR "Unable to setup rt_secret_timer\n");
> -
> if (ip_rt_proc_init())
> printk(KERN_ERR "Unable to create route proc files\n");
> #ifdef CONFIG_XFRM
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply
* 2.6.34-rc5-mmotm0428 - 2 RCU whinges in mac80211
From: Valdis.Kletnieks-PjAqaU27lzQ @ 2010-05-06 17:35 UTC (permalink / raw)
To: Andrew Morton, Johannes Berg, David S. Miller, John W. Linville
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
netdev-u79uwXL29TY76Z2rM5mHXA,
linux-wireless-u79uwXL29TY76Z2rM5mHXA
[-- Attachment #1: Type: text/plain, Size: 5105 bytes --]
Spotted these in my dmesg, hopefully somebody is interested...
[ 54.076863] wlan0: deauthenticating from 00:11:20:a4:4c:11 by local choice (reason=3)
[ 54.080580]
[ 54.080581] ===================================================
[ 54.080584] [ INFO: suspicious rcu_dereference_check() usage. ]
[ 54.080586] ---------------------------------------------------
[ 54.080589] net/mac80211/sta_info.c:858 invoked rcu_dereference_check() without protection!
[ 54.080591]
[ 54.080591] other info that might help us debug this:
[ 54.080592]
[ 54.080594]
[ 54.080595] rcu_scheduler_active = 1, debug_locks = 1
[ 54.080597] no locks held by hald/3362.
[ 54.080599]
[ 54.080599] stack backtrace:
[ 54.080602] Pid: 3362, comm: hald Not tainted 2.6.34-rc5-mmotm0428 #1
[ 54.080604] Call Trace:
[ 54.080607] <IRQ> [<ffffffff81064eb9>] lockdep_rcu_dereference+0x9d/0xa5
[ 54.080619] [<ffffffff815649f9>] ieee80211_find_sta_by_hw+0x46/0x10f
[ 54.080623] [<ffffffff81564ad9>] ieee80211_find_sta+0x17/0x19
[ 54.080628] [<ffffffff8136e1ee>] iwlagn_tx_queue_reclaim+0xe7/0x1bd
[ 54.080632] [<ffffffff81371da3>] iwlagn_rx_reply_tx+0x4e9/0x5a6
[ 54.080638] [<ffffffff8136501f>] iwl_rx_handle+0x268/0x3fe
[ 54.080642] [<ffffffff813664ea>] iwl_irq_tasklet+0x2d3/0x3e4
[ 54.080647] [<ffffffff8103e337>] tasklet_action+0x79/0xd7
[ 54.080651] [<ffffffff8103fd54>] __do_softirq+0x15a/0x2a2
[ 54.080656] [<ffffffff8100358c>] call_softirq+0x1c/0x34
[ 54.080659] [<ffffffff81004ad8>] do_softirq+0x44/0xf0
[ 54.080663] [<ffffffff8103f3ba>] irq_exit+0x4a/0xb3
[ 54.080667] [<ffffffff81004211>] do_IRQ+0xa7/0xbe
[ 54.080671] [<ffffffff8159cd93>] ret_from_intr+0x0/0xf
[ 54.080673] <EOI>
[ 54.114898] wlan0: authenticate with 00:11:20:a4:4c:11 (try 1)
[ 54.122103] wlan0: authenticated
[ 54.122472] wlan0: associate with 00:11:20:a4:4c:11 (try 1)
[ 54.128267] wlan0: RX AssocResp from 00:11:20:a4:4c:11 (capab=0x31 status=0 aid=10)
[ 54.128271] wlan0: associated
[ 54.132990] ADDRCONF(NETDEV_CHANGE): wlan0: link becomes ready
[ 54.133333] cfg80211: Calling CRDA for country: US
[ 54.135752]
[ 54.135753] ===================================================
[ 54.135756] [ INFO: suspicious rcu_dereference_check() usage. ]
[ 54.135758] ---------------------------------------------------
[ 54.135760] net/mac80211/sta_info.c:858 invoked rcu_dereference_check() without protection!
[ 54.135763]
[ 54.135763] other info that might help us debug this:
[ 54.135764]
[ 54.135766]
[ 54.135767] rcu_scheduler_active = 1, debug_locks = 1
[ 54.135769] 1 lock held by udevd/2933:
[ 54.135771] #0: (policy_rwlock){.+.+..}, at: [<ffffffff811d4f60>] security_compute_av+0x29/0x259
[ 54.135781]
[ 54.135782] stack backtrace:
[ 54.135785] Pid: 2933, comm: udevd Not tainted 2.6.34-rc5-mmotm0428 #1
[ 54.135787] Call Trace:
[ 54.135790] <IRQ> [<ffffffff81064eb9>] lockdep_rcu_dereference+0x9d/0xa5
[ 54.135800] [<ffffffff81564a49>] ieee80211_find_sta_by_hw+0x96/0x10f
[ 54.135804] [<ffffffff81564ad9>] ieee80211_find_sta+0x17/0x19
[ 54.135809] [<ffffffff8136e1ee>] iwlagn_tx_queue_reclaim+0xe7/0x1bd
[ 54.135813] [<ffffffff81371da3>] iwlagn_rx_reply_tx+0x4e9/0x5a6
[ 54.135819] [<ffffffff8136501f>] iwl_rx_handle+0x268/0x3fe
[ 54.135823] [<ffffffff813664ea>] iwl_irq_tasklet+0x2d3/0x3e4
[ 54.135828] [<ffffffff8103e337>] tasklet_action+0x79/0xd7
[ 54.135832] [<ffffffff8103fd54>] __do_softirq+0x15a/0x2a2
[ 54.135837] [<ffffffff8100358c>] call_softirq+0x1c/0x34
[ 54.135841] [<ffffffff81004ad8>] do_softirq+0x44/0xf0
[ 54.135844] [<ffffffff8103f3ba>] irq_exit+0x4a/0xb3
[ 54.135847] [<ffffffff81004211>] do_IRQ+0xa7/0xbe
[ 54.135852] [<ffffffff8159cd93>] ret_from_intr+0x0/0xf
[ 54.135854] <EOI> [<ffffffff811ce48b>] ? avtab_search_node+0x6c/0x7b
[ 54.135862] [<ffffffff811d4cdb>] context_struct_compute_av+0x136/0x271
[ 54.135867] [<ffffffff811d506b>] security_compute_av+0x134/0x259
[ 54.135872] [<ffffffff811c2560>] avc_has_perm_noaudit+0x22e/0x537
[ 54.135876] [<ffffffff811c289f>] avc_has_perm+0x36/0x69
[ 54.135880] [<ffffffff810c52cd>] ? __do_fault+0x254/0x3f1
[ 54.135885] [<ffffffff811c4bca>] current_has_perm+0x3a/0x3f
[ 54.135888] [<ffffffff811c4c7d>] selinux_task_create+0x17/0x19
[ 54.135893] [<ffffffff811beced>] security_task_create+0x11/0x13
[ 54.135898] [<ffffffff81036dae>] copy_process+0x8d/0x11ed
[ 54.135901] [<ffffffff810c5434>] ? __do_fault+0x3bb/0x3f1
[ 54.135905] [<ffffffff8107c26b>] ? rcu_read_lock+0x0/0x35
[ 54.135909] [<ffffffff810380b2>] do_fork+0x1a4/0x3b5
[ 54.135913] [<ffffffff8107c2c1>] ? rcu_read_unlock+0x21/0x23
[ 54.135917] [<ffffffff8107dc56>] ? audit_filter_syscall+0xb4/0xc8
[ 54.135921] [<ffffffff81059044>] ? up_read+0x1e/0x36
[ 54.135924] [<ffffffff8105d760>] ? current_kernel_time+0x28/0x50
[ 54.135929] [<ffffffff81009705>] sys_clone+0x23/0x25
[ 54.135933] [<ffffffff810029d3>] stub_clone+0x13/0x20
[ 54.135936] [<ffffffff8100266b>] ? system_call_fastpath+0x16/0x1b
[-- Attachment #2: Type: application/pgp-signature, Size: 227 bytes --]
^ permalink raw reply
* Re: Receive issues with bonding and vlans
From: Jay Vosburgh @ 2010-05-06 17:42 UTC (permalink / raw)
To: John Fastabend
Cc: Leech, Christopher, netdev@vger.kernel.org, Andy Gospodarek,
Patrick McHardy, bonding-devel@lists.sourceforge.net
In-Reply-To: <4BE103BE.3040805@intel.com>
John Fastabend <john.r.fastabend@intel.com> wrote:
>Jay Vosburgh wrote:
>> John Fastabend <john.r.fastabend@intel.com> wrote:
[...]
>>> #3 bond0 --> ethx
>>> vlanx --> -|
>>>
>>> Here is the case where adding the IFF_SLAVE bit doesn't work as I
>>> hoped. We don't want to run skb_bond_should_drop here.
>>
>> Yes, this is tricky because the VLAN device will copy the
>> dev->flags from the device it's placed atop, so the VLAN will inherit
>> the ethx's IFF_SLAVE flag. This happens regardless of the setup order
>> (enslave ethX, then add VLAN, or vice versa).
>>
>
>This doesn't appear to be true, adding a VLAN on ethx then enslave ethx
>doesn't set the IFF_SLAVE flag on the VLAN. Unless I am missing
>something.
I tried this again, and yes, the vlan device inherits the flags
of the device at the time the vlan is added.
I think I was confused because the vlan device doesn't lose
IFF_SLAVE if the underlying ethX is taken out of the bond. I suspect
both of these behaviors are because netdev_set_master doesn't do a
notifier call (just an rtmsg_ifinfo) when it changes dev->flags outside
of dev_set_flags.
I don't think the vlan device should pick up IFF_SLAVE, though,
when the vlan device itself is not a slave, so that part seems correct.
-J
---
-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
^ permalink raw reply
* RE: ixgbe and mac-vlans problem
From: Tantilov, Emil S @ 2010-05-06 17:51 UTC (permalink / raw)
To: Ben Greear; +Cc: Arnd Bergmann, NetDev, Patrick McHardy
In-Reply-To: <4BE2ECEE.7070200@candelatech.com>
[-- Attachment #1: Type: text/plain, Size: 1210 bytes --]
Ben Greear wrote:
> On 04/30/2010 03:26 PM, Tantilov, Emil S wrote:
>> Ben Greear wrote:
>>> On 04/30/2010 02:13 PM, Tantilov, Emil S wrote:
>
>>>> I ran a quick test in my setup with 82599 and was able to pass
>>>> traffic on all 50 mac-vlans without issues. This is on net-next.
>>>
>>> For an 82599 system, I can get 127 mac-vlans working out of 500
>>> created.
>>>
>>> That NIC also does not go PROMISC with lots (500) of mac-vlans.
>>>
>>> Once I put it in promisc mode manually, it works fine.
>>>
>>> So, I think whatever logic is supposed to put the NIC into promisc
>>> mode when it overflows it's lookup tables isn't working for ixgbe
>>> in 2.6.31.12.
>>
>> Yeah, you're right. I was able to repro it.
>>
>> We'll look into it.
>
> I'd be happy to test out a patch if you have one available.
>
> If you don't expect to have one soon, please let me know and
> I'll add work-arounds to my code to throw ixgbe NICs into PROMISC
> mode manually.
>
> Thanks,
> Ben
Hi Ben,
We do have a patch in testing (see attached). It may not apply cleanly as it is on top of some other patches currently in validation. Let me know if it works for you.
Thanks,
Emil
[-- Attachment #2: ixgbe_macvlan_v5.patch --]
[-- Type: application/octet-stream, Size: 2680 bytes --]
Introduce uc_set_promisc flag to better handle the enabling of promisc when
the number of RARs is exceeded.
Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
drivers/net/ixgbe/ixgbe_common.c | 5 ++++-
drivers/net/ixgbe/ixgbe_main.c | 8 ++++----
drivers/net/ixgbe/ixgbe_type.h | 1 +
3 files changed, 9 insertions(+), 5 deletions(-)
diff --git a/drivers/net/ixgbe/ixgbe_common.c b/drivers/net/ixgbe/ixgbe_common.c
index ee42fd6..49775b6 100644
--- a/drivers/net/ixgbe/ixgbe_common.c
+++ b/drivers/net/ixgbe/ixgbe_common.c
@@ -1397,14 +1397,17 @@ s32 ixgbe_update_uc_addr_list_generic(struct ixgbe_hw *hw,
fctrl = IXGBE_READ_REG(hw, IXGBE_FCTRL);
fctrl |= IXGBE_FCTRL_UPE;
IXGBE_WRITE_REG(hw, IXGBE_FCTRL, fctrl);
+ hw->addr_ctrl.uc_set_promisc = true;
}
} else {
/* only disable if set by overflow, not by user */
- if (old_promisc_setting && !hw->addr_ctrl.user_set_promisc) {
+ if ((old_promisc_setting && hw->addr_ctrl.uc_set_promisc) &&
+ !(hw->addr_ctrl.user_set_promisc)){
hw_dbg(hw, " Leaving address overflow promisc mode\n");
fctrl = IXGBE_READ_REG(hw, IXGBE_FCTRL);
fctrl &= ~IXGBE_FCTRL_UPE;
IXGBE_WRITE_REG(hw, IXGBE_FCTRL, fctrl);
+ hw->addr_ctrl.uc_set_promisc = false;
}
}
diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index 577ac72..7bf3b40 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -2970,8 +2970,8 @@ void ixgbe_set_rx_mode(struct net_device *netdev)
fctrl = IXGBE_READ_REG(hw, IXGBE_FCTRL);
- if (netdev->flags & IFF_PROMISC) {
- hw->addr_ctrl.user_set_promisc = 1;
+ if (netdev->flags & IFF_PROMISC){
+ hw->addr_ctrl.user_set_promisc = true;
fctrl |= (IXGBE_FCTRL_UPE | IXGBE_FCTRL_MPE);
/* don't hardware filter vlans in promisc mode */
ixgbe_vlan_filter_disable(adapter);
@@ -2979,11 +2979,11 @@ void ixgbe_set_rx_mode(struct net_device *netdev)
if (netdev->flags & IFF_ALLMULTI) {
fctrl |= IXGBE_FCTRL_MPE;
fctrl &= ~IXGBE_FCTRL_UPE;
- } else {
+ } else if (!hw->addr_ctrl.uc_set_promisc) {
fctrl &= ~(IXGBE_FCTRL_UPE | IXGBE_FCTRL_MPE);
}
ixgbe_vlan_filter_enable(adapter);
- hw->addr_ctrl.user_set_promisc = 0;
+ hw->addr_ctrl.user_set_promisc = false;
}
IXGBE_WRITE_REG(hw, IXGBE_FCTRL, fctrl);
diff --git a/drivers/net/ixgbe/ixgbe_type.h b/drivers/net/ixgbe/ixgbe_type.h
index 1c89cbb..38f26bb 100644
--- a/drivers/net/ixgbe/ixgbe_type.h
+++ b/drivers/net/ixgbe/ixgbe_type.h
@@ -2285,6 +2285,7 @@ struct ixgbe_addr_filter_info {
u32 mc_addr_in_rar_count;
u32 mta_in_use;
u32 overflow_promisc;
+ bool uc_set_promisc;
bool user_set_promisc;
};
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox