* Re: [net-next 4/4] ixgbevf: scheduling while atomic in reset hw path
From: Eric Dumazet @ 2012-09-19 5:05 UTC (permalink / raw)
To: Jeff Kirsher; +Cc: davem, John Fastabend, netdev, gospo, sassmann
In-Reply-To: <1348029108-26659-5-git-send-email-jeffrey.t.kirsher@intel.com>
On Tue, 2012-09-18 at 21:31 -0700, Jeff Kirsher wrote:
> From: John Fastabend <john.r.fastabend@intel.com>
>
> In ixgbevf_reset_hw_vf() msleep is called while holding rtnl_lock
> and mbx_lock resulting in a schedule while atomic bug with trace
> below.
>
This sentence is misleading, as rtnl is a mutex.
Its legal to sleep while holding it
So the atomic context is because of lock #1, not 'lock' #2
> This patch uses mdelay instead.
>
> BUG: scheduling while atomic: ip/6539/0x00000002
> 2 locks held by ip/6539:
> #0: (rtnl_mutex){+.+.+.}, at: [<ffffffff81419cc3>] rtnl_lock+0x17/0x19
> #1: (&(&adapter->mbx_lock)->rlock){+.+...}, at: [<ffffffffa0030855>] ixgbevf_reset+0x30/0xc1 [ixgbevf]
> Modules linked in: ixgbevf ixgbe mdio libfc scsi_transport_fc 8021q scsi_tgt garp stp llc cpufreq_ondemand acpi_cpufreq freq_table mperf ipv6 uinput igb coretemp hwmon crc32c_intel ioatdma i2c_i801 shpchp microcode lpc_ich mfd_core i2c_core joydev dca pcspkr serio_raw pata_acpi ata_generic usb_storage pata_jmicron
> Pid: 6539, comm: ip Not tainted 3.6.0-rc3jk-net-next+ #104
> Call Trace:
> [<ffffffff81072202>] __schedule_bug+0x6a/0x79
> [<ffffffff814bc7e0>] __schedule+0xa2/0x684
> [<ffffffff8108f85f>] ? trace_hardirqs_off+0xd/0xf
> [<ffffffff814bd0c0>] schedule+0x64/0x66
> [<ffffffff814bb5e2>] schedule_timeout+0xa6/0xca
> [<ffffffff810536b9>] ? lock_timer_base+0x52/0x52
> [<ffffffff812629e0>] ? __udelay+0x15/0x17
> [<ffffffff814bb624>] schedule_timeout_uninterruptible+0x1e/0x20
> [<ffffffff810541c0>] msleep+0x1b/0x22
> [<ffffffffa002e723>] ixgbevf_reset_hw_vf+0x90/0xe5 [ixgbevf]
> [<ffffffffa0030860>] ixgbevf_reset+0x3b/0xc1 [ixgbevf]
> [<ffffffffa0032fba>] ixgbevf_open+0x43/0x43e [ixgbevf]
> [<ffffffff81409610>] ? dev_set_rx_mode+0x2e/0x33
> [<ffffffff8140b0f1>] __dev_open+0xa0/0xe5
> [<ffffffff814097ed>] __dev_change_flags+0xbe/0x142
> [<ffffffff8140b01c>] dev_change_flags+0x21/0x56
> [<ffffffff8141a843>] do_setlink+0x2e2/0x7f4
> [<ffffffff81016e36>] ? native_sched_clock+0x37/0x39
> [<ffffffff8141b0ac>] rtnl_newlink+0x277/0x4bb
> [<ffffffff8141aee9>] ? rtnl_newlink+0xb4/0x4bb
> [<ffffffff812217d1>] ? selinux_capable+0x32/0x3a
> [<ffffffff8104fb17>] ? ns_capable+0x4f/0x67
> [<ffffffff81419cc3>] ? rtnl_lock+0x17/0x19
> [<ffffffff81419f28>] rtnetlink_rcv_msg+0x236/0x253
> [<ffffffff81419cf2>] ? rtnetlink_rcv+0x2d/0x2d
> [<ffffffff8142fd42>] netlink_rcv_skb+0x43/0x94
> [<ffffffff81419ceb>] rtnetlink_rcv+0x26/0x2d
> [<ffffffff8142faf1>] netlink_unicast+0xee/0x174
> [<ffffffff81430327>] netlink_sendmsg+0x26a/0x288
> [<ffffffff813fb04f>] ? rcu_read_unlock+0x56/0x67
> [<ffffffff813f5e6d>] __sock_sendmsg_nosec+0x58/0x61
> [<ffffffff813f81b7>] __sock_sendmsg+0x3d/0x48
> [<ffffffff813f8339>] sock_sendmsg+0x6e/0x87
> [<ffffffff81107c9f>] ? might_fault+0xa5/0xac
> [<ffffffff81402a72>] ? copy_from_user+0x2a/0x2c
> [<ffffffff81402e62>] ? verify_iovec+0x54/0xaa
> [<ffffffff813f9834>] __sys_sendmsg+0x206/0x288
> [<ffffffff810694fa>] ? up_read+0x23/0x3d
> [<ffffffff811307e5>] ? fcheck_files+0xac/0xea
> [<ffffffff8113095e>] ? fget_light+0x3a/0xb9
> [<ffffffff813f9a2e>] sys_sendmsg+0x42/0x60
> [<ffffffff814c5ba9>] system_call_fastpath+0x16/0x1b
>
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> Tested-by: Robert Garrett <robertx.e.garrett@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> ---
> drivers/net/ethernet/intel/ixgbevf/vf.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/intel/ixgbevf/vf.c b/drivers/net/ethernet/intel/ixgbevf/vf.c
> index 690801b..87b3f3b 100644
> --- a/drivers/net/ethernet/intel/ixgbevf/vf.c
> +++ b/drivers/net/ethernet/intel/ixgbevf/vf.c
> @@ -100,7 +100,7 @@ static s32 ixgbevf_reset_hw_vf(struct ixgbe_hw *hw)
> msgbuf[0] = IXGBE_VF_RESET;
> mbx->ops.write_posted(hw, msgbuf, 1);
>
> - msleep(10);
> + mdelay(10);
>
> /* set our "perm_addr" based on info provided by PF */
> /* also set up the mc_filter_type which is piggy backed
^ permalink raw reply
* Re: [PATCH] net/core: fix comment in skb_try_coalesce
From: Eric Dumazet @ 2012-09-19 5:08 UTC (permalink / raw)
To: roy.qing.li; +Cc: netdev
In-Reply-To: <1348023201-7727-1-git-send-email-roy.qing.li@gmail.com>
On Wed, 2012-09-19 at 10:53 +0800, roy.qing.li@gmail.com wrote:
> From: Li RongQing <roy.qing.li@gmail.com>
>
> It should be the skb which is not cloned
>
> Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
> ---
> net/core/skbuff.c | 4 +++-
> 1 files changed, 3 insertions(+), 1 deletions(-)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index fe00d12..354a4e4 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -3502,7 +3502,9 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
> if (!skb_cloned(from))
> skb_shinfo(from)->nr_frags = 0;
>
> - /* if the skb is cloned this does nothing since we set nr_frags to 0 */
> + /* if the skb is not cloned this does nothing
> + * since we set nr_frags to 0.
> + */
> for (i = 0; i < skb_shinfo(from)->nr_frags; i++)
> skb_frag_ref(from, i);
>
Yes I saw that yesterday and was about to submit the same change (more
or less)
Acked-by: Eric Dumazet <edumazet@google.com>
^ permalink raw reply
* Re: [PATCH] tcp: Fixed a TFO server bug that crashed kernel by raw sockets
From: Eric Dumazet @ 2012-09-19 5:12 UTC (permalink / raw)
To: Christoph Paasch; +Cc: H.K. Jerry Chu, davem, netdev, ncardwell, edumazet
In-Reply-To: <4380003.jOHRfqhomY@cpaasch-mac>
On Wed, 2012-09-19 at 02:19 +0200, Christoph Paasch wrote:
> Why not moving the TCP-code out of inet_sock_destruct by modifying the sk_destruct
> callback when TFO is in use? Like the below (only compile-tested) patch. That
> way inet_sock_destruct stays TFO-free.
>
>
> Cheers,
> Christoph
>
> ---------
>
> From: Christoph Paasch <christoph.paasch@uclouvain.be>
> Date: Wed, 19 Sep 2012 02:06:53 +0200
> Subject: [PATCH] Don't add TCP-code in inet_sock_destruct
>
> Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
> ---
> include/linux/tcp.h | 4 ++++
> net/ipv4/af_inet.c | 2 --
> net/ipv4/tcp.c | 7 +++++++
> 3 files changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index ae46df5..67c789a 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -574,6 +574,8 @@ static inline bool fastopen_cookie_present(struct tcp_fastopen_cookie *foc)
> return foc->len != -1;
> }
>
> +extern void tcp_sock_destruct(struct sock *sk);
> +
> static inline int fastopen_init_queue(struct sock *sk, int backlog)
> {
> struct request_sock_queue *queue =
> @@ -585,6 +587,8 @@ static inline int fastopen_init_queue(struct sock *sk, int backlog)
> sk->sk_allocation);
> if (queue->fastopenq == NULL)
> return -ENOMEM;
> +
> + sk->sk_destruct = tcp_sock_destruct;
> spin_lock_init(&queue->fastopenq->lock);
Yes, it seems much better, thanks !
Acked-by: Eric Dumazet <edumazet@google.com>
^ permalink raw reply
* Re: BUG: TCPDUMP invalid cksum persists after disabling TCP cksum offload
From: Eric Dumazet @ 2012-09-19 5:54 UTC (permalink / raw)
To: Jamie Gloudon; +Cc: netdev
In-Reply-To: <20120918211423.GA19115@darkstar>
On Tue, 2012-09-18 at 17:14 -0400, Jamie Gloudon wrote:
> Hello,
> I am seeing that tx checksum offload appears to be still running after disabling the feature with ethtool. I'm using kernel 3.6.0-rc6 and the latest ethtool from the git repo.
>
> The default settings on my e1000e NIC:
> # ethtool -k eth1 | grep ': on'
> rx-checksumming: on
> tx-checksumming: on
> tx-checksum-ip-generic: on
> scatter-gather: on
> tx-scatter-gather: on
> tcp-segmentation-offload: on
> tx-tcp-segmentation: on
> tx-tcp6-segmentation: on
> generic-segmentation-offload: on
> generic-receive-offload: on
> rx-vlan-offload: on
> tx-vlan-offload: on
> receive-hashing: on
> highdma: on [fixed]
> rx-vlan-filter: on [fixed]
> tx-nocache-copy: on
>
> The results after disabling tcp cksum offload feature:
> # ethtool -K eth1 tx off
> Actual changes:
> tx-checksumming: off
> tx-checksum-ip-generic: off
> scatter-gather: off
> tx-scatter-gather: off [requested on]
> tcp-segmentation-offload: off
> tx-tcp-segmentation: off [requested on]
> tx-tcp6-segmentation: off [requested on]
> generic-segmentation-offload: off [requested on]
>
> However, in tcpdump, I'm still observing incorrect tcp checksum:
> 14:44:38.838711 IP (tos 0x10, ttl 64, id 45798, offset 0, flags [DF], proto TCP
> (6), length 60)
> 1.1.1.2.59748 > 1.1.1.1.23: Flags [S], cksum 0x0433 (incorrect -> 0x4137), seq 318222122, win 14600, options [mss 1460,sackOK,TS val 5447116 ecr 0,nop,wscale 7], length 0
>
> Is this behaviour valid? I'm quite baffled.
Thats because dev_hard_start_xmit() calls dev_queue_xmit_nit() before
doing the features tests :
tcpdump gets a copy of the packet before all mangling done
(skb_checksum_help() in your case)
if (!list_empty(&ptype_all))
dev_queue_xmit_nit(skb, dev);
features = netif_skb_features(skb);
if (vlan_tx_tag_present(skb) &&
!(features & NETIF_F_HW_VLAN_TX)) {
skb = __vlan_put_tag(skb, vlan_tx_tag_get(skb));
if (unlikely(!skb))
goto out;
skb->vlan_tci = 0;
}
if (netif_needs_gso(skb, features)) {
if (unlikely(dev_gso_segment(skb, features)))
goto out_kfree_skb;
if (skb->next)
goto gso;
} else {
if (skb_needs_linearize(skb, features) &&
__skb_linearize(skb))
goto out_kfree_skb;
/* If packet is not checksummed and device does not
* support checksumming for this protocol, complete
* checksumming here.
*/
if (skb->ip_summed == CHECKSUM_PARTIAL) {
skb_set_transport_header(skb,
skb_checksum_start_offset(skb));
if (!(features & NETIF_F_ALL_CSUM) &&
skb_checksum_help(skb))
goto out_kfree_skb;
}
}
skb_len = skb->len;
rc = ops->ndo_start_xmit(skb, dev);
I guess we could move dev_queue_xmit_nit(skb, dev) calls right before the
ndo_start_xmit() calls...
^ permalink raw reply
* Re: [net-next 0/4][pull request] Intel Wired LAN Driver Updates
From: Jeff Kirsher @ 2012-09-19 6:03 UTC (permalink / raw)
To: davem; +Cc: netdev, gospo, sassmann
In-Reply-To: <1348029108-26659-1-git-send-email-jeffrey.t.kirsher@intel.com>
[-- Attachment #1: Type: text/plain, Size: 1535 bytes --]
On Tue, 2012-09-18 at 21:31 -0700, Jeff Kirsher wrote:
> This series contains updates to igb and ixgbevf.
>
> The following are changes since commit adccff34de1ef81564b7e6c436f762e7a1caf807:
> net/tipc/name_table.c: Remove unecessary semicolon
> and are available in the git repository at:
> git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next master
>
> Akeem G. Abodunrin (1):
> igb: Support to enable EEE on all eee_supported devices
>
> Alexander Duyck (2):
> igb: Remove artificial restriction on RQDPC stat reading
> ixgbevf: Add support for VF API negotiation
>
> John Fastabend (1):
> ixgbevf: scheduling while atomic in reset hw path
>
> drivers/net/ethernet/intel/igb/e1000_82575.c | 17 +++++++---
> drivers/net/ethernet/intel/igb/e1000_defines.h | 3 +-
> drivers/net/ethernet/intel/igb/e1000_regs.h | 1 +
> drivers/net/ethernet/intel/igb/igb_main.c | 8 +++--
> drivers/net/ethernet/intel/ixgbevf/defines.h | 1 +
> drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 23 +++++++++++++
> drivers/net/ethernet/intel/ixgbevf/mbx.h | 21 ++++++++++--
> drivers/net/ethernet/intel/ixgbevf/vf.c | 39 ++++++++++++++++++++++-
> drivers/net/ethernet/intel/ixgbevf/vf.h | 3 ++
> 9 files changed, 105 insertions(+), 11 deletions(-)
>
Dave,
Do not pull, it appears there will be changes to patch 04 of the series,
I will be sending a v2 of the series once John gets patch 04 fixed up.
Cheers,
Jeff
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 836 bytes --]
^ permalink raw reply
* Re: [PATCHv4] virtio-spec: virtio network device multiqueue support
From: Michael S. Tsirkin @ 2012-09-19 6:12 UTC (permalink / raw)
To: Rusty Russell
Cc: kvm, netdev, rick.jones2, virtualization, levinsasha928, pbonzini,
Tom Herbert
In-Reply-To: <87wqzqpz7p.fsf@rustcorp.com.au>
On Wed, Sep 19, 2012 at 11:10:10AM +0930, Rusty Russell wrote:
> Tom Herbert <therbert@google.com> writes:
> > On Tue, Sep 11, 2012 at 10:49 PM, Rusty Russell <rusty@rustcorp.com.au>wrote:
> >> Perhaps Tom can explain how we avoid out-of-order receive for the
> >> accelerated RFS case? It's not clear to me, but we need to be able to
> >> do that for virtio-net if it implements accelerated RFS.
> >
> > AFAIK ooo RX is possible with accelerated RFS. We have an algorithm that
> > prevents this for RFS case by deferring a migration to a new queue as long
> > as it's possible that a flow might have outstanding packets on the old
> > queue. I suppose this could be implemented in the device for the HW
> > queues, but I don't think it would be easy to cover all cases where packets
> > were already in transit to the host or other cases where host and device
> > queues are out of sync.
>
> Having gone to such great lengths to avoid ooo for RFS, I don't think
> DaveM would be happy if we allow it for virtio_net.
>
> So, how *would* we implement such a thing for a "hardware" device? What
> if the device will only change the receive queue if the old receive
> queue is empty?
>
> Cheers,
> Rusty.
>
I think that would do it in most cases. Or if we want to be more
exact we could delay switching a specific flow until no
outstanding rx packets for this flow. Not sure it's worth the
hassle.
--
MST
^ permalink raw reply
* Re: [net-next 0/4][pull request] Intel Wired LAN Driver Updates
From: David Miller @ 2012-09-19 6:19 UTC (permalink / raw)
To: jeffrey.t.kirsher; +Cc: netdev, gospo, sassmann
In-Reply-To: <1348034635.2006.52.camel@jtkirshe-mobl>
From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Tue, 18 Sep 2012 23:03:55 -0700
> Do not pull, it appears there will be changes to patch 04 of the series,
> I will be sending a v2 of the series once John gets patch 04 fixed up.
Ok.
^ permalink raw reply
* [PATCH net-next] net: more accurate network taps in transmit path
From: Eric Dumazet @ 2012-09-19 6:44 UTC (permalink / raw)
To: Jamie Gloudon; +Cc: netdev
In-Reply-To: <1348034050.26523.325.camel@edumazet-glaptop>
From: Eric Dumazet <edumazet@google.com>
dev_queue_xmit_nit() should be called right before ndo_start_xmit()
calls or we might give wrong packet contents to taps users :
Packet checksum can be changed, or packet can be linearized or
segmented, and segments partially sent for the later case.
Also a memory allocation can fail and packet never really hit the
driver entry point.
Reported-by: Jamie Gloudon <jamie.gloudon@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
net/core/dev.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/net/core/dev.c b/net/core/dev.c
index dcc673d..52cd1d7 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2213,9 +2213,6 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
if (dev->priv_flags & IFF_XMIT_DST_RELEASE)
skb_dst_drop(skb);
- if (!list_empty(&ptype_all))
- dev_queue_xmit_nit(skb, dev);
-
features = netif_skb_features(skb);
if (vlan_tx_tag_present(skb) &&
@@ -2250,6 +2247,9 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
}
}
+ if (!list_empty(&ptype_all))
+ dev_queue_xmit_nit(skb, dev);
+
skb_len = skb->len;
rc = ops->ndo_start_xmit(skb, dev);
trace_net_dev_xmit(skb, rc, dev, skb_len);
@@ -2272,6 +2272,9 @@ gso:
if (dev->priv_flags & IFF_XMIT_DST_RELEASE)
skb_dst_drop(nskb);
+ if (!list_empty(&ptype_all))
+ dev_queue_xmit_nit(nskb, dev);
+
skb_len = nskb->len;
rc = ops->ndo_start_xmit(nskb, dev);
trace_net_dev_xmit(nskb, rc, dev, skb_len);
^ permalink raw reply related
* RE: [PATCH] netxen: check for root bus in netxen_mask_aer_correctable
From: Rajesh Borundia @ 2012-09-19 7:24 UTC (permalink / raw)
To: David Miller; +Cc: nikolay@redhat.com, Sony Chacko, netdev
In-Reply-To: <20120918.162321.1283398796161136088.davem@davemloft.net>
________________________________________
From: David Miller [davem@davemloft.net]
Sent: Wednesday, September 19, 2012 1:53 AM
To: Rajesh Borundia
Cc: nikolay@redhat.com; Sony Chacko; netdev
Subject: Re: [PATCH] netxen: check for root bus in netxen_mask_aer_correctable
No, this is not the correct way to submit patches written by other
people.
Look at how people like Jeff Kirsher submits Intel driver patches
written by people other than himself.
Apologies, will follow the guideline.
^ permalink raw reply
* Re: [PATCH 1/5] ucc_geth: Reduce IRQ off in xmit path
From: Joakim Tjernlund @ 2012-09-19 7:36 UTC (permalink / raw)
To: Francois Romieu; +Cc: netdev
In-Reply-To: <20120918223938.GA22868@electric-eye.fr.zoreil.com>
Francois Romieu <romieu@fr.zoreil.com> wrote on 2012/09/19 00:39:38:
>
> Joakim Tjernlund <Joakim.Tjernlund@transmode.se> :
> > Currently ucc_geth_start_xmit wraps IRQ off for the
> > whole body just to be safe.
> > Reduce the IRQ off period to a minimum.
>
> The driver does not do much work in its irq handler. You may as well
> convert it to the usual tg3-ish locking style (i.e. almost no locking).
You mean broadcom/tg3.c? It is a bit much to look at ATM for me and
there almost no locking with my patch also. Could possibly
be improved further but I am happy for now.
Jocke
^ permalink raw reply
* RE: [PATCH net-next] mlx4: use dev_kfree_skb() instead of dev_kfree_skb_any()
From: Yevgeny Petrilin @ 2012-09-19 7:58 UTC (permalink / raw)
To: Eric Dumazet, David Miller; +Cc: netdev, Or Gerlitz, Ying Cai
In-Reply-To: <1347866974.26523.53.camel@edumazet-glaptop>
>
> Since commit e22979d96a5 (mlx4_en: Moving to Interrupts for TX
> completions), we no longer can free TX skb from hard IRQ, but only from
> normal softirq or process context.
>
> Therefore, we can directly call dev_kfree_skb() from
> mlx4_en_free_tx_desc() like other conventional NAPI drivers.
>
Hi Eric,
At the moment the TX completion processing is done from IRQ context.
So I think we need to change the driver to work with NAPI for TX completions
before making this change.
I'll send the patch in a few days.
Yevgeny
^ permalink raw reply
* Re: [PATCH net-next] ipv6: recursive check rt->dst.from when call rt6_check_expired
From: Gao feng @ 2012-09-19 8:22 UTC (permalink / raw)
To: roy.qing.li; +Cc: netdev
In-Reply-To: <1347602097-18034-1-git-send-email-roy.qing.li@gmail.com>
于 2012年09月14日 13:54, roy.qing.li@gmail.com 写道:
> From: Li RongQing <roy.qing.li@gmail.com>
>
> If dst cache dst_a copies from dst_b, and dst_b copies from dst_c, check
> if dst_a is expired or not, we should not end with dst_a->dst.from, dst_b,
> we should check dst_c.
>
> CC: Gao feng <gaofeng@cn.fujitsu.com>
> Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
> ---
Looks good to me,thanks rongqing :)
^ permalink raw reply
* Re: [RFC PATCH v1 1/3] usbnet: introduce usbnet_link_change API
From: Oliver Neukum @ 2012-09-19 8:38 UTC (permalink / raw)
To: Ming Lei
Cc: David S. Miller, Greg Kroah-Hartman, Fink Dmitry, Rafael Wysocki,
Alan Stern, netdev-u79uwXL29TY76Z2rM5mHXA,
linux-usb-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1347978201-6219-2-git-send-email-ming.lei-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
On Tuesday 18 September 2012 22:23:19 Ming Lei wrote:
> This patch introduces the API of usbnet_link_change, so that
> usbnet can trace the link change, which may help to implement
> the later runtime PM triggered by usb ethernet link change.
>
> Signed-off-by: Ming Lei <ming.lei-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
Acked-by: Oliver Neukum <oneukum-l3A5Bk7waGM@public.gmane.org>
Regards
Oliver
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [RFC PATCH v1 2/3] usbnet: apply usbnet_link_change
From: Oliver Neukum @ 2012-09-19 8:39 UTC (permalink / raw)
To: Ming Lei
Cc: David S. Miller, Greg Kroah-Hartman, Fink Dmitry, Rafael Wysocki,
Alan Stern, netdev, linux-usb
In-Reply-To: <1347978201-6219-3-git-send-email-ming.lei@canonical.com>
On Tuesday 18 September 2012 22:23:20 Ming Lei wrote:
> This patch applies the introduce usbnet_link_change API.
>
> Signed-off-by: Ming Lei <ming.lei@canonical.com>
Acked-by: Oliver Neukum <oneukum@suse.de>
Regards
Oliver
^ permalink raw reply
* [PATCH net-next v3 0/1] Add support of ECMPv6
From: Nicolas Dichtel @ 2012-09-19 9:18 UTC (permalink / raw)
To: netdev, davem; +Cc: bernat, yoshfuji
In-Reply-To: <1347609548-14494-1-git-send-email-nicolas.dichtel@6wind.com>
Here is a proposal to add the support of ECMPv6. The previous patch
from Vincent against iproute2 can be used, but a little other patch is needed
too, see http://patchwork.ozlabs.org/patch/183277/
If the kernel patch is approved, I can submit formally the patch for
iproute2.
Here is an example of a command to add an ECMP route:
$ ip -6 route add 3ffe:304:124:2306::/64 \
nexthop via fe80::230:1bff:feb4:e05c dev eth0 \
nexthop via fe80::230:1bff:feb4:dd4f dev eth0
But note that this command is a shortcut and previous patches are not
mandatory to set ECMP routes. The following commands can be used too:
$ ip -6 route add 3ffe:304:124:2306::/64 via fe80::230:1bff:feb4:dd4f dev
eth0
$ ip -6 route append 3ffe:304:124:2306::/64 via fe80::230:1bff:feb4:e05c dev
eth0
Here is an example of a dump:
$ ip -6 route | grep 3ffe:304:124:2306::/64
3ffe:304:124:2306::/64 via fe80::230:1bff:feb4:dd4f dev eth0 metric 1024
3ffe:304:124:2306::/64 via fe80::230:1bff:feb4:e05c dev eth0 metric 1024
v2: rename CONFIG_IPV6_MULTIPATH_ROUTE to CONFIG_IPV6_MULTIPATH_HASH
use flowlabel in the hash function
add reference to RFC
fix a small identation issue
remove "If unsure, say N." from the help of CONFIG_IPV6_MULTIPATH
v3: rebase after updating net-next
Comments are welcome.
Regards,
Nicolas
^ permalink raw reply
* [PATCH net-next v3 1/1] ipv6: add support of ECMP
From: Nicolas Dichtel @ 2012-09-19 9:18 UTC (permalink / raw)
To: netdev, davem; +Cc: bernat, yoshfuji, Nicolas Dichtel
In-Reply-To: <1348046304-4156-1-git-send-email-nicolas.dichtel@6wind.com>
This patch adds the support of equal cost multipath for IPv6.
The patch is based on a previous work from
Luc Saillard <luc.saillard@6wind.com>.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
include/net/ip6_fib.h | 13 ++++
net/ipv6/Kconfig | 33 ++++++++
net/ipv6/ip6_fib.c | 73 ++++++++++++++++++
net/ipv6/route.c | 209 +++++++++++++++++++++++++++++++++++++++++++++++++-
4 files changed, 325 insertions(+), 3 deletions(-)
diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index cd64cf3..37e502a 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -47,6 +47,10 @@ struct fib6_config {
unsigned long fc_expires;
struct nlattr *fc_mx;
int fc_mx_len;
+#ifdef CONFIG_IPV6_MULTIPATH
+ struct nlattr *fc_mp;
+ int fc_mp_len;
+#endif
struct nl_info fc_nlinfo;
};
@@ -98,6 +102,15 @@ struct rt6_info {
struct fib6_node *rt6i_node;
struct in6_addr rt6i_gateway;
+#ifdef CONFIG_IPV6_MULTIPATH
+ /*
+ * siblings is a list of rt6_info that have the the same metric/weight,
+ * destination, but not the same gateway. nsiblings is just a cache
+ * to speed up lookup.
+ */
+ unsigned int rt6i_nsiblings;
+ struct list_head rt6i_siblings;
+#endif
atomic_t rt6i_ref;
diff --git a/net/ipv6/Kconfig b/net/ipv6/Kconfig
index 4f7fe72..e0c92dc 100644
--- a/net/ipv6/Kconfig
+++ b/net/ipv6/Kconfig
@@ -266,4 +266,37 @@ config IPV6_PIMSM_V2
Support for IPv6 PIM multicast routing protocol PIM-SMv2.
If unsure, say N.
+config IPV6_MULTIPATH
+ bool "IPv6: equal cost multipath for IPv6 routing"
+ depends on IPV6
+ default y
+ ---help---
+ Enable this option to support ECMP for IPv6.
+
+choice
+ prompt "IPv6: choose Multipath algorithm"
+ depends on IPV6_MULTIPATH
+ default IPV6_MULTIPATH_HASH
+ ---help---
+ Define the method to select route between each possible path.
+ The recommanded algorithm (by RFC4311) is HASH method.
+
+ config IPV6_MULTIPATH_HASH
+ bool "IPv6: MULTIPATH hash/flow algorithm"
+ ---help---
+ Multipath routes are chosen according to hash of packet header to
+ ensure a flow keeps the same route.
+ This algorithm is recommanded by RFC4311.
+
+ config IPV6_MULTIPATH_RR
+ bool "IPv6: MULTIPATH round robin algorithm"
+ ---help---
+ Multipath routes are chosen according to Round Robin.
+
+ config IPV6_MULTIPATH_RANDOM
+ bool "IPv6: MULTIPATH random algorithm"
+ ---help---
+ Multipath routes are chosen in a random fashion.
+endchoice
+
endif # IPV6
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 13690d6..3541e44 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -672,6 +672,10 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct rt6_info *rt,
iter->rt6i_idev == rt->rt6i_idev &&
ipv6_addr_equal(&iter->rt6i_gateway,
&rt->rt6i_gateway)) {
+#ifdef CONFIG_IPV6_MULTIPATH
+ if (rt->rt6i_nsiblings)
+ rt->rt6i_nsiblings = 0;
+#endif
if (!(iter->rt6i_flags & RTF_EXPIRES))
return -EEXIST;
if (!(rt->rt6i_flags & RTF_EXPIRES))
@@ -680,6 +684,23 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct rt6_info *rt,
rt6_set_expires(iter, rt->dst.expires);
return -EEXIST;
}
+#ifdef CONFIG_IPV6_MULTIPATH
+ /* If we have the same destination and the same metric,
+ * but not the same gateway, then the route we try to
+ * add is sibling to this route, increment our counter
+ * of siblings, and later we will add our route to the
+ * list.
+ * Only static routes (which don't have flag
+ * RTF_EXPIRES) are used for ECMPv6.
+ *
+ * To avoid long list, we only had siblings if the
+ * route have a gateway.
+ */
+ if (rt->rt6i_flags & RTF_GATEWAY &&
+ !(rt->rt6i_flags & RTF_EXPIRES) &&
+ !(iter->rt6i_flags & RTF_EXPIRES))
+ rt->rt6i_nsiblings++;
+#endif
}
if (iter->rt6i_metric > rt->rt6i_metric)
@@ -692,6 +713,43 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct rt6_info *rt,
if (ins == &fn->leaf)
fn->rr_ptr = NULL;
+#ifdef CONFIG_IPV6_MULTIPATH
+ /* Link this route to others same route. */
+ if (rt->rt6i_nsiblings) {
+ unsigned int rt6i_nsiblings;
+ struct rt6_info *sibling, *temp_sibling;
+
+ /* Find the first route that have the same metric */
+ sibling = fn->leaf;
+ while (sibling) {
+ if (sibling->rt6i_metric == rt->rt6i_metric) {
+ list_add_tail(&rt->rt6i_siblings,
+ &sibling->rt6i_siblings);
+ break;
+ }
+ sibling = sibling->dst.rt6_next;
+ }
+ /* For each sibling in the list, increment the counter of
+ * siblings. We can check if all the counter are equal.
+ */
+ rt6i_nsiblings = 0;
+ list_for_each_entry_safe(sibling, temp_sibling,
+ &rt->rt6i_siblings,
+ rt6i_siblings) {
+ sibling->rt6i_nsiblings++;
+ if (unlikely(sibling->rt6i_nsiblings !=
+ rt->rt6i_nsiblings)) {
+ pr_err("Wrong number of siblings for route %p (%d)\n",
+ sibling, sibling->rt6i_nsiblings);
+ }
+ rt6i_nsiblings++;
+ }
+ if (unlikely(rt6i_nsiblings != rt->rt6i_nsiblings)) {
+ pr_err("Wrong number of siblings for route %p. I have %d routes, but count %d siblings\n",
+ rt, rt6i_nsiblings, rt->rt6i_nsiblings);
+ }
+ }
+#endif
/*
* insert node
*/
@@ -1197,6 +1255,21 @@ static void fib6_del_route(struct fib6_node *fn, struct rt6_info **rtp,
if (fn->rr_ptr == rt)
fn->rr_ptr = NULL;
+#ifdef CONFIG_IPV6_MULTIPATH
+ /* Remove this entry from other siblings */
+ if (rt->rt6i_nsiblings) {
+ struct rt6_info *sibling, *next_sibling;
+
+ /* For each siblings, decrement the counter of siblings */
+ list_for_each_entry_safe(sibling, next_sibling,
+ &rt->rt6i_siblings, rt6i_siblings) {
+ sibling->rt6i_nsiblings--;
+ }
+ rt->rt6i_nsiblings = 0;
+ list_del_init(&rt->rt6i_siblings);
+ }
+#endif
+
/* Adjust walkers */
read_lock(&fib6_walker_lock);
FOR_WALKERS(w) {
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 83dafa5..ac8b3a2 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -57,6 +57,9 @@
#include <net/xfrm.h>
#include <net/netevent.h>
#include <net/netlink.h>
+#ifdef CONFIG_IPV6_MULTIPATH
+#include <net/nexthop.h>
+#endif
#include <asm/uaccess.h>
@@ -288,6 +291,10 @@ static inline struct rt6_info *ip6_dst_alloc(struct net *net,
memset(dst + 1, 0, sizeof(*rt) - sizeof(*dst));
rt6_init_peer(rt, table ? &table->tb6_peers : net->ipv6.peers);
+#ifdef CONFIG_IPV6_MULTIPATH
+ INIT_LIST_HEAD(&rt->rt6i_siblings);
+ rt->rt6i_nsiblings = 0;
+#endif
}
return rt;
}
@@ -388,6 +395,124 @@ static bool rt6_need_strict(const struct in6_addr *daddr)
(IPV6_ADDR_MULTICAST | IPV6_ADDR_LINKLOCAL | IPV6_ADDR_LOOPBACK);
}
+#ifdef CONFIG_IPV6_MULTIPATH
+/*
+ * Multipath route selection.
+ */
+
+#ifdef CONFIG_IPV6_MULTIPATH_RANDOM
+/*
+ * Pseudo random candidate function
+ */
+static int rt6_info_hash_randomfn(unsigned int candidate_count)
+{
+ return random32() % candidate_count;
+}
+#endif
+
+#ifdef CONFIG_IPV6_MULTIPATH_RR
+/*
+ * Fake Round Robin candidate function
+ * If we want real RR, we need to add a counter in each route
+ */
+static int rt6_info_hash_falserr(unsigned int candidate_count)
+{
+ static unsigned int seed;
+ seed++;
+ return seed % candidate_count;
+}
+#endif
+
+#ifdef CONFIG_IPV6_MULTIPATH_HASH
+/*
+ * Pseudo random candidate using the src port, and other information
+ * Adapted from fib_info_hashfn()
+ */
+static int rt6_info_hash_nhsfn(unsigned int candidate_count,
+ const struct flowi6 *fl6)
+{
+ unsigned int val = fl6->flowi6_proto;
+
+ val ^= fl6->daddr.s6_addr32[0];
+ val ^= fl6->daddr.s6_addr32[1];
+ val ^= fl6->daddr.s6_addr32[2];
+ val ^= fl6->daddr.s6_addr32[3];
+
+ val ^= fl6->saddr.s6_addr32[0];
+ val ^= fl6->saddr.s6_addr32[1];
+ val ^= fl6->saddr.s6_addr32[2];
+ val ^= fl6->saddr.s6_addr32[3];
+
+ /* Work only if this not encapsulated */
+ switch (fl6->flowi6_proto) {
+ case IPPROTO_UDP:
+ case IPPROTO_TCP:
+ case IPPROTO_SCTP:
+ val ^= fl6->fl6_sport;
+ val ^= fl6->fl6_dport;
+ break;
+
+ case IPPROTO_ICMPV6:
+ val ^= fl6->fl6_icmp_type;
+ val ^= fl6->fl6_icmp_code;
+ break;
+ }
+ /* RFC6438 recommands to use flowlabel */
+ val ^= fl6->flowlabel;
+
+ /* Perhaps, we need to tune, this function? */
+ val = val ^ (val >> 7) ^ (val >> 12);
+ return val % candidate_count;
+}
+#endif
+
+/*
+ * This function return an index used to select (at random, round robin, ...)
+ * a route between any siblings.
+ *
+ * Note: fl6 can be NULL
+ */
+static unsigned int rt6_info_hashfn(const struct rt6_info *rt,
+ const struct flowi6 *fl6)
+{
+ int candidate_count = rt->rt6i_nsiblings + 1;
+
+#if defined(CONFIG_IPV6_MULTIPATH_RR)
+ return rt6_info_hash_falserr(candidate_count);
+#elif defined(CONFIG_IPV6_MULTIPATH_RANDOM)
+ return rt6_info_hash_randomfn(candidate_count);
+#elif defined(CONFIG_IPV6_MULTIPATH_HASH)
+ if (fl6 == NULL)
+ return 0;
+ return rt6_info_hash_nhsfn(candidate_count, fl6);
+#else
+ return 0;
+#endif
+}
+
+static struct rt6_info *rt6_multipath_select(struct rt6_info *match,
+ struct flowi6 *fl6)
+{
+ struct rt6_info *sibling, *next_sibling;
+ int route_choosen;
+
+ route_choosen = rt6_info_hashfn(match, fl6);
+ /* Don't change the route, if route_choosen == 0
+ * (siblings does not include ourself)
+ */
+ if (route_choosen)
+ list_for_each_entry_safe(sibling, next_sibling,
+ &match->rt6i_siblings, rt6i_siblings) {
+ route_choosen--;
+ if (route_choosen == 0) {
+ match = sibling;
+ break;
+ }
+ }
+ return match;
+}
+#endif /* CONFIG_IPV6_MULTIPATH */
+
/*
* Route lookup. Any table->tb6_lock is implied.
*/
@@ -705,6 +830,10 @@ static struct rt6_info *ip6_pol_route_lookup(struct net *net,
restart:
rt = fn->leaf;
rt = rt6_device_match(net, rt, &fl6->saddr, fl6->flowi6_oif, flags);
+#ifdef CONFIG_IPV6_MULTIPATH
+ if (rt->rt6i_nsiblings && fl6->flowi6_oif == 0)
+ rt = rt6_multipath_select(rt, fl6);
+#endif
BACKTRACK(net, &fl6->saddr);
out:
dst_use(&rt->dst, jiffies);
@@ -866,7 +995,10 @@ restart_2:
restart:
rt = rt6_select(fn, oif, strict | reachable);
-
+#ifdef CONFIG_IPV6_MULTIPATH
+ if (rt->rt6i_nsiblings && oif == 0)
+ rt = rt6_multipath_select(rt, fl6);
+#endif
BACKTRACK(net, &fl6->saddr);
if (rt == net->ipv6.ip6_null_entry ||
rt->rt6i_flags & RTF_CACHE)
@@ -2247,6 +2379,9 @@ static const struct nla_policy rtm_ipv6_policy[RTA_MAX+1] = {
[RTA_IIF] = { .type = NLA_U32 },
[RTA_PRIORITY] = { .type = NLA_U32 },
[RTA_METRICS] = { .type = NLA_NESTED },
+#ifdef CONFIG_IPV6_MULTIPATH
+ [RTA_MULTIPATH] = { .len = sizeof(struct rtnexthop) },
+#endif
};
static int rtm_to_fib6_config(struct sk_buff *skb, struct nlmsghdr *nlh,
@@ -2324,11 +2459,69 @@ static int rtm_to_fib6_config(struct sk_buff *skb, struct nlmsghdr *nlh,
if (tb[RTA_TABLE])
cfg->fc_table = nla_get_u32(tb[RTA_TABLE]);
+#ifdef CONFIG_IPV6_MULTIPATH
+ if (tb[RTA_MULTIPATH]) {
+ cfg->fc_mp = nla_data(tb[RTA_MULTIPATH]);
+ cfg->fc_mp_len = nla_len(tb[RTA_MULTIPATH]);
+ }
+#endif
+
err = 0;
errout:
return err;
}
+#ifdef CONFIG_IPV6_MULTIPATH
+static int ip6_route_multipath(struct fib6_config *cfg, int add)
+{
+ struct fib6_config r_cfg;
+ struct rtnexthop *rtnh;
+ int remaining;
+ int attrlen;
+ int err = 0, last_err = 0;
+
+beginning:
+ rtnh = (struct rtnexthop *)cfg->fc_mp;
+ remaining = cfg->fc_mp_len;
+
+ /* Parse a Multipath Entry */
+ while (rtnh_ok(rtnh, remaining)) {
+ memcpy(&r_cfg, cfg, sizeof(*cfg));
+ if (rtnh->rtnh_ifindex)
+ r_cfg.fc_ifindex = rtnh->rtnh_ifindex;
+
+ attrlen = rtnh_attrlen(rtnh);
+ if (attrlen > 0) {
+ struct nlattr *nla, *attrs = rtnh_attrs(rtnh);
+
+ nla = nla_find(attrs, attrlen, RTA_GATEWAY);
+ if (nla) {
+ nla_memcpy(&r_cfg.fc_gateway, nla, 16);
+ r_cfg.fc_flags |= RTF_GATEWAY;
+ }
+ }
+ err = add ? ip6_route_add(&r_cfg) : ip6_route_del(&r_cfg);
+ if (err) {
+ last_err = err;
+ /* If we are trying to remove a route, do not stop the
+ * loop when ip6_route_del() fails (because next hop is
+ * already gone), we should try to remove all next hops.
+ */
+ if (add) {
+ /* If add fails, we should try to delete all
+ * next hops that have been already added.
+ */
+ add = 0;
+ goto beginning;
+ }
+ }
+ rtnh = rtnh_next(rtnh, &remaining);
+ }
+
+ return last_err;
+}
+#endif /* CONFIG_IPV6_MULTIPATH */
+
static int inet6_rtm_delroute(struct sk_buff *skb, struct nlmsghdr* nlh, void *arg)
{
struct fib6_config cfg;
@@ -2338,7 +2531,12 @@ static int inet6_rtm_delroute(struct sk_buff *skb, struct nlmsghdr* nlh, void *a
if (err < 0)
return err;
- return ip6_route_del(&cfg);
+#ifdef CONFIG_IPV6_MULTIPATH
+ if (cfg.fc_mp)
+ return ip6_route_multipath(&cfg, 0);
+ else
+#endif
+ return ip6_route_del(&cfg);
}
static int inet6_rtm_newroute(struct sk_buff *skb, struct nlmsghdr* nlh, void *arg)
@@ -2350,7 +2548,12 @@ static int inet6_rtm_newroute(struct sk_buff *skb, struct nlmsghdr* nlh, void *a
if (err < 0)
return err;
- return ip6_route_add(&cfg);
+#ifdef CONFIG_IPV6_MULTIPATH
+ if (cfg.fc_mp)
+ return ip6_route_multipath(&cfg, 1);
+ else
+#endif
+ return ip6_route_add(&cfg);
}
static inline size_t rt6_nlmsg_size(void)
--
1.7.12
^ permalink raw reply related
* [PATCH net-next] net: only run neigh_forced_gc() from one cpu
From: Eric Dumazet @ 2012-09-19 9:27 UTC (permalink / raw)
To: David Miller
Cc: netdev, Maciej Żenczykowski, Tom Herbert, Lorenzo Colitti
From: Eric Dumazet <edumazet@google.com>
With multiqueue NIC or RPS, we can have situation where all cpus are
spending huge amount of cycles in neigh_forced_gc(), and machine can
crash.
Since we are under probable attack, its better to let only one cpu
do the scan, and other cpus immediately return from neigh_forced_gc()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Tom Herbert <therbert@google.com>
---
Google-Bug-Id: 7121897
include/net/neighbour.h | 1 +
net/core/neighbour.c | 9 +++++++--
2 files changed, 8 insertions(+), 2 deletions(-)
diff --git a/include/net/neighbour.h b/include/net/neighbour.h
index 0dab173..ba21e93 100644
--- a/include/net/neighbour.h
+++ b/include/net/neighbour.h
@@ -178,6 +178,7 @@ struct neigh_table {
struct neigh_statistics __percpu *stats;
struct neigh_hash_table __rcu *nht;
struct pneigh_entry **phash_buckets;
+ spinlock_t forced_gc_lock;
};
#define NEIGH_PRIV_ALIGN sizeof(long long)
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index c160adb..1f7d8fa 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -134,9 +134,12 @@ static int neigh_forced_gc(struct neigh_table *tbl)
int i;
struct neigh_hash_table *nht;
+ if (!spin_trylock_bh(&tbl->forced_gc_lock))
+ return 0;
+
NEIGH_CACHE_STAT_INC(tbl, forced_gc_runs);
- write_lock_bh(&tbl->lock);
+ write_lock(&tbl->lock);
nht = rcu_dereference_protected(tbl->nht,
lockdep_is_held(&tbl->lock));
for (i = 0; i < (1 << nht->hash_shift); i++) {
@@ -169,7 +172,8 @@ static int neigh_forced_gc(struct neigh_table *tbl)
tbl->last_flush = jiffies;
- write_unlock_bh(&tbl->lock);
+ write_unlock(&tbl->lock);
+ spin_unlock_bh(&tbl->forced_gc_lock);
return shrunk;
}
@@ -1545,6 +1549,7 @@ static void neigh_table_init_no_netlink(struct neigh_table *tbl)
panic("cannot allocate neighbour cache hashes");
rwlock_init(&tbl->lock);
+ spin_lock_init(&tbl->forced_gc_lock);
INIT_DELAYED_WORK_DEFERRABLE(&tbl->gc_work, neigh_periodic_work);
schedule_delayed_work(&tbl->gc_work, tbl->parms.reachable_time);
setup_timer(&tbl->proxy_timer, neigh_proxy_process, (unsigned long)tbl);
^ permalink raw reply related
* Re: [PATCH net-next] net: only run neigh_forced_gc() from one cpu
From: Neil Horman @ 2012-09-19 10:50 UTC (permalink / raw)
To: Eric Dumazet
Cc: David Miller, netdev, Maciej Żenczykowski, Tom Herbert,
Lorenzo Colitti
In-Reply-To: <1348046827.26523.571.camel@edumazet-glaptop>
On Wed, Sep 19, 2012 at 11:27:07AM +0200, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> With multiqueue NIC or RPS, we can have situation where all cpus are
> spending huge amount of cycles in neigh_forced_gc(), and machine can
> crash.
>
> Since we are under probable attack, its better to let only one cpu
> do the scan, and other cpus immediately return from neigh_forced_gc()
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Lorenzo Colitti <lorenzo@google.com>
> Cc: Maciej Żenczykowski <maze@google.com>
> Cc: Tom Herbert <therbert@google.com>
> ---
> Google-Bug-Id: 7121897
>
> include/net/neighbour.h | 1 +
> net/core/neighbour.c | 9 +++++++--
> 2 files changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/include/net/neighbour.h b/include/net/neighbour.h
> index 0dab173..ba21e93 100644
> --- a/include/net/neighbour.h
> +++ b/include/net/neighbour.h
> @@ -178,6 +178,7 @@ struct neigh_table {
> struct neigh_statistics __percpu *stats;
> struct neigh_hash_table __rcu *nht;
> struct pneigh_entry **phash_buckets;
> + spinlock_t forced_gc_lock;
> };
>
> #define NEIGH_PRIV_ALIGN sizeof(long long)
> diff --git a/net/core/neighbour.c b/net/core/neighbour.c
> index c160adb..1f7d8fa 100644
> --- a/net/core/neighbour.c
> +++ b/net/core/neighbour.c
> @@ -134,9 +134,12 @@ static int neigh_forced_gc(struct neigh_table *tbl)
> int i;
> struct neigh_hash_table *nht;
>
> + if (!spin_trylock_bh(&tbl->forced_gc_lock))
> + return 0;
> +
This is going to cause callers in neigh_alloc to immediately fail their
allocation attempts. Would it be a good idea to modify that call site so that
instead of returning NULL, instead reread tbl->entries before comparing to
gc_thresh3, on the hope that the cpu in the garbage collecting routine has freed
some entries?
Neil
^ permalink raw reply
* Re: [PATCH net-next] net: only run neigh_forced_gc() from one cpu
From: Eric Dumazet @ 2012-09-19 11:07 UTC (permalink / raw)
To: Neil Horman
Cc: David Miller, netdev, Maciej Żenczykowski, Tom Herbert,
Lorenzo Colitti
In-Reply-To: <20120919105038.GA12352@hmsreliant.think-freely.org>
On Wed, 2012-09-19 at 06:50 -0400, Neil Horman wrote:
> This is going to cause callers in neigh_alloc to immediately fail their
> allocation attempts. Would it be a good idea to modify that call site so that
> instead of returning NULL, instead reread tbl->entries before comparing to
> gc_thresh3, on the hope that the cpu in the garbage collecting routine has freed
> some entries?
neigh_alloc() fails only if gc_thresh3 is hit, and if it is hit, we are
under attack by definition.
(the gc is run every 5 seconds is above gc_thresh2, and below
gc_thresh3)
No matter what you try, the attacker is going to be the winner.
The best thing here is to drop packets, not spending several milli
seconds to serve one packet, as queues are going to tail drop anyway.
^ permalink raw reply
* Re: [PATCH net-next] net: only run neigh_forced_gc() from one cpu
From: Eric Dumazet @ 2012-09-19 11:09 UTC (permalink / raw)
To: Neil Horman
Cc: David Miller, netdev, Maciej Żenczykowski, Tom Herbert,
Lorenzo Colitti
In-Reply-To: <1348052825.26523.676.camel@edumazet-glaptop>
On Wed, 2012-09-19 at 13:07 +0200, Eric Dumazet wrote:
> On Wed, 2012-09-19 at 06:50 -0400, Neil Horman wrote:
>
> > This is going to cause callers in neigh_alloc to immediately fail their
> > allocation attempts. Would it be a good idea to modify that call site so that
> > instead of returning NULL, instead reread tbl->entries before comparing to
> > gc_thresh3, on the hope that the cpu in the garbage collecting routine has freed
> > some entries?
>
> neigh_alloc() fails only if gc_thresh3 is hit, and if it is hit, we are
> under attack by definition.
>
> (the gc is run every 5 seconds is above gc_thresh2, and below
> gc_thresh3)
>
> No matter what you try, the attacker is going to be the winner.
>
> The best thing here is to drop packets, not spending several milli
> seconds to serve one packet, as queues are going to tail drop anyway.
>
I meant several hundred of milli seconds per packet.
In our tests we even trigger a softlockup, so thats more than 10 seconds
waiting for the rwlock, for a single packet.
^ permalink raw reply
* [PATCH 1/2] Added information about which firmware file is being requested.
From: Jarl Friis @ 2012-09-19 11:18 UTC (permalink / raw)
To: Stefano Brivio, Gábor Stefanik
Cc: linux-wireless-u79uwXL29TY76Z2rM5mHXA,
b43-dev-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
netdev-u79uwXL29TY76Z2rM5mHXA, John W. Linville, Jarl Friis
This is informative information to provide about which actual firmware
file is being used.
Signed-off-by: Jarl Friis <jarl-bE7lSbLpGj1/SzgSGea1oA@public.gmane.org>
---
drivers/net/wireless/b43/main.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/net/wireless/b43/main.c b/drivers/net/wireless/b43/main.c
index a140165..202a0eb 100644
--- a/drivers/net/wireless/b43/main.c
+++ b/drivers/net/wireless/b43/main.c
@@ -2131,6 +2131,7 @@ int b43_do_request_fw(struct b43_request_fw_context *ctx,
B43_WARN_ON(1);
return -ENOSYS;
}
+ b43info(ctx->dev->wl, "Requesting firmware file '%s'\n", ctx->fwname);
err = request_firmware(&blob, ctx->fwname, ctx->dev->dev->dev);
if (err == -ENOENT) {
snprintf(ctx->errors[ctx->req_type],
--
1.7.9.5
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* [PATCH 2/2] Using LP firmware for taking advantage of the low-power capabilities.
From: Jarl Friis @ 2012-09-19 11:18 UTC (permalink / raw)
To: Stefano Brivio, Gábor Stefanik
Cc: linux-wireless, b43-dev, netdev, John W. Linville, Jarl Friis
In-Reply-To: <1348053493-22955-1-git-send-email-jarl@softace.dk>
This is using the LP specific firmware to better take advantage of the
Low-Power capabilities.
Signed-off-by: Jarl Friis <jarl@softace.dk>
---
drivers/net/wireless/b43/main.c | 16 ++++++++++++++--
1 file changed, 14 insertions(+), 2 deletions(-)
diff --git a/drivers/net/wireless/b43/main.c b/drivers/net/wireless/b43/main.c
index 202a0eb..9ee6030 100644
--- a/drivers/net/wireless/b43/main.c
+++ b/drivers/net/wireless/b43/main.c
@@ -8,6 +8,7 @@
Copyright (c) 2005 Danny van Dyk <kugelfang@gentoo.org>
Copyright (c) 2005 Andreas Jaggi <andreas.jaggi@waterwave.ch>
Copyright (c) 2010-2011 Rafał Miłecki <zajec5@gmail.com>
+ Copyright (c) 2012 Jarl Friis <jarl@softace.dk>
SDIO support
Copyright (c) 2009 Albert Herranz <albert_herranz@yahoo.es>
@@ -72,6 +73,7 @@ MODULE_FIRMWARE("b43/ucode11.fw");
MODULE_FIRMWARE("b43/ucode13.fw");
MODULE_FIRMWARE("b43/ucode14.fw");
MODULE_FIRMWARE("b43/ucode15.fw");
+MODULE_FIRMWARE("b43/ucode16_lp.fw");
MODULE_FIRMWARE("b43/ucode16_mimo.fw");
MODULE_FIRMWARE("b43/ucode5.fw");
MODULE_FIRMWARE("b43/ucode9.fw");
@@ -2208,6 +2210,12 @@ static int b43_try_request_fw(struct b43_request_fw_context *ctx)
else
goto err_no_ucode;
break;
+ case B43_PHYTYPE_LP:
+ if (rev >= 16)
+ filename = "ucode16_lp";
+ else
+ goto err_no_ucode;
+ break;
case B43_PHYTYPE_HT:
if (rev == 29)
filename = "ucode29_mimo";
@@ -2277,8 +2285,10 @@ static int b43_try_request_fw(struct b43_request_fw_context *ctx)
filename = "lp0initvals13";
else if (rev == 14)
filename = "lp0initvals14";
- else if (rev >= 15)
+ else if (rev == 15)
filename = "lp0initvals15";
+ else if (rev >= 16)
+ filename = "lp0initvals16";
else
goto err_no_initvals;
break;
@@ -2336,8 +2346,10 @@ static int b43_try_request_fw(struct b43_request_fw_context *ctx)
filename = "lp0bsinitvals13";
else if (rev == 14)
filename = "lp0bsinitvals14";
- else if (rev >= 15)
+ else if (rev == 15)
filename = "lp0bsinitvals15";
+ else if (rev >= 16)
+ filename = "lp0bsinitvals16";
else
goto err_no_initvals;
break;
--
1.7.9.5
^ permalink raw reply related
* Re: New commands to configure IOV features
From: Yuval Mintz @ 2012-09-19 11:07 UTC (permalink / raw)
To: davem@davemloft.net, netdev@vger.kernel.org; +Cc: Ariel Elior, Eilon Greenstein
In-Reply-To: <C5551D9AAB213A418B7FD5E4A6F30A070C7D9DCD@ORSMSX106.amr.corp.intel.com>
>>> Back to the original discussion though--has anyone got any ideas about
>>> the best way to trigger runtime creation of VFs? I don't know what
>>> the binary APIs looks like, but via sysfs I could see something like
>>>
>>> echo number_of_new_vfs_to_create >
>>> /sys/bus/pci/devices/<address>/create_vfs
>>>
>>> Something else that occurred to me--is there buy-in from driver
>>> maintainers? I know the Intel ethernet drivers (what I'm most
>>> familiar
>>> with) would need to be substantially modified to support on-the-fly
>>> addition of new vfs. Currently they assume that the number of vfs is
>>> known at module init time.
>>
>> Why couldn't rtnl_link_ops be used for this. It is already the preferred
>> interface to create vlan's, bond devices, and other virtual devices?
>> The one issue is that do the created VF's exist in kernel as devices or
>> only visible to guest?
>
> I would say that rtnl_link_ops are network oriented and not appropriate for something like a storage controller or graphics device, which are two other common SR-IOV capable devices.
Hi Dave,
We're currently fine-tuning our SRIOV support, which we will shortly
send upstream.
We've encountered a problem though - all drivers currently supporting
SRIOV do so with the usage of a module param: e.g., 'max_vfs' for ixgbe,
'num_vfs' for benet, etc.
The SRIOV feature is disabled by default on all the drivers; it can only
be enabled via usage of the module param.
We don't want the lack of SRIOV module param in the bnx2x driver to be
the bottle-neck when we'll submit the SRIOV feature upstream, and we
also don't want to enable SRIOV by default (following the same logic of
other drivers; most users don't use SRIOV and it would strain their
resources).
As we see it, there are several possible ways of solving the issue:
1. Use some network-tool (e.g., ethtool).
2. Implement a standard sysfs interface for PCIe devices, as SRIOV is
not solely network-related (this should be done via the PCI linux
tree).
3. Implement a module param in our bnx2x code.
We would like to know what's your preferred method for solving this issue,
and to hear if you have another (better?) method by which we can add this
kind of support.
Thanks,
Yuval Mintz
^ permalink raw reply
* [PATCH v2 0/9] net/macb: driver enhancement concerning GEM support, ring logic and cleanup
From: Nicolas Ferre @ 2012-09-19 11:55 UTC (permalink / raw)
To: netdev, davem, havard
Cc: bhutchings, linux-arm-kernel, plagnioj, patrice.vilchez,
linux-kernel, Nicolas Ferre
This is an enhancement work that began several years ago. I try to catchup with
some performance improvement that has been implemented then by Havard.
The ring index logic and the TX error path modification are the biggest changes
but some cleanup/debugging have been added along the way.
The GEM revision will benefit from the Gigabit support.
The series has been tested on several Atmel AT91 SoC with the two MACB/GEM
flavors.
v2: - modify the tx error handling: now uses a workqueue
- information provided by ethtool -i were not accurate: removed
Havard Skinnemoen (4):
net/macb: memory barriers cleanup
net/macb: change debugging messages
net/macb: clean up ring buffer logic
net/macb: Offset first RX buffer by two bytes
Nicolas Ferre (4):
net/macb: remove macb_get_drvinfo()
net/macb: tx status is more than 8 bits now
net/macb: ethtool interface: add register dump feature
net/macb: better manage tx errors
Patrice Vilchez (1):
net/macb: Add support for Gigabit Ethernet mode
drivers/net/ethernet/cadence/macb.c | 433 +++++++++++++++++++++++++-----------
drivers/net/ethernet/cadence/macb.h | 30 ++-
2 files changed, 321 insertions(+), 142 deletions(-)
--
1.7.11.3
^ permalink raw reply
* [PATCH v2 1/9] net/macb: Add support for Gigabit Ethernet mode
From: Nicolas Ferre @ 2012-09-19 11:55 UTC (permalink / raw)
To: netdev, davem, havard
Cc: bhutchings, linux-arm-kernel, plagnioj, patrice.vilchez,
linux-kernel, Nicolas Ferre
In-Reply-To: <cover.1348055112.git.nicolas.ferre@atmel.com>
From: Patrice Vilchez <patrice.vilchez@atmel.com>
Add Gigabit Ethernet mode to GEM cadence IP and enable RGMII connection.
Signed-off-by: Patrice Vilchez <patrice.vilchez@atmel.com>
Signed-off-by: Nicolas Ferre <nicolas.ferre@atmel.com>
---
drivers/net/ethernet/cadence/macb.c | 15 ++++++++++++---
drivers/net/ethernet/cadence/macb.h | 4 ++++
2 files changed, 16 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/cadence/macb.c b/drivers/net/ethernet/cadence/macb.c
index c4834c2..56375e2 100644
--- a/drivers/net/ethernet/cadence/macb.c
+++ b/drivers/net/ethernet/cadence/macb.c
@@ -152,13 +152,17 @@ static void macb_handle_link_change(struct net_device *dev)
reg = macb_readl(bp, NCFGR);
reg &= ~(MACB_BIT(SPD) | MACB_BIT(FD));
+ if (macb_is_gem(bp))
+ reg &= ~GEM_BIT(GBE);
if (phydev->duplex)
reg |= MACB_BIT(FD);
if (phydev->speed == SPEED_100)
reg |= MACB_BIT(SPD);
+ if (phydev->speed == SPEED_1000)
+ reg |= GEM_BIT(GBE);
- macb_writel(bp, NCFGR, reg);
+ macb_or_gem_writel(bp, NCFGR, reg);
bp->speed = phydev->speed;
bp->duplex = phydev->duplex;
@@ -213,7 +217,10 @@ static int macb_mii_probe(struct net_device *dev)
}
/* mask with MAC supported features */
- phydev->supported &= PHY_BASIC_FEATURES;
+ if (macb_is_gem(bp))
+ phydev->supported &= PHY_GBIT_FEATURES;
+ else
+ phydev->supported &= PHY_BASIC_FEATURES;
phydev->advertising = phydev->supported;
@@ -1377,7 +1384,9 @@ static int __init macb_probe(struct platform_device *pdev)
bp->phy_interface = err;
}
- if (bp->phy_interface == PHY_INTERFACE_MODE_RMII)
+ if (bp->phy_interface == PHY_INTERFACE_MODE_RGMII)
+ macb_or_gem_writel(bp, USRIO, GEM_BIT(RGMII));
+ else if (bp->phy_interface == PHY_INTERFACE_MODE_RMII)
#if defined(CONFIG_ARCH_AT91)
macb_or_gem_writel(bp, USRIO, (MACB_BIT(RMII) |
MACB_BIT(CLKEN)));
diff --git a/drivers/net/ethernet/cadence/macb.h b/drivers/net/ethernet/cadence/macb.h
index 335e288..f69ceef 100644
--- a/drivers/net/ethernet/cadence/macb.h
+++ b/drivers/net/ethernet/cadence/macb.h
@@ -145,6 +145,8 @@
#define MACB_IRXFCS_SIZE 1
/* GEM specific NCFGR bitfields. */
+#define GEM_GBE_OFFSET 10
+#define GEM_GBE_SIZE 1
#define GEM_CLK_OFFSET 18
#define GEM_CLK_SIZE 3
#define GEM_DBW_OFFSET 21
@@ -246,6 +248,8 @@
/* Bitfields in USRIO (AT91) */
#define MACB_RMII_OFFSET 0
#define MACB_RMII_SIZE 1
+#define GEM_RGMII_OFFSET 0 /* GEM gigabit mode */
+#define GEM_RGMII_SIZE 1
#define MACB_CLKEN_OFFSET 1
#define MACB_CLKEN_SIZE 1
--
1.7.11.3
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox