* [PATCH net-next-2.6] net: sk_add_backlog() take rmem_alloc into account
From: Eric Dumazet @ 2010-04-27 20:21 UTC (permalink / raw)
To: David Miller; +Cc: bmb, therbert, netdev, rick.jones2
In-Reply-To: <1272389872.2295.405.camel@edumazet-laptop>
Le mardi 27 avril 2010 à 19:37 +0200, Eric Dumazet a écrit :
> We might use the ticket spinlock paradigm to let writers go in parallel
> and let the user the socket lock
>
> Instead of having the bh_lock_sock() to protect receive_queue *and*
> backlog, writers get a unique slot in a table, that 'user' can handle
> later.
>
> Or serialize writers (before they try to bh_lock_sock()) with a
> dedicated lock, so that user has 50% chances to get the sock lock,
> contending with at most one writer.
Following patch fixes the issue for me, with little performance hit on
fast path.
Under huge stress from a multiqueue/RPS enabled NIC, a single flow udp
receiver can now process ~200.000 pps (instead of ~100 pps before the
patch) on my dev machine.
Thanks !
[PATCH net-next-2.6] net: sk_add_backlog() take rmem_alloc into account
Current socket backlog limit is not enough to really stop DDOS attacks,
because user thread spend many time to process a full backlog each
round, and user might crazy spin on socket lock.
We should add backlog size and receive_queue size (aka rmem_alloc) to
pace writers, and let user run without being slow down too much.
Introduce a sk_rcvqueues_full() helper, to avoid taking socket lock in
stress situations.
Under huge stress from a multiqueue/RPS enabled NIC, a single flow udp
receiver can now process ~200.000 pps (instead of ~100 pps before the
patch) on a 8 core machine.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
include/net/sock.h | 13 +++++++++++--
net/core/sock.c | 5 ++++-
net/ipv4/udp.c | 4 ++++
net/ipv6/udp.c | 8 ++++++++
4 files changed, 27 insertions(+), 3 deletions(-)
diff --git a/include/net/sock.h b/include/net/sock.h
index 86a8ca1..4b0097d 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -255,7 +255,6 @@ struct sock {
struct sk_buff *head;
struct sk_buff *tail;
int len;
- int limit;
} sk_backlog;
wait_queue_head_t *sk_sleep;
struct dst_entry *sk_dst_cache;
@@ -604,10 +603,20 @@ static inline void __sk_add_backlog(struct sock *sk, struct sk_buff *skb)
skb->next = NULL;
}
+/*
+ * Take into account size of receive queue and backlog queue
+ */
+static inline bool sk_rcvqueues_full(const struct sock *sk, const struct sk_buff *skb)
+{
+ unsigned int qsize = sk->sk_backlog.len + atomic_read(&sk->sk_rmem_alloc);
+
+ return qsize + skb->truesize > sk->sk_rcvbuf;
+}
+
/* The per-socket spinlock must be held here. */
static inline __must_check int sk_add_backlog(struct sock *sk, struct sk_buff *skb)
{
- if (sk->sk_backlog.len >= max(sk->sk_backlog.limit, sk->sk_rcvbuf << 1))
+ if (sk_rcvqueues_full(sk, skb))
return -ENOBUFS;
__sk_add_backlog(sk, skb);
diff --git a/net/core/sock.c b/net/core/sock.c
index 58ebd14..5104175 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -327,6 +327,10 @@ int sk_receive_skb(struct sock *sk, struct sk_buff *skb, const int nested)
skb->dev = NULL;
+ if (sk_rcvqueues_full(sk, skb)) {
+ atomic_inc(&sk->sk_drops);
+ goto discard_and_relse;
+ }
if (nested)
bh_lock_sock_nested(sk);
else
@@ -1885,7 +1889,6 @@ void sock_init_data(struct socket *sock, struct sock *sk)
sk->sk_allocation = GFP_KERNEL;
sk->sk_rcvbuf = sysctl_rmem_default;
sk->sk_sndbuf = sysctl_wmem_default;
- sk->sk_backlog.limit = sk->sk_rcvbuf << 1;
sk->sk_state = TCP_CLOSE;
sk_set_socket(sk, sock);
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 1e18f9c..776c844 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1372,6 +1372,10 @@ int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
goto drop;
}
+
+ if (sk_rcvqueues_full(sk, skb))
+ goto drop;
+
rc = 0;
bh_lock_sock(sk);
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 2850e35..3ead20a 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -584,6 +584,10 @@ static void flush_stack(struct sock **stack, unsigned int count,
sk = stack[i];
if (skb1) {
+ if (sk_rcvqueues_full(sk, skb)) {
+ kfree_skb(skb1);
+ goto drop;
+ }
bh_lock_sock(sk);
if (!sock_owned_by_user(sk))
udpv6_queue_rcv_skb(sk, skb1);
@@ -759,6 +763,10 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
/* deliver */
+ if (sk_rcvqueues_full(sk, skb)) {
+ sock_put(sk);
+ goto discard;
+ }
bh_lock_sock(sk);
if (!sock_owned_by_user(sk))
udpv6_queue_rcv_skb(sk, skb);
^ permalink raw reply related
* Re: [PATCH 1/3] ptp: Added a brand new class driver for ptp clocks.
From: David Miller @ 2010-04-27 20:07 UTC (permalink / raw)
To: richardcochran; +Cc: netdev
In-Reply-To: <20100427091405.GA5098@riccoc20.at.omicron.at>
From: Richard Cochran <richardcochran@gmail.com>
Date: Tue, 27 Apr 2010 11:14:05 +0200
> +struct ptp_clock {
> + struct cdev cdev;
> + struct device *dev;
> + struct ptp_clock_info *info;
> + dev_t devid;
> + int index; /* index into clocks[], also the minor number */
> + struct semaphore mux; /* one process at a time on a device */
> +};
A mutex works just as well and is preferable to a semaphore.
> +/* private globals */
> +
> +static const struct file_operations ptp_fops;
> +static dev_t ptp_devt;
> +static struct class *ptp_class;
> +struct ptp_clock *clocks[PTP_MAX_CLOCKS];
> +DEFINE_SPINLOCK(clocks_lock);
The clocks[] table is not protected by any mutual exclusion in the
unregister method, it needs at least a spinlock or similar. Probably
clocks_lock was meant to be used for this purpose.
Also, having arbitray limits like PTP_MAX_CLOCKS and a linear scan
when registering or unregistering is suboptimal.
Even if we're not expecting to have many of these things, use linux/list.h
list to manage these things.
In fact, if you keep them in a list you don't need to look them up at
all during at least unregister, you can return the real "struct
ptp_clock *" as an opaque ERR_PTR() back to the caller on register and
on unregister you can just list_del() on it.
Don't expose the layout of struct ptp_clock to the users, you don't have
do. Just:
struct ptp_clock;
in the exported header file, and then you can return "struct ptp_clock *'
from ptp_clock_register() just fine.
^ permalink raw reply
* Re: [net-2.6 PATCH] ixgbe: cleanup ethtool autoneg input
From: Jeff Kirsher @ 2010-04-27 20:05 UTC (permalink / raw)
To: David Miller; +Cc: netdev, gospo, donald.c.skidmore
In-Reply-To: <20100427.095629.191403481.davem@davemloft.net>
On Tue, Apr 27, 2010 at 09:56, David Miller <davem@davemloft.net> wrote:
>
> This is also not appropriate for net-2.6, it doesn't fix a
> regression in the regression list and it doesn't fix a catastropic
> crash or failure.
>
> You really have to be kidding me if you thing a patch like this
> is fine this late in the -RC series.
>
> You also aren't even numbering your patches, which is quite a
> transgression when a submission of a set of patches all to the same
> driver and/or files.
> --
My apologies. My reasoning for not numbering them was that the
patches were not related and not dependent upon each other.
As far as sending the the three patches un-acceptable patches this
late in the -rc series, that was poor judgement on my part, sorry.
--
Cheers,
Jeff
^ permalink raw reply
* Re: [PATCH 0/4] net: ipmr netlink interface for route dumping
From: David Miller @ 2010-04-27 19:59 UTC (permalink / raw)
To: kaber; +Cc: netdev
In-Reply-To: <1272374785-3858-1-git-send-email-kaber@trash.net>
From: Patrick@trash.net, McHardy@trash.net, kaber@trash.net
Date: Tue, 27 Apr 2010 15:26:22 +0200
> Please apply or pull from:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/kaber/ipmr-2.6.git master
>
Pulled, thanks Patrick.
And good luck finding those missing quotes in your git send-email
scripting :-)
^ permalink raw reply
* Re: [PATCH net-next-2.6] rps: inet_rps_save_rxhash() argument is not const
From: David Miller @ 2010-04-27 19:56 UTC (permalink / raw)
To: eric.dumazet; +Cc: netdev
In-Reply-To: <1272372171.2295.68.camel@edumazet-laptop>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 27 Apr 2010 14:42:51 +0200
> const qualifier on sock argument is misleading, since we can modify rxhash.
>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Applied.
It's a shame that because of the cast the compiler can't see this.
Instead it would be nice if the compiler only allowed casts from
a const pointer to another const pointer.
^ permalink raw reply
* Re: [net-next-2.6 PATCH] ixgbe: ixgbe_down needs to stop dev_watchdog
From: David Miller @ 2010-04-27 19:56 UTC (permalink / raw)
To: jeffrey.t.kirsher; +Cc: netdev, gospo, john.r.fastabend, peter.p.waskiewicz.jr
In-Reply-To: <20100427121300.25038.2341.stgit@localhost.localdomain>
From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Tue, 27 Apr 2010 05:13:39 -0700
> From: John Fastabend <john.r.fastabend@intel.com>
>
> There is a small race between when the tx queues are stopped
> and when netif_carrier_off() is called in ixgbe_down. If the
> dev_watchdog() timer fires during this time it is possible for
> a false tx timeout to occur.
>
> This patch moves the netif_carrier_off() so that it is called before
> the tx queues are stopped preventing the dev_watchdog timer from
> detecting false tx timeouts. The race is seen occosionally when
> FCoE or DCB settings are being configured or changed.
>
> Testing note, running ifconfig up/down will not reproduce this
> issue because dev_open/dev_close call dev_deactivate() and then
> dev_activate().
>
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> Acked-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Applied.
^ permalink raw reply
* Re: [net-next-2.6 PATCH 2/2] ixgbe: fix bug when EITR=0 causing no writebacks
From: David Miller @ 2010-04-27 19:56 UTC (permalink / raw)
To: jeffrey.t.kirsher; +Cc: netdev, gospo, jesse.brandeburg
In-Reply-To: <20100427113739.24431.46358.stgit@localhost.localdomain>
From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Tue, 27 Apr 2010 04:37:41 -0700
> From: Jesse Brandeburg <jesse.brandeburg@intel.com>
>
> writebacks can be held indefinitely by hardware if EITR=0, when
> combined with TXDCTL.WTHRESH=8. When EITR=0, WTHRESH should be
> set back to zero.
>
> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Applied.
^ permalink raw reply
* Re: [net-next-2.6 PATCH 1/2] ixgbe: enable extremely low latency
From: David Miller @ 2010-04-27 19:56 UTC (permalink / raw)
To: jeffrey.t.kirsher; +Cc: netdev, gospo, jesse.brandeburg
In-Reply-To: <20100427113651.24431.9221.stgit@localhost.localdomain>
From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Tue, 27 Apr 2010 04:37:20 -0700
> From: Jesse Brandeburg <jesse.brandeburg@intel.com>
>
> 82598/82599 can support EITR == 0, which allows for the
> absolutely lowest latency setting in the hardware. This disables
> writeback batching and anything else that relies upon a delayed
> interrupt. This patch enables the feature of "override" when a
> user sets rx-usecs to zero, the driver will respect that setting
> over using RSC, and automatically disable RSC. If rx-usecs is
> used to set the EITR value to 0, then the driver should disable
> LRO (aka RSC) internally until EITR is set to non-zero again.
>
> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Applied.
^ permalink raw reply
* Re: [net-next-2.6 PATCH] igb: add support for reporting 5GT/s during probe on PCIe Gen2
From: David Miller @ 2010-04-27 19:55 UTC (permalink / raw)
To: jeffrey.t.kirsher; +Cc: netdev, gospo, alexander.h.duyck
In-Reply-To: <20100427110238.23921.24825.stgit@localhost.localdomain>
From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Tue, 27 Apr 2010 04:02:40 -0700
> From: Alexander Duyck <alexander.h.duyck@intel.com>
>
> This change corrects the fact that we were not reporting Gen2 link speeds
> when we were in fact connected at Gen2 rates.
>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Applied.
^ permalink raw reply
* Re: [net-next-2.6 PATCH 2/2] igbvf: double increment nr_frags
From: David Miller @ 2010-04-27 19:55 UTC (permalink / raw)
To: jeffrey.t.kirsher; +Cc: netdev, gospo, sanagi.koki
In-Reply-To: <20100427110137.23872.8779.stgit@localhost.localdomain>
From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Tue, 27 Apr 2010 04:01:39 -0700
> From: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
>
> There is no need to increment nr_frags because skb_fill_page_desc increments
> it.
>
> Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Applied.
^ permalink raw reply
* Re: [net-next-2.6 PATCH 1/2] igb: double increment nr_frags
From: David Miller @ 2010-04-27 19:55 UTC (permalink / raw)
To: jeffrey.t.kirsher; +Cc: netdev, gospo, sanagi.koki
In-Reply-To: <20100427110107.23872.86247.stgit@localhost.localdomain>
From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Tue, 27 Apr 2010 04:01:19 -0700
> From: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
>
> There is no need to increment nr_frags because skb_fill_page_desc increments
> it.
>
> Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Applied.
^ permalink raw reply
* Re: [net-next-2.6 PATCH] ixgb: Use pr_<level> and netdev_<level>
From: David Miller @ 2010-04-27 19:55 UTC (permalink / raw)
To: jeffrey.t.kirsher; +Cc: netdev, gospo, joe
In-Reply-To: <20100427104952.23637.6317.stgit@localhost.localdomain>
From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Tue, 27 Apr 2010 03:50:58 -0700
> From: Joe Perches <joe@perches.com>
>
> Convert DEBUGOUTx to pr_debug
> Convert DEBUGFUNC to more commonly used ENTER
> Convert mac address output to %pM
> Use #define pr_fmt
> Convert a few printks to pr_<level>
> Improve ixgb_mc_addr_list_update: use a temporary for current mc address
> Use etherdevice.h functions for mac address testing
>
> Signed-off-by: Joe Perches <joe@perches.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Applied.
^ permalink raw reply
* Re: [PATCH net-next-2.6] net: fix a lockdep rcu warning in __sk_dst_set()
From: David Miller @ 2010-04-27 19:55 UTC (permalink / raw)
To: eric.dumazet; +Cc: netdev
In-Reply-To: <1272350443.4861.9.camel@edumazet-laptop>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 27 Apr 2010 08:40:43 +0200
> __sk_dst_set() might be called while no state can be integrated in a
> rcu_dereference_check() condition.
>
> So use rcu_dereference_raw() to shutup lockdep warnings (if
> CONFIG_PROVE_RCU is set)
>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Applied, thanks Eric.
^ permalink raw reply
* Re: [PATCH v4] TCP: avoid to send keepalive probes if receiving data
From: David Miller @ 2010-04-27 19:55 UTC (permalink / raw)
To: ilpo.jarvinen; +Cc: fleitner, netdev, eric.dumazet
In-Reply-To: <alpine.DEB.2.00.1004270906320.13989@melkinpaasi.cs.helsinki.fi>
From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Tue, 27 Apr 2010 09:08:08 +0300 (EEST)
> On Tue, 27 Apr 2010, Flavio Leitner wrote:
>
>> RFC 1122 says the following:
>> ...
>> Keep-alive packets MUST only be sent when no data or
>> acknowledgement packets have been received for the
>> connection within an interval.
>> ...
>>
>> The acknowledgement packet is reseting the keepalive
>> timer but the data packet isn't. This patch fixes it by
>> checking the timestamp of the last received data packet
>> too when the keepalive timer expires.
>>
>> Signed-off-by: Flavio Leitner <fleitner@redhat.com>
...
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
...
> Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Applied, thanks everyone.
^ permalink raw reply
* Re: [PATCH net-next] bridge: multicast router list manipulation
From: David Miller @ 2010-04-27 19:54 UTC (permalink / raw)
To: shemminger; +Cc: herbert, netdev
In-Reply-To: <20100427101311.2f445227@nehalam>
From: Stephen Hemminger <shemminger@vyatta.com>
Date: Tue, 27 Apr 2010 10:13:11 -0700
> I prefer that the hlist be only accessed through the hlist macro
> objects. Explicit twiddling of links (especially with RCU) exposes
> the code to future bugs.
>
> Compile tested only.
>
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Yes this by-hand stuff was beyond awful, I'm sorry I didn't catch it
when pulling in these changes initially :-)
Applied, thanks Stephen.
^ permalink raw reply
* Re: [PATCH net-next] bridge: use is_multicast_ether_addr
From: David Miller @ 2010-04-27 19:53 UTC (permalink / raw)
To: shemminger; +Cc: herbert, netdev
In-Reply-To: <20100427101306.3a49104f@nehalam>
From: Stephen Hemminger <shemminger@vyatta.com>
Date: Tue, 27 Apr 2010 10:13:06 -0700
> Use existing inline function.
>
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Applied, thanks.
^ permalink raw reply
* Re: [PATCH net-next-2.6] net: fix a lockdep rcu warning in __sk_dst_set()
From: David Miller @ 2010-04-27 19:42 UTC (permalink / raw)
To: paulmck; +Cc: eric.dumazet, netdev
In-Reply-To: <20100427161716.GB2424@linux.vnet.ibm.com>
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Date: Tue, 27 Apr 2010 09:17:16 -0700
> On Tue, Apr 27, 2010 at 08:40:43AM +0200, Eric Dumazet wrote:
>> __sk_dst_set() might be called while no state can be integrated in a
>> rcu_dereference_check() condition.
>>
>> So use rcu_dereference_raw() to shutup lockdep warnings (if
>> CONFIG_PROVE_RCU is set)
>
> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>
>> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
I've applied this to net-next-2.6, thanks!
^ permalink raw reply
* Re: [net-next-2.6 PATCH 1/2] Add ndo_set_vf_port_profile
From: Arnd Bergmann @ 2010-04-27 19:38 UTC (permalink / raw)
To: Anirban Chakraborty
Cc: Scott Feldman, Rose, Gregory V, David Miller,
netdev@vger.kernel.org, chrisw@redhat.com, Williams, Mitch A
In-Reply-To: <8966E338-1C9C-43D9-B6A3-A44349E7EE18@qlogic.com>
On Tuesday 27 April 2010 19:33:04 Anirban Chakraborty wrote:
> On Apr 27, 2010, at 5:35 AM, Arnd Bergmann wrote:
> > Anything that ties port profiles to VFs seems fundamentally flawed AFAICT,
> > at least when we want to extend this to adapters that don't do it in firmware.
>
> Correct me if I am wrong. Shouldn't the port profile be tied to the physical NICs which are essentially
> PCI functions (be it PF or VF)? I'd think that a port profile would have configuration settings for all the
> physical NICs (PF/VF) of a specific physical port of the adapter. I liked the idea of querying the device
> for number of VFs as it will cover both SR-IOV and non SR-IOV PCI functions.
Yes, the port profile association is tied to whoever owns the link to the switch.
That can be a regular NIC, an SR-IOV PF, an ethernet bonding device or an S-component
implementing provider S-VLANs on top of any of these.
Usually it will be the same as a physical link, but in case of bonding it is two
physical links while in case of S-VLAN, you have multiple instances that each
have their own set of port profile association. If S-VLAN is implemented by
the NIC, that may be a VF.
Querying a PF for the number of VFs attached to it is a useful thing, but this
is independent of port profiles. Consider this (artificially complex) setup:
- eth0 is the PF of an SR-IOV NIC
- eth1 is a regular single-channel NIC
- vf0 is a VF of eth0, used by a guest using PCI passthrough mode on S-VLAN 2
- vf1 is a VF of eth0 owned by the host on S-VLAN 3
- vf1.23 is a VLAN port for VLAN 23 in S-VLAN 3
- br0 is a bridge connected to vf1
- br23 is a bridge VLAN device for br0
- vf2 is a VF of eth0 owned by the host on S-VLAN 4
- eth1.5 is a software vlan device for S-VLAN 4
- bond0 combines eth1.5 and vf2
- bond0.24 is a VLAN port for VLAN24 on bond0
- tap0 is a guest connected to br0 in trunk mode
- tap1 is a guest connected to br23 in access mode
- macvtap0 is a VEPA mode guest on bond0
- macvtap1 is a private mode guest on bond0.24
This means you have a total of five guests running, on vf0, tap0, tap1,
macvtap0 and macvtap1. Querying the number of VFs on eth0 will return '2',
for vf0 and vf1. What you are interested in however is which guests are
associated. Querying every single interface in the system will tell you
eth0: one guest (vf0)
vf1: two guests (tap0 and tap1)
bond0: two guests (macvtap0 and macvtap1)
Arnd
^ permalink raw reply
* Re: [PATCH] bnx2x: add support for receive hashing
From: Eric Dumazet @ 2010-04-27 19:30 UTC (permalink / raw)
To: eilong
Cc: Rick Jones, David Miller, therbert@google.com,
netdev@vger.kernel.org
In-Reply-To: <1272393060.30392.2.camel@lb-tlvb-eilong.il.broadcom.com>
Le mardi 27 avril 2010 à 21:31 +0300, Eilon Greenstein a écrit :
> Though the thread is going in a different direction now, I just wanted
> to clarify two things:
> - yes, the 57710 and 57711 only handle the IP (src+dst) for UDP toeplitz
> hash. We all agree that it is much better to address the UDP ports as
> well, but I think Rick Jones explained the process very well - thank you
> Rick. Just to add one more (lame) excuse: the HW was designed before new
> NAPI was introduced and it complies with the requirements from Redmond
> - the next generation (57712) which we already sample does (finally)
> support it. We are working on a patch series to enhance the bnx2x to
> support this device now.
>
Thanks Eilon !
^ permalink raw reply
* Re: [PATCH 0/4] net: ipmr netlink interface for route dumping
From: Patrick McHardy @ 2010-04-27 18:41 UTC (permalink / raw)
To: David Miller; +Cc: netdev
In-Reply-To: <20100427.100345.241441437.davem@davemloft.net>
David Miller wrote:
> Whoa, there are three of you now?!?!?!
>
> :-)
>
That would be nice, I'd have my two clones do all the work :)
Not sure what happened, some mishandling of git send-email
apparently :)
^ permalink raw reply
* Re: vlan performance issue on outgoing traffic
From: Brandeburg, Jesse @ 2010-04-27 18:32 UTC (permalink / raw)
To: R. Weinedel; +Cc: netdev@vger.kernel.org
In-Reply-To: <4BD4C037.2070003@yahoo.de>
On Sun, 25 Apr 2010, R. Weinedel wrote:
> hallo,
>
> I have an performance issue with vlan interfaces on an Debian Lenny
> server. The problem occurs only on outgoing traffic from the vlan
> interfaces. They use only half of the available bandwidth - (490 Mbit/s
> measured with iperf ). Incoming traffic is handled @ 950 Mbit/s and is
> fine. The issue remains even with no switch and an direct connection
> between pc and server on the same nic. Removing (on server) the vlans
> from eth0 and configure one net on eth0 results in full speed (950
> Mbit/s) in both directions. Even another nic (onboard nvidia3 - mod
> forcedeth) couldn't solve it. I tested only in the same networking
> segment (vlan) without the need for ip forwarding or NAT, but the issue
> occurs on all my vlan's.
>
> All values were taken with iperf between the server and an ubuntu 9.04
> workstation (and vice versa). I have controlled (w. ethtool / stats from
> switch) that all connection was 1000-BaseT/full duplex. It looks like
> some kind of trafficshaping to me, but i don't use tc, qos,tos nor other
> priority handling.
> The network ist quite simple: One Server, one switch and then the
> workstations. No need for cascading or using (r)stp.
>
> Here some information about my network:
>
> Switch: Netgear GSM7224 Layer 2 managed switch, FW 6.2.0.14
> (independent, issue remains on direct connection).
>
> Server: Debian Lenny, kernel 2.6.26-2,
This version of the kernel doesn't support offloads for vlan adapters,
which is probably causing most of your decrease in throughput due to
either exhausting socket buffer size, or because of the round trip time
being so much more relevant when not sending large bursts using TSO.
Sometimes the flood of ACK packets causes higher cpu which could reduce
your throughput also.
The newer kernels will have a major impact on your setup due to a patch
that enabled pass through of hardware offloads to the vlan device's
offload advertisement.
The commit id of the patch is 5fb13570543f4ae022996c9d7c0c099c8abf22dd,
you can view it at:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=5fb13570543f4ae022996c9d7c0c099c8abf22dd
> NIC: Intel Corporation 82541PI Gigabit Ethernet Con. (e1000 module).
This PCI adapter is bandwidth limited on the PCI bus, and so will be even
more sensitive to offload on (TSO) vs offload off.
> # ethtool eth0
> Settings for eth0:
> Supported ports: [ TP ]
> Supported link modes: 10baseT/Half 10baseT/Full
> 100baseT/Half 100baseT/Full
> 1000baseT/Full
> Supports auto-negotiation: Yes
> Advertised link modes: 1000baseT/Full
> Advertised auto-negotiation: Yes
> Speed: 1000Mb/s
> Duplex: Full
> Port: Twisted Pair
> PHYAD: 0
> Transceiver: internal
> Auto-negotiation: on
> Supports Wake-on: umbg
> Wake-on: g
> Current message level: 0x00000007 (7)
> Link detected: yes
>
> 8021q:
> filename: /lib/modules/2.6.26-2-686/kernel/net/8021q/8021q.ko
> version: 1.8
> license: GPL
> alias: rtnl-link-vlan
> srcversion: A61E1168F65EE335A91D4E1
> depends:
> vermagic: 2.6.26-2-686 SMP mod_unload modversions 686
>
> VLAN: #/proc/net/vlan/config
> VLAN Dev name | VLAN ID
> Name-Type: VLAN_NAME_TYPE_RAW_PLUS_VID_NO_PAD
> eth0.5 | 5 | eth0
> eth0.101 | 101 | eth0
> eth0.90 | 90 | eth0
>
> IFCONFIG:
> eth0 Link encap:Ethernet Hardware Adresse 00:0e:0c:bc:43:43
> inet6-Adresse: fe80::20e:cff:febc:4343/64
> Gültigkeitsbereich:Verbindung
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metrik:1
> RX packets:28140829 errors:0 dropped:218 overruns:0 frame:0
> TX packets:44994420 errors:0 dropped:0 overruns:0 carrier:0
> Kollisionen:0 Sendewarteschlangenlänge:1000
> RX bytes:3472864138 (3.2 GiB) TX bytes:3908682627 (3.6 GiB)
>
> eth0.5 Link encap:Ethernet Hardware Adresse 00:0e:0c:bc:43:43
> inet Adresse:XXX.YYY.5.1 Bcast:XXX.YYY.5.255 Maske:255.255.255.0
> inet6-Adresse: fe80::20e:cff:febc:4343/64
> Gültigkeitsbereich:Verbindung
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metrik:1
> RX packets:77807 errors:0 dropped:0 overruns:0 frame:0
> TX packets:69699 errors:0 dropped:0 overruns:0 carrier:0
> Kollisionen:0 Sendewarteschlangenlänge:0
> RX bytes:57578233 (54.9 MiB) TX bytes:7782844 (7.4 MiB)
>
> eth0.90 Link encap:Ethernet Hardware Adresse 00:0e:0c:bc:43:43
> inet Adresse:XXX.YYY.90.1 Bcast:XXX.YYY.90.255
> Maske:255.255.255.0
> inet6-Adresse: fe80::20e:cff:febc:4343/64
> Gültigkeitsbereich:Verbindung
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metrik:1
> RX packets:457850 errors:0 dropped:0 overruns:0 frame:0
> TX packets:913988 errors:0 dropped:0 overruns:0 carrier:0
> Kollisionen:0 Sendewarteschlangenlänge:0
> RX bytes:23824841 (22.7 MiB) TX bytes:1311485281 (1.2 GiB)
>
> eth0.101 Link encap:Ethernet Hardware Adresse 00:0e:0c:bc:43:43
> inet Adresse:XXX.YYY.101.1 Bcast:XXX.YYY.101.255
> Maske:255.255.255.0
> inet6-Adresse: fe80::20e:cff:febc:4343/64
> Gültigkeitsbereich:Verbindung
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metrik:1
> RX packets:24856818 errors:0 dropped:0 overruns:0 frame:0
> TX packets:41608593 errors:0 dropped:0 overruns:0 carrier:0
> Kollisionen:0 Sendewarteschlangenlänge:0
> RX bytes:423116676 (403.5 MiB) TX bytes:3855703636 (3.5 GiB)
>
> ROUTE: #route -n
> Ziel Router Genmask Flags Metric Ref Use
> Iface
> XXX.YYY.101.0 0.0.0.0 255.255.255.0 U 0 0 0
> eth0.101
> XXX.YYY.5.0 0.0.0.0 255.255.255.0 U 0 0 0
> eth0.5
> XXX.YYY.90.0 0.0.0.0 255.255.255.0 U 0 0 0
> eth0.90
> 0.0.0.0 192.168.5.4 0.0.0.0 UG 0 0 0
> eth0.5
>
> Can someone give me a hint, where my search for an solution should be
> going on ?
>
> Many thanks !
> Regards
> Ralf Weinedel
> Falkensee/Germany
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply
* Re: [PATCH] bnx2x: add support for receive hashing
From: Eilon Greenstein @ 2010-04-27 18:31 UTC (permalink / raw)
To: Rick Jones, David Miller, therbert@google.com,
eric.dumazet@gmail.com
Cc: netdev@vger.kernel.org
In-Reply-To: <4BD601C3.5030108@hp.com>
On Mon, 2010-04-26 at 14:12 -0700, Rick Jones wrote:
> David Miller wrote:
> > From: Rick Jones <rick.jones2@hp.com>
> > Date: Mon, 26 Apr 2010 13:48:22 -0700
> >
> >>Do not confuse explanation with endorsement.
> >
> > Ok, fair enough.
> >
> > But I don't see even the "other perspective" argument being even
> > valid. Big shops still use UDP and it has to scale.
>
> Preface - I too think it is massively stupid to ignore anything but TCP/IPv4,
> and unwise to ignore IPv6 and so on, but there is a very real reason why one of
> my email signatures reads:
>
> "The road to hell is paved with business decisions"
>
> > Or have they made multicast magically start working with TCP so
> > they can us it to do trades on the NASDAQ?
>
> No. How many NIC chips can NASDAQ be expected to move? 0.1%? or even 1% of the
> NIC chip market?
>
> How many more NIC chips are in places where someone says "You sold me on
> iSCSI/FCoE/whatnot, why can't I get 'link-rate' to/from iSCSI storage/whatnot?!"
>
> The NIC designer is there with his finance guys breathing down his neck shouting
> "ROI Uber Alles!" and "Your budget is only this many monetary units!" The
> system designers at the system vendors are hearing the same things from their
> own finance guys, have certain schedules, which then has them going to the NIC
> firms, who want to sell chips to the system guys "You have to be ready to ship
> by this date and your chip has to sell for no more than this."
>
> Lather, rinse, repeat a few times and you get compromises on top of compromises.
>
> Sometimes I think it is a wonder any of it actually works at all...
>
> rick jones
Though the thread is going in a different direction now, I just wanted
to clarify two things:
- yes, the 57710 and 57711 only handle the IP (src+dst) for UDP toeplitz
hash. We all agree that it is much better to address the UDP ports as
well, but I think Rick Jones explained the process very well - thank you
Rick. Just to add one more (lame) excuse: the HW was designed before new
NAPI was introduced and it complies with the requirements from Redmond
- the next generation (57712) which we already sample does (finally)
support it. We are working on a patch series to enhance the bnx2x to
support this device now.
Eilon
^ permalink raw reply
* Re: [PATCH] RCU: don't turn off lockdep when find suspicious rcu_dereference_check() usage
From: Miles Lane @ 2010-04-27 17:58 UTC (permalink / raw)
To: paulmck
Cc: Eric W. Biederman, Vivek Goyal, Eric Paris, Lai Jiangshan,
Ingo Molnar, Peter Zijlstra, LKML, nauman, eric.dumazet, netdev,
Jens Axboe, Gui Jianfeng, Li Zefan, Johannes Berg
In-Reply-To: <20100427162201.GA5826@linux.vnet.ibm.com>
On Tue, Apr 27, 2010 at 12:22 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
> On Mon, Apr 26, 2010 at 09:27:44PM -0700, Paul E. McKenney wrote:
>> On Mon, Apr 26, 2010 at 11:35:10AM -0700, Eric W. Biederman wrote:
>> > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
>> >
>> > > Eric Dumazet traced these down to a commit from Eric Biederman.
>> > >
>> > > If I don't hear from Eric Biederman in a few days, I will attempt a
>> > > patch, but it would be more likely to be correct coming from someone
>> > > with a better understanding of the code. ;-)
>> >
>> > I already replied.
>> >
>> > http://lkml.org/lkml/2010/4/21/420
>>
>> You did indeed!!! This experience is giving me an even better appreciation
>> of the maintainers' ability to keep all their patches straight!
>>
>> I will put together something based on your suggestion.
>
> How about the following?
>
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> commit 85fa42bd568ab99c375f018761ae6345249942cd
> Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Date: Mon Apr 26 21:40:05 2010 -0700
>
> net: suppress RCU lockdep false positive in twsk_net()
>
> Calls to twsk_net() are in some cases protected by reference counting
> as an alternative to RCU protection. Cases covered by reference counts
> include __inet_twsk_kill(), inet_twsk_free(), inet_twdr_do_twkill_work(),
> inet_twdr_twcal_tick(), and tcp_timewait_state_process(). RCU is used
> by inet_twsk_purge(). Locking is used by established_get_first()
> and established_get_next(). Finally, __inet_twsk_hashdance() is an
> initialization case.
>
> It appears to be non-trivial to locate the appropriate locks and
> reference counts from within twsk_net(), so used rcu_dereference_raw().
>
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>
> diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
> index 79f67ea..a066fdd 100644
> --- a/include/net/inet_timewait_sock.h
> +++ b/include/net/inet_timewait_sock.h
> @@ -224,7 +224,9 @@ static inline
> struct net *twsk_net(const struct inet_timewait_sock *twsk)
> {
> #ifdef CONFIG_NET_NS
> - return rcu_dereference(twsk->tw_net);
> + return rcu_dereference_raw(twsk->tw_net); /* protected by locking, */
> + /* reference counting, */
> + /* initialization, or RCU. */
> #else
> return &init_net;
> #endif
>
Worked for me. Thanks!
Miles
^ permalink raw reply
* Re: [PATCH] bnx2x: add support for receive hashing
From: Eric Dumazet @ 2010-04-27 17:37 UTC (permalink / raw)
To: David Miller; +Cc: bmb, therbert, netdev, rick.jones2
In-Reply-To: <20100427.102038.57469310.davem@davemloft.net>
Le mardi 27 avril 2010 à 10:20 -0700, David Miller a écrit :
>
> Indeed, a huge issue, in that we haven't converted the UDP hash over
> to RCU yet :-)
>
I am not sure what you mean, UDP hash _is_ RCU converted ;)
> But because of the transient bind nature of UDP there are still a bunch
> of cases that won't even cure.
> --
We might use the ticket spinlock paradigm to let writers go in parallel
and let the user the socket lock
Instead of having the bh_lock_sock() to protect receive_queue *and*
backlog, writers get a unique slot in a table, that 'user' can handle
later.
Or serialize writers (before they try to bh_lock_sock()) with a
dedicated lock, so that user has 50% chances to get the sock lock,
contending with at most one writer.
^ permalink raw reply
* Re: [PATCH] bnx2x: add support for receive hashing
From: Eric Dumazet @ 2010-04-27 17:36 UTC (permalink / raw)
To: Tom Herbert; +Cc: David Miller, bmb, netdev, rick.jones2
In-Reply-To: <g2k65634d661004271031r2eb2000bxc30013009509c410@mail.gmail.com>
Le mardi 27 avril 2010 à 10:31 -0700, Tom Herbert a écrit :
> This is the problem that we are addressing with so_reuseport!
How standard applications are protected against a DDOS ?
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox