* [net-next 0/3] Improve UDP multicast receive latency
From: Shawn Bohrer @ 2013-10-01 19:33 UTC (permalink / raw)
To: David Miller; +Cc: Eric Dumazet, tomk, netdev, Shawn Bohrer
The removal of the routing cache in 3.6 had impacted the latency of our
UDP multicast workload. This patch series brings down the latency to
what we were seeing with 3.4.
Patch 1 "udp: Only allow busy read/poll on connected sockets" is mostly
done for correctness and because it allows unifying the unicast and
multicast paths when a socket is found in early demux. It can also
improve latency for a connected multicast socket if busy read/poll is
used.
Patches 2&3 remove the fib lookups and restore latency for our workload
to the pre 3.6 levels.
Benchmark results from a netperf UDP_RR test:
3.11 kernel 90596.44 transactions/s
3.11 + series 91792.70 transactions/s
Benchmark results from a fio 1 byte UDP multicast pingpong test
(Multicast one way unicast response):
3.11 kernel 12.647us RTT
3.11 + series 12.233us RTT
Shawn Bohrer (3):
udp: Only allow busy read/poll on connected sockets
udp: Add udp early demux
net: ipv4 only populate IP_PKTINFO when needed
include/net/ip.h | 2 +-
include/net/sock.h | 2 +-
include/net/udp.h | 1 +
net/ipv4/af_inet.c | 1 +
net/ipv4/ip_sockglue.c | 5 +-
net/ipv4/raw.c | 2 +-
net/ipv4/udp.c | 160 +++++++++++++++++++++++++++++++++++++++++------
net/ipv6/udp.c | 5 +-
8 files changed, 150 insertions(+), 28 deletions(-)
--
1.7.7.6
--
---------------------------------------------------------------
This email, along with any attachments, is confidential. If you
believe you received this message in error, please contact the
sender immediately and delete all copies of the message.
Thank you.
^ permalink raw reply
* [net-next 1/3] udp: Only allow busy read/poll on connected sockets
From: Shawn Bohrer @ 2013-10-01 19:33 UTC (permalink / raw)
To: David Miller; +Cc: Eric Dumazet, tomk, netdev, Shawn Bohrer
In-Reply-To: <1380656025-8847-1-git-send-email-sbohrer@rgmadvisors.com>
UDP sockets can receive packets from multiple endpoints and thus may be
received on multiple receive queues. Since packets packets can arrive
on multiple receive queues we should not mark the napi_id for all
packets. This makes busy read/poll only work for connected UDP sockets.
This additionally enables busy read/poll for UDP multicast packets as
long as the socket is connected by moving the check into
__udp_queue_rcv_skb().
Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
---
net/ipv4/udp.c | 5 +++--
net/ipv6/udp.c | 5 +++--
2 files changed, 6 insertions(+), 4 deletions(-)
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 728ce95..1982a03 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1405,8 +1405,10 @@ static int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
{
int rc;
- if (inet_sk(sk)->inet_daddr)
+ if (inet_sk(sk)->inet_daddr) {
sock_rps_save_rxhash(sk, skb);
+ sk_mark_napi_id(sk, skb);
+ }
rc = sock_queue_rcv_skb(sk, skb);
if (rc < 0) {
@@ -1716,7 +1718,6 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
if (sk != NULL) {
int ret;
- sk_mark_napi_id(sk, skb);
ret = udp_queue_rcv_skb(sk, skb);
sock_put(sk);
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index f405815..84e18ab 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -549,8 +549,10 @@ static int __udpv6_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
{
int rc;
- if (!ipv6_addr_any(&inet6_sk(sk)->daddr))
+ if (!ipv6_addr_any(&inet6_sk(sk)->daddr)) {
sock_rps_save_rxhash(sk, skb);
+ sk_mark_napi_id(sk, skb);
+ }
rc = sock_queue_rcv_skb(sk, skb);
if (rc < 0) {
@@ -844,7 +846,6 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
if (sk != NULL) {
int ret;
- sk_mark_napi_id(sk, skb);
ret = udpv6_queue_rcv_skb(sk, skb);
sock_put(sk);
--
1.7.7.6
--
---------------------------------------------------------------
This email, along with any attachments, is confidential. If you
believe you received this message in error, please contact the
sender immediately and delete all copies of the message.
Thank you.
^ permalink raw reply related
* [net-next 3/3] net: ipv4 only populate IP_PKTINFO when needed
From: Shawn Bohrer @ 2013-10-01 19:33 UTC (permalink / raw)
To: David Miller; +Cc: Eric Dumazet, tomk, netdev, Shawn Bohrer
In-Reply-To: <1380656025-8847-1-git-send-email-sbohrer@rgmadvisors.com>
The since the removal of the routing cache computing
fib_compute_spec_dst() does a fib_table lookup for each UDP multicast
packet received. This has introduced a performance regression for some
UDP workloads.
This change skips populating the packet info for sockets that do not have
IP_PKTINFO set.
Benchmark results from a netperf UDP_RR test:
Before 91296.97 transactions/s
After 91792.70 transactions/s
Benchmark results from a fio 1 byte UDP multicast pingpong test
(Multicast one way unicast response):
Before 12.647us RTT
After 12.233us RTT
Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
---
include/net/ip.h | 2 +-
net/ipv4/ip_sockglue.c | 5 +++--
net/ipv4/raw.c | 2 +-
net/ipv4/udp.c | 2 +-
4 files changed, 6 insertions(+), 5 deletions(-)
diff --git a/include/net/ip.h b/include/net/ip.h
index 16078f4..bc98241 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -459,7 +459,7 @@ int ip_options_rcv_srr(struct sk_buff *skb);
* Functions provided by ip_sockglue.c
*/
-void ipv4_pktinfo_prepare(struct sk_buff *skb);
+void ipv4_pktinfo_prepare(struct sock *sk, struct sk_buff *skb);
void ip_cmsg_recv(struct msghdr *msg, struct sk_buff *skb);
int ip_cmsg_send(struct net *net, struct msghdr *msg, struct ipcm_cookie *ipc);
int ip_setsockopt(struct sock *sk, int level, int optname, char __user *optval,
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index 56e3445..dda9866 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -1052,11 +1052,12 @@ e_inval:
* destination in skb->cb[] before dst drop.
* This way, receiver doesnt make cache line misses to read rtable.
*/
-void ipv4_pktinfo_prepare(struct sk_buff *skb)
+void ipv4_pktinfo_prepare(struct sock *sk, struct sk_buff *skb)
{
struct in_pktinfo *pktinfo = PKTINFO_SKB_CB(skb);
- if (skb_rtable(skb)) {
+ if ((inet_sk(sk)->cmsg_flags & IP_CMSG_PKTINFO) &&
+ skb_rtable(skb)) {
pktinfo->ipi_ifindex = inet_iif(skb);
pktinfo->ipi_spec_dst.s_addr = fib_compute_spec_dst(skb);
} else {
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index a3fe534..28694f8 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -297,7 +297,7 @@ static int raw_rcv_skb(struct sock *sk, struct sk_buff *skb)
{
/* Charge it to the socket. */
- ipv4_pktinfo_prepare(skb);
+ ipv4_pktinfo_prepare(sk, skb);
if (sock_queue_rcv_skb(sk, skb) < 0) {
kfree_skb(skb);
return NET_RX_DROP;
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index ca54886..02185a5 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1543,7 +1543,7 @@ int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
rc = 0;
- ipv4_pktinfo_prepare(skb);
+ ipv4_pktinfo_prepare(sk, skb);
bh_lock_sock(sk);
if (!sock_owned_by_user(sk))
rc = __udp_queue_rcv_skb(sk, skb);
--
1.7.7.6
--
---------------------------------------------------------------
This email, along with any attachments, is confidential. If you
believe you received this message in error, please contact the
sender immediately and delete all copies of the message.
Thank you.
^ permalink raw reply related
* [net-next 2/3] udp: Add udp early demux
From: Shawn Bohrer @ 2013-10-01 19:33 UTC (permalink / raw)
To: David Miller; +Cc: Eric Dumazet, tomk, netdev, Shawn Bohrer
In-Reply-To: <1380656025-8847-1-git-send-email-sbohrer@rgmadvisors.com>
The removal of the routing cache introduced a performance regression for
some UDP workloads since a dst lookup must be done for each packet.
This change caches the dst per socket in a similar manner to what we do
for TCP by implementing early_demux.
For UDP multicast we can only cache the dst if there is only one
receiving socket on the host. Since caching only works when there is
one receiving socket we do the multicast socket lookup using RCU.
Benchmark results from a netperf UDP_RR test:
Before 90596.44 transactions/s
After 91296.97 transactions/s
Benchmark results from a fio 1 byte UDP multicast pingpong test
(Multicast one way unicast response):
Before 12.647us RTT
After 12.497us RTT
Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
---
include/net/sock.h | 2 +-
include/net/udp.h | 1 +
net/ipv4/af_inet.c | 1 +
net/ipv4/udp.c | 153 +++++++++++++++++++++++++++++++++++++++++++++------
4 files changed, 138 insertions(+), 19 deletions(-)
diff --git a/include/net/sock.h b/include/net/sock.h
index cf91c8e..46661dd 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -218,7 +218,7 @@ struct cg_proto;
* @sk_lock: synchronizer
* @sk_rcvbuf: size of receive buffer in bytes
* @sk_wq: sock wait queue and async head
- * @sk_rx_dst: receive input route used by early tcp demux
+ * @sk_rx_dst: receive input route used by early demux
* @sk_dst_cache: destination cache
* @sk_dst_lock: destination cache lock
* @sk_policy: flow policy
diff --git a/include/net/udp.h b/include/net/udp.h
index 510b8cb..fe4ba9f 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -175,6 +175,7 @@ int udp_lib_get_port(struct sock *sk, unsigned short snum,
unsigned int hash2_nulladdr);
/* net/ipv4/udp.c */
+void udp_v4_early_demux(struct sk_buff *skb);
int udp_get_port(struct sock *sk, unsigned short snum,
int (*saddr_cmp)(const struct sock *,
const struct sock *));
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 7a1874b..3539ddf 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1548,6 +1548,7 @@ static const struct net_protocol tcp_protocol = {
};
static const struct net_protocol udp_protocol = {
+ .early_demux = udp_v4_early_demux,
.handler = udp_rcv,
.err_handler = udp_err,
.no_policy = 1,
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 1982a03..ca54886 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -565,6 +565,26 @@ struct sock *udp4_lib_lookup(struct net *net, __be32 saddr, __be16 sport,
}
EXPORT_SYMBOL_GPL(udp4_lib_lookup);
+static inline bool __udp_is_mcast_sock(struct net *net, struct sock *sk,
+ __be16 loc_port, __be32 loc_addr,
+ __be16 rmt_port, __be32 rmt_addr,
+ int dif, unsigned short hnum)
+{
+ struct inet_sock *inet = inet_sk(sk);
+
+ if (!net_eq(sock_net(sk), net) ||
+ udp_sk(sk)->udp_port_hash != hnum ||
+ (inet->inet_daddr && inet->inet_daddr != rmt_addr) ||
+ (inet->inet_dport != rmt_port && inet->inet_dport) ||
+ (inet->inet_rcv_saddr && inet->inet_rcv_saddr != loc_addr) ||
+ ipv6_only_sock(sk) ||
+ (sk->sk_bound_dev_if && sk->sk_bound_dev_if != dif))
+ return false;
+ if (!ip_mc_sf_allow(sk, loc_addr, rmt_addr, dif))
+ return false;
+ return true;
+}
+
static inline struct sock *udp_v4_mcast_next(struct net *net, struct sock *sk,
__be16 loc_port, __be32 loc_addr,
__be16 rmt_port, __be32 rmt_addr,
@@ -575,20 +595,11 @@ static inline struct sock *udp_v4_mcast_next(struct net *net, struct sock *sk,
unsigned short hnum = ntohs(loc_port);
sk_nulls_for_each_from(s, node) {
- struct inet_sock *inet = inet_sk(s);
-
- if (!net_eq(sock_net(s), net) ||
- udp_sk(s)->udp_port_hash != hnum ||
- (inet->inet_daddr && inet->inet_daddr != rmt_addr) ||
- (inet->inet_dport != rmt_port && inet->inet_dport) ||
- (inet->inet_rcv_saddr &&
- inet->inet_rcv_saddr != loc_addr) ||
- ipv6_only_sock(s) ||
- (s->sk_bound_dev_if && s->sk_bound_dev_if != dif))
- continue;
- if (!ip_mc_sf_allow(s, loc_addr, rmt_addr, dif))
- continue;
- goto found;
+ if (__udp_is_mcast_sock(net, s,
+ loc_port, loc_addr,
+ rmt_port, rmt_addr,
+ dif, hnum))
+ goto found;
}
s = NULL;
found:
@@ -1581,6 +1592,14 @@ static void flush_stack(struct sock **stack, unsigned int count,
kfree_skb(skb1);
}
+static void udp_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb)
+{
+ struct dst_entry *dst = skb_dst(skb);
+
+ dst_hold(dst);
+ sk->sk_rx_dst = dst;
+}
+
/*
* Multicasts and broadcasts go to each listener.
*
@@ -1709,11 +1728,28 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
if (udp4_csum_init(skb, uh, proto))
goto csum_error;
- if (rt->rt_flags & (RTCF_BROADCAST|RTCF_MULTICAST))
- return __udp4_lib_mcast_deliver(net, skb, uh,
- saddr, daddr, udptable);
+ if (skb->sk) {
+ int ret;
+ sk = skb->sk;
+
+ if (unlikely(sk->sk_rx_dst == NULL))
+ udp_sk_rx_dst_set(sk, skb);
+
+ ret = udp_queue_rcv_skb(sk, skb);
+
+ /* a return value > 0 means to resubmit the input, but
+ * it wants the return to be -protocol, or 0
+ */
+ if (ret > 0)
+ return -ret;
+ return 0;
+ } else {
+ if (rt->rt_flags & (RTCF_BROADCAST|RTCF_MULTICAST))
+ return __udp4_lib_mcast_deliver(net, skb, uh,
+ saddr, daddr, udptable);
- sk = __udp4_lib_lookup_skb(skb, uh->source, uh->dest, udptable);
+ sk = __udp4_lib_lookup_skb(skb, uh->source, uh->dest, udptable);
+ }
if (sk != NULL) {
int ret;
@@ -1771,6 +1807,87 @@ drop:
return 0;
}
+/* We can only early demux multicast if there is a single matching socket.
+ * If more than one socket found returns NULL
+ */
+static struct sock *__udp4_lib_mcast_demux_lookup(struct net *net,
+ __be16 loc_port, __be32 loc_addr,
+ __be16 rmt_port, __be32 rmt_addr,
+ int dif)
+{
+ struct sock *sk, *result;
+ struct hlist_nulls_node *node;
+ unsigned short hnum = ntohs(loc_port);
+ unsigned int count, slot = udp_hashfn(net, hnum, udp_table.mask);
+ struct udp_hslot *hslot = &udp_table.hash[slot];
+
+ rcu_read_lock();
+begin:
+ count = 0;
+ result = NULL;
+ sk_nulls_for_each_rcu(sk, node, &hslot->head) {
+ if (__udp_is_mcast_sock(net, sk,
+ loc_port, loc_addr,
+ rmt_port, rmt_addr,
+ dif, hnum)) {
+ result = sk;
+ ++count;
+ }
+ }
+ /*
+ * if the nulls value we got at the end of this lookup is
+ * not the expected one, we must restart lookup.
+ * We probably met an item that was moved to another chain.
+ */
+ if (get_nulls_value(node) != slot)
+ goto begin;
+
+ if (result) {
+ if (count != 1 ||
+ unlikely(!atomic_inc_not_zero_hint(&result->sk_refcnt, 2)))
+ result = NULL;
+ }
+ rcu_read_unlock();
+ return result;
+
+}
+
+void udp_v4_early_demux(struct sk_buff *skb)
+{
+ const struct iphdr *iph = ip_hdr(skb);
+ const struct udphdr *uh = udp_hdr(skb);
+ struct sock *sk;
+ struct dst_entry *dst;
+ struct net *net = dev_net(skb->dev);
+ int dif = skb->dev->ifindex;
+
+ /* validate the packet */
+ if (!pskb_may_pull(skb, skb_transport_offset(skb) + sizeof(struct udphdr)))
+ return;
+
+ if (skb->pkt_type == PACKET_BROADCAST ||
+ skb->pkt_type == PACKET_MULTICAST)
+ sk = __udp4_lib_mcast_demux_lookup(net, uh->dest, iph->daddr,
+ uh->source, iph->saddr, dif);
+ else if (skb->pkt_type == PACKET_HOST)
+ sk = __udp4_lib_lookup(net, iph->saddr, uh->source,
+ iph->daddr, uh->dest, dif, &udp_table);
+ else
+ return;
+
+ if (!sk)
+ return;
+
+ skb->sk = sk;
+ skb->destructor = sock_edemux;
+ dst = sk->sk_rx_dst;
+
+ if (dst)
+ dst = dst_check(dst, 0);
+ if (dst)
+ skb_dst_set_noref(skb, dst);
+}
+
int udp_rcv(struct sk_buff *skb)
{
return __udp4_lib_rcv(skb, &udp_table, IPPROTO_UDP);
--
1.7.7.6
--
---------------------------------------------------------------
This email, along with any attachments, is confidential. If you
believe you received this message in error, please contact the
sender immediately and delete all copies of the message.
Thank you.
^ permalink raw reply related
* Re: Big performance loss from 3.4.63 to 3.10.13 when routing ipv4
From: Wolfgang Walter @ 2013-10-01 19:44 UTC (permalink / raw)
To: Brian Haley; +Cc: netdev
In-Reply-To: <524B1B1D.3050304@hp.com>
Am Dienstag, 1. Oktober 2013, 14:57:33 schrieb Brian Haley:
> On 10/01/2013 12:39 PM, Wolfgang Walter wrote:
> > Hello,
> >
> > I tried to upgrade one of our routers to 3.10.13 from 3.4.63 and I see a
> > dramatic performance loss. I tried 3.11.2 and it is still there.
> >
> > *** Symptoms:
> >
> > All network traffic over the router become slow and sluggish. If one pings
> > the router there is a packet loss. After about 2 minutes the traffic
> > completely stalls for about 1 minute. Then it works again as in the
> > beginning to then stall again. And so on.
> >
> > This happens even with rather moderate traffic. While still routing the
> > CPU
> > utilization is higher than it is with 3.4.63 but only moderately.
> >
> > When it stalls no network traffic seems possible (but to loopback). If one
> > tries to ping from the router any target (even if it is on a interface
> > with no>
> > traffic at all) one gets:
> > ping: sendmsg: No buffer space available
>
> dmesg show anything? I've seen this happen when the neighbour table is
> full, but that's not a typical occurence.
>
Nothing is logged.
No, the neighbour table is not full. At least is does not contain more entries
then usually and not more then under 3.4 (maybe 100 or so).
I also checked the number of entries in the conntrack table and things like
that: all normal.
Regards,
--
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts
^ permalink raw reply
* Re: [PATCH net RFC 2/2] ixgbe: fix sleep bug caused by napi_disable inside local_bh_disable()d context
From: Keller, Jacob E @ 2013-10-01 20:05 UTC (permalink / raw)
To: Yuval Mintz
Cc: Francois Romieu, netdev@vger.kernel.org, Duyck, Alexander H,
Hyong-Youb Kim, Dmitry Kravkov, Amir Vadai, Eliezer Tamir
In-Reply-To: <979A8436335E3744ADCD3A9F2A2B68A52ADFF9EE@SJEXCHMB10.corp.ad.broadcom.com>
On Tue, 2013-10-01 at 12:11 +0000, Yuval Mintz wrote:
> > > > I have to move the local_bh_disable in order to put napi_disable
> > > outside
> > > > of the call since napi_disable could sleep, causing a scheduling while
> > > > atomic BUG.
> > >
> > > I am in violent agreement with this part.
> > > --
> > > Ueimor
> >
> > Regards,
> > Jake
> > --
>
> It seem like we've hit the same issue with the bnx2x driver.
> Is there anything new about the RFC?
>
> Thanks,
> Yuval
>
>
>
>
>
The napi_disable call for might_sleep() is the same. The solution in the
ixgbe driver is different. I completely re-wrote the segment about how
to disable the q_vector by adding a new state, rather than abusing the
QV_LOCKED_NAPI state. In addition I refactored it so that the
qv_lock_napi used spin_lock_bh() instead of plain spin_lock(), and
changed it so that we didn't need the local_bh_disable() call in
ixgbe_napi_disable_all.
This is a much cleaner solution than what I originally proposed.
Regards,
Jake
^ permalink raw reply
* Re: [PATCH net RFC 2/2] ixgbe: fix sleep bug caused by napi_disable inside local_bh_disable()d context
From: Keller, Jacob E @ 2013-10-01 20:08 UTC (permalink / raw)
To: Yuval Mintz
Cc: Francois Romieu, netdev@vger.kernel.org, Duyck, Alexander H,
Hyong-Youb Kim, Dmitry Kravkov, Amir Vadai, Eliezer Tamir
In-Reply-To: <979A8436335E3744ADCD3A9F2A2B68A52ADFF9EE@SJEXCHMB10.corp.ad.broadcom.com>
On Tue, 2013-10-01 at 12:11 +0000, Yuval Mintz wrote:
> > > > I have to move the local_bh_disable in order to put napi_disable
> > > outside
> > > > of the call since napi_disable could sleep, causing a scheduling while
> > > > atomic BUG.
> > >
> > > I am in violent agreement with this part.
> > > --
> > > Ueimor
> >
> > Regards,
> > Jake
> > --
>
> It seem like we've hit the same issue with the bnx2x driver.
> Is there anything new about the RFC?
>
> Thanks,
> Yuval
>
>
>
>
>
FYI, the new patch is currently going through internal validation here,
and once Jeff Kirsher mails it, I will ensure that you are Cc'ed on it.
Regards,
Jake
^ permalink raw reply
* Re: [net-next 2/3] udp: Add udp early demux
From: Rick Jones @ 2013-10-01 20:12 UTC (permalink / raw)
To: Shawn Bohrer; +Cc: David Miller, Eric Dumazet, tomk, netdev
In-Reply-To: <1380656025-8847-3-git-send-email-sbohrer@rgmadvisors.com>
On 10/01/2013 12:33 PM, Shawn Bohrer wrote:
> The removal of the routing cache introduced a performance regression for
> some UDP workloads since a dst lookup must be done for each packet.
> This change caches the dst per socket in a similar manner to what we do
> for TCP by implementing early_demux.
>
> For UDP multicast we can only cache the dst if there is only one
> receiving socket on the host. Since caching only works when there is
> one receiving socket we do the multicast socket lookup using RCU.
>
> Benchmark results from a netperf UDP_RR test:
> Before 90596.44 transactions/s
> After 91296.97 transactions/s
Were those measured with confidence intervals enabled? It would be a
Good Idea (tm) to either use that - I would suggest -I 99,1 -i 30,3
added to the global portion of the netperf command line - or take
several runs. (If you've not already done so since those look more like
"raw" netperf numbers rather than theaverage of several runs)
happy benchmarking,
rick jones
> Benchmark results from a fio 1 byte UDP multicast pingpong test
> (Multicast one way unicast response):
> Before 12.647us RTT
> After 12.497us RTT
^ permalink raw reply
* Re: [PATCH 1/3] tg3: add support a phy at an address different than 01
From: David Miller @ 2013-10-01 20:12 UTC (permalink / raw)
To: hauke; +Cc: nsujir, mchan, netdev
In-Reply-To: <1380402928-11480-1-git-send-email-hauke@hauke-m.de>
Can I get a review from the Broadcom folks for this series please?
Thank you.
^ permalink raw reply
* Re: [net-next 0/3] Improve UDP multicast receive latency
From: Veaceslav Falico @ 2013-10-01 20:21 UTC (permalink / raw)
To: Shawn Bohrer; +Cc: David Miller, Eric Dumazet, tomk, netdev
In-Reply-To: <1380656025-8847-1-git-send-email-sbohrer@rgmadvisors.com>
On Tue, Oct 01, 2013 at 02:33:42PM -0500, Shawn Bohrer wrote:
>The removal of the routing cache in 3.6 had impacted the latency of our
>UDP multicast workload. This patch series brings down the latency to
>what we were seeing with 3.4.
>
>Patch 1 "udp: Only allow busy read/poll on connected sockets" is mostly
>done for correctness and because it allows unifying the unicast and
>multicast paths when a socket is found in early demux. It can also
>improve latency for a connected multicast socket if busy read/poll is
>used.
>
>Patches 2&3 remove the fib lookups and restore latency for our workload
>to the pre 3.6 levels.
>
>Benchmark results from a netperf UDP_RR test:
>3.11 kernel 90596.44 transactions/s
>3.11 + series 91792.70 transactions/s
>
>Benchmark results from a fio 1 byte UDP multicast pingpong test
>(Multicast one way unicast response):
>3.11 kernel 12.647us RTT
>3.11 + series 12.233us RTT
>
>Shawn Bohrer (3):
> udp: Only allow busy read/poll on connected sockets
> udp: Add udp early demux
> net: ipv4 only populate IP_PKTINFO when needed
>
> include/net/ip.h | 2 +-
> include/net/sock.h | 2 +-
> include/net/udp.h | 1 +
> net/ipv4/af_inet.c | 1 +
> net/ipv4/ip_sockglue.c | 5 +-
> net/ipv4/raw.c | 2 +-
> net/ipv4/udp.c | 160 +++++++++++++++++++++++++++++++++++++++++------
> net/ipv6/udp.c | 5 +-
> 8 files changed, 150 insertions(+), 28 deletions(-)
>
>--
>1.7.7.6
>
>
>--
>
>---------------------------------------------------------------
>This email, along with any attachments, is confidential. If you
>believe you received this message in error, please contact the
>sender immediately and delete all copies of the message.
>Thank you.
It's not a good idea to send patches with that kind of footer, afaik.
>--
>To unsubscribe from this list: send the line "unsubscribe netdev" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH net-next] xen-netfront: convert to GRO API
From: Anirban Chakraborty @ 2013-10-01 20:32 UTC (permalink / raw)
To: Ian Campbell
Cc: Wei Liu, <netdev@vger.kernel.org>,
<xen-devel@lists.xen.org>, <konrad.wilk@oracle.com>
In-Reply-To: <1380547472.1302.45.camel@kazak.uk.xensource.com>
On Sep 30, 2013, at 6:24 AM, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> On Mon, 2013-09-30 at 13:46 +0100, Wei Liu wrote:
>> Anirban was seeing netfront received MTU size packets, which downgraded
>> throughput. The following patch makes netfront use GRO API which
>> improves throughput for that case.
>>
>> Signed-off-by: Wei Liu <wei.liu2@citrix.com>
>> Cc: Anirban Chakraborty <abchak@juniper.net>
>
> Acked-by: Ian Campbell <ian.campbell@citrix.com>
>
>> Acked-by: Konrad Wilk <konrad.wilk@oracle.com>
>> ---
>> drivers/net/xen-netfront.c | 4 +++-
>> 1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
>> index 36808bf..dd1011e 100644
>> --- a/drivers/net/xen-netfront.c
>> +++ b/drivers/net/xen-netfront.c
>> @@ -952,7 +952,7 @@ static int handle_incoming_queue(struct net_device *dev,
>> u64_stats_update_end(&stats->syncp);
>>
>> /* Pass it up. */
>> - netif_receive_skb(skb);
>> + napi_gro_receive(&np->napi, skb);
>> }
>>
>> return packets_dropped;
>> @@ -1051,6 +1051,8 @@ err:
>> if (work_done < budget) {
>> int more_to_do = 0;
>>
>> + napi_gro_flush(napi, false);
>> +
>> local_irq_save(flags);
>>
>> RING_FINAL_CHECK_FOR_RESPONSES(&np->rx, more_to_do);
>
We should apply it to stable as well. Thanks.
Anirban
^ permalink raw reply
* Re: [net-next 3/3] net: ipv4 only populate IP_PKTINFO when needed
From: Eric Dumazet @ 2013-10-01 20:42 UTC (permalink / raw)
To: Shawn Bohrer; +Cc: David Miller, tomk, netdev
In-Reply-To: <1380656025-8847-4-git-send-email-sbohrer@rgmadvisors.com>
On Tue, 2013-10-01 at 14:33 -0500, Shawn Bohrer wrote:
> -void ipv4_pktinfo_prepare(struct sk_buff *skb)
> +void ipv4_pktinfo_prepare(struct sock *sk, struct sk_buff *skb)
Seems good to me, could you use :
void ipv4_pktinfo_prepare(const struct sock *sk, struct sk_buff *skb)
^ permalink raw reply
* Re: [net-next 1/3] udp: Only allow busy read/poll on connected sockets
From: Eric Dumazet @ 2013-10-01 20:44 UTC (permalink / raw)
To: Shawn Bohrer; +Cc: David Miller, tomk, netdev
In-Reply-To: <1380656025-8847-2-git-send-email-sbohrer@rgmadvisors.com>
On Tue, 2013-10-01 at 14:33 -0500, Shawn Bohrer wrote:
> UDP sockets can receive packets from multiple endpoints and thus may be
> received on multiple receive queues. Since packets packets can arrive
> on multiple receive queues we should not mark the napi_id for all
> packets. This makes busy read/poll only work for connected UDP sockets.
>
> This additionally enables busy read/poll for UDP multicast packets as
> long as the socket is connected by moving the check into
> __udp_queue_rcv_skb().
>
> Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
> ---
Suggested-by: Eric Dumazet <edumazet@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Thanks for following up, it seems I forgot to submit this patch ;)
^ permalink raw reply
* Re: [net-next 2/3] udp: Add udp early demux
From: Eric Dumazet @ 2013-10-01 20:52 UTC (permalink / raw)
To: Shawn Bohrer; +Cc: David Miller, tomk, netdev
In-Reply-To: <1380656025-8847-3-git-send-email-sbohrer@rgmadvisors.com>
On Tue, 2013-10-01 at 14:33 -0500, Shawn Bohrer wrote:
> The removal of the routing cache introduced a performance regression for
> some UDP workloads since a dst lookup must be done for each packet.
> This change caches the dst per socket in a similar manner to what we do
> for TCP by implementing early_demux.
>
> For UDP multicast we can only cache the dst if there is only one
> receiving socket on the host. Since caching only works when there is
> one receiving socket we do the multicast socket lookup using RCU.
For unicast, we should find a matching socket for early demux only if
this is a connected socket.
Otherwise, forwarding setups will break.
You probably need to add a minimum score to __udp4_lib_lookup()
^ permalink raw reply
* Re: [PATCH 1/3] tg3: add support a phy at an address different than 01
From: Nithin Nayak Sujir @ 2013-10-01 21:11 UTC (permalink / raw)
To: Hauke Mehrtens; +Cc: davem, mchan, netdev
In-Reply-To: <1380402928-11480-1-git-send-email-hauke@hauke-m.de>
On 09/28/2013 02:15 PM, Hauke Mehrtens wrote:
> When phylib was in use tg3 only searched at address 01 on the mdio
> bus and did not work with any other address. On the BCM4705 SoCs the
> switch is connected as a PHY behind the MAC driven by tg3 and it is at
> PHY address 30 in most cases. This is a preparation patch to allow
> support for such switches.
>
> phy_addr is set to TG3_PHY_MII_ADDR for all devices, which are using
> phylib, so this should not change any behavior.
>
> Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de>
> ---
Acked-by: Nithin Nayak Sujir <nsujir@broadcom.com>
> drivers/net/ethernet/broadcom/tg3.c | 38 +++++++++++++++++------------------
> 1 file changed, 19 insertions(+), 19 deletions(-)
>
> diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
> index 221a181..853a05e 100644
> --- a/drivers/net/ethernet/broadcom/tg3.c
> +++ b/drivers/net/ethernet/broadcom/tg3.c
> @@ -1375,7 +1375,7 @@ static int tg3_mdio_read(struct mii_bus *bp, int mii_id, int reg)
>
> spin_lock_bh(&tp->lock);
>
> - if (tg3_readphy(tp, reg, &val))
> + if (__tg3_readphy(tp, mii_id, reg, &val))
> val = -EIO;
>
> spin_unlock_bh(&tp->lock);
> @@ -1390,7 +1390,7 @@ static int tg3_mdio_write(struct mii_bus *bp, int mii_id, int reg, u16 val)
>
> spin_lock_bh(&tp->lock);
>
> - if (tg3_writephy(tp, reg, val))
> + if (__tg3_writephy(tp, mii_id, reg, val))
> ret = -EIO;
>
> spin_unlock_bh(&tp->lock);
> @@ -1408,7 +1408,7 @@ static void tg3_mdio_config_5785(struct tg3 *tp)
> u32 val;
> struct phy_device *phydev;
>
> - phydev = tp->mdio_bus->phy_map[TG3_PHY_MII_ADDR];
> + phydev = tp->mdio_bus->phy_map[tp->phy_addr];
> switch (phydev->drv->phy_id & phydev->drv->phy_id_mask) {
> case PHY_ID_BCM50610:
> case PHY_ID_BCM50610M:
> @@ -1533,7 +1533,7 @@ static int tg3_mdio_init(struct tg3 *tp)
> tp->mdio_bus->read = &tg3_mdio_read;
> tp->mdio_bus->write = &tg3_mdio_write;
> tp->mdio_bus->reset = &tg3_mdio_reset;
> - tp->mdio_bus->phy_mask = ~(1 << TG3_PHY_MII_ADDR);
> + tp->mdio_bus->phy_mask = ~(1 << tp->phy_addr);
> tp->mdio_bus->irq = &tp->mdio_irq[0];
>
> for (i = 0; i < PHY_MAX_ADDR; i++)
> @@ -1554,7 +1554,7 @@ static int tg3_mdio_init(struct tg3 *tp)
> return i;
> }
>
> - phydev = tp->mdio_bus->phy_map[TG3_PHY_MII_ADDR];
> + phydev = tp->mdio_bus->phy_map[tp->phy_addr];
>
> if (!phydev || !phydev->drv) {
> dev_warn(&tp->pdev->dev, "No PHY devices\n");
> @@ -1964,7 +1964,7 @@ static void tg3_setup_flow_control(struct tg3 *tp, u32 lcladv, u32 rmtadv)
> u32 old_tx_mode = tp->tx_mode;
>
> if (tg3_flag(tp, USE_PHYLIB))
> - autoneg = tp->mdio_bus->phy_map[TG3_PHY_MII_ADDR]->autoneg;
> + autoneg = tp->mdio_bus->phy_map[tp->phy_addr]->autoneg;
> else
> autoneg = tp->link_config.autoneg;
>
> @@ -2000,7 +2000,7 @@ static void tg3_adjust_link(struct net_device *dev)
> u8 oldflowctrl, linkmesg = 0;
> u32 mac_mode, lcl_adv, rmt_adv;
> struct tg3 *tp = netdev_priv(dev);
> - struct phy_device *phydev = tp->mdio_bus->phy_map[TG3_PHY_MII_ADDR];
> + struct phy_device *phydev = tp->mdio_bus->phy_map[tp->phy_addr];
>
> spin_lock_bh(&tp->lock);
>
> @@ -2089,7 +2089,7 @@ static int tg3_phy_init(struct tg3 *tp)
> /* Bring the PHY back to a known state. */
> tg3_bmcr_reset(tp);
>
> - phydev = tp->mdio_bus->phy_map[TG3_PHY_MII_ADDR];
> + phydev = tp->mdio_bus->phy_map[tp->phy_addr];
>
> /* Attach the MAC to the PHY. */
> phydev = phy_connect(tp->dev, dev_name(&phydev->dev),
> @@ -2116,7 +2116,7 @@ static int tg3_phy_init(struct tg3 *tp)
> SUPPORTED_Asym_Pause);
> break;
> default:
> - phy_disconnect(tp->mdio_bus->phy_map[TG3_PHY_MII_ADDR]);
> + phy_disconnect(tp->mdio_bus->phy_map[tp->phy_addr]);
> return -EINVAL;
> }
>
> @@ -2134,7 +2134,7 @@ static void tg3_phy_start(struct tg3 *tp)
> if (!(tp->phy_flags & TG3_PHYFLG_IS_CONNECTED))
> return;
>
> - phydev = tp->mdio_bus->phy_map[TG3_PHY_MII_ADDR];
> + phydev = tp->mdio_bus->phy_map[tp->phy_addr];
>
> if (tp->phy_flags & TG3_PHYFLG_IS_LOW_POWER) {
> tp->phy_flags &= ~TG3_PHYFLG_IS_LOW_POWER;
> @@ -2154,13 +2154,13 @@ static void tg3_phy_stop(struct tg3 *tp)
> if (!(tp->phy_flags & TG3_PHYFLG_IS_CONNECTED))
> return;
>
> - phy_stop(tp->mdio_bus->phy_map[TG3_PHY_MII_ADDR]);
> + phy_stop(tp->mdio_bus->phy_map[tp->phy_addr]);
> }
>
> static void tg3_phy_fini(struct tg3 *tp)
> {
> if (tp->phy_flags & TG3_PHYFLG_IS_CONNECTED) {
> - phy_disconnect(tp->mdio_bus->phy_map[TG3_PHY_MII_ADDR]);
> + phy_disconnect(tp->mdio_bus->phy_map[tp->phy_addr]);
> tp->phy_flags &= ~TG3_PHYFLG_IS_CONNECTED;
> }
> }
> @@ -4034,7 +4034,7 @@ static int tg3_power_down_prepare(struct tg3 *tp)
> struct phy_device *phydev;
> u32 phyid, advertising;
>
> - phydev = tp->mdio_bus->phy_map[TG3_PHY_MII_ADDR];
> + phydev = tp->mdio_bus->phy_map[tp->phy_addr];
>
> tp->phy_flags |= TG3_PHYFLG_IS_LOW_POWER;
>
> @@ -11922,7 +11922,7 @@ static int tg3_get_settings(struct net_device *dev, struct ethtool_cmd *cmd)
> struct phy_device *phydev;
> if (!(tp->phy_flags & TG3_PHYFLG_IS_CONNECTED))
> return -EAGAIN;
> - phydev = tp->mdio_bus->phy_map[TG3_PHY_MII_ADDR];
> + phydev = tp->mdio_bus->phy_map[tp->phy_addr];
> return phy_ethtool_gset(phydev, cmd);
> }
>
> @@ -11989,7 +11989,7 @@ static int tg3_set_settings(struct net_device *dev, struct ethtool_cmd *cmd)
> struct phy_device *phydev;
> if (!(tp->phy_flags & TG3_PHYFLG_IS_CONNECTED))
> return -EAGAIN;
> - phydev = tp->mdio_bus->phy_map[TG3_PHY_MII_ADDR];
> + phydev = tp->mdio_bus->phy_map[tp->phy_addr];
> return phy_ethtool_sset(phydev, cmd);
> }
>
> @@ -12144,7 +12144,7 @@ static int tg3_nway_reset(struct net_device *dev)
> if (tg3_flag(tp, USE_PHYLIB)) {
> if (!(tp->phy_flags & TG3_PHYFLG_IS_CONNECTED))
> return -EAGAIN;
> - r = phy_start_aneg(tp->mdio_bus->phy_map[TG3_PHY_MII_ADDR]);
> + r = phy_start_aneg(tp->mdio_bus->phy_map[tp->phy_addr]);
> } else {
> u32 bmcr;
>
> @@ -12260,7 +12260,7 @@ static int tg3_set_pauseparam(struct net_device *dev, struct ethtool_pauseparam
> u32 newadv;
> struct phy_device *phydev;
>
> - phydev = tp->mdio_bus->phy_map[TG3_PHY_MII_ADDR];
> + phydev = tp->mdio_bus->phy_map[tp->phy_addr];
>
> if (!(phydev->supported & SUPPORTED_Pause) ||
> (!(phydev->supported & SUPPORTED_Asym_Pause) &&
> @@ -13696,7 +13696,7 @@ static int tg3_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
> struct phy_device *phydev;
> if (!(tp->phy_flags & TG3_PHYFLG_IS_CONNECTED))
> return -EAGAIN;
> - phydev = tp->mdio_bus->phy_map[TG3_PHY_MII_ADDR];
> + phydev = tp->mdio_bus->phy_map[tp->phy_addr];
> return phy_mii_ioctl(phydev, ifr, cmd);
> }
>
> @@ -17635,7 +17635,7 @@ static int tg3_init_one(struct pci_dev *pdev,
>
> if (tp->phy_flags & TG3_PHYFLG_IS_CONNECTED) {
> struct phy_device *phydev;
> - phydev = tp->mdio_bus->phy_map[TG3_PHY_MII_ADDR];
> + phydev = tp->mdio_bus->phy_map[tp->phy_addr];
> netdev_info(dev,
> "attached PHY driver [%s] (mii_bus:phy_addr=%s)\n",
> phydev->drv->name, dev_name(&phydev->dev));
>
^ permalink raw reply
* Re: [PATCH 3/3] tg3: use phylib when robo switch is in use
From: Nithin Nayak Sujir @ 2013-10-01 21:12 UTC (permalink / raw)
To: Hauke Mehrtens; +Cc: davem, mchan, netdev
In-Reply-To: <1380402928-11480-3-git-send-email-hauke@hauke-m.de>
On 09/28/2013 02:15 PM, Hauke Mehrtens wrote:
> When a switch is connected as a PHY to the MAC driven by tg3, use
> phylib and provide the phy address to tg3 from the sprom.
>
> Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de>
> ---
Acked-by: Nithin Nayak Sujir <nsujir@broadcom.com>
> drivers/net/ethernet/broadcom/tg3.c | 11 ++++++++++-
> 1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
> index 853a05e..a17a3c9 100644
> --- a/drivers/net/ethernet/broadcom/tg3.c
> +++ b/drivers/net/ethernet/broadcom/tg3.c
> @@ -1513,6 +1513,13 @@ static int tg3_mdio_init(struct tg3 *tp)
> TG3_CPMU_PHY_STRAP_IS_SERDES;
> if (is_serdes)
> tp->phy_addr += 7;
> + } else if (tg3_flag(tp, IS_SSB_CORE) && tg3_flag(tp, ROBOSWITCH)) {
> + int addr;
> +
> + addr = ssb_gige_get_phyaddr(tp->pdev);
> + if (addr < 0)
> + return addr;
> + tp->phy_addr = addr;
> } else
> tp->phy_addr = TG3_PHY_MII_ADDR;
>
> @@ -17366,8 +17373,10 @@ static int tg3_init_one(struct pci_dev *pdev,
> tg3_flag_set(tp, FLUSH_POSTED_WRITES);
> if (ssb_gige_one_dma_at_once(pdev))
> tg3_flag_set(tp, ONE_DMA_AT_ONCE);
> - if (ssb_gige_have_roboswitch(pdev))
> + if (ssb_gige_have_roboswitch(pdev)) {
> + tg3_flag_set(tp, USE_PHYLIB);
> tg3_flag_set(tp, ROBOSWITCH);
> + }
> if (ssb_gige_is_rgmii(pdev))
> tg3_flag_set(tp, RGMII_MODE);
> }
>
^ permalink raw reply
* Re: [PATCH 2/3] ssb: provide phy address for Gigabit Ethernet driver
From: Nithin Nayak Sujir @ 2013-10-01 21:13 UTC (permalink / raw)
To: Hauke Mehrtens; +Cc: davem, mchan, netdev
In-Reply-To: <1380402928-11480-2-git-send-email-hauke@hauke-m.de>
On 09/28/2013 02:15 PM, Hauke Mehrtens wrote:
> Add a function to provide the phy address which should be used to the
> Gigabit Ethernet driver connected to ssb.
>
> Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de>
> ---
Reviewed-by: Nithin Nayak Sujir <nsujir@broadcom.com>
> include/linux/ssb/ssb_driver_gige.h | 14 ++++++++++++++
> 1 file changed, 14 insertions(+)
>
> diff --git a/include/linux/ssb/ssb_driver_gige.h b/include/linux/ssb/ssb_driver_gige.h
> index 86a12b0..0688472 100644
> --- a/include/linux/ssb/ssb_driver_gige.h
> +++ b/include/linux/ssb/ssb_driver_gige.h
> @@ -108,6 +108,16 @@ static inline int ssb_gige_get_macaddr(struct pci_dev *pdev, u8 *macaddr)
> return 0;
> }
>
> +/* Get the device phy address */
> +static inline int ssb_gige_get_phyaddr(struct pci_dev *pdev)
> +{
> + struct ssb_gige *dev = pdev_to_ssb_gige(pdev);
> + if (!dev)
> + return -ENODEV;
> +
> + return dev->dev->bus->sprom.et0phyaddr;
> +}
> +
> extern int ssb_gige_pcibios_plat_dev_init(struct ssb_device *sdev,
> struct pci_dev *pdev);
> extern int ssb_gige_map_irq(struct ssb_device *sdev,
> @@ -174,6 +184,10 @@ static inline int ssb_gige_get_macaddr(struct pci_dev *pdev, u8 *macaddr)
> {
> return -ENODEV;
> }
> +static inline int ssb_gige_get_phyaddr(struct pci_dev *pdev)
> +{
> + return -ENODEV;
> +}
>
> #endif /* CONFIG_SSB_DRIVER_GIGE */
> #endif /* LINUX_SSB_DRIVER_GIGE_H_ */
>
^ permalink raw reply
* Re: [PATCH] ipv6: udp packets following an UFO enqueued packet need also be handled by UFO
From: Hannes Frederic Sowa @ 2013-10-01 21:47 UTC (permalink / raw)
To: Jiri Pirko, netdev, yoshfuji, davem, kuznet, jmorris, kaber,
herbert, eric.dumazet
In-Reply-To: <20131001123214.GI10771@order.stressinduktion.org>
Hi Jiri,
On Tue, Oct 01, 2013 at 02:32:14PM +0200, Hannes Frederic Sowa wrote:
> On Tue, Oct 01, 2013 at 02:09:07PM +0200, Hannes Frederic Sowa wrote:
> > On Tue, Oct 01, 2013 at 12:58:37PM +0200, Jiri Pirko wrote:
> > > >>
> > > >> What if non-ufo-path-created skb is peeked?
> > > >
> > > >You mean like:
> > > >first append a frame < mtu
> > > >second frame > mtu so it gets handles by ip6_ufo_append?
> > > >
> > > >Currently I don't see a problem with it but I may be wrong. What is
> > > >your suspicion?
> > >
> > > Well the skb will have gso_size==0 and ip_summed==CHECKSUM_NONE
> > > Because of that it will be not threated as ufo skb.
> > >
> > > Following patch fixes it:
> > >
> > > Subject: [patch net] ip6_output: do skb ufo init for peeked non ufo skb as well
> > >
> > > Now, if user application does:
> > > sendto len<mtu flag MSG_MORE
> > > sendto len>mtu flag 0
> > > The skb is not treated as fragmented one because it is not initialized
> > > that way. So move the initialization to fix this.
> > >
> > > Signed-off-by: Jiri Pirko <jiri@resnulli.us>
> >
> > My understanding is that the frame is only a valid gso frame iff the first skb
> > queued up is setup as an UFO frame. In other cases we only need to make sure
> > we append to the fragment list without updating the gso field and the skb will
> > get linearized (if needed) later in the output path.
> >
> > This seems not to matter for virtio_net and loopback because we seem to pass
> > the skb as is. But the remaining ethernet card which sets NETIF_F_UFO is
> > neterion and we have to verify if this assumption is correct, there.
>
> Hm, ip_append_page seems to violate my assumption. I need to read up on the
> code later today.
I still guess my assumption about gso is correct and I will have to review how
the ip_append_page codepath works in detail.
I currently have a tree where I have applied my UFO patch and your dontfrag
patch. When I have an application which does:
socket(PF_INET6, SOCK_DGRAM, IPPROTO_IP) = 3
connect(3, {sa_family=AF_INET6, sin6_port=htons(53), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0 // UFO is enabled currently
setsockopt(3, SOL_UDP, 1, [1], 4) = 0 // this sets UDP_CORK active
setsockopt(3, SOL_IPV6, IPV6_MTU, [1280], 4) = 0 // minimal IPV6 mtu
write(3, " ", 1) = 1 // generates non-gso head
write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 3701) = 3701 // generates gso segment in frags but skb_is_gso(skb) == false
write(3, " ", 1 // and boom: null pointer deref and slab corruption
Without your patch this would not been possible and I am not sure your
patch breaks assumptions for hardware UFO. We should wait with that
patch until we sorted out the details.
My current guess is that we need to replace the skb_is_gso() with
something that checks for already appended frags and continue to use
sbk_append_datato_frags in that case.
The strange thing is that if I don't do the IPV6_MTU setsockopt I don't
get an oops.
IPv4 seems to work without problems, too.
Maybe I miss something? :|
Greetings,
Hannes
^ permalink raw reply
* Re: [PATCH net-next v4 1/3] flow_dissector: factor out the ports extraction in skb_flow_get_ports
From: David Miller @ 2013-10-01 21:47 UTC (permalink / raw)
To: nikolay; +Cc: eric.dumazet, netdev, andy, fubar, vfalico
In-Reply-To: <52446255.6040108@redhat.com>
From: Nikolay Aleksandrov <nikolay@redhat.com>
Date: Thu, 26 Sep 2013 18:35:33 +0200
> Dave would you apply Eric's fix in the net-next tree ?
> I'll then re-base on it.
I just merged net into net-next, please rebase.
Thanks.
^ permalink raw reply
* Re: [patch net] udp6: respect IPV6_DONTFRAG sockopt in case there are pending frames
From: Hannes Frederic Sowa @ 2013-10-01 22:03 UTC (permalink / raw)
To: Jiri Pirko, netdev, davem, kuznet, jmorris, kaber, yoshfuji
In-Reply-To: <20130930212155.GG10771@order.stressinduktion.org>
Hi David!
On Mon, Sep 30, 2013 at 11:21:55PM +0200, Hannes Frederic Sowa wrote:
> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
(for patchwork correspondence)
I still think this patch is perfectly fine but it opens up a hole in the
ip6_append_data logic where it is possible to crash the kernel. Jiri and me
will look after it in the thread " ipv6: udp packets following an UFO enqueued
packet need also be handled by UFO".
So, for the time being, I would like to withdraw my Acked-by until this is
sorted out.
Thanks,
Hannes
^ permalink raw reply
* Re: [net-next 0/9][pull request] Intel Wired LAN Driver Updates
From: Jeff Kirsher @ 2013-10-01 22:15 UTC (permalink / raw)
To: David Miller; +Cc: netdev, gospo, sassmann
In-Reply-To: <20131001.125116.1237920409152830224.davem@davemloft.net>
[-- Attachment #1: Type: text/plain, Size: 1621 bytes --]
On Tue, 2013-10-01 at 12:51 -0400, David Miller wrote:
> From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> Date: Tue, 1 Oct 2013 04:33:47 -0700
>
> > This series contains updates to ixgbevf, ixgbe and igb.
> >
> > Don provides 3 patches for ixgbevf where he cleans up a redundant
> > read mailbox failure check, adds a new function to wait for receive
> > queues to be disabled before disabling NAPI, and move the API
> > negotiation so that it occurs in the reset path. This will allow
> > the PF to be informed of the API version earlier.
> >
> > Jacob provides a ixgbevf and ixgbe patch. His ixgbevf patch removes
> > the use of hw_dbg when the ixgbe_get_regs function is called in ethtool.
> > The ixgbe patch renames the LL_EXTENDED_STATS and some of the functions
> > required to implement busy polling in order to remove the marketing
> > "low latency" blurb which hides what the code actually does.
> >
> > Leonardo provides a ixgbe patch to add support for DCB registers dump
> > using ethtool for 82599 and x540 ethernet controllers.
> >
> > I (Jeff) provide a ixgbe patch to cleanup whitespace issues seen in a
> > code review.
> >
> > Todd provides for igb to add support for i354 in the ethtool offline
> > tests.
> >
> > Laura provides an igb patch to add the ethtool callbacks necessary to
> > configure the number of RSS queues.
>
> Pulled, thanks Jeff.
>
> Please address Ben Hutching's concerns about the state of the device
> after a number of channels configuration failure with follow-on
> changes, if necessary.
>
> Thanks.
Understood, will do.
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 836 bytes --]
^ permalink raw reply
* Re: Big performance loss from 3.4.63 to 3.10.13 when routing ipv4
From: Hannes Frederic Sowa @ 2013-10-01 22:20 UTC (permalink / raw)
To: Wolfgang Walter; +Cc: netdev
In-Reply-To: <3244031.uQGDddGTLF@h2o.as.studentenwerk.mhn.de>
On Tue, Oct 01, 2013 at 06:39:32PM +0200, Wolfgang Walter wrote:
> All network traffic over the router become slow and sluggish. If one pings the
> router there is a packet loss. After about 2 minutes the traffic completely
> stalls for about 1 minute. Then it works again as in the beginning to then
> stall again. And so on.
Maybe dropwatch can give a first hint?
^ permalink raw reply
* Re: [net-next 2/3] udp: Add udp early demux
From: Shawn Bohrer @ 2013-10-01 22:26 UTC (permalink / raw)
To: Rick Jones; +Cc: David Miller, Eric Dumazet, tomk, netdev
In-Reply-To: <524B2CA2.4090807@hp.com>
On Tue, Oct 01, 2013 at 01:12:18PM -0700, Rick Jones wrote:
> On 10/01/2013 12:33 PM, Shawn Bohrer wrote:
> >The removal of the routing cache introduced a performance regression for
> >some UDP workloads since a dst lookup must be done for each packet.
> >This change caches the dst per socket in a similar manner to what we do
> >for TCP by implementing early_demux.
> >
> >For UDP multicast we can only cache the dst if there is only one
> >receiving socket on the host. Since caching only works when there is
> >one receiving socket we do the multicast socket lookup using RCU.
> >
> >Benchmark results from a netperf UDP_RR test:
> >Before 90596.44 transactions/s
> >After 91296.97 transactions/s
>
> Were those measured with confidence intervals enabled? It would be
> a Good Idea (tm) to either use that - I would suggest -I 99,1 -i
> 30,3 added to the global portion of the netperf command line - or
> take several runs. (If you've not already done so since those look
> more like "raw" netperf numbers rather than theaverage of several
> runs)
Those are the averages from six one minute UDP_RR tests. If I
re-rerun the netperf numbers I'll use the confidence intervals I did
not know about that feature. In general I just added these quick
numbers as a point of reference and to double check that some open
source benchmarks could even see a difference. For me the real
benchmark is my workload which does show a much more significant
difference with these changes.
--
Shawn
--
---------------------------------------------------------------
This email, along with any attachments, is confidential. If you
believe you received this message in error, please contact the
sender immediately and delete all copies of the message.
Thank you.
^ permalink raw reply
* Re: [net-next 3/3] net: ipv4 only populate IP_PKTINFO when needed
From: Shawn Bohrer @ 2013-10-01 22:29 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David Miller, tomk, netdev
In-Reply-To: <1380660150.19002.34.camel@edumazet-glaptop.roam.corp.google.com>
On Tue, Oct 01, 2013 at 01:42:30PM -0700, Eric Dumazet wrote:
> On Tue, 2013-10-01 at 14:33 -0500, Shawn Bohrer wrote:
>
> > -void ipv4_pktinfo_prepare(struct sk_buff *skb)
> > +void ipv4_pktinfo_prepare(struct sock *sk, struct sk_buff *skb)
>
>
> Seems good to me, could you use :
>
> void ipv4_pktinfo_prepare(const struct sock *sk, struct sk_buff *skb)
Yep, I'll make that const and resend.
--
Shawn
--
---------------------------------------------------------------
This email, along with any attachments, is confidential. If you
believe you received this message in error, please contact the
sender immediately and delete all copies of the message.
Thank you.
^ permalink raw reply
* Re: [PATCH v2.41 5/5] datapath: Add basic MPLS support to kernel
From: Jesse Gross @ 2013-10-01 23:02 UTC (permalink / raw)
To: Simon Horman
Cc: dev@openvswitch.org, netdev, Ben Pfaff, Pravin B Shelar, Ravi K,
Isaku Yamahata, Joe Stringer
In-Reply-To: <1380610064-14856-6-git-send-email-horms@verge.net.au>
On Mon, Sep 30, 2013 at 11:47 PM, Simon Horman <horms@verge.net.au> wrote:
> diff --git a/datapath/actions.c b/datapath/actions.c
> index d961e5d..bfab9ec 100644
> --- a/datapath/actions.c
> +++ b/datapath/actions.c
> +/* Push MPLS after the ethernet header. */
> +static int push_mpls(struct sk_buff *skb,
> + const struct ovs_action_push_mpls *mpls)
[...]
> + hdr = eth_hdr(skb);
> + hdr->h_proto = mpls->mpls_ethertype;
> + if (!eth_p_mpls(skb->protocol) && !ovs_skb_get_inner_protocol(skb))
> + ovs_skb_set_inner_protocol(skb, skb->protocol);
Do we actually need the check for !eth_p_mpls(skb->protocol)? It's not
clear to me what condition it is trying to prevent.
> +static int pop_mpls(struct sk_buff *skb, const __be16 ethertype)
> +{
> + struct ethhdr *hdr;
> + int err;
> +
> + err = make_writable(skb, skb->mac_len + MPLS_HLEN);
> + if (unlikely(err))
> + return err;
> +
> + if (unlikely(skb->len < skb->mac_len + MPLS_HLEN))
> + return -ENOMEM;
I'm not sure that this is the right error code in this situation -
maybe -EINVAL would be better.
> @@ -545,6 +662,14 @@ static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
> output_userspace(dp, skb, a);
> break;
>
> + case OVS_ACTION_ATTR_PUSH_MPLS:
> + err = push_mpls(skb, nla_data(a));
> + break;
> +
> + case OVS_ACTION_ATTR_POP_MPLS:
> + err = pop_mpls(skb, nla_get_be16(a));
> + break;
I think we need something similar to POP_VLAN here - in the event of
an error the skb will have already been freed.
> diff --git a/datapath/datapath.h b/datapath/datapath.h
> index 4a49a7d..31fe10a 100644
> --- a/datapath/datapath.h
> +++ b/datapath/datapath.h
> @@ -95,6 +95,8 @@ struct datapath {
> * @pkt_key: The flow information extracted from the packet. Must be nonnull.
> * @tun_key: Key for the tunnel that encapsulated this packet. NULL if the
> * packet is not being tunneled.
> + * @inner_protocol: Provides a substitute for the skb->inner_protocol field on
> + * kernels before 3.11.
> */
> struct ovs_skb_cb {
> struct sw_flow *flow;
I think this comment no longer applies now that inner_protocol has
been moved into the GSO struct.
> diff --git a/datapath/linux/compat/gso.c b/datapath/linux/compat/gso.c
> index 32f906c..f917356 100644
> --- a/datapath/linux/compat/gso.c
> +++ b/datapath/linux/compat/gso.c
> int rpl_dev_queue_xmit(struct sk_buff *skb)
> {
> #undef dev_queue_xmit
> int err = -ENOMEM;
> + __be16 inner_protocol;
> + bool vlan, mpls;
>
> - if (vlan_tx_tag_present(skb) && !dev_supports_vlan_tx(skb->dev)) {
> + vlan = mpls = false;
> +
> + inner_protocol = ovs_skb_get_inner_protocol(skb);
I don't think this is actually used in this function.
Pravin, do you have any further comments?
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox