* Re: [RFC PATCH net-next 0/5] tcp: TCP tracer
From: Arnaldo Carvalho de Melo @ 2014-12-17 15:07 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Martin KaFai Lau, netdev@vger.kernel.org, David S. Miller,
Hannes Frederic Sowa, Steven Rostedt, Lawrence Brakmo,
Josef Bacik, Kernel Team
In-Reply-To: <CAADnVQJ+8mtB8LD=U7XbxOC2hxhDChxOELhZ3NEYeoTk1G3LYg@mail.gmail.com>
Em Sun, Dec 14, 2014 at 10:55:55PM -0800, Alexei Starovoitov escreveu:
> On Sun, Dec 14, 2014 at 5:56 PM, Martin KaFai Lau <kafai@fb.com> wrote:
> > Hi,
> >
> > We have been using the kernel ftrace infra to collect TCP per-flow statistics.
> > The following patch set is a first slim-down version of our
> > existing implementation. We would like to get some early feedback
> > and make it useful for others.
> >
> > [RFC PATCH net-next 1/5] tcp: Add TCP TRACE_EVENTs:
> > Defines some basic tracepoints (by TRACE_EVENT).
> >
> > [RFC PATCH net-next 2/5] tcp: A perf script for TCP tracepoints:
> > A sample perf script with simple ip/port filtering and summary output.
> >
> > [RFC PATCH net-next 3/5] tcp: Add a few more tracepoints for tcp tracer:
> > Declares a few more tracepoints (by DECLARE_TRACE) which are
> > used by the tcp_tracer. The tcp_tracer is in the patch 5/5.
> >
> > [RFC PATCH net-next 4/5] tcp: Introduce tcp_sk_trace and related structs:
> > Defines a few tcp_trace structs which are used to collect statistics
> > on each tcp_sock.
> >
> > [RFC PATCH net-next 5/5] tcp: Add TCP tracer:
> > It introduces a tcp_tracer which hooks onto the tracepoints defined in the
> > patch 1/5 and 3/5. It collects data defined in patch 4/5. We currently
> > use this tracer to collect per-flow statistics. The commit log has
> > some more details.
>
> I think patches 1 and 3 are good additions, since they establish
> few permanent points of instrumentation in tcp stack.
> Patches 4-5 look more like use cases of tracepoints established
> before. They may feel like simple additions and, no doubt,
> they are useful, but since they expose things via tracing
> infra they become part of api and cannot be changed later,
> when more stats would be needed.
> I think systemtap like scripting on top of patches 1 and 3
> should solve your use case ?
I guess even just using 'perf probe' to set those wannabe tracepoints
should be enough, no? Then he can refer to those in his perf record
call, etc and process it just like with the real tracepoints.
> Also, have you looked at recent eBPF work?
> Though it's not completely ready yet, soon it should
> be able to do the same stats collection as you have
> in 4/5 without adding permanent pieces to the kernel.
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [RFC PATCH net-next 1/5] tcp: Add TCP TRACE_EVENTs
From: David Ahern @ 2014-12-17 15:08 UTC (permalink / raw)
To: Martin KaFai Lau, netdev
Cc: David S. Miller, Hannes Frederic Sowa, Steven Rostedt,
Lawrence Brakmo, Josef Bacik, Kernel Team
In-Reply-To: <1418608606-1569264-2-git-send-email-kafai@fb.com>
On 12/14/14 6:56 PM, Martin KaFai Lau wrote:
> +DECLARE_EVENT_CLASS(tcp,
> + TP_PROTO(struct sock *sk),
> + TP_ARGS(sk),
> + TP_STRUCT__entry(
> + __field(u8, ipv6)
> + __array(u8, laddr, 16)
> + __array(u8, raddr, 16)
You could store the addresses as
union {
struct sockaddr_in v4;
struct sockaddr_in6 v6;
} sa;
and then use:
> + TP_printk("local=%s:%d remote=%s:%d snd_cwnd=%u mss_cache=%u "
> + "ssthresh=%u srtt_us=%llu rto_ms=%u",
%pIS to print the addresses in a more readable format than what
__print_hex will show.
I have a patch to perf (and by extension it applies to trace-cmd) to
handle pI4, pI6 and pI6c. It readily extends to pIS.
David
^ permalink raw reply
* Re: [RFC PATCH net-next 3/5] tcp: Add a few more tracepoints for tcp tracer
From: Arnaldo Carvalho de Melo @ 2014-12-17 15:33 UTC (permalink / raw)
To: Martin KaFai Lau
Cc: netdev, David S. Miller, Hannes Frederic Sowa, Steven Rostedt,
Lawrence Brakmo, Josef Bacik, Kernel Team
In-Reply-To: <1418608606-1569264-4-git-send-email-kafai@fb.com>
Em Sun, Dec 14, 2014 at 05:56:44PM -0800, Martin KaFai Lau escreveu:
> The tcp tracer, which will be added in the later patch, depends
> on them to collect statistics.
> --- a/include/trace/events/tcp.h
<SNIP>
> +DECLARE_TRACE(tcp_sacks_rcv,
> + TP_PROTO(struct sock *sk, int num_sacks),
> + TP_ARGS(sk, num_sacks)
> +);
<SNIP>
> +++ b/net/ipv4/tcp_input.c
> @@ -1650,6 +1650,8 @@ tcp_sacktag_write_queue(struct sock *sk, const struct sk_buff *ack_skb,
> int i, j;
> int first_sack_index;
>
> + trace_tcp_sacks_rcv(sk, num_sacks);
> +
In another message someone pointed out that we want some tracepoints, but
others would imply ABI, a drag on upstream to keep tons of set in stone
tracepoints, so what I was saying was like below, where one of the above
proposed tracepoints is implemented as a "wannabe tracepoint", i.e. a
dynamic probe, that will be as optimized as the kprobes_tracer can make it,
sometimes even using, IIRC, the ftrace mechanizms, if put on some suitable
place (function entry, etc, IIRC, Steven?).
On a random RHEL7 kernel I had laying around on a test machine:
[root@ssdandy ~]# perf probe -L tcp_sacktag_write_queue | head -20
<tcp_sacktag_write_queue@/usr/src/debug/kernel-3.10.0-123.el7/linux-3.10.0-123.el7.x86_64/net/ipv4/tcp_input.c:0>
0 tcp_sacktag_write_queue(struct sock *sk, const struct sk_buff *ack_skb,
u32 prior_snd_una)
2 {
struct tcp_sock *tp = tcp_sk(sk);
4 const unsigned char *ptr = (skb_transport_header(ack_skb) +
TCP_SKB_CB(ack_skb)->sacked);
struct tcp_sack_block_wire *sp_wire = (struct tcp_sack_block_wire *)(ptr+2);
struct tcp_sack_block sp[TCP_NUM_SACKS];
struct tcp_sack_block *cache;
struct tcp_sacktag_state state;
struct sk_buff *skb;
11 int num_sacks = min(TCP_NUM_SACKS, (ptr[1] - TCPOLEN_SACK_BASE) >> 3);
int used_sacks;
bool found_dup_sack = false;
int i, j;
int first_sack_index;
17 state.flag = 0;
18 state.reord = tp->packets_out;
[root@ssdandy ~]#
Available variables at tcp_sacktag_write_queue:17
@<tcp_sacktag_write_queue+77>
int num_sacks
struct sk_buff* ack_skb
struct sock* sk
struct tcp_sack_block_wire* sp_wire
u32 prior_snd_una
unsigned char* ptr
[root@ssdandy ~]#
Ok, so we can insert a probe at that point and also we can collect the values of the
sk and num_sacks variables, so:
[root@ssdandy ~]# perf probe 'tcp_sacks_rcv=tcp_sacktag_write_queue:17 sk num_sacks'
Added new event:
probe:tcp_sacks_rcv (on tcp_sacktag_write_queue:17 with sk num_sacks)
You can now use it in all perf tools, such as:
perf record -e probe:tcp_sacks_rcv -aR sleep 1
[root@ssdandy ~]
There you go, you have your wannabe tracepoint, dynamic:
[root@ssdandy ~]# perf record -a -g -e probe:tcp_sacks_rcv
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.362 MB perf.data (~15799 samples) ]
[root@ssdandy ~]# perf script
swapper 0 [000] 184175.932790: probe:tcp_sacks_rcv: (ffffffff8151e59d) sk=ffff880425761e00 num_sacks=0
71e59e tcp_sacktag_write_queue (/usr/lib/debug/lib/modules/3.10.0-123.el7.x86_64/vmlinux)
72216e tcp_ack (/usr/lib/debug/lib/modules/3.10.0-123.el7.x86_64/vmlinux)
723cf8 tcp_rcv_state_process (/usr/lib/debug/lib/modules/3.10.0-123.el7.x86_64/vmlinux)
72d158 tcp_v4_do_rcv (/usr/lib/debug/lib/modules/3.10.0-123.el7.x86_64/vmlinux)
72f5c7 tcp_v4_rcv (/usr/lib/debug/lib/modules/3.10.0-123.el7.x86_64/vmlinux)
709584 ip_local_deliver_finish (/usr/lib/debug/lib/modules/3.10.0-123.el7.x86_64/vmlinux)
709858 ip_local_deliver (/usr/lib/debug/lib/modules/3.10.0-123.el7.x86_64/vmlinux)
7091fd ip_rcv_finish (/usr/lib/debug/lib/modules/3.10.0-123.el7.x86_64/vmlinux)
709ac4 ip_rcv (/usr/lib/debug/lib/modules/3.10.0-123.el7.x86_64/vmlinux)
6cfdb6 __netif_receive_skb_core (/usr/lib/debug/lib/modules/3.10.0-123.el7.x86_64/vmlinux)
6cffc8 __netif_receive_skb (/usr/lib/debug/lib/modules/3.10.0-123.el7.x86_64/vmlinux)
6d0050 netif_receive_skb (/usr/lib/debug/lib/modules/3.10.0-123.el7.x86_64/vmlinux)
6d0aa8 napi_gro_receive (/usr/lib/debug/lib/modules/3.10.0-123.el7.x86_64/vmlinux)
1b5bf e1000_receive_skb (/lib/modules/3.10.0-123.el7.x86_64/kernel/drivers/net/ethernet/intel/e1000e/e1000e.ko)
1cbda e1000_clean_rx_irq (/lib/modules/3.10.0-123.el7.x86_64/kernel/drivers/net/ethernet/intel/e1000e/e1000e.ko)
247dc e1000e_poll (/lib/modules/3.10.0-123.el7.x86_64/kernel/drivers/net/ethernet/intel/e1000e/e1000e.ko)
6d041a net_rx_action (/usr/lib/debug/lib/modules/3.10.0-123.el7.x86_64/vmlinux)
267047 __do_softirq (/usr/lib/debug/lib/modules/3.10.0-123.el7.x86_64/vmlinux)
7f3a5c call_softirq (/usr/lib/debug/lib/modules/3.10.0-123.el7.x86_64/vmlinux)
214d25 do_softirq (/usr/lib/debug/lib/modules/3.10.0-123.el7.x86_64/vmlinux)
2673e5 irq_exit (/usr/lib/debug/lib/modules/3.10.0-123.el7.x86_64/vmlinux)
7f4358 do_IRQ (/usr/lib/debug/lib/modules/3.10.0-123.el7.x86_64/vmlinux)
7e94ad ret_from_intr (/usr/lib/debug/lib/modules/3.10.0-123.el7.x86_64/vmlinux)
7c3927 rest_init (/usr/lib/debug/lib/modules/3.10.0-123.el7.x86_64/vmlinux)
e06fa7 start_kernel ([kernel.vmlinux].init.text)
e065ee x86_64_start_reservations ([kernel.vmlinux].init.text)
e06742 x86_64_start_kernel ([kernel.vmlinux].init.text)
<SNIP>
[root@ssdandy ~]# perf script -g python
generated Python script: perf-script.py
[root@ssdandy ~]# vim perf-script.py # Edit it to remove callchain printing, simplify some stuff
[root@ssdandy ~]# mv perf-script.py tcp_sack_rcv.py
[root@ssdandy ~]# cat tcp_sack_rcv.py
import os, sys
sys.path.append(os.environ['PERF_EXEC_PATH'] + \
'/scripts/python/Perf-Trace-Util/lib/Perf/Trace')
from perf_trace_context import *
from Core import *
def probe__tcp_sacks_rcv(event_name, context, common_cpu,
common_secs, common_nsecs, common_pid, common_comm,
common_callchain, __probe_ip, sk, num_sacks):
print_header(event_name, common_cpu, common_secs, common_nsecs,
common_pid, common_comm)
print "__probe_ip=%#x, sk=%#x, num_sacks=%d" % \
(__probe_ip, sk, num_sacks)
def print_header(event_name, cpu, secs, nsecs, pid, comm):
print "%-18s %3u %05u.%09u %1u %-8s " % \
(event_name, cpu, secs, nsecs, pid, comm),
[root@ssdandy ~]#
[root@ssdandy ~]# perf script -s tcp_sack_rcv.py | head -10
Failed to open 64/libfreebl3.so, continuing without symbols
probe__tcp_sacks_rcv 0 184175.932790461 0 swapper __probe_ip=0xffffffff8151e59d, sk=0xffff880425761e00, num_sacks=0
probe__tcp_sacks_rcv 0 184177.487455369 0 swapper __probe_ip=0xffffffff8151e59d, sk=0xffff8804047e0780, num_sacks=0
probe__tcp_sacks_rcv 0 184177.588593040 0 swapper __probe_ip=0xffffffff8151e59d, sk=0xffff8804256af800, num_sacks=0
probe__tcp_sacks_rcv 0 184178.741298627 0 swapper __probe_ip=0xffffffff8151e59d, sk=0xffff8804256acb00, num_sacks=0
probe__tcp_sacks_rcv 0 184179.902089365 0 swapper __probe_ip=0xffffffff8151e59d, sk=0xffff8804256ad280, num_sacks=0
probe__tcp_sacks_rcv 0 184180.802761942 0 swapper __probe_ip=0xffffffff8151e59d, sk=0xffff8804256acb00, num_sacks=0
probe__tcp_sacks_rcv 0 184180.961373503 0 swapper __probe_ip=0xffffffff8151e59d, sk=0xffff8804256af800, num_sacks=0
probe__tcp_sacks_rcv 0 184182.123660739 0 swapper __probe_ip=0xffffffff8151e59d, sk=0xffff8804256ad280, num_sacks=0
probe__tcp_sacks_rcv 0 184182.387640636 0 swapper __probe_ip=0xffffffff8151e59d, sk=0xffff8804256acb00, num_sacks=0
probe__tcp_sacks_rcv 0 184182.859420892 0 swapper __probe_ip=0xffffffff8151e59d, sk=0xffff8804256af800, num_sacks=0
[root@ssdandy ~]#
> state.flag = 0;
> state.reord = tp->packets_out;
> state.rtt_us = -1L;
> @@ -2932,6 +2934,9 @@ static inline bool tcp_ack_update_rtt(struct sock *sk, const int flag,
>
> /* RFC6298: only reset backoff on valid RTT measurement. */
> inet_csk(sk)->icsk_backoff = 0;
> +
> + trace_tcp_rtt_sample(sk, seq_rtt_us);
> +
> return true;
> }
>
> @@ -4232,6 +4237,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
> NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPOFOQUEUE);
> SOCK_DEBUG(sk, "out of order segment: rcv_next %X seq %X - %X\n",
> tp->rcv_nxt, TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq);
> + trace_tcp_ooo_rcv(sk);
>
> skb1 = skb_peek_tail(&tp->out_of_order_queue);
> if (!skb1) {
> --
> 1.8.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH net-next RESEND] net: Do not call ndo_dflt_fdb_dump if ndo_fdb_dump is defined.
From: Vlad Yasevich @ 2014-12-17 15:39 UTC (permalink / raw)
To: John Fastabend, Jamal Hadi Salim
Cc: Hubert Sokolowski, Roopa Prabhu, netdev@vger.kernel.org
In-Reply-To: <548F80B2.80408@gmail.com>
On 12/15/2014 07:45 PM, John Fastabend wrote:
> On 12/15/2014 06:29 AM, Jamal Hadi Salim wrote:
>> On 12/12/14 15:05, John Fastabend wrote:
>>> On 12/12/2014 06:35 AM, Jamal Hadi Salim wrote:
>>
>>
>>> I'll wake up ;)
>>
>>
>> Vlad made me go over those patches in a few iterations to make
>> sure that the use cases covered in the test case work. It is
>> holiday season, so he may be offline.
>>
>
> Yep.
Sorry, had HW/network issues as well as holiday season... Been trying to
catch up.
>
>>> First quick grep of code finds some strange uses of ndo_fdb_dump like
>>> this in macvlan,
>>>
>>> ./drivers/net/macvlan.c
>>> .ndo_fdb_dump = ndo_dflt_fdb_dump,
>>>
>>> I'll be sending a patch once net-next opens up again to resolve it. Its
>>> harmless though so not really a fix for net.
>>>
>>> There seem to be a few places that have the potential to return
>>> different values then the uc/mc lists.
>>>
>>> ./drivers/net/vxlan.c
>>> ./drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c
>>> ./drivers/net/ethernet/rocker/rocker.c
>>>
>>> ./net/bridge/br_device.c
>>>
>>
>> Yes, thats my observation as well.
>> The question is: Are multi/unicast address unconditionally dumped?
>
> hmm good question. When I implemented this on the host nics with SR-IOV,
> VMDQ, etc. The multi/unicast addresses were propagated into the FDB by
> the driver. My logic was if some netdev ethx has a set of MAC addresses
> above it well then any virtual function or virtual device also behind
> the hardware shouldn't be sending those addresses out the egress switch
> facing port. Otherwise the switch will see packets it knows are behind
> that port and drop them. Or flood them if it hasn't learned the address
> yet. Either way they will never get to the right netdev.
>
> Admittedly I wasn't thinking about switches with many ports at the time.
Looking at the old code, we've always asked HW to dump it's state and if HW
didn't support the dumper, the default dumper of MC/UC lists was used.
This makes sense for most devices.
>
>> Some of these drivers may be just doing the LinuxWay(aka cutnpaste what
>> the other driver did).
>
> My original thinking here was... if it didn't implement fdb_add, fdb_del
> and fdb_dump then if you wanted to think of it as having forwarding
> database that was fine but it was really just a two port mac relay. In
> which case just dump all the mac addresses it knows about. In this case
> if it was something more fancy it could do its own dump like vxlan or
> macvlan.
>
>> If you go over the original thread exchange with Vlad, you'll notice
>> i was kind of unsure why dumping of unicast/multicast had anything to
>> do with fdb dumping.
>> It is still my view that we shouldnt be treating these addresses as if
>> they were fdb entries. But: The problem is once you allow an API to
>> user space you cant take it back even if people are depending on bugs.
>>
>
> For a host nic ucast/multicast and fdb are the same, I think? The
> code we had was just short-hand to allow the common case a host nic
> to work. Notice vxlan and bridge drivers didn't dump there addr lists from fdb_dump until
> your patch.
Right. That patch added additional filtering code, but I guess we missed
the change that force it dump MC/UC lists from the master devices. It did dump
those list from the slave devices.
>
> Perhaps my implementation of macvlan fdb_{add|del|dump} is buggy. And
> I shouldn't overload the addr lists.
>
>>
>>> So I guess we can walk through the list and analyse them a bit.
>>>
>>> vxlan:
>>>
>>> Try stacking devices on top of the vxlan device this will call a uc_add
>>> routine if you then change the mac addr on the vlan. This would get
>>> reported by the dflt fdb dump handlers but not the drivers fdb dump
>>> handlers. So removing the dflt dump handler from this patch at least
>>> changes things. We should either explain why this is OK or accept that
>>> the driver needs to be fixed. Or I guess that the patch is just wrong.
>>> My guess is one of the latter options.
>>>
>>> Also Jamal, your original patch seems like it might of changed this
>>> and Hubert's patch is reverting back to its original case. Was this
>>> specific part of your patch intentional?
>>>
>>
>> Yes.
>> This is based on the view that unicast/multicast must be dumped
>> *unconditionally*. If the view is that uni/mcast addresses are
>> dumped conditionally based on what the driver thinks, then Hubert's
>> one liner is good. But i really would like Vlad to comment. 80%
>> of the effort on my part if you look at the thread was the refactoring
>> of the code to meet the use case.
I don't think we have to dump uc/mc lists unconditionally. What we
want is the lower diver's view of any fdb entries it things are appropriate.
For simple cards, this becomes equivalent to uc/mc lists. For smarter cards
that override the default dumper, it makes sense for them to provide the info.
>
> I'm interested to see what Vlad says as well. But the current situation
> is previously some drivers dumped their addr lists others didn't.
> Specifically, the more switch like devices (bridge, vxlan) didn't. Now
> every device will dump the addr lists. I'm not entirely convinced that
> is correct.
>
Well, the bridge would have dumped any fdbs that pointer at the bridge
device (port is NULL) as it's egress port. Not sure about the vxlan.
For stacked situations of complex devices this make sense. For
instance if you stack vxlan on top of bridge, then from the vxlan
perspective, we want to see what macs the bridge will forward to the
vxlan. Here, the bridge is actually slightly broken as it wouldn't
actually dump all the pertinent info, but that's more of a bridge problem.
-vlad
>>
>> I thought the abstraction which requires that your own MAC addresses
>> are treated as fdb entries was broken - but it is too late to change
>> that.
>>
>
> It works OK for host nics (NICS that can't forward between ports) and
> seems at best confusing for real switch asics. On a related question do
> you expect the switch asic to trap any packets with MAC addresses in
> the multi/unicast address lists and send them to the correct netdev? Or
> will the switch forward them using normal FDB tables?
>
> Also I don't think its too late to fix it though. Maybe we had some
> buggy drivers is all.
>
>> cheers,
>> jamal
>
>
^ permalink raw reply
* Re: [RFC PATCH net-next 3/5] tcp: Add a few more tracepoints for tcp tracer
From: David Ahern @ 2014-12-17 15:59 UTC (permalink / raw)
To: Arnaldo Carvalho de Melo, Martin KaFai Lau
Cc: netdev, David S. Miller, Hannes Frederic Sowa, Steven Rostedt,
Lawrence Brakmo, Josef Bacik, Kernel Team
In-Reply-To: <20141217153349.GG11607@kernel.org>
On 12/17/14 8:33 AM, Arnaldo Carvalho de Melo wrote:
>
> On a random RHEL7 kernel I had laying around on a test machine:
>
> [root@ssdandy ~]# perf probe -L tcp_sacktag_write_queue | head -20
> <tcp_sacktag_write_queue@/usr/src/debug/kernel-3.10.0-123.el7/linux-3.10.0-123.el7.x86_64/net/ipv4/tcp_input.c:0>
> 0 tcp_sacktag_write_queue(struct sock *sk, const struct sk_buff *ack_skb,
> u32 prior_snd_una)
> 2 {
> struct tcp_sock *tp = tcp_sk(sk);
> 4 const unsigned char *ptr = (skb_transport_header(ack_skb) +
> TCP_SKB_CB(ack_skb)->sacked);
> struct tcp_sack_block_wire *sp_wire = (struct tcp_sack_block_wire *)(ptr+2);
> struct tcp_sack_block sp[TCP_NUM_SACKS];
> struct tcp_sack_block *cache;
> struct tcp_sacktag_state state;
> struct sk_buff *skb;
> 11 int num_sacks = min(TCP_NUM_SACKS, (ptr[1] - TCPOLEN_SACK_BASE) >> 3);
> int used_sacks;
> bool found_dup_sack = false;
> int i, j;
> int first_sack_index;
>
> 17 state.flag = 0;
> 18 state.reord = tp->packets_out;
But there are limitations/hassles with this approach. For starters I
believe it requires vmlinux on box. The products I work on do not have
vmlinux available in the runtime environment. I recall someone (Masami?)
suggesting the ability to write the probe data to a file (ie., create
the probe definition off box) and load the file to create the probe, so
yes a solvable problem.
But with this approach it could very be that the function name and
variable names differ with kernel version and that makes it hard to
impossible to create a set of analysis commands.
David
^ permalink raw reply
* Question about phys_port_id
From: Joshua Watt @ 2014-12-17 16:09 UTC (permalink / raw)
To: netdev
Hello,
I had a question regarding the phys_port_id attribute of net_device.
Is that identifier supposed to be globally unique or just unique among
devices that share a common device? For example, we have a single
device that create two net_device s (one for each of it's macs). Would
it be sufficient for this device to return a phys_port_id of 0 for the
first net_device and 1 for the second? I noticed that the other
implementations that use phys_port_id copy their mac address into the
phys_port_id, but I'm not sure if that is just because that is an easy
way to get a unique number or if it is because the ID needs to be
globally unique.
If you're wondering the driver in question is the TI cpsw driver
(drivers/net/ethernet/ti/cpsw.c). We are running the device in
dual-emac mode and need to uniquely identify which emac is which in
userspace (specifically, udev rules). The physical port identifier
seems to be the logical choice to me, but I'm not sure if I'm missing
something.
Thanks,
Joshua Watt
^ permalink raw reply
* Re: [PATCH net-next RESEND] net: Do not call ndo_dflt_fdb_dump if ndo_fdb_dump is defined.
From: Hubert Sokolowski @ 2014-12-17 16:18 UTC (permalink / raw)
To: vyasevic
Cc: John Fastabend, Jamal Hadi Salim, Roopa Prabhu,
netdev@vger.kernel.org
In-Reply-To: <5491A3B5.9070601@redhat.com>
>
> I don't think we have to dump uc/mc lists unconditionally. What we
> want is the lower diver's view of any fdb entries it things are appropriate.
> For simple cards, this becomes equivalent to uc/mc lists. For smarter cards
> that override the default dumper, it makes sense for them to provide the info.
>
I am very glad to hear that :).
>
> Well, the bridge would have dumped any fdbs that pointer at the bridge
> device (port is NULL) as it's egress port. Not sure about the vxlan.
> For stacked situations of complex devices this make sense. For
> instance if you stack vxlan on top of bridge, then from the vxlan
> perspective, we want to see what macs the bridge will forward to the
> vxlan. Here, the bridge is actually slightly broken as it wouldn't
> actually dump all the pertinent info, but that's more of a bridge problem.
I have just prepared a patch where I dump uc/mc for bridge devices
by looking at (dev->priv_flags & IFF_EBRIDGE), so I have same results
as without my changes. This should satisfy Jamal and Roopa.
I could send it as v3 of my patch along with the results if you are
interested.
>
> -vlad
>
thanks,
Hubert
--
Hubert Sokolowski Intel Corporation
^ permalink raw reply
* Re: Netlink mmap tx security?
From: Thomas Graf @ 2014-12-17 16:26 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David Miller, dborkman, luto, torvalds, kaber, netdev
In-Reply-To: <1418774579.9773.69.camel@edumazet-glaptop2.roam.corp.google.com>
On 12/16/14 at 04:02pm, Eric Dumazet wrote:
> On Tue, 2014-12-16 at 17:58 -0500, David Miller wrote:
>
> > + __skb_put(skb, nm_len);
> > + memcpy(skb->data, (void *)hdr + NL_MMAP_HDRLEN, nm_len);
> > + netlink_set_status(hdr, NL_MMAP_STATUS_UNUSED);
> >
>
> Not related to this patch, but it looks like netlink_set_status()
> barrier is wrong ?
>
> It seems we need smp_wmb() after the memcpy() and before the
> hdr->nm_status = status;
Yes, definitely wrong as-is. I'll send a patch. For the particular
case we'd need a smp_rmb() after the memcpy() to complete the loads.
The skb destructor needs a smp_wmb() after setting nm_len. We could
get away with a smp_wmb() first thing in netlink_set_status() with
the code as-is but smp_mb() might be the less fragile thing to do.
Objections?
^ permalink raw reply
* Re: Netlink mmap tx security?
From: Thomas Graf @ 2014-12-17 16:27 UTC (permalink / raw)
To: Daniel Borkmann; +Cc: David Miller, luto, torvalds, kaber, netdev
In-Reply-To: <5490C73C.8010405@redhat.com>
On 12/17/14 at 12:58am, Daniel Borkmann wrote:
> Fixes: 5fd96123ee19 ("netlink: implement memory mapped sendmsg()")
> Acked-by: Daniel Borkmann <dborkman@redhat.com>
Nothing to add to Daniel's excellent feedback.
Acked-by: Thomas Graf <tgraf@suug.ch>
^ permalink raw reply
* Re: [PATCH net 1/2] geneve: Remove socket and offload handlers at destruction.
From: Thomas Graf @ 2014-12-17 16:41 UTC (permalink / raw)
To: Jesse Gross; +Cc: David Miller, netdev, Andy Zhou
In-Reply-To: <1418783132-99230-1-git-send-email-jesse@nicira.com>
On 12/16/14 at 06:25pm, Jesse Gross wrote:
> Sockets aren't currently removed from the the global list when
> they are destroyed. In addition, offload handlers need to be cleaned
> up as well.
>
> Fixes: 0b5e8b8e ("net: Add Geneve tunneling protocol driver")
> CC: Andy Zhou <azhou@nicira.com>
> Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
^ permalink raw reply
* Re: Question about phys_port_id
From: Dan Williams @ 2014-12-17 16:47 UTC (permalink / raw)
To: Joshua Watt; +Cc: netdev
In-Reply-To: <CAEPrYjTZ8SU1y4TwKHYtZju+4_-O5mazAjaFKaKmR2cYuAKHnA@mail.gmail.com>
On Wed, 2014-12-17 at 10:09 -0600, Joshua Watt wrote:
> Hello,
>
> I had a question regarding the phys_port_id attribute of net_device.
> Is that identifier supposed to be globally unique or just unique among
> devices that share a common device? For example, we have a single
> device that create two net_device s (one for each of it's macs). Would
Do the two net_device's share hardware or firmware resources? Can they
be used independently at maximum capability, or when both are in use do
they have degraded capability?
> it be sufficient for this device to return a phys_port_id of 0 for the
> first net_device and 1 for the second? I noticed that the other
If the two netdevs share resources, then they should have the *same*
phys_port_id. If they do not have the same physical hardware or shared
resources and are completely independent from each other at all levels,
then you can either skip phy_port_id altogether.
One good use for this (and why it was originally added) was to indicate
to userspace that it was pointless to bond two interfaces with the same
underlying hardware or resources, because that totally defeats the
purpose of both failover and aggregation.
> implementations that use phys_port_id copy their mac address into the
> phys_port_id, but I'm not sure if that is just because that is an easy
> way to get a unique number or if it is because the ID needs to be
> globally unique.
Say you have two netdevs that share the same hardware or resources. You
assign them both a phys_port_id of "1" to indicate this. What if
there's a second cpsw device on the system, do both of its netdevs also
get "1", or "2", or? Or how about a card from another vendor, how do
you ensure that your device's phys_port_id won't conflict with that
vendor's device/driver?
That's why most drivers currently use the MAC address or a GUID.
Dan
> If you're wondering the driver in question is the TI cpsw driver
> (drivers/net/ethernet/ti/cpsw.c). We are running the device in
> dual-emac mode and need to uniquely identify which emac is which in
> userspace (specifically, udev rules). The physical port identifier
> seems to be the logical choice to me, but I'm not sure if I'm missing
> something.
>
> Thanks,
> Joshua Watt
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH net 2/2] geneve: Fix races between socket add and release.
From: Thomas Graf @ 2014-12-17 16:54 UTC (permalink / raw)
To: Jesse Gross; +Cc: David Miller, netdev, Andy Zhou
In-Reply-To: <1418783132-99230-2-git-send-email-jesse@nicira.com>
On 12/16/14 at 06:25pm, Jesse Gross wrote:
> Currently, searching for a socket to add a reference to is not
> synchronized with deletion of sockets. This can result in use
> after free if there is another operation that is removing a
> socket at the same time. Solving this requires both holding the
> appropriate lock and checking the refcount to ensure that it
> has not already hit zero.
>
> Inspired by a related (but not exactly the same) issue in the
> VXLAN driver.
>
> Fixes: 0b5e8b8e ("net: Add Geneve tunneling protocol driver")
> CC: Andy Zhou <azhou@nicira.com>
> Signed-off-by: Jesse Gross <jesse@nicira.com>
> ---
> net/ipv4/geneve.c | 13 +++++++------
> 1 file changed, 7 insertions(+), 6 deletions(-)
>
> diff --git a/net/ipv4/geneve.c b/net/ipv4/geneve.c
> index 5a47188..95e47c9 100644
> --- a/net/ipv4/geneve.c
> +++ b/net/ipv4/geneve.c
> @@ -296,6 +296,7 @@ struct geneve_sock *geneve_sock_add(struct net *net, __be16 port,
> geneve_rcv_t *rcv, void *data,
> bool no_share, bool ipv6)
> {
> + struct geneve_net *gn = net_generic(net, geneve_net_id);
> struct geneve_sock *gs;
>
> gs = geneve_socket_create(net, port, rcv, data, ipv6);
> @@ -305,15 +306,15 @@ struct geneve_sock *geneve_sock_add(struct net *net, __be16 port,
> if (no_share) /* Return error if sharing is not allowed. */
> return ERR_PTR(-EINVAL);
>
> + spin_lock(&gn->sock_lock);
> gs = geneve_find_sock(net, port);
Perhaps remove the _rcu of the iterator in the geneve_find_sock?
Also, the kfree_rcu() seems no longer needed as all read accesses
are protected by the spinlock.
> - if (gs) {
> - if (gs->rcv == rcv)
> - atomic_inc(&gs->refcnt);
> - else
> + if (gs && ((gs->rcv != rcv) ||
> + !atomic_add_unless(&gs->refcnt, 1, 0)))
> gs = ERR_PTR(-EBUSY);
Since you are taking gn->sock_lock in geneve_sock_release()
anyway, all accesses to refcnt could eventually be converted
to non-atomic ops.
^ permalink raw reply
* [PATCH] Documentation: clarify phys_port_id
From: Dan Williams @ 2014-12-17 16:59 UTC (permalink / raw)
To: netdev; +Cc: Joshua Watt, jpirko, Florian Fainelli
In-Reply-To: <1418834826.1160.35.camel@dcbw.local>
Signed-off-by: Dan Williams <dcbw@redhat.com>
---
Documentation/ABI/testing/sysfs-class-net | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/Documentation/ABI/testing/sysfs-class-net b/Documentation/ABI/testing/sysfs-class-net
index e1b2e78..7fe823a 100644
--- a/Documentation/ABI/testing/sysfs-class-net
+++ b/Documentation/ABI/testing/sysfs-class-net
@@ -186,7 +186,12 @@ KernelVersion: 3.12
Contact: netdev@vger.kernel.org
Description:
Indicates the interface unique physical port identifier within
- the NIC, as a string.
+ the NIC, as a string. If two net_device objects share physical
+ hardware or other resources, and/or do not operate independently
+ both net_device objects should be assigned the
+ same phys_port_id. phys_port_id should be as globally unique
+ as possible to prevent conflicts between different drivers and
+ vendors, eg with MAC addresses or hardware GUIDs.
What: /sys/class/net/<iface>/speed
Date: October 2009
--
1.9.3
^ permalink raw reply related
* Re: [PATCH net-next 1/3] Implementation of RFC 4898 Extended TCP Statistics (Web10G)
From: rapier @ 2014-12-17 17:00 UTC (permalink / raw)
To: Bjørn Mork; +Cc: netdev
In-Reply-To: <87d27i35gb.fsf@nemi.mork.no>
On 12/17/14 6:01 AM, Bjørn Mork wrote:
> rapier <rapier@psc.edu> writes:
>
>> + * The Web10Gig project. See http://www.web10gig.org
>
> URL is already outdated?
My apologies. The correct site is http://www.web10g.org. My fingers
moved faster than my brain on that one.
^ permalink raw reply
* Re: [RFC PATCH net-next 3/5] tcp: Add a few more tracepoints for tcp tracer
From: Arnaldo Carvalho de Melo @ 2014-12-17 17:02 UTC (permalink / raw)
To: David Ahern
Cc: Martin KaFai Lau, netdev, David S. Miller, Hannes Frederic Sowa,
Steven Rostedt, Lawrence Brakmo, Josef Bacik, Kernel Team
In-Reply-To: <5491A86C.5030207@gmail.com>
Em Wed, Dec 17, 2014 at 08:59:40AM -0700, David Ahern escreveu:
> On 12/17/14 8:33 AM, Arnaldo Carvalho de Melo wrote:
> >On a random RHEL7 kernel I had laying around on a test machine:
> >[root@ssdandy ~]# perf probe -L tcp_sacktag_write_queue | head -20
> ><tcp_sacktag_write_queue@/usr/src/debug/kernel-3.10.0-123.el7/linux-3.10.0-123.el7.x86_64/net/ipv4/tcp_input.c:0>
> > 0 tcp_sacktag_write_queue(struct sock *sk, const struct sk_buff *ack_skb,
> > u32 prior_snd_una)
> > 2 {
> > struct tcp_sock *tp = tcp_sk(sk);
> > 4 const unsigned char *ptr = (skb_transport_header(ack_skb) +
> > TCP_SKB_CB(ack_skb)->sacked);
> > struct tcp_sack_block_wire *sp_wire = (struct tcp_sack_block_wire *)(ptr+2);
> > struct tcp_sack_block sp[TCP_NUM_SACKS];
> > struct tcp_sack_block *cache;
> > struct tcp_sacktag_state state;
> > struct sk_buff *skb;
> > 11 int num_sacks = min(TCP_NUM_SACKS, (ptr[1] - TCPOLEN_SACK_BASE) >> 3);
> > int used_sacks;
> > bool found_dup_sack = false;
> > int i, j;
> > int first_sack_index;
> >
> > 17 state.flag = 0;
> > 18 state.reord = tp->packets_out;
> But there are limitations/hassles with this approach. For starters I believe
> it requires vmlinux on box. The products I work on do not have vmlinux
> available in the runtime environment. I recall someone (Masami?) suggesting
Not necessarily, one can do all this on a development machine that has
that info and then end up with kprobe_trace expressions as described on:
Documentation/trace/kprobetrace.txt
> the ability to write the probe data to a file (ie., create the probe
> definition off box) and load the file to create the probe, so yes a solvable
> problem.
Exactly.
> But with this approach it could very be that the function name and variable
> names differ with kernel version and that makes it hard to impossible to
> create a set of analysis commands.
Well, if this is in a very much in flux code, then, there is no place
for a tracepoint there :-)
For instance: I've been using this definition:
commit c522739d72a341a3e74a369ce6298b9412813d3f
Author: Arnaldo Carvalho de Melo <acme@redhat.com>
Date: Fri Sep 27 18:06:19 2013 -0300
perf trace: Use vfs_getname hook if available
Initially it tries to find a probe:vfs_getname that should be setup
with:
perf probe 'vfs_getname=getname_flags:65 pathname=result->name:string'
-----------------
For quite a while in tools/perf/builtin-trace.c
I.e. if this thing is in place, then I get mappings from a pointer to a
pathname that I use in a tool, 'perf trace', if not, it tries reading
/proc/, etc, which is suboptimal but then the only way to map a fd to a
pathname.
On RHEL7:
[root@ssdandy ~]# perf probe 'vfs_getname=getname_flags:65 pathname=result->name:string'
Added new event:
probe:vfs_getname (on getname_flags:65 with pathname=result->name:string)
You can now use it in all perf tools, such as:
perf record -e probe:vfs_getname -aR sleep 1
[root@ssdandy ~]#
[root@ssdandy ~]# perf record -e probe:vfs_getname cat /etc/passwd > /dev/null
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.019 MB perf.data (~812 samples) ]
[root@ssdandy ~]# perf script
cat 24805 [005] 188716.046396: probe:vfs_getname: (ffffffff811bb7e3) pathname="/etc/ld.so.preload"
cat 24805 [005] 188716.046408: probe:vfs_getname: (ffffffff811bb7e3) pathname="/etc/ld.so.cache"
cat 24805 [005] 188716.046428: probe:vfs_getname: (ffffffff811bb7e3) pathname="/lib64/libc.so.6"
cat 24805 [005] 188716.046675: probe:vfs_getname: (ffffffff811bb7e3) pathname="/usr/lib/locale/locale-archive"
cat 24805 [005] 188716.046718: probe:vfs_getname: (ffffffff811bb7e3) pathname="/etc/passwd"
[root@ssdandy ~]# uname -r
3.10.0-123.el7.x86_64
And in fedora there was a change in how we must set up the probe:
[root@zoo ~]# perf probe 'vfs_getname=getname_flags:65 pathname=filename:string'
Added new event:
probe:vfs_getname (on getname_flags:65 with pathname=filename:string)
You can now use it in all perf tools, such as:
perf record -e probe:vfs_getname -aR sleep 1
[root@zoo ~]# perf record -e probe:vfs_getname cat /etc/passwd > /dev/null
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.016 MB perf.data (~687 samples) ]
[root@zoo ~]# perf script
cat 27485 [000] 931.973139: probe:vfs_getname: (ffffffff8120ad13) pathname="/etc/ld.so.preload"
cat 27485 [000] 931.973148: probe:vfs_getname: (ffffffff8120ad13) pathname="/etc/ld.so.cache"
cat 27485 [000] 931.973169: probe:vfs_getname: (ffffffff8120ad13) pathname="/lib64/libc.so.6"
cat 27485 [000] 931.973417: probe:vfs_getname: (ffffffff8120ad13) pathname="/usr/lib/locale/locale-archive"
cat 27485 [000] 931.973488: probe:vfs_getname: (ffffffff8120ad13) pathname="/etc/passwd"
[root@zoo ~]# uname -r
3.17.4-200.fc20.x86_64
[root@zoo ~]#
What I could find in 'result->name' changed to 'filename', but since I gave it
a standard name ("pathname"), its just when setting up the probes, something we
do at system start, i.e. just once, before the first 'perf record', we can deal
with that with some scripting that will do probe setup fallbacks.
All this while we're prototyping a tool, when we're 100% sure that a tracepoint
would provide any performance gains and that that area is so set in stone that
we can guarantee an ABI, go ahead and add the tracepoint.
- Arnaldo
^ permalink raw reply
* [PATCH][v3.13.y] e1000e: Fix no connectivity when driver loaded with cable out
From: Joseph Salisbury @ 2014-12-17 17:08 UTC (permalink / raw)
To: Kamal Mostafa
Cc: stable@vger.kernel.org, davidx.m.ertman, jeffrey.e.pieper,
jeffrey.t.kirsher, LKML, linux.nics, e1000-devel,
netdev@vger.kernel.org
Hello,
Please consider including mainline commit b20a774 in the next v3.13.y
stable release. It was included in the mainline tree as of v3.15-rc1.
It has been tested and confirmed to resolve
http://bugs.launchpad.net/bugs/1400365 .
commit b20a774495671f037e7160ea2ce8789af6b61533
Author: David Ertman <davidx.m.ertman@intel.com>
Date: Tue Mar 25 04:27:55 2014 +0000
e1000e: Fix no connectivity when driver loaded with cable out
Sincerely,
Joseph Salisbury
^ permalink raw reply
* Re: [PATCH net-next 1/3] Implementation of RFC 4898 Extended TCP Statistics (Web10G)
From: Bryan @ 2014-12-17 16:19 UTC (permalink / raw)
To: Bjørn Mork; +Cc: netdev
In-Reply-To: <87d27i35gb.fsf@nemi.mork.no>
On 12/17/2014 6:01 AM, Bjørn Mork wrote:
>> + * The Web10Gig project. See http://www.web10gig.org
> URL is already outdated?
That's a typo. Please see http://www.web10g.org
^ permalink raw reply
* Re: [RFC PATCH net-next 0/5] tcp: TCP tracer
From: Alexei Starovoitov @ 2014-12-17 17:14 UTC (permalink / raw)
To: Arnaldo Carvalho de Melo
Cc: Martin KaFai Lau, netdev@vger.kernel.org, David S. Miller,
Hannes Frederic Sowa, Steven Rostedt, Lawrence Brakmo,
Josef Bacik, Kernel Team
On Wed, Dec 17, 2014 at 7:07 AM, Arnaldo Carvalho de Melo
<arnaldo.melo@gmail.com> wrote:
>
> I guess even just using 'perf probe' to set those wannabe tracepoints
> should be enough, no? Then he can refer to those in his perf record
> call, etc and process it just like with the real tracepoints.
it's far from ideal for two reasons.
- they have different kernels and dragging along vmlinux
with debug info or multiple 'perf list' data is too cumbersome
operationally. Permanent tracepoints solve this problem.
- the action upon hitting tracepoint is non-trivial.
perf probe style of unconditionally walking pointer chains
will be tripping over wrong pointers.
Plus they already need to do aggregation for high
frequency events.
As part of acting on trace_transmit_skb() event:
if (before(tcb->seq, tcp_sk(sk)->snd_nxt)) {
tcp_trace_stats_add(...)
}
if (jiffies_to_msecs(jiffies - sktr->last_ts) ..) {
tcp_trace_stats_add(...)
}
^ permalink raw reply
* [PATCH] MAINTAINERS: changes for wireless
From: John W. Linville @ 2014-12-17 17:07 UTC (permalink / raw)
To: netdev; +Cc: linux-wireless, davem, John W. Linville
http://marc.info/?l=linux-wireless&m=141883202530292&w=2
This makes it official... :-)
Signed-off-by: John W. Linville <linville@tuxdriver.com>
---
MAINTAINERS | 19 ++++++++-----------
1 file changed, 8 insertions(+), 11 deletions(-)
diff --git a/MAINTAINERS b/MAINTAINERS
index fdffe962a16a..e82d31aeb936 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6603,19 +6603,8 @@ L: netdev@vger.kernel.org
S: Maintained
NETWORKING [WIRELESS]
-M: "John W. Linville" <linville@tuxdriver.com>
L: linux-wireless@vger.kernel.org
Q: http://patchwork.kernel.org/project/linux-wireless/list/
-T: git git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless.git
-S: Maintained
-F: net/mac80211/
-F: net/rfkill/
-F: net/wireless/
-F: include/net/ieee80211*
-F: include/linux/wireless.h
-F: include/uapi/linux/wireless.h
-F: include/net/iw_handler.h
-F: drivers/net/wireless/
NETWORKING DRIVERS
L: netdev@vger.kernel.org
@@ -6636,6 +6625,14 @@ F: include/linux/inetdevice.h
F: include/uapi/linux/if_*
F: include/uapi/linux/netdevice.h
+NETWORKING DRIVERS (WIRELESS)
+M: Kalle Valo <kvalo@codeaurora.org>
+L: linux-wireless@vger.kernel.org
+Q: http://patchwork.kernel.org/project/linux-wireless/list/
+T: git git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers.git/
+S: Maintained
+F: drivers/net/wireless/
+
NETXEN (1/10) GbE SUPPORT
M: Manish Chopra <manish.chopra@qlogic.com>
M: Sony Chacko <sony.chacko@qlogic.com>
--
1.9.3
^ permalink raw reply related
* Re: [PATCH net-next 2/3] Implementation of RFC 4898 Extended TCP Statistics (Web10G)
From: rapier @ 2014-12-17 17:32 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David Miller, alexei.starovoitov, netdev
In-Reply-To: <1418769212.9773.65.camel@edumazet-glaptop2.roam.corp.google.com>
On 12/16/14 5:33 PM, Eric Dumazet wrote:
> There is very little chance web10g ~3000 lines of code are added into
> linux TCP stack, by people who did not submit netdev changes in last
> years.
I do understand this. I went though something similar when submitting
the hpn-ssh patch set to OpenSSH many years ago. In retrospect we should
have been submitting subsets of the instruments on a periodic basis over
the past couple of years.
I also understand the need to be conservative in the approach to
inclusion of new functionality. No one, least of all my team, wants to
introduce instability or useless complexity into the stack.
> At Google, we tried the web10g route, but reverted it (today !) in favor
> of tcp_info extensions (ss command from iproute2 can also grab/display
> these), after too many bugs being filled.
I was informed that there was parallel development at Google and the
decision to move in favor of tcp_info happened not that long ago. I do
wish Google had shared more of these bug reports with the development
team so we could have addressed them at the time. That's how things go
though.
> Researchers love/want to have hundred of metrics. This does not mean
> linux has to provide them natively, unless we can prove it is really
> damn useful.
We agree with that. What we've found is that determining the most
valuable metrics often depends on the context. Those looking at
instrumenting connections within a data center are going to be looking
at different metrics than someone managing flows across a federated set
of widely distributed data transfer nodes. I'd be more than happy to
discuss what instruments provide the most value and which are
superfluous. In fact, I believe this is a critical conversation *if* the
community feels that more stack instrumentation would be useful.
> Sorry, but someone had to raise some reality concerns.
No need to apologize. Reality, like the Moon, is a harsh mistress.
> tcp_info _is_ extensible, granted you do not try to push 127 new metrics
> in it.
We are more than willing to look in to extending tcp_info. We do think
our methodology has some value though. One of the things we feel is an
advantage is that tcp_estats has a method to query the MIB and
dynamically determine what set of instruments are available. This allows
for a bit more flexibility in terms of forward/backward compatibility.
Chris
^ permalink raw reply
* [PATCH 00/10] Split UFO into v4 and v6 versions.
From: Vladislav Yasevich @ 2014-12-17 18:20 UTC (permalink / raw)
To: netdev; +Cc: mst, ben, stefanha, virtualization
UFO support in the kernel applies to both IPv4 and IPv6 protocols
with the same device feature. However some devices may not be able
to support one of the offloads. For this we split the UFO offload
feature into 2 pieces. NETIF_F_UFO now controlls the IPv4 part and
this series introduces NETIF_F_UFO6.
As a result of this work, we can now re-enable NETIF_F_UFO on
virtio_net devices and restore UDP over IPv4 performance for guests.
We also continue to support legacy guests that assume that UFO6
support included into UFO(4).
Without this work, migrating a guest to a 3.18 kernel fails.
Vladislav Yasevich (10):
core: Split out UFO6 support
net: Correctly mark IPv6 UFO offload type.
ovs: Enable handling of UFO6 packets.
loopback: Turn on UFO6 support.
veth: Enable UFO6 support.
macvlan: Enable UFO6 support.
s2io: Enable UFO6 support.
tun: Re-uanble UFO support.
macvtap: Re-enable UFO support
Revert "drivers/net: Disable UFO through virtio"
drivers/net/ethernet/neterion/s2io.c | 6 +++---
drivers/net/loopback.c | 4 ++--
drivers/net/macvlan.c | 2 +-
drivers/net/macvtap.c | 20 ++++++++++++++------
drivers/net/tun.c | 26 ++++++++++++++------------
drivers/net/veth.c | 2 +-
drivers/net/virtio_net.c | 24 ++++++++++--------------
include/linux/netdev_features.h | 7 +++++--
include/linux/netdevice.h | 1 +
include/linux/skbuff.h | 1 +
net/core/dev.c | 35 +++++++++++++++++++----------------
net/core/ethtool.c | 2 +-
net/ipv6/ip6_offload.c | 1 +
net/ipv6/ip6_output.c | 4 ++--
net/ipv6/udp_offload.c | 3 ++-
net/mpls/mpls_gso.c | 1 +
net/openvswitch/datapath.c | 3 ++-
net/openvswitch/flow.c | 2 +-
18 files changed, 81 insertions(+), 63 deletions(-)
--
1.9.3
^ permalink raw reply
* [PATCH 03/10] ovs: Enable handling of UFO6 packets.
From: Vladislav Yasevich @ 2014-12-17 18:20 UTC (permalink / raw)
To: netdev; +Cc: mst, ben, stefanha, virtualization
In-Reply-To: <1418840455-22598-1-git-send-email-vyasevic@redhat.com>
Since UFO6 packets can now be identified by SKB_GSO_UDP6, add proper checks
to handel UFO6 flows.
Legacy applications may still have UFO6 packets identified by SKB_GSO_UDP,
so we need to continue to handle them correclty.
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
---
net/openvswitch/datapath.c | 3 ++-
net/openvswitch/flow.c | 2 +-
2 files changed, 3 insertions(+), 2 deletions(-)
diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index f9e556b..b43fc60 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -334,7 +334,8 @@ static int queue_gso_packets(struct datapath *dp, struct sk_buff *skb,
if (err)
break;
- if (skb == segs && gso_type & SKB_GSO_UDP) {
+ if (skb == segs &&
+ ((gso_type & SKB_GSO_UDP) || (gso_type & SKB_GSO_UDP6))) {
/* The initial flow key extracted by ovs_flow_extract()
* in this case is for a first fragment, so we need to
* properly mark later fragments.
diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 2b78789..d03adf4 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -602,7 +602,7 @@ static int key_extract(struct sk_buff *skb, struct sw_flow_key *key)
if (key->ip.frag == OVS_FRAG_TYPE_LATER)
return 0;
- if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP)
+ if (skb_shinfo(skb)->gso_type & (SKB_GSO_UDP | SKB_GSO_UDP6))
key->ip.frag = OVS_FRAG_TYPE_FIRST;
/* Transport layer. */
--
1.9.3
^ permalink raw reply related
* [PATCH 01/10] core: Split out UFO6 support
From: Vladislav Yasevich @ 2014-12-17 18:20 UTC (permalink / raw)
To: netdev; +Cc: virtualization, mst, ben, stefanha, Vladislav Yasevich
In-Reply-To: <1418840455-22598-1-git-send-email-vyasevic@redhat.com>
Split IPv6 support for UFO into its own feature similiar to TSO.
This will later allow us to re-enable UFO support for virtio-net
devices.
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
---
include/linux/netdev_features.h | 7 +++++--
include/linux/netdevice.h | 1 +
include/linux/skbuff.h | 1 +
net/core/dev.c | 35 +++++++++++++++++++----------------
net/core/ethtool.c | 2 +-
5 files changed, 27 insertions(+), 19 deletions(-)
diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
index dcfdecb..a078945 100644
--- a/include/linux/netdev_features.h
+++ b/include/linux/netdev_features.h
@@ -48,8 +48,9 @@ enum {
NETIF_F_GSO_UDP_TUNNEL_BIT, /* ... UDP TUNNEL with TSO */
NETIF_F_GSO_UDP_TUNNEL_CSUM_BIT,/* ... UDP TUNNEL with TSO & CSUM */
NETIF_F_GSO_MPLS_BIT, /* ... MPLS segmentation */
+ NETIF_F_UFO6_BIT, /* ... UDPv6 fragmentation */
/**/NETIF_F_GSO_LAST = /* last bit, see GSO_MASK */
- NETIF_F_GSO_MPLS_BIT,
+ NETIF_F_UFO6_BIT,
NETIF_F_FCOE_CRC_BIT, /* FCoE CRC32 */
NETIF_F_SCTP_CSUM_BIT, /* SCTP checksum offload */
@@ -109,6 +110,7 @@ enum {
#define NETIF_F_TSO_ECN __NETIF_F(TSO_ECN)
#define NETIF_F_TSO __NETIF_F(TSO)
#define NETIF_F_UFO __NETIF_F(UFO)
+#define NETIF_F_UFO6 __NETIF_F(UFO6)
#define NETIF_F_VLAN_CHALLENGED __NETIF_F(VLAN_CHALLENGED)
#define NETIF_F_RXFCS __NETIF_F(RXFCS)
#define NETIF_F_RXALL __NETIF_F(RXALL)
@@ -141,7 +143,7 @@ enum {
/* List of features with software fallbacks. */
#define NETIF_F_GSO_SOFTWARE (NETIF_F_TSO | NETIF_F_TSO_ECN | \
- NETIF_F_TSO6 | NETIF_F_UFO)
+ NETIF_F_TSO6 | NETIF_F_UFO | NETIF_F_UFO6)
#define NETIF_F_GEN_CSUM NETIF_F_HW_CSUM
#define NETIF_F_V4_CSUM (NETIF_F_GEN_CSUM | NETIF_F_IP_CSUM)
@@ -149,6 +151,7 @@ enum {
#define NETIF_F_ALL_CSUM (NETIF_F_V4_CSUM | NETIF_F_V6_CSUM)
#define NETIF_F_ALL_TSO (NETIF_F_TSO | NETIF_F_TSO6 | NETIF_F_TSO_ECN)
+#define NETIF_F_ALL_UFO (NETIF_F_UFO | NETIF_F_UFO6)
#define NETIF_F_ALL_FCOE (NETIF_F_FCOE_CRC | NETIF_F_FCOE_MTU | \
NETIF_F_FSO)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 74fd5d3..86af10a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3559,6 +3559,7 @@ static inline bool net_gso_ok(netdev_features_t features, int gso_type)
/* check flags correspondence */
BUILD_BUG_ON(SKB_GSO_TCPV4 != (NETIF_F_TSO >> NETIF_F_GSO_SHIFT));
BUILD_BUG_ON(SKB_GSO_UDP != (NETIF_F_UFO >> NETIF_F_GSO_SHIFT));
+ BUILD_BUG_ON(SKB_GSO_UDP6 != (NETIF_F_UFO6 >> NETIF_F_GSO_SHIFT));
BUILD_BUG_ON(SKB_GSO_DODGY != (NETIF_F_GSO_ROBUST >> NETIF_F_GSO_SHIFT));
BUILD_BUG_ON(SKB_GSO_TCP_ECN != (NETIF_F_TSO_ECN >> NETIF_F_GSO_SHIFT));
BUILD_BUG_ON(SKB_GSO_TCPV6 != (NETIF_F_TSO6 >> NETIF_F_GSO_SHIFT));
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 6c8b6f6..8538b67 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -372,6 +372,7 @@ enum {
SKB_GSO_MPLS = 1 << 12,
+ SKB_GSO_UDP6 = 1 << 13
};
#if BITS_PER_LONG > 32
diff --git a/net/core/dev.c b/net/core/dev.c
index 945bbd0..fa4d2ee 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5929,6 +5929,12 @@ static netdev_features_t netdev_fix_features(struct net_device *dev,
features &= ~NETIF_F_ALL_TSO;
}
+ /* UFO requires that SG is present as well */
+ if ((features & NETIF_F_ALL_UFO) && !(features & NETIF_F_SG)) {
+ netdev_dbg(dev, "Dropping UFO features since no SG feature.\n");
+ features &= ~NETIF_F_ALL_UFO;
+ }
+
if ((features & NETIF_F_TSO) && !(features & NETIF_F_HW_CSUM) &&
!(features & NETIF_F_IP_CSUM)) {
netdev_dbg(dev, "Dropping TSO features since no CSUM feature.\n");
@@ -5952,24 +5958,21 @@ static netdev_features_t netdev_fix_features(struct net_device *dev,
features &= ~NETIF_F_GSO;
}
- /* UFO needs SG and checksumming */
- if (features & NETIF_F_UFO) {
- /* maybe split UFO into V4 and V6? */
- if (!((features & NETIF_F_GEN_CSUM) ||
- (features & (NETIF_F_IP_CSUM|NETIF_F_IPV6_CSUM))
- == (NETIF_F_IP_CSUM|NETIF_F_IPV6_CSUM))) {
- netdev_dbg(dev,
- "Dropping NETIF_F_UFO since no checksum offload features.\n");
- features &= ~NETIF_F_UFO;
- }
-
- if (!(features & NETIF_F_SG)) {
- netdev_dbg(dev,
- "Dropping NETIF_F_UFO since no NETIF_F_SG feature.\n");
- features &= ~NETIF_F_UFO;
- }
+ /* UFO also needs checksumming */
+ if ((features & NETIF_F_UFO) && !(features & NETIF_F_GEN_CSUM) &&
+ !(features & NETIF_F_IP_CSUM)) {
+ netdev_dbg(dev,
+ "Dropping NETIF_F_UFO since no checksum offload features.\n");
+ features &= ~NETIF_F_UFO;
+ }
+ if ((features & NETIF_F_UFO6) && !(features & NETIF_F_GEN_CSUM) &&
+ !(features & NETIF_F_IPV6_CSUM)) {
+ netdev_dbg(dev,
+ "Dropping NETIF_F_UFO6 since no checksum offload features.\n");
+ features &= ~NETIF_F_UFO6;
}
+
#ifdef CONFIG_NET_RX_BUSY_POLL
if (dev->netdev_ops->ndo_busy_poll)
features |= NETIF_F_BUSY_POLL;
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 06dfb29..93eff41 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -223,7 +223,7 @@ static netdev_features_t ethtool_get_feature_mask(u32 eth_cmd)
return NETIF_F_ALL_TSO;
case ETHTOOL_GUFO:
case ETHTOOL_SUFO:
- return NETIF_F_UFO;
+ return NETIF_F_ALL_UFO;
case ETHTOOL_GGSO:
case ETHTOOL_SGSO:
return NETIF_F_GSO;
--
1.9.3
^ permalink raw reply related
* [PATCH 02/10] net: Correctly mark IPv6 UFO offload type.
From: Vladislav Yasevich @ 2014-12-17 18:20 UTC (permalink / raw)
To: netdev; +Cc: virtualization, mst, ben, stefanha, Vladislav Yasevich
In-Reply-To: <1418840455-22598-1-git-send-email-vyasevic@redhat.com>
If the device supports UFO6 features, mark the offloaded ipv6 udp
traffic appropriately.
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
---
net/ipv6/ip6_offload.c | 1 +
net/ipv6/ip6_output.c | 4 ++--
net/ipv6/udp_offload.c | 3 ++-
net/mpls/mpls_gso.c | 1 +
4 files changed, 6 insertions(+), 3 deletions(-)
diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index 01e12d0..bd985d5 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -81,6 +81,7 @@ static struct sk_buff *ipv6_gso_segment(struct sk_buff *skb,
SKB_GSO_UDP_TUNNEL_CSUM |
SKB_GSO_MPLS |
SKB_GSO_TCPV6 |
+ SKB_GSO_UDP6 |
0)))
goto out;
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 8e950c2..83f5c04 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1089,7 +1089,7 @@ static inline int ip6_ufo_append_data(struct sock *sk,
*/
skb_shinfo(skb)->gso_size = (mtu - fragheaderlen -
sizeof(struct frag_hdr)) & ~7;
- skb_shinfo(skb)->gso_type = SKB_GSO_UDP;
+ skb_shinfo(skb)->gso_type = SKB_GSO_UDP6;
ipv6_select_ident(&fhdr, rt);
skb_shinfo(skb)->ip6_frag_id = fhdr.identification;
@@ -1296,7 +1296,7 @@ emsgsize:
if (((length > mtu) ||
(skb && skb_is_gso(skb))) &&
(sk->sk_protocol == IPPROTO_UDP) &&
- (rt->dst.dev->features & NETIF_F_UFO)) {
+ (rt->dst.dev->features & NETIF_F_UFO6)) {
err = ip6_ufo_append_data(sk, getfrag, from, length,
hh_len, fragheaderlen,
transhdrlen, mtu, flags, rt);
diff --git a/net/ipv6/udp_offload.c b/net/ipv6/udp_offload.c
index 6b8f543..00d723e 100644
--- a/net/ipv6/udp_offload.c
+++ b/net/ipv6/udp_offload.c
@@ -39,6 +39,7 @@ static struct sk_buff *udp6_ufo_fragment(struct sk_buff *skb,
int type = skb_shinfo(skb)->gso_type;
if (unlikely(type & ~(SKB_GSO_UDP |
+ SKB_GSO_UDP6 |
SKB_GSO_DODGY |
SKB_GSO_UDP_TUNNEL |
SKB_GSO_UDP_TUNNEL_CSUM |
@@ -47,7 +48,7 @@ static struct sk_buff *udp6_ufo_fragment(struct sk_buff *skb,
SKB_GSO_IPIP |
SKB_GSO_SIT |
SKB_GSO_MPLS) ||
- !(type & (SKB_GSO_UDP))))
+ !(type & (SKB_GSO_UDP | SKB_GSO_UDP6))))
goto out;
skb_shinfo(skb)->gso_segs = DIV_ROUND_UP(skb->len, mss);
diff --git a/net/mpls/mpls_gso.c b/net/mpls/mpls_gso.c
index e3545f2..27343f0 100644
--- a/net/mpls/mpls_gso.c
+++ b/net/mpls/mpls_gso.c
@@ -30,6 +30,7 @@ static struct sk_buff *mpls_gso_segment(struct sk_buff *skb,
~(SKB_GSO_TCPV4 |
SKB_GSO_TCPV6 |
SKB_GSO_UDP |
+ SKB_GSO_UDP6 |
SKB_GSO_DODGY |
SKB_GSO_TCP_ECN |
SKB_GSO_GRE |
--
1.9.3
^ permalink raw reply related
* [PATCH 04/10] loopback: Turn on UFO6 support.
From: Vladislav Yasevich @ 2014-12-17 18:20 UTC (permalink / raw)
To: netdev; +Cc: virtualization, mst, ben, stefanha, Vladislav Yasevich
In-Reply-To: <1418840455-22598-1-git-send-email-vyasevic@redhat.com>
Turn on UFO6 support.
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
---
drivers/net/loopback.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
index c76283c..762c28a 100644
--- a/drivers/net/loopback.c
+++ b/drivers/net/loopback.c
@@ -170,10 +170,10 @@ static void loopback_setup(struct net_device *dev)
dev->flags = IFF_LOOPBACK;
dev->priv_flags |= IFF_LIVE_ADDR_CHANGE;
netif_keep_dst(dev);
- dev->hw_features = NETIF_F_ALL_TSO | NETIF_F_UFO;
+ dev->hw_features = NETIF_F_ALL_TSO | NETIF_F_ALL_UFO;
dev->features = NETIF_F_SG | NETIF_F_FRAGLIST
| NETIF_F_ALL_TSO
- | NETIF_F_UFO
+ | NETIF_F_ALL_UFO
| NETIF_F_HW_CSUM
| NETIF_F_RXCSUM
| NETIF_F_SCTP_CSUM
--
1.9.3
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox