Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next 2/8] perf, bpf: allow bpf programs attach to tracepoints
From: Alexei Starovoitov @ 2016-04-18 21:43 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, David S . Miller, Ingo Molnar, Daniel Borkmann,
	Arnaldo Carvalho de Melo, Wang Nan, Josef Bacik, Brendan Gregg,
	netdev, linux-kernel, kernel-team
In-Reply-To: <20160418162905.220df2f4@gandalf.local.home>

On 4/18/16 1:29 PM, Steven Rostedt wrote:
> On Mon, 4 Apr 2016 21:52:48 -0700
> Alexei Starovoitov <ast@fb.com> wrote:
>
>> introduce BPF_PROG_TYPE_TRACEPOINT program type and allow it to be
>> attached to tracepoints.
>> The tracepoint will copy the arguments in the per-cpu buffer and pass
>> it to the bpf program as its first argument.
>> The layout of the fields can be discovered by doing
>> 'cat /sys/kernel/debug/tracing/events/sched/sched_switch/format'
>> prior to the compilation of the program with exception that first 8 bytes
>> are reserved and not accessible to the program. This area is used to store
>> the pointer to 'struct pt_regs' which some of the bpf helpers will use:
>> +---------+
>> | 8 bytes | hidden 'struct pt_regs *' (inaccessible to bpf program)
>> +---------+
>> | N bytes | static tracepoint fields defined in tracepoint/format (bpf readonly)
>> +---------+
>> | dynamic | __dynamic_array bytes of tracepoint (inaccessible to bpf yet)
>> +---------+
>>
>> Not that all of the fields are already dumped to user space via perf ring buffer
>> and some application access it directly without consulting tracepoint/format.
>> Same rule applies here: static tracepoint fields should only be accessed
>> in a format defined in tracepoint/format. The order of fields and
>> field sizes are not an ABI.
>>
>> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>> ---
...
>> -	entry = perf_trace_buf_prepare(__entry_size,			\
>> -			event_call->event.type, &__regs, &rctx);	\
>> +	event_type = prog ? TRACE_EVENT_TYPE_MAX : event_call->event.type; \
>
> Can you move this into perf_trace_entry_prepare?

that's the old version.
The last one are commits 1e1dcd93b46 and 98b5c2c65c295 in net-next.

>> +	if (prog) {							\
>> +		*(struct pt_regs **)entry = __regs;			\
>> +		if (!trace_call_bpf(prog, entry) || hlist_empty(head)) { \
>> +			perf_swevent_put_recursion_context(rctx);	\
>> +			return;						\
>> +		}							\
>> +		memset(&entry->ent, 0, sizeof(entry->ent));		\
>> +	}								\
>
> And perhaps this into perf_trace_buf_submit()?
>
> Tracepoints are a major cause of bloat, and the reasons for these
> prepare and submit functions is to move code out of the macros. Every
> tracepoint in the kernel (1000 and counting) will include this code.
> I've already had complaints that each tracepoint can add up to 5k to
> the core.

I was worried about this too, but single 'if' and two calls
(as in commit 98b5c2c65c295) is a better way, since it's faster, cleaner
and doesn't need to refactor the whole perf_trace_buf_submit() to pass
extra event_call argument to it.
perf_trace_buf_submit() is already ugly with 8 arguments!
Passing more args or creating a struct to pass args only going to
hurt performance without much reduction in .text size.
tinyfication folks will disable tracepoints anyway.
Note that the most common case is bpf returning 0 and not even
calling perf_trace_buf_submit() which is already slow due
to so many args passed on stack.
This stuff is called million times a second, so every instruction
counts.

^ permalink raw reply

* Re: [PATCH net-next] macvlan: fix failure during registration
From: Eric W. Biederman @ 2016-04-18 21:33 UTC (permalink / raw)
  To: Francesco Ruggeri; +Cc: netdev, David S. Miller, Mahesh Bandewar
In-Reply-To: <CA+HUmGhUqXn35x6Nnhcrk+WH8PuiaxwsrZNy-NsJ94wybCukGA@mail.gmail.com>

Francesco Ruggeri <fruggeri@arista.com> writes:

> On Mon, Apr 18, 2016 at 11:48 AM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>>
>> These interactions all seem a little bit funny.  At a quick skim it
>> would make more sense to increment the port count in macvlan_init,
>> and completely remove the need to mess with port counts anywhere except
>> macvlan_init and macvlan_uninit.
>
> Thanks Eric, let me try that.
>
>>
>> If for some reason that can't be done the code can easily look at
>> dev->reg_state.  If dev->reg_state == NETREG_UNITIALIZED it should
>> be exactly the same as your new flag being set.
>>
>
> I am not sure that in macvlan_uninit one can tell whether it is being
> invoked in the context of a failed register_netdevice (if that is what
> you meant).
>
> In case of register_netdevice failing in
> call_netdevice_notifiers(NETDEV_REGISTER) the call sequence is:
>
> macvlan_common_newlink
>   register_netdevice
>     call_netdevice_notifiers(NETDEV_REGISTER, dev) (<== fail here)
>       rollback_registered(dev);
>         rollback_registered_many
>           dev->reg_state = NETREG_UNREGISTERING
>           dev->netdev_ops->ndo_uninit(dev)
>       dev->reg_state = NETREG_UNREGISTERED;
>

The code I have looks a little different.  But that is a good point.

But please see if you can get macvlan_init to do the necessary work.
That should simplify everything, and make clever games unnecessary.

Eric

^ permalink raw reply

* Re: [PATCH] macsec: fix crypto Kconfig dependency
From: Stephen Rothwell @ 2016-04-18 21:44 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, arnd, sd, hannes, johannes, netdev, linux-kernel
In-Reply-To: <20160418.124316.1862068166928198574.davem@davemloft.net>

Hi Dave,

On Mon, 18 Apr 2016 12:43:16 -0400 (EDT) David Miller <davem@davemloft.net> wrote:
>
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Mon, 18 Apr 2016 18:43:36 +0800
> 
> > Right, the problem is that nothing within crypto ever selects
> > CRYPTO since it's also used as a way of hiding the crypto menu
> > options.  
> 
> As far as I understand it, this won't help.  Because selects do not
> trigger other selects and dependencies.

My understanding is that selects trigger other selects, but do not
honour dependencies.  So, in general, you should not select a symbol
that has dependencies (unless you also have the same dependencies or
select those dependencies).

-- 
Cheers,
Stephen Rothwell

^ permalink raw reply

* Re: [PATCH nf] netfilter: ipv6: Orphan skbs in nf_ct_frag6_gather()
From: Joe Stringer @ 2016-04-18 21:49 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Florian Westphal, David Laight, netfilter-devel@vger.kernel.org,
	diproiettod@vmware.com, netdev@vger.kernel.org
In-Reply-To: <20160418183546.GA4289@salvia>

On 18 April 2016 at 11:35, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> On Thu, Apr 14, 2016 at 05:35:39PM -0700, Joe Stringer wrote:
>> On 14 April 2016 at 03:35, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
>> > On Thu, Apr 14, 2016 at 10:40:15AM +0200, Florian Westphal wrote:
>> >> David Laight <David.Laight@ACULAB.COM> wrote:
>> >> > From: Joe Stringer
>> >> > > Sent: 13 April 2016 19:10
>> >> > > This is the IPv6 equivalent of commit 8282f27449bf ("inet: frag: Always
>> >> > > orphan skbs inside ip_defrag()").
>> >> > >
>> >> > > Prior to commit 029f7f3b8701 ("netfilter: ipv6: nf_defrag: avoid/free
>> >> > > clone operations"), ipv6 fragments sent to nf_ct_frag6_gather() would be
>> >> > > cloned (implicitly orphaning) prior to queueing for reassembly. As such,
>> >> > > when the IPv6 message is eventually reassembled, the skb->sk for all
>> >> > > fragments would be NULL. After that commit was introduced, rather than
>> >> > > cloning, the original skbs were queued directly without orphaning. The
>> >> > > end result is that all frags except for the first and last may have a
>> >> > > socket attached.
>> >> >
>> >> > I'd have thought that the queued fragments would still want to be
>> >> > resource-counted against the socket (I think that is what skb->sk is for).
>> >>
>> >> No, ipv4/ipv6 reasm has its own accouting.
>> >>
>> >> > Although I can't imagine why IPv6 reassembly is happening on skb
>> >> > associated with a socket.
>> >>
>> >> Right, thats a much more interesting question -- both ipv4 and
>> >> ipv6 orphan skbs before NF_HOOK prerouting trip.
>> >>
>> >> (That being said, I don't mind the patch, I'm just be curious how this
>> >>  can happen).
>> >
>> > If this change is specific to get this working in ovs and its
>> > conntrack support, then I don't think this belong to core
>> > infrastructure. This should be fixed in ovs instead.
>>
>> I admit I've only been able to reproduce it with OVS. My main reason
>> for proposing the fix this way was just because this is what the IPv4
>> code does, so I figured IPv6 should be consistent with that.
>
> You mean that this is what you did in 029f7f3b8701 to fix this, right?
>
> But we shouldn't add code to the core that is OVS specific for no
> reason. We don't need this orphan from ipv4 and ipv6 as Florian
> indicated.

That makes complete sense to me. I was wondering whether the original
IPv4 fix was the correct one from the nf core perspective - but it
seems like perhaps it is, given that ip_defrag() has a lot more users
from a lot more different paths which are likely relying on the skb to
be orphaned. In comparison, in the IPv6 path, nf_ct_frag6_gather() is
only called from OVS and the netfilter hooks; the hooks already do the
orphaning, so it's more consistent for OVS to do it as well.

> Is there any chance you can fix this from OVS and its conntrack glue
> code? Thanks.

Sure, I'll resend the patch making the fix in OVS.

^ permalink raw reply

* [PATCHv2 net] openvswitch: Orphan skbs before IPv6 defrag
From: Joe Stringer @ 2016-04-18 21:51 UTC (permalink / raw)
  To: netdev; +Cc: Joe Stringer, fw, netfilter-devel, diproiettod, pablo

This is the IPv6 counterpart to commit 8282f27449bf ("inet: frag: Always
orphan skbs inside ip_defrag()").

Prior to commit 029f7f3b8701 ("netfilter: ipv6: nf_defrag: avoid/free
clone operations"), ipv6 fragments sent to nf_ct_frag6_gather() would be
cloned (implicitly orphaning) prior to queueing for reassembly. As such,
when the IPv6 message is eventually reassembled, the skb->sk for all
fragments would be NULL. After that commit was introduced, rather than
cloning, the original skbs were queued directly without orphaning. The
end result is that all frags except for the first and last may have a
socket attached.

This commit explicitly orphans such skbs during nf_ct_frag6_gather() to
prevent BUG_ON(skb->sk) during a later call to ip6_fragment().

kernel BUG at net/ipv6/ip6_output.c:631!
[...]
Call Trace:
 <IRQ>
 [<ffffffff810be8f7>] ? __lock_acquire+0x927/0x20a0
 [<ffffffffa042c7c0>] ? do_output.isra.28+0x1b0/0x1b0 [openvswitch]
 [<ffffffff810bb8a2>] ? __lock_is_held+0x52/0x70
 [<ffffffffa042c587>] ovs_fragment+0x1f7/0x280 [openvswitch]
 [<ffffffff810bdab5>] ? mark_held_locks+0x75/0xa0
 [<ffffffff817be416>] ? _raw_spin_unlock_irqrestore+0x36/0x50
 [<ffffffff81697ea0>] ? dst_discard_out+0x20/0x20
 [<ffffffff81697e80>] ? dst_ifdown+0x80/0x80
 [<ffffffffa042c703>] do_output.isra.28+0xf3/0x1b0 [openvswitch]
 [<ffffffffa042d279>] do_execute_actions+0x709/0x12c0 [openvswitch]
 [<ffffffffa04340a4>] ? ovs_flow_stats_update+0x74/0x1e0 [openvswitch]
 [<ffffffffa04340d1>] ? ovs_flow_stats_update+0xa1/0x1e0 [openvswitch]
 [<ffffffff817be387>] ? _raw_spin_unlock+0x27/0x40
 [<ffffffffa042de75>] ovs_execute_actions+0x45/0x120 [openvswitch]
 [<ffffffffa0432d65>] ovs_dp_process_packet+0x85/0x150 [openvswitch]
 [<ffffffff817be387>] ? _raw_spin_unlock+0x27/0x40
 [<ffffffffa042def4>] ovs_execute_actions+0xc4/0x120 [openvswitch]
 [<ffffffffa0432d65>] ovs_dp_process_packet+0x85/0x150 [openvswitch]
 [<ffffffffa04337f2>] ? key_extract+0x442/0xc10 [openvswitch]
 [<ffffffffa043b26d>] ovs_vport_receive+0x5d/0xb0 [openvswitch]
 [<ffffffff810be8f7>] ? __lock_acquire+0x927/0x20a0
 [<ffffffff810be8f7>] ? __lock_acquire+0x927/0x20a0
 [<ffffffff810be8f7>] ? __lock_acquire+0x927/0x20a0
 [<ffffffff817be416>] ? _raw_spin_unlock_irqrestore+0x36/0x50
 [<ffffffffa043c11d>] internal_dev_xmit+0x6d/0x150 [openvswitch]
 [<ffffffffa043c0b5>] ? internal_dev_xmit+0x5/0x150 [openvswitch]
 [<ffffffff8168fb5f>] dev_hard_start_xmit+0x2df/0x660
 [<ffffffff8168f5ea>] ? validate_xmit_skb.isra.105.part.106+0x1a/0x2b0
 [<ffffffff81690925>] __dev_queue_xmit+0x8f5/0x950
 [<ffffffff81690080>] ? __dev_queue_xmit+0x50/0x950
 [<ffffffff810bdab5>] ? mark_held_locks+0x75/0xa0
 [<ffffffff81690990>] dev_queue_xmit+0x10/0x20
 [<ffffffff8169a418>] neigh_resolve_output+0x178/0x220
 [<ffffffff81752759>] ? ip6_finish_output2+0x219/0x7b0
 [<ffffffff81752759>] ip6_finish_output2+0x219/0x7b0
 [<ffffffff817525a5>] ? ip6_finish_output2+0x65/0x7b0
 [<ffffffff816cde2b>] ? ip_idents_reserve+0x6b/0x80
 [<ffffffff8175488f>] ? ip6_fragment+0x93f/0xc50
 [<ffffffff81754af1>] ip6_fragment+0xba1/0xc50
 [<ffffffff81752540>] ? ip6_flush_pending_frames+0x40/0x40
 [<ffffffff81754c6b>] ip6_finish_output+0xcb/0x1d0
 [<ffffffff81754dcf>] ip6_output+0x5f/0x1a0
 [<ffffffff81754ba0>] ? ip6_fragment+0xc50/0xc50
 [<ffffffff81797fbd>] ip6_local_out+0x3d/0x80
 [<ffffffff817554df>] ip6_send_skb+0x2f/0xc0
 [<ffffffff817555bd>] ip6_push_pending_frames+0x4d/0x50
 [<ffffffff817796cc>] icmpv6_push_pending_frames+0xac/0xe0
 [<ffffffff8177a4be>] icmpv6_echo_reply+0x42e/0x500
 [<ffffffff8177acbf>] icmpv6_rcv+0x4cf/0x580
 [<ffffffff81755ac7>] ip6_input_finish+0x1a7/0x690
 [<ffffffff81755925>] ? ip6_input_finish+0x5/0x690
 [<ffffffff817567a0>] ip6_input+0x30/0xa0
 [<ffffffff81755920>] ? ip6_rcv_finish+0x1a0/0x1a0
 [<ffffffff817557ce>] ip6_rcv_finish+0x4e/0x1a0
 [<ffffffff8175640f>] ipv6_rcv+0x45f/0x7c0
 [<ffffffff81755fe6>] ? ipv6_rcv+0x36/0x7c0
 [<ffffffff81755780>] ? ip6_make_skb+0x1c0/0x1c0
 [<ffffffff8168b649>] __netif_receive_skb_core+0x229/0xb80
 [<ffffffff810bdab5>] ? mark_held_locks+0x75/0xa0
 [<ffffffff8168c07f>] ? process_backlog+0x6f/0x230
 [<ffffffff8168bfb6>] __netif_receive_skb+0x16/0x70
 [<ffffffff8168c088>] process_backlog+0x78/0x230
 [<ffffffff8168c0ed>] ? process_backlog+0xdd/0x230
 [<ffffffff8168db43>] net_rx_action+0x203/0x480
 [<ffffffff810bdab5>] ? mark_held_locks+0x75/0xa0
 [<ffffffff817c156e>] __do_softirq+0xde/0x49f
 [<ffffffff81752768>] ? ip6_finish_output2+0x228/0x7b0
 [<ffffffff817c070c>] do_softirq_own_stack+0x1c/0x30
 <EOI>
 [<ffffffff8106f88b>] do_softirq.part.18+0x3b/0x40
 [<ffffffff8106f946>] __local_bh_enable_ip+0xb6/0xc0
 [<ffffffff81752791>] ip6_finish_output2+0x251/0x7b0
 [<ffffffff81754af1>] ? ip6_fragment+0xba1/0xc50
 [<ffffffff816cde2b>] ? ip_idents_reserve+0x6b/0x80
 [<ffffffff8175488f>] ? ip6_fragment+0x93f/0xc50
 [<ffffffff81754af1>] ip6_fragment+0xba1/0xc50
 [<ffffffff81752540>] ? ip6_flush_pending_frames+0x40/0x40
 [<ffffffff81754c6b>] ip6_finish_output+0xcb/0x1d0
 [<ffffffff81754dcf>] ip6_output+0x5f/0x1a0
 [<ffffffff81754ba0>] ? ip6_fragment+0xc50/0xc50
 [<ffffffff81797fbd>] ip6_local_out+0x3d/0x80
 [<ffffffff817554df>] ip6_send_skb+0x2f/0xc0
 [<ffffffff817555bd>] ip6_push_pending_frames+0x4d/0x50
 [<ffffffff81778558>] rawv6_sendmsg+0xa28/0xe30
 [<ffffffff81719097>] ? inet_sendmsg+0xc7/0x1d0
 [<ffffffff817190d6>] inet_sendmsg+0x106/0x1d0
 [<ffffffff81718fd5>] ? inet_sendmsg+0x5/0x1d0
 [<ffffffff8166d078>] sock_sendmsg+0x38/0x50
 [<ffffffff8166d4d6>] SYSC_sendto+0xf6/0x170
 [<ffffffff8100201b>] ? trace_hardirqs_on_thunk+0x1b/0x1d
 [<ffffffff8166e38e>] SyS_sendto+0xe/0x10
 [<ffffffff817bebe5>] entry_SYSCALL_64_fastpath+0x18/0xa8
Code: 06 48 83 3f 00 75 26 48 8b 87 d8 00 00 00 2b 87 d0 00 00 00 48 39 d0 72 14 8b 87 e4 00 00 00 83 f8 01 75 09 48 83 7f 18 00 74 9a <0f> 0b 41 8b 86 cc 00 00 00 49 8#
RIP  [<ffffffff8175468a>] ip6_fragment+0x73a/0xc50
 RSP <ffff880072803120>

Fixes: 029f7f3b8701 ("netfilter: ipv6: nf_defrag: avoid/free clone
operations")
Reported-by: Daniele Di Proietto <diproiettod@vmware.com>
Signed-off-by: Joe Stringer <joe@ovn.org>
---
 net/openvswitch/conntrack.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index 1b9d286756be..b5fea1101faa 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -367,6 +367,7 @@ static int handle_fragments(struct net *net, struct sw_flow_key *key,
 	} else if (key->eth.type == htons(ETH_P_IPV6)) {
 		enum ip6_defrag_users user = IP6_DEFRAG_CONNTRACK_IN + zone;
 
+		skb_orphan(skb);
 		memset(IP6CB(skb), 0, sizeof(struct inet6_skb_parm));
 		err = nf_ct_frag6_gather(net, skb, user);
 		if (err)
-- 
2.1.4

^ permalink raw reply related

* [PATCH] net: w5100: don't build spi driver without w5100
From: Arnd Bergmann @ 2016-04-18 21:58 UTC (permalink / raw)
  To: David S. Miller, Akinobu Mita, Paul Gortmaker, Arnd Bergmann
  Cc: netdev, linux-kernel

The w5100-spi driver front-end only makes sense when the w5100
core driver is enabled, not for a configuration that only has w5300:

drivers/net/built-in.o: In function `w5100_spi_remove':
drivers/net/ethernet/wiznet/w5100-spi.c:277: undefined reference to `w5100_remove'
drivers/net/built-in.o: In function `w5100_spi_probe':
drivers/net/ethernet/wiznet/w5100-spi.c:272: undefined reference to `w5100_probe'
drivers/net/built-in.o: In function `w5200_spi_init':
drivers/net/ethernet/wiznet/w5100-spi.c:125: undefined reference to `w5100_ops_priv'
drivers/net/built-in.o: In function `w5200_spi_readbulk':
drivers/net/ethernet/wiznet/w5100-spi.c:125: undefined reference to `w5100_ops_priv'
drivers/net/built-in.o: In function `w5200_spi_writebulk':
drivers/net/ethernet/wiznet/w5100-spi.c:125: undefined reference to `w5100_ops_priv'
drivers/net/built-in.o:(.data+0x3ed1c): undefined reference to `w5100_pm_ops'

This adds an appropriate Kconfig dependency.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 630cf09751fe ("net: w5100: support SPI interface mode")
---
 drivers/net/ethernet/wiznet/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/wiznet/Kconfig b/drivers/net/ethernet/wiznet/Kconfig
index 1f15376e9856..f3385a1999a2 100644
--- a/drivers/net/ethernet/wiznet/Kconfig
+++ b/drivers/net/ethernet/wiznet/Kconfig
@@ -71,7 +71,7 @@ endchoice
 
 config WIZNET_W5100_SPI
 	tristate "WIZnet W5100/W5200 Ethernet support for SPI mode"
-	depends on WIZNET_BUS_ANY
+	depends on WIZNET_BUS_ANY && WIZNET_W5100
 	depends on SPI
 	---help---
 	  In SPI mode host system accesses registers using SPI protocol
-- 
2.7.0

^ permalink raw reply related

* [PATCH] rtl8xxxu: hide unused tables
From: Arnd Bergmann @ 2016-04-18 21:59 UTC (permalink / raw)
  To: Jes Sorensen, Kalle Valo
  Cc: Arnd Bergmann, Jakub Sitnicki, linux-wireless, netdev,
	linux-kernel

The references to some arrays in the rtl8xxxu driver were moved inside
of an #ifdef, but the symbols remain outside, resulting in build warnings:

rtl8xxxu/rtl8xxxu.c:1506:33: error: 'rtl8188ru_radioa_1t_highpa_table' defined but not used
rtl8xxxu/rtl8xxxu.c:1431:33: error: 'rtl8192cu_radioa_1t_init_table' defined but not used
rtl8xxxu/rtl8xxxu.c:1407:33: error: 'rtl8192cu_radiob_2t_init_table' defined but not used
rtl8xxxu/rtl8xxxu.c:1332:33: error: 'rtl8192cu_radioa_2t_init_table' defined but not used
rtl8xxxu/rtl8xxxu.c:239:35: error: 'rtl8192c_power_base' defined but not used
rtl8xxxu/rtl8xxxu.c:217:35: error: 'rtl8188r_power_base' defined but not used

This adds an extra #ifdef around them to shut up the warnings.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 2fc0b8e5a17d ("rtl8xxxu: Add TX power base values for gen1 parts")
Fixes: 4062b8ffec36 ("rtl8xxxu: Move PHY RF init into device specific functions")
---
 drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu.c b/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu.c
index 928ca56f751c..0ba84b5fe0d6 100644
--- a/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu.c
+++ b/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu.c
@@ -214,6 +214,7 @@ static struct rtl8xxxu_reg8val rtl8192e_mac_init_table[] = {
 	{0xffff, 0xff},
 };
 
+#ifdef CONFIG_RTL8XXXU_UNTESTED
 static struct rtl8xxxu_power_base rtl8188r_power_base = {
 	.reg_0e00 = 0x06080808,
 	.reg_0e04 = 0x00040406,
@@ -257,6 +258,7 @@ static struct rtl8xxxu_power_base rtl8192c_power_base = {
 	.reg_084c = 0x0b0c0d0e,
 	.reg_0868 = 0x01030509,
 };
+#endif
 
 static struct rtl8xxxu_power_base rtl8723a_power_base = {
 	.reg_0e00 = 0x0a0c0c0c,
@@ -1329,6 +1331,7 @@ static struct rtl8xxxu_rfregval rtl8723bu_radioa_1t_init_table[] = {
 	{0xff, 0xffffffff}
 };
 
+#ifdef CONFIG_RTL8XXXU_UNTESTED
 static struct rtl8xxxu_rfregval rtl8192cu_radioa_2t_init_table[] = {
 	{0x00, 0x00030159}, {0x01, 0x00031284},
 	{0x02, 0x00098000}, {0x03, 0x00018c63},
@@ -1577,6 +1580,7 @@ static struct rtl8xxxu_rfregval rtl8188ru_radioa_1t_highpa_table[] = {
 	{0x00, 0x00030159},
 	{0xff, 0xffffffff}
 };
+#endif
 
 static struct rtl8xxxu_rfregval rtl8192eu_radioa_init_table[] = {
 	{0x7f, 0x00000082}, {0x81, 0x0003fc00},
-- 
2.7.0

^ permalink raw reply related

* Re: [PATCH] net: phy: Ensure the state machine is called when phy is UP
From: Alexandre Belloni @ 2016-04-18 22:14 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Andrew Lunn, David S . Miller, Nicolas Ferre, netdev,
	linux-kernel
In-Reply-To: <571169EB.4090300@gmail.com>

On 15/04/2016 at 15:23:39 -0700, Florian Fainelli wrote :
> On 15/04/16 15:17, Alexandre Belloni wrote:
> > On 16/04/2016 at 00:05:08 +0200, Andrew Lunn wrote :
> >>> Trace without my patch:
> >>> libphy: MACB_mii_bus: probed
> >>> macb f8020000.ethernet eth0: Cadence GEM rev 0x00020120 at 0xf8020000 irq 27 (fc:c2:3d:0c:6e:05)
> >>> Micrel KSZ8081 or KSZ8091 f8020000.etherne:01: attached PHY driver [Micrel KSZ8081 or KSZ8091] (mii_bus:phy_addr=f8020000.etherne:01, irq=171)
> >>> Micrel KSZ8081 or KSZ8091 f8020000.etherne:01: PHY state change READY -> READY
> >>> [...]
> >>> Micrel KSZ8081 or KSZ8091 f8020000.etherne:01: PHY state change READY -> READY
> >>
> >> Are there some state changes before this? How is it getting to state
> >> READY? It would expect it to start in DOWN, from when the phy device
> >> was created in phy_device_create().
> >>
> > 
> > No other changes. I forgot to mention that this is when booting with a
> > cable plugged in. Unplugging and replugging the cable makes the link
> > detection work fine even without the patch.
> 
> OK, so the last hunk of the change in d5c3d84657db ("net: phy: Avoid
> polling PHY with PHY_IGNORE_INTERRUPTS"):
> 
> -       queue_delayed_work(system_power_efficient_wq, &phydev->state_queue,
> -                          PHY_STATE_TIME * HZ);
> +       /* Only re-schedule a PHY state machine change if we are polling the
> +        * PHY, if PHY_IGNORE_INTERRUPT is set, then we will be moving
> +        * between states from phy_mac_interrupt()
> +        */
> +       if (phydev->irq == PHY_POLL)
> +               queue_delayed_work(system_power_efficient_wq,
> &phydev->state_queue,
> +                                  PHY_STATE_TIME * HZ);
> 
> 
> is presumably what broke for you, right?
> 
> Could you also give this patch a spin and see if it works better with
> it? The macb driver does something racy with how the MDIO and PHY are
> probe wrt. registering the netdev, that needs fixing too.
> 
> diff --git a/drivers/net/ethernet/cadence/macb.c
> b/drivers/net/ethernet/cadence/macb.c
> index eec3200ade4a..98b99149ce0b 100644
> --- a/drivers/net/ethernet/cadence/macb.c
> +++ b/drivers/net/ethernet/cadence/macb.c
> @@ -3005,28 +3005,36 @@ static int macb_probe(struct platform_device *pdev)
>         if (err)
>                 goto err_out_free_netdev;
> 
> +       err = macb_mii_init(bp);
> +       if (err)
> +               goto err_out_free_netdev;
> +
> +       phydev = bp->phy_dev;
> +       phy_attached_info(phydev);
> +
> +       netif_carrier_off(dev);
> +
>         err = register_netdev(dev);
>         if (err) {
>                 dev_err(&pdev->dev, "Cannot register net device,
> aborting.\n");
>                 goto err_out_unregister_netdev;
>         }
> 
> -       err = macb_mii_init(bp);
> -       if (err)
> -               goto err_out_unregister_netdev;
> -
> -       netif_carrier_off(dev);
> -
>         netdev_info(dev, "Cadence %s rev 0x%08x at 0x%08lx irq %d (%pM)\n",
>                     macb_is_gem(bp) ? "GEM" : "MACB", macb_readl(bp, MID),
>                     dev->base_addr, dev->irq, dev->dev_addr);
> 
> -       phydev = bp->phy_dev;
> -       phy_attached_info(phydev);
> -
>         return 0;
> 
>  err_out_unregister_netdev:
> +       phy_disconnect(bp->phy_dev);
> +       mdiobus_unregister(bp->mii_bus);
> +       mdiobus_free(bp->mii_bus);
> +
> +       /* Shutdown the PHY if there is a GPIO reset */
> +       if (bp->reset_gpio)
> +               gpiod_set_value(bp->reset_gpio, 0);
> +
>         unregister_netdev(dev);
> 
>  err_out_free_netdev:
> 

Well, this fails with:

[    2.780000] ------------[ cut here ]------------
[    2.780000] WARNING: CPU: 0 PID: 1 at lib/kobject.c:597 kobject_get+0x6c/0xa0
[    2.790000] kobject: '(null)' (df532280): is not initialized, yet kobject_get() is being called.
[    2.800000] Modules linked in:
[    2.800000] CPU: 0 PID: 1 Comm: swapper Not tainted 4.6.0-rc1+ #46
[    2.810000] Hardware name: Atmel SAMA5
[    2.810000] [<c010cc44>] (unwind_backtrace) from [<c010a76c>] (show_stack+0x10/0x14)
[    2.820000] [<c010a76c>] (show_stack) from [<c0115a70>] (__warn+0xe4/0xfc)
[    2.830000] [<c0115a70>] (__warn) from [<c0115ac0>] (warn_slowpath_fmt+0x38/0x48)
[    2.840000] [<c0115ac0>] (warn_slowpath_fmt) from [<c02d5524>] (kobject_get+0x6c/0xa0)
[    2.840000] [<c02d5524>] (kobject_get) from [<c0343b24>] (device_add+0xac/0x56c)
[    2.850000] [<c0343b24>] (device_add) from [<c03aa900>] (__mdiobus_register+0x8c/0x198)
[    2.860000] [<c03aa900>] (__mdiobus_register) from [<c045ed0c>] (of_mdiobus_register+0x20/0x184)
[    2.870000] [<c045ed0c>] (of_mdiobus_register) from [<c03b18f0>] (macb_probe+0x488/0x898)
[    2.880000] [<c03b18f0>] (macb_probe) from [<c0347ff0>] (platform_drv_probe+0x4c/0xb0)
[    2.880000] [<c0347ff0>] (platform_drv_probe) from [<c0346838>] (driver_probe_device+0x214/0x2c0)
[    2.890000] [<c0346838>] (driver_probe_device) from [<c034699c>] (__driver_attach+0xb8/0xbc)
[    2.900000] [<c034699c>] (__driver_attach) from [<c0344b8c>] (bus_for_each_dev+0x68/0x9c)
[    2.910000] [<c0344b8c>] (bus_for_each_dev) from [<c0345cd0>] (bus_add_driver+0x1a0/0x218)
[    2.920000] [<c0345cd0>] (bus_add_driver) from [<c0347084>] (driver_register+0x78/0xf8)
[    2.930000] [<c0347084>] (driver_register) from [<c010166c>] (do_one_initcall+0x90/0x1dc)
[    2.930000] [<c010166c>] (do_one_initcall) from [<c0800d78>] (kernel_init_freeable+0x134/0x1d4)
[    2.940000] [<c0800d78>] (kernel_init_freeable) from [<c05e7bd0>] (kernel_init+0x8/0x110)
[    2.950000] [<c05e7bd0>] (kernel_init) from [<c0107598>] (ret_from_fork+0x14/0x3c)
[    2.960000] ---[ end trace 81bf87ef8c18d052 ]---


I'm not completely clear why but I think one of the parent is not initialized
until register_netdev() is called.


-- 
Alexandre Belloni, Free Electrons
Embedded Linux, Kernel and Android engineering
http://free-electrons.com

^ permalink raw reply

* Re: [PATCH net-next 2/8] perf, bpf: allow bpf programs attach to tracepoints
From: Steven Rostedt @ 2016-04-18 22:16 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Peter Zijlstra, David S . Miller, Ingo Molnar, Daniel Borkmann,
	Arnaldo Carvalho de Melo, Wang Nan, Josef Bacik, Brendan Gregg,
	netdev, linux-kernel, kernel-team
In-Reply-To: <571554EB.9010702@fb.com>

On Mon, 18 Apr 2016 14:43:07 -0700
Alexei Starovoitov <ast@fb.com> wrote:

> I was worried about this too, but single 'if' and two calls
> (as in commit 98b5c2c65c295) is a better way, since it's faster, cleaner
> and doesn't need to refactor the whole perf_trace_buf_submit() to pass
> extra event_call argument to it.
> perf_trace_buf_submit() is already ugly with 8 arguments!

Right, but I solved that in ftrace by creating an on-stack descriptor
that can be passed by a single parameter. See the "fbuffer" in the
trace_event_raw_event* code.

> Passing more args or creating a struct to pass args only going to
> hurt performance without much reduction in .text size.
> tinyfication folks will disable tracepoints anyway.
> Note that the most common case is bpf returning 0 and not even
> calling perf_trace_buf_submit() which is already slow due
> to so many args passed on stack.
> This stuff is called million times a second, so every instruction
> counts.

Note, that doesn't matter if you are bloating the kernel for the 99.9%
of those that don't use bpf.

Please remember this! Us tracing folks are second class citizens! If
there's a way to speed up tracing by 10%, but in doing so we cause
mainline to be hurt by over 1%, we shouldn't be doing it. Tracing and
hooks on tracepoints are really not used by many people. Don't fall
into Linus's category of "my code is the most important". That's
especially true for tracing.

These macros causes bloat. There's been many complaints about it.
There's a way around it that isn't that bad (with the descriptor), we
should be using it.

-- Steve

^ permalink raw reply

* Re: [PATCH] net: phy: Ensure the state machine is called when phy is UP
From: Florian Fainelli @ 2016-04-18 22:17 UTC (permalink / raw)
  To: Alexandre Belloni
  Cc: Andrew Lunn, David S . Miller, Nicolas Ferre, netdev,
	linux-kernel
In-Reply-To: <20160418221433.GX25196@piout.net>

On 18/04/16 15:14, Alexandre Belloni wrote:
> On 15/04/2016 at 15:23:39 -0700, Florian Fainelli wrote :
>> On 15/04/16 15:17, Alexandre Belloni wrote:
>>> On 16/04/2016 at 00:05:08 +0200, Andrew Lunn wrote :
>>>>> Trace without my patch:
>>>>> libphy: MACB_mii_bus: probed
>>>>> macb f8020000.ethernet eth0: Cadence GEM rev 0x00020120 at 0xf8020000 irq 27 (fc:c2:3d:0c:6e:05)
>>>>> Micrel KSZ8081 or KSZ8091 f8020000.etherne:01: attached PHY driver [Micrel KSZ8081 or KSZ8091] (mii_bus:phy_addr=f8020000.etherne:01, irq=171)
>>>>> Micrel KSZ8081 or KSZ8091 f8020000.etherne:01: PHY state change READY -> READY
>>>>> [...]
>>>>> Micrel KSZ8081 or KSZ8091 f8020000.etherne:01: PHY state change READY -> READY
>>>>
>>>> Are there some state changes before this? How is it getting to state
>>>> READY? It would expect it to start in DOWN, from when the phy device
>>>> was created in phy_device_create().
>>>>
>>>
>>> No other changes. I forgot to mention that this is when booting with a
>>> cable plugged in. Unplugging and replugging the cable makes the link
>>> detection work fine even without the patch.
>>
>> OK, so the last hunk of the change in d5c3d84657db ("net: phy: Avoid
>> polling PHY with PHY_IGNORE_INTERRUPTS"):
>>
>> -       queue_delayed_work(system_power_efficient_wq, &phydev->state_queue,
>> -                          PHY_STATE_TIME * HZ);
>> +       /* Only re-schedule a PHY state machine change if we are polling the
>> +        * PHY, if PHY_IGNORE_INTERRUPT is set, then we will be moving
>> +        * between states from phy_mac_interrupt()
>> +        */
>> +       if (phydev->irq == PHY_POLL)
>> +               queue_delayed_work(system_power_efficient_wq,
>> &phydev->state_queue,
>> +                                  PHY_STATE_TIME * HZ);
>>
>>
>> is presumably what broke for you, right?
>>
>> Could you also give this patch a spin and see if it works better with
>> it? The macb driver does something racy with how the MDIO and PHY are
>> probe wrt. registering the netdev, that needs fixing too.
>>
>> diff --git a/drivers/net/ethernet/cadence/macb.c
>> b/drivers/net/ethernet/cadence/macb.c
>> index eec3200ade4a..98b99149ce0b 100644
>> --- a/drivers/net/ethernet/cadence/macb.c
>> +++ b/drivers/net/ethernet/cadence/macb.c
>> @@ -3005,28 +3005,36 @@ static int macb_probe(struct platform_device *pdev)
>>         if (err)
>>                 goto err_out_free_netdev;
>>
>> +       err = macb_mii_init(bp);
>> +       if (err)
>> +               goto err_out_free_netdev;
>> +
>> +       phydev = bp->phy_dev;
>> +       phy_attached_info(phydev);
>> +
>> +       netif_carrier_off(dev);
>> +
>>         err = register_netdev(dev);
>>         if (err) {
>>                 dev_err(&pdev->dev, "Cannot register net device,
>> aborting.\n");
>>                 goto err_out_unregister_netdev;
>>         }
>>
>> -       err = macb_mii_init(bp);
>> -       if (err)
>> -               goto err_out_unregister_netdev;
>> -
>> -       netif_carrier_off(dev);
>> -
>>         netdev_info(dev, "Cadence %s rev 0x%08x at 0x%08lx irq %d (%pM)\n",
>>                     macb_is_gem(bp) ? "GEM" : "MACB", macb_readl(bp, MID),
>>                     dev->base_addr, dev->irq, dev->dev_addr);
>>
>> -       phydev = bp->phy_dev;
>> -       phy_attached_info(phydev);
>> -
>>         return 0;
>>
>>  err_out_unregister_netdev:
>> +       phy_disconnect(bp->phy_dev);
>> +       mdiobus_unregister(bp->mii_bus);
>> +       mdiobus_free(bp->mii_bus);
>> +
>> +       /* Shutdown the PHY if there is a GPIO reset */
>> +       if (bp->reset_gpio)
>> +               gpiod_set_value(bp->reset_gpio, 0);
>> +
>>         unregister_netdev(dev);
>>
>>  err_out_free_netdev:
>>
> 
> Well, this fails with:
> 
> [    2.780000] ------------[ cut here ]------------
> [    2.780000] WARNING: CPU: 0 PID: 1 at lib/kobject.c:597 kobject_get+0x6c/0xa0
> [    2.790000] kobject: '(null)' (df532280): is not initialized, yet kobject_get() is being called.
> [    2.800000] Modules linked in:
> [    2.800000] CPU: 0 PID: 1 Comm: swapper Not tainted 4.6.0-rc1+ #46
> [    2.810000] Hardware name: Atmel SAMA5
> [    2.810000] [<c010cc44>] (unwind_backtrace) from [<c010a76c>] (show_stack+0x10/0x14)
> [    2.820000] [<c010a76c>] (show_stack) from [<c0115a70>] (__warn+0xe4/0xfc)
> [    2.830000] [<c0115a70>] (__warn) from [<c0115ac0>] (warn_slowpath_fmt+0x38/0x48)
> [    2.840000] [<c0115ac0>] (warn_slowpath_fmt) from [<c02d5524>] (kobject_get+0x6c/0xa0)
> [    2.840000] [<c02d5524>] (kobject_get) from [<c0343b24>] (device_add+0xac/0x56c)
> [    2.850000] [<c0343b24>] (device_add) from [<c03aa900>] (__mdiobus_register+0x8c/0x198)
> [    2.860000] [<c03aa900>] (__mdiobus_register) from [<c045ed0c>] (of_mdiobus_register+0x20/0x184)
> [    2.870000] [<c045ed0c>] (of_mdiobus_register) from [<c03b18f0>] (macb_probe+0x488/0x898)
> [    2.880000] [<c03b18f0>] (macb_probe) from [<c0347ff0>] (platform_drv_probe+0x4c/0xb0)
> [    2.880000] [<c0347ff0>] (platform_drv_probe) from [<c0346838>] (driver_probe_device+0x214/0x2c0)
> [    2.890000] [<c0346838>] (driver_probe_device) from [<c034699c>] (__driver_attach+0xb8/0xbc)
> [    2.900000] [<c034699c>] (__driver_attach) from [<c0344b8c>] (bus_for_each_dev+0x68/0x9c)
> [    2.910000] [<c0344b8c>] (bus_for_each_dev) from [<c0345cd0>] (bus_add_driver+0x1a0/0x218)
> [    2.920000] [<c0345cd0>] (bus_add_driver) from [<c0347084>] (driver_register+0x78/0xf8)
> [    2.930000] [<c0347084>] (driver_register) from [<c010166c>] (do_one_initcall+0x90/0x1dc)
> [    2.930000] [<c010166c>] (do_one_initcall) from [<c0800d78>] (kernel_init_freeable+0x134/0x1d4)
> [    2.940000] [<c0800d78>] (kernel_init_freeable) from [<c05e7bd0>] (kernel_init+0x8/0x110)
> [    2.950000] [<c05e7bd0>] (kernel_init) from [<c0107598>] (ret_from_fork+0x14/0x3c)
> [    2.960000] ---[ end trace 81bf87ef8c18d052 ]---
> 
> 
> I'm not completely clear why but I think one of the parent is not initialized
> until register_netdev() is called.

Yes, seems like it, how about adding this:

diff --git a/drivers/net/ethernet/cadence/macb.c
b/drivers/net/ethernet/cadence/macb.c
index 98b99149ce0b..21096dfb0e83 100644
--- a/drivers/net/ethernet/cadence/macb.c
+++ b/drivers/net/ethernet/cadence/macb.c
@@ -440,7 +440,7 @@ static int macb_mii_init(struct macb *bp)
        snprintf(bp->mii_bus->id, MII_BUS_ID_SIZE, "%s-%x",
                 bp->pdev->name, bp->pdev->id);
        bp->mii_bus->priv = bp;
-       bp->mii_bus->parent = &bp->dev->dev;
+       bp->mii_bus->parent = &bp->pdev->dev;
        pdata = dev_get_platdata(&bp->pdev->dev);

        dev_set_drvdata(&bp->dev->dev, bp->mii_bus);
-- 
Florian

^ permalink raw reply related

* [PATCH net-next] net: dsa: remove tag_protocol from dsa_switch
From: Vivien Didelot @ 2016-04-18 22:24 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, kernel, David S. Miller, Florian Fainelli,
	Andrew Lunn, Vivien Didelot

Having the tag protocol in dsa_switch_driver for setup time and in
dsa_switch_tree for runtime is enough. Remove dsa_switch's one.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
---
 include/net/dsa.h | 5 -----
 net/dsa/dsa.c     | 5 ++---
 2 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/include/net/dsa.h b/include/net/dsa.h
index c4bc42b..2d280ab 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -136,11 +136,6 @@ struct dsa_switch {
 	void *priv;
 
 	/*
-	 * Tagging protocol understood by this switch
-	 */
-	enum dsa_tag_protocol	tag_protocol;
-
-	/*
 	 * Configuration data for this switch.
 	 */
 	struct dsa_chip_data	*pd;
diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c
index efa612f..d61ceed 100644
--- a/net/dsa/dsa.c
+++ b/net/dsa/dsa.c
@@ -267,7 +267,7 @@ static int dsa_switch_setup_one(struct dsa_switch *ds, struct device *parent)
 	 * switch.
 	 */
 	if (dst->cpu_switch == index) {
-		switch (ds->tag_protocol) {
+		switch (drv->tag_protocol) {
 #ifdef CONFIG_NET_DSA_TAG_DSA
 		case DSA_TAG_PROTO_DSA:
 			dst->rcv = dsa_netdev_ops.rcv;
@@ -295,7 +295,7 @@ static int dsa_switch_setup_one(struct dsa_switch *ds, struct device *parent)
 			goto out;
 		}
 
-		dst->tag_protocol = ds->tag_protocol;
+		dst->tag_protocol = drv->tag_protocol;
 	}
 
 	/*
@@ -411,7 +411,6 @@ dsa_switch_setup(struct dsa_switch_tree *dst, int index,
 	ds->pd = pd;
 	ds->drv = drv;
 	ds->priv = priv;
-	ds->tag_protocol = drv->tag_protocol;
 	ds->master_dev = host_dev;
 
 	ret = dsa_switch_setup_one(ds, parent);
-- 
2.8.0

^ permalink raw reply related

* [PATCH net] tcp: Fix SOF_TIMESTAMPING_TX_ACK when handling dup acks
From: Martin KaFai Lau @ 2016-04-18 22:39 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Eric Dumazet, Neal Cardwell, Soheil Hassas Yeganeh,
	Willem de Bruijn, Yuchung Cheng

Assuming SOF_TIMESTAMPING_TX_ACK is on. When dup acks are received,
it could incorrectly think that a skb has already
been acked and queue a SCM_TSTAMP_ACK cmsg to the
sk->sk_error_queue.

In tcp_ack_tstamp(), it checks
'between(shinfo->tskey, prior_snd_una, tcp_sk(sk)->snd_una - 1)'.
If prior_snd_una == tcp_sk(sk)->snd_una like the following packetdrill
script, between() returns true but the tskey is actually not acked.
e.g. try between(3, 2, 1).

The fix is to replace between() with one before() and one !before().
By doing this, the -1 offset on the tcp_sk(sk)->snd_una can also be
removed.

A packetdrill script is used to reproduce the dup ack scenario.
Due to the lacking cmsg support in packetdrill (may be I
cannot find it),  a BPF prog is used to kprobe to
sock_queue_err_skb() and print out the value of
serr->ee.ee_data.

Both the packetdrill and the bcc BPF script is attached at the end of
this commit message.

BPF Output Before Fix:
~~~~~~
      <...>-2056  [001] d.s.   433.927987: : ee_data:1459  #incorrect
packetdrill-2056  [001] d.s.   433.929563: : ee_data:1459  #incorrect
packetdrill-2056  [001] d.s.   433.930765: : ee_data:1459  #incorrect
packetdrill-2056  [001] d.s.   434.028177: : ee_data:1459
packetdrill-2056  [001] d.s.   434.029686: : ee_data:14599

BPF Output After Fix:
~~~~~~
      <...>-2049  [000] d.s.   113.517039: : ee_data:1459
      <...>-2049  [000] d.s.   113.517253: : ee_data:14599

BCC BPF Script:
~~~~~~
#!/usr/bin/env python

from __future__ import print_function
from bcc import BPF

bpf_text = """
#include <uapi/linux/ptrace.h>
#include <net/sock.h>
#include <bcc/proto.h>
#include <linux/errqueue.h>

#ifdef memset
#undef memset
#endif

int trace_err_skb(struct pt_regs *ctx)
{
	struct sk_buff *skb = (struct sk_buff *)ctx->si;
	struct sock *sk = (struct sock *)ctx->di;
	struct sock_exterr_skb *serr;
	u32 ee_data = 0;

	if (!sk || !skb)
		return 0;

	serr = SKB_EXT_ERR(skb);
	bpf_probe_read(&ee_data, sizeof(ee_data), &serr->ee.ee_data);
	bpf_trace_printk("ee_data:%u\\n", ee_data);

	return 0;
};
"""

b = BPF(text=bpf_text)
b.attach_kprobe(event="sock_queue_err_skb", fn_name="trace_err_skb")
print("Attached to kprobe")
b.trace_print()

Packetdrill Script:
~~~~~~
+0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
+0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
+0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0

0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0

+0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
0.200 write(4, ..., 1460) = 1460
0.200 write(4, ..., 13140) = 13140

0.200 > P. 1:1461(1460) ack 1
0.200 > . 1461:8761(7300) ack 1
0.200 > P. 8761:14601(5840) ack 1

0.300 < . 1:1(0) ack 1 win 257 <sack 1461:2921,nop,nop>
0.300 < . 1:1(0) ack 1 win 257 <sack 1461:4381,nop,nop>
0.300 < . 1:1(0) ack 1 win 257 <sack 1461:5841,nop,nop>
0.300 > P. 1:1461(1460) ack 1
0.400 < . 1:1(0) ack 14601 win 257

0.400 close(4) = 0
0.400 > F. 14601:14601(0) ack 1
0.500 < F. 1:1(0) ack 14602 win 257
0.500 > . 14602:14602(0) ack 2

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil.kdev@gmail.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
---
 net/ipv4/tcp_input.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index e6e65f7..0edb071 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3098,7 +3098,8 @@ static void tcp_ack_tstamp(struct sock *sk, struct sk_buff *skb,
 
 	shinfo = skb_shinfo(skb);
 	if ((shinfo->tx_flags & SKBTX_ACK_TSTAMP) &&
-	    between(shinfo->tskey, prior_snd_una, tcp_sk(sk)->snd_una - 1))
+	    !before(shinfo->tskey, prior_snd_una) &&
+	    before(shinfo->tskey, tcp_sk(sk)->snd_una))
 		__skb_tstamp_tx(skb, NULL, sk, SCM_TSTAMP_ACK);
 }
 
-- 
2.5.1

^ permalink raw reply related

* Re: [PATCH] net: phy: Ensure the state machine is called when phy is UP
From: Alexandre Belloni @ 2016-04-18 22:42 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Andrew Lunn, David S . Miller, Nicolas Ferre, netdev,
	linux-kernel
In-Reply-To: <57155D16.2080306@gmail.com>

On 18/04/2016 at 15:17:58 -0700, Florian Fainelli wrote :
> Yes, seems like it, how about adding this:
> 
> diff --git a/drivers/net/ethernet/cadence/macb.c
> b/drivers/net/ethernet/cadence/macb.c
> index 98b99149ce0b..21096dfb0e83 100644
> --- a/drivers/net/ethernet/cadence/macb.c
> +++ b/drivers/net/ethernet/cadence/macb.c
> @@ -440,7 +440,7 @@ static int macb_mii_init(struct macb *bp)
>         snprintf(bp->mii_bus->id, MII_BUS_ID_SIZE, "%s-%x",
>                  bp->pdev->name, bp->pdev->id);
>         bp->mii_bus->priv = bp;
> -       bp->mii_bus->parent = &bp->dev->dev;
> +       bp->mii_bus->parent = &bp->pdev->dev;
>         pdata = dev_get_platdata(&bp->pdev->dev);
> 
>         dev_set_drvdata(&bp->dev->dev, bp->mii_bus);

Works fine.

But still, this doesn't solve the phy issue ;)


-- 
Alexandre Belloni, Free Electrons
Embedded Linux, Kernel and Android engineering
http://free-electrons.com

^ permalink raw reply

* [RFC PATCH v2 net-next 5/7] tcp: Make use of MSG_EOR in tcp_sendpage
From: Martin KaFai Lau @ 2016-04-18 22:46 UTC (permalink / raw)
  To: netdev
  Cc: Eric Dumazet, Neal Cardwell, Soheil Hassas Yeganeh,
	Willem de Bruijn, Yuchung Cheng, Kernel Team
In-Reply-To: <1461019569-3037369-1-git-send-email-kafai@fb.com>

It reuses the tcp_sendmsg_noappend() to decide if a new_segment
is needed before entering the loop.  More checks could be added
later for the tcp_sendpage case to decide if a new_segment is
needed immediately.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil.kdev@gmail.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
---
 net/ipv4/tcp.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 2918f42..6bb33b8 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -913,6 +913,9 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
 	if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
 		goto out_err;
 
+	if (tcp_sendmsg_noappend(sk, sk->sk_tsflags))
+		goto new_segment;
+
 	while (size > 0) {
 		struct sk_buff *skb = tcp_write_queue_tail(sk);
 		int copy, i;
@@ -969,7 +972,7 @@ new_segment:
 		offset += copy;
 		size -= copy;
 		if (!size) {
-			tcp_tx_timestamp(sk, sk->sk_tsflags, skb, 0);
+			tcp_tx_timestamp(sk, sk->sk_tsflags, skb, flags);
 			goto out;
 		}
 
-- 
2.5.1

^ permalink raw reply related

* [RFC PATCH v2 net-next 4/7] tcp: Make use of MSG_EOR flag in tcp_sendmsg
From: Martin KaFai Lau @ 2016-04-18 22:46 UTC (permalink / raw)
  To: netdev
  Cc: Eric Dumazet, Neal Cardwell, Soheil Hassas Yeganeh,
	Willem de Bruijn, Yuchung Cheng, Kernel Team
In-Reply-To: <1461019569-3037369-1-git-send-email-kafai@fb.com>

This patch allows the user process to use MSG_EOR during
tcp_sendmsg to tell the kernel that it is the last byte
of an application response message.

It is currently useful when the end-user has turned on any bit of the
SOF_TIMESTAMPING_TX_RECORD_MASK (either by setsockopt or cmsg).
The kernel will then mark the newly added tcb->eor_info bit so
that the shinfo->tskey will not be overwritten (i.e. lost) in
the later skb append/collapse operation.

With selective SOF_TIMESTAMPING_TX_ACK (by cmsg) and MSG_EOR (this
patch), the user application can specially tell which outgoing byte
it wants to track its ACK and ask the kernel not to lose this
tracking info in the later skb append/collapse action.

This patch handles the append case in tcp_sendmsg.  The later
patches will handle the collapse during retransmission and
skb slicing in tcp_fragment()/tso_fragment().

One of our use case is at the webserver.  The webserver tracks
the HTTP2 response latency by measuring when the webserver sends
the first byte to the socket till the TCP ACK of the last byte
is received.  In the cases where we don't have client side
measurement, measuring from the server side is the only option.
In the cases we have the client side measurement, the server side
data can also be used to justify/cross-check-with the client
side data.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil.kdev@gmail.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
---
 include/net/tcp.h |  5 ++++-
 net/ipv4/tcp.c    | 21 +++++++++++++++++----
 2 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index c0ef054..f3c5dcb 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -762,7 +762,10 @@ struct tcp_skb_cb {
 
 	__u8		ip_dsfield;	/* IPv4 tos or IPv6 dsfield	*/
 	__u8		txstamp_ack:1,	/* Record TX timestamp for ack? */
-			unused:7;
+			eor_info:1,	/* Any EOR marked info that prevents
+					 * skbs from merging.
+					 */
+			unused:6;
 	__u32		ack_seq;	/* Sequence number ACK'd	*/
 	union {
 		struct inet_skb_parm	h4;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 4d73858..2918f42 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -428,15 +428,18 @@ void tcp_init_sock(struct sock *sk)
 }
 EXPORT_SYMBOL(tcp_init_sock);
 
-static void tcp_tx_timestamp(struct sock *sk, u16 tsflags, struct sk_buff *skb)
+static void tcp_tx_timestamp(struct sock *sk, u16 tsflags, struct sk_buff *skb,
+			     int flags)
 {
 	if (sk->sk_tsflags || tsflags) {
 		struct skb_shared_info *shinfo = skb_shinfo(skb);
 		struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
 
 		sock_tx_timestamp(sk, tsflags, &shinfo->tx_flags);
-		if (shinfo->tx_flags & SKBTX_ANY_TSTAMP)
+		if (shinfo->tx_flags & SKBTX_ANY_TSTAMP) {
 			shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
+			tcb->eor_info = !!(flags & MSG_EOR);
+		}
 		tcb->txstamp_ack = !!(shinfo->tx_flags & SKBTX_ACK_TSTAMP);
 	}
 }
@@ -874,6 +877,13 @@ static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
 	return mss_now;
 }
 
+static bool tcp_sendmsg_noappend(struct sock *sk, u16 tx_tsflags)
+{
+	return unlikely((tx_tsflags & SOF_TIMESTAMPING_TX_RECORD_MASK) &&
+			tcp_send_head(sk) &&
+			TCP_SKB_CB(tcp_write_queue_tail(sk))->eor_info);
+}
+
 static ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
 				size_t size, int flags)
 {
@@ -959,7 +969,7 @@ new_segment:
 		offset += copy;
 		size -= copy;
 		if (!size) {
-			tcp_tx_timestamp(sk, sk->sk_tsflags, skb);
+			tcp_tx_timestamp(sk, sk->sk_tsflags, skb, 0);
 			goto out;
 		}
 
@@ -1145,6 +1155,9 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
 
 	sg = !!(sk->sk_route_caps & NETIF_F_SG);
 
+	if (tcp_sendmsg_noappend(sk, sockc.tsflags))
+		goto new_segment;
+
 	while (msg_data_left(msg)) {
 		int copy = 0;
 		int max = size_goal;
@@ -1249,7 +1262,7 @@ new_segment:
 
 		copied += copy;
 		if (!msg_data_left(msg)) {
-			tcp_tx_timestamp(sk, sockc.tsflags, skb);
+			tcp_tx_timestamp(sk, sockc.tsflags, skb, flags);
 			goto out;
 		}
 
-- 
2.5.1

^ permalink raw reply related

* [RFC PATCH v2 net-next 6/7] tcp: Carry eor_info in tcp_fragment_tstamp() and tcp_skb_collapse_tstamp()
From: Martin KaFai Lau @ 2016-04-18 22:46 UTC (permalink / raw)
  To: netdev
  Cc: Eric Dumazet, Neal Cardwell, Soheil Hassas Yeganeh,
	Willem de Bruijn, Yuchung Cheng, Kernel Team
In-Reply-To: <1461019569-3037369-1-git-send-email-kafai@fb.com>

Like txstamp_ack bit, if needed, the eor_info bit should also be carried
to the new skb2 when splitting a skb
or
to the prev skb from the next_skb when collapsing skbs.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil.kdev@gmail.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
---
 net/ipv4/tcp_output.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index d21a78f..e71336c 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1125,6 +1125,8 @@ static void tcp_fragment_tstamp(struct sk_buff *skb, struct sk_buff *skb2)
 		swap(shinfo->tskey, shinfo2->tskey);
 		TCP_SKB_CB(skb2)->txstamp_ack = TCP_SKB_CB(skb)->txstamp_ack;
 		TCP_SKB_CB(skb)->txstamp_ack = 0;
+		TCP_SKB_CB(skb2)->eor_info = TCP_SKB_CB(skb)->eor_info;
+		TCP_SKB_CB(skb)->eor_info = 0;
 	}
 }
 
@@ -2456,6 +2458,8 @@ void tcp_skb_collapse_tstamp(struct sk_buff *skb,
 		shinfo->tskey = next_shinfo->tskey;
 		TCP_SKB_CB(skb)->txstamp_ack =
 			!!(shinfo->tx_flags & SKBTX_ACK_TSTAMP);
+		if (TCP_SKB_CB(next_skb)->eor_info)
+			TCP_SKB_CB(skb)->eor_info = 1;
 	}
 }
 
-- 
2.5.1

^ permalink raw reply related

* [RFC PATCH v2 net-next 3/7] tcp: Merge tx_flags/tskey/txstamp_ack in tcp_shifted_skb
From: Martin KaFai Lau @ 2016-04-18 22:46 UTC (permalink / raw)
  To: netdev
  Cc: Eric Dumazet, Neal Cardwell, Soheil Hassas Yeganeh,
	Willem de Bruijn, Yuchung Cheng, Kernel Team
In-Reply-To: <1461019569-3037369-1-git-send-email-kafai@fb.com>

After receiving sacks, tcp_shifted_skb() will collapse
skbs if possible.  tx_flags/tskey/txstamp_ack also has
to be merged in this case.

This patch resues the tcp_skb_collapse_tstamp() to handle
them.

BPF Output Before:
~~~~~
<no-output-due-to-missing-tstamp-event>

BPF Output After:
~~~~~
<...>-2024  [007] d.s.    88.644374: : ee_data:14599

Packetdrill Script:
~~~~~
+0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
+0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
+0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0

0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0

0.200 write(4, ..., 1460) = 1460
+0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
0.200 write(4, ..., 13140) = 13140
+0 setsockopt(4, SOL_SOCKET, 37, [2176], 4) = 0

0.200 > P. 1:1461(1460) ack 1
0.200 > . 1461:8761(7300) ack 1
0.200 > P. 8761:14601(5840) ack 1

0.300 < . 1:1(0) ack 1 win 257 <sack 1461:14601,nop,nop>
0.300 > P. 1:1461(1460) ack 1
0.400 < . 1:1(0) ack 14601 win 257

0.400 close(4) = 0
0.400 > F. 14601:14601(0) ack 1
0.500 < F. 1:1(0) ack 14602 win 257
0.500 > . 14602:14602(0) ack 2

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil.kdev@gmail.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
---
 include/net/tcp.h     | 2 ++
 net/ipv4/tcp_input.c  | 1 +
 net/ipv4/tcp_output.c | 4 ++--
 3 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index fd40f8c..c0ef054 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -557,6 +557,8 @@ void tcp_send_ack(struct sock *sk);
 void tcp_send_delayed_ack(struct sock *sk);
 void tcp_send_loss_probe(struct sock *sk);
 bool tcp_schedule_loss_probe(struct sock *sk);
+void tcp_skb_collapse_tstamp(struct sk_buff *skb,
+			     const struct sk_buff *next_skb);
 
 /* tcp_input.c */
 void tcp_resume_early_retransmit(struct sock *sk);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 5e45a9c..75e8336 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1309,6 +1309,7 @@ static bool tcp_shifted_skb(struct sock *sk, struct sk_buff *skb,
 	if (skb == tcp_highest_sack(sk))
 		tcp_advance_highest_sack(sk, skb);
 
+	tcp_skb_collapse_tstamp(prev, skb);
 	tcp_unlink_write_queue(skb, sk);
 	sk_wmem_free_skb(sk, skb);
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 889ed96..d21a78f 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2443,8 +2443,8 @@ u32 __tcp_select_window(struct sock *sk)
 	return window;
 }
 
-static void tcp_skb_collapse_tstamp(struct sk_buff *skb,
-				    const struct sk_buff *next_skb)
+void tcp_skb_collapse_tstamp(struct sk_buff *skb,
+			     const struct sk_buff *next_skb)
 {
 	const struct skb_shared_info *next_shinfo = skb_shinfo(next_skb);
 
-- 
2.5.1

^ permalink raw reply related

* [RFC PATCH v2 net-next 2/7] tcp: Merge tx_flags/tskey/txstamp_ack in tcp_collapse_retrans
From: Martin KaFai Lau @ 2016-04-18 22:46 UTC (permalink / raw)
  To: netdev
  Cc: Eric Dumazet, Neal Cardwell, Soheil Hassas Yeganeh,
	Willem de Bruijn, Yuchung Cheng, Kernel Team
In-Reply-To: <1461019569-3037369-1-git-send-email-kafai@fb.com>

If two skbs are merged/collapsed during retransmission, the current
logic does not merge the tx_flags, tskey and txstamp_ack.  The end
result is the SCM_TSTAMP_ACK timestamp could be missing for a
packet that the end-user has specifically turned on
SOF_TIMESTAMPING_TX_ACK (e.g. by cmsg).

The patch:
1. Merge the tx_flags and txstamp_ack
2. Overwrite the tskey with the later skb (next_skb)

BPF Output Before:
~~~~~~
<no-output-due-to-missing-tstamp-event>

BPF Output After:
~~~~~~
packetdrill-2092  [001] d.s.   453.998486: : ee_data:1459

Packetdrill Script:
~~~~~~
+0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
+0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
+0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0

0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0

0.200 write(4, ..., 730) = 730
+0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
0.200 write(4, ..., 730) = 730
+0 setsockopt(4, SOL_SOCKET, 37, [2176], 4) = 0
0.200 write(4, ..., 11680) = 11680

0.200 > P. 1:731(730) ack 1
0.200 > P. 731:1461(730) ack 1
0.200 > . 1461:8761(7300) ack 1
0.200 > P. 8761:13141(4380) ack 1

0.300 < . 1:1(0) ack 1 win 257 <sack 1461:2921,nop,nop>
0.300 < . 1:1(0) ack 1 win 257 <sack 1461:4381,nop,nop>
0.300 < . 1:1(0) ack 1 win 257 <sack 1461:5841,nop,nop>
0.300 > P. 1:1461(1460) ack 1
0.400 < . 1:1(0) ack 13141 win 257

0.400 close(4) = 0
0.400 > F. 13141:13141(0) ack 1
0.500 < F. 1:1(0) ack 13142 win 257
0.500 > . 13142:13142(0) ack 2

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil.kdev@gmail.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
---
 net/ipv4/tcp_output.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 0527ce9..889ed96 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2443,6 +2443,22 @@ u32 __tcp_select_window(struct sock *sk)
 	return window;
 }
 
+static void tcp_skb_collapse_tstamp(struct sk_buff *skb,
+				    const struct sk_buff *next_skb)
+{
+	const struct skb_shared_info *next_shinfo = skb_shinfo(next_skb);
+
+	if (unlikely(next_shinfo->tx_flags & SKBTX_ANY_TSTAMP)) {
+		struct skb_shared_info *shinfo = skb_shinfo(skb);
+		u8 tsflags = next_shinfo->tx_flags & SKBTX_ANY_TSTAMP;
+
+		shinfo->tx_flags |= tsflags;
+		shinfo->tskey = next_shinfo->tskey;
+		TCP_SKB_CB(skb)->txstamp_ack =
+			!!(shinfo->tx_flags & SKBTX_ACK_TSTAMP);
+	}
+}
+
 /* Collapses two adjacent SKB's during retransmission. */
 static void tcp_collapse_retrans(struct sock *sk, struct sk_buff *skb)
 {
@@ -2486,6 +2502,8 @@ static void tcp_collapse_retrans(struct sock *sk, struct sk_buff *skb)
 
 	tcp_adjust_pcount(sk, next_skb, tcp_skb_pcount(next_skb));
 
+	tcp_skb_collapse_tstamp(skb, next_skb);
+
 	sk_wmem_free_skb(sk, next_skb);
 }
 
-- 
2.5.1

^ permalink raw reply related

* [RFC PATCH v2 net-next 1/7] tcp: Carry txstamp_ack in tcp_fragment_tstamp
From: Martin KaFai Lau @ 2016-04-18 22:46 UTC (permalink / raw)
  To: netdev
  Cc: Eric Dumazet, Neal Cardwell, Soheil Hassas Yeganeh,
	Willem de Bruijn, Yuchung Cheng, Kernel Team
In-Reply-To: <1461019569-3037369-1-git-send-email-kafai@fb.com>

When a tcp skb is sliced into two smaller skbs (e.g. in
tcp_fragment() and tso_fragment()),  it does not carry
the txstamp_ack bit to the newly created skb if it is needed.
The end result is a timestamping event (SCM_TSTAMP_ACK) will
be missing from the sk->sk_error_queue.

This patch carries this bit to the new skb2 (if needed)
in tcp_fragment_tstamp().

BPF Output Before:
~~~~~~
<No output due to missing SCM_TSTAMP_ACK timestamp>

BPF Output After:
~~~~~~
<...>-2050  [000] d.s.   100.928763: : ee_data:14599

Packetdrill Script:
~~~~~~
+0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
+0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
+0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0

0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0

+0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
0.200 write(4, ..., 14600) = 14600
+0 setsockopt(4, SOL_SOCKET, 37, [2176], 4) = 0

0.200 > . 1:7301(7300) ack 1
0.200 > P. 7301:14601(7300) ack 1

0.300 < . 1:1(0) ack 14601 win 257

0.300 close(4) = 0
0.300 > F. 14601:14601(0) ack 1
0.400 < F. 1:1(0) ack 16062 win 257
0.400 > . 14602:14602(0) ack 2

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil.kdev@gmail.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
---
 net/ipv4/tcp_output.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 6451b83..0527ce9 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1123,6 +1123,8 @@ static void tcp_fragment_tstamp(struct sk_buff *skb, struct sk_buff *skb2)
 		shinfo->tx_flags &= ~tsflags;
 		shinfo2->tx_flags |= tsflags;
 		swap(shinfo->tskey, shinfo2->tskey);
+		TCP_SKB_CB(skb2)->txstamp_ack = TCP_SKB_CB(skb)->txstamp_ack;
+		TCP_SKB_CB(skb)->txstamp_ack = 0;
 	}
 }
 
-- 
2.5.1

^ permalink raw reply related

* [RFC PATCH v2 net-next 7/7] tcp: Avoid losing eor_info when collapsing skbs
From: Martin KaFai Lau @ 2016-04-18 22:46 UTC (permalink / raw)
  To: netdev
  Cc: Eric Dumazet, Neal Cardwell, Soheil Hassas Yeganeh,
	Willem de Bruijn, Yuchung Cheng, Kernel Team
In-Reply-To: <1461019569-3037369-1-git-send-email-kafai@fb.com>

When collapsing skbs during tcp_collapse_retrans and tcp_shift_skb_data,
this patch is to avoid collapsing to a prev skb that has eor_info mark
and the next_skb also has the tskey set (i.e. the prev skb will lose
a eor marked tskey info).

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil.kdev@gmail.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
---
 include/net/tcp.h     | 6 ++++++
 net/ipv4/tcp_input.c  | 3 +++
 net/ipv4/tcp_output.c | 7 +++++--
 3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index f3c5dcb..56a287a 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -812,6 +812,12 @@ static inline int tcp_skb_mss(const struct sk_buff *skb)
 	return TCP_SKB_CB(skb)->tcp_gso_size;
 }
 
+static inline bool  tcp_skb_can_collapse_eor(const struct sk_buff *skb,
+					     const struct sk_buff *next_skb)
+{
+	return !(TCP_SKB_CB(skb)->eor_info && skb_shinfo(next_skb)->tskey);
+}
+
 /* Events passed to congestion control interface */
 enum tcp_ca_event {
 	CA_EVENT_TX_START,	/* first transmit when no packets in flight */
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 75e8336..e60922d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1368,6 +1368,9 @@ static struct sk_buff *tcp_shift_skb_data(struct sock *sk, struct sk_buff *skb,
 	if ((TCP_SKB_CB(prev)->sacked & TCPCB_TAGBITS) != TCPCB_SACKED_ACKED)
 		goto fallback;
 
+	if (!tcp_skb_can_collapse_eor(prev, skb))
+		goto fallback;
+
 	in_sack = !after(start_seq, TCP_SKB_CB(skb)->seq) &&
 		  !before(end_seq, TCP_SKB_CB(skb)->end_seq);
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index e71336c..1ae21f1 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2512,7 +2512,8 @@ static void tcp_collapse_retrans(struct sock *sk, struct sk_buff *skb)
 }
 
 /* Check if coalescing SKBs is legal. */
-static bool tcp_can_collapse(const struct sock *sk, const struct sk_buff *skb)
+static bool tcp_can_collapse(const struct sock *sk, const struct sk_buff *to,
+			     const struct sk_buff *skb)
 {
 	if (tcp_skb_pcount(skb) > 1)
 		return false;
@@ -2526,6 +2527,8 @@ static bool tcp_can_collapse(const struct sock *sk, const struct sk_buff *skb)
 	/* Some heurestics for collapsing over SACK'd could be invented */
 	if (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED)
 		return false;
+	if (!tcp_skb_can_collapse_eor(to, skb))
+		return false;
 
 	return true;
 }
@@ -2546,7 +2549,7 @@ static void tcp_retrans_try_collapse(struct sock *sk, struct sk_buff *to,
 		return;
 
 	tcp_for_write_queue_from_safe(skb, tmp, sk) {
-		if (!tcp_can_collapse(sk, skb))
+		if (!tcp_can_collapse(sk, to, skb))
 			break;
 
 		space -= skb->len;
-- 
2.5.1

^ permalink raw reply related

* [RFC PATCH v2 net-next 0/7] tcp: Make use of MSG_EOR in tcp_sendmsg
From: Martin KaFai Lau @ 2016-04-18 22:46 UTC (permalink / raw)
  To: netdev
  Cc: Eric Dumazet, Neal Cardwell, Soheil Hassas Yeganeh,
	Willem de Bruijn, Yuchung Cheng, Kernel Team

v2:
~ Rework based on the recent work
  "add TX timestamping via cmsg" by
  Soheil Hassas Yeganeh <soheil.kdev@gmail.com>
~ This version takes the MSG_EOR bit as a signal of
  end-of-response-message and leave the selective
  timestamping job to the cmsg
~ Changes based on the v1 feedback (like avoid
  unlikely check in a loop and adding tcp_sendpage
  support)
~ The first 3 patches are bug fixes.  The fixes in this
  series depend on the newly introduced txstamp_ack in
  net-next.  I will make relevant patches against net after
  getting some feedback.
~ The test results are based on the recently posted net fix:
  "tcp: Fix SOF_TIMESTAMPING_TX_ACK when handling dup acks"
~ Due to the lacking cmsg support in packetdrill (or I just
  could not find it), a BPF prog is used to kprobe
  to sock_queue_err_skb() and print out the value of
  serr->ee.ee_data.  The BPF prog (run-able from bcc) is
  attached at the end.

Some of the following is stolen from a commit message of
the following patch to serve as a high level description of
the objective in this series:

This patchset allows the user process to use MSG_EOR during
tcp_sendmsg to tell the kernel that it is the last byte
of an application response message.

It is currently useful when the end-user has turned on any bit of the
SOF_TIMESTAMPING_TX_RECORD_MASK (either by setsockopt or cmsg).
The kernel will then mark the newly added tcb->eor_info bit so
that the shinfo->tskey will not be overwritten (i.e. lost) in
the later skb append/collapse operation.

Each skb can only track one tskey (which is the seq number of the
last byte of the message).   To allow tracking the last byte of
multiple response messages (e.g. HTTP2), this patch takes an
approach by not appending to the previous skb during tcp_sendmsg
if this previous skb's eor information (only shinfo->tskey for now)
will be overwritten.  A similar case also happens during
retransmission.

This approach avoids introducing another list to track the tskey.
The downside is that it will have less tso benefit and/or more
outgoing packets.  Practically, due to the amount of measurement
data generated, sampling is usually used in production. (i.e. not
every connection is tracked).

One of our use case is at the webserver.  The webserver tracks
the HTTP2 response latency by measuring when the webserver sends
the first byte to the socket till the TCP ACK of the last byte
is received.  In the cases where we don't have client side
measurement, measuring from the server side is the only option.
In the cases we have the client side measurement, the server side
data can also be used to justify/cross-check-with the client
side data.

The TCP PRR paper [1] also measures a similar metrics:
"The TCP latency of a HTTP response when the server sends the first
byte until it receives the acknowledgment (ACK) for the last byte."

[1] Proportional Rate Reduction for TCP:
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/37486.pdf

BPF prog used for testing:
~~~~~
#!/usr/bin/env python

from __future__ import print_function
from bcc import BPF

bpf_text = """
#include <uapi/linux/ptrace.h>
#include <net/sock.h>
#include <bcc/proto.h>
#include <linux/errqueue.h>

#ifdef memset
#undef memset
#endif

int trace_err_skb(struct pt_regs *ctx)
{
	struct sk_buff *skb = (struct sk_buff *)ctx->si;
	struct sock *sk = (struct sock *)ctx->di;
	struct sock_exterr_skb *serr;
	u32 ee_data = 0;

	if (!sk || !skb)
		return 0;

	serr = SKB_EXT_ERR(skb);
	bpf_probe_read(&ee_data, sizeof(ee_data), &serr->ee.ee_data);
	bpf_trace_printk("ee_data:%u\\n", ee_data);

	return 0;
};
"""

b = BPF(text=bpf_text)
b.attach_kprobe(event="sock_queue_err_skb", fn_name="trace_err_skb")
print("Attached to kprobe")
b.trace_print()

^ permalink raw reply

* Re: [RFC PATCH v2 net-next 4/7] tcp: Make use of MSG_EOR flag in tcp_sendmsg
From: Eric Dumazet @ 2016-04-18 23:18 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: netdev, Eric Dumazet, Neal Cardwell, Soheil Hassas Yeganeh,
	Willem de Bruijn, Yuchung Cheng, Kernel Team
In-Reply-To: <1461019569-3037369-5-git-send-email-kafai@fb.com>

On Mon, 2016-04-18 at 15:46 -0700, Martin KaFai Lau wrote:
> This patch allows the user process to use MSG_EOR during
> tcp_sendmsg to tell the kernel that it is the last byte
> of an application response message.
> 
> It is currently useful when the end-user has turned on any bit of the
> SOF_TIMESTAMPING_TX_RECORD_MASK (either by setsockopt or cmsg).
> The kernel will then mark the newly added tcb->eor_info bit so
> that the shinfo->tskey will not be overwritten (i.e. lost) in
> the later skb append/collapse operation.
> 
> With selective SOF_TIMESTAMPING_TX_ACK (by cmsg) and MSG_EOR (this
> patch), the user application can specially tell which outgoing byte
> it wants to track its ACK and ask the kernel not to lose this
> tracking info in the later skb append/collapse action.
> 
> This patch handles the append case in tcp_sendmsg.  The later
> patches will handle the collapse during retransmission and
> skb slicing in tcp_fragment()/tso_fragment().
> 
> One of our use case is at the webserver.  The webserver tracks
> the HTTP2 response latency by measuring when the webserver sends
> the first byte to the socket till the TCP ACK of the last byte
> is received.  In the cases where we don't have client side
> measurement, measuring from the server side is the only option.
> In the cases we have the client side measurement, the server side
> data can also be used to justify/cross-check-with the client
> side data.
> 
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Soheil Hassas Yeganeh <soheil.kdev@gmail.com>
> Cc: Willem de Bruijn <willemb@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> ---

MSG_EOR should not depend on SKBTX_ANY_TSTAMP

Really, simply using send(fd, ..., len, MSG_EOR) should instruct TCP to
mark the cooked skb as a non candidate for future coalescing.

netperf could then get an option to set this MSG_EOR ;)

I believe Soheil was working on such simple alternative ?

^ permalink raw reply

* EMC Users
From: Phoebe Calkins @ 2016-04-18 22:40 UTC (permalink / raw)
  To: netdev

Hi,

Would you be interested in acquiring an Email list of EMC Users List?

List of Job Titles:- C-level Executives, V-level Executives, IT Executives, Software Executives, Application Executives, Operations, HR, Finance Executives, Directors, Managers, Sales Executives, Marketing, Application Executives, Operations and many more.

Few Industry Specific Lists:- CRM/Marketing Automation, Data Management, Data Storage, ITSM, ITO Agreements/IT Intel, Networking, Servers users, Security, Business Intelligence, Collaboration, Enterprise Application, ERP/HR/HCM/FI, Hardware/OS/Systems Environment, Programming Tools, Languages, Telecommunication,
Virtualization and many more.

If your target audience is not mentioned above, please do not hesitate to fill it below.

Kindly, let me know your exact target criteria as shown in below format.In turn I will get back with you more details for your consideration.

Target Industry:          ,Target Job Title:              ,Target Geography:

Looking forward to your response.

Best Regards,
Phoebe Calkins | Inside Sales

If you do not wish to receive further emails kindly reply with "Leave Out"

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

^ permalink raw reply

* Re: [RFC PATCH v2 net-next 4/7] tcp: Make use of MSG_EOR flag in tcp_sendmsg
From: kafai @ 2016-04-18 23:43 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, Eric Dumazet, Neal Cardwell, Soheil Hassas Yeganeh,
	Willem de Bruijn, Yuchung Cheng, Kernel Team
In-Reply-To: <1461021493.10638.131.camel@edumazet-glaptop3.roam.corp.google.com>

On Mon, Apr 18, 2016 at 04:18:13PM -0700, Eric Dumazet wrote:
> On Mon, 2016-04-18 at 15:46 -0700, Martin KaFai Lau wrote:
> > This patch allows the user process to use MSG_EOR during
> > tcp_sendmsg to tell the kernel that it is the last byte
> > of an application response message.
> >
> > It is currently useful when the end-user has turned on any bit of the
> > SOF_TIMESTAMPING_TX_RECORD_MASK (either by setsockopt or cmsg).
> > The kernel will then mark the newly added tcb->eor_info bit so
> > that the shinfo->tskey will not be overwritten (i.e. lost) in
> > the later skb append/collapse operation.
> >
> > With selective SOF_TIMESTAMPING_TX_ACK (by cmsg) and MSG_EOR (this
> > patch), the user application can specially tell which outgoing byte
> > it wants to track its ACK and ask the kernel not to lose this
> > tracking info in the later skb append/collapse action.
> >
> > This patch handles the append case in tcp_sendmsg.  The later
> > patches will handle the collapse during retransmission and
> > skb slicing in tcp_fragment()/tso_fragment().
> >
> > One of our use case is at the webserver.  The webserver tracks
> > the HTTP2 response latency by measuring when the webserver sends
> > the first byte to the socket till the TCP ACK of the last byte
> > is received.  In the cases where we don't have client side
> > measurement, measuring from the server side is the only option.
> > In the cases we have the client side measurement, the server side
> > data can also be used to justify/cross-check-with the client
> > side data.
> >
> > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > Cc: Eric Dumazet <edumazet@google.com>
> > Cc: Neal Cardwell <ncardwell@google.com>
> > Cc: Soheil Hassas Yeganeh <soheil.kdev@gmail.com>
> > Cc: Willem de Bruijn <willemb@google.com>
> > Cc: Yuchung Cheng <ycheng@google.com>
> > ---
>
> MSG_EOR should not depend on SKBTX_ANY_TSTAMP
>
> Really, simply using send(fd, ..., len, MSG_EOR) should instruct TCP to
> mark the cooked skb as a non candidate for future coalescing.
It was one of my earlier local attempt.  There are cases that coalescing
will not lose the tskey, so I trashed it.

If we mark eor only based on MSG_EOR, we can still do checks on
prev_skb's tskey and next_skb's tskey before coalescing two skbs
or
you meant simply don't coalesce if the prev_skb has eor marked?

>
> netperf could then get an option to set this MSG_EOR ;)
Not sure how it is related.  Can you share how netperf can
benefit from MSG_EOR in TCP tests without any of the
SOF_TIMESTAMPING_TX_RECORD_MASK.

^ permalink raw reply

* Re: [RFC PATCH v2 net-next 4/7] tcp: Make use of MSG_EOR flag in tcp_sendmsg
From: Eric Dumazet @ 2016-04-19  0:06 UTC (permalink / raw)
  To: kafai
  Cc: netdev, Eric Dumazet, Neal Cardwell, Soheil Hassas Yeganeh,
	Willem de Bruijn, Yuchung Cheng, Kernel Team
In-Reply-To: <20160418234202.GA27948@kafai-mba.local>

On Mon, 2016-04-18 at 16:43 -0700, kafai@fb.com wrote:

> >
> > netperf could then get an option to set this MSG_EOR ;)
> Not sure how it is related.  Can you share how netperf can
> benefit from MSG_EOR in TCP tests without any of the
> SOF_TIMESTAMPING_TX_RECORD_MASK.

Simply setting MSG_EOR would be orthogonal to other timestamping stuff.

Maybe the application does not _want_ to be notified when skb is sent or
acknowledged, but would like some kind of "tcpdump awareness" or
something about burst control, who knows...

It should only be a request from user space to ask TCP to not aggregate
stuff on future sendpage()/sendmsg() on the skb carrying this new flag.

We already have other flags to ask for timestamping stuff, and they
could be used at the same time.

If the stack needs to be changed to properly fragment skb (or
aggregating them at retransmit), this is a separate concern.

Note that you do not need to automatically assert MSG_BOR (Begin of
Request) : MSG_EOR should really control the fact that last byte sent
marks the skb as being a non candidate for aggregation.

This would keep tcp_sendmsg() reasonnably fast.
Your tcp_sendmsg_noappend() is quite expensive :(

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox