Netdev List

Netdev List
 help / color / mirror / Atom feed

* 1oan @ low interest rate
From: Mr. Wolf @ 2016-03-31  1:06 UTC (permalink / raw)
  To: Recipients

Are you interested in starting up a business or do you have financial
difficulties that requires capital....what about you over there looking to
solve that personal need? look no further as we are here to offer you a loan of 3% monthly with no hidden charges. All you need to do is send us an email to pierrewoolff@gmail.com for more information

^ permalink raw reply

* Re: [PATCH net-next 2/8] tcp: accept SOF_TIMESTAMPING_OPT_ID for passive TFO
From: Willem de Bruijn @ 2016-03-31  3:35 UTC (permalink / raw)
  To: Soheil Hassas Yeganeh
  Cc: David Miller, Network Development, Willem de Bruijn, Eric Dumazet,
	Yuchung Cheng, Neal Cardwell, Martin KaFai Lau,
	Soheil Hassas Yeganeh
In-Reply-To: <1459377448-2239-3-git-send-email-soheil.kdev@gmail.com>

On Wed, Mar 30, 2016 at 6:37 PM, Soheil Hassas Yeganeh
<soheil.kdev@gmail.com> wrote:
> From: Soheil Hassas Yeganeh <soheil@google.com>
>
> SOF_TIMESTAMPING_OPT_ID is set to get data-independent IDs
> to associate timestamps with send calls. For TCP connections,
> tp->snd_una is used as the starting point to calculate
> relative IDs.
>
> This socket option will fail if set before the handshake on a
> passive TCP fast open connection with data in SYN or SYN/ACK,
> since setsockopt requires the connection to be in the
> ESTABLISHED state.
>
> To address these, instead of limiting the option to the
> ESTABLISHED state, accept the SOF_TIMESTAMPING_OPT_ID option as
> long as the connection is not in LISTEN or CLOSE states.
>
> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>

Acked-by: Willem de Bruijn <willemb@google.com>

^ permalink raw reply

* Re: [PATCH net-next 3/8] tcp: use one bit in TCP_SKB_CB to mark ACK timestamps
From: Willem de Bruijn @ 2016-03-31  3:36 UTC (permalink / raw)
  To: Soheil Hassas Yeganeh
  Cc: David Miller, Network Development, Willem de Bruijn, Eric Dumazet,
	Yuchung Cheng, Neal Cardwell, Martin KaFai Lau,
	Soheil Hassas Yeganeh
In-Reply-To: <1459377448-2239-4-git-send-email-soheil.kdev@gmail.com>

On Wed, Mar 30, 2016 at 6:37 PM, Soheil Hassas Yeganeh
<soheil.kdev@gmail.com> wrote:
> From: Soheil Hassas Yeganeh <soheil@google.com>
>
> Currently, to avoid a cache line miss for accessing skb_shinfo,
> tcp_ack_tstamp skips socket that do not have
> SOF_TIMESTAMPING_TX_ACK bit set in sk_tsflags. This is
> implemented based on an implicit assumption that the
> SOF_TIMESTAMPING_TX_ACK is set via socket options for the
> duration that ACK timestamps are needed.
>
> To implement per-write timestamps, this check should be
> removed and replaced with a per-packet alternative that
> quickly skips packets missing ACK timestamps marks without
> a cache-line miss.
>
> To enable per-packet marking without a cache line miss, use
> one bit in TCP_SKB_CB to mark a whether a SKB might need a
> ack tx timestamp or not. Further checks in tcp_ack_tstamp are not
> modified and work as before.
>
> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>

Acked-by: Willem de Bruijn <willemb@google.com>

^ permalink raw reply

* Re: [PATCH net-next 4/8] sock: accept SO_TIMESTAMPING flags in socket cmsg
From: Willem de Bruijn @ 2016-03-31  3:36 UTC (permalink / raw)
  To: Soheil Hassas Yeganeh
  Cc: David Miller, Network Development, Willem de Bruijn, Eric Dumazet,
	Yuchung Cheng, Neal Cardwell, Martin KaFai Lau,
	Soheil Hassas Yeganeh
In-Reply-To: <1459377448-2239-5-git-send-email-soheil.kdev@gmail.com>

On Wed, Mar 30, 2016 at 6:37 PM, Soheil Hassas Yeganeh
<soheil.kdev@gmail.com> wrote:
> From: Soheil Hassas Yeganeh <soheil@google.com>
>
> Accept SO_TIMESTAMPING in control messages of the SOL_SOCKET level
> as a basis to accept timestamping requests per write.
>
> This implementation only accepts TX recording flags (i.e.,
> SOF_TIMESTAMPING_TX_HARDWARE, SOF_TIMESTAMPING_TX_SOFTWARE,
> SOF_TIMESTAMPING_TX_SCHED, and SOF_TIMESTAMPING_TX_ACK) in
> control messages. Users need to set reporting flags (e.g.,
> SOF_TIMESTAMPING_OPT_ID) per socket via socket options.
>
> This commit adds a tsflags field in sockcm_cookie which is
> set in __sock_cmsg_send. It only override the SOF_TIMESTAMPING_TX_*
> bits in sockcm_cookie.tsflags allowing the control message
> to override the recording behavior per write, yet maintaining
> the value of other flags.
>
> This patch implements validating the control message and setting
> tsflags in struct sockcm_cookie. Next commits in this series will
> actually implement timestamping per write for different protocols.
>
> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>

Acked-by: Willem de Bruijn <willemb@google.com>

^ permalink raw reply

* Re: [PATCH net-next 5/8] ipv4: process socket-level control messages in IPv4
From: Willem de Bruijn @ 2016-03-31  3:36 UTC (permalink / raw)
  To: Soheil Hassas Yeganeh
  Cc: David Miller, Network Development, Willem de Bruijn, Eric Dumazet,
	Yuchung Cheng, Neal Cardwell, Martin KaFai Lau,
	Soheil Hassas Yeganeh
In-Reply-To: <1459377448-2239-6-git-send-email-soheil.kdev@gmail.com>

On Wed, Mar 30, 2016 at 6:37 PM, Soheil Hassas Yeganeh
<soheil.kdev@gmail.com> wrote:
> From: Soheil Hassas Yeganeh <soheil@google.com>
>
> Process socket-level control messages by invoking
> __sock_cmsg_send in ip_cmsg_send for control messages on
> the SOL_SOCKET layer.
>
> This makes sure whenever ip_cmsg_send is called in udp, icmp,
> and raw, we also process socket-level control messages.
>
> Note that this commit interprets new control messages that
> were ignored before. As such, this commit does not change
> the behavior of IPv4 control messages.
>
> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>

Acked-by: Willem de Bruijn <willemb@google.com>

^ permalink raw reply

* Re: [PATCH net-next 6/8] ipv6: process socket-level control messages in IPv6
From: Willem de Bruijn @ 2016-03-31  3:37 UTC (permalink / raw)
  To: Soheil Hassas Yeganeh
  Cc: David Miller, Network Development, Willem de Bruijn, Eric Dumazet,
	Yuchung Cheng, Neal Cardwell, Martin KaFai Lau,
	Soheil Hassas Yeganeh
In-Reply-To: <1459377448-2239-7-git-send-email-soheil.kdev@gmail.com>

On Wed, Mar 30, 2016 at 6:37 PM, Soheil Hassas Yeganeh
<soheil.kdev@gmail.com> wrote:
> From: Soheil Hassas Yeganeh <soheil@google.com>
>
> Process socket-level control messages by invoking
> __sock_cmsg_send in ip6_datagram_send_ctl for control messages on
> the SOL_SOCKET layer.
>
> This makes sure whenever ip6_datagram_send_ctl is called for
> udp and raw, we also process socket-level control messages.
>
> This is a bit uglier than IPv4, since IPv6 does not have
> something like ipcm_cookie. Perhaps we can later create
> a control message cookie for IPv6?
>
> Note that this commit interprets new control messages that
> were ignored before. As such, this commit does not change
> the behavior of IPv6 control messages.
>
> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>

Acked-by: Willem de Bruijn <willemb@google.com>

^ permalink raw reply

* Re: [PATCH net-next 7/8] sock: enable timestamping using control messages
From: Willem de Bruijn @ 2016-03-31  3:39 UTC (permalink / raw)
  To: Soheil Hassas Yeganeh
  Cc: David Miller, Network Development, Willem de Bruijn, Eric Dumazet,
	Yuchung Cheng, Neal Cardwell, Martin KaFai Lau,
	Soheil Hassas Yeganeh
In-Reply-To: <1459377448-2239-8-git-send-email-soheil.kdev@gmail.com>

On Wed, Mar 30, 2016 at 6:37 PM, Soheil Hassas Yeganeh
<soheil.kdev@gmail.com> wrote:
> From: Soheil Hassas Yeganeh <soheil@google.com>
>
> Currently, SOL_TIMESTAMPING can only be enabled using setsockopt.
> This is very costly when users want to sample writes to gather
> tx timestamps.
>
> Add support for enabling SO_TIMESTAMPING via control messages by
> using tsflags added in `struct sockcm_cookie` (added in the previous
> patches in this series) to set the tx_flags of the last skb created in
> a sendmsg. With this patch, the timestamp recording bits in tx_flags
> of the skbuff is overridden if SO_TIMESTAMPING is passed in a cmsg.
>
> Please note that this is only effective for overriding the recording
> timestamps flags. Users should enable timestamp reporting (e.g.,
> SOF_TIMESTAMPING_SOFTWARE | SOF_TIMESTAMPING_OPT_ID) using
> socket options and then should ask for SOF_TIMESTAMPING_TX_*
> using control messages per sendmsg to sample timestamps for each
> write.
>
> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>

Acked-by: Willem de Bruijn <willemb@google.com>

^ permalink raw reply

* Re: [PATCH net-next 0/8] add TX timestamping via cmsg
From: Willem de Bruijn @ 2016-03-31  3:50 UTC (permalink / raw)
  To: Martin KaFai Lau, Soheil Hassas Yeganeh; +Cc: Network Development
In-Reply-To: <CADAM+B1osNNwVpg1R_kAFW8m75mr15_usDa=gmqpBA+xwyA_fg@mail.gmail.com>

>> Nice patches!

This does not yet solve the append issue that your MSG_EOR patch
addresses, of course.

The straightforward jump to new_segment that I proposed as
simplification is more properly a "start-of-record" than
"end-of-record" signal. It is probably preferable to indeed be able to
pass EOR as signal that the last skb must not be appended to in
subsequent calls.

I think that the record bounds issue is best solved independently from
the interface for intermittent timestamps because (a) changing the tcp
bytestream packetization for timestamping introduces subtle
differences between tracked and untracked data that are not always
acceptable and (b) EOR can also be useful outside timestamps. A
zerocopy sendmsg patchset that I sent for RFC last year encountered a
similar requirement, to give one example: each skb with user data must
point to a completion notification structure (ubuf_info), and can only
point to one at a time. Appends that cause a conflict in skb->uarg
pointers had to be blocked, at the cost of possibly different
packetization compared to regular sends.

^ permalink raw reply

* Re: [PATCH net] tun, bpf: fix suspicious RCU usage in tun_{attach,detach}_filter
From: Michal Kubecek @ 2016-03-31  5:01 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Borkmann, davem, sasha.levin, jslaby, eric.dumazet, mst,
	netdev
In-Reply-To: <20160331011840.GA50846@ast-mbp.thefacebook.com>

On Wed, Mar 30, 2016 at 06:18:42PM -0700, Alexei Starovoitov wrote:
> On Thu, Mar 31, 2016 at 02:13:18AM +0200, Daniel Borkmann wrote:
> > Sasha Levin reported a suspicious rcu_dereference_protected() warning
> > found while fuzzing with trinity that is similar to this one:
> > 
> >   [   52.765684] net/core/filter.c:2262 suspicious rcu_dereference_protected() usage!
> >   [   52.765688] other info that might help us debug this:
> >   [   52.765695] rcu_scheduler_active = 1, debug_locks = 1
> >   [   52.765701] 1 lock held by a.out/1525:
> >   [   52.765704]  #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff816a64b7>] rtnl_lock+0x17/0x20
> >   [   52.765721] stack backtrace:
> >   [   52.765728] CPU: 1 PID: 1525 Comm: a.out Not tainted 4.5.0+ #264
> >   [...]
> >   [   52.765768] Call Trace:
> >   [   52.765775]  [<ffffffff813e488d>] dump_stack+0x85/0xc8
> >   [   52.765784]  [<ffffffff810f2fa5>] lockdep_rcu_suspicious+0xd5/0x110
> >   [   52.765792]  [<ffffffff816afdc2>] sk_detach_filter+0x82/0x90
> >   [   52.765801]  [<ffffffffa0883425>] tun_detach_filter+0x35/0x90 [tun]
> >   [   52.765810]  [<ffffffffa0884ed4>] __tun_chr_ioctl+0x354/0x1130 [tun]
> >   [   52.765818]  [<ffffffff8136fed0>] ? selinux_file_ioctl+0x130/0x210
> >   [   52.765827]  [<ffffffffa0885ce3>] tun_chr_ioctl+0x13/0x20 [tun]
> >   [   52.765834]  [<ffffffff81260ea6>] do_vfs_ioctl+0x96/0x690
> >   [   52.765843]  [<ffffffff81364af3>] ? security_file_ioctl+0x43/0x60
> >   [   52.765850]  [<ffffffff81261519>] SyS_ioctl+0x79/0x90
> >   [   52.765858]  [<ffffffff81003ba2>] do_syscall_64+0x62/0x140
> >   [   52.765866]  [<ffffffff817d563f>] entry_SYSCALL64_slow_path+0x25/0x25
> > 
> > Same can be triggered with PROVE_RCU (+ PROVE_RCU_REPEATEDLY) enabled
> > from tun_attach_filter() when user space calls ioctl(tun_fd, TUN{ATTACH,
> > DETACH}FILTER, ...) for adding/removing a BPF filter on tap devices.
> > 
> > Since the fix in f91ff5b9ff52 ("net: sk_{detach|attach}_filter() rcu
> > fixes") sk_attach_filter()/sk_detach_filter() now dereferences the
> > filter with rcu_dereference_protected(), checking whether socket lock
> > is held in control path.
> > 
> > Since its introduction in 994051625981 ("tun: socket filter support"),
> > tap filters are managed under RTNL lock from __tun_chr_ioctl(). Thus the
> > sock_owned_by_user(sk) doesn't apply in this specific case and therefore
> > triggers the false positive.
> > 
> > Extend the BPF API with __sk_attach_filter()/__sk_detach_filter() pair
> > that is used by tap filters and pass in lockdep_rtnl_is_held() for the
> > rcu_dereference_protected() checks instead.
> > 
> > Reported-by: Sasha Levin <sasha.levin@oracle.com>
> > Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> > ---
> >  drivers/net/tun.c      |  8 +++++---
> >  include/linux/filter.h |  4 ++++
> >  net/core/filter.c      | 33 +++++++++++++++++++++------------
> >  3 files changed, 30 insertions(+), 15 deletions(-)
> 
> kinda heavy patch to shut up lockdep.
> Can we do
> old_fp = rcu_dereference_protected(sk->sk_filter,
>                                 sock_owned_by_user(sk) || lockdep_rtnl_is_held());
> and it always be correct?
> I think right now tun is the only such user, but if it's correct for tun,
> it's correct for future users too. If not correct then not correct for tun either.
> Or I'm missing something?

Already discussed here:

  http://thread.gmane.org/gmane.linux.kernel/2158069/focus=405853

Michal Kubecek

^ permalink raw reply

* Re: [PATCH net] tun, bpf: fix suspicious RCU usage in tun_{attach,detach}_filter
From: Alexei Starovoitov @ 2016-03-31  5:08 UTC (permalink / raw)
  To: Michal Kubecek
  Cc: Daniel Borkmann, davem, sasha.levin, jslaby, eric.dumazet, mst,
	netdev
In-Reply-To: <20160331050115.GA21665@unicorn.suse.cz>

On Thu, Mar 31, 2016 at 07:01:15AM +0200, Michal Kubecek wrote:
> On Wed, Mar 30, 2016 at 06:18:42PM -0700, Alexei Starovoitov wrote:
> > On Thu, Mar 31, 2016 at 02:13:18AM +0200, Daniel Borkmann wrote:
> > > Sasha Levin reported a suspicious rcu_dereference_protected() warning
> > > found while fuzzing with trinity that is similar to this one:
> > > 
> > >   [   52.765684] net/core/filter.c:2262 suspicious rcu_dereference_protected() usage!
> > >   [   52.765688] other info that might help us debug this:
> > >   [   52.765695] rcu_scheduler_active = 1, debug_locks = 1
> > >   [   52.765701] 1 lock held by a.out/1525:
> > >   [   52.765704]  #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff816a64b7>] rtnl_lock+0x17/0x20
> > >   [   52.765721] stack backtrace:
> > >   [   52.765728] CPU: 1 PID: 1525 Comm: a.out Not tainted 4.5.0+ #264
> > >   [...]
> > >   [   52.765768] Call Trace:
> > >   [   52.765775]  [<ffffffff813e488d>] dump_stack+0x85/0xc8
> > >   [   52.765784]  [<ffffffff810f2fa5>] lockdep_rcu_suspicious+0xd5/0x110
> > >   [   52.765792]  [<ffffffff816afdc2>] sk_detach_filter+0x82/0x90
> > >   [   52.765801]  [<ffffffffa0883425>] tun_detach_filter+0x35/0x90 [tun]
> > >   [   52.765810]  [<ffffffffa0884ed4>] __tun_chr_ioctl+0x354/0x1130 [tun]
> > >   [   52.765818]  [<ffffffff8136fed0>] ? selinux_file_ioctl+0x130/0x210
> > >   [   52.765827]  [<ffffffffa0885ce3>] tun_chr_ioctl+0x13/0x20 [tun]
> > >   [   52.765834]  [<ffffffff81260ea6>] do_vfs_ioctl+0x96/0x690
> > >   [   52.765843]  [<ffffffff81364af3>] ? security_file_ioctl+0x43/0x60
> > >   [   52.765850]  [<ffffffff81261519>] SyS_ioctl+0x79/0x90
> > >   [   52.765858]  [<ffffffff81003ba2>] do_syscall_64+0x62/0x140
> > >   [   52.765866]  [<ffffffff817d563f>] entry_SYSCALL64_slow_path+0x25/0x25
> > > 
> > > Same can be triggered with PROVE_RCU (+ PROVE_RCU_REPEATEDLY) enabled
> > > from tun_attach_filter() when user space calls ioctl(tun_fd, TUN{ATTACH,
> > > DETACH}FILTER, ...) for adding/removing a BPF filter on tap devices.
> > > 
> > > Since the fix in f91ff5b9ff52 ("net: sk_{detach|attach}_filter() rcu
> > > fixes") sk_attach_filter()/sk_detach_filter() now dereferences the
> > > filter with rcu_dereference_protected(), checking whether socket lock
> > > is held in control path.
> > > 
> > > Since its introduction in 994051625981 ("tun: socket filter support"),
> > > tap filters are managed under RTNL lock from __tun_chr_ioctl(). Thus the
> > > sock_owned_by_user(sk) doesn't apply in this specific case and therefore
> > > triggers the false positive.
> > > 
> > > Extend the BPF API with __sk_attach_filter()/__sk_detach_filter() pair
> > > that is used by tap filters and pass in lockdep_rtnl_is_held() for the
> > > rcu_dereference_protected() checks instead.
> > > 
> > > Reported-by: Sasha Levin <sasha.levin@oracle.com>
> > > Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> > > ---
> > >  drivers/net/tun.c      |  8 +++++---
> > >  include/linux/filter.h |  4 ++++
> > >  net/core/filter.c      | 33 +++++++++++++++++++++------------
> > >  3 files changed, 30 insertions(+), 15 deletions(-)
> > 
> > kinda heavy patch to shut up lockdep.
> > Can we do
> > old_fp = rcu_dereference_protected(sk->sk_filter,
> >                                 sock_owned_by_user(sk) || lockdep_rtnl_is_held());
> > and it always be correct?
> > I think right now tun is the only such user, but if it's correct for tun,
> > it's correct for future users too. If not correct then not correct for tun either.
> > Or I'm missing something?
> 
> Already discussed here:
> 
>   http://thread.gmane.org/gmane.linux.kernel/2158069/focus=405853

I saw that. My point above was challenging 'less accurate' part.

^ permalink raw reply

* Re: [PATCH net] tun, bpf: fix suspicious RCU usage in tun_{attach,detach}_filter
From: Michal Kubecek @ 2016-03-31  5:22 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Borkmann, davem, sasha.levin, jslaby, eric.dumazet, mst,
	netdev
In-Reply-To: <20160331050808.GA56499@ast-mbp.thefacebook.com>

On Wed, Mar 30, 2016 at 10:08:10PM -0700, Alexei Starovoitov wrote:
> On Thu, Mar 31, 2016 at 07:01:15AM +0200, Michal Kubecek wrote:
> > On Wed, Mar 30, 2016 at 06:18:42PM -0700, Alexei Starovoitov wrote:
> > > 
> > > kinda heavy patch to shut up lockdep.
> > > Can we do
> > > old_fp = rcu_dereference_protected(sk->sk_filter,
> > >                                 sock_owned_by_user(sk) || lockdep_rtnl_is_held());
> > > and it always be correct?
> > > I think right now tun is the only such user, but if it's correct
> > > for tun, it's correct for future users too. If not correct then
> > > not correct for tun either.
> > > Or I'm missing something?
> > 
> > Already discussed here:
> > 
> >   http://thread.gmane.org/gmane.linux.kernel/2158069/focus=405853
> 
> I saw that. My point above was challenging 'less accurate' part.
> 
Daniel's point was that lockdep_rtnl_is_held() does not mean "we hold
RTNL" but "someone holds RTNL" so that some other task holding RTNL at
the moment could make the check happy even when called by someone
supposed to own the socket.

So I guess the key question is if avoiding this type of false negative
is important enough to justify the extra complexity (taking into account
this race would have to happen every time the check is performed to
really hide a locking issue).

                                                          Michal Kubecek

^ permalink raw reply

* Re: [PATCH net] tun, bpf: fix suspicious RCU usage in tun_{attach,detach}_filter
From: Alexei Starovoitov @ 2016-03-31  5:43 UTC (permalink / raw)
  To: Michal Kubecek
  Cc: Daniel Borkmann, davem, sasha.levin, jslaby, eric.dumazet, mst,
	netdev
In-Reply-To: <20160331052232.GB21665@unicorn.suse.cz>

On Thu, Mar 31, 2016 at 07:22:32AM +0200, Michal Kubecek wrote:
> On Wed, Mar 30, 2016 at 10:08:10PM -0700, Alexei Starovoitov wrote:
> > On Thu, Mar 31, 2016 at 07:01:15AM +0200, Michal Kubecek wrote:
> > > On Wed, Mar 30, 2016 at 06:18:42PM -0700, Alexei Starovoitov wrote:
> > > > 
> > > > kinda heavy patch to shut up lockdep.
> > > > Can we do
> > > > old_fp = rcu_dereference_protected(sk->sk_filter,
> > > >                                 sock_owned_by_user(sk) || lockdep_rtnl_is_held());
> > > > and it always be correct?
> > > > I think right now tun is the only such user, but if it's correct
> > > > for tun, it's correct for future users too. If not correct then
> > > > not correct for tun either.
> > > > Or I'm missing something?
> > > 
> > > Already discussed here:
> > > 
> > >   http://thread.gmane.org/gmane.linux.kernel/2158069/focus=405853
> > 
> > I saw that. My point above was challenging 'less accurate' part.
> > 
> Daniel's point was that lockdep_rtnl_is_held() does not mean "we hold
> RTNL" but "someone holds RTNL" so that some other task holding RTNL at
> the moment could make the check happy even when called by someone
> supposed to own the socket.

Of course... and that is the case for all rtnl_dereference() calls...
yet we're not paranoid about it.

^ permalink raw reply

* [PATCH net-next 0/6] net device rx busy polling support in vhost_net
From: Jason Wang @ 2016-03-31  5:50 UTC (permalink / raw)
  To: davem, mst, netdev, linux-kernel; +Cc: Jason Wang

Hi all:

This series try to add net device rx busy polling support in
vhost_net. This is done through:

- adding socket rx busy polling support for tun/macvtap by marking
  napi_id.
- vhost_net will try to find the net device through napi_id and do
  busy polling if possible.

TCP_RR tests on two mlx4s show some improvements:

smp=1 queue=1
size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
    1/     1/   +4%/   +3%/   +3%/   +3%/  +22%
    1/    50/   +2%/   +2%/   +2%/   +2%/    0%
    1/   100/   +1%/    0%/   +1%/   +1%/   -1%
    1/   200/   +2%/   +1%/   +2%/   +2%/    0%
   64/     1/   +1%/   +3%/   +1%/   +1%/   +1%
   64/    50/    0%/    0%/    0%/    0%/   -1%
   64/   100/   +1%/    0%/   +1%/   +1%/    0%
   64/   200/    0%/    0%/   +2%/   +2%/    0%
  256/     1/   +2%/   +2%/   +2%/   +2%/   +2%
  256/    50/   +3%/   +3%/   +3%/   +3%/    0%
  256/   100/   +1%/   +1%/   +2%/   +2%/    0%
  256/   200/    0%/    0%/   +1%/   +1%/   +1%
 1024/     1/   +2%/   +2%/   +2%/   +2%/   +2%
 1024/    50/   -1%/   -1%/   -1%/   -1%/   -2%
 1024/   100/   +1%/   +1%/    0%/    0%/   -1%
 1024/   200/   +2%/   +1%/   +2%/   +2%/    0%

smp=8 queue=1
size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
    1/     1/   +1%/   -5%/   +1%/   +1%/    0%
    1/    50/   +1%/    0%/   +1%/   +1%/   -1%
    1/   100/   -1%/   -1%/   -2%/   -2%/   -4%
    1/   200/    0%/    0%/    0%/    0%/   -1%
   64/     1/   -2%/  -10%/   -2%/   -2%/   -2%
   64/    50/   -1%/   -1%/   -1%/   -1%/   -2%
   64/   100/   -1%/    0%/    0%/    0%/   -1%
   64/   200/   -1%/   -1%/    0%/    0%/    0%
  256/     1/   +7%/  +25%/   +7%/   +7%/   +7%
  256/    50/   +2%/   +2%/   +2%/   +2%/   -1%
  256/   100/   -1%/   -1%/   -1%/   -1%/   -3%
  256/   200/   +1%/    0%/    0%/    0%/    0%
 1024/     1/   +5%/  +15%/   +5%/   +5%/   +4%
 1024/    50/    0%/    0%/   -1%/   -1%/   -1%
 1024/   100/   -1%/   -1%/   -1%/   -1%/   -2%
 1024/   200/   -1%/    0%/   -1%/   -1%/   -1%

smp=8 queue=8
size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
    1/     1/   +5%/   +2%/   +5%/   +5%/    0%
    1/    50/   +2%/   +2%/   +3%/   +3%/  -20%
    1/   100/   +5%/   +5%/   +5%/   +5%/  -13%
    1/   200/   +8%/   +8%/   +6%/   +6%/  -12%
   64/     1/    0%/   +4%/    0%/    0%/  +18%
   64/    50/   +6%/   +5%/   +5%/   +5%/   -7%
   64/   100/   +4%/   +4%/   +5%/   +5%/  -12%
   64/   200/   +5%/   +5%/   +5%/   +5%/  -12%
  256/     1/    0%/   -3%/    0%/    0%/   +1%
  256/    50/   +3%/   +3%/   +3%/   +3%/   -2%
  256/   100/   +6%/   +5%/   +5%/   +5%/  -11%
  256/   200/   +4%/   +4%/   +4%/   +4%/  -13%
 1024/     1/    0%/   -3%/    0%/    0%/   -6%
 1024/    50/   +1%/   +1%/   +1%/   +1%/  -10%
 1024/   100/   +4%/   +4%/   +5%/   +5%/  -11%
 1024/   200/   +4%/   +5%/   +4%/   +4%/  -12%

Thanks

Jason Wang (6):
  net: skbuff: don't use union for napi_id and sender_cpu
  tuntap: socket rx busy polling support
  macvtap: socket rx busy polling support
  net: core: factor out core busy polling logic to sk_busy_loop_once()
  net: export napi_by_id()
  vhost_net: net device rx busy polling support

 drivers/net/macvtap.c   |  4 ++++
 drivers/net/tun.c       |  3 ++-
 drivers/vhost/net.c     | 22 ++++++++++++++++--
 include/linux/skbuff.h  | 10 ++++----
 include/net/busy_poll.h |  8 +++++++
 net/core/dev.c          | 62 ++++++++++++++++++++++++++++---------------------
 6 files changed, 75 insertions(+), 34 deletions(-)

-- 
2.5.0

^ permalink raw reply

* [PATCH net-next 1/6] net: skbuff: don't use union for napi_id and sender_cpu
From: Jason Wang @ 2016-03-31  5:50 UTC (permalink / raw)
  To: davem, mst, netdev, linux-kernel; +Cc: Jason Wang
In-Reply-To: <1459403439-6011-1-git-send-email-jasowang@redhat.com>

We use a union for napi_id and send_cpu, this is ok for most of the
cases except when we want to support busy polling for tun which needs
napi_id to be stored and passed to socket during tun_net_xmit(). In
this case, napi_id was overridden with sender_cpu before tun_net_xmit()
was called if XPS was enabled. Fixing by not using union for napi_id
and sender_cpu.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 include/linux/skbuff.h | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 15d0df9..8aee891 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -743,11 +743,11 @@ struct sk_buff {
 	__u32			hash;
 	__be16			vlan_proto;
 	__u16			vlan_tci;
-#if defined(CONFIG_NET_RX_BUSY_POLL) || defined(CONFIG_XPS)
-	union {
-		unsigned int	napi_id;
-		unsigned int	sender_cpu;
-	};
+#if defined(CONFIG_NET_RX_BUSY_POLL)
+	unsigned int		napi_id;
+#endif
+#if defined(CONFIG_XPS)
+	unsigned int		sender_cpu;
 #endif
 	union {
 #ifdef CONFIG_NETWORK_SECMARK
-- 
2.5.0

^ permalink raw reply related

* [PATCH net-next 2/6] tuntap: socket rx busy polling support
From: Jason Wang @ 2016-03-31  5:50 UTC (permalink / raw)
  To: davem, mst, netdev, linux-kernel; +Cc: Jason Wang
In-Reply-To: <1459403439-6011-1-git-send-email-jasowang@redhat.com>

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/tun.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index afdf950..950faf5 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -69,6 +69,7 @@
 #include <net/netns/generic.h>
 #include <net/rtnetlink.h>
 #include <net/sock.h>
+#include <net/busy_poll.h>
 #include <linux/seq_file.h>
 #include <linux/uio.h>
 
@@ -871,6 +872,7 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
 
 	nf_reset(skb);
 
+	sk_mark_napi_id(tfile->socket.sk, skb);
 	/* Enqueue packet */
 	skb_queue_tail(&tfile->socket.sk->sk_receive_queue, skb);
 
@@ -878,7 +880,6 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
 	if (tfile->flags & TUN_FASYNC)
 		kill_fasync(&tfile->fasync, SIGIO, POLL_IN);
 	tfile->socket.sk->sk_data_ready(tfile->socket.sk);
-
 	rcu_read_unlock();
 	return NETDEV_TX_OK;
 
-- 
2.5.0

^ permalink raw reply related

* [PATCH net-next 3/6] macvtap: socket rx busy polling support
From: Jason Wang @ 2016-03-31  5:50 UTC (permalink / raw)
  To: davem, mst, netdev, linux-kernel; +Cc: Jason Wang
In-Reply-To: <1459403439-6011-1-git-send-email-jasowang@redhat.com>

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/macvtap.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 95394ed..1891aff 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -20,6 +20,7 @@
 #include <net/net_namespace.h>
 #include <net/rtnetlink.h>
 #include <net/sock.h>
+#include <net/busy_poll.h>
 #include <linux/virtio_net.h>
 
 /*
@@ -369,6 +370,7 @@ static rx_handler_result_t macvtap_handle_frame(struct sk_buff **pskb)
 			goto drop;
 
 		if (!segs) {
+			sk_mark_napi_id(&q->sk, skb);
 			skb_queue_tail(&q->sk.sk_receive_queue, skb);
 			goto wake_up;
 		}
@@ -378,6 +380,7 @@ static rx_handler_result_t macvtap_handle_frame(struct sk_buff **pskb)
 			struct sk_buff *nskb = segs->next;
 
 			segs->next = NULL;
+			sk_mark_napi_id(&q->sk, segs);
 			skb_queue_tail(&q->sk.sk_receive_queue, segs);
 			segs = nskb;
 		}
@@ -391,6 +394,7 @@ static rx_handler_result_t macvtap_handle_frame(struct sk_buff **pskb)
 		    !(features & NETIF_F_CSUM_MASK) &&
 		    skb_checksum_help(skb))
 			goto drop;
+		sk_mark_napi_id(&q->sk, skb);
 		skb_queue_tail(&q->sk.sk_receive_queue, skb);
 	}
 
-- 
2.5.0

^ permalink raw reply related

* [PATCH net-next 4/6] net: core: factor out core busy polling logic to sk_busy_loop_once()
From: Jason Wang @ 2016-03-31  5:50 UTC (permalink / raw)
  To: davem, mst, netdev, linux-kernel; +Cc: Jason Wang
In-Reply-To: <1459403439-6011-1-git-send-email-jasowang@redhat.com>

This patch factors out core logic of busy polling to
sk_busy_loop_once() in order to be reused by other modules.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 include/net/busy_poll.h |  7 ++++++
 net/core/dev.c          | 59 ++++++++++++++++++++++++++++---------------------
 2 files changed, 41 insertions(+), 25 deletions(-)

diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index 2fbeb13..e765e23 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -73,6 +73,7 @@ static inline bool busy_loop_timeout(unsigned long end_time)
 }
 
 bool sk_busy_loop(struct sock *sk, int nonblock);
+int sk_busy_loop_once(struct sock *sk, struct napi_struct *napi);
 
 /* used in the NIC receive handler to mark the skb */
 static inline void skb_mark_napi_id(struct sk_buff *skb,
@@ -117,6 +118,12 @@ static inline bool busy_loop_timeout(unsigned long end_time)
 	return true;
 }
 
+static inline int sk_busy_loop_once(struct napi_struct *napi,
+				    int (*busy_poll)(struct napi_struct *dev))
+{
+	return 0;
+}
+
 static inline bool sk_busy_loop(struct sock *sk, int nonblock)
 {
 	return false;
diff --git a/net/core/dev.c b/net/core/dev.c
index b9bcbe7..a2f0c46 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4902,10 +4902,42 @@ static struct napi_struct *napi_by_id(unsigned int napi_id)
 
 #if defined(CONFIG_NET_RX_BUSY_POLL)
 #define BUSY_POLL_BUDGET 8
+int sk_busy_loop_once(struct sock *sk, struct napi_struct *napi)
+{
+	int (*busy_poll)(struct napi_struct *dev);
+	int rc = 0;
+
+	/* Note: ndo_busy_poll method is optional in linux-4.5 */
+	busy_poll = napi->dev->netdev_ops->ndo_busy_poll;
+
+	local_bh_disable();
+	if (busy_poll) {
+		rc = busy_poll(napi);
+	} else if (napi_schedule_prep(napi)) {
+		void *have = netpoll_poll_lock(napi);
+
+		if (test_bit(NAPI_STATE_SCHED, &napi->state)) {
+			rc = napi->poll(napi, BUSY_POLL_BUDGET);
+			trace_napi_poll(napi);
+			if (rc == BUSY_POLL_BUDGET) {
+				napi_complete_done(napi, rc);
+				napi_schedule(napi);
+			}
+		}
+		netpoll_poll_unlock(have);
+	}
+	if (rc > 0)
+		NET_ADD_STATS_BH(sock_net(sk),
+				 LINUX_MIB_BUSYPOLLRXPACKETS, rc);
+	local_bh_enable();
+
+	return rc;
+}
+EXPORT_SYMBOL(sk_busy_loop_once);
+
 bool sk_busy_loop(struct sock *sk, int nonblock)
 {
 	unsigned long end_time = !nonblock ? sk_busy_loop_end_time(sk) : 0;
-	int (*busy_poll)(struct napi_struct *dev);
 	struct napi_struct *napi;
 	int rc = false;
 
@@ -4915,31 +4947,8 @@ bool sk_busy_loop(struct sock *sk, int nonblock)
 	if (!napi)
 		goto out;
 
-	/* Note: ndo_busy_poll method is optional in linux-4.5 */
-	busy_poll = napi->dev->netdev_ops->ndo_busy_poll;
-
 	do {
-		rc = 0;
-		local_bh_disable();
-		if (busy_poll) {
-			rc = busy_poll(napi);
-		} else if (napi_schedule_prep(napi)) {
-			void *have = netpoll_poll_lock(napi);
-
-			if (test_bit(NAPI_STATE_SCHED, &napi->state)) {
-				rc = napi->poll(napi, BUSY_POLL_BUDGET);
-				trace_napi_poll(napi);
-				if (rc == BUSY_POLL_BUDGET) {
-					napi_complete_done(napi, rc);
-					napi_schedule(napi);
-				}
-			}
-			netpoll_poll_unlock(have);
-		}
-		if (rc > 0)
-			NET_ADD_STATS_BH(sock_net(sk),
-					 LINUX_MIB_BUSYPOLLRXPACKETS, rc);
-		local_bh_enable();
+		rc = sk_busy_loop_once(sk, napi);
 
 		if (rc == LL_FLUSH_FAILED)
 			break; /* permanent failure */
-- 
2.5.0

^ permalink raw reply related

* [PATCH net-next 6/6] vhost_net: net device rx busy polling support
From: Jason Wang @ 2016-03-31  5:50 UTC (permalink / raw)
  To: davem, mst, netdev, linux-kernel; +Cc: Jason Wang
In-Reply-To: <1459403439-6011-1-git-send-email-jasowang@redhat.com>

This patch let vhost_net try rx busy polling of underlying net device
when busy polling is enabled. Test shows some improvement on TCP_RR:

smp=1 queue=1
size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
    1/     1/   +4%/   +3%/   +3%/   +3%/  +22%
    1/    50/   +2%/   +2%/   +2%/   +2%/    0%
    1/   100/   +1%/    0%/   +1%/   +1%/   -1%
    1/   200/   +2%/   +1%/   +2%/   +2%/    0%
   64/     1/   +1%/   +3%/   +1%/   +1%/   +1%
   64/    50/    0%/    0%/    0%/    0%/   -1%
   64/   100/   +1%/    0%/   +1%/   +1%/    0%
   64/   200/    0%/    0%/   +2%/   +2%/    0%
  256/     1/   +2%/   +2%/   +2%/   +2%/   +2%
  256/    50/   +3%/   +3%/   +3%/   +3%/    0%
  256/   100/   +1%/   +1%/   +2%/   +2%/    0%
  256/   200/    0%/    0%/   +1%/   +1%/   +1%
 1024/     1/   +2%/   +2%/   +2%/   +2%/   +2%
 1024/    50/   -1%/   -1%/   -1%/   -1%/   -2%
 1024/   100/   +1%/   +1%/    0%/    0%/   -1%
 1024/   200/   +2%/   +1%/   +2%/   +2%/    0%

smp=8 queue=1
size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
    1/     1/   +1%/   -5%/   +1%/   +1%/    0%
    1/    50/   +1%/    0%/   +1%/   +1%/   -1%
    1/   100/   -1%/   -1%/   -2%/   -2%/   -4%
    1/   200/    0%/    0%/    0%/    0%/   -1%
   64/     1/   -2%/  -10%/   -2%/   -2%/   -2%
   64/    50/   -1%/   -1%/   -1%/   -1%/   -2%
   64/   100/   -1%/    0%/    0%/    0%/   -1%
   64/   200/   -1%/   -1%/    0%/    0%/    0%
  256/     1/   +7%/  +25%/   +7%/   +7%/   +7%
  256/    50/   +2%/   +2%/   +2%/   +2%/   -1%
  256/   100/   -1%/   -1%/   -1%/   -1%/   -3%
  256/   200/   +1%/    0%/    0%/    0%/    0%
 1024/     1/   +5%/  +15%/   +5%/   +5%/   +4%
 1024/    50/    0%/    0%/   -1%/   -1%/   -1%
 1024/   100/   -1%/   -1%/   -1%/   -1%/   -2%
 1024/   200/   -1%/    0%/   -1%/   -1%/   -1%

smp=8 queue=8
size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
    1/     1/   +5%/   +2%/   +5%/   +5%/    0%
    1/    50/   +2%/   +2%/   +3%/   +3%/  -20%
    1/   100/   +5%/   +5%/   +5%/   +5%/  -13%
    1/   200/   +8%/   +8%/   +6%/   +6%/  -12%
   64/     1/    0%/   +4%/    0%/    0%/  +18%
   64/    50/   +6%/   +5%/   +5%/   +5%/   -7%
   64/   100/   +4%/   +4%/   +5%/   +5%/  -12%
   64/   200/   +5%/   +5%/   +5%/   +5%/  -12%
  256/     1/    0%/   -3%/    0%/    0%/   +1%
  256/    50/   +3%/   +3%/   +3%/   +3%/   -2%
  256/   100/   +6%/   +5%/   +5%/   +5%/  -11%
  256/   200/   +4%/   +4%/   +4%/   +4%/  -13%
 1024/     1/    0%/   -3%/    0%/    0%/   -6%
 1024/    50/   +1%/   +1%/   +1%/   +1%/  -10%
 1024/   100/   +4%/   +4%/   +5%/   +5%/  -11%
 1024/   200/   +4%/   +5%/   +4%/   +4%/  -12%

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/net.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index f744eeb..7350f6c 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -27,6 +27,7 @@
 #include <linux/if_vlan.h>
 
 #include <net/sock.h>
+#include <net/busy_poll.h>
 
 #include "vhost.h"
 
@@ -307,15 +308,24 @@ static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
 				    unsigned int *out_num, unsigned int *in_num)
 {
 	unsigned long uninitialized_var(endtime);
+	struct socket *sock = vq->private_data;
+	struct sock *sk = sock->sk;
+	struct napi_struct *napi;
 	int r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
 				    out_num, in_num, NULL, NULL);
 
 	if (r == vq->num && vq->busyloop_timeout) {
 		preempt_disable();
+		rcu_read_lock();
+		napi = napi_by_id(sk->sk_napi_id);
 		endtime = busy_clock() + vq->busyloop_timeout;
 		while (vhost_can_busy_poll(vq->dev, endtime) &&
-		       vhost_vq_avail_empty(vq->dev, vq))
+		       vhost_vq_avail_empty(vq->dev, vq)) {
+			if (napi)
+				sk_busy_loop_once(sk, napi);
 			cpu_relax_lowlatency();
+		}
+		rcu_read_unlock();
 		preempt_enable();
 		r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
 					out_num, in_num, NULL, NULL);
@@ -476,6 +486,7 @@ static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk)
 	struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_TX];
 	struct vhost_virtqueue *vq = &nvq->vq;
 	unsigned long uninitialized_var(endtime);
+	struct napi_struct *napi;
 	int len = peek_head_len(sk);
 
 	if (!len && vq->busyloop_timeout) {
@@ -484,13 +495,20 @@ static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk)
 		vhost_disable_notify(&net->dev, vq);
 
 		preempt_disable();
+		rcu_read_lock();
+
+		napi = napi_by_id(sk->sk_napi_id);
 		endtime = busy_clock() + vq->busyloop_timeout;
 
 		while (vhost_can_busy_poll(&net->dev, endtime) &&
 		       skb_queue_empty(&sk->sk_receive_queue) &&
-		       vhost_vq_avail_empty(&net->dev, vq))
+		       vhost_vq_avail_empty(&net->dev, vq)) {
+			if (napi)
+				sk_busy_loop_once(sk, napi);
 			cpu_relax_lowlatency();
+		}
 
+		rcu_read_unlock();
 		preempt_enable();
 
 		if (vhost_enable_notify(&net->dev, vq))
-- 
2.5.0

^ permalink raw reply related

* [PATCH net-next 5/6] net: export napi_by_id()
From: Jason Wang @ 2016-03-31  5:50 UTC (permalink / raw)
  To: davem, mst, netdev, linux-kernel; +Cc: Jason Wang
In-Reply-To: <1459403439-6011-1-git-send-email-jasowang@redhat.com>

This patch exports napi_by_id() which will be used by vhost_net socket
busy polling.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 include/net/busy_poll.h | 1 +
 net/core/dev.c          | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index e765e23..dc9c76d 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -74,6 +74,7 @@ static inline bool busy_loop_timeout(unsigned long end_time)
 
 bool sk_busy_loop(struct sock *sk, int nonblock);
 int sk_busy_loop_once(struct sock *sk, struct napi_struct *napi);
+struct napi_struct *napi_by_id(unsigned int napi_id);
 
 /* used in the NIC receive handler to mark the skb */
 static inline void skb_mark_napi_id(struct sk_buff *skb,
diff --git a/net/core/dev.c b/net/core/dev.c
index a2f0c46..b98d210 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4888,7 +4888,7 @@ void napi_complete_done(struct napi_struct *n, int work_done)
 EXPORT_SYMBOL(napi_complete_done);
 
 /* must be called under rcu_read_lock(), as we dont take a reference */
-static struct napi_struct *napi_by_id(unsigned int napi_id)
+struct napi_struct *napi_by_id(unsigned int napi_id)
 {
 	unsigned int hash = napi_id % HASH_SIZE(napi_hash);
 	struct napi_struct *napi;
@@ -4899,6 +4899,7 @@ static struct napi_struct *napi_by_id(unsigned int napi_id)
 
 	return NULL;
 }
+EXPORT_SYMBOL(napi_by_id);
 
 #if defined(CONFIG_NET_RX_BUSY_POLL)
 #define BUSY_POLL_BUDGET 8
-- 
2.5.0

^ permalink raw reply related

* Re: [PATCH] net: mvneta: explicitly disable BM on 64bit platform
From: Jisheng Zhang @ 2016-03-31  5:53 UTC (permalink / raw)
  To: Gregory CLEMENT, davem, mw, thomas.petazzoni
  Cc: netdev, linux-kernel, linux-arm-kernel
In-Reply-To: <8737r8uhjm.fsf@free-electrons.com>

Hi Gregory,

On Wed, 30 Mar 2016 17:11:41 +0200 Gregory CLEMENT wrote:

> Hi Jisheng,
>  
>  On mer., mars 30 2016, Jisheng Zhang <jszhang@marvell.com> wrote:
> 
> > The mvneta BM can't work on 64bit platform, as the BM hardware expects
> > buf virtual address to be placed in the first four bytes of mapped
> > buffer, but obviously the virtual address on 64bit platform can't be
> > stored in 4 bytes. So we have to explicitly disable BM on 64bit
> > platform.  
> 
> Actually mvneta is used on Armada 3700 which is a 64bits platform.
> Is it true that the driver needs some change to use BM in 64 bits, but
> we don't have to disable it.
> 
> Here is the 64 bits part of the patch we have currently on the hardware
> prototype. We have more things which are really related to the way the
> mvneta is connected to the Armada 3700 SoC. This code was not ready for

Thanks for the sharing.

I think we could commit easy parts firstly, for example: the cacheline size
hardcoding, either piece of your diff or my version:

http://lists.infradead.org/pipermail/linux-arm-kernel/2016-March/418513.html

> mainline but I prefer share it now instead of having the HWBM blindly

I have looked through the diff, it is for the driver itself on 64bit platforms,
and it doesn't touch BM. The BM itself need to be disabled for 64bit, I'm not
sure the BM could work on 64bit even with your diff. Per my understanding, the BM
can't work on 64 bit, let's have a look at some piece of the mvneta_bm_construct()

*(u32 *)buf = (u32)buf;

Am I misunderstanding?

Thanks,
Jisheng

> disable for 64 bits platform:
> 
> --- a/drivers/net/ethernet/marvell/Kconfig
> +++ b/drivers/net/ethernet/marvell/Kconfig
> @@ -55,7 +55,7 @@ config MVNETA_BM_ENABLE
>  
>  config MVNETA
>  	tristate "Marvell Armada 370/38x/XP network interface support"
> -	depends on PLAT_ORION
> +	depends on ARCH_MVEBU || COMPILE_TEST
>  	select MVMDIO
>  	select FIXED_PHY
>  	---help---
> diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
> index 577f7ca7deba..6929ad112b64 100644
> --- a/drivers/net/ethernet/marvell/mvneta.c
> +++ b/drivers/net/ethernet/marvell/mvneta.c
> @@ -260,7 +260,7 @@
>  
>  #define MVNETA_VLAN_TAG_LEN             4
>  
> -#define MVNETA_CPU_D_CACHE_LINE_SIZE    32
> +#define MVNETA_CPU_D_CACHE_LINE_SIZE    cache_line_size()
>  #define MVNETA_TX_CSUM_DEF_SIZE		1600
>  #define MVNETA_TX_CSUM_MAX_SIZE		9800
>  #define MVNETA_ACC_MODE_EXT1		1
> @@ -297,6 +297,12 @@
>  /* descriptor aligned size */
>  #define MVNETA_DESC_ALIGNED_SIZE	32
>  
> +/* Number of bytes to be taken into account by HW when putting incoming data
> + * to the buffers. It is needed in case NET_SKB_PAD exceeds maximum packet
> + * offset supported in MVNETA_RXQ_CONFIG_REG(q) registers.
> + */
> +#define MVNETA_RX_PKT_OFFSET_CORRECTION		64
> +
>  #define MVNETA_RX_PKT_SIZE(mtu) \
>  	ALIGN((mtu) + MVNETA_MH_SIZE + MVNETA_VLAN_TAG_LEN + \
>  	      ETH_HLEN + ETH_FCS_LEN,			     \
> @@ -417,6 +423,10 @@ struct mvneta_port {
>  	u64 ethtool_stats[ARRAY_SIZE(mvneta_statistics)];
>  
>  	u32 indir[MVNETA_RSS_LU_TABLE_SIZE];
> +#ifdef CONFIG_64BIT
> +	u64 data_high;
> +#endif
> +	u16 rx_offset_correction;
>  };
>  
>  /* The mvneta_tx_desc and mvneta_rx_desc structures describe the
> @@ -961,7 +971,9 @@ static int mvneta_bm_port_init(struct platform_device *pdev,
>  			       struct mvneta_port *pp)
>  {
>  	struct device_node *dn = pdev->dev.of_node;
> -	u32 long_pool_id, short_pool_id, wsize;
> +	u32 long_pool_id, short_pool_id;
> +#ifndef CONFIG_64BIT
> +	u32 wsize;
>  	u8 target, attr;
>  	int err;
>  
> @@ -985,7 +997,7 @@ static int mvneta_bm_port_init(struct platform_device *pdev,
>  		netdev_info(pp->dev, "missing long pool id\n");
>  		return -EINVAL;
>  	}
> -
> +#endif
>  	/* Create port's long pool depending on mtu */
>  	pp->pool_long = mvneta_bm_pool_use(pp->bm_priv, long_pool_id,
>  					   MVNETA_BM_LONG, pp->id,
> @@ -1790,6 +1802,10 @@ static int mvneta_rx_refill(struct mvneta_port *pp,
>  	if (!data)
>  		return -ENOMEM;
>  
> +#ifdef CONFIG_64BIT
> +	if (unlikely(pp->data_high != ((u64)data & 0xffffffff00000000)))
> +		return -ENOMEM;
> +#endif
>  	phys_addr = dma_map_single(pp->dev->dev.parent, data,
>  				   MVNETA_RX_BUF_SIZE(pp->pkt_size),
>  				   DMA_FROM_DEVICE);
> @@ -1798,7 +1814,8 @@ static int mvneta_rx_refill(struct mvneta_port *pp,
>  		return -ENOMEM;
>  	}
>  
> -	mvneta_rx_desc_fill(rx_desc, phys_addr, (u32)data);
> +	phys_addr += pp->rx_offset_correction;
> +	mvneta_rx_desc_fill(rx_desc, phys_addr, (uintptr_t)data);
>  	return 0;
>  }
>  
> @@ -1860,8 +1877,16 @@ static void mvneta_rxq_drop_pkts(struct mvneta_port *pp,
>  
>  	for (i = 0; i < rxq->size; i++) {
>  		struct mvneta_rx_desc *rx_desc = rxq->descs + i;
> -		void *data = (void *)rx_desc->buf_cookie;
> -
> +		void *data = (u8 *)(uintptr_t)rx_desc->buf_cookie;
> +#ifdef CONFIG_64BIT
> +		/* In Neta HW only 32 bits data is supported, so in
> +		 * order to obtain whole 64 bits address from RX
> +		 * descriptor, we store the upper 32 bits when
> +		 * allocating buffer, and put it back when using
> +		 * buffer cookie for accessing packet in memory.
> +		 */
> +		data = (u8 *)(pp->data_high | (u64)data);
> +#endif
>  		dma_unmap_single(pp->dev->dev.parent, rx_desc->buf_phys_addr,
>  				 MVNETA_RX_BUF_SIZE(pp->pkt_size), DMA_FROM_DEVICE);
>  		mvneta_frag_free(pp->frag_size, data);
> @@ -1898,7 +1923,17 @@ static int mvneta_rx_swbm(struct mvneta_port *pp, int rx_todo,
>  		rx_done++;
>  		rx_status = rx_desc->status;
>  		rx_bytes = rx_desc->data_size - (ETH_FCS_LEN + MVNETA_MH_SIZE);
> +#ifdef CONFIG_64BIT
> +		/* In Neta HW only 32 bits data is supported, so in
> +		 * order to obtain whole 64 bits address from RX
> +		 * descriptor, we store the upper 32 bits when
> +		 * allocating buffer, and put it back when using
> +		 * buffer cookie for accessing packet in memory.
> +		 */
> +		data = (u8 *)(pp->data_high | (u64)rx_desc->buf_cookie);
> +#else
>  		data = (unsigned char *)rx_desc->buf_cookie;
> +#endif
>  		phys_addr = rx_desc->buf_phys_addr;
>  
>  		if (!mvneta_rxq_desc_is_first_last(rx_status) ||
> @@ -2019,7 +2054,17 @@ static int mvneta_rx_hwbm(struct mvneta_port *pp, int rx_todo,
>  		rx_done++;
>  		rx_status = rx_desc->status;
>  		rx_bytes = rx_desc->data_size - (ETH_FCS_LEN + MVNETA_MH_SIZE);
> -		data = (unsigned char *)rx_desc->buf_cookie;
> +#ifdef CONFIG_64BIT
> +		/* In Neta HW only 32 bits data is supported, so in
> +		 * order to obtain whole 64 bits address from RX
> +		 * descriptor, we store the upper 32 bits when
> +		 * allocating buffer, and put it back when using
> +		 * buffer cookie for accessing packet in memory.
> +		 */
> +		data = (u8 *)(pp->data_high | (u64)rx_desc->buf_cookie);
> +#else
> +		data = (u8 *)rx_desc->buf_cookie;
> +#endif
>  		phys_addr = rx_desc->buf_phys_addr;
>  		pool_id = MVNETA_RX_GET_BM_POOL_ID(rx_desc);
>  		bm_pool = &pp->bm_priv->bm_pools[pool_id];
> @@ -2774,7 +2819,7 @@ static int mvneta_rxq_init(struct mvneta_port *pp,
>  	mvreg_write(pp, MVNETA_RXQ_SIZE_REG(rxq->id), rxq->size);
>  
>  	/* Set Offset */
> -	mvneta_rxq_offset_set(pp, rxq, NET_SKB_PAD);
> +	mvneta_rxq_offset_set(pp, rxq, NET_SKB_PAD - pp->rx_offset_correction);
>  
>  	/* Set coalescing pkts and time */
>  	mvneta_rx_pkts_coal_set(pp, rxq, rxq->pkts_coal);
> @@ -2935,6 +2980,22 @@ static void mvneta_cleanup_rxqs(struct mvneta_port *pp)
>  static int mvneta_setup_rxqs(struct mvneta_port *pp)
>  {
>  	int queue;
> +#ifdef CONFIG_64BIT
> +	void *data_tmp;
> +
> +	/* In Neta HW only 32 bits data is supported, so in order to
> +	 * obtain whole 64 bits address from RX descriptor, we store
> +	 * the upper 32 bits when allocating buffer, and put it back
> +	 * when using buffer cookie for accessing packet in memory.
> +	 * Frags should be allocated from single 'memory' region,
> +	 * hence common upper address half should be sufficient.
> +	 */
> +	data_tmp = mvneta_frag_alloc(pp->frag_size);
> +	if (data_tmp) {
> +		pp->data_high = (u64)data_tmp & 0xffffffff00000000;
> +		mvneta_frag_free(pp->frag_size, data_tmp);
> +	}
> +#endif
>  
>  	for (queue = 0; queue < rxq_number; queue++) {
>  		int err = mvneta_rxq_init(pp, &pp->rxqs[queue]);
> @@ -3672,25 +3733,24 @@ static void mvneta_ethtool_update_stats(struct mvneta_port *pp)
>  	const struct mvneta_statistic *s;
>  	void __iomem *base = pp->base;
>  	u32 high, low, val;
> -	u64 val64;
>  	int i;
>  
>  	for (i = 0, s = mvneta_statistics;
>  	     s < mvneta_statistics + ARRAY_SIZE(mvneta_statistics);
>  	     s++, i++) {
> +		val = 0;
>  		switch (s->type) {
>  		case T_REG_32:
>  			val = readl_relaxed(base + s->offset);
> -			pp->ethtool_stats[i] += val;
>  			break;
>  		case T_REG_64:
>  			/* Docs say to read low 32-bit then high */
>  			low = readl_relaxed(base + s->offset);
>  			high = readl_relaxed(base + s->offset + 4);
> -			val64 = (u64)high << 32 | low;
> -			pp->ethtool_stats[i] += val64;
> +			val = (u64)high << 32 | low;
>  			break;
>  		}
> +		pp->ethtool_stats[i] += val;
>  	}
>  }
>  
> @@ -4034,6 +4094,13 @@ static int mvneta_probe(struct platform_device *pdev)
>  
>  	pp->rxq_def = rxq_def;
>  
> +	/* Set RX packet offset correction for platforms, whose
> +	 * NET_SKB_PAD, exceeds 64B. It should be 64B for 64-bit
> +	 * platforms and 0B for 32-bit ones.
> +	 */
> +	pp->rx_offset_correction =
> +		max(0, NET_SKB_PAD - MVNETA_RX_PKT_OFFSET_CORRECTION);
> +
>  	pp->indir[0] = rxq_def;
>  
>  	pp->clk = devm_clk_get(&pdev->dev, "core");
> --
> 
> >
> > Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
> > ---
> >  drivers/net/ethernet/marvell/Kconfig | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/net/ethernet/marvell/Kconfig b/drivers/net/ethernet/marvell/Kconfig
> > index b5c6d42..53d6572 100644
> > --- a/drivers/net/ethernet/marvell/Kconfig
> > +++ b/drivers/net/ethernet/marvell/Kconfig
> > @@ -42,7 +42,7 @@ config MVMDIO
> >  
> >  config MVNETA_BM_ENABLE
> >  	tristate "Marvell Armada 38x/XP network interface BM support"
> > -	depends on MVNETA
> > +	depends on MVNETA && !64BIT
> >  	---help---
> >  	  This driver supports auxiliary block of the network
> >  	  interface units in the Marvell ARMADA XP and ARMADA 38x SoC
> > -- 
> > 2.8.0.rc3
> >  
> 

^ permalink raw reply

* Re: am335x: no multicast reception over VLAN
From: Mugunthan V N @ 2016-03-31  6:37 UTC (permalink / raw)
  To: Peter Korsgaard
  Cc: Yegor Yefremov, Grygorii Strashko, netdev,
	linux-omap@vger.kernel.org, drivshin, ml
In-Reply-To: <878u0z9297.fsf@dell.be.48ers.dk>

On Thursday 31 March 2016 01:17 AM, Peter Korsgaard wrote:
>>>>>> "Mugunthan" == Mugunthan V N <mugunthanvnm@ti.com> writes:
> 
> Hi,
> 
>  > You had received these packets as tcpdump will enable promiscuous mode
>  > so that you receive all the packets from the wire.
> 
> FYI, you can use the -p option to tcpdump to not put the interface into
> promiscuous mode.
> 

Thanks for the information Peter Korsgaard.

Yegor, can you provide tcpdump using -p as well in Grygorii commands.

Regards
Mugunthan V N

^ permalink raw reply

* [PATCH] net: mvneta: remove useless RX descriptor prefetch
From: Jisheng Zhang @ 2016-03-31  6:36 UTC (permalink / raw)
  To: thomas.petazzoni, gregory.clement, davem
  Cc: netdev, linux-kernel, Jisheng Zhang

The rx descriptors are allocated using dma_alloc_coherent, so prefetch
doesn't really happen at all.

Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
---
 drivers/net/ethernet/marvell/mvneta.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 5880871..6c09a27 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -757,7 +757,6 @@ mvneta_rxq_next_desc_get(struct mvneta_rx_queue *rxq)
 	int rx_desc = rxq->next_desc_to_proc;
 
 	rxq->next_desc_to_proc = MVNETA_QUEUE_NEXT_DESC(rxq, rx_desc);
-	prefetch(rxq->descs + rxq->next_desc_to_proc);
 	return rxq->descs + rx_desc;
 }
 
-- 
2.8.0.rc3

^ permalink raw reply related

* Re: [PATCH] net: mvneta: explicitly disable BM on 64bit platform
From: Marcin Wojtas @ 2016-03-31  6:49 UTC (permalink / raw)
  To: Jisheng Zhang
  Cc: Gregory CLEMENT, David S. Miller, Thomas Petazzoni, netdev,
	linux-kernel, linux-arm-kernel@lists.infradead.org
In-Reply-To: <20160331135322.523a7ceb@xhacker>

Hi Jisheng,

2016-03-31 7:53 GMT+02:00 Jisheng Zhang <jszhang@marvell.com>:
> Hi Gregory,
>
> On Wed, 30 Mar 2016 17:11:41 +0200 Gregory CLEMENT wrote:
>
>> Hi Jisheng,
>>
>>  On mer., mars 30 2016, Jisheng Zhang <jszhang@marvell.com> wrote:
>>
>> > The mvneta BM can't work on 64bit platform, as the BM hardware expects
>> > buf virtual address to be placed in the first four bytes of mapped
>> > buffer, but obviously the virtual address on 64bit platform can't be
>> > stored in 4 bytes. So we have to explicitly disable BM on 64bit
>> > platform.
>>
>> Actually mvneta is used on Armada 3700 which is a 64bits platform.
>> Is it true that the driver needs some change to use BM in 64 bits, but
>> we don't have to disable it.
>>
>> Here is the 64 bits part of the patch we have currently on the hardware
>> prototype. We have more things which are really related to the way the
>> mvneta is connected to the Armada 3700 SoC. This code was not ready for
>
> Thanks for the sharing.
>
> I think we could commit easy parts firstly, for example: the cacheline size
> hardcoding, either piece of your diff or my version:
>
> http://lists.infradead.org/pipermail/linux-arm-kernel/2016-March/418513.html

Since the commit:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/arch/arm64/include/asm/cache.h?id=97303480753e48fb313dc0e15daaf11b0451cdb8
detached L1_CACHE_BYTES from real cache size, I suggest, the macro should be:
#define MVNETA_CPU_D_CACHE_LINE_SIZE     cache_line_size()

Regarding check after dma_alloc_coherent, I agree it's not necessary.

>
>> mainline but I prefer share it now instead of having the HWBM blindly
>
> I have looked through the diff, it is for the driver itself on 64bit platforms,
> and it doesn't touch BM. The BM itself need to be disabled for 64bit, I'm not
> sure the BM could work on 64bit even with your diff. Per my understanding, the BM
> can't work on 64 bit, let's have a look at some piece of the mvneta_bm_construct()
>
> *(u32 *)buf = (u32)buf;

Indeed this particular part is different and unclear, I tried
different options - with no success. I'm checking with design team
now. Anyway, I managed to enable operation for HWBM on A3700 with one
work-around in mvneta_hwbm_rx():
data = phys_to_virt(rx_desc->buf_phys_addr);

Of course mvneta_bm, due to some silicone differences needed also a rework.

Actually I'd wait with updating 64-bit parts of mvneta, until real
support for such machine's controller is introduced. Basing on my
experience with enabling neta on A3700, it turns out to be more
changes.

Best regards,
Marcin

^ permalink raw reply

* Re: [PATCH] net: mvneta: remove useless RX descriptor prefetch
From: Jisheng Zhang @ 2016-03-31  6:45 UTC (permalink / raw)
  To: thomas.petazzoni, gregory.clement, davem
  Cc: netdev, linux-kernel, linux-arm-kernel
In-Reply-To: <1459406190-2334-1-git-send-email-jszhang@marvell.com>

Hi,

+ linux arm kernel

On Thu, 31 Mar 2016 14:36:30 +0800 Jisheng Zhang wrote:

> The rx descriptors are allocated using dma_alloc_coherent, so prefetch
> doesn't really happen at all.

This is for RFC, I'm sorry to send it without changing its title -- s/PATCH/RFC.

I'm not sure whether there's any benefit to prefetch on space allocated from
dma_alloc_coherent.

Thanks

> 
> Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
> ---
>  drivers/net/ethernet/marvell/mvneta.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
> index 5880871..6c09a27 100644
> --- a/drivers/net/ethernet/marvell/mvneta.c
> +++ b/drivers/net/ethernet/marvell/mvneta.c
> @@ -757,7 +757,6 @@ mvneta_rxq_next_desc_get(struct mvneta_rx_queue *rxq)
>  	int rx_desc = rxq->next_desc_to_proc;
>  
>  	rxq->next_desc_to_proc = MVNETA_QUEUE_NEXT_DESC(rxq, rx_desc);
> -	prefetch(rxq->descs + rxq->next_desc_to_proc);
>  	return rxq->descs + rx_desc;
>  }
>  

^ permalink raw reply

* Reboot Failed was occurred after v4.4-rc1 during IPv6 Ready Logo Conformance Test
From: Yuki Machida @ 2016-03-31  7:09 UTC (permalink / raw)
  To: netdev

Hi all,

Reboot Failed was occurred at Linux Kernel after v4.4-rc1 during IPv6 Ready Logo Conformance Test.
Not Fix a bug in v4.5-rc7 yet.
I will conform that it fix a bug in v4.6-rc1.

Currently, it is under investigation.

IPv6 Ready Logo
https://www.ipv6ready.org/
TAHI Project
http://www.tahi.org/

I ran the IPv6 Ready Logo Core Conformance Test on Intel D510MO (Atom D510).
It is using userland build with yocto project.

Test Environment
Test Specification          : 4.0.6
Tool Version                : REL_3_3_2
Test Program Version        : V6LC_5_0_0
Target Device               : Intel D510MO (Atom D510)

I conform that random testcases are failed.
(e.g. No.5, No.130, No.131, No.134, No.167 and No.168)

Regards,
Yuki Machida

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox