Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next] MAINTAINERS: TCP gets its first maintainer
From: David Miller @ 2018-06-04 14:32 UTC (permalink / raw)
  To: edumazet; +Cc: netdev, eric.dumazet
In-Reply-To: <20180604135029.241753-1-edumazet@google.com>

From: Eric Dumazet <edumazet@google.com>
Date: Mon,  4 Jun 2018 06:50:29 -0700

> Signed-off-by: Eric Dumazet <edumazet@google.com>

Thanks a lot Eric, applied to net-next. :-)

^ permalink raw reply

* [PATCH 1/2 v2 net-next] net_failover: Use netdev_features_t instead of u32
From: Dan Carpenter @ 2018-06-04 14:43 UTC (permalink / raw)
  To: David S. Miller, Sridhar Samudrala; +Cc: netdev, kernel-janitors
In-Reply-To: <20180604.093147.1707102168081704551.davem@davemloft.net>

The features mask needs to be a netdev_features_t (u64) because a u32
is not big enough.

Fixes: cfc80d9a1163 ("net: Introduce net_failover driver")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
---
v2: In the original patch, I thought that the & should be | and I
    introduced a bug.

diff --git a/drivers/net/net_failover.c b/drivers/net/net_failover.c
index 8b508e2cf29b..83f7420ddea5 100644
--- a/drivers/net/net_failover.c
+++ b/drivers/net/net_failover.c
@@ -380,7 +380,8 @@ static rx_handler_result_t net_failover_handle_frame(struct sk_buff **pskb)
 
 static void net_failover_compute_features(struct net_device *dev)
 {
-	u32 vlan_features = FAILOVER_VLAN_FEATURES & NETIF_F_ALL_FOR_ALL;
+	netdev_features_t vlan_features = FAILOVER_VLAN_FEATURES &
+					  NETIF_F_ALL_FOR_ALL;
 	netdev_features_t enc_features  = FAILOVER_ENC_FEATURES;
 	unsigned short max_hard_header_len = ETH_HLEN;
 	unsigned int dst_release_flag = IFF_XMIT_DST_RELEASE |

^ permalink raw reply related

* [PATCH 2/2 net] team: use netdev_features_t instead of u32
From: Dan Carpenter @ 2018-06-04 14:46 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: David S. Miller, netdev, kernel-janitors
In-Reply-To: <20180604.093147.1707102168081704551.davem@davemloft.net>

This code was introduced in 2011 around the same time that we made
netdev_features_t a u64 type.  These days a u32 is not big enough to
hold all the potential features.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>

diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
index 267dcc929f6c..8863fa023500 100644
--- a/drivers/net/team/team.c
+++ b/drivers/net/team/team.c
@@ -1004,7 +1004,8 @@ static void team_port_disable(struct team *team,
 static void __team_compute_features(struct team *team)
 {
 	struct team_port *port;
-	u32 vlan_features = TEAM_VLAN_FEATURES & NETIF_F_ALL_FOR_ALL;
+	netdev_features_t vlan_features = TEAM_VLAN_FEATURES &
+					  NETIF_F_ALL_FOR_ALL;
 	netdev_features_t enc_features  = TEAM_ENC_FEATURES;
 	unsigned short max_hard_header_len = ETH_HLEN;
 	unsigned int dst_release_flag = IFF_XMIT_DST_RELEASE |

^ permalink raw reply related

* Re: [PATCH 2/2 net] team: use netdev_features_t instead of u32
From: Jiri Pirko @ 2018-06-04 14:49 UTC (permalink / raw)
  To: Dan Carpenter; +Cc: David S. Miller, netdev, kernel-janitors
In-Reply-To: <20180604144601.wo5iqb5irfes4frv@kili.mountain>

Mon, Jun 04, 2018 at 04:46:01PM CEST, dan.carpenter@oracle.com wrote:
>This code was introduced in 2011 around the same time that we made
>netdev_features_t a u64 type.  These days a u32 is not big enough to
>hold all the potential features.
>
>Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>

Acked-by: Jiri Pirko <jiri@mellanox.com>

^ permalink raw reply

* RE: [PATCH net-next] qed: use dma_zalloc_coherent instead of allocator/memset
From: Tayar, Tomer @ 2018-06-04 14:53 UTC (permalink / raw)
  To: YueHaibing, davem@davemloft.net, Elior, Ariel
  Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	Dept-Eng Everest Linux L2
In-Reply-To: <20180604131031.24476-1-yuehaibing@huawei.com>

From: YueHaibing [mailto:yuehaibing@huawei.com]
Sent: Monday, June 04, 2018 4:11 PM

> Use dma_zalloc_coherent instead of dma_alloc_coherent
> followed by memset 0.
> 
> Signed-off-by: YueHaibing <yuehaibing@huawei.com>

Thanks
Acked-by: Tomer Tayar <Tomer.Tayar@cavium.com>

^ permalink raw reply

* Re: [PATCH rdma-next v3 00/14] Verbs flow counters support
From: Jason Gunthorpe @ 2018-06-04 14:53 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Doug Ledford, RDMA mailing list, Boris Pismenny, Matan Barak,
	Michael J . Ruhl, Or Gerlitz, Raed Salem, Yishai Hadas,
	Saeed Mahameed, linux-netdev
In-Reply-To: <20180602050419.GH2843@mtr-leonro.mtl.com>

On Sat, Jun 02, 2018 at 08:04:19AM +0300, Leon Romanovsky wrote:
> On Fri, Jun 01, 2018 at 03:11:49PM -0600, Jason Gunthorpe wrote:
> > On Thu, May 31, 2018 at 04:43:27PM +0300, Leon Romanovsky wrote:
> > > From: Leon Romanovsky <leonro@mellanox.com>
> > >
> > > Changelog:
> > > v2->v3:
> > >  * Change function mlx5_fc_query signature to hide the details of
> > >    internal core driver struct mlx5_fc
> > >  * Add commen to data[] field at struct mlx5_ib_flow_counters_data (mlx5-abi.h)
> > >  * Use array of struct mlx5_ib_flow_counters_desc to clarify the output
> > > v1->v2:
> > >  * Removed conversion from struct mlx5_fc* to void*
> > >  * Fixed one place with double space in it
> > >  * Balanced release of hardware handler in case of counters allocation failure
> > >  * Added Tested-by
> > >  * Minimize time spent holding mutex lock
> > >  * Fixed deadlock caused by nested lock in error path
> > >  * Protect from handler pointer derefence in the error paths
> >
> > Okay,
> >
> > Acked-by: Jason Gunthorpe <jgg@mellanox.com>
> >
> > I've revised some of the commit messages, fixed the two bad
> > check-patch warnings, and fixed the patch ordering..
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/log/?h=wip/jgg-counters
> >
> > Please send a PR with the mlx-core bits and above commits.
> 
> Hi,
> 
> I applied two mlx5-next commits to the relevant tree:
> https://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux.git/commit/?h=mlx5-next&id=930821e39d0a5f91ed58fea1692afe04f0fe0e1f
> https://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux.git/commit/?h=mlx5-next&id=5f9bf63ae80c4d0e5e986b6c1280bf8174978545
> 
> In first commit, I dropped the words "as used to be", per-Saeed's request.
> 
> The proper signed tag for whole the series is: verbs_flow_counters
> git://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git tags/verbs_flow_counters

Okay pulled, thanks

Jason

^ permalink raw reply

* Re: [bpf PATCH v2] bpf: sockmap, fix crash when ipv6 sock is added
From: Daniel Borkmann @ 2018-06-04 14:55 UTC (permalink / raw)
  To: John Fastabend, Eric Dumazet, edumazet, ast, Dave Watson; +Cc: netdev
In-Reply-To: <aeff8901-2ca4-99b6-6238-93703946b233@gmail.com>

On 06/04/2018 03:57 PM, John Fastabend wrote:
> On 06/04/2018 06:39 AM, Daniel Borkmann wrote:
>> On 06/02/2018 11:39 PM, John Fastabend wrote:
>>> On 06/01/2018 12:58 PM, Eric Dumazet wrote:
>>>> On 06/01/2018 03:46 PM, John Fastabend wrote:
>>>>> This fixes a crash where we assign tcp_prot to IPv6 sockets instead
>>>>> of tcpv6_prot.
>>>>
>>>> ...
>>>>
>>>>> +	/* ULPs are currently supported only for TCP sockets in ESTABLISHED
>>>>> +	 * state. Supporting sockets in LISTEN state will require us to
>>>>> +	 * modify the accept implementation to clone rather then share the
>>>>> +	 * ulp context.
>>>>> +	 */
>>>>> +	if (sock->sk_state != TCP_ESTABLISHED)
>>>>> +		return -ENOTSUPP;
>>>>> +
>>>>>  	/* 1. If sock map has BPF programs those will be inherited by the
>>>>>  	 * sock being added. If the sock is already attached to BPF programs
>>>>>  	 * this results in an error.
>>>>
>>>> Next question will be then : What happens if syzbot uses tcp_disconnect() and then listen() ?
>>>
>>> Yep we need to fix that as well :( Looks like we can plumb the
>>> unhash callback and remove it from the sockmap when the socket
>>> goes through tcp_disconnect().
>>>
>>> This patch should go in as-is though and we can fix the disconnect
>>> issue with a new patch.
>>>
>>> Adding Dave Watson to the thread as well because I'm guessing
>>> the disconnect() case is also applicable to TLS. At least I see
>>> a hw handler for unhash but there does not appear to be a handler
>>> in the SW case, at least from a quick glance.
>>>
>>> Thanks again!
>>
>> Given the discussion and fixes weren't resolved resp. ready in time for 4.17,
>> and last bpf pr for it went out last week, we need to route this via -stable
>> once all is hashed out.
> 
> OK.
> 
>> This fix here therefore needs to be rebased against bpf-next tree, and as far
>> as I can see another fix for hash map is also needed to address the same issue.
>>
> 
> This fix works for both sockmap and sockhash because they use the same
> ulp register and init paths. But, will rebase for net-next and send out
> this morning.

Ok, right, because in bpf-next this eventually goes into __sock_map_ctx_update_elem()
instead of sock_map_ctx_update_elem() call site.

>> After that, likely also fixes for the disconnect + listen case are needed.
> 
> Yep will have a fix today for this.
> 
>> (I can use the one here later on for -stable backport, but given merge window
>> is open this needs a rebase and a resolution for hash map.)
> 
> hash map is also resolved with the same patch but please do queue this
> up for -stable.

Will do, thanks!

^ permalink raw reply

* Re: 答复: ANNOUNCE: Enhanced IP v1.4
From: Tom Herbert @ 2018-06-04 14:56 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: PKU.孙斌, Willy Tarreau,
	Linux Kernel Network Developers
In-Reply-To: <4d9e164d-58e3-caa0-a378-b9681eefa9d7@gmail.com>

On Mon, Jun 4, 2018 at 6:02 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
> On 06/03/2018 10:58 PM, PKU.孙斌 wrote:
>> On Sun, Jun 03, 2018 at 03:41:08PM -0700, Eric Dumazet wrote:
>>>
>>>
>>> On 06/03/2018 01:37 PM, Tom Herbert wrote:
>>>
>>>> This is not an inconsequential mechanism that is being proposed. It's
>>>> a modification to IP protocol that is intended to work on the
>>>> Internet, but it looks like the draft hasn't been updated for two
>>>> years and it is not adopted by any IETF working group. I don't see how
>>>> this can go anywhere without IETF support. Also, I suggest that you
>>>> look at the IPv10 proposal since that was very similar in intent. One
>>>> of the reasons that IPv10 shot down was because protocol transition
>>>> mechanisms were more interesting ten years ago than today. IPv6 has
>>>> good traction now. In fact, it's probably the case that it's now
>>>> easier to bring up IPv6 than to try to make IPv4 options work over the
>>>> Internet.
>>>
>>> +1
>>>
>>> Many hosts do not use IPv4 anymore.
>>>
>>> We even have the project making IPv4 support in linux optional.
>>
>> I guess then Linux kernel wouldn't be able to boot itself without IPv4 built in, e.g., when we only have old L2 links (without the IPv6 frame type)...
>
>
>
> *Optional* means that a CONFIG_IPV4 would be there, and some people could build a kernel with CONFIG_IPV4=n,
>
> Like IPv6 is optional today.
>
> Of course, most distros will select CONFIG_IPV4=y  (as they probably select CONFIG_IPV6=y today)
>
> Do not worry, IPv4 is not dead, but I doubt Enhanced IP v1.4 has any chance,
> it is at least 10 years too late.

There's also https://www.theregister.co.uk/2018/05/30/internet_engineers_united_nations_ipv6/.
We're reaching the point where it's the transition mechnanisms that
are hampering IPv6 adoption.

Tom

^ permalink raw reply

* Re: [PATCH net-next v2 1/3] bpf: implement bpf_get_current_cgroup_id() helper
From: Alexei Starovoitov @ 2018-06-04 14:58 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: Yonghong Song, ast, netdev, kernel-team
In-Reply-To: <0032678a-67ee-78ab-47d7-4ea3ecb1edd7@iogearbox.net>

On Mon, Jun 04, 2018 at 11:08:35AM +0200, Daniel Borkmann wrote:
> On 06/04/2018 12:59 AM, Yonghong Song wrote:
> > bpf has been used extensively for tracing. For example, bcc
> > contains an almost full set of bpf-based tools to trace kernel
> > and user functions/events. Most tracing tools are currently
> > either filtered based on pid or system-wide.
> > 
> > Containers have been used quite extensively in industry and
> > cgroup is often used together to provide resource isolation
> > and protection. Several processes may run inside the same
> > container. It is often desirable to get container-level tracing
> > results as well, e.g. syscall count, function count, I/O
> > activity, etc.
> > 
> > This patch implements a new helper, bpf_get_current_cgroup_id(),
> > which will return cgroup id based on the cgroup within which
> > the current task is running.
> > 
> > The later patch will provide an example to show that
> > userspace can get the same cgroup id so it could
> > configure a filter or policy in the bpf program based on
> > task cgroup id.
> > 
> > The helper is currently implemented for tracing. It can
> > be added to other program types as well when needed.
> > 
> > Acked-by: Alexei Starovoitov <ast@kernel.org>
> > Signed-off-by: Yonghong Song <yhs@fb.com>
> > ---
> >  include/linux/bpf.h      |  1 +
> >  include/uapi/linux/bpf.h |  8 +++++++-
> >  kernel/bpf/core.c        |  1 +
> >  kernel/bpf/helpers.c     | 15 +++++++++++++++
> >  kernel/trace/bpf_trace.c |  2 ++
> >  5 files changed, 26 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index bbe2974..995c3b1 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -746,6 +746,7 @@ extern const struct bpf_func_proto bpf_get_stackid_proto;
> >  extern const struct bpf_func_proto bpf_get_stack_proto;
> >  extern const struct bpf_func_proto bpf_sock_map_update_proto;
> >  extern const struct bpf_func_proto bpf_sock_hash_update_proto;
> > +extern const struct bpf_func_proto bpf_get_current_cgroup_id_proto;
> >  
> >  /* Shared helpers among cBPF and eBPF. */
> >  void bpf_user_rnd_init_once(void);
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index f0b6608..18712b0 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -2070,6 +2070,11 @@ union bpf_attr {
> >   * 		**CONFIG_SOCK_CGROUP_DATA** configuration option.
> >   * 	Return
> >   * 		The id is returned or 0 in case the id could not be retrieved.
> > + *
> > + * u64 bpf_get_current_cgroup_id(void)
> > + * 	Return
> > + * 		A 64-bit integer containing the current cgroup id based
> > + * 		on the cgroup within which the current task is running.
> >   */
> >  #define __BPF_FUNC_MAPPER(FN)		\
> >  	FN(unspec),			\
> > @@ -2151,7 +2156,8 @@ union bpf_attr {
> >  	FN(lwt_seg6_action),		\
> >  	FN(rc_repeat),			\
> >  	FN(rc_keydown),			\
> > -	FN(skb_cgroup_id),
> > +	FN(skb_cgroup_id),		\
> > +	FN(get_current_cgroup_id),
> >  
> >  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> >   * function eBPF program intends to call
> > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> > index 527587d..9f14937 100644
> > --- a/kernel/bpf/core.c
> > +++ b/kernel/bpf/core.c
> > @@ -1765,6 +1765,7 @@ const struct bpf_func_proto bpf_get_current_uid_gid_proto __weak;
> >  const struct bpf_func_proto bpf_get_current_comm_proto __weak;
> >  const struct bpf_func_proto bpf_sock_map_update_proto __weak;
> >  const struct bpf_func_proto bpf_sock_hash_update_proto __weak;
> > +const struct bpf_func_proto bpf_get_current_cgroup_id_proto __weak;
> >  
> >  const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void)
> >  {
> > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> > index 3d24e23..73065e2 100644
> > --- a/kernel/bpf/helpers.c
> > +++ b/kernel/bpf/helpers.c
> > @@ -179,3 +179,18 @@ const struct bpf_func_proto bpf_get_current_comm_proto = {
> >  	.arg1_type	= ARG_PTR_TO_UNINIT_MEM,
> >  	.arg2_type	= ARG_CONST_SIZE,
> >  };
> > +
> > +#ifdef CONFIG_CGROUPS
> > +BPF_CALL_0(bpf_get_current_cgroup_id)
> > +{
> > +	struct cgroup *cgrp = task_dfl_cgroup(current);
> > +
> > +	return cgrp->kn->id.id;
> > +}
> > +
> > +const struct bpf_func_proto bpf_get_current_cgroup_id_proto = {
> > +	.func		= bpf_get_current_cgroup_id,
> > +	.gpl_only	= false,
> > +	.ret_type	= RET_INTEGER,
> > +};
> > +#endif
> 
> Nit: why not moving this function directly to bpf_trace.c?

my preference would be to keep it in helpers.c as-is.
imo bpf_trace.c is only for things that depend on kernel/trace/

> > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> > index 752992c..e2ab5b7 100644
> > --- a/kernel/trace/bpf_trace.c
> > +++ b/kernel/trace/bpf_trace.c
> > @@ -564,6 +564,8 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> >  		return &bpf_get_prandom_u32_proto;
> >  	case BPF_FUNC_probe_read_str:
> >  		return &bpf_probe_read_str_proto;
> > +	case BPF_FUNC_get_current_cgroup_id:
> > +		return &bpf_get_current_cgroup_id_proto;
> 
> When you have !CONFIG_CGROUPS, then it relies on the weak definition of the
> bpf_get_current_cgroup_id_proto, which I would think at latest in fixup_bpf_calls()
> bails out with 'kernel subsystem misconfigured func' due to func being NULL.
> 
> Can't we just do the #ifdef CONFIG_CGROUPS around BPF_FUNC_get_current_cgroup_id
> case instead? Then we bail out normally with 'unknown func' when cgroups are
> not configured?

good idea.

^ permalink raw reply

* Re: [net-next][PATCH] tcp: probe timer MUST not less than 5 minuter for tcp PMTU
From: David Miller @ 2018-06-04 14:59 UTC (permalink / raw)
  To: lirongqing; +Cc: netdev
In-Reply-To: <1527851039-6626-1-git-send-email-lirongqing@baidu.com>

From: Li RongQing <lirongqing@baidu.com>
Date: Fri,  1 Jun 2018 19:03:59 +0800

> RFC4821 say: The value for this timer MUST NOT be less than
> 5 minutes and is recommended to be 10 minutes, per RFC 1981.
> 
> Signed-off-by: Li RongQing <lirongqing@baidu.com>

As Eric stated, admins should be allowed to set this however they
wish (f.e. for specialized tests).

^ permalink raw reply

* Re: [PATCH net] ipv4: igmp: hold wakelock to prevent delayed reports
From: Tejaswi Tanikella @ 2018-06-04 15:03 UTC (permalink / raw)
  To: Florian Fainelli; +Cc: netdev, Andrew Lunn
In-Reply-To: <8661f50b-876f-50d8-f1a5-61b1dabc5398@gmail.com>

On Fri, Jun 01, 2018 at 07:45:16AM -0700, Florian Fainelli wrote:

Thank you Florian for reviewing my patch.
I had a few questions.
> 
> 
> On 06/01/2018 07:05 AM, Tejaswi Tanikella wrote:
> > On receiving a IGMPv2/v3 query, based on max_delay set in the header a
> > timer is started to send out a response after a random time within
> > max_delay. If the system then moves into suspend state, Report is
> > delayed until system wakes up.
> > 
> > In one reported scenario, on arm64 devices, max_delay was set to 10s,
> > Reports were consistantly delayed if the timer is scheduled after 5 plus
> > seconds.
> > 
> > Hold wakelock while starting the timer to prevent moving into suspend
> > state.
> 
> I suppose this looks fine, but are not you going to be playing
> whack-a-mole through the network stack wherever there are similar
> patterns? Is not a better solution to move this to
> in_dev_get()/in_dev_put() where you could create a proper wakelock per
> network device?
I see that in_dev_get()/in_dev_put() are being used by some drivers.
won't addition of a wakelock be unnecessary for them?
Will adding two independent functions per in_dev to hold and release
wakelock be sufficient ?


> For instance, I could imagine ARP suffering from the same short comings...
I am not sure if ARP will be effected.
>From my understanding only systems with timers which trigger sending out
responses should be affected. For example IGMP and MLD must send reports
for the router to fwd mcast packets.
In ARP for instance, only moving between states will be delayed. Is there
something that I am missing ?


> 
> There is one thing that needs fixing though, see below.
Thanks, I'll fix them in the next patch.
> 
> > 
> > Signed-off-by: Tejaswi Tanikella <tejaswit@codeaurora.org>
> > ---
> >  include/linux/igmp.h |  1 +b
> >  net/ipv4/igmp.c      | 20 ++++++++++++++++++--
> >  2 files changed, 19 insertions(+), 2 deletions(-)
> > 
> > diff --git a/include/linux/igmp.h b/include/linux/igmp.h
> > index f823185..9be1c58 100644
> > --- a/include/linux/igmp.h
> > +++ b/include/linux/igmp.h
> > @@ -84,6 +84,7 @@ struct ip_mc_list {
> >  	};
> >  	struct ip_mc_list __rcu *next_hash;
> >  	struct timer_list	timer;
> > +	struct wakeup_source	*wakeup_src;
> 
> Since you are using this only when CONFIG_IP_MULTICAST is defined, you
> might was well save a few bytes by making this enclosed within an ifdef
> CONFIG_IP_MULTICAST here as well?
> 
> [snip]
> 
> > @@ -1415,6 +1429,8 @@ void ip_mc_inc_group(struct in_device *in_dev, __be32 addr)
> >  #ifdef CONFIG_IP_MULTICAST
> >  	timer_setup(&im->timer, igmp_timer_expire, 0);
> >  	im->unsolicit_count = net->ipv4.sysctl_igmp_qrv;
> > +	im->wakeup_src = wakeup_source_create("igmp_wakeup_source");
> 
> Missing error checking, wakeup_source_create() can return NULL here.
> -- 
> Florian

^ permalink raw reply

* INFO: task hung in ip6gre_exit_batch_net
From: syzbot @ 2018-06-04 15:03 UTC (permalink / raw)
  To: christian.brauner, davem, dsahern, fw, jbenc, ktkhai,
	linux-kernel, lucien.xin, mschiffer, netdev, syzkaller-bugs,
	vyasevich

Hello,

syzbot found the following crash on:

HEAD commit:    bc2dbc5420e8 Merge branch 'akpm' (patches from Andrew)
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=164e42b7800000
kernel config:  https://syzkaller.appspot.com/x/.config?x=982e2df1b9e60b02
dashboard link: https://syzkaller.appspot.com/bug?extid=bf78a74f82c1cf19069e
compiler:       gcc (GCC) 8.0.1 20180413 (experimental)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+bf78a74f82c1cf19069e@syzkaller.appspotmail.com

INFO: task kworker/u4:1:22 blocked for more than 120 seconds.
       Not tainted 4.17.0-rc6+ #68
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/u4:1    D13944    22      2 0x80000000
Workqueue: netns cleanup_net
Call Trace:
  context_switch kernel/sched/core.c:2859 [inline]
  __schedule+0x801/0x1e30 kernel/sched/core.c:3501
  schedule+0xef/0x430 kernel/sched/core.c:3545
  schedule_preempt_disabled+0x10/0x20 kernel/sched/core.c:3603
  __mutex_lock_common kernel/locking/mutex.c:833 [inline]
  __mutex_lock+0xe38/0x17f0 kernel/locking/mutex.c:893
  mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:908
  rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74
  ip6gre_exit_batch_net+0xd5/0x7d0 net/ipv6/ip6_gre.c:1585
  ops_exit_list.isra.7+0x105/0x160 net/core/net_namespace.c:155
  cleanup_net+0x51d/0xb20 net/core/net_namespace.c:523
  process_one_work+0xc1e/0x1b50 kernel/workqueue.c:2145
  worker_thread+0x1cc/0x1440 kernel/workqueue.c:2279
  kthread+0x345/0x410 kernel/kthread.c:240
  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412

Showing all locks held in the system:
4 locks held by kworker/u4:1/22:
  #0: 0000000049a7b590 ((wq_completion)"%s""netns"){+.+.}, at:  
__write_once_size include/linux/compiler.h:215 [inline]
  #0: 0000000049a7b590 ((wq_completion)"%s""netns"){+.+.}, at:  
arch_atomic64_set arch/x86/include/asm/atomic64_64.h:34 [inline]
  #0: 0000000049a7b590 ((wq_completion)"%s""netns"){+.+.}, at: atomic64_set  
include/asm-generic/atomic-instrumented.h:40 [inline]
  #0: 0000000049a7b590 ((wq_completion)"%s""netns"){+.+.}, at:  
atomic_long_set include/asm-generic/atomic-long.h:57 [inline]
  #0: 0000000049a7b590 ((wq_completion)"%s""netns"){+.+.}, at: set_work_data  
kernel/workqueue.c:617 [inline]
  #0: 0000000049a7b590 ((wq_completion)"%s""netns"){+.+.}, at:  
set_work_pool_and_clear_pending kernel/workqueue.c:644 [inline]
  #0: 0000000049a7b590 ((wq_completion)"%s""netns"){+.+.}, at:  
process_one_work+0xaef/0x1b50 kernel/workqueue.c:2116
  #1: 0000000030a00b6b (net_cleanup_work){+.+.}, at:  
process_one_work+0xb46/0x1b50 kernel/workqueue.c:2120
  #2: 000000007eb35e65 (pernet_ops_rwsem){++++}, at: cleanup_net+0x11a/0xb20  
net/core/net_namespace.c:490
  #3: 000000007eb32c75 (rtnl_mutex){+.+.}, at: rtnl_lock+0x17/0x20  
net/core/rtnetlink.c:74
3 locks held by kworker/1:1/24:
  #0: 000000001c9e6580 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:  
__write_once_size include/linux/compiler.h:215 [inline]
  #0: 000000001c9e6580 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:  
arch_atomic64_set arch/x86/include/asm/atomic64_64.h:34 [inline]
  #0: 000000001c9e6580 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:  
atomic64_set include/asm-generic/atomic-instrumented.h:40 [inline]
  #0: 000000001c9e6580 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:  
atomic_long_set include/asm-generic/atomic-long.h:57 [inline]
  #0: 000000001c9e6580 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:  
set_work_data kernel/workqueue.c:617 [inline]
  #0: 000000001c9e6580 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:  
set_work_pool_and_clear_pending kernel/workqueue.c:644 [inline]
  #0: 000000001c9e6580 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:  
process_one_work+0xaef/0x1b50 kernel/workqueue.c:2116
  #1: 000000009edcfbe7 ((addr_chk_work).work){+.+.}, at:  
process_one_work+0xb46/0x1b50 kernel/workqueue.c:2120
  #2: 000000007eb32c75 (rtnl_mutex){+.+.}, at: rtnl_lock+0x17/0x20  
net/core/rtnetlink.c:74
2 locks held by khungtaskd/893:
  #0: 000000007eeb621a (rcu_read_lock){....}, at:  
check_hung_uninterruptible_tasks kernel/hung_task.c:175 [inline]
  #0: 000000007eeb621a (rcu_read_lock){....}, at: watchdog+0x1ff/0xf60  
kernel/hung_task.c:249
  #1: 00000000239f1b5e (tasklist_lock){.+.+}, at:  
debug_show_all_locks+0xde/0x34a kernel/locking/lockdep.c:4470
2 locks held by getty/4481:
  #0: 00000000cc114025 (&tty->ldisc_sem){++++}, at:  
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
  #1: 000000006ad1f3fc (&ldata->atomic_read_lock){+.+.}, at:  
n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
2 locks held by getty/4482:
  #0: 00000000226a16cc (&tty->ldisc_sem){++++}, at:  
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
  #1: 000000008cee8cdc (&ldata->atomic_read_lock){+.+.}, at:  
n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
2 locks held by getty/4483:
  #0: 0000000067bd3c39 (&tty->ldisc_sem){++++}, at:  
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
  #1: 000000005d8bc81d (&ldata->atomic_read_lock){+.+.}, at:  
n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
2 locks held by getty/4484:
  #0: 00000000f0f8d839 (&tty->ldisc_sem){++++}, at:  
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
  #1: 00000000a9d5f091 (&ldata->atomic_read_lock){+.+.}, at:  
n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
2 locks held by getty/4485:
  #0: 000000002c96ee9a (&tty->ldisc_sem){++++}, at:  
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
  #1: 0000000033338ac7 (&ldata->atomic_read_lock){+.+.}, at:  
n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
2 locks held by getty/4486:
  #0: 00000000f6db39b5 (&tty->ldisc_sem){++++}, at:  
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
  #1: 00000000bb7c6099 (&ldata->atomic_read_lock){+.+.}, at:  
n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
2 locks held by getty/4487:
  #0: 000000006be9659d (&tty->ldisc_sem){++++}, at:  
ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
  #1: 00000000e2edd3d0 (&ldata->atomic_read_lock){+.+.}, at:  
n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
3 locks held by kworker/1:3/27147:
  #0: 00000000c208aa7f ((wq_completion)"events"){+.+.}, at:  
__write_once_size include/linux/compiler.h:215 [inline]
  #0: 00000000c208aa7f ((wq_completion)"events"){+.+.}, at:  
arch_atomic64_set arch/x86/include/asm/atomic64_64.h:34 [inline]
  #0: 00000000c208aa7f ((wq_completion)"events"){+.+.}, at: atomic64_set  
include/asm-generic/atomic-instrumented.h:40 [inline]
  #0: 00000000c208aa7f ((wq_completion)"events"){+.+.}, at: atomic_long_set  
include/asm-generic/atomic-long.h:57 [inline]
  #0: 00000000c208aa7f ((wq_completion)"events"){+.+.}, at: set_work_data  
kernel/workqueue.c:617 [inline]
  #0: 00000000c208aa7f ((wq_completion)"events"){+.+.}, at:  
set_work_pool_and_clear_pending kernel/workqueue.c:644 [inline]
  #0: 00000000c208aa7f ((wq_completion)"events"){+.+.}, at:  
process_one_work+0xaef/0x1b50 kernel/workqueue.c:2116
  #1: 00000000fa87e61f (deferred_process_work){+.+.}, at:  
process_one_work+0xb46/0x1b50 kernel/workqueue.c:2120
  #2: 000000007eb32c75 (rtnl_mutex){+.+.}, at: rtnl_lock+0x17/0x20  
net/core/rtnetlink.c:74
1 lock held by syz-executor7/29665:
  #0: 000000006e20d618 (sk_lock-AF_INET6){+.+.}, at: lock_sock  
include/net/sock.h:1469 [inline]
  #0: 000000006e20d618 (sk_lock-AF_INET6){+.+.}, at:  
tls_sw_sendmsg+0x1b9/0x12e0 net/tls/tls_sw.c:384
1 lock held by syz-executor7/29689:
  #0: 000000007eb32c75 (rtnl_mutex){+.+.}, at: rtnl_lock+0x17/0x20  
net/core/rtnetlink.c:74

=============================================

NMI backtrace for cpu 0
CPU: 0 PID: 893 Comm: khungtaskd Not tainted 4.17.0-rc6+ #68
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x1b9/0x294 lib/dump_stack.c:113
  nmi_cpu_backtrace.cold.4+0x19/0xce lib/nmi_backtrace.c:103
  nmi_trigger_cpumask_backtrace+0x151/0x192 lib/nmi_backtrace.c:62
  arch_trigger_cpumask_backtrace+0x14/0x20 arch/x86/kernel/apic/hw_nmi.c:38
  trigger_all_cpu_backtrace include/linux/nmi.h:138 [inline]
  check_hung_task kernel/hung_task.c:132 [inline]
  check_hung_uninterruptible_tasks kernel/hung_task.c:190 [inline]
  watchdog+0xc10/0xf60 kernel/hung_task.c:249
  kthread+0x345/0x410 kernel/kthread.c:240
  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412
Sending NMI from CPU 0 to CPUs 1:
NMI backtrace for cpu 1 skipped: idling at native_safe_halt+0x6/0x10  
arch/x86/include/asm/irqflags.h:54


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with  
syzbot.

^ permalink raw reply

* Re: [PATCH net] ipv4: igmp: hold wakelock to prevent delayed reports
From: David Miller @ 2018-06-04 15:06 UTC (permalink / raw)
  To: tejaswit; +Cc: netdev
In-Reply-To: <20180601140532.GA13253@tejaswit-linux.qualcomm.com>

From: Tejaswi Tanikella <tejaswit@codeaurora.org>
Date: Fri, 1 Jun 2018 19:35:41 +0530

> On receiving a IGMPv2/v3 query, based on max_delay set in the header a
> timer is started to send out a response after a random time within
> max_delay. If the system then moves into suspend state, Report is
> delayed until system wakes up.
> 
> In one reported scenario, on arm64 devices, max_delay was set to 10s,
> Reports were consistantly delayed if the timer is scheduled after 5 plus
> seconds.
> 
> Hold wakelock while starting the timer to prevent moving into suspend
> state.
> 
> Signed-off-by: Tejaswi Tanikella <tejaswit@codeaurora.org>

As Florian stated, this won't be the only networking facility to hit
a problem like this.  So, if we go down this route, we probably want
to generically solve this.

But I have a deeper concern.

Do we really want every timer based querying mechanism to prevent a
system from being able to suspend?

We get to the point where external entities can generate traffic which
can prevent a remote system from entering suspend state.

^ permalink raw reply

* Re: [RFC PATCH 0/2] net: macb: Disable TX checksum offloading on all Zynq
From: Nicolas Ferre @ 2018-06-04 15:05 UTC (permalink / raw)
  To: Jennifer Dahm, netdev, David S . Miller
  Cc: Nathan Sullivan, Rafal Ozieblo, Claudiu Beznea, Harini Katakam
In-Reply-To: <1527284654-24835-1-git-send-email-jennifer.dahm@ni.com>

Jennifer,

On 25/05/2018 at 23:44, Jennifer Dahm wrote:
> During testing, I discovered that the Zynq GEM hardware overwrites all
> outgoing UDP packet checksums, which is illegal in packet forwarding
> cases. This happens both with and without the checksum-zeroing
> behavior  introduced  in  007e4ba3ee137f4700f39aa6dbaf01a71047c5f6
> ("net: macb: initialize checksum when using checksum offloading"). The
> only solution to both the small packet bug and the packet forwarding
> bug that I can find is to disable TX checksum offloading entirely.

Are the bugs listed above present in all revisions of the GEM IP, only 
for some revisions?
Is there an errata that describe this issue for the Zynq GEM?


> There's still the possibility that these bugs are actually with the
> driver software and not with the hardware. I've found several places
> where the checksum is set to 0xFFFF (the incorrect checksum found in
> small packets) when something goes wrong, and I can imagine a buggy
> driver writing over the checksum blindly when TX checksum offloading
> is enabled.
> 
> I would like feedback on two things:
> 1. Is it possible that the two bugs described above are caused by the
>     driver and not by the hardware? If so, where should I look to
>     implicate the driver?

Rafal,
Do you want to comment on this part?

> 2. Is this a problem we care enough about to completely disable TX
>     checksum offloading?

I would prefer to keep it enabled if possible. I mean I'm not against 
such a workaround if the HW is not able to handle such cases but 
offloading a CPU is always good when we have "low end" processors like 
ARM9 at 200MHz like the older AT91 that use this driver...

> Here is the testing procedure I used to reproduce these bugs on my
> machine. Specifically, without this patchset, step 9 fails. Without
> 007e4ba3ee, step 8 also fails.
> 
> 1. Set up the test environment:
>    a. Acquire a Zynq device with two ethernet ports. This is the DUT.
>    b. Acquire a USB-Ethernet adapter.
>    c. Acquire two ethernet cables.
>    d. Connect one Ethernet port on the DUT to your computer's network
>       switch.
>    e. Connect the other Ethernet port to the USB-Ethernet adapter and
>       plug that adapter into your computer.
>    f. Set up a Linux VM to send packets through the DUT. I recommend
>       using a VM here so that you can easily detach it from the primary
>       network to force outgoing traffic through the DUT.
>    g. Set up a computer with a packet inspecting program to receive and
>       inspect packets. This doesn't need to be a VM. For the purposes
>       of this test, I'll be using a Windows instance with WireShark.
> 2. Load the kernel you want to test onto the DUT, making sure to
>     include the `bridge` module.
> 3. Set up a bridge on the DUT. The following commands on the DUT
>     should work, replacing `eth0` and `eth1` with the two ethernet
>     interfaces on the DUT:
>     ```
>     brctl addbr test
>     brctl addif test eth0 eth1
>     ifconfig eth0 0.0.0.0
>     ifconfig eth1 0.0.0.0
>     dhclient test -v
>     ```
> 4. Disconnect the Linux VM from your host computer's network and
>     connect it to the USB-Ethernet adapter in order to force outgoing
>     network traffic through the DUT. If necessary, run dhclient on the
>     Linux VM to acquire an IP address.
> 5. Ensure that you can reach your Windows instance from your Linux VM
>     through the DUT (e.g. ping).
> 6. Start WireShark on your Windows instance and start monitoring
>     traffic on a specific, unused port (e.g. 61557).
> 7. Using netcat, send a few not-tiny UDP packets from your Linux VM to
>     your Windows instance to ensure that valid UDP packets are properly
>     forwarded. Ex:
>     ```
>     echo "hello world" | netcat -u <WindowsIP> 61557
>     ```
>     Inspect these packets to ensure that the data arrived intact and
>     that the checksum looks reasonable (i.e. not 0x0000 or 0xFFFF).
> 8. Using netcat, send a few tiny UDP packets (2 bytes or fewer) from
>     Linux VM to your Windows instance to ensure that the checksum is
>     reasonable. Ex:
>     ```
>     echo "h" | netcat -u <WindowsIP> 61557
>     ```
> 9. Using a custom program, send UDP packets with broken checksums
>     (e.g. 0xABCD) from your Linux VM to your Windows instance. Inspect
>     these packets with WireShark and make sure that the packet arrived
>     with the same checksum you sent it with.
> 
> For step 9, I wrote a C program using the Linux socket API that will
> send a properly formatted UDP packet with the payload "Hello!" and a
> (broken) checksum of 0xABCD to port 61557 on the host provided at the
> command line. I can send the full program if you would like, but here
> is the important part of it:
> ```
> struct custom_udp {
> 	int16_t s_port;
> 	int16_t d_port;
> 	int16_t length;
> 	int16_t check;
> 	char data[];
> };
> 
> int send_message(int sockfd, in_port_t port, const char *message) {
> 	struct custom_udp *frame;
> 	int16_t message_len;
> 	int16_t frame_len;
> 	int ret;
> 
> 	message_len = strlen(message) * sizeof(char);
> 	frame_len = sizeof(struct custom_udp) + message_len;
> 	frame = malloc(frame_len);
> 	frame->s_port = htons(0);
> 	frame->d_port = htons(port);
> 	frame->length = htons(frame_len);
> 	frame->check = htons(0xABCD);
> 	memmove(frame->data, message, message_len);
> 
> 	ret = write(sockfd, frame, frame_len);
> 	free(frame);
> 
> 	return ret;
> }

Thanks a lot for the detailed testing procedure. It will definitively 
help us reproduce the bug.

As a conclusion, I would say that we can use this patch as a last resort 
if we are sure that:
1/ the driver doesn't do weird things with TX CSUM in the code
2/ the hw is unable to handle these particular cases.


Best regards,
   Nicolas


> ```
> 
> Jennifer Dahm (1):
>    net/macb: Disable TX checksum offloading on all Zynq-7000
> 
>   drivers/net/ethernet/cadence/macb.h      |  1 +
>   drivers/net/ethernet/cadence/macb_main.c | 11 ++++++++---
>   2 files changed, 9 insertions(+), 3 deletions(-)
> 


-- 
Nicolas Ferre

^ permalink raw reply

* Re: [RFC PATCH 2/2] net: macb: Disable TX checksum offloading on all Zynq
From: Nicolas Ferre @ 2018-06-04 15:06 UTC (permalink / raw)
  To: Jennifer Dahm, netdev, David S . Miller, Michal Simek,
	Harini Katakam
  Cc: Nathan Sullivan
In-Reply-To: <1527284654-24835-3-git-send-email-jennifer.dahm@ni.com>

On 25/05/2018 at 23:44, Jennifer Dahm wrote:
> The Zynq ethernet hardware has checksum offloading bugs that cause
> small UDP packets (<= 2 bytes) to be sent with an incorrect checksum
> (0xffff) and forwarded UDP packets to be re-checksummed, which is
> illegal behavior. The best solution we have right now is to disable
> hardware TX checksum offloading entirely.
> 
> Signed-off-by: Jennifer Dahm <jennifer.dahm@ni.com>

Adding some xilinx people I know...

> ---
>   drivers/net/ethernet/cadence/macb_main.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
> index a5d564b..e8cc68a 100644
> --- a/drivers/net/ethernet/cadence/macb_main.c
> +++ b/drivers/net/ethernet/cadence/macb_main.c
> @@ -3807,7 +3807,8 @@ static const struct macb_config zynqmp_config = {
>   };
>   
>   static const struct macb_config zynq_config = {
> -	.caps = MACB_CAPS_GIGABIT_MODE_AVAILABLE | MACB_CAPS_NO_GIGABIT_HALF,
> +	.caps = MACB_CAPS_GIGABIT_MODE_AVAILABLE | MACB_CAPS_NO_GIGABIT_HALF
> +	      | MACB_CAPS_DISABLE_TX_HW_CSUM,
>   	.dma_burst_length = 16,
>   	.clk_init = macb_clk_init,
>   	.init = macb_init,
> 


-- 
Nicolas Ferre

^ permalink raw reply

* Re: [PATCH net-next v2 1/3] bpf: implement bpf_get_current_cgroup_id() helper
From: Yonghong Song @ 2018-06-04 15:11 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann; +Cc: ast, netdev, kernel-team
In-Reply-To: <20180604145801.7z25aqm4piw2fft7@ast-mbp>



On 6/4/18 7:58 AM, Alexei Starovoitov wrote:
> On Mon, Jun 04, 2018 at 11:08:35AM +0200, Daniel Borkmann wrote:
>> On 06/04/2018 12:59 AM, Yonghong Song wrote:
>>> bpf has been used extensively for tracing. For example, bcc
>>> contains an almost full set of bpf-based tools to trace kernel
>>> and user functions/events. Most tracing tools are currently
>>> either filtered based on pid or system-wide.
>>>
>>> Containers have been used quite extensively in industry and
>>> cgroup is often used together to provide resource isolation
>>> and protection. Several processes may run inside the same
>>> container. It is often desirable to get container-level tracing
>>> results as well, e.g. syscall count, function count, I/O
>>> activity, etc.
>>>
>>> This patch implements a new helper, bpf_get_current_cgroup_id(),
>>> which will return cgroup id based on the cgroup within which
>>> the current task is running.
>>>
>>> The later patch will provide an example to show that
>>> userspace can get the same cgroup id so it could
>>> configure a filter or policy in the bpf program based on
>>> task cgroup id.
>>>
>>> The helper is currently implemented for tracing. It can
>>> be added to other program types as well when needed.
>>>
>>> Acked-by: Alexei Starovoitov <ast@kernel.org>
>>> Signed-off-by: Yonghong Song <yhs@fb.com>
>>> ---
>>>   include/linux/bpf.h      |  1 +
>>>   include/uapi/linux/bpf.h |  8 +++++++-
>>>   kernel/bpf/core.c        |  1 +
>>>   kernel/bpf/helpers.c     | 15 +++++++++++++++
>>>   kernel/trace/bpf_trace.c |  2 ++
>>>   5 files changed, 26 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>>> index bbe2974..995c3b1 100644
>>> --- a/include/linux/bpf.h
>>> +++ b/include/linux/bpf.h
>>> @@ -746,6 +746,7 @@ extern const struct bpf_func_proto bpf_get_stackid_proto;
>>>   extern const struct bpf_func_proto bpf_get_stack_proto;
>>>   extern const struct bpf_func_proto bpf_sock_map_update_proto;
>>>   extern const struct bpf_func_proto bpf_sock_hash_update_proto;
>>> +extern const struct bpf_func_proto bpf_get_current_cgroup_id_proto;
>>>   
>>>   /* Shared helpers among cBPF and eBPF. */
>>>   void bpf_user_rnd_init_once(void);
>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>> index f0b6608..18712b0 100644
>>> --- a/include/uapi/linux/bpf.h
>>> +++ b/include/uapi/linux/bpf.h
>>> @@ -2070,6 +2070,11 @@ union bpf_attr {
>>>    * 		**CONFIG_SOCK_CGROUP_DATA** configuration option.
>>>    * 	Return
>>>    * 		The id is returned or 0 in case the id could not be retrieved.
>>> + *
>>> + * u64 bpf_get_current_cgroup_id(void)
>>> + * 	Return
>>> + * 		A 64-bit integer containing the current cgroup id based
>>> + * 		on the cgroup within which the current task is running.
>>>    */
>>>   #define __BPF_FUNC_MAPPER(FN)		\
>>>   	FN(unspec),			\
>>> @@ -2151,7 +2156,8 @@ union bpf_attr {
>>>   	FN(lwt_seg6_action),		\
>>>   	FN(rc_repeat),			\
>>>   	FN(rc_keydown),			\
>>> -	FN(skb_cgroup_id),
>>> +	FN(skb_cgroup_id),		\
>>> +	FN(get_current_cgroup_id),
>>>   
>>>   /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>>>    * function eBPF program intends to call
>>> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
>>> index 527587d..9f14937 100644
>>> --- a/kernel/bpf/core.c
>>> +++ b/kernel/bpf/core.c
>>> @@ -1765,6 +1765,7 @@ const struct bpf_func_proto bpf_get_current_uid_gid_proto __weak;
>>>   const struct bpf_func_proto bpf_get_current_comm_proto __weak;
>>>   const struct bpf_func_proto bpf_sock_map_update_proto __weak;
>>>   const struct bpf_func_proto bpf_sock_hash_update_proto __weak;
>>> +const struct bpf_func_proto bpf_get_current_cgroup_id_proto __weak;
>>>   
>>>   const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void)
>>>   {
>>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
>>> index 3d24e23..73065e2 100644
>>> --- a/kernel/bpf/helpers.c
>>> +++ b/kernel/bpf/helpers.c
>>> @@ -179,3 +179,18 @@ const struct bpf_func_proto bpf_get_current_comm_proto = {
>>>   	.arg1_type	= ARG_PTR_TO_UNINIT_MEM,
>>>   	.arg2_type	= ARG_CONST_SIZE,
>>>   };
>>> +
>>> +#ifdef CONFIG_CGROUPS
>>> +BPF_CALL_0(bpf_get_current_cgroup_id)
>>> +{
>>> +	struct cgroup *cgrp = task_dfl_cgroup(current);
>>> +
>>> +	return cgrp->kn->id.id;
>>> +}
>>> +
>>> +const struct bpf_func_proto bpf_get_current_cgroup_id_proto = {
>>> +	.func		= bpf_get_current_cgroup_id,
>>> +	.gpl_only	= false,
>>> +	.ret_type	= RET_INTEGER,
>>> +};
>>> +#endif
>>
>> Nit: why not moving this function directly to bpf_trace.c?
> 
> my preference would be to keep it in helpers.c as-is.
> imo bpf_trace.c is only for things that depend on kernel/trace/
> 
>>> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
>>> index 752992c..e2ab5b7 100644
>>> --- a/kernel/trace/bpf_trace.c
>>> +++ b/kernel/trace/bpf_trace.c
>>> @@ -564,6 +564,8 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>>>   		return &bpf_get_prandom_u32_proto;
>>>   	case BPF_FUNC_probe_read_str:
>>>   		return &bpf_probe_read_str_proto;
>>> +	case BPF_FUNC_get_current_cgroup_id:
>>> +		return &bpf_get_current_cgroup_id_proto;
>>
>> When you have !CONFIG_CGROUPS, then it relies on the weak definition of the
>> bpf_get_current_cgroup_id_proto, which I would think at latest in fixup_bpf_calls()
>> bails out with 'kernel subsystem misconfigured func' due to func being NULL.
>>
>> Can't we just do the #ifdef CONFIG_CGROUPS around BPF_FUNC_get_current_cgroup_id
>> case instead? Then we bail out normally with 'unknown func' when cgroups are
>> not configured?

Thanks. Will send a followup patch soon.

> 
> good idea.
> 

^ permalink raw reply

* Re: [RFC PATCH 1/2] net: macb: Add CAP to disable hardware TX checksum offloading
From: Nicolas Ferre @ 2018-06-04 15:13 UTC (permalink / raw)
  To: Jennifer Dahm, netdev, David S . Miller; +Cc: Nathan Sullivan
In-Reply-To: <1527284654-24835-2-git-send-email-jennifer.dahm@ni.com>

On 25/05/2018 at 23:44, Jennifer Dahm wrote:
> Certain PHYs have significant bugs in their TX checksum offloading
> that cannot be solved in software. In order to accommodate these PHYS,
> add a CAP to disable this hardware.
> 
> Signed-off-by: Jennifer Dahm <jennifer.dahm@ni.com>
> ---
>   drivers/net/ethernet/cadence/macb.h      | 1 +
>   drivers/net/ethernet/cadence/macb_main.c | 8 ++++++--
>   2 files changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/cadence/macb.h b/drivers/net/ethernet/cadence/macb.h
> index 8665982..6b85e97 100644
> --- a/drivers/net/ethernet/cadence/macb.h
> +++ b/drivers/net/ethernet/cadence/macb.h
> @@ -635,6 +635,7 @@
>   #define MACB_CAPS_USRIO_DISABLED		0x00000010
>   #define MACB_CAPS_JUMBO				0x00000020
>   #define MACB_CAPS_GEM_HAS_PTP			0x00000040
> +#define MACB_CAPS_DISABLE_TX_HW_CSUM		0x00000080

Nitpicking: "DISABLED" tends to be at the end of the name...

>   #define MACB_CAPS_FIFO_MODE			0x10000000
>   #define MACB_CAPS_GIGABIT_MODE_AVAILABLE	0x20000000
>   #define MACB_CAPS_SG_DISABLED			0x40000000
> diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
> index 3e93df5..a5d564b 100644
> --- a/drivers/net/ethernet/cadence/macb_main.c
> +++ b/drivers/net/ethernet/cadence/macb_main.c
> @@ -3360,8 +3360,12 @@ static int macb_init(struct platform_device *pdev)
>   		dev->hw_features |= MACB_NETIF_LSO;
>   
>   	/* Checksum offload is only available on gem with packet buffer */
> -	if (macb_is_gem(bp) && !(bp->caps & MACB_CAPS_FIFO_MODE))
> -		dev->hw_features |= NETIF_F_HW_CSUM | NETIF_F_RXCSUM;
> +	if (macb_is_gem(bp) && !(bp->caps & MACB_CAPS_FIFO_MODE)) {
> +		if (!(bp->caps & MACB_CAPS_DISABLE_TX_HW_CSUM))

Why not the other way around? negating a "disabled" feature is always 
challenge ;-)

> +			dev->hw_features |= NETIF_F_HW_CSUM | NETIF_F_RXCSUM;
> +		else
> +			dev->hw_features |= NETIF_F_RXCSUM;
> +	}
>   	if (bp->caps & MACB_CAPS_SG_DISABLED)
>   		dev->hw_features &= ~NETIF_F_SG;
>   	dev->features = dev->hw_features;
> 


-- 
Nicolas Ferre

^ permalink raw reply

* [bug] cxgb4: vrf stopped working with cxgb4 card
From: AMG Zollner Robert @ 2018-06-04 15:03 UTC (permalink / raw)
  To: ganeshgr; +Cc: netdev, dsa

I have noticed that vrf is not working with kernel v4.15.0 but was 
working with v4.13.0 when using cxgb4 Chelsio driver (T520-cr)

Setup:
Two metal servers with a T520-cr card each, directly connected without a 
switch in between.

        SVR1  only ipfwd                 SVR2     with vrf
.----------------------------. .----------------------------------.
|                            |         |             |
|    192.168.8.1 [  ens2f4]--|---------|--[ens1f4] 192.168.8.2   |
|    192.168.9.1 [ens2f4d1]--|---------|--<ens1f4d1> 192.168.9.2 VRF=10   |
`----------------------------' `----------------------------------'

When vrf is not working there are no error messages (dmesg or iproute 
commands), tcpdump on the interface (SVR2.ens1f4d1) enslaved in vrf 10 
shows packets(arp req/reply) coming in and going out, but outgoing 
packets(arp reply) do not reach the other server SVR1.ens2f4d1


Bisect:
Found this commit to be the problem after doing a git bisect between 
v4.13..v4.15:

commit ba581f77df23c8ee70b372966e69cf10bc5453d8
Author: Ganesh Goudar <ganeshgr@chelsio.com>
Date:   Sat Sep 23 16:07:28 2017 +0530

     cxgb4: do DCB state reset in couple of places

     reset the driver's DCB state in couple of places
     where it was missing.


A bisect step was considered good when:
- successful ping from SVR1 to SVR2.ens1f4d1 vrf interface
- successful ping from SVR2 global to SVR2 vrf interface trough SVR1(l3 
forwarding) (this check was redundant,both tests fail or pass simultaneous)

The problem is still present on recent kernels also, checked v4.16.0 and 
v4.17.rc7

Disabling DCB for the card support fixes the problem ( Compiling kernel 
with "CONFIG_CHELSIO_T4_DCB=n")



This is my first time reporting a bug to the linux kernel and hope I 
have included the right amount of information. Please let me know if I 
have missed something.



Thank you,
Zollner Robert


--------
Logs:

VRF configured using folowing commands:

#!/bin/sh

CHDEV=ens1f4
VRF=vrf-recv

sysctl -w net.ipv4.tcp_l3mdev_accept=1
sysctl -w net.ipv4.udp_l3mdev_accept=1
sysctl -w net.ipv4.conf.all.accept_local=1

ifconfig ${CHDEV}   192.168.8.2/24
ifconfig ${CHDEV}d1 192.168.9.2/24

ip link add ${VRF} type vrf table 10
ip link set dev ${VRF} up

ip rule add pref 32765 table local
ip rule del pref 0

ip route add table 10 unreachable default metric 4278198272

ip link set dev ${CHDEV}d1 master ${VRF}

ip route add table 10 default via 192.168.9.1
ip route add 192.168.9.0/24 via 192.168.8.1

^ permalink raw reply

* [bpf-next PATCH] bpf: sockmap, fix crash when ipv6 sock is added
From: John Fastabend @ 2018-06-04 15:21 UTC (permalink / raw)
  To: edumazet, ast, daniel; +Cc: netdev

This fixes a crash where we assign tcp_prot to IPv6 sockets instead
of tcpv6_prot.

Previously we overwrote the sk->prot field with tcp_prot even in the
AF_INET6 case. This patch ensures the correct tcp_prot and tcpv6_prot
are used. Further, only allow ESTABLISHED connections to join the
map per note in TLS ULP,

   /* The TLS ulp is currently supported only for TCP sockets
    * in ESTABLISHED state.
    * Supporting sockets in LISTEN state will require us
    * to modify the accept implementation to clone rather then
    * share the ulp context.
    */

Also tested with 'netserver -6' and 'netperf -H [IPv6]' as well as
'netperf -H [IPv4]'. The ESTABLISHED check resolves the previously
crashing case here.

Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support")
Reported-by: syzbot+5c063698bdbfac19f363@syzkaller.appspotmail.com
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Wei Wang <weiwan@google.com>
---
 kernel/bpf/sockmap.c |   23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 52a91d8..9e80118 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -41,6 +41,7 @@
 #include <linux/mm.h>
 #include <net/strparser.h>
 #include <net/tcp.h>
+#include <net/transp_v6.h>
 #include <linux/ptr_ring.h>
 #include <net/inet_common.h>
 #include <linux/sched/signal.h>
@@ -162,6 +163,8 @@ static bool bpf_tcp_stream_read(const struct sock *sk)
 }
 
 static struct proto tcp_bpf_proto;
+static struct proto tcpv6_bpf_proto;
+
 static int bpf_tcp_init(struct sock *sk)
 {
 	struct smap_psock *psock;
@@ -182,13 +185,21 @@ static int bpf_tcp_init(struct sock *sk)
 	psock->sk_proto = sk->sk_prot;
 
 	if (psock->bpf_tx_msg) {
+		tcpv6_bpf_proto.sendmsg = bpf_tcp_sendmsg;
+		tcpv6_bpf_proto.sendpage = bpf_tcp_sendpage;
+		tcpv6_bpf_proto.recvmsg = bpf_tcp_recvmsg;
+		tcpv6_bpf_proto.stream_memory_read = bpf_tcp_stream_read;
 		tcp_bpf_proto.sendmsg = bpf_tcp_sendmsg;
 		tcp_bpf_proto.sendpage = bpf_tcp_sendpage;
 		tcp_bpf_proto.recvmsg = bpf_tcp_recvmsg;
 		tcp_bpf_proto.stream_memory_read = bpf_tcp_stream_read;
 	}
 
-	sk->sk_prot = &tcp_bpf_proto;
+	if (sk->sk_family == AF_INET6)
+		sk->sk_prot = &tcpv6_bpf_proto;
+	else
+		sk->sk_prot = &tcp_bpf_proto;
+
 	rcu_read_unlock();
 	return 0;
 }
@@ -1113,6 +1124,8 @@ static int bpf_tcp_ulp_register(void)
 {
 	tcp_bpf_proto = tcp_prot;
 	tcp_bpf_proto.close = bpf_tcp_close;
+	tcpv6_bpf_proto = tcpv6_prot;
+	tcpv6_bpf_proto.close = bpf_tcp_close;
 	/* Once BPF TX ULP is registered it is never unregistered. It
 	 * will be in the ULP list for the lifetime of the system. Doing
 	 * duplicate registers is not a problem.
@@ -1716,6 +1729,14 @@ static int __sock_map_ctx_update_elem(struct bpf_map *map,
 	bool new = false;
 	int err = 0;
 
+	/* ULPs are currently supported only for TCP sockets in ESTABLISHED
+	 * state. Supporting sockets in LISTEN state will require us to
+	 * modify the accept implementation to clone rather then share the
+	 * ulp context.
+	 */
+	if (sock->sk_state != TCP_ESTABLISHED)
+		return -ENOTSUPP;
+
 	/* 1. If sock map has BPF programs those will be inherited by the
 	 * sock being added. If the sock is already attached to BPF programs
 	 * this results in an error.

^ permalink raw reply related

* Re: INFO: task hung in ip6gre_exit_batch_net
From: Dmitry Vyukov @ 2018-06-04 15:22 UTC (permalink / raw)
  To: syzbot
  Cc: Christian Brauner, David Miller, David Ahern, Florian Westphal,
	Jiri Benc, Kirill Tkhai, LKML, Xin Long, mschiffer, netdev,
	syzkaller-bugs, Vladislav Yasevich
In-Reply-To: <0000000000006e4595056dd23c16@google.com>

On Mon, Jun 4, 2018 at 5:03 PM, syzbot
<syzbot+bf78a74f82c1cf19069e@syzkaller.appspotmail.com> wrote:
> Hello,
>
> syzbot found the following crash on:
>
> HEAD commit:    bc2dbc5420e8 Merge branch 'akpm' (patches from Andrew)
> git tree:       upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=164e42b7800000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=982e2df1b9e60b02
> dashboard link: https://syzkaller.appspot.com/bug?extid=bf78a74f82c1cf19069e
> compiler:       gcc (GCC) 8.0.1 20180413 (experimental)
>
> Unfortunately, I don't have any reproducer for this crash yet.
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+bf78a74f82c1cf19069e@syzkaller.appspotmail.com

Another hang on rtnl lock:

#syz dup: INFO: task hung in netdev_run_todo

May be related to "unregister_netdevice: waiting for DEV to become free":
https://syzkaller.appspot.com/bug?id=1a97a5bd119fd97995f752819fd87840ab9479a9

Any other explanations for massive hangs on rtnl lock for minutes?


> INFO: task kworker/u4:1:22 blocked for more than 120 seconds.
>       Not tainted 4.17.0-rc6+ #68
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> kworker/u4:1    D13944    22      2 0x80000000
> Workqueue: netns cleanup_net
> Call Trace:
>  context_switch kernel/sched/core.c:2859 [inline]
>  __schedule+0x801/0x1e30 kernel/sched/core.c:3501
>  schedule+0xef/0x430 kernel/sched/core.c:3545
>  schedule_preempt_disabled+0x10/0x20 kernel/sched/core.c:3603
>  __mutex_lock_common kernel/locking/mutex.c:833 [inline]
>  __mutex_lock+0xe38/0x17f0 kernel/locking/mutex.c:893
>  mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:908
>  rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74
>  ip6gre_exit_batch_net+0xd5/0x7d0 net/ipv6/ip6_gre.c:1585
>  ops_exit_list.isra.7+0x105/0x160 net/core/net_namespace.c:155
>  cleanup_net+0x51d/0xb20 net/core/net_namespace.c:523
>  process_one_work+0xc1e/0x1b50 kernel/workqueue.c:2145
>  worker_thread+0x1cc/0x1440 kernel/workqueue.c:2279
>  kthread+0x345/0x410 kernel/kthread.c:240
>  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412
>
> Showing all locks held in the system:
> 4 locks held by kworker/u4:1/22:
>  #0: 0000000049a7b590 ((wq_completion)"%s""netns"){+.+.}, at:
> __write_once_size include/linux/compiler.h:215 [inline]
>  #0: 0000000049a7b590 ((wq_completion)"%s""netns"){+.+.}, at:
> arch_atomic64_set arch/x86/include/asm/atomic64_64.h:34 [inline]
>  #0: 0000000049a7b590 ((wq_completion)"%s""netns"){+.+.}, at: atomic64_set
> include/asm-generic/atomic-instrumented.h:40 [inline]
>  #0: 0000000049a7b590 ((wq_completion)"%s""netns"){+.+.}, at:
> atomic_long_set include/asm-generic/atomic-long.h:57 [inline]
>  #0: 0000000049a7b590 ((wq_completion)"%s""netns"){+.+.}, at: set_work_data
> kernel/workqueue.c:617 [inline]
>  #0: 0000000049a7b590 ((wq_completion)"%s""netns"){+.+.}, at:
> set_work_pool_and_clear_pending kernel/workqueue.c:644 [inline]
>  #0: 0000000049a7b590 ((wq_completion)"%s""netns"){+.+.}, at:
> process_one_work+0xaef/0x1b50 kernel/workqueue.c:2116
>  #1: 0000000030a00b6b (net_cleanup_work){+.+.}, at:
> process_one_work+0xb46/0x1b50 kernel/workqueue.c:2120
>  #2: 000000007eb35e65 (pernet_ops_rwsem){++++}, at: cleanup_net+0x11a/0xb20
> net/core/net_namespace.c:490
>  #3: 000000007eb32c75 (rtnl_mutex){+.+.}, at: rtnl_lock+0x17/0x20
> net/core/rtnetlink.c:74
> 3 locks held by kworker/1:1/24:
>  #0: 000000001c9e6580 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
> __write_once_size include/linux/compiler.h:215 [inline]
>  #0: 000000001c9e6580 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
> arch_atomic64_set arch/x86/include/asm/atomic64_64.h:34 [inline]
>  #0: 000000001c9e6580 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
> atomic64_set include/asm-generic/atomic-instrumented.h:40 [inline]
>  #0: 000000001c9e6580 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
> atomic_long_set include/asm-generic/atomic-long.h:57 [inline]
>  #0: 000000001c9e6580 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
> set_work_data kernel/workqueue.c:617 [inline]
>  #0: 000000001c9e6580 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
> set_work_pool_and_clear_pending kernel/workqueue.c:644 [inline]
>  #0: 000000001c9e6580 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:
> process_one_work+0xaef/0x1b50 kernel/workqueue.c:2116
>  #1: 000000009edcfbe7 ((addr_chk_work).work){+.+.}, at:
> process_one_work+0xb46/0x1b50 kernel/workqueue.c:2120
>  #2: 000000007eb32c75 (rtnl_mutex){+.+.}, at: rtnl_lock+0x17/0x20
> net/core/rtnetlink.c:74
> 2 locks held by khungtaskd/893:
>  #0: 000000007eeb621a (rcu_read_lock){....}, at:
> check_hung_uninterruptible_tasks kernel/hung_task.c:175 [inline]
>  #0: 000000007eeb621a (rcu_read_lock){....}, at: watchdog+0x1ff/0xf60
> kernel/hung_task.c:249
>  #1: 00000000239f1b5e (tasklist_lock){.+.+}, at:
> debug_show_all_locks+0xde/0x34a kernel/locking/lockdep.c:4470
> 2 locks held by getty/4481:
>  #0: 00000000cc114025 (&tty->ldisc_sem){++++}, at: ldsem_down_read+0x37/0x40
> drivers/tty/tty_ldsem.c:365
>  #1: 000000006ad1f3fc (&ldata->atomic_read_lock){+.+.}, at:
> n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
> 2 locks held by getty/4482:
>  #0: 00000000226a16cc (&tty->ldisc_sem){++++}, at: ldsem_down_read+0x37/0x40
> drivers/tty/tty_ldsem.c:365
>  #1: 000000008cee8cdc (&ldata->atomic_read_lock){+.+.}, at:
> n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
> 2 locks held by getty/4483:
>  #0: 0000000067bd3c39 (&tty->ldisc_sem){++++}, at: ldsem_down_read+0x37/0x40
> drivers/tty/tty_ldsem.c:365
>  #1: 000000005d8bc81d (&ldata->atomic_read_lock){+.+.}, at:
> n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
> 2 locks held by getty/4484:
>  #0: 00000000f0f8d839 (&tty->ldisc_sem){++++}, at: ldsem_down_read+0x37/0x40
> drivers/tty/tty_ldsem.c:365
>  #1: 00000000a9d5f091 (&ldata->atomic_read_lock){+.+.}, at:
> n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
> 2 locks held by getty/4485:
>  #0: 000000002c96ee9a (&tty->ldisc_sem){++++}, at: ldsem_down_read+0x37/0x40
> drivers/tty/tty_ldsem.c:365
>  #1: 0000000033338ac7 (&ldata->atomic_read_lock){+.+.}, at:
> n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
> 2 locks held by getty/4486:
>  #0: 00000000f6db39b5 (&tty->ldisc_sem){++++}, at: ldsem_down_read+0x37/0x40
> drivers/tty/tty_ldsem.c:365
>  #1: 00000000bb7c6099 (&ldata->atomic_read_lock){+.+.}, at:
> n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
> 2 locks held by getty/4487:
>  #0: 000000006be9659d (&tty->ldisc_sem){++++}, at: ldsem_down_read+0x37/0x40
> drivers/tty/tty_ldsem.c:365
>  #1: 00000000e2edd3d0 (&ldata->atomic_read_lock){+.+.}, at:
> n_tty_read+0x321/0x1cc0 drivers/tty/n_tty.c:2131
> 3 locks held by kworker/1:3/27147:
>  #0: 00000000c208aa7f ((wq_completion)"events"){+.+.}, at: __write_once_size
> include/linux/compiler.h:215 [inline]
>  #0: 00000000c208aa7f ((wq_completion)"events"){+.+.}, at: arch_atomic64_set
> arch/x86/include/asm/atomic64_64.h:34 [inline]
>  #0: 00000000c208aa7f ((wq_completion)"events"){+.+.}, at: atomic64_set
> include/asm-generic/atomic-instrumented.h:40 [inline]
>  #0: 00000000c208aa7f ((wq_completion)"events"){+.+.}, at: atomic_long_set
> include/asm-generic/atomic-long.h:57 [inline]
>  #0: 00000000c208aa7f ((wq_completion)"events"){+.+.}, at: set_work_data
> kernel/workqueue.c:617 [inline]
>  #0: 00000000c208aa7f ((wq_completion)"events"){+.+.}, at:
> set_work_pool_and_clear_pending kernel/workqueue.c:644 [inline]
>  #0: 00000000c208aa7f ((wq_completion)"events"){+.+.}, at:
> process_one_work+0xaef/0x1b50 kernel/workqueue.c:2116
>  #1: 00000000fa87e61f (deferred_process_work){+.+.}, at:
> process_one_work+0xb46/0x1b50 kernel/workqueue.c:2120
>  #2: 000000007eb32c75 (rtnl_mutex){+.+.}, at: rtnl_lock+0x17/0x20
> net/core/rtnetlink.c:74
> 1 lock held by syz-executor7/29665:
>  #0: 000000006e20d618 (sk_lock-AF_INET6){+.+.}, at: lock_sock
> include/net/sock.h:1469 [inline]
>  #0: 000000006e20d618 (sk_lock-AF_INET6){+.+.}, at:
> tls_sw_sendmsg+0x1b9/0x12e0 net/tls/tls_sw.c:384
> 1 lock held by syz-executor7/29689:
>  #0: 000000007eb32c75 (rtnl_mutex){+.+.}, at: rtnl_lock+0x17/0x20
> net/core/rtnetlink.c:74
>
> =============================================
>
> NMI backtrace for cpu 0
> CPU: 0 PID: 893 Comm: khungtaskd Not tainted 4.17.0-rc6+ #68
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:77 [inline]
>  dump_stack+0x1b9/0x294 lib/dump_stack.c:113
>  nmi_cpu_backtrace.cold.4+0x19/0xce lib/nmi_backtrace.c:103
>  nmi_trigger_cpumask_backtrace+0x151/0x192 lib/nmi_backtrace.c:62
>  arch_trigger_cpumask_backtrace+0x14/0x20 arch/x86/kernel/apic/hw_nmi.c:38
>  trigger_all_cpu_backtrace include/linux/nmi.h:138 [inline]
>  check_hung_task kernel/hung_task.c:132 [inline]
>  check_hung_uninterruptible_tasks kernel/hung_task.c:190 [inline]
>  watchdog+0xc10/0xf60 kernel/hung_task.c:249
>  kthread+0x345/0x410 kernel/kthread.c:240
>  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412
> Sending NMI from CPU 0 to CPUs 1:
> NMI backtrace for cpu 1 skipped: idling at native_safe_halt+0x6/0x10
> arch/x86/include/asm/irqflags.h:54
>
>
> ---
> This bug is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkaller@googlegroups.com.
>
> syzbot will keep track of this bug report. See:
> https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with
> syzbot.
>
> --
> You received this message because you are subscribed to the Google Groups
> "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to syzkaller-bugs+unsubscribe@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/syzkaller-bugs/0000000000006e4595056dd23c16%40google.com.
> For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply

* [RFC PATCH 0/4] Add support for DCB and XPS with MACVLAN offload
From: Alexander Duyck @ 2018-06-04 15:17 UTC (permalink / raw)
  To: netdev, intel-wired-lan

This patch set is meant to address a number of shortcomings in the current
implementation of MACVLAN offload. The main issue being the fact that the
current Tx queue implementation for ixgbe required the use of
ndo_select_queue which doesn't necessarily play well with DCB or XPS.

I started this patch set to address that and thought I would submit the
start of the series as an RFC just to make sure I am headed in the right
direction and to see if there is anything I am overlooking. With these
patches applied we now start seeing what I am calling "subordinate
channels" which show themselves via a "-x" suffix where the "-x" can be
anything from "-1" to "-32767". So for example traffic class 0 with a
subordinate class of 5 will be called out as "0-5" if you dump the traffic
class via the sysfs value for the queue.

The main pieces that still need to be handled are the updating of the
ndo_select_queue function and the fallback call to allow support for
passing the accel_priv which I am changing to a netdev. I figure those
patches will be mostly noise so I didn't see the need to include them in
this RFC.

---

Alexander Duyck (4):
      net-sysfs: Drop support for XPS and traffic_class on single queue device
      net: Add support for subordinate device traffic classes
      ixgbe: Add code to populate and use macvlan tc to Tx queue map
      net: Add support for subordinate traffic classes to netdev_pick_tx

 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   55 +++++++---
 drivers/net/macvlan.c                         |   10 --
 include/linux/netdevice.h                     |   20 +++
 net/core/dev.c                                |  144 +++++++++++++++++++++----
 net/core/net-sysfs.c                          |   36 ++++++
 5 files changed, 215 insertions(+), 50 deletions(-)

^ permalink raw reply

* [RFC PATCH 1/4] net-sysfs: Drop support for XPS and traffic_class on single queue device
From: Alexander Duyck @ 2018-06-04 15:17 UTC (permalink / raw)
  To: netdev, intel-wired-lan
In-Reply-To: <20180604150847.74754.8585.stgit@ahduyck-green-test.jf.intel.com>

This patch makes it so that we do not report the traffic class or allow XPS
configuration on single queue devices. This is mostly to avoid unnecessary
complexity with changes I have planned that will allow us to reuse
the unused tc_to_txq and XPS configuration on a single queue device to
allow it to make use of a subset of queues on an underlying device.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 net/core/net-sysfs.c |   15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index bb7e80f..335c6a4 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1047,9 +1047,14 @@ static ssize_t traffic_class_show(struct netdev_queue *queue,
 				  char *buf)
 {
 	struct net_device *dev = queue->dev;
-	int index = get_netdev_queue_index(queue);
-	int tc = netdev_txq_to_tc(dev, index);
+	int index;
+	int tc;
 
+	if (!netif_is_multiqueue(dev))
+		return -ENOENT;
+
+	index = get_netdev_queue_index(queue);
+	tc = netdev_txq_to_tc(dev, index);
 	if (tc < 0)
 		return -EINVAL;
 
@@ -1214,6 +1219,9 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
 	cpumask_var_t mask;
 	unsigned long index;
 
+	if (!netif_is_multiqueue(dev))
+		return -ENOENT;
+
 	index = get_netdev_queue_index(queue);
 
 	if (dev->num_tc) {
@@ -1260,6 +1268,9 @@ static ssize_t xps_cpus_store(struct netdev_queue *queue,
 	cpumask_var_t mask;
 	int err;
 
+	if (!netif_is_multiqueue(dev))
+		return -ENOENT;
+
 	if (!capable(CAP_NET_ADMIN))
 		return -EPERM;
 

^ permalink raw reply related

* [RFC PATCH 2/4] net: Add support for subordinate device traffic classes
From: Alexander Duyck @ 2018-06-04 15:17 UTC (permalink / raw)
  To: netdev, intel-wired-lan
In-Reply-To: <20180604150847.74754.8585.stgit@ahduyck-green-test.jf.intel.com>

This patch is meant to provide the basic tools needed to allow us to create
subordinate device traffic classes. The general idea here is to allow
subdividing the queues of a device into queue groups accessible through an
upper device such as a macvlan.

The idea here is to enforce the idea that an upper device has to be a
single queue device, ideally with IFF_NO_QUQUE set. With that being the
case we can pretty much guarantee that the tc_to_txq mappings and XPS maps
for the upper device are unused. As such we could reuse those in order to
support subdividing the lower device and distributing those queues between
the subordinate devices.

In order to distinguish between a regular set of traffic classes and if a
device is carrying subordinate traffic classes I changed num_tc from a u8
to a s16 value and use the negative values to represent the suboordinate
pool values. So starting at -1 and running to -32768 we can encode those as
pool values, and the existing values of 0 to 15 can be maintained.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 include/linux/netdevice.h |   16 ++++++++
 net/core/dev.c            |   89 +++++++++++++++++++++++++++++++++++++++++++++
 net/core/net-sysfs.c      |   21 ++++++++++-
 3 files changed, 124 insertions(+), 2 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 03ed492..bbc8045 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -569,6 +569,9 @@ struct netdev_queue {
 	 * (/sys/class/net/DEV/Q/trans_timeout)
 	 */
 	unsigned long		trans_timeout;
+
+	/* Suboordinate device that the queue has been assigned to */
+	struct net_device	*sb_dev;
 /*
  * write-mostly part
  */
@@ -1960,7 +1963,7 @@ struct net_device {
 #ifdef CONFIG_DCB
 	const struct dcbnl_rtnl_ops *dcbnl_ops;
 #endif
-	u8			num_tc;
+	s16			num_tc;
 	struct netdev_tc_txq	tc_to_txq[TC_MAX_QUEUE];
 	u8			prio_tc_map[TC_BITMASK + 1];
 
@@ -2014,6 +2017,17 @@ int netdev_get_num_tc(struct net_device *dev)
 	return dev->num_tc;
 }
 
+void netdev_unbind_sb_channel(struct net_device *dev,
+			      struct net_device *sb_dev);
+int netdev_bind_sb_channel_queue(struct net_device *dev,
+				 struct net_device *sb_dev,
+				 u8 tc, u16 count, u16 offset);
+int netdev_set_sb_channel(struct net_device *dev, u16 channel);
+static inline int netdev_get_sb_channel(struct net_device *dev)
+{
+	return max_t(int, -dev->num_tc, 0);
+}
+
 static inline
 struct netdev_queue *netdev_get_tx_queue(const struct net_device *dev,
 					 unsigned int index)
diff --git a/net/core/dev.c b/net/core/dev.c
index 034c2a9..eba255d 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2068,11 +2068,13 @@ int netdev_txq_to_tc(struct net_device *dev, unsigned int txq)
 		struct netdev_tc_txq *tc = &dev->tc_to_txq[0];
 		int i;
 
+		/* walk through the TCs and see if it falls into any of them */
 		for (i = 0; i < TC_MAX_QUEUE; i++, tc++) {
 			if ((txq - tc->offset) < tc->count)
 				return i;
 		}
 
+		/* didn't find it, just return -1 to indicate no match */
 		return -1;
 	}
 
@@ -2215,7 +2217,14 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
 	bool active = false;
 
 	if (dev->num_tc) {
+		/* Do not allow XPS on subordinate device directly */
 		num_tc = dev->num_tc;
+		if (num_tc < 0)
+			return -EINVAL;
+
+		/* If queue belongs to subordinate dev use its map */
+		dev = netdev_get_tx_queue(dev, index)->sb_dev ? : dev;
+
 		tc = netdev_txq_to_tc(dev, index);
 		if (tc < 0)
 			return -EINVAL;
@@ -2366,11 +2375,25 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
 EXPORT_SYMBOL(netif_set_xps_queue);
 
 #endif
+static void netdev_unbind_all_sb_channels(struct net_device *dev)
+{
+	struct netdev_queue *txq = &dev->_tx[dev->num_tx_queues];
+
+	/* Unbind any subordinate channels */
+	while (txq-- != &dev->_tx[0]) {
+		if (txq->sb_dev)
+			netdev_unbind_sb_channel(dev, txq->sb_dev);
+	}
+}
+
 void netdev_reset_tc(struct net_device *dev)
 {
 #ifdef CONFIG_XPS
 	netif_reset_xps_queues_gt(dev, 0);
 #endif
+	netdev_unbind_all_sb_channels(dev);
+
+	/* Reset TC configuration of device */
 	dev->num_tc = 0;
 	memset(dev->tc_to_txq, 0, sizeof(dev->tc_to_txq));
 	memset(dev->prio_tc_map, 0, sizeof(dev->prio_tc_map));
@@ -2399,11 +2422,77 @@ int netdev_set_num_tc(struct net_device *dev, u8 num_tc)
 #ifdef CONFIG_XPS
 	netif_reset_xps_queues_gt(dev, 0);
 #endif
+	netdev_unbind_all_sb_channels(dev);
+
 	dev->num_tc = num_tc;
 	return 0;
 }
 EXPORT_SYMBOL(netdev_set_num_tc);
 
+void netdev_unbind_sb_channel(struct net_device *dev,
+			      struct net_device *sb_dev)
+{
+	struct netdev_queue *txq = &dev->_tx[dev->num_tx_queues];
+
+#ifdef CONFIG_XPS
+	netif_reset_xps_queues_gt(sb_dev, 0);
+#endif
+	memset(sb_dev->tc_to_txq, 0, sizeof(sb_dev->tc_to_txq));
+	memset(sb_dev->prio_tc_map, 0, sizeof(sb_dev->prio_tc_map));
+
+	while (txq-- != &dev->_tx[0]) {
+		if (txq->sb_dev == sb_dev)
+			txq->sb_dev = NULL;
+	}
+}
+EXPORT_SYMBOL(netdev_unbind_sb_channel);
+
+int netdev_bind_sb_channel_queue(struct net_device *dev,
+				 struct net_device *sb_dev,
+				 u8 tc, u16 count, u16 offset)
+{
+	/* Make certain the sb_dev and dev are already configured */
+	if (sb_dev->num_tc >= 0 || tc >= dev->num_tc)
+		return -EINVAL;
+
+	/* We cannot hand out queues we don't have */
+	if ((offset + count) > dev->real_num_tx_queues)
+		return -EINVAL;
+
+	/* Record the mapping */
+	sb_dev->tc_to_txq[tc].count = count;
+	sb_dev->tc_to_txq[tc].offset = offset;
+
+	/* Provide a way for Tx queue to find the tc_to_txq map or
+	 * XPS map for itself.
+	 */
+	while (count--)
+		netdev_get_tx_queue(dev, count + offset)->sb_dev = sb_dev;
+
+	return 0;
+}
+EXPORT_SYMBOL(netdev_bind_sb_channel_queue);
+
+int netdev_set_sb_channel(struct net_device *dev, u16 channel)
+{
+	/* Do not use a multiqueue device to represent a subordinate channel */
+	if (netif_is_multiqueue(dev))
+		return -ENODEV;
+
+	/* We allow channels 1 - 32767 to be used for subordinate channels.
+	 * Channel 0 is meant to be "native" mode and used only to represent
+	 * the main root device. We allow writing 0 to reset the device back
+	 * to normal mode after being used as a subordinate channel.
+	 */
+	if (channel > S16_MAX)
+		return -EINVAL;
+
+	dev->num_tc = -channel;
+
+	return 0;
+}
+EXPORT_SYMBOL(netdev_set_sb_channel);
+
 /*
  * Routine to help set real_num_tx_queues. To avoid skbs mapped to queues
  * greater than real_num_tx_queues stale skbs on the qdisc must be flushed.
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 335c6a4..bd067b1 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1054,11 +1054,23 @@ static ssize_t traffic_class_show(struct netdev_queue *queue,
 		return -ENOENT;
 
 	index = get_netdev_queue_index(queue);
+
+	/* If queue belongs to subordinate dev use its tc mapping */
+	dev = netdev_get_tx_queue(dev, index)->sb_dev ? : dev;
+
 	tc = netdev_txq_to_tc(dev, index);
 	if (tc < 0)
 		return -EINVAL;
 
-	return sprintf(buf, "%u\n", tc);
+	/* We can report the traffic class one of two ways:
+	 * Subordinate device traffic classes are reported with the traffic
+	 * class first, and then the subordinate class so for example TC0 on
+	 * subordinate device 2 will be reported as "0-2". If the queue
+	 * belongs to the root device it will be reported with just the
+	 * traffic class, so just "0" for TC 0 for example.
+	 */
+	return dev->num_tc < 0 ? sprintf(buf, "%u%d\n", tc, dev->num_tc) :
+				 sprintf(buf, "%u\n", tc);
 }
 
 #ifdef CONFIG_XPS
@@ -1225,7 +1237,14 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
 	index = get_netdev_queue_index(queue);
 
 	if (dev->num_tc) {
+		/* Do not allow XPS on subordinate device directly */
 		num_tc = dev->num_tc;
+		if (num_tc < 0)
+			return -EINVAL;
+
+		/* If queue belongs to subordinate dev use its map */
+		dev = netdev_get_tx_queue(dev, index)->sb_dev ? : dev;
+
 		tc = netdev_txq_to_tc(dev, index);
 		if (tc < 0)
 			return -EINVAL;

^ permalink raw reply related

* [RFC PATCH 3/4] ixgbe: Add code to populate and use macvlan tc to Tx queue map
From: Alexander Duyck @ 2018-06-04 15:17 UTC (permalink / raw)
  To: netdev, intel-wired-lan
In-Reply-To: <20180604150847.74754.8585.stgit@ahduyck-green-test.jf.intel.com>

This patch makes it so that we use the tc_to_txq mapping in the macvlan
device in order to select the Tx queue for outgoing packets.

The idea here is to try and move away from using ixgbe_select_queue and to
come up with a generic way to make this work for devices going forward. By
encoding this information in the netdev this can become something that can
be used generically as a solution for similar setups going forward.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   44 ++++++++++++++++++++++---
 1 file changed, 38 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index f460c16..05d6d48 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -5271,6 +5271,8 @@ static void ixgbe_clean_rx_ring(struct ixgbe_ring *rx_ring)
 static int ixgbe_fwd_ring_up(struct ixgbe_adapter *adapter,
 			     struct ixgbe_fwd_adapter *accel)
 {
+	u16 rss_i = adapter->ring_feature[RING_F_RSS].indices;
+	int num_tc = netdev_get_num_tc(adapter->netdev);
 	struct net_device *vdev = accel->netdev;
 	int i, baseq, err;
 
@@ -5282,6 +5284,11 @@ static int ixgbe_fwd_ring_up(struct ixgbe_adapter *adapter,
 	accel->rx_base_queue = baseq;
 	accel->tx_base_queue = baseq;
 
+	/* record configuration for macvlan interface in vdev */
+	for (i = 0; i < num_tc; i++)
+		netdev_bind_sb_channel_queue(adapter->netdev, vdev,
+					     i, rss_i, baseq + (rss_i * i));
+
 	for (i = 0; i < adapter->num_rx_queues_per_pool; i++)
 		adapter->rx_ring[baseq + i]->netdev = vdev;
 
@@ -5306,6 +5313,10 @@ static int ixgbe_fwd_ring_up(struct ixgbe_adapter *adapter,
 
 	netdev_err(vdev, "L2FW offload disabled due to L2 filter error\n");
 
+	/* unbind the queues and drop the subordinate channel config */
+	netdev_unbind_sb_channel(adapter->netdev, vdev);
+	netdev_set_sb_channel(vdev, 0);
+
 	clear_bit(accel->pool, adapter->fwd_bitmask);
 	kfree(accel);
 
@@ -8211,18 +8222,22 @@ static u16 ixgbe_select_queue(struct net_device *dev, struct sk_buff *skb,
 			      void *accel_priv, select_queue_fallback_t fallback)
 {
 	struct ixgbe_fwd_adapter *fwd_adapter = accel_priv;
-	struct ixgbe_adapter *adapter;
-	int txq;
 #ifdef IXGBE_FCOE
+	struct ixgbe_adapter *adapter;
 	struct ixgbe_ring_feature *f;
 #endif
+	int txq;
 
 	if (fwd_adapter) {
-		adapter = netdev_priv(dev);
-		txq = reciprocal_scale(skb_get_hash(skb),
-				       adapter->num_rx_queues_per_pool);
+		u8 tc = netdev_get_num_tc(dev) ?
+			netdev_get_prio_tc_map(dev, skb->priority) : 0;
+		struct net_device *vdev = fwd_adapter->netdev;
+
+		txq = vdev->tc_to_txq[tc].offset;
+		txq += reciprocal_scale(skb_get_hash(skb),
+					vdev->tc_to_txq[tc].count);
 
-		return txq + fwd_adapter->tx_base_queue;
+		return txq;
 	}
 
 #ifdef IXGBE_FCOE
@@ -8776,6 +8791,11 @@ static int ixgbe_reassign_macvlan_pool(struct net_device *vdev, void *data)
 	/* if we cannot find a free pool then disable the offload */
 	netdev_err(vdev, "L2FW offload disabled due to lack of queue resources\n");
 	macvlan_release_l2fw_offload(vdev);
+
+	/* unbind the queues and drop the subordinate channel config */
+	netdev_unbind_sb_channel(adapter->netdev, vdev);
+	netdev_set_sb_channel(vdev, 0);
+
 	kfree(accel);
 
 	return 0;
@@ -9780,6 +9800,13 @@ static void *ixgbe_fwd_add(struct net_device *pdev, struct net_device *vdev)
 	if (!macvlan_supports_dest_filter(vdev))
 		return ERR_PTR(-EMEDIUMTYPE);
 
+	/* We need to lock down the macvlan to be a single queue device so that
+	 * we can reuse the tc_to_txq field in the macvlan netdev to represent
+	 * the queue mapping to our netdev.
+	 */
+	if (netif_is_multiqueue(vdev))
+		return ERR_PTR(-ERANGE);
+
 	pool = find_first_zero_bit(adapter->fwd_bitmask, adapter->num_rx_pools);
 	if (pool == adapter->num_rx_pools) {
 		u16 used_pools = adapter->num_vfs + adapter->num_rx_pools;
@@ -9836,6 +9863,7 @@ static void *ixgbe_fwd_add(struct net_device *pdev, struct net_device *vdev)
 		return ERR_PTR(-ENOMEM);
 
 	set_bit(pool, adapter->fwd_bitmask);
+	netdev_set_sb_channel(vdev, pool);
 	accel->pool = pool;
 	accel->netdev = vdev;
 
@@ -9877,6 +9905,10 @@ static void ixgbe_fwd_del(struct net_device *pdev, void *priv)
 		ring->netdev = NULL;
 	}
 
+	/* unbind the queues and drop the subordinate channel config */
+	netdev_unbind_sb_channel(pdev, accel->netdev);
+	netdev_set_sb_channel(accel->netdev, 0);
+
 	clear_bit(accel->pool, adapter->fwd_bitmask);
 	kfree(accel);
 }

^ permalink raw reply related

* [RFC PATCH 4/4] net: Add support for subordinate traffic classes to netdev_pick_tx
From: Alexander Duyck @ 2018-06-04 15:17 UTC (permalink / raw)
  To: netdev, intel-wired-lan
In-Reply-To: <20180604150847.74754.8585.stgit@ahduyck-green-test.jf.intel.com>

This change makes it so that we can support the concept of subordinate
device traffic classes to the core networking code. In doing this we can
start pulling out the driver specific bits needed to support selecting a
queue based on an upper device.

The solution at is currently stands is only partially implemented. I have
the start of some XPS bits in here, but I would still need to allow for
configuration of the XPS maps on the queues reserved for the subordinate
devices. For now I am using the reference to the sb_dev XPS map as just a
way to skip the lookup of the lower device XPS map for now as that would
result in the wrong queue being picked.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   15 ++-----
 drivers/net/macvlan.c                         |   10 +----
 include/linux/netdevice.h                     |    4 +-
 net/core/dev.c                                |   55 +++++++++++++++----------
 4 files changed, 42 insertions(+), 42 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 05d6d48..c42498d 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -8218,20 +8218,18 @@ static void ixgbe_atr(struct ixgbe_ring *ring,
 					      input, common, ring->queue_index);
 }
 
+#ifdef IXGBE_FCOE
 static u16 ixgbe_select_queue(struct net_device *dev, struct sk_buff *skb,
 			      void *accel_priv, select_queue_fallback_t fallback)
 {
-	struct ixgbe_fwd_adapter *fwd_adapter = accel_priv;
-#ifdef IXGBE_FCOE
 	struct ixgbe_adapter *adapter;
 	struct ixgbe_ring_feature *f;
-#endif
 	int txq;
 
-	if (fwd_adapter) {
+	if (accel_priv) {
 		u8 tc = netdev_get_num_tc(dev) ?
 			netdev_get_prio_tc_map(dev, skb->priority) : 0;
-		struct net_device *vdev = fwd_adapter->netdev;
+		struct net_device *vdev = accel_priv;
 
 		txq = vdev->tc_to_txq[tc].offset;
 		txq += reciprocal_scale(skb_get_hash(skb),
@@ -8240,7 +8238,6 @@ static u16 ixgbe_select_queue(struct net_device *dev, struct sk_buff *skb,
 		return txq;
 	}
 
-#ifdef IXGBE_FCOE
 
 	/*
 	 * only execute the code below if protocol is FCoE
@@ -8267,11 +8264,9 @@ static u16 ixgbe_select_queue(struct net_device *dev, struct sk_buff *skb,
 		txq -= f->indices;
 
 	return txq + f->offset;
-#else
-	return fallback(dev, skb);
-#endif
 }
 
+#endif
 static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter,
 			       struct xdp_frame *xdpf)
 {
@@ -10071,7 +10066,6 @@ static void ixgbe_xdp_flush(struct net_device *dev)
 	.ndo_open		= ixgbe_open,
 	.ndo_stop		= ixgbe_close,
 	.ndo_start_xmit		= ixgbe_xmit_frame,
-	.ndo_select_queue	= ixgbe_select_queue,
 	.ndo_set_rx_mode	= ixgbe_set_rx_mode,
 	.ndo_validate_addr	= eth_validate_addr,
 	.ndo_set_mac_address	= ixgbe_set_mac,
@@ -10094,6 +10088,7 @@ static void ixgbe_xdp_flush(struct net_device *dev)
 	.ndo_poll_controller	= ixgbe_netpoll,
 #endif
 #ifdef IXGBE_FCOE
+	.ndo_select_queue	= ixgbe_select_queue,
 	.ndo_fcoe_ddp_setup = ixgbe_fcoe_ddp_get,
 	.ndo_fcoe_ddp_target = ixgbe_fcoe_ddp_target,
 	.ndo_fcoe_ddp_done = ixgbe_fcoe_ddp_put,
diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index adde8fc..401e1d1 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -514,7 +514,6 @@ static int macvlan_queue_xmit(struct sk_buff *skb, struct net_device *dev)
 	const struct macvlan_dev *vlan = netdev_priv(dev);
 	const struct macvlan_port *port = vlan->port;
 	const struct macvlan_dev *dest;
-	void *accel_priv = NULL;
 
 	if (vlan->mode == MACVLAN_MODE_BRIDGE) {
 		const struct ethhdr *eth = (void *)skb->data;
@@ -533,15 +532,10 @@ static int macvlan_queue_xmit(struct sk_buff *skb, struct net_device *dev)
 			return NET_XMIT_SUCCESS;
 		}
 	}
-
-	/* For packets that are non-multicast and not bridged we will pass
-	 * the necessary information so that the lowerdev can distinguish
-	 * the source of the packets via the accel_priv value.
-	 */
-	accel_priv = vlan->accel_priv;
 xmit_world:
 	skb->dev = vlan->lowerdev;
-	return dev_queue_xmit_accel(skb, accel_priv);
+	return dev_queue_xmit_accel(skb,
+				    netdev_get_sb_channel(dev) ? dev : NULL);
 }
 
 static inline netdev_tx_t macvlan_netpoll_send_skb(struct macvlan_dev *vlan, struct sk_buff *skb)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index bbc8045..df5a582 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2072,7 +2072,7 @@ static inline void netdev_for_each_tx_queue(struct net_device *dev,
 
 struct netdev_queue *netdev_pick_tx(struct net_device *dev,
 				    struct sk_buff *skb,
-				    void *accel_priv);
+				    struct net_device *sb_dev);
 
 /* returns the headroom that the master device needs to take in account
  * when forwarding to this dev
@@ -2523,7 +2523,7 @@ struct net_device *__dev_get_by_flags(struct net *net, unsigned short flags,
 void dev_disable_lro(struct net_device *dev);
 int dev_loopback_xmit(struct net *net, struct sock *sk, struct sk_buff *newskb);
 int dev_queue_xmit(struct sk_buff *skb);
-int dev_queue_xmit_accel(struct sk_buff *skb, void *accel_priv);
+int dev_queue_xmit_accel(struct sk_buff *skb, struct net_device *sb_dev);
 int dev_direct_xmit(struct sk_buff *skb, u16 queue_id);
 int register_netdevice(struct net_device *dev);
 void unregister_netdevice_queue(struct net_device *dev, struct list_head *head);
diff --git a/net/core/dev.c b/net/core/dev.c
index eba255d..507e606 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2704,24 +2704,26 @@ void netif_device_attach(struct net_device *dev)
  * Returns a Tx hash based on the given packet descriptor a Tx queues' number
  * to be used as a distribution range.
  */
-static u16 skb_tx_hash(const struct net_device *dev, struct sk_buff *skb)
+static u16 skb_tx_hash(const struct net_device *dev,
+		       const struct net_device *sb_dev,
+		       struct sk_buff *skb)
 {
 	u32 hash;
 	u16 qoffset = 0;
 	u16 qcount = dev->real_num_tx_queues;
 
+	if (dev->num_tc) {
+		u8 tc = netdev_get_prio_tc_map(dev, skb->priority);
+
+		qoffset = sb_dev->tc_to_txq[tc].offset;
+		qcount = sb_dev->tc_to_txq[tc].count;
+	}
+
 	if (skb_rx_queue_recorded(skb)) {
 		hash = skb_get_rx_queue(skb);
 		while (unlikely(hash >= qcount))
 			hash -= qcount;
-		return hash;
-	}
-
-	if (dev->num_tc) {
-		u8 tc = netdev_get_prio_tc_map(dev, skb->priority);
-
-		qoffset = dev->tc_to_txq[tc].offset;
-		qcount = dev->tc_to_txq[tc].count;
+		return hash + qoffset;
 	}
 
 	return (u16) reciprocal_scale(skb_get_hash(skb), qcount) + qoffset;
@@ -3483,7 +3485,9 @@ int dev_loopback_xmit(struct net *net, struct sock *sk, struct sk_buff *skb)
 }
 #endif /* CONFIG_NET_EGRESS */
 
-static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
+static inline int get_xps_queue(struct net_device *dev,
+				struct net_device *sb_dev,
+				struct sk_buff *skb)
 {
 #ifdef CONFIG_XPS
 	struct xps_dev_maps *dev_maps;
@@ -3491,7 +3495,7 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
 	int queue_index = -1;
 
 	rcu_read_lock();
-	dev_maps = rcu_dereference(dev->xps_maps);
+	dev_maps = rcu_dereference(sb_dev->xps_maps);
 	if (dev_maps) {
 		unsigned int tci = skb->sender_cpu - 1;
 
@@ -3519,17 +3523,18 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
 #endif
 }
 
-static u16 __netdev_pick_tx(struct net_device *dev, struct sk_buff *skb)
+static u16 ___netdev_pick_tx(struct net_device *dev, struct sk_buff *skb,
+			     struct net_device *sb_dev)
 {
 	struct sock *sk = skb->sk;
 	int queue_index = sk_tx_queue_get(sk);
 
 	if (queue_index < 0 || skb->ooo_okay ||
 	    queue_index >= dev->real_num_tx_queues) {
-		int new_index = get_xps_queue(dev, skb);
+		int new_index = get_xps_queue(dev, sb_dev, skb);
 
 		if (new_index < 0)
-			new_index = skb_tx_hash(dev, skb);
+			new_index = skb_tx_hash(dev, sb_dev, skb);
 
 		if (queue_index != new_index && sk &&
 		    sk_fullsock(sk) &&
@@ -3542,9 +3547,14 @@ static u16 __netdev_pick_tx(struct net_device *dev, struct sk_buff *skb)
 	return queue_index;
 }
 
+static u16 __netdev_pick_tx(struct net_device *dev,
+			    struct sk_buff *skb)
+{
+	return ___netdev_pick_tx(dev, skb, NULL);
+}
 struct netdev_queue *netdev_pick_tx(struct net_device *dev,
 				    struct sk_buff *skb,
-				    void *accel_priv)
+				    struct net_device *sb_dev)
 {
 	int queue_index = 0;
 
@@ -3559,10 +3569,11 @@ struct netdev_queue *netdev_pick_tx(struct net_device *dev,
 		const struct net_device_ops *ops = dev->netdev_ops;
 
 		if (ops->ndo_select_queue)
-			queue_index = ops->ndo_select_queue(dev, skb, accel_priv,
+			queue_index = ops->ndo_select_queue(dev, skb, sb_dev,
 							    __netdev_pick_tx);
 		else
-			queue_index = __netdev_pick_tx(dev, skb);
+			queue_index = ___netdev_pick_tx(dev, skb,
+							sb_dev ? : dev);
 
 		queue_index = netdev_cap_txqueue(dev, queue_index);
 	}
@@ -3574,7 +3585,7 @@ struct netdev_queue *netdev_pick_tx(struct net_device *dev,
 /**
  *	__dev_queue_xmit - transmit a buffer
  *	@skb: buffer to transmit
- *	@accel_priv: private data used for L2 forwarding offload
+ *	@sb_dev: suboordinate device used for L2 forwarding offload
  *
  *	Queue a buffer for transmission to a network device. The caller must
  *	have set the device and priority and built the buffer before calling
@@ -3597,7 +3608,7 @@ struct netdev_queue *netdev_pick_tx(struct net_device *dev,
  *      the BH enable code must have IRQs enabled so that it will not deadlock.
  *          --BLG
  */
-static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv)
+static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
 {
 	struct net_device *dev = skb->dev;
 	struct netdev_queue *txq;
@@ -3636,7 +3647,7 @@ static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv)
 	else
 		skb_dst_force(skb);
 
-	txq = netdev_pick_tx(dev, skb, accel_priv);
+	txq = netdev_pick_tx(dev, skb, sb_dev);
 	q = rcu_dereference_bh(txq->qdisc);
 
 	trace_net_dev_queue(skb);
@@ -3710,9 +3721,9 @@ int dev_queue_xmit(struct sk_buff *skb)
 }
 EXPORT_SYMBOL(dev_queue_xmit);
 
-int dev_queue_xmit_accel(struct sk_buff *skb, void *accel_priv)
+int dev_queue_xmit_accel(struct sk_buff *skb, struct net_device *sb_dev)
 {
-	return __dev_queue_xmit(skb, accel_priv);
+	return __dev_queue_xmit(skb, sb_dev);
 }
 EXPORT_SYMBOL(dev_queue_xmit_accel);
 

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox