Netdev List

Netdev List
 help / color / mirror / Atom feed

* RE: [PATCH v3 0/5] Introduce variable length mdev alias
From: Parav Pandit @ 2019-09-09 20:42 UTC (permalink / raw)
  To: Parav Pandit, alex.williamson@redhat.com, Jiri Pirko,
	kwankhede@nvidia.com, cohuck@redhat.com, davem@davemloft.net
  Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	netdev@vger.kernel.org
In-Reply-To: <20190902042436.23294-1-parav@mellanox.com>

Hi Alex,

> -----Original Message-----
> From: Parav Pandit <parav@mellanox.com>
> Sent: Sunday, September 1, 2019 11:25 PM
> To: alex.williamson@redhat.com; Jiri Pirko <jiri@mellanox.com>;
> kwankhede@nvidia.com; cohuck@redhat.com; davem@davemloft.net
> Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> netdev@vger.kernel.org; Parav Pandit <parav@mellanox.com>
> Subject: [PATCH v3 0/5] Introduce variable length mdev alias
> 
> To have consistent naming for the netdevice of a mdev and to have consistent
> naming of the devlink port [1] of a mdev, which is formed using
> phys_port_name of the devlink port, current UUID is not usable because UUID
> is too long.
> 
> UUID in string format is 36-characters long and in binary 128-bit.
> Both formats are not able to fit within 15 characters limit of netdev name.
> 
> It is desired to have mdev device naming consistent using UUID.
> So that widely used user space framework such as ovs [2] can make use of
> mdev representor in similar way as PCIe SR-IOV VF and PF representors.
> 
> Hence,
> (a) mdev alias is created which is derived using sha1 from the mdev name.
> (b) Vendor driver describes how long an alias should be for the child mdev
> created for a given parent.
> (c) Mdev aliases are unique at system level.
> (d) alias is created optionally whenever parent requested.
> This ensures that non networking mdev parents can function without alias
> creation overhead.
> 
> This design is discussed at [3].
> 
> An example systemd/udev extension will have,
> 
> 1. netdev name created using mdev alias available in sysfs.
> 
> mdev UUID=83b8f4f2-509f-382f-3c1e-e6bfe0fa1001
> mdev 12 character alias=cd5b146a80a5
> 
> netdev name of this mdev = enmcd5b146a80a5 Here en = Ethernet link m =
> mediated device
> 
> 2. devlink port phys_port_name created using mdev alias.
> devlink phys_port_name=pcd5b146a80a5
> 
> This patchset enables mdev core to maintain unique alias for a mdev.
> 
> Patch-1 Introduces mdev alias using sha1.
> Patch-2 Ensures that mdev alias is unique in a system.
> Patch-3 Exposes mdev alias in a sysfs hirerchy, update Documentation
> Patch-4 Introduces mdev_alias() API.
> Patch-5 Extends mtty driver to optionally provide alias generation.
> This also enables to test UUID based sha1 collision and trigger error handling
> for duplicate sha1 results.
> 
> [1] http://man7.org/linux/man-pages/man8/devlink-port.8.html
> [2] https://docs.openstack.org/os-vif/latest/user/plugins/ovs.html
> [3] https://patchwork.kernel.org/cover/11084231/
> 
> ---
> Changelog:
> v2->v3:
>  - Addressed comment from Yunsheng Lin
>  - Changed strcmp() ==0 to !strcmp()
>  - Addressed comment from Cornelia Hunk
>  - Merged sysfs Documentation patch with syfs patch
>  - Added more description for alias return value

Did you get a chance review this updated series?
I addressed Cornelia's and yours comment.
I do not think allocating alias memory twice, once for comparison and once for storing is good idea or moving alias generation logic inside the mdev_list_lock(). So I didn't address that suggestion of Cornelia.
 
> v1->v2:
>  - Corrected a typo from 'and' to 'an'
>  - Addressed comments from Alex Williamson
>  - Kept mdev_device naturally aligned
>  - Added error checking for crypt_*() calls
>  - Moved alias NULL check at beginning
>  - Added mdev_alias() API
>  - Updated mtty driver to show example mdev_alias() usage
>  - Changed return type of generate_alias() from int to char*
> v0->v1:
>  - Addressed comments from Alex Williamson, Cornelia Hunk and Mark Bloch
>  - Moved alias length check outside of the parent lock
>  - Moved alias and digest allocation from kvzalloc to kzalloc
>  - &alias[0] changed to alias
>  - alias_length check is nested under get_alias_length callback check
>  - Changed comments to start with an empty line
>  - Added comment where alias memory ownership is handed over to mdev
> device
>  - Fixed cleaunup of hash if mdev_bus_register() fails
>  - Updated documentation for new sysfs alias file
>  - Improved commit logs to make description more clear
>  - Fixed inclusiong of alias for NULL check
>  - Added ratelimited debug print for sha1 hash collision error
> 
> Parav Pandit (5):
>   mdev: Introduce sha1 based mdev alias
>   mdev: Make mdev alias unique among all mdevs
>   mdev: Expose mdev alias in sysfs tree
>   mdev: Introduce an API mdev_alias
>   mtty: Optionally support mtty alias
> 
>  .../driver-api/vfio-mediated-device.rst       |   9 ++
>  drivers/vfio/mdev/mdev_core.c                 | 142 +++++++++++++++++-
>  drivers/vfio/mdev/mdev_private.h              |   5 +-
>  drivers/vfio/mdev/mdev_sysfs.c                |  26 +++-
>  include/linux/mdev.h                          |   5 +
>  samples/vfio-mdev/mtty.c                      |  13 ++
>  6 files changed, 190 insertions(+), 10 deletions(-)
> 
> --
> 2.19.2


^ permalink raw reply

* [PATCH net-next] ipv6: Don't use dst gateway directly in ip6_confirm_neigh()
From: Stefano Brivio @ 2019-09-09 20:44 UTC (permalink / raw)
  To: David Miller
  Cc: Guillaume Nault, Julian Anastasov, Nicolas Dichtel, David Ahern,
	netdev

This is the equivalent of commit 2c6b55f45d53 ("ipv6: fix neighbour
resolution with raw socket") for ip6_confirm_neigh(): we can send a
packet with MSG_CONFIRM on a raw socket for a connected route, so the
gateway would be :: here, and we should pick the next hop using
rt6_nexthop() instead.

This was found by code review and, to the best of my knowledge, doesn't
actually fix a practical issue: the destination address from the packet
is not considered while confirming a neighbour, as ip6_confirm_neigh()
calls choose_neigh_daddr() without passing the packet, so there are no
similar issues as the one fixed by said commit.

A possible source of issues with the existing implementation might come
from the fact that, if we have a cached dst, we won't consider it,
while rt6_nexthop() takes care of that. I might just not be creative
enough to find a practical problem here: the only way to affect this
with cached routes is to have one coming from an ICMPv6 redirect, but
if the next hop is a directly connected host, there should be no
topology for which a redirect applies here, and tests with redirected
routes show no differences for MSG_CONFIRM (and MSG_PROBE) packets on
raw sockets destined to a directly connected host.

However, directly using the dst gateway here is not consistent anymore
with neighbour resolution, and, in general, as we want the next hop,
using rt6_nexthop() looks like the only sane way to fetch it.

Reported-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 net/ipv6/route.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 7a5d331cdefa..874641d4d2a1 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -227,7 +227,7 @@ static void ip6_confirm_neigh(const struct dst_entry *dst, const void *daddr)
 	struct net_device *dev = dst->dev;
 	struct rt6_info *rt = (struct rt6_info *)dst;

-	daddr = choose_neigh_daddr(&rt->rt6i_gateway, NULL, daddr);
+	daddr = choose_neigh_daddr(rt6_nexthop(rt, &in6addr_any), NULL, daddr);
 	if (!daddr)
 		return;
 	if (dev->flags & (IFF_NOARP | IFF_LOOPBACK))
-- 
2.20.1

^ permalink raw reply related

* [PATCH] net/ibmvnic: Fix missing { in __ibmvnic_reset
From: Michal Suchanek @ 2019-09-09 20:44 UTC (permalink / raw)
  To: netdev, David S. Miller, Juliet Kim
  Cc: Michal Suchanek, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Thomas Falcon, John Allen, linuxppc-dev,
	linux-kernel
In-Reply-To: <20190907.173714.1400426487600521308.davem@davemloft.net>

Commit 1c2977c09499 ("net/ibmvnic: free reset work of removed device from queue")
adds a } without corresponding { causing build break.

Fixes: 1c2977c09499 ("net/ibmvnic: free reset work of removed device from queue")
Signed-off-by: Michal Suchanek <msuchanek@suse.de>
---
 drivers/net/ethernet/ibm/ibmvnic.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c
index 6644cabc8e75..5cb55ea671e3 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -1984,7 +1984,7 @@ static void __ibmvnic_reset(struct work_struct *work)
 	rwi = get_next_rwi(adapter);
 	while (rwi) {
 		if (adapter->state == VNIC_REMOVING ||
-		    adapter->state == VNIC_REMOVED)
+		    adapter->state == VNIC_REMOVED) {
 			kfree(rwi);
 			rc = EBUSY;
 			break;
-- 
2.22.0


^ permalink raw reply related

* [PATCH net] tcp: fix tcp_ecn_withdraw_cwr() to clear TCP_ECN_QUEUE_CWR
From: Neal Cardwell @ 2019-09-09 20:56 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Neal Cardwell, Yuchung Cheng, Soheil Hassas Yeganeh,
	Eric Dumazet

Fix tcp_ecn_withdraw_cwr() to clear the correct bit:
TCP_ECN_QUEUE_CWR.

Rationale: basically, TCP_ECN_DEMAND_CWR is a bit that is purely about
the behavior of data receivers, and deciding whether to reflect
incoming IP ECN CE marks as outgoing TCP th->ece marks. The
TCP_ECN_QUEUE_CWR bit is purely about the behavior of data senders,
and deciding whether to send CWR. The tcp_ecn_withdraw_cwr() function
is only called from tcp_undo_cwnd_reduction() by data senders during
an undo, so it should zero the sender-side state,
TCP_ECN_QUEUE_CWR. It does not make sense to stop the reflection of
incoming CE bits on incoming data packets just because outgoing
packets were spuriously retransmitted.

The bug has been reproduced with packetdrill to manifest in a scenario
with RFC3168 ECN, with an incoming data packet with CE bit set and
carrying a TCP timestamp value that causes cwnd undo. Before this fix,
the IP CE bit was ignored and not reflected in the TCP ECE header bit,
and sender sent a TCP CWR ('W') bit on the next outgoing data packet,
even though the cwnd reduction had been undone.  After this fix, the
sender properly reflects the CE bit and does not set the W bit.

Note: the bug actually predates 2005 git history; this Fixes footer is
chosen to be the oldest SHA1 I have tested (from Sep 2007) for which
the patch applies cleanly (since before this commit the code was in a
.h file).

Fixes: bdf1ee5d3bd3 ("[TCP]: Move code from tcp_ecn.h to tcp*.c and tcp.h & remove it")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_input.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index c21e8a22fb3b..8a1cd93dbb09 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -266,7 +266,7 @@ static void tcp_ecn_accept_cwr(struct sock *sk, const struct sk_buff *skb)

 static void tcp_ecn_withdraw_cwr(struct tcp_sock *tp)
 {
-	tp->ecn_flags &= ~TCP_ECN_DEMAND_CWR;
+	tp->ecn_flags &= ~TCP_ECN_QUEUE_CWR;
 }

 static void __tcp_ecn_check_ce(struct sock *sk, const struct sk_buff *skb)
-- 
2.23.0.162.g0b9fbb3734-goog

^ permalink raw reply related

* RE: [Intel-wired-lan] [PATCH] i40e: clear __I40E_VIRTCHNL_OP_PENDING on invalid min tx rate
From: Bowers, AndrewX @ 2019-09-09 21:00 UTC (permalink / raw)
  To: intel-wired-lan@lists.osuosl.org; +Cc: netdev@vger.kernel.org
In-Reply-To: <20190903060810.30775-1-sassmann@kpanic.de>

> -----Original Message-----
> From: Intel-wired-lan [mailto:intel-wired-lan-bounces@osuosl.org] On
> Behalf Of Stefan Assmann
> Sent: Monday, September 2, 2019 11:08 PM
> To: intel-wired-lan@lists.osuosl.org
> Cc: netdev@vger.kernel.org; davem@davemloft.net; sassmann@kpanic.de
> Subject: [Intel-wired-lan] [PATCH] i40e: clear
> __I40E_VIRTCHNL_OP_PENDING on invalid min tx rate
> 
> In the case of an invalid min tx rate being requested
> i40e_ndo_set_vf_bw() immediately returns -EINVAL instead of releasing
> __I40E_VIRTCHNL_OP_PENDING first.
> 
> Signed-off-by: Stefan Assmann <sassmann@kpanic.de>
> ---
>  drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)

Tested-by: Andrew Bowers <andrewx.bowers@intel.com>



^ permalink raw reply

* RE: [Intel-wired-lan] [PATCH] net/ixgbevf: make array api static const, makes object smaller
From: Bowers, AndrewX @ 2019-09-09 21:03 UTC (permalink / raw)
  To: intel-wired-lan@lists.osuosl.org, netdev@vger.kernel.org
  Cc: kernel-janitors@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <20190906113356.9985-1-colin.king@canonical.com>

> -----Original Message-----
> From: Intel-wired-lan [mailto:intel-wired-lan-bounces@osuosl.org] On
> Behalf Of Colin King
> Sent: Friday, September 6, 2019 4:34 AM
> To: Kirsher, Jeffrey T <jeffrey.t.kirsher@intel.com>; David S . Miller
> <davem@davemloft.net>; intel-wired-lan@lists.osuosl.org;
> netdev@vger.kernel.org
> Cc: kernel-janitors@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: [Intel-wired-lan] [PATCH] net/ixgbevf: make array api static const,
> makes object smaller
> 
> From: Colin Ian King <colin.king@canonical.com>
> 
> Don't populate the array api on the stack but instead make it static const.
> Makes the object code smaller by 58 bytes.
> 
> Before:
>    text	   data	    bss	    dec	    hex	filename
>   82969	   9763	    256	  92988	  16b3c	ixgbevf/ixgbevf_main.o
> 
> After:
>    text	   data	    bss	    dec	    hex	filename
>   82815	   9859	    256	  92930	  16b02	ixgbevf/ixgbevf_main.o
> 
> (gcc version 9.2.1, amd64)
> 
> Signed-off-by: Colin Ian King <colin.king@canonical.com>
> ---
>  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 14 ++++++++------
>  1 file changed, 8 insertions(+), 6 deletions(-)

Tested-by: Andrew Bowers <andrewx.bowers@intel.com>



^ permalink raw reply

* Re: [PATCH net-next 1/2] mlx5: steering: use correct enum type
From: Nick Desaulniers @ 2019-09-09 21:14 UTC (permalink / raw)
  To: Nathan Chancellor
  Cc: Arnd Bergmann, Saeed Mahameed, Leon Romanovsky, David S. Miller,
	Alex Vesker, Erez Shitrit, Mark Bloch, Network Development,
	linux-rdma, LKML, clang-built-linux
In-Reply-To: <20190909195513.GA94662@archlinux-threadripper>

On Mon, Sep 9, 2019 at 12:55 PM Nathan Chancellor
<natechancellor@gmail.com> wrote:
>
> On Mon, Sep 09, 2019 at 09:50:08PM +0200, Arnd Bergmann wrote:
> > The newly added code triggers a harmless warning with
> > clang:
> >
> > drivers/net/ethernet/mellanox/mlx5/core/steering/dr_action.c:1080:9: error: implicit conversion from enumeration type 'enum mlx5_reformat_ctx_type' to different enumeration type 'enum mlx5dr_action_type' [-Werror,-Wenum-conversion]
> >                         rt = MLX5_REFORMAT_TYPE_L2_TO_L2_TUNNEL;
> >                            ~ ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > drivers/net/ethernet/mellanox/mlx5/core/steering/dr_action.c:1084:51: error: implicit conversion from enumeration type 'enum mlx5dr_action_type' to different enumeration type 'enum mlx5_reformat_ctx_type' [-Werror,-Wenum-conversion]
> >                 ret = mlx5dr_cmd_create_reformat_ctx(dmn->mdev, rt, data_sz, data,
> >                       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~            ^~
> >
> > Change it to use mlx5_reformat_ctx_type instead of mlx5dr_action_type.
> >
> > Fixes: 9db810ed2d37 ("net/mlx5: DR, Expose steering action functionality")
> > Signed-off-by: Arnd Bergmann <arnd@arndb.de>
>
> I sent the same fix a couple of days ago:
>
> https://lore.kernel.org/netdev/20190905014733.17564-1-natechancellor@gmail.com/
>
> I don't care which patch goes in since they are the same thing so:
>
> Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>

GCC recently gained support (via me scanning the commit logs for an
unrelated feature) for -Wenum-warnings (though I think it's off by
default) so hopefully these kinds of issues will taper off over time.
-- 
Thanks,
~Nick Desaulniers

^ permalink raw reply

* Re: [PATCH] net/ibmvnic: Fix missing { in __ibmvnic_reset
From: Tyrel Datwyler @ 2019-09-09 21:21 UTC (permalink / raw)
  To: Michal Suchanek, netdev, David S. Miller, Juliet Kim
  Cc: linux-kernel, Thomas Falcon, Paul Mackerras, John Allen,
	linuxppc-dev
In-Reply-To: <20190909204451.7929-1-msuchanek@suse.de>

On 9/9/19 1:44 PM, Michal Suchanek wrote:
> Commit 1c2977c09499 ("net/ibmvnic: free reset work of removed device from queue")
> adds a } without corresponding { causing build break.
> 
> Fixes: 1c2977c09499 ("net/ibmvnic: free reset work of removed device from queue")
> Signed-off-by: Michal Suchanek <msuchanek@suse.de>

Reviewed-by: Tyrel Datwyler <tyreld@linux.ibm.com>

> ---
>  drivers/net/ethernet/ibm/ibmvnic.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c
> index 6644cabc8e75..5cb55ea671e3 100644
> --- a/drivers/net/ethernet/ibm/ibmvnic.c
> +++ b/drivers/net/ethernet/ibm/ibmvnic.c
> @@ -1984,7 +1984,7 @@ static void __ibmvnic_reset(struct work_struct *work)
>  	rwi = get_next_rwi(adapter);
>  	while (rwi) {
>  		if (adapter->state == VNIC_REMOVING ||
> -		    adapter->state == VNIC_REMOVED)
> +		    adapter->state == VNIC_REMOVED) {
>  			kfree(rwi);
>  			rc = EBUSY;
>  			break;
> 


^ permalink raw reply

* Re: [PATCH net] tcp: fix tcp_ecn_withdraw_cwr() to clear TCP_ECN_QUEUE_CWR
From: Eric Dumazet @ 2019-09-09 21:26 UTC (permalink / raw)
  To: Neal Cardwell; +Cc: David Miller, netdev, Yuchung Cheng, Soheil Hassas Yeganeh
In-Reply-To: <20190909205602.248472-1-ncardwell@google.com>

On Mon, Sep 9, 2019 at 10:56 PM Neal Cardwell <ncardwell@google.com> wrote:
>
> Fix tcp_ecn_withdraw_cwr() to clear the correct bit:
> TCP_ECN_QUEUE_CWR.
>
> Rationale: basically, TCP_ECN_DEMAND_CWR is a bit that is purely about
> the behavior of data receivers, and deciding whether to reflect
> incoming IP ECN CE marks as outgoing TCP th->ece marks. The
> TCP_ECN_QUEUE_CWR bit is purely about the behavior of data senders,
> and deciding whether to send CWR. The tcp_ecn_withdraw_cwr() function
> is only called from tcp_undo_cwnd_reduction() by data senders during
> an undo, so it should zero the sender-side state,
> TCP_ECN_QUEUE_CWR. It does not make sense to stop the reflection of
> incoming CE bits on incoming data packets just because outgoing
> packets were spuriously retransmitted.
>
> The bug has been reproduced with packetdrill to manifest in a scenario
> with RFC3168 ECN, with an incoming data packet with CE bit set and
> carrying a TCP timestamp value that causes cwnd undo. Before this fix,
> the IP CE bit was ignored and not reflected in the TCP ECE header bit,
> and sender sent a TCP CWR ('W') bit on the next outgoing data packet,
> even though the cwnd reduction had been undone.  After this fix, the
> sender properly reflects the CE bit and does not set the W bit.
>
> Note: the bug actually predates 2005 git history; this Fixes footer is
> chosen to be the oldest SHA1 I have tested (from Sep 2007) for which
> the patch applies cleanly (since before this commit the code was in a
> .h file).
>
> Fixes: bdf1ee5d3bd3 ("[TCP]: Move code from tcp_ecn.h to tcp*.c and tcp.h & remove it")
> Signed-off-by: Neal Cardwell <ncardwell@google.com>
> Acked-by: Yuchung Cheng <ycheng@google.com>
> Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
> Cc: Eric Dumazet <edumazet@google.com>

Signed-off-by: Eric Dumazet <edumazet@google.com>

Thanks Neal

^ permalink raw reply

* Re: [PATCH] net/mlx5: reduce stack usage in FW tracer
From: Saeed Mahameed @ 2019-09-09 21:53 UTC (permalink / raw)
  To: arnd@arndb.de
  Cc: cai@lca.pw, linux-rdma@vger.kernel.org, davem@davemloft.net,
	Moshe Shemesh, Feras Daoud, linux-kernel@vger.kernel.org,
	Eran Ben Elisha, netdev@vger.kernel.org, leon@kernel.org,
	Erez Shitrit
In-Reply-To: <CAK8P3a0_VhZ9hYmc6P3Qx+Z6WSHh3PVZ7JZh7Tr=R1CAKvqWmA@mail.gmail.com>

On Mon, 2019-09-09 at 22:18 +0200, Arnd Bergmann wrote:
> On Mon, Sep 9, 2019 at 9:39 PM Saeed Mahameed <saeedm@mellanox.com>
> wrote:
> > On Fri, 2019-09-06 at 17:11 +0200, Arnd Bergmann wrote:
> > > --- a/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c
> > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c
> > > @@ -557,16 +557,16 @@ static void mlx5_tracer_print_trace(struct
> > > tracer_string_format *str_frmt,
> > >                                   struct mlx5_core_dev *dev,
> > >                                   u64 trace_timestamp)
> > >  {
> > > -     char    tmp[512];
> > > -
> > 
> > Hi Arnd, thanks for the patch,
> > this function is very perfomance critical when fw traces are
> > activated
> > to pull some fw content on error situations, using kmalloc here
> > might
> > become a problem and stall the system further more if the problem
> > was
> > initially due to lack of memory.
> > 
> > since this function only needs 512 bytes maybe we should mark it as
> > noinline to avoid any extra stack usages on the caller function
> > mlx5_fw_tracer_handle_traces ?
> 
> That would shut up the warning, but doesn't sound right either.
> 
> If it's performance critical indeed, maybe the best solution would
> be to also avoid the snprintf(), as that is also a rather heavyweight
> function?
> 
> I could not find an easy solution for this, but I did notice the
> unusual way
> this deals with a variable format string passed into
> mlx5_tracer_print_trace
> along with a set of parameters, which opens up a set of possible
> format string vulnerabilities as well as making
> mlx5_tracer_print_trace()
> a bit expensive. You also take a mutex and free memory in there,
> which obviously then also got allocated in the fast path.
> 
> To do this right, a better approach may be to just rely on ftrace,
> storing
> the (pointer to the) format string and the arguments in the buffer
> without
> creating a string. Would that be an option here?

I am not sure how this would work, since the format parameters can
changes depending on the FW string and the specific traces.

> 
> A more minimal approach might be to move what is now the on-stack
> buffer into the mlx5_fw_tracer function. I see that you already store
> a copy of the string in there from mlx5_fw_tracer_save_trace(),
> which conveniently also holds a mutex already that protects
> it from concurrent access.
> 

This sounds plausible.

So for now let's do this or the noinline approach, Please let me know
which one do you prefer, if it is the mutex protected buffer, i can do
it myself.

I will open an internal task and discussion then address your valuable
points in a future submission, since we already in rc8 I don't want to
take the risk now.

Thanks for your feedback !
Saeed.

>        Arnd

^ permalink raw reply

* [PATCH] lib/Kconfig: fix OBJAGG in lib/ menu structure
From: Randy Dunlap @ 2019-09-09 21:54 UTC (permalink / raw)
  To: netdev@vger.kernel.org, LKML, David Miller; +Cc: Jiri Pirko, Ido Schimmel

From: Randy Dunlap <rdunlap@infradead.org>

Keep the "Library routines" menu intact by moving OBJAGG into it.
Otherwise OBJAGG is displayed/presented as an orphan in the
various config menus.

Fixes: 0a020d416d0a ("lib: introduce initial implementation of object aggregation manager")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jiri Pirko <jiri@mellanox.com>
Cc: Ido Schimmel <idosch@mellanox.com>
Cc: David S. Miller <davem@davemloft.net>
---
 lib/Kconfig |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

--- linux-next-20190904.orig/lib/Kconfig
+++ linux-next-20190904/lib/Kconfig
@@ -631,6 +631,9 @@ config SBITMAP
 config PARMAN
 	tristate "parman" if COMPILE_TEST
 
+config OBJAGG
+	tristate "objagg" if COMPILE_TEST
+
 config STRING_SELFTEST
 	tristate "Test string functions"
 
@@ -653,6 +656,3 @@ config GENERIC_LIB_CMPDI2
 
 config GENERIC_LIB_UCMPDI2
 	bool
-
-config OBJAGG
-	tristate "objagg" if COMPILE_TEST


^ permalink raw reply

* Re: [PATCH net-next v8 2/3] net: phy: add support for clause 37 auto-negotiation
From: Tao Ren @ 2019-09-09 21:56 UTC (permalink / raw)
  To: Andrew Lunn, Florian Fainelli, Heiner Kallweit, David S . Miller,
	Vladimir Oltean, Arun Parameswaran, Justin Chen,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	openbmc@lists.ozlabs.org
In-Reply-To: <20190909204906.2191290-1-taoren@fb.com>

Hi Andrew / Florian,

On 9/9/19 1:49 PM, Tao Ren wrote:
> From: Heiner Kallweit <hkallweit1@gmail.com>
> 
> This patch adds support for clause 37 1000Base-X auto-negotiation.
> 
> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
> Signed-off-by: Tao Ren <taoren@fb.com>
> Tested-by: René van Dorst <opensource@vdorst.com>

Just want to check if I missed your comments (I noticed some wired problems on my email client)?

If yes (sorry I missed something), could you please re-send your comments? And I will address them as soon as possible.


Thanks,

Tao

> ---
>  Changes in v8:
>   - Rebased the patch on top of net-next HEAD.
>  Changes in v7:
>   - Update "if (AUTONEG_ENABLE != phydev->autoneg)" to
>     "if (phydev->autoneg != AUTONEG_ENABLE)" so checkpatch.pl is happy.
>  Changes in v6:
>   - add "Signed-off-by: Tao Ren <taoren@fb.com>"
>  Changes in v1-v5:
>   - nothing changed. It's given v5 just to align with the version of
>     patch series.
> 
>  drivers/net/phy/phy_device.c | 139 +++++++++++++++++++++++++++++++++++
>  include/linux/phy.h          |   4 +
>  2 files changed, 143 insertions(+)
> 
> diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
> index 8933f07d39e9..dd05f750bb3f 100644
> --- a/drivers/net/phy/phy_device.c
> +++ b/drivers/net/phy/phy_device.c
> @@ -1607,6 +1607,40 @@ static int genphy_config_advert(struct phy_device *phydev)
>  	return changed;
>  }
>  
> +/**
> + * genphy_c37_config_advert - sanitize and advertise auto-negotiation parameters
> + * @phydev: target phy_device struct
> + *
> + * Description: Writes MII_ADVERTISE with the appropriate values,
> + *   after sanitizing the values to make sure we only advertise
> + *   what is supported.  Returns < 0 on error, 0 if the PHY's advertisement
> + *   hasn't changed, and > 0 if it has changed. This function is intended
> + *   for Clause 37 1000Base-X mode.
> + */
> +static int genphy_c37_config_advert(struct phy_device *phydev)
> +{
> +	u16 adv = 0;
> +
> +	/* Only allow advertising what this PHY supports */
> +	linkmode_and(phydev->advertising, phydev->advertising,
> +		     phydev->supported);
> +
> +	if (linkmode_test_bit(ETHTOOL_LINK_MODE_1000baseX_Full_BIT,
> +			      phydev->advertising))
> +		adv |= ADVERTISE_1000XFULL;
> +	if (linkmode_test_bit(ETHTOOL_LINK_MODE_Pause_BIT,
> +			      phydev->advertising))
> +		adv |= ADVERTISE_1000XPAUSE;
> +	if (linkmode_test_bit(ETHTOOL_LINK_MODE_Asym_Pause_BIT,
> +			      phydev->advertising))
> +		adv |= ADVERTISE_1000XPSE_ASYM;
> +
> +	return phy_modify_changed(phydev, MII_ADVERTISE,
> +				  ADVERTISE_1000XFULL | ADVERTISE_1000XPAUSE |
> +				  ADVERTISE_1000XHALF | ADVERTISE_1000XPSE_ASYM,
> +				  adv);
> +}
> +
>  /**
>   * genphy_config_eee_advert - disable unwanted eee mode advertisement
>   * @phydev: target phy_device struct
> @@ -1715,6 +1749,54 @@ int __genphy_config_aneg(struct phy_device *phydev, bool changed)
>  }
>  EXPORT_SYMBOL(__genphy_config_aneg);
>  
> +/**
> + * genphy_c37_config_aneg - restart auto-negotiation or write BMCR
> + * @phydev: target phy_device struct
> + *
> + * Description: If auto-negotiation is enabled, we configure the
> + *   advertising, and then restart auto-negotiation.  If it is not
> + *   enabled, then we write the BMCR. This function is intended
> + *   for use with Clause 37 1000Base-X mode.
> + */
> +int genphy_c37_config_aneg(struct phy_device *phydev)
> +{
> +	int err, changed;
> +
> +	if (phydev->autoneg != AUTONEG_ENABLE)
> +		return genphy_setup_forced(phydev);
> +
> +	err = phy_modify(phydev, MII_BMCR, BMCR_SPEED1000 | BMCR_SPEED100,
> +			 BMCR_SPEED1000);
> +	if (err)
> +		return err;
> +
> +	changed = genphy_c37_config_advert(phydev);
> +	if (changed < 0) /* error */
> +		return changed;
> +
> +	if (!changed) {
> +		/* Advertisement hasn't changed, but maybe aneg was never on to
> +		 * begin with?  Or maybe phy was isolated?
> +		 */
> +		int ctl = phy_read(phydev, MII_BMCR);
> +
> +		if (ctl < 0)
> +			return ctl;
> +
> +		if (!(ctl & BMCR_ANENABLE) || (ctl & BMCR_ISOLATE))
> +			changed = 1; /* do restart aneg */
> +	}
> +
> +	/* Only restart aneg if we are advertising something different
> +	 * than we were before.
> +	 */
> +	if (changed > 0)
> +		return genphy_restart_aneg(phydev);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL(genphy_c37_config_aneg);
> +
>  /**
>   * genphy_aneg_done - return auto-negotiation status
>   * @phydev: target phy_device struct
> @@ -1862,6 +1944,63 @@ int genphy_read_status(struct phy_device *phydev)
>  }
>  EXPORT_SYMBOL(genphy_read_status);
>  
> +/**
> + * genphy_c37_read_status - check the link status and update current link state
> + * @phydev: target phy_device struct
> + *
> + * Description: Check the link, then figure out the current state
> + *   by comparing what we advertise with what the link partner
> + *   advertises. This function is for Clause 37 1000Base-X mode.
> + */
> +int genphy_c37_read_status(struct phy_device *phydev)
> +{
> +	int lpa, err, old_link = phydev->link;
> +
> +	/* Update the link, but return if there was an error */
> +	err = genphy_update_link(phydev);
> +	if (err)
> +		return err;
> +
> +	/* why bother the PHY if nothing can have changed */
> +	if (phydev->autoneg == AUTONEG_ENABLE && old_link && phydev->link)
> +		return 0;
> +
> +	phydev->duplex = DUPLEX_UNKNOWN;
> +	phydev->pause = 0;
> +	phydev->asym_pause = 0;
> +
> +	if (phydev->autoneg == AUTONEG_ENABLE && phydev->autoneg_complete) {
> +		lpa = phy_read(phydev, MII_LPA);
> +		if (lpa < 0)
> +			return lpa;
> +
> +		linkmode_mod_bit(ETHTOOL_LINK_MODE_Autoneg_BIT,
> +				 phydev->lp_advertising, lpa & LPA_LPACK);
> +		linkmode_mod_bit(ETHTOOL_LINK_MODE_1000baseX_Full_BIT,
> +				 phydev->lp_advertising, lpa & LPA_1000XFULL);
> +		linkmode_mod_bit(ETHTOOL_LINK_MODE_Pause_BIT,
> +				 phydev->lp_advertising, lpa & LPA_1000XPAUSE);
> +		linkmode_mod_bit(ETHTOOL_LINK_MODE_Asym_Pause_BIT,
> +				 phydev->lp_advertising,
> +				 lpa & LPA_1000XPAUSE_ASYM);
> +
> +		phy_resolve_aneg_linkmode(phydev);
> +	} else if (phydev->autoneg == AUTONEG_DISABLE) {
> +		int bmcr = phy_read(phydev, MII_BMCR);
> +
> +		if (bmcr < 0)
> +			return bmcr;
> +
> +		if (bmcr & BMCR_FULLDPLX)
> +			phydev->duplex = DUPLEX_FULL;
> +		else
> +			phydev->duplex = DUPLEX_HALF;
> +	}
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL(genphy_c37_read_status);
> +
>  /**
>   * genphy_soft_reset - software reset the PHY via BMCR_RESET bit
>   * @phydev: target phy_device struct
> diff --git a/include/linux/phy.h b/include/linux/phy.h
> index a7ecbe0e55aa..cd9786ccf630 100644
> --- a/include/linux/phy.h
> +++ b/include/linux/phy.h
> @@ -1104,6 +1104,10 @@ int genphy_read_mmd_unsupported(struct phy_device *phdev, int devad,
>  int genphy_write_mmd_unsupported(struct phy_device *phdev, int devnum,
>  				 u16 regnum, u16 val);
>  
> +/* Clause 37 */
> +int genphy_c37_config_aneg(struct phy_device *phydev);
> +int genphy_c37_read_status(struct phy_device *phydev);
> +
>  /* Clause 45 PHY */
>  int genphy_c45_restart_aneg(struct phy_device *phydev);
>  int genphy_c45_check_and_restart_aneg(struct phy_device *phydev, bool restart);
> 

^ permalink raw reply

* Re: general protection fault in cbs_destroy
From: Cong Wang @ 2019-09-09 22:23 UTC (permalink / raw)
  To: Hillf Danton
  Cc: syzbot, David Miller, Jamal Hadi Salim, Jiri Pirko, LKML,
	Linux Kernel Network Developers, syzkaller-bugs,
	Vinicius Costa Gomes
In-Reply-To: <20190908072907.15220-1-hdanton@sina.com>

On Sun, Sep 8, 2019 at 12:29 AM Hillf Danton <hdanton@sina.com> wrote:
>
>
> > syzbot found the following crash on Sat, 07 Sep 2019 23:08:08 -0700
> >
> > HEAD commit:    3b47fd5c Merge tag 'nfs-for-5.3-4' of git://git.linux-nfs...
> > git tree:       upstream
> > console output: https://syzkaller.appspot.com/x/log.txt?x=14854e71600000
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=144488c6c6c6d2b6
> > dashboard link: https://syzkaller.appspot.com/bug?extid=3a8d6a998cbb73bcf337
> > compiler:       clang version 9.0.0 (/home/glider/llvm/clang
> > 80fee25776c2fb61e74c1ecb1a523375c2500b69)
> > syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=17998f9e600000
> > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=10421efa600000
> >
> > general protection fault: 0000 [#1] PREEMPT SMP KASAN
> > CPU: 0 PID: 9249 Comm: syz-executor457 Not tainted 5.3.0-rc7+ #0
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> > Google 01/01/2011
> > RIP: 0010:__list_del_entry_valid+0x6b/0x100 lib/list_debug.c:51
> > Code: 4c 89 f7 e8 97 d0 58 fe 48 ba 00 01 00 00 00 00 ad de 49 8b 1e 48 39
> > d3 74 54 48 83 c2 22 49 39 d7 74 5e 4c 89 f8 48 c1 e8 03 <42> 80 3c 20 00
> > 74 08 4c 89 ff e8 66 d0 58 fe 49 8b 17 4c 39 f2 75
> > RSP: 0018:ffff88809898f568 EFLAGS: 00010246
> > RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000001
> > RDX: dead000000000122 RSI: 0000000000000004 RDI: ffff88809fb5a7e8
> > RBP: ffff88809898f588 R08: dffffc0000000000 R09: ffffed1013131ea8
> > R10: ffffed1013131ea8 R11: 0000000000000000 R12: dffffc0000000000
> > R13: ffff88809fb5a480 R14: ffff88809fb5a7e0 R15: 0000000000000000
> > FS:  00005555568cb880(0000) GS:ffff8880aea00000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 0000000020000610 CR3: 00000000a3968000 CR4: 00000000001406f0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > Call Trace:
> >   __list_del_entry include/linux/list.h:131 [inline]
> >   list_del include/linux/list.h:139 [inline]
> >   cbs_destroy+0x85/0x3e0 net/sched/sch_cbs.c:435
> >   qdisc_create+0xff8/0x13e0 net/sched/sch_api.c:1302
> >   tc_modify_qdisc+0x989/0x1ea0 net/sched/sch_api.c:1652
> >   rtnetlink_rcv_msg+0x889/0xd40 net/core/rtnetlink.c:5223
> >   netlink_rcv_skb+0x19e/0x3d0 net/netlink/af_netlink.c:2477
> >   rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:5241
> >   netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
> >   netlink_unicast+0x787/0x900 net/netlink/af_netlink.c:1328
> >   netlink_sendmsg+0x993/0xc50 net/netlink/af_netlink.c:1917
> >   sock_sendmsg_nosec net/socket.c:637 [inline]
> >   sock_sendmsg net/socket.c:657 [inline]
> >   ___sys_sendmsg+0x60d/0x910 net/socket.c:2311
> >   __sys_sendmsg net/socket.c:2356 [inline]
> >   __do_sys_sendmsg net/socket.c:2365 [inline]
> >   __se_sys_sendmsg net/socket.c:2363 [inline]
> >   __x64_sys_sendmsg+0x17c/0x200 net/socket.c:2363
> >   do_syscall_64+0xfe/0x140 arch/x86/entry/common.c:296
> >   entry_SYSCALL_64_after_hwframe+0x49/0xbe
>
> --- a/net/sched/sch_cbs.c
> +++ b/net/sched/sch_cbs.c
> @@ -401,6 +401,7 @@ static int cbs_init(struct Qdisc *sch, s
>         if (!q->qdisc)
>                 return -ENOMEM;
>
> +       INIT_LIST_HEAD(&q->cbs_list);
>         qdisc_hash_add(q->qdisc, false);
>
>         q->queue = sch->dev_queue - netdev_get_tx_queue(dev, 0);

There is already a patch for the same bug:
http://patchwork.ozlabs.org/patch/1154697/

Vinicius, can you send V2 as David requested?

Thanks.

^ permalink raw reply

* Re: [RFC bpf-next 2/7] bpf: extend bpf_pcap support to tracing programs
From: Alan Maguire @ 2019-09-09 22:25 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Alan Maguire, ast@kernel.org, daniel@iogearbox.net, Martin Lau,
	Song Liu, davem@davemloft.net, jakub.kicinski@netronome.com,
	hawk@kernel.org, john.fastabend@gmail.com, rostedt@goodmis.org,
	mingo@redhat.com, quentin.monnet@netronome.com, Andrey Ignatov,
	joe@wand.net.nz, acme@redhat.com, jolsa@kernel.org,
	alexey.budankov@linux.intel.com, gregkh@linuxfoundation.org,
	namhyung@kernel.org, sdf@google.com, f.fainelli@gmail.com,
	shuah@kernel.org, peter@lekensteyn.nl, ivan@cloudflare.com,
	Andrii Nakryiko, bhole_prashant_q7@lab.ntt.co.jp,
	david.calavera@gmail.com, danieltimlee@gmail.com,
	Takshak Chahande, netdev@vger.kernel.org, bpf@vger.kernel.org,
	linux-kselftest@vger.kernel.org, toke, jbenc, acme
In-Reply-To: <89305ec8-7e03-3cd0-4e39-c3760dd3477b@fb.com>

On Sun, 8 Sep 2019, Yonghong Song wrote:
 
> For net side bpf_perf_event_output, we have
> static unsigned long bpf_skb_copy(void *dst_buff, const void *skb,
>                                    unsigned long off, unsigned long len)
> {
>          void *ptr = skb_header_pointer(skb, off, len, dst_buff);
> 
>          if (unlikely(!ptr))
>                  return len;
>          if (ptr != dst_buff)
>                  memcpy(dst_buff, ptr, len);
> 
>          return 0;
> }
> 
> BPF_CALL_5(bpf_skb_event_output, struct sk_buff *, skb, struct bpf_map 
> *, map,
>             u64, flags, void *, meta, u64, meta_size)
> {
>          u64 skb_size = (flags & BPF_F_CTXLEN_MASK) >> 32;
> 
>          if (unlikely(flags & ~(BPF_F_CTXLEN_MASK | BPF_F_INDEX_MASK)))
>                  return -EINVAL;
>          if (unlikely(skb_size > skb->len))
>                  return -EFAULT;
> 
>          return bpf_event_output(map, flags, meta, meta_size, skb, skb_size,
>                                  bpf_skb_copy);
> }
> 
> It does not really consider output all the frags.
> I understand that to get truly all packet data, frags should be
> considered, but seems we did not do it before? I am wondering
> whether we need to do here.

Thanks for the feedback! In experimenting with packet capture,
my original hope was to keep things simple and avoid fragment parsing
if possible. However if scatter-gather is enabled for the networking
device, or indeed if it's running in a VM it turns out a lot of the
interesting packet data ends up in the fragments on transmit (ssh
headers, http headers etc).  So I think it would be worth considering
adding support for fragment traversal.  It's not needed as much
in the skb program case - we can always pullup the skb - but in
the tracing situation we probably wouldn't want to do something
that invasive in tracing context.

Fragment traversal might be worth breaking out as a separate patchset, 
perhaps triggered by a specific flag to bpf_skb_event_output?

Feedback from folks at Linux Plumbers (I hope I'm summarizing correctly) 
seemed to agree with what you mentioned WRT the first patch in this 
series.  The gist was we probably don't want to force the metadata to be a 
specific packet capture type; we'd rather use the existing perf event 
mechanisms and if we are indeed doing packet capture, simply specify that 
data in the program as metadata. 

I'd be happy with that approach myself if I could capture skb 
fragments in tracing programs - being able to do that would give 
equivalent functionality to what I proposed but without having a packet 
capture-specific helper.
> 
> If we indeed do not need to handle frags here, I think maybe
> bpf_probe_read() in existing bpf kprobe function should be
> enough, we do not need this helper?
> 

Certainly for many use cases, that will get you most of what you need - 
particularly if you're just looking at L2 to L4 data. For full packet 
capture however I think we may need to think about fragment traversal.

> > +
> > +/* Derive protocol for some of the easier cases.  For tracing, a probe point
> > + * may be dealing with packets in various states. Common cases are IP
> > + * packets prior to adding MAC header (_PCAP_TYPE_IP) and a full packet
> > + * (_PCAP_TYPE_ETH).  For other cases the caller must specify the
> > + * protocol they expect.  Other heuristics for packet identification
> > + * should be added here as needed, since determining the packet type
> > + * ensures we do not capture packets that fail to match the desired
> > + * pcap type in BPF_F_PCAP_STRICT_TYPE mode.
> > + */
> > +static inline int bpf_skb_protocol_get(struct sk_buff *skb)
> > +{
> > +	switch (htons(skb->protocol)) {
> > +	case ETH_P_IP:
> > +	case ETH_P_IPV6:
> > +		if (skb_network_header(skb) == skb->data)
> > +			return BPF_PCAP_TYPE_IP;
> > +		else
> > +			return BPF_PCAP_TYPE_ETH;
> > +	default:
> > +		return BPF_PCAP_TYPE_UNSET;
> > +	}
> > +}
> > +
> > +BPF_CALL_5(bpf_trace_pcap, void *, data, u32, size, struct bpf_map *, map,
> > +	   int, protocol_wanted, u64, flags)
> 
> Up to now, for helpers, verifier has a way to verifier it is used 
> properly regarding to the context. For example, for xdp version
> perf_event_output, the help prototype,
>    BPF_CALL_5(bpf_xdp_event_output, struct xdp_buff *, xdp, struct 
> bpf_map *, map,
>             u64, flags, void *, meta, u64, meta_size)
> the verifier is able to guarantee that the first parameter
> has correct type xdp_buff, not something from type cast.
>    .arg1_type      = ARG_PTR_TO_CTX,
> 
> This helper, in the below we have
>    .arg1_type	= ARG_ANYTHING,
> 
> So it is not really enforced. Bringing BTF can help, but type
> name matching typically bad.
> 
> 
One thing we were discussing - and I think this is similar to what
you're suggesting - is to investigate if there might be a way to
leverage BTF to provide additional guarantees that the tracing
data we are handling is indeed an skb.  Specifically if we
trace a kprobe function argument or a tracepoint function, and
if we had that guarantee, we could perhaps invoke the skb-style
perf event output function (trace both the skb data and the metadata).
The challenge would be how to do that type-based matching; we'd
need the function argument information from BTF _and_ need to
somehow associate it at probe attach time. 

Thanks again for looking at the code!

Alan

^ permalink raw reply

* Re: WARNING in cbs_dequeue_soft
From: Cong Wang @ 2019-09-09 22:28 UTC (permalink / raw)
  To: syzbot
  Cc: David Miller, Jamal Hadi Salim, Jiri Pirko, Leandro Dorileo, LKML,
	Linux Kernel Network Developers, syzkaller-bugs, vedang.patel
In-Reply-To: <000000000000e3a76c0592047ea7@google.com>

#syz fix: net/sched: cbs: Set default link speed to 10 Mbps in cbs_set_port_rate

^ permalink raw reply

* [PATCH] bpf: validate bpf_func when BPF_JIT is enabled
From: Sami Tolvanen @ 2019-09-09 22:32 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: Kees Cook, Martin KaFai Lau, Song Liu, Yonghong Song, netdev, bpf,
	linux-kernel, Sami Tolvanen

With CONFIG_BPF_JIT, the kernel makes indirect calls to dynamically
generated code. This change adds basic sanity checking to ensure
we are jumping to a valid location, which narrows down the attack
surface on the stored pointer. This also prepares the code for future
Control-Flow Integrity (CFI) checking, which adds indirect call
validation to call targets that can be determined at compile-time, but
cannot validate calls to jited functions.

In addition, this change adds a weak arch_bpf_jit_check_func function,
which architectures that implement BPF JIT can override to perform
additional validation, such as verifying that the pointer points to
the correct memory region.

Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
---
 include/linux/filter.h | 26 ++++++++++++++++++++++++--
 kernel/bpf/core.c      | 25 +++++++++++++++++++++++++
 2 files changed, 49 insertions(+), 2 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 92c6e31fb008..abfb0e1b21a8 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -511,7 +511,10 @@ struct sock_fprog_kern {
 	struct sock_filter	*filter;
 };
 
+#define BPF_BINARY_HEADER_MAGIC	0x05de0e82
+
 struct bpf_binary_header {
+	u32 magic;
 	u32 pages;
 	/* Some arches need word alignment for their instructions */
 	u8 image[] __aligned(4);
@@ -553,20 +556,39 @@ struct sk_filter {
 
 DECLARE_STATIC_KEY_FALSE(bpf_stats_enabled_key);
 
+#ifdef CONFIG_BPF_JIT
+/*
+ * With JIT, the kernel makes an indirect call to dynamically generated
+ * code. Use bpf_call_func to perform additional validation of the call
+ * target to narrow down attack surface. Architectures implementing BPF
+ * JIT can override arch_bpf_jit_check_func for arch-specific checking.
+ */
+extern unsigned int bpf_call_func(const struct bpf_prog *prog,
+				  const void *ctx);
+
+extern bool arch_bpf_jit_check_func(const struct bpf_prog *prog);
+#else
+static inline unsigned int bpf_call_func(const struct bpf_prog *prog,
+					 const void *ctx)
+{
+	return prog->bpf_func(ctx, prog->insnsi);
+}
+#endif
+
 #define BPF_PROG_RUN(prog, ctx)	({				\
 	u32 ret;						\
 	cant_sleep();						\
 	if (static_branch_unlikely(&bpf_stats_enabled_key)) {	\
 		struct bpf_prog_stats *stats;			\
 		u64 start = sched_clock();			\
-		ret = (*(prog)->bpf_func)(ctx, (prog)->insnsi);	\
+		ret = bpf_call_func(prog, ctx);			\
 		stats = this_cpu_ptr(prog->aux->stats);		\
 		u64_stats_update_begin(&stats->syncp);		\
 		stats->cnt++;					\
 		stats->nsecs += sched_clock() - start;		\
 		u64_stats_update_end(&stats->syncp);		\
 	} else {						\
-		ret = (*(prog)->bpf_func)(ctx, (prog)->insnsi);	\
+		ret = bpf_call_func(prog, ctx);			\
 	}							\
 	ret; })
 
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 66088a9e9b9e..7aad58f67105 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -792,6 +792,30 @@ void __weak bpf_jit_free_exec(void *addr)
 	module_memfree(addr);
 }
 
+#ifdef CONFIG_BPF_JIT
+bool __weak arch_bpf_jit_check_func(const struct bpf_prog *prog)
+{
+	return true;
+}
+
+unsigned int bpf_call_func(const struct bpf_prog *prog, const void *ctx)
+{
+	const struct bpf_binary_header *hdr = bpf_jit_binary_hdr(prog);
+
+	if (!IS_ENABLED(CONFIG_BPF_JIT_ALWAYS_ON) && !prog->jited)
+		return prog->bpf_func(ctx, prog->insnsi);
+
+	if (unlikely(hdr->magic != BPF_BINARY_HEADER_MAGIC ||
+		     !arch_bpf_jit_check_func(prog))) {
+		WARN(1, "attempt to jump to an invalid address");
+		return 0;
+	}
+
+	return prog->bpf_func(ctx, prog->insnsi);
+}
+EXPORT_SYMBOL_GPL(bpf_call_func);
+#endif
+
 struct bpf_binary_header *
 bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr,
 		     unsigned int alignment,
@@ -818,6 +842,7 @@ bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr,
 	/* Fill space with illegal/arch-dep instructions. */
 	bpf_fill_ill_insns(hdr, size);
 
+	hdr->magic = BPF_BINARY_HEADER_MAGIC;
 	hdr->pages = pages;
 	hole = min_t(unsigned int, size - (proglen + sizeof(*hdr)),
 		     PAGE_SIZE - sizeof(*hdr));
-- 
2.23.0.162.g0b9fbb3734-goog


^ permalink raw reply related

* [RFC PATCH net-next 0/2] more IPv4 unicast extensions
From: Dave Taht @ 2019-09-09 22:37 UTC (permalink / raw)
  To: netdev; +Cc: Dave Taht

Adding support in linux for the 240/4 and 0/8 ipv4 address ranges were
easy, no brainer removals of obsolete support for obsolete
specifications. Has anyone noticed yet?

The following two patches are intended as discussion points fot
my https://linuxplumbersconf.org/event/4/contributions/457/
talk today.

Dave Taht (2):
  Allow 225/8-231/8 as unicast
  Reduce localhost to a /16

 include/linux/in.h | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

-- 
2.17.1

^ permalink raw reply

* [RFC PATCH net-next 2/2] Reduce localhost to 127.0.0.0/16
From: Dave Taht @ 2019-09-09 22:37 UTC (permalink / raw)
  To: netdev; +Cc: Dave Taht
In-Reply-To: <1568068639-6511-1-git-send-email-dave.taht@gmail.com>

The 127 concept of "localhost" was  basically a straight port over from
original ARPANET behavior. At the time it was envisioned that
many services would exist in the mainframe that would need to be
individually addressable, and long predated alternative approaches
such as tipc, etc.

This reduces 127.0.0.0/8 to a /16, and opens up 127.1.0.0 and above
for use as real, unicast addresses either on the wire or internal
to containers, virtual machines, vpns, etc.

The only major uses of 127 to date have been (of
course) 127.0.0.1 and 127.0.1.1 (for dns), and various adhoc uses
for things like anti-spam daemons.

Linux also has a non-standard treatment of the entire 127 address
space - anything sent to any address there "just works". This patch
preserves that behavior for 127.0.0.0/16 only.

Other uses, such as ntp's 127.127 "handle" for it, doesn't actually
touch the stack.  We've seen 127.x used for chassis access and a few
other explicit uses in the field, but very little.

There is some small hope that we can preserve the original intent of
"localhost" in an age of stacked vms and containers, where instead of
using rfc1918 nat at each level, we can route, instead.

---
 include/linux/in.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/in.h b/include/linux/in.h
index 8665842a3589..6e4f37e5bf51 100644
--- a/include/linux/in.h
+++ b/include/linux/in.h
@@ -37,7 +37,9 @@ static inline int proto_ports_offset(int proto)

 static inline bool ipv4_is_loopback(__be32 addr)
 {
-	return (addr & htonl(0xff000000)) == htonl(0x7f000000);
+	if((addr & htonl(0xff000000)) == htonl(0x7f000000))
+		return( addr & htonl(0x00ff0000)) == 0;
+	return 0;
 }

 static inline bool ipv4_is_multicast(__be32 addr)
-- 
2.17.1

^ permalink raw reply related

* [RFC PATCH net-next 1/2] Allow 225/8-231/8 as unicast
From: Dave Taht @ 2019-09-09 22:37 UTC (permalink / raw)
  To: netdev; +Cc: Dave Taht
In-Reply-To: <1568068639-6511-1-git-send-email-dave.taht@gmail.com>

This patch converts the long "reserved for future use" multicast
address space, 225/8-231/8 - 120m addresses - for use as unicast
addresses instead.

In a comprehensive survey of all the open source code on the planet
we found no users of this range. We found some official and unofficial
usage of addresses in 224/8 and in 239/8 - both spaces at well under
50% allocation in the first place, so we anticipate no additional growth
for any reason, into the 225-231 spaces.

There will be some short term incompatabilities induced.

The principal flaw of converting this space to unicast involves
a non-uniext box, sending a packet to the formerly multicast address, 
and the reply coming back from that "formerly multicast" address
as unicast.

The return packet will be dropped because the source of the reply is unicast
(L2) with what the non-uniext box considers to be multicast (L3).

and, like all multicast packets sent anywhere, the attempt will still
flood all ports on the local switch.

A tcp attempt fails immediately due to the inherent IN_MULTICAST
check in the existing kernel. Some stacks (not linux) MAY do more 
of the wrong thing here.

As for userspace exposure...

We were only able to find 89 packages in fedora that used the IN_MULTICAST
macro. Currently the plan is not to kill IN_MULTICAST, (as doing it right
requires access to the big endian macros) but retire its usages in
the kernel (already done) and then the very few programs that use it userspace.

All the routing daemons we've inspected and modified don't use IN_MULTICAST.
The patches to them are trivial.

New users of multicast, seem to always pick something out of the 224/8
or 239/8 ranges, which are untouched by this patch.

Additional potential problems include: 

* hardware offloads that explicitly check for multicast
* binary firmware that explicitly checks for multicast
* a tiny cpu hit

Whether or not these problems are worth addressing to regain 120m
useful unicast addresses in the next decade is up for debate.

---
 include/linux/in.h | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/linux/in.h b/include/linux/in.h
index 1873ef642605..8665842a3589 100644
--- a/include/linux/in.h
+++ b/include/linux/in.h
@@ -42,7 +42,10 @@ static inline bool ipv4_is_loopback(__be32 addr)

 static inline bool ipv4_is_multicast(__be32 addr)
 {
-	return (addr & htonl(0xf0000000)) == htonl(0xe0000000);
+	if((addr & htonl(0xf0000000)) == htonl(0xe0000000))
+		return !((htonl(addr) >= 0xe1000000) &&
+			 (htonl(addr) < 0xe8000000));
+	return 0;
 }

 static inline bool ipv4_is_local_multicast(__be32 addr)
-- 
2.17.1

^ permalink raw reply related

* [net-next v2 02/15] ixgbevf: Link lost in VM on ixgbevf when restoring from freeze or suspend
From: Jeff Kirsher @ 2019-09-09 22:47 UTC (permalink / raw)
  To: davem; +Cc: Radoslaw Tyl, netdev, nhorman, sassmann, Andrew Bowers,
	Jeff Kirsher
In-Reply-To: <20190909224802.29595-1-jeffrey.t.kirsher@intel.com>

From: Radoslaw Tyl <radoslawx.tyl@intel.com>

This patch fixed issue in VM which shows no link when hypervisor is
restored from low-power state. The driver is responsible for re-enabling
any features of the device that had been disabled during suspend calls,
such as IRQs and bus mastering.

Signed-off-by: Radoslaw Tyl <radoslawx.tyl@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 8c011d4ce7a9..75e849a64db7 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -2517,6 +2517,7 @@ void ixgbevf_reinit_locked(struct ixgbevf_adapter *adapter)
 		msleep(1);
 
 	ixgbevf_down(adapter);
+	pci_set_master(adapter->pdev);
 	ixgbevf_up(adapter);
 
 	clear_bit(__IXGBEVF_RESETTING, &adapter->state);
-- 
2.21.0


^ permalink raw reply related

* [net-next v2 03/15] e1000e: Make speed detection on hotplugging cable more reliable
From: Jeff Kirsher @ 2019-09-09 22:47 UTC (permalink / raw)
  To: davem; +Cc: Kai-Heng Feng, netdev, nhorman, sassmann, Aaron Brown,
	Jeff Kirsher
In-Reply-To: <20190909224802.29595-1-jeffrey.t.kirsher@intel.com>

From: Kai-Heng Feng <kai.heng.feng@canonical.com>

After hot plugging an 1Gbps Ethernet cable with 1Gbps link partner, the
MII_BMSR may report 10Mbps, renders the network rather slow.

The issue has much lower fail rate after commit 59653e6497d1 ("e1000e:
Make watchdog use delayed work"), which essentially introduces some
delay before running the watchdog task.

But there's still a chance that the hot plugging event and the queued
watchdog task gets run at the same time, then the original issue can be
observed once again.

So let's use mod_delayed_work() to add a deterministic 1 second delay
before running watchdog task, after an interrupt.

Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/e1000e/netdev.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
index 8a3f035c3a5f..d7d56e42a6aa 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1780,8 +1780,8 @@ static irqreturn_t e1000_intr_msi(int __always_unused irq, void *data)
 		}
 		/* guard against interrupt when we're going down */
 		if (!test_bit(__E1000_DOWN, &adapter->state))
-			queue_delayed_work(adapter->e1000_workqueue,
-					   &adapter->watchdog_task, 1);
+			mod_delayed_work(adapter->e1000_workqueue,
+					 &adapter->watchdog_task, HZ);
 	}
 
 	/* Reset on uncorrectable ECC error */
@@ -1861,8 +1861,8 @@ static irqreturn_t e1000_intr(int __always_unused irq, void *data)
 		}
 		/* guard against interrupt when we're going down */
 		if (!test_bit(__E1000_DOWN, &adapter->state))
-			queue_delayed_work(adapter->e1000_workqueue,
-					   &adapter->watchdog_task, 1);
+			mod_delayed_work(adapter->e1000_workqueue,
+					 &adapter->watchdog_task, HZ);
 	}
 
 	/* Reset on uncorrectable ECC error */
@@ -1907,8 +1907,8 @@ static irqreturn_t e1000_msix_other(int __always_unused irq, void *data)
 		hw->mac.get_link_status = true;
 		/* guard against interrupt when we're going down */
 		if (!test_bit(__E1000_DOWN, &adapter->state))
-			queue_delayed_work(adapter->e1000_workqueue,
-					   &adapter->watchdog_task, 1);
+			mod_delayed_work(adapter->e1000_workqueue,
+					 &adapter->watchdog_task, HZ);
 	}
 
 	if (!test_bit(__E1000_DOWN, &adapter->state))
-- 
2.21.0


^ permalink raw reply related

* [net-next v2 05/15] Documentation: iavf: Update the Intel LAN driver doc for iavf
From: Jeff Kirsher @ 2019-09-09 22:47 UTC (permalink / raw)
  To: davem; +Cc: Jeff Kirsher, netdev, nhorman, sassmann, Aaron Brown
In-Reply-To: <20190909224802.29595-1-jeffrey.t.kirsher@intel.com>

Update the LAN driver documentation to include the latest feature
implementation and driver capabilities.

Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
---
 .../networking/device_drivers/intel/iavf.rst  | 115 +++++++++++++-----
 1 file changed, 82 insertions(+), 33 deletions(-)

diff --git a/Documentation/networking/device_drivers/intel/iavf.rst b/Documentation/networking/device_drivers/intel/iavf.rst
index 2d0c3baa1752..cfc08842e32c 100644
--- a/Documentation/networking/device_drivers/intel/iavf.rst
+++ b/Documentation/networking/device_drivers/intel/iavf.rst
@@ -10,11 +10,15 @@ Copyright(c) 2013-2018 Intel Corporation.
 Contents
 ========
 
+- Overview
 - Identifying Your Adapter
 - Additional Configurations
 - Known Issues/Troubleshooting
 - Support
 
+Overview
+========
+
 This file describes the iavf Linux* Base Driver. This driver was formerly
 called i40evf.
 
@@ -27,6 +31,7 @@ The guest OS loading the iavf driver must support MSI-X interrupts.
 
 Identifying Your Adapter
 ========================
+
 The driver in this kernel is compatible with devices based on the following:
  * Intel(R) XL710 X710 Virtual Function
  * Intel(R) X722 Virtual Function
@@ -50,9 +55,10 @@ Link messages will not be displayed to the console if the distribution is
 restricting system messages. In order to see network driver link messages on
 your console, set dmesg to eight by entering the following::
 
-  dmesg -n 8
+    # dmesg -n 8
 
-NOTE: This setting is not saved across reboots.
+NOTE:
+  This setting is not saved across reboots.
 
 ethtool
 -------
@@ -72,11 +78,11 @@ then requests from that VF to set VLAN tag stripping will be ignored.
 To enable/disable VLAN tag stripping for a VF, issue the following command
 from inside the VM in which you are running the VF::
 
-  ethtool -K <if_name> rxvlan on/off
+    # ethtool -K <if_name> rxvlan on/off
 
 or alternatively::
 
-  ethtool --offload <if_name> rxvlan on/off
+    # ethtool --offload <if_name> rxvlan on/off
 
 Adaptive Virtual Function
 -------------------------
@@ -91,21 +97,21 @@ additional features depending on what features are available in the PF with
 which the AVF is associated. The following are base mode features:
 
 - 4 Queue Pairs (QP) and associated Configuration Status Registers (CSRs)
-  for Tx/Rx.
-- i40e descriptors and ring format.
-- Descriptor write-back completion.
-- 1 control queue, with i40e descriptors, CSRs and ring format.
-- 5 MSI-X interrupt vectors and corresponding i40e CSRs.
-- 1 Interrupt Throttle Rate (ITR) index.
-- 1 Virtual Station Interface (VSI) per VF.
+  for Tx/Rx
+- i40e descriptors and ring format
+- Descriptor write-back completion
+- 1 control queue, with i40e descriptors, CSRs and ring format
+- 5 MSI-X interrupt vectors and corresponding i40e CSRs
+- 1 Interrupt Throttle Rate (ITR) index
+- 1 Virtual Station Interface (VSI) per VF
 - 1 Traffic Class (TC), TC0
 - Receive Side Scaling (RSS) with 64 entry indirection table and key,
-  configured through the PF.
-- 1 unicast MAC address reserved per VF.
-- 16 MAC address filters for each VF.
-- Stateless offloads - non-tunneled checksums.
-- AVF device ID.
-- HW mailbox is used for VF to PF communications (including on Windows).
+  configured through the PF
+- 1 unicast MAC address reserved per VF
+- 16 MAC address filters for each VF
+- Stateless offloads - non-tunneled checksums
+- AVF device ID
+- HW mailbox is used for VF to PF communications (including on Windows)
 
 IEEE 802.1ad (QinQ) Support
 ---------------------------
@@ -117,8 +123,8 @@ VLAN ID, among other uses.
 
 The following are examples of how to configure 802.1ad (QinQ)::
 
-  ip link add link eth0 eth0.24 type vlan proto 802.1ad id 24
-  ip link add link eth0.24 eth0.24.371 type vlan proto 802.1Q id 371
+    # ip link add link eth0 eth0.24 type vlan proto 802.1ad id 24
+    # ip link add link eth0.24 eth0.24.371 type vlan proto 802.1Q id 371
 
 Where "24" and "371" are example VLAN IDs.
 
@@ -133,6 +139,19 @@ specific application. This can reduce latency for the specified application,
 and allow Tx traffic to be rate limited per application. Follow the steps below
 to set ADq.
 
+Requirements:
+
+- The sch_mqprio, act_mirred and cls_flower modules must be loaded
+- The latest version of iproute2
+- If another driver (for example, DPDK) has set cloud filters, you cannot
+  enable ADQ
+- Depending on the underlying PF device, ADQ cannot be enabled when the
+  following features are enabled:
+
+  + Data Center Bridging (DCB)
+  + Multiple Functions per Port (MFP)
+  + Sideband Filters
+
 1. Create traffic classes (TCs). Maximum of 8 TCs can be created per interface.
 The shaper bw_rlimit parameter is optional.
 
@@ -141,9 +160,9 @@ to 1Gbit for tc0 and 3Gbit for tc1.
 
 ::
 
-  # tc qdisc add dev <interface> root mqprio num_tc 2 map 0 0 0 0 1 1 1 1
-  queues 16@0 16@16 hw 1 mode channel shaper bw_rlimit min_rate 1Gbit 2Gbit
-  max_rate 1Gbit 3Gbit
+    tc qdisc add dev <interface> root mqprio num_tc 2 map 0 0 0 0 1 1 1 1
+    queues 16@0 16@16 hw 1 mode channel shaper bw_rlimit min_rate 1Gbit 2Gbit
+    max_rate 1Gbit 3Gbit
 
 map: priority mapping for up to 16 priorities to tcs (e.g. map 0 0 0 0 1 1 1 1
 sets priorities 0-3 to use tc0 and 4-7 to use tc1)
@@ -162,6 +181,10 @@ Totals must be equal or less than port speed.
 For example: min_rate 1Gbit 3Gbit: Verify bandwidth limit using network
 monitoring tools such as ifstat or sar –n DEV [interval] [number of samples]
 
+NOTE:
+  Setting up channels via ethtool (ethtool -L) is not supported when the
+  TCs are configured using mqprio.
+
 2. Enable HW TC offload on interface::
 
     # ethtool -K <interface> hw-tc-offload on
@@ -171,16 +194,16 @@ monitoring tools such as ifstat or sar –n DEV [interval] [number of samples]
     # tc qdisc add dev <interface> ingress
 
 NOTES:
- - Run all tc commands from the iproute2 <pathtoiproute2>/tc/ directory.
- - ADq is not compatible with cloud filters.
+ - Run all tc commands from the iproute2 <pathtoiproute2>/tc/ directory
+ - ADq is not compatible with cloud filters
  - Setting up channels via ethtool (ethtool -L) is not supported when the TCs
-   are configured using mqprio.
+   are configured using mqprio
  - You must have iproute2 latest version
- - NVM version 6.01 or later is required.
+ - NVM version 6.01 or later is required
  - ADq cannot be enabled when any the following features are enabled: Data
-   Center Bridging (DCB), Multiple Functions per Port (MFP), or Sideband Filters.
+   Center Bridging (DCB), Multiple Functions per Port (MFP), or Sideband Filters
  - If another driver (for example, DPDK) has set cloud filters, you cannot
-   enable ADq.
+   enable ADq
  - Tunnel filters are not supported in ADq. If encapsulated packets do arrive
    in non-tunnel mode, filtering will be done on the inner headers.  For example,
    for VXLAN traffic in non-tunnel mode, PCTYPE is identified as a VXLAN
@@ -198,6 +221,16 @@ NOTES:
 Known Issues/Troubleshooting
 ============================
 
+Bonding fails with VFs bound to an Intel(R) Ethernet Controller 700 series device
+---------------------------------------------------------------------------------
+If you bind Virtual Functions (VFs) to an Intel(R) Ethernet Controller 700
+series based device, the VF slaves may fail when they become the active slave.
+If the MAC address of the VF is set by the PF (Physical Function) of the
+device, when you add a slave, or change the active-backup slave, Linux bonding
+tries to sync the backup slave's MAC address to the same MAC address as the
+active slave. Linux bonding will fail at this point. This issue will not occur
+if the VF's MAC address is not set by the PF.
+
 Traffic Is Not Being Passed Between VM and Client
 -------------------------------------------------
 You may not be able to pass traffic between a client system and a
@@ -215,13 +248,28 @@ Do not unload a port's driver if a Virtual Function (VF) with an active Virtual
 Machine (VM) is bound to it. Doing so will cause the port to appear to hang.
 Once the VM shuts down, or otherwise releases the VF, the command will complete.
 
+Using four traffic classes fails
+--------------------------------
+Do not try to reserve more than three traffic classes in the iavf driver. Doing
+so will fail to set any traffic classes and will cause the driver to write
+errors to stdout. Use a maximum of three queues to avoid this issue.
+
+Multiple log error messages on iavf driver removal
+--------------------------------------------------
+If you have several VFs and you remove the iavf driver, several instances of
+the following log errors are written to the log::
+
+    Unable to send opcode 2 to PF, err I40E_ERR_QUEUE_EMPTY, aq_err ok
+    Unable to send the message to VF 2 aq_err 12
+    ARQ Overflow Error detected
+
 Virtual machine does not get link
 ---------------------------------
 If the virtual machine has more than one virtual port assigned to it, and those
 virtual ports are bound to different physical ports, you may not get link on
 all of the virtual ports. The following command may work around the issue::
 
-  ethtool -r <PF>
+    # ethtool -r <PF>
 
 Where <PF> is the PF interface in the host, for example: p5p1. You may need to
 run the command more than once to get link on all virtual ports.
@@ -251,12 +299,13 @@ traffic.
 If you have multiple interfaces in a server, either turn on ARP filtering by
 entering::
 
-  echo 1 > /proc/sys/net/ipv4/conf/all/arp_filter
+    # echo 1 > /proc/sys/net/ipv4/conf/all/arp_filter
 
-NOTE: This setting is not saved across reboots. The configuration change can be
-made permanent by adding the following line to the file /etc/sysctl.conf::
+NOTE:
+  This setting is not saved across reboots. The configuration change can be
+  made permanent by adding the following line to the file /etc/sysctl.conf::
 
-  net.ipv4.conf.all.arp_filter = 1
+    net.ipv4.conf.all.arp_filter = 1
 
 Another alternative is to install the interfaces in separate broadcast domains
 (either in different switches or in a switch partitioned to VLANs).
-- 
2.21.0


^ permalink raw reply related

* [net-next v2 08/15] iavf: allow permanent MAC address to change
From: Jeff Kirsher @ 2019-09-09 22:47 UTC (permalink / raw)
  To: davem
  Cc: Mitch Williams, netdev, nhorman, sassmann, Andrew Bowers,
	Jeff Kirsher
In-Reply-To: <20190909224802.29595-1-jeffrey.t.kirsher@intel.com>

From: Mitch Williams <mitch.a.williams@intel.com>

Allow the VF to override the "permanent" MAC address set by the host.
This allows bonding to work in the case where the administrator has set
the VF MAC.

Note that the VF must still be set to Trusted on the host if this change
is to be accepted by the PF driver.

Signed-off-by: Mitch Williams <mitch.a.williams@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/iavf/iavf.h      | 1 -
 drivers/net/ethernet/intel/iavf/iavf_main.c | 4 ----
 2 files changed, 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/iavf/iavf.h b/drivers/net/ethernet/intel/iavf/iavf.h
index 9fc635d816d2..29de3ae96ef2 100644
--- a/drivers/net/ethernet/intel/iavf/iavf.h
+++ b/drivers/net/ethernet/intel/iavf/iavf.h
@@ -253,7 +253,6 @@ struct iavf_adapter {
 #define IAVF_FLAG_RESET_PENDING		BIT(4)
 #define IAVF_FLAG_RESET_NEEDED		BIT(5)
 #define IAVF_FLAG_WB_ON_ITR_CAPABLE		BIT(6)
-#define IAVF_FLAG_ADDR_SET_BY_PF		BIT(8)
 #define IAVF_FLAG_SERVICE_CLIENT_REQUESTED	BIT(9)
 #define IAVF_FLAG_CLIENT_NEEDS_OPEN		BIT(10)
 #define IAVF_FLAG_CLIENT_NEEDS_CLOSE		BIT(11)
diff --git a/drivers/net/ethernet/intel/iavf/iavf_main.c b/drivers/net/ethernet/intel/iavf/iavf_main.c
index 554aa619ff02..07f5541a0f01 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_main.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_main.c
@@ -790,9 +790,6 @@ static int iavf_set_mac(struct net_device *netdev, void *p)
 	if (ether_addr_equal(netdev->dev_addr, addr->sa_data))
 		return 0;
 
-	if (adapter->flags & IAVF_FLAG_ADDR_SET_BY_PF)
-		return -EPERM;
-
 	spin_lock_bh(&adapter->mac_vlan_list_lock);
 
 	f = iavf_find_filter(adapter, hw->mac.addr);
@@ -1811,7 +1808,6 @@ static int iavf_init_get_resources(struct iavf_adapter *adapter)
 		eth_hw_addr_random(netdev);
 		ether_addr_copy(adapter->hw.mac.addr, netdev->dev_addr);
 	} else {
-		adapter->flags |= IAVF_FLAG_ADDR_SET_BY_PF;
 		ether_addr_copy(netdev->dev_addr, adapter->hw.mac.addr);
 		ether_addr_copy(netdev->perm_addr, adapter->hw.mac.addr);
 	}
-- 
2.21.0


^ permalink raw reply related

* [net-next v2 06/15] fm10k: use a local variable for the frag pointer
From: Jeff Kirsher @ 2019-09-09 22:47 UTC (permalink / raw)
  To: davem; +Cc: Jacob Keller, netdev, nhorman, sassmann, Andrew Bowers,
	Jeff Kirsher
In-Reply-To: <20190909224802.29595-1-jeffrey.t.kirsher@intel.com>

From: Jacob Keller <jacob.e.keller@intel.com>

In the function fm10k_xmit_frame_ring, we recently switched to using
the skb_frag_size accessor instead of directly using the size member of
the skb fragment.

This made the for loop slightly harder to read because it created a very
long line that is difficult to split up. Avoid this by using a local
variable in the for loop, so that we do not have to break the line on an
open parenthesis.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/fm10k/fm10k_main.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_main.c b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
index e0a2be534b20..2be9222510e7 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_main.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
@@ -1073,9 +1073,11 @@ netdev_tx_t fm10k_xmit_frame_ring(struct sk_buff *skb,
 	 *       + 2 desc gap to keep tail from touching head
 	 * otherwise try next time
 	 */
-	for (f = 0; f < skb_shinfo(skb)->nr_frags; f++)
-		count += TXD_USE_COUNT(skb_frag_size(
-						&skb_shinfo(skb)->frags[f]));
+	for (f = 0; f < skb_shinfo(skb)->nr_frags; f++) {
+		skb_frag_t *frag = &skb_shinfo(skb)->frags[f];
+
+		count += TXD_USE_COUNT(skb_frag_size(frag));
+	}
 
 	if (fm10k_maybe_stop_tx(tx_ring, count + 3)) {
 		tx_ring->tx_stats.tx_busy++;
-- 
2.21.0


^ permalink raw reply related

* [net-next v2 04/15] igc: Remove useless forward declaration
From: Jeff Kirsher @ 2019-09-09 22:47 UTC (permalink / raw)
  To: davem; +Cc: Sasha Neftin, netdev, nhorman, sassmann, Aaron Brown,
	Jeff Kirsher
In-Reply-To: <20190909224802.29595-1-jeffrey.t.kirsher@intel.com>

From: Sasha Neftin <sasha.neftin@intel.com>

Move igc_phy_setup_autoneg, igc_wait_autoneg and igc_set_fc_watermarks
up to avoid forward declaration.
It is not necessary to forward declare these static methods.

Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/igc/igc_mac.c |  73 +++++----
 drivers/net/ethernet/intel/igc/igc_phy.c | 192 +++++++++++------------
 2 files changed, 129 insertions(+), 136 deletions(-)

diff --git a/drivers/net/ethernet/intel/igc/igc_mac.c b/drivers/net/ethernet/intel/igc/igc_mac.c
index ba4646737288..5eeb4c8caf4a 100644
--- a/drivers/net/ethernet/intel/igc/igc_mac.c
+++ b/drivers/net/ethernet/intel/igc/igc_mac.c
@@ -7,9 +7,6 @@
 #include "igc_mac.h"
 #include "igc_hw.h"
 
-/* forward declaration */
-static s32 igc_set_fc_watermarks(struct igc_hw *hw);
-
 /**
  * igc_disable_pcie_master - Disables PCI-express master access
  * @hw: pointer to the HW structure
@@ -74,6 +71,41 @@ void igc_init_rx_addrs(struct igc_hw *hw, u16 rar_count)
 		hw->mac.ops.rar_set(hw, mac_addr, i);
 }
 
+/**
+ * igc_set_fc_watermarks - Set flow control high/low watermarks
+ * @hw: pointer to the HW structure
+ *
+ * Sets the flow control high/low threshold (watermark) registers.  If
+ * flow control XON frame transmission is enabled, then set XON frame
+ * transmission as well.
+ */
+static s32 igc_set_fc_watermarks(struct igc_hw *hw)
+{
+	u32 fcrtl = 0, fcrth = 0;
+
+	/* Set the flow control receive threshold registers.  Normally,
+	 * these registers will be set to a default threshold that may be
+	 * adjusted later by the driver's runtime code.  However, if the
+	 * ability to transmit pause frames is not enabled, then these
+	 * registers will be set to 0.
+	 */
+	if (hw->fc.current_mode & igc_fc_tx_pause) {
+		/* We need to set up the Receive Threshold high and low water
+		 * marks as well as (optionally) enabling the transmission of
+		 * XON frames.
+		 */
+		fcrtl = hw->fc.low_water;
+		if (hw->fc.send_xon)
+			fcrtl |= IGC_FCRTL_XONE;
+
+		fcrth = hw->fc.high_water;
+	}
+	wr32(IGC_FCRTL, fcrtl);
+	wr32(IGC_FCRTH, fcrth);
+
+	return 0;
+}
+
 /**
  * igc_setup_link - Setup flow control and link settings
  * @hw: pointer to the HW structure
@@ -194,41 +226,6 @@ s32 igc_force_mac_fc(struct igc_hw *hw)
 	return ret_val;
 }
 
-/**
- * igc_set_fc_watermarks - Set flow control high/low watermarks
- * @hw: pointer to the HW structure
- *
- * Sets the flow control high/low threshold (watermark) registers.  If
- * flow control XON frame transmission is enabled, then set XON frame
- * transmission as well.
- */
-static s32 igc_set_fc_watermarks(struct igc_hw *hw)
-{
-	u32 fcrtl = 0, fcrth = 0;
-
-	/* Set the flow control receive threshold registers.  Normally,
-	 * these registers will be set to a default threshold that may be
-	 * adjusted later by the driver's runtime code.  However, if the
-	 * ability to transmit pause frames is not enabled, then these
-	 * registers will be set to 0.
-	 */
-	if (hw->fc.current_mode & igc_fc_tx_pause) {
-		/* We need to set up the Receive Threshold high and low water
-		 * marks as well as (optionally) enabling the transmission of
-		 * XON frames.
-		 */
-		fcrtl = hw->fc.low_water;
-		if (hw->fc.send_xon)
-			fcrtl |= IGC_FCRTL_XONE;
-
-		fcrth = hw->fc.high_water;
-	}
-	wr32(IGC_FCRTL, fcrtl);
-	wr32(IGC_FCRTH, fcrth);
-
-	return 0;
-}
-
 /**
  * igc_clear_hw_cntrs_base - Clear base hardware counters
  * @hw: pointer to the HW structure
diff --git a/drivers/net/ethernet/intel/igc/igc_phy.c b/drivers/net/ethernet/intel/igc/igc_phy.c
index 4c8f96a9a148..f4b05af0dd2f 100644
--- a/drivers/net/ethernet/intel/igc/igc_phy.c
+++ b/drivers/net/ethernet/intel/igc/igc_phy.c
@@ -3,10 +3,6 @@
 
 #include "igc_phy.h"
 
-/* forward declaration */
-static s32 igc_phy_setup_autoneg(struct igc_hw *hw);
-static s32 igc_wait_autoneg(struct igc_hw *hw);
-
 /**
  * igc_check_reset_block - Check if PHY reset is blocked
  * @hw: pointer to the HW structure
@@ -207,100 +203,6 @@ s32 igc_phy_hw_reset(struct igc_hw *hw)
 	return ret_val;
 }
 
-/**
- * igc_copper_link_autoneg - Setup/Enable autoneg for copper link
- * @hw: pointer to the HW structure
- *
- * Performs initial bounds checking on autoneg advertisement parameter, then
- * configure to advertise the full capability.  Setup the PHY to autoneg
- * and restart the negotiation process between the link partner.  If
- * autoneg_wait_to_complete, then wait for autoneg to complete before exiting.
- */
-static s32 igc_copper_link_autoneg(struct igc_hw *hw)
-{
-	struct igc_phy_info *phy = &hw->phy;
-	u16 phy_ctrl;
-	s32 ret_val;
-
-	/* Perform some bounds checking on the autoneg advertisement
-	 * parameter.
-	 */
-	phy->autoneg_advertised &= phy->autoneg_mask;
-
-	/* If autoneg_advertised is zero, we assume it was not defaulted
-	 * by the calling code so we set to advertise full capability.
-	 */
-	if (phy->autoneg_advertised == 0)
-		phy->autoneg_advertised = phy->autoneg_mask;
-
-	hw_dbg("Reconfiguring auto-neg advertisement params\n");
-	ret_val = igc_phy_setup_autoneg(hw);
-	if (ret_val) {
-		hw_dbg("Error Setting up Auto-Negotiation\n");
-		goto out;
-	}
-	hw_dbg("Restarting Auto-Neg\n");
-
-	/* Restart auto-negotiation by setting the Auto Neg Enable bit and
-	 * the Auto Neg Restart bit in the PHY control register.
-	 */
-	ret_val = phy->ops.read_reg(hw, PHY_CONTROL, &phy_ctrl);
-	if (ret_val)
-		goto out;
-
-	phy_ctrl |= (MII_CR_AUTO_NEG_EN | MII_CR_RESTART_AUTO_NEG);
-	ret_val = phy->ops.write_reg(hw, PHY_CONTROL, phy_ctrl);
-	if (ret_val)
-		goto out;
-
-	/* Does the user want to wait for Auto-Neg to complete here, or
-	 * check at a later time (for example, callback routine).
-	 */
-	if (phy->autoneg_wait_to_complete) {
-		ret_val = igc_wait_autoneg(hw);
-		if (ret_val) {
-			hw_dbg("Error while waiting for autoneg to complete\n");
-			goto out;
-		}
-	}
-
-	hw->mac.get_link_status = true;
-
-out:
-	return ret_val;
-}
-
-/**
- * igc_wait_autoneg - Wait for auto-neg completion
- * @hw: pointer to the HW structure
- *
- * Waits for auto-negotiation to complete or for the auto-negotiation time
- * limit to expire, which ever happens first.
- */
-static s32 igc_wait_autoneg(struct igc_hw *hw)
-{
-	u16 i, phy_status;
-	s32 ret_val = 0;
-
-	/* Break after autoneg completes or PHY_AUTO_NEG_LIMIT expires. */
-	for (i = PHY_AUTO_NEG_LIMIT; i > 0; i--) {
-		ret_val = hw->phy.ops.read_reg(hw, PHY_STATUS, &phy_status);
-		if (ret_val)
-			break;
-		ret_val = hw->phy.ops.read_reg(hw, PHY_STATUS, &phy_status);
-		if (ret_val)
-			break;
-		if (phy_status & MII_SR_AUTONEG_COMPLETE)
-			break;
-		msleep(100);
-	}
-
-	/* PHY_AUTO_NEG_TIME expiration doesn't guarantee auto-negotiation
-	 * has completed.
-	 */
-	return ret_val;
-}
-
 /**
  * igc_phy_setup_autoneg - Configure PHY for auto-negotiation
  * @hw: pointer to the HW structure
@@ -485,6 +387,100 @@ static s32 igc_phy_setup_autoneg(struct igc_hw *hw)
 	return ret_val;
 }
 
+/**
+ * igc_wait_autoneg - Wait for auto-neg completion
+ * @hw: pointer to the HW structure
+ *
+ * Waits for auto-negotiation to complete or for the auto-negotiation time
+ * limit to expire, which ever happens first.
+ */
+static s32 igc_wait_autoneg(struct igc_hw *hw)
+{
+	u16 i, phy_status;
+	s32 ret_val = 0;
+
+	/* Break after autoneg completes or PHY_AUTO_NEG_LIMIT expires. */
+	for (i = PHY_AUTO_NEG_LIMIT; i > 0; i--) {
+		ret_val = hw->phy.ops.read_reg(hw, PHY_STATUS, &phy_status);
+		if (ret_val)
+			break;
+		ret_val = hw->phy.ops.read_reg(hw, PHY_STATUS, &phy_status);
+		if (ret_val)
+			break;
+		if (phy_status & MII_SR_AUTONEG_COMPLETE)
+			break;
+		msleep(100);
+	}
+
+	/* PHY_AUTO_NEG_TIME expiration doesn't guarantee auto-negotiation
+	 * has completed.
+	 */
+	return ret_val;
+}
+
+/**
+ * igc_copper_link_autoneg - Setup/Enable autoneg for copper link
+ * @hw: pointer to the HW structure
+ *
+ * Performs initial bounds checking on autoneg advertisement parameter, then
+ * configure to advertise the full capability.  Setup the PHY to autoneg
+ * and restart the negotiation process between the link partner.  If
+ * autoneg_wait_to_complete, then wait for autoneg to complete before exiting.
+ */
+static s32 igc_copper_link_autoneg(struct igc_hw *hw)
+{
+	struct igc_phy_info *phy = &hw->phy;
+	u16 phy_ctrl;
+	s32 ret_val;
+
+	/* Perform some bounds checking on the autoneg advertisement
+	 * parameter.
+	 */
+	phy->autoneg_advertised &= phy->autoneg_mask;
+
+	/* If autoneg_advertised is zero, we assume it was not defaulted
+	 * by the calling code so we set to advertise full capability.
+	 */
+	if (phy->autoneg_advertised == 0)
+		phy->autoneg_advertised = phy->autoneg_mask;
+
+	hw_dbg("Reconfiguring auto-neg advertisement params\n");
+	ret_val = igc_phy_setup_autoneg(hw);
+	if (ret_val) {
+		hw_dbg("Error Setting up Auto-Negotiation\n");
+		goto out;
+	}
+	hw_dbg("Restarting Auto-Neg\n");
+
+	/* Restart auto-negotiation by setting the Auto Neg Enable bit and
+	 * the Auto Neg Restart bit in the PHY control register.
+	 */
+	ret_val = phy->ops.read_reg(hw, PHY_CONTROL, &phy_ctrl);
+	if (ret_val)
+		goto out;
+
+	phy_ctrl |= (MII_CR_AUTO_NEG_EN | MII_CR_RESTART_AUTO_NEG);
+	ret_val = phy->ops.write_reg(hw, PHY_CONTROL, phy_ctrl);
+	if (ret_val)
+		goto out;
+
+	/* Does the user want to wait for Auto-Neg to complete here, or
+	 * check at a later time (for example, callback routine).
+	 */
+	if (phy->autoneg_wait_to_complete) {
+		ret_val = igc_wait_autoneg(hw);
+		if (ret_val) {
+			hw_dbg("Error while waiting for autoneg to complete\n");
+			goto out;
+		}
+	}
+
+	hw->mac.get_link_status = true;
+
+out:
+	return ret_val;
+}
+
 /**
  * igc_setup_copper_link - Configure copper link settings
  * @hw: pointer to the HW structure
-- 
2.21.0


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox