Netdev List
 help / color / mirror / Atom feed
* Re: phys_port_id in switchdev mode?
From: Or Gerlitz @ 2018-09-04 20:37 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Florian Fainelli, Simon Horman, Andy Gospodarek,
	mchan@broadcom.com, Jiri Pirko, Alexander Duyck, Frederick Botha,
	nick viljoen, Linux Netdev List
In-Reply-To: <20180904122057.46fce83a@cakuba>

On Tue, Sep 4, 2018 at 1:20 PM, Jakub Kicinski
<jakub.kicinski@netronome.com> wrote:
> On Mon, 3 Sep 2018 12:40:22 +0300, Or Gerlitz wrote:
>> On Tue, Aug 28, 2018 at 9:05 PM, Jakub Kicinski wrote:
>> > Hi!
>>
>> Hi Jakub and sorry for the late reply, this crazigly hot summer refuses to die,
>>
>> Note I replied couple of minutes ago but it didn't get to the list, so
>> lets take it from this one:
>>
>> > I wonder if we can use phys_port_id in switchdev to group together
>> > interfaces of a single PCI PF?  Here is the problem:
>> >
>> > With a mix of PF and VF interfaces it gets increasingly difficult to
>> > figure out which one corresponds to which PF.  We can identify which
>> > *representor* is which, by means of phys_port_name and devlink
>> > flavours.  But if the actual VF/PF interfaces are also present on the
>> > same host, it gets confusing when one tries to identify the PF they
>> > came from.  Generally one has to resort of matching between PCI DBDF of
>> > the PF and VFs or read relevant info out of ethtool -i.
>> >
>> > In multi host scenario this is particularly painful, as there seems to
>> > be no immediately obvious way to match PCI interface ID of a card (0,
>> > 1, 2, 3, 4...) to the DBDF we have connected.
>> >
>> > Another angle to this is legacy SR-IOV NDOs.  User space picks a netdev
>> > from /sys/bus/pci/$VF_DBDF/physfn/net/ to run the NDOs on in somehow
>> > random manner, which means we have to provide those for all devices with
>> > link to the PF (all reprs).  And we have to link them (a) because it's
>> > right (tm) and (b) to get correct naming.
>>
>> wait, as you commented in later, not only the mellanox vf reprs but rather also
>> the nfp vf reprs are not linked to the PF, because ip link output
>> grows quadratically.
>
> Right, correct.  If we set phys_port_id libvirt will reliably pick the
> correct netdev to run NDOs on (PF/PF repr) so we can remove them from
> the other netdevs and therefore limit the size of ip link show output.

just to make sure, this is suggested/future not existing flow of libvirt?


> Ugh, you're right!  Libvirt is our primary target here.  IIUC we need
> phys_port_id on the actual VF and then *a* netdev linked to physfn in
> sysfs which will have the legacy NDOs.
>
> We can't set the phys_port_id on the VF reprs because then we're back
> to the problem of ip link output growing.  Perhaps we shouldn't set it
> on PF repr either?
>
> Let's make a table (assuming bare metal cloud scenario where Host0 is
> controlling the network, while Host1 is the actual server):

yeah, this would be a super-set the non-smartnic case where
we have only one host.



[...]


> With this libvirt on Host0 should easily find the actual PF0 netdev to
> run the NDO on, if it wants to use VFs:
>  - libvrit finds act VF0/0 to plug into the VM;
>  - reads its phys_port_id -> "PF0 SN";
>  - finds netdev with "PF0 SN" linked to physfn -> "act PF0";
>  - runs NDOs on "act PF0" for PF0's VF correctly.

What you describe here doesn't seem to be networking
configuration, as it deals only with VFs and PF but not with reprs,
and hence AFAIK runs on host host1

[...]

> Should Host0 in bare metal cloud have access to SR-IOV NDOs of Host1?

I need to think on that

^ permalink raw reply

* Re: [PATCH RFC net-next] net: Poptrie based routing table lookup
From: Md. Islam @ 2018-09-04 20:34 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Netdev, David Miller, David Ahern, Alexey Kuznetsov,
	alexei.starovoitov, Stephen Hemminger, makita.toshiaki, panda,
	yasuhiro.ohara, Eric Dumazet, john fastabend
In-Reply-To: <CAFgPn1AFUKgGdMArXtfCYQfHxO6nzOYcaPFgN-8ref4HBrMcuQ@mail.gmail.com>

On Tue, Sep 4, 2018 at 12:14 PM, Md. Islam <mislam4@kent.edu> wrote:
>
> On Tue, Sep 4, 2018, 6:53 AM Jesper Dangaard Brouer <brouer@redhat.com>
> wrote:
>>
>> Hi Md. Islam,
>>
>> People will start to ignore you, when you don't interact appropriately
>> with the community, and you ignore their advice, especially when it is
>> about how to interact with the community[1].
>>
>> You have not addressed any of my feedback on your patch in [1].
>>  [1]
>> http://www.mail-archive.com/search?l=mid&q=20180827173334.16ff0673@redhat.com
>
>
> Jesper,
>
> I actually addressed all the feedbacks in the previous patch except TOS,
> FIB_matrics, and etc. This is because I don't think they are relevant in
> this usecase. Please let me know if I wrong.
>
> Thanks

Jesper

Sorry, I missed your review in the first place. I will take a look and
resubmit the patch.

Thanks

>>
>>
>>
>> --
>> Best regards,
>>   Jesper Dangaard Brouer
>>   MSc.CS, Principal Kernel Engineer at Red Hat
>>   LinkedIn: http://www.linkedin.com/in/brouer
>>
>> p.s. also top-posting is bad, but I suspect you will not read my
>> response if I don't top-post.
>>
>>
>> On Tue, 4 Sep 2018 01:02:30 -0400 "Md. Islam" <mislam4@kent.edu> wrote:
>>
>> > This patch implements Poptrie based routing table
>> > lookup/insert/delete/flush. Currently many carrier routers use kernel
>> > bypass frameworks such as DPDK and VPP to implement the data plane.
>> > XDP along with this patch will enable Linux to work as such a router.
>> > Currently it supports up to 255 ports. Many real word backbone routers
>> > have up to 233 ports (to the best of my knowledge), so it seems to be
>> > sufficient at this moment.
>> >
>> > I also have attached a draft paper to explain it works (poptrie.pdf).
>> > Please set CONFIG_FIB_POPTRIE=y (default n) before testing the patch.
>> > Note that, poptrie_lookup() is not being called from anywhere. It will
>> > be used by XDP forwarding.
>> >
>> >
>> > From 3dc9683298ed896dd3080733503c35d68f05370e Mon Sep 17 00:00:00 2001
>> > From: tamimcse <tamim@csebuet.org>
>> > Date: Mon, 3 Sep 2018 23:56:43 -0400
>> > Subject: [PATCH] Poptrie based routing table lookup
>> >
>> > Signed-off-by: tamimcse <tamim@csebuet.org>
>> > ---
>> >  include/net/ip_fib.h   |  42 +++++
>> >  net/ipv4/Kconfig       |   4 +
>> >  net/ipv4/Makefile      |   1 +
>> >  net/ipv4/fib_poptrie.c | 483
>> > +++++++++++++++++++++++++++++++++++++++++++++++++
>> >  net/ipv4/fib_trie.c    |  12 ++
>> >  5 files changed, 542 insertions(+)
>> >  create mode 100644 net/ipv4/fib_poptrie.c
>>
>> First of order of business: You need to conform to the kernels coding
>> standards!
>>
>> https://www.kernel.org/doc/html/v4.18/process/coding-style.html
>>
>> There is a script avail to check this called: scripts/checkpatch.pl
>> It summary says:
>>  total: 139 errors, 238 warnings, 6 checks, 372 lines checked
>> (Not good, more error+warnings than lines...)
>>
>> Please fix up those... else people will not even read you code!
>>
>

^ permalink raw reply

* Re: [RFC/PATCH] net: nixge: Add PHYLINK support
From: Andrew Lunn @ 2018-09-05  1:01 UTC (permalink / raw)
  To: Moritz Fischer
  Cc: netdev, davem, f.fainelli, alex.williams, moritz.fischer,
	linux-kernel
In-Reply-To: <20180905001535.19168-1-mdf@kernel.org>

> 3) I'm again not sure about the 'select PHYLINK', wouldn't
>    wanna break the build again...

Hi Moritz

I think it is safe. PHYLINK has no stated dependencies on OF. But i
suspect it currently is pretty useless without OF.

> @@ -1286,7 +1329,13 @@ static int nixge_probe(struct platform_device *pdev)
>  	priv->coalesce_count_rx = XAXIDMA_DFT_RX_THRESHOLD;
>  	priv->coalesce_count_tx = XAXIDMA_DFT_TX_THRESHOLD;
>  
> -	err = nixge_mdio_setup(priv, pdev->dev.of_node);
> +	mn = of_get_child_by_name(pdev->dev.of_node, "mdio");
> +	if (!mn) {
> +		dev_warn(&pdev->dev, "No \"mdio\" subnode found, defaulting to legacy\n");
> +		mn = pdev->dev.of_node;
> +	}
> +
> +	err = nixge_mdio_setup(priv, mn);

I would suggest making this a patch of its own.

Also, do you need the legacy behaviour? If there are no boards out in
the wild which this will break, just make the change.

Please also update the device tree binding documentation.

       Andrew

^ permalink raw reply

* Motorcycle Owners List
From: Audrey Tyler @ 2018-09-04 19:59 UTC (permalink / raw)
  To: netdev


Hi,

Would you are interested in acquiring an email list of "Motorcycle Owners" from USA.

We also having data of Harley Davidson Owners, Car Owners List, BMW Owners List, RV Owners, Pick Up Truck Owners, Boat Owners, RV Owners List and many more...

Each record we will provide you with: Contact (First and Last name), Mailing Address and Emails Address.

Please let me know your thoughts towards procuring these Lists.

Best Regards,
Audrey Tyler
Research Analyst

^ permalink raw reply

* Re: [RFC/PATCH] net: nixge: Add PHYLINK support
From: Florian Fainelli @ 2018-09-05  0:27 UTC (permalink / raw)
  To: Moritz Fischer, netdev
  Cc: davem, andrew, alex.williams, moritz.fischer, linux-kernel
In-Reply-To: <20180905001535.19168-1-mdf@kernel.org>

On 09/04/2018 05:15 PM, Moritz Fischer wrote:
> Add basic PHYLINK support to driver.
> 
> Suggested-by: Andrew Lunn <andrew@lunn.ch>
> Signed-off-by: Moritz Fischer <mdf@kernel.org>
> ---
> 
> Hi all,
> 
> as Andrew suggested in order to enable SFP as
> well as fixed-link support add PHYLINK support.
> 
> A couple of questions are still open (hence the RFC):
> 
> 1) It seems odd to implement PHYLINK callbacks that
>    are all empty? If so, should we have generic empty
>    ones in drivers/net/phy/phylink.c like we have for
>    genphys?

Yes it is odd, the validate callback most certainly should not be empty,
neither should the mac_config and mac_link_{up,down}, but, with some
luck, you can get things to just work, typically with MDIO PHYs, since a
large amount of what they can do is discoverable.

If you had an existing adjust_link callback from PHYLIB, it's really
about breaking it down such that the MAC configuration of
speed/duplex/pause happens in mac_config, and the link setting (if
necessary), happens in mac_link_{up,down}, and that's pretty much it for
MLO_AN_PHY cases.

> 
> 2) If this is ok, then I'll go ahead rework this with
>    a DT binding update to deprecate the non-'mdio'-subnode
>    case (since there are no in-tree users we might just
>    change the binding)?
> 
> 3) I'm again not sure about the 'select PHYLINK', wouldn't
>    wanna break the build again...
> 
> Thanks again for your time!
> 
> Moritz
> 
> ---
>  drivers/net/ethernet/ni/Kconfig |   1 +
>  drivers/net/ethernet/ni/nixge.c | 115 +++++++++++++++++++++++---------
>  2 files changed, 83 insertions(+), 33 deletions(-)
> 
> diff --git a/drivers/net/ethernet/ni/Kconfig b/drivers/net/ethernet/ni/Kconfig
> index c73978474c4b..80cd72948551 100644
> --- a/drivers/net/ethernet/ni/Kconfig
> +++ b/drivers/net/ethernet/ni/Kconfig
> @@ -21,6 +21,7 @@ config NI_XGE_MANAGEMENT_ENET
>  	depends on HAS_IOMEM && HAS_DMA
>  	select PHYLIB
>  	select OF_MDIO if OF
> +	select PHYLINK
>  	help
>  	  Simple LAN device for debug or management purposes. Can
>  	  support either 10G or 1G PHYs via SFP+ ports.
> diff --git a/drivers/net/ethernet/ni/nixge.c b/drivers/net/ethernet/ni/nixge.c
> index 74cf52e3fb09..a0e790d07b1c 100644
> --- a/drivers/net/ethernet/ni/nixge.c
> +++ b/drivers/net/ethernet/ni/nixge.c
> @@ -11,6 +11,7 @@
>  #include <linux/of_mdio.h>
>  #include <linux/of_net.h>
>  #include <linux/of_platform.h>
> +#include <linux/phylink.h>
>  #include <linux/of_irq.h>
>  #include <linux/skbuff.h>
>  #include <linux/phy.h>
> @@ -165,7 +166,7 @@ struct nixge_priv {
>  	struct device *dev;
>  
>  	/* Connection to PHY device */
> -	struct device_node *phy_node;
> +	struct phylink *phylink;
>  	phy_interface_t		phy_mode;
>  
>  	int link;
> @@ -416,20 +417,6 @@ static void nixge_device_reset(struct net_device *ndev)
>  	netif_trans_update(ndev);
>  }
>  
> -static void nixge_handle_link_change(struct net_device *ndev)
> -{
> -	struct nixge_priv *priv = netdev_priv(ndev);
> -	struct phy_device *phydev = ndev->phydev;
> -
> -	if (phydev->link != priv->link || phydev->speed != priv->speed ||
> -	    phydev->duplex != priv->duplex) {
> -		priv->link = phydev->link;
> -		priv->speed = phydev->speed;
> -		priv->duplex = phydev->duplex;
> -		phy_print_status(phydev);
> -	}
> -}
> -
>  static void nixge_tx_skb_unmap(struct nixge_priv *priv,
>  			       struct nixge_tx_skb *tx_skb)
>  {
> @@ -859,17 +846,15 @@ static void nixge_dma_err_handler(unsigned long data)
>  static int nixge_open(struct net_device *ndev)
>  {
>  	struct nixge_priv *priv = netdev_priv(ndev);
> -	struct phy_device *phy;
>  	int ret;
>  
>  	nixge_device_reset(ndev);
>  
> -	phy = of_phy_connect(ndev, priv->phy_node,
> -			     &nixge_handle_link_change, 0, priv->phy_mode);
> -	if (!phy)
> -		return -ENODEV;
> +	ret = phylink_of_phy_connect(priv->phylink, priv->dev->of_node, 0);
> +	if (ret < 0)
> +		return ret;
>  
> -	phy_start(phy);
> +	phylink_start(priv->phylink);
>  
>  	/* Enable tasklets for Axi DMA error handling */
>  	tasklet_init(&priv->dma_err_tasklet, nixge_dma_err_handler,
> @@ -893,8 +878,7 @@ static int nixge_open(struct net_device *ndev)
>  err_rx_irq:
>  	free_irq(priv->tx_irq, ndev);
>  err_tx_irq:
> -	phy_stop(phy);
> -	phy_disconnect(phy);
> +	phylink_disconnect_phy(priv->phylink);
>  	tasklet_kill(&priv->dma_err_tasklet);
>  	netdev_err(ndev, "request_irq() failed\n");
>  	return ret;
> @@ -908,9 +892,9 @@ static int nixge_stop(struct net_device *ndev)
>  	netif_stop_queue(ndev);
>  	napi_disable(&priv->napi);
>  
> -	if (ndev->phydev) {
> -		phy_stop(ndev->phydev);
> -		phy_disconnect(ndev->phydev);
> +	if (priv->phylink) {
> +		phylink_stop(priv->phylink);
> +		phylink_disconnect_phy(priv->phylink);
>  	}
>  
>  	cr = nixge_dma_read_reg(priv, XAXIDMA_RX_CR_OFFSET);
> @@ -1076,13 +1060,31 @@ static int nixge_ethtools_set_phys_id(struct net_device *ndev,
>  	return 0;
>  }
>  
> +static int
> +nixge_ethtool_set_link_ksettings(struct net_device *ndev,
> +				 const struct ethtool_link_ksettings *cmd)
> +{
> +	struct nixge_priv *priv = netdev_priv(ndev);
> +
> +	return phylink_ethtool_ksettings_set(priv->phylink, cmd);
> +}
> +
> +static int
> +nixge_ethtool_get_link_ksettings(struct net_device *ndev,
> +				 struct ethtool_link_ksettings *cmd)
> +{
> +	struct nixge_priv *priv = netdev_priv(ndev);
> +
> +	return phylink_ethtool_ksettings_get(priv->phylink, cmd);
> +}
> +
>  static const struct ethtool_ops nixge_ethtool_ops = {
>  	.get_drvinfo    = nixge_ethtools_get_drvinfo,
>  	.get_coalesce   = nixge_ethtools_get_coalesce,
>  	.set_coalesce   = nixge_ethtools_set_coalesce,
>  	.set_phys_id    = nixge_ethtools_set_phys_id,
> -	.get_link_ksettings     = phy_ethtool_get_link_ksettings,
> -	.set_link_ksettings     = phy_ethtool_set_link_ksettings,
> +	.get_link_ksettings     = nixge_ethtool_get_link_ksettings,
> +	.set_link_ksettings     = nixge_ethtool_set_link_ksettings,
>  	.get_link		= ethtool_op_get_link,
>  };
>  
> @@ -1225,11 +1227,52 @@ static void *nixge_get_nvmem_address(struct device *dev)
>  	return mac;
>  }
>  
> +static void nixge_validate(struct net_device *ndev, unsigned long *supported,
> +			   struct phylink_link_state *state)
> +{
> +}
> +
> +static int nixge_mac_link_state(struct net_device *ndev,
> +				struct phylink_link_state *state)
> +{
> +	return 0;
> +}
> +
> +static void nixge_mac_config(struct net_device *ndev, unsigned int mode,
> +			     const struct phylink_link_state *state)
> +{
> +}
> +
> +static void nixge_mac_an_restart(struct net_device *ndev)
> +{
> +}
> +
> +static void nixge_mac_link_down(struct net_device *ndev, unsigned int mode,
> +				phy_interface_t interface)
> +{
> +}
> +
> +static void nixge_mac_link_up(struct net_device *ndev, unsigned int mode,
> +			      phy_interface_t interface,
> +			      struct phy_device *phy)
> +{
> +}
> +
> +static const struct phylink_mac_ops nixge_phylink_ops = {
> +	.validate = nixge_validate,
> +	.mac_link_state = nixge_mac_link_state,
> +	.mac_an_restart = nixge_mac_an_restart,
> +	.mac_config = nixge_mac_config,
> +	.mac_link_down = nixge_mac_link_down,
> +	.mac_link_up = nixge_mac_link_up,
> +};
> +
>  static int nixge_probe(struct platform_device *pdev)
>  {
>  	struct nixge_priv *priv;
>  	struct net_device *ndev;
>  	struct resource *dmares;
> +	struct device_node *mn;
>  	const u8 *mac_addr;
>  	int err;
>  
> @@ -1286,7 +1329,13 @@ static int nixge_probe(struct platform_device *pdev)
>  	priv->coalesce_count_rx = XAXIDMA_DFT_RX_THRESHOLD;
>  	priv->coalesce_count_tx = XAXIDMA_DFT_TX_THRESHOLD;
>  
> -	err = nixge_mdio_setup(priv, pdev->dev.of_node);
> +	mn = of_get_child_by_name(pdev->dev.of_node, "mdio");
> +	if (!mn) {
> +		dev_warn(&pdev->dev, "No \"mdio\" subnode found, defaulting to legacy\n");
> +		mn = pdev->dev.of_node;
> +	}
> +
> +	err = nixge_mdio_setup(priv, mn);
>  	if (err) {
>  		netdev_err(ndev, "error registering mdio bus");
>  		goto free_netdev;
> @@ -1299,10 +1348,10 @@ static int nixge_probe(struct platform_device *pdev)
>  		goto unregister_mdio;
>  	}
>  
> -	priv->phy_node = of_parse_phandle(pdev->dev.of_node, "phy-handle", 0);
> -	if (!priv->phy_node) {
> -		netdev_err(ndev, "not find \"phy-handle\" property\n");
> -		err = -EINVAL;
> +	priv->phylink = phylink_create(ndev, pdev->dev.fwnode, priv->phy_mode,
> +				       &nixge_phylink_ops);
> +	if (IS_ERR(priv->phylink)) {
> +		err = PTR_ERR(priv->phylink);
>  		goto unregister_mdio;
>  	}
>  
> 


-- 
Florian

^ permalink raw reply

* [PATCH net-next v2 7/9] rtnetlink: s/IFLA_IF_NETNSID/IFLA_TARGET_NETNSID/g
From: Christian Brauner @ 2018-09-04 19:53 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: davem, kuznet, yoshfuji, pombredanne, kstewart, gregkh, dsahern,
	fw, ktkhai, lucien.xin, jakub.kicinski, jbenc, nicolas.dichtel,
	Christian Brauner
In-Reply-To: <20180904195355.4695-1-christian@brauner.io>

IFLA_TARGET_NETNSID is the new alias for IFLA_IF_NETNSID. This commit
replaces all occurrences of IFLA_IF_NETNSID with the new alias to
indicate that this identifier is the preferred one.

Signed-off-by: Christian Brauner <christian@brauner.io>
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Cc: Jiri Benc <jbenc@redhat.com>
---
v1->v2:
- patch added

v0->v1:
- patch not present
---
 net/core/rtnetlink.c | 32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index b36dab7507a0..67d7898db346 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1012,7 +1012,7 @@ static noinline size_t if_nlmsg_size(const struct net_device *dev,
 	       + nla_total_size(4)  /* IFLA_NEW_NETNSID */
 	       + nla_total_size(4)  /* IFLA_NEW_IFINDEX */
 	       + nla_total_size(1)  /* IFLA_PROTO_DOWN */
-	       + nla_total_size(4)  /* IFLA_IF_NETNSID */
+	       + nla_total_size(4)  /* IFLA_TARGET_NETNSID */
 	       + nla_total_size(4)  /* IFLA_CARRIER_UP_COUNT */
 	       + nla_total_size(4)  /* IFLA_CARRIER_DOWN_COUNT */
 	       + nla_total_size(4)  /* IFLA_MIN_MTU */
@@ -1594,7 +1594,7 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb,
 	ifm->ifi_flags = dev_get_flags(dev);
 	ifm->ifi_change = change;
 
-	if (tgt_netnsid >= 0 && nla_put_s32(skb, IFLA_IF_NETNSID, tgt_netnsid))
+	if (tgt_netnsid >= 0 && nla_put_s32(skb, IFLA_TARGET_NETNSID, tgt_netnsid))
 		goto nla_put_failure;
 
 	if (nla_put_string(skb, IFLA_IFNAME, dev->name) ||
@@ -1733,7 +1733,7 @@ static const struct nla_policy ifla_policy[IFLA_MAX+1] = {
 	[IFLA_XDP]		= { .type = NLA_NESTED },
 	[IFLA_EVENT]		= { .type = NLA_U32 },
 	[IFLA_GROUP]		= { .type = NLA_U32 },
-	[IFLA_IF_NETNSID]	= { .type = NLA_S32 },
+	[IFLA_TARGET_NETNSID]	= { .type = NLA_S32 },
 	[IFLA_CARRIER_UP_COUNT]	= { .type = NLA_U32 },
 	[IFLA_CARRIER_DOWN_COUNT] = { .type = NLA_U32 },
 	[IFLA_MIN_MTU]		= { .type = NLA_U32 },
@@ -1900,8 +1900,8 @@ static int rtnl_dump_ifinfo(struct sk_buff *skb, struct netlink_callback *cb)
 
 	if (nlmsg_parse(cb->nlh, hdrlen, tb, IFLA_MAX,
 			ifla_policy, NULL) >= 0) {
-		if (tb[IFLA_IF_NETNSID]) {
-			netnsid = nla_get_s32(tb[IFLA_IF_NETNSID]);
+		if (tb[IFLA_TARGET_NETNSID]) {
+			netnsid = nla_get_s32(tb[IFLA_TARGET_NETNSID]);
 			tgt_net = rtnl_get_net_ns_capable(skb->sk, netnsid);
 			if (IS_ERR(tgt_net)) {
 				tgt_net = net;
@@ -1989,7 +1989,7 @@ EXPORT_SYMBOL(rtnl_link_get_net);
  *
  * 1. IFLA_NET_NS_PID
  * 2. IFLA_NET_NS_FD
- * 3. IFLA_IF_NETNSID
+ * 3. IFLA_TARGET_NETNSID
  */
 static struct net *rtnl_link_get_net_by_nlattr(struct net *src_net,
 					       struct nlattr *tb[])
@@ -1999,10 +1999,10 @@ static struct net *rtnl_link_get_net_by_nlattr(struct net *src_net,
 	if (tb[IFLA_NET_NS_PID] || tb[IFLA_NET_NS_FD])
 		return rtnl_link_get_net(src_net, tb);
 
-	if (!tb[IFLA_IF_NETNSID])
+	if (!tb[IFLA_TARGET_NETNSID])
 		return get_net(src_net);
 
-	net = get_net_ns_by_id(src_net, nla_get_u32(tb[IFLA_IF_NETNSID]));
+	net = get_net_ns_by_id(src_net, nla_get_u32(tb[IFLA_TARGET_NETNSID]));
 	if (!net)
 		return ERR_PTR(-EINVAL);
 
@@ -2043,13 +2043,13 @@ static int rtnl_ensure_unique_netns(struct nlattr *tb[],
 		return -EOPNOTSUPP;
 	}
 
-	if (tb[IFLA_IF_NETNSID] && (tb[IFLA_NET_NS_PID] || tb[IFLA_NET_NS_FD]))
+	if (tb[IFLA_TARGET_NETNSID] && (tb[IFLA_NET_NS_PID] || tb[IFLA_NET_NS_FD]))
 		goto invalid_attr;
 
-	if (tb[IFLA_NET_NS_PID] && (tb[IFLA_IF_NETNSID] || tb[IFLA_NET_NS_FD]))
+	if (tb[IFLA_NET_NS_PID] && (tb[IFLA_TARGET_NETNSID] || tb[IFLA_NET_NS_FD]))
 		goto invalid_attr;
 
-	if (tb[IFLA_NET_NS_FD] && (tb[IFLA_IF_NETNSID] || tb[IFLA_NET_NS_PID]))
+	if (tb[IFLA_NET_NS_FD] && (tb[IFLA_TARGET_NETNSID] || tb[IFLA_NET_NS_PID]))
 		goto invalid_attr;
 
 	return 0;
@@ -2325,7 +2325,7 @@ static int do_setlink(const struct sk_buff *skb,
 	if (err < 0)
 		return err;
 
-	if (tb[IFLA_NET_NS_PID] || tb[IFLA_NET_NS_FD] || tb[IFLA_IF_NETNSID]) {
+	if (tb[IFLA_NET_NS_PID] || tb[IFLA_NET_NS_FD] || tb[IFLA_TARGET_NETNSID]) {
 		struct net *net = rtnl_link_get_net_capable(skb, dev_net(dev),
 							    tb, CAP_NET_ADMIN);
 		if (IS_ERR(net)) {
@@ -2768,8 +2768,8 @@ static int rtnl_dellink(struct sk_buff *skb, struct nlmsghdr *nlh,
 	if (tb[IFLA_IFNAME])
 		nla_strlcpy(ifname, tb[IFLA_IFNAME], IFNAMSIZ);
 
-	if (tb[IFLA_IF_NETNSID]) {
-		netnsid = nla_get_s32(tb[IFLA_IF_NETNSID]);
+	if (tb[IFLA_TARGET_NETNSID]) {
+		netnsid = nla_get_s32(tb[IFLA_TARGET_NETNSID]);
 		tgt_net = rtnl_get_net_ns_capable(NETLINK_CB(skb).sk, netnsid);
 		if (IS_ERR(tgt_net))
 			return PTR_ERR(tgt_net);
@@ -3178,8 +3178,8 @@ static int rtnl_getlink(struct sk_buff *skb, struct nlmsghdr *nlh,
 	if (err < 0)
 		return err;
 
-	if (tb[IFLA_IF_NETNSID]) {
-		netnsid = nla_get_s32(tb[IFLA_IF_NETNSID]);
+	if (tb[IFLA_TARGET_NETNSID]) {
+		netnsid = nla_get_s32(tb[IFLA_TARGET_NETNSID]);
 		tgt_net = rtnl_get_net_ns_capable(NETLINK_CB(skb).sk, netnsid);
 		if (IS_ERR(tgt_net))
 			return PTR_ERR(tgt_net);
-- 
2.17.1

^ permalink raw reply related

* [PATCH net-next v2 5/9] rtnetlink: move type calculation out of loop
From: Christian Brauner @ 2018-09-04 19:53 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: davem, kuznet, yoshfuji, pombredanne, kstewart, gregkh, dsahern,
	fw, ktkhai, lucien.xin, jakub.kicinski, jbenc, nicolas.dichtel,
	Christian Brauner
In-Reply-To: <20180904195355.4695-1-christian@brauner.io>

I don't see how the type - which is one of
RTM_{GETADDR,GETROUTE,GETNETCONF} - can change. So do the message type
calculation once before entering the for loop.

Signed-off-by: Christian Brauner <christian@brauner.io>
---
v1->v2:
- unchanged

v0->v1:
- unchanged
---
 net/core/rtnetlink.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 30645d9a9801..b36dab7507a0 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -3265,13 +3265,13 @@ static int rtnl_dump_all(struct sk_buff *skb, struct netlink_callback *cb)
 {
 	int idx;
 	int s_idx = cb->family;
+	int type = cb->nlh->nlmsg_type - RTM_BASE;
 
 	if (s_idx == 0)
 		s_idx = 1;
 
 	for (idx = 1; idx <= RTNL_FAMILY_MAX; idx++) {
 		struct rtnl_link **tab;
-		int type = cb->nlh->nlmsg_type-RTM_BASE;
 		struct rtnl_link *link;
 		rtnl_dumpit_func dumpit;
 
-- 
2.17.1

^ permalink raw reply related

* [PATCH net-next v2 3/9] ipv4: enable IFA_TARGET_NETNSID for RTM_GETADDR
From: Christian Brauner @ 2018-09-04 19:53 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: davem, kuznet, yoshfuji, pombredanne, kstewart, gregkh, dsahern,
	fw, ktkhai, lucien.xin, jakub.kicinski, jbenc, nicolas.dichtel,
	Christian Brauner
In-Reply-To: <20180904195355.4695-1-christian@brauner.io>

- Backwards Compatibility:
  If userspace wants to determine whether ipv4 RTM_GETADDR requests
  support the new IFA_TARGET_NETNSID property it should verify that the
  reply includes the IFA_TARGET_NETNSID property. If it does not
  userspace should assume that IFA_TARGET_NETNSID is not supported for
  ipv4 RTM_GETADDR requests on this kernel.
- From what I gather from current userspace tools that make use of
  RTM_GETADDR requests some of them pass down struct ifinfomsg when they
  should actually pass down struct ifaddrmsg. To not break existing
  tools that pass down the wrong struct we will do the same as for
  RTM_GETLINK | NLM_F_DUMP requests and not error out when the
  nlmsg_parse() fails.

- Security:
  Callers must have CAP_NET_ADMIN in the owning user namespace of the
  target network namespace.

Signed-off-by: Christian Brauner <christian@brauner.io>
---
v1->v2:
- rename from IFA_IF_NETNSID to IFA_TARGET_NETNSID

v0->v1:
- unchanged
---
 net/ipv4/devinet.c | 38 ++++++++++++++++++++++++++++++--------
 1 file changed, 30 insertions(+), 8 deletions(-)

diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index ea4bd8a52422..5cb849300b81 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -100,6 +100,7 @@ static const struct nla_policy ifa_ipv4_policy[IFA_MAX+1] = {
 	[IFA_CACHEINFO]		= { .len = sizeof(struct ifa_cacheinfo) },
 	[IFA_FLAGS]		= { .type = NLA_U32 },
 	[IFA_RT_PRIORITY]	= { .type = NLA_U32 },
+	[IFA_TARGET_NETNSID]	= { .type = NLA_S32 },
 };
 
 #define IN4_ADDR_HSIZE_SHIFT	8
@@ -1584,7 +1585,8 @@ static int put_cacheinfo(struct sk_buff *skb, unsigned long cstamp,
 }
 
 static int inet_fill_ifaddr(struct sk_buff *skb, struct in_ifaddr *ifa,
-			    u32 portid, u32 seq, int event, unsigned int flags)
+			    u32 portid, u32 seq, int event, unsigned int flags,
+			    int netnsid)
 {
 	struct ifaddrmsg *ifm;
 	struct nlmsghdr  *nlh;
@@ -1601,6 +1603,9 @@ static int inet_fill_ifaddr(struct sk_buff *skb, struct in_ifaddr *ifa,
 	ifm->ifa_scope = ifa->ifa_scope;
 	ifm->ifa_index = ifa->ifa_dev->dev->ifindex;
 
+	if (netnsid >= 0 && nla_put_s32(skb, IFA_TARGET_NETNSID, netnsid))
+		goto nla_put_failure;
+
 	if (!(ifm->ifa_flags & IFA_F_PERMANENT)) {
 		preferred = ifa->ifa_preferred_lft;
 		valid = ifa->ifa_valid_lft;
@@ -1648,6 +1653,9 @@ static int inet_fill_ifaddr(struct sk_buff *skb, struct in_ifaddr *ifa,
 static int inet_dump_ifaddr(struct sk_buff *skb, struct netlink_callback *cb)
 {
 	struct net *net = sock_net(skb->sk);
+	struct nlattr *tb[IFA_MAX+1];
+	struct net *tgt_net = net;
+	int netnsid = -1;
 	int h, s_h;
 	int idx, s_idx;
 	int ip_idx, s_ip_idx;
@@ -1660,12 +1668,23 @@ static int inet_dump_ifaddr(struct sk_buff *skb, struct netlink_callback *cb)
 	s_idx = idx = cb->args[1];
 	s_ip_idx = ip_idx = cb->args[2];
 
+	if (nlmsg_parse(cb->nlh, sizeof(struct ifaddrmsg), tb, IFA_MAX,
+			ifa_ipv4_policy, NULL) >= 0) {
+		if (tb[IFA_TARGET_NETNSID]) {
+			netnsid = nla_get_s32(tb[IFA_TARGET_NETNSID]);
+
+			tgt_net = rtnl_get_net_ns_capable(skb->sk, netnsid);
+			if (IS_ERR(tgt_net))
+				return PTR_ERR(tgt_net);
+		}
+	}
+
 	for (h = s_h; h < NETDEV_HASHENTRIES; h++, s_idx = 0) {
 		idx = 0;
-		head = &net->dev_index_head[h];
+		head = &tgt_net->dev_index_head[h];
 		rcu_read_lock();
-		cb->seq = atomic_read(&net->ipv4.dev_addr_genid) ^
-			  net->dev_base_seq;
+		cb->seq = atomic_read(&tgt_net->ipv4.dev_addr_genid) ^
+			  tgt_net->dev_base_seq;
 		hlist_for_each_entry_rcu(dev, head, index_hlist) {
 			if (idx < s_idx)
 				goto cont;
@@ -1680,9 +1699,10 @@ static int inet_dump_ifaddr(struct sk_buff *skb, struct netlink_callback *cb)
 				if (ip_idx < s_ip_idx)
 					continue;
 				if (inet_fill_ifaddr(skb, ifa,
-					     NETLINK_CB(cb->skb).portid,
-					     cb->nlh->nlmsg_seq,
-					     RTM_NEWADDR, NLM_F_MULTI) < 0) {
+						     NETLINK_CB(cb->skb).portid,
+						     cb->nlh->nlmsg_seq,
+						     RTM_NEWADDR, NLM_F_MULTI,
+						     netnsid) < 0) {
 					rcu_read_unlock();
 					goto done;
 				}
@@ -1698,6 +1718,8 @@ static int inet_dump_ifaddr(struct sk_buff *skb, struct netlink_callback *cb)
 	cb->args[0] = h;
 	cb->args[1] = idx;
 	cb->args[2] = ip_idx;
+	if (netnsid >= 0)
+		put_net(tgt_net);
 
 	return skb->len;
 }
@@ -1715,7 +1737,7 @@ static void rtmsg_ifa(int event, struct in_ifaddr *ifa, struct nlmsghdr *nlh,
 	if (!skb)
 		goto errout;
 
-	err = inet_fill_ifaddr(skb, ifa, portid, seq, event, 0);
+	err = inet_fill_ifaddr(skb, ifa, portid, seq, event, 0, -1);
 	if (err < 0) {
 		/* -EMSGSIZE implies BUG in inet_nlmsg_size() */
 		WARN_ON(err == -EMSGSIZE);
-- 
2.17.1

^ permalink raw reply related

* [PATCH net-next v2 2/9] if_addr: add IFA_TARGET_NETNSID
From: Christian Brauner @ 2018-09-04 19:53 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: davem, kuznet, yoshfuji, pombredanne, kstewart, gregkh, dsahern,
	fw, ktkhai, lucien.xin, jakub.kicinski, jbenc, nicolas.dichtel,
	Christian Brauner
In-Reply-To: <20180904195355.4695-1-christian@brauner.io>

This adds a new IFA_TARGET_NETNSID property to be used by address
families such as PF_INET and PF_INET6.
The IFA_TARGET_NETNSID property can be used to send a network namespace
identifier as part of a request. If a IFA_TARGET_NETNSID property is
identified it will be used to retrieve the target network namespace in
which the request is to be made.

Signed-off-by: Christian Brauner <christian@brauner.io>
Cc: Jiri Benc <jbenc@redhat.com>
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
v1->v2:
- rename from IFA_IF_NETNSID to IFA_TARGET_NETNSID

v0->v1:
- unchanged
  Note, I did not change the property name to IFA_TARGET_NSID as there
  was no clear agreement what would be preferred. My personal preference
  is to keep the IFA_IF_NETNSID name because it aligns naturally with
  the IFLA_IF_NETNSID property for RTM_*LINK requests. Jiri seems to
  prefer this name too.
  However, if there is agreement that another property name makes more
  sense I'm happy to send a v2 that changes this.
---
 include/uapi/linux/if_addr.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/if_addr.h b/include/uapi/linux/if_addr.h
index ebaf5701c9db..dfcf3ce0097f 100644
--- a/include/uapi/linux/if_addr.h
+++ b/include/uapi/linux/if_addr.h
@@ -34,6 +34,7 @@ enum {
 	IFA_MULTICAST,
 	IFA_FLAGS,
 	IFA_RT_PRIORITY,  /* u32, priority/metric for prefix route */
+	IFA_TARGET_NETNSID,
 	__IFA_MAX,
 };
 
-- 
2.17.1

^ permalink raw reply related

* Re: [PATCH net] net: phy: sfp: Handle unimplemented hwmon limits and alarms
From: David Miller @ 2018-09-04 19:23 UTC (permalink / raw)
  To: andrew; +Cc: netdev, rmk+kernel, f.fainelli
In-Reply-To: <1536027836-19913-1-git-send-email-andrew@lunn.ch>

From: Andrew Lunn <andrew@lunn.ch>
Date: Tue,  4 Sep 2018 04:23:56 +0200

> Not all SFPs implement the registers containing sensor limits and
> alarms. Luckily, there is a bit indicating if they are implemented or
> not. Add checking for this bit, when deciding if the hwmon attributes
> should be visible.
> 
> Fixes: 1323061a018a ("net: phy: sfp: Add HWMON support for module sensors")
> Signed-off-by: Andrew Lunn <andrew@lunn.ch>

Applied, thanks Andrew.

^ permalink raw reply

* Re: [PATCH net-next v2] net: sched: action_ife: take reference to meta module
From: David Miller @ 2018-09-04 19:21 UTC (permalink / raw)
  To: vladbu; +Cc: netdev, xiyou.wangcong, jhs, jiri
In-Reply-To: <1536011082-2043-1-git-send-email-vladbu@mellanox.com>

From: Vlad Buslov <vladbu@mellanox.com>
Date: Tue,  4 Sep 2018 00:44:42 +0300

> Recent refactoring of add_metainfo() caused use_all_metadata() to add
> metainfo to ife action metalist without taking reference to module. This
> causes warning in module_put called from ife action cleanup function.
> 
> Implement add_metainfo_and_get_ops() function that returns with reference
> to module taken if metainfo was added successfully, and call it from
> use_all_metadata(), instead of calling __add_metainfo() directly.
> 
> Example warning:
 ...
> Fixes: 5ffe57da29b3 ("act_ife: fix a potential deadlock")
> Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
> ---
> 
> Changes V1->V2:
> - fold constants into helper function

Applied to 'net'.

^ permalink raw reply

* Re: [Patch net] act_ife: fix a potential use-after-free
From: David Miller @ 2018-09-04 19:19 UTC (permalink / raw)
  To: xiyou.wangcong; +Cc: netdev, jhs
In-Reply-To: <20180903180815.32220-1-xiyou.wangcong@gmail.com>

From: Cong Wang <xiyou.wangcong@gmail.com>
Date: Mon,  3 Sep 2018 11:08:15 -0700

> Immediately after module_put(), user could delete this
> module, so e->ops could be already freed before we call
> e->ops->release().
> 
> Fix this by moving module_put() after ops->release().
> 
> Fixes: ef6980b6becb ("introduce IFE action")
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>

Applied and queued up for -stable, thanks Cong.

^ permalink raw reply

* Re: [PATCH net] net/mlx5: Fix SQ offset in QPs with small RQ
From: David Miller @ 2018-09-04 19:18 UTC (permalink / raw)
  To: tariqt; +Cc: netdev, eranbe, saeedm, alaa
In-Reply-To: <1535987184-16417-1-git-send-email-tariqt@mellanox.com>

From: Tariq Toukan <tariqt@mellanox.com>
Date: Mon,  3 Sep 2018 18:06:24 +0300

> Correct the formula for calculating the RQ page remainder,
> which should be in byte granularity.  The result will be
> non-zero only for RQs smaller than PAGE_SIZE, as an RQ size
> is a power of 2.
> 
> Divide this by the SQ stride (MLX5_SEND_WQE_BB) to get the
> SQ offset in strides granularity.
> 
> Fixes: d7037ad73daa ("net/mlx5: Fix QP fragmented buffer allocation")
> Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
> Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com>
> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/wq.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> Hi Dave,
> Please queue for -stable v4.18.

Applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: [PATCH] neighbour: confirm neigh entries when ARP packet is received
From: David Miller @ 2018-09-04 18:57 UTC (permalink / raw)
  To: ihrachys
  Cc: vasilykh, roopa, adobriyan, jwestfall, stephen, anarsoul,
	keescook, w.bumiller, edumazet, netdev
In-Reply-To: <CAKwN9=A6YEU5QWf9cQjHwv6L_ssOH5K_rapE6hoZtF_A4gokuQ@mail.gmail.com>

From: Ihar Hrachyshka <ihrachys@redhat.com>
Date: Tue, 4 Sep 2018 11:31:23 -0700

> Of course, I also agree that the comment will need some adjustment to
> reflect the fact that now a single timestamp is being updated. Perhaps
> while at it, Vasily could also explicitly describe in a comment which
> scenario the "if" branch check is supposed to cover. (I should have
> done it myself, mea culpa.)

Yes, that would help a lot.

^ permalink raw reply

* Re: [Patch net] tipc: call start and done ops directly in __tipc_nl_compat_dumpit()
From: Cong Wang @ 2018-09-04 18:41 UTC (permalink / raw)
  To: Ying Xue; +Cc: Linux Kernel Network Developers, tipc-discussion, Jon Maloy
In-Reply-To: <941068fa-85c9-7e5b-f769-23800ca407fa@windriver.com>

On Tue, Sep 4, 2018 at 4:45 AM Ying Xue <ying.xue@windriver.com> wrote:
>
>
> On 09/04/2018 10:12 AM, Cong Wang wrote:
> > __tipc_nl_compat_dumpit() uses a netlink_callback on stack,
> > so the only way to align it with other ->dumpit() call path
> > is calling tipc_dump_start() and tipc_dump_done() directly
> > inside it. Otherwise ->dumpit() would always get NULL from
> > cb->args[0].
>
> Thank you for your fix Cong!
>
> Your solution is fine with me.
>
> When we align __tipc_nl_compat_dumpit() with ->dumpit() functions
> defined in tipc_genl_v2_ops[], cb->args[0] is used to save a
> rhashtable_iter object allocated in tipc_dump_start(), and the object
> will be retrieved with cb->args[0] in tipc_dump_done() and will be freed.
>
> But unfortunately cb->args[0] has been used to other purposes in
> tipc_nl_bearer_dump(), tipc_nl_node_dump_link(),
> tipc_nl_name_table_dump(), tipc_nl_node_dump() and tipc_nl_net_dump().
> It means cb->args[0] saved in __tipc_dump_start() will be overwritten in
> these ->dumpit() functions. As a consequence, not only the
> rhashtable_iter object allocated in tipc_dump_start() cannot be properly
> released in tipc_dump_done(), but also more kernel panics might be
> triggered in tipc_dump_done().

Ah, good catch!

The max utilization of cb->args is tipc_nl_name_table_dump():

net/tipc/name_table.c:  cb->args[0] = last_type;
net/tipc/name_table.c:  cb->args[1] = last_lower;
net/tipc/name_table.c:  cb->args[2] = last_key;
net/tipc/name_table.c:  cb->args[3] = done;

Looks like I should just use cb->args[4] for rhashtable iterator,
as we still have some room in cb->args[].

^ permalink raw reply

* Re: [PATCH v2 00/11] mscc: ocelot: add support for SerDes muxing configuration
From: Paul Burton @ 2018-09-04 23:03 UTC (permalink / raw)
  To: Quentin Schulz
  Cc: Alexandre Belloni, David Miller, andrew, ralf, jhogan, robh+dt,
	mark.rutland, kishon, f.fainelli, allan.nielsen, linux-mips,
	devicetree, linux-kernel, netdev, thomas.petazzoni
In-Reply-To: <20180904180006.d5th3jrbhr4vtahi@qschulz>

Hi Quentin,

On Tue, Sep 04, 2018 at 08:00:06PM +0200, Quentin Schulz wrote:
> On Tue, Sep 04, 2018 at 09:10:28AM -0700, Paul Burton wrote:
> > Hi Alexandre, Quentin, all,
> > 
> > On Tue, Sep 04, 2018 at 05:16:53PM +0200, Alexandre Belloni wrote:
> > > On 03/09/2018 22:09:10-0700, David Miller wrote:
> > > > From: Alexandre Belloni <alexandre.belloni@bootlin.com>
> > > > Date: Mon, 3 Sep 2018 15:45:22 +0200
> > > > 
> > > > > On 03/09/2018 15:34:15+0200, Andrew Lunn wrote:
> > > > >> > I suggest patches 1 and 8 go through MIPS tree, 2 to 5 and 11 go through
> > > > >> > net while the others (6, 7, 9 and 10) go through the generic PHY subsystem.
> > > > >> 
> > > > >> Hi Quentin
> > > > >> 
> > > > >> Are you expecting merge conflicts? If not, it might be simpler to gets
> > > > >> ACKs from each maintainer, and then merge it though one tree.
> > > > > 
> > > > > There are some other DT changes for this cycle so those should probably
> > > > > go through MIPS.
> > > > 
> > > > No objection for this going through the MIPS tree, and from me:
> > > >
> > > > Acked-by: David S. Miller <davem@davemloft.net>
> > > 
> > > What I meant was that 1/11 and 8/11 should go through MIPS because of
> > > the potential conflicts. The other patches can go through net-next as
> > > that will make more sense. Maybe Quentin can split the series in two,
> > > one for MIPS and one for net if that makes it easier for you to apply.
> > 
> > I'd be happy to take the .dts changes through the MIPS tree, though
> > looking at them won't patch 1 break bisection?
> > 
> > Since you remove the hsio reg entry it looks to me like
> > mscc_ocelot_probe() will fail with -EINVAL (which comes from
> > devm_ioremap_resource() with res=NULL) until patch 3.
> > 
> 
> That's correct.
> 
> > I'd feel more comfortable merging this piecemeal if it doesn't result in
> > us breaking bisection for however long it takes for both the trees
> > involved to be merged.
> > 
> 
> How do you want to proceed then?

Well, it sounded like David is OK with this all going through the MIPS
tree, though we'd need an ack for the PHY parts.

Alternatively I'd be happy for the DT changes to go through the net-next
tree, which may make more sense given that the .dts changes are pretty
trivial in comparison with the driver changes. If David wants to do that
then for patches 1 & 8:

    Acked-by: Paul Burton <paul.burton@mips.com>

Either way there may be conflicts for ocelot.dtsi when it comes to
merging to master, but they should be simple to resolve. It seems
Wolfram already took your DT changes for I2C so there's probably going
to be multiple trees updating that file this cycle already anyway.

Ideally I'd say "don't break bisection" but that's sort of a separate
issue here since even if you restructure your series to do that it would
still need to go through one tree. For example you could adjust
mscc_ocelot_probe() to handle either the reg property or the syscon,
then adjust the DT to use the syscon, then remove the code dealing with
the reg property, and I'd consider that a good idea anyway but it would
still probably all need to go through one tree to make sure things get
merged in the right order & avoid breaking bisection.

Thanks,
    Paul

^ permalink raw reply

* Re: [PATCH net-next v2] net: sched: action_ife: take reference to meta module
From: Cong Wang @ 2018-09-04 18:32 UTC (permalink / raw)
  To: Vlad Buslov
  Cc: Linux Kernel Network Developers, Jamal Hadi Salim, Jiri Pirko,
	David Miller
In-Reply-To: <1536011082-2043-1-git-send-email-vladbu@mellanox.com>

On Mon, Sep 3, 2018 at 2:44 PM Vlad Buslov <vladbu@mellanox.com> wrote:
>
> Recent refactoring of add_metainfo() caused use_all_metadata() to add
> metainfo to ife action metalist without taking reference to module. This
> causes warning in module_put called from ife action cleanup function.
>
> Implement add_metainfo_and_get_ops() function that returns with reference
> to module taken if metainfo was added successfully, and call it from
> use_all_metadata(), instead of calling __add_metainfo() directly.


Acked-by: Cong Wang <xiyou.wangcong@gmail.com>

This one should go to -net too.

^ permalink raw reply

* Re: [PATCH] neighbour: confirm neigh entries when ARP packet is received
From: Ihar Hrachyshka @ 2018-09-04 18:31 UTC (permalink / raw)
  To: David Miller
  Cc: Vasiliy Khoruzhick, roopa, adobriyan, jwestfall, stephen,
	anarsoul, keescook, w.bumiller, edumazet, Networking
In-Reply-To: <20180901.165129.1631978540527112789.davem@davemloft.net>

On Sat, Sep 1, 2018 at 4:51 PM, David Miller <davem@davemloft.net> wrote:
> From: Vasily Khoruzhick <vasilykh@arista.com>
> Date: Tue, 28 Aug 2018 19:48:25 -0700
>
>> Update 'confirmed' timestamp when ARP packet is received. It shouldn't
>> affect locktime logic and anyway entry can be confirmed by any higher-layer
>> protocol. Thus it makes no sense not to confirm it when ARP packet is
>> received.
>>
>> Fixes: 77d7123342 ("neighbour: update neigh timestamps iff update is
>> effective")
>>
>> Signed-off-by: Vasily Khoruzhick <vasilykh@arista.com>
>
> I'm not so sure.
>
> The comment above the code you are moving explains that the current
> behavior is intention, and it explains why too.
>
> Even if your change is correct, you're now making that comment
> inaccuratte, so you'd have to update it to match the new code.
>
> But I still think the current code is intentionally behaving that
> way, and for good reason.

Hi David,

(I am the one who put this comment there.)

I agree with the reasoning that Vasily provided for the change (we
should confirm the entry if e.g. ARP packet with identical
hwaddr/ipaddr pair arrives; just not mark it as updated). It was a
mistake of mine to put access to both updated and confirmed fields
under the "if" branch. Just leaving 'updated' there and moving
'confirmed' outside seems like the right thing to do.

The original intent was to not update 'updated' field when no update
happens (because of consequent ARP packets sent in short span of
time). The fix by Vasily should not negatively affect this scenario.

Of course, I also agree that the comment will need some adjustment to
reflect the fact that now a single timestamp is being updated. Perhaps
while at it, Vasily could also explicitly describe in a comment which
scenario the "if" branch check is supposed to cover. (I should have
done it myself, mea culpa.)

I hope it helps,
Ihar

^ permalink raw reply

* Re: [PATCH net] net/sched: fix memory leak in act_tunnel_key_init()
From: Cong Wang @ 2018-09-04 18:31 UTC (permalink / raw)
  To: Davide Caratti
  Cc: David Miller, Simon Horman, Amir Vadai, Jamal Hadi Salim,
	Linux Kernel Network Developers
In-Reply-To: <d0a72d1371e790505a8141592c6af904c9b24031.1536079973.git.dcaratti@redhat.com>

On Tue, Sep 4, 2018 at 10:00 AM Davide Caratti <dcaratti@redhat.com> wrote:
>
> If users try to install act_tunnel_key 'set' rules with duplicate values
> of 'index', the tunnel metadata are allocated, but never released. Then,
> kmemleak complains as follows:

Acked-by: Cong Wang <xiyou.wangcong@gmail.com>

^ permalink raw reply

* [PATCH bpf-next 4/4] i40e: disallow changing the number of descriptors when AF_XDP is on
From: Björn Töpel @ 2018-09-04 18:11 UTC (permalink / raw)
  To: ast, daniel, netdev, jeffrey.t.kirsher, intel-wired-lan,
	jakub.kicinski
  Cc: Björn Töpel, magnus.karlsson, magnus.karlsson
In-Reply-To: <20180904181105.10983-1-bjorn.topel@gmail.com>

From: Björn Töpel <bjorn.topel@intel.com>

When an AF_XDP UMEM is attached to any of the Rx rings, we disallow a
user to change the number of descriptors via e.g. "ethtool -G IFNAME".

Otherwise, the size of the stash/reuse queue can grow unbounded, which
would result in OOM or leaking userspace buffers.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 .../net/ethernet/intel/i40e/i40e_ethtool.c    |  9 +++++++-
 .../ethernet/intel/i40e/i40e_txrx_common.h    |  1 +
 drivers/net/ethernet/intel/i40e/i40e_xsk.c    | 22 +++++++++++++++++++
 3 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index d7d3974beca2..3cd2c88c72f8 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -5,7 +5,7 @@
 
 #include "i40e.h"
 #include "i40e_diag.h"
-
+#include "i40e_txrx_common.h"
 #include "i40e_ethtool_stats.h"
 
 #define I40E_PF_STAT(_name, _stat) \
@@ -1493,6 +1493,13 @@ static int i40e_set_ringparam(struct net_device *netdev,
 	    (new_rx_count == vsi->rx_rings[0]->count))
 		return 0;
 
+	/* If there is a AF_XDP UMEM attached to any of Rx rings,
+	 * disallow changing the number of descriptors -- regardless
+	 * if the netdev is running or not.
+	 */
+	if (i40e_xsk_any_rx_ring_enabled(vsi))
+		return -EBUSY;
+
 	while (test_and_set_bit(__I40E_CONFIG_BUSY, pf->state)) {
 		timeout--;
 		if (!timeout)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h b/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
index 8d46acff6f2e..09809dffe399 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
@@ -89,5 +89,6 @@ static inline void i40e_arm_wb(struct i40e_ring *tx_ring,
 
 void i40e_xsk_clean_rx_ring(struct i40e_ring *rx_ring);
 void i40e_xsk_clean_tx_ring(struct i40e_ring *tx_ring);
+bool i40e_xsk_any_rx_ring_enabled(struct i40e_vsi *vsi);
 
 #endif /* I40E_TXRX_COMMON_ */
diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index e4b62e871afc..119f59ec7cc0 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -944,3 +944,25 @@ void i40e_xsk_clean_tx_ring(struct i40e_ring *tx_ring)
 	if (xsk_frames)
 		xsk_umem_complete_tx(umem, xsk_frames);
 }
+
+/**
+ * i40e_xsk_any_rx_ring_enabled - Checks whether any of the Rx rings
+ * has AF_XDP UMEM attached
+ * @vsi: vsi
+ *
+ * Returns true if any of the Rx rings has an AF_XDP UMEM attached
+ **/
+bool i40e_xsk_any_rx_ring_enabled(struct i40e_vsi *vsi)
+{
+	int i;
+
+	if (!vsi->xsk_umems)
+		return false;
+
+	for (i = 0; i < vsi->num_queue_pairs; i++) {
+		if (vsi->xsk_umems[i])
+			return true;
+	}
+
+	return false;
+}
-- 
2.17.1

^ permalink raw reply related

* [PATCH bpf-next 2/4] net: xsk: add a simple buffer reuse queue
From: Björn Töpel @ 2018-09-04 18:11 UTC (permalink / raw)
  To: ast, daniel, netdev, jeffrey.t.kirsher, intel-wired-lan,
	jakub.kicinski
  Cc: magnus.karlsson, magnus.karlsson
In-Reply-To: <20180904181105.10983-1-bjorn.topel@gmail.com>

From: Jakub Kicinski <jakub.kicinski@netronome.com>

XSK UMEM is strongly single producer single consumer so reuse of
frames is challenging.  Add a simple "stash" of FILL packets to
reuse for drivers to optionally make use of.  This is useful
when driver has to free (ndo_stop) or resize a ring with an active
AF_XDP ZC socket.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 include/net/xdp_sock.h | 43 +++++++++++++++++++++++++++++++++
 net/xdp/xdp_umem.c     |  2 ++
 net/xdp/xsk_queue.c    | 55 ++++++++++++++++++++++++++++++++++++++++++
 net/xdp/xsk_queue.h    |  3 +++
 4 files changed, 103 insertions(+)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 932ca0dad6f3..7b55206da138 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -14,6 +14,7 @@
 #include <net/sock.h>
 
 struct net_device;
+struct xdp_umem_fq_reuse;
 struct xsk_queue;
 
 struct xdp_umem_page {
@@ -37,6 +38,7 @@ struct xdp_umem {
 	struct page **pgs;
 	u32 npgs;
 	struct net_device *dev;
+	struct xdp_umem_fq_reuse *fq_reuse;
 	u16 queue_id;
 	bool zc;
 	spinlock_t xsk_list_lock;
@@ -139,4 +141,45 @@ static inline dma_addr_t xdp_umem_get_dma(struct xdp_umem *umem, u64 addr)
 }
 #endif /* CONFIG_XDP_SOCKETS */
 
+struct xdp_umem_fq_reuse {
+	u32 nentries;
+	u32 length;
+	u64 handles[];
+};
+
+/* Following functions are not thread-safe in any way */
+struct xdp_umem_fq_reuse *xsk_reuseq_prepare(u32 nentries);
+struct xdp_umem_fq_reuse *xsk_reuseq_swap(struct xdp_umem *umem,
+					  struct xdp_umem_fq_reuse *newq);
+void xsk_reuseq_free(struct xdp_umem_fq_reuse *rq);
+
+/* Reuse-queue aware version of FILL queue helpers */
+static inline u64 *xsk_umem_peek_addr_rq(struct xdp_umem *umem, u64 *addr)
+{
+	struct xdp_umem_fq_reuse *rq = umem->fq_reuse;
+
+	if (!rq->length)
+		return xsk_umem_peek_addr(umem, addr);
+
+	*addr = rq->handles[rq->length - 1];
+	return addr;
+}
+
+static inline void xsk_umem_discard_addr_rq(struct xdp_umem *umem)
+{
+	struct xdp_umem_fq_reuse *rq = umem->fq_reuse;
+
+	if (!rq->length)
+		xsk_umem_discard_addr(umem);
+	else
+		rq->length--;
+}
+
+static inline void xsk_umem_fq_reuse(struct xdp_umem *umem, u64 addr)
+{
+	struct xdp_umem_fq_reuse *rq = umem->fq_reuse;
+
+	rq->handles[rq->length++] = addr;
+}
+
 #endif /* _LINUX_XDP_SOCK_H */
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index b3b632c5aeae..555427b3e0fe 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -165,6 +165,8 @@ static void xdp_umem_release(struct xdp_umem *umem)
 		umem->cq = NULL;
 	}
 
+	xsk_reuseq_destroy(umem);
+
 	xdp_umem_unpin_pages(umem);
 
 	task = get_pid_task(umem->pid, PIDTYPE_PID);
diff --git a/net/xdp/xsk_queue.c b/net/xdp/xsk_queue.c
index 2dc1384d9f27..b66504592d9b 100644
--- a/net/xdp/xsk_queue.c
+++ b/net/xdp/xsk_queue.c
@@ -3,7 +3,9 @@
  * Copyright(c) 2018 Intel Corporation.
  */
 
+#include <linux/log2.h>
 #include <linux/slab.h>
+#include <linux/overflow.h>
 
 #include "xsk_queue.h"
 
@@ -62,3 +64,56 @@ void xskq_destroy(struct xsk_queue *q)
 	page_frag_free(q->ring);
 	kfree(q);
 }
+
+struct xdp_umem_fq_reuse *xsk_reuseq_prepare(u32 nentries)
+{
+	struct xdp_umem_fq_reuse *newq;
+
+	/* Check for overflow */
+	if (nentries > (u32)roundup_pow_of_two(nentries))
+		return NULL;
+	nentries = roundup_pow_of_two(nentries);
+
+	newq = kvmalloc(struct_size(newq, handles, nentries), GFP_KERNEL);
+	if (!newq)
+		return NULL;
+	memset(newq, 0, offsetof(typeof(*newq), handles));
+
+	newq->nentries = nentries;
+	return newq;
+}
+EXPORT_SYMBOL_GPL(xsk_reuseq_prepare);
+
+struct xdp_umem_fq_reuse *xsk_reuseq_swap(struct xdp_umem *umem,
+					  struct xdp_umem_fq_reuse *newq)
+{
+	struct xdp_umem_fq_reuse *oldq = umem->fq_reuse;
+
+	if (!oldq) {
+		umem->fq_reuse = newq;
+		return NULL;
+	}
+
+	if (newq->nentries < oldq->length)
+		return newq;
+
+	memcpy(newq->handles, oldq->handles,
+	       array_size(oldq->length, sizeof(u64)));
+	newq->length = oldq->length;
+
+	umem->fq_reuse = newq;
+	return oldq;
+}
+EXPORT_SYMBOL_GPL(xsk_reuseq_swap);
+
+void xsk_reuseq_free(struct xdp_umem_fq_reuse *rq)
+{
+	kvfree(rq);
+}
+EXPORT_SYMBOL_GPL(xsk_reuseq_free);
+
+void xsk_reuseq_destroy(struct xdp_umem *umem)
+{
+	xsk_reuseq_free(umem->fq_reuse);
+	umem->fq_reuse = NULL;
+}
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 82252cccb4e0..bcb5cbb40419 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -258,4 +258,7 @@ void xskq_set_umem(struct xsk_queue *q, u64 size, u64 chunk_mask);
 struct xsk_queue *xskq_create(u32 nentries, bool umem_queue);
 void xskq_destroy(struct xsk_queue *q_ops);
 
+/* Executed by the core when the entire UMEM gets freed */
+void xsk_reuseq_destroy(struct xdp_umem *umem);
+
 #endif /* _LINUX_XSK_QUEUE_H */
-- 
2.17.1

^ permalink raw reply related

* [PATCH bpf-next 3/4] i40e: clean zero-copy XDP Rx ring on shutdown/reset
From: Björn Töpel @ 2018-09-04 18:11 UTC (permalink / raw)
  To: ast, daniel, netdev, jeffrey.t.kirsher, intel-wired-lan,
	jakub.kicinski
  Cc: Björn Töpel, magnus.karlsson, magnus.karlsson
In-Reply-To: <20180904181105.10983-1-bjorn.topel@gmail.com>

From: Björn Töpel <bjorn.topel@intel.com>

Outstanding Rx descriptors are temporarily stored on a stash/reuse
queue. When/if the HW rings comes up again, entries from the stash are
used to re-populate the ring.

The latter required some restructuring of the allocation scheme for
the AF_XDP zero-copy implementation. There is now a fast, and a slow
allocation. The "fast allocation" is used from the fast-path and
obtains free buffers from the fill ring and the internal recycle
mechanism. The "slow allocation" is only used in ring setup, and
obtains buffers from the fill ring and the stash (if any).

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   |   4 +-
 .../ethernet/intel/i40e/i40e_txrx_common.h    |   1 +
 drivers/net/ethernet/intel/i40e/i40e_xsk.c    | 100 ++++++++++++++++--
 3 files changed, 96 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 7f85d4ba8b54..740ea58ba938 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -1355,8 +1355,10 @@ void i40e_clean_rx_ring(struct i40e_ring *rx_ring)
 		rx_ring->skb = NULL;
 	}
 
-	if (rx_ring->xsk_umem)
+	if (rx_ring->xsk_umem) {
+		i40e_xsk_clean_rx_ring(rx_ring);
 		goto skip_free;
+	}
 
 	/* Free all the Rx ring sk_buffs */
 	for (i = 0; i < rx_ring->count; i++) {
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h b/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
index 29c68b29d36f..8d46acff6f2e 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
@@ -87,6 +87,7 @@ static inline void i40e_arm_wb(struct i40e_ring *tx_ring,
 	}
 }
 
+void i40e_xsk_clean_rx_ring(struct i40e_ring *rx_ring);
 void i40e_xsk_clean_tx_ring(struct i40e_ring *tx_ring);
 
 #endif /* I40E_TXRX_COMMON_ */
diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index 99116277c4d2..e4b62e871afc 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -140,6 +140,7 @@ static void i40e_xsk_umem_dma_unmap(struct i40e_vsi *vsi, struct xdp_umem *umem)
 static int i40e_xsk_umem_enable(struct i40e_vsi *vsi, struct xdp_umem *umem,
 				u16 qid)
 {
+	struct xdp_umem_fq_reuse *reuseq;
 	bool if_running;
 	int err;
 
@@ -156,6 +157,12 @@ static int i40e_xsk_umem_enable(struct i40e_vsi *vsi, struct xdp_umem *umem,
 			return -EBUSY;
 	}
 
+	reuseq = xsk_reuseq_prepare(vsi->rx_rings[0]->count);
+	if (!reuseq)
+		return -ENOMEM;
+
+	xsk_reuseq_free(xsk_reuseq_swap(umem, reuseq));
+
 	err = i40e_xsk_umem_dma_map(vsi, umem);
 	if (err)
 		return err;
@@ -353,16 +360,46 @@ static bool i40e_alloc_buffer_zc(struct i40e_ring *rx_ring,
 }
 
 /**
- * i40e_alloc_rx_buffers_zc - Allocates a number of Rx buffers
+ * i40e_alloc_buffer_slow_zc - Allocates an i40e_rx_buffer
  * @rx_ring: Rx ring
- * @count: The number of buffers to allocate
+ * @bi: Rx buffer to populate
  *
- * This function allocates a number of Rx buffers and places them on
- * the Rx ring.
+ * This function allocates an Rx buffer. The buffer can come from fill
+ * queue, or via the reuse queue.
  *
  * Returns true for a successful allocation, false otherwise
  **/
-bool i40e_alloc_rx_buffers_zc(struct i40e_ring *rx_ring, u16 count)
+static bool i40e_alloc_buffer_slow_zc(struct i40e_ring *rx_ring,
+				      struct i40e_rx_buffer *bi)
+{
+	struct xdp_umem *umem = rx_ring->xsk_umem;
+	u64 handle, hr;
+
+	if (!xsk_umem_peek_addr_rq(umem, &handle)) {
+		rx_ring->rx_stats.alloc_page_failed++;
+		return false;
+	}
+
+	handle &= rx_ring->xsk_umem->chunk_mask;
+
+	hr = umem->headroom + XDP_PACKET_HEADROOM;
+
+	bi->dma = xdp_umem_get_dma(umem, handle);
+	bi->dma += hr;
+
+	bi->addr = xdp_umem_get_data(umem, handle);
+	bi->addr += hr;
+
+	bi->handle = handle + umem->headroom;
+
+	xsk_umem_discard_addr_rq(umem);
+	return true;
+}
+
+static __always_inline bool __i40e_alloc_rx_buffers_zc(
+	struct i40e_ring *rx_ring, u16 count,
+	bool alloc(struct i40e_ring *rx_ring,
+		   struct i40e_rx_buffer *bi))
 {
 	u16 ntu = rx_ring->next_to_use;
 	union i40e_rx_desc *rx_desc;
@@ -372,7 +409,7 @@ bool i40e_alloc_rx_buffers_zc(struct i40e_ring *rx_ring, u16 count)
 	rx_desc = I40E_RX_DESC(rx_ring, ntu);
 	bi = &rx_ring->rx_bi[ntu];
 	do {
-		if (!i40e_alloc_buffer_zc(rx_ring, bi)) {
+		if (!alloc(rx_ring, bi)) {
 			ok = false;
 			goto no_buffers;
 		}
@@ -404,6 +441,38 @@ bool i40e_alloc_rx_buffers_zc(struct i40e_ring *rx_ring, u16 count)
 	return ok;
 }
 
+/**
+ * i40e_alloc_rx_buffers_zc - Allocates a number of Rx buffers
+ * @rx_ring: Rx ring
+ * @count: The number of buffers to allocate
+ *
+ * This function allocates a number of Rx buffers from the reuse queue
+ * or fill ring and places them on the Rx ring.
+ *
+ * Returns true for a successful allocation, false otherwise
+ **/
+bool i40e_alloc_rx_buffers_zc(struct i40e_ring *rx_ring, u16 count)
+{
+	return __i40e_alloc_rx_buffers_zc(rx_ring, count,
+					  i40e_alloc_buffer_slow_zc);
+}
+
+/**
+ * i40e_alloc_rx_buffers_fast_zc - Allocates a number of Rx buffers
+ * @rx_ring: Rx ring
+ * @count: The number of buffers to allocate
+ *
+ * This function allocates a number of Rx buffers from the fill ring
+ * or the internal recycle mechanism and places them on the Rx ring.
+ *
+ * Returns true for a successful allocation, false otherwise
+ **/
+static bool i40e_alloc_rx_buffers_fast_zc(struct i40e_ring *rx_ring, u16 count)
+{
+	return __i40e_alloc_rx_buffers_zc(rx_ring, count,
+					  i40e_alloc_buffer_zc);
+}
+
 /**
  * i40e_get_rx_buffer_zc - Return the current Rx buffer
  * @rx_ring: Rx ring
@@ -571,8 +640,8 @@ int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
 
 		if (cleaned_count >= I40E_RX_BUFFER_WRITE) {
 			failure = failure ||
-				  !i40e_alloc_rx_buffers_zc(rx_ring,
-							    cleaned_count);
+				  !i40e_alloc_rx_buffers_fast_zc(rx_ring,
+								 cleaned_count);
 			cleaned_count = 0;
 		}
 
@@ -831,6 +900,21 @@ int i40e_xsk_async_xmit(struct net_device *dev, u32 queue_id)
 	return 0;
 }
 
+void i40e_xsk_clean_rx_ring(struct i40e_ring *rx_ring)
+{
+	u16 i;
+
+	for (i = 0; i < rx_ring->count; i++) {
+		struct i40e_rx_buffer *rx_bi = &rx_ring->rx_bi[i];
+
+		if (!rx_bi->addr)
+			continue;
+
+		xsk_umem_fq_reuse(rx_ring->xsk_umem, rx_bi->handle);
+		rx_bi->addr = NULL;
+	}
+}
+
 /**
  * i40e_xsk_clean_xdp_ring - Clean the XDP Tx ring on shutdown
  * @xdp_ring: XDP Tx ring
-- 
2.17.1

^ permalink raw reply related

* [PATCH bpf-next 1/4] i40e: clean zero-copy XDP Tx ring on shutdown/reset
From: Björn Töpel @ 2018-09-04 18:11 UTC (permalink / raw)
  To: ast, daniel, netdev, jeffrey.t.kirsher, intel-wired-lan,
	jakub.kicinski
  Cc: Björn Töpel, magnus.karlsson, magnus.karlsson
In-Reply-To: <20180904181105.10983-1-bjorn.topel@gmail.com>

From: Björn Töpel <bjorn.topel@intel.com>

When the zero-copy enabled XDP Tx ring is torn down, due to
configuration changes, outstandning frames on the hardware descriptor
ring are queued on the completion ring.

The completion ring has a back-pressure mechanism that will guarantee
that there is sufficient space on the ring.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 17 +++++++----
 .../ethernet/intel/i40e/i40e_txrx_common.h    |  2 ++
 drivers/net/ethernet/intel/i40e/i40e_xsk.c    | 30 +++++++++++++++++++
 3 files changed, 43 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 37bd4e50ccde..7f85d4ba8b54 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -636,13 +636,18 @@ void i40e_clean_tx_ring(struct i40e_ring *tx_ring)
 	unsigned long bi_size;
 	u16 i;
 
-	/* ring already cleared, nothing to do */
-	if (!tx_ring->tx_bi)
-		return;
+	if (ring_is_xdp(tx_ring) && tx_ring->xsk_umem) {
+		i40e_xsk_clean_tx_ring(tx_ring);
+	} else {
+		/* ring already cleared, nothing to do */
+		if (!tx_ring->tx_bi)
+			return;
 
-	/* Free all the Tx ring sk_buffs */
-	for (i = 0; i < tx_ring->count; i++)
-		i40e_unmap_and_free_tx_resource(tx_ring, &tx_ring->tx_bi[i]);
+		/* Free all the Tx ring sk_buffs */
+		for (i = 0; i < tx_ring->count; i++)
+			i40e_unmap_and_free_tx_resource(tx_ring,
+							&tx_ring->tx_bi[i]);
+	}
 
 	bi_size = sizeof(struct i40e_tx_buffer) * tx_ring->count;
 	memset(tx_ring->tx_bi, 0, bi_size);
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h b/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
index b5afd479a9c5..29c68b29d36f 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
@@ -87,4 +87,6 @@ static inline void i40e_arm_wb(struct i40e_ring *tx_ring,
 	}
 }
 
+void i40e_xsk_clean_tx_ring(struct i40e_ring *tx_ring);
+
 #endif /* I40E_TXRX_COMMON_ */
diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index 2ebfc78bbd09..99116277c4d2 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -830,3 +830,33 @@ int i40e_xsk_async_xmit(struct net_device *dev, u32 queue_id)
 
 	return 0;
 }
+
+/**
+ * i40e_xsk_clean_xdp_ring - Clean the XDP Tx ring on shutdown
+ * @xdp_ring: XDP Tx ring
+ **/
+void i40e_xsk_clean_tx_ring(struct i40e_ring *tx_ring)
+{
+	u16 ntc = tx_ring->next_to_clean, ntu = tx_ring->next_to_use;
+	struct xdp_umem *umem = tx_ring->xsk_umem;
+	struct i40e_tx_buffer *tx_bi;
+	u32 xsk_frames = 0;
+
+	while (ntc != ntu) {
+		tx_bi = &tx_ring->tx_bi[ntc];
+
+		if (tx_bi->xdpf)
+			i40e_clean_xdp_tx_buffer(tx_ring, tx_bi);
+		else
+			xsk_frames++;
+
+		tx_bi->xdpf = NULL;
+
+		ntc++;
+		if (ntc > tx_ring->count)
+			ntc = 0;
+	}
+
+	if (xsk_frames)
+		xsk_umem_complete_tx(umem, xsk_frames);
+}
-- 
2.17.1

^ permalink raw reply related

* [PATCH bpf-next 0/4] i40e AF_XDP zero-copy buffer leak fixes
From: Björn Töpel @ 2018-09-04 18:11 UTC (permalink / raw)
  To: ast, daniel, netdev, jeffrey.t.kirsher, intel-wired-lan,
	jakub.kicinski
  Cc: Björn Töpel, magnus.karlsson, magnus.karlsson

From: Björn Töpel <bjorn.topel@intel.com>

This series addresses an AF_XDP zero-copy issue that buffers passed
from userspace to the kernel was leaked when the hardware descriptor
ring was torn down.

The patches fixes the i40e AF_XDP zero-copy implementation.

Thanks to Jakub Kicinski for pointing this out!

Some background for folks that don't know the details: A zero-copy
capable driver picks buffers off the fill ring and places them on the
hardware Rx ring to be completed at a later point when DMA is
complete. Similar on the Tx side; The driver picks buffers off the Tx
ring and places them on the Tx hardware ring.

In the typical flow, the Rx buffer will be placed onto an Rx ring
(completed to the user), and the Tx buffer will be placed on the
completion ring to notify the user that the transfer is done.

However, if the driver needs to tear down the hardware rings for some
reason (interface goes down, reconfiguration and such), the userspace
buffers cannot be leaked. They have to be reused or completed back to
userspace.

The implementation does the following:

* Outstanding Tx descriptors will be passed to the completion
  ring. The Tx code has back-pressure mechanism in place, so that
  enough empty space in the completion ring is guaranteed.

* Outstanding Rx descriptors are temporarily stored on a stash/reuse
  queue. The reuse queue is based on Jakub's RFC. When/if the HW rings
  comes up again, entries from the stash are used to re-populate the
  ring.

* When AF_XDP ZC is enabled, disallow changing the number of hardware
  descriptors via ethtool. Otherwise, the size of the stash/reuse
  queue can grow unbounded.

Going forward, introducing a "zero-copy allocator" analogous to Jesper
Brouer's page pool would be a more robust and reuseable solution.

Jakub: I've made a minor checkpatch-fix to your RFC, prior adding it
into this series.


Thanks!
Björn

Björn Töpel (3):
  i40e: clean zero-copy XDP Tx ring on shutdown/reset
  i40e: clean zero-copy XDP Rx ring on shutdown/reset
  i40e: disallow changing the number of descriptors when AF_XDP is on

Jakub Kicinski (1):
  net: xsk: add a simple buffer reuse queue

 .../net/ethernet/intel/i40e/i40e_ethtool.c    |   9 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   |  21 ++-
 .../ethernet/intel/i40e/i40e_txrx_common.h    |   4 +
 drivers/net/ethernet/intel/i40e/i40e_xsk.c    | 152 +++++++++++++++++-
 include/net/xdp_sock.h                        |  43 +++++
 net/xdp/xdp_umem.c                            |   2 +
 net/xdp/xsk_queue.c                           |  55 +++++++
 net/xdp/xsk_queue.h                           |   3 +
 8 files changed, 273 insertions(+), 16 deletions(-)

-- 
2.17.1

^ permalink raw reply

* [iproute PATCH] ip-route: Fix segfault with many nexthops
From: Phil Sutter @ 2018-09-04 17:15 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev

It was possible to crash ip-route by adding an IPv6 route with 37
nexthop statements. A simple reproducer is:

| for i in `seq 37`; do
| 	nhs="nexthop via 1111::$i "$nhs
| done
| ip -6 route add 3333::/64 $nhs

The related code was broken in multiple ways:

* parse_one_nh() assumed that rta points to 4kB of storage but caller
  provided just 1kB. Fixed by passing 'len' parameter with the correct
  value.

* Error checking of rta_addattr*() calls in parse_one_nh() and called
  functions was completely absent, so with above fix in place output
  flood would occur due to parser looping forever.

Note that it is still not possible to add a route with more than 36
nexthops due to stack buffer sizes, this patch merely fixes error path.

Signed-off-by: Phil Sutter <phil@nwl.cc>
---
 ip/iproute.c          |  41 ++++++++++------
 ip/iproute_lwtunnel.c | 108 +++++++++++++++++++++++++-----------------
 2 files changed, 91 insertions(+), 58 deletions(-)

diff --git a/ip/iproute.c b/ip/iproute.c
index 30833414a3f7f..9e5ae48c0715c 100644
--- a/ip/iproute.c
+++ b/ip/iproute.c
@@ -941,7 +941,7 @@ int print_route(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg)
 }
 
 static int parse_one_nh(struct nlmsghdr *n, struct rtmsg *r,
-			struct rtattr *rta, struct rtnexthop *rtnh,
+			struct rtattr *rta, size_t len, struct rtnexthop *rtnh,
 			int *argcp, char ***argvp)
 {
 	int argc = *argcp;
@@ -962,11 +962,16 @@ static int parse_one_nh(struct nlmsghdr *n, struct rtmsg *r,
 			if (r->rtm_family == AF_UNSPEC)
 				r->rtm_family = addr.family;
 			if (addr.family == r->rtm_family) {
-				rta_addattr_l(rta, 4096, RTA_GATEWAY, &addr.data, addr.bytelen);
-				rtnh->rtnh_len += sizeof(struct rtattr) + addr.bytelen;
+				if (rta_addattr_l(rta, len, RTA_GATEWAY,
+						  &addr.data, addr.bytelen))
+					return -1;
+				rtnh->rtnh_len += sizeof(struct rtattr)
+						  + addr.bytelen;
 			} else {
-				rta_addattr_l(rta, 4096, RTA_VIA, &addr.family, addr.bytelen+2);
-				rtnh->rtnh_len += RTA_SPACE(addr.bytelen+2);
+				if (rta_addattr_l(rta, len, RTA_VIA,
+						  &addr.family, addr.bytelen + 2))
+					return -1;
+				rtnh->rtnh_len += RTA_SPACE(addr.bytelen + 2);
 			}
 		} else if (strcmp(*argv, "dev") == 0) {
 			NEXT_ARG();
@@ -988,13 +993,15 @@ static int parse_one_nh(struct nlmsghdr *n, struct rtmsg *r,
 			NEXT_ARG();
 			if (get_rt_realms_or_raw(&realm, *argv))
 				invarg("\"realm\" value is invalid\n", *argv);
-			rta_addattr32(rta, 4096, RTA_FLOW, realm);
+			if (rta_addattr32(rta, len, RTA_FLOW, realm))
+				return -1;
 			rtnh->rtnh_len += sizeof(struct rtattr) + 4;
 		} else if (strcmp(*argv, "encap") == 0) {
-			int len = rta->rta_len;
+			int old_len = rta->rta_len;
 
-			lwt_parse_encap(rta, 4096, &argc, &argv);
-			rtnh->rtnh_len += rta->rta_len - len;
+			if (lwt_parse_encap(rta, len, &argc, &argv))
+				return -1;
+			rtnh->rtnh_len += rta->rta_len - old_len;
 		} else if (strcmp(*argv, "as") == 0) {
 			inet_prefix addr;
 
@@ -1002,8 +1009,9 @@ static int parse_one_nh(struct nlmsghdr *n, struct rtmsg *r,
 			if (strcmp(*argv, "to") == 0)
 				NEXT_ARG();
 			get_addr(&addr, *argv, r->rtm_family);
-			rta_addattr_l(rta, 4096, RTA_NEWDST, &addr.data,
-				      addr.bytelen);
+			if (rta_addattr_l(rta, len, RTA_NEWDST,
+					  &addr.data, addr.bytelen))
+				return -1;
 			rtnh->rtnh_len += sizeof(struct rtattr) + addr.bytelen;
 		} else
 			break;
@@ -1036,15 +1044,18 @@ static int parse_nexthops(struct nlmsghdr *n, struct rtmsg *r,
 		memset(rtnh, 0, sizeof(*rtnh));
 		rtnh->rtnh_len = sizeof(*rtnh);
 		rta->rta_len += rtnh->rtnh_len;
-		if (parse_one_nh(n, r, rta, rtnh, &argc, &argv)) {
+		if (parse_one_nh(n, r, rta, 1024, rtnh, &argc, &argv)) {
 			fprintf(stderr, "Error: cannot parse nexthop\n");
 			exit(-1);
 		}
 		rtnh = RTNH_NEXT(rtnh);
 	}
 
+		return 0;
+
 	if (rta->rta_len > RTA_LENGTH(0))
-		addattr_l(n, 1024, RTA_MULTIPATH, RTA_DATA(rta), RTA_PAYLOAD(rta));
+		return addattr_l(n, 1024, RTA_MULTIPATH,
+				 RTA_DATA(rta), RTA_PAYLOAD(rta));
 	return 0;
 }
 
@@ -1484,8 +1495,8 @@ static int iproute_modify(int cmd, unsigned int flags, int argc, char **argv)
 		addattr_l(&req.n, sizeof(req), RTA_METRICS, RTA_DATA(mxrta), RTA_PAYLOAD(mxrta));
 	}
 
-	if (nhs_ok)
-		parse_nexthops(&req.n, &req.r, argc, argv);
+	if (nhs_ok && parse_nexthops(&req.n, &req.r, argc, argv))
+		return -1;
 
 	if (req.r.rtm_family == AF_UNSPEC)
 		req.r.rtm_family = AF_INET;
diff --git a/ip/iproute_lwtunnel.c b/ip/iproute_lwtunnel.c
index e604481142ec1..969a4763df71d 100644
--- a/ip/iproute_lwtunnel.c
+++ b/ip/iproute_lwtunnel.c
@@ -538,8 +538,9 @@ static int parse_encap_seg6(struct rtattr *rta, size_t len, int *argcp,
 
 	memcpy(tuninfo->srh, srh, srhlen);
 
-	rta_addattr_l(rta, len, SEG6_IPTUNNEL_SRH, tuninfo,
-		      sizeof(*tuninfo) + srhlen);
+	if (rta_addattr_l(rta, len, SEG6_IPTUNNEL_SRH, tuninfo,
+			  sizeof(*tuninfo) + srhlen))
+		return -1;
 
 	free(tuninfo);
 	free(srh);
@@ -611,6 +612,7 @@ static int parse_encap_seg6local(struct rtattr *rta, size_t len, int *argcp,
 	char segbuf[1024];
 	inet_prefix addr;
 	__u32 hmac = 0;
+	int ret = 0;
 
 	while (argc > 0) {
 		if (strcmp(*argv, "action") == 0) {
@@ -620,27 +622,28 @@ static int parse_encap_seg6local(struct rtattr *rta, size_t len, int *argcp,
 			action = read_action_type(*argv);
 			if (!action)
 				invarg("\"action\" value is invalid\n", *argv);
-			rta_addattr32(rta, len, SEG6_LOCAL_ACTION, action);
+			ret = rta_addattr32(rta, len, SEG6_LOCAL_ACTION,
+					    action);
 		} else if (strcmp(*argv, "table") == 0) {
 			NEXT_ARG();
 			if (table_ok++)
 				duparg2("table", *argv);
 			get_u32(&table, *argv, 0);
-			rta_addattr32(rta, len, SEG6_LOCAL_TABLE, table);
+			ret = rta_addattr32(rta, len, SEG6_LOCAL_TABLE, table);
 		} else if (strcmp(*argv, "nh4") == 0) {
 			NEXT_ARG();
 			if (nh4_ok++)
 				duparg2("nh4", *argv);
 			get_addr(&addr, *argv, AF_INET);
-			rta_addattr_l(rta, len, SEG6_LOCAL_NH4, &addr.data,
-				      addr.bytelen);
+			ret = rta_addattr_l(rta, len, SEG6_LOCAL_NH4,
+					    &addr.data, addr.bytelen);
 		} else if (strcmp(*argv, "nh6") == 0) {
 			NEXT_ARG();
 			if (nh6_ok++)
 				duparg2("nh6", *argv);
 			get_addr(&addr, *argv, AF_INET6);
-			rta_addattr_l(rta, len, SEG6_LOCAL_NH6, &addr.data,
-				      addr.bytelen);
+			ret = rta_addattr_l(rta, len, SEG6_LOCAL_NH6,
+					    &addr.data, addr.bytelen);
 		} else if (strcmp(*argv, "iif") == 0) {
 			NEXT_ARG();
 			if (iif_ok++)
@@ -648,7 +651,7 @@ static int parse_encap_seg6local(struct rtattr *rta, size_t len, int *argcp,
 			iif = ll_name_to_index(*argv);
 			if (!iif)
 				exit(nodev(*argv));
-			rta_addattr32(rta, len, SEG6_LOCAL_IIF, iif);
+			ret = rta_addattr32(rta, len, SEG6_LOCAL_IIF, iif);
 		} else if (strcmp(*argv, "oif") == 0) {
 			NEXT_ARG();
 			if (oif_ok++)
@@ -656,7 +659,7 @@ static int parse_encap_seg6local(struct rtattr *rta, size_t len, int *argcp,
 			oif = ll_name_to_index(*argv);
 			if (!oif)
 				exit(nodev(*argv));
-			rta_addattr32(rta, len, SEG6_LOCAL_OIF, oif);
+			ret = rta_addattr32(rta, len, SEG6_LOCAL_OIF, oif);
 		} else if (strcmp(*argv, "srh") == 0) {
 			NEXT_ARG();
 			if (srh_ok++)
@@ -691,6 +694,8 @@ static int parse_encap_seg6local(struct rtattr *rta, size_t len, int *argcp,
 		} else {
 			break;
 		}
+		if (ret)
+			return ret;
 		argc--; argv++;
 	}
 
@@ -705,14 +710,14 @@ static int parse_encap_seg6local(struct rtattr *rta, size_t len, int *argcp,
 		srh = parse_srh(segbuf, hmac,
 				action == SEG6_LOCAL_ACTION_END_B6_ENCAP);
 		srhlen = (srh->hdrlen + 1) << 3;
-		rta_addattr_l(rta, len, SEG6_LOCAL_SRH, srh, srhlen);
+		ret = rta_addattr_l(rta, len, SEG6_LOCAL_SRH, srh, srhlen);
 		free(srh);
 	}
 
 	*argcp = argc + 1;
 	*argvp = argv - 1;
 
-	return 0;
+	return ret;
 }
 
 static int parse_encap_mpls(struct rtattr *rta, size_t len,
@@ -730,8 +735,9 @@ static int parse_encap_mpls(struct rtattr *rta, size_t len,
 		exit(1);
 	}
 
-	rta_addattr_l(rta, len, MPLS_IPTUNNEL_DST, &addr.data,
-		      addr.bytelen);
+	if (rta_addattr_l(rta, len, MPLS_IPTUNNEL_DST,
+			  &addr.data, addr.bytelen))
+		return -1;
 
 	argc--;
 	argv++;
@@ -745,7 +751,8 @@ static int parse_encap_mpls(struct rtattr *rta, size_t len,
 				duparg2("ttl", *argv);
 			if (get_u8(&ttl, *argv, 0))
 				invarg("\"ttl\" value is invalid\n", *argv);
-			rta_addattr8(rta, len, MPLS_IPTUNNEL_TTL, ttl);
+			if (rta_addattr8(rta, len, MPLS_IPTUNNEL_TTL, ttl))
+				return -1;
 		} else {
 			break;
 		}
@@ -768,6 +775,7 @@ static int parse_encap_ip(struct rtattr *rta, size_t len,
 	int id_ok = 0, dst_ok = 0, tos_ok = 0, ttl_ok = 0;
 	char **argv = *argvp;
 	int argc = *argcp;
+	int ret = 0;
 
 	while (argc > 0) {
 		if (strcmp(*argv, "id") == 0) {
@@ -778,7 +786,7 @@ static int parse_encap_ip(struct rtattr *rta, size_t len,
 				duparg2("id", *argv);
 			if (get_be64(&id, *argv, 0))
 				invarg("\"id\" value is invalid\n", *argv);
-			rta_addattr64(rta, len, LWTUNNEL_IP_ID, id);
+			ret = rta_addattr64(rta, len, LWTUNNEL_IP_ID, id);
 		} else if (strcmp(*argv, "dst") == 0) {
 			inet_prefix addr;
 
@@ -786,8 +794,8 @@ static int parse_encap_ip(struct rtattr *rta, size_t len,
 			if (dst_ok++)
 				duparg2("dst", *argv);
 			get_addr(&addr, *argv, AF_INET);
-			rta_addattr_l(rta, len, LWTUNNEL_IP_DST,
-				      &addr.data, addr.bytelen);
+			ret = rta_addattr_l(rta, len, LWTUNNEL_IP_DST,
+					    &addr.data, addr.bytelen);
 		} else if (strcmp(*argv, "tos") == 0) {
 			__u32 tos;
 
@@ -796,7 +804,7 @@ static int parse_encap_ip(struct rtattr *rta, size_t len,
 				duparg2("tos", *argv);
 			if (rtnl_dsfield_a2n(&tos, *argv))
 				invarg("\"tos\" value is invalid\n", *argv);
-			rta_addattr8(rta, len, LWTUNNEL_IP_TOS, tos);
+			ret = rta_addattr8(rta, len, LWTUNNEL_IP_TOS, tos);
 		} else if (strcmp(*argv, "ttl") == 0) {
 			__u8 ttl;
 
@@ -805,10 +813,12 @@ static int parse_encap_ip(struct rtattr *rta, size_t len,
 				duparg2("ttl", *argv);
 			if (get_u8(&ttl, *argv, 0))
 				invarg("\"ttl\" value is invalid\n", *argv);
-			rta_addattr8(rta, len, LWTUNNEL_IP_TTL, ttl);
+			ret = rta_addattr8(rta, len, LWTUNNEL_IP_TTL, ttl);
 		} else {
 			break;
 		}
+		if (ret)
+			break;
 		argc--; argv++;
 	}
 
@@ -819,7 +829,7 @@ static int parse_encap_ip(struct rtattr *rta, size_t len,
 	*argcp = argc + 1;
 	*argvp = argv - 1;
 
-	return 0;
+	return ret;
 }
 
 static int parse_encap_ila(struct rtattr *rta, size_t len,
@@ -828,6 +838,7 @@ static int parse_encap_ila(struct rtattr *rta, size_t len,
 	__u64 locator;
 	int argc = *argcp;
 	char **argv = *argvp;
+	int ret = 0;
 
 	if (get_addr64(&locator, *argv) < 0) {
 		fprintf(stderr, "Bad locator: %s\n", *argv);
@@ -836,7 +847,8 @@ static int parse_encap_ila(struct rtattr *rta, size_t len,
 
 	argc--; argv++;
 
-	rta_addattr64(rta, 1024, ILA_ATTR_LOCATOR, locator);
+	if (rta_addattr64(rta, 1024, ILA_ATTR_LOCATOR, locator))
+		return -1;
 
 	while (argc > 0) {
 		if (strcmp(*argv, "csum-mode") == 0) {
@@ -849,8 +861,8 @@ static int parse_encap_ila(struct rtattr *rta, size_t len,
 				invarg("\"csum-mode\" value is invalid\n",
 				       *argv);
 
-			rta_addattr8(rta, 1024, ILA_ATTR_CSUM_MODE,
-				     (__u8)csum_mode);
+			ret = rta_addattr8(rta, 1024, ILA_ATTR_CSUM_MODE,
+					   (__u8)csum_mode);
 
 			argc--; argv++;
 		} else if (strcmp(*argv, "ident-type") == 0) {
@@ -863,8 +875,8 @@ static int parse_encap_ila(struct rtattr *rta, size_t len,
 				invarg("\"ident-type\" value is invalid\n",
 				       *argv);
 
-			rta_addattr8(rta, 1024, ILA_ATTR_IDENT_TYPE,
-				     (__u8)ident_type);
+			ret = rta_addattr8(rta, 1024, ILA_ATTR_IDENT_TYPE,
+					   (__u8)ident_type);
 
 			argc--; argv++;
 		} else if (strcmp(*argv, "hook-type") == 0) {
@@ -877,13 +889,15 @@ static int parse_encap_ila(struct rtattr *rta, size_t len,
 				invarg("\"hook-type\" value is invalid\n",
 				       *argv);
 
-			rta_addattr8(rta, 1024, ILA_ATTR_HOOK_TYPE,
-				     (__u8)hook_type);
+			ret = rta_addattr8(rta, 1024, ILA_ATTR_HOOK_TYPE,
+					   (__u8)hook_type);
 
 			argc--; argv++;
 		} else {
 			break;
 		}
+		if (ret)
+			break;
 	}
 
 	/* argv is currently the first unparsed argument,
@@ -893,7 +907,7 @@ static int parse_encap_ila(struct rtattr *rta, size_t len,
 	*argcp = argc + 1;
 	*argvp = argv - 1;
 
-	return 0;
+	return ret;
 }
 
 static int parse_encap_ip6(struct rtattr *rta, size_t len,
@@ -902,6 +916,7 @@ static int parse_encap_ip6(struct rtattr *rta, size_t len,
 	int id_ok = 0, dst_ok = 0, tos_ok = 0, ttl_ok = 0;
 	char **argv = *argvp;
 	int argc = *argcp;
+	int ret = 0;
 
 	while (argc > 0) {
 		if (strcmp(*argv, "id") == 0) {
@@ -912,7 +927,7 @@ static int parse_encap_ip6(struct rtattr *rta, size_t len,
 				duparg2("id", *argv);
 			if (get_be64(&id, *argv, 0))
 				invarg("\"id\" value is invalid\n", *argv);
-			rta_addattr64(rta, len, LWTUNNEL_IP6_ID, id);
+			ret = rta_addattr64(rta, len, LWTUNNEL_IP6_ID, id);
 		} else if (strcmp(*argv, "dst") == 0) {
 			inet_prefix addr;
 
@@ -920,8 +935,8 @@ static int parse_encap_ip6(struct rtattr *rta, size_t len,
 			if (dst_ok++)
 				duparg2("dst", *argv);
 			get_addr(&addr, *argv, AF_INET6);
-			rta_addattr_l(rta, len, LWTUNNEL_IP6_DST,
-				      &addr.data, addr.bytelen);
+			ret = rta_addattr_l(rta, len, LWTUNNEL_IP6_DST,
+					    &addr.data, addr.bytelen);
 		} else if (strcmp(*argv, "tc") == 0) {
 			__u32 tc;
 
@@ -930,7 +945,7 @@ static int parse_encap_ip6(struct rtattr *rta, size_t len,
 				duparg2("tc", *argv);
 			if (rtnl_dsfield_a2n(&tc, *argv))
 				invarg("\"tc\" value is invalid\n", *argv);
-			rta_addattr8(rta, len, LWTUNNEL_IP6_TC, tc);
+			ret = rta_addattr8(rta, len, LWTUNNEL_IP6_TC, tc);
 		} else if (strcmp(*argv, "hoplimit") == 0) {
 			__u8 hoplimit;
 
@@ -940,10 +955,13 @@ static int parse_encap_ip6(struct rtattr *rta, size_t len,
 			if (get_u8(&hoplimit, *argv, 0))
 				invarg("\"hoplimit\" value is invalid\n",
 				       *argv);
-			rta_addattr8(rta, len, LWTUNNEL_IP6_HOPLIMIT, hoplimit);
+			ret = rta_addattr8(rta, len, LWTUNNEL_IP6_HOPLIMIT,
+					   hoplimit);
 		} else {
 			break;
 		}
+		if (ret)
+			break;
 		argc--; argv++;
 	}
 
@@ -954,7 +972,7 @@ static int parse_encap_ip6(struct rtattr *rta, size_t len,
 	*argcp = argc + 1;
 	*argvp = argv - 1;
 
-	return 0;
+	return ret;
 }
 
 static void lwt_bpf_usage(void)
@@ -1021,6 +1039,7 @@ int lwt_parse_encap(struct rtattr *rta, size_t len, int *argcp, char ***argvp)
 	int argc = *argcp;
 	char **argv = *argvp;
 	__u16 type;
+	int ret = 0;
 
 	NEXT_ARG();
 	type = read_encap_type(*argv);
@@ -1037,37 +1056,40 @@ int lwt_parse_encap(struct rtattr *rta, size_t len, int *argcp, char ***argvp)
 	nest = rta_nest(rta, 1024, RTA_ENCAP);
 	switch (type) {
 	case LWTUNNEL_ENCAP_MPLS:
-		parse_encap_mpls(rta, len, &argc, &argv);
+		ret = parse_encap_mpls(rta, len, &argc, &argv);
 		break;
 	case LWTUNNEL_ENCAP_IP:
-		parse_encap_ip(rta, len, &argc, &argv);
+		ret = parse_encap_ip(rta, len, &argc, &argv);
 		break;
 	case LWTUNNEL_ENCAP_ILA:
-		parse_encap_ila(rta, len, &argc, &argv);
+		ret = parse_encap_ila(rta, len, &argc, &argv);
 		break;
 	case LWTUNNEL_ENCAP_IP6:
-		parse_encap_ip6(rta, len, &argc, &argv);
+		ret = parse_encap_ip6(rta, len, &argc, &argv);
 		break;
 	case LWTUNNEL_ENCAP_BPF:
 		if (parse_encap_bpf(rta, len, &argc, &argv) < 0)
 			exit(-1);
 		break;
 	case LWTUNNEL_ENCAP_SEG6:
-		parse_encap_seg6(rta, len, &argc, &argv);
+		ret = parse_encap_seg6(rta, len, &argc, &argv);
 		break;
 	case LWTUNNEL_ENCAP_SEG6_LOCAL:
-		parse_encap_seg6local(rta, len, &argc, &argv);
+		ret = parse_encap_seg6local(rta, len, &argc, &argv);
 		break;
 	default:
 		fprintf(stderr, "Error: unsupported encap type\n");
 		break;
 	}
+	if (ret)
+		return ret;
+
 	rta_nest_end(rta, nest);
 
-	rta_addattr16(rta, 1024, RTA_ENCAP_TYPE, type);
+	ret = rta_addattr16(rta, 1024, RTA_ENCAP_TYPE, type);
 
 	*argcp = argc;
 	*argvp = argv;
 
-	return 0;
+	return ret;
 }
-- 
2.18.0

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox