Netdev List
 help / color / mirror / Atom feed
* [PATCH v4 net-next 0/4]
From: Brad Mouring @ 2018-03-12 21:34 UTC (permalink / raw)
  To: Nicolas Ferre, Rob Herring, David S . Miller, Michael Grzeschik,
	Andrew Lunn
  Cc: Mark Rutland, netdev, Julia Cartwright, devicetree
In-Reply-To: <20180312175956.GS27783@lunn.ch>


Consider the situation where a macb netdev is connected through
a phydev that sits on a mii bus other than the one provided to
this particular netdev. This situation is what this patchset aims
to accomplish through the existing phy-handle optional binding.

This optional binding (as described in the ethernet DT bindings doc)
directs the netdev to the phydev to use. This is precisely the
situation this patchset aims to solve, so it makes sense to introduce
the functionality to this driver (where the physical layout discussed
was encountered).

The devicetree snippet would look something like this:

...
   ethernet@feedf00d {
           ...
           phy-handle = <&phy0> // the first netdev is physically wired to phy0
           ...
           phy0: phy@0 {
                   ...
                   reg = <0x0> // MDIO address 0
                   ...
           }
           phy1: phy@1 {
                   ...
                   reg = <0x1> // MDIO address 1
                   ...
           }
   ...
   }

   ethernet@deadbeef {
           ...
           phy-handle = <&phy1> // tells the driver to use phy1 on the
                                // first mac's mdio bus (it's wired thusly)
           ...
   }
...

The work done to add the phy_node in the first place (dacdbb4dfc1a1:
"net: macb: add fixed-link node support") will consume the
device_node (if found).

v2: Reorganization of mii probe/init functions, suggested by Andrew Lunn
v3: Moved some of the bus init code back into init (erroneously moved to probe)
    some style issues, and an unintialized variable warning addressed.
v4: Add Reviewed-by: tags
    Skip fallback code if phy-handle phandle is found

^ permalink raw reply

* [PATCH v4 net-next 2/4] net: macb: Remove redundant poll irq assignment
From: Brad Mouring @ 2018-03-12 21:34 UTC (permalink / raw)
  To: Nicolas Ferre, Rob Herring, David S . Miller, Michael Grzeschik,
	Andrew Lunn
  Cc: Mark Rutland, netdev, Julia Cartwright, devicetree, Brad Mouring
In-Reply-To: <20180312213435.115174-1-brad.mouring@ni.com>

In phy_device's general probe, this device will already be set for
phy register polling, rendering this code redundant.

Signed-off-by: Brad Mouring <brad.mouring@ni.com>
Suggested-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
---
 drivers/net/ethernet/cadence/macb_main.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
index 9b6195fbbf8e..db1dc301bed3 100644
--- a/drivers/net/ethernet/cadence/macb_main.c
+++ b/drivers/net/ethernet/cadence/macb_main.c
@@ -593,9 +593,6 @@ static int macb_mii_init(struct macb *bp)
 	if (np) {
 		err = of_mdiobus_register(bp->mii_bus, np);
 	} else {
-		for (i = 0; i < PHY_MAX_ADDR; i++)
-			bp->mii_bus->irq[i] = PHY_POLL;
-
 		if (pdata)
 			bp->mii_bus->phy_mask = pdata->phy_mask;
 
-- 
2.16.2

^ permalink raw reply related

* [PATCH v4 net-next 1/4] net: macb: Reorganize macb_mii bringup
From: Brad Mouring @ 2018-03-12 21:34 UTC (permalink / raw)
  To: Nicolas Ferre, Rob Herring, David S . Miller, Michael Grzeschik,
	Andrew Lunn
  Cc: Mark Rutland, netdev, Julia Cartwright, devicetree, Brad Mouring
In-Reply-To: <20180312213435.115174-1-brad.mouring@ni.com>

The macb mii setup (mii_probe() and mii_init()) previously was
somewhat interspersed, likely a result of organic growth and hacking.

This change moves mii bus registration into mii_init and probing the
bus for devices into mii_probe.

Signed-off-by: Brad Mouring <brad.mouring@ni.com>
Suggested-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
---
 drivers/net/ethernet/cadence/macb_main.c | 79 +++++++++++++++++---------------
 1 file changed, 41 insertions(+), 38 deletions(-)

diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
index e84afcf1ecb5..9b6195fbbf8e 100644
--- a/drivers/net/ethernet/cadence/macb_main.c
+++ b/drivers/net/ethernet/cadence/macb_main.c
@@ -472,8 +472,42 @@ static int macb_mii_probe(struct net_device *dev)
 	struct macb *bp = netdev_priv(dev);
 	struct macb_platform_data *pdata;
 	struct phy_device *phydev;
-	int phy_irq;
-	int ret;
+	struct device_node *np;
+	int phy_irq, ret, i;
+
+	pdata = dev_get_platdata(&bp->pdev->dev);
+	np = bp->pdev->dev.of_node;
+	ret = 0;
+
+	if (np) {
+		if (of_phy_is_fixed_link(np)) {
+			if (of_phy_register_fixed_link(np) < 0) {
+				dev_err(&bp->pdev->dev,
+					"broken fixed-link specification\n");
+				return -ENODEV;
+			}
+			bp->phy_node = of_node_get(np);
+		} else {
+			/* fallback to standard phy registration if no phy were
+			 * found during dt phy registration
+			 */
+			if (!phy_find_first(bp->mii_bus)) {
+				for (i = 0; i < PHY_MAX_ADDR; i++) {
+					struct phy_device *phydev;
+
+					phydev = mdiobus_scan(bp->mii_bus, i);
+					if (IS_ERR(phydev) &&
+					    PTR_ERR(phydev) != -ENODEV) {
+						ret = PTR_ERR(phydev);
+						break;
+					}
+				}
+
+				if (ret)
+					return -ENODEV;
+			}
+		}
+	}
 
 	if (bp->phy_node) {
 		phydev = of_phy_connect(dev, bp->phy_node,
@@ -488,7 +522,6 @@ static int macb_mii_probe(struct net_device *dev)
 			return -ENXIO;
 		}
 
-		pdata = dev_get_platdata(&bp->pdev->dev);
 		if (pdata) {
 			if (gpio_is_valid(pdata->phy_irq_pin)) {
 				ret = devm_gpio_request(&bp->pdev->dev,
@@ -533,7 +566,7 @@ static int macb_mii_init(struct macb *bp)
 {
 	struct macb_platform_data *pdata;
 	struct device_node *np;
-	int err = -ENXIO, i;
+	int err;
 
 	/* Enable management port */
 	macb_writel(bp, NCR, MACB_BIT(MPE));
@@ -556,39 +589,9 @@ static int macb_mii_init(struct macb *bp)
 	dev_set_drvdata(&bp->dev->dev, bp->mii_bus);
 
 	np = bp->pdev->dev.of_node;
-	if (np) {
-		if (of_phy_is_fixed_link(np)) {
-			if (of_phy_register_fixed_link(np) < 0) {
-				dev_err(&bp->pdev->dev,
-					"broken fixed-link specification\n");
-				goto err_out_unregister_bus;
-			}
-			bp->phy_node = of_node_get(np);
-
-			err = mdiobus_register(bp->mii_bus);
-		} else {
-			/* try dt phy registration */
-			err = of_mdiobus_register(bp->mii_bus, np);
-
-			/* fallback to standard phy registration if no phy were
-			 * found during dt phy registration
-			 */
-			if (!err && !phy_find_first(bp->mii_bus)) {
-				for (i = 0; i < PHY_MAX_ADDR; i++) {
-					struct phy_device *phydev;
-
-					phydev = mdiobus_scan(bp->mii_bus, i);
-					if (IS_ERR(phydev) &&
-					    PTR_ERR(phydev) != -ENODEV) {
-						err = PTR_ERR(phydev);
-						break;
-					}
-				}
 
-				if (err)
-					goto err_out_unregister_bus;
-			}
-		}
+	if (np) {
+		err = of_mdiobus_register(bp->mii_bus, np);
 	} else {
 		for (i = 0; i < PHY_MAX_ADDR; i++)
 			bp->mii_bus->irq[i] = PHY_POLL;
@@ -610,10 +613,10 @@ static int macb_mii_init(struct macb *bp)
 
 err_out_unregister_bus:
 	mdiobus_unregister(bp->mii_bus);
-err_out_free_mdiobus:
-	of_node_put(bp->phy_node);
 	if (np && of_phy_is_fixed_link(np))
 		of_phy_deregister_fixed_link(np);
+err_out_free_mdiobus:
+	of_node_put(bp->phy_node);
 	mdiobus_free(bp->mii_bus);
 err_out:
 	return err;
-- 
2.16.2

^ permalink raw reply related

* Re: [PATCH net] kcm: lock lower socket in kcm_attach
From: Eric Biggers @ 2018-03-12 21:33 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Tom Herbert, David S. Miller, Linux Kernel Network Developers
In-Reply-To: <CALx6S35TJ44b1UM=OqqwfDMFrXFjDJj==shW6Z_8zM3VH9v6WA@mail.gmail.com>

On Mon, Mar 12, 2018 at 02:25:41PM -0700, Tom Herbert wrote:
> On Mon, Mar 12, 2018 at 2:09 PM, Eric Biggers <ebiggers3@gmail.com> wrote:
> > On Mon, Mar 12, 2018 at 02:04:12PM -0700, Tom Herbert wrote:
> >> Need to lock lower socket in order to provide mutual exclusion
> >> with kcm_unattach.
> >>
> >> Fixes: ab7ac4eb9832e32a09f4e804 ("kcm: Kernel Connection Multiplexor module")
> >> Signed-off-by: Tom Herbert <tom@quantonium.net>
> >> ---
> >
> > Is this fixing the syzbot-reported bug "KASAN: use-after-free Read in
> > get_work_pool"?  If so, please add:
> >
> > Reported-by: syzbot+ea75c0ffcd353d32515f064aaebefc5279e6161e@syzkaller.appspotmail.com
> 
> Yeah, I was looking for a "reported by". I didn't see it in the email
> from syzbot. Where is this found?
> 
> Thanks,
> Tom

This was an old bug report that was sent out before syzbot was updated to
suggest a Reported-by line.  But you can still use Reported-by for these old
bugs.  I took the bug ID from the From: header of the email.

Eric

^ permalink raw reply

* Re: [PATCH bpf-next v4 1/2] bpf: extend stackmap to save binary_build_id+offset instead of address
From: Alexei Starovoitov @ 2018-03-12 21:31 UTC (permalink / raw)
  To: Song Liu
  Cc: netdev@vger.kernel.org, ast@kernel.org, Peter Zijlstra,
	Daniel Borkmann, Kernel Team, hannes@cmpxchg.org, Teng Qin
In-Reply-To: <664505EF-F681-4641-B49A-6CF4F5CFB5C8@fb.com>

On 3/12/18 2:12 PM, Song Liu wrote:
>
>> On Mar 12, 2018, at 2:00 PM, Alexei Starovoitov <ast@fb.com> wrote:
>>
>> On 3/12/18 1:39 PM, Song Liu wrote:
>>> +	page = find_get_page(vma->vm_file->f_mapping, 0);
>>
>> did you test it with config_debug_atomic_sleep ?
>> it should have complained...
>
> Yeah, I have CONFIG_DEBUG_ATOMIC_SLEEP=y.
>
> I think find_get_page() will not sleep. The variation find_get_page_flags()
> may sleep with flag FGP_CREAT.

I see. gfp_mask == 0 and no locks. should work indeed.
curious how perf report looks like for heavy bpf_get_stackid() usage?

^ permalink raw reply

* Re: [net-next 0/6][pull request] 10GbE Intel Wired LAN Driver Updates 2018-03-12
From: David Miller @ 2018-03-12 21:30 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, nhorman, sassmann, jogreene
In-Reply-To: <20180312201215.2854-1-jeffrey.t.kirsher@intel.com>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Mon, 12 Mar 2018 13:12:09 -0700

> This series contains updates to ixgbe and ixgbevf only.

Pulled, thanks Jeff.

^ permalink raw reply

* Re: [PATCH net] kcm: lock lower socket in kcm_attach
From: Tom Herbert @ 2018-03-12 21:25 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Tom Herbert, David S. Miller, Linux Kernel Network Developers
In-Reply-To: <20180312210941.GC230165@gmail.com>

On Mon, Mar 12, 2018 at 2:09 PM, Eric Biggers <ebiggers3@gmail.com> wrote:
> On Mon, Mar 12, 2018 at 02:04:12PM -0700, Tom Herbert wrote:
>> Need to lock lower socket in order to provide mutual exclusion
>> with kcm_unattach.
>>
>> Fixes: ab7ac4eb9832e32a09f4e804 ("kcm: Kernel Connection Multiplexor module")
>> Signed-off-by: Tom Herbert <tom@quantonium.net>
>> ---
>
> Is this fixing the syzbot-reported bug "KASAN: use-after-free Read in
> get_work_pool"?  If so, please add:
>
> Reported-by: syzbot+ea75c0ffcd353d32515f064aaebefc5279e6161e@syzkaller.appspotmail.com

Yeah, I was looking for a "reported by". I didn't see it in the email
from syzbot. Where is this found?

Thanks,
Tom

^ permalink raw reply

* Re: [PATCH bpf-next v4 1/2] bpf: extend stackmap to save binary_build_id+offset instead of address
From: Song Liu @ 2018-03-12 21:12 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: netdev@vger.kernel.org, ast@kernel.org, Peter Zijlstra,
	Daniel Borkmann, Kernel Team, hannes@cmpxchg.org, Teng Qin
In-Reply-To: <97bb9fff-beda-6ace-8ed2-57f1f2de8c69@fb.com>


> On Mar 12, 2018, at 2:00 PM, Alexei Starovoitov <ast@fb.com> wrote:
> 
> On 3/12/18 1:39 PM, Song Liu wrote:
>> +	page = find_get_page(vma->vm_file->f_mapping, 0);
> 
> did you test it with config_debug_atomic_sleep ?
> it should have complained...

Yeah, I have CONFIG_DEBUG_ATOMIC_SLEEP=y. 

I think find_get_page() will not sleep. The variation find_get_page_flags()
may sleep with flag FGP_CREAT. 

Song

^ permalink raw reply

* Re: [PATCH net] kcm: lock lower socket in kcm_attach
From: Eric Biggers @ 2018-03-12 21:09 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev
In-Reply-To: <20180312210412.1875-1-tom@quantonium.net>

On Mon, Mar 12, 2018 at 02:04:12PM -0700, Tom Herbert wrote:
> Need to lock lower socket in order to provide mutual exclusion
> with kcm_unattach.
> 
> Fixes: ab7ac4eb9832e32a09f4e804 ("kcm: Kernel Connection Multiplexor module")
> Signed-off-by: Tom Herbert <tom@quantonium.net>
> ---

Is this fixing the syzbot-reported bug "KASAN: use-after-free Read in
get_work_pool"?  If so, please add:

Reported-by: syzbot+ea75c0ffcd353d32515f064aaebefc5279e6161e@syzkaller.appspotmail.com

^ permalink raw reply

* Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
From: Jiri Pirko @ 2018-03-12 21:08 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: mst, stephen, davem, netdev, virtio-dev, jesse.brandeburg,
	alexander.h.duyck, kubakici
In-Reply-To: <4bfb0f35-cbb4-b9e1-b8b6-9fb723c29140@intel.com>

Mon, Mar 12, 2018 at 09:58:06PM CET, sridhar.samudrala@intel.com wrote:
>
>
>On 3/12/2018 1:12 PM, Jiri Pirko wrote:
>> Thu, Mar 01, 2018 at 09:08:43PM CET, sridhar.samudrala@intel.com wrote:
>> > This patch enables virtio_net to switch over to a VF datapath when a VF
>> > netdev is present with the same MAC address. It allows live migration
>> > of a VM with a direct attached VF without the need to setup a bond/team
>> > between a VF and virtio net device in the guest.
>> > 
>> > The hypervisor needs to enable only one datapath at any time so that
>> > packets don't get looped back to the VM over the other datapath. When a VF
>> > is plugged, the virtio datapath link state can be marked as down. The
>> > hypervisor needs to unplug the VF device from the guest on the source host
>> > and reset the MAC filter of the VF to initiate failover of datapath to
>> > virtio before starting the migration. After the migration is completed,
>> > the destination hypervisor sets the MAC filter on the VF and plugs it back
>> > to the guest to switch over to VF datapath.
>> > 
>> > When BACKUP feature is enabled, an additional netdev(bypass netdev) is
>> > created that acts as a master device and tracks the state of the 2 lower
>> > netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
>> > passthru device with the same MAC is registered as 'active' netdev.
>> > 
>> > This patch is based on the discussion initiated by Jesse on this thread.
>> > https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>> > 
>> > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> > Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>> > Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
>> [...]
>> 
>> 
>> > +static int virtnet_bypass_register_child(struct net_device *child_netdev)
>> > +{
>> > +	struct virtnet_bypass_info *vbi;
>> > +	struct net_device *dev;
>> > +	bool backup;
>> > +	int ret;
>> > +
>> > +	if (child_netdev->addr_len != ETH_ALEN)
>> > +		return NOTIFY_DONE;
>> > +
>> > +	/* We will use the MAC address to locate the virtnet_bypass netdev
>> > +	 * to associate with the child netdev. If we don't find a matching
>> > +	 * bypass netdev, move on.
>> > +	 */
>> > +	dev = get_virtnet_bypass_bymac(dev_net(child_netdev),
>> > +				       child_netdev->perm_addr);
>> > +	if (!dev)
>> > +		return NOTIFY_DONE;
>> > +
>> > +	vbi = netdev_priv(dev);
>> > +	backup = (child_netdev->dev.parent == dev->dev.parent);
>> > +	if (backup ? rtnl_dereference(vbi->backup_netdev) :
>> > +			rtnl_dereference(vbi->active_netdev)) {
>> > +		netdev_info(dev,
>> > +			    "%s attempting to join bypass dev when %s already present\n",
>> > +			    child_netdev->name, backup ? "backup" : "active");
>> > +		return NOTIFY_DONE;
>> > +	}
>> > +
>> > +	/* Avoid non pci devices as active netdev */
>> > +	if (!backup && (!child_netdev->dev.parent ||
>> > +			!dev_is_pci(child_netdev->dev.parent)))
>> > +		return NOTIFY_DONE;
>> > +
>> > +	ret = netdev_rx_handler_register(child_netdev,
>> > +					 virtnet_bypass_handle_frame, dev);
>> > +	if (ret != 0) {
>> > +		netdev_err(child_netdev,
>> > +			   "can not register bypass receive handler (err = %d)\n",
>> > +			   ret);
>> > +		goto rx_handler_failed;
>> > +	}
>> > +
>> > +	ret = netdev_upper_dev_link(child_netdev, dev, NULL);
>> > +	if (ret != 0) {
>> > +		netdev_err(child_netdev,
>> > +			   "can not set master device %s (err = %d)\n",
>> > +			   dev->name, ret);
>> > +		goto upper_link_failed;
>> > +	}
>> > +
>> > +	child_netdev->flags |= IFF_SLAVE;
>> > +
>> > +	if (netif_running(dev)) {
>> > +		ret = dev_open(child_netdev);
>> > +		if (ret && (ret != -EBUSY)) {
>> > +			netdev_err(dev, "Opening child %s failed ret:%d\n",
>> > +				   child_netdev->name, ret);
>> > +			goto err_interface_up;
>> > +		}
>> > +	}
>> Much of this function is copy of netvsc_vf_join, should be shared with
>> netvsc.
>
>Yes. This is based on netvsc_register_vf() and netvsc_vf_join(). But modified
>to handle registration of 2 child netdevs(backup virtio and active vf).� In case
>of netvsc, they only register VF. There is no backup netdev.
>I think trying to make this code shareable with netvsc will introduce additional
>checks and make it convoluted.

No problem.


>
>
>> 
>> 
>> > +
>> > +	/* Align MTU of child with master */
>> > +	ret = dev_set_mtu(child_netdev, dev->mtu);
>> > +	if (ret) {
>> > +		netdev_err(dev,
>> > +			   "unable to change mtu of %s to %u register failed\n",
>> > +			   child_netdev->name, dev->mtu);
>> > +		goto err_set_mtu;
>> > +	}
>> > +
>> > +	call_netdevice_notifiers(NETDEV_JOIN, child_netdev);
>> > +
>> > +	netdev_info(dev, "registering %s\n", child_netdev->name);
>> > +
>> > +	dev_hold(child_netdev);
>> > +	if (backup) {
>> > +		rcu_assign_pointer(vbi->backup_netdev, child_netdev);
>> > +		dev_get_stats(vbi->backup_netdev, &vbi->backup_stats);
>> > +	} else {
>> > +		rcu_assign_pointer(vbi->active_netdev, child_netdev);
>> > +		dev_get_stats(vbi->active_netdev, &vbi->active_stats);
>> > +		dev->min_mtu = child_netdev->min_mtu;
>> > +		dev->max_mtu = child_netdev->max_mtu;
>> > +	}
>> > +
>> > +	return NOTIFY_OK;
>> > +
>> > +err_set_mtu:
>> > +	dev_close(child_netdev);
>> > +err_interface_up:
>> > +	netdev_upper_dev_unlink(child_netdev, dev);
>> > +	child_netdev->flags &= ~IFF_SLAVE;
>> > +upper_link_failed:
>> > +	netdev_rx_handler_unregister(child_netdev);
>> > +rx_handler_failed:
>> > +	return NOTIFY_DONE;
>> > +}
>> > +
>> > +static int virtnet_bypass_unregister_child(struct net_device *child_netdev)
>> > +{
>> > +	struct virtnet_bypass_info *vbi;
>> > +	struct net_device *dev, *backup;
>> > +
>> > +	dev = get_virtnet_bypass_byref(child_netdev);
>> > +	if (!dev)
>> > +		return NOTIFY_DONE;
>> > +
>> > +	vbi = netdev_priv(dev);
>> > +
>> > +	netdev_info(dev, "unregistering %s\n", child_netdev->name);
>> > +
>> > +	netdev_rx_handler_unregister(child_netdev);
>> > +	netdev_upper_dev_unlink(child_netdev, dev);
>> > +	child_netdev->flags &= ~IFF_SLAVE;
>> > +
>> > +	if (child_netdev->dev.parent == dev->dev.parent) {
>> > +		RCU_INIT_POINTER(vbi->backup_netdev, NULL);
>> > +	} else {
>> > +		RCU_INIT_POINTER(vbi->active_netdev, NULL);
>> > +		backup = rtnl_dereference(vbi->backup_netdev);
>> > +		if (backup) {
>> > +			dev->min_mtu = backup->min_mtu;
>> > +			dev->max_mtu = backup->max_mtu;
>> > +		}
>> > +	}
>> > +
>> > +	dev_put(child_netdev);
>> > +
>> > +	return NOTIFY_OK;
>> > +}
>> > +
>> > +static int virtnet_bypass_update_link(struct net_device *child_netdev)
>> > +{
>> > +	struct net_device *dev, *active, *backup;
>> > +	struct virtnet_bypass_info *vbi;
>> > +
>> > +	dev = get_virtnet_bypass_byref(child_netdev);
>> > +	if (!dev || !netif_running(dev))
>> > +		return NOTIFY_DONE;
>> > +
>> > +	vbi = netdev_priv(dev);
>> > +
>> > +	active = rtnl_dereference(vbi->active_netdev);
>> > +	backup = rtnl_dereference(vbi->backup_netdev);
>> > +
>> > +	if ((active && virtnet_bypass_xmit_ready(active)) ||
>> > +	    (backup && virtnet_bypass_xmit_ready(backup))) {
>> > +		netif_carrier_on(dev);
>> > +		netif_tx_wake_all_queues(dev);
>> > +	} else {
>> > +		netif_carrier_off(dev);
>> > +		netif_tx_stop_all_queues(dev);
>> > +	}
>> > +
>> > +	return NOTIFY_OK;
>> > +}
>> > +
>> > +static int virtnet_bypass_event(struct notifier_block *this,
>> > +				unsigned long event, void *ptr)
>> > +{
>> > +	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
>> > +
>> > +	/* Skip our own events */
>> > +	if (event_dev->netdev_ops == &virtnet_bypass_netdev_ops)
>> > +		return NOTIFY_DONE;
>> > +
>> > +	/* Avoid non-Ethernet type devices */
>> > +	if (event_dev->type != ARPHRD_ETHER)
>> > +		return NOTIFY_DONE;
>> > +
>> > +	/* Avoid Vlan dev with same MAC registering as child dev */
>> > +	if (is_vlan_dev(event_dev))
>> > +		return NOTIFY_DONE;
>> > +
>> > +	/* Avoid Bonding master dev with same MAC registering as child dev */
>> > +	if ((event_dev->priv_flags & IFF_BONDING) &&
>> > +	    (event_dev->flags & IFF_MASTER))
>> > +		return NOTIFY_DONE;
>> > +
>> > +	switch (event) {
>> > +	case NETDEV_REGISTER:
>> > +		return virtnet_bypass_register_child(event_dev);
>> > +	case NETDEV_UNREGISTER:
>> > +		return virtnet_bypass_unregister_child(event_dev);
>> > +	case NETDEV_UP:
>> > +	case NETDEV_DOWN:
>> > +	case NETDEV_CHANGE:
>> > +		return virtnet_bypass_update_link(event_dev);
>> > +	default:
>> > +		return NOTIFY_DONE;
>> > +	}
>> > +}
>> For example this function is 1:1 copy of netvsc, even with comments
>> and bugs (like IFF_BODING check).
>> 
>> This is also something that should be shared with netvsc.
>
>The problem is with the calls that are made from this function.
>Sharing would have been possible if both virtio and netvsc went with
>same netdev model.

ops? You can still share.


>
>> 
>> 
>> > +
>> > +static struct notifier_block virtnet_bypass_notifier = {
>> > +	.notifier_call = virtnet_bypass_event,
>> > +};
>> > +
>> > +static int virtnet_bypass_create(struct virtnet_info *vi)
>> > +{
>> > +	struct net_device *backup_netdev = vi->dev;
>> > +	struct device *dev = &vi->vdev->dev;
>> > +	struct net_device *bypass_netdev;
>> > +	int res;
>> > +
>> > +	/* Alloc at least 2 queues, for now we are going with 16 assuming
>> > +	 * that most devices being bonded won't have too many queues.
>> > +	 */
>> > +	bypass_netdev = alloc_etherdev_mq(sizeof(struct virtnet_bypass_info),
>> > +					  16);
>> > +	if (!bypass_netdev) {
>> > +		dev_err(dev, "Unable to allocate bypass_netdev!\n");
>> > +		return -ENOMEM;
>> > +	}
>> > +
>> > +	dev_net_set(bypass_netdev, dev_net(backup_netdev));
>> > +	SET_NETDEV_DEV(bypass_netdev, dev);
>> > +
>> > +	bypass_netdev->netdev_ops = &virtnet_bypass_netdev_ops;
>> > +	bypass_netdev->ethtool_ops = &virtnet_bypass_ethtool_ops;
>> > +
>> > +	/* Initialize the device options */
>> > +	bypass_netdev->flags |= IFF_MASTER;
>> > +	bypass_netdev->priv_flags |= IFF_BONDING | IFF_UNICAST_FLT |
>> No clue why you set IFF_BONDING here...
>
>This is to indicate to the userspace that this is a master bond device.

This is not a master bond device! If you say this, it makes me really
worried....



>May be it is sufficient to just set IFF_MASTER.
>
>> 
>> 
>> 
>> 
>> 
>

^ permalink raw reply

* [PATCH net] kcm: lock lower socket in kcm_attach
From: Tom Herbert @ 2018-03-12 21:04 UTC (permalink / raw)
  To: davem; +Cc: netdev, ebiggers3, Tom Herbert

Need to lock lower socket in order to provide mutual exclusion
with kcm_unattach.

Fixes: ab7ac4eb9832e32a09f4e804 ("kcm: Kernel Connection Multiplexor module")
Signed-off-by: Tom Herbert <tom@quantonium.net>
---
 net/kcm/kcmsock.c | 33 +++++++++++++++++++++++----------
 1 file changed, 23 insertions(+), 10 deletions(-)

diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
index f297d53a11aa..34355fd19f27 100644
--- a/net/kcm/kcmsock.c
+++ b/net/kcm/kcmsock.c
@@ -1381,24 +1381,32 @@ static int kcm_attach(struct socket *sock, struct socket *csock,
 		.parse_msg = kcm_parse_func_strparser,
 		.read_sock_done = kcm_read_sock_done,
 	};
-	int err;
+	int err = 0;
 
 	csk = csock->sk;
 	if (!csk)
 		return -EINVAL;
 
+	lock_sock(csk);
+
 	/* Only allow TCP sockets to be attached for now */
 	if ((csk->sk_family != AF_INET && csk->sk_family != AF_INET6) ||
-	    csk->sk_protocol != IPPROTO_TCP)
-		return -EOPNOTSUPP;
+	    csk->sk_protocol != IPPROTO_TCP) {
+		err = -EOPNOTSUPP;
+		goto out;
+	}
 
 	/* Don't allow listeners or closed sockets */
-	if (csk->sk_state == TCP_LISTEN || csk->sk_state == TCP_CLOSE)
-		return -EOPNOTSUPP;
+	if (csk->sk_state == TCP_LISTEN || csk->sk_state == TCP_CLOSE) {
+		err = -EOPNOTSUPP;
+		goto out;
+	}
 
 	psock = kmem_cache_zalloc(kcm_psockp, GFP_KERNEL);
-	if (!psock)
-		return -ENOMEM;
+	if (!psock) {
+		err = -ENOMEM;
+		goto out;
+	}
 
 	psock->mux = mux;
 	psock->sk = csk;
@@ -1407,7 +1415,7 @@ static int kcm_attach(struct socket *sock, struct socket *csock,
 	err = strp_init(&psock->strp, csk, &cb);
 	if (err) {
 		kmem_cache_free(kcm_psockp, psock);
-		return err;
+		goto out;
 	}
 
 	write_lock_bh(&csk->sk_callback_lock);
@@ -1419,7 +1427,8 @@ static int kcm_attach(struct socket *sock, struct socket *csock,
 		write_unlock_bh(&csk->sk_callback_lock);
 		strp_done(&psock->strp);
 		kmem_cache_free(kcm_psockp, psock);
-		return -EALREADY;
+		err = -EALREADY;
+		goto out;
 	}
 
 	psock->save_data_ready = csk->sk_data_ready;
@@ -1455,7 +1464,10 @@ static int kcm_attach(struct socket *sock, struct socket *csock,
 	/* Schedule RX work in case there are already bytes queued */
 	strp_check_rcv(&psock->strp);
 
-	return 0;
+out:
+	release_sock(csk);
+
+	return err;
 }
 
 static int kcm_attach_ioctl(struct socket *sock, struct kcm_attach *info)
@@ -1507,6 +1519,7 @@ static void kcm_unattach(struct kcm_psock *psock)
 
 	if (WARN_ON(psock->rx_kcm)) {
 		write_unlock_bh(&csk->sk_callback_lock);
+		release_sock(csk);
 		return;
 	}
 
-- 
2.11.0

^ permalink raw reply related

* Re: [PATCH bpf-next v4 1/2] bpf: extend stackmap to save binary_build_id+offset instead of address
From: Alexei Starovoitov @ 2018-03-12 21:00 UTC (permalink / raw)
  To: Song Liu, netdev, ast, peterz, daniel; +Cc: kernel-team, hannes, qinteng
In-Reply-To: <20180312203957.2025833-2-songliubraving@fb.com>

On 3/12/18 1:39 PM, Song Liu wrote:
> +	page = find_get_page(vma->vm_file->f_mapping, 0);

did you test it with config_debug_atomic_sleep ?
it should have complained...

^ permalink raw reply

* [PATCH iproute2] Revert "iproute: "list/flush/save default" selected all of the routes"
From: Stephen Hemminger @ 2018-03-12 21:03 UTC (permalink / raw)
  To: bluca, green; +Cc: netdev, Stephen Hemminger

This reverts commit 9135c4d6037ff9f1818507bac0049fc44db8c3d2.

Debian maintainer found that basic command:
	# ip route flush all
No longer worked as expected which breaks user scripts and
expectations. It no longer flushed all IPv4 routes.

Reported-by: Luca Boccassi <bluca@debian.org>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 ip/iproute.c | 65 ++++++++++++++++++------------------------------------------
 lib/utils.c  | 13 ++++++++++++
 2 files changed, 32 insertions(+), 46 deletions(-)

diff --git a/ip/iproute.c b/ip/iproute.c
index bf886fda9d76..32c93ed5abd9 100644
--- a/ip/iproute.c
+++ b/ip/iproute.c
@@ -191,42 +191,20 @@ static int filter_nlmsg(struct nlmsghdr *n, struct rtattr **tb, int host_len)
 		return 0;
 	if ((filter.tos^r->rtm_tos)&filter.tosmask)
 		return 0;
-	if (filter.rdst.family) {
-		if (r->rtm_family != filter.rdst.family ||
-		    filter.rdst.bitlen > r->rtm_dst_len)
-			return 0;
-	} else if (filter.rdst.flags & PREFIXLEN_SPECIFIED) {
-		if (filter.rdst.bitlen > r->rtm_dst_len)
-			return 0;
-	}
-	if (filter.mdst.family) {
-		if (r->rtm_family != filter.mdst.family ||
-		    (filter.mdst.bitlen >= 0 &&
-		     filter.mdst.bitlen < r->rtm_dst_len))
-			return 0;
-	} else if (filter.mdst.flags & PREFIXLEN_SPECIFIED) {
-		if (filter.mdst.bitlen >= 0 &&
-		    filter.mdst.bitlen < r->rtm_dst_len)
-			return 0;
-	}
-	if (filter.rsrc.family) {
-		if (r->rtm_family != filter.rsrc.family ||
-		    filter.rsrc.bitlen > r->rtm_src_len)
-			return 0;
-	} else if (filter.rsrc.flags & PREFIXLEN_SPECIFIED) {
-		if (filter.rsrc.bitlen > r->rtm_src_len)
-			return 0;
-	}
-	if (filter.msrc.family) {
-		if (r->rtm_family != filter.msrc.family ||
-		    (filter.msrc.bitlen >= 0 &&
-		     filter.msrc.bitlen < r->rtm_src_len))
-			return 0;
-	} else if (filter.msrc.flags & PREFIXLEN_SPECIFIED) {
-		if (filter.msrc.bitlen >= 0 &&
-		    filter.msrc.bitlen < r->rtm_src_len)
-			return 0;
-	}
+	if (filter.rdst.family &&
+	    (r->rtm_family != filter.rdst.family || filter.rdst.bitlen > r->rtm_dst_len))
+		return 0;
+	if (filter.mdst.family &&
+	    (r->rtm_family != filter.mdst.family ||
+	     (filter.mdst.bitlen >= 0 && filter.mdst.bitlen < r->rtm_dst_len)))
+		return 0;
+	if (filter.rsrc.family &&
+	    (r->rtm_family != filter.rsrc.family || filter.rsrc.bitlen > r->rtm_src_len))
+		return 0;
+	if (filter.msrc.family &&
+	    (r->rtm_family != filter.msrc.family ||
+	     (filter.msrc.bitlen >= 0 && filter.msrc.bitlen < r->rtm_src_len)))
+		return 0;
 	if (filter.rvia.family) {
 		int family = r->rtm_family;
 
@@ -243,9 +221,7 @@ static int filter_nlmsg(struct nlmsghdr *n, struct rtattr **tb, int host_len)
 
 	if (tb[RTA_DST])
 		memcpy(&dst.data, RTA_DATA(tb[RTA_DST]), (r->rtm_dst_len+7)/8);
-	if (filter.rsrc.family || filter.msrc.family ||
-	    filter.rsrc.flags & PREFIXLEN_SPECIFIED ||
-	    filter.msrc.flags & PREFIXLEN_SPECIFIED) {
+	if (filter.rsrc.family || filter.msrc.family) {
 		if (tb[RTA_SRC])
 			memcpy(&src.data, RTA_DATA(tb[RTA_SRC]), (r->rtm_src_len+7)/8);
 	}
@@ -265,18 +241,15 @@ static int filter_nlmsg(struct nlmsghdr *n, struct rtattr **tb, int host_len)
 			memcpy(&prefsrc.data, RTA_DATA(tb[RTA_PREFSRC]), host_len/8);
 	}
 
-	if ((filter.rdst.family || filter.rdst.flags & PREFIXLEN_SPECIFIED) &&
-	    inet_addr_match(&dst, &filter.rdst, filter.rdst.bitlen))
+	if (filter.rdst.family && inet_addr_match(&dst, &filter.rdst, filter.rdst.bitlen))
 		return 0;
-	if ((filter.mdst.family || filter.mdst.flags & PREFIXLEN_SPECIFIED) &&
+	if (filter.mdst.family && filter.mdst.bitlen >= 0 &&
 	    inet_addr_match(&dst, &filter.mdst, r->rtm_dst_len))
 		return 0;
 
-	if ((filter.rsrc.family || filter.rsrc.flags & PREFIXLEN_SPECIFIED) &&
-	    inet_addr_match(&src, &filter.rsrc, filter.rsrc.bitlen))
+	if (filter.rsrc.family && inet_addr_match(&src, &filter.rsrc, filter.rsrc.bitlen))
 		return 0;
-	if ((filter.msrc.family || filter.msrc.flags & PREFIXLEN_SPECIFIED) &&
-	    filter.msrc.bitlen >= 0 &&
+	if (filter.msrc.family && filter.msrc.bitlen >= 0 &&
 	    inet_addr_match(&src, &filter.msrc, r->rtm_src_len))
 		return 0;
 
diff --git a/lib/utils.c b/lib/utils.c
index 379739d61246..87b609f2a6bc 100644
--- a/lib/utils.c
+++ b/lib/utils.c
@@ -681,6 +681,19 @@ int get_prefix_1(inet_prefix *dst, char *arg, int family)
 	char *slash;
 	int err, bitlen, flags;
 
+	memset(dst, 0, sizeof(*dst));
+
+	if (strcmp(arg, "default") == 0 ||
+	    strcmp(arg, "any") == 0 ||
+	    strcmp(arg, "all") == 0) {
+		if ((family == AF_DECnet) || (family == AF_MPLS))
+			return -1;
+		dst->family = family;
+		dst->bytelen = 0;
+		dst->bitlen = 0;
+		return 0;
+	}
+
 	slash = strchr(arg, '/');
 	if (slash)
 		*slash = 0;
-- 
2.16.1

^ permalink raw reply related

* Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
From: Samudrala, Sridhar @ 2018-03-12 20:58 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: mst, stephen, davem, netdev, virtio-dev, jesse.brandeburg,
	alexander.h.duyck, kubakici
In-Reply-To: <20180312201231.GA17283@nanopsycho>



On 3/12/2018 1:12 PM, Jiri Pirko wrote:
> Thu, Mar 01, 2018 at 09:08:43PM CET, sridhar.samudrala@intel.com wrote:
>> This patch enables virtio_net to switch over to a VF datapath when a VF
>> netdev is present with the same MAC address. It allows live migration
>> of a VM with a direct attached VF without the need to setup a bond/team
>> between a VF and virtio net device in the guest.
>>
>> The hypervisor needs to enable only one datapath at any time so that
>> packets don't get looped back to the VM over the other datapath. When a VF
>> is plugged, the virtio datapath link state can be marked as down. The
>> hypervisor needs to unplug the VF device from the guest on the source host
>> and reset the MAC filter of the VF to initiate failover of datapath to
>> virtio before starting the migration. After the migration is completed,
>> the destination hypervisor sets the MAC filter on the VF and plugs it back
>> to the guest to switch over to VF datapath.
>>
>> When BACKUP feature is enabled, an additional netdev(bypass netdev) is
>> created that acts as a master device and tracks the state of the 2 lower
>> netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
>> passthru device with the same MAC is registered as 'active' netdev.
>>
>> This patch is based on the discussion initiated by Jesse on this thread.
>> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>>
>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
> [...]
>
>
>> +static int virtnet_bypass_register_child(struct net_device *child_netdev)
>> +{
>> +	struct virtnet_bypass_info *vbi;
>> +	struct net_device *dev;
>> +	bool backup;
>> +	int ret;
>> +
>> +	if (child_netdev->addr_len != ETH_ALEN)
>> +		return NOTIFY_DONE;
>> +
>> +	/* We will use the MAC address to locate the virtnet_bypass netdev
>> +	 * to associate with the child netdev. If we don't find a matching
>> +	 * bypass netdev, move on.
>> +	 */
>> +	dev = get_virtnet_bypass_bymac(dev_net(child_netdev),
>> +				       child_netdev->perm_addr);
>> +	if (!dev)
>> +		return NOTIFY_DONE;
>> +
>> +	vbi = netdev_priv(dev);
>> +	backup = (child_netdev->dev.parent == dev->dev.parent);
>> +	if (backup ? rtnl_dereference(vbi->backup_netdev) :
>> +			rtnl_dereference(vbi->active_netdev)) {
>> +		netdev_info(dev,
>> +			    "%s attempting to join bypass dev when %s already present\n",
>> +			    child_netdev->name, backup ? "backup" : "active");
>> +		return NOTIFY_DONE;
>> +	}
>> +
>> +	/* Avoid non pci devices as active netdev */
>> +	if (!backup && (!child_netdev->dev.parent ||
>> +			!dev_is_pci(child_netdev->dev.parent)))
>> +		return NOTIFY_DONE;
>> +
>> +	ret = netdev_rx_handler_register(child_netdev,
>> +					 virtnet_bypass_handle_frame, dev);
>> +	if (ret != 0) {
>> +		netdev_err(child_netdev,
>> +			   "can not register bypass receive handler (err = %d)\n",
>> +			   ret);
>> +		goto rx_handler_failed;
>> +	}
>> +
>> +	ret = netdev_upper_dev_link(child_netdev, dev, NULL);
>> +	if (ret != 0) {
>> +		netdev_err(child_netdev,
>> +			   "can not set master device %s (err = %d)\n",
>> +			   dev->name, ret);
>> +		goto upper_link_failed;
>> +	}
>> +
>> +	child_netdev->flags |= IFF_SLAVE;
>> +
>> +	if (netif_running(dev)) {
>> +		ret = dev_open(child_netdev);
>> +		if (ret && (ret != -EBUSY)) {
>> +			netdev_err(dev, "Opening child %s failed ret:%d\n",
>> +				   child_netdev->name, ret);
>> +			goto err_interface_up;
>> +		}
>> +	}
> Much of this function is copy of netvsc_vf_join, should be shared with
> netvsc.

Yes. This is based on netvsc_register_vf() and netvsc_vf_join(). But modified
to handle registration of 2 child netdevs(backup virtio and active vf).  In case
of netvsc, they only register VF. There is no backup netdev.
I think trying to make this code shareable with netvsc will introduce additional
checks and make it convoluted.


>
>
>> +
>> +	/* Align MTU of child with master */
>> +	ret = dev_set_mtu(child_netdev, dev->mtu);
>> +	if (ret) {
>> +		netdev_err(dev,
>> +			   "unable to change mtu of %s to %u register failed\n",
>> +			   child_netdev->name, dev->mtu);
>> +		goto err_set_mtu;
>> +	}
>> +
>> +	call_netdevice_notifiers(NETDEV_JOIN, child_netdev);
>> +
>> +	netdev_info(dev, "registering %s\n", child_netdev->name);
>> +
>> +	dev_hold(child_netdev);
>> +	if (backup) {
>> +		rcu_assign_pointer(vbi->backup_netdev, child_netdev);
>> +		dev_get_stats(vbi->backup_netdev, &vbi->backup_stats);
>> +	} else {
>> +		rcu_assign_pointer(vbi->active_netdev, child_netdev);
>> +		dev_get_stats(vbi->active_netdev, &vbi->active_stats);
>> +		dev->min_mtu = child_netdev->min_mtu;
>> +		dev->max_mtu = child_netdev->max_mtu;
>> +	}
>> +
>> +	return NOTIFY_OK;
>> +
>> +err_set_mtu:
>> +	dev_close(child_netdev);
>> +err_interface_up:
>> +	netdev_upper_dev_unlink(child_netdev, dev);
>> +	child_netdev->flags &= ~IFF_SLAVE;
>> +upper_link_failed:
>> +	netdev_rx_handler_unregister(child_netdev);
>> +rx_handler_failed:
>> +	return NOTIFY_DONE;
>> +}
>> +
>> +static int virtnet_bypass_unregister_child(struct net_device *child_netdev)
>> +{
>> +	struct virtnet_bypass_info *vbi;
>> +	struct net_device *dev, *backup;
>> +
>> +	dev = get_virtnet_bypass_byref(child_netdev);
>> +	if (!dev)
>> +		return NOTIFY_DONE;
>> +
>> +	vbi = netdev_priv(dev);
>> +
>> +	netdev_info(dev, "unregistering %s\n", child_netdev->name);
>> +
>> +	netdev_rx_handler_unregister(child_netdev);
>> +	netdev_upper_dev_unlink(child_netdev, dev);
>> +	child_netdev->flags &= ~IFF_SLAVE;
>> +
>> +	if (child_netdev->dev.parent == dev->dev.parent) {
>> +		RCU_INIT_POINTER(vbi->backup_netdev, NULL);
>> +	} else {
>> +		RCU_INIT_POINTER(vbi->active_netdev, NULL);
>> +		backup = rtnl_dereference(vbi->backup_netdev);
>> +		if (backup) {
>> +			dev->min_mtu = backup->min_mtu;
>> +			dev->max_mtu = backup->max_mtu;
>> +		}
>> +	}
>> +
>> +	dev_put(child_netdev);
>> +
>> +	return NOTIFY_OK;
>> +}
>> +
>> +static int virtnet_bypass_update_link(struct net_device *child_netdev)
>> +{
>> +	struct net_device *dev, *active, *backup;
>> +	struct virtnet_bypass_info *vbi;
>> +
>> +	dev = get_virtnet_bypass_byref(child_netdev);
>> +	if (!dev || !netif_running(dev))
>> +		return NOTIFY_DONE;
>> +
>> +	vbi = netdev_priv(dev);
>> +
>> +	active = rtnl_dereference(vbi->active_netdev);
>> +	backup = rtnl_dereference(vbi->backup_netdev);
>> +
>> +	if ((active && virtnet_bypass_xmit_ready(active)) ||
>> +	    (backup && virtnet_bypass_xmit_ready(backup))) {
>> +		netif_carrier_on(dev);
>> +		netif_tx_wake_all_queues(dev);
>> +	} else {
>> +		netif_carrier_off(dev);
>> +		netif_tx_stop_all_queues(dev);
>> +	}
>> +
>> +	return NOTIFY_OK;
>> +}
>> +
>> +static int virtnet_bypass_event(struct notifier_block *this,
>> +				unsigned long event, void *ptr)
>> +{
>> +	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
>> +
>> +	/* Skip our own events */
>> +	if (event_dev->netdev_ops == &virtnet_bypass_netdev_ops)
>> +		return NOTIFY_DONE;
>> +
>> +	/* Avoid non-Ethernet type devices */
>> +	if (event_dev->type != ARPHRD_ETHER)
>> +		return NOTIFY_DONE;
>> +
>> +	/* Avoid Vlan dev with same MAC registering as child dev */
>> +	if (is_vlan_dev(event_dev))
>> +		return NOTIFY_DONE;
>> +
>> +	/* Avoid Bonding master dev with same MAC registering as child dev */
>> +	if ((event_dev->priv_flags & IFF_BONDING) &&
>> +	    (event_dev->flags & IFF_MASTER))
>> +		return NOTIFY_DONE;
>> +
>> +	switch (event) {
>> +	case NETDEV_REGISTER:
>> +		return virtnet_bypass_register_child(event_dev);
>> +	case NETDEV_UNREGISTER:
>> +		return virtnet_bypass_unregister_child(event_dev);
>> +	case NETDEV_UP:
>> +	case NETDEV_DOWN:
>> +	case NETDEV_CHANGE:
>> +		return virtnet_bypass_update_link(event_dev);
>> +	default:
>> +		return NOTIFY_DONE;
>> +	}
>> +}
> For example this function is 1:1 copy of netvsc, even with comments
> and bugs (like IFF_BODING check).
>
> This is also something that should be shared with netvsc.

The problem is with the calls that are made from this function.
Sharing would have been possible if both virtio and netvsc went with
same netdev model.

>
>
>> +
>> +static struct notifier_block virtnet_bypass_notifier = {
>> +	.notifier_call = virtnet_bypass_event,
>> +};
>> +
>> +static int virtnet_bypass_create(struct virtnet_info *vi)
>> +{
>> +	struct net_device *backup_netdev = vi->dev;
>> +	struct device *dev = &vi->vdev->dev;
>> +	struct net_device *bypass_netdev;
>> +	int res;
>> +
>> +	/* Alloc at least 2 queues, for now we are going with 16 assuming
>> +	 * that most devices being bonded won't have too many queues.
>> +	 */
>> +	bypass_netdev = alloc_etherdev_mq(sizeof(struct virtnet_bypass_info),
>> +					  16);
>> +	if (!bypass_netdev) {
>> +		dev_err(dev, "Unable to allocate bypass_netdev!\n");
>> +		return -ENOMEM;
>> +	}
>> +
>> +	dev_net_set(bypass_netdev, dev_net(backup_netdev));
>> +	SET_NETDEV_DEV(bypass_netdev, dev);
>> +
>> +	bypass_netdev->netdev_ops = &virtnet_bypass_netdev_ops;
>> +	bypass_netdev->ethtool_ops = &virtnet_bypass_ethtool_ops;
>> +
>> +	/* Initialize the device options */
>> +	bypass_netdev->flags |= IFF_MASTER;
>> +	bypass_netdev->priv_flags |= IFF_BONDING | IFF_UNICAST_FLT |
> No clue why you set IFF_BONDING here...

This is to indicate to the userspace that this is a master bond device.
May be it is sufficient to just set IFF_MASTER.

>
>
>
>
>

^ permalink raw reply

* Re: de-indirect TCP congestion control
From: Stephen Hemminger @ 2018-03-12 20:51 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev
In-Reply-To: <20180312.160516.696486782939299055.davem@davemloft.net>

On Mon, 12 Mar 2018 16:05:16 -0400 (EDT)
David Miller <davem@davemloft.net> wrote:

> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Mon, 12 Mar 2018 13:03:35 -0700
> 
> > 
> > 
> > On 03/12/2018 12:48 PM, Stephen Hemminger wrote:  
> >> On Mon, 12 Mar 2018 15:04:06 -0400 (EDT)
> >> David Miller <davem@davemloft.net> wrote:
> >>   
> >>> From: Stephen Hemminger <stephen@networkplumber.org>
> >>> Date: Mon, 12 Mar 2018 11:45:52 -0700
> >>>  
> >>>> Since indirect calls are expensive, and now even more so, perhaps we
> >>>> should figure out
> >>>> a way to make the default TCP congestion control hooks into direct
> >>>> calls.
> >>>> 99% of the users just use the single CC module compiled into the
> >>>> kernel.  
> >>>
> >>> Who is this magic user with only one CC algorithm enabled in their
> >>> kernel?  I want to know who this dude is?
> >>>
> >>> I don't think it's going to help much since people will have I think
> >>> at least two algorithms compiled into nearly everyone's tree.
> >>>
> >>> Distributions will enable everything.
> >>>
> >>> Google is going to have at least two algorithms enabled.
> >>>
> >>> etc. etc. etc.
> >>>
> >>> Getting rid of indirect calls is a fine goal, but the precondition you
> >>> are mentioning to achieve this doesn't seem practical at all.  
> >> What I meant is that kernels with N congestion controls, almost all
> >> traffic
> >> uses the default So that path can be optimized. The example I gave
> >> would
> >> have all the others doing the same indirect call.  
> > 
> > I do not understand. What is default_tcp_ops anyway ?
> > 
> > How changes to /proc/sys/net/ipv4/tcp_congestion_control will impact
> > this ?  
> 
> I'm also confused what is being suggested exactly and how this can
> work. :-)

Trying to directly call the fastpath TCP congestion operations for sockets
using the default.  The problem is how to do this and it gets messy with Kconfig
and CPP to deal with as raw tools. Maybe weak symbols could be used.

^ permalink raw reply

* [PATCH bpf-next v4 0/2] bpf stackmap with build_id+offset
From: Song Liu @ 2018-03-12 20:39 UTC (permalink / raw)
  To: netdev, ast, peterz, daniel; +Cc: kernel-team, hannes, qinteng, Song Liu

This is v4 of this work. Changes since v3:

1. Add fallback when build_id lookup failed. In this case, status is set
   to BPF_STACK_BUILD_ID_IP, and ip of this entry is saved.
2. Handle cases where vma is only part of the file (vma->vm_pgoff != 0).
   Thanks to Teng for helping me identify this issue!
3. Address feedbacks for previous versions.

Song Liu (2):
  bpf: extend stackmap to save binary_build_id+offset instead of address
  bpf: add selftest for stackmap with BPF_F_STACK_BUILD_ID

 include/uapi/linux/bpf.h                           |  22 ++
 kernel/bpf/stackmap.c                              | 251 +++++++++++++++++++--
 tools/include/uapi/linux/bpf.h                     |  22 ++
 tools/testing/selftests/bpf/Makefile               |  12 +-
 tools/testing/selftests/bpf/test_progs.c           | 164 +++++++++++++-
 .../selftests/bpf/test_stacktrace_build_id.c       |  60 +++++
 tools/testing/selftests/bpf/urandom_read.c         |  22 ++
 7 files changed, 529 insertions(+), 24 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_stacktrace_build_id.c
 create mode 100644 tools/testing/selftests/bpf/urandom_read.c

--
2.9.5

^ permalink raw reply

* [PATCH bpf-next v4 2/2] bpf: add selftest for stackmap with BPF_F_STACK_BUILD_ID
From: Song Liu @ 2018-03-12 20:39 UTC (permalink / raw)
  To: netdev, ast, peterz, daniel; +Cc: kernel-team, hannes, qinteng, Song Liu
In-Reply-To: <20180312203957.2025833-1-songliubraving@fb.com>

test_stacktrace_build_id() is added. It accesses tracepoint urandom_read
with "dd" and "urandom_read" and gathers stack traces. Then it reads the
stack traces from the stackmap.

urandom_read is a statically link binary that reads from /dev/urandom.
test_stacktrace_build_id() calls readelf to read build ID of urandom_read
and compares it with build ID from the stackmap.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 tools/include/uapi/linux/bpf.h                     |  22 +++
 tools/testing/selftests/bpf/Makefile               |  12 +-
 tools/testing/selftests/bpf/test_progs.c           | 164 ++++++++++++++++++++-
 .../selftests/bpf/test_stacktrace_build_id.c       |  60 ++++++++
 tools/testing/selftests/bpf/urandom_read.c         |  22 +++
 5 files changed, 278 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_stacktrace_build_id.c
 create mode 100644 tools/testing/selftests/bpf/urandom_read.c

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index db6bdc3..1944d0a 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -231,6 +231,28 @@ enum bpf_attach_type {
 #define BPF_F_RDONLY		(1U << 3)
 #define BPF_F_WRONLY		(1U << 4)
 
+/* Flag for stack_map, store build_id+offset instead of pointer */
+#define BPF_F_STACK_BUILD_ID	(1U << 5)
+
+enum bpf_stack_build_id_status {
+	/* user space need an empty entry to identify end of a trace */
+	BPF_STACK_BUILD_ID_EMPTY = 0,
+	/* with valid build_id and offset */
+	BPF_STACK_BUILD_ID_VALID = 1,
+	/* couldn't get build_id, fallback to ip */
+	BPF_STACK_BUILD_ID_IP = 2,
+};
+
+#define BPF_BUILD_ID_SIZE 20
+struct bpf_stack_build_id {
+	__s32		status;
+	unsigned char	build_id[BPF_BUILD_ID_SIZE];
+	union {
+		__u64	offset;
+		__u64	ip;
+	};
+};
+
 union bpf_attr {
 	struct { /* anonymous struct used by BPF_MAP_CREATE command */
 		__u32	map_type;	/* one of enum bpf_map_type */
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 8567a858..b0d29fd 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -13,6 +13,14 @@ endif
 CFLAGS += -Wall -O2 -I$(APIDIR) -I$(LIBDIR) -I$(GENDIR) $(GENFLAGS) -I../../../include
 LDLIBS += -lcap -lelf -lrt -lpthread
 
+TEST_CUSTOM_PROGS = $(OUTPUT)/urandom_read
+all: $(TEST_CUSTOM_PROGS)
+
+$(TEST_CUSTOM_PROGS): urandom_read
+
+urandom_read: urandom_read.c
+	$(CC) -o $(TEST_CUSTOM_PROGS) -static $<
+
 # Order correspond to 'make run_tests' order
 TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test_progs \
 	test_align test_verifier_log test_dev_cgroup test_tcpbpf_user
@@ -21,7 +29,7 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o test
 	test_pkt_md_access.o test_xdp_redirect.o test_xdp_meta.o sockmap_parse_prog.o     \
 	sockmap_verdict_prog.o dev_cgroup.o sample_ret0.o test_tracepoint.o \
 	test_l4lb_noinline.o test_xdp_noinline.o test_stacktrace_map.o \
-	sample_map_ret0.o test_tcpbpf_kern.o
+	sample_map_ret0.o test_tcpbpf_kern.o test_stacktrace_build_id.o
 
 # Order correspond to 'make run_tests' order
 TEST_PROGS := test_kmod.sh \
@@ -74,3 +82,5 @@ $(OUTPUT)/%.o: %.c
 	$(CLANG) $(CLANG_FLAGS) \
 		 -O2 -target bpf -emit-llvm -c $< -o - |      \
 	$(LLC) -march=bpf -mcpu=$(CPU) -filetype=obj -o $@
+
+EXTRA_CLEAN := $(TEST_CUSTOM_PROGS)
diff --git a/tools/testing/selftests/bpf/test_progs.c b/tools/testing/selftests/bpf/test_progs.c
index 27ad540..e9df48b 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -841,7 +841,8 @@ static void test_tp_attach_query(void)
 static int compare_map_keys(int map1_fd, int map2_fd)
 {
 	__u32 key, next_key;
-	char val_buf[PERF_MAX_STACK_DEPTH * sizeof(__u64)];
+	char val_buf[PERF_MAX_STACK_DEPTH *
+		     sizeof(struct bpf_stack_build_id)];
 	int err;
 
 	err = bpf_map_get_next_key(map1_fd, NULL, &key);
@@ -964,6 +965,166 @@ static void test_stacktrace_map()
 	return;
 }
 
+static int extract_build_id(char *build_id, size_t size)
+{
+	FILE *fp;
+	char *line = NULL;
+	size_t len = 0;
+
+	fp = popen("readelf -n ./urandom_read | grep 'Build ID'", "r");
+	if (fp == NULL)
+		return -1;
+
+	if (getline(&line, &len, fp) == -1)
+		goto err;
+	fclose(fp);
+
+	if (len > size)
+		len = size;
+	memcpy(build_id, line, len);
+	build_id[len] = '\0';
+	return 0;
+err:
+	fclose(fp);
+	return -1;
+}
+
+static void test_stacktrace_build_id(void)
+{
+	int control_map_fd, stackid_hmap_fd, stackmap_fd;
+	const char *file = "./test_stacktrace_build_id.o";
+	int bytes, efd, err, pmu_fd, prog_fd;
+	struct perf_event_attr attr = {};
+	__u32 key, previous_key, val, duration = 0;
+	struct bpf_object *obj;
+	char buf[256];
+	int i, j;
+	struct bpf_stack_build_id id_offs[PERF_MAX_STACK_DEPTH];
+	int build_id_matches = 0;
+
+	err = bpf_prog_load(file, BPF_PROG_TYPE_TRACEPOINT, &obj, &prog_fd);
+	if (CHECK(err, "prog_load", "err %d errno %d\n", err, errno))
+		goto out;
+
+	/* Get the ID for the sched/sched_switch tracepoint */
+	snprintf(buf, sizeof(buf),
+		 "/sys/kernel/debug/tracing/events/random/urandom_read/id");
+	efd = open(buf, O_RDONLY, 0);
+	if (CHECK(efd < 0, "open", "err %d errno %d\n", efd, errno))
+		goto close_prog;
+
+	bytes = read(efd, buf, sizeof(buf));
+	close(efd);
+	if (CHECK(bytes <= 0 || bytes >= sizeof(buf),
+		  "read", "bytes %d errno %d\n", bytes, errno))
+		goto close_prog;
+
+	/* Open the perf event and attach bpf progrram */
+	attr.config = strtol(buf, NULL, 0);
+	attr.type = PERF_TYPE_TRACEPOINT;
+	attr.sample_type = PERF_SAMPLE_RAW | PERF_SAMPLE_CALLCHAIN;
+	attr.sample_period = 1;
+	attr.wakeup_events = 1;
+	pmu_fd = syscall(__NR_perf_event_open, &attr, -1 /* pid */,
+			 0 /* cpu 0 */, -1 /* group id */,
+			 0 /* flags */);
+	if (CHECK(pmu_fd < 0, "perf_event_open", "err %d errno %d\n",
+		  pmu_fd, errno))
+		goto close_prog;
+
+	err = ioctl(pmu_fd, PERF_EVENT_IOC_ENABLE, 0);
+	if (CHECK(err, "perf_event_ioc_enable", "err %d errno %d\n",
+		  err, errno))
+		goto close_pmu;
+
+	err = ioctl(pmu_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
+	if (CHECK(err, "perf_event_ioc_set_bpf", "err %d errno %d\n",
+		  err, errno))
+		goto disable_pmu;
+
+	/* find map fds */
+	control_map_fd = bpf_find_map(__func__, obj, "control_map");
+	if (CHECK(control_map_fd < 0, "bpf_find_map control_map",
+		  "err %d errno %d\n", err, errno))
+		goto disable_pmu;
+
+	stackid_hmap_fd = bpf_find_map(__func__, obj, "stackid_hmap");
+	if (CHECK(stackid_hmap_fd < 0, "bpf_find_map stackid_hmap",
+		  "err %d errno %d\n", err, errno))
+		goto disable_pmu;
+
+	stackmap_fd = bpf_find_map(__func__, obj, "stackmap");
+	if (CHECK(stackmap_fd < 0, "bpf_find_map stackmap", "err %d errno %d\n",
+		  err, errno))
+		goto disable_pmu;
+
+	assert(system("dd if=/dev/urandom of=/dev/zero count=4 2> /dev/null")
+	       == 0);
+	assert(system("./urandom_read if=/dev/urandom of=/dev/zero count=4 2> /dev/null") == 0);
+	/* disable stack trace collection */
+	key = 0;
+	val = 1;
+	bpf_map_update_elem(control_map_fd, &key, &val, 0);
+
+	/* for every element in stackid_hmap, we can find a corresponding one
+	 * in stackmap, and vise versa.
+	 */
+	err = compare_map_keys(stackid_hmap_fd, stackmap_fd);
+	if (CHECK(err, "compare_map_keys stackid_hmap vs. stackmap",
+		  "err %d errno %d\n", err, errno))
+		goto disable_pmu;
+
+	err = compare_map_keys(stackmap_fd, stackid_hmap_fd);
+	if (CHECK(err, "compare_map_keys stackmap vs. stackid_hmap",
+		  "err %d errno %d\n", err, errno))
+		goto disable_pmu;
+
+	err = extract_build_id(buf, 256);
+
+	if (CHECK(err, "get build_id with readelf",
+		  "err %d errno %d\n", err, errno))
+		goto disable_pmu;
+
+	err = bpf_map_get_next_key(stackmap_fd, NULL, &key);
+	if (CHECK(err, "get_next_key from stackmap",
+		  "err %d, errno %d\n", err, errno))
+		goto disable_pmu;
+
+	do {
+		char build_id[64];
+
+		err = bpf_map_lookup_elem(stackmap_fd, &key, id_offs);
+		if (CHECK(err, "lookup_elem from stackmap",
+			  "err %d, errno %d\n", err, errno))
+			goto disable_pmu;
+		for (i = 0; i < PERF_MAX_STACK_DEPTH; ++i)
+			if (id_offs[i].status == BPF_STACK_BUILD_ID_VALID &&
+			    id_offs[i].offset != 0) {
+				for (j = 0; j < 20; ++j)
+					sprintf(build_id + 2 * j, "%02x",
+						id_offs[i].build_id[j] & 0xff);
+				if (strstr(buf, build_id) != NULL)
+					build_id_matches = 1;
+			}
+		previous_key = key;
+	} while (bpf_map_get_next_key(stackmap_fd, &previous_key, &key) == 0);
+
+	CHECK(build_id_matches < 1, "build id match",
+	      "Didn't find expected build ID from the map");
+
+disable_pmu:
+	ioctl(pmu_fd, PERF_EVENT_IOC_DISABLE);
+
+close_pmu:
+	close(pmu_fd);
+
+close_prog:
+	bpf_object__close(obj);
+
+out:
+	return;
+}
+
 int main(void)
 {
 	test_pkt_access();
@@ -976,6 +1137,7 @@ int main(void)
 	test_obj_name();
 	test_tp_attach_query();
 	test_stacktrace_map();
+	test_stacktrace_build_id();
 
 	printf("Summary: %d PASSED, %d FAILED\n", pass_cnt, error_cnt);
 	return error_cnt ? EXIT_FAILURE : EXIT_SUCCESS;
diff --git a/tools/testing/selftests/bpf/test_stacktrace_build_id.c b/tools/testing/selftests/bpf/test_stacktrace_build_id.c
new file mode 100644
index 0000000..b755bd7
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_stacktrace_build_id.c
@@ -0,0 +1,60 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2018 Facebook
+
+#include <linux/bpf.h>
+#include "bpf_helpers.h"
+
+#ifndef PERF_MAX_STACK_DEPTH
+#define PERF_MAX_STACK_DEPTH         127
+#endif
+
+struct bpf_map_def SEC("maps") control_map = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(__u32),
+	.max_entries = 1,
+};
+
+struct bpf_map_def SEC("maps") stackid_hmap = {
+	.type = BPF_MAP_TYPE_HASH,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(__u32),
+	.max_entries = 10000,
+};
+
+struct bpf_map_def SEC("maps") stackmap = {
+	.type = BPF_MAP_TYPE_STACK_TRACE,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(struct bpf_stack_build_id)
+		* PERF_MAX_STACK_DEPTH,
+	.max_entries = 128,
+	.map_flags = BPF_F_STACK_BUILD_ID,
+};
+
+/* taken from /sys/kernel/debug/tracing/events/random/urandom_read/format */
+struct random_urandom_args {
+	unsigned long long pad;
+	int got_bits;
+	int pool_left;
+	int input_left;
+};
+
+SEC("tracepoint/random/urandom_read")
+int oncpu(struct random_urandom_args *args)
+{
+	__u32 key = 0, val = 0, *value_p;
+
+	value_p = bpf_map_lookup_elem(&control_map, &key);
+	if (value_p && *value_p)
+		return 0; /* skip if non-zero *value_p */
+
+	/* The size of stackmap and stackid_hmap should be the same */
+	key = bpf_get_stackid(args, &stackmap, BPF_F_USER_STACK);
+	if ((int)key >= 0)
+		bpf_map_update_elem(&stackid_hmap, &key, &val, 0);
+
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
+__u32 _version SEC("version") = 1; /* ignored by tracepoints, required by libbpf.a */
diff --git a/tools/testing/selftests/bpf/urandom_read.c b/tools/testing/selftests/bpf/urandom_read.c
new file mode 100644
index 0000000..4acfdeb
--- /dev/null
+++ b/tools/testing/selftests/bpf/urandom_read.c
@@ -0,0 +1,22 @@
+#include <stdio.h>
+#include <unistd.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <stdlib.h>
+
+#define BUF_SIZE 256
+int main(void)
+{
+	int fd = open("/dev/urandom", O_RDONLY);
+	int i;
+	char buf[BUF_SIZE];
+
+	if (fd < 0)
+		return 1;
+	for (i = 0; i < 4; ++i)
+		read(fd, buf, BUF_SIZE);
+
+	close(fd);
+	return 0;
+}
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf-next v4 1/2] bpf: extend stackmap to save binary_build_id+offset instead of address
From: Song Liu @ 2018-03-12 20:39 UTC (permalink / raw)
  To: netdev, ast, peterz, daniel; +Cc: kernel-team, hannes, qinteng, Song Liu
In-Reply-To: <20180312203957.2025833-1-songliubraving@fb.com>

Currently, bpf stackmap store address for each entry in the call trace.
To map these addresses to user space files, it is necessary to maintain
the mapping from these virtual address to symbols in the binary. Usually,
the user space profiler (such as perf) has to scan /proc/pid/maps at the
beginning of profiling, and monitor mmap2() calls afterwards. Given the
cost of maintaining the address map, this solution is not practical for
system wide profiling that is always on.

This patch tries to solve this problem with a variation of stackmap. This
variation is enabled by flag BPF_F_STACK_BUILD_ID, and only works for
user stack. Instead of storing addresses, the variation stores ELF file
build_id + offset.

Build ID is a 20-byte unique identifier for ELF files. The following
command shows the Build ID of /bin/bash:

  [user@]$ readelf -n /bin/bash
  ...
    Build ID: XXXXXXXXXX
  ...

With BPF_F_STACK_BUILD_ID, bpf_get_stackid() tries to parse Build ID
for each entry in the call trace, and translate it into the following
struct:

  struct bpf_stack_build_id_offset {
          __s32           status;
          unsigned char   build_id[BPF_BUILD_ID_SIZE];
          union {
                  __u64   offset;
                  __u64   ip;
          };
  };

The search of build_id is limited to the first page of the file, and this
page should be in page cache. Otherwise, we fallback to store ip for this
entry (ip field in struct bpf_stack_build_id_offset). This requires the
build_id to be stored in the first page. A quick survey of binary and
dynamic library files in a few different systems shows that almost all
binary and dynamic library files have build_id in the first page.

User space can access struct bpf_stack_build_id_offset with bpf
syscall BPF_MAP_LOOKUP_ELEM. It is necessary for user space to
maintain mapping from build id to binary files. This mostly static
mapping is much easier to maintain than per process address maps.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 include/uapi/linux/bpf.h |  22 +++++
 kernel/bpf/stackmap.c    | 251 ++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 251 insertions(+), 22 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 2a66769..1e15d17 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -231,6 +231,28 @@ enum bpf_attach_type {
 #define BPF_F_RDONLY		(1U << 3)
 #define BPF_F_WRONLY		(1U << 4)
 
+/* Flag for stack_map, store build_id+offset instead of pointer */
+#define BPF_F_STACK_BUILD_ID	(1U << 5)
+
+enum bpf_stack_build_id_status {
+	/* user space need an empty entry to identify end of a trace */
+	BPF_STACK_BUILD_ID_EMPTY = 0,
+	/* with valid build_id and offset */
+	BPF_STACK_BUILD_ID_VALID = 1,
+	/* couldn't get build_id, fallback to ip */
+	BPF_STACK_BUILD_ID_IP = 2,
+};
+
+#define BPF_BUILD_ID_SIZE 20
+struct bpf_stack_build_id {
+	__s32		status;
+	unsigned char	build_id[BPF_BUILD_ID_SIZE];
+	union {
+		__u64	offset;
+		__u64	ip;
+	};
+};
+
 union bpf_attr {
 	struct { /* anonymous struct used by BPF_MAP_CREATE command */
 		__u32	map_type;	/* one of enum bpf_map_type */
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index b0ecf43..ec055ea 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -9,16 +9,19 @@
 #include <linux/filter.h>
 #include <linux/stacktrace.h>
 #include <linux/perf_event.h>
+#include <linux/elf.h>
+#include <linux/pagemap.h>
 #include "percpu_freelist.h"
 
-#define STACK_CREATE_FLAG_MASK \
-	(BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY)
+#define STACK_CREATE_FLAG_MASK					\
+	(BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY |	\
+	 BPF_F_STACK_BUILD_ID)
 
 struct stack_map_bucket {
 	struct pcpu_freelist_node fnode;
 	u32 hash;
 	u32 nr;
-	u64 ip[];
+	u64 data[];
 };
 
 struct bpf_stack_map {
@@ -29,6 +32,17 @@ struct bpf_stack_map {
 	struct stack_map_bucket *buckets[];
 };
 
+static inline bool stack_map_use_build_id(struct bpf_map *map)
+{
+	return (map->map_flags & BPF_F_STACK_BUILD_ID);
+}
+
+static inline int stack_map_data_size(struct bpf_map *map)
+{
+	return stack_map_use_build_id(map) ?
+		sizeof(struct bpf_stack_build_id) : sizeof(u64);
+}
+
 static int prealloc_elems_and_freelist(struct bpf_stack_map *smap)
 {
 	u32 elem_size = sizeof(struct stack_map_bucket) + smap->map.value_size;
@@ -68,8 +82,16 @@ static struct bpf_map *stack_map_alloc(union bpf_attr *attr)
 
 	/* check sanity of attributes */
 	if (attr->max_entries == 0 || attr->key_size != 4 ||
-	    value_size < 8 || value_size % 8 ||
-	    value_size / 8 > sysctl_perf_event_max_stack)
+	    value_size < 8 || value_size % 8)
+		return ERR_PTR(-EINVAL);
+
+	BUILD_BUG_ON(sizeof(struct bpf_stack_build_id) % sizeof(u64));
+	if (attr->map_flags & BPF_F_STACK_BUILD_ID) {
+		if (value_size % sizeof(struct bpf_stack_build_id) ||
+		    value_size / sizeof(struct bpf_stack_build_id)
+		    > sysctl_perf_event_max_stack)
+			return ERR_PTR(-EINVAL);
+	} else if (value_size / 8 > sysctl_perf_event_max_stack)
 		return ERR_PTR(-EINVAL);
 
 	/* hash table size must be power of 2 */
@@ -114,13 +136,175 @@ static struct bpf_map *stack_map_alloc(union bpf_attr *attr)
 	return ERR_PTR(err);
 }
 
+#define BPF_BUILD_ID 3
+/*
+ * Parse build id from the note segment. This logic can be shared between
+ * 32-bit and 64-bit system, because Elf32_Nhdr and Elf64_Nhdr are
+ * identical.
+ */
+static inline int stack_map_parse_build_id(void *page_addr,
+					   unsigned char *build_id,
+					   void *note_start,
+					   Elf32_Word note_size)
+{
+	Elf32_Word note_offs = 0, new_offs;
+
+	/* check for overflow */
+	if (note_start < page_addr || note_start + note_size < note_start)
+		return -EINVAL;
+
+	/* only supports note that fits in the first page */
+	if (note_start + note_size > page_addr + PAGE_SIZE)
+		return -EINVAL;
+
+	while (note_offs + sizeof(Elf32_Nhdr) < note_size) {
+		Elf32_Nhdr *nhdr = (Elf32_Nhdr *)(note_start + note_offs);
+
+		if (nhdr->n_type == BPF_BUILD_ID &&
+		    nhdr->n_namesz == sizeof("GNU") &&
+		    nhdr->n_descsz == BPF_BUILD_ID_SIZE) {
+			memcpy(build_id,
+			       note_start + note_offs +
+			       ALIGN(sizeof("GNU"), 4) + sizeof(Elf32_Nhdr),
+			       BPF_BUILD_ID_SIZE);
+			return 0;
+		}
+		new_offs = note_offs + sizeof(Elf32_Nhdr) +
+			ALIGN(nhdr->n_namesz, 4) + ALIGN(nhdr->n_descsz, 4);
+		if (new_offs <= note_offs)  /* overflow */
+			break;
+		note_offs = new_offs;
+	};
+	return -EINVAL;
+}
+
+/* Parse build ID from 32-bit ELF */
+static int stack_map_get_build_id_32(void *page_addr,
+				     unsigned char *build_id)
+{
+	Elf32_Ehdr *ehdr = (Elf32_Ehdr *)page_addr;
+	Elf32_Phdr *phdr;
+	int i;
+
+	/* only supports phdr that fits in one page */
+	if (ehdr->e_phnum >
+	    (PAGE_SIZE - sizeof(Elf32_Ehdr)) / sizeof(Elf32_Phdr))
+		return -EINVAL;
+
+	phdr = (Elf32_Phdr *)(page_addr + sizeof(Elf32_Ehdr));
+
+	for (i = 0; i < ehdr->e_phnum; ++i)
+		if (phdr[i].p_type == PT_NOTE)
+			return stack_map_parse_build_id(page_addr, build_id,
+					page_addr + phdr[i].p_offset,
+					phdr[i].p_filesz);
+	return -EINVAL;
+}
+
+/* Parse build ID from 64-bit ELF */
+static int stack_map_get_build_id_64(void *page_addr,
+				     unsigned char *build_id)
+{
+	Elf64_Ehdr *ehdr = (Elf64_Ehdr *)page_addr;
+	Elf64_Phdr *phdr;
+	int i;
+
+	/* only supports phdr that fits in one page */
+	if (ehdr->e_phnum >
+	    (PAGE_SIZE - sizeof(Elf64_Ehdr)) / sizeof(Elf64_Phdr))
+		return -EINVAL;
+
+	phdr = (Elf64_Phdr *)(page_addr + sizeof(Elf64_Ehdr));
+
+	for (i = 0; i < ehdr->e_phnum; ++i)
+		if (phdr[i].p_type == PT_NOTE)
+			return stack_map_parse_build_id(page_addr, build_id,
+					page_addr + phdr[i].p_offset,
+					phdr[i].p_filesz);
+	return -EINVAL;
+}
+
+/* Parse build ID of ELF file mapped to vma */
+static int stack_map_get_build_id(struct vm_area_struct *vma,
+				  unsigned char *build_id)
+{
+	Elf32_Ehdr *ehdr;
+	struct page *page;
+	void *page_addr;
+	int ret;
+
+	/* only works for page backed storage  */
+	if (!vma->vm_file)
+		return -EINVAL;
+
+	page = find_get_page(vma->vm_file->f_mapping, 0);
+	if (!page)
+		return -EFAULT;	/* page not mapped */
+
+	ret = -EINVAL;
+	page_addr = page_address(page);
+	ehdr = (Elf32_Ehdr *)page_addr;
+
+	/* compare magic x7f "ELF" */
+	if (memcmp(ehdr->e_ident, ELFMAG, SELFMAG) != 0)
+		goto out;
+
+	/* only support executable file and shared object file */
+	if (ehdr->e_type != ET_EXEC && ehdr->e_type != ET_DYN)
+		goto out;
+
+	if (ehdr->e_ident[EI_CLASS] == ELFCLASS32)
+		ret = stack_map_get_build_id_32(page_addr, build_id);
+	else if (ehdr->e_ident[EI_CLASS] == ELFCLASS64)
+		ret = stack_map_get_build_id_64(page_addr, build_id);
+out:
+	put_page(page);
+	return ret;
+}
+
+static void stack_map_get_build_id_offset(struct bpf_map *map,
+					  struct stack_map_bucket *bucket,
+					  u64 *ips, u32 trace_nr)
+{
+	int i;
+	struct vm_area_struct *vma;
+	struct bpf_stack_build_id *id_offs;
+
+	bucket->nr = trace_nr;
+	id_offs = (struct bpf_stack_build_id *)bucket->data;
+
+	if (!current || !current->mm ||
+	    down_read_trylock(&current->mm->mmap_sem) == 0) {
+		/* cannot access current->mm, fall back to ips */
+		for (i = 0; i < trace_nr; i++) {
+			id_offs[i].status = BPF_STACK_BUILD_ID_IP;
+			id_offs[i].ip = ips[i];
+		}
+		return;
+	}
+
+	for (i = 0; i < trace_nr; i++) {
+		vma = find_vma(current->mm, ips[i]);
+		if (!vma || stack_map_get_build_id(vma, id_offs[i].build_id)) {
+			/* per entry fall back to ips */
+			id_offs[i].status = BPF_STACK_BUILD_ID_IP;
+			id_offs[i].ip = ips[i];
+			continue;
+		}
+		id_offs[i].offset = (vma->vm_pgoff << PAGE_SHIFT) + ips[i]
+			- vma->vm_start;
+		id_offs[i].status = BPF_STACK_BUILD_ID_VALID;
+	}
+	up_read(&current->mm->mmap_sem);
+}
+
 BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
 	   u64, flags)
 {
 	struct bpf_stack_map *smap = container_of(map, struct bpf_stack_map, map);
 	struct perf_callchain_entry *trace;
 	struct stack_map_bucket *bucket, *new_bucket, *old_bucket;
-	u32 max_depth = map->value_size / 8;
+	u32 max_depth = map->value_size / stack_map_data_size(map);
 	/* stack_map_alloc() checks that max_depth <= sysctl_perf_event_max_stack */
 	u32 init_nr = sysctl_perf_event_max_stack - max_depth;
 	u32 skip = flags & BPF_F_SKIP_FIELD_MASK;
@@ -128,11 +312,16 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
 	bool user = flags & BPF_F_USER_STACK;
 	bool kernel = !user;
 	u64 *ips;
+	bool hash_matches;
 
 	if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK |
 			       BPF_F_FAST_STACK_CMP | BPF_F_REUSE_STACKID)))
 		return -EINVAL;
 
+	/* build_id+offset stack map only supports user stack */
+	if (stack_map_use_build_id(map) && !user)
+		return -EINVAL;
+
 	trace = get_perf_callchain(regs, init_nr, kernel, user,
 				   sysctl_perf_event_max_stack, false, false);
 
@@ -156,24 +345,42 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
 	id = hash & (smap->n_buckets - 1);
 	bucket = READ_ONCE(smap->buckets[id]);
 
-	if (bucket && bucket->hash == hash) {
-		if (flags & BPF_F_FAST_STACK_CMP)
+	hash_matches = bucket && bucket->hash == hash;
+	/* fast cmp */
+	if (hash_matches && flags & BPF_F_FAST_STACK_CMP)
+		return id;
+
+	if (stack_map_use_build_id(map)) {
+		/* for build_id+offset, pop a bucket before slow cmp */
+		new_bucket = (struct stack_map_bucket *)
+			pcpu_freelist_pop(&smap->freelist);
+		if (unlikely(!new_bucket))
+			return -ENOMEM;
+		stack_map_get_build_id_offset(map, new_bucket, ips, trace_nr);
+		trace_len = trace_nr * sizeof(struct bpf_stack_build_id);
+		if (hash_matches && bucket->nr == trace_nr &&
+		    memcmp(bucket->data, new_bucket->data, trace_len) == 0) {
+			pcpu_freelist_push(&smap->freelist, &new_bucket->fnode);
 			return id;
-		if (bucket->nr == trace_nr &&
-		    memcmp(bucket->ip, ips, trace_len) == 0)
+		}
+		if (bucket && !(flags & BPF_F_REUSE_STACKID)) {
+			pcpu_freelist_push(&smap->freelist, &new_bucket->fnode);
+			return -EEXIST;
+		}
+	} else {
+		if (hash_matches && bucket->nr == trace_nr &&
+		    memcmp(bucket->data, ips, trace_len) == 0)
 			return id;
+		if (bucket && !(flags & BPF_F_REUSE_STACKID))
+			return -EEXIST;
+
+		new_bucket = (struct stack_map_bucket *)
+			pcpu_freelist_pop(&smap->freelist);
+		if (unlikely(!new_bucket))
+			return -ENOMEM;
+		memcpy(new_bucket->data, ips, trace_len);
 	}
 
-	/* this call stack is not in the map, try to add it */
-	if (bucket && !(flags & BPF_F_REUSE_STACKID))
-		return -EEXIST;
-
-	new_bucket = (struct stack_map_bucket *)
-		pcpu_freelist_pop(&smap->freelist);
-	if (unlikely(!new_bucket))
-		return -ENOMEM;
-
-	memcpy(new_bucket->ip, ips, trace_len);
 	new_bucket->hash = hash;
 	new_bucket->nr = trace_nr;
 
@@ -212,8 +419,8 @@ int bpf_stackmap_copy(struct bpf_map *map, void *key, void *value)
 	if (!bucket)
 		return -ENOENT;
 
-	trace_len = bucket->nr * sizeof(u64);
-	memcpy(value, bucket->ip, trace_len);
+	trace_len = bucket->nr * stack_map_data_size(map);
+	memcpy(value, bucket->data, trace_len);
 	memset(value + trace_len, 0, map->value_size - trace_len);
 
 	old_bucket = xchg(&smap->buckets[id], bucket);
-- 
2.9.5

^ permalink raw reply related

* Re: "wrong" ifindex on received VLAN tagged packet?
From: David Ahern @ 2018-03-12 20:26 UTC (permalink / raw)
  To: Lawrence Kreeger; +Cc: netdev
In-Reply-To: <CAOfa6T1QU+ukwTvujxN8xdSGjnEz=dXzMmGp1TQQGxaZU7ESaQ@mail.gmail.com>

On 3/6/18 5:27 PM, Lawrence Kreeger wrote:
> Using ETH_P_ALL instead of ETH_P_802_2, is causing mstpd to get 3
> copies of the same BPDU.  One from eth0, one from eth0.100, and
> another from vlan100 (the bridge).
> mstpd will drop the one from vlan100, but since there is also an
> instance of spanning tree running on the native VLAN, there is now no
> way to differentiate BPDUs coming in
> tagged vs untagged because they all show up with eth0.  So, there
> isn't some kernel knob to get the BPDUs to only come from eth0.100?

not that I am aware of. You could bind your socket or program to
eth0.100, but I suspect you actually have more than the one vlan
interface coming into the bridge that you want to snoop.

^ permalink raw reply

* Re: [PATCH 00/30] Netfilter/IPVS updates for net-next
From: Felix Fietkau @ 2018-03-12 20:22 UTC (permalink / raw)
  To: David Miller; +Cc: pablo, netfilter-devel, netdev
In-Reply-To: <20180312.160119.1610465393660409111.davem@davemloft.net>

On 2018-03-12 21:01, David Miller wrote:
> From: Felix Fietkau <nbd@nbd.name>
> Date: Mon, 12 Mar 2018 20:30:01 +0100
> 
>> It's not dead and useless. In its current state, it has a software fast
>> path that significantly improves nftables routing/NAT throughput,
>> especially on embedded devices.
>> On some devices, I've seen "only" 20% throughput improvement (along with
>> CPU usage reduction), on others it's quite a bit lot more. This is
>> without any extra drivers or patches aside from what's posted.
> 
> I wonder if this software fast path has the exploitability problems that
> things like the ipv4 routing cache and the per-cpu flow cache both had.
> And the reason for which both were removed.
> 
> I don't see how you can avoid this problem.
> 
> I'm willing to be shown otherwise :-)
I don't think it suffers from the same issues, and if it does, it's a
lot easier to mitigate. The ruleset can easily be configured to only
offload connections that transferred a certain amount of data, handling
only bulk flows.

It's easy to put an upper limit on the number of offloaded connections,
and there's nothing in the code that just creates an offload entry per
packet or per lookup or something like that.

If you have other concerns, I'm sure we can address them with follow-up
patches, but as it stands, I think the code is already quite useful.

- Felix

^ permalink raw reply

* [PATCH net-next 1/1] net sched actions: return explicit error when tunnel_key mode is not specified
From: Roman Mashak @ 2018-03-12 20:20 UTC (permalink / raw)
  To: davem; +Cc: netdev, jhs, amir, xiyou.wangcong, jiri, Roman Mashak

If set/unset mode of the tunnel_key action is not provided, ->init() still
returns 0, and the caller proceeds with bogus 'struct tc_action *' object,
this results in crash:

% tc actions add action tunnel_key src_ip 1.1.1.1 dst_ip 2.2.2.1 id 7 index 1

[   35.805515] general protection fault: 0000 [#1] SMP PTI
[   35.806161] Modules linked in: act_tunnel_key kvm_intel kvm irqbypass
crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64
crypto_simd glue_helper cryptd serio_raw
[   35.808233] CPU: 1 PID: 428 Comm: tc Not tainted 4.16.0-rc4+ #286
[   35.808929] RIP: 0010:tcf_action_init+0x90/0x190
[   35.809457] RSP: 0018:ffffb8edc068b9a0 EFLAGS: 00010206
[   35.810053] RAX: 1320c000000a0003 RBX: 0000000000000001 RCX: 0000000000000000
[   35.810866] RDX: 0000000000000070 RSI: 0000000000007965 RDI: ffffb8edc068b910
[   35.811660] RBP: ffffb8edc068b9d0 R08: 0000000000000000 R09: ffffb8edc068b808
[   35.812463] R10: ffffffffc02bf040 R11: 0000000000000040 R12: ffffb8edc068bb38
[   35.813235] R13: 0000000000000000 R14: 0000000000000000 R15: ffffb8edc068b910
[   35.814006] FS:  00007f3d0d8556c0(0000) GS:ffff91d1dbc40000(0000)
knlGS:0000000000000000
[   35.814881] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   35.815540] CR2: 000000000043f720 CR3: 0000000019248001 CR4: 00000000001606a0
[   35.816457] Call Trace:
[   35.817158]  tc_ctl_action+0x11a/0x220
[   35.817795]  rtnetlink_rcv_msg+0x23d/0x2e0
[   35.818457]  ? __slab_alloc+0x1c/0x30
[   35.819079]  ? __kmalloc_node_track_caller+0xb1/0x2b0
[   35.819544]  ? rtnl_calcit.isra.30+0xe0/0xe0
[   35.820231]  netlink_rcv_skb+0xce/0x100
[   35.820744]  netlink_unicast+0x164/0x220
[   35.821500]  netlink_sendmsg+0x293/0x370
[   35.822040]  sock_sendmsg+0x30/0x40
[   35.822508]  ___sys_sendmsg+0x2c5/0x2e0
[   35.823149]  ? pagecache_get_page+0x27/0x220
[   35.823714]  ? filemap_fault+0xa2/0x640
[   35.824423]  ? page_add_file_rmap+0x108/0x200
[   35.825065]  ? alloc_set_pte+0x2aa/0x530
[   35.825585]  ? finish_fault+0x4e/0x70
[   35.826140]  ? __handle_mm_fault+0xbc1/0x10d0
[   35.826723]  ? __sys_sendmsg+0x41/0x70
[   35.827230]  __sys_sendmsg+0x41/0x70
[   35.827710]  do_syscall_64+0x68/0x120
[   35.828195]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[   35.828859] RIP: 0033:0x7f3d0ca4da67
[   35.829331] RSP: 002b:00007ffc9f284338 EFLAGS: 00000246 ORIG_RAX:
000000000000002e
[   35.830304] RAX: ffffffffffffffda RBX: 00007ffc9f284460 RCX: 00007f3d0ca4da67
[   35.831247] RDX: 0000000000000000 RSI: 00007ffc9f2843b0 RDI: 0000000000000003
[   35.832167] RBP: 000000005aa6a7a9 R08: 0000000000000001 R09: 0000000000000000
[   35.833075] R10: 00000000000005f1 R11: 0000000000000246 R12: 0000000000000000
[   35.833997] R13: 00007ffc9f2884c0 R14: 0000000000000001 R15: 0000000000674640
[   35.834923] Code: 24 30 bb 01 00 00 00 45 31 f6 eb 5e 8b 50 08 83 c2 07 83 e2
fc 83 c2 70 49 8b 07 48 8b 40 70 48 85 c0 74 10 48 89 14 24 4c 89 ff <ff> d0 48
8b 14 24 48 01 c2 49 01 d6 45 85 ed 74 05 41 83 47 2c 
[   35.837442] RIP: tcf_action_init+0x90/0x190 RSP: ffffb8edc068b9a0
[   35.838291] ---[ end trace a095c06ee4b97a26 ]---

Fixes: d0f6dd8a914f ("net/sched: Introduce act_tunnel_key")
Signed-off-by: Roman Mashak <mrv@mojatatu.com>
---
 net/sched/act_tunnel_key.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/sched/act_tunnel_key.c b/net/sched/act_tunnel_key.c
index 9169b7e78ada..0c0fbbfbc63b 100644
--- a/net/sched/act_tunnel_key.c
+++ b/net/sched/act_tunnel_key.c
@@ -153,6 +153,7 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
 		metadata->u.tun_info.mode |= IP_TUNNEL_INFO_TX;
 		break;
 	default:
+		ret = -EINVAL;
 		goto err_out;
 	}
 
-- 
2.7.4

^ permalink raw reply related

* Re: [pci PATCH v5 1/4] pci: Add pci_sriov_configure_simple for PFs that don't manage VF resources
From: Alexander Duyck @ 2018-03-12 20:17 UTC (permalink / raw)
  To: Keith Busch
  Cc: Bjorn Helgaas, Duyck, Alexander H, linux-pci, virtio-dev, kvm,
	Netdev, Daly, Dan, LKML, linux-nvme, netanel, Maximilian Heyne,
	Wang, Liang-min, Rustad, Mark D, David Woodhouse,
	Christoph Hellwig, dwmw
In-Reply-To: <20180312182321.GG18494@localhost.localdomain>

On Mon, Mar 12, 2018 at 11:23 AM, Keith Busch <keith.busch@intel.com> wrote:
> On Mon, Mar 12, 2018 at 11:09:34AM -0700, Alexander Duyck wrote:
>> On Mon, Mar 12, 2018 at 10:40 AM, Keith Busch <keith.busch@intel.com> wrote:
>> > On Mon, Mar 12, 2018 at 10:21:29AM -0700, Alexander Duyck wrote:
>> >> diff --git a/include/linux/pci.h b/include/linux/pci.h
>> >> index 024a1beda008..9cab9d0d51dc 100644
>> >> --- a/include/linux/pci.h
>> >> +++ b/include/linux/pci.h
>> >> @@ -1953,6 +1953,7 @@ static inline void pci_mmcfg_late_init(void) { }
>> >>  int pci_vfs_assigned(struct pci_dev *dev);
>> >>  int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs);
>> >>  int pci_sriov_get_totalvfs(struct pci_dev *dev);
>> >> +int pci_sriov_configure_simple(struct pci_dev *dev, int nr_virtfn);
>> >>  resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno);
>> >>  void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool probe);
>> >>  #else
>> >
>> > I recommend stubbing 'pci_sriov_configure_simple' or defining it to
>> > NULL in the '#else' section here so you don't need to repeat the "#ifdef
>> > CONFIG_PCI_IOV" in each driver wishing to use this function. Otherwise
>> > looks fine to me.
>>
>> My concern with defining it as NULL is that somebody may end up
>> calling it in the future directly and that may end up causing issues.
>> One thought I have been debating is moving it to a different file. I
>> am just not sure where the best place to put something like this would
>> be. I could move this function to drivers/pci/pci.c if everyone is
>> okay with it and then I could just strip the contents out by wrapping
>> them in a #ifdef instead.
>
> Okay, instead of NULL, a stub implementation in the header file may
> suffice when CONFIG_PCI_IOV is not defined:
>
> static inline int pci_sriov_configure_simple(struct pci_dev *dev, int nr_virtfn)
> {
>         return -ENOSYS;
> }
>
> See pci_iov_virtfn_bus, pci_iov_virtfn_devfn, pci_iov_add_virtfn, or
> pci_enable_sriov for other examples.

No, I am aware of those. The problem is they aren't accessed as
function pointers. As such converting them to static inline functions
is easy. As I am sure you are aware an "inline" function doesn't
normally generate a function pointer.

Actually my original idea has been complicated further by the fact
that I realized my code is accessing functions that are static in the
iov.c file. I'll need to think about how to come up with a better
solution for this.

^ permalink raw reply

* Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available
From: Jiri Pirko @ 2018-03-12 20:12 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: mst, stephen, davem, netdev, virtio-dev, jesse.brandeburg,
	alexander.h.duyck, kubakici
In-Reply-To: <1519934923-39372-3-git-send-email-sridhar.samudrala@intel.com>

Thu, Mar 01, 2018 at 09:08:43PM CET, sridhar.samudrala@intel.com wrote:
>This patch enables virtio_net to switch over to a VF datapath when a VF
>netdev is present with the same MAC address. It allows live migration
>of a VM with a direct attached VF without the need to setup a bond/team
>between a VF and virtio net device in the guest.
>
>The hypervisor needs to enable only one datapath at any time so that
>packets don't get looped back to the VM over the other datapath. When a VF
>is plugged, the virtio datapath link state can be marked as down. The
>hypervisor needs to unplug the VF device from the guest on the source host
>and reset the MAC filter of the VF to initiate failover of datapath to
>virtio before starting the migration. After the migration is completed,
>the destination hypervisor sets the MAC filter on the VF and plugs it back
>to the guest to switch over to VF datapath.
>
>When BACKUP feature is enabled, an additional netdev(bypass netdev) is
>created that acts as a master device and tracks the state of the 2 lower
>netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
>passthru device with the same MAC is registered as 'active' netdev.
>
>This patch is based on the discussion initiated by Jesse on this thread.
>https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>
>Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>

[...]


>+static int virtnet_bypass_register_child(struct net_device *child_netdev)
>+{
>+	struct virtnet_bypass_info *vbi;
>+	struct net_device *dev;
>+	bool backup;
>+	int ret;
>+
>+	if (child_netdev->addr_len != ETH_ALEN)
>+		return NOTIFY_DONE;
>+
>+	/* We will use the MAC address to locate the virtnet_bypass netdev
>+	 * to associate with the child netdev. If we don't find a matching
>+	 * bypass netdev, move on.
>+	 */
>+	dev = get_virtnet_bypass_bymac(dev_net(child_netdev),
>+				       child_netdev->perm_addr);
>+	if (!dev)
>+		return NOTIFY_DONE;
>+
>+	vbi = netdev_priv(dev);
>+	backup = (child_netdev->dev.parent == dev->dev.parent);
>+	if (backup ? rtnl_dereference(vbi->backup_netdev) :
>+			rtnl_dereference(vbi->active_netdev)) {
>+		netdev_info(dev,
>+			    "%s attempting to join bypass dev when %s already present\n",
>+			    child_netdev->name, backup ? "backup" : "active");
>+		return NOTIFY_DONE;
>+	}
>+
>+	/* Avoid non pci devices as active netdev */
>+	if (!backup && (!child_netdev->dev.parent ||
>+			!dev_is_pci(child_netdev->dev.parent)))
>+		return NOTIFY_DONE;
>+
>+	ret = netdev_rx_handler_register(child_netdev,
>+					 virtnet_bypass_handle_frame, dev);
>+	if (ret != 0) {
>+		netdev_err(child_netdev,
>+			   "can not register bypass receive handler (err = %d)\n",
>+			   ret);
>+		goto rx_handler_failed;
>+	}
>+
>+	ret = netdev_upper_dev_link(child_netdev, dev, NULL);
>+	if (ret != 0) {
>+		netdev_err(child_netdev,
>+			   "can not set master device %s (err = %d)\n",
>+			   dev->name, ret);
>+		goto upper_link_failed;
>+	}
>+
>+	child_netdev->flags |= IFF_SLAVE;
>+
>+	if (netif_running(dev)) {
>+		ret = dev_open(child_netdev);
>+		if (ret && (ret != -EBUSY)) {
>+			netdev_err(dev, "Opening child %s failed ret:%d\n",
>+				   child_netdev->name, ret);
>+			goto err_interface_up;
>+		}
>+	}

Much of this function is copy of netvsc_vf_join, should be shared with
netvsc.


>+
>+	/* Align MTU of child with master */
>+	ret = dev_set_mtu(child_netdev, dev->mtu);
>+	if (ret) {
>+		netdev_err(dev,
>+			   "unable to change mtu of %s to %u register failed\n",
>+			   child_netdev->name, dev->mtu);
>+		goto err_set_mtu;
>+	}
>+
>+	call_netdevice_notifiers(NETDEV_JOIN, child_netdev);
>+
>+	netdev_info(dev, "registering %s\n", child_netdev->name);
>+
>+	dev_hold(child_netdev);
>+	if (backup) {
>+		rcu_assign_pointer(vbi->backup_netdev, child_netdev);
>+		dev_get_stats(vbi->backup_netdev, &vbi->backup_stats);
>+	} else {
>+		rcu_assign_pointer(vbi->active_netdev, child_netdev);
>+		dev_get_stats(vbi->active_netdev, &vbi->active_stats);
>+		dev->min_mtu = child_netdev->min_mtu;
>+		dev->max_mtu = child_netdev->max_mtu;
>+	}
>+
>+	return NOTIFY_OK;
>+
>+err_set_mtu:
>+	dev_close(child_netdev);
>+err_interface_up:
>+	netdev_upper_dev_unlink(child_netdev, dev);
>+	child_netdev->flags &= ~IFF_SLAVE;
>+upper_link_failed:
>+	netdev_rx_handler_unregister(child_netdev);
>+rx_handler_failed:
>+	return NOTIFY_DONE;
>+}
>+
>+static int virtnet_bypass_unregister_child(struct net_device *child_netdev)
>+{
>+	struct virtnet_bypass_info *vbi;
>+	struct net_device *dev, *backup;
>+
>+	dev = get_virtnet_bypass_byref(child_netdev);
>+	if (!dev)
>+		return NOTIFY_DONE;
>+
>+	vbi = netdev_priv(dev);
>+
>+	netdev_info(dev, "unregistering %s\n", child_netdev->name);
>+
>+	netdev_rx_handler_unregister(child_netdev);
>+	netdev_upper_dev_unlink(child_netdev, dev);
>+	child_netdev->flags &= ~IFF_SLAVE;
>+
>+	if (child_netdev->dev.parent == dev->dev.parent) {
>+		RCU_INIT_POINTER(vbi->backup_netdev, NULL);
>+	} else {
>+		RCU_INIT_POINTER(vbi->active_netdev, NULL);
>+		backup = rtnl_dereference(vbi->backup_netdev);
>+		if (backup) {
>+			dev->min_mtu = backup->min_mtu;
>+			dev->max_mtu = backup->max_mtu;
>+		}
>+	}
>+
>+	dev_put(child_netdev);
>+
>+	return NOTIFY_OK;
>+}
>+
>+static int virtnet_bypass_update_link(struct net_device *child_netdev)
>+{
>+	struct net_device *dev, *active, *backup;
>+	struct virtnet_bypass_info *vbi;
>+
>+	dev = get_virtnet_bypass_byref(child_netdev);
>+	if (!dev || !netif_running(dev))
>+		return NOTIFY_DONE;
>+
>+	vbi = netdev_priv(dev);
>+
>+	active = rtnl_dereference(vbi->active_netdev);
>+	backup = rtnl_dereference(vbi->backup_netdev);
>+
>+	if ((active && virtnet_bypass_xmit_ready(active)) ||
>+	    (backup && virtnet_bypass_xmit_ready(backup))) {
>+		netif_carrier_on(dev);
>+		netif_tx_wake_all_queues(dev);
>+	} else {
>+		netif_carrier_off(dev);
>+		netif_tx_stop_all_queues(dev);
>+	}
>+
>+	return NOTIFY_OK;
>+}
>+
>+static int virtnet_bypass_event(struct notifier_block *this,
>+				unsigned long event, void *ptr)
>+{
>+	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
>+
>+	/* Skip our own events */
>+	if (event_dev->netdev_ops == &virtnet_bypass_netdev_ops)
>+		return NOTIFY_DONE;
>+
>+	/* Avoid non-Ethernet type devices */
>+	if (event_dev->type != ARPHRD_ETHER)
>+		return NOTIFY_DONE;
>+
>+	/* Avoid Vlan dev with same MAC registering as child dev */
>+	if (is_vlan_dev(event_dev))
>+		return NOTIFY_DONE;
>+
>+	/* Avoid Bonding master dev with same MAC registering as child dev */
>+	if ((event_dev->priv_flags & IFF_BONDING) &&
>+	    (event_dev->flags & IFF_MASTER))
>+		return NOTIFY_DONE;
>+
>+	switch (event) {
>+	case NETDEV_REGISTER:
>+		return virtnet_bypass_register_child(event_dev);
>+	case NETDEV_UNREGISTER:
>+		return virtnet_bypass_unregister_child(event_dev);
>+	case NETDEV_UP:
>+	case NETDEV_DOWN:
>+	case NETDEV_CHANGE:
>+		return virtnet_bypass_update_link(event_dev);
>+	default:
>+		return NOTIFY_DONE;
>+	}
>+}

For example this function is 1:1 copy of netvsc, even with comments
and bugs (like IFF_BODING check).

This is also something that should be shared with netvsc.


>+
>+static struct notifier_block virtnet_bypass_notifier = {
>+	.notifier_call = virtnet_bypass_event,
>+};
>+
>+static int virtnet_bypass_create(struct virtnet_info *vi)
>+{
>+	struct net_device *backup_netdev = vi->dev;
>+	struct device *dev = &vi->vdev->dev;
>+	struct net_device *bypass_netdev;
>+	int res;
>+
>+	/* Alloc at least 2 queues, for now we are going with 16 assuming
>+	 * that most devices being bonded won't have too many queues.
>+	 */
>+	bypass_netdev = alloc_etherdev_mq(sizeof(struct virtnet_bypass_info),
>+					  16);
>+	if (!bypass_netdev) {
>+		dev_err(dev, "Unable to allocate bypass_netdev!\n");
>+		return -ENOMEM;
>+	}
>+
>+	dev_net_set(bypass_netdev, dev_net(backup_netdev));
>+	SET_NETDEV_DEV(bypass_netdev, dev);
>+
>+	bypass_netdev->netdev_ops = &virtnet_bypass_netdev_ops;
>+	bypass_netdev->ethtool_ops = &virtnet_bypass_ethtool_ops;
>+
>+	/* Initialize the device options */
>+	bypass_netdev->flags |= IFF_MASTER;
>+	bypass_netdev->priv_flags |= IFF_BONDING | IFF_UNICAST_FLT |

No clue why you set IFF_BONDING here...

^ permalink raw reply

* Re: [PATCH net-next v2] ibmvnic: Bail from ibmvnic_open if driver is already open
From: John Allen @ 2018-03-12 20:12 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: netdev, Thomas Falcon, Nathan Fontenot
In-Reply-To: <20180312201002.GC31588@lunn.ch>

On 03/12/2018 03:10 PM, Andrew Lunn wrote:
>> The problem here is that our routine to change the mtu does a full reset on
>> the driver meaning that in the process we go from effectively "open" to
>> "closed" to "open" again.
>>
>> Consider the scenario where we change the mtu by running "ifdown <interface>",
>> editing the ifcfg file with the new mtu, and finally running "ifup <interface".
>> In this case, we call ibmvnic_close from the ifdown and as a result of the ifup,
>> we do the reset for the mtu change (which puts us back in the "open" state) and
>> call ibmvnic_open. After the reset, we are already in the "open" state and the
>> following call to open is redundant.
> 
> Hi John
> 
> So you are saying "ip link set mtu 4242 eth1" on a down interface will
> put it up. That i would say is broken. You should be fixing this,
> rather than papering over the cracks.

That's a good point. I'll work on a fix to address that.

-John

> 
>       Andrew
> 
> 

^ permalink raw reply

* [net-next 3/6] ixgbe: remove unneeded ipsec state free callback
From: Jeff Kirsher @ 2018-03-12 20:12 UTC (permalink / raw)
  To: davem; +Cc: Shannon Nelson, netdev, nhorman, sassmann, jogreene, Jeff Kirsher
In-Reply-To: <20180312201215.2854-1-jeffrey.t.kirsher@intel.com>

From: Shannon Nelson <shannon.nelson@oracle.com>

With commit 7f05b467a735 ("xfrm: check for xdo_dev_state_free")
we no longer need to add an empty callback function
to the driver, so now let's remove the useless code.

Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c | 13 -------------
 1 file changed, 13 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
index 8623013e517e..f2254528dcfc 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
@@ -724,23 +724,10 @@ static bool ixgbe_ipsec_offload_ok(struct sk_buff *skb, struct xfrm_state *xs)
 	return true;
 }
 
-/**
- * ixgbe_ipsec_free - called by xfrm garbage collections
- * @xs: pointer to transformer state struct
- *
- * We don't have any garbage to collect, so we shouldn't bother
- * implementing this function, but the XFRM code doesn't check for
- * existence before calling the API callback.
- **/
-static void ixgbe_ipsec_free(struct xfrm_state *xs)
-{
-}
-
 static const struct xfrmdev_ops ixgbe_xfrmdev_ops = {
 	.xdo_dev_state_add = ixgbe_ipsec_add_sa,
 	.xdo_dev_state_delete = ixgbe_ipsec_del_sa,
 	.xdo_dev_offload_ok = ixgbe_ipsec_offload_ok,
-	.xdo_dev_state_free = ixgbe_ipsec_free,
 };
 
 /**
-- 
2.14.3

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox